LabVIEW

cancel
Showing results for 
Search instead for 
Did you mean: 

Reentrant Memory Question (Preallocate / Shared)

Okay so over the years I've been defaulting to Shared Clones in reentrant code.  My thought process was to reduce memory usage.  With the trade off being more jitter, since we might need to allocate a new VI clone, or we might be able to reuse an existing clone.

 

But now I'm doing more real-time work and the context help on the shared function specifically states "If you intend to run your VIs on Real-Time operating systems, select Preallocate instead"  I realize this statement is to promote less jitter on RT but what about clones inside of clones inside of clones?  Won't Preallocate create tons and tons of copies that aren't necessary?

 

For example.  Lets say I have a subVI called "Read and Write Data.vi".  And in this VI calls the VI "Read or Write Data.vi" twice, with an input telling that subVI if it should perform a Read, or perform a Write, and the "Read and Write Data.vi" just calls the same VI twice, once to read and once to write.  If we set "Read or Write Data.vi" to be preallocated clone I realize there will be two copies of it in memory always and one copy will be used for read, and one copy will be used for write.  But lets say I put the "Read and Write Data.vi" in a parallelized for loop that runs 10 times, and I set the "Read and Write Data.vi" to also be preallocated clone.  Now are there 10 copies of "Read and Write Data.vi" and 20 copies of "Read or Write Data.vi"?  And what if I put that for loop in a subVI that is then preallocated reentrant, and it runs 5 times.  Now are there 100 copies of "Read or Write Data.vi" in memory?  This seems a bit crazy, and as a VIs developer I can't know if a user of my API will choose to do something like this and cause crazy number of clones.

 

Looking at the majority of NI code I see Preallocate as the option selected.

 

Also I usually tend to inline subVIs if possible figuring it will reduce the overhead of each call.  So does having that setting effect this clone inside a clone inside a clone issue?

Message 1 of 10
(3,975 Views)

You are correct that the sub-vis should be inlined.  This does require more memory in the caller but the clones are allocated in the caller not dynamically.

 

Now, if your caller is also pre-allocated you need to address the size of the clone pool at start-up or risk needing to spawn a new clone during execution.  Spawning a new clone takes time and that time will cause significant jitter.

 

Thankfully there is an allocate clones method.  So you can increase the size of the clone pool (The default is 1 clone per core- hey! it expects to have one instance in a parallelized Loop ) and bury the clone spawning into startup time.  Just count the clone instances and multiply by number of processors and parallel processes.  Remember, a parallelized loop still can't call more than # processors of simultaneous iterations so your example with 100 copies really isn't accurate unless you have 64 hyperthreaded cores


"Should be" isn't "Is" -Jay
0 Kudos
Message 2 of 10
(3,954 Views)

To be precise - you'll have as many clone instances as you specify in "Number of generated parallel loop instances" in for loop parallelism configuration.

parallelfor.png

 

 

You can actually check the number of clones by opening the VI and selecting menu View -> Browse Relationships -> Reentrant Items.

 

(There is also hidden VI property node Metrics -> Number of clones, if you like brown color)

 

Btw.

 


@Hooovahh wrote:

 

For example.  Lets say I have a subVI called "Read and Write Data.vi".  And in this VI calls the VI "Read or Write Data.vi" twice


Took me a moment to understand you're talking about two different VIs, that's a bit unfortunate choice of names for the example 😉

Message 3 of 10
(3,931 Views)
@JÞB wrote:

 

 

Remember, a parallelized loop still can't call more than # processors of simultaneous iterations so your example with 100 copies really isn't accurate unless you have 64 hyperthreaded cores


PIDI already corrected you a bit but I knew this statement isn't quite true.  If I have a single core PC, and have a parrelized for loop calling a reentrant subVI, and that reentrant subVI only has a single function in it and that function is a wait 1000 ms, running that for loop 64 times, setting the P to and N to 64, then the for loop should finish running around 1 second (you may also need set the number of generated parallel loop instances to 64).  If your statement were true and you couldn't call more simultaneous copies then cores, then this should take 64 seconds.  I don't have a single core machine to test with but my quad core also only takes 1 second.  I realize a wait function isn't very useful but I've used similar techniques talking to N serial ports where each will respond at different rates, or reading from N DAQ cards where each will take a roughly the same amount of time to respond with something like a finite sample read.

 

@PiDi wrote:

Took me a moment to understand you're talking about two different VIs, that's a bit unfortunate choice of names for the example 😉


Yeah it was hard to come up with an example that represented my concern.  I probably could have posted some example code.

 

Using the private method I'm able to look at all VIs in all contexts, and see what ones have the most number of copies.  In a pretty basic project the largest was the Sort 1D Array I32 which had 163 clones.  This appears to be because the OpenG delete from array uses it.  The next was an XNet VI at 108 which also had preallocate.  The next was an XNet VI that was preallocated which made 33 clones.  All of my reuse VIs (which are shared) have 3 or less clones.

 

If my options are to have either have 3 copies of a VI (but have jitter on allocating those two) or have no jitter but have 10s of clones up to over 100, then I'd rather have shared.  Of course I don't know how much this jitter is for shared but hopefully not too much.

 

On a side note OpenG really should be inlined, or preallocated for the most common ones.  I made some code a while ago which would inline the OpenG functions that could be inlined over on LAVA but I don't run that on every machine I use.

Message 4 of 10
(3,911 Views)

Hooovahh wrote: 

Looking at the majority of NI code I see Preallocate as the option selected.

 

Also I usually tend to inline subVIs if possible figuring it will reduce the overhead of each call.  So does having that setting effect this clone inside a clone inside a clone issue?


If you're already inlining your subVIs, then you're essentially already forcing them to be pre-allocated clones. Inline or not, additional clones shouldn't require much extra memory unless you're storing huge data structures in them. The compiler doesn't need to store duplicate copies of the machine instructions that make up the VI, it only needs to allocate separate storage for any temporary data it uses. The overhead required for an additional clone of a VI that operates in-place on its inputs should be negligible.

Message 5 of 10
(3,910 Views)

@Hooovahh wrote:
@JÞB wrote:

 

 

Remember, a parallelized loop still can't call more than # processors of simultaneous iterations so your example with 100 copies really isn't accurate unless you have 64 hyperthreaded cores


PIDI already corrected you a bit but I knew this statement isn't quite true.  If I have a single core PC, and have a parrelized for loop calling a reentrant subVI, and that reentrant subVI only has a single function in it and that function is a wait 1000 ms, running that for loop 64 times, setting the P to and N to 64, then the for loop should finish running around 1 second (you may also need set the number of generated parallel loop instances to 64).  If your statement were true and you couldn't call more simultaneous copies then cores, then this should take 64 seconds.  I don't have a single core machine to test with but my quad core also only takes 1 second.  I realize a wait function isn't very useful but I've used similar techniques talking to N serial ports where each will respond at different rates, or reading from N DAQ cards where each will take a roughly the same amount of time to respond with something like a finite sample read.

 


I'd like to see that code with the MSec count output as an array.  (to avoid the obvious compiler optimization)  But, I willing to chalk it up to "You learn something new...  Wait also releases the thread so you can wait while you are waiting.  That is, you start a 1000mSec wait, release the thread and start another wait while you are waiting.  I don't think that is a fair benchmark.for how many iterations are active at any time


"Should be" isn't "Is" -Jay
0 Kudos
Message 6 of 10
(3,872 Views)

Jeff·Þ·Bohrer wrote: 

I'd like to see that code with the MSec count output as an array.  (to avoid the obvious compiler optimization)  But, I willing to chalk it up to "You learn something new...  Wait also releases the thread so you can wait while you are waiting.  That is, you start a 1000mSec wait, release the thread and start another wait while you are waiting.  I don't think that is a fair benchmark.for how many iterations are active at any time


Well if this is the case than many other NI functions will release the thread too (which is good) like the DAQmx read, XNet read, and VISA read to name a few.  Attached is a demo using the wait.  I had a system where I had 12 COM ports and I wanted to exercise them in parallel.  So I had an array of VISA references and used an for loop where I'd send the same write function to all 12 at the same time.  Then as soon as a port saw so many bytes I'd need to send several more bytes back, and I couldn't wait for all 12 to reply because some were faster than others.  So in my subVI I'd write, then read, waiting for my data, and as soon as I saw it I would write more data to that port.  I did this to all 12 ports at once even though it was only a dual core machine at the time.  I mean it shouldn't be any different from having that subVI called 12 times in parallel, it was just easier to do in a single for loop.

0 Kudos
Message 7 of 10
(3,860 Views)

@Hooovahh wrote:


Well if this is the case than many other NI functions will release the thread too (which is good) like the DAQmx read, XNet read, and VISA read to name a few.  Attached is a demo using the wait.  I had a system where I had 12 COM ports and I wanted to exercise them in parallel.  So I had an array of VISA references and used an for loop where I'd send the same write function to all 12 at the same time.  Then as soon as a port saw so many bytes I'd need to send several more bytes back, and I couldn't wait for all 12 to reply because some were faster than others.  So in my subVI I'd write, then read, waiting for my data, and as soon as I saw it I would write more data to that port.  I did this to all 12 ports at once even though it was only a dual core machine at the time.  I mean it shouldn't be any different from having that subVI called 12 times in parallel, it was just easier to do in a single for loop.


Only async nodes in LabVIEW that in fact implement the cooperative multitasking that LabVIEW already did before it learned to be multithreaded, can release the thread for other parallel clumps to execute. VIs that are set to be reentrant can release the current clump on such async nodes and certain structure borders, which LabVIEW uses as natural clump borders, but not anywhere else.

Most nodes that have some form of wait operation such as a timeout or explicit wait, are async and able to release the current clump, but for VISA for instance it depends if you enable the async operation for the VISA Read or Write.

 

But DAQmx Read is a bad example. While it is set to be re-entrant by NI and therefore possible to be called in parallel multiple times, it ultimately calls into the DAQmx library through a Call Library Node and once the call has been sent of to the DLL, there is nothing LabVIEW can do to reuse the current thread for something else. You still can call the function another time in another thread in any of the execution pools that LabVIEW has, but the number of threads that are configured for the current execution system ultimately limits how far you can parallelize that in a paralleled loop.

Rolf Kalbermatter
My Blog
Message 8 of 10
(3,850 Views)

nathand wrote:

Inline or not, additional clones shouldn't require much extra memory unless you're storing huge data structures in them. The compiler doesn't need to store duplicate copies of the machine instructions that make up the VI, it only needs to allocate separate storage for any temporary data it uses.

After re-reading what I wrote, I realized I should clarify. If you inline a subVI, you are explicitly telling the compiler to include a new copy of the machine instructions for that subVI as part of the calling VI, although possibly in reduced form since the compiler can optimize out portions of the subVI that aren't needed for that particular instance. If you DON'T inline, then the compiler only needs to store one copy of the machine instructions regardless of how many clones there are, but that one copy needs to implement the full functionality of the VI because it could be called with any set of inputs, and there's the standard overhead of calling a subVI. In my original comments, I should not have implied that multiple instances of an inlined subVI can all be stored as the same set of machine instructions, since that is almost definitely not the case.

Message 9 of 10
(3,837 Views)

Thanks Nathand.. Reminds me that I ment to clarify the differences between calling a reentrant VI in a loop (one clone) and calling a vi asynchronously in a loop with the reentrant flag set form the open vi ref primitive (One clone per ref)


"Should be" isn't "Is" -Jay
0 Kudos
Message 10 of 10
(3,826 Views)