For loop parallelism vs starting multiple asynchronous call and collect

banksey255 · ‎11-20-2025

I'm curious about the fundamental difference between using a parallel for loop versus using for loops to start multiple asynchronous call and collect like the "Benchmarking Asynchronous Calls" example with respect to how threads and cpu cores are made use of. There was a question asked over on lavag mid 2024 regarding the difference, but didn't discuss what I'm curious about: https://lavag.org/topic/23855-parallel-for-loop-versus-async-call-and-collect/

On that lavag thread someone wrote that the programmer needs to be especially mindful of the following when using for loop parallelism:

Code needs to be able to run parallel (e.g., re-entrant subVIs), but I would think that applies to any code that runs in parallel
Being careful accessing shared resources, but I would think that applies to any multi-threading that accesses shared data and resources

Beyond that, someone wrote that they only use for loop parallelism for simple code, although "simple" wasn't defined and is subjective, so I'm not sure what to make of that and there was no justification given that I could see.

I could be wrong, but from what I've pieced together, my guess at the fundamental difference is that for loop parallelism runs each instance on a different cpu core, so if there are less cpu cores used than iterations, then some iterations can't start until others end, whereas asynchronous call makes use of multi-threading to complete each iteration at the same time (to the degree that cpu scheduling allows) so whilst it can use multiple cpu cores it isn't dependent upon it to execute iterations at the same time. Is this correct?

If of interest, although hopefully not relevant to answering my question, my specific applications where I have used for loop parallelism is where I am accessing different resources in parallel (e.g., communicating on different COM ports), which was similar to an example given in the lavag thread where someone has used it, although in some of my applications they are different devices, so require different code, which I've achieved using a case structure within the parallel for loop.

altenbach · ‎11-21-2025

Typically the code inside a parallel FOR loop should not be "simple", (whatever that means). Simple code is fast anyway and the parallelization overhead will nullify any parallel advantage.

Parallel FOR loops can contain critical sections as long as they only contributes to a small percentage of the total loop code. For example in one of my programs, all parallel instances interact with the same map based cache (a non-reentrant subVI!) to see if a calculation has already been made (or already started in another parallel instance) for the same inputs recently. Still, it scales perfectly with the number of available cores. This cache is not trivial because it is fixed size and also maintains "age" for each entry that needs to be update with each access in order to discard the oldest entry if space runs out.

Not shown in the table is the loop rate if all entries are cached already and it is 1000x+ faster, showing that the cache overhead is <<1%.

LabVIEW Champion.

banksey255 · ‎11-22-2025

Thanks, altenbach. I fear I may have asked a too general question.

To ask a more specific question, let's say that I absolutely must have a certain number of concurrent threads, but I don't know how many at compile time, and this number of threads might exceed the number of logical processors, or I may not want to use all logical processors to leave resources for other operations (e.g., as an extreme case, just to highlight the point, I may prefer multi-threading with a single logical processor). How can this be achieved?

The specific problem that has prompted this is that I have an application interfacing with multiple devices that operate in parallel, and I don't know the number of devices at compile time, and they must be communicated with in parallel, for example, to prevent hardware and software buffers becoming filled (that is, waiting on one can't block further operations with others).

Currently, the runtime methods I am aware of to create multiple threads are (1) parallel FOR loop; (2) open a VI reference with the 0x40 option then start multiple asynchronous calls; (3) open multiple VI references to the same VI then start an asynchronous call to each. There are probably other methods I am not aware of, and honestly, I have only used parallel FOR loops in the past and only just started exploring asynchronous calls.

Forgive the following if it is incoherent and has misunderstandings due to my beginning exploration of this problem and the solutions to it:

From what I've read, parallel FOR loops assign each instance to a different logical processor, so if the number of threads I require is greater than the number of logical processors, oversubscribing won't guarantee additional instances are started unless there are waiting operations and, as mentioned, I don't necessarily want to use all logical processors. It seems a little complex, but if I undersubscribe with, for example, a chunk size of 1, do wait operations also cause the parallel FOR loop to begin another outstanding iteration, meaning one solution is to insert wait functions with zero wait time within each iteration to get all iterations started and then use additional wait functions to switch back and forth between the iterations so that all iterations are progressing?

Since my original post, I may be wrong, but I have inferred that parallel asynchronous calls may behave similarly to parallel FOR loops since Asynchronously Calling VIs - NI states that "For each VI reference, LabVIEW creates one data space in the asynchronous call pool for each CPU core on the target computer", which to me implies that it will run each parallel call on a different logical processor, and so the thread count at any given moment may be limited by that number of logical processors.

It seems there is a related post how can i dynanic create threads in labview - NI Community where one reply references Solved: Community Nugget 1/29/2007 - NI Community, which indicates method (3) I wrote above may be the best solution, but can't be sure.

banksey255 · ‎11-24-2025

Happy to be proven wrong, but apparently what I was hoping for is not possible (at least in LabVIEW).

I developed some basic code to trial different methods for dynamically launching threads, and all methods I tried that ran in parallel, evidence indicated that each thread ran on a different logical processor. Unrelated to my question, but I thought it useful information for myself going forward: the parallel FOR loop had much better performance than starting multiple asynchronous calls.

So, coming back to my original question, it seems there is no difference between parallel FOR loops and starting multiple asynchronous calls with respect to how threads and logical processors are made use of.

Yamaeda · ‎11-25-2025

Starting and stopping asynchronous tasks is not free, but the better option if the tasks are unrelated, like communicating to different instruments or IPs or something. If you have plenty of data you'll probably gain from using a parallell loop instead. Sometimes you simply have to test and see.

G# - Award winning reference based OOP for LV, for free! - Qestit VIPM GitHub

Qestit Systems

wiebe@CARYA · ‎11-25-2025

@banksey255 wrote:
I could be wrong, but from what I've pieced together, my guess at the fundamental difference is that for loop parallelism runs each instance on a different cpu core, so if there are less cpu cores used than iterations, then some iterations can't start until others end, whereas asynchronous call makes use of multi-threading to complete each iteration at the same time (to the degree that cpu scheduling allows) so whilst it can use multiple cpu cores it isn't dependent upon it to execute iterations at the same time. Is this correct?

LabVIEW code is executes in parallel even 1 CPU.

LabVIEW has it's own task scheduling and runs things in parallel all by itself. Unless you explicitly force it to use a single thread (e.g. in a timed loop), it will simply distribute it's load over available processor power. It doesn't matter at all if it's a single VI, VIs running in a parallel for loop or dynamically started VIs.

I personally avoid both Call and Forget and Call and Collect. They seem to disagree with my habits of aborting VIs during development. I end up with crashes and\or forced "end process". I've given it 2nd chances a few times over the years; for now I always regretted (and refactored) it.

I use parallel loops for starting class methods and it works well for me*. This only solves the "collect" part if you can wait for all the methods to finish of if you don't care. To collect data from the methods before they all end, you need a mechanism, like user events, queues or VI server...

* I hardly used them for performance though. For me it's just convenient (easier to understand, easier testing, no crashes, less effort to setup).

Note that the maximum number of parallel executions is (was?) limited by default. IIRC, it used to be 16.

Set ParallelLoop.MaxNumLoopInstances in LabVIEW.ini and in your executable's ini file. E.g.:

ParallelLoop.MaxNumLoopInstances=10000

Search LabVIEW like a graph!

banksey255 · ‎11-27-2025

Thanks Yamaeda and wiebe@CARYA. It seems I'm not alone in suspecting that parallel FOR loops may be the better choice in general compared to asynchronous calls (although I suppose there will be code where asynchronous calls may be simpler to implement, but, as you mentioned wiebe@CARYA, code can probably always be refactored for either).

One performance critical aspect my testing indicated is the use of wait operations. Whilst explicit wait operations were not required for a greater number of threads than logical processors to execute simultaneously, I did find that for threads to get an approximately equal share of processor time and maximise performance it required I insert explicit wait operations, with the optimal wait time being dependent on the code being executed. From this, I infer (and am not surprised) that there must be other causes for a thread to lose control of a processor beyond explicit wait operations, but perhaps it should just be understood that explicit wait operations are the main (only?) mechanism for us to influence it.

Whether wait operations need to be added when interfacing with hardware seems to need assessment on a case-by-case basis. For example, I think I read somewhere that VISA functions that can timeout (e.g., VISA read) effectively act as wait operations, so I presume that the more general understanding that I should take away is that a wait operation is anything that causes a thread to sleep that will wake when an interrupt occurs. Unfortunately, it seems it's typically not clear how drivers wait for a hardware event (e.g., I've seen drivers that just create a thread then poll, thereby keeping the processor relatively busy).

The use of a timed loop is interesting, which executes on a single logical processor that can be specified. Indeed, this might be used as part of a solution for multi-threading on a constrained set of logical processors, which I'll explore. My idea of undersubscribing a parallel FOR loop with wait operations in the iterations did not result in it executing outstanding iterations (it's not documented behaviour that it should do this but thought I'd try) in contrast to oversubscribing.

Yamaeda · ‎11-28-2025

Yes, a Wait is a Sleep for the thread. Sometimes it's good to include a Wait(0) in a loop to keep the UI more responsive for this reason, but it's pretty rare it's needed.

G# - Award winning reference based OOP for LV, for free! - Qestit VIPM GitHub

Qestit Systems

altenbach · ‎11-30-2025

@Yamaeda wrote:

Yes, a Wait is a Sleep for the thread. Sometimes it's good to include a Wait(0) in a loop to keep the UI more responsive for this reason, but it's pretty rare it's needed.

Here is a decade+ old demonstration that clearly shows the effect of a "0ms wait" vs "no wait".

(For tight loops, constantly switching threads is of course detrimental to overall performance.)

LabVIEW Champion.

Frozen · ‎12-01-2025

@altenbach wrote:

@Yamaeda wrote:

Yes, a Wait is a Sleep for the thread. Sometimes it's good to include a Wait(0) in a loop to keep the UI more responsive for this reason, but it's pretty rare it's needed.

Here is a decade+ old demonstration that clearly shows the effect of a "0ms wait" vs "no wait".

(For tight loops, constantly switching threads is of course detrimental to overall performance.)

The OP did not mention the LV in use... just to remind people, there was a parallel bug introduced in 2019 that went unnoticed and was fixed in LV2023. Basically from LV2019 to LV2023, a zero wait was required to get parallelizem to work as intended.

https://forums.ni.com/t5/LabVIEW/Question-about-Implicit-Multithreading/m-p/4335568#M1270751

---------------------------------------------
Former Certified LabVIEW Developer (CLD)

LabVIEW

For loop parallelism vs starting multiple asynchronous call and collect

For loop parallelism vs starting multiple asynchronous call and collect

Re: For loop parallelism vs starting multiple asynchronous call and collect

Re: For loop parallelism vs starting multiple asynchronous call and collect

Re: For loop parallelism vs starting multiple asynchronous call and collect

Re: For loop parallelism vs starting multiple asynchronous call and collect

Re: For loop parallelism vs starting multiple asynchronous call and collect

Re: For loop parallelism vs starting multiple asynchronous call and collect

Re: For loop parallelism vs starting multiple asynchronous call and collect

Re: For loop parallelism vs starting multiple asynchronous call and collect

Re: For loop parallelism vs starting multiple asynchronous call and collect