For loop parallelism vs starting multiple asynchronous call and collect

Andrey_Dmitriev · ‎12-02-2025

@altenbach wrote:
the effect of a "0ms wait" vs "no wait".

Yes, it’s quite interesting how a 0 ms wait works internally. I built a simple app and ran it under a debugger. When the value is set to 0, the code effectively skips the wait, but it behaves similarly to Sleep(0), where a zero value causes the thread to relinquish the remainder of its time slice to any other thread that is ready to run:

By the way, both Wait (ms) and Wait Until Next ms Multiple are nearly the same, except for the additional computation required to align Wait Until Next ms Multiple with the millisecond timer (like quote and remainder). If you want to save a few assembly instructions, it’s seems to be better to use Wait (ms) because in that case the timer value will not be obtained — the jump over the wait occurs before the lvrt.Millisecs call (as long as you leave the output of Wait (ms) unconnected; otherwise, the call will be added).

"Parallel" Execution

wiebe@CARYA wrote:

LabVIEW code is executes in parallel even 1 CPU.

This statement is partially correct. Not all LabVIEW code executes in parallel. In some cases, it may appear to run in parallel, but technically it executes sequentially. Let me explain.

Here’s a classic example that can lead to a well-known race condition:

You might think that Add and Multiply run in parallel — but they don’t. They execute sequentially. However, you cannot predict which one will execute first without actually running the code. This results in Undefined Behavior (UB).

From a technical perspective, it makes no sense to execute such a small piece of code in two separate threads because of the overhead involved in thread creation, synchronization, and completion. LabVIEW 2.0 introduced the clumper. The clumping algorithm identifies parallelism in the LabVIEW diagram and groups nodes into “clumps,” which can run in parallel. Refer to NI LabVIEW Compiler: Under the Hood. In this example, both Add and Multiply belong to the same clump.

Let’s check this under the debugger—here are Multiply and Add (machine code below generated by LabVIEW 2025 Q3 64-bit v25.3.2f2😞

As you can see — this is pure sequential execution, no threads involved. The race condition here is caused by Undefined Behavior (UB), not by parallelism. In general, this code is deterministic as long as you don’t modify it, but any recompilation can change the execution order.

It also seems (this is only my guess!) that after a diagram cleanup (Ctrl+U), LabVIEW will place the code that executes first at the top (in our case, Multiply above Add), as shown on the block diagram.

Side notes: We can observe that only two copies of the Numeric are created — not four — and the two constants 2 are folded into a single one, stored at the same address. This is how the optimizer works. By the way, it absolutely doesn’t matter whether you place 2 inside the loop or outside — the generated code will remain exactly the same.

In contrast to the example above, two While loops will generate two clumps, meaning we have two separate threads:

You can verify this using Process Explorer — you’ll see about 10% CPU load, with two threads, each consuming roughly 5%.

wiebe@CARYA wrote:

LabVIEW has it's own task scheduling and runs things in parallel all by itself. Unless you explicitly force it to use a single thread (e.g. in a timed loop), it will simply distribute it's load over available processor power.

When this code runs on a multi-core CPU, you’ll see a “forest” of peaks in the CPU usage graph, but won’t be able to identify a dedicated CPU core where it executes, CPU usage is uniform more or less across the cores:

We might think LabVIEW distributes this load — but that’s not the case. This behavior is controlled by the operating system, not LabVIEW (technically LabVIEW might switch affinity, but I don't think so). Both Windows and Linux uses a preemptive, priority-based, time-sliced scheduling model for threads, and the OS scheduler moves threads between cores (unless CPU affinity is explicitly set, as with Timed Loops).

On Windows, this “ping-pong” switching typically happens every 10–20 milliseconds; on Windows Server or Linux, the interval may differ. This is not LabVIEW-specific. For example, if you create a simple loop in C that fully consumes one CPU core, you’ll observe the same behavior:

int main()
{
    while(true);
}

The CPU core where the code executes is easy to obtain using the GetCurrentProcessorNumber() WinAPI function. Even for a single UI thread, the CPU core often changes:

Thread switching is not free, of course, but the application (and its threads) does not need to explicitly manage or even be aware of when the scheduler switches execution from one core to another. The operating system automatically handles saving and restoring the CPU context (registers, flags, instruction pointer), so the thread resumes exactly where it left off without any special code from the application.

For high-load applications, it might be useful to "pin" threads to dedicated cores (set CPU affinity with help of SetThreadAffinityMask()), but in most cases, this is not necessary.

Amount of available Threads

wiebe@CARYA wrote:
Note that the maximum number of parallel executions is (was?) limited by default. IIRC, it used to be 16.

Yes, this is true, the number of threads is limited. Previously, it was 24 threads (but not fewer than the number of logical cores). Refer to kb How Many Threads Does LabVIEW Allocate?

But in LabVIEW 2025 Q3 64-bit v25.3.2f2, this seems to have increased to 30 threads on my 20-core CPU. It’s quite simple to verify: just call Sleep(1000) from kernel32.dll and measure the execution time, this is for 30 parallel instances (each DLL call will require own thread):

Adding one more iteration increases the time to 2 seconds:

From 31 to 60 instances, the time will be 2 seconds. Starting from the 61th instance, the time increases to 3 seconds. This happens because only 30 threads are reserved for execution. (Currently, I don’t have a CPU with fewer cores — maybe on an 8-core CPU, this limit will revert to default 24 threads instead of 30.)

1000 Threads Experiment

wiebe@CARYA wrote:
Set ParallelLoop.MaxNumLoopInstances in LabVIEW.ini and in your executable's ini file. E.g.:

ParallelLoop.MaxNumLoopInstances=10000

Well, it's possible (but I will set to 1000 instead of 10000), and obviously this code will take about 1000/30 = 33,3(3) seconds, the first 990 iterations will take 33 seconds, the rest 10 is one more, therefore 34 seconds total, the only 30 threads available:

If you need really to run such a crazy number of threads in parallel, there are two additional configuration keys for LabVIEW.ini:

ESys.StdNParallel=-1
ESys.Normal=1000

(These can be adjusted using threadconfig.vi located in <LabVIEW>\vi.lib\Utility\sysinfo.llb, but I prefer to edit them manually.)

The first key disables the default thread limit. The second key defines 1000 threads for the Standard Execution System at Normal Priority. (Note: This setting applies per Execution System AND Priority.)

Refer to Configuring the Number of Execution Threads Assigned to a LabVIEW Execution System.

Now, all 1000 iterations complete within one second (+ overhead):

And yes — 1000+ threads in Task Manager:

Of course, this doesn’t mean your application will run 1000× faster. This trick is rarely needed and, when used inappropriately, can actually slow down performance. Finally, there’s also the Chunk(C) terminal in the Parallel For Loop, but explaining that would make this comment even longer…

Just one more thing for everyone to dive deeper into these topics, there are plenty of excellent books available — especially for Windows: "Windows Internals, Part 1: System Architecture, Processes, Threads…" by Mark Russinovich, David Solomon, and Alex Ionescu; "Concurrent Programming on Windows" by Joe Duffy; and of course, the classic: "Modern Operating Systems" by Andrew S. Tanenbaum.

Yamaeda · ‎12-03-2025

Reading a book by Tannenbaum seems fitting this time of year. 🙂

G# - Award winning reference based OOP for LV, for free! - Qestit VIPM GitHub

Qestit Systems

banksey255 · ‎12-03-2025

Thanks all. I don't have access to a PC with LabVIEW this week, so haven't been able to continue exploring nor post any code.

For reference, I'm using LabVIEW 2019. It appears the bug mentioned doesn't apply because I am not doing implicit multithreading (e.g., I'm using a parallel FOR loop).

My original code had each thread process a sufficient volume of data to take about 1 second to complete.

Later, I switched to much simpler code that ran for a set time because it was easier to infer if threads were all starting and ending at the same time. That simpler code for each thread was a while loop incrementing a shift register with a wait operation, which iterated until the millisecond timer value changed by a set amount. What I found interesting was (1) there was no difference between not having the wait operation and a 0ms wait insofar the overall execution time was approximately equal to that of a single thread and the final shift register value varied significantly between threads and on average was relatively small; (2) to get a uniform final shift register value for all threads implying equal cpu time required a wait operation greater than 0ms; (3) to maximise the final shift register value required an even greater wait operation beyond which it would start to decrease. It seems bizarre that increasing the duration of the wait operation in each while loop iteration increased the average number of while loop iterations.

Andrey_Dmitriev · ‎12-04-2025

@banksey255 wrote:

(2) to get a uniform final shift register value for all threads implying equal cpu time required a wait operation greater than 0ms;

Well, in general, if you observe a performance boost by adding a Wait(ms), then something could be wrong with the threads scheduling, may be too many started. Ideally, you should not have any waits in high-performance loops. Possible reasons for this behavior may be variable execution time across iterations or touching UI thread (in which case Wait(ms) might help) or something else...

Anyway, let’s complete this topic by exploring iteration order and chunking.

To get an understanding of the execution order in a Parallel for loop, we can use a simple technique with Queue:

For 10 instances and 10 iterations, all 10 threads started at the same time (total execution time will be approximately 1 second), so far it obvious:

 +0,0 s Thread 0 started
 +0,0 s Thread 2 started
 +0,0 s Thread 3 started
 +0,0 s Thread 4 started
 +0,0 s Thread 5 started
 +0,0 s Thread 1 started
 +0,0 s Thread 6 started
 +0,0 s Thread 7 started
 +0,0 s Thread 8 started
 +0,0 s Thread 9 started

The order is not sequential because all threads start almost simultaneously, some may begin slightly earlier or later due to overhead.

What happens if we limit the number of parallel instances to only 2?:

Result:

 +0,0 s Thread 0 started
 +0,0 s Thread 2 started
 // no more threads available, wait for completition
 +1,0 s Thread 1 started
 +1,0 s Thread 3 started
 //next two finished
 +2,0 s Thread 4 started
 +2,0 s Thread 5 started
 //and so on
 +3,0 s Thread 6 started
 +3,0 s Thread 7 started
 
 +4,0 s Thread 8 started
 +4,0 s Thread 9 started

Now the loop starts in an interleaved order: first 0 and 2, then 1 and 3, but then followed by 4 and 5, and finally 6 & 7 and 8 & 9. This is how automatic partitioning works.

In more complex scenarios — such as 100 iterations with 5 parallel instances — the execution order becomes even more complicate:

 +0,0 s Thread 10 started
 +0,0 s Thread 0 started
 +0,0 s Thread 19 started
 +0,0 s Thread 27 started
 +0,0 s Thread 34 started
 
 +1,0 s Thread 11 started
 +1,0 s Thread 20 started
 +1,0 s Thread 28 started
 +1,0 s Thread 35 started
 +1,0 s Thread 1 started
 
 +2,0 s Thread 12 started
 +2,0 s Thread 36 started
 +2,0 s Thread 21 started
 +2,0 s Thread 29 started
 +2,0 s Thread 2 started

So, we started with 0, 10, 19, 27, and 34; then the next iteration processed 1, 11, 20, 28, and 35; then 2, 12, 21, 29, 36, and so on. A clear pattern is recognizable — the step size decreases each time: 1, 11 (+10), 20 (+9), 28 (+8), and 35 (+7).

In general, this will not significantly affect performance as long as all iterations require roughly the same execution time. However, the important takeaway is that execution order is influenced by both the number of iterations and the number of parallel instances. You can experiment with different combinations to observe the effect.

The situation changes if processing times vary. Let’s make the code a little more complex: now one iteration will take 1000 ms as before, but the next one will take 1500 ms, then 1000 ms again, and so on:

Here we have 20 iterations running across 10 threads, but take a look at the execution order and timing:

 +0,0 s Thread 0 - Begin (1000 ms)
 +0,0 s Thread 1 - Begin (1500 ms)
 +0,0 s Thread 3 - Begin (1500 ms)
 +0,0 s Thread 2 - Begin (1000 ms)
 +0,0 s Thread 4 - Begin (1000 ms)
 +0,0 s Thread 6 - Begin (1000 ms)
 +0,0 s Thread 5 - Begin (1500 ms)
 +0,0 s Thread 7 - Begin (1500 ms)
 +0,0 s Thread 8 - Begin (1000 ms)
 +0,0 s Thread 9 - Begin (1500 ms)
 //Threads 0, 2, 4, 6, 8 are finished here; 1, 3, 5, 7 and 9 still running
 +1,0 s Thread 10 - Begin (1000 ms)
 +1,0 s Thread 11 - Begin (1500 ms)
 +1,0 s Thread 14 - Begin (1000 ms)
 +1,0 s Thread 13 - Begin (1500 ms)
 +1,0 s Thread 12 - Begin (1000 ms)
 //now threads 0, 2, 4, 6, 8 finished as well
 +1,5 s Thread 15 - Begin (1500 ms)
 +1,5 s Thread 16 - Begin (1000 ms)
 +1,5 s Thread 17 - Begin (1500 ms)
 +1,5 s Thread 18 - Begin (1000 ms)
 +1,5 s Thread 19 - Begin (1500 ms)

The first 10 iterations (0…9) are "mixed": half of them take 1000 ms, and the other half take 1500 ms. After one second, five workers become “free,” so at +1.0 s tie stamp the next five iterations (10, 11, 12, 13, and 14) start (without any gap!). At the end, the last five iterations start with mixed durations. As a result, some iterations finish earlier, while others (15, 17, and 19) require an additional half second. In total, this code will take about 3 seconds.

Here’s how the Chunk Terminal can improve this. Technically, this setting defines the step between iterations. A value of 0 or 1 results in sequential order, but 2 creates an interleaved pattern — exactly what we need:

Now the order gets changed:

 +0,0 s Thread 2 - Begin (1000 ms)
 +0,0 s Thread 0 - Begin (1000 ms)
 +0,0 s Thread 6 - Begin (1000 ms)
 +0,0 s Thread 4 - Begin (1000 ms)
 +0,0 s Thread 8 - Begin (1000 ms)
 +0,0 s Thread 12 - Begin (1000 ms)
 +0,0 s Thread 10 - Begin (1000 ms)
 +0,0 s Thread 14 - Begin (1000 ms)
 +0,0 s Thread 16 - Begin (1000 ms)
 +0,0 s Thread 18 - Begin (1000 ms)
 
 +1,0 s Thread 3 - Begin (1500 ms)
 +1,0 s Thread 1 - Begin (1500 ms)
 +1,0 s Thread 7 - Begin (1500 ms)
 +1,0 s Thread 5 - Begin (1500 ms)
 +1,0 s Thread 9 - Begin (1500 ms)
 +1,0 s Thread 15 - Begin (1500 ms)
 +1,0 s Thread 17 - Begin (1500 ms)
 +1,0 s Thread 19 - Begin (1500 ms)
 +1,0 s Thread 11 - Begin (1500 ms)
 +1,0 s Thread 13 - Begin (1500 ms)

First, the even elements (0, 2, 4, 6…) are processed, then the odd elements (1, 3, 5…). As a result, this code will take about 2.5 seconds — half a second less than before. This provides a good uniform load: at the beginning, 10 instances start with the “1000 ms” elements, they finish at the same time, and immediately the next portion with 1500 ms starts. All threads remain completely and uniformly busy from start to finish.

Note that not only a scalar can be connected to the C terminal, but also an array. Each element defines the step to the next element, and the last element will be used for subsequent iterations, for example:

Result:

 +0,0 s Thread 0 - Begin (1000 ms)
 +0,0 s Thread 2 (+2) - Begin (1000 ms)
 +0,0 s Thread 5 (+3) - Begin (1500 ms)
 +0,0 s Thread 9 (+4) - Begin (1500 ms)
 +0,0 s Thread 13 - Begin (1500 ms)
 +0,0 s Thread 17 - Begin (1500 ms)
 +1,0 s Thread 1 - Begin (1500 ms)
 +1,0 s Thread 3 - Begin (1500 ms)
 +1,5 s Thread 6 - Begin (1000 ms)
 +1,5 s Thread 10 - Begin (1000 ms)
 +1,5 s Thread 14 - Begin (1000 ms)
 +1,5 s Thread 18 - Begin (1000 ms)
 +2,5 s Thread 4 - Begin (1000 ms)
...

For [2, 3, 4], you can see how it works now: 0 → 2 (+2) → 5 (+3) → 9 (+4) → 13 (+4) → 17 (+4).

An important point is that inappropriate use can slow down your code. In our case, the next element should be 21 (17 + 4), but we only have 20 elements at all. As a result, the loop actually runs with only six parallel instances instead of ten, and the next batch (iterations 1 and 3) starts later. When used properly, this feature is truly a “Swiss Army Knife” — especially if you have prior knowledge about execution times or need to reorder execution in a specific pattern. However, it’s rarely needed in practice.

SHA-256 Experiment

As a practical exercise: compute SHA-256 hashes for arrays of variable length (two arrays filled with 'a': one is 100 MB, the other is 150 MB). These arrays are processed in an interleaved manner:

This code takes around 5.3 seconds on my Xeon, but if the C terminal is connected and set to 2, the execution time decreases down to 4,7 seconds:

Advanced Topic

If further improvement is needed, it’s better to wrap intensive computations in a DLL, as this produces more efficient machine code. And single-thread DLL approach can be combined with LabVIEW-based threading. Let’s continue with SHA-256: the current LabVIEW implementation is well optimized, but we can do even better. I’ll use this pure "C" implementation from Brad Conte.

The code is wrapped into a DLL suitable for LabVIEW (arrays passed as "Adapt to Type") with one important modification: we can define the CPU core on which this code will execute. Nothing special — just pure WinAPI:

#include "sha256.h"
#include "include/extcode.h"

#include <windows.h>

SHA256_API void sha256(TD1Hdl buffer, TD1Hdl digest, int CPU)
{
    // Set CPU affinity for the current thread
    HANDLE hThread = GetCurrentThread();
    DWORD_PTR affinityMask = ((DWORD_PTR)1 << CPU); // CPU is zero-based
    if (CPU >= 0) SetThreadAffinityMask(hThread, affinityMask);

    // Perform SHA256 calculation
    NumericArrayResize(uB, 1, (UHandle*)(&(digest)), SHA256_BLOCK_SIZE);
    (*digest)->dimSize = SHA256_BLOCK_SIZE;

    SHA256_CTX ctx;
    sha256_init(&ctx);
    sha256_update(&ctx, &((*buffer)->elt[0]), (*buffer)->dimSize);
    sha256_final(&ctx, &((*digest)->elt[0]));
}

I won’t bother you with a comparison of “with vs. without Chunk” (the ratio will be almost the same). Let’s start with this one (Affinity is not used yet, therefore -1 connected to the last parameter):

Now, the execution time for each 20 iterations looks like this:

So, we’ve improved performance from 4.70 seconds down to about 1.45 seconds, which is roughly a 3× speedup for the same amount of data (not bad for LabVIEW, by the way).

It’s important to also check CPU load while the code is running (hence the code is wrapped in a while loop):

CPU usage is around 50%, with all cores uniformly loaded.

Next, we can add some code to set CPU affinity — the first thread will run on the first physical core, the next on the second, and so on. In Windows, logical cores are paired due to Hyper-Threading (which is enabled here), so I’ll use only the even-numbered cores:

Here is the result (also note much smaller time deviations from run to run, because we will not touch Hyper-Threaded pairs):

We are now well below 1.3 seconds, and the CPU load looks like this, much better:

By the way, these “49%” are not real 49%, because we’re using only the physical cores. This leaves some performance unused on the Hyper-Threaded siblings — maybe 10–20%, but definitely not half. (If we increase the number of threads to 20, we will obviously not get a 50% boost.) So, don’t trust the CPU percentage indicator on Hyper-Threaded CPUs — it’s misleading. It depends on how logical cores are utilized. If you’d like to dive deeper into Hyper-Threading at the assembly level, here’s a small article for you).

Hopefully this will help for better understanding.

altenbach · ‎12-04-2025

On a side note, it is sometimes interesting to look at the output of the parallel instance ID, available if so configured.

For example, I have a non-reentrant Fortran dll that cannot be called in parallel, so I create a sufficient set of uniquely named copies in the temp folder and now each will be assigned to a parallel instance using the instance ID. (It even picks the master based on target bitness and works in 32 or 64bit LabVIEW without any code changes)

LabVIEW Champion.

Andrey_Dmitriev · ‎12-04-2025

@altenbach wrote:

On a side note, it is sometimes interesting to look at the output of the parallel instance ID,

That’s a very good point. The last example can then be simplified:

wiebe@CARYA · ‎12-05-2025

@Andrey_Dmitriev wrote:
"Parallel" Execution

wiebe@CARYA wrote:

LabVIEW code is executes in parallel even 1 CPU.

This statement is partially correct. Not all LabVIEW code executes in parallel. In some cases, it may appear to run in parallel, but technically it executes sequentially. Let me explain.

Here’s a classic example that can lead to a well-known race condition:

You might think that Add and Multiply run in parallel — but they don’t. They execute sequentially. However, you cannot predict which one will execute first without actually running the code. This results in Undefined Behavior (UB).

From a technical perspective, it makes no sense to execute such a small piece of code in two separate threads because of the overhead involved in thread creation, synchronization, and completion. LabVIEW 2.0 introduced the clumper. The clumping algorithm identifies parallelism in the LabVIEW diagram and groups nodes into “clumps,” which can run in parallel. Refer to NI LabVIEW Compiler: Under the Hood. In this example, both Add and Multiply belong to the same clump.

Let’s check this under the debugger—here are Multiply and Add (machine code below generated by LabVIEW 2025 Q3 64-bit v25.3.2f2😞

The Add and Multiply are conceptually executed in parallel. That's why you shouldn't do that (unless you're demonstrating race conditions). But to execute the Add and Multiply in 2 threads would be very expensive.

A big (the only?) factor in LabVIEW's choice to run things in parallel (for real) is based on the "Is Asynchronous" property of objects:

Search LabVIEW like a graph!

LabVIEW