LabVIEW

cancel
Showing results for 
Search instead for 
Did you mean: 

For loop parallelism vs starting multiple asynchronous call and collect


@altenbach wrote:
the effect of a "0ms wait" vs "no wait".

Yes, it’s quite interesting how a 0 ms wait works internally. I built a simple app and ran it under a debugger. When the value is set to 0, the code effectively skips the wait, but it behaves similarly to Sleep(0), where a zero value causes the thread to relinquish the remainder of its time slice to any other thread that is ready to run:

WAIT vs.png

By the way, both Wait (ms) and Wait Until Next ms Multiple are nearly the same, except for the additional computation required to align Wait Until Next ms Multiple with the millisecond timer (like quote and remainder). If you want to save a few assembly instructions, it’s seems to be better to use Wait (ms) because in that case the timer value will not be obtained — the jump over the wait occurs before the lvrt.Millisecs call (as long as you leave the output of Wait (ms) unconnected; otherwise, the call will be added).

 

"Parallel" Execution

 

wiebe@CARYA wrote:

LabVIEW code is executes in parallel even 1 CPU.

This statement is partially correct. Not all LabVIEW code executes in parallel. In some cases, it may appear to run in parallel, but technically it executes sequentially. Let me explain.

Here’s a classic example that can lead to a well-known race condition:

snippet.png

You might think that Add and Multiply run in parallel — but they don’t. They execute sequentially. However, you cannot predict which one will execute first without actually running the code. This results in Undefined Behavior (UB).

From a technical perspective, it makes no sense to execute such a small piece of code in two separate threads because of the overhead involved in thread creation, synchronization, and completion. LabVIEW 2.0 introduced the clumper. The clumping algorithm identifies parallelism in the LabVIEW diagram and groups nodes into “clumps,” which can run in parallel. Refer to NI LabVIEW Compiler: Under the Hood. In this example, both Add and Multiply belong to the same clump.

Let’s check this under the debugger—here are Multiply and Add (machine code below generated by LabVIEW 2025 Q3 64-bit v25.3.2f2😞

 

Sequential.png

 

As you can see — this is pure sequential execution, no threads involved. The race condition here is caused by Undefined Behavior (UB), not by parallelism. In general, this code is deterministic as long as you don’t modify it, but any recompilation can change the execution order.

 

It also seems (this is only my guess!) that after a diagram cleanup (Ctrl+U), LabVIEW will place the code that executes first at the top (in our case, Multiply above Add), as shown on the block diagram.

 

Side notes: We can observe that only two copies of the Numeric are created — not four — and the two constants 2 are folded into a single one, stored at the same address. This is how the optimizer works. By the way, it absolutely doesn’t matter whether you place 2 inside the loop or outside — the generated code will remain exactly the same.

 

In contrast to the example above, two While loops will generate two clumps, meaning we have two separate threads:

SnippetTwoLoops.png

You can verify this using Process Explorer — you’ll see about 10% CPU load, with two threads, each consuming roughly 5%.

image-20251202143917640.png

wiebe@CARYA wrote:

LabVIEW has it's own task scheduling and runs things in parallel all by itself. Unless you explicitly force it to use a single thread (e.g. in a timed loop), it will simply distribute it's load over available processor power.

When this code runs on a multi-core CPU, you’ll see a “forest” of peaks in the CPU usage graph, but won’t be able to identify a dedicated CPU core where it executes, CPU usage is uniform more or less across the cores:

image-20251202131507781.png

We might think LabVIEW distributes this load — but that’s not the case. This behavior is controlled by the operating system, not LabVIEW (technically LabVIEW might switch affinity, but I don't think so). Both Windows and Linux uses a preemptive, priority-based, time-sliced scheduling model for threads, and the OS scheduler moves threads between cores (unless CPU affinity is explicitly set, as with Timed Loops).

On Windows, this “ping-pong” switching typically happens every 10–20 milliseconds; on Windows Server or Linux, the interval may differ. This is not LabVIEW-specific. For example, if you create a simple loop in C that fully consumes one CPU core, you’ll observe the same behavior:

int main()
{
    while(true);
}

The CPU core where the code executes is easy to obtain using the GetCurrentProcessorNumber() WinAPI function. Even for a single UI thread, the CPU core often changes:

cpu.gif

Thread switching is not free, of course, but the application (and its threads) does not need to explicitly manage or even be aware of when the scheduler switches execution from one core to another. The operating system automatically handles saving and restoring the CPU context (registers, flags, instruction pointer), so the thread resumes exactly where it left off without any special code from the application.

For high-load applications, it might be useful to "pin" threads to dedicated cores (set CPU affinity with help of SetThreadAffinityMask()), but in most cases, this is not necessary.

 

Amount of available Threads

wiebe@CARYA wrote:

Note that the maximum number of parallel executions is (was?) limited by default. IIRC, it used to be 16.


Yes, this is true, the number of threads is limited. Previously, it was 24 threads (but not fewer than the number of logical cores). Refer to kb How Many Threads Does LabVIEW Allocate?

But in LabVIEW 2025 Q3 64-bit v25.3.2f2, this seems to have increased to 30 threads on my 20-core CPU. It’s quite simple to verify: just call Sleep(1000) from kernel32.dll and measure the execution time, this is for 30 parallel instances (each DLL call will require own thread):

image-20251202135656019.png

Adding one more iteration increases the time to 2 seconds:

image-20251202135912949.png

From 31 to 60 instances, the time will be 2 seconds. Starting from the 61th instance, the time increases to 3 seconds. This happens because only 30 threads are reserved for execution. (Currently, I don’t have a CPU with fewer cores — maybe on an 8-core CPU, this limit will revert to default 24 threads instead of 30.)

1000 Threads Experiment

 

wiebe@CARYA wrote:

Set ParallelLoop.MaxNumLoopInstances in LabVIEW.ini and in your executable's ini file. E.g.:


ParallelLoop.MaxNumLoopInstances=10000


Well, it's possible (but I will set to 1000 instead of 10000), and obviously this code will take about 1000/30 = 33,3(3) seconds, the first 990 iterations will take 33 seconds, the rest 10 is one more, therefore 34 seconds total, the only 30 threads available:

image-20251202140318566.png

If you need really to run such a crazy number of threads in parallel, there are two additional configuration keys for LabVIEW.ini:

ESys.StdNParallel=-1
ESys.Normal=1000

(These can be adjusted using threadconfig.vi located in <LabVIEW>\vi.lib\Utility\sysinfo.llb, but I prefer to edit them manually.)

The first key disables the default thread limit. The second key defines 1000 threads for the Standard Execution System at Normal Priority. (Note: This setting applies per Execution System AND Priority.)

image-20251202145747198.png

Refer to Configuring the Number of Execution Threads Assigned to a LabVIEW Execution System.

Now, all 1000 iterations complete within one second (+ overhead):

image-20251202141959554.png

And yes —  1000+ threads in Task Manager:

image-20251202141826125.png

Of course, this doesn’t mean your application will run 1000× faster. This trick is rarely needed and, when used inappropriately, can actually slow down performance. Finally, there’s also the Chunk(C) terminal in the Parallel For Loop, but explaining that would make this comment even longer…

 

Just one more thing for everyone to dive deeper into these topics, there are plenty of excellent books available — especially for Windows: "Windows Internals, Part 1: System Architecture, Processes, Threads…" by Mark Russinovich, David Solomon, and Alex Ionescu; "Concurrent Programming on Windows" by Joe Duffy; and of course, the classic: "Modern Operating Systems" by Andrew S. Tanenbaum.

Message 11 of 17
(476 Views)

Reading a book by Tannenbaum seems fitting this time of year. 🙂

G# - Award winning reference based OOP for LV, for free! - Qestit VIPM GitHub

Qestit Systems
Certified-LabVIEW-Developer
Message 12 of 17
(346 Views)

Thanks all. I don't have access to a PC with LabVIEW this week, so haven't been able to continue exploring nor post any code.

 

For reference, I'm using LabVIEW 2019. It appears the bug mentioned doesn't apply because I am not doing implicit multithreading (e.g., I'm using a parallel FOR loop).

 

My original code had each thread process a sufficient volume of data to take about 1 second to complete.

 

Later, I switched to much simpler code that ran for a set time because it was easier to infer if threads were all starting and ending at the same time. That simpler code for each thread was a while loop incrementing a shift register with a wait operation, which iterated until the millisecond timer value changed by a set amount. What I found interesting was (1) there was no difference between not having the wait operation and a 0ms wait insofar the overall execution time was approximately equal to that of a single thread and the final shift register value varied significantly between threads and on average was relatively small; (2) to get a uniform final shift register value for all threads implying equal cpu time required a wait operation greater than 0ms; (3) to maximise the final shift register value required an even greater wait operation beyond which it would start to decrease. It seems bizarre that increasing the duration of the wait operation in each while loop iteration increased the average number of while loop iterations.

0 Kudos
Message 13 of 17
(260 Views)

@banksey255 wrote:

(2) to get a uniform final shift register value for all threads implying equal cpu time required a wait operation greater than 0ms;


Well, in general, if you observe a performance boost by adding a Wait(ms), then something could be wrong with the threads scheduling, may be too many started. Ideally, you should not have any waits in high-performance loops. Possible reasons for this behavior may be variable execution time across iterations or touching UI thread (in which case Wait(ms) might help) or something else...

 

Anyway, let’s complete this topic by exploring iteration order and chunking.

To get an understanding of the execution order in a Parallel for loop, we can use a simple technique with Queue:

Snippet01.png

For 10 instances and 10 iterations, all 10 threads started at the same time (total execution time will be approximately 1 second), so far it obvious:

 +0,0 s Thread 0 started
 +0,0 s Thread 2 started
 +0,0 s Thread 3 started
 +0,0 s Thread 4 started
 +0,0 s Thread 5 started
 +0,0 s Thread 1 started
 +0,0 s Thread 6 started
 +0,0 s Thread 7 started
 +0,0 s Thread 8 started
 +0,0 s Thread 9 started

The order is not sequential because all threads start almost simultaneously, some may begin slightly earlier or later due to overhead.

What happens if we limit the number of parallel instances to only 2?:

image-20251204105957787.png

Result:

 +0,0 s Thread 0 started
 +0,0 s Thread 2 started
 // no more threads available, wait for completition
 +1,0 s Thread 1 started
 +1,0 s Thread 3 started
 //next two finished
 +2,0 s Thread 4 started
 +2,0 s Thread 5 started
 //and so on
 +3,0 s Thread 6 started
 +3,0 s Thread 7 started
 
 +4,0 s Thread 8 started
 +4,0 s Thread 9 started

Now the loop starts in an interleaved order: first 0 and 2, then 1 and 3, but then followed by 4 and 5, and finally 6 & 7 and 8 & 9. This is how automatic partitioning works.

In more complex scenarios — such as 100 iterations with 5 parallel instances — the execution order becomes even more complicate:

image-20251204110256035.png

 +0,0 s Thread 10 started
 +0,0 s Thread 0 started
 +0,0 s Thread 19 started
 +0,0 s Thread 27 started
 +0,0 s Thread 34 started
 
 +1,0 s Thread 11 started
 +1,0 s Thread 20 started
 +1,0 s Thread 28 started
 +1,0 s Thread 35 started
 +1,0 s Thread 1 started
 
 +2,0 s Thread 12 started
 +2,0 s Thread 36 started
 +2,0 s Thread 21 started
 +2,0 s Thread 29 started
 +2,0 s Thread 2 started

So, we started with 0, 10, 19, 27, and 34; then the next iteration processed 1, 11, 20, 28, and 35; then 2, 12, 21, 29, 36, and so on. A clear pattern is recognizable — the step size decreases each time: 1, 11 (+10), 20 (+9), 28 (+8), and 35 (+7).

 

In general, this will not significantly affect performance as long as all iterations require roughly the same execution time. However, the important takeaway is that execution order is influenced by both the number of iterations and the number of parallel instances. You can experiment with different combinations to observe the effect.

 

The situation changes if processing times vary. Let’s make the code a little more complex: now one iteration will take 1000 ms as before, but the next one will take 1500 ms, then 1000 ms again, and so on:

Snippet02.png

Here we have 20 iterations running across 10 threads, but take a look at the execution order and timing:

 +0,0 s Thread 0 - Begin (1000 ms)
 +0,0 s Thread 1 - Begin (1500 ms)
 +0,0 s Thread 3 - Begin (1500 ms)
 +0,0 s Thread 2 - Begin (1000 ms)
 +0,0 s Thread 4 - Begin (1000 ms)
 +0,0 s Thread 6 - Begin (1000 ms)
 +0,0 s Thread 5 - Begin (1500 ms)
 +0,0 s Thread 7 - Begin (1500 ms)
 +0,0 s Thread 8 - Begin (1000 ms)
 +0,0 s Thread 9 - Begin (1500 ms)
 //Threads 0, 2, 4, 6, 8 are finished here; 1, 3, 5, 7 and 9 still running
 +1,0 s Thread 10 - Begin (1000 ms)
 +1,0 s Thread 11 - Begin (1500 ms)
 +1,0 s Thread 14 - Begin (1000 ms)
 +1,0 s Thread 13 - Begin (1500 ms)
 +1,0 s Thread 12 - Begin (1000 ms)
 //now threads 0, 2, 4, 6, 8 finished as well
 +1,5 s Thread 15 - Begin (1500 ms)
 +1,5 s Thread 16 - Begin (1000 ms)
 +1,5 s Thread 17 - Begin (1500 ms)
 +1,5 s Thread 18 - Begin (1000 ms)
 +1,5 s Thread 19 - Begin (1500 ms)

The first 10 iterations (0…9) are "mixed": half of them take 1000 ms, and the other half take 1500 ms. After one second, five workers become “free,” so at +1.0 s tie stamp the next five iterations (10, 11, 12, 13, and 14) start (without any gap!). At the end, the last five iterations start with mixed durations. As a result, some iterations finish earlier, while others (15, 17, and 19) require an additional half second. In total, this code will take about 3 seconds.

Here’s how the Chunk Terminal can improve this. Technically, this setting defines the step between iterations. A value of 0 or 1 results in sequential order, but 2 creates an interleaved pattern — exactly what we need:

image-20251204121446306.png

Now the order gets changed:

 +0,0 s Thread 2 - Begin (1000 ms)
 +0,0 s Thread 0 - Begin (1000 ms)
 +0,0 s Thread 6 - Begin (1000 ms)
 +0,0 s Thread 4 - Begin (1000 ms)
 +0,0 s Thread 8 - Begin (1000 ms)
 +0,0 s Thread 12 - Begin (1000 ms)
 +0,0 s Thread 10 - Begin (1000 ms)
 +0,0 s Thread 14 - Begin (1000 ms)
 +0,0 s Thread 16 - Begin (1000 ms)
 +0,0 s Thread 18 - Begin (1000 ms)
 
 +1,0 s Thread 3 - Begin (1500 ms)
 +1,0 s Thread 1 - Begin (1500 ms)
 +1,0 s Thread 7 - Begin (1500 ms)
 +1,0 s Thread 5 - Begin (1500 ms)
 +1,0 s Thread 9 - Begin (1500 ms)
 +1,0 s Thread 15 - Begin (1500 ms)
 +1,0 s Thread 17 - Begin (1500 ms)
 +1,0 s Thread 19 - Begin (1500 ms)
 +1,0 s Thread 11 - Begin (1500 ms)
 +1,0 s Thread 13 - Begin (1500 ms)

First, the even elements (0, 2, 4, 6…) are processed, then the odd elements (1, 3, 5…). As a result, this code will take about 2.5 seconds — half a second less than before. This provides a good uniform load: at the beginning, 10 instances start with the “1000 ms” elements, they finish at the same time, and immediately the next portion with 1500 ms starts. All threads remain completely and uniformly busy from start to finish.

 

Note that not only a scalar can be connected to the C terminal, but also an array. Each element defines the step to the next element, and the last element will be used for subsequent iterations, for example:

image-20251204122902154.png

Result:

 +0,0 s Thread 0 - Begin (1000 ms)
 +0,0 s Thread 2 (+2) - Begin (1000 ms)
 +0,0 s Thread 5 (+3) - Begin (1500 ms)
 +0,0 s Thread 9 (+4) - Begin (1500 ms)
 +0,0 s Thread 13 - Begin (1500 ms)
 +0,0 s Thread 17 - Begin (1500 ms)
 +1,0 s Thread 1 - Begin (1500 ms)
 +1,0 s Thread 3 - Begin (1500 ms)
 +1,5 s Thread 6 - Begin (1000 ms)
 +1,5 s Thread 10 - Begin (1000 ms)
 +1,5 s Thread 14 - Begin (1000 ms)
 +1,5 s Thread 18 - Begin (1000 ms)
 +2,5 s Thread 4 - Begin (1000 ms)
...

For [2, 3, 4], you can see how it works now: 0 → 2 (+2) → 5 (+3) → 9 (+4) → 13 (+4) → 17 (+4).

An important point is that inappropriate use can slow down your code. In our case, the next element should be 21 (17 + 4), but we only have 20 elements at all. As a result, the loop actually runs with only six parallel instances instead of ten, and the next batch (iterations 1 and 3) starts later. When used properly, this feature is truly a “Swiss Army Knife” — especially if you have prior knowledge about execution times or need to reorder execution in a specific pattern. However, it’s rarely needed in practice.

SHA-256 Experiment

As a practical exercise: compute SHA-256 hashes for arrays of variable length (two arrays filled with 'a': one is 100 MB, the other is 150 MB). These arrays are processed in an interleaved manner:

Snippet03.png

This code takes around 5.3 seconds on my Xeon, but if the C terminal is connected and set to 2, the execution time decreases down to 4,7 seconds:

image-20251204125009908.png

Advanced Topic

If further improvement is needed, it’s better to wrap intensive computations in a DLL, as this produces more efficient machine code. And single-thread DLL approach can be combined with LabVIEW-based threading. Let’s continue with SHA-256: the current LabVIEW implementation is well optimized, but we can do even better. I’ll use this pure "C" implementation from Brad Conte.

The code is wrapped into a DLL suitable for LabVIEW (arrays passed as "Adapt to Type") with one important modification: we can define the CPU core on which this code will execute. Nothing special — just pure WinAPI:

#include "sha256.h"
#include "include/extcode.h"

#include <windows.h>

SHA256_API void sha256(TD1Hdl buffer, TD1Hdl digest, int CPU)
{
    // Set CPU affinity for the current thread
    HANDLE hThread = GetCurrentThread();
    DWORD_PTR affinityMask = ((DWORD_PTR)1 << CPU); // CPU is zero-based
    if (CPU >= 0) SetThreadAffinityMask(hThread, affinityMask);

    // Perform SHA256 calculation
    NumericArrayResize(uB, 1, (UHandle*)(&(digest)), SHA256_BLOCK_SIZE);
    (*digest)->dimSize = SHA256_BLOCK_SIZE;

    SHA256_CTX ctx;
    sha256_init(&ctx);
    sha256_update(&ctx, &((*buffer)->elt[0]), (*buffer)->dimSize);
    sha256_final(&ctx, &((*digest)->elt[0]));
}

I won’t bother you with a comparison of “with vs. without Chunk” (the ratio will be almost the same). Let’s start with this one (Affinity is not used yet, therefore -1 connected to the last parameter):

image-20251204143232488.png

Now, the execution time for each 20 iterations looks like this:

image-20251204143626534.png

So, we’ve improved performance from 4.70 seconds down to about 1.45 seconds, which is roughly a 3× speedup for the same amount of data (not bad for LabVIEW, by the way).

 

It’s important to also check CPU load while the code is running (hence the code is wrapped in a while loop):

image-20251204143838792.png

 

CPU usage is around 50%, with all cores uniformly loaded.

Next, we can add some code to set CPU affinity — the first thread will run on the first physical core, the next on the second, and so on. In Windows, logical cores are paired due to Hyper-Threading (which is enabled here), so I’ll use only the even-numbered cores:

image-20251204144046667.png

Here is the result (also note much smaller time deviations from run to run, because we will not touch Hyper-Threaded pairs):

image-20251204144317177.png

We are now well below 1.3 seconds, and the CPU load looks like this, much better:

image-20251204144432250.png

By the way, these “49%” are not real 49%, because we’re using only the physical cores. This leaves some performance unused on the Hyper-Threaded siblings — maybe 10–20%, but definitely not half. (If we increase the number of threads to 20, we will obviously not get a 50% boost.) So, don’t trust the CPU percentage indicator on Hyper-Threaded CPUs — it’s misleading. It depends on how logical cores are utilized. If you’d like to dive deeper into Hyper-Threading at the assembly level, here’s a small article for you).

Hopefully this will help for better understanding.

Message 14 of 17
(220 Views)

On a side note, it is sometimes interesting to look at the output of the parallel instance ID, available if so configured.

 

altenbach_0-1764866471026.png

 

For example, I have a non-reentrant Fortran dll that cannot be called in parallel, so I create a sufficient set of uniquely named copies in the temp folder and now each will be assigned to a parallel instance using the instance ID. (It even picks the master based on target bitness and works in 32 or 64bit LabVIEW without any code changes)

 

 

altenbach_1-1764867063494.png

 

 

Message 15 of 17
(168 Views)

@altenbach wrote:

On a side note, it is sometimes interesting to look at the output of the parallel instance ID,

 


That’s a very good point. The last example can then be simplified:
Screenshot 2025-12-04 18.33.35.png
0 Kudos
Message 16 of 17
(152 Views)

@Andrey_Dmitriev wrote:

"Parallel" Execution

wiebe@CARYA wrote:

LabVIEW code is executes in parallel even 1 CPU.

This statement is partially correct. Not all LabVIEW code executes in parallel. In some cases, it may appear to run in parallel, but technically it executes sequentially. Let me explain.

Here’s a classic example that can lead to a well-known race condition:

snippet.png

You might think that Add and Multiply run in parallel — but they don’t. They execute sequentially. However, you cannot predict which one will execute first without actually running the code. This results in Undefined Behavior (UB).

From a technical perspective, it makes no sense to execute such a small piece of code in two separate threads because of the overhead involved in thread creation, synchronization, and completion. LabVIEW 2.0 introduced the clumper. The clumping algorithm identifies parallelism in the LabVIEW diagram and groups nodes into “clumps,” which can run in parallel. Refer to NI LabVIEW Compiler: Under the Hood. In this example, both Add and Multiply belong to the same clump.

Let’s check this under the debugger—here are Multiply and Add (machine code below generated by LabVIEW 2025 Q3 64-bit v25.3.2f2😞


The Add and Multiply are conceptually executed in parallel. That's why you shouldn't do that (unless you're demonstrating race conditions). But to execute the Add and Multiply in 2 threads would be very expensive.

 

A big (the only?) factor in LabVIEW's choice to run things in parallel (for real) is based on the "Is Asynchronous" property of objects:

wiebeCARYA_0-1764929011552.png

Message 17 of 17
(105 Views)