Parallel For Loop Increases Iteration Execution Time

Andrey_Dmitriev · ‎02-03-2025

Just to "complete" this parallelization topic — there is one more "fine-tuning" thing called Chunks.

By default, LabVIEW will partition iterations automatically.

To understand chunks, I would like to recommend making a very simple experiment where you will log thread's execution — beginning and end — with the help of a Queue, for example.

So, the test thread worker looks like this:

and the main loop looks like that (sorting will work, because of time offset at the beginning):

Now you will see the order of execution, and the iterations are not called sequentially; this is how "auto partition" works (the first eight iteration will start same time):

+0,00 s - Start Thread 00
+0,00 s - Start Thread 04
+0,00 s - Start Thread 07
+0,00 s - Start Thread 10
+0,00 s - Start Thread 13
+0,00 s - Start Thread 16
+0,00 s - Start Thread 19
+0,00 s - Start Thread 21
+0,10 s - Finish Thread 00
+0,10 s - Finish Thread 04
...

As you can see - first step is 4, then decreased to 3, at the end — 2. If you change the number of iterations or the number of threads, this pattern will also changed.

Now, if you enable Chunk Size Terminal:

and connect to this terminal, let's say "1":

Screenshot 2025-02-03 11.09.42.png

then you will see which elements will be processed first:

+0,00 s - Start Thread 00
+0,00 s - Start Thread 01
+0,00 s - Start Thread 02
+0,00 s - Start Thread 03
+0,00 s - Start Thread 04
+0,00 s - Start Thread 05
+0,00 s - Start Thread 06
+0,00 s - Start Thread 07
+0,10 s - Finish Thread 00
+0,10 s - Finish Thread 01

If you connect "2" instead of "1", then every second element will be used:

+0,00 s - Start Thread 00
+0,00 s - Start Thread 02
+0,00 s - Start Thread 04
+0,00 s - Start Thread 06
+0,00 s - Start Thread 08
+0,00 s - Start Thread 10
+0,00 s - Start Thread 12
+0,00 s - Start Thread 14
+0,10 s - Finish Thread 00
...

if "3", then every third, and so on. So you can control the order of execution. In this case this will not affect overall execution time yet.

Note that you can also connect an array here instead of scalar. Then you can handle more complicated scenarios. For example, if I connect an array with 1,2,3,4,5,6,7:

then you will see the pattern - first and second used, then one gap, then 3rd and so on:

+0,00 s - Start Thread 00
+0,00 s - Start Thread 01 (+1)
+0,00 s - Start Thread 03 (+2)
+0,00 s - Start Thread 06 (+3)
+0,00 s - Start Thread 10 (+4)
+0,00 s - Start Thread 15 (and so on)
+0,00 s - Start Thread 21
+0,00 s - Start Thread 28
+0,10 s - Finish Thread 00
...

Every element in the Chunks array is the distance between consecutive elements which will be processed.

Normally, in most cases, auto partitioning works just fine if the execution time for each iteration is more or less the same. But in some special cases, if different iterations take significantly different times and you know the pattern, then you can get better performance.

For example, I will increase processing time for every 8th iteration from 100 ms to 1s:

Now, with default settings, the overall time will take 2.3 seconds.

Nothing wrong in the middle of the parallel for loop's execution, but at the end it is not efficient; some cores will not be occupied because 1s iteration is parallelized with 0.1s iterations. You will see it in the log at the end, this is why you have 2.3 seconds:

...
+1,40 s - Start Thread 61
+1,40 s - Start Thread 62
+1,50 s - Finish Thread 58
+1,50 s - Finish Thread 59
+1,50 s - Finish Thread 60
+1,50 s - Finish Thread 61
+1,50 s - Finish Thread 62
+1,50 s - Start Thread 63
+1,60 s - Finish Thread 63
+1,70 s - Finish Thread 40
+2,00 s - Finish Thread 48
+2,30 s - Finish Thread 56

But if I set chunks to 8, then the 'heavy' iterations will be processed first and in the same chunk:

The CPU will be continuously occupied till the end, and the overall time will be 1.7 seconds - meaning 0.6 seconds better than before:

+0,00 s - Start Thread 00
+0,00 s - Start Thread 08
+0,00 s - Start Thread 16
+0,00 s - Start Thread 24
+0,00 s - Start Thread 32
+0,00 s - Start Thread 40
+0,00 s - Start Thread 48
+0,00 s - Start Thread 56
+1,00 s - Finish Thread 00
...
+1,60 s - Start Thread 63
+1,70 s - Finish Thread 07
+1,70 s - Finish Thread 15
+1,70 s - Finish Thread 23
+1,70 s - Finish Thread 31
+1,70 s - Finish Thread 39
+1,70 s - Finish Thread 47
+1,70 s - Finish Thread 55
+1,70 s - Finish Thread 63

I wrote this because, in the discussed case, the time of the Gauss-Lobatto Quadrature computation depends heavily on the limits. So it could be your case where different iterations will take significantly different amounts of time.

But I recommend trying to adjust chunks only at the end and only if needed; otherwise, you can get slower execution than default. Remember, as Donald Knuth said, — "Premature optimization is the root of all evil."

Yamaeda · ‎02-03-2025

I've seen the Chunk input but never found a scenario in where it was useful. Nice example. So the For loop isn't truly parallell, or should i say independant, i works/wait for the number of instances?

G# - Award winning reference based OOP for LV, for free! - Qestit VIPM GitHub

Qestit Systems

Andrey_Dmitriev · ‎02-03-2025

@Yamaeda wrote:

I've seen the Chunk input but never found a scenario in where it was useful. Nice example. So the For loop isn't truly parallell, or should i say independant, i works/wait for the number of instances?

It's a good question, really. No, they are truly parallel, there are no gaps in between, also when one thread takes longer than another.

But for visualization, I will need to go back to C.

What I will do is prepare a pretty simple "Thread Simulator", where I will get the current thread ID.

As a result, I will know, in which chunk the thread is really executed:

#include <Windows.h>
#include <utility.h>
#include "ThreadWorker.h"

int Your_Functions_Here (int Delay)
{
	DWORD thr = GetCurrentThreadId();
	Sleep(Delay);
	return thr;
}

The idea is to get real ID assigned to the thread reserved for DLL call.

Now my thread function will measure start/stop and fill thread ID returned from DLL:

Now I can build timeline visualization in 2D Picture (the code inside is slightly messy, sorry about that, my lunch break is too short for cleanup):

Now let's say — one of the threads (every 8th call) will execute 1 second, the rest - 100 ms.

This is what happened with default chunks and eight threads in this case:

As you can see - no gaps, the first thread busy with 1 second, but each other is active, thread 4 is longest, and overall time is 2.3 seconds, and now you clearly see why.

By the way, its slightly different from run to run (as expected):

And this is with chunks == 8:

Random execution time 50...500 ms:

This is how it works under the hood:

Something like that.

Andrey_Dmitriev · ‎02-03-2025

Ups, sorry, Queue here is a kind of Rube Goldberg Code.

This is simplified version - works exactly the same:

LabVIEW

Parallel For Loop Increases Iteration Execution Time

Re: Parallel For Loop Increases Iteration Execution Time

Re: Parallel For Loop Increases Iteration Execution Time

Re: Parallel For Loop Increases Iteration Execution Time

Re: Parallel For Loop Increases Iteration Execution Time