LabVIEW

cancel
Showing results for 
Search instead for 
Did you mean: 

Parallel For Loop Increases Iteration Execution Time

Solved!
Go to solution

@mrtoad wrote:

I thought root loop was only a problem with opening the reference. Is it still a problem even when the reference is opened outside the loop?


You could try the following to see if it makes a difference. Assume you have N Parallel instances, before the loop, open N references, one for each call. Right now you have one reference for the For loop, I don't think it is opening a new instance every loop, since there is no open in the loop.

0 Kudos
Message 11 of 24
(565 Views)

@mrtoad wrote:

Wow, thank you so much for this! You managed to solve the problem AND boost the base speed by 2x.

 

Out of curiosity, do you know why call by reference causes problems in parallel even when the reference is reentrant? 

 


You're welcome! Glad to see that it was helpful for you.

The calls by reference itself are not a problems. Well, they do add some overhead, but this is not the root cause.

It is quite simple to demonstrate.

I will create a simple "Dummy" VI with the following calculations:

dummy.png

This is how it will be called and benchmarked:

demo1.png

Sometimes I using flag x80, in case if my SubVI configured as preallocated clones, then it looks like shown below, I will open the same number of references as parallel instances, then take one from the "pool" for each call:

 

demo.png

This is a typical design patterns for parallel calls by reference (both attached for experiments).

 

Results:

Single Thread:

1.png

Same performance, so far so good,

Two threads:

2.png

Roughly half time, works as expected.

Four threads:

4a.png

In general, it does the job, but a direct call is slightly faster (obviously, becuase inlined) than a call by reference. So, if you can avoid a call by reference, then avoid it. The real problem lies somewhere within NI Math. It could be that some SubVIs are not marked as reentrant (for example, original Quadrature Integrand.vi from Examples - not) or not inlined, or perhaps there is some mixture of calls by reference inside reentrant VIs. Who knows? I usually say that every observed behavior has a rational explanation, but in this particular case, it was much easier to "close eyes" and replace all calls with their content (like "forced inlining") to reduce any possible overhead without deeper investigation. Of course, this resulted in a "spaghetti" VI that now needs to be strongly refactored.

If you need ultimate performance, then in theory, you could recreate all these computations in pure C, compile it with a good compiler, and then call it as a DLL. I expect around a 3x-5x performance boost (at least). However, this would increase the "maintenance cost" of the solution, as you would be mixing two programming languages and adding additional dependencies.

 

PS

By the way - I forgot to change flag back to x40 in previous benchmark, so the difference is not so dramatically, but anyway faster than original NI math:

Screenshot 2025-02-01 08.10.16.png

 

Download All
Message 12 of 24
(550 Views)

Until you hit default LV thread limits, unless you *KNOW* that everything is reentrant but you see nearly linear behavior... then things aren't actually reentrant. My usual suspicions are always something not being reentrant in the call chain (or blocking on a shared DVR or some other shared resource) or something like VI server which scales non-linearly with parallel access since (based on my observations) it observes caller priority which means it needs to reassess every caller that is pending to check priority every time VI server becomes available. If the hierarchy calls into DLLs/SOs it entirely depends on the library implementation and the library returning quickly to allow the LV execution system to shift to other pending code for that thread.

~ G stands for Fun ~
Helping pave the path to long-term living and thriving in space.
Message 13 of 24
(543 Views)

Just one more thing.

 

I was too pessimistic when I made an estimation of a 3-5 times performance boost after re-implementing this algorithm (which is the Gauss-Lobatto method) in C and compiling it into a DLL

 

Let check. This is the C code as is

 

Spoiler
//==============================================================================
//
// Title:		QuadratureLib
// Purpose:		Gauss Lobatto implementation.
//
// Created on:	01.02.2025 at 15:36:34 by AD.
// Copyright:	based on https://bnikolic.co.uk/nqm/1dinteg/gausslobatto.
//
//==============================================================================

#include "framework.h"
#include "QuadratureLib.h"


/**
 * Quadrature Function f(x) same as used in LabVIEW's Example
 *
 * terms bit 1 0: sin(x), 1: cos(x); bit 2 0: cos(0,5*x), 1: sin(0,5*x), bit 3 0: x*x, 1: x 
 * @return term1 + term 2 - term 3
 */
QUADRATURELIB_API double f(double x, int terms)
{
	double Term1 = (terms & 0x1) ? cos(x) : sin(x);
	double Term2 = (terms & 0x2) ? sin(0.5 * x) : cos(0.5 * x);
	double Term3 = (terms & 0x4) ? (x) : (x * x);
	return Term1 + Term2 - Term3;
}


/**
 * Perform a single step of the Gauss-Lobatto integration
 * terms of the Function to integrate
 * a Lower integration limit
 * b Upper integration limit
 * fa Value of function at the lower limit (used to save an evaluation when refinement is used)
 * fa Value of function at the upper limit (the same as above)
 * neval Number of evaluations made so far
 * maxeval Maximum number of evalutions which should not be exceeded
 * acc Required accuracy expressed in units of epsilon(). This allows less-than comparison by using addition and equality.
 * err Returned Error code
*/
double GaussLobattoIntStep(const int terms,
	double a, double b,
	double fa, double fb,
	size_t* neval,
	size_t maxeval,
	double acc,
	int* err)
{
	if (*err) return NAN;
	// Constants used in the algorithm
	const double alpha = sqrt(2.0 / 3.0);
	const double beta = 1.0 / sqrt(5.0);

	if (*neval >= maxeval) { //Maximum number of evaluations reached in Gauss Lobatto
		*err = 5001;
		return NAN;
	}

	// Here the abcissa points and function values for both the 4-point and the 7-point rule are calculated
	// (the points at the end of interval come from the function call, i.e., fa and fb.
	// Also note the 7-point rule re-uses all the points of the 4-point rule.)
	const double h = (b - a) / 2.0;
	const double m = (a + b) / 2.0;

	const double mll = m - alpha * h;
	const double ml = m - beta * h;
	const double mr = m + beta * h;
	const double mrr = m + alpha * h;

	const double fmll = f(mll, terms);
	const double fml = f(ml, terms);
	const double fm = f(m, terms);
	const double fmr = f(mr, terms);
	const double fmrr = f(mrr, terms);
	(*neval) += 5;

	// Both the 4-point and 7-point rule integrals are evaluted
	const double integral2 = (h / 6.0) * (fa + fb + 5.0 * (fml + fmr));
	const double integral1 = (h / 1470.0) * (77.0 * (fa + fb) + 432.0 * (fmll + fmrr) + 625.0 * (fml + fmr) + 672.0 * fm);

	// The difference betwen the 4-point and 7-point integrals is the estimate of the accuracy
	const double estacc = (integral1 - integral2);

	// The volatile keyword should prevent the floating point destination value from being stored in extended precision
	// registers which actually have a very different epsilon(). 
	volatile double dist = acc + estacc;

	if (dist == acc || mll <= a || b <= mrr) {
		if (!(m > a && b > m)) { // Integration reached an interval with no more machine numbers
			*err = 5002;
			return NAN;
		}
		return integral1;
	}
	else {
		return  GaussLobattoIntStep(terms, a, mll, fa, fmll, neval, maxeval, acc, err)
			+ GaussLobattoIntStep(terms, mll, ml, fmll, fml, neval, maxeval, acc, err)
			+ GaussLobattoIntStep(terms, ml, m, fml, fm, neval, maxeval, acc, err)
			+ GaussLobattoIntStep(terms, m, mr, fm, fmr, neval, maxeval, acc, err)
			+ GaussLobattoIntStep(terms, mr, mrr, fmr, fmrr, neval, maxeval, acc, err)
			+ GaussLobattoIntStep(terms, mrr, b, fmrr, fb, neval, maxeval, acc, err);
	}
} // GaussLobattoIntStep


/** Compute the Gauss - Lobatto integral
 * terms of the function to be integrated
 * a The lower integration limit
 * b The upper integration limit
 * abstol Absolute tolerance -- integration stops when the error estimate is smaller than this
 * maxeval Maxium of evaluations to make. If this number of evalution is made without reaching the requied accuracy, an error is thrown.
*/
QUADRATURELIB_API double GaussLobattoInt(const int terms, double a, double b, double abstol, size_t maxeval, int* err)
{ 
	const double tol_epsunit = abstol / (2.22045E-16);
	size_t neval = 0;
	double f_a = f(a, terms);
	double f_b = f(b, terms);
	return GaussLobattoIntStep(terms, a, b, f_a, f_b, &neval, maxeval, tol_epsunit, err);
}

double Example(int terms)
{
	int err;
	return GaussLobattoInt(terms, -2.0, 2.0, 1e-5, 1000, &err); //Like NI
}

How it used in VI:

 

subVI.png

 

And benchmark:

 

bench.png

Result on 64bit — 45 times faster:

1-64.png

And the good point is that this approach is much better "suited" for parallelization, for example, for 8 threads:

8-64.png

LabVIEW's code processing time decreased from 2.9 seconds down to 1.2 seconds, while the DLL's time improved from 64 ms to 16 ms, resulting in a 70x faster execution in 8 threads.

The only problem is that it’s hard to get exactly the same result because floating-point arithmetic is not associative in general. I mean, (a + b) + c is not necessarily equal to a + (b + c). Therefore, a direct comparison will not work. However, the proposed algorithm differs from the original by no more than 10⁻¹⁴ to 10⁻¹⁵, which is acceptable in most applications.

For 32-bit systems, the performance boost is not as significant. The DLL is around 20x faster for a single thread and up to 35x faster for 8 threads.

The provided code was compiled with Visual Studio 2022, so to run attached code you might need the latest Microsoft Visual C++ Redistributable if not installed yet.

As you stated in first message above the initial execution time is around 40 seconds, so it makes sense to give a try with DLL, you can get less than second. Also remember that by default LabVIEW will reserve 24 threads per execution system. If you need to utilize all 32 cores from parallel for-loop then you have to perform  "fine tuning" and set thread limits for standard to at least 32 or more, otherwise you will be not able to run more than 24 parallel instances of VI with DLL inside.. Refer to Configuring the Number of Execution Threads Assigned to a LabVIEW Execution System.

Message 14 of 24
(515 Views)

This is absolutely incredible! Just to give you an idea of how much time you are saving me, my total dataset is about 70 files each with 3000 rows. Processing each row takes 40 seconds without parallelization. Something that would take 100 days to complete could be done over a weekend. Really needed this extra time too since I am trying to finish my PhD this semester. Thank you so much for taking the time to do this! I'll let you know the final results once I get it working.

Message 15 of 24
(501 Views)

@mrtoad wrote:

This is absolutely incredible! Just to give you an idea of how much time you are saving me, my total dataset is about 70 files each with 3000 rows. Processing each row takes 40 seconds without parallelization. Something that would take 100 days to complete could be done over a weekend. Really needed this extra time too since I am trying to finish my PhD this semester. Thank you so much for taking the time to do this! I'll let you know the final results once I get it working.


All right, you're on the right path towards high-performance numerical methods.

 

One side note: By default, LabVIEW will reserve 24 threads for you. The easiest trick I use to check multithreading settings is the following:

 

t_snip.png

Here, I will call Sleep(1000) from kernel32.dll in, for example,100 threads (by default, LabVIEW will allow you to configure max 64 instances; see below on how to increase this value).

 

When running on LabVIEW "Out of the Box", this will give me a 5 second execution time. This is because only the first 24 calls will be executed in parallel, then the next, and so on, with the final 4 remaining — hence, 5 seconds.

 

The trick is to add the following lines to the LabVIEW.ini (don't forget to close LabVIEW before tampering of LabVIEW.ini):

 

ParallelLoop.PrevNumLoopInstances=100
ParallelLoop.MaxNumLoopInstances=100
ESys.StdNParallel=-1
ESys.Normal=100

 

The first two lines will set up 100 instances for the parallel for loop and set the same by default. The last two lines will increase the pool to 100 for the Standard Execution System, and take a note that you can do this separately for each execution system, I mean these (the link to the kb article was provided above):

 

Screenshot 2025-02-02 07.12.32.png

 

Now, the execution time has dropped down to 1 second as expected, even on my 4/8-core CPU:

 

Screenshot 2025-02-02 07.14.42.png

This doesn't mean, of course, that I will get a 100x performance improvement on real computation (I still have 8 logical processors), but technically, I'm no longer limited. Also, remember as a warning that additional threads mean additional overhead to create, switch context, and so on. Use this only when necessary; otherwise, you might get slower execution of the regular LabVIEW code in attempt to do it with huge amount of threads.

Hyperthreading is another point — it makes sense to try with and without, and in many cases, overall performance is better when HT is off.

 

Second side note about compilers: It's not absolutely necessary to use Microsoft Visual Studio; you can use any suitable C compiler, but they are not the same from a performance point of view. My personal favorites are Visual Studio 17.12.4, then Intel OneAPI 2025.0.4, and GCC 14.2.0. NI CVI 2020 is not the best because it's based on the ancient Clang 3.3.

 

Just for fun, I recompiled this using GCC and got almost the same performance (may be GCC few percent better than MSVC). The easiest way to get GCC installed on Windows is through MSYS2, and then you can use VSCode for development.

 

Some "ready-to-use" numeric method libraries can be used, but I was unable to find any suitable for this particular case. This method can be implemented in MATLAB, but integrating MATLAB into LabVIEW can be painful. I usually use MATLAB for "cross-checking" results, but never integrate it into LabVIEW.

 

I'm not sure how familiar you are with C, DLLs, and integrating external code into LabVIEW, but for sure, it makes sense to spend some time into this area, rather than wait for the end of long and slow computations. There are some performance improvements possible also in LabVIEW code, but in this case, you will never ever get the same level of performance.

Message 16 of 24
(460 Views)

@Andrey_Dmitriev wrote:
ParallelLoop.PrevNumLoopInstances=100
ParallelLoop.MaxNumLoopInstances=100
ESys.StdNParallel=-1
ESys.Normal=100

Should these lines be added to a built EXE also? Thanks

Message 17 of 24
(392 Views)

@mcduff wrote:

@Andrey_Dmitriev wrote:
ParallelLoop.PrevNumLoopInstances=100
ParallelLoop.MaxNumLoopInstances=100
ESys.StdNParallel=-1
ESys.Normal=100

Should these lines be added to a built EXE also? Thanks


The last two should be, yes. The first two are only necessary for LabVIEW IDE to set more than 64 instances from the dialog GUI. Once set for given for loop, they will persist.

Message 18 of 24
(387 Views)

How the F*** can the DLL be that much faster? Is it too much safety measures and memory management in LV?

G# - Award winning reference based OOP for LV, for free! - Qestit VIPM GitHub

Qestit Systems
Certified-LabVIEW-Developer
0 Kudos