 mcduff
		
			mcduff
		
		
		
		
		
		
		
		
	
			01-31-2025 03:32 PM
@mrtoad wrote:
I thought root loop was only a problem with opening the reference. Is it still a problem even when the reference is opened outside the loop?
You could try the following to see if it makes a difference. Assume you have N Parallel instances, before the loop, open N references, one for each call. Right now you have one reference for the For loop, I don't think it is opening a new instance every loop, since there is no open in the loop.
 Andrey_Dmitriev
		
			Andrey_Dmitriev
		
		
		
		
		
		
		
		
	
			02-01-2025 12:54 AM - edited 02-01-2025 01:18 AM
@mrtoad wrote:
Wow, thank you so much for this! You managed to solve the problem AND boost the base speed by 2x.
Out of curiosity, do you know why call by reference causes problems in parallel even when the reference is reentrant?
You're welcome! Glad to see that it was helpful for you.
The calls by reference itself are not a problems. Well, they do add some overhead, but this is not the root cause.
It is quite simple to demonstrate.
I will create a simple "Dummy" VI with the following calculations:
This is how it will be called and benchmarked:
Sometimes I using flag x80, in case if my SubVI configured as preallocated clones, then it looks like shown below, I will open the same number of references as parallel instances, then take one from the "pool" for each call:
This is a typical design patterns for parallel calls by reference (both attached for experiments).
Results:
Single Thread:
Same performance, so far so good,
Two threads:
Roughly half time, works as expected.
Four threads:
In general, it does the job, but a direct call is slightly faster (obviously, becuase inlined) than a call by reference. So, if you can avoid a call by reference, then avoid it. The real problem lies somewhere within NI Math. It could be that some SubVIs are not marked as reentrant (for example, original Quadrature Integrand.vi from Examples - not) or not inlined, or perhaps there is some mixture of calls by reference inside reentrant VIs. Who knows? I usually say that every observed behavior has a rational explanation, but in this particular case, it was much easier to "close eyes" and replace all calls with their content (like "forced inlining") to reduce any possible overhead without deeper investigation. Of course, this resulted in a "spaghetti" VI that now needs to be strongly refactored.
If you need ultimate performance, then in theory, you could recreate all these computations in pure C, compile it with a good compiler, and then call it as a DLL. I expect around a 3x-5x performance boost (at least). However, this would increase the "maintenance cost" of the solution, as you would be mixing two programming languages and adding additional dependencies.
PS
By the way - I forgot to change flag back to x40 in previous benchmark, so the difference is not so dramatically, but anyway faster than original NI math:
 IlluminatedG
		
			IlluminatedG
		
		
		 
		
		
		
		
		
	
			02-01-2025 01:11 AM
Until you hit default LV thread limits, unless you *KNOW* that everything is reentrant but you see nearly linear behavior... then things aren't actually reentrant. My usual suspicions are always something not being reentrant in the call chain (or blocking on a shared DVR or some other shared resource) or something like VI server which scales non-linearly with parallel access since (based on my observations) it observes caller priority which means it needs to reassess every caller that is pending to check priority every time VI server becomes available. If the hierarchy calls into DLLs/SOs it entirely depends on the library implementation and the library returning quickly to allow the LV execution system to shift to other pending code for that thread.
 Andrey_Dmitriev
		
			Andrey_Dmitriev
		
		
		
		
		
		
		
		
	
			02-01-2025 02:11 PM - edited 02-01-2025 02:22 PM
Just one more thing.
I was too pessimistic when I made an estimation of a 3-5 times performance boost after re-implementing this algorithm (which is the Gauss-Lobatto method) in C and compiling it into a DLL
Let check. This is the C code as is
//==============================================================================
//
// Title:		QuadratureLib
// Purpose:		Gauss Lobatto implementation.
//
// Created on:	01.02.2025 at 15:36:34 by AD.
// Copyright:	based on https://bnikolic.co.uk/nqm/1dinteg/gausslobatto.
//
//==============================================================================
#include "framework.h"
#include "QuadratureLib.h"
/**
 * Quadrature Function f(x) same as used in LabVIEW's Example
 *
 * terms bit 1 0: sin(x), 1: cos(x); bit 2 0: cos(0,5*x), 1: sin(0,5*x), bit 3 0: x*x, 1: x 
 * @return term1 + term 2 - term 3
 */
QUADRATURELIB_API double f(double x, int terms)
{
	double Term1 = (terms & 0x1) ? cos(x) : sin(x);
	double Term2 = (terms & 0x2) ? sin(0.5 * x) : cos(0.5 * x);
	double Term3 = (terms & 0x4) ? (x) : (x * x);
	return Term1 + Term2 - Term3;
}
/**
 * Perform a single step of the Gauss-Lobatto integration
 * terms of the Function to integrate
 * a Lower integration limit
 * b Upper integration limit
 * fa Value of function at the lower limit (used to save an evaluation when refinement is used)
 * fa Value of function at the upper limit (the same as above)
 * neval Number of evaluations made so far
 * maxeval Maximum number of evalutions which should not be exceeded
 * acc Required accuracy expressed in units of epsilon(). This allows less-than comparison by using addition and equality.
 * err Returned Error code
*/
double GaussLobattoIntStep(const int terms,
	double a, double b,
	double fa, double fb,
	size_t* neval,
	size_t maxeval,
	double acc,
	int* err)
{
	if (*err) return NAN;
	// Constants used in the algorithm
	const double alpha = sqrt(2.0 / 3.0);
	const double beta = 1.0 / sqrt(5.0);
	if (*neval >= maxeval) { //Maximum number of evaluations reached in Gauss Lobatto
		*err = 5001;
		return NAN;
	}
	// Here the abcissa points and function values for both the 4-point and the 7-point rule are calculated
	// (the points at the end of interval come from the function call, i.e., fa and fb.
	// Also note the 7-point rule re-uses all the points of the 4-point rule.)
	const double h = (b - a) / 2.0;
	const double m = (a + b) / 2.0;
	const double mll = m - alpha * h;
	const double ml = m - beta * h;
	const double mr = m + beta * h;
	const double mrr = m + alpha * h;
	const double fmll = f(mll, terms);
	const double fml = f(ml, terms);
	const double fm = f(m, terms);
	const double fmr = f(mr, terms);
	const double fmrr = f(mrr, terms);
	(*neval) += 5;
	// Both the 4-point and 7-point rule integrals are evaluted
	const double integral2 = (h / 6.0) * (fa + fb + 5.0 * (fml + fmr));
	const double integral1 = (h / 1470.0) * (77.0 * (fa + fb) + 432.0 * (fmll + fmrr) + 625.0 * (fml + fmr) + 672.0 * fm);
	// The difference betwen the 4-point and 7-point integrals is the estimate of the accuracy
	const double estacc = (integral1 - integral2);
	// The volatile keyword should prevent the floating point destination value from being stored in extended precision
	// registers which actually have a very different epsilon(). 
	volatile double dist = acc + estacc;
	if (dist == acc || mll <= a || b <= mrr) {
		if (!(m > a && b > m)) { // Integration reached an interval with no more machine numbers
			*err = 5002;
			return NAN;
		}
		return integral1;
	}
	else {
		return  GaussLobattoIntStep(terms, a, mll, fa, fmll, neval, maxeval, acc, err)
			+ GaussLobattoIntStep(terms, mll, ml, fmll, fml, neval, maxeval, acc, err)
			+ GaussLobattoIntStep(terms, ml, m, fml, fm, neval, maxeval, acc, err)
			+ GaussLobattoIntStep(terms, m, mr, fm, fmr, neval, maxeval, acc, err)
			+ GaussLobattoIntStep(terms, mr, mrr, fmr, fmrr, neval, maxeval, acc, err)
			+ GaussLobattoIntStep(terms, mrr, b, fmrr, fb, neval, maxeval, acc, err);
	}
} // GaussLobattoIntStep
/** Compute the Gauss - Lobatto integral
 * terms of the function to be integrated
 * a The lower integration limit
 * b The upper integration limit
 * abstol Absolute tolerance -- integration stops when the error estimate is smaller than this
 * maxeval Maxium of evaluations to make. If this number of evalution is made without reaching the requied accuracy, an error is thrown.
*/
QUADRATURELIB_API double GaussLobattoInt(const int terms, double a, double b, double abstol, size_t maxeval, int* err)
{ 
	const double tol_epsunit = abstol / (2.22045E-16);
	size_t neval = 0;
	double f_a = f(a, terms);
	double f_b = f(b, terms);
	return GaussLobattoIntStep(terms, a, b, f_a, f_b, &neval, maxeval, tol_epsunit, err);
}
double Example(int terms)
{
	int err;
	return GaussLobattoInt(terms, -2.0, 2.0, 1e-5, 1000, &err); //Like NI
}
How it used in VI:
And benchmark:
Result on 64bit — 45 times faster:
And the good point is that this approach is much better "suited" for parallelization, for example, for 8 threads:
LabVIEW's code processing time decreased from 2.9 seconds down to 1.2 seconds, while the DLL's time improved from 64 ms to 16 ms, resulting in a 70x faster execution in 8 threads.
The only problem is that it’s hard to get exactly the same result because floating-point arithmetic is not associative in general. I mean, (a + b) + c is not necessarily equal to a + (b + c). Therefore, a direct comparison will not work. However, the proposed algorithm differs from the original by no more than 10⁻¹⁴ to 10⁻¹⁵, which is acceptable in most applications.
For 32-bit systems, the performance boost is not as significant. The DLL is around 20x faster for a single thread and up to 35x faster for 8 threads.
The provided code was compiled with Visual Studio 2022, so to run attached code you might need the latest Microsoft Visual C++ Redistributable if not installed yet.
As you stated in first message above the initial execution time is around 40 seconds, so it makes sense to give a try with DLL, you can get less than second. Also remember that by default LabVIEW will reserve 24 threads per execution system. If you need to utilize all 32 cores from parallel for-loop then you have to perform "fine tuning" and set thread limits for standard to at least 32 or more, otherwise you will be not able to run more than 24 parallel instances of VI with DLL inside.. Refer to Configuring the Number of Execution Threads Assigned to a LabVIEW Execution System.
02-01-2025 05:15 PM
This is absolutely incredible! Just to give you an idea of how much time you are saving me, my total dataset is about 70 files each with 3000 rows. Processing each row takes 40 seconds without parallelization. Something that would take 100 days to complete could be done over a weekend. Really needed this extra time too since I am trying to finish my PhD this semester. Thank you so much for taking the time to do this! I'll let you know the final results once I get it working.
 Andrey_Dmitriev
		
			Andrey_Dmitriev
		
		
		
		
		
		
		
		
	
			02-02-2025 01:00 AM - edited 02-02-2025 01:09 AM
@mrtoad wrote:
This is absolutely incredible! Just to give you an idea of how much time you are saving me, my total dataset is about 70 files each with 3000 rows. Processing each row takes 40 seconds without parallelization. Something that would take 100 days to complete could be done over a weekend. Really needed this extra time too since I am trying to finish my PhD this semester. Thank you so much for taking the time to do this! I'll let you know the final results once I get it working.
All right, you're on the right path towards high-performance numerical methods.
One side note: By default, LabVIEW will reserve 24 threads for you. The easiest trick I use to check multithreading settings is the following:
Here, I will call Sleep(1000) from kernel32.dll in, for example,100 threads (by default, LabVIEW will allow you to configure max 64 instances; see below on how to increase this value).
When running on LabVIEW "Out of the Box", this will give me a 5 second execution time. This is because only the first 24 calls will be executed in parallel, then the next, and so on, with the final 4 remaining — hence, 5 seconds.
The trick is to add the following lines to the LabVIEW.ini (don't forget to close LabVIEW before tampering of LabVIEW.ini):
ParallelLoop.PrevNumLoopInstances=100
ParallelLoop.MaxNumLoopInstances=100
ESys.StdNParallel=-1
ESys.Normal=100
The first two lines will set up 100 instances for the parallel for loop and set the same by default. The last two lines will increase the pool to 100 for the Standard Execution System, and take a note that you can do this separately for each execution system, I mean these (the link to the kb article was provided above):
Now, the execution time has dropped down to 1 second as expected, even on my 4/8-core CPU:
This doesn't mean, of course, that I will get a 100x performance improvement on real computation (I still have 8 logical processors), but technically, I'm no longer limited. Also, remember as a warning that additional threads mean additional overhead to create, switch context, and so on. Use this only when necessary; otherwise, you might get slower execution of the regular LabVIEW code in attempt to do it with huge amount of threads.
Hyperthreading is another point — it makes sense to try with and without, and in many cases, overall performance is better when HT is off.
Second side note about compilers: It's not absolutely necessary to use Microsoft Visual Studio; you can use any suitable C compiler, but they are not the same from a performance point of view. My personal favorites are Visual Studio 17.12.4, then Intel OneAPI 2025.0.4, and GCC 14.2.0. NI CVI 2020 is not the best because it's based on the ancient Clang 3.3.
Just for fun, I recompiled this using GCC and got almost the same performance (may be GCC few percent better than MSVC). The easiest way to get GCC installed on Windows is through MSYS2, and then you can use VSCode for development.
Some "ready-to-use" numeric method libraries can be used, but I was unable to find any suitable for this particular case. This method can be implemented in MATLAB, but integrating MATLAB into LabVIEW can be painful. I usually use MATLAB for "cross-checking" results, but never integrate it into LabVIEW.
I'm not sure how familiar you are with C, DLLs, and integrating external code into LabVIEW, but for sure, it makes sense to spend some time into this area, rather than wait for the end of long and slow computations. There are some performance improvements possible also in LabVIEW code, but in this case, you will never ever get the same level of performance.
 mcduff
		
			mcduff
		
		
		
		
		
		
		
		
	
			02-02-2025 12:41 PM
@Andrey_Dmitriev wrote:ParallelLoop.PrevNumLoopInstances=100 ParallelLoop.MaxNumLoopInstances=100 ESys.StdNParallel=-1 ESys.Normal=100
Should these lines be added to a built EXE also? Thanks
 Andrey_Dmitriev
		
			Andrey_Dmitriev
		
		
		
		
		
		
		
		
	
			02-02-2025 12:55 PM
@mcduff wrote:
@Andrey_Dmitriev wrote:ParallelLoop.PrevNumLoopInstances=100 ParallelLoop.MaxNumLoopInstances=100 ESys.StdNParallel=-1 ESys.Normal=100Should these lines be added to a built EXE also? Thanks
The last two should be, yes. The first two are only necessary for LabVIEW IDE to set more than 64 instances from the dialog GUI. Once set for given for loop, they will persist.
 Yamaeda
		
			Yamaeda
		
		
		
		
		
		
		
		
	
			02-03-2025 03:09 AM
How the F*** can the DLL be that much faster? Is it too much safety measures and memory management in LV?
 Andrey_Dmitriev
		
			Andrey_Dmitriev
		
		
		
		
		
		
		
		
	
			02-03-2025 03:42 AM - edited 02-03-2025 03:58 AM
@Yamaeda wrote:
How... the DLL be that much faster? Is it too much safety measures and memory management in LV?
Yes, the LabVIEW compiler is far away from optimal (and by the way, old Visual Studio 2015 still used by NI for development). The modern C compilers are very efficient, then perform vectorization, actively using SIMD, perform loops unrolling, etc, as result - a very efficient machine code.
Also additional overhead caused by graphical programming. For example, you have an array with 10 elements. If you will try to access 11th Element in C, then you will get an exception. In LabVIEW you can easily index array outside of allocated memory and nothing happened, but this luxury is not for free. Single operations like multiplication array with scalar are optimized pretty well, but all together — huge overhead, especially in the code like this:
And you will not believe what happened on the high-end machine, like this:
Here with 56 threads speedup by factor 400x:
Well, well, well, the given LabVIEW code was not optimized, there are lot of points of the improvement, but you have no chance to reach same performance.
In the past I played also with well-optimized LabVIEW code, for example, SHA-256 Byte Array Checksum VI.
The goal was to compare different compilers. The C-code used for replacement is the following:
/******************************************************************************
* Filename:   sha256.c
* Author:     Brad Conte (brad AT bradconte.com)
* Copyright:
* Disclaimer: This code is presented "as is" without any guarantees.
* Details:    Implementation of the SHA-256 hashing algorithm.
              SHA-256 is one of the three algorithms in the SHA2
              specification. The others, SHA-384 and SHA-512, are not
              offered in this implementation.
Algorithm specification can be found here:
http://csrc.nist.gov/publications/fips/fips180-2/fips180-2withchangenotice.pdf
              This implementation uses little endian byte order.
******************************************************************************/
/*************************** HEADER FILES ***************************/
//#include <ansi_c.h>
#include <stdlib.h>
//#include <memory.h>
#include "sha256.hpp"
/****************************** MACROS ******************************/
#define ROTLEFT(a,b) (((a) << (b)) | ((a) >> (32-(b))))
#define ROTRIGHT(a,b) (((a) >> (b)) | ((a) << (32-(b))))
#define CH(x,y,z) (((x) & (y)) ^ (~(x) & (z)))
#define MAJ(x,y,z) (((x) & (y)) ^ ((x) & (z)) ^ ((y) & (z)))
#define EP0(x) (ROTRIGHT(x,2) ^ ROTRIGHT(x,13) ^ ROTRIGHT(x,22))
#define EP1(x) (ROTRIGHT(x,6) ^ ROTRIGHT(x,11) ^ ROTRIGHT(x,25))
#define SIG0(x) (ROTRIGHT(x,7) ^ ROTRIGHT(x,18) ^ ((x) >> 3))
#define SIG1(x) (ROTRIGHT(x,17) ^ ROTRIGHT(x,19) ^ ((x) >> 10))
/**************************** VARIABLES *****************************/
static const unsigned int k[64] = {
	0x428a2f98,0x71374491,0xb5c0fbcf,0xe9b5dba5,0x3956c25b,0x59f111f1,0x923f82a4,0xab1c5ed5,
	0xd807aa98,0x12835b01,0x243185be,0x550c7dc3,0x72be5d74,0x80deb1fe,0x9bdc06a7,0xc19bf174,
	0xe49b69c1,0xefbe4786,0x0fc19dc6,0x240ca1cc,0x2de92c6f,0x4a7484aa,0x5cb0a9dc,0x76f988da,
	0x983e5152,0xa831c66d,0xb00327c8,0xbf597fc7,0xc6e00bf3,0xd5a79147,0x06ca6351,0x14292967,
	0x27b70a85,0x2e1b2138,0x4d2c6dfc,0x53380d13,0x650a7354,0x766a0abb,0x81c2c92e,0x92722c85,
	0xa2bfe8a1,0xa81a664b,0xc24b8b70,0xc76c51a3,0xd192e819,0xd6990624,0xf40e3585,0x106aa070,
	0x19a4c116,0x1e376c08,0x2748774c,0x34b0bcb5,0x391c0cb3,0x4ed8aa4a,0x5b9cca4f,0x682e6ff3,
	0x748f82ee,0x78a5636f,0x84c87814,0x8cc70208,0x90befffa,0xa4506ceb,0xbef9a3f7,0xc67178f2
};
/*********************** FUNCTION DEFINITIONS ***********************/
void sha256_transform(SHA256_CTX *ctx, const BYTE data[])
{
	unsigned int a, b, c, d, e, f, g, h, i, j, t1, t2, m[64] = {0};
	for (i = 0, j = 0; i < 16; ++i, j += 4)
		m[i] = (data[j] << 24) | (data[j + 1] << 16) | (data[j + 2] << 😎 | (data[j + 3]);
	for ( ; i < 64; ++i)
		m[i] = SIG1(m[i - 2]) + m[i - 7] + SIG0(m[i - 15]) + m[i - 16];
	a = ctx->state[0];
	b = ctx->state[1];
	c = ctx->state[2];
	d = ctx->state[3];
	e = ctx->state[4];
	f = ctx->state[5];
	g = ctx->state[6];
	h = ctx->state[7];
	for (i = 0; i < 64; ++i) {
		t1 = h + EP1(e) + CH(e,f,g) + k[i] + m[i];
		t2 = EP0(a) + MAJ(a,b,c);
		h = g;
		g = f;
		f = e;
		e = d + t1;
		d = c;
		c = b;
		b = a;
		a = t1 + t2;
	}
	ctx->state[0] += a;
	ctx->state[1] += b;
	ctx->state[2] += c;
	ctx->state[3] += d;
	ctx->state[4] += e;
	ctx->state[5] += f;
	ctx->state[6] += g;
	ctx->state[7] += h;
}
void sha256_init(SHA256_CTX *ctx)
{
	ctx->datalen = 0;
	ctx->bitlen = 0;
	ctx->state[0] = 0x6a09e667;
	ctx->state[1] = 0xbb67ae85;
	ctx->state[2] = 0x3c6ef372;
	ctx->state[3] = 0xa54ff53a;
	ctx->state[4] = 0x510e527f;
	ctx->state[5] = 0x9b05688c;
	ctx->state[6] = 0x1f83d9ab;
	ctx->state[7] = 0x5be0cd19;
}
void sha256_update(SHA256_CTX *ctx, const BYTE data[], size_t len)
{
	unsigned int i;
	for (i = 0; i < len; ++i) {
		ctx->data[ctx->datalen] = data[i];
		ctx->datalen++;
		if (ctx->datalen == 64) {
			sha256_transform(ctx, ctx->data);
			ctx->bitlen += 512;
			ctx->datalen = 0;
		}
	}
}
void sha256_final(SHA256_CTX *ctx, BYTE hash[])
{
	unsigned int i;
	i = ctx->datalen;
	// Pad whatever data is left in the buffer.
	if (ctx->datalen < 56) {
		ctx->data[i++] = 0x80;
		while (i < 56)
			ctx->data[i++] = 0x00;
	}
	else {
		ctx->data[i++] = 0x80;
		while (i < 64)
			ctx->data[i++] = 0x00;
		sha256_transform(ctx, ctx->data);
		//memset(ctx->data, 0, 56);
		for(int i = 0; i < 56; i++){
			ctx->data[i] = 0;
		}
	}
	// Append to the padding the total message's length in bits and transform.
	ctx->bitlen += ctx->datalen * 8;
	ctx->data[63] = (BYTE)(ctx->bitlen);
	ctx->data[62] = (BYTE)(ctx->bitlen >> 8);
	ctx->data[61] = (BYTE)(ctx->bitlen >> 16);
	ctx->data[60] = (BYTE)(ctx->bitlen >> 24);
	ctx->data[59] = (BYTE)(ctx->bitlen >> 32);
	ctx->data[58] = (BYTE)(ctx->bitlen >> 40);
	ctx->data[57] = (BYTE)(ctx->bitlen >> 48);
	ctx->data[56] = (BYTE)(ctx->bitlen >> 56);
	sha256_transform(ctx, ctx->data);
	// Since this implementation uses little endian byte ordering and SHA uses big endian,
	// reverse all the bytes when copying the final state to the output hash.
	for (i = 0; i < 4; ++i) {
		hash[i]      = (ctx->state[0] >> (24 - i * 8)) & 0x000000ff;
		hash[i + 4]  = (ctx->state[1] >> (24 - i * 8)) & 0x000000ff;
		hash[i + 8]  = (ctx->state[2] >> (24 - i * 8)) & 0x000000ff;
		hash[i + 12] = (ctx->state[3] >> (24 - i * 8)) & 0x000000ff;
		hash[i + 16] = (ctx->state[4] >> (24 - i * 8)) & 0x000000ff;
		hash[i + 20] = (ctx->state[5] >> (24 - i * 8)) & 0x000000ff;
		hash[i + 24] = (ctx->state[6] >> (24 - i * 8)) & 0x000000ff;
		hash[i + 28] = (ctx->state[7] >> (24 - i * 8)) & 0x000000ff;
	}
}
And results from different compilers and OpenSSL library for single thread:
It is not 400x, of course, but 4x improvement.