CPU update from 4770k to 13900K but not much increase in speed in a data processing VI

Craig_ · ‎09-07-2023

Does it get any better if you configure your parallel for loops to have more instances?

Right click on them (there are two paralellized loops) and choose "Configure iteration parallelism" and change the number of generated instances to 32 (from 8).

Craig H. | CLA CTA CLED | Applications Engineer | NI Employee 2012-2023

obarriel · ‎09-07-2023

For the loop on the left, the optimum parallelization is 4, I guess because this loop runs 4 times. For the loop on the right, I have essentially the same performance if I use 8, 16 or 32 instances(with 4 the performance is lower).

altenbach · ‎09-07-2023

@obarriel wrote:

I just used the same array four times to illustrate the performance. In the real application all five blue arrays are different, but have the same size.

Yes the VI debugging was not disabled but disabling it does not changes much because loops were paralellized and that already disables debugging.

Following some of your suggestions I have reduced the VI in one loop. There is a small increase in performance (but very small).

However, I am still stuck in something close to a x2 performance increase between both cpus, so I suppose it is still not programmed efficiently in terms of multicore.

You still have some silly coercions (look at the representation of the "1" diagram constant somewhere in the middle. Converting to a U64 waveform (Why is it U64???) then coercing it a femtosecond later to a DBL waveform seems a silly detour, etc.)

You still have not identified the bottlenecks. Don't focus on loop stacks that make little difference.

You need to take timings of

the first loop
the second loop (incl the built array)
the pinv
the A x vector

It could be that one of the above steps might account for 95% of the elapsed time, so far we don't know where to focus our efforts!

There is still a lot of memory thrashing I am sure it could be rearchitected so you don't need to insert elements to the beginning of the 1D and 2D arrays. Keep things in place and allocate the final size once!!

We don't do well with code images. Please attach your actual code once more! You also have race conditions, for example you cannot guarantee that the thread configuration executes before the MASM subVIs are called.

Also explain exactly how typical data looks like. I assume that the 1D arrays are somewhat sinusoidal (since you are trying to find the dominant frequency component), but since U64 is nonnegative, there must be a large DC component, right?

Where does the data come from and what does it represent? (and why is it U64? I am not aware of any 64bit digitizers!). What are typical values for the detected components? Can you attach some typical data?

LabVIEW Champion.

altenbach · ‎09-07-2023

@obarriel wrote:

For the loop on the right, I have essentially the same performance if I use 8, 16 or 32 instances(with 4 the performance is lower).

Have you tried without parallelization?

LabVIEW Champion.

obarriel · ‎09-07-2023

Thank you very much for the inputs. Tomorrow I will reattach the code with the timings of all 4 main steps (and some of the other suggested changes) .I think the pseudoinverse is the one taking the longest but it is not "winning" the others by a huge margin. Unfortunately there is no step taking 95% of the time.

Yes, in the real system the 1Ds arrays are not U64.

In the real system there is one array (top one) that is U16 (16 bit digitizer) that is somewhat sinusoidal with multiple frequencies while the other 4 are digital inputs (1 bit) that carry different single frequencies. In this test program I just set them all equal because I thought it was enough as illustration (my fault! as it is creates lots of confusion). However, I think that setting all the correct data types will reduce the memory usage but does not really affect much the performance of this particular VI.

Disabling all the parallelizations of the loops worsenes the perfomance (but not by much, maybe by a factor 1.5x). The big improvement in this VI was when I changed the "standard" pseudoinverse by the pseudoinverse from the MASM toolkit.

James_McN · ‎09-08-2023

Hi there,

Firstly as others have said I would start by removing coersions. If single precision is good enough (and it will help a lot with performance) then make sure you are using it everywhere. Including all the constants etc. as there are several places where coercions are immediately coerced back up to doubles.

Working out where the different is, as Altenbach mentioned, will help. I have a slight theory that the MASM toolkit could be to blame. If I remember it wraps the Intel MKL tool which has a history of hobbling processors (*cough* AMD) that it doesn't recognise. Given the last update to the NI toolkit was 2015 it is possible that it is struggling to get the best out of the processor.

LabVIEW also can't always get the best out of the latest processors, for example, not supporting AVX instructions - but both processors support these so that itself doesn't explain the big delta between artificial benchmarks and your code.

James Mc
========
CLA and cRIO Fanatic
My writings on LabVIEW Development are at devs.wiresmithtech.com

James_McN · ‎09-08-2023

Ah nope - just looked at what it loads and it doesn't appear to load the MKL library so that idea is out. Looks like a custom library built by NI.

There probably are faster options than that library now - but I don't see any reason it would be impacting the performance delta between the two processors.

James Mc
========
CLA and cRIO Fanatic
My writings on LabVIEW Development are at devs.wiresmithtech.com

Yndigegn · ‎09-08-2023

Hi

I have for many years used LabVIEW with the Vision Development Module. Which is about processing large data block efficiently.

It is my experience that total performance depends less on number of cores and more on raw processor clock speed and Lx cache sizes.

Why is that so ? It guess it boils down to whether a certain image/array processing operation is written to be parallelized. So you are on the mercy of those implementing the algorithms.

If you really want performance you hand-optimize each to use the GPU's CUDA libraries, among other options.

NI has not written code themselves for anything in this respect for years. Actually they replaced their own math library with the similar Intel library code in LabVIEW 7.1. This was not bad.

Writing C++ code using modern compilers then you have the option to use OpenMP. So you insert a pragma before a FOR loops like :

#pragma omp parallel for

and then your code should magically use all the cores to execute the FOR loop. Often you will be disappointed. The code logic was not written to use more than one processor. So that is what you got.

= = =

It is important to remove all coercion dots from the diagram ( using conversion ). Not that it in itself makes code run faster, but because it shows that you have looked at all cases and judged each dot's performance impact.

= = =

Another performance optimization for continuous data acquisitions is using Circular Buffers. They got forgotten somewhere when the higher abstraction Queue concept took over.

Regards

wiebe@CARYA · ‎09-08-2023

This won't compare in the AXB magic, but it's still a waste:

And a very simple fix:

array(scalarXscalar) will be fasterer than (arrayXscalar)X(arrayXscalar).

Also, why not convert to sgl before you build a large array of doubles:

Again, plain waste and easy fix.

Then again, why convert to sgl if the data is converted to dbl anyway:

Doubles will speed things up compared to doubles, but if you keep converting from doubles to singles and back, you'll be a lot better of using doubles.

Search LabVIEW like a graph!

wiebe@CARYA · ‎09-08-2023

A cheap boost will be to turn off debugging in your VI.

Of course, you'll loose debugging.

It will speed up things, but isn't related to the the relative speed between the 2 CPUs.

Search LabVIEW like a graph!

LabVIEW

CPU update from 4770k to 13900K but not much increase in speed in a data processing VI

Re: CPU update from 4770k to 13900K but not much increase in speed in a data processing VI

Re: CPU update from 4770k to 13900K but not much increase in speed in a data processing VI

Re: CPU update from 4770k to 13900K but not much increase in speed in a data processing VI

Re: CPU update from 4770k to 13900K but not much increase in speed in a data processing VI

Re: CPU update from 4770k to 13900K but not much increase in speed in a data processing VI

Re: CPU update from 4770k to 13900K but not much increase in speed in a data processing VI

Re: CPU update from 4770k to 13900K but not much increase in speed in a data processing VI

Re: CPU update from 4770k to 13900K but not much increase in speed in a data processing VI

Re: CPU update from 4770k to 13900K but not much increase in speed in a data processing VI

Re: CPU update from 4770k to 13900K but not much increase in speed in a data processing VI