CPU update from 4770k to 13900K but not much increase in speed in a data processing VI

obarriel · ‎09-06-2023

Thanks in advance for any help or recommendation,

I have updated the computer that I use for data processing some large arrays. Basically, I what I have a VI that does some linear algebra operations in long arrays (they have around 1000000 points).

I was expecting a big boost in the performance with the new computer (maybe something like 10 times faster) for this data processing VI, but it turns out that is "only" between 2 and 3 times faster.

I attach to this message the vi in question. Ch0 to Ch4 are the long input arrays. I have tried to optimize the calculus as much as possible. I use the multicore toolkit and I also parallelize the for loops to use all cores. Not sure if I was just expecting too much of an improvement or I am missing something...

Another (partially related) thing that bothers me in both the "old" and "new" computers is that when this VI is used standalone, I get CPU usage close to 100%, and all cores seem to be in use. However, when I use it inside a queue (this is how I really use it most of the time) to process queued data, the CPU usage remains rather low and not all cores seem to be used, So I am guessing that there is CPU power that is being wasted when placed in queue.

Bob_Schor · ‎09-06-2023

I'm not too surprised to hear that using Queues can slow things down. Queues are basically 1D arrays of whatever Queue element you are using, so lots of queues means managing lots of (relatively small) arrays. What 64-bit processing and fast memory aids is processing large data structures, such as big Images and other multi-dimensional items.

I wonder if a large DVR structure could be utilized profitably to "roll your own Queue", using the last dimension in the Structure as the "Queue pointer". I wasn't really thinking about optimizing for speed, but I am currently working on a LabVIEW-RT project where we are sampling 16 channels of A/D at fairly respectable (to me, which means 10s of kHz) for moderate periods of time (4-8 seconds) -- I wasn't thinking that I was creating a "Queue-like" structure with the DVR, but we were sampling in bursts, so we essentially had a 4D DVR passing our data from FPGA to Target memory and eventually to the PC via Network Streams (and on to disk). The DVR certainly got us quick access to memory (to "enqueue" our data), and also got it back for Network Streaming (dequeue) to the Host ...

Note that I have not benchmarked this to compare with Queues, partly because I just realized this analogy, and mostly because I'm too swamped with other things.

Bob Schor

altenbach · ‎09-06-2023

@obarriel wrote:

I have updated the computer that I use for data processing some large arrays. Basically, I what I have a VI that does some linear algebra operations in long arrays (they have around 1000000 points).

So what is your definition of "big" Are you talking about the 1D size of the various blue input arrays?

Did you identify what the slowest element is? (Sorry, I currently don't have the MASM toolkit installed), But have you for example tried NOT to parallelize the loop stack? Can you get accurate timing for each step so we can see where the bottleneck is?

Probably won't make a difference, but most of your "to SGL" should go, because all get coerced back to DBL a nanosecond later anyway (hopefully, the compiler can sort it out 😮 ). Look at all these coercion dots! That code has the measles. 😄

Your "insert into array is just a glorified "built array" and since you are prepending, you probably force a new allocation.

That said, a 3x speedup is pretty good.

I don't understand your queue comment. Can you explain what you mean?

Your new CPU only has 8 performance cores, the rest are slower. Not sure how optimal the scheduling is on such a hybrid.

Can you create a simple simulation of typical data and set correct settings for all other controls as default?

Can you point to a website describing the algorithm?

LabVIEW Champion.

obarriel · ‎09-07-2023

Thank you for your inputs!

Yes, the blue 1D arrays are the big arrays. Their size is around 1000000. I attach a snip on how some test could be done.

The code itself was optimized some years ago for the "old" processor. Parallelizing the loops made a large improvement in speed. The MASM toolkit also brougth a large improvement, specially for the pseudoinverse calculation.

I agree that some of the conversions to SGL that I have may be redundant or useless, but some of them speedup the code significatively. If all the "to SGL" I have are removed the code becomes slower.

With the old processor (4700k) this VI typically executed in 0.45s , while with the new processor (13900K) it executes in around 0.20 s. So the speedup is actually closer to 2x than 3x. I expected it to be much faster in the new computer, attending to the benchmarks between two processors. I was suprised by only this moderate improvement. This is my main concern.

The queue comment was maybe not relevant and it is slightly out of topic. I just noted that when this VI is run continously the CPU usage goes close to the 100% but when I used it in a queue the CPU usage in the task manager is quite lower.

santo_13 · ‎09-07-2023

@obarriel wrote:

I was expecting a big boost in the performance with the new computer (maybe something like 10 times faster) for this data processing VI,

Could you please explain on what basis you were expecting a 10x faster computation? based on quick search on benchmarks, none of them tend to show 10x improvement.

Santhosh
Soliton Technologies

New to the forum? Please read community guidelines and how to ask smart questions

Only two ways to appreciate someone who spent their free time to reply/answer your question - give them Kudos or mark their reply as the answer/solution.

Finding it hard to source NI hardware? Try NI Trading Post

obarriel · ‎09-07-2023

Well maybe the x10 was a bit optimistic. But for example the the CPUZ multithread bench shows 8x faster computation for the 13900k (32 cores) than for the 4770k (8 cores) : https://valid.x86.fr/bench/8 and when I run this test on both computers I get values very close to the published ones.

But in this labview VI I only get slightly over x2 faster.

altenbach · ‎09-07-2023

@obarriel wrote:

Well maybe the x10 was a bit optimistic. But for example the the CPUZ multithread bench shows 8x faster computation for the 13900k (32 cores) than for the 4770k (8 cores)

But in this labview VI I only get slightly over x2 faster.

Nothing to do with LabVIEW. A lot to do with the specific problem.

This assume that the problem can be parallelized 100%. For example you can probably find the upper parallelization limit for e.g. matrix x vector and pseudoinverse. that you use here. Not all linear algebra can scale linearly with the number of CPU cores, sometimes many steps might depend on other steps.

For reference, also have a look at my old talk, especially slides 22-23.

Always be aware of the parallelization overhead of splitting the problem and reassembling the result compared to the gain due to parallelization. Often, the overhead can be worse than the gain and you actually slow down. Some problems are much more suitable than others. For example, my fitting program scales linearly with the # of cores, even for very high core counts (benchmarks). This is an ideal case because each atomic spectrum computation is independent and expensive, so the parallelization overhead is negligible.

In this particular case, the 13900k is almost 9x faster that the 4770k!

In any case, even without parallelization at all, I can see quite a few places to significantly improve your current code. There is a lot of glaring memory thrashing due to constant type conversions and array resizing. You have not even disabled debugging! I'll have a more detailed look later...

Currently, you are flying blind! As a first step, construct a reliable test harness and try various alternatives. Sometimes even a well placed "always copy" can speed things up. You need do do detailed timing of all your steps to identify the bottlenecks and see how they scale with input size. As a baseline, replace all MASM VIs with their plain version and work out the code skeleton, then substitute one MASM at a time to see.

LabVIEW Champion.

JÞB · ‎09-07-2023

Thanks CA, I got halfway through the thread when my magic 8-Ball started spinning to ask if debugging was disabled. The debugging hooks really really hate coercion dots for some reason. I just have seen it enough to avoid those coercion and debugging enabled. I'm sure there is a very reasonable price for the debugger but Lamborghini's are reasonably priced too. (When you can afford them)

In this case, with attempts to optomize the algorithms in place, the performance improvement of the calculations will be swamped by the lack of improvement to debug hooks.

I'd also migrate the die roll loop in front of the sequence Frame. That is a spendy loop there! You CAN parallelize that loop too! (VIA will warn you but ignore that! You WILL get an array out containing the same values - but maybe not in the exact same order 😀)

I can't figure out why you build a 2D array of 4 copies of a 1D array just to calculate the same two values 4 times in the next loop (Rube-Goldburg would be proud of that!) Or why you start out with U64 array that Coerces lossilly to DBL within the waveform Y[] a U32 would seam to be better.

"Should be" isn't "Is" -Jay

JÞB · ‎09-07-2023

And get that indicator on the root diagram! Yes, the calculated value is placed on the terminal before the sequence Frame Exits. In theory, you could put a breakpoint on the sequence out, change the indicator value and continue to the root Diagram when the changed value is then output to the connector pane. See, the "Clear as mud" thread. I'm certain I have that tag so, just borrow my tag cloud from my profile page to find it.

"Should be" isn't "Is" -Jay

obarriel · ‎09-07-2023

I just used the same array four times to illustrate the performance. In the real application all five blue arrays are different, but have the same size.

Yes the VI debugging was not disabled but disabling it does not changes much because loops were paralellized and that already disables debugging.

Following some of your suggestions I have reduced the VI in one loop. There is a small increase in performance (but very small).

However, I am still stuck in something close to a x2 performance increase between both cpus, so I suppose it is still not programmed efficiently in terms of multicore.

LabVIEW

CPU update from 4770k to 13900K but not much increase in speed in a data processing VI

CPU update from 4770k to 13900K but not much increase in speed in a data processing VI

Re: CPU update from 4770k to 13900K but not much increase in speed in a data processing VI

Re: CPU update from 4770k to 13900K but not much increase in speed in a data processing VI

Re: CPU update from 4770k to 13900K but not much increase in speed in a data processing VI

Re: CPU update from 4770k to 13900K but not much increase in speed in a data processing VI

Re: CPU update from 4770k to 13900K but not much increase in speed in a data processing VI

Re: CPU update from 4770k to 13900K but not much increase in speed in a data processing VI

Re: CPU update from 4770k to 13900K but not much increase in speed in a data processing VI

Re: CPU update from 4770k to 13900K but not much increase in speed in a data processing VI

Re: CPU update from 4770k to 13900K but not much increase in speed in a data processing VI