Why is RT matrix multiplication so much slower than Windows?

dncoble · ‎11-23-2022

Hello,

I ran the attached VI on both a cRIO-9035 and my Windows machine (A Lenovo ThinkPad) running LabVIEW 2020. All it does is find the average the timing cost of multiplying a random vector with a random square matrix over 20,000 times. With a vector size of 32, on the cRIO the average time was 21.88 microseconds, while on my Windows machine the cost was 2.04 microseconds. Does anyone know why there's such a huge disparity between the two times or if there's any way to speed up matrix multiplication on the RT target? The cRIO-9035 has a clock speed of 1.33 GHz while my Windows machine has a clock speed of 2.60 GHz, so certainly it isn't just the clock speeds.

Thanks,

Daniel

GerdW · ‎11-23-2022

Hi Daniel,

@dncoble wrote:

The cRIO-9035 has a clock speed of 1.33 GHz while my Windows machine has a clock speed of 2.60 GHz, so certainly it isn't just the clock speeds.

Yes, it's not just clock speed!

Which CPU does your Windows computer use? Which kind/type/spec of RAM?

The cRIO uses an old Atom E3825 with just two cores…

Best regards,
GerdW

using LV2016/2019/2021 on Win10/11+cRIO, TestStand2016/2019

dncoble · ‎11-23-2022

Thanks for the reponse!

To answer your question, the Windows machine has an Intel i7-6600U (2 cores), with 16 GB of DDR3 RAM. I expected that the laptop would be faster, but I doubt that hardware could account for a 10x difference in speed.

I'm interested in improving the performance on the real-time system, and it doesn't seem like the Intel Atom E3825 on the cRIO-9035 is performing as fast as I expect it could. So I guess my question was whether there is some library or setting that exists for LabVIEW windows which is accelerating matrix multiplication that doesn't exist for LabVIEW Real-Time. If so, is there a way I can change the real-time target to improve it's performance by maybe installing some module or changing some setting, or is this type of slow-down inherent to the real-time system?

Thanks,

Daniel

altenbach · ‎11-23-2022

Atom processors are simpler. Optimized for low power, and in this case only has 25% of the cache.

Comparing the two shows the huge difference:

While I don't have benchmarks for exactly these CPUs, here my list of benchmarks showing the dramatic differences.

LabVIEW Champion.

dncoble · ‎11-29-2022

I wanted to benchmark the performance of my cRIO-9035 by running a 32-element matrix-vector multiply (see the attached picture). Using this code, an average multiply took 22.88 microseconds. However, if the Intel Atom inside the cRIO was working as fast as possible and performing one multiply per clock cycle, at 1.33 GHz it would take 32^2/1.33 GHz = 0.77 microseconds. So compared to this value, the speed from the program was 30 times slower (one multiply per 30 clock cycles). I doubt that it could actually ever run that fast, but when I ran the same code on my laptop, it was only ~5.6 times slower than the theoretical value, when using the PC's clock speed. I'm wondering what produces this disparity, is there linear algebra package on LabVIEW Windows that isn't on LabVIEW Real-Time, or is this slow-down an inherent quality of how NI Linux guarantees determinism?

This question is a repost for clarity.

cordm · ‎11-29-2022

The CPU in the cRIO is a low power Atom chip and quite old (launched 2013). Clock speed is mostly useful for comparing CPUs of the same family or successors to see how instructions per cycle changed.

Atom CPUs lack a lot of the Core family CPU features, most notably AVX. If you want fast matrix multiplications you need vector instructions.

So until you run your benchmark on a Windows PC with a similar CPU and show that it is faster than the cRIO I will believe the CPU is to blame.

rolfk · ‎11-29-2022

Did you forget your last query with the same title that you made last week? https://forums.ni.com/t5/LabVIEW/Why-is-RT-matrix-multiplication-so-much-slower-than-Windows/m-p/426...

The facts still are the same!

Rolf Kalbermatter My Blog

DEMO, Electronic and Mechanical Support department, room 36.LB00.390

altenbach · ‎11-30-2022

Thankfully somebody combined the duplicate threads.

There is nothing magic going on, one of the processors is simply much more powerful. Intel learned the hard way back in the days that clock speed is just a meaningless marketing term. When they came out with the 3GHz P4, people were wowed by the GHz, but performance was not that great. Later they came out with the core architecture which was significantly more powerful at lower clocks. (Analogy with cars: You cannot say that a 50cc motorcycle at 12000rpm is 3x more powerful than a 7 liter seventies muscle car at 4000rpm)

A modern processor takes about 7 CPU cycles for a multiply. But there is much more to it. My benchmarks have a column "S/GHz" which shows the single core performance normalized to the clock speed.

An old Intel Atom N450 can do about 3
A Intel P4 can do about 5.5
Older Intel core can do about 12-20
Modern Intel and AMD processors can do over 30!

LabVIEW Champion.

LabVIEW

Why is RT matrix multiplication so much slower than Windows?

Why is RT matrix multiplication so much slower than Windows?

Re: Why is RT matrix multiplication so much slower than Windows?

Re: Why is RT matrix multiplication so much slower than Windows?

Re: Why is RT matrix multiplication so much slower than Windows?

Why is Real-Time matrix multiplication so slow?

Re: Why is Real-Time matrix multiplication so slow?

Re: Why is Real-Time matrix multiplication so slow?

Re: Why is Real-Time matrix multiplication so slow?