FFT multicore processing on several CPU sockets

lesnah · ‎01-22-2015

Hi all,

I started a discussion about the limitations of multicore processing in Labview here. I do some image processing and it worked very well on an i7-3770k. I have a lot of images and there are no dependencies between them, they can all be processed separately and that's why parallelization should work very well for it. It does on the i7: The process time depends on the number of cores as expected (time ~ (1/#cores)). The task manager shows 100% for every core.

Now we wanted to scale that to a larger system with more cores and built a quad-Xeon system to be able to process even faster. So our system consists of four E5-4567L v2 and a lot of RAM. Each CPU has 12 cores - summed up there are 48 real cores (or 96 with hyperthreading activated).

Unfortunately it doesn't work. It doesn't really scale and it is as fast as an i7. The problem might be the distribution between the different sockets - but I would like to discuss about that in general. Attached you find a minimum example of the code (it reduces the image processing to an FFT-step). Every single step of the processing is slow - so I think it can be discussed well on that simple example.

I do not recommend to run that VI if you don't have enough memory. But you can have a look to the graph. It shows the dependency of the process time from the number of cores used. As one can see it scales until a number of about 12 cores (remember, that's the number of cores on one single CPU) and then it doesn't improve. I really don't understand that there is no more effect by adding cores. Or maybe it's too much data for too little calculation?

Has anyone seen similar problems? Can anybody approve the problems with the parallelization of an FFT distributed to several CPUs?

We use LV2014, have to use Windows server 2008 (supports more than two CPU sockets).

GerdW · ‎01-22-2015

Hi Hansel,

can you repeat the test, but now with a parallized outer FOR loop in the processing state?

Best regards,
GerdW

using LV2016/2019/2021 on Win10/11+cRIO, TestStand2016/2019

lesnah · ‎01-22-2015

I don't know why. The outer loop is only for creating all possible numbers of cores and get their process time. It runs the parallel loop 48 times with different numbers of parallel loops.

Nevertheless I tested it and it crashed LV. Maybe the "get tick count VI" doesn't work parallelized or 48*48 ~ 2500 threads is too much.

GerdW · ‎01-22-2015

Hi Hansel,

sorry, misinterpreted your BD…

Best regards,
GerdW

using LV2016/2019/2021 on Win10/11+cRIO, TestStand2016/2019

LabVIEW

FFT multicore processing on several CPU sockets

FFT multicore processing on several CPU sockets

Re: FFT multicore processing on several CPU sockets

Re: FFT multicore processing on several CPU sockets

Re: FFT multicore processing on several CPU sockets