parallel vision Processing?

shoneill · ‎10-02-2007

I'm looking for a definitive answer on a problem I've encountered.

I have an application where I'd like to investigate the ability to dramatically speed up vision processing (currently at 3Hz!) by using a different acquisition (Which could scale up to 12-15Hz) and processing the images in parallel.

I wrote a small utility to test the speed-up on multiple parallel threads. I loaded an array of pictures to memory (about 700 of them) and after all were loaded, split the array and fed a different number of parallel processing arrays.

I started with 2 parallel threads. I made first tests and saw no improvement whatsoever. Then I noticed that the Vision VIs are NOT set to Re-entrant. So I set them all to Reentrant. After this, I only saw a 30% increase in speed (I was expecting close to 100%).

I tried the same with 4 parallel threads, and things got SLOWER than with one thread (All of this on a dual-core machine).

I then read some older posts
Here (Problems with reentrant Vision VIs)
Here (Vision being INHERENTLY parallel)
Here (More questions about parallel operation and dual-core support)

I then tried limiting LabVIEW to a single core and running the single-thread variation of my test (thinking that maybe the functions are already utilizing both cores). I saw no difference whatsoever whether LV was running on a single core or a dual-core machine (simulated).

I now don't know exactly what to think, or what to do.

Is there anyone out there who has some definitive information on the matter? Is VIsion 8.5 different in this regard at all? NI is doing some amount of advertising on "multicore" and parallel processing being super on LabVIEW, but the one package which should benefit from this seems to have me fooled a bit.

Perhaps the problem is the functions I'm using. I'm fitting a circle (spoke edge detection, not pattern matching or anything as intensive as that) and running a Sobel filter followed by particle analysis.

Shane.

Using LV 6.1 and 8.2.1 on W2k (SP4) and WXP (SP2)

Vladimir_Drzik · ‎10-02-2007

Hi Shane

Maybe I can answer a subset of your questions. It's not a big surprise for me that you got only 30% performance improvement with two parallel threads. Usually, in image processing algorithms, a lot of computing work is done with RAM. And that's a bottleneck you can't solve with two cores. Moreover, what kind of dualcore processor are you using? We were using both the older Pentium D's and the newer Core2 Duo's and we observed a huge parallel performance increase with Core2 Duo (our app running approx. 100-120% faster). So not all dualcores are equivalent.

Anyway, it's a big surprise that using 4 parallel threads makes things slower. From what I know about multithreading capabilities of LabVIEW 8.2 (don't know if it's different in 8.5), you only have two physical threads running for each of the so called execution systems and for each execution priority. Thus, if you run any number of parallel LV threads in just one execution system, and priority, you only get two physical parallel threads running. If you want more, you have to assign different execution systems or different priorities to your parallel threads (VIs). You can do that in the VI properties dialog box. Still, it's a mystery why it's actually slower when using 4 threads.

Regarding Vision 8.5, as far as I know there is no internal parallelization of the image processing operators. Yes there was a lot of hype about multicore support in LV 8.5, but the difference is mainly in the Realtime module and in some advanced multicore functions, such as the ability to select which core you want to use to run a certain VI. But if you run simple two parallel processing threads, the results would be similar with both LV 8.2 and 8.5.

At last, if you want to get from 3 Hz to 12-15 Hz, it looks like the algorithm should be optimized a bit more, anyway. Maybe if you post some details, someone would get a genial idea for optimization

.

Vladimir

shoneill · ‎10-02-2007

Thanks for the answer Vladimir,

I saw your name I believe in one of the threads I referenced to.

The algorithm for the vision analysis has already been optimised. The 3Hz problem comes more from the aspect of capturing the images than the image processing itself. Still, with a fast enough camera, and a new method of image acquisition it would be possible to increase the picture acquisition rate to well over 30Hz (12-15Hz with existing hardware and algorithm). I thought I would be able to improve the speed of the analysis by utilising all cores for the picture processing, but it seems I'm wrong.

At the moment, on my machine (2GHz AmdX2, 1Gig RAM) and pictures with XGA resolution (8-bit grayscale) I'm able to process the pictures every 40ms or so. The rest of the time is spent doing other things neccessary for the current method of picture acquisition. This is for a relatively simple case. The application should remain expandable, meaning there may be more computationally intensive operations in the future. I was hoping the number of CPU cores would help me work against this.

I still haven't given up on this yet. This is something which, to put it simply, I expect to be able to run fully in parallel, especially given the two distinct data sets (Different pictures in parallel).

Hmmm. I'm still hoping for an official word on this from NI. Re. the internal parallelisation, I wouldn't have come up with that, but one of my previously referenced posts has a post from a NI employee saying that (at least some) operations are internally parallelised. I believe this applies to pattern matching and the like though. Otherwise I certainly wouldn't have thought so.

Shane.

Using LV 6.1 and 8.2.1 on W2k (SP4) and WXP (SP2)

shoneill · ‎10-02-2007

I did some more testing this evening and found out the following:

Running code in parallel without setting a single IMAQ VI to reentrant but setting all of MY VIs to reentrant, I ahve recorded the following (around 690 Pictures in Memory)

1 Thread : 19.144 sec
2 Threads: 11.835 sec (76% Increase in speed)
3 Threads: 15.613 sec (37% Increase in speed)
4 Threads: 19.963 sec (4.3% decrease in speed)
5 Threads: 21.185 sec (11% decrease in speed)
6 Threads: 20.753 sec (8.4% decrease in speed)

As you see, I've actually improved on the previous data by NOT setting the actual IMAQ VIs to reentrant, but all of MY calling VIs. This apparently gives the threading enough granularity to make some impressive improvements.

What I do NOT understand is the subsequent decrease in speed for each and every thread added to the equation. And yes, I've run them out of sequence and in reverse sequence to rule out any memory leak / other influences. The numbers vary by about 0.1 to 0.2 seconds, but they're relatively stable.

Anyone got any idea what's going on? I'm using LV 8.2.1 with Vision 8.2.1.94 (from MAX)

I gotta get a Quad-core machine to do more testing

Shane.

Using LV 6.1 and 8.2.1 on W2k (SP4) and WXP (SP2)

BruceAmmons · ‎10-02-2007

My intuition tells me that LabVIEW is switching between the threads and trying to process multiple images simultaneously on each core. As you are switching between the images, different blocks of memory have to be loaded for each image. Since moving blocks of memory is fairly slow, the memory management is dragging you down.

Setting the number of threads equal to the number of cores gives you the ideal results. You don't have thread starvation (1 thread) or excessive memory cycling (>2 threads).

This is actually a good example where LabVIEW's automatic multithreading can slow you down. Normally, when dealing with very small pieces of data, it isn't an issue that LabVIEW is jumping all over the place and processing whatever is ready next.

Bruce

Bruce Ammons
Ammons Engineering

shoneill · ‎10-03-2007

Hi Bruce,

thanks for the explanation.

I didn't think memory management on a machine which theoretically has over 6GB/s memory throughput would kick in so quickly. The pictures have 1024x768x8bit resolution, so not even a megabyte of raw data. In the worst case, I'm registering a delay of over 5ms (22ms compared to 17ms) when moving from two threads to three threads. That just seems like an awful lot to me. Doing the numbers on the theoretical memors throughput on my machine gives a theoretical transfer of 30MB in 5ms, factoring in overhead and thread-swaps might just do it. Hmm.

I do remember a post mentioning that the allocation of many IMAQ images is not linear with increasing numbers of images. There is some internal overhead when dealing with a large number of images. I might try running the same test on smaller numbers of images to see if the % change numbers change dramatically. I'll try underclocking my RAM too, see if the numbers change then. That's pretty much be a certain indicator.

Thanks, even though it's not strictly an official NI answer, it's good enough for me at the moment. Certainly seems plausible.

Thanks again Bruce,

Shane.

Using LV 6.1 and 8.2.1 on W2k (SP4) and WXP (SP2)

Vladimir_Drzik · ‎10-03-2007

Hi guys

I don't quite understand what Bruce means by "different blocks of memory have to be loaded for each image". I don't think the image memory blocks are actually moved throughout RAM when switching thread contexts. There is just loading of the memory blocks into the processor's cache, but I wouldn't say it can do such a dramatic performance decrease. Well, maybe...

Actually, I've already seen a similar weird behavior when setting both IMAQ VIs and my processing VIs to reentrant. My application experienced some strange slowdowns when I dynamically opened the same processing VI in two instances and ran them in parallel (with two different input images of course). When I did my own parallelization (dividing my processing VI into two separate parallel parts doing different actions), the performance was much better. It was quite a pain for me, because you are never able to load balance those two separate parts quite well. Running the same code in two instances should IMHO result in much better load balancing.

Vladimir

BruceAmmons · ‎10-03-2007

I was talking about loading the images from RAM memory into the CPU cache and moving them back when each analysis step is complete. If you are alternating between two or more images, each image has to be loaded into the cache for processing each time you use it. This has to take a certain amount of time. If you have several analysis steps for each image, the images would end up getting swapped at the end of each step, resulting in several swaps per image.

Shane, do you have any idea how large your CPU cache is?

Bruce

Bruce Ammons
Ammons Engineering

shoneill · ‎10-03-2007

Bruce,

I've only got 512kB cache per core.

I think I see where this is heading......

Shane.

Using LV 6.1 and 8.2.1 on W2k (SP4) and WXP (SP2)

BruceAmmons · ‎10-03-2007

Shane,

Your cache is smaller than a single image. This supports my guesses. I suspect if you had a larger cache that could handle several images, you would have similar speeds between 2 and 2*N threads, where N is the number of images that can fit in each cache.

Bruce

Bruce Ammons
Ammons Engineering

Machine Vision

parallel vision Processing?

parallel vision Processing?

Re: parallel vision Processing?

Re: parallel vision Processing?

Re: parallel vision Processing?

Re: parallel vision Processing?

Re: parallel vision Processing?

Re: parallel vision Processing?

Re: parallel vision Processing?

Re: parallel vision Processing?

Re: parallel vision Processing?