I was benchmarking a VI to see how much time was saved by enabling loop parallelism. The loop just uses select and multiplication on an array of 2048 elements. To my surprise, enabling parallelism increased the execution time! Please see if you can re-create what I am seeing by changing the loop parallelism in the attached SubVI (Calc Central Wavelength.vi), and changing the number of parallel loops allowed. Then run Test Calc Central Wavelength.vi to see how long it takes. The more loops I allow, the longer it takes! Thank you for any insight.
Solved! Go to Solution.
Can that code really be run in full parallel? Looks to me like the compiler basically has to perform the work sequentially since you are reliant on using the loop indicator and combining the results back in order via the tunnels. Given the number of iterations you are doing (thousands) it's liklely that the context switching over-head is being clearly exposed.
Parallelism requires some overhead in order to split and reassemble the data. etc.. If your loop code does not do much, the parallelism overhead is realtively large and can even slow you down.
Thank you everyone for the suggestions.
Altenbach, yours did indeed run much faster. I started changing my function piece by piece to look like yours, running after every change to see which changes gave the best improvement (it seemed to be the usage of shift registers, which I am puzzled about). After all of the changes my function was still a factor of 2 slower, and I realized it was the "Enable Debugging" checkbox that was to blame. Thank you for your help,
Yes, debuging adds some overhead. Also make sure the subVI front panel is closed when testing. If performance is important, you should also try inlining the subVI (not tested).
One big difference between the code versions is that yours uses significantly more memory, creating a boolean array and two DBL arrays, all with the same size as the input array.
Mine operates in place and exclusively uses scalars that operate in place (shift registers). They only arrays are the inputs.
Sometimes array operations can be faster because they are more suitable for SSE instructions, but here creating all these extra huge data structures just to throw them away a nanosecond later seems pointless. My code has an extremely small memory footprint which is desirable even if there is no speed advantage. For extremely large inputs, your's will run out of memory much quicker because it requires contiguous free memory for all these arrays.
In any case, all code versions are probably fast enough, even on a tiny atom processor. What are your performance requirements?
Thank you, I will try to keep in mind these points about memory. There are no strict requirements, the VI Analyzer just told me that I had missed some opportunities to parallelize "for" loops, and I was surprised that enabling the parallelism slowed it down!
A little late to the party, but something just piqued my attention in your post -
'Allow Debugging' enabled actually forces the loop to execute sequentially, as per LabView. Not sure if this tip was shown in your version, but i suspect this was another reason why parallelism was slower.
Sorry to have woken up a sleeping post.