07-06-2010 03:24 AM
Hi
I'm having a bit of bother with FOR loop parallelism - in particular, the dataflow model seems to break in a subtle way if parallelism is enabled. The loop outputs "appear" as soon as the loop starts, even though it hasn't done what it needs to do.
Please see attached VI. Originally I was doing some FFTs but I've simplified the question to this.
The top loop is simple enough and does what I'd expect.
The middle loop shows what I struggled with for ages. Why is the time difference zero?
The bottom loop shows one way to fix it - but I don't understand why it fixes it. Surely the middle loop shouldn't "complete" until all the iterations are complete?
Many thanks
John
07-06-2010 03:36 AM - edited 07-06-2010 03:42 AM
Hi John,
you should use proper structures with forced dataflow for correct time measurement, see the attachment...
You have to ensure the correct sequence of program steps - and here's the only reason for using a sequence structure. You have to measure time before the loop starts and after the loop. Don't do this in parallel. To avoid constant folding you should not use constants in the loop - use a RandomNumber instead...
07-06-2010 04:52 AM
Hi
Thanks for the reply. Really I'm only using the timings to indicate execution order, not for precise timings per se.
Your fix is equivalent to the bottom of the three loops in my example, and yes it works. However, what I'm getting at is this: why doesn't the array passed in my middle example prevent the sequence executing, and thus the time value being read, until the loop is complete? In particular, if loop parallelism is disabled, it then does work as I expect.
I suspect the answer is some subtle optimisation that goes on. Attached is another example. In my view, the two time values should be the same (OK give or take a millisecond), as the flat sequences shouldn't be able to execute untill the associated FOR loops have finished and passed their output arrays on. However in the bottom case, the sequence runs right away. In the top case, adding the extra indicator has forced LabView to wait for the loop to complete before running the sequence - which is what wiring the array to the border of the sequence should have forced it to do anyway, in my opinion.
The point is, changing from a simple FOR loop to a parallel one has changed how dataflow works, which seems a bit wrong to me.
cheers
John
07-06-2010 06:22 AM
07-07-2010 03:55 AM
Hi
Maybe it is one of the "wonders" of optimisation. It means one has to be careful when using the parallelism feature.
To be fair the Profile->Find Parallelizable Loops dialog does come up with:
"This For Loop may or may not be safe to parallelize. Warning(s):
- A node in the For Loop may have side effects."
though it doesn't specify which node, or what side effects. Removing the delay (which I only put in to highlight execution order anyway) removes the warning.
Why did I cast the timer into I32 instead of U32? I had some problem in the past, but I forget what right now.
Ho hum. I'll be more wary of the handy parallel FOR loop feature in future - maybe just do it the hard way!
John
07-24-2010 03:03 AM
In case anybody is reading: I finally got around to sorting this out.
Apparently it's a known bug in LabView 2009. Fixed in the next version.
John
07-24-2010 04:03 AM
@camtest wrote:
Your fix is equivalent to the bottom of the three loops in my example
No, it's not. When benchmarking, you must wrap the initial timer in a sequence structure prior to the code you want to benchmark. In your examples, it's essentially paralleled with the process under test, and it could be called before that code executes, or even after.
02-04-2011 09:29 AM
Back to trying to optimise parallel for loops, and now with LV2010. It's a bit strange. See attached VIs. They're not useful as such; just a much reduced version of what I'm trying to do.
The "outer for parallelism" just runs the parallelism VI and tells how long it took. This is the VI I run each time.
On my machine, if I open both and also both their diagrams, and then run the "outer", it reports around 530ms.
The times in "parallelism" itself do what I'd expect: i.e. time 1 + time 2 = time 3; also time 3 = that reported by the outer.
Now configure iteration parallelism in the "parallelism" VI, with one instance and one worker. Run the outer VI - it's about the same (good!).
Two instances and one worker - run the outer VI - a bit slower? (say, 610ms)
More interestingly, time 1 + time 2 != time 3? If time 1 and time 2 are reported, what more is there to do in the middle frame before moving on to the next frame in the sequence?
Two instances and two workers - slower again (800ms) and it gets slower the more times I run it (up to 1600ms after ten runs). Sometimes it slows down quite quickly, other times it runs fine for a few runs.
Turn off parallelism, run the outer VI once more. Now it takes 1300ms, but at least the times add up once more.
Save to disk, close LabView (which takes an age - maybe 15 seconds). Reload, rerun, and we're back to 600ms again.
The point of all this? I'm trying to optimise some number crunching. However I'm struggling to separate the difference I may make by changing the code, and the difference I get from running the code several times. Any suggestions how to get some more repeatable results?
thanks
John
Machine is dual core centrino Dell laptop. 4gb RAM (3gb accessible). Win7 / 32bit. LV2010.
I've seen similar behaviour on another - desktop - machine (core i7, 4gb, Win7/32bit, LV2010).