11-25-2014 10:40 AM - edited 11-25-2014 10:41 AM
Personally,I doubt that the difference is in parallelization. I definitely would not want for the exponential function to do this. I typically like to parallelize on a much coarser level.
I don't think that LabVIEW uses anything special, probably just Intel MKL libraries. Maybe it is possible that the way it is called inside the LabVIEW primitive does not take advantage of SSE when operating on arrays (see. e.g. the speed differences here), while matlab does. Just wildely guessing....
I'll try to investigate more....
11-25-2014 11:18 AM
Since my parallell test still wasn't as fast as Matlab i assume there's something more, probably like the SSE you mention. I'll see if i can get the AMD math core to function, it sounds cool.
/Y
11-25-2014 06:24 PM
I have tested the paralellism as you suggested. It speeds up the computation to around 500ms which is twice faster than before. The computer I used has 4 cores but I am not sure why the speed is only doubled.
One question regarding the paralellism. What happens if inside the parallel for loop, I make another one for loop with parallelism? When I tried this configuration, I get around 350ms. I am asking this because previously, I have tried this configuration in other program and the result was worse than single for loop. The result is unpredictable. =(
11-25-2014 06:49 PM
You should typically only parallelize one loop, preferably the outermost one. I'll do some testing...
11-26-2014 03:14 AM
@clonzz wrote:
I have tested the paralellism as you suggested. It speeds up the computation to around 500ms which is twice faster than before. The computer I used has 4 cores but I am not sure why the speed is only doubled.
One question regarding the paralellism. What happens if inside the parallel for loop, I make another one for loop with parallelism? When I tried this configuration, I get around 350ms. I am asking this because previously, I have tried this configuration in other program and the result was worse than single for loop. The result is unpredictable. =(
Setting up a parallell loop has some overhead, and if the work time is very small, it'll be a loss. That's why it's typically easier and safer to parallellize the outmost loop. Just like it's usually better to use the array versions of functions in LV compared to placing a scalar version in a loop. This seems to be an exception.
I'll attach a changed test-vi. What i notice is that there seems to be some issue with how the function handles arrays, as the fastest is a double loop with the outer parallell!
See what results you get.
/Y
11-26-2014 12:33 PM - edited 11-26-2014 12:35 PM
Disabling debugging results in an infinately faster vi. (Dead-code removal) You really need an indicator on the data.
Matlab has much lower debugging overhead
11-26-2014 12:43 PM - edited 11-26-2014 12:44 PM
I was always testing with debugging disabled but placed an "add array elements" with a scalar indicator after the sequence structure to prevent dead code removal.
The times were basically identical. That's not it! There is no debugging overhead if the bulk of the time is spent inside a single primitive.
There is no reason why the LabVIEW function could not approach the speed of the matlab function unless matlab uses some shortcuts that give lower precision. I still believe it is an SSE issue.
11-26-2014 08:28 PM - edited 11-26-2014 08:29 PM
I just disabled the debugging stuff. I tested the program with different configurations. Below are the results.
No loop - ~980ms
single loop with no parallelism - ~1210ms
double loop with no parallelism - ~1340ms
Single loop with parallelism - ~460ms
double loop, outer loop parallel, inner loop normal - ~365ms
double loop, outer loop normal, inner loop parallel - ~485ms
double loop, both parallel - ~400ms
I think the timing is almost the same although the debugging is disabled. Not affected much.
p/s: Yamaeda, I could not open the program you attached. I use Labview 2011 =P. Anyway, thanks a lot everyone for the effort.
11-27-2014 03:43 AM - edited 11-27-2014 03:45 AM
2011 version attached.
For me, the 2D array directly is ~3x slower than the other alternatives, with dual for loops with the outer parallell being the fastest. 2011 is actually slightly faster on the 2D array than 2014, but the parallell loops are slightly faster on 2014.
You have the same order in speed, though my parallell loops are slightly faster due to 8 cores. I didn't bother about non parallell loops. 🙂
/Y
11-27-2014 07:40 AM - edited 11-27-2014 07:40 AM
@Yamaeda wrote:
2011 version attached.
For me, the 2D array directly is ~3x slower than the other alternatives, with dual for loops with the outer parallell being the fastest. 2011 is actually slightly faster on the 2D array than 2014, but the parallell loops are slightly faster on 2014.
You have the same order in speed, though my parallell loops are slightly faster due to 8 cores. I didn't bother about non parallell loops. 🙂
/Y
Attached is my result using your program. The rank seem consistent as previously, but I wonder why each of them took significantly more time than when using my program.
Also, this still does not solve the issue where the exponential function execution is much slower than matlab. =(