From Friday, April 19th (11:00 PM CDT) through Saturday, April 20th (2:00 PM CDT), 2024, ni.com will undergo system upgrades that may result in temporary service interruption.

We appreciate your patience as we improve our online experience.

2016 Advanced User Track

cancel
Showing results for 
Search instead for 
Did you mean: 

Side-notes, comments and taking it further than intended - TS9524 Code Optimization and Benchmarking

Hi all,

This is in no way suggestion that TS9524 was not complete on its own.  I found it very interesting and useful!  Rather, this is a topic that for various reasons have taken (and continue to take) a lot of my time, and I presume if you are reading, it may be similar for you, so I figured we could discuss random details and share stories in a thread here.

Lastly, I know the VI's and project provided for the demonstration were just that, for demonstration, and the presenters made no claim that these were 'fully optimized', in fact he pointed out that they were just just simple examples for illustration purposes. 

For the points below, when I talk about 'expensive' operations, it is in comparison to other simple primitives, not compared to complex functions.. These things only matters if every cycle / nano-second counts!

I was intrigued by the "Efficient Mean" and enhanced it in a few ways (other than removing the monitoring hooks, can we do more?).  I just wanted to share in case someone are un-awares of the following simple optimizations (in no particular order):

1) Division (of DBL at least) are more expensive (on PowerPC (cRIO) and on my Intel Laptop at least) than multiplies.  If the denominator is constant or can be pick out of a Lookup table, you will save quite a bit if you do (billions) of divides. (If you can tolerate incorrect mean values during the 'warm-up', just use the reciprocal of the "N point average" calculated once outside the loop, if you need every value to be correct from first sample, you would need either a LUT that latch at the last element (cost a lot of memory) or calculate the reciprocal for every iteration while you warm up, then change to a constant after warm-up.

2) Quotient and remainder is a relatively expensive operation, more so than divide for example.  Creating an in-line "counter with reset" carried on either the loop shift register or a feedback node makes a notable difference.

3) ?

Disabling all the other mean test loops, and adding max- and min- rate tracking to the monitoring loop, on my laptop (without taking any special precautions re. AV, background tasks, etc.) I get the below results:

For 100k N, with minimum of 103 Million loop iterations:

"original" VI (modified as described above)

Min / Max = 4.79MHz / 5.40MHz

VI with 1 implemented (using Q/R to index, using constant reciprocal - invalid until N points accumulated)

Min / Max = 5.41MHz / 5.73MHz

VI with 1 and 2 implemented (using "counter with reset" to index and constant reciprocal - invalid until N points accumulated)

Min / Max = 5.87MHz / 6.11MHz

I don't claim that there are no further improvements that can be made, but if we're going to the level optimizing rates at the nano-second scale, it is interesting (to me) to see what sort of seemingly trivial things start to add up. In many cases, it is fairly trivial to replace divide with multiply and/or Q/R with divide and a rounding operation or truncate to Integer.

(I included the two new files for the project. Extract to the ..\02-TS9524BenchmarkArrays\ folder from the presentation zip file.)

Message was edited by: QFang [spelling edit. English is my second language.. more grammar and spelling errors bound to exist, sorry]

Message was edited by: QFang
Edited Title

QFang
-------------
CLD LabVIEW 7.1 to 2016
Message 1 of 3
(6,107 Views)

Thanks for the detailed analysis. I'll have a deeper look at it later. I have intentionally left some doors open for further improvernents. Adding more shift registers and subVIs would complicate the code more, potentially slowing down the presentation.

Yes Q&R is a relatively expensive operation, but it only occurs once per iteration and thus gives only a constant offset, independent of N.

If the warmup period is short, the outside division is probably OK, especially since we have the LED indicator (in the monitor loop) to see if the values are valid yet.

Of course maybe the entire thing could be implemented on FPGA (as long as it fits!), eliminating all bottlenecks.

All great points.Thanks! Keep them coming! (I might add some of them to future versions of the presentation.)

Message 2 of 3
(5,696 Views)

You might want to mention the title of the presentation, for those who have not memorized what "TS9524" is.

0 Kudos
Message 3 of 3
(5,697 Views)