Suggestions for maximizing RT loop performance

pjrose · ‎01-27-2012

I've attached a screen shot of a small portion of a larger RT data acquisiton and processing program. This loop processes 1 second blocks of data from a RT FIFO but very rarely (about once a day) the loop slows down down to the point where it's not able to keep up the 1 second data blocks coming from the producer loop (not shown). Normal loop iteration time is around a half a second but when it falls behind, the iteration rate slows down to almost 2 seconds and doesn't recover.

The data coming in for processing is an array of 14 waveforms, 102400 samples each.

1 of the channels is peeled off early on for tachometer processing. For the remaining analog channels, there is some calibration and light filtering/RMS calculation then the data is sent off to another RT FIFO. Then, the analog channels are heavily filtered, decimated and sent to an FFT/unit conversion/integration VI. The results of the FFT are transmitted over the network and then checked for exceedances in VI connected to the "chimney data" network shared variable. The most time consuming step of the whole process appears to be the filter waveform vi, right after the first RT FIFO write operation.

I'm looking for suggestions on improving the performance and deterministic nature of this loop. So far, I've only tried turning off debugging and making all vi's re-entrant, inline if the sub vi's support it. Those steps helped, but I'm still seeing the issue come up. I'm willing to try anything, so just point out anything that doesn't look quite right. I'll be glad to provide more details about the any of the sub vi's if it will help.

pjrose · ‎01-30-2012

I was able vastly improve performance by using the in place element structure for the fitler, calibration and decimation vi's. Then I replaced every build array with a replace array subset. Finally, I pipelined the loop into two distinct stages. Since the RT target is only using a single core, so i'm not getting a performance boost from multithreading, but it seemed work a little better, I'm not sure why..

timtamslam · ‎01-30-2012

Hey pjrose,

Thanks for posting on our forums, I hope you're having a good day. I was hoping you could elaborate some on your comment "the iteration rate slows down to almost 2 seconds and doesn't recover". How is it that you're quantifying the loop speed of this program. Have you considered using a Timed Sequence structure instead of a While Loop? Both of which have their benefits but I think in you're case if you're interested in looking at the speed of your loop and reaching a more deterministic process the Timed Sequence may be a better structure. You can find more information about those structures in the "Avoid Jitter" section of the link I placed below.

Have you monitored the CPU and Memory of your system while this is occurring? I'm wondering what happens to those values as this code progresses.

One other recommendation I can make looking at this code has to do with the 3 For Loops. For the bottom two atleast, you can probably unbundle the values from the two different clusters and build the new arrays within 1 instance of a For Loop. I'm guessing they have the same count value atleast. This may also be possible for the waveform t0 array building For Loop.

It also may not hurt to fix the 12x1024 math operation. If the product of these two values doesn't ever change, why not use the product as a constant input? Also, build it as a double to avoid the Coercion into your array function.

Here's a good link on RT programming best practices. I would also review this for other tips.

http://zone.ni.com/reference/en-XX/help/370622J-01/lvrtbestpractices/rt_portal/'

I hope this all helps you out!

Tim A.

pjrose · ‎01-30-2012

Thanks for the tips, especially about combining the for loops - it makes perfect sense and it's easy to do, I don't know why I didn't see that before. I've already gotten rid of the auto-indexed outputs and reshape array functions in that area by means of replace array subset, so the 12x1024 product is gone, too.

How would I go about implementing a timed sequence in this case? Will I have to benchmark normal execution times for each subvi or can I just specify a single deadline for the whole structure?

CPU utilization averages around 50% but it spikes from 95% to 2% several times a second, the entire vi uses around 600MB, fluctuating about 20MB, and I have about 500MB contiguous memory available.

By the way, I'm estimating loop iteration times very roughly, I'll try to get put some frames inside the loop and get a more accurate handle on how much time it's actually taking. The slow down is very hard to reproduce, I can't seem to nail down a specific failure mode.

pjrose · ‎01-30-2012

Reading through the article you posted earlier, I noticed that my network published shared variables do not have the RT FIFO enabled. Is it possible that it could be affecting determinism at this relatively slow loop rate?

timtamslam · ‎01-31-2012

Hey pjrose,

When it comes to implementing the timed sequence structure for your code, you would specify a single deadline for the whole structure. This approach is used when you know what rate you want to your processes to occur, so, you would start by using that speed. If you find that your code isn't meeting your time requirements, then you move to optimizing the code. Here's some more information regarding benchmarking your code http://zone.ni.com/reference/en-XX/help/370622J-01/lvrtbestpractices/rt_bp_benchmarking/. This link is also found in the "Get the Benchmarking Right" section of the RT Best Practices Portal I sent you before.

When it comes to enabling RT FIFO for your network published shared variables, I don't think that enabling them would greatly change the determinism of your program. It would, however, definitely help ensure that data isn't being lost if the variable is read by multiple subscribers in other VI's. More information for shared variables here: http://zone.ni.com/devzone/cda/tut/p/id/4679

Tim A.

pjrose · ‎01-31-2012

To put some numbers to the improvements mentioned above (namely using pipelining, replace array subset and inplace element structures as much as possible):

CPU utilization has dropped from 25% to 2%!!!

Total iteration time of both loops dropped from 582ms average to 57ms average!!!

However, memory utilization changed from 50% to 65% (a difference of about 150 MB), not sure what's going on there. It doesn't really matter though, it stays steady and I still have over 350MB of contigous memory available...

timtamslam · ‎02-01-2012

Howdy,

That's great news! I'm glad you were able to decrease the CPU usage and the loop time by 10 fold. It does seem that this performance did come at a cost of 15% of your memory. As with many things there is always a trade off, and its especially true with real time programs. Its all a matter knowing which trade offs are worth while for you.

Did you also happen to switch your Network Variables to RTF FIFO enabled like you mentioned before? These will definitely take up more memory. Its also possible that the new array functions you're using are percolating more array space into memory also.

Tim A.

LabVIEW

Suggestions for maximizing RT loop performance

Suggestions for maximizing RT loop performance

Re: Suggestions for maximizing RT loop performance

Re: Suggestions for maximizing RT loop performance

Re: Suggestions for maximizing RT loop performance

Re: Suggestions for maximizing RT loop performance

Re: Suggestions for maximizing RT loop performance

Re: Suggestions for maximizing RT loop performance

Re: Suggestions for maximizing RTF loop performance