memory manager optimization

QFang · ‎10-12-2015

Reply to Bob_Schor::

Yes, that is what I was trying to state. Also, RT FIFO's require (a bit) more CPU time than queues (according to the NI guy giving the integrator's embedded course I attended). Basically, you trade fixed memory footprint and determinism for a slightly higher overhead on the CPU, compared to a queue with the same data type.

Reply to Nathand (and others still reading):

I did some extensive(?) CPU/time benchmarking on 5 cases of (essentially) the same stripped down function that most of my discussion has revolved around. The benchmark VI's and Excel table of summary is attached. (All code ran on cRIO-9014.)

Some of the results might surprise people here:

Test Definitions:
1) Basic – no array operation, just string to U8 (variable length) dumped into cluster. (this one differs in that the cluster output has a variable sized U8 message array).
2) Replace array subset (with dummy control of 256 element U8 array as input array)
3) Replace array subset (with local feedback node carrying a 256 element U8 Array)
4) Reshape array (fixed 256 element output)
5) IPE on dummy control cluster with 256 element U8 array

I ran each test a total of 6 times, 3 times with message length 0; and 3 times with message length 256.
I collected all the results (see Excel in attached zip file).

Summarized Benchmark Results (see attached):

Overall/practical application winner is Test 5)!

Test 5) has practically same performance for both 0 and 256 character messages. It is about twice as fast as Test's 1, 2, and 3 for 256 character messages. (Only beaten by test 1 with 0 byte message).

Test 1) is faster for 0 byte messages (only). It has comparable performance to 2) and 3) for 256 character messages.

Test 4 has the worst performance by a wide margin, and is especially bad for 0(!) character messages. It is some 4 to 6+ times slower than Test 5.

If I did the benchmarking wrong, feel free to make suggestions. I got rid of the 'Get RT CPU' call (replaced with control going to the test subVI). I left the "Get Date/Time" primitive in the test subVI to maybe trick the compiler into not optimizing away the whole test subVI. (each iteration will be subtly different).

Note that in addition to setting the VI properties, I also turned off RT ping and RT CPU broadcast in the project target options. I tried to prevent constant folding etc., but I'm not sure if the compiler is still able to trash the test by some strange optimization case.. (For instance, would it know that there is no way for the control inputs to change once the for-loop starts... and then realizing I'm only looking at the final iteration output, to do some handwavy magic to speed things up?)

nathand wrote:

Let's say you had the front panel of the VI open - then it's obvious that the code has to make a copy of the 256-element array, because the copy that's displayed on the front panel doesn't change

I agree, but I almost never think about front panel controls and indicators in that way anymore, because most (90%) of my code is RTEXE code, so there is no front panel that can be seen (in UI Thread) anymore.. so my thinking is that this prevents the extra copy. As far as the other points, they seem like good valid points to me.

QFang
-------------
CLD LabVIEW 7.1 to 2016

nathand · ‎10-12-2015

Any chance you can post your VIs as a more-standard .zip, rather than 7z (and saved for LV 2013 or earlier, if they aren't already)? If so I'll take a look, although maybe not that quickly due to plans this week.

QFang · ‎10-12-2015

Certainly!

Attached is a zip with the 2015 and a folder with 2011.

Note the benchmark was ran on a cRIO-9014 (vxWorks) using LV RT 2015 and latest drivers etc.

Note also I could have missed a setting or made some stupid misstake in the setup. Your results may vary.

QFang
-------------
CLD LabVIEW 7.1 to 2016

nathand · ‎10-12-2015

One change I'd recommend is to make all the subVI inputs required, rather than recommended. This is good general practice - in some cases it improves performance (sorry, can't find the reference in a quick search, but it's on the forum somewhere) - although I'm not seeing a change here.

You can get a substantial improvement in all your versions by making the "template" cluster a control and connecting it to a front-panel terminal, then wiring it through a shift register in the for loop, forcing LabVIEW to reuse the same cluster repeatedly. This shows up in the buffer allocations tool. As far as I can tell, doing this makes the "replace subset from control" version equivalent to the IPE version, which it should be since they do exactly the same thing. To my surprise, the "basic" version is slower - not sure why, if I have time later I'll investigate - and the reshape array version is slower still, as you noted. Might be that reshape array is less efficient than I thought. I know Reshape Array does make a copy when the number of dimensions changes (although the buffer allocations tool doesn't show it), it's possible it always does that regardless of the new size. I didn't modify the feedback node version.

Here are my changes (LabVIEW 2013), I haven't compiled an Excel sheet as you did.

nathand · ‎10-13-2015

Spent a bit more time thinking about this although haven't had time to test much. Clearly you're right that the IPE does have some effect where I thought it wouldn't, so my understanding isn't quite right and I'm sorry for adding to the confusion rather than helping clarify. I did a quick test where I replaced the bundle in the "basic" test with an IPE, and it's not any faster than the standard bundle, so the IPE isn't too magic. I tried moving the message generation into the for loop, to see if that makes a difference, but the "basic" version with a simple bundle is still slower than the equivalent "replace array subset" version, and I have no idea what's going on there.

QFang · ‎10-13-2015

Hi nathand, it just goes to re-inforce what Bob_Schor said early on in this discussion: "you have to do the experiment"... 😛 'experiment' and 'test'! -You don't know what you don't know and so forth.. but boy it can be time consuming. (In this case, my justification is that I'm doing some other code changes in this VIP and those changes will drive some extra work in dependent packages, so I want to take this chance to do some optimizations and tweaks while I have the chance.)

As far as "my" basic version being slower, it is most certainly related to memory allocation. (Sorry, not trying to be a wise-guy here!). As to why it is slower when on a shift register, I wanted to say its because the size of the message array could change.. but since it doesn't during the test, I'm not sure that holds up.. In the other cases, the size is more clearly (to the compiler) constant output size.. maybe that changes things under the hood?

Your approach of wiring the cluster as input/output to the 'benchmark' vi itself and carrying it on a shift register in the testing for-loop, that seems like a test optimization more than anything. I do see a significant speed increase, but I think it might all be from for-loop overhead changes, and nothing to do with performance of the actual test VI.

I base this on the following test and observations I did: performing essentially the same thing by putting the test VI cluster on a feedback node inside the test VI (and using a dummy control to set the default value of feedback node on first run) shows none (or very slight) improvement in overall performance, so that indicates that the performance boost comes from for-loop level optimization and for loop interactions, not the code we try to test itself?

When I checked various IPE variations (internal feedback, controls to vi connector pane to for loop, with and without shift register on the for loop, etc., etc., I found little to no benefit in most cases, and a significant detriment in some (wired input/output to tunnels on for loop is worse than not wiring or having the input wired at all).

I also modified my test framework slightly, turning RT LED on before sequence structure, and turned it off after, and in addition, after all timing info had been collected, compiled and wrote an appending string log on root of RT device.. This allowed me to run the benchmark as compiled startup code. The result outputs indicate that from CPU perspective, a constant (initialized at compile/load) feedback node is about identical in performance to a control (that is not on the connector pane of the subVI). The only (tiny) question in my mind then, is just if memory space is more static in this case or that case, and I don’t think I can determine that with the tools available to me.

As a special case side-note that applies for (compiled) RT code: wouldn't having the 'dummy control' NOT connected to the connector pane allow the compiler more freedom to optimize? After all, (on compiled RT code) it would know that it cannot be changed outside of that VI(?) since there wouldn't be any valid references to the control from outside the vi, and by nature of being a control it is not constant folded? –Not sure if/how ‘re-setting to default value’ would work on each call.. I share your notion on that still, and that might prevent ideal usage of that memory block.. which is why I think, performance being near identical, I’ll implement the constant initialized at compile feedback node since it won’t need to ‘reset’ any values on next call…

…and then some more testing throws stuff on its head again… So based on multiple runs as RT EXE, I’m having some difficulty reproducing certain results… and then I tried to initialize the feedback with an empty cluster constant (meaning the cluster constant message array was empty).. This brought with it another boost… and this time it reproduced across two build operations and 4 tests.. It’s time to stop.. you can go bat-**bleep** crazy trying to figure out what’s going on.. and in truth, the gain in memory stability and my experience and knowledge are valuable, I’ll have to call it ‘good enough’ at this point and move on to the next item on my list. 😛

QFang
-------------
CLD LabVIEW 7.1 to 2016

nathand · ‎10-14-2015

QFang wrote:

Your approach of wiring the cluster as input/output to the 'benchmark' vi itself and carrying it on a shift register in the testing for-loop, that seems like a test optimization more than anything. I do see a significant speed increase, but I think it might all be from for-loop overhead changes, and nothing to do with performance of the actual test VI.

I base this on the following test and observations I did: performing essentially the same thing by putting the test VI cluster on a feedback node inside the test VI (and using a dummy control to set the default value of feedback node on first run) shows none (or very slight) improvement in overall performance, so that indicates that the performance boost comes from for-loop level optimization and for loop interactions, not the code we try to test itself?

In my mind, the performance increase in putting a shift register around the for loop isn't a test optimization; I would write actual performance-critical code that way (carry data around in a shift register at the highest hierarchy level possible, to minimize the chances of reallocating it in a subVI). The feedback node probably doesn't work the same way because you're not "done" with the cluster when you save it to the feedback node (when the VI exits), whereas with the shift register at the end of the loop iteration you essentially tell the compiler, "I'm done with this set of values, you're now free to update it with new values." This would be more apparent if you were chaining together multiple VIs that all acted in sequence on that cluster.

QFang · ‎10-15-2015

... so I should test if a while loop with exit wired to constant true, or a for loop with constant wired to N=1, with a shift register, might prove more efficient in my actual code than the feedback node then. Because of the way the test code is a piece of the larger puzzle, that would be the only way to introduce a shift register.. Going back to earlier statements regarding taking away most often being a better approach, this would then be one of the exceptions. I got all the test stuff set up... I might as well try it. 🙂 I'll report back after.

QFang
-------------
CLD LabVIEW 7.1 to 2016

nathand · ‎10-15-2015

@QFang wrote:

... so I should test if a while loop with exit wired to constant true, or a for loop with constant wired to N=1, with a shift register, might prove more efficient in my actual code than the feedback node then. Because of the way the test code is a piece of the larger puzzle, that would be the only way to introduce a shift register.. Going back to earlier statements regarding taking away most often being a better approach, this would then be one of the exceptions. I got all the test stuff set up... I might as well try it. 🙂 I'll report back after.

NO! It's not that the shift register is special versus a feedback node, it's the level of the hierarchy that's important. You want the shift register at the highest level at which that piece of data needs to be available, so you allocate it once and then pass it around. The VI gets to reuse its input as the output (which is what in-place means). When you store the data within the subVI, either in a feedback node or shift register, the subVI now has to do its own allocation because it needs a copy of the data in the feedback node/shift register (for the next iteration) and a copy for the output.

This isn't really taking away versus adding code on the block diagram, it's just moving memory allocation up to the highest level. You can see if a VI is operating in-place with the show buffer allocations tool, when you look at the calling VI. If you don't see buffers being created at the inputs or outputs, then the subVI can reuse the inputs as outputs without allocating additional memory.

QFang · ‎10-16-2015

Got it.

QFang
-------------
CLD LabVIEW 7.1 to 2016

LabVIEW

memory manager optimization

Re: memory manager optimization

Re: memory manager optimization

Re: memory manager optimization

Re: memory manager optimization

Re: memory manager optimization

Re: memory manager optimization

Re: memory manager optimization

Re: memory manager optimization

Re: memory manager optimization

Re: memory manager optimization