cluster array performance penalty

shoneill · ‎02-22-2007

I'm creating this thread as an off-shoot from here.

Regarding the discussion which has gone on before regarding a big performance penalty when operating on otherwise in-place arrays within and without a cluster:

Ben correctly stated that arrays within a cluster carry a significant overhead and that handling non-clusteres arrays are faster. This has been verified.

I have trouble understanding WHY this is.

I've just repeated some of the tests mentioned in the original post, but in LabVIEW 6.1. Here I see that a clustered array with the cluster carried forward as in:

This is "Method 1" Unbundle 1
is just as fast as handling the naked arrays (a shift register for each array separately with unbundle / bundle OUTSIDE the loop) in LabVIEW 6.1 regardless whether the second array contains a load of data or not. In LabVIEW 8.20, it's significantly slower. Something changed between 6.1 and 8.20.

But when unbundling BOTH arrays as in:

This is "Method 1" Unbundle 2
, the clustered arrays are still an order of magnitude slower than naked arrays.

Referring to the version working directly with clustered arrays as "Method 1" and non-clustered arrays as "Method 2", here's a table of results I've just measured with the attached program:
"Small" and "Large" refer to the size of the second array element in the clusters being tested. This array is actually not altered during testing. Times are given in milliseconds.

I used the following settings for the "unbundle both" version and 1000 repeats for the "unbundle 1" option:

Final results were scaled to 1000 repeats to allow direct comparisons.

The plot thickens.

Shane.

Message Edited by shoneill on 02-22-2007 05:11 PM

Using LV 6.1 and 8.2.1 on W2k (SP4) and WXP (SP2)

altenbach · ‎02-22-2007

Interesting. I never run across this because my sanity always prevented me from resizing arrays inside clusters. 😉

What is surprising is the fact that LabVIEW 8.20 claims that the entire inner loops are "folded" (see image).

Message Edited by altenbach on 02-22-2007 08:43 AM

LabVIEW Champion.

shoneill · ‎02-23-2007

My "Sanity" also prevented me from doing this up until recently.

I learned that variable length elements of a cluster are stored externally (out-sourced if you wish) meaning that a dimension change of one of these elements should require no copying of moving of the other cluster elements. They should be handled the same as a plain old unclustered element.

I then made a comment that this would be a good idea for state machines, whereupon Ben informed me (and showed me) that there is very much a performance penalty. I'm trying to find out why.

I think it would be really advantageous to be able to operate with clusters mixed with fixed- and variable-sized elements without having to worry about dramatic slowdown as is currently the case.

The fact that things have changed since LV 6.1 could be taken as an indicator that

The code to do this was unneccessarily complex and was dumped to reduce errors
The code was somehow "forgotten" or left out due to time constraints.

Either way, if we're going to suggest a new feature for handling "sometimes uninitialised shift registers" for state machines, I think making the clustered arrays behave as one would think after reading that the elements are not stored in contiguous memory would be the first step in the right direction. Then a single cluster wire for ALL elements passed through a state machine would clean up a lot of code.

Here's an example of mine for example which, although somewhat pretty, could do with some cleaning up.....

Shane.

Using LV 6.1 and 8.2.1 on W2k (SP4) and WXP (SP2)

Ben · ‎02-23-2007

Hi Shane,

I spent about 2 hours looking at your example and could not figure it all out.

The differnce between 6.1 and latter I can not address since I no longer have 6.1 at home.

To get a proper understanding I will have to compare the performance with the "show buffer allocations" display and work up individual test were we can compare various methods and get some numbers on each variation.

I will try to return to this Q this week-end if my schedule permits.

This is what I can say now.

We are measuring to mant variables in your examples.

In the attached 7.1 VI I have moved the indicator updates to outside the time structure. I can not rule out LV attempting to update the GUI for test 1 while test 2 is running.

After doing that I get these two rather dramatic effects

The red circle note the beffer allocations

To continue I would alos like to to test the cluster performance using in-place operations. the build array and other non-in-place operators are forcing us to measure the amount of time required for LV to allocate larger buffers and this is blurring our ability to measure the cluster work alone. I don't even know if the inplaceness algorithm is even involved.

Those are my thought for now. I'll post more if I run across any other discoveries.

Just as perplexed as you are,

Ben

Message Edited by Ben on 02-23-2007 06:16 AM

Retired Senior Automation Systems Architect with Data Science Automation LabVIEW Champion Knight of NI and Prepper LinkedIn Profile YouTube Channel

shoneill · ‎02-23-2007

Ben,

I'm at it two DAYS now and I still don't know what's going on.

I've taken your new suggestions to heart and made a new version. I've changed the "Append array" to "Replace array subset" and have moved the indicator update outside the timing structure. I also programatically generate the clusters involved and have put all relevant settings on the front panel.

In this VI I test 6 different methods.

Method 1 is the "Naked array" method shown below.

Method 2 is a Clustered array with one unbundle output wired and a cluster wire passthrough as shown below.

Method 3 is a clustered array with both unbundle outputs wired and a cluster wire passthrough as shown below.

Method 4 is a clustered array with both unbundle outputs wired and NO cluster wire passthrough as shown below.

Method 5 is a clustered array with one NAMED unbundle output wired and a cluster wire passtorugh (obligatory) as shown below.

Method 6 is a clustered array with both NAMED unbundle outputs wired and a cluster passthrough (obligatory) as shown below.

Running these with the VI attached (saved under 8.20 for LV 8.20) and the options shown on the diagram, I get following results:

NOTE: Columns for Method 3 and 4 are switched. The labelling is correct, "Cluster 2 no Thru" is for the clustered array with both unbundles wired and NO wire passthrough!

I thnk this is correct, but for the life of me, I can't explain why naked arrays are slower than clustered arrays in LV 6.1. I think I might need to move the "unbundle" and "bundle" for this case outside the timing structure. What do you think?

From my results in LV 8.20, it would seem that it makes no difference whether we are working with clustered arrays or bare arrays, just like the LV documentation says. I never thought of the indicator update issue you mentioned earlier. I was stuck in "data-flow" and I thought it would be finished updating before carrying on.

I appreciate your having a look at this.

VI Included (Version difference to the featured picture is because of the comments on the Block diagram).

Shane.

Message Edited by shoneill on 02-23-2007 03:48 PM

Using LV 6.1 and 8.2.1 on W2k (SP4) and WXP (SP2)

Ben · ‎02-23-2007

Hi Shane,

For naked cluster try wiring the cluster around to the for loop to the bundle so the same buffer can be re-used.

I also think that if LV 8 sees a constant wired to the replaced, it may fold the code (see Christian observation) Replace the array elements with the index (just to defeat constant folding).

I am not going to be able to turn my attention to this riddle for quit a while.

My sister-in-law has gone on to meet the "Supreme Wire-Worker" yesterday so my attention will be demanded elsewhere.

Please share what you find,

Your brother in wire,

Ben

PS The answer is probably staring us in the face when we show buffer allocations.

Retired Senior Automation Systems Architect with Data Science Automation LabVIEW Champion Knight of NI and Prepper LinkedIn Profile YouTube Channel

shoneill · ‎02-23-2007

Sorry to hear about your sister-in-law Ben.

I'll have another look, and I look forward to your input whenever you get around to it.

Shane.

PS I have much too little experience interpreting buffer allocations.......

Using LV 6.1 and 8.2.1 on W2k (SP4) and WXP (SP2)

Ben · ‎02-24-2007

Hi Shane,

Attached is a revised version of your “Cluster pointers 8 6.1.vi” saved as 7.1.

The changes I made were;

1) The GUI updates could happen while other tests were running, move it to happen after all testing was done.

2) Add a default case so case “0” does not get special treatment.

3) Remember result (data) from each method so that the output buffer work is the same for all. Note: I believe LV will skip transferring data to an output tunnel data buffer of indexing is not enabled until the last iteration.

4) Used index value to as replace element to prevent constant folding clouding the measurements.

5) Wired the cluster around on method 1 to tell LV it was OK to re-use the input buffer as our output.

After a few runs I noticed my No-op was taking about as much time as my other best. This implied that the control logic ( selecting which method) and over-head (filling input and output buffers) was dominating the measurements. I tweaked the test parameters to invoke the control logic less often and beat the code we are trying to characterize harder. I saved my defaults (warning: Due to method 4 a test run takes forever).

This is how I read it.

All methods use an input Buffer “A” and an output buffer “B”. This includes the default method. The default method required about 300 ms on my machine to fill the input buffer and transfer it to the output buffer.

All methods that only required an input and output buffer ran about the same speed. I suspect under my default settings, the measurement time are indicative of the time required to fill input buffer and fill output buffer. To get a better measurement of the time require to execute each method I will have to tweak my measurement parameters again. Since method 4 is so inefficient I will stop using it my tests. Before I forget abou this method I will venture some guesses about why this si so bad.

The SR is realized by working in the input buffer. Each iteration copies the contents of all of the buffers in “A” to “C” and back to “A” again. No wonder this takes so long!

Now for a suprising issue.

Compare your method #2 and #5

And then #3 and #6

The differnces appear to be tht in the case of unbundled vs unbundled by name. In the case of the unbundled by name we pick-up an extra buffer copy to fill the buffer that is allocated for the SR.

Q : Why isn’t buffer “A” used to support the SR for the “by name” version?

Summary;

Building a cluster with large arrays (method 4) is costly.

Something weird is happening with unbundled/bundle by name.

Further study will be required to measure the performance of the buffeb A-B versions.

I’ll post more when I know more.

Ben

Message Edited by Ben on 02-24-2007 12:01 PM

Retired Senior Automation Systems Architect with Data Science Automation LabVIEW Champion Knight of NI and Prepper LinkedIn Profile YouTube Channel

Ben · ‎02-24-2007

And if you deletemethod 4 and wire the cluster through you eliminate the need for the output buffer!

And of course the performance jumps due to less buffer copying.

Ben

Message Edited by Ben on 02-24-2007 12:14 PM

Retired Senior Automation Systems Architect with Data Science Automation LabVIEW Champion Knight of NI and Prepper LinkedIn Profile YouTube Channel

shoneill · ‎02-25-2007

Ben,

Thanks for the analysis. I don't have access to LV 8.2 (or any other version for that matter) until I'm back at work on Monday.

I'll have a look at it then and see what I come up with.

I have to say, the problem (or at least the benchmarking program) has changed dramatically since the beginning of this discussion. As long as we get to a sensible answer at the end of it, it's all sweet.

I hadn't thought of a single operation dominating the results due to the loop-case structure. Good catch. I guess this explains the difference in best-case values between 6.1 and 8.20?

As I said, I'll have another look tomorrow.

Thanks again,

Shane.

Using LV 6.1 and 8.2.1 on W2k (SP4) and WXP (SP2)

LabVIEW

cluster array performance penalty

cluster array performance penalty

Re: cluster array performance penalty

Re: cluster array performance penalty

Re: cluster array performance penalty

Re: cluster array performance penalty

Re: cluster array performance penalty

Re: cluster array performance penalty

Re: cluster array performance penalty

Re: cluster array performance penalty

Re: cluster array performance penalty