Using delete from array in a very large array situation

Z4CRUZN · ‎11-10-2018

Hello,

I have been a labview developer for a number of years, and I am always trying to get better in the way that I think about things. I am an engineer, but I am not what you might call a "hard core" programmer. In any event, my goal is always to learn, and that brings me to my question.

No code to post at the moment, just getting ready for a project, and want to make sure I am starting the right way. So I will be dealing with some very large binary 1D arrays. Probably on the order of 1Gb or so. I will be reading in the entire thing at once with a "Read from binary file" function. I am going to need to scan through all of this data to "clean" some of the data out of it. Apparently there are occasional "glitches" in it, so I will be doing some sort of pattern search.

I am aware of memory issues that can arise when using very large arrays. Such as never use a build array inside of a loop, because every time the array grows, it will not actually just append to the array in memory, it will create a brand new array. Do that too many times with a large array, and you will kill you computer pretty quickly. So what I was going to do, is read in the array, and then do my pattern searching inside of a loop with shift registers. Basically I was going to keep track of my array index, and maintain the entire array in the shift register, then I would delete sections of the main array as needed, until the array that remains is now "clean".

I looked around the forums a bit, and the answers seem a bit ambiguous to me. When you use the delete from array in this way, does that start creating duplicate huge arrays in memory? I would think that because you are only shrinking, and never growing the main array, and you are maintaining the array in the shift register, that it should only be re-using the same memory locations? I could be off the mark here though. Thanks for helping me with this. Trying to head off a major problem later before I actually have one.

GerdW · ‎11-11-2018

Hi z,

simple suggestion: don't read the whole file at once. Read smaller chunks instead...

Best regards,
GerdW

using LV2016/2019/2021 on Win10/11+cRIO, TestStand2016/2019

altenbach · ‎11-11-2018

@Z4CRUZN wrote:

No code to post at the moment, just getting ready for a project, and want to make sure I am starting the right way. So I will be dealing with some very large binary 1D arrays. Probably on the order of 1Gb or so. I will be reading in the entire thing at once with a "Read from binary file" function. I am going to need to scan through all of this data to "clean" some of the data out of it. Apparently there are occasional "glitches" in it, so I will be doing some sort of pattern search.

Arrays are always contiguous in memory, so every delete operation needs to touch all higher elements and move them down, even if the memory is re-used. What is the datatype of the array? How big are the "patterns" you are trying to match?

1GB is a lot! Are you using 64bit LabVIEW? Can't you do it in a few smaller sections instead?

@Z4CRUZN wrote:

I am aware of memory issues that can arise when using very large arrays. Such as never use a build array inside of a loop, because every time the array grows, it will not actually just append to the array in memory, it will create a brand new array.

LabVIEW is actually a little bit smarter than that an will allocate in larger preemptive chunks. It is not "every time it grows". Still It is not efficient.

In any case, I still would not use delete from array, but keep the array in the shift register at a fixed size while keeping track of the "write index" in a shift register. Now keep reading at [i] and write back at this write index (which is always =< i) using "replace array subset", not incrementing or writing whenever sections need to be deleted. At the very end, you would trim the array to the final size after the loop.

Still, code a few alternative versions and do some careful benchmarking.

LabVIEW Champion.

Z4CRUZN · ‎11-11-2018

Thanks for the responses... Interesting... So It would actually be more efficient to do the following: ?

1. Create a 1D array (initilize array) to the largest size of a binary file I am likely to encounter... say 1.5Gb. This gets sent into a loop with a shift register on it.

2. Then read in 35000 bytes at a time (this is one frame of the data file). Based on the data read in (search for glitches). And then use replace array subset if the data is good into the main (at the proper index).

3. Once all of the "good data" has been "replace array subset" into the main array, trim off all array space that was not used.

So this sort of operation would be more efficient, and take up significantly less memory space?

As a second option, what if I were to read in the entire array (like I am doing now), but instead of use "delete from array", I were to use "replace array subset" to pick up the remaining data in the array that has not be analyzed yet advance it to the proper place in the main array and put it back down again. Effectively overwriting the known bad data. with data to be looked at the next time through the loop. (hope that made sense).

Z4CRUZN · ‎11-11-2018

I am also thinking that perhaps it would be even better, that once the entire file is "cleaned" to save it back to a file. Then the labview program can just read off the data that it needs. That way the entire array does not actually need to be in labview memory at all...

altenbach · ‎11-11-2018

@Z4CRUZN wrote:

As a second option, what if I were to read in the entire array (like I am doing now), but instead of use "delete from array", I were to use "replace array subset" to pick up the remaining data in the array that has not be analyzed yet advance it to the proper place in the main array and put it back down again. Effectively overwriting the known bad data. with data to be looked at the next time through the loop. (hope that made sense).

Yes, that's what I was describing. This way, each array element gets only touched once. If you use "delete from array" instead, the highest elements would get moved X times, once for every delete operations. Much less efficient unless the number of deletions are very small.

Yes, I would definitely do this in chunks. I was asking about the length of the patterns (i.e. # of elements that needed to be inspected to find it. In the best case, it's just a threshold and each element can stand on its own.). You probably could do a nice pipelining loop where one part reads in data, one does the matching, and one does the writing to the output file, all basically in parallel. Just take care to do correct stitching if the deleted portion spans two sections. One idea would be to read overlapping sections and don't process them all the way to the end.

How big are the deleted sections and what is the average distance between them?

Also remember that the original data might be spaced equally in time (for example), so if you delete sections, you might need to retain extra metadata to maintain the correct timing. If the original position is important, you could replace faulty data with NaN, keeping the size constant.

LabVIEW Champion.

wiebe@CARYA · ‎11-12-2018

You could consider storing the data as bytes, not Booleans. Each Boolean will be stored as byte, so 1G Booleans will be 1GB of data. It could be 1/8th, if you'd store those Booleans as bits inside bytes (or U16, U32, U64 of course).

If you'd read the data in chunks, this will probably be just a waist of CPU. Memory won't be an issue then.

Note that this is most likely (much) less efficient on the CPU. But if memory is a problem, or memory is causing speed performance issues, the trick might help.

This also depends on what you want to do with the data. Any operation will be more difficult and slower when this trick is used.

EDIT: BTW: Deleting with this trick could typically seem very inefficient, unless you're deleting chunks. All bits need shifting. Maybe you can leverage a byte mask trick on top of it, and only delete the entire byte when the entire mask is false (not used). Or when a manual clean up is triggered, all masked out bits are removed by shifted all the bits through the array.

Search LabVIEW like a graph!

Yamaeda · ‎11-12-2018

@Z4CRUZN wrote:

Thanks for the responses... Interesting... So It would actually be more efficient to do the following: ?

1. Create a 1D array (initilize array) to the largest size of a binary file I am likely to encounter... say 1.5Gb. This gets sent into a loop with a shift register on it.

2. Then read in 35000 bytes at a time (this is one frame of the data file). Based on the data read in (search for glitches). And then use replace array subset if the data is good into the main (at the proper index).

3. Once all of the "good data" has been "replace array subset" into the main array, trim off all array space that was not used.

So this sort of operation would be more efficient, and take up significantly less memory space?

As a second option, what if I were to read in the entire array (like I am doing now), but instead of use "delete from array", I were to use "replace array subset" to pick up the remaining data in the array that has not be analyzed yet advance it to the proper place in the main array and put it back down again. Effectively overwriting the known bad data. with data to be looked at the next time through the loop. (hope that made sense).

Just for the cleaning part, you could just read 1 or 10 frames, Clean them and write them out, no need for big memory at all. 🙂

As for Reading it all back, there's some White papers on how to handle large data sets. When Reading the cleaned file, you can check the file size and initialize an Array accordingly. As 1,5Gb can be hard to allocate, if you run into that problem, you could e.g. allocate 100Mb sections and create an Array of clusters of 100Mb arrays.

Also, compressing 32 binary into an integer as suggested is a good idea. To index a specific number out you'll just need to Modulo with 32 and use the remainder as bit shift, then AND 1. 🙂

/Y

G# - Award winning reference based OOP for LV, for free! - Qestit VIPM GitHub

Qestit Systems

altenbach · ‎11-12-2018

wiebe@CARYA wrote:

You could consider storing the data as bytes, not Booleans. Each Boolean will be stored as byte, so 1G Booleans will be 1GB of data. It could be 1/8th, if you'd store those Booleans as bits inside bytes (or U16, U32, U64 of course).

I could not make a connection between "binary array" and "boolean array". All he said was that he's using "read from binary file", which could be any datatype. In computing, everything is "binary" one way or another. 😄

In my first reply, I was asking about the datatype and this has not been answered. It is not clear what a "frame" is (35000 bytes is divisible by 8 so the element could be 1, 2, 4, or 8 bytes of almost any datatype (...except complex)).

Obviously, to help further we need more detailed information, even maybe some sample data and sample patterns. What does the data represent? What is a frame? What is a processing unit? Are entire frames skipped if a pattern is matched or are partial frames retained sometimes?

LabVIEW Champion.

LabVIEW

Using delete from array in a very large array situation

Using delete from array in a very large array situation

Re: Using delete from array in a very large array situation

Re: Using delete from array in a very large array situation

Re: Using delete from array in a very large array situation

Re: Using delete from array in a very large array situation

Re: Using delete from array in a very large array situation

Re: Using delete from array in a very large array situation

Re: Using delete from array in a very large array situation

Re: Using delete from array in a very large array situation