Fast processing of mixed representation binary files

nemi · ‎12-18-2013

Hi Everyone,

I see multiple ways of tackling this but I'm looking for the fastest approach as my data set is very large....

The issue:

I have a binary data file holding 2D data.

It encodes 200+ differnet "columns" that are repeated in time (sampled)

The data contains mixed data representaitons: a mxiture of U8, I8, U16, I16 etc.

They are all regullarly repeated in a known file structure (660 bytes per "line")

I'd like to gnereate the 200+ differnet 1-D arrays form the file each using the correct data representaiton (or a sub-set of the columns).

I can load the file in using binary file read and I specify U8 as the data type . I can then rediension to the correct 2D array.

I'm now stuck on the fastest method to process the columns of data (1-2 bytes wide) into the corect numeric representaiton 1-D arrays (2x U8 to i16 etc.).

Scanning byte by byte would be very slow.

any suggesitons?

nathand · ‎12-18-2013

I would create a cluster that matches the format of a single line. Use that cluster as the data type input to Read from Binary File, and set the count to as many lines as you have in the file. Then you'll have an array of clusters, one element per line. If you want to extract a single column, loop through the array of clusters and unbundle the desired element.

nemi · ‎12-19-2013

That is an interesting solution. I will give it a go.

One (addtional) issue is that the structure of the data could change from different sources so I would ideally like to progrmaiatically import the data. Is there any way to programatically build a cluster of mixed data types to use as the import data type?

I'm also looking into getting the data in as a U8, converting to 2D array of U8. Then slicing out 1,2 ,4 wide colums of data corresponding to u8 /i8, u16/i6, u32/i32 .

I'm wondering if i can then cast the data to the correct type. .... Do I flatten the 2d arrays of U8 to string and then onwards to the new data type (1D) arrays?

nemi · ‎12-19-2013

I have a working draft solution that can be made programatic to cope with different (mixed representaiton) binary file formats:

1) Import the mixed rpresentaiton binary data as U8 1-D array using binary file read

2) Redimension the 1D U8 array to the correct sized 2D array to represent columns and rows (any data represented by >8 bits will now span 2 or more colums)

3) Itterate through the 2D U8 array at the correct column index places extracting 2D arrays with all rows (samples) and N width columns (N=1 for i8/U8, N=2 for i16/u16 etc.)

4) for each extracted array flatten the data to string (prepend array size = FALSE).

5) then unflatten string to the data type you need by inputing a empty 1D array of the correct data type (i8, u8 , i16 etc.) and choosing correct big / little edian format for the conversion.

6) the output is a 1D array of the correct representaiton data.

I;ve got some array transposing goign on that I think I can eliminate....

I'll try and post some tidy simplified code soon.

Taki1999 · ‎12-19-2013

I don't get why you would flatten to string.

Won't Join Numbers on your N=2 columns work just as well?

nemi · ‎12-19-2013

Here is an exmaple:

nemi · ‎12-19-2013

>>suggesting to use number join approach

Yes, just tested that and it also works, here is vi code picture for comparison.

I will have to do further testing to see which approach is faster. However both methods are (probably) much faster than using a FOR loop so should "good enough".

Taki1999 · ‎12-19-2013

Your sequence structures are unnecessary.

I'd bet on Join Number being faster than any string functions.

What sort of mechanism are you going to use for column definitions?

Do you care if 8 bit datatypes get upcast to 16 bit?

nemi · ‎12-19-2013

@Taking wrote:

Your sequence structures are unnecessary.

I'd bet on Join Number being faster than any string functions.

What sort of mechanism are you going to use for column definitions?

Do you care if 8 bit datatypes get upcast to 16 bit?

Agreed, the sequence structures only present to aid illustration.

I'm not sure flatten to string is a "true" (slow) string function. I'm viewing it more as a container of bytes. I'm going to run some speed tests. The array massaging that has to go on to use the join function may be a large overhead.

Column definitions will be sourced from a secondary text file that describes the file structure. The example conversion of 2 x u8 to i16 would be replaced by a for loop (over all columns of 1-N bytes) and case structure (representation of the column) that processes the data. Ultimately each 1D array of correctly converted data will be saved off to it's own binary data file in a appropriate numeric representation within the case statement.

Next I will be looking what for loop paralization I can achieve vs. source array memory copies. (again it is a very big source data file).

Taki1999 · ‎12-19-2013

Sounds like you've got a pretty good handle on it.

Here's how I'd do it, assuming that I'd downcast back to U8 when necessary at a later point.

LabVIEW

Fast processing of mixed representation binary files

Fast processing of mixed representation binary files

Re: Fast processing of mixed representation binary files

Re: Fast processing of mixed representation binary files

Re: Fast processing of mixed representation binary files

Re: Fast processing of mixed representation binary files

Re: Fast processing of mixed representation binary files

Re: Fast processing of mixed representation binary files

Re: Fast processing of mixed representation binary files

Re: Fast processing of mixed representation binary files

Re: Fast processing of mixed representation binary files