12-18-2013 02:57 PM
Hi Everyone,
I see multiple ways of tackling this but I'm looking for the fastest approach as my data set is very large....
The issue:
I have a binary data file holding 2D data.
It encodes 200+ differnet "columns" that are repeated in time (sampled)
The data contains mixed data representaitons: a mxiture of U8, I8, U16, I16 etc.
They are all regullarly repeated in a known file structure (660 bytes per "line")
I'd like to gnereate the 200+ differnet 1-D arrays form the file each using the correct data representaiton (or a sub-set of the columns).
I can load the file in using binary file read and I specify U8 as the data type . I can then rediension to the correct 2D array.
I'm now stuck on the fastest method to process the columns of data (1-2 bytes wide) into the corect numeric representaiton 1-D arrays (2x U8 to i16 etc.).
Scanning byte by byte would be very slow.
any suggesitons?
Solved! Go to Solution.
12-18-2013 03:11 PM
I would create a cluster that matches the format of a single line. Use that cluster as the data type input to Read from Binary File, and set the count to as many lines as you have in the file. Then you'll have an array of clusters, one element per line. If you want to extract a single column, loop through the array of clusters and unbundle the desired element.
12-19-2013 08:19 AM
That is an interesting solution. I will give it a go.
One (addtional) issue is that the structure of the data could change from different sources so I would ideally like to progrmaiatically import the data. Is there any way to programatically build a cluster of mixed data types to use as the import data type?
I'm also looking into getting the data in as a U8, converting to 2D array of U8. Then slicing out 1,2 ,4 wide colums of data corresponding to u8 /i8, u16/i6, u32/i32 .
I'm wondering if i can then cast the data to the correct type. .... Do I flatten the 2d arrays of U8 to string and then onwards to the new data type (1D) arrays?
12-19-2013 09:19 AM
I have a working draft solution that can be made programatic to cope with different (mixed representaiton) binary file formats:
1) Import the mixed rpresentaiton binary data as U8 1-D array using binary file read
2) Redimension the 1D U8 array to the correct sized 2D array to represent columns and rows (any data represented by >8 bits will now span 2 or more colums)
3) Itterate through the 2D U8 array at the correct column index places extracting 2D arrays with all rows (samples) and N width columns (N=1 for i8/U8, N=2 for i16/u16 etc.)
4) for each extracted array flatten the data to string (prepend array size = FALSE).
5) then unflatten string to the data type you need by inputing a empty 1D array of the correct data type (i8, u8 , i16 etc.) and choosing correct big / little edian format for the conversion.
6) the output is a 1D array of the correct representaiton data.
I;ve got some array transposing goign on that I think I can eliminate....
I'll try and post some tidy simplified code soon.
12-19-2013 09:23 AM
I don't get why you would flatten to string.
Won't Join Numbers on your N=2 columns work just as well?
12-19-2013 09:31 AM
Here is an exmaple:
12-19-2013 09:37 AM - edited 12-19-2013 09:38 AM
>>suggesting to use number join approach
Yes, just tested that and it also works, here is vi code picture for comparison.
I will have to do further testing to see which approach is faster. However both methods are (probably) much faster than using a FOR loop so should "good enough".
12-19-2013 09:45 AM
Your sequence structures are unnecessary.
I'd bet on Join Number being faster than any string functions.
What sort of mechanism are you going to use for column definitions?
Do you care if 8 bit datatypes get upcast to 16 bit?
12-19-2013 09:57 AM
@Taking wrote:
Your sequence structures are unnecessary.
I'd bet on Join Number being faster than any string functions.
What sort of mechanism are you going to use for column definitions?
Do you care if 8 bit datatypes get upcast to 16 bit?
Agreed, the sequence structures only present to aid illustration.
I'm not sure flatten to string is a "true" (slow) string function. I'm viewing it more as a container of bytes. I'm going to run some speed tests. The array massaging that has to go on to use the join function may be a large overhead.
Column definitions will be sourced from a secondary text file that describes the file structure. The example conversion of 2 x u8 to i16 would be replaced by a for loop (over all columns of 1-N bytes) and case structure (representation of the column) that processes the data. Ultimately each 1D array of correctly converted data will be saved off to it's own binary data file in a appropriate numeric representation within the case statement.
Next I will be looking what for loop paralization I can achieve vs. source array memory copies. (again it is a very big source data file).
12-19-2013 10:03 AM
Sounds like you've got a pretty good handle on it.
Here's how I'd do it, assuming that I'd downcast back to U8 when necessary at a later point.