slow conversion of base64 to sgl array and back

dpak · ‎07-04-2013

Hi All,

I am working with a particular file format (mzml) which stores spectral data as base64, in an xml file. I need to convert this base64 encoded spectral data to a sgl array, work on it and then convert the result back into the same base64 format to save the file, so the data can be read by other packages designed to read this data file type. (I didn't develop the file type!)

The problem is, the data is pretty big - commonly over 7M points in the sgl array (and there are at least two of these in each file - an x and y array) - so it's pretty slow to convert, in both directions.

My algorithm is:

Read in [string] of base64. For each letter in [string], convert to integer (using a variant for fast lookup of the base64 alphabet). Convert each integer (the letter's place in base64 alphabet) to binary sextet (using Number to boolean array) and reverse each sextet (I'm guessing this is an endian issue), then replace the appropriate section in a preformed boolean array - [binary].

Reverse the order of every octet in [binary], in place - I'm guessing this is also an endian issue

Take 32bit portions of [binary] and typecast to sgl type, and replace appropriate index in pre-created sgl array - [sgl]

I know my code works because I get the right spectrum out of the end - I'd add the code, but it contains lots of parallel loops to force parallelisation, so it's not that pretty! But, it's strangely slow in some parts.

I think the typecasting is the problem step which seems to be very slow. I have seen there can be an issue with using typecasting in long arrays.

Does anyone have a better way of extracting floating point arrays from base64 encoded text (without using external libraries) or who can point me in the direction of good style for doing this in labview?

Hope this is an interesting question!

Thanks for your help,

David

altenbach · ‎07-04-2013

@dpak wrote:

Hi All,

I am working with a particular file format (mzml) which stores spectral data as base64, in an xml file. I need to convert this base64 encoded spectral data to a sgl array, work on it and then convert the result back into the same base64 format to save the file, so the data can be read by other packages designed to read this data file type. (I didn't develop the file type!)

The problem is, the data is pretty big - commonly over 7M points in the sgl array (and there are at least two of these in each file - an x and y array) - so it's pretty slow to convert, in both directions.

My algorithm is:

Read in [string] of base64. For each letter in [string], convert to integer (using a variant for fast lookup of the base64 alphabet). Convert each integer (the letter's place in base64 alphabet) to binary sextet (using Number to boolean array) and reverse each sextet (I'm guessing this is an endian issue), then replace the appropriate section in a preformed boolean array - [binary].

Reverse the order of every octet in [binary], in place - I'm guessing this is also an endian issue

Take 32bit portions of [binary] and typecast to sgl type, and replace appropriate index in pre-created sgl array - [sgl]

We are not good with words. I think it would be sigificantly more helpful if you could simply attach your VI, then we also have something to validate a modified solution. 😉

LabVIEW Champion.

johnsold · ‎07-04-2013

David,

That does sound like an interesting challenge.

Can you post the mapping from the xml string to base 64 alphabet and a few typical strings along with the base 64 and base 10 equivalents? No 7 M datasets, please.

Lynn

dpak · ‎07-04-2013

Dear Altenbach

Here it is. I've put in only a short portion of the data string - so this version runs pretty quickly.

Thanks for your help,

David

dpak · ‎07-04-2013

Dear Lyn,

The mapping is the normal - ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/ (being 0...63)

I've uploaded a vi with some test data to the list - perhaps that will help give some exmaple data?

Thanks,

David

dpak · ‎07-04-2013

Sorry all, didn't delete all the references in the example - they are there so that I could track the processing progress on sliders in the top level vi (which I didn't include - it reads the xml file, extracts the right elements etc. and diplays the final data)

dpak · ‎07-04-2013

Here is a more focussed example which demonstrates the effect - probably better than my last effort.

In this example, you can generate a dummy binary (meaning boolean) array to convert to a float sgl array.

Set the size to somewhere between 1M and 10M and run - there is a time to measure how long the processing of the array takes.

You can switch between two different methods of generating the floating point number, from 32bit sections of the binary array - trust me, both give the same result.

I just assumed that typecasting would be the more efficient method, but as you can see, it is not.

Anyone got an even better method up their sleeve?

Thank you for your interest and guidance,

David

johnsold · ‎07-04-2013

David,

I have been looking at the first example. The make.vi divides Binary array in into four segments, of lengths 1440, 1440, 1440, and 3366. The for loop which processes the last segment only works with the first 1440 elements. Is this intentional?

Lynn

dpak · ‎07-04-2013

Dear Lynn,

No, that would be a booboo. I clearly rushed that bit a bit too much, trying to get some example code out.

However, I think the slow processing issue is not related to that mistake - I was trying different types of parallelization in order to improve the run time and clearly that verison had other issues too!

In the later version I posted, I've cut it all down much more, to try to make the issue clearer and easier to play with.

Thank you for your help, it is much appreciated,

David

johnsold · ‎07-04-2013

David,

I have some observations for your consideration.

Each time I ran one of the modifications I looked at the Float Array Out and String to see if it had changed. (Although I tended to only look at the first and last 10 elements in most cases). The timing measurements I report below were made by using Run Continuously while the Profiler was running. All subVIs were closed.

Your original VI: average time 8584 us over 3541 runs. My revised version: average time 1629 us over 4977 runs.

I found that letting LV deal with the parallelism seems faster than having 4 of everything in the code. I created an array of the Base 64 alphabet bytes as U8 and used Search 1D Array to find the index. This was faster than using the variant attributes. In make.vi (old method) you reverse the mantissa binary array inside the for loops. By reversing the Mantissa decoder array outside the loop (or using a reversed constant), you can avoid the reversal. Similarly rather than using a case structure to select a -1 or +1 to multiply the numeric value, pass the value into the case structure and put Negate in the True case while just wiring straight through the False case. Generally I try to precalculate anything in a loop which can be done outside. Also multiply is faster than divide and add is faster than multiply when things are done many times. None of these things speeds up very much but the cumulative effect is noticeable. It also seemed to be faster when the subVIs were not reentrant.

The attached file has two of my recent versions.

Lynn

LabVIEW

slow conversion of base64 to sgl array and back

slow conversion of base64 to sgl array and back

Re: slow conversion of base64 to sgl array and back

Re: slow conversion of base64 to sgl array and back

Re: slow conversion of base64 to sgl array and back

Re: slow conversion of base64 to sgl array and back

Re: slow conversion of base64 to sgl array and back

Re: slow conversion of base64 to sgl array and back

Re: slow conversion of base64 to sgl array and back

Re: slow conversion of base64 to sgl array and back

Re: slow conversion of base64 to sgl array and back