Performance when writing a large 2D DBL array to a spreadsheet file

josborne · ‎01-24-2012

Hello!

My application saves a lot of numeric DBL data to a TDMS file. But occasionally, my users want to export that data into a tab-delimited text file.

I can do this. But its very slow.

The problem is that it takes a LONG time to convert that array into a spreadsheet string using he "Array to Spreadsheet String" primitive.

BTW: I really want to use automtic formatting as seen below, too. This will save a lot of bytes in the resulting file by dropping the trailing zeros.

Any ideas onm how to speed this up?

I will attach some demo VIs that show the performance metrics.

http://www.medicollector.com

josborne · ‎01-24-2012

As an example:

A typical amount of data to write is an array containing:

2 million rows

5 columns

Attached is an example where I try to split the array into smaller chunks. But it still takes a LONG time to write that to a text file using automatic formating (with no trailing zeros).

Can anyone do this faster than me?

http://www.medicollector.com

Ben · ‎01-24-2012

@josborne wrote:

As an example:

A typical amount of data to write is an array containing:

2 million rows

5 columns

Attached is an example where I try to split the array into smaller chunks. But it still takes a LONG time to write that to a text file using automatic formating (with no trailing zeros).

Can anyone do this faster than me?

Good question!

Native LV operators are hard to beat (most of the time) so I would do a quick test and use the add-in for TDMS in Excel, open the file and then (still using Excel) to save as a tab-delimited file to get an idea of how fast Excel can do it.

If Excel kicks LV's butt then it may be worth attempting to do it faster by preallocating arrays of strings but first see if its worth the trouble.

Ben

Retired Senior Automation Systems Architect with Data Science Automation LabVIEW Champion Knight of NI and Prepper LinkedIn Profile YouTube Channel

nathand · ‎01-24-2012

Here are a couple of suggestions, if you must use a text file. I don't have time to make these changes right now, but maybe I'll get to it later if no one beats me to it.

- Preallocate the file on disk by using "Set File Size." This can make a big difference. Make the file too big (determine how many digits maximum a number will need, add an extra byte per point for the delimiter after each value, then multiply by the number of points). Resize the file once you finish writing.

- Try pipelining. Convert one sub-array to text, and store it in a shift register. On the next iteration, write that text to the file while you convert the next array chunk in parallel. I don't know if this will help - your computer's IO system should just buffer the write, then commit it to disk when it gets a chance - but who knows, maybe pipelining it will help.

nathand · ‎01-24-2012

Check the logic in that VI you posted, I'm not sure it does what you intended. You don't wire the "Index" inputs to array subset, so you're always writing the same array section, and you might not be getting the array size you expect.

josborne · ‎01-24-2012

Thanks, everyone. I was hoping my question would attract the attention of some Knights!

Nathand is correct about the bug in my previous post. Attached is a modified VI that correctly splits the array into chunks.

First I want to clarify that it is not the disk I/O process that is slowing me down. In the attached VI, I have completely removed the file I/O and the process is (almost) equally as slow.

The part that is slowing me down is the Array to Spreadsheet String conversion. Because of that, I don't think that pipelining or pre-setting the file size will make any difference.

Though, I think that I might be able to do some iteration parallelism which might help a bit.

http://www.medicollector.com

johnsold · ‎01-24-2012

Please be quantitative. How long is a long time? Milliseconds? Weeks? What is the time you desire?

There are several things in your VI which do not slow it much but certainly do not help. Arrays cannot be larger than I32 and i and N are I32. So using U64 forces coercions. The Tick Count outside the for loop may execute before the array initialization, distorting your timing measurements. Repeating the N-1 calculation inside the loop makes no sense. The result is always the same. In fact the value already exists as the quotient. Just wire it to the = comparison. No need to write Total nr of chunks to the indicator on every iteration when it never changes. Move the indicator terminal outside the loop.

Lynn

Questionmarker · ‎01-25-2012

Hey,

I just tried out your last VI, and just to add some numbers for the others:

as is, one run was about 60 seconds. Without the placeholders (array to spreadsheet) the VI runs below 1 second. Despite the things johnsold did mention hold true, they are not the major problem here (I think he just hotwired a fast example, and the 80/20 rule comes to mind).

Me, as a rather newbie to LabVIEW wanted to compare the performance with the VI I am using to write spreadsheet files: Write to Spreadsheet file.vi

It runs 10 seconds if put in the position of your array to spreadsheet.vi (:

If I even skip the chunking, I can slightly pull ahead by almost one more second (though, it looks like a freeze - so I would stick with the slower solution)

Anyhow, I have no Idea why this works faster, as the VI I used obviously uses the same "array to spreadsheet.vi" internally... Sadly, there is no way to look into the VI to see what makes the difference. The onliest difference I can tell is: There is another format string in use!

So, I tried your VI with the "standard" format string: %.3f

With this, your VI runs in 8.5 seconds - probably due to not writing to disk.

Thus, the performance of the VI heavily depends on the format string used. If you want to use a specific format string it might be worthwhile to hardcode your own text parser; if not, maybe rather stick to a "faster" format string. I hope this solves your problem and speeds up your VI by several 100% 😛

Regards, Questionmarker

josborne · ‎01-25-2012

Thanks, Questionmarker! Very good analysis.

The problem definitely appears to be the format string. I would prefer the "automatic formatting" string: %#g

I want to use this because it automatically drops the trailing zeros in the resulting string. I guess a good workaround for me would be to find a better way to drop the trailing zeros. I think I will try doing a search & replace to get rid of the trailing zeros afterwards.

Anybody have any FAST code for searching through a giant tab-delimted string and removing all the trailing zeros?

Thanks everyone!

http://www.medicollector.com

Questionmarker · ‎01-25-2012

Well, I think regular expressions would be a viable way to do this, combined with search and replace. I do not know if that is fast; probably not. Maybe you might consider saving as is and replace the zeros afterwards using a non time critical application?

"Search and Replace.vi" offers regular expression input. For the tries I did it went not too good. I searched for: 0*\t and replaced it with 0\t

On chunk takes about 40 seconds on my PC... so rather look for sth else. Sadly, RE are not fast, but powerful.

Anyhow, dealing with a surplus of zeros sounds more like an eye candy problem - so maybe just ignore it for the time being?

LabVIEW

Performance when writing a large 2D DBL array to a spreadsheet file

Performance when writing a large 2D DBL array to a spreadsheet file

Re: Performance when writing a large 2D DBL array to a spreadsheet file

Re: Performance when writing a large 2D DBL array to a spreadsheet file

Re: Performance when writing a large 2D DBL array to a spreadsheet file

Re: Performance when writing a large 2D DBL array to a spreadsheet file

Re: Performance when writing a large 2D DBL array to a spreadsheet file

Re: Performance when writing a large 2D DBL array to a spreadsheet file

Re: Performance when writing a large 2D DBL array to a spreadsheet file

Re: Performance when writing a large 2D DBL array to a spreadsheet file

Re: Performance when writing a large 2D DBL array to a spreadsheet file