read delimited spreadsheet with ~10e6 rows

EngrStudent · ‎03-06-2017

@tyk - E5-2667v4 Xeon @ 3.20 GHz. Installed memory(RAM): 256 GB (192 GB usable). It is a "sexy beast" of a system.

The built-in subvi has a data-type of 32-bit integer. It does type conversion at the connector. I could specify anything I wanted, but at 2^32+1 it is going to roll over to near -2^32+1. I tried it.

Mancho00 · ‎03-06-2017

There may be a way to create a DSN for the file, then use the Database Connectivity Toolkit to read it.

EngrStudent · ‎03-06-2017

So you were thinking something like the following?

https://docs.microsoft.com/en-us/sql/odbc/microsoft/text-file-format-text-file-driver

That is less than ideal because, as I said, I have ~1TB split into ~10GB chunks. It is going to be a lot of manual creation. It would be a lot easier if I could recursively look at the directory for all of my "txt" files, and feed their names into a chunk-process-loop that performs the processing that I am looking for.

billko · ‎03-06-2017

I guess it's a little late in the development cycle to suggest saving in a binary format rather than ASCII? (Ducking.)

Bill

(Mid-Level minion.)
My support system ensures that I don't look totally incompetent.
Proud to say that I've progressed beyond knowing just enough to be dangerous. I now know enough to know that I have no clue about anything at all.
Humble author of the CLAD Nugget.

EngrStudent · ‎03-06-2017

@bilko - I so dearly wish. It wasn't my choice. I advocated for a decent binary format for... 3 years. Too much organizational inertia/momentum. ... Technical debt?

tyk007 · ‎03-06-2017

Still waiting for some code. Otherwise I'm just guessing I'm afraid.

Don't forget that by using a 32-bit application (you said you were using LabVIEW 32-bit) you only have access to ~2Gb of memory for the entire process (I don't believe LabVIEW has the Large Address Aware flag compiled in, which would still only take you to a max 3Gb), irrespective of how much physical memory you actually have. Some 32-bit specific data processing applications solve this by spawning multiple independent processes to increase their effective capacity under Windows; it wouldn't surprise me if 32-bit R does this.

mcduff · ‎03-06-2017

Does each line have a fixed amount of characters? If so @tyk gave you the answer. It will take some work.

Use the binary read file functions, set to an array of U8. Then convert to text. Or just read directly into a string. (Having a array may be easier to manipulate, then a string.)
Read X amount of characters, which is equal to N lines.
Use the set file position vi to move in the file, it takes an I64 as an input.
Rinse and repeat as needed.

If there is not a fixed number of characters, this will still work with the following modification.

Read X lines in and whatever is left over from the line terminator append to the next read until you are through the file.

Cheers,

mcduff

Bob_Schor · ‎03-06-2017

@EngrStudent wrote:

It isn't going to work. The "set file position" is in bytes and is a 16-bit integer.

.

If you are referring to the "Get File Position" and "Set File Position" functions on the File I/O Palette, you are correct that this is "position in bytes", but it is a 64-bit, not 16-bit, integer. Thus the maximum file size is on the order of 9 exabytes, probably more than your largest hard drive ...

Bob Schor

Yamaeda · ‎03-07-2017

@EngrStudent wrote:

Background:

I am trying to read in a delimited test file with ~10 million rows, that is ~8 GB in size.

The "read text file" chokes on memory. It can't allocate enough memory. I am not even using it, it dies "at the gate".

I'm cleaning up the data and writing to SQLite3 using the SQLite3 library by Dr. Powell.

After I read the spreadsheet I use "flatten to string" then count characters.

Now I can't see why a pointer in FAT can't make the read part of this work reasonably quickly.

I am trying to use a "read delimited spreadsheet" and increment the start of read offset by number of characters. The idea is that I "gulp" a reasonable number of lines, count characters, increment the character skip value, process the lines, then return to "read delimited spreadsheet" with a larger "characters to skip" number.

It gets to ~1.43 million lines and dies.

Question:

is there any reason it should die at that row count?

why might this code be terminating at only a small part of the way through the data?

Wouldn't it be easier to just use Read from text file in line mode (e.g. 1E6 at a time) and do a String Length instead of flatten? Repeat until Error 4 (EOF).

Maybe something like this, using your Flatten approach:

G# - Award winning reference based OOP for LV, for free! - Qestit VIPM GitHub

Qestit Systems

Ben · ‎03-07-2017

I am a little late to the game but I will share anyway.

As I wrote in this NI Week paper, the Rad From Spreadsheet uses a an I32 as the byte offset. it will choke when the value goes to 2Gig and the file offset will appear to be negative.

The Read From Spreadsheet is not password protected and it is possible to go through all of the sub-VIs and change them to an I64.

It ain't easy but I have done it and it did work.

Also (see that paper) be aware that it takes time to convert the strings to numbers for files that large. Be patient!

In my case I broke the work up into multiple producer/consumers chained together so tht I put all of the cores to work at the same time. Some reading text from file, others looking for new lines and yet others converting text to numbers.

And since I had to be able to process the same data file possibly multiple times (customer changes mind and wants to take another closer look), I compressed the data and wrote it as an indexed binary file so that I only had to process the huge text file once.

As I said, a little late but it may help someone someday.

Ben

Retired Senior Automation Systems Architect with Data Science Automation LabVIEW Champion Knight of NI and Prepper LinkedIn Profile YouTube Channel

LabVIEW

read delimited spreadsheet with ~10e6 rows

Re: read delimited spreadsheet with ~10e6 rows

Re: read delimited spreadsheet with ~10e6 rows

Re: read delimited spreadsheet with ~10e6 rows

Re: read delimited spreadsheet with ~10e6 rows

Re: read delimited spreadsheet with ~10e6 rows

Re: read delimited spreadsheet with ~10e6 rows

Re: read delimited spreadsheet with ~10e6 rows

Re: read delimited spreadsheet with ~10e6 rows

Re: read delimited spreadsheet with ~10e6 rows

Re: read delimited spreadsheet with ~10e6 rows