Is anyone working with large datasets (>200M) in LabVIEW?

colby · ‎05-06-2005

I am working with external Bioinformatics databasesa and find the datasets to be quite large (2 files easily come out at 50M or more). Is anyone working with large datasets like these? What is your experience with performance?

ElSmitho · ‎05-06-2005

Salutations,

I've used some large data sets, like the ones you've described. I once went through about 1.5 gigs of information, that was pleasant.
It took approximately 45 minutes to complete all the necessary analysis. Operations included, resampling, order extraction, statistical mean and standard deviation calculations, feature calculations (kurtosis, crest, mean, rms, so forth), binning, etc....

I don't know what your task goal is, however, the best technique always involves keeping as little information around as possible. For example, when taking the mean, you only need 3 numbers.

Current mean, new value to add in, and the number of samples.

Hence, finding techniques that don't require all the data to be in at one time, the better off you are. The more data you have in at one point in time, the more you're using up the memory. Then you'll start paging files and use all your CPU to take care of that. Once that occurs, you're boned for the most part, it's best to try again.

If you're entering data, you could have it write to files every so often and clear out the data set when you perform that task, etc...

Without knowing what exactly you desire, the best note (as i repeat myself for the 252th, give or take, time): use as little data as possible.

Sincerely,
ElSmitho

colby · ‎05-06-2005

Thank you for the advice. It sounds like you keep the data (1.5 G) on the hard drive and then read in as little as possible to analyze. This sounds a little cumbersome. Did you develop some LabVIEW tools to help you do this?

ElSmitho · ‎05-06-2005

Heh, whoops.

Indeed you are correct, all the data was stored on a hard drive. It wasn't being analyzed as the collection process was going on. I've seen such programs before, normally the analysis has to be very minimal, otherwise you'll bog down the computer and possibly miss data collection points.

The data was collected using Daq tools, and then examples from the labview example area were used to read in the files. As far as preventing the whole read, i believe there is an example from DaqMX or regular Daq that shows you how to do this.

So really, labview did pretty much all the work as far as writing and reading the files. I'd check the examples, it'll give you some solid footing.

Sincerely,
ElSmitho

minnellac · ‎05-06-2005

Colby, any environment will give you trouble if you try to deal with more data then there is physical ram at one time. What Elsmitho is getting at is that it is best to try to develop your program so that it only works with "confortable" amounts of data at any given time. For common operations, there are references to more efficient ways of completing them in numerical methods type textbooks. In Elsmitho's example of the averaging operation, a very simple algorithm allows memory usage to be minimized. Maybe we could suggest some methods if you tell us a bit more about your application. When graphing, remember, there is no reason to put more points on the screen then pixels. For other operations there are other solutions.

Chris

colby · ‎05-06-2005

Chris,

Thanks for the input. I am trying to download data from six different databases (NCBI, Stanford Microarray database, and several more), integrate the data into one master file, and then analyze the protein interactions. The datasets will vary from 200M to 500M on each run. How much data do you think I can put into LabVIEW and work with comfortably? The analysis is simple ANOVA, standard deviation, and mean, but it's done across a large portion of the data collected.

Colby

minnellac · ‎05-09-2005

Colby, it all depends on how much memory you have in your system. You could be okay doing all that with 1GB of memory, but you still have to take care to not make copies of your data in your program. That said, I would not be surprised if your code could be written so that it would work on a machine with much less ram by using efficient algorithms. I am not a statistician, but I know that the averages & standard deviations can be calculated using a few bytes (even on arbitrary length data sets). Can't the ANOVA be performed using the standard deviations and means (and other information like the degrees of freedom, etc.)? Potentially, you could calculate all the various bits that are necessary and do the F-test with that information, and not need to ever have the entire data set in memory at one time. The tricky part for your application may be getting the desired data at the necessary times from all those different sources. I am usually working with files on disk where I grab x samples at a time, perform the statistics, dump the samples and get the next set, repeat as necessary. I can calculate the average of an arbitrary length data set easily by only loading one sample at a time from disk (it's still more efficient to work in small batches because the disk I/O overhead builds up).

Let me use the calculation of the mean as an example (hopefully the notation makes sense): see the jpg. What this means in plain english is that the mean can be calculated solely as a function of the current data point, the previous mean, and the sample number. For instance, given the data set [1 2 3 4 5], sum it, and divide by 5, you get 3. Or take it a point at a time: the average of [1]=1, [2+1*1]/2=1.5, [3+1.5*2]/3=2, [4+2*3]/4=2.5, [5+2.5*4]/5=3. This second method required far more multiplications and divisions, but it only ever required remembering the previous mean and the sample number, in addition to the new data point. Using this technique, I can find the average of gigs of data without ever needing more than three doubles and an int32 in memory. A similar derivation can be done for the variance, but it's easier to look it up (I can provide it if you have trouble finding it). Also, I think this funtionality is built into the LabVIEW pt by pt statistics functions.

I think you can probably get the data you need from those db's through some carefully crafted queries, but it's hard to say more without knowing a lot more about your application.

Hope this helps!
Chris

LabVIEW

Is anyone working with large datasets (>200M) in LabVIEW?

Is anyone working with large datasets (>200M) in LabVIEW?

Re: Is anyone working with large datasets (>200M) in LabVIEW?

Re: Is anyone working with large datasets (>200M) in LabVIEW?

Re: Is anyone working with large datasets (>200M) in LabVIEW?

Re: Is anyone working with large datasets (>200M) in LabVIEW?

Re: Is anyone working with large datasets (>200M) in LabVIEW?

Re: Is anyone working with large datasets (>200M) in LabVIEW?