12-13-2016 04:52 PM
Hello,
I am running an analysis on very large data sets (>100MB) and need to achieve some fairly drastic (hopefully 10x) performance improvements. The bulk of the code is within a single while loop where the TDMS Read function is called iteratively (w/ set read size and moving offset) extracting 2 channels of data. These two channels are fed to a subroutine which, essentially, calculates the cross-correlation. The resulting cross-correlation is then added to the previous cross-correlation compuation and stored into a shift register until the file reading is complete.
After profiling the code I've realized that most of the time is spent reading out data via the TDMS Read function. I am hoping that it is now possible to parallelize the code by splitting the data into 4 different subsections (e.g. http://www.ni.com/white-paper/6421/en/). However, I don't know if this works since we have a single reference to the TDMS file. It seems this wasn't possible a few years back but had some active development in this direction. If necessary, I imagine I could break the data into smaller TDMS files but it would be much nicer to work with a single file.
Also, I was curious if I should expect a difference in read times using the advanced vs. regular TDMS functions. I initially switched from advanced tdms to regular tdms reads so I could select individual channels without having to unpack all of the data. In contrast to my expectation, I got a reduction in my performance.. is this expected?
Thanks for any help, and please let me know if any clarification is needed.
12-14-2016 04:00 PM
Hi nkmath,
I have a few suggestions to help with the performance of your application:
Why Are My TDMS Files So Large?
http://digital.ni.com/public.nsf/allkb/63DA22F92660C7308625744700781A8D
Application Design Patterns: Producer/Consumer
http://www.ni.com/white-paper/3023/en/
To answer your last question, it is expected that you would see a drop in performance when switching from the Advanced to the Standard TDMS functions. The Advanced functions were developed after the Standard ones and included some performance optimization features. You can read more about the differences here: http://forums.ni.com/t5/LabWindows-CVI-User-Group/LabWindows-CVI-Tip-Write-Data-to-Disk-Faster-with-...
I'm not sure this will give you 10x performance improvements, but these steps should definitely help you get closer those goals!
12-15-2016 04:09 AM
100 Mb should be able to manage in memory, so read it all and work in memory.
/Y
12-22-2016 12:44 PM
Hi Alex,
Thanks for the tips, and sorry for the delay in response.. holiday season getting the best of me. Anyways, I was able to switch back to using the Advanced TDMS Asynchronous functions which provided a significant reduction in the computation time (~3-4x). That was very good to see. Here is a summary of what happened w/ the other suggestions:
Defragging: The process of defragging a single 500MB file, corresponding to 1 second of data acquistion, took something like half an hour. The analysis (as it stands) takes 50 seconds of time to analyze the 1 second of data so there doesn't seem to be much point in defragging the file to reduce computation time. Additionally, I encountered errors in analyzing the defragged file. Not sure what's going on there, but I'm going to leave it for now as the defrag process is much too long.
Producer/Consumer Architecture: I was able to implement the prod/consumer loops and it did offer slight (1.2-1.5x) improvement in computation time. However, this step seemed to introduce a bug which I can't quite to seem to figure out.. The data is stored in the TDMS file as two interleaved channels (1 sample each). I am extracting the data in the producer loop, passing it into the queue, extracting the data from the queue in the consumer loop, decimating the array to retrieve each channel, and passing those channels to an analysis SubVI (I've attached a screenshot). Then, to test that the code is working properly, I fed a pulsed 1MHz signal to both channels (same signal) and calculated the cross-correlation between channels as a function of the time delay. With previous implementations of the code, I see that there is a peak in the cross correlation for zero time delay, and then peaks for +/- 1,2,3.. micro-seconds delays, which is expected. However, when I add in the queue handling it seems that it delays one of the channels w.r.t. the other and the peak in the cross-correlation lies away from zero time delay.. The shift appears to be exactly one read size (integer multiple of 512 elements). Let me know if you have any suggestions as I have run out of ideas..
Include multiple TDMS reads: I have not yet implemented this step due to the issue stated above. I have worries about the use of the MASM cross-correlation function, which already makes use of multiple computing cores, in conjuction with a parallel operation on multiple files. I'd be happy to discuss this more once the above is understood.
Nolan
12-27-2016 04:17 PM
Hey Nolan,
I am looking into how the queueing of data could be causing an offset in the cross correlation could be causing an offset in the time delay. I don’t see anything jumping out at me… would it be feasible to offset the peak by a known 512 samples? Or is the offset not by a consistent integer number?
I would be interested in seeing how the Asynchronous TDMS Read functions are set up further off to the left of your screenshot. Additionally, I am curious about the “Sub Correlate Cross” VI that you are using here. Is it something that you developed? I would say that if you have the option, changing the input / outputs from doubles to a smaller data type might increase the program running speed. But if the TDMS read functions are the limiting factor, this may not be worth looking into just yet.
Sam R.
Applications Engineer
National Instruments