Help in optimizing cross correlation routine

nkmath · ‎09-01-2015

Hello,

I am hoping to decrease the computation time of a cross-correlation routine I have written in LabVIEW. I am performing the cross-correlation between two channels, each sampled at a rate of 250MHz at 8 bits per sample. The data is streamed to disk at a rate of 500MB/sec and typical run times go anywhere from a minute all the way up to an hour or so. Respectively, these correspond to 30GB to 2TB per file. Currently it's taking several hours to analyze just a few minutes of data which is becoming a huge bottleneck in our research efforts. An improvement of 100x would be desired but even a factor of 10 would be extremely helpful.

I have not yet benchmarked my code but realize that most of the time is taken by running the 1D Cross Correlation VI provided by NI which relies on C libraries. I have a sub VI, 'subcorrelateClean', which implements the correlation. Because of the large number of samples we must extract the data in chunks and perform the correlation iteratively. Also important is the correlation at different lags, which the correlation VI performs automatically. However, if we were to take two NxN arrays and perform a cross-correlation at different lags there would be edge effects for time lags > 1 step since they don't overlap fully. To counter this I hold onto the data to get an array of length 3N and correlate that with the other channel which spans the length from N->2N. Thus, the ouput cross correlation is of length 4N, but there is full overlap between the sections N->3N which I accordingly extract. I have tried to make the code as readable as possible but let me know if there is anything I should clarify.

I appreciate any and all help. I imagine most of the improvement will come in the way the correlation is performed but I imagine there could be other ways that would be beneficial as well, but I cannot see them currently. I did find this thread, https://forums.ni.com/t5/LabVIEW/Cross-Correlation/m-p/1413388/highlight/true#M548340, which mentions using external libraries to improve the correlation speed but have not yet implemented it yet.

Best

Nolan M.

ToeTickler · ‎09-02-2015

You should use LabVIEW tools to determine exactly which VI is taking the most time and how many times it is called. To do this go to Tools » Profile » Performance and Memory, Check mark the Timing statistics, Timing details, Profile memory, and Memory usage boxes. Click the start button, and start your VI. Stop it once your VI is done executing. I know you said the runtimes vary, but this would be a good way to benchmark your code. Post the results of the Profile Performance and Memory Tool here so we can get a better idea of where to start.

altenbach · ‎09-02-2015

On a side note, the MASM toolkit has a parallelized version of the cross correlation. If you have a computer with many cores, you might get a proportional speedup. Try it! (I think it is now free, but please check)

(Sorry, I haven't looked at your code)

LabVIEW Champion.

nkmath · ‎09-02-2015

@ToeTickler wrote:

You should use LabVIEW tools to determine exactly which VI is taking the most time and how many times it is called. To do this go to Tools » Profile » Performance and Memory, Check mark the Timing statistics, Timing details, Profile memory, and Memory usage boxes. Click the start button, and start your VI. Stop it once your VI is done executing. I know you said the runtimes vary, but this would be a good way to benchmark your code. Post the results of the Profile Performance and Memory Tool here so we can get a better idea of where to start.

Thanks for the help. I have done so and I've attached the results below. For clarity the three most time consuming VI's are the following:

1D Cross Correlation (DBL) : 92148 Runs w/ a total time of 50.887 seconds.

subCorrelate.VI: 30716 Runs w/ total time of 61.027 seconds.

u32toDBLarray.VI: 30719 Runs w/ total time of 3.681 seconds.

The 1D Cross correlation is within the subCorrelate.VI so most of that time (~50.9 sec.) is spent in the 1D cross correlation algorithm. However, there is a significant amount of time that is spent outside of the 1DCrossCorrelation VI (61.0 - 50.9 = 10.1 seconds) where I am splicing/indexing arrays. I imagine there may be some improvement in how the arrays are manipulated.

The other VI, 'u32toDBLarray', is a VI I wrote which takes a u32 word and unpacks it to get to the data in DBL format. We are storing 4 x I8 samples into the one u32. There may be a potentially better way to pack the data, and correspondingly how that may improve the speed of unpacking.

Improvements: So the first thing I will be doing is removing the calcluation for the auto-correlation for each channel and only measuring the cross-correlation between them (although there is useful info in the auto-correlation..). That should give me about a factor of 3 improvement. The next thing would be to implement the libraries I mentioned in my original post which should help speed up the cross-correlation algorithm (Edit: or the one's that altenbach mentions, thank you!). Then, I think looking into the packing/unpacking of data and optimal ways of approaching this would also be helpful.

Let me know if you have any suggestions on what I've mentioned above.

Nolan M.

mcduff · ‎09-02-2015

Could you load a small sample of data in the subCorrelate-Clean vi?

Looking at the code I imagine a speed up could occur if you are not constantly resizing arrays. Memory allocation for big data sets is slow.

Cheers,

mcduff

nkmath · ‎09-02-2015

@altenbach wrote:

On a side note, the MASM toolkit has a parallelized version of the cross correlation. If you have a computer with many cores, you might get a proportional speedup. Try it! (I think it is now free, but please check)

(Sorry, I haven't looked at your code)

Wow. Thanks for this. Very easy to implement and I got a 3x improvement in the computation time!

nkmath · ‎09-03-2015

OK so I've implemented the MASM toolkit correlation function, removed the calculation of the auto-correlations, and am using SGL arrays rather than DBL's which resulted in a factor 3-4 reduction in total computation time. This is very good but typical run of 2.5 minutes still will take about 2.5 hours to run so I'd like to keep pushing the optimization. I've reprofiled the code and also attached the optimized (current) code. Again the three most time-consuming VI's are the same as before, but now in different order:

u32toSGLarray.VI - Average: 107 micro-sec/run

subcorrelateCCSGL.VI - Average: 100 micro-sec/run

1D Cross Correlation.VI (MASM version) - Average: 63 micro-sec/run

The first VI, 'u32toSGLarray.VI', opens and reads in a TDMS file and unpacks the data, which is stored as a U32, into 4 x I8. The data is stored in this way to satisfy FPGA/FIFO transfer limitations and I don't see a good way to get around it but maybe you all may have some ideas.

The second VI, 'subcorrelateCCSGL.VI' is now only doing the cross correlation. The correlation function returns the cross-correlation at many different time lags. For example, if we put in two arrays of size N the output would be a 2N array where the Nth element is the 0 time lag cross-correlation (see help file if this is confusing..).At time lags > 0 there is not a perfect overlap btw the channels so to correct for this we hold onto the previous iterations 'data chunk' and feed the correlation algorithm one array of size 3N and another of size N, which overlaps with the former array from indexes N->2N. The output of the correlation function is an array of length 4N, but we only extract the region from N->3N, where the overlap between the two inputs are full. I know this is probably confusing but I think it is good to have an idea of what I am doing to ensure the best advice...

The cross correlation algorithm seems to working efficiently after I implemented the MASM toolkit version. The way this could still be improved is by calculating the cross-correlation at a lot less timelags. I am forced to extract the data in chunks (arrays) with a number of elements (typically 512) set by the sector size of the TDMS file, which I believe to be fundamentally set by the hardware. The correlation function forces the time delay cross-correlation at every possible time delay, given by the input array size. The problem is that I only need the cross-correlation at around +/-100 steps but since the output of the TDMS file array is set at 512 elements, the cross-correlation function always returns the cross-correlation at +/-512 steps. The only way around this would be to resize the array's before putting them into the correlation function (probably more time-consuming), or rewrite the correlation function to only do what it needs..

Anyways, this probably reads more like a stream of conciousness than something organized but if anyone has any further tips that would be much appreciated!

LabVIEW

Help in optimizing cross correlation routine

Help in optimizing cross correlation routine

Re: Help in optimizing cross correlation routine

Re: Help in optimizing cross correlation routine

Re: Help in optimizing cross correlation routine

Re: Help in optimizing cross correlation routine

Re: Help in optimizing cross correlation routine

Re: Help in optimizing cross correlation routine