I am writing code to stream images and my main goal is to improve performance.
I have an array of i16s that I need to cast into an array of i32s. In worst case scenarios I have 4096x4096 * 4 (67 million) data points, attempting to stream at 5 FPS.
Is it possible to move this i16 array into a preallocated i32 array?
Currently, I am preallocating arrays and using the replace array subset; however, I am seeing that when I convert the i16 array to i32 to then place it into the preallocated array, there is a buffer allocation happening. This happens whether or not I explicitly cast the array. I know that using the memory manager and allocating dynamic resources can be costly (remember, I only care about timing here, I'm already in 64 bit LV).
Is it possible to simply use the replace array subset without the extra allocation?
Confused by your terminology.
Does the I32 array have only half the number of elements (i.e. what "typecast" would do) or do you simply need to convert the datatype, keeping the array size the same? (... and no, you cannot typecast 2D arrays)
Can you show us some code?
Sorry for the confusion. I need to convert the i16s into i32s (The i32 array has the same number of elements)
The reason I am converting the i16 to i32 is because I need to apply a gain to the data and cannot risk saturating the i16.
My question is related to moving the i16s into an i32 buffer without making any extra data copies.
I can't share the code, but attached is a screenshot. You can see that a buffer allocation is happening when I convert to i32 (this was proven using the Profile Buffer Allocations tool). Is it possible to simply just push the data from the i16 into the i32 buffer I've preallocated?
The two arrays have different size in memory, so a copy needs to be made no matter what. You also branch the wire.
You have a fixed size I32 array in the shift register, so this is re-used in place. Since you replace all elements, I would just wire to the tunnel on the right, omitting the replace operation entirely. Is that wire in a shift register? Why do you also keep the I16 array around? How about using DVRs for the 2D array?
(We really need to see more of the code than just a microscopy cropped section. What hardware is this running on?)
I like your basic thought process. The one thing I'd consider trying is removing the explicit conversion to i32 and wiring the i16 elements directly into the "Replace Array Subset" node. Let there be a coercion dot and an *implicit* conversion instead.
If forced to bet, I'd expect it not to matter, but hey you never know. It's possible that by deferring the conversion to the "Replace Array Subset" primitive, LabVIEW will handle the element replacement more optimally, without needing to populate a temporary short-lived i32 array. It's conceivable the "replace" algorithm could convert each element from i16 to i32 as it overwrites values one at a time.
P.S. altenbach's answer came in before I finished mine. We've both pointed out that the element values will have to be copied. I think it's *possible* that the explicit conversion to i32 might lead to copying twice (once when filling in the short-lived temporary i32 array, again when that i32 array is used to replace values in the i32 array you carry in your shift register) while the implicit conversion might only need to copy once.
I wouldn't count on it because the LabVIEW compiler is pretty good at figuring out ways to optimize. But I wouldn't entirely rule it out and it may be worth investigating.
I'm attaching a screenshot that shows more of the acquisition VI (had to do some cleanup).
For reference, the host code is running on a PXIe-8840, and I am acquiring from an NI 5172 using the recon oscope instrument design library (IDL) template.
When I initialize my class I create the i16 and i32 arrays and put them in DVRs.
The de-interleaver (orange block) is part of the IDL and takes in a pre-allocated 2D i16 array to reshape the data from the FPGA. Because I want to scale the data, I want to put that i16 data into the i32 buffer and then scale.
I hang on to the i16 array because this buffer will need to be used on the next acquisition. Recall I am attempting to stream at 5 FPS (5 FPS for the large images, I can hit 20-60 for smaller images) - so this VI is being call inside of an acquisition loop repeatedly. Because I am using preallocated DVRs, I need to put something back in the DVR; this is where the i16 wire branch comes from.
Additionally, I need to use the data outside of this VI for display/logging purposes (I have 8 cards that I am aggregating data from), so that is why we see the i32 wire branch in this screenshot.
You both make a good point about the fact that there will need to be a copy of the data, I just want to make sure I am doing it in the most efficient way possible because I am working with such large data sets.
I have tried the implicit coercion and did not see a difference in performance and the Profile Buffer Allocations tool reported the same allocation happening.
I would question whether you need to do the conversion/scaling to i32 in real time? For logging, there's no reason to write more than i16 (plus gain and offset) which contains exactly the same amount of information. For display, you would be better to move this into a separate loop (using a queue) and process and display there - but even so, if the only processing is a scaling and offset, again there's no additional information that can be visually seen - you could simply adjust the colour map.
I'm also working at the moment with a high bandwidth (2048x8 U16 @ 25000 fps, ~800MB/s) and having to be careful to do as little as possible to the data in real time to stay within what LabVIEW can do. The main thing I've learnt (this week) is that typecast in LabVIEW is a lot costlier than in C - I'd been trying to cast an array of U64s to U16s (with 4x as many entries) but had to rewrite to use U16s directly from the FPGA - probably a better approach in the end anyway.
The de-interleaver (orange block) is part of the IDL and takes in a pre-allocated 2D i16 array to reshape the data from the FPGA.
Can you write your own de-interleaver? From the screenshot it looks like it is rearranging elements in the 2d I16 array. If you write you own, you can convert the data from the FIFO directly to I32 and insert into your I32 array, no need to keep at copy of both 2D arrays.
I think @GregS has the right idea. (Typecast in LabVIEW always makes a data copy.)
Greg & mcduff,
Thank you for the suggestions, these are really good things to consider.
Not doing the scaling in line: the goal is to stream images at a high rate, so if I were to transfer it to another process I would still face the same issue and the display may no longer be "live". However, you bring up a good point that the gain & offset may not be able to be seen visually. I will bring this to the team and see if this route makes sense for us. I do like the idea of storing the data as i16s and saving the gain and offset as parameters - this could save us a lot of time without losing any data.
Writing my own de-interleaver: I think I still end up in the same situation where it is not guaranteed that the i16s will be copied into the i32 buffer without a temporary buffer being created. For example, let's say I have a preallocated 2D i32 array and the 1D i16 array from the DMA. I am then de-interleaving, moving data from the 1D i16 array to the 2D i32 array with Replace Array Subset. Do we know if there is a temporary i32 buffer used when we see the coercion dot when we replace the i16 into the i32 array? Ideally, it would copy the 16 bits directly into the 32 bit buffer. I'm not sure how LabVIEW works in this regard.
Additionally, the de-interleaver calls a DLL via a call library function node; I have a feeling this call is going to be more efficient than what I can do in LabVIEW. This DLL actually has an i32 version of the de-interleaver that I can use (uncovered this recently when I tried sending i32s from the FPGA, this didn't work for various reasons). However, I think I run into the same problem using this as I still need to convert the i16 data from the DMA into an i32 buffer before de-interleaving. This essentially just moves the current copy from after the de-interleaver to before the de-interleaver.
Thanks for all the suggestions everyone, definitely got some things to chew on!
One more thought regarding displaying a live image view: even if you do need to show an i32 image (and I doubt you do), you almost certainly don't need to display a full 4096x4096 image either, unless you have a huuuuge monitor! If you can subsample or crop that image stream to 1024x1024, that's only transferring and displaying 1/16 of the full data.