Bandwidth from FPGA to RT on 9606 and 9651

jiangliang · ‎07-29-2015

I was trying to connect a high speed ADC to FPGA, and I want to have my SBRIO act as a simple oscilloscope: continues gathering data and can be triggered as there is something comes out, kind like pre-trigger in DAQmx, I need save some samples before the trigger comes.

Now I'm think about use up to two 40M 8bit ADC, and want to consistently save about 20k samples before event occurs, so there comes to me one question:

Where should I store these samples, in FPGA or in RT?

FPGA seems to have too little memory, which is certainly not likely to happen in 9606, not sure if it feasible in 9651, any one have any idea？

RT on the contrary does have enough memory to do so, but the problem would be is the bandwidth from fpga to rt, it looks like NI doesn't mentioned any of these parameter in their datasheet?

tannerite · ‎07-29-2015

Hi jiangliang,

The Zynq-7020 on the sbRIO-9651 has much more BRAM than the Spartan-6 LX25 on the sbRIO-9606. 560 KB v. 117 KB.

If you are wanting 20k of pre-sample data, that equates to 40 KB (20k x 1Byte x 2 Channels). We have internally benchmarked the DMA FIFO to stream data from LabVIEW FPGA to LabVIEW RT via the AXI bus at ~300 MB/s for either direction; however, NI Linux Real-Time was only reading/writing from the FIFO and not running additional LabVIEW code, so CPU usage of your application will have some effect. If you are sampling at 40 MSamples/s on two channels, that would equate to 80 MB/s of data needing to be streamed through DMA FIFO(s) for post-sampling so you are within reason.

I created a simple LabVIEW project and created a 200,000 byte (8-bit data type) LabVIEW Memory Item and continually wrote data to in a circular buffer fashion without triggering or anything fancy to see if it compiled -- it did while using 37.9% (pretty good considering perfect utilization of BRAM would be 195 KB / 560 KB = 35%). This was in a single-cycle timed loop with a "Cycles or read latency" set to 3 to lower the resource utilization and increase the pipelining stages.

I would recommend prototyping your application on your LabVIEW FPGA target to verify that you can fit your design on the fabric. There are many different ways you could potentially architect this application and hopefully I have provided the necessary estimated maximum throughputs and what can potentially fit on the fabric.

- Tanner

Tannerite
National Instruments

Kevin_Hooper · ‎07-29-2015

You do not mention your post trigger sample depth.

This will decide whether the overall system meets your needs.

The sbRIO-9651 with LabVIEW FPGA will easily handle your data bandwidth.

My current application streams and post processes samples from a device containing 25 12-bit ADCs

operating in parallel (30MB/s continuous data). The post processing (filtering, thresholding, etc)

uses historical measurement data in FPGA memory. The challenge was closing the timing on this post

processing. I did exercise the FPGA DDR DMA by capturing raw samples from three devices (which

exceeds your requirement at 90MB/s).

Maybe your post-trigger sample depth is small enough to allow you to only use the FPGA to write samples

to DDR. I think LabVIEW FPGA and the sbRIO-9651 will easily handle this.

The issue arises if you want to use LabVIEW Real-Time for the sbRIO-9651 ARM processor to handle

the data stream. I failed to get the ARM processor to stream 30MB/s back to a PC and found simple

processing of the 30MB/s data challenging. The issue is the LabVIEW Real-Time ARM processor

code is inefficient/slow (have various NI support requests where I have shown MathWorks and/or Xilinx

delivering significantly better performance - like MathWorks Simulink generated code running 8 to 10
times faster than LabVIEW Real-Time code).

My best solution is to keep data logging in the hardware (avoid the ARM processor). I have had

90MB/s data streaming back to PC. The solution was LabVIEW FPGA with sbRIO-9651 to capture

and process data solely in the FPGA gates. The FPGA output data into an Orange Tree TCP/IP card.

This small card can continuously stream data at 100MB/s over a TCP/IP socket using Gigabit Ethernet.

jiangliang · ‎07-29-2015

Thank you for these informations, it saves me a lot time to verify myself.

I did a little test on 9606, it can stream at about 50Mbits/s to RT, on RT side use lossy enqueue function to get these data. Anything above that cause FIFO to overrun.

May I ask when you benchmark 9651 bandwidth, are you using U32 data or U8? I remember the bus from FPGA to RT is 32bit width (or maybe 64bit in AXI?), so can we get better performance when we transport data using full bus width? Or maybe I should ask if I use 8bit FIFO, would that introduce some overhead?

Thank you very much.

jiangliang · ‎07-29-2015

I noticed LVRT is relative slow compares to other RT system based on the same HW, but never noted the difference could be so big, 8~10 times of performance almost means a I7 versus ATOM.

I'm sure LVRT is not slow in every aspect (or not THAT slow), but certainly the overall performance will have a discount when we running on the same HW. Let's say discount of 50~20% is reasonable, 80% performance penalty would be a big issue for users. Think about it, you will need to pay more for same performance, design better heat manage system, and consume more energy, which basically says bye bye to many portable application that should fit perfectly.

I also wondering if the performance penalty comes from OS or LabVIEW itself? I noticed NI moves from PhLaps/Vxworks to Linux, is it because of the relative new OS is not perfectly tweaked or maybe LV code is inherently inefficient?

tannerite · ‎07-30-2015

Jiangliang,

I had talked to one of our software developers a while back who had worked, in part, on the DMA FIFO transfer mechanisms on the Zynq platform and asked the same question about optimizing for 32 or 64. He stated that efficiently packing in LabVIEW FPGA or LabVIEW Real-Time would not really provide a benefit. This is because it is packing the data efficiently together under-the-hood before sending it over the bus and the necessary copy for that to happen does not really impact the throughput performance. Plus that copy is going to happen either with you manually doing it in LabVIEW, or the underlying code doing it for you.

I have not tested this with different data types to confirm this, however. Again, this is just a conversation I have had with one of the software developers a while back -- I do not have deep, working knowledge of this transfer mechanism.

- Tanner

Tannerite
National Instruments

jiangliang · ‎07-30-2015

Thank you, even id there is no bandwidth benefits about packing data from

FPGA to RT, would packing reduce the overall CPU overhead? Afterall if I

packing 4 u8 data to a u32, seems RT would have less busy fetching data?

Or maybe it all been done under the hood.

In my case, I'm not gonna unpacking at RT side, these thing will be done at

PC, I still wanna believe there should be some performance benefits.

tannerite · ‎07-30-2015

Again, I don't have working knowledge of the nuts and bolts of this mechanism. With that said, if you pack 4 U8's into a U32 on the FPGA and send it to RT, I would imagine you would eventually want to split the U32 back into 4 U8s. So again, it is either you splitting the data or the DMA FIFO code splitting the data for you if you want the 4 U8's back. If not and you are going to do that in some kind of post-processing, then yes, I could see that potentially helping out, but I am unsure of the actual CPU overhead reduction.

If you end up testing that out, feel free to post if you see any kind of decrease in CPU overhead.

- Tanner

Tannerite
National Instruments

Kevin_Hooper · ‎07-30-2015

Here are the answers to your questions:

My application uses FPGA writes configured for thirty two I32 elements.

Past experience (performance profiling of SoC chips) and Xilinx discussions (looked at their tools) indicated performance improves when you write frames of data elements rather than individual elements. One disadvantage is there is some FPGA gate overhead to create/manage the data frames. There are several advantages. I was able to included a frame timestamp with little overhead. I was able to run the FPGA FIFO write when the external interface providing ADC samples was idle.

NI Linux looks pretty good.

My main disappointment was the "NI system" fail to provide a method to continuously log 30MB/s data (NI sales wrongly assumed sbRIO-9651 USB2 or Gigabit Ethernet would meet this requirement). USB2 write is too slow (common Linux limitation even when using USB sticks capable of this data rate) and Gigabit Ethernet appears to be limited to 100BaseTX performance (looks like an NI issue as Xilinx have shown 45MB/s data throughput on the same FPGA using their Linux distribution and drivers). I have two alternatives working - FPGA interfacing to FTDI USB controller streaming 35MB/s data to PC which saves to TDMS file and OrangeTree TCP/IP board streaming 100MB/s data to PC.

Another disappointment was NI do not support the proven I2C/SPI functions built-in to the FPGA. I have asked NI to support the built-in FPGA I2C/SPI functions rather than using LabVIEW FPGA to make inferior I2C/SPI functions (inferior in terms of performance & being proven/tested for standards compliance).

LabVIEW Real-Time

Most individual LabVIEW functions are efficient and fast. Most issues arise due to memory management and only having one memory manager. I have seen performance limitation when I followed the LabVIEW Real-Time training advice of two loops (deterministic loop reading DMA FIFO data and non-deterministic loop logging/processing data). The end result of several support requests was the recommendation to have one loop containing DMA FIFO read and data post-processing/logging.

My calls with NI developers indicated the Real-Time DMA functions are limited by memory management. For example, the NI LabVIEW Real-Time DMA FIFO read uses ARM code to copy data from allocated FIFO memory into another memory area allocated by the underlying LabVIEW memory manager. Others do not use this "simple" approach as its performance is limited - Xilinx tends to use a large FIFO which allows them to pass data from FIFO to data post-processing/logging by merely passing a pointer (no memory allocation and memory copy needed) and MathWorks use the FPGA built-in DMA engine to rapidly move large volumes of data with little processor overhead.

At the end of the day, LabVIEW Real-Time on an ARM A9 dual core 667MHz processor is struggling to run a small amount of the algorithms which run on an AVR-32 single core 60MHz. The MathWorks tools do generate better ARM code but they are significantly more complex and expensive compared to LabVIEW Real-Time.

Tiffi · ‎07-30-2015

Hi,

maybe a year ago I made some throughput test on, as I believe, cRIO 9068 so the same Zynq as it is probably in SOM. I can't present the details, but the result was U32 had better throughput than U16, and U16 had better throughput than FXP 28. Since then I always use U32 for data transfer and it works great on SOM and other high-speed devices. It's not all-case solution, of course, and I don't know why U32 worked better than other data types (however It would be extremely useful to know, so if anyone knows, please share).

I would recommend you testing what type of data would suit your application.

Best regards,

Tifi

Hardware Developers Community - NI sbRIO & SOM

Bandwidth from FPGA to RT on 9606 and 9651

Bandwidth from FPGA to RT on 9606 and 9651

Re: Bandwidth from FPGA to RT on 9606 and 9651

Re: Bandwidth from FPGA to RT on 9606 and 9651

Re: Bandwidth from FPGA to RT on 9606 and 9651

Re: Bandwidth from FPGA to RT on 9606 and 9651

Re: Bandwidth from FPGA to RT on 9606 and 9651

Re: Bandwidth from FPGA to RT on 9606 and 9651

Re: Bandwidth from FPGA to RT on 9606 and 9651

Re: Bandwidth from FPGA to RT on 9606 and 9651

Re: Bandwidth from FPGA to RT on 9606 and 9651