Zynq 7000 performance

Intaris · ‎08-08-2013

The Zynq 7000 used in the newly announced cRIO-9068 has me interested.

Especially the statement: "The processor and FPGA fabric communicate over 10,000 internal interconnect, delivering a performance between a microprocessor and FPGA fabric that is physically impossible to accomplish between a discrete processor and an FPGA implemented on a printed circuit board."

Something I've been wondering for a while is if this tight coupling of ARM cores and FPGA speeds up DMA Transfers between "host" and "target". I have seen on a PXIE 8115 coupled with a Flex RIO 7895 a minimum latency of around 6 microseconds per single DMA transfer, making transfers of small amounts of data relatively time consuming (Meaning that a transfer of 1 element or 100 elements costs about the same amount of time).

Is this latency lower in the new design?

Shane.

Intaris · ‎08-08-2013

The reason I'm so interested in this is that according to my limited understanding, such a tight coupling of Processor and FPGA should (theoretically) lead to much lower latencies when writing and reading FIFOs. Whether these are "DMA" labelled or not is irrelevant, the interfaces are there internally to drastically cut latencies as also shown HERE.

Even compared to my previously-mentioned (relatively high-spec) example, this system shows more than an order of magnitude better latency with small transfers. This is something I look forward to immensely as it will allow for much tighter coupling between control loops on RT and FPGA levels.

Shane.

BRennhofer · ‎08-13-2013

Hi Intaris.

Sorry the delay. But you have a very interesting question.

I spoke with some of my colleagues about this. They think in generally use controls and indicator would be the better way.
So why do you want to use FIFO instead of indicators and controls?

Could you describe your application and explain what do want please.

Best regards

Bernhard

Intaris · ‎08-13-2013

Usinf a DMA FIFO is the method of choice for passing data between RT and FPGA. Using controls and indicators is inefficient. This is nothing new.

The efficiency I am referring to is related to the DMA transfer itself between RT and FPGA.

Shane

Intaris · ‎08-13-2013

Unless of course I'm misunderstanding something terribly. I thought the major idea of the Zynq was that the CPU runs the RT system and the FPGA runs (well obviously) the FPGA subsystem..... The tight coupling of the two offers a more efficient transfer than other solutions, hence my questiona bout the latency of such DMA transfers.

If I have misunderstood how the device should work, sorry.

Shane.

Nick-C · ‎08-13-2013

Shane,

I think Bernhard may have misunderstood the question you posed. The Zynq 7000 is a single SoC (system on chip) and the CPU and FPGA now communicate over an AXI bus as the link you shared mentions. The architecture of the Zynq chips allow for much greater throughput, our benchmarking indicates a max throughput of ~ 300 Mb/s which is roughly a 3x performance increase.

As I understand the latency of transfers over the bus with our implementation is similar to the older hardware targets. The main advantage that we harnessed in using the Zynq chip is high throughput data transfers with very minimal loads on the CPU. Compared to a 9024, the 9068 uses roughly half as much CPU time on an identical DMA transfer.

The data I can find at the moment with regards to bus latency is a benchmark of hardware interrupts over the bus and the performance is inline with our other CompactRIO controllers.

I'm a LabVIEW FPGA module software product support engineer, so this isn't exactly my territory. I'd be happy to ask one of the RIO hardware or software product support engineers to follow up with you for a more in depth discussion/confirmation if you'd like. Just let me know and I'll ask one of them to post back to you here.

Hope this helps some, if you need more information I'll get you in touch with the right guys.

Nick C | Software Project Manager - LabVIEW Real-Time | National Instruments

Intaris · ‎08-13-2013

That's fine,

We're not planning on actually USING any soon, but it seemed like the latency for the transfers could be massively improved with this type of design, I was wondering if it was implemented in such a way or not.

So it DOES go over the AXI bus? I have seen papers showing around 0.6us latency for small bursts of data. That's 1/10th of the latency of a PXIe-8115 with a flex RIO card. If that is reflected in the LV environment, that would be really cool.

So it's basically the throughput for SMALL (up to 128 Bytes) packets of data I'm interested in whereas the 300MB/s are for large packets I presume.

Shane.

ColdenR · ‎08-16-2013

Hello Shane,

The DMA transfers do go over the AXI bus. However, we have not benchmarked DMA transfer latency for small packets. This is something that we may benchmark in the future, but we have not at this point. In general, we expect the use-case for DMA to be high throughput, extended use applications, where latency does not matter nearly as much. In order to get high throughput you end up with a high latency anyway, because the Real-Time processor needs to read large chunks of data at a time in order to keep up with the FPGA.

In general, we assume that for the extremely low latency applications, interrupts will be used. This way the Real-Time application does not need to spin in a loop monitoring the FPGA for incoming data. If you need to transfer a lot of data but also need low latency, you might consider a combination of interrupts and DMA. Interrupts transfer the identify bits (or whatever needs to run at low latency), and the DMA follows up with a payload.

Can you explain in more detail your use case (or a generalized example use case) for needing 128 byte packets at extremely low latency? We are definitely interested in knowing more about non-standard use cases.

Colden

Intaris · ‎08-19-2013

Hi Colden,

thanks for the interesting reply. One thing I don't understand is how Interrupts help transfer data (I thought they were the FPGA-RT version of occurrences). In fact the help for the "Wait on IRQ" on the RT side states that the function will actually hog a CPU and may cause other code to stop executing until it has finished executing. This is certainly something I want to avoid.

I should perhaps clarify a bit more what I'm referring to. I should really be using the word "Overhead" instead of "Latency" when referring to the DMA calls. Where our control loop is spending significant portions of time waiting for data from the FPGA, this kind of amounts to the same thing but it's not strictly the same. If we wait 6 us for data arriving and anohter 6us before it can be sent (Because this is the overhead associated with the SMA transfer) then we have 12 us delay if we simply loop data back and forth between the FPGA and RT.

I have benchmarked DMA FIFOs on a PXIe-8115 with a 7865R and have found that the minumum transfer time for a single DMA is around 6us (According to RETT). This is for sending a single U8 over DMA. For approximately 100 U8s we still need the same time of approximately 6us. Only over 100 U8s do we start to see a linear increase in DMA transfer execution time (0f 60ns per element IIRC). What this means is that the transfer of data over DMA is only really efficient when we are sending at least 100 elements. In our application we send less that that, so we're essentially in the region where the DMAs are not operating at full efficiency. Although we have no plans to incorporate a Zynq into our products, the very low latency of the AXI bus sounded like it could change this behaviour significantly. If we could lower this DMA overhead of 6us (equivalent to sending 100x U8) to, say 2us (25 x U8) then our RT loop would actually be able to run faster (We are currently reaching 20kHz without problems). If the overhead dropped to 0.4 us (5 U8) then we could actually consider operating a pipelined DMA transfer within our RT loop with data being transferred in several sequential packets instead of one "large" monolithic packet. Between DMA calls we could then execute parts of our RT control loop and thus increase overall responsiveness (i.e. latency) of our loop.

Other considerations would be using two DMAs in parallel to achieve faster data transfer. The lower the overhead, the more scenarios would benefit from this.

My considerations are mainly theoretical at the moment but it might come iin handy for future developments.

Shane.

ColdenR · ‎08-19-2013

Hello Intaris,

"Wait on IRQ" is just useful for synchronization, so that you can read data from a read/write control only when it is ready to be read. They are definitely similar to occurrences. While an IRQ does transfer a small quantity of data (the IRQ number, if you have multiple), the real payload is in the read/write node that follows. Also, "Wait on IRQ" is blocking (unless it hits a timeout), but the general application requires that you have multiple loops so that it doesn't block other code. "Wait on IRQ" shouldn't hog an entire CPU, although it does use a thread. If you are short on threads then you could get slowed down by Wait on IRQ.

In the application you described, the best way to reduce latency of course would be to do everything on the FPGA. But if you do need to get data up to RT, in general you should see improved performance on the cRIO-9068 over similar cRIOs. The dual-core helps quite a bit with this, and my guess is that the AXI bus would help with the DMA latency (although I can't say that for sure because we haven't benchmarked it). That said, PXIe is generally faster than our cRIOs as far as CPU cycles and latency are concerned, partially because you can get a really fast CPU in your PXIe controller.

Colden

LabVIEW

Zynq 7000 performance

Zynq 7000 performance

Re: Zynq 7000 performance

Re: Zynq 7000 performance

Re: Zynq 7000 performance

Re: Zynq 7000 performance

Re: Zynq 7000 performance

Re: Zynq 7000 performance

Re: Zynq 7000 performance

Re: Zynq 7000 performance

Re: Zynq 7000 performance