Host to Target DMA transfer timeout

nkmath · ‎11-26-2018

Hi All,

I've programmed a 7975R FPGA to perform a cross-correlation between data read from two separate disk drives (HDD 8264/8265 raid arrays). The data sets are quite large ( >100GB ) and so the FPGA is used instead of standard processors to gain a significant advantage (>100 x) in the processing time. To perform the stream, I first fill up the host buffer for both transfers, start the DMA FIFO transfers, and then it is managed (on the host end) by two DMA FIFO transfers that are running in separate threads (while loops). The two threads are effectively identical, using the TDMS Advanced Asynchronous Read VI to perform the transfer.

Now the problem is that I keep getting FIFO timeouts which I believe to be due to underflow on the FPGA end in that the host processor cannot buffer data to the FPGA through the DMA engine quick enough. I first compiled the FPGA at a rate of 125MHz, which will immediately times out. When I lowered it down to 80MHz, it will transfer successfully for some length of time but consistently will timeout (underflow) after ~ 1-2 minutes. I lowered the FPGA rate even further (down to 40MHz) and it performs roughly the same, which I found surprising. I am now trying it at 10MHz, but this is too low for our application in that it will take far too long to process the data.

I am using an 8133 controller. The data is streamed with 16-bit resolution/sample. I tried changing many of the parameters (FPGA FIFO Size, Host Buffer Size, Write Region size) and there are slight differences in the results, but after a few minutes of the application running, it times out. At 40MHz the transfer rate is 40M x 2 (hard drives) x 2B (bytes per sample) = 160 MB/s, which is well within the specs of the system (~ < 800 MB/s or so). The behavior is a bit confusing as well, in that I don't gain much in the timeout rate from when I go from say 80 to 60 to 40 MHz. When I try looking in other posts, it is generally concerned with transfers from the target to host and I can't find much resources in dealing with this when going from host to target.

Happy to post any code, but first it would be nice to hear thoughts on what I just described.

NKM

wiebe@CARYA · ‎11-27-2018

Not sure if I get the entire picture...

Underflow is usually desirable. Why not simply skip functionality if there is a timeout?

Search LabVIEW like a graph!

Intaris · ‎11-27-2018

I agree with Wiebe,

why is a timeout on reading the DMA on your FPGA target a problem? Just use the boolean output to toggle whatever work the FPGA code is doing.

Requiring a gap-less FIFO transfer in order to process your data seems like a very limiting choice.

nkmath · ‎11-28-2018

Hi Wiebe/Intaris,

Ah ok, that is probably the best solution.. part of the issue is that I have not comprehensively built in handshaking. So, I think the ideal thing is to work towards that. But given the current size of the code I think it will take some time to execute this, and I am a bit worried about running out of FPGA resources.

But I guess to be clear, what you are suggesting is to toggle the rest of the code using the FPGA buffer timeout indicator through a case structure. I am wondering about the execution of a feedback node within a case structure all inside a SCTL. Let's say the case structure input for three serial clock cycles is T/F/T. My understanding is that the value stored in the feedback node would be held constant during "F" cycles, and only advances the stored value, and taking in the next value, during the "T" cycles. I will test this out for myself, but at the moment I am away from the LabVIEW workstation I use. If this holds, then I think it is straightforward to bring this into the code and get it working at much higher rates..

Thanks for the suggestions. I agree that depending on a loss-less transfer is limiting if it can be done otherwise. It will save a lot of headache/worry while the application is running as well. I was just surprised by the clock rates that I needed for loss-less transfer and got stuck on that. I still think the buffer should be keep up with say < 80 MHz rates but avoiding this dependency is a better solution regardless.

Best

NKM

wiebe@CARYA · ‎11-29-2018

@nkmath wrote:

Ah ok, that is probably the best solution.. part of the issue is that I have not comprehensively built in handshaking. So, I think the ideal thing is to work towards that. But given the current size of the code I think it will take some time to execute this, and I am a bit worried about running out of FPGA resources.

As long as the consumer\receiver processes faster then the producer\sender, there is no need for handshaking.

@nkmath wrote:

But I guess to be clear, what you are suggesting is to toggle the rest of the code using the FPGA buffer timeout indicator through a case structure. I am wondering about the execution of a feedback node within a case structure all inside a SCTL. Let's say the case structure input for three serial clock cycles is T/F/T. My understanding is that the value stored in the feedback node would be held constant during "F" cycles, and only advances the stored value, and taking in the next value, during the "T" cycles. I will test this out for myself, but at the moment I am away from the LabVIEW workstation I use. If this holds, then I think it is straightforward to bring this into the code and get it working at much higher rates..

That sounds right to me. That is how a feedback node works.

Not sure if it will help much with the skipping, but the feedback node will be useful for parallelization of the get and process parts.

@nkmath wrote:

Thanks for the suggestions. I agree that depending on a loss-less transfer is limiting if it can be done otherwise. It will save a lot of headache/worry while the application is running as well. I was just surprised by the clock rates that I needed for loss-less transfer and got stuck on that. I still think the buffer should be keep up with say < 80 MHz rates but avoiding this dependency is a better solution regardless.

As long as the consumer consumes faster then the produces, this is loss-less.

As I see it, there are three situations. Producer is faster then consumer P is exactly as fast as C, and P is slower then C. Or P>C, P=C and P<C.

P>C is a problem. This can't be made loss-less. Even a large buffer won't work, unless the buffer is large enough to hold the data if it's finite.

P=C should work. But it's hard to establish. I think at some point you would still need to skip FPGA code. The PFGA will run at a fixed rate, and the rate on the PC is also fixed. Worse of the two worlds, AFAIC. Unless they are (always) exactly the same, something needs to give. They won't be exactly the same, except when the PC gets the data from a source with exactly the same clock frequency as the FPGA. This might be the case on some PXI configurations (when setup correctly), or when the data comes from the FPGA in the first place.

P<C is perfect. No hassle with handshaking, just large enough buffers to allow time jittering. Simply push the data fast enough to keep the buffer full, and you know for sure the FPGA can handle it.

Search LabVIEW like a graph!

nkmath · ‎11-29-2018

So some updates... I put the previous FPGA code within a case structure to be toggled by the timeout indicator. However, the thing I didn't think about at the time, is that I am reading from two separate DMA fifo's (from data stored on separate disks). I need to have the two streams read out together, without falling out of phase with respect to another. Obviously, if I use the timeout for say buffer "1", it is possible that the other, buffer "2", does not time out, and which would lead to values in buffer "2" being read out, thus falling out of phase and causing issues downstream.

To get around this, I first tested if the value of the "Get Number of Elements to Read" FIFO Method for both DMA FIFOs was greater than zero. If both were greater than zero, indicating there are elements in the FIFO to be read, I would then toggle the rest of the FPGA code. I put a check on possible FIFO overflows as well, by comparing to see if the number of elements in the FPGA DMA FIFO was greater than a value close to the size of the buffer (my buffer is currently 65547 elements, and I compared to see if it was greater than 65500). If it overflowed it would latch a boolean indicator to True, which can be monitored by the host.

So the code successfully compiled at 80MHz, but unfortunately, I am now getting FIFO overflows in at least one of the buffers. I am now compiling the FPGA so that I can monitor each of the FIFOs independently to see if the behavior is constant between both, or just one in particular. Given the size of the FPGA buffer, I expect a little over 500 us of time of waiting on one DMA fifo, before the other fills up. I am not sure of the specs on the DMA engine, but I would imagine that is should be able to handle this. I have handled transfers from target to host at 125MHz, with only a 1023 element buffer.. but that was with a single DMA fifo. Maybe there are limitations to using two DMA fifos at the same time?

NKM

nkmath · ‎11-29-2018

Actually.. maybe this is not an issue. The code is now compiled, and I see from monitoring the number of elements in the FPGA buffer that the DMA engine actively works to keep the FPGA buffer as full as possible. In hindsight, this should have been obvious.. (doh!)

My impression now is that if one of the buffers gets drained (the underflow condition that started this thread), the other one, even if full, will simply 'sit and wait', so there will be no loss of data. In that case, I imagine I should be able to make the FPGA buffer as small as possible without issue.

OK, now I just need to test on data to ensure that I'm getting the correct result.

Intaris · ‎11-29-2018

Yes. The actual buffer size required on FPGA is usually much smaller than one would intuitively think.

wiebe@CARYA · ‎12-03-2018

Also note that the FPGA FIFO buffer is not the only buffer that's there. Even if it's full, the PC might be able to buffer values. The DMA FIFO sizes on the PC side can be changed, IIRC.

Search LabVIEW like a graph!

nkmath · ‎12-04-2018

Hi All,

So final update, I got the code compiled at a rate of 90MHz and it seems to be working well! This is a 9 x improvement from where it was working before including the toggling of the FPGA code via monitoring of the FPGA FIFO, so thanks for the help. My computation time just went from ~ 40 hours down to about 4...

Regarding the PC FIFO, yes this can be changed with the FIFO.Configure method of the Invoke Method function. This is of course very important when going from Target to Host, and in our case, we have to optimize this buffer to allow for continuous streaming at a 250MHz sampling rate. Also, when I was attempting the loss-less transfer from Host to Target w/o toggling the FPGA code, it did affect the timeout rate when adjusting the relative buffer/write size to this FIFO.

Cheers

NKM

LabVIEW

Host to Target DMA transfer timeout

Host to Target DMA transfer timeout

Re: Host to Target DMA transfer timeout

Re: Host to Target DMA transfer timeout

Re: Host to Target DMA transfer timeout

Re: Host to Target DMA transfer timeout

Re: Host to Target DMA transfer timeout

Re: Host to Target DMA transfer timeout

Re: Host to Target DMA transfer timeout

Re: Host to Target DMA transfer timeout

Re: Host to Target DMA transfer timeout