As far as throwing errors during the loop, you can accomplish this by moving the "Simple Error Handler.vi" inside the loop and removing the shift registers for the error wire. This will cause the error reporting to occur within the loop.
The differences you are seeing between the two systems are expected. You are comparing a 10Gb LAN to a PCI bus. The PCI bus has a maximum sustained transfer rate of 110MB/s (133MB/s instantaneous) vs. the 10Gb/s of the LAN. The PCI bus is also subject to nondeterministic rates as controlled by your program counter, so the best case scenario of 110MB/s will be tough to sustain and dependant on the state of your system at any given time. The fact that Windows is a much heavier operating system and the transfer method is significantly different is more than enough to account for the difference in transfer rate. It is essentially dependent on the software timing of your Operating System and Windows and Linux are VERY different in terms of background processes and general overhead. The difference has nothing to do with a difference in the performance of Windows vs Linux LabVIEW, but rather the characteristics of the operating systems themselves as well as the physical transfer media.
Thanks for the tip on the error handling.
Regarding the rates, the physical interface is PCI express, not PCI. According to Wikipedia (https://en.wikipedia.org/wiki/PCI_Express), a Gen1 PCI express bus has a bandwidth of 250MB/s for each lane and the NI card has 4 lanes. So, this means 1GB/s. The 10Gb Ethernet is about the same (if I just divide by 8 bits, it is 1.25GB/s). The NI website (http://sine.ni.com/nips/cds/view/p/lang/en/nid/213000) has the following statement "High-speed, low-latency PCI Express x4, 800 MB/s connection to the host". The Ettus website (http://www.ettus.com/kb/detail/usrp-bandwidth) advertises 200 MS/s full duplex over the same PCIe link with the same hardware. I don't think the PCIe link is any significant part of the problem.
Perhaps the problem is Windows, but I'm not sure how I can tell if it is Windows or LabVIEW. Or for that matter, perhaps it is my VI (which is just the example rx streaming VI which I modified to remove the GUI displays in order to speed it up). But, with no other example to go on, my current view is that it is a combination of LabVIEW/Windows.
I imagine that someone besides me has tried to run at faster sustained streaming rates than 40 MS/s. I'm wondering if such attempts were successful and if so, how.
As a quick aside:
PCIe gen 1 is 250 MB/s per lane. The USRP RIO supports PCIe gen 1 x4, for 1 GB/s, but that 'G' is 1000, and not 1024. So in terms of bytes, it ends up being ~953 MB/s. Next, we have to remove some of the throughput for packet overhead and lastly the read response packet size. The larger the read response packet, the less packet overhead incurred. In the end, the USRP RIO can stream about 832 MB/s to the host. 800 MB/s is a nice, round number though.
Each channel has an 16-bit I and Q portion, and there are two channels. Setting the IQ rate (sample rate) to 120 MS/s results in 120 MS/s * 2 channels * 4 bytes/channel = 960 MB/s, which exceeds the total throughput of the PCIe bus. It is impossible to continuously stream both channels at 120 MS/s back to the host. However, lowering the sample rate to 100 MS/s results in 800 MB/s. Or, if you use just one channel, you'll be at 480 MB/s.
That takes care of the throughput, but there are additional issues. There are other things on the PCIe bus, and so bus hiccups may occur and the memory controller may not service the read requests from the device fast enough. If data is not pulled out of the USRP's FIFOs fast enough, the FIFOs will fill up. You can increase the size of the FPGA's DMA FIFOs in the target's settings in the project. A larger DMA FIFO will make the USRP more resilient to these hiccups.
Next, after the data is transfered from the USRP to the host PC's memory, the host PC has to do something with it. If you have complex math or graphs, the rate at which the host PC can pull data from its FIFO may be too slow. If the host's FIFO fills up, then no more data from the FPGA can be transferred. So, you can (1) increase the host's DMA FIFO buffer depth, or two, increase the rate at which the host pulls data from its FIFO (i.e. reduce CPU computation).
There are additional tweaks, such as using Zero Copy DMA, too, which can yield large improvements to streaming performance.
So, in conclusion, high throughput streaming is hard, but it's possible. The Sample Project has to cover a wide range of use-cases and care-abouts, so it cannot perfectly solve every problem. But, it can stream at full rate. By default, the FPGA's FIFOs' buffers are around 1000 elements. If you set the sample size to less than 1000, and set the sample rate to 120 MS/s, and do bursty/finite streaming, you will be able to stream at full rate. Your data will not be contiguous though. To do continuous streaming, you'll need to look into the above suggestions.
Hope this helps!
Regarding the max rates through PCIe, I am satisfied that 240 MS/s (120 on each channel) is not possible. You mentioned the possibility of 200 MS/s (100 per channel) but I'm wondering if the decimation rate needs to be an integer such that the max I can expect is 120 MS/s (60 per channel)? Not a big deal though as I would be happy with either one.
Following the suggestions, I have modified the program to increase the size of the host FIFO from 1M to 265M. I am limited to 265M because I get an error if I set it any bigger. I am presently using 32-bit LabVIEW, but I am wondering if I change to 64-bit LabVIEW will I be able to increase the size of this FIFO beyond 265M? My PC has 32GB of RAM so I would like to make the host FIFO as big as possible.
After increasing the size of the host FIFO to 265M and removing all processing from the Rx streaming example while loop (it is doing nothing with the samples and simply fetching and discarding them), I still cannot run forever even at 60 MS/s (30 per channel). After about 4.5 secs, the FIFO fills up and then an overflow occurs. At 40 MS/s, I can run forever. My question is why is the program not able to keep up at faster rates given that I am discarding all samples such that I'm not processing data in any way? It's clear to me that the problem is not a PCIe bandwidth problem but rather a CPU / host application problem. But, I know that this same PC can work fine in Linux / Ettus UHD with this same USRP at 200 MS/s. Given that I've removed all processing from the VI "fetch" loop, I can't think of any way to speed up the loop. Please let me know if you have any suggestions on this.
1) The decimation rate does not need to be an integer as the DDC uses a fractional decimator. You can look into the Calculate Sample Rate.vi to see what sample rates are achievable. This VI will tell you your coerced sample rate based on your sample rate and data rate settings.
2) What is the actual eror you get when you increase the FIFO size to 256M? In order to use LabVIEW 64-bit, you would have to develop the code on 32-bit LabVIEW and then interface with it in LabVIEW 64-bit as only the FPGA interface functions are supported in LabVIEW 64-bit. This may allow you to increase the FIFO size.
3) As far as the FIFO filling up when you run, you can modify the fetch rx data vi to show the number of elements available at every iteration. You should not see this grow if it is in balance, but growing here is a problem. You can also read larger chunks of data from the DMA at a time. Instead of 8196, increase this to around 32784 (4*8196).
Also, you need to increase the DMA FIFO on the FPGA side. You can do this by double-clicking the streams in the project explorer window under FIFOs. In the general settings for the stream, increase the requested number of elements to about 10x the default of 1023. This will reduce susceptibility of PCI hiccups. You will need to recompile the bitfile and reference it in your application. The FPGA code and bitfile will both need to be developed in 32-bit LabVIEW.
The exact error I get when I increase the FIFO size to 256M is "Error - 52000 occurred at configure stream.vi. Possible reasons: The specified number of bytes cannot be allocated." Let me know if you think that LabVIEW 64-bit will allow this FIFO to go larger. If we could make this RX FIFO more on the order of 30GB (we have 32 GB PC RAM), this would allow us to stream data for a while before hitting an overflow.
We modified the VI to show the elements remaining. When we run at 10 or 20 MS/s (2 chan), this number fluctuates but does not grow and thus we can run continuously without overflow errors. However, when we run at 40 MS/s (2 chan), the number grows until it reaches approx 265M and then the overflow occurs. You mentioned that we should also increase the size of the FPGA FIFO and recompile the FPGA bit file. But, I don't see the point in doing this given that it is the host FIFO that is overflowing rather than the FPGA FIFO. It sure seems like we have reached the streaming limit of the host for some reason.
We also tinkered with the number of samples we fetch at a time - even going as high as 1M. We didn't notice much of a difference as we tried various values for this parameter.
As far as increasing the amount of available virtual memory in LabVIEW, moving to LabVIEW 64-bit 2009 or later will allow you to access up to 8TB of memory. Because you have 32 available and the OS wil retain some of that for its notmal function, I highly doubt you will be able to get 30GB available, but you can certainly increase the virtual memory available to LabVIEW. It's important to note that LVFPGA only partially supports 64-bit operation, so although LabVIEW will have more available virtual memory, the FPGA may not have acess to all of it. Please refer to the following document for information in virtual memory capacity in LabVIEW:
As far as the 256M FIFO restriction, I was able to set the requested depth for the FIFO to 4294967247 without error.
Another thing to keep in mind is that the example you are using is a very general example. The example is not optimized for transfer rate. By offloading as much process as possible to the FPGA and then only transferring final result data to the host, you may be able to improve your transfer rate since you would be transferring less data.
This conversation was already finished but I feel I should post my information.
To achieve 800MB/s, we need to modify LabVIEW sample program as I listed below.
I achieved 880MB/s (110MS/s by 2 channel) with PXIe-8135, PXIe-1082 and PXIe-8374.
You don’t need to change the size of DMA-FIFO at Host. (1M is enough)
Application engineer in NI Japan