NI FlexRIO Driver and example code for implementing peer-to-peer (P2P) data streams between a National Instruments FlexRIO device and a NVIDIA GPU using a zero-copy, bidirectional, DMA communication path using NVIDIA GPUDirect™.
FPGAs and GPUs are becoming the norm in bleeding-edge performance applications. Traditionally, it has been fairly hard to combine both platforms within the same system; usually, to share data between an FPGA and GPU, one would need to develop a user-space application to bridge the gap:
However, to achieve this, many memory copies need to be made, leading to high latencies, increased CPU usage and drastically eating into host controller memory bandwidth and space.
Now, with a new Linux kernel driver, NI FlexRIO FPGA modules can do true DMA communication, peer-to-peer, with a NVIDIA Tesla/Quadro GPU:
This opens the door to applications needing higher memory bandwidth, increased application performance or other benefits of direct P2P communication between an open-FPGA and CUDA-enabled GPU.
FPGAs are hugely powerful and have some key benefits:
However, FPGAs are not great for every application and there are some known drawbacks:
Traditionally, the drawbacks of FPGAs and ASICs were only able to be solved by a host computer’s application code targeting a bus-connected CPU that can do these more advanced series of algorithms within a set of instructions defined by a user program. With this, a CPU is generally measured in the amount of operations/instructions it can chug through within a given unit of time (in industry usually this is measure via core clock speed) so for operations on large data sets, the performance is directly related to how fast we can serially crunch through the data set.
However, as the above shows, there is a large discrepancy in floating-point compute capability between the CPU and the GPU; this is because GPUs are specialized for compute-intensive, highly parallel computation - exactly what graphics rendering is about and how GPUs got their start- and therefore designed such that more transistors are devoted to data processing rather than data caching and flow control.
This makes GPUs as an advantageous platform for “data-parallel computations” in which a highly arithmetic algorithm can be executed on many, many data elements in parallel as opposed to doing many memory based operations that require sophisticated flow control silicon (this is the strong suit of CPUs where latency and optimization of instructions are key). We can leverage the advantages of the GPU HW platform with CUDA programs.
"…a general purpose parallel computing platform and programming model that leverages the parallel compute engine in NVIDIA GPUs to solve many complex computational problems in a more efficient way than on a CPU”- from NVIDIA CUDA C Programming Guide
CUDA C extends the C language by allowing the programmer to define C functions, called kernels, that, when called, are executed N times in parallel by N different CUDA threads, as opposed to only once with serial execution like regular C functions. There is also support for languages like Fortran and OpenCL.
For more information on the CUDA language and best practices, see the NVIDIA CUDA C Programming Guide.
Using the included examples as reference, the typical development flow is:
Note on Desktop/Chipset Selection: Both FlexRIO and GPU should share the same PCIe root complex and ideally should only be seperated by PCIe switches, however some chipsets, and physical slot placements, cause the PCIe path to traverse across a CPU/IOH or QPI/HT link which can cause serious performance degredation or even failure of DMA communication. For more information, see here
Currently, NVIDIA's GPUDirect functionality is only supported on Linux Operating Systems. Furthermore, FlexRIO Driver Support for Linux requires one of the following distributions:
Note: FlexRIO no longer officially supports x86 systems. See NI_FlexRIO-16.0.0_P2P_GPU_Driver/README.txt for more information.
This example was mainly tested and developed with the most recent version of CentOS 7 available at the time. If using a newer kernel version, it may be necessary to update to a newer version of NI-KAL than is included with the driver. At a minimum, the following packages should be installed (CentOS instructions shown as an example):
$ yum -y groupinstall "Development Tools" $ yum -y install install avahi gcc kernel-devel-$(uname -r) libstdc++.i686
(Optional) Install gnuplot for graphing functionality with
To develop a bitfile from LabVIEW FPGA and have C API support, the following should be installed on a Windows development machine:
As mentioned above, the Linux development machine should have no NI drivers installed other than the FlexRIO P2P GPU driver version included with this example. Upgrading to a different version of the FlexRIO driver or installing other NI drivers is likely to break compatibility with P2P GPU functionality.
$ tar xzf FlexRIO_P2P_GPU-0.1.tar.gz $ cd FlexRIO_P2P_GPU-0.1/NI_FlexRIO-16.0.0_P2P_GPU_Driver/ $ sudo sh INSTALL
nvcccompiler. For more information, see the NVIDIA CUDA Installation Guide for Linux
lsni64to view all connected NI devices (and the associated RIO handle of the FlexRIO) and compile then run the CUDA example
deviceQueryto view connected CUDA GPU devices.
$ ./throughput_test -b NiFpga_FPGA_main.lvbitx \ -s "3D8FA985BF4824A9C2343697C9135C49" -r "RIO0"
...and for the
$ ./GPU_FFT -la -b ./NiFpga_FPGA_Main.lvbitx \ -s "DAA4B54616BF18D27170CFDD9178EF17" -r "RIO0" > SimSignal.dat $ gnuplot gnuplot_conf
The last line runs gnuplot against the data file
SimSignal.dat and outputs a PNG file of the power spectrum