So I'm starting a new thread to discuss the possibility of a presentation covering FPGA 'Best Practices', or common pitfalls/design choices.
This began in the main "I wish there was a presentation on..." thread but is continued here to avoid derailing that thread too much further.
I'd like a presentation on common patterns in FPGA code.
This presentation from 2014(?) is somewhat similar: LabVIEW FPGA Design Patterns and Best Practices (NIWeek 2014?)
I'd like to know about common mistakes and the better way of writing FPGA-based code.
Tom McQuillan: Possible presentation on GoF designs (with Sam Taggart?)
Me: Might not be exactly what I'm imagining - GoF patterns often require dynamic dispatch.
Terry Stratoudakis (Terry_ALE):
I am interested in this but first some comments and questions.
... software optimizations and techniques are mostly single core minded where on an FPGA things are spatial and so forth.
Has a new thread for this been made?
Yes, here now.
I gave a talk in May 2020 https://www.youtube.com/watch?v=i_nC_sGOqUw&t which talks about some of these techniques at a general level. How does this compare to what you are looking for?
I enjoyed that presentation but it mainly focused (as I understand it) on making faster things possible.
My problems are not necessarily related to making things as fast as possible, but rather about making them as readable/conceptually understandable as possible.
There are good LabVIEW FPGA shipping examples that have best practices as well.
Perhaps some review of these best examples could form the beginning of this hypothetical presentation (or if nobody submits this presentation, I'd be happy to receive some pointers here)
Other best practices can be found in the VST2 and RTSA code but they are not openly available. A talk could be made that speaks to those practices without revealing the code.
Also, what is typical application and NI hardware (i.e. cRIO or PXI)?
For me, cRIO, but I'd like to think that the problems I'm facing might not be specific to the hardware or the clock speeds. I guess that as speeds get faster and faster, more sacrifices to readability might need to be made though...
To give a concrete example of what I might mean with regards to pitfalls/design choices, I'll describe some cRIO code I've been recently rewriting.
My system uses some NI-9402 modules to communicate via SPI with a PCB that I designed, which contains an ADC and an "octal switch" (see ADG714). The switch controls various "almost static" inputs to the ADC, for example the shutdown, reset and oversampling digital inputs.
Most of the time, the ADC acquires continuously (this could be controlled by either the RT system, or by a switch using an NI-9344 module). The results are streamed over DMA FIFO to the RT system, which bundles them together in nice packages for communication to a desktop system, for logging, display, further analysis, etc.
Sometimes we might want to change some settings - e.g. oversampling ratio, or the sample rate, etc. To change something like the oversampling rate, the ADC must stop acquiring, the ADG needs to be updated with new values, the ADC must be reset (again requiring a pair of changes to the ADG switches), and then the sampling should resume.
Previously, the code ran in a sort of nested state machine structure. To update the settings, the RT system would change some FPGA controls, then set a boolean ("Requesting Update", or something) to true. The FPGA would poll that control, then go through a series of "Updating", "Finished Updating", "Ready to Acquire" like states, allowing the RT system to wait for the Ready to Acquire, then empty the FIFO, then set "Start" to true, resuming the acquisition.
This required lots of different booleans, and states, and seemingly worked at best "most" of the time. Clearly there were some situations in which the end state was not valid, but digging into this mess was pretty tricky - keeping the changes to state in your head continuously wasn't very practical.
This situation was vastly simplified by a recent change I made - now, the FPGA will always acquire a "block" of data of a certain length, depending on an enum "Sample Rate" value, which also includes the number of channels to sample (e.g. a typical values are "10kHz x 8Ch", or "50kHz x 3Ch", or similar).
The DMA sends a 'header' element that conveys the contents of the upcoming block - how many elements, how many channels do they represent?
By promising to always output that number of elements (even if some of them are 0, because the acquisition died due to e.g. power failure to the board, or a broken wire, or whatever), the RT system is much simpler.
Now, a new setting request can simply be enqueued on a FIFO to the FPGA, and when the end of a block is reached, the FIFO can be checked to see if it should continue sampling, or change something.
No complicated handshaking is necessary between RT and FPGA.
I don't know that this is a common problem, or a common solution (enforcing a block of data rather than individual elements, or e.g. 1 sample cycle with N results (one per channel sampled)), but it wouldn't surprise me in hindsight to learn that it was. If I'd considered this approach a long time ago, I could have saved probably a non-negligible amount of time and effort.
At the same time, modifying various parts of the code to use objects and simpler abstractions (e.g. a VI that carries out "Pulse Reset", rather than setting the ADG switches value to 28, then setting "Update Switches", then waiting for "Finished", then setting the values to 12, then "Update Switches", then...) allow more easily spotting problems in code - for example, the ADC is triggered by a pulse on one line, but the results actually are transferred partially during the next sampling cycle. If the sample rate increases, then previously it would be possible for the "Conversion Start" line to pulse repeatedly during the transfer of a previous sample, leading to a whole collection of "Start Time" values being put on a FIFO with no accompanying data.
Now, it's clearer that this can be a problem and when the sample rate changes, an additional pause is given between the last CONVST on the previous "block", and the first in the new block at a different rate.
Thanks for the context and feedback. Really helpful. I have not done many cRIO systems but I understand the (general) challenges faced.
One common principle that we look to apply in our projects is a more pronounced design phase that is outside of the LabVIEW environment. For this we look to UML diagram templates.
Another is simulating the system but not just literally in the FPGA sense. This could help one see integration issues up front and in an environment that is easier to troubleshoot. A few years ago we were working on a cRIO based system where the deployment was overseas and there wasn't a lot of room for back and forth. The FPGA was very simple but the RT had it's share of complexity. We made a simulated model in Windows and were able to exercise all known scenarios. We are applying this (in concept) to PXI based FPGA systems. Though they are RF and high bandwidth we do this to test the interfaces and low bandwidth logic. There are modules to help shake out issues where we need to run at higher bandwidths.
Anyway, I think applying design techniques outside of LabVIEW tends to be counter intuitive in the LabVIEW world (myself included). NI teaches us that "it is easy" and "no coding needed" which even if we ignore these statements may still have it in some level. The simulations are another aspect which I would say are best practices.
That said I wonder if this 'talk' could be a panel where there are different perspectives with some questions planned (pre-submitted), on the fly, or combination of the two. I feel like I know quite a bit on the subject but I still see things that keep me humble.
The other general issue is that LabVIEW FPGA has a much smaller and quieter community than LabVIEW. The reason is understandable but the result is that there are less resources and discussions happening. I find that with LabVIEW FPGA some projects tend to be more proprietary which leads to less discussions. The best thing would be decouple the principles from the projects. This is not easy but it is really the only way a success can be repeated and a failure can be avoided.
I assume you know of the cRIO Developers Guide http://www.ni.com/pdf/products/us/fullcriodevguide.pdf. Though dated, I am sure it has good stuff in there. I haven't studied it but I assume what you are looking for goes beyond and maybe some things have changed since it was published.
It's not Best Practices, nor is it Common Design patterns, but I did a presentation on creating a Time Weighted Data Averaging Mechanism in FPGA that covered the challenges of how a simple set of code in Windows could eat too many resources in FPGA until converted from a parallel approach to a state machine approach. If that interests you, I can submit it. It was presented to the LabVIEW Architects Forum user group in February 2017 (link to video recording below).
Averaging data in FPGA
Averaging data can take many forms, for a project, I was requested to implement a time weighted average to smooth data spikes. The time weighting formula could be adjusted to a user specified value (# of averages), and also had to account for situations in which older data did not yet exist. FPGA code requires data to be fixed in length. Although not successfully implemented in FPGA during the project (had to move it up to the RT layer), this presentation will show how I was eventually able to implement the code in the FPGA, and some of the changes made to make it scaleable for a large number of channels while using minimal FPGA resources.
I think this is an area in great need of exposure. I'm about to unleash a wall of text. I apologise in advance.
I've been working on completely re-architecting our FPGA code for the last few years (among other things).
We have possibly non-typical code on our FPGA. We have a lot of different modules, individually controllable but with defined APIs for interaction between them. Lockins (16x), Oscilloscopes (1 at 40MHz, 2 at 1MHz), AI (24 Channels at 1MHz, 2 at 40MHz), AO (50 Channels at 1MHz, 2 at 40MHz), Digital IO, PLLs, PI controllers (16x), Function generators and so on. These are fairly feature-rich modules. Each AO for example has a 24-bit Setpoint, customisable resolution, offset compensation, Lock-In assignment, Function generator assignment, Limits, Dithering (for increased resolution when using an output filter), Glitch compensation and so on and so forth. A total of 118 bits of configuration data per channel excluding the functionality for dithering and anti-glitch compensation which itself requires 8192 bits per channel. Despite the extensive customisability, we've found ways of handling these which allow for great flexibility AND great resource efficiency. All of this is on a single Virtex 5 FPGA.
Practically all signals created by all modules are available as inputs for all other modules (for monitoring or acquisition). This idea has been a bit of a game-changer for us because we found a way to do it whilst actually decoupling our individual modules from each other. We have approximately 130 individual signals being handles in this way. The flexibility it gives us is tremendous.
We've recently also added code which actually assumes control of other processes on the FPGA. We allow fully-FPGA driven Spectroscopy which can take control of pretty much any output we want to control (want to control the amplitude of the Lock-in - go ahead. Want to acquire the frequency shift of the PLL, sure). The hand-off between RT and FPGA control has been an interesting topic, and I think we have a really nice solution to this. Our code is organised to allow maximum flexibility for modules controlling other modules, it's almost like an Actor Framework on FPGA. Which is funny, because I don't even use AF... I've re-written some of our core modules as LVOOP modules, something which has proven useful in situations but which comes with some drawbacks, especially when debugging. LVOOP itself at that level is not neccessarily a good thing. But LVOOP for many other things reap great benefits.
The overall organisation of the code might be interesting for some. It might be too dependent on specifics of implementation though, I don't know.
An additional topic I have dealt with is the abstraction of FPGA code. This greatly aids with debugging and for me at least has made some of the LV options for debugging obsolete. My pet peeve is a static FIFO constant linked to the Project in some sub-VI. I try to have ALL static references on my top.level VI. And the only static references we have are Clocks and DMA channels. This enables us to do side-by-side testing of modules because the internal code is inherently decoupled from static influences in the project. The source code instantiates all of the resources it needs.
A third topic, and one which took the longest to conceive of and implement so that it works was the abstraction of pipeline delays in code. Due to our code modules being so inter-connected, we utilise a lot of dependency injection in our FPGA code (again, not necessarily LVOOP, but same principle) so for some of our components, the latency of the code portions is not fixed (especially BRAM latency). We found a way to handle this, automatically calculate the latencies involves and ensure that we keep all of our data flowing through our module synchronised so that we don't get any mixing of channels when multiplexing functionality over channels. We do this by abstracting the "latency" property for out concrete implementations and utilising constant folding, which is enforced by a little-known utility to throw an error in the early stages of preparing for compilation if a signal connected to a certain VI is not a constant. This led to the creation of a "deadline" for a lot of our functions. It allows us to determine the final latency required for a group of inter-related functions (via constant folding) which we can then pass to each function where, knowing its own latency, knows how much it needs to pipeline its output to keep all paths aligned. This allows us to perform latency-balancing on a code-for-code basis via constant folding without any extra cost in fabric. It costs time in queuing up the compile unfortunately, but has saved a lot of headaches otherwise. It sounds like a huge amount of complexity but although it initially has a steep learning curve, it forces us to handle the intricacies of our pipelining delays up-front (which is surely good design practice). Once the latency balancing has been implemented, we can essentially forget all about it for that module.
I'd love to get at least some of this information out there.