So we all use Feedback nodes in FPGA code, right? Delay and interpath synchronisation is very important when programming on FPGAs when operating at high clock speeds.
Typically we use a Feedback node to delay one signal relative to another so that the receiving code is presented with parameters which "belong together". The normal way of doing this is shown below:
The resource usilisation for this is A x B Registers where A = Bit width of the data and B is the delay in clock cycles required. For a 24-bit value with a delay of 8, this requires 192 Registers. A delay of 16 will cost 384 Registers. Larger delays become prohibitive to implement.
But did you know that most Xilinx chips have built-in delay circuitry within LUTs which can be used as shift registers with up to 16 elements deep (SRL16)? These primitives ("Discrete Delay") can be found (weirdly enough) under the High Throughput math palette (on my VIrtex 5 target at least).
Resource requirements for this primitive are A Registers + ( (A-1) x (Bmod16) ) ) LUTs. For a delay of 1, tthe example shown utilises 24 registers. Changing the delay in the configuration window for this primitive shows directly the usage of a register for one delay but register and SRL for delay 2-17. Unfortunately, the tool does not display how many SRLs are needed for larger depths. So for a delay of 2 to 17, the resource utilisation is constant (24 Registers, 24 LUTs), delays of 18-33 require 48 Registers, 48 LUTs and so on. (a 33 cycle delay using registers costs 792 registers).
We can actually instantiate these SRLs by configuring a Feedback node to ifnore the reset command:
This code is functionally equivalent to the "Discrete Delay" code shown above and also uses the same resources. Of course depending on your FPGA design, it may not be feasible to disable the reset function of the feedback node. But the ability to implement a deep feedback with SRL16s can be a great way to save resources if you are not LUT starved.
So what are the other differences between Feedback nodes and "Discrete Delay"?
A really cool feature of the "Discrete Delay" primitive is support for a dynamic access to the output data. By configuring the "Discrete Delay" to accept dynamic addressing, we can set the length to 16 but then wire in a selector which will return a specific value in the pipeline. Feedback nodes cannot do this.
Perhaps the most obvious difference is that a "Discrete Delay" node does NOT actually allow feedback. If you want feedback, the aptly named feedback node will be required in addition to the "Discrete Delay".
I personally still prefer to feedback nodes because this helps with code portability and readability.
Write a VI with a case structure with a Feedback node of a different delay in each case, let's say from 0 to 16 delay. Now, if you use this sub-VI and wire in a constant on your BD, the case corresponding to the constant you have wired will be included whereas all others will be removed (unreachable code elimination). I have double checked, and the code removal is performed by LabVIEW before the code is sent to the Xilinx compiler.
If you want a super easy-to-use version, create a VI macro so that the datatype of the Feedback node can also autoadapt to usage. You can then utilise a single "delay" VI in your FPGA code (included, copy to LabVIEW/user.lib/macros : create the directory if none exists).
The unreachable code removal of the delay cases is done automatically by the LabVIEW compiler. If you have multiple paths of code which all may have different latency, it is feasible to perform latency calculations to figure out which code path needs which latency to operate properly and it will all be constant folded when compiling although the complexity of the calculations can sometimes become rather large and care is needed to make sure that the code is foldable (For Loops are not a good idea for example). An idea born out of this is located HERE.
great post. I do a fair amount of FPGA development and understanding how the Xilinx primitives work is very useful. There is not that much information or tutorials on using the Xilinx primitives so I appreciate whatver I can find on them, thank you.
Thank you very much for your post! Me, too, I do lots of programming in FPGA at high-speed and I look for any information like this. Now I have one more idea what to do when I run out of registers.
I'm doing some self-necro in order to give a couple of further tips in this direction. Or to be more specific, passing on some problems which users may encounter.
I'll be referencing a XILINX PDF in this post. It describes some of the lower-level details of the 7-Series chips. It can be found HERE. I recommend trying to become in some way familiar with the contents of this document, however superficial that knowledge may be. Nearly everything I mention here retains validity with regard to Virtex 5 targets, where I spend nearly all of my time.
Slices with the possibility of acting as Shift registers are actually different hardware than "normal" Slices. Any given slice on an FPGA chip will be SLICEM as opposed to SLICEL (mentioned on Page 10 of linked document). Functionally, SLICEM is a superset of SLICEL, with precisely the SRLs and Distributed RAM being the difference between the two. More detailed info on the way SRLs work is available from Page 34 of the linked PDF. I'll ignore Distributed RAM because I know nothing on that topic.
Any given Chip will have only a portion of the slices available as SLICEM. For example, the 7k420T Chip has a total of 65150 Slices, 41400 of them are "normal" SLICEL and only 23750 of them are SLICEM (Page 11 of PDF). So if we're using huge amounts of SRLs to handle delays (for either very wide delays or very deep delays), we may run into problems. Espeically because, even is we have only a delay of 1 (where an SRL is really not required) then using the "Discrete Delay" option will REQUIRE an SRL, possible starving other code portions of the required resources.
So be aware that it is possible to receive an error from the Xilinx compiler even if you're only using 50% of your LUTs. This is because not all LUTs are equal. Some can do more than others. The number of LUTs available as SRL is a lot smaller than the overall number of LUTs available.