FPGA Feedback Node Tidbit: Resource Optimisation and usability tip

Intaris · ‎10-25-2016

So we all use Feedback nodes in FPGA code, right? Delay and interpath synchronisation is very important when programming on FPGAs when operating at high clock speeds.

Typically we use a Feedback node to delay one signal relative to another so that the receiving code is presented with parameters which "belong together". The normal way of doing this is shown below:

The resource usilisation for this is A x B Registers where A = Bit width of the data and B is the delay in clock cycles required. For a 24-bit value with a delay of 8, this requires 192 Registers. A delay of 16 will cost 384 Registers. Larger delays become prohibitive to implement.

But did you know that most Xilinx chips have built-in delay circuitry within LUTs which can be used as shift registers with up to 16 elements deep (SRL16)? These primitives ("Discrete Delay") can be found (weirdly enough) under the High Throughput math palette (on my VIrtex 5 target at least).

Resource requirements for this primitive are A Registers + ( (A-1) x (Bmod16) ) ) LUTs. For a delay of 1, tthe example shown utilises 24 registers. Changing the delay in the configuration window for this primitive shows directly the usage of a register for one delay but register and SRL for delay 2-17. Unfortunately, the tool does not display how many SRLs are needed for larger depths. So for a delay of 2 to 17, the resource utilisation is constant (24 Registers, 24 LUTs), delays of 18-33 require 48 Registers, 48 LUTs and so on. (a 33 cycle delay using registers costs 792 registers).

We can actually instantiate these SRLs by configuring a Feedback node to ifnore the reset command:

This code is functionally equivalent to the "Discrete Delay" code shown above and also uses the same resources. Of course depending on your FPGA design, it may not be feasible to disable the reset function of the feedback node. But the ability to implement a deep feedback with SRL16s can be a great way to save resources if you are not LUT starved.

So what are the other differences between Feedback nodes and "Discrete Delay"?

A really cool feature of the "Discrete Delay" primitive is support for a dynamic access to the output data. By configuring the "Discrete Delay" to accept dynamic addressing, we can set the length to 16 but then wire in a selector which will return a specific value in the pipeline. Feedback nodes cannot do this.

Perhaps the most obvious difference is that a "Discrete Delay" node does NOT actually allow feedback. If you want feedback, the aptly named feedback node will be required in addition to the "Discrete Delay".

I personally still prefer to feedback nodes because this helps with code portability and readability.

Usage tip:

Write a VI with a case structure with a Feedback node of a different delay in each case, let's say from 0 to 16 delay. Now, if you use this sub-VI and wire in a constant on your BD, the case corresponding to the constant you have wired will be included whereas all others will be removed (unreachable code elimination). I have double checked, and the code removal is performed by LabVIEW before the code is sent to the Xilinx compiler.

If you want a super easy-to-use version, create a VI macro so that the datatype of the Feedback node can also autoadapt to usage. You can then utilise a single "delay" VI in your FPGA code (included, copy to LabVIEW/user.lib/macros : create the directory if none exists).

The unreachable code removal of the delay cases is done automatically by the LabVIEW compiler. If you have multiple paths of code which all may have different latency, it is feasible to perform latency calculations to figure out which code path needs which latency to operate properly and it will all be constant folded when compiling although the complexity of the calculations can sometimes become rather large and care is needed to make sure that the code is foldable (For Loops are not a good idea for example). An idea born out of this is located HERE.

MarkCG · ‎10-27-2016

great post. I do a fair amount of FPGA development and understanding how the Xilinx primitives work is very useful. There is not that much information or tutorials on using the Xilinx primitives so I appreciate whatver I can find on them, thank you.

Riv · ‎11-10-2016

Thank you very much for your post! Me, too, I do lots of programming in FPGA at high-speed and I look for any information like this. Now I have one more idea what to do when I run out of registers.

AL3 · ‎11-24-2016

Very valuable post. Thanks!

nanocyte · ‎12-13-2016

You just saved me 50 very easy slices.

Intaris · ‎06-17-2019

I'm doing some self-necro in order to give a couple of further tips in this direction. Or to be more specific, passing on some problems which users may encounter.

I'll be referencing a XILINX PDF in this post. It describes some of the lower-level details of the 7-Series chips. It can be found HERE. I recommend trying to become in some way familiar with the contents of this document, however superficial that knowledge may be. Nearly everything I mention here retains validity with regard to Virtex 5 targets, where I spend nearly all of my time.

Slices with the possibility of acting as Shift registers are actually different hardware than "normal" Slices. Any given slice on an FPGA chip will be SLICEM as opposed to SLICEL (mentioned on Page 10 of linked document). Functionally, SLICEM is a superset of SLICEL, with precisely the SRLs and Distributed RAM being the difference between the two. More detailed info on the way SRLs work is available from Page 34 of the linked PDF. I'll ignore Distributed RAM because I know nothing on that topic.

Any given Chip will have only a portion of the slices available as SLICEM. For example, the 7k420T Chip has a total of 65150 Slices, 41400 of them are "normal" SLICEL and only 23750 of them are SLICEM (Page 11 of PDF). So if we're using huge amounts of SRLs to handle delays (for either very wide delays or very deep delays), we may run into problems. Espeically because, even is we have only a delay of 1 (where an SRL is really not required) then using the "Discrete Delay" option will REQUIRE an SRL, possible starving other code portions of the required resources.

So be aware that it is possible to receive an error from the Xilinx compiler even if you're only using 50% of your LUTs. This is because not all LUTs are equal. Some can do more than others. The number of LUTs available as SRL is a lot smaller than the overall number of LUTs available.

winterishere8 · ‎07-02-2020

Amazing post - thank you.

Question. Let's say I have a loop in FPGA with 16 monitored variables.For each variable, I'll need 2 of its previous values - so 2 feedback nodes per variable = 32 feedback nodes in the loop. I dont understand if all the 32 feedbacks can be executed in one tick or does it need 32 ticks?

Kudos are the best way to say thanks 🙂

GerdW · ‎07-02-2020

Hi winter,

@winterishere8 wrote:

Let's say I have a loop in FPGA with 16 monitored variables.For each variable, I'll need 2 of its previous values - so 2 feedback nodes per variable = 32 feedback nodes in the loop. I dont understand if all the 32 feedbacks can be executed in one tick or does it need 32 ticks?

All feedback nodes will (usually) execute within one tick…

Best regards,
GerdW

using LV2016/2019/2021 on Win10/11+cRIO, TestStand2016/2019

usman66 · ‎02-27-2024

Thanks for the helpful post.

Just to confirm my understanding, if I am short on LUTs, I should use Feedback node instead of discrete delay, is that correct?

In my designs, I usually run out of LUTs before BRAMs or DSPs. Are there any design guidelines for saving LUTs?

Intaris · ‎02-28-2024

If your delay stage is more than 1 cycle, switching out Feedback nodes with delay 6 for a discrete delay with delay 5 will save LUTs (Assuming you haven't used up your SRLs on the target).

Switching out a Feedback Node with delay 1 with a discrete delay will have no benefit.

The best way to save LUTs is by organising your code so that you have as little "branching" or "conditional" execution as possible. Sometimes it's cheaper to implement two independent pathways and choose the result at the end than have multiple case structures peppered throughout the entire chain.

Think of each output tunnel of a case as being as many LUTs as the data is wide. A 32-bit output tunnel requires 32 LUTs. If the case structure has more than 7 elements, it will actually cost 64 LUTs. And so on, there's a certain element of quantised scaling of the resources required. It gets rather complicated. And these values are only valid BEFORE Xilinx starts re-organising and optimising things. Oftentimes code which theoretically should cost less actually costs the same because it's just replicating optimisations that Xilinx does anyway. Working out exactly what is saved still requires a "Do it and check" step to make sure you're comparing apples with apples.

LabVIEW

FPGA Feedback Node Tidbit: Resource Optimisation and usability tip

FPGA Feedback Node Tidbit: Resource Optimisation and usability tip

Re: FPGA Feedback Node Tidbit: Resource Optimisation and usability tip

Re: FPGA Feedback Node Tidbit: Resource Optimisation and usability tip

Re: FPGA Feedback Node Tidbit: Resource Optimisation and usability tip

Re: FPGA Feedback Node Tidbit: Resource Optimisation and usability tip

Re: FPGA Feedback Node Tidbit: Resource Optimisation and usability tip

Re: FPGA Feedback Node Tidbit: Resource Optimisation and usability tip

Re: FPGA Feedback Node Tidbit: Resource Optimisation and usability tip

Re: FPGA Feedback Node Tidbit: Resource Optimisation and usability tip

Re: FPGA Feedback Node Tidbit: Resource Optimisation and usability tip