04-19-2013 05:53 PM
I have a loop executing on my FPGA (7854R card) that is running much slower than I would expect. Looking at the code, I would expect the limiting factor to be an analog output; that block takes around 42-43 ticks to execute on its own, and I have pipelined the loop to isolate that block. I have taken the loop apart and benchmarked many of the separate parts, finding that all of the other parts take much less than 40 ticks to execute. But when I put all of the parts together, the loop takes around 133 ticks to execute. Can you help me figure out what might be causing this slowdown?
I have attached the .vi in question. You may notice that it is incomplete, since I have eliminated the rest of the function of my program to make it as simple to understand as possible.
04-20-2013 08:29 AM
You can have lot of your logic and math inside a SCTL http://zone.ni.com/reference/en-XX/help/371599H-01/lvfpgaconcepts/using_sctl_optimize_fpga/
That should help.
There are other optimisation techniques as well, these links may be useful if you havent seen them already:
http://www.ni.com/white-paper/3749/en
http://digital.ni.com/public.nsf/allkb/311C18E2D635FA338625714700664816?OpenDocument
http://digital.ni.com/public.nsf/allkb/722A9451AE4E23A586257212007DC5FD
04-21-2013 06:16 PM
Thanks, I am trying the SCTLs right now. I am also taking a look at the links you sent. Will let you know if that helps.
Here's the part that I don't get: I don't see why that logic should take more than the 40-ish ticks required by the analog output block, much less the 133 clock ticks it currently takes. If I break the code down into smaller chunks, they all seem to run at reasonable speed. It seems that it's only when I put them all together that the slowdown happens. Any insight into that behavior would be appreciated too.
Thanks again for the assistance.
04-21-2013 07:23 PM
I read through all of your links... I had already read the first two, but I picked up a new piece of information somewhere along the way.... A SCTL not only makes code run faster, it also saves FPGA fabric. I didn't realize that.
My issue seems to be the divide function. It takes 60+ cycles to execute. I had two of them cascaded, resulting in 133 cycles for the whole loop. I could swear I had benchmarked the divide function separately, but when I benchmarked it today it came out at 60+ cycles. The high speed divide function is about twice the speed; I've switched to that function.
I have pipelined the two divide functions (that will require some adaptation of adjacent code, but that's OK). I'm also playing with how much of the other logic I can put in one SCTL; I know I can break it into two SCTLs and it will compile.
My loop is currently running at 44 cycles/iteration, so there's not much room for optimization in speed. I will make a few more attempts with SCTLs to save some fabric, but I'm just about where I wanted to be.
Thanks for the help!
04-22-2013 03:38 AM
If you want to save fabric the SCTL is definitely a good choice. In addition you can also save some by:
- checking the FXP number representations (for the divisions this will also save ticks). E.g. the 40020000 constant does not need to be 30 or even 50 bits wide.
- avoid divisions when possible. One of your divisions divides by a constant. You can easily change that into a multiplication. Also, you can specify the output type of a FXP calculation.
Also, pipelining makes sense, but if you do so, try splitting your critical path (the 2 divisions) as well. For instance the second division could probably be in parallel to the first one (also when switching to multiplication this might make sense).
04-23-2013 10:52 AM
A quick way to reduce the types of constants is to right-click on the constant and select "Adapt to Entered Data". This will give you the optimal fixed-point range that fits the number automatically. The reduced ranges will automatically propagate through the diagram, reducing bits needed for downstream operations as well.