fpga: SCTL's and Block Memory

HelpMeJebus · ‎08-09-2011

Does anyone have a good example of using SCTL's and block memory together in an efficient way in Labview FPGA 2010? I am programming a pretty easy encryption algorithm (blowfish) using the labview FPGA 2010 and the spartan 3e starter board. The algorithm requires a lot of memory (1042 * uint32), which I was able to achieve using a single block memory in my first attempt at the code. It worked fine but I wasn't impressed with the speed nearly as much as I was hoping.

In my next iteration, I decided to split the single large BRAM into 5 smaller ones (one 18 * uint32 LUT, and four 256 * uint32 BRAM's), and attempting to access certain things in parallel to make some operations faster. I also attempted to utilize SCTL's in the hopes of speeding up some of the logic (XOR's, ADD's, real basic things). After synthesis, everything still worked and produced the same results but it is actually slower than when I just threw some code together for the first pass.

It is clear that I do not understand the conversion between labview code and FPGA implementation nor do I really understand when or how to use SCTL's either alone or in conjunction between BRAM's. If anyone with a much better grasp of this than I do would like to comment on my code or provide their own examples, it would help me a lot. Thanks

(included are the two different versions of the most basic function in the encryption algorithm)

nathand · ‎08-09-2011

I'm curious how you're determining the speed of execution, and what speed you expect. The FPGA code looks fine to me, either approach should work but the SCTL will usually be faster. I don't have any experience with the Spartan board - how are you setting the input and output values? All my FPGA experience is with NI-RIO, and with those boards there's some startup time to get the FPGA running, so if your FPGA is starting and stopping that could account for some delay. If that's what's happening, try letting the FPGA run continuously and use a front-panel boolean to trigger your algorithm to run.

JJMontante · ‎08-10-2011

@HelpMeJebus wrote:

Does anyone have a good example of using SCTL's and block memory together in an efficient way in Labview FPGA 2010? I am programming a pretty easy encryption algorithm (blowfish) using the labview FPGA 2010 and the spartan 3e starter board. The algorithm requires a lot of memory (1042 * uint32), which I was able to achieve using a single block memory in my first attempt at the code. It worked fine but I wasn't impressed with the speed nearly as much as I was hoping.

How are you benchmarking your execution speed? Are you measuring it based on the maximum clock frequency of your synthesized design (40 MHz block diagram top level clock, constrained to 40 MHz on the board?) or is it the actual throughput of the implemented algorithm (is the implementation pipelined sufficiently)? Or is it an issue of LATENCY: using while loops, rather than SCTLs increases the latency by adding pipeline registers throughout the data flow, breaking up the logic between each register - meaning less work has to be done per clock cycle - meaning higher. The higher the pipelining, the longer the latency (time from first valid output from the time of the first valid input).

In my next iteration, I decided to split the single large BRAM into 5 smaller ones (one 18 * uint32 LUT, and four 256 * uint32 BRAM's), and attempting to access certain things in parallel to make some operations faster. I also attempted to utilize SCTL's in the hopes of speeding up some of the logic (XOR's, ADD's, real basic things). After synthesis, everything still worked and produced the same results but it is actually slower than when I just threw some code together for the first pass.

Generally speaking, blockRAM implementation of a memory will ALWAYS be faster than LUT implementation of the hardware. Please respond if you need details on how FPGAs use LUTs to create logic/memory, compared to BRAM.

It is clear that I do not understand the conversion between labview code and FPGA implementation nor do I really understand when or how to use SCTL's either alone or in conjunction between BRAM's. If anyone with a much better grasp of this than I do would like to comment on my code or provide their own examples, it would help me a lot.

Let's do a quick explanation of how FPGA design works using real tools, and why you want to use SCTLs. We agree that digital logic works on clock edges, right? A rising edge (usually, sometimes a falling edge but thats another story) triggers a memory unit called a "flip flop" or FF to latch it's input value. Basically, a FF is a 1-bit storage unit that takes the value of it's input at every clock edge. A signal is "registered" every time it is stored into a flip flop. Everything you do in LabVIEW FPGA will be completed between two FFs (spacially and temporally) as follows in the example of an adder.

1. The values Data_A and Data_B (uint8's) are driven from SOMEWHERE. It could be FPGA IO, it could be another 'VI' or VHDL module, or it could come from controls driven by the host. (Clock cycle 0)
2. Data_A and Data_B are clocked into a register, or a set of FFs, one FF per bit (Clock cycle 1).
3. Data_A and Data_B enter an adder, and the Result_C is pushed to the input of a register (set of FFs), still in Clock Cycle 1.
4. Result_C is output from the second register.

This is what would be inferred (the compiler would create this in HW) for a single operation occurring between your inputs and outputs in BOTH a normal while loop, and a SCTL. The difference is when you increase the complexity of what is happening between the input and output of your block. For each component you drop down on your block diagram, a while-loop adds a register. So if your critical path (the data path in your diagram with the largest number of steps before an output) is large, the added registers increase your latency - it takes longer to get data out, but your maximum frequency (the rate at which valid data is produced after the first valid data point) is higher because there is LESS computation to do between two clock edges (aka, between the time a register outputs a value, and when the downstream register accepts that computation). The reason for this is the time it takes actual electrical signals to propogate through computation circuitry increases as the complexity of that computation increases. Since there are so many registers being created, you will run up against Spartan FPGA resource bounds faster (especially if you waste LUTs instead of using BRAM). You can think of the while loop as automatically pipelining your design (but it is very wasteful).

This while loop implementation is different than an SCTL in the following way: An SCTL completes EVERYTHING inside the loop in a single clock cycle. Since there is no automatic 'registering' of all intermediate computations in a SCTL, it is a far more efficient use of FPGA resources. However, there is risk with that. An SCTL will complete all the operations in the space of one clock cycle (hence single cycle loop). This means that all the logic that performs your operation is sandwiched between just two sets of registers (inputs and outputs). This means that the critical path your data takes is longer. This means the clock frequency will be lower because there is more work to be done between clocking OUT of the first set of registers and INTO the second set of registers. The way to mitigate this is to add your own pipeline registers- I'll leave that exercise to the reader (read appendix A of the labview FPGA intro training). Basically you insert "feedback nodes" or "shift registers". This way you create registers on your intermediate values and break up those long paths for faster throughput. MAKE DARN SURE that you have the same number of 'feedback nodes' on all parallel data lines in your block diagram though, or you won't have data lining up at the same time. Example. If you are implementing D = (A + B) + C, and you implement it as two steps:

A + B, then that quantity added to C it must be structurally implemented (pipelined) like this:
(A + B) goes into a feedback node
C goes into a feedback node.
the output of those feedback nodes go into an an adder so this way the data from C always lines up with the (A+B).

Some final notes:

Increased LUT utilization = slower frequency max, because the choices for routing are more difficult to make in the FPGA.

Using LUTs instead of BRAM for memory = slower, because you have more routing delays because you are stringing together LOTS of LUTS to make one semi-equivalent BRAM.

Using LUTs instead of BRAM for memory = slower, because the RAM control logic is not dedicated high speed logic.

Using LUTs instead of BRAM wastes LUTS/FFs = slower, because you're wasting logic.

Using While Loops instead of SCTLS wastes LUTS/FFs = slower, because you're wasting logic.

So read up as much as you can on pipelining your design, and use SCTLs and BlockRAMs.

Thanks

(included are the two different versions of the most basic function in the encryption algorithm)

HelpMeJebus · ‎08-10-2011

Thank you both for the informative responses. I'm sorry I wasn't very clear in my initial post.

As far as how I'm benchmarking, I am just talking about the time it takes to run a 64-bit block through the algorithm (which I obviously want to make as small as possible). I am using the attached VI (PC_timingBlowfish.vi) just to see how quickly it will run through the data. There are probably inaccuracies based on the communication between the PC and the FPGA, but because I didn't change the way the data transferred between the PC and the FPGA, I think the increase in execution time for my "optimized" with split BRAM and SCTL's was probably due to faulty coding and not just an inaccurate testbench.

The round function is just part of a slightly larger system. I'm attaching the two zips of the first attempt and "optimized" versions in case anyone really wants to help a really bad labview programmer out!

I chose to implement the 18 * uint32 P-Box in LUT's rather than BRAM due to the following I found in the labview help. "You are accessing this memory in a single-cycle Timed Loop and need to read data from the memory item during the same cycle as the one in which you give the address." This gave me the impression that because I could use it in the same cycle, that it would be faster? I slightly understand why it takes more resources but how can it take more time if I can use it in the same cycle?

I think this is probably where I am fundamentally misunderstanding how labview FPGA works. When I put some functions into a SCTL, and leave the default clock for my board (50 MHz), I was under the impression that everything in the loop would execute in exactly 20 ns, and if it there were too many actions to perform (i.e. the critical path was longer than 20ns) that it would generate a compile-time error telling me I'd placed too much in my SCTL. Is this not the case? If you don't wire a clock to the input of the SCTL, does it automatically adjust the frequency to compensate for all the logic inside the loop so I'm actually not getting all the processing done in 20ns but rather the shortest amount of time labview deems will execute all the logic? If that is the case, then I indeed see the benefits of pipelining and will be sure to try to implement it effectively.

Once again, thanks for taking the time to help me out!

(Notes on attachment: Contains two folders, (1) Cryptkey which contains my original attempt and (2) Cryptkey_v2, which is my attempt at optimization. The main FPGA vi is labeled FPGA_Blowfish, while the main PC host is BlowfishHostTest2.vi)

JJMontante · ‎08-10-2011

@HelpMeJebus wrote:

Thank you both for the informative responses. I'm sorry I wasn't very clear in my initial post.

As far as how I'm benchmarking, I am just talking about the time it takes to run a 64-bit block through the algorithm (which I obviously want to make as small as possible). I am using the attached VI (PC_timingBlowfish.vi) just to see how quickly it will run through the data. There are probably inaccuracies based on the communication between the PC and the FPGA, but because I didn't change the way the data transferred between the PC and the FPGA, I think the increase in execution time for my "optimized" with split BRAM and SCTL's was probably due to faulty coding and not just an inaccurate testbench.

The round function is just part of a slightly larger system. I'm attaching the two zips of the first attempt and "optimized" versions in case anyone really wants to help a really bad labview programmer out!

I chose to implement the 18 * uint32 P-Box in LUT's rather than BRAM due to the following I found in the labview help. "You are accessing this memory in a single-cycle Timed Loop and need to read data from the memory item during the same cycle as the one in which you give the address." This gave me the impression that because I could use it in the same cycle, that it would be faster? I slightly understand why it takes more resources but how can it take more time if I can use it in the same cycle?

BlockRAM is an SRAM. SRAMs tend to have a 1-clock latency. A blockRAM read call in your VI will easily make the 20ns window. However, the data for a read won't be valid until the following clock cycle: a change in address on clock cycle 1 will output the data on clock cycle 2. You could test out the latency by writing a known value into a block ram, setting a counter that starts incrementing (and increments every clock cycle) when you issue a read (ie: the read memory node is executed in the block diagram), and stops incrementing when you've recieved your known value.

I think this is probably where I am fundamentally misunderstanding how labview FPGA works. When I put some functions into a SCTL, and leave the default clock for my board (50 MHz), I was under the impression that everything in the loop would execute in exactly 20 ns, and if it there were too many actions to perform (i.e. the critical path was longer than 20ns) that it would generate a compile-time error telling me I'd placed too much in my SCTL. Is this not the case?

This should be the case. The point of a single cycle timed loop is to provide a structure similar to a clocked process in your HDL of choice. In a clocked process, a line like "D <= (((A + B) << 2)) + C" must complete all those operations in one clock cycle. The inputs (A,B,C) are sampled at that clock edge, propogated through the circuit, and the value D is output from a register on the next edge (meaning all the computation is done between the two clock edges, aka one clock cycle).

For the record, all the synthesis, mapping, place-and-route, and timing analysis is done in the Xilinx toolchain. Labview FPGA simply translates block diagrams into VHDL, and creates a set of project inputs for the Xilinx ISE toolset to process. Xilinx reads the HDL created by Labview, and creates a list of low-level structures native to the FPGA to implement your design. The Mapper (part of ISE) looks at those structures, and optimizes them with the goal of reducing complexity. The final step is Place and Route - where the tool looks at the FPGA and attempts to place all of the pieces that the mapper tells it are necessary, and does all the routing. Once a given placement/routing is done, it then checks all paths from your inputs to their downstream outputs. Any path between two registers/flops that was placed/routed in a way that violates timing will be flagged as a problem, and a PaR iteration will be performed to try and fix the glitch. While pipelining adds more logic (flip flops), it provides more PaR flexibility, and breaks up those long chains. The catch is you can't saturate an FPGA's LUTs and FFs, or else it will get harder and harder to find places for all the LUTs/Registers that still meet the timing requirements. This is why MANY real FPGA designs have a requirement like "The FPGA shall be no more than 70% complete", so you can achieve this 'timing closure'. So to the point: Labview is just design entry for those who don't know HDL - it isn't the actual toolset that does the FPGA compilation, that is the Xilinx toolchain.

In a standard FPGA design flow, your timing constraints would be kept in a .ucf file - a list of settings that apply to clock domains. Any clocked process that can't perform each one of it's single-cycle instructions in the time specified in the constraints file will produce an error in either the synthesis or timing-driven-map or post-PAR static timing.

Labview has a few additional checks and balances. It differentiates clock domains using FPGA base clock, top level clock, derived clock, or CLIP clocks - and in this manner lets you segregate and easily read your clock domains (the logic that is to be driven at a certain clock frequency). This labview segregation, when coupled with the .ucf constraints file will properly initialize the clocking circuitry to create your derived clocks on the chip, as well as give input to the FPGA tools so they know how to analyze 'critical paths' in the design compared to the maximum period in order to achieve the frqeuencies you want.

If you don't wire a clock to the input of the SCTL, does it automatically adjust the frequency to compensate for all the logic inside the loop so I'm actually not getting all the processing done in 20ns but rather the shortest amount of time labview deems will execute all the logic? If that is the case, then I indeed see the benefits of pipelining and will be sure to try to implement it effectively.

There's no really good reason not to wire a clock to the input of the SCTL- clarity is always a good thing. I believe if the clock input of the SCTL is not wired, it uses the FPGA base clock. You can find this in the FPGA properties. My flexRIO development cards default to 40MHz. Simple example of pipelining: laundry. If your wash takes 30 mins and your dryer takes 30 mins, it takes you 1 hour per load if you only use one machine at a time. If you wash and dry different loads at the same time, you get 0.5 hours per load. Pipelining is the same way.

Your goal is to increase your throughput. If you have a 5 computations, and you try to do them all at once (one clock cycle), you might achieve 20MHz (example so the math works out). If you put in 4 pipeline registers to separate the computations into stages, it now takes 1/5 of effort (assuming each computation is of similar implementation complexity). So now you can run at 100MHz. So Bonus - you get 5x performance, at the expense of resources on the FPGA.

Look at it from another perspective, you have a requirement that you must get 100MHz performance, but you're only getting 20MHz - you look at the longest path in your computation chain and start breaking it up by adding pipeline registers. Pipeline register = a regular register, but it is used for pipelining rather than generic memory storage or delaying in time for data alignment.

The Spartan3 is a 4-generation-old budget part. It is advertised as having decent speeds (go look at the old literature) but to DO that you have to use every tool that xilinx gives you: use BRAMs instead of LUTs, pipeline the hell out of all your processing chains.

Once again, thanks for taking the time to help me out!

(Notes on attachment: Contains two folders, (1) Cryptkey which contains my original attempt and (2) Cryptkey_v2, which is my attempt at optimization. The main FPGA vi is labeled FPGA_Blowfish, while the main PC host is BlowfishHostTest2.vi)

HelpMeJebus · ‎08-10-2011

JJ Montante,

Thanks for helping me to understand all this. I'm pretty bullheaded concerning all this, but I am starting to come around. So in order to implement the following pseudocode:

Loop on i from 1 to 16

xL = xL XOR P[i]

xR = F(xL) XOR xR

swap xL and xR

End of loop

where F(xL) requires reading from 4 different BRAM's also.

1st Clock Cycle: Read BRAM holding PBox

2nd Clock Cycle: xL XOR P[i] (using the data you just read)

3rd Clock Cycle: Read 4 BRAM's holding SBoxes

4th Clock Cycle: Finish Round function and XOR with xR/ swap them

**Repeat those 4 cycles 16 times

Because the memories require an additional clock cycle, I need to divide them this way? Would it be faster implemented in a flat sequence structure containing 4 SCTL's, or a single SCTL with a case structure inside? Or something else entirely?

As you can see, I am having a very hard time thinking about all this, but I really appreciate the help

Dragis · ‎08-10-2011

Like mentioned earlier, whether you loop X iterations within one loop or unroll the loop into X concurrently running loops depends on what average latency you can accomodate. If you can allow extra cycles per point, then using a single loop with a state machine (case structure) is a great way to conserve resources. Once the system can't get data through fast enough, you have to resort to unrolling the code into multiple concurrent sections (pipelining) which will usually require duplicating/breaking memory blocks and other logic. Unfortunately, there usually isn't a golden rule for any of this, just using experience to try things out.

LabVIEW

fpga: SCTL's and Block Memory

fpga: SCTL's and Block Memory

Re: fpga: SCTL's and Block Memory

Re: fpga: SCTL's and Block Memory

Re: fpga: SCTL's and Block Memory

Re: fpga: SCTL's and Block Memory

Re: fpga: SCTL's and Block Memory

Re: fpga: SCTL's and Block Memory