LabVIEW FPGA: Multiple SCTL versus one SCTL (same clock domain)

michaeljoseph · ‎10-25-2014

Hello NI forums,

-------------------

Question:

-------------------

See the attached picture from a modified version of the LabVIEW DRAM FIFO example. It probably explains my question more effectively than the paragraphs below.

What is the difference to the LabVIEW / Xilinx compiliers, if any, between placing two independent branches of code in the same SCTL, versus placing them in individual SCTLs (in the same clock domain)?

-------------------

Misc. comments:

-------------------

I have briefly experimented with this concept using the included LabVIEW DRAM FIFO example (example finder >> Hardware Input and Output >> FlexRIO >> External Memory >> Simple External Memory FIFO.lvproj).

I compiled the default example (the read and write interfaces are in separate 40MHz SCTLs) five separate times. Then I put the read and write interfaces in the same 40MHz SCTL and compiled another five times. The result (when both read and write interfaces were in the same SCTL) was a reduction in resource usage (according to the compilation summary).

However, due to my lack of knowledge I'm hesitant to conclude that placing everything in one SCTL is always the best option. For example, I do not know what is created 'behind the scenes' with each SCTL. Perhaps putting independent branches of code in separate SCTLs makes it possible to route clock, reset, implicit enable, etc. signals more effectively.

-------------------

Background information:

-------------------

My task involves acquiring 2 channels of analog data using the NI 5772 and PXIe-7966. Data acquisition takes place in a 200MHz SCTL, and downstream processing is performed in a 100MHz SCTL.

During a vast majority of the 100MHz SCTL processing stages of the FPGA VI, the 2 channels of data do not interact with eachother. So it would be easy for me to place them in separate 100MHz loops if doing so would somehow help the design (timing, resource usage, etc.).

--------------

Thanks!

dan_u · ‎10-27-2014

I would also be interested in that.

At NI Week 2 years ago I was told that splitting the code in several timed loops can help if you get timing violations. Everything which is in one timed loop shares the same enable line for the flip-flops, so that might make routing difficult which in turn can lead to timing violations because of routing delays.

Can anybody share more insights?

What additional resources are used (if any) when splitting one loop into multiple (with same clock)?

crossrulz · ‎10-27-2014

Based on your example, I see nothing wrong with using a single SCTL. It probably is saving a little bit of resources from clock routing. But I wouldn't think it would be enough to actually worry about unless your FPGA is getting really close to being full.

There are only two ways to tell somebody thanks: Kudos and Marked Solutions
Unofficial Forum Rules and Guidelines
"Not that we are sufficient in ourselves to claim anything as coming from us, but our sufficiency is from God" - 2 Corinthians 3:5

Dragis · ‎10-27-2014

I would organize the code in a way that makes it easiest to understand, debug, and maintain. There are some cases where breaking it up into multiple loops might fix a timing issue; there are other cases where combining them might reduce resource utilization allowing the design to fit on the chip.

In this particular example, either way would be sufficient so it really depends on whether the code is related or not and if it needs to be tested separately. I would think through those high-level design details first and design the application accordingly.

Intaris · ‎10-28-2014

Just out of interest, what is the resource usage differential between the two versions?

Dragis · ‎10-28-2014

There is some amount of overhead associated with each loop to deal with data flowing into and out of the loop (both via tunnels and through other resources). It's generally minimal overhead compared to the logic within a loop, but it does exist and depending on how the application is written combining into a single loop may allow the compiler to share some of these boundary resources between multiple items. Again, this generally is not an issue so don't architect around the number of loops as a primary concern.

michaeljoseph · ‎10-28-2014

Yes this is exactly why I am now looking into alternative configurations.

In my actual design the FPGA usage is more extensive than the example. I've copied the resource usage summary for the latest version of my FPGA VI (several compiles of the same VI). I have the most resource intensive parts of the design completed, but I still need to add some more things (hopefully they will fit).

If you are curious, I have 8 Xilinx FFT cores configured as 1024 point, streaming I/O, 100MHz, 12-bit input and output width, and using as much block ram and as many DSP48s as possible. That is what is taking up most of the resources.

Device Utilization
---------------------------
Total Slices: 94.1% (13850 out of 14720)
Slice Registers: 66.6% (39219 out of 58880)
Slice LUTs: 61.7% (36343 out of 58880)
DSP48s: 50.0% (320 out of 640)
Block RAMs: 43.4% (106 out of 244)

Device Utilization
---------------------------
Total Slices: 98.3% (14466 out of 14720)
Slice Registers: 66.6% (39219 out of 58880)
Slice LUTs: 61.7% (36344 out of 58880)
DSP48s: 50.0% (320 out of 640)
Block RAMs: 43.4% (106 out of 244)

Device Utilization
---------------------------
Total Slices: 92.6% (13630 out of 14720)
Slice Registers: 66.6% (39219 out of 58880)
Slice LUTs: 61.7% (36342 out of 58880)
DSP48s: 50.0% (320 out of 640)
Block RAMs: 42.2% (103 out of 244)

Device Utilization
---------------------------
Total Slices: 94.0% (13843 out of 14720)
Slice Registers: 66.6% (39219 out of 58880)
Slice LUTs: 61.7% (36345 out of 58880)
DSP48s: 50.0% (320 out of 640)
Block RAMs: 43.4% (106 out of 244)

Device Utilization
---------------------------
Total Slices: 93.2% (13723 out of 14720)
Slice Registers: 66.6% (39219 out of 58880)
Slice LUTs: 61.7% (36344 out of 58880)
DSP48s: 50.0% (320 out of 640)
Block RAMs: 43.4% (106 out of 244)

Intaris · ‎10-28-2014

Are these all for one SCTL or all for multiple SCTLs? What is the difference in resource usage for both approaches.

michaeljoseph · ‎10-28-2014

Trusted Enthusiast

Posts: 3,264

Re: LabVIEW FPGA: Multiple SCTL versus one SCTL (same clock domain)

‎10-28-2014 12:11 PM

Just out of interest, what is the resource usage differential between the two versions?

----------------------------------------------------------------------------------------------------------------------------

In response to the above comment,

This is a little embarrassing, but it seems like the resource usage is similar than I initially thought for this particular example. I think the previous compilations that I based my assumption on coincidentally used more resources in the 2-SCTL loop case. I just compiled each version two additional times (see below).

Here's the version with everything in one loop:

Device Utilization
---------------------------
Total Slices: 17.6% (2587 out of 14720)
Slice Registers: 9.5% (5583 out of 58880)
Slice LUTs: 8.2% (4855 out of 58880)
DSP48s: 0.0% (0 out of 640)
Block RAMs: 2.5% (6 out of 244)

Device Utilization
---------------------------
Total Slices: 16.9% (2493 out of 14720)
Slice Registers: 9.5% (5583 out of 58880)
Slice LUTs: 8.3% (4858 out of 58880)
DSP48s: 0.0% (0 out of 640)
Block RAMs: 2.5% (6 out of 244)

Here's the version with the read and write in separate loops:

Device Utilization
---------------------------
Total Slices: 16.4% (2407 out of 14720)
Slice Registers: 9.5% (5583 out of 58880)
Slice LUTs: 8.2% (4852 out of 58880)
DSP48s: 0.0% (0 out of 640)
Block RAMs: 2.5% (6 out of 244)

Device Utilization
---------------------------
Total Slices: 19.4% (2859 out of 14720)
Slice Registers: 9.5% (5583 out of 58880)
Slice LUTs: 8.3% (4859 out of 58880)
DSP48s: 0.0% (0 out of 640)
Block RAMs: 2.5% (6 out of 244)

michaeljoseph · ‎10-28-2014

Intaris:

In the version that I just posted, everything that runs at 100MHz is in the same SCTL.

I will try a separate 100MHz loop version after I finish some new additions to the VI. It will probably take me a few days.

Once I have the new version I will compile it in both configurations 1 and 2 SCTLs and post the results here.