03-21-2018 01:16 PM
I'm new to the community, so my apologies if this is a redundant post. I searched for a while and didn't see anything that was close enough to my challenge.
I have an FPGA system (PCIe-7852R) which does the following:
If the Master LUT is read only (i.e. initialized when I compile the FPGA and never written to later), everything compiles and works. However, as soon as I add a Host-to-target FIFO to asynchronously write data to the Master LUT (so I can re-load correction data and keep the system tuned without having to re-compile every time something changes) I get placer errors (Place543 to be specific). And even more infuriating is the fact that it's always an issue placing 100 FFs and 160 LUTs as if there is some specific resource conflict which is not clearly identified. I've tried this a few ways and it never seems to work.
I've done all the standard optimizations (remove front panel objects, remove memory arbitration where possible, etc.) but it is still failing to compile. The hard-coded equivalent (no FIFO writing to Memory) consumes close to 80% of FFs and LUTs, so it is already pretty full. Resource estimation of the version which includes the FIFO-Memory code is always below 100% in pre-synthesis and synthesis resource utilization.
While I have many questions, here are three related and key questions:
I'm working on a code snippet to show what I am trying to do and to test the baseline functionality (which will probably work), but in the meantime I am hoping to find information on the underlying technical details and best practices for this type of FIFO-Memory interaction. Any help will be appreciated.
03-22-2018 03:40 PM
I don't know of any examples of what you're trying to do but I do something similar in my code (writing block ram fed by a FIFO). I send my address with the data using split/join number but it's more or less the same concept. I haven't run into the same issue that you are but the fact that you're starting at 80% utilization is a concern.
I have one small optimization that I don't know if you're using. Put the FIFO and the MemWrite in a single cycle loop. I use the handshaking terminal from the FIFO to decide if MemWrite executes.
I don't know of a magic guide. It sounds like you have a good idea of what's going wrong based on utilization. I'm hoping there's some slack out there in the PID code. I'm guessing because you have 8 channels, there likely is optimization available (serialize the data and use a for loop?)
03-25-2018 10:12 AM
Thanks Nanocyte.
One of the latest things I did was to move my asynchrounous FIFO/memory update into a SCTL. It sounds like your FIFO is maybe a U32 or U64 with ADDR:DATA packed in each FIFO data element? I could see that being effective for non-sequential or write operations where the memory is written semi-randomly. Since my data is basically a giant lookup table, I would typically write the entire thing rather infrequently. I'm unsure whether it is more space efficient to use a smaller FIFO element (I16) with an incrementing loop for the FIFO->MEM write, or a larger (U32) FIFO element which includes ADDR & DATA with a case structure tied to the handshaking. I haven't done a ton of sample code & compile testing, but maybe it's time to start? 🙂
I've been going back through my code and trying to commonalize as much of the multi-channel interpolation as possible. I've already found some places where I can replace multiplier/divider functions with bit shifting. I'm re-architecting now to serialize the interpolation code across the 6 channels where I think I can.
Still wish it were easier to trace from VI to VHDL, but I feel like I am limping along.
Thanks for your thoughts, sometimes it's just nice to see someone else is doing something similar to feel better about the path your on.
03-26-2018
10:28 AM
- last edited on
10-17-2025
08:17 PM
by
Content Cleaner
Hi DanAllis,
In regards to your debugging inquiry, NI does have a guide for best FPGA debugging practices, and has a couple of FPGA debugging tools to help ease the headaches that FPGA compilation can undoubtedly cause.
03-29-2018
01:28 PM
- last edited on
10-17-2025
08:19 PM
by
Content Cleaner
Hey DanAllis,
A few things that might be helpful -
03-29-2018 01:36 PM
Thanks Mike. I was looking more for a Compiler debugging tool to help bridge the gap between the NI VI code and the initial VHDL that goes to Xilinx. Ideally, there are some intermediate files I can open in the Xilinx tools and use the VHDL schematic viewer as a way to see the full extent of the compiled code.
In this case, I solved my problem by code generalization/consolidation and forcing those now reusable chunks to be non-reentrant sub-vis. This seems to be key for space saving on FPGA but get's very little play in the FPGA Development guides. It meant going from 7200/7200 slices on my PCIe-7852R to 6500/7200 slices.
03-29-2018 01:44 PM
Thanks Andrew - I hadn't read that white paper before. It is certainly a good read to help understand comm flow between devices.
I was never having issues with the FIFO as much as with fitting all of my code onto my FPGA. It turned out to be a lot of redundant code that was being compiled into N blocks of the same logic. In the NI LabVIEW High-Performance FPGA Developer's Guide there are two sentences in the 94 pages that were relevant to me. In order to force the compiler to treat a block of reused sub-vi code as a single block in the FPGA, you have to make the sub-vi non-reentrant. That forces the logic to be create once and then inputs are muxed in over the routing fabric in the FPGA. Makes a huge difference to area utilization. I'm running slow enough that I can process everything serially and save the area.
Of all the things which would potentially cause issues with over allocation of FPGA resources, this would seem one of the most likely candidates. Not sure why it doesn't get more play in the dev guide.
As for the debugging tools - I like the Xilinx ISE tools and have used them way back when. I'm going to take a look back at the intermediate files and see if I can open them standalone in ISE. The last time I tried, I was probably not trying hard enough 🙂
03-29-2018 01:48 PM
Once I resolved the FPGA bloat issue, I was able to get a FIFO-to-Block Ram Asynchronous update loop to compile just fine (image attached). Still waiting to hear from NI if the coding structure is logical. My biggest concern is whether the timeout boolean is sufficient to ensure that as I read the FIFO, it will properly increment and fill the Block Memory with the address increment feedback node.
I'm also waiting to hear back from NI regarding the best practices for memory access. Maybe digging into the VHDL will shed some light...
03-29-2018 03:04 PM
You might want to consider sending us a snippet rather than image.
How big is the PID_Config array? That is potentially a big resource hog.
Yes, your use of a timeout there is fine. I'm assuming it just feeds through on the true case.
I'm a little curious about your LUT_Update true case, I'm hoping you just set the write address to zero. You really shouldn't need that control. Ideally you'd just rely on the fifo and the length.
I don't understand the wait in the loop. Why is that there? Can you maybe get this all in a single cycle?
Your use of implies is not how most people do it. Without seeing the true case, I can't give you equivalent logic.
Please label the constants going into the shift register.
03-29-2018 03:37 PM
It's only 3 channels. I had considered encoding the values into a block of memory, but only if I couldn't get the design to compile with my other changes. There is an NI example that's effectively the same code that I swiped and used.
When LUT_UPDATE feedback = T (after a complete update occurred or was cancelled) the address counter is reset to a START_ADDR control. This allows for partial writes to the block memory. I can't do everything in a single cycle on this hardware (only the DRAM block memory can execute in a SCTL).
I was looking over resource utilization and was trying to limit the operators by using the implicit function. The idea is to react to the edge of the signal, so it a sample(N) = T AND sample(N-1) = F. I could have used an AND compound arithmetic function with an invert on one input, but the implies provides the same function with the inverse output (effectively a NOR with opposite input inversion as the example above).
Code Snippet is attached.