I'm modifying code running on a Virtex-5 FPGA card (PXIe-7965R).
We are approaching saturation of the device with our current functionality and I'm spending a little time looking through the code tryingt o find places where we can juggle resources in order to fit more on the device. One area we are really under-using at the moment is Block RAM. Our design uses only 45 of 244 Block RAM units. We also only use 65 of 640 DSPs.
While the use of DSPs is reserved for very specific operations, I was thinking I could use the Block RAM strategically to implement certain pipelining operations we utilise in a different way and offload data to block RAM isntead of using Registers and LUTs for that. The problem is that such FIFOs would be really shallow (maybe 8x 48 bit or so) meaning that the overhead of the FIFO function is an important factor to bear in mind when considering such a change.
The only document known to me which lists the resource requirements of various functions of this card (http://www.ni.com/white-paper/7727/en/) tells me that a simple Block RAM FIFO costs 300 Registers and 331 LUTs (Values were measured with LabVIEW 8.5 !!). My problem is that this seems already to be very close to what my data requires when implemented in fabric. I'm also aware that each Block RAM has a limited width so that exceeding this will probably end up using more than one FIFO and multiply the resource requirement accordingly.
I know that the Virtex 5 has built-in FIFO circuitry which can be utilised under certain conditions.
What is the resource requirement for a Block RAM FIFO implemented with and without the built-in FIFO circuitry and how do I make sure my code makes use of this?
Hmm, I'm trying to work up an example so that I can benchmark this but I'm seeing weird results.
I have generated sub-VI as illustrated below. This should theoretically use 4 BRAM FIFOs (1023 element U32 no arbitration). I place up to 32 of these in my main VI and compile.
The iteration counter propagates through all the FIFOs (4 in total) until it gets to a Register which is read on the top-level diagram and all values are put on the FP.
The code compiles, and it runs.
The weird part is the resource usage.
Either it's not counting the BRAM FIFOs as Block RAM or something's being really really well optimised to remove my nonsensical operations.
I know the vhd files are being created for the correct number of built-in FIFOs, but the resource usage simply cannot be correct here.
Whoah, I just found something interesting.....
I decided that the most likely cause was that the Xilinx compiler was counting the Block RAMs incorrectly. I increased the number of instances of my silly FIFO operations to a point where it was actually above the number of available units.
It wouldn't compile. The Compilation window says that I have used 2 of 244 Block RAMs but a peek in the log file says that the compilation failed because there weren't enough Block RAMs.
ERROR: Pack:2310 - Too many comps of type "FIFO36_EXP" found to fit this device.
ERROR:Map:237 - The design is too large to fit the device. Please check the
Design Summary section to see which resource requirement for your design
exceeds the resources available in the device. Note that the number of slices
reported may not be reflected accurately as their packing might not have been
This would seem to be a bug in the Xilinx compiler where the number of used Block RAMs is not taking all forms of Block RAM usage into account. I'm pretty sure it's not a LabVIEW bug because I can find the exact same "2 of 244 Block RAM" info int he resource utilisation statistics within the Xilinx log itself.
Oh, I'm using LV 2012 SP1.
Reading the deocumentation for the LogicCore FIFO generator 8.4 (The version supplied with my version of LV FPGA) it references the following document:
where several FIFO configurations are listed with SIGNIFICANTLY lower resource usage than I thought.
For example, a built-in Block RAM FIFO requires for a 512 deep 72-bit wide FIFO36 FIFO with built-in implementation requires a mere 0 LUTs and 2 FFs and 1 Block RAM (versus 300 LUT and 331 FFs and 1 Block RAM according to the NI document mentioned earlier).
I really wish these resource utilisation statistics were more up to date because meking design choices based on really outdated information is error-prone and really inefficient.
sorry that it took so long to get back to you.
First of all: Thanks for the clearly structured analysis and nice documentation.
I filed a CAR (corrective action request) asking to update the documenation. If you want to get back to that, just answer to the thread then I should be informed automatically about the thread activity.
One thing: Could you upload the test project, where you did the testing in case someone wants to have a look at your code?
Here's the file.
Just try compiling the x64 instance test.
In the cub-VI I had a clock defined in my project which was simply double the base clock to make sure that the Xilinx compiler wasn't optimising away my FIFOs. Seems like that fear was unwarranted, but still.... It may be neccessary to re-assign a clock within the sub-VI.
I tried compiling on a VIrtex 5 Target with 244 Block Ram units. This failed due to NRAM_FIFO Overmapping even though the official Xilinx resource usage was kind of small (not over 100% anywhere).
OK, a bit of thread necro required. Due to other compilation problems I've been having I started digging into the Xilinx Logs for other information and found out that the number of BRAMs used is actually reported correctly int he Xilinx log, but LV displays numbers which I can't fathom the source of.
I think this needs to be escalated to a bug, because it does seem to be a mistake of LV's parsing of the Xilinx log after all.
Apparently, the CAR has been rejected because it's not viewed as a bug.... Well, I just got a compilation error telling me that my design couldn't place 3 instances of BRAM. LV wants me to think that only 79 of 244 BRAMS are actually being used. Somehow I'm not sure that's correct.
LV takes values from an XML file produced by Xilinx and apparently that's where false values are reported. If, however, I look into the Xilinx.log file (Which is displayed during compilation) I find the following:
Given the following, what is my BRAM usage really?
Device Utilization Summary: Number of BUFGs 20 out of 32 62% Number of LOCed BUFGs 1 out of 20 5% Number of BUFGCTRLs 5 out of 32 15% Number of DCIRESETs 1 out of 1 100% Number of DCM_ADVs 7 out of 12 58% Number of LOCed DCM_ADVs 1 out of 7 14% Number of DSP48Es 65 out of 640 10% Number of FIFO36_72_EXPs 14 out of 244 5% Number of FIFO36_EXPs 3 out of 244 1% Number of IDELAYCTRLs 1 out of 22 4% Number of LOCed IDELAYCTRLs 1 out of 1 100% Number of ILOGICs 41 out of 800 5% Number of External IOBs 260 out of 640 40% Number of LOCed IOBs 260 out of 260 100% Number of External IOBMs 19 out of 320 5% Number of LOCed IOBMs 19 out of 19 100% Number of External IOBSs 19 out of 320 5% Number of LOCed IOBSs 19 out of 19 100% Number of IODELAYs 18 out of 800 2% Number of ISERDESs 17 out of 800 2% Number of OLOGICs 110 out of 800 13% Number of PLL_ADVs 1 out of 6 16% Number of RAMB18X2s 22 out of 244 9% Number of RAMB18X2SDPs 25 out of 244 10% Number of RAMB36SDP_EXPs 12 out of 244 4% Number of RAMB36_EXPs 43 out of 244 17% Number of RAMBFIFO18_36s 5 out of 244 2% Number of STARTUPs 1 out of 1 100% Number of Slices 12718 out of 14720 86% Number of Slice Registers 26342 out of 58880 44% Number used as Flip Flops 26339 Number used as Latches 0 Number used as LatchThrus 3 Number of Slice LUTS 26218 out of 58880 44% Number of Slice LUT-Flip Flop pairs 36820 out of 58880 62%