FPGA

maheshkumar0459 · ‎02-28-2016

Hi everyone,

I wanted to use DRAM for NI 7975 FPGA module.It has single bank of 2GB DRAM with data width of 512 bits. I made 4 different memories each with 40MB. Initially i assumed i can use DRAM as normal block memory because in simulation mode i am vetting same output for both DRAM and block memory but on real hardware DRAM behaves some thing different.

1) Grant cycles refers to e amount of time in cycles that LabVIEW grants access to each partition of DRAM memory. As i am using only one DRAM i dont need to take care of grant cycles. Is this observation true??

2) I am using datatype of data is U32, adress data type is U32. So for every clock cycle i am writing single element into signle adress and reading it in next clock cycle. Do i need to read 512 bits after writing 512 bits? ( I read in some forum they mention if DRAM width 128bits then if data type is U32 then write 4 elements aftre that read those elements or use packing & unpacking technique. Below is the link where i found this.

https://forums.ni.com/t5/LabVIEW/Labview-FPGA-DRAM-Address-question/td-p/1383624/page/2

3) But another observation if run with 40MHz clock i am getting output data but the data is half not full. Is that any bandwidth effect??

4) Finally i want to use it as normal memory & FIFO to store signle U32 data & read it in next clock cycle. Is this possible?

Below screenshot gives my experiment where i gtried to compare normal memory with DRAM memory at 125MHz clock rate.

David-A · ‎02-29-2016

What version of LabVIEW are you using? LV 2015 gives you access to some IP that makes DRAM a lot easier to use on a FlexRIO target.

Also you should read the white paper on using DRAM effectively. https://www.ni.com/en/support/documentation/supplemental/21/three-steps-to-using-dram-effectively.ht...

1)You said you used 4 different partitions of DRAM, each 40MB. Since only one partition can access the DRAM at any one time the memory controller uses round robin scheduling to switch between the partitions should each of them have a pending transaction. The grant time specifies how much time each partition is granted when the memory controller gives it access to the DRAM before access is revoked and swapped to a different partition. Note that there is overhead associated with swapping access between partitions, so although lower grant times may in certain cases lower latency they will also decrease throughput.

2)The 797x targets all have a DRAM width of 512 bits. That means that in order to optimize throughput you should perform both reads and writes in 512 bit transactions. You can forego throughput optimizations by performing a transaction in the data type of your choosing, for example by performing a read and a write using a single U32. You're only using 1/16th of the available throughput by reading and writing using a single U32, but if it can be a little easier to create an archtietcture that writes directly to the DRAM using the desired data type rather than packing the data up into optimally sized packets. So to answer your question, you can read and write using whatever data type you want but there are tradeoffs to doing so.

3) Accessing DRAM is done optimally in the DRAM clock domain. This is a clock that you can add to 797x targets. It runs at 166MHz, which is the same rate that the memory controller runs at.

4) DRAM can't be used to store data one cycle and access it the next cycle. The theorietical minimum, assuming there is only one partition, and you are accessing the DRAM in the DRAM Clock domain, takes something like 10 clock ticks to submit data to the DRAM, request the date, then retreive the data. In practice you'll see write-request-read times that are much higher than that, especially if you have multiple partitions. If you need to store data then access it one cycle later you need to store the data on the FPGA rather than on external memory. So you can use DRAM as a FIFO or as a large block of random access memory, but there is going to be some appeciable amount of latency since you you're accessing a resource that is external to the FPGA.

Though I'm curious what you're doing that you need to store 40MB of data and then access all of it 1 clock cycle later.

maheshkumar0459 · ‎03-03-2016

Thanks David for detailed explanation. I have understood 1,2 & 3 points but i failed to understand 4th point. Let say if there is 1 partition 2GB i configured (as 7975 has 2GB with single bank).I am running loop with 125MHz rate (its optimal clock is 166MHz) as my application needs to run with this particular clock rate as im not bother about throughput instead im bother about latency. In my program I have utilized 98% of FPGA resources and I don’t have any memories left. As DRAM is external to FPGA and I want to use it like normal fifo or normal memory which is present on FPGA with minimum latency of 1 cycle or 10 cycles. Cant i store one element in present clock cycle and read it in next clock cycle if latency is 1 cycle. If latency is 10 cycles then cant i read from 11th cycle?? Below table gives my requirement if dram latency is 1 cycle. If dram latency is 10 cycles then after 11th cycle op valid should be true and reading starts from address 1 and continues. Will it be implemented??? What are the necessary actions to be taken in order to implement successfully. What would be maximum latency?? How to find latency??? Most important is the program is executed in simulation but on target im not getting correct data.

No of Cycles	Writing into DRAM	Requesting from DRAM	Reading from DRAM	o/p Valid	o/p element
1	Address=1 Data=10(U32)	-	-	false	0
2	Address=2 Data=20(U32)	Address=1	Data=10(U32)	True	10
3	Address=3 Data=30(U32	Address=2	Data=20(U32)	True	20
4	Address=4 Data=40(U32	Address=3	Data=30(U32)	True	30
5	Address=5 Data=50(U32	Address=4	Data=40(U32)	True	40
6	Address=6 Data=60(U32	Address=5	Data=50(U32)	True	50
7	Address=7 Data=70(U32	Address=6	Data=60(U32)	True	60

David-A · ‎03-03-2016

DRAM has an internal pipeline that inserts several ticks of latency between the request for data and the retrieval of data. You need to implement a look ahead architecture that requests the data several clock ticks in advance on when you actually need it.

Please read the article I provided a link to as the Request Pipelining section answers this question.

https://www.ni.com/en/support/documentation/supplemental/21/three-steps-to-using-dram-effectively.ht...

If you would like some examples of how to implement this I would recommend taking a look at the example finder. You can access this in LabVIEW from the help menu. Once the example finder has launched you'll want to navigate to Hardware Input and Output>>FlexRIO>>External Memory and then take a look at a few of the examples in that directory.

maheshkumar0459 · ‎12-30-2016

Thanks for your explanation David.

Sorry for delay. Again i started working on DRAM.

How can i find the latency between request and retreive methods.

Is it not possible to use it as a normal memory??

David-A · ‎01-04-2017

Use a counter and count the number of cycles between when a request is issued and the retrieve returns the data. Its important to note that latency will depend on the access pattern that you use. If the DRAM is being written to and you issue a read request it will take longer than if the DRAM had no pending actions prior to issuing the read request.

"normal memory" can be interpreted a number of ways. I assume you're asking if the DRAM external to the fpga can be used exactly the same way as the BRAM internal to the FPGA. If this is what you're asking then the answer is no.

There are inherent physical differences between accessing data storage elements that are located on the same die and accessing data storage elements that are external. The advantage to using external memory on the 797x FlexRIO devices is you have up to 2GB of storage while internal memory provides only only a few MB of storage. The disadvantage to external is that there is a greater latency when accessing the data since the data now needs to move chip-to-chip.

maheshkumar0459 · ‎02-07-2017

Thank you David,

After doing a wide research and going through examples i analyzed few things.

I succeded in writing into dram and reading from dram by writing samples which are equal to depth of dram. Let me put it in this way i configured DRAM length to 1 MB (16384 elements each of 512 bit length (7975 fpga)). now im writing 16384 elements to dram then im able to read all the elements then again writing an dthen reading. I am not able to access dram simultaneous that means in first loop writing data continously and then redaing data in parallel loop is not possible. What i understood is simultaneous access of DRAM is not possible. Is my conclusion is correct???

David-A · ‎02-07-2017

I am not able to access dram simultaneous that means in first loop writing data continously and then redaing data in parallel loop is not possible. What i understood is simultaneous access of DRAM is not possible. Is my conclusion is correct??

You're correct when you say that reads and writes can't occur simultaneously. The memory controller takes several cycles to switch between reading and writing. This is the reason why its more efficient to queue up several writes before performing any read operations.

The first sentence is incorrect through. You can write to DRAM from one clock domain and read from DRAM in another clock domain. Just be careful not to try to read and write at the same time or your access-time latency will go up.

maheshkumar0459 · ‎02-08-2017

The first sentence is incorrect through. You can write to DRAM from one clock domain and read from DRAM in another clock domain. Just be careful not to try to read and write at the same time or your access-time latency will go up.

If i use different clock domain then i might miss some data. Do u mean i need to read data from dram with double the clock rate with which im writing??? If not then below statement refers what??

"You can write to DRAM from one clock domain and read from DRAM in another clock domain"

Let me attach a pic of my usage of DRAM.

DRAM i configured 1MB (65536, 128 bit width).

In first case im writing into dram.

In second case im reading from dram with a single cycle delay (feed back node).

Im retrieving data outside by keeping always true to input valid as whenever request is qued data will be available for reading from dram. Hence i kept always true to input valid of retrieve dram.

ignore broken wire.

I have made some comments on figure if anything require pls let me know.

This is what simultaneous approch i told earlier. The output is wrong data im getting.

After the above conversation i have understood the approch i have gone is wrong. Whats your different clock domain of simultaneous write read approch???

David-A · ‎02-08-2017

Theres nothing wrong with reading and writing from the same clock domain. Its probably recommended in most cases. I was just noting that its possible to to do from different clock domains. In hindsight it was probably confusing to point that out, my apologies for the confusion.

The approach you're using looks fine. Just make sure that the read request is issued after the write request. If I had to guess why you're not getting the correct data I would say the logic that produces the address and case selection is producing unexpected behavior. Try running in simulation and observing to confirm that the request is issued when you think it is. The LVFPGA sampling probe can also be useful if you're trying to observe cycle accurate timing in simulation mode.

LabVIEW

FPGA

FPGA

Re: FPGA

Re: FPGA

Re: FPGA

Re: FPGA

Re: FPGA

Re: FPGA

Re: FPGA

Re: FPGA

Re: FPGA