04-03-2014 09:04 AM
Implementation details for LV 2012 would suffice.
04-03-2014 12:29 PM - edited 04-03-2014 12:30 PM
Phew. Today was a very frustrating day.
I finally managed to get a version working sending the values from a 160MHz loop to a 40MHz without having to resort to BRAM or having to scale up the speeds so that they match or that I could live with the 2 cycles of handshaking.
I can implement a group of Registers for a single variable (instead of a single register as I have previously used). I then write to each one in turn on the producer side. Then, on the consumer side I read from the registers one at a time (with an index difference of 1 (Per Cycle: Write N, Read N+1).
This seems to work. I need to investigate if the index shift is repeatable as there's no way to change this in the end software. The final "aha" moment was writing to the registers in series but reading them all in parallel, but only actually using one value at a time (stepping through the registers essentially, but the "blind" read functions take care of the otherwise annoying handshakes). I feel this may prove unstable in real software but it's worth a try. It's basically a custom FIFO using Registers as elements.......
Shane.
04-03-2014 01:28 PM - edited 04-03-2014 01:29 PM
One thing to keep in mind here is that writing the values separately on each tick of the faster clock cycle can cause the values to show up at different clock cycles on the read side. If it is important that the data be atomic, you'll want to buffer up an array of 4 values on the writer side and then commit that whole block on the last cycle and then the data will show up aligned on the next (aligned) cycle of the slower clock.
04-03-2014 04:08 PM - edited 04-03-2014 04:15 PM
The values showing up on different cycles on the read side is exactly the point. This is essential to what I'm trying to achieve. Each register carries one quarter of the total data being transferred with the handshaking for each register being offset, hence the ability to have a valid new value each and every iteration of the slower loop. The data spread over the four registers is NOT atomic, it's interleaved. The series od data is essentially transferred via Register N, N+1, N+2, N+3, N, N+1, N+2 and so on. The trick is keeping the writing and reading indexes aligned.
The example is for a single parameter being transferred between fast and slow loop. In reality, this would all be duplicated for each unit being multiplexed within the faster loop.
04-04-2014 02:54 AM
I still have some open questions regarding how the registers work. I'm trying to find a bomb-proof solution to my problem and knowing the parameters important to the transfer of data via register is important for that.
At the moment, the 33% throughput characteristic (one new data point every 3 read cycles) when writing and reading at the same rate from different clock domains is hurting me. Although I have a solution (interleaving - see above), I think I may be able to simplify this if I could clear up a few points regarding the register implementation. In all of the following cases, each clock I am referring to is > 40MHz.
I realise this is getting into the gritty details of the implementation but I'm really trying to squeeze as much as possible out of my architecture and "small" details like this are capable of bringin g the whole house of cards crashing down.
Shane.
04-04-2014 05:08 AM
Testing would indicate that the first two points are as I assumed. Handshaking occurs ONLY between different clock domains, even if the same register is used for both the same and a different clock domain. There is lost data between the high and low speed loops even though WITH THE SAME REGISTER there is no data loss between reading and writing within the same high speed loop.
Am I right in assuming these are implemented as two independent registers behind the scenes? How can one read work without handshaking and the other require it?
04-04-2014 09:29 AM
Sometimes you just have to get down in the dirt to get things working : )
If you really want to understand this, here is a decent overview of how synchronization is done in hardware: http://www.stanford.edu/class/ee183/handouts/synchronization_pres.pdf. You can skip down to the part about handshaking if you just want to see what hardware is involved. The implementation for LabVIEW FPGA is a slightly modified/optimized version of this.
For your questions ...
You are correct, when the write and read are in the same clock domain there is not synchronization overhead and the value is available on the next clock cycle. When multiple clocks are involved, think of there being one version of that register in each clock domain it is accessed. When a write occurs, there is some logic (see pdf above) that moves that value safely to the clones of the register in the other clock domains.
And to make the tranfser safe, you can lose data if you push data into the write side more often than those 2-3 cycles of latency (in the slowest clock domain) to get the data to the other regions. Again, all of this "can" be optimized away in some cases if the two clock domains are related nicely (derived from the same source clock, etc.), but LabVIEW FPGA does not currently do that optimization.
Also, for anyone that cares, note that the same clock domain must be the same exact clock. If you derive two 40 MHz clock from two different base clocks that would be clock crossing because they may not be aligned with one another.
That is a lot of info, so please keep asking questions.
04-04-2014 10:23 AM
OK, It's nice to see that my thoughts on the subject are starting to align with reality, something I've grown to value instead of taking for granted in younger years. 🙂
I'm slowly making progress on my architecture and I think I'll be able to salvage my original architecture with a few tweaks.
Regarding the related clocks, it's important that the clocks are phase locked (exact same base clock) and have frequencies which are whole multiples of each other (40 & 80, 120 & 40 but not 120 & 80) so that the conditions for handshake-free transfers could theoretically be possible, right? As long as the starting points (and end points) of each clock cycle in both domains are always aligned. This is not the case with 120MHz and 80MHz, even though both may be an integer multiple of the base clock (40MHz).
04-04-2014 10:38 AM
That's right, if you were using 40 MHz and 120 MHz clocks derived from the same clock source the optimization "would" be possible. However, you would have to do it yourself using CLIP for now until LabVIEW FPGA natively supports it.
04-04-2014 10:53 AM
@Dragis wrote:
That's right, if you were using 40 MHz and 120 MHz clocks derived from the same clock source the optimization "would" be possible. However, you would have to do it yourself using CLIP for now until LabVIEW FPGA natively supports it.
I choose to interpret the word "until" as being a promise. That's a nice way to end the week.