FPGA timing different in execution mode and simulation mode

Intaris · ‎02-03-2020

I have run into similar issues, with my interpretation of the end result being that all logic on the FPGA chip up to but not including the actual hardware device being controlled is simulated. When I test in simulation mode, I have zero delay between my outputs and inputs when in reality the delay is approximately 30us.

Having said that, I program exclusively with SCTLs on FPGA and am only familiar with the simulation timing of that mode of operation.

Thomas444 · ‎02-03-2020

A little update to the topic - I created an application where I can disable parts of the code through the Case structure and check for timing. First, I fully understand that adding ten new case structures affects the execution a lot. But still, outcome is unexpected:

Everything disabled: 10 ticks per loop

Enabling FIFO writes (9 elements): 36 ticks per loop

Also enabling filters: 75 ticks per loop

Also enabling flagging (custom bitwise data manipulation): 105 ticks per loop

Also enabling input reading: 782 ticks per loop

Disabling everything but input reading: 392 ticks per loop

After enabling input reading, loop is timed by input signal (which runs at 392 ticks per one sample). It is like enabling input reading slows everything down about 6 times. Any ideas now?

GerdW · ‎02-03-2020

Hi Thomas,

@Thomas444 wrote:

After enabling input reading, loop is timed by input signal (which runs at 392 ticks per one sample). It is like enabling input reading slows everything down about 6 times. Any ideas now?

In your first message you wrote:

I use a while loop to read inputs of 3 x NI 9232 in cRIO 9057 @ 102400 Hz, which means 9 channels.

The FPGA runs at 40MHz, divided by 102.4kHz gives 390.625 ticks per sample, pretty close to your 392 ticks per while loop iteration. Pure coincidence?

Best regards,
GerdW

using LV2016/2019/2021 on Win10/11+cRIO, TestStand2016/2019

Thomas444 · ‎02-03-2020

Exactly, I point that out from the beginning. But processing by itself takes ~ 100 ticks when in execution mode, so it should always be done 392 ticks. Which it isn't. It takes double of that period, ~ 790 ticks. See?

Edit: Now I have an idea. Maybe, maybe the data from multiple channels of every module are not ready at the same instance. That would mean that every sampling cycle is divided in thirds (for 3 channels) and processing starts only in the last third, after all three channels are ready for dispatch. That might explain why ~ 100 ticks of processing cause doubling the timing when combined with reading the inputs. I will try making another application where I can disable certain channels. Also, using FIFO might fix this.

Intaris · ‎02-03-2020

Your code does not run in parallel. Your inputs need to complete running before the rest can even start.

If this code would be ALL in a SCTL (not just portions of it - I do not understand why your FIFO writes are in SCTLs for a single element) then the code would be executed in a far more efficient manner. Each node you add to your code slows your overall execution down by that much. Only one portion of the code is ever active at once when you are not using SCTLs. So if ONLY your inputs cost 392 ticks, that's the minimum you're going to need. Every single other operation you add will increase that time (and can't be run in parallel due to data flow). I think your IO may be synchronising with hardware each time you start executing the node. Your loop timing is suspisciously close to exactly 2x the IO Node timing itself. Because you only start reading a new value 100 ticks after the last one completed, you may "miss" a cycle depending on how the IO node works internally. It's pretty much a guess, but it seems to fit your descriptions.

I recommend reading:

http://download.ni.com/pub/gdc/tut/labview_high-perf_fpga_v1.1.pdf

Page 15&16.

MrJackHamilton · ‎02-03-2020

Also,

Take the FP boolean controls out of the processing IO loop. Put them in another loop, and use local variables to read the booleans in the time critical loop.

Sounds bizzare, but years ago NI R&D offered this when we had timing, issues with time critical loops. The reason is the FP control in the Time critical loop - requires overhead to check if it's value has changed, this chews up processing.

A local variable essentially, puts a wire between the two loops - and decouples the TC loop from determining is the boolean value has changed on the FP.

Also, any filter will create some latency, as they all essentially buffer data and do summing and/or divsion. It would be better to put an actual filter circuit on the wire instead. I am working on a PID project now, where we are having a latency issue with the filter and are solving it that way.

Regards

Jack Hamilton

cbutcher · ‎02-03-2020

With the appropriate pipelining, I think you should be able to get it done as fast as your acquisition comes in. Of course, the time for the data to be acquired, processed and output will be the same in total, but you could acquire new data as fast as your module (rather than as fast as your module acquisition period + processing time).

Perhaps the simplest (although not necessarily the fastest) way to do this is to put acquisition in one loop, local FIFO to a SCTL or if needs must a While loop, then just check for valid data (handshaking) or non-timeout (timeout) to process.

Like Intaris said (and I tried to say, perhaps unsuccessfully), it seems like you have lots of separate loops and it's not obvious why this should be the chosen design...

Thomas444 · ‎02-04-2020

I tried some more measurements and I found that time from start of a cycle to finishing reading the input data takes ~320 ticks no matter how many channels I read from 9232 module (and I repeat, sampling period is ~395 ticks), which means that room for data processing in the same cycle is very small (~80 ticks). Wiring acquired data to shift register and processing it in next cycle did it for me because then I can use all 395 ticks of cycle time.

Thank you all guys. Some of your ideas were pretty useful and helped me to put pieces of knowledge into context. Obviously timing stuff in hardware implementation is way more complicated than one can forsee in simulation. I hope this discussion may be useful for others with timing stuff, as we discussed a wide range of solutions.

LabVIEW

FPGA timing different in execution mode and simulation mode

Re: FPGA timing different in execution mode and simulation mode

Re: FPGA timing different in execution mode and simulation mode

Re: FPGA timing different in execution mode and simulation mode

Re: FPGA timing different in execution mode and simulation mode

Re: FPGA timing different in execution mode and simulation mode

Re: FPGA timing different in execution mode and simulation mode

Re: FPGA timing different in execution mode and simulation mode

Re: FPGA timing different in execution mode and simulation mode