SCTL clock domain

kabooom · ‎12-03-2020

Hi

I have a question about SCTL.

I write some code for cRIO 9145. In this code i have to calculate sin X. To do this i use High Throughput Sine & Cosine function. When you read the help, it said this function take 17 cycles to return a valid result.

I made some tests and i measured something like 21 ticks... or 26 µs for 50 call.

In order to improve the execution speed, i tried to put this call in an SCTL and i selected a devrived clock of 80 MHz instead of 40 MHz. I was thinking it will always takes something like 21 ticks but the clock run twice faster therefore my function will execute two times faster.

Unfortunately i measure 29µs for 50 call... its even worst than before...

Am i missing something ?

Thanks for your help

wiebe@CARYA · ‎12-03-2020

I'm surprised it even compiles. Typically, function in a SCTL can only take 1 tick... So they probably translated the ticks it takes to a latency.

What is your goal?

Sine waves can often be created with a LUT, approximated with fast formula's, or with a combination. This will take 1 tick...

Search LabVIEW like a graph!

kabooom · ‎12-04-2020

Hi,

Thanks for your answer.

My goal was to reduce time. I had some performance problem, i have a formula in which i have to calculate the sin of each point. I though this could resolve my problem. It didnt work and i changed my architecture to parralelize some code.

But i still dont understand why.

If I'm correct (i'm quite new using SCTL) your right, the SCTL must be executed in one cycle, but if u use High Throughput function, some function will take longer e.g. sin, divide... therefore you will only have a valide result after x cycle latency. So if i connect High Throughput sin output Valid to the stop terminal of the SCTL it means the SCTL will stop after 17 cycle and return a valid value. The idea to save some time was to double the frequency of the SCTL... (80 MHz instead of 40 MHz...). Unfortunately it doesnt seems to work....

cbutcher · ‎12-04-2020

@kabooom wrote:

Hi,

Thanks for your answer.

My goal was to reduce time. I had some performance problem, i have a formula in which i have to calculate the sin of each point. I though this could resolve my problem. It didnt work and i changed my architecture to parralelize some code.

But i still dont understand why.

If I'm correct (i'm quite new using SCTL) your right, the SCTL must be executed in one cycle, but if u use High Throughput function, some function will take longer e.g. sin, divide... therefore you will only have a valide result after x cycle latency. So if i connect High Throughput sin output Valid to the stop terminal of the SCTL it means the SCTL will stop after 17 cycle and return a valid value. The idea to save some time was to double the frequency of the SCTL... (80 MHz instead of 40 MHz...). Unfortunately it doesnt seems to work....

Hi Kabooom,

So making the SCTL operate at a higher frequency should improve the throughput (sine values calculated per second) as you expect.

However, you shouldn't be stopping the loop after the value is valid.

Instead, you should accept the latency (sounds like you already went through this bit) and then by modifying the setup of the High Throughput Sine function you can set the behaviour of the function.

Attached is a zip file containing a project with a simulation showing this - by setting the throughput to 1 cycle/sample the latency (for a 16 bit input) becomes 16 cycles (17 if reading from a register).

Together with handshaking (should be implemented in case of different input rate or downstream usage, etc), this allows you to continuously stream data.

Desktop Execution simulationFPGA VI with Handshaking interface setup

If you run this simulation, you'll see for the first 16 iterations, "Elements to be Read" will be 0 (nothing was output on the FPGA) and the input value will go from 0 to 2pi*15/1000. Then the next value will be input, and "Elements to be Read" will be 1. It will never change (because you're writing one new value per SCTL iteration and reading one value per SCTL iteration), but the returned data value isn't 2pi*16/1000, it's 0. Then the next value is 2pi/1000, and so on. The output will lag the input by 16 cycles, but this is constant, and the throughput becomes 40MHz (SCTL clock frequency) in this example (although since this is a simulation, the throughput is much much lower, and primarily determined by the "Wait time (Simulation)" input 😉 )

kabooom · ‎12-04-2020

Hi cbutcher

Thanks a lot for your detailed answer. It sounds a little bit more complicated than expected . I think i understand what you are doing, but i still a little bit confused.

If i dont use an SCTL, i just call the function High throughput Sine. The function will take 17 cycle to execute, i.e. the FPGA code will be blocked on this functionc during 17 cycles.

Now that's what i wanted to try. I put the function in an SCTL . The function returns a valid data after 17 cycle (when output valid become true). I stop my SCTL to retrieve the value. But in order to save some times, i execute the SCTL twice faster. This is what i was thinking....

Should i really use some FIFO if i want to stay ion FPGA ? The input and output of the function come from FPGA not from the host... I think i missed something about SCTL....

Thanks again for your help

cbutcher · ‎12-04-2020

@kabooom wrote:

I put the function in an SCTL . The function returns a valid data after 17 cycle (when output valid become true). I stop my SCTL to retrieve the value. But in order to save some times, i execute the SCTL twice faster. This is what i was thinking....

I would suggest don't stop the SCTL. Put data in as fast as you acquire it, and take it out as it's ready - leave it running always in the background (well, on the fabric... whatever). Only stop it if you're shutting down the FPGA/RT system maybe (and even then, only if you want to do something after, or ensure everything was processed right up to the last acquired element, or whatever).

@kabooom wrote:

Should i really use some FIFO if i want to stay ion FPGA ? The input and output of the function come from FPGA not from the host... I think i missed something about SCTL....

You don't have to use FIFO (and if your source/sink are on FPGA, then you would really not want to use DMA FIFOs like I did here...

You only need to know if the input is valid, and the output is "ready", whatever that might mean. (Edit: to be clear, what I mean is that some outputs might always be ready, so you could just wire a True Constant. e.g. if you're putting the output in a global variable/main FPGA VI indicator and you don't care about historic values, only the most recent one).

If you're acquiring from an analog input though, the input won't be valid every cycle of the FPGA SCTL (at least, I don't think there are any AI that fast for cRIO...)

In that case, I feel like a separate loop that acquires from the AI node and sends via FIFO is quite convenient, but it isn't a requirement - you just need to make sure that every time you have a valid input, you set "input valid" true. You can instead allow the node to be valid only once per 16 iterations (I think this removal of pipelining will reduce the fabric requirements?) - this is reasonable if you know the necessary throughput is lower.

What do you want to do with your sin(x) value after you calculate it?

As with the input condition, a FIFO isn't a requirement - you could directly use it maybe, you could use a Register or a Local/Global Variable or Indicator, but presumably you want to put it somewhere.

The FIFO allows you to separate the use of the data from the processing of the data, avoiding any weird issues messing up the throughput of the SCTL.

It might be that your limitation has nothing to do with the processing of the data, and is instead driven by your input/output nodes - maybe you could describe those in more detail?

kabooom · ‎12-04-2020

Wow ... thats what i called a fast answer... again thanks a lot.

As i said , i change a bit my architecture because this solution didnt work as i expected.

To briefely summerize , i try to calculate an RMS value of 32 channels using sinus extrapolation. My new architecture in the following:

1) Data task : i acquire 32 channel on a 9145, put all data in FIFO

2) Detect task: i read the fifo, detect the period , calculate frequency, and write the data in a memory. when a period is complete i write some data (index of memory...) in another FIFO

3) Calcul Task: i read data from memory and calculate the sinus, the sum and the RMS values.

To improve my performances, i duplicate last task i.e. each task deal 16 channels. But my first thought was to speed up the sinus function cause this was the bottleneck (i have to calculate the sinus for each sample, and each sinus tooks appoximatively 21 ticks * 50 sample * 32 channels ...)

I think i understand what your explaining. But even if my result is false, why does the fucntion take the same amount of time if i put it in an 80 MHz SCTL ?

cbutcher · ‎12-04-2020

@kabooom wrote:

Wow ... thats what i called a fast answer... again thanks a lot.

As i said , i change a bit my architecture because this solution didnt work as i expected.

To briefely summerize , i try to calculate an RMS value of 32 channels using sinus extrapolation. My new architecture in the following:

1) Data task : i acquire 32 channel on a 9145, put all data in FIFO

2) Detect task: i read the fifo, detect the period , calculate frequency, and write the data in a memory. when a period is complete i write some data (index of memory...) in another FIFO

3) Calcul Task: i read data from memory and calculate the sinus, the sum and the RMS values.

To improve my performances, i duplicate last task i.e. each task deal 16 channels. But my first thought was to speed up the sinus function cause this was the bottleneck (i have to calculate the sinus for each sample, and each sinus tooks appoximatively 21 ticks * 50 sample * 32 channels ...)

I think i understand what your explaining. But even if my result is false, why does the fucntion take the same amount of time if i put it in an 80 MHz SCTL ?

Without seeing your actual code, I can't really do more than guess, but my guess is that the calculation of the sine in the SCTL wasn't actually the key delay once it was factored into the surrounding code. Maybe something caused the start of the SCTL to be synchronized with a slower clock and produced delays? I don't know 😕

The 9145 is an expansion chassis, so I still don't know where your data comes from or how fast it is 😉 but I guess you're acquiring 32 values at once, enqueueing each to the FIFO, then enqueueing the next 32, then the next?

And 'downstream', you're pushing this into a memory block (maybe rearranged by channel?) and storing the address in a separate FIFO. I guess this pushing of data is continuous, but the FIFO enqueue is happening once per period of your input signal. Is it the same for every channel, or do they vary in period/phase?

Finally, you're reading your memory block to get 50 samples from a single channel at a time? and calculating the RMS, sum and sine of each of the 50 samples?

Since you want the sine of each sample, depending on what you do next, you might consider calculating that when you're populating your memory block, and having a second block? You could perhaps also do the same for RMS and sum - I guess you're using the "DC and RMS Measurements" Express VI? (This also has a handshaking implementation for use inside the SCTL, but the latency might be different depending on sample rate etc, so be careful - you might consider using a Delay Node to "resync" your outputs by increasing the latency of the fast one (probably the Sine calculation).

wiebe@CARYA · ‎12-04-2020

Have you considered the alternatives?

You're already working with limited resolution (fixed point)... A large enough LUT could provide exactly the same result under correct circumstances (e.g. when done right).

Approximations could be very close to the actual result, may even closer then your resolution provides.

And all within 1 clock cycle.

Search LabVIEW like a graph!

kabooom · ‎12-04-2020

Your guess are quite accurate ... impressive 🙂

You're remarks are very welcome, maybe i could calculate the sinus before writting in the memory, but the detect task rate has to be at least as fast as the Data rate otherwise the FIFO will be filled with sample. Actually my acquisition rate is 50 kS/s which mean i have 20 µs to do the period detection, calculate frequency, for the 32 channels, writting data to memory and do some error checks (phase, frequency)...

So at first i tried to do all calculation in this loop , but it wasnt fast enough, that why i decided to create a third task to do the calculation.

But as i wrote the performance issue seems to be corrected since i duplicate the calculate task... my question was only to undersand the working of the SCTL.

In order to test the performance of the SCTL I just test the function High Trhoughput Sin like this:

This takes 31 µs...

This is what i expected :

With a clock of 40 MHz (50*17)/40 e6 = 21 µs

With a clock of 80 MHz (50*17)/80 e6 = 10.5 µs

LabVIEW

SCTL clock domain

SCTL clock domain

Re: SCTL clock domain

Re: SCTL clock domain

Re: SCTL clock domain

Re: SCTL clock domain

Re: SCTL clock domain

Re: SCTL clock domain

Re: SCTL clock domain

Re: SCTL clock domain

Re: SCTL clock domain