03-09-2021 08:27 AM
Hi,
I'm using sample clock synchronization to synchronize 4 SC Express cards (PXIe-4303) and 10 X-Series cards in an 18-Slot Chassis. The cards are used in hardware timed single point mode.
The 4 PXIe-4303 cards are combined in a multi-device task. That task is also the "timing master", exporting it's sample clock to PXI_TRIG0 (the 4303 cards can't import the sample clock from another device, hence they have to be the master).
The X-Series cards use PXI_TRIG0 as their sample clock. The read and write function of all DAQmx tasks are called in a single timed-loop whose timing-source is the multi-device task using the 4 4303 cards.
This works well as long as all Tasks run at the same rate. However, due to the "long" time it takes to complete the AI conversion on X-Series cards (depending on AI Convert Clock Rate and Channel Count), I decimate the sample clock of those AI tasks by 2 and then only call the Read-Function every 2nd loop iteration. Therefore I never have to wait for AD Conversion to finish, leaving more time to do useful work.
My problem is that it sometimes works as expected and sometimes not. By "works as expected" I mean that the Read functions return in < 10µs, indicating that the AD Conversion is already complete. By "not working" I mean that it takes several 100µs for the Read functions to return, indicating that the AD Conversion is still ongoing in the moment where the function is called.
I can't figure out where this "randomness" comes from. I guess there must be some additional synchronization step that I'm missing? I should also say that there is no drift, so if it works immediately after program start, it will continue working forever. If it doesn't work immediately after program start, it will never work.
Here are the steps I'm doing to synchronize the cards:
Could someone explain what needs to be done to make this work reliably (the X-Series AD Conversion must always be performed during the loop iterations where Read is not called, so Read will return immediately in the iterations where it is called)?
Many thanks for your help!
03-09-2021 03:28 PM
I've long advocated *against* reliance on Hardware-Timed Single Point mode when running under Windows. It's been even longer since I used that mode myself under RT. You didn't mention which you were doing, but the regularity of what you observed leads me to suspect you're running RT.
Here's how I think I recall that it works. You didn't mention a sample rate, but seem to have concerns at the microsec level, so for now I'll just use 1000 Hz sampling as a talking point.
1. Once you start the task, the 1000 Hz sample clock will run for as long as the task lasts.
2. There is no buffer
3. At whatever *phase* you happen to be within that 1000 Hz cycle when you make the software call to DAQmx Read, the function will wait for the *next* active edge of the sample clock (and then enough convert clocks for all your channels) before returning.
Sometimes the next active edge comes within <100 microsec, sometimes it takes >900. The independence of timing between the free-running sample clock and your software execution will make this wait time look "random." [Edit: if you are in fact running RT and using the sample clock to drive a Timed Loop, there should be quite a bit less randomness]
4. So I don't *think* it's the case that conversions are always happening in the background. I think the call to DAQmx Read *initiates* them. Just like a software-timed read. The difference (as I understand it) is that in software-timed mode, sampling is initiated immediately (but perhaps with more overhead each time) whereas with HWTSP mode, sampling is initiated at the next active edge of the free-running sample clock.
5. I think I recall that inputs and outputs had slightly different behavior when a sample opportunity was missed under HWTSP mode. I don't remember what the exact difference was but I the 4 possibilities that come to mind are: fatal DAQmx error, non-fatal DAQmx error, DAQmx warning, function performs without error or warning.
Ok, all that's a mouthful and then some.
If indeed you're running RT, I'm inclined to venture that your *output* tasks are the most crucial ones to leave running in HWTSP mode. That gets you the real-time control advantage of regular update timing without the disadvantage of latency due to buffering.
I'd be a little inclined to try something along the lines of:
6. Configure the 4303 cards for ~5-10x their present sample rate. Still export to PXI_TRIG0
7. Let your X-series cards use PXI_TRIG0 directly as a sample clock (if they support the higher rate) or else as a "sample clock timebase" if you need to divide it down into the realm they *can* support. Either way, all the cards will stay in sync.
8. Let your Timed Loop wake & execute once every 5-10 cycles of PXI_TRIG0 (to bring it back to the present loop rate)
9. For *input* signals, use the "read recent already-buffered samples" trick illustrated over here. No more waiting for new A/D conversions, you just use the most recent ones that are already available. But because your hardware sampling is now faster than your software loop, you have more info to use to your advantage in the cases where you don't yet have the specific sample that woke up your Timed Loop
10. Since you don't consume any appreciable time on your inputs, you're more likely to get your calcs done and your output values set in time for the next output sample.
11. Oh yeah, the output tasks should use PXI_TRIG0 as a "sample clock timebase" as well, and then divide it down to the desired actual sample rate.
Nothing real magic about the oversampling, buffering, and retrieval of recent samples. It's just an approach that lets you:
A. acquire with deterministic sample timing and
B. retrieve samples with no possibility of needing to wait for them
At the moment, that seems like it might be a useful tradeoff.
-Kevin P
03-10-2021 02:51 AM
Thanks for your detailed answer Kevin.
You are right I'm using realtime. Actually I'm porting code from Phar Lap ETS to Linux RT. Strangely, the decimation-approach that I described in my initial post has been in productive use for years under Phar Lap without problems. But now that I see it doesn't work reliably under Linux RT, I start to think that maybe we were just lucky that it works under Phar Lap and in theory it could also fail there. We usually use 1kHz or 2kHz loop rates.
Regarding your points 4 & 5:
I don't think that DAQmx Read will initiate the AD conversion. And I also think it doesn't wait for the next sample clock (that might depend on the timeout parameter as well). If the DAQmx Read would wait for the next sample clock and then wait additionally for the AD conversion to finish, it could never return in less than 10µs. For example, my AI Tasks have 128 channels and the AD Convert clock is set to 500kHz. According to your description the Read would take 128/500e3 = 256µs in the best case.
My understanding is that the sample clock will trigger the AD conversion on the card, and then you can collect the results with a call to DAQmx read. If you call DAQmx read immediately after the sample clock pulse, it will have to wait for AD conversion to finish before it returns. If you call it after the AD conversion has finished (this is what I'm trying to do), but before the next sample clock appears, it should return immediately. So although they say HWTSP mode doesn't have a buffer, I think there actually is a buffer to hold a single sample for each channel.
I'm happy to be corrected about the above statement. It's a pity this kind of detailed information can't be found in NI documentation.
Regarding your points 6-11:
I'm actually doing something very similar on other PXI systems that don't use the SC Express cards. But in this case I cannot use this approach because the group delay of the SC Express cards is unacceptably high in buffered mode (well, NI calls it "group delay", but actually it's not a group delay but a dead-time). For realtime/control use cases, the SC Express cards can only be used in HWTSP mode, unfortunately.
03-12-2021 05:07 AM
I'll go with your observations about AI -- my experience with HWTSP under RT is limited and very stale. It does sound like the behavior is similar to a task with a buffer size of 1 that's set to allow overwriting.
Perhaps the 4303's can be left in HWTSP mode and the code can be structured so that they read first, *then* you use the "read recent samples" trick to read 1 single past sample from the X-series boards. That means you're looking 1 sample into the past with the X-series data, but the 4303's have *some* amount of group delay as well, so those are looking into the past a bit too.
Note that you'll probably have to special handling of the very first 4303 sample clock b/c the X-series won't have finished their first round of convert clocks yet.
All this will roughly have the effect of cutting your control loop rate in half. Cycle i retrieves X-series data from cycle i-1, performs a control law calc and writes to AO. That output value gets generated on the *next* sample clock at cycle i+1. So even if you update the output at 1 kHz (for example), it's reacting to inputs from ~2 msec ago. So it acts more like a 500 Hz control loop.
The upside is that the timing should be *consistent*.
-Kevin P
03-17-2021 04:03 AM
Using the X-series cards in buffered mode for this use case is something I hadn't considered yet. Meanwhile it's no longer necessary for me to try this because I (re-) discovered a solution that works with all cards in HWTSP mode: Looking at the old Phar Lap code, I realized that I had the Start-Trigger commented out. When I remove the Start Trigger setup from the Linux version as well, it works reliably. The formula to call DAQmx read is then:
if (LoopCnt >= 2 * AIDecimation + 1 && LoopCnt % AIDecimation == AIDecimation - 1)
{
DAQmxRead();
}
else
{
// do nothing
}
So yes, for 1kHz loop rate and AIDecimation == 2 this will effectively be a 500Hz control loop.
For my use case it doesn't matter what happens during the first few loop iterations, so I can live without the start trigger. But I won't mark this as a solution because I think it should also work with a start trigger. The start trigger delay could also play an important role here. By default it is 4 ticks of the sample clock timebase, which is a 100MHz clock when not using a decimation (so a neglectable delay) but (in my case) a 1kHz clock when using a decimation.
I tried all sorts of things to reflect this start trigger delay in the above formula, but it just doesn't produce reliable results (sometimes it works, sometimes not).