04-03-2023 04:14 AM
Hi all,
I've been using a sequencing framework based on the QMH architecture to drive various automated test setups (i.e. using control hardware). Among other things, the framework separates between the sequencing of test steps and the actual test logic and (theoretically) supports any number of sequencer instances sharing any number of test logic modules (or just modules).
In practice, both sequencers and modules are controlled by their respective message queues. I'm using a notifier to keep the sequencer from running away from the module, once the sequencer has sent the current test step to the module's queue, it waits on the particular module's notifier to be set, optonally containing test outcome data as well.
I cannot share the actual construct for various reasons, so I've cobbled together a functional analogue in this snippet, (yes I am aware that I'm not cleaning up the references as I should, the actual block diagrams do so):
Now, this has worked great, from operations where a single test cycle takes close to half a minute to applications where the same cycle is performed every 20ms. In some cases, the control program is distributed as a built executable, in other cases it runs in the development environment.
Now, various details of the implementation is constantly evolving and all of a sudden, I've had test units (running on built executables) get stuck in their sequence and I've figured out that I've fallen victim to a race condition I am unsure how to resolve. Additionally, it only seems to be a problem when using built executables, so far at least:
In some cases, for whatever reason, the module command sent down the module message queue is dequeued, goes through a pre-processing step, is performed, AND the HoldSequencer notifier is sent before the activity in the sequencer itself reaches the wait on notification function.
Clearly a race condition.
As a result, the notifier is missed and the sequencer is stuck waiting for it. This occurrs usually hours into the test unit's operation, and occurrs more frequently when two instances of the built executable are running.
I've hoped placing the enqueue message and wait on notification blocks in a flat sequence frame would mitigate the issue (see snippet below), since that would at least allow the initial condition for both blocks to be fulfilled simultaneously. Thinking surely the program would not advance through the enqueue message block, do all the things in the module loop incl. setting the notifier before starting the wait on notification block, but to no avail.
As a desperate, exploratory measure I've included a waiting function which puts a 10ms pause on the error wire leading into the enqueue message block. This does prevent the race condition from occurring, but would not be compatible with the application where the sequence command stack is executed every 20ms. Waiting 10ms for every step would defeat the purpose. Still, here's the analogue snippet.
Now, on the current branch where I'm developing something, I've had to include an administrative notifier to be sent by the sequencer and it is the last thing the block diagram has to do before the error wire is split to its two receipients, but sending that notifier isn't "slow" enough to mitigate the issue. Further testing has shown that even a 1ms wait isn't enough. I've included the snippet for good measure as well.
So at this point I'm a bit at a loss. The affected test systems wouldn't suffer all that much by a 10ms delay before each step is executed, so I do have a functioning workaround for the time being, but I do need to solve this issue sustainably: Does anybody have a suggestion to a) either ensure the wait on notification block is doing its waiting before the command is enqueued or b) find another solution to hold up the sequencer altogether.
Any input leading towards a solution would be greatly appreciated.
Tom
Solved! Go to Solution.
04-03-2023 05:45 AM
Tom, Why are you using notifiers when you clearly need "Lossless" data communications? Notifiers are inherently lossy. Replace them with queues.
04-03-2023 06:03 AM - edited 04-03-2023 06:17 AM
Yeah I was afraid that was going to be one of the answers.
The original choice is based on the idea that n sequencer instances could access m module instances. The N to 1 nature of queues seemed to contradict that, the N to M nature of Notifiers seemed more appropriate. In retrospect I suppose a dedicated queue acting like a semaphore (or just using semaphores period) where the reference is passed along with the module command would've made way more sense.
E: And I do think to recall the LabVIEW Core 3 course using notifiers in at least a similar fashion.
04-03-2023 06:59 AM
@Kabatom wrote:
The original choice is based on the idea that n sequencer instances could access m module instances. The N to 1 nature of queues seemed to contradict that, the N to M nature of Notifiers seemed more appropriate. In retrospect I suppose a dedicated queue acting like a semaphore (or just using semaphores period) where the reference is passed along with the module command would've made way more sense.
Sounds like you should be using User Events then. Have a good look at the Delacor QMH (DQMH). That is how they implemented it.
04-03-2023 07:45 AM
Aren't user events N to 1 as well, though?
04-03-2023 09:11 AM - edited 04-03-2023 09:12 AM
@Kabatom wrote:
Aren't user events N to 1 as well, though?
No, they can be N to N. You can have multiple Event Structures register for the same event (don't share the event registration, just the event reference). The attached example VI was last saved in 2016.
04-03-2023 09:59 AM
@crossrulz wrote:
@Kabatom wrote:
Aren't user events N to 1 as well, though?
No, they can be N to N. You can have multiple Event Structures register for the same event (don't share the event registration, just the event reference). The attached example VI was last saved in 2016.
The only real caveat with registering multiple event structures for the same event comes in when you have a filter event and discard it in some structures. The Event will be discarded by the first Event structure that it sees in Event registration order which, can be difficult to determine. Essentially the first Event Structure that starts execution has the first chance to execute and discard a filter event. With LabVIEW's native parallel execution it can not always be seen by visual inspection. the Event Inspector window is needed to assist debugging those edge case problems.
04-04-2023 02:30 AM
Ah, that was the mistake I made way back when I tried using Events for something else but ended up sharing the event registration between multiple event structures!
Either way, due to the effort associated with refactoring, going the queue-or-semaphore-route is likely the most efficient, even if not necessarily the best. Much obliged.
04-04-2023 06:50 AM
@Kabatom wrote:
Ah, that was the mistake I made way back when I tried using Events for something else but ended up sharing the event registration between multiple event structures!
Either way, due to the effort associated with refactoring, going the queue-or-semaphore-route is likely the most efficient, even if not necessarily the best. Much obliged.
I could argue. That a QHM should always have a queue specific to the consumer loop since, each consumer should be handling a limited scope of work. In some cases where I use dynamic Events the several Event structures simply coordinate the multiple Queued consumer loops to system wide events like stop, start cycle, temperature steady, etc...
04-04-2023 08:38 AM
Fixing bad code with more bad code doesn't seem to be very maintainable. You'd better be sure you ultra-document this deviation from LabVIEW best practices so that whoever picks up the code next will understand what's going on.