How can a Real-Time Host detect if Target has died?

Bob_Schor · ‎05-18-2022

I'm developing Control software for an Instrument using LabVIEW Real-Time in the "conventional" way -- the Host handles the User Interface and File I/O, the RT-Target does all of the DAQ stuff and timing (I'm using a myRIO and taking advantage of the FPGA for precise timing and "flexible" device implementation).

One question we worried about was "What if the RT code got hung up while the Instrument was "active" and failed to say "OK, now stop". We implemented a Watch Dog Timer -- the RT (LabVIEW) code toggles a DIO line in the FPGA at 2 Hz (on 250 ms, off 250 ms), and a simple circuit that my Engineering colleagues built kills the power to the Instrument if it doesn't "see" any transitions (I understood this at the time ...).

But we were running a demo today, everything was going well, and someone asked "What happens if we turn off the Power to the box containing the myRIO and the Target side"? So we "did the experiment". The PC (running the Host) suddenly sprouted a Pop-Up (Modal Dialog Box?) saying something like "The myRIO has lost power" -- the only way to get rid of this was to close the Windows (the X in the upper right corner). Meanwhile, the Host just sat there, probably waiting for a Message from its Target saying "I'm shutting down now, please exit ..." (which is part of the "planned shutdown sequence" initiated by the Target, for example, when it encounters a Fatal Error and needs to close itself and try to close its Host.

One thing I could do (which I just dreamed up while writing this Message) would be to have a Host/Target "Watch Dog" -- a pair of Network Streams originating in the Host, perhaps once a second sending "elapsed time in seconds" to the Target, which echoes it back to the Host. The traffic would be very light, so maybe even if the Target was "busy" sending "Data I've Collected" back to the Host as fast as Network Streams could push it, we could still "sneak" an "echo" response from the Host/Target Watch Dog.

The other thing, however, that occurred to me was to understand (!?!) and capture the message that pops up on the Host when I kill the power to the Target. Hmm, let me see if I can recreate this situation, and describe the result better: So I start the myRIO, then pull its power cord. This is what pops up:

And when I plug it back in, I get another Dialog box with a title "myRIO USB Monitor" that gives me a choice of options (from "Launch the Getting Started Wizard" to "Do Nothing"). Needless to say, I'd also like to understand where this comes from (and make it stop popping up, already!).

Can anyone shed some light on these behaviors, and whether or not I can take advantage of (or, perhaps a better phrase is "pervert") their behavior?

Bob Schor

Rodney314 · ‎05-18-2022

Well, I'm not an expert on any of that, but my (hopefully relevant) ideas & guesses are:

1. The myRIO devices seem to have an associated NI Device Monitor-like daemon, "ni_usbmon.exe"

https://forums.ni.com/t5/Academic-Hardware-Products-myDAQ/myRio-NI-usb-monitor/td-p/3079925

2. That service probably watches the Windows OS hardware resources for USB devices being connected or disconnected. It might use the WinUSB.dll? But it just needs to check if the connected/disconnected device's USB vendor id and product id match.

3. Besides the Network Streams you mentioned, perhaps you could disable ni_usbmon.exe and create your own method for checking usb connection/disconnections? Here's an idea:

DevCon : check links in last post of: https://forums.ni.com/t5/LabVIEW/Check-USB-Device-Connected-to-a-computer/td-p/3036347

But I did see that scanning the registry is also an option (https://forums.ni.com/t5/Example-Code/Check-if-a-USB-Device-is-Connected-to-the-System-using-Windows...) - but I guess this likely wouldn't be very fast,

and there's also an older (2017) C# library that may be useful (https://sourceforge.net/projects/libusbdotnet/).

Rodney

rolfk · ‎05-19-2022

Hmmm, you know that cRIO's can get corrupted if the power is cutoff at the wrong moment? It happens rarely but regularly enough with some customers we have that it was not an option for their production environment. Their systems all run now behind a small dedicated UPS for the E cabinet they are situated in. My suspicion would be that the myRIO may be even more susceptible to such problems. It's the same hardware design but specifically not tested nor designed for industrial applications.

Our watchdog is actually located in the FPGA fabric. The FPGA has a way of resetting the entire chassis without having to cut off power. And I have yet to see a situation where that FPGA is hanging itself so that its operation just stops out of the blue, unless you initiate such a stop from the real-time controller side by stopping, unloading or otherwise disabling the FPGA bitfile. A simple register/front panel control needs to be toggled regularly from the RT side to avoid the watchdog from pulling the reset line. And of course there is another register that can be used to inhibit that watchdog altogether, you do not want to pull the reset while you are busy setting up things to make everything work. And you can even tie in an extra external reset from a digital line that allows the FPGA to be reset by the operator. Some customers feel very empowered to know they have that capability. 😀

Unexpected power loss should be avoided at all costs if you don't want to have to unbrick the system occasionally. So far it was always possible to do that ourselves by taking the unit out of the system and do a complete reformat of it in NI-MAX and occasionally a small standalone tool from NI but I have heard of incidents where the unit had to be RMAed to NI for them to restore the firmware image before it wanted to work again.

As to your problem, how do you communicate with the myRIO application? Using standard network shared variables or streams? We usually do it all ourselves with basic TCP communication. That avoids an opaque software layer whose behaviour you can't really control. Most NI software returns errors when it loses a connection and maybe that dialog is actually your own doing by being a little bit to zealous and unforgiving about such an API returning an error. Usually once those APIs have lost the connection, you need to close the session and reopen it for the connection to work again. Higher level APIs have sometimes no proper error indications, the original DataSocket implementation was quite notorious in that respect. If it worked it worked and if it didn't you had a really though time.

But most NI APIs do have such errors handling nowadays and you need to use it properly. Your problem of a nasty dialog popping up in the middle of the operation and not being able to get rid of it is certainly something I know from my own network protocol handlers during development time when being a little bit to tight up about error handling and not allowing the control loop to exit. The quick and dirty fix in those cases is to allow at least a two button dialog with the choice of "Retry" and "Abort". But that "Retry" typically should consist of a disconnect and reconnect of the resource in question or you risk to keep looping on the same error over and over again.

If you use a high level API that does that dialog handling itself you are probably out of luck if your control loop can't detect the error somehow and then simply quit.

Rolf Kalbermatter My Blog

DEMO, Electronic and Mechanical Support department, room 36.LB00.390

wiebe@CARYA · ‎05-19-2022

@Bob_Schor wrote:
The other thing, however, that occurred to me was to understand (!?!) and capture the message that pops up on the Host when I kill the power to the Target. Hmm, let me see if I can recreate this situation, and describe the result better: So I start the myRIO, then pull its power cord. This is what pops up:

I think that's the LabVIEW development environment, that monitors the connection.

If you make an exe, that dialog won't be a problem.

You can also decouple your cRIO development and Host development. Then make the cRIO code run at startup, and the host project won't have anything to do with the cRIO (except that your code communicates with it).

There might also be a right click option in the cRIO target to disconnect. Then monitoring should also stop.

It's been a while, and the dialog is probably just a minor distraction.

Search LabVIEW like a graph!

Bob_Schor · ‎05-19-2022

Many thanks for the thoughtful and speedy replies from Rodney, Rolf, and Wiebe. Will do more investigating and incorporating.

Bob Schor

JÞB · ‎05-19-2022

I'll take a swing at the secondary question about what causes pop-ups on device reconnections

There USED TO BE a document in example code that appears to have not migrated well . You may need to go to the way back page.

Referred to here

If NI Devmon has not dramatically changed the Registry Hive contains Keys that launch applications based on device type VID UID values.

Some uber searching will show examples of how I used those registry HKEYS in past projects (try searching hardware boards for posts by me containing devmon)

Of course I don't recall all of the details! That's why I have a devmon tag!

"Should be" isn't "Is" -Jay

paul.r.r · ‎05-19-2022

@rolfk wrote:

Hmmm, you know that cRIO's can get corrupted if the power is cutoff at the wrong moment? It happens rarely but regularly enough with some customers we have that it was not an option for their production environment. Their systems all run now behind a small dedicated UPS for the E cabinet they are situated in. My suspicion would be that the myRIO may be even more susceptible to such problems. It's the same hardware design but specifically not tested nor designed for industrial applications.

Our watchdog is actually located in the FPGA fabric. The FPGA has a way of resetting the entire chassis without having to cut off power. And I have yet to see a situation where that FPGA is hanging itself so that its operation just stops out of the blue, unless you initiate such a stop from the real-time controller side by stopping, unloading or otherwise disabling the FPGA bitfile. A simple register/front panel control needs to be toggled regularly from the RT side to avoid the watchdog from pulling the reset line. And of course there is another register that can be used to inhibit that watchdog altogether, you do not want to pull the reset while you are busy setting up things to make everything work. And you can even tie in an extra external reset from a digital line that allows the FPGA to be reset by the operator. Some customers feel very empowered to know they have that capability. 😀

Unexpected power loss should be avoided at all costs if you don't want to have to unbrick the system occasionally. So far it was always possible to do that ourselves by taking the unit out of the system and do a complete reformat of it in NI-MAX and occasionally a small standalone tool from NI but I have heard of incidents where the unit had to be RMAed to NI for them to restore the firmware image before it wanted to work again.

As to your problem, how do you communicate with the myRIO application? Using standard network shared variables or streams? We usually do it all ourselves with basic TCP communication. That avoids an opaque software layer whose behaviour you can't really control. Most NI software returns errors when it loses a connection and maybe that dialog is actually your own doing by being a little bit to zealous and unforgiving about such an API returning an error. Usually once those APIs have lost the connection, you need to close the session and reopen it for the connection to work again. Higher level APIs have sometimes no proper error indications, the original DataSocket implementation was quite notorious in that respect. If it worked it worked and if it didn't you had a really though time.

But most NI APIs do have such errors handling nowadays and you need to use it properly. Your problem of a nasty dialog popping up in the middle of the operation and not being able to get rid of it is certainly something I know from my own network protocol handlers during development time when being a little bit to tight up about error handling and not allowing the control loop to exit. The quick and dirty fix in those cases is to allow at least a two button dialog with the choice of "Retry" and "Abort". But that "Retry" typically should consist of a disconnect and reconnect of the resource in question or you risk to keep looping on the same error over and over again.

If you use a high level API that does that dialog handling itself you are probably out of luck if your control loop can't detect the error somehow and then simply quit.

What is the proper method of shutting down a RIO device? Maybe I'm not looking in the right spots, but I haven't seen much official guidance beyond simply pulling the power.

Regarding the question, we've got an open source networking API with injectable callbacks for connection state changes. You start the session, and a background worker attempts to make a connection - when one is made, the callback object you injected on startup is called. Similarly, when a connection is lost, the injected callback object is called again. I did some major refactoring of the code recently and it might be a little bit of a mess, but feel free to check it out or use as is.

https://bitbucket.org/composedsystems/composed-network-interface/src/master/

LabVIEW

How can a Real-Time Host detect if Target has died?

How can a Real-Time Host detect if Target has died?

Re: How can a Real-Time Host detect if Target has died?

Re: How can a Real-Time Host detect if Target has died?

Re: How can a Real-Time Host detect if Target has died?

Re: How can a Real-Time Host detect if Target has died?

Re: How can a Real-Time Host detect if Target has died?

Re: How can a Real-Time Host detect if Target has died?