04-08-2020 02:32 PM
I am using labview to communicate with an Ethernet IO rack. Occasionally, I briefly lose the Ethernet connection to the IO rack, which shuts down the process I am controlling. If the rack does not receive a message over Ethernet every 100 mS, the watchdog on the rack shuts everything down.
There are only two nodes on the Ethernet cable, the IO rack and the computer running a labview program. I am using the following NI vi’s. UDP Open, UDP Close, UDP Write and UDP Read. The computer runs windows 10 and the labview version is 15. The program is a built exe and runs 24 /7.
At first I opened the port, writing, and reading data and then closing the port each iteration. This worked except that it would occasionally lose connection. It would run OK for days then there would be a shutdown.
I know there are things like windows update that attempt use the Ethernet port. Is there any way I can guarantee an uninterrupted connection?
Then I tried opening the port when the program starts, and keeping it open until the program shuts down. There is a bizarre undesirable behavior from time to time that I have yet to understand.
Do you have any ideas of what I can do to solve this problem?
04-08-2020 02:41 PM
Hi Frank,
@FrankHS wrote:
This worked except that it would occasionally lose connection.
I know there are things like windows update that attempt use the Ethernet port. Is there any way I can guarantee an uninterrupted connection?
There is no way to guarantee this type of communication within your 100ms watchdog timeout…
Yes, Windows can interfere with using the port for other communication, but in general you can have many open communication connections over the same Ethernet port.
@FrankHS wrote:
Then I tried opening the port when the program starts, and keeping it open until the program shuts down. There is a bizarre undesirable behavior from time to time that I have yet to understand.
What kind of "bizarre"? Do you get any errors from your LabVIEW program?
04-08-2020 03:26 PM - edited 04-08-2020 03:28 PM
Your entire problem description is insufficient to troubleshoot, and some of your comments are "bizarre" in their own right.
Who wrote the code running on each side? Are both written by you in LabVIEW or is the other device a "black box"?
Why is the watchdog so picky? Can you make the watchdog timer much longer for testing, then do a statistical analysis of the actual idle times? If both sides are LabVIEW, can you show us some code? Is there anything that potentially slows down the communication loops? What else are the programs doing?
Maybe it's a hardware error. Have you tried a different cable? Blow some compressed air into the connectors to clean up some dust? Did you try to update the ethernet drivers for your hardware? Do you connect with a dedicated ethernet port and crossover cable or does any of this travel through shared external hardware (routers, switches, etc.)
If this is a time critical process, you should not be running on windows anyway, at least turn off updates, virus scans, etc..
You say you are using UDP read and write. Does that mean that the communication is two-way? Or does one side only write and the other side only read? UDP is connection-less and there is absolutely no guarantee that any particular message makes it to the other side. How big is each transmission in bytes? What is the configured sending rate? Do you use a specific local port or an ephemeral port?
04-09-2020 01:41 PM
Opening Port and Closing port for each iteration is not a good practice.
As you told that opening port and closing the port at the program shut down is "bizarre undesirable behavior", i think this is the place you should work on. I am thinking timeouts handling are not done properly at this point.
It would be really great if you can add your code, So that better solution can be given!
04-10-2020 09:24 AM - edited 04-10-2020 09:26 AM
The choice to use a watchdog in combination with UDP communication is aleady bizarre in its own. Making that a 100ms watchdog would be even with TCP/IP considered bizarre.
A device that shuts down after it hasn't received a network message within 100ms? What sort of atomic nuclear control device is that?
You might be able to guarantee 100ms repetitive messages from a real-time controlled device but never from something like Windows or any other desktop operating system for that matter. If your device doesn't allow for message dropouts from the remote side for more than several seconds it is doomed to run into shutdown situations eventually. Especially when using UDP messages as there is no way to guarantee a UDP message is ever arriving nor even detecting that it didn't arrive other than implementing your own application level acknowledge messaging on top of UDP.
Also does your device really need to shutdown completely after a timeout? Wouldn't it be smarter to return to a default operating state that allows it to continue to receive new messages and react to them?
04-11-2020 12:39 PM
Thank you for the reply
I wrote the labview code and the other device is a black box. The ethernet is is connected directly from the computer to the black box, an automation direct Ethernet base controller. There are no routers etc.
I have changed the watchdog timer to 1 second. Do you think that one second would be a good choice for Ethernet? The process is not time critical, though it needs to shut down if there is no communication for more than 5 seconds. I would prefer to keep this down to a second or two.
The communication is 2 way. I send a packet of data to the controller and it replies. The replies are checked for a checksum. The false case is not being used now but it closes and opens and closes the Ethernet port. This runs in a loop that runs at 25 mS.
I just added code that will let me know if the loop time exceeds 50 mS and to capture all IO errors and log them to disk in the hope of identifying this problem.
When I said that there was bizarre behavior I meant that with the limited information available, I was unable to explain it or even describe it well. This only happened when the port was opened at program start and closed at exit. The cause was when an error occurs, there was a reset function that closed and opened the port. The reset function added a 1 second delay what caused the watchdog to trip. This has been removed.
The program usually runs for a few days without issues. But every few days there is a shutdown.
04-11-2020 01:10 PM
Thanks for the details. We don't work well with pictures, though. Too many ambiguities. an actually VI would be significantly more helpful.
04-11-2020 01:39 PM
I don't actually need the 100 mS watchdog. I just changed it to 1 second. What do you think it the minimum that I can set the watchdog for using ethernet and having nothing on the cable except the controlling computer and the IO Rack? The communication uses a CRC checksum to guarantee that the data is correct. See my last message for the block diagram and more info.
The device doesn't really shut down. The IO resets and continues. If there is an IO error that cant be corrected by retrying, the process is stopped until an operator resets it.
04-12-2020 06:35 AM
Do you know of any rule of thumb for how many seconds the watchdog setting should be?
04-12-2020 08:57 AM
@FrankHS wrote:
Do you know of any rule of thumb for how many seconds the watchdog setting should be?
As with most things, the answer is: it depends!
1s is usually enough to let a Windows process respond in time but there is absolutely no guarantee. Windows is not a real-time operating system and therefore you can make absolutely no guarantees about how long a process might get suspended because Windows decides to start a virus scan or something, or the network communication might get suspended because Windows decides the network stack or network interface needs to be reset.
The solution is usually not to depend on the Windows application to never fail to send a message within a specific time, but to configure/program the device to not fall into an unrecoverable state after it detected that there was a certain amount of inactivity. Your device should at most change critical outputs into a failsafe state and then return to an idle state that the remote application can detect, in order to reconfigure the device and return to full operation if that is safe to do.
So you may actually have to modify your Windows program to actually query periodically if the device decided to go into failsafe shutdown and then reconfigure it to return to normal operation mode. That is robust programming, not trying to prevent the system to never fail to send messages to the device within a certain amount of time. Of course this all might be different if your device controls something that might cause physical harm to a person or if it could cause your factory to blow up. Then other standards apply and you should definitely consult an expert in these matters to look at your system and probably seek certification of it.