Note: The reference example uses the PID toolkit, although if your control algorithm is not a PID loop, and you do not have the PID toolkit, you can still use this example by ignoring the warning messages for loading the PID VI, and then deleting the PID VI and replacing it with your own control algorithm.
When controlling dangerous or critical machinery, it is necessary to implement fail-safes to ensure that the machine operates safely even when elements of the control hardware or software fail. The NI RIO architecture is ideal from this standpoint because most I/O is channeled through the FPGA, which is also the most reliable component of the system. By defining a safe state for all control outputs within the FPGA itself, you can create a control system with a high degree of immunity from hardware or software problems in the HMI, Real-Time controller, or Input modules. In order to maintain all outputs at a safe state, the only requirements are that the FPGA itself must be functioning, and that any output modules must be functioning.
Note: This design pattern demonstrates sound practices for developing critical applications on CompactRIO. It does not represent any specific certification or guarantee by National Instruments. The actual levels of fault-tolerance or safety in a system must be validated for the system as a whole, of which fault-tolerant software design is one piece. System integrators are responsible for defining the safety requirements and desired system behaviors under fault conditions, and validating those requirements for their specific application.
The reference example was designed for CompactRIO, but with minor modifications it applies to a PXI RIO, or any FPGA based control application.
The FPGA should implement a simple state machine in all loops which produce a critical output. At a minimum, the state machine should have a primary safe state, and a state for normal operation. The reference example uses a single safe state to respond to all failures. Multiple safe states for responding differently to different failures are also possible, however, you should still define a primary safe state which represents the most basic operation. The primary safe state should be the default state for the state machine, so that the system boots into a safe state. All safe states should define a safe value or algorithm for each output. Note that the example uses a simple static value for each output, but you can define more complex algorithms, such as ramping down an output, by using shift registers or memory to store the current output value. In the primary safe state you should not rely on inputs from other modules or the real-time controller. Other safe states can use inputs as long as they are verified to be functioning correctly. In each iteration of your output loop, you should check all possible failure conditions, and if any have occurred, transition the state machine to a safe state in the next iteration.
The reference example defines four failure conditions:
RT Safe - indicates that the real-time system is ready. The reference example uses this input only for indicating that the RT system has booted and is ready to execute. However, this input is also useful for responding to critical software errors in the RT system.
Emergency Safe - is tied to an external digital input. This input represents an emergency shut-off switch or other external failure detection mechanism. The input loop latches this value so that an emergency stop cannot be missed. In the reference example, this failure condition is also used to communicate an error in the digital module which reads the emergency stop, essentially defining the default behavior to be an emergency stop if the input cannot be read.
Watchdog Safe - monitors the Real-Time system via the RT Watchdog loop. If the RT controller program fails to communicate with the watchdog loop for longer than the RT Timeout period, this failure condition is triggered.
Control Inputs Valid - monitors the health of the inputs to the control algorithm. The reference example triggers this failure condition if the input module reports an error. For example, it triggers if the module is removed from the chassis during run-time. You can also perform a valid range check on each input instead of, or in addition to, the error check.
You can define additional failure conditions as necessary.
The reference example includes multiple watchdogs. The FPGA monitors the Real-Time system as a whole via the RT Watchdog Loop. The Real-Time control loop also monitors each other RT loop using a watchdog and will only service the FPGA watchdog if all loops have responded within the defined period. This ensures that the system is placed into a safe state if any loop on the Real-Time system becomes unresponsive. Each watchdog operates by checking the elapsed time since the last time the watchdog was serviced (the term "pet" is occasionally used in place of "service"). If the elapsed time exceeds the specified timeout period, the watchdog notifies its monitor.
Note: Be sure to disable the FPGA watchdog before installing software or re-imaging a CompactRIO system. Otherwise, the controller will be rebooted, causing the installation or re-imaging process to fail.
Example code from the Example Code Exchange in the NI Community is licensed with the MIT license.