Designing Reliable Embedded Systems Presentation

Burt_S · ‎10-25-2013

Please find the presentation from last night attached. I want to thank everyone for the opportunity to present this.

There was an unanswered question last night about hard drive performance caveats with cRIO controllers. It turns out that once the hard drive gets about 70-75% full, you should expect a sudden and significant decrease in disk write speed. This is likely due to wear leveling algorithms on the controller and as far as I know only occurs on VxWorks systems. The new 9068 for example does not appear to have this issue. The following article gives some good benchmarks on expected disk throughput for a few of our controllers.

http://www.ni.com/white-paper/9272/en

Let me know if there are any other questions I can help with!

Aaron_G · ‎10-25-2013

Hi Burt,

In your slides, you talked about how the FPGA could reset the RT side of things. What are the advantages of using the FPGA to do the reset vs using the RT Watchdog VIs to do the reset? In one of my current systems, I pet the RT watchdog as part of the same loop that pets the FPGA watchdog line (and all other loops provide feedback to the watchdog loop so if a different loop is stuck, the watchdog loop knows about it).

Thanks,

Aaron

David_Staab · ‎10-25-2013

Does the RT Watchdog work if the entire application hangs? What if the app crashes on Out Of Memory? What if the Run-Time Engine crashes? The FPGA will still be running in all these scenarios.

ChadE · ‎10-25-2013

The cRIOs are equipped with a hardware watchdog, a separate hardware chip on the cRIO, which also continues running in all of those scenarios. Rewriting your own watchdog in FPGA is also a viable (though arguably unnecessary) option.

Burt_S · ‎10-25-2013

As Chad mentioned, the RT watchdog is actually a hardware watchdog, so it actually will protect you in all of the cases that David asked about. That said, there actually are a few good reasons to use the FPGA instead (or in addition to).

The main reason is that you generally want the FPGA to know if something has gone wrong on the RT side, so its good to have RT check-in with the FPGA anyway. If the FPGA knows something is wrong, it can take additional action besides just a reboot like setting outputs to safe values. In general, I find implementing the watchdog on the FPGA worth the additional effort.

Jim_Kring · ‎10-25-2013

Great presentation. I figured I'd throw out a shameless (but hopefully helpful) plug there that JKI's VI Tester is a great unit and integration testing framework. Also, since it's implemented purely in LabVIEW, you can run the tests directly on the embedded (LabVIEW RT) system (not sure if that's possible with the LVUTF, but maybe it is). We've had great success using this at JKI for a variety of performance and reliability testing on RT.

Let's talk about the future of LabVIEW...

Aaron_G · ‎10-25-2013

In my application, I was petting a watchdog implemented in the FPGA from the RT side, and on the FPGA side was safeguarding the outputs (also monitoring external ESTOP inputs). However, I was using the RT watchdog to reboot the controller rather than the FPGA call that you mentioned, so I did have effectively the same functionality that you mention, just using a different API to handle the RT reboots on watchdog failure.

mike_nrao · ‎10-25-2013

Forgive my naiveté, help me understand the need for watchdogs. What is this talk of "something has gone wrong"? I have worked with over a dozen cRIO systems over the last three years yet never had one inexplicably spontaneously "lock-up". Are watchdogs an intended safeguard against hardware/firmware failure (Real-Time OS crash?) or software "bugs" (memory leaks, race-conditions, etc)? I have never experienced the former, and quickly discover and eliminate the latter prior to "mission critical" operations. I think, if it's software bugs you're worried about, and if you are able to recognize areas of your code where petting the watchdog may be necessary, you can instead implement design so "something has gone wrong" isn't a possibility. Am I wrong?

mike_nrao · ‎10-25-2013

Perhaps my systems are too "simple" -- There aren't any unexpected scenarios, only unlikely ones; and the RT code is designed to catch these and transition to a safe state with configured outputs. Can someone tell me a specific application where watchdog is necessary?

Aaron_G · ‎10-28-2013

Michael,

The watchdog is a safety feature against unanticipated problems (any of the ones you mentioned). It will allow the system to reboot itself if it appears to be in trouble. Ideally it will never trigger.

I have seen several cases where if it had been implemented properly, the pain to the end user could have been drastically reduced. One particular system was a cFP system that had undergone a LabVIEW 7 to 7.1 upgrade. An extremely difficult to find race condition had slipped into a very low level hardware driver. As the developer doing the upgrade, what I was able to isolate (slowly and painfully) after a month or more of effort was that the DataSocket calls to the second cFP chassis were failing and hanging (despite a timeout having been specified). The problem did not occur on the first datasocket call, nor the second. It would typically occur somewhere between the one hundred thousandth call and the four hundred thousandth call. Or approximately 24 to 48 hours after the system was started. The system had not been designed to take advantage of the watchdog, nor did it save current state information that allowed it to reboot and resume where it left off. Had the watchdog been part of the original design paradigm, with the state information preserved, we could have recovered the system a lot more gracefully. As it was, the host system would have the timing information needed for the operators to finish the run (~1-8 hours) before rebooting the system. But this was at an enormous cost in lost productivity. As it was NI was eventually able to isolate and fix the race condition (albeit from a different problem they were observing, they never could reproduce it for us but the fix they created for the other problem also fixed our customers system).

LabVIEW Architects Forum

Designing Reliable Embedded Systems Presentation

Designing Reliable Embedded Systems Presentation

Re: Designing Reliable Embedded Systems Presentation

Re: Designing Reliable Embedded Systems Presentation

Re: Designing Reliable Embedded Systems Presentation

Re: Designing Reliable Embedded Systems Presentation

Re: Designing Reliable Embedded Systems Presentation

Re: Designing Reliable Embedded Systems Presentation

Re: Designing Reliable Embedded Systems Presentation

Re: Designing Reliable Embedded Systems Presentation

Re: Designing Reliable Embedded Systems Presentation