Designing Reliable Embedded Systems Presentation

Daklu · ‎10-28-2013

I'm still a newbie to embedded systems having done less than a dozen cRIO projects, none of which used the watchdog feature. I'm a little confused about the various watchdogs listed on slide 39.

The FPGA Watchdog shown in the slide implies there is no built-in watchdog capabilities for FPGA. I assume we have to build our own using the System Reset function?

As others have mentioned, the RT Watchdog is implemented in hardware on the cRIO. If this watchdog expires, I assume it will only reset the RT controller, and not automatically reset the FPGA? That said, if the FPGA code is started as part of the RT initialization process (such as by checking the "Run the FPGA VI" option in the Configure Open FPGA VI Reference dialog box), would resetting the RT controller effectively reboot the entire system, or does the "Run the FPGA VI" vi return a reference to the still running FPGA VI?

The Thread Watchdog appears to be an API for monitoring multiple loops. This would be used as the mechanism for "all other loops (to) provide feedback to the watchdog loop" Aaron mentioned in the first post. Yes, no? (FWIW I think the inclusion of a Shutdown mode makes this API more complicated than necessary and less useful than it would be without it.)

I understand the need for the FPGA code and RT code to monitor each other when both code bases are stateful; I'm still trying to figure out the different ways to accomplish that and determine what the tradeoffs are.

Aaron_G · ‎10-28-2013

Daklu,

I have not looked at the watchdog API yet, and I am assuming what the shutdown functionality does, but there is a good reason to have a shutdown/cleanup type functionality. If you are running the code in the debug environement, then stop your code (either programmatically or via an abort), to make changes. You can be sitting there about to make a change and then "Connection to Target has been lost". If you do not shutdown the watchdog properly, it will assume your code hung and reboot the controller . Oops.

Daklu · ‎10-28-2013

Aaron_G wrote:
If you are running the code in the debug environement, then stop your code (either programmatically or via an abort), to make changes. You can be sitting there about to make a change and then "Connection to Target has been lost". If you do not shutdown the watchdog properly, it will assume your code hung and reboot the controller . Oops.

I assume by "running the code in the debug environment" you're talking about clicking the Run button on an RT vi. To be honest, most of the RT apps I've written were intended to be run from within Labview and I'm not familiar with all the background magic LV does to make that easy.

For instance, I assumed the hardware watchdog was automatically disabled when user code stopped running in the debug environment. If the hardware watchdog is not automatically disabled...

1. There's no way to prevet a reset if we use the Abort button to stop an RT vi.

2. If we use the RT Watchdog and want to avoid disconnecting, we have to create and use a shutdown process that disables the watchdog. This seems like wasted work on RT systems that are only intended to execute a single application. Why create shutdown code for an application that is never intended to stop? Resetting the controller accomplishes the same thing as cleanup code. Is there enough value in avoiding the disconnect to justify the added complexity?

Hmm... seems like those are fairly negative user experiences and I'd be surprised if NI allowed them to persist. Are you sure the watchdog isn't automatically disabled?

(Later: In the Watchdog Configure vi's Expiration Actions input cluster, there is a boolean switch to "disable watchdog on vi exit" which defaults to True. Presumably this should prevent the disconnections you were referring to.

Unfortunately the help file doesn't mention it at all, so discovering a note in the "SW Watchdog on RT.vi" example saying,

***HW Watchdog Note:
Make sure to set the "disable watchdog on VI exit" flag to FALSE in deployed application.

leaves me more confused than I was before.)

Aaron_G wrote:
I have not looked at the watchdog API yet, and I am assuming what the shutdown functionality does, but there is a good reason to have a shutdown/cleanup type functionality.

Even if there is a reason for shutdown/cleanup functionality in an RT app, that doesn't seem like what the API is trying to enable with this mode. The Watchdog.ShutdownMode description is, "Set the Pound to be in shutdown mode. In shutdown mode, the Pound will check to make sure that each process stops executing within the timeout," and the Pound.ShutdownMode description is "Puts the Pound in shutdown mode where it expects all of the watchdog to stop executing." I read that as a verification that each watchdog is closed within a certain amount of time.

At the very least the descriptions are misleading and the "SW Watchdog on RT.vi" example doesn't do a very good job of illustrating how it should be used. At worst the API includes incomplete aspects of loop control and communication mechanisms instead of limiting it to what its purpose is--to make sure each loop is still functioning correctly and hasn't hung.

I appreciate the work the Systems Engineering group puts into releasing code, but I sure wish they would quit adding kitchen sink features to APIs in an effort to make them easier to use... that strategy usually has the opposite effect.

Burt_S · ‎10-28-2013

Hello Daklu,

You are correct that there is no built-in watchdog for the FPGA and that you must build your own using the System Reset function. There are some examples available like the Fail-safe Reference Design mentioned in the slides as well as the FPGA Control Sample Project that started shipping in LabVIEW 2012.

Your questions about how and when the FPGA get reset are all great. Unfortunately this is an extremely confusing topic. My goal in the presentation was to introduce the concept of a watchdog and encourage people to use them, but I purposefully avoided getting into the details. Now let me try to clear some of this up.

Generally speaking, rebooting RT will not reset the FPGA program. Obviously if you happen to close the reference to the FPGA before you reboot then that is a different story. There is one exception to this in the form of a configuration option that I will talk about later, but its use-cases are extremely limited.

The Open FPGA VI Reference will not change the state of a currently running FPGA unless you check the 'Run the FPGA VI' option in its configuration. This option will essentially reset and then run the FPGA VI. Leaving this unchecked will connect to an existing VI if it is the same as the VI that is configured.

For a fail-safe architecture, I generally recommend downloading your FPGA bitfile to flash using the RIO Device Setup while also configuring the flash to 'Autoload on device powerup'. The 'Autoload on device reboot' is the expection I had mentioned that can cause your FPGA to reset along with RT. Using the 'Autoload on device powerup' along with leaving the 'Run the FPGA VI' flag unchecked in your open function allows your FPGA to function before RT even boots up. Once RT does boot up, you can connect to the FPGA VI that is already running without disturbing it.

The last piece missing here is how you get the FPGA VI to run in the first place. This is done with the 'Run when loaded to FPGA' configuration option in your FPGA Build Specifications.

I did my best to make sure this got documented in our user manual for the FPGA Control Sample Project. The relavent details are at the end of the PDF.

https://decibel.ni.com/content/docs/DOC-23262

Again I realize all of this is extremely complicated. Hopefully my explanation along with the PDF will help clear some things up. Let me know if you have any additional questions though. I will respond to your Thread Watchdog questions in your discussion with Aaron below.

Burt_S · ‎10-28-2013

The purpose of the Software Watchog component is to have a loop that can monitor the execution of any other critical loops in your system. Without this functionality, any FPGA or RT watchdog are only really checking that one loop on the system is still executing. This still protects you from a lot of failure modes like a complete system hang, but doesn't protect you from the case where a single loop stops unexpectedly. I wanted to bring up this component as an example of how to solve this problem.

As for the shutdown mode behavior, this mode is used to invert the behavior of the watchdog. Under normal operating conditions, the watchdog is used to make sure that every loop is still running properly. Once you switch to shutdown, it is instead looking for any loops that haven't stopped appropriately. You can use this functionality to log any improper shutdown behavior and to still reboot even if a loop has somehow gone rogue. You could consider this a kitchen sink feature, but I'd have to respectfully disagree.

I would agree that this feature isn't documented very well, but again if nothing else this component can serve as an example for how to do this in your own applications.

maxwellb · ‎10-28-2013

Hi Burt,

I understand that the performance degradation at low disk space levels is documented in the white paper you showed above, but is it something that NI is able to improve? I have an application that requires high data throughput to disk (no network connection available) and I want to use every bit of that hard drive space in order to maximize acquisition time. I also need all of the streaming rate in order to keep up. The performance issues below 20% free space effectively reduce my maximum acquisition time by 20%. My customer was not pleased when I discovered this rather late in the development process.

Max

Daklu · ‎10-28-2013

Burt,

Thanks for the explanation and links about resetting. I'll need to some time to absorb this but a few questions come to mind immediately:

First, what does it mean for an fpga bitfile to be loaded but not running? I've always assumed loading a bitfile (not to be confused with downloading a bitfile to flash memory) alters the fpga fabric so it will execute the logic I've defined in the vi. I've also assumed the fpga fabric is always exposed to clock signals and executing its logic with whatever infomation is in the registers/block RAM and on the I/O lines. These two assumptions naturally lead to a situation where loading a bitfile is equivalent to running it, which according to NI documentation doesn't appear to be correct.

- Is "loaded but not running" a state generally built into all fpga hardware, or does NI add some overhead code when the fpga vi is compiled to allow this state?

- What is actually going on in the fpga when the bitfile isn't running?

Second, there doesn't appear to be many options available to explicitly load a bitfile onto the fpga fabric. As near as I can tell our only options are to:

1. use the Open FPGA Reference vi in RT code, which embeds the bitfile in the vi and loads it when the vi executes, or,

2. use the "Download" option on an fpga build spec right-click menu, or,

3. use the RIO Device Setup utility to download the bitfile to flash memory and apply one of the autoload option.

- Does the right-click Download option put the bitfile on the flash memory, or does it load it directly to the fpga fabric? If it puts it on flash memory, what autoload behavior does it apply? If it applies the bitfile directly to the fpga fabric, does it persist through power cycles?

The "Do not autoload VI" option in RIO Device Setup implies there's a way to load a bitfile onto the fpga fabric other than the three I listed here, since neither option 1 or 2 uses the bitfile located on flash. How would I load a bitfile from flash if I used the "Do not autoload" option?

It's pretty obvious from my questions that I don't have a clue how everything works together. Until I am able to create a better mental model in my head I'll keep being surprised when things don't work how I expect they would, and that frustrates both me and my customers.

----------

Burt_S wrote:
You could consider this a kitchen sink feature, but I'd have to respectfully disagree.

I'm not saying the shutdown feature isn't useful in certain situations, I'm saying its inclusion in an API named "Watchdog" is not intuitive and is unnecessary baggage. A watchdog is generally understood to be a device that executes an action unless it is reset (or pet) at regular intervals. Checking to make sure all the loops have stopped isn't behavior one would normally associate with a watchdog--that behavior belongs to a shutdown monitoring algorithm. If you want a component that both monitors loops while live and makes sure they stop correctly, better to call it a "Thread Monitor" than a "Thread Watchdog."

Since the shutdown feature operates outside of normal communication channels often established between loops, using it (the shutdown feature) creates multiple communication channels and adds considerable complexity to one's code. The only time I can think of where it makes sense (IMO) to use the shutdown feature is in applications that don't use any messaging between loops--either all information is shared via references or no information sharing is necessary. I could use a Thread Watchdog; I have much less use for a Thread Monitor.

I do agree it serves as a good example of how to implement this in my own applications. It's just a bit disappointing to see so many APIs that would be useful in a wider range of applications if they were more concise. As a developer I can add write my own code to add features I need; I can't remove unnecessary features from an existing API.

Burt_S · ‎10-29-2013

Hi Max,

I honestly am not sure whether or not this is something that could be improved. Benchmarking on the cRIO 9068 does not appear to show this slow-down so I am optimistic that this will no longer be an issue for any new cRIOs that are released. If your goal is maximizing your acquisition time, have you considered adding an SD Card Module to your system for additional storage space? You could also add additional storage space using the USB port (although I know this approach is usually and understandably unpopular).

Burt_S · ‎10-29-2013

Loading the bitfile is not equivalent to running it because of the implicit enable signal. You can find more information about this here.

http://zone.ni.com/reference/en-XX/help/371599J-01/lvfpgaconcepts/fpga_routing_congestion/

The following presentation may help give you a better view of everything that goes into an FPGA. The implicit enable signal is just one of many pieces that exists external to what is shown on your block diagram.

https://decibel.ni.com/content/docs/DOC-23675

For the right-click 'Download' functionality, I don't know that I have used this feature, but the following Help article seems to imply that this loads the bitfile the same way that the Open FPGA Reference function would. As far as I know, you can only download the bitfile to flash using the RIO Device Setup or with its corresponding APIs.

http://zone.ni.com/reference/en-XX/help/371599J-01/lvfpgahelp/compiling_fpga_vis_howto/

maxwellb · ‎10-29-2013

Thanks Burt. I have considering adding an SD card module, but my concern is that the 9802 says it supports write rates of up to 2 MB/s. I'm typically writing at 4-5 MB/s.

LabVIEW Architects Forum

Designing Reliable Embedded Systems Presentation

Re: Designing Reliable Embedded Systems Presentation

Re: Designing Reliable Embedded Systems Presentation

Re: Designing Reliable Embedded Systems Presentation

Re: Designing Reliable Embedded Systems Presentation

Re: Designing Reliable Embedded Systems Presentation

Re: Designing Reliable Embedded Systems Presentation

Re: Designing Reliable Embedded Systems Presentation

Re: Designing Reliable Embedded Systems Presentation

Re: Designing Reliable Embedded Systems Presentation

Re: Designing Reliable Embedded Systems Presentation