Long term stand-alone stability of Linux RT

Mads · ‎08-23-2019

Is Linux RT set up to handle running for a long time without any interaction? I would assume so based on the intended use case (PACs), but (how much) would I be wrong? My only experience with Linux is really what I encounter when using Linux RT on PACs, and this part I have not studied sufficiently yet.

If I set up a cRIO with Linux RT and start a LabVIEW RT application on it, I can make sure that application does does its own housekeeping to not fill up the drive or memory. Will the Linux RT setup do the same out of the box? Does it limit the size of it slogs for example? If not, what would be the necessary minimum steps to make it probable that the device could run for many years without interaction? All ways to harden the solution / make it likely to keep on running its application would be nice to know about. I guess I can find a lot of this information on general Linux forums, but maybe Linux RT has already been designed / configured to handle many of the issues already?

Needing to restart the device to get it back up running is acceptable in most of my cases. If there are maintenance commands that needs to be run regularly that is fine as long as I add it to my application. Needing to log on regularly though is not an option.

Mads Toppe
Check out our Modbus Test Master - developed in LabVIEW

GatorChomp · ‎08-23-2019

Hi Mads,

I'm not sure I fully understand what you're asking. You differentiate several times between a "cRIO with Linux RT" and a "Linux RT setup." Can you explain what you mean? Are you comparing the OS to an application on the OS?

We intend the OS to run for long periods without interaction, and I'm not aware of anything you would need to do outside of your own application to accomplish this. I know some people have run Linux Real-Time systems for months on end with no interaction required depending on the design of their system. The OS shouldn't fill up the system memory and I believe we do limit the log size via configuration files at /etc/logrotate.d/. For example, the messages log is limited to 1M.

Obviously, regardless of what I say here you should make sure to do validation on the system. If you're more comfortable doing periodic restarts, that's definitely something that can be handled headlessly through the NI System Configuration API

Charlie J.
National Instruments

Mads · ‎08-23-2019

Hi,

The cRIO was just an example, I mean Linux RT in general (or at least for ARM targets in my case). So yes, what I was after was whether you had put in place limitations that will ensure that the system is not bound to crash just by running for a long time...(several years). The specific event that triggered my interest in this was a cRIO that was using 70 MB more memory than before even when not running an application. I have not figured out why that is yet so it might be unrelated (corruption issue perhaps), but it got me worrying that there could be an accumulation issue just from the way the OS itself is set up.

Mads Toppe
Check out our Modbus Test Master - developed in LabVIEW

rtollert · ‎09-19-2019

About the simplest way to answer your question is that there isn't anything up our sleeves AFAIK.

Regarding leaks:

Log sizes are capped with logrotate, configured in the usual way. Lack of log rotation is a major bug, and gets fixed in the next release, but generally is not believed to merit a patch. This process happens reactively, not proactively.
OTOH, /var/log is symlinked to a tmpfs, so if you do encounter an unrotated logfile, changes are that it will manifest itself as running out of memory. This can result in some rather humorous investigations if you are not aware of it.
LabVIEW's memory management is rather baroque. Many internal data structures grow with increasing utilization but do not shrink with decreasing utilization. Large changes in quiescent memory usage, in and of themselves (i.e. when they aren't leaks) are not tracked or prioritized.
If you can ensure that there is a host PC that can interact with the target long-term, you could plausibly automate redeploying a new system image via NI System Configuration periodically. This would be very challenging to accomplish on a standalone ARM target, for an internal NI developer, let alone a customer, but one could imagine alternative approaches which could work nearly as well — for instance, write a daemon which auto-stops the lvrt service if disk usage falls below some threshold, blow away specific directories, re-extract them from a tarball, then restart lvrt.
Most OS hackery will interoperate very poorly with the RT install process, i.e. any configuration files you modify will get rewritten the next time you reinstall/upgrade from MAX.

Regarding long-term execution:

My personal opinion is that I would strongly recommend defining a regular maintenance interval to ensure that the system gets rebooted periodically. A large part of this is simply reflecting that there's a lot more testing done in the weeks-to-low-months-of-uptime part of the boot than months-to-years, and not due to any concrete issue I am aware of (and I am not aware of any at the moment).
Reboots and power cycles can be surprisingly stressful to any computer (not just NI controllers) and you should not do them willy-nilly. An average of once or twice a day, 365 days a year, over 10 years, is fine. An average of once every hour is not. Stay under 100k.
SSD wear tends to be overblown as a rule, but it is a thing. NI support should be able to hook you up with controller-specific figures if you are concerned about this.

NI Linux Real-Time Discussions

Long term stand-alone stability of Linux RT

Long term stand-alone stability of Linux RT

Re: Long term stand-alone stability of Linux RT

Re: Long term stand-alone stability of Linux RT

Re: Long term stand-alone stability of Linux RT