LabVIEW

cancel
Showing results for 
Search instead for 
Did you mean: 

What can trigger cRIO app to restart randomly?

Judging from my posts on NI forum, one can easily see that I deal a lot with random restarts and memory leaks. But here I go again.

I am using cRIO 9067, running a quite intense data acquisition (120 MB of data per hour uploaded to our server) for days, and it randomly restarts. A quite reliant source of restarts is running out of memory. That happens quite regularly. One can expect cRIO to run for 7-10 days until it restarts. That's kind of long enough for our application.

 

But obviously, random restarts occur as well, lvrt process memory is not high, but the app stops responding and after 10 minutes, one of three watchdogs (RT Watchdog, FPGA Watchdog, custom Linux script) successfully restarts cRIO.

 

RT application can of course restart cRIO when needed, but we have thorough logging of that and I am sure it does not happen. And none of it can shut down an application so that watchdog would later have to restart cRIO. All of the watchdogs are set to restart complete system, so it cannot be malfunction of that. 

 

LabVIEW error logs are not useful too. I will demostrate on and example. cRIO runs since 13th March. On 23rd, 2:04, it ran out of memory, restarted on 23rd, 2:13, and later, on 27th, 0:14 restarted randomly. Error log (13th to 23rd March) says:

 

####
#Date: Fri, Mar 13, 2020 09:50:36 AM
#OSName: Linux
#OSVers: 4.14.87-rt49-cg-7.1.0f0-xilinx-zynq-41
#OSBuild: 265815
#AppName: lvrt
#Version: 19.0
#AppKind: AppLib
#AppModDate: 


InitExecSystem() call to GetCurrProcessNumProcessors() reports: 2 processors
InitExecSystem() call to GetNumProcessors()            reports: 2 processors
InitExecSystem()                                      will use: 2 processors
starting LV_ESys1248001a_Thr0 , capacity: 24 at [3666937837.99443293, (09:50:37.994433000 2020:03:13)]
starting LV_ESys2_Thr0 , capacity: 24 at [3666937838.69144917, (09:50:38.691449000 2020:03:13)]
starting LV_ESys2_Thr1 , capacity: 24 at [3666937838.69144917, (09:50:38.691449000 2020:03:13)]
starting LV_ESys2_Thr2 , capacity: 24 at [3666937838.69144917, (09:50:38.691449000 2020:03:13)]
starting LV_ESys2_Thr3 , capacity: 24 at [3666937838.69144917, (09:50:38.691449000 2020:03:13)]
starting LV_ESys2_Thr4 , capacity: 24 at [3666937838.69144917, (09:50:38.691449000 2020:03:13)]
starting LV_ESys2_Thr5 , capacity: 24 at [3666937838.69144917, (09:50:38.691449000 2020:03:13)]
starting LV_ESys2_Thr6 , capacity: 24 at [3666937838.69144917, (09:50:38.691449000 2020:03:13)]
starting LV_ESys2_Thr7 , capacity: 24 at [3666937838.69144917, (09:50:38.691449000 2020:03:13)]
Thread consumption suspected: 1 Try starting 1 threads
starting LV_ESys2_Thr8 , capacity: 24 at [3666945683.47829294, (12:01:23.478293000 2020:03:13)]
Thread consumption suspected: 5 Try starting 1 threads
starting LV_ESys2_Thr9 , capacity: 24 at [3667773845.43022680, (02:04:05.430227000 2020:03:23)]

 

Note the thread consumption that happenned few minutes before restart, probably at app shutdown. This is expected result.

 

Newer log file (23rd to 27th) is the same but there is no thread consumption message. 

 

kern.log - on 23rd and 27th the same output, the errors are usual for every startup. Nothing else is there:

2020-03-23T02:13:00.807+00:00 NI-cRIO-9067-01cc3b38 kernel: [    2.567323] Warning: unable to open an initial console.
2020-03-23T02:13:00.810+00:00 NI-cRIO-9067-01cc3b38 kernel: [    6.664650] ubi0 error: ubi_open_volume: cannot open device 0, volume 2, error -16
2020-03-23T02:13:00.810+00:00 NI-cRIO-9067-01cc3b38 kernel: [    6.705092] ubi1 error: ubi_open_volume: cannot open device 1, volume 0, error -16 
2020-03-27T00:14:26.872+00:00 NI-cRIO-9067-01cc3b38 kernel: [    2.536430] Warning: unable to open an initial console.
2020-03-27T00:14:26.875+00:00 NI-cRIO-9067-01cc3b38 kernel: [    7.158970] ubi0 error: ubi_open_volume: cannot open device 0, volume 2, error -16
2020-03-27T00:14:26.875+00:00 NI-cRIO-9067-01cc3b38 kernel: [    7.205540] ubi1 error: ubi_open_volume: cannot open device 1, volume 0, error -16 

All of it happens repeatedly on 9067 and 9064, both were formatted and reinstalled and the problem still persists. What are the ways to debug such problem, or what might cause it? Any ideas? Thank you.

0 Kudos
Message 1 of 6
(2,749 Views)

Since I have to guess, probably your "quite intense data acquisition" routine is getting a little bit "out of whack".  I'm going to assume it is coded as something resembling a State Machine (so a QMH, a series of interacting loops, etc.).

 

Working with an older (and much slower) PXI system using the LabVIEW Real-Time Module, I also encountered situations when the system errored out and forced a PXI shutdown/restart.  As part of my data logging (on the Host), I was saving "Events", time-stamped as "milliseconds since Start of Program" (so I had about 7 weeks before the clock "rolled over") from both the Host and the RT Target.  I initially logged (among other things) State Machine Transitions, so I at least got a clue where the RT code was around the time of the "unexpected behavior".

 

In one case, I added some additional "Debug logs" (I don't remember whether they were passed to the Host or saved on the Remote), designed to "trap" details that seemed to be related to the anomalous response.  As often as not, it was a "wiring Error" of mine that had code where the wires "looked OK", but the wire didn't really go to the terminal where it was apparently connected.  [A simple example is putting a VI with Error In/Error Out on an existing Error Line and wiring (only) Error In to the input terminal -- if this VI throws an Error and produces nonsense, no Error Out is recorded, and the rest of the code is "uninformed"].

 

Bob Schor

0 Kudos
Message 2 of 6
(2,727 Views)

@Thomas444 wrote:

What are the ways to debug such problem, or what might cause it? Any ideas? Thank you.


Turning on\off features (to locate the problem).

Speeding up processes to see if you can reproduce the crash faster.

Shrinking\growing buffers to see if you can reproduce the crash faster.

Using Real Time Execution Trace Toolkit (to spot leaks and to notice oddities).

0 Kudos
Message 3 of 6
(2,723 Views)

If you try the RT Execution Trace Toolkit (and enable the detailed logs), I'd be cautious about reading too much into a bunch of green flags for memory management.

 

I still haven't managed to track down what an acceptable number/frequency/coverage of flags is, but I was initially very concerned by my number of calls to the Memory Management system.

However, after finding the real problem, I still have lots of green flags but no problems.

So, if you see a lot of green flags, it may or may not be a huge problem.

 

If someone else has some guidance, I'm happy to be corrected...


GCentral
0 Kudos
Message 4 of 6
(2,703 Views)

Did you manage to solve your problem?

 

I get the same error messages on a sbRIO 9627 and would appreciate some leads as to where to start looking for this.

 

Thanks.

0 Kudos
Message 5 of 6
(2,540 Views)

In cases like this I usually try to come up with the most minimal project that reproduces the issue and then send that to NI.  Historically this is followed up by NI saying the project is too complicated for them to understand and I keep trimming it down...anyway.  With a crash that appears dependent on processing data, and taking a long time to occur, this type of technique isn't as useful.

 

The cause of this is either your code, or the NI tools you write your code on top of.  This includes the run-time engine, the cRIO OS, it could be things like DAQmx or VISA drivers, or any thing else.  NI probably gets support calls all the time for programs that crash due to poorly written code and so they are probably more inclined to think the issue isn't with their stuff, unless there is solid proof otherwise.  I have had several examples of projects that crashes on the Linux RT, that took way too long to convince NI it was on their end.

0 Kudos
Message 6 of 6
(2,530 Views)