05-16-2016 01:44 AM
Out of the 3 projects that my company has done that uses Linux RT cRIOs, 2 of them periodically crash with error 0x661 during runtime (while running a deployed application, as opposed to running from the IDE):
[error] LabVIEW: (Hex 0x661) The LabVIEW Real-Time process encountered an unexpected error and restarted automatically. LabVIEW Real-Time process restarted
It seems that, if error 0x661 occurs twice without power cycling, the cRIO goes into some kind of safe mode, ceasing all functionality.
There have been a few reported cases over the past year or so:
The links above mainly talk about error 0x661 during deployment, but again, we're seeing it happen during runtime.
There's an open support ticket with NI, but the AE looked through our project code but couldn't identify anything obviously wrong. So, I'm posting here in hopes that a Linux RT guru has more insight.
Questions:
Thanks for your time.
05-16-2016 09:24 AM
You noted that the system goes into a safe mode when this happens twice without power cycling. You can override that behavior if you'd like, see http://digital.ni.com/public.nsf/allkb/41E8E448E2547CEF86256CFD00678340
Of course, that doesn't fix the actual crash, just makes it so your application will keep running more than twice if it happens.
As far as your specific questions:
It's worth noting that in the links you cited reporting the error "during deployment", the actual problem still occurred at runtime. The error is being reported during deployment because that's the first opportunity after a crash that the customer would get to see the error on the host (Windows) UI, since you have to "deploy" to reconnect to the target and have the IDE learn of the error. However, again, the fact that the error code is the same and it happens at runtime doesn't mean that the cause is the same as yours, because that error is used for any crash of the LabVIEW application.
05-16-2016 10:47 PM
Thanks Scot, you've provided a ton of useful info. So basically, 0x661 is a generic crash message, similar to "<Application> has encountered a problem and needs to close" on Windows.
Disabling boot-into-safe-mode and allowing LabVIEW to recover is a slight improvement, but we'd need to look for a way to prevent those crashes in the first place. From what you've said, logging memory usage would be a good place to start.
I'll work with my AE to get those logs. In the event that the logs aren't enough, do you have any recommendations on how else we can monitor/probe our LabVIEW program to capture events leading up to a crash?
We're not calling any CLFNs ourselves. However, our latest project that suffers from 0x661 does use the following APIs:
The OPC UA VIs contain CLFNs. I'm not sure about the Modbus and email VIs, as they are password-protected.
05-17-2016 12:28 PM
There are a variety of low level tools I could recommend that I would use myself, especially if you are comfortable with Linux, but the AE's will be better for advice at the LV level. Either way, if you could use diagram disable or other refactoring tools to narrow down which of those VI's seem connected to the crash, that would be super helpful.
05-18-2016 01:39 AM
05-18-2016 09:01 AM
Nice. How long does it take to reproduce the issue, can you trigger it pretty reliably?
You can install GDB (opkg install gdb) and attach from the debugger (gdb -p `pidof lvrt`). Without symbols it may not be enlightening, but it's easy and sometimes provides useful information even without symbols.
Another tool I'd check is strace (opkg install strace). It's not as useful if you can't reliably trigger the crash though. If you need to leave it running for a while, you probably can't have it log to a file, and it'll slow down execution a lot whether you're logging to a file or not.
The version of top that is included in the base distribution is very stripped-down. You can get a more full featured one from the procps package (opkg install procps) and then use that to check for, among other possible issues, memory leaks.
You mentioned in the original post that the problem only afflicts 2 of your 3 controllers. Still true? Any idea what's different about the 3rd one?
05-18-2016 10:59 AM
ScotSalmon wrote:
Nice. How long does it take to reproduce the issue, can you trigger it pretty reliably?
You can install GDB (opkg install gdb) and attach from the debugger (gdb -p `pidof lvrt`). Without symbols it may not be enlightening, but it's easy and sometimes provides useful information even without symbols.
...
Even without LVRT symbols, if there are libraries that are being used that do have symbols, this can help (even if it is to exhonerate "suspects"). Even if that's not the case either, the symbol-less backtrace will at least show what libraries/binaries are at play and something as simple as the type of signal sent can give you clues as to what went wrong.
05-19-2016 12:46 AM
05-19-2016 01:37 AM
ScotSalmon wrote:
You mentioned in the original post that the problem only afflicts 2 of your 3 controllers. Still true? Any idea what's different about the 3rd one?
We've used 3 Linux RT cRIOs in 3 projects. Let's call them Projects X, Y1, and Y2 (in chronological order). None of them call CLFNs directly.
Project Y1 uses a cRIO-9030 + LV 2014 SP1 and doesn't crash.
Project Y2 also uses a cRIO-9030 + LV 2014 SP1, but crashes. Y1 and Y2 have very similar architectures and share a fair bit of code. Y2 has the following additions on top of Y1:
Project X used a cRIO-9067 and crashed, like Y2. Its architecture is completely different from Y2 though. Project X doesn't use Modbus/OPC UA/Embedded UI. The only similarity I can see is a large number of NSVs (Project X has 400+ variables). Y2 has few libraries where the biggest one contains 468 NSVs, while X has many libraries where each only contains a few NSVs.
Project X's crashes started this post: https://forums.ni.com/t5/LabVIEW/cRIO-9067-Real-time-unexpected-error-restart-Hex-0x661/td-p/3105450. When we switched from cRIO-9067 to cRIO-9024 without changing our code, the crashing stopped.
05-19-2016 11:38 AM
With it needing several hours to reproduce, strace might not be very helpful, but gdb might still tell us something.
So "more NSV's" is a lead, it sounds like. Any practical way to test that lead, by removing some from Y2 or X, or adding some to Y1?
Embedded UI has significant implications internally and would be a good variable to rule out. Can you try either removing it from Y2 or adding it to Y1?