NI Linux Real-Time Discussions

cancel
Showing results for 
Search instead for 
Did you mean: 

What are the possible causes of error 0x661?

Out of the 3 projects that my company has done that uses Linux RT cRIOs, 2 of them periodically crash with error 0x661 during runtime (while running a deployed application, as opposed to running from the IDE):

[error] LabVIEW:  (Hex 0x661) The LabVIEW Real-Time process encountered an unexpected error and restarted automatically. LabVIEW Real-Time process restarted

It seems that, if error 0x661 occurs twice without power cycling, the cRIO goes into some kind of safe mode, ceasing all functionality.

There have been a few reported cases over the past year or so:

The links above mainly talk about error 0x661 during deployment, but again, we're seeing it happen during runtime.

There's an open support ticket with NI, but the AE looked through our project code but couldn't identify anything obviously wrong. So, I'm posting here in hopes that a Linux RT guru has more insight.

Questions:

  1. What are some possible causes of error 0x661?
  2. I read that CAR 523471 was created to investigate this error. Has NI made any findings?

Thanks for your time.

Certified LabVIEW Developer
0 Kudos
Message 1 of 12
(5,513 Views)

You noted that the system goes into a safe mode when this happens twice without power cycling. You can override that behavior if you'd like, see http://digital.ni.com/public.nsf/allkb/41E8E448E2547CEF86256CFD00678340

Of course, that doesn't fix the actual crash, just makes it so your application will keep running more than twice if it happens.

As far as your  specific questions:

  1. It just means LabVIEW crashed, and doesn't by itself tell us anything about the cause. If you're using a Call Library Function node to call external code like a C library, this could have the same causes as a crash of any C code. Some common examples would be running out of memory, stack overflow, uncaught exception, divide by zero, etc. Normally LabVIEW will catch some of those at the CLFN and turn it into a runtime error instead of crashing, but it's not always possible, and in addition sometimes it's appropriate to disable that feature of the CLFN for performance reasons. If you're not using a CLFN, it's still possible that the LabVIEW VI has a memory leak, and of course a bug in LabVIEW itself cannot be ruled out -- you should continue to work with your AE on that. There are logs on the target that could help narrow down the issue if that's what is happening, and the AE should be able to help you retrieve those and review them.
  2. That specific CAR was for a specific customer's application. That customer apparently worked around or otherwise resolved their crash, so the CAR has been closed.

It's worth noting that in the links you cited reporting the error "during deployment", the actual problem still occurred at runtime. The error is being reported during deployment because that's the first opportunity after a crash that the customer would get to see the error on the host (Windows) UI, since you have to "deploy" to reconnect to the target and have the IDE learn of the error.  However, again, the fact that the error code is the same and it happens at runtime doesn't mean that the cause is the same as yours, because that error is used for any crash of the LabVIEW application.

Message 2 of 12
(4,536 Views)

Thanks Scot, you've provided a ton of useful info. So basically, 0x661 is a generic crash message, similar to "<Application> has encountered a problem and needs to close" on Windows.

Disabling boot-into-safe-mode and allowing LabVIEW to recover is a slight improvement, but we'd need to look for a way to prevent those crashes in the first place. From what you've said, logging memory usage would be a good place to start.

I'll work with my AE to get those logs. In the event that the logs aren't enough, do you have any recommendations on how else we can monitor/probe our LabVIEW program to capture events leading up to a crash?

We're not calling any CLFNs ourselves. However, our latest project that suffers from 0x661 does use the following APIs:

The OPC UA VIs contain CLFNs. I'm not sure about the Modbus and email VIs, as they are password-protected.

Certified LabVIEW Developer
0 Kudos
Message 3 of 12
(4,536 Views)

There are a variety of low level tools I could recommend that I would use myself, especially if  you are comfortable with Linux, but the AE's will be better for advice at the LV level. Either way, if you could use diagram disable or other refactoring tools to narrow down which of those VI's seem connected to the crash, that would be super helpful.

Message 4 of 12
(4,536 Views)

Yes, I'm comfortable with Linux and have used GDB to capture stack traces of crashing programs before (these were C++ code compiled with debugging symbols enabled though, not LabVIEW code). I'm gathering all possible leads at this point, so I'm keen to hear what low-level tools you find useful.

Certified LabVIEW Developer
0 Kudos
Message 5 of 12
(4,536 Views)

Nice. How long does it take to reproduce the issue, can you trigger it pretty reliably?

You can install GDB (opkg install gdb) and attach from the debugger (gdb -p `pidof lvrt`). Without symbols it may not be enlightening, but it's easy and sometimes provides useful information even without symbols.

Another tool I'd check is strace (opkg install strace). It's not as useful if you can't reliably trigger the crash though. If you need to leave it running for a while, you probably can't have it log to a file, and it'll slow down execution a lot whether you're logging to a file or not.

The version of top that is included in the base distribution is very stripped-down. You can get a more full featured one from the procps package (opkg install procps) and then use that to check for, among other possible issues, memory leaks.

You mentioned in the original post that the problem only afflicts 2 of your 3 controllers. Still true? Any idea what's different about the 3rd one?

0 Kudos
Message 6 of 12
(4,536 Views)

ScotSalmon wrote:

Nice. How long does it take to reproduce the issue, can you trigger it pretty reliably?

You can install GDB (opkg install gdb) and attach from the debugger (gdb -p `pidof lvrt`). Without symbols it may not be enlightening, but it's easy and sometimes provides useful information even without symbols.

...

Even without LVRT symbols, if there are libraries that are being used that do have symbols, this can help (even if it is to exhonerate "suspects"). Even if that's not the case either, the symbol-less backtrace will at least show what libraries/binaries are at play and something as simple as the type of signal sent can give you clues as to what went wrong.

0 Kudos
Message 7 of 12
(4,536 Views)

Thanks again, Scot and Brad. I won't have access to the cRIO for several days at least, but once I do I'll investigate the system with high- and low-level tools and report back. Unfortunately, we can't trigger the crash on demand -- the cRIO needs to run for several hours at least.

Certified LabVIEW Developer
0 Kudos
Message 8 of 12
(4,536 Views)

ScotSalmon wrote:

You mentioned in the original post that the problem only afflicts 2 of your 3 controllers. Still true? Any idea what's different about the 3rd one?

We've used 3 Linux RT cRIOs in 3 projects. Let's call them Projects X, Y1, and Y2 (in chronological order). None of them call CLFNs directly.

Project Y1 uses a cRIO-9030 + LV 2014 SP1 and doesn't crash.

Project Y2 also uses a cRIO-9030 + LV 2014 SP1, but crashes. Y1 and Y2 have very similar architectures and share a fair bit of code. Y2 has the following additions on top of Y1:

  • Y2 has more Network Shared Variables than Y1 (Y2 has ~700, Y1 has ~100)
  • Y2 has more Modbus TCP masters in parallel loops than Y1 (Y2 has 6, Y1 has 3)
  • Y2 has a Modbus Serial master, Y1 doesn't
  • Y2 runs an OPC UA server, Y1 doesn't
  • Y2 runs an Embedded UI, Y1 doesn't

Project X used a cRIO-9067 and crashed, like Y2. Its architecture is completely different from Y2 though. Project X doesn't use Modbus/OPC UA/Embedded UI. The only similarity I can see is a large number of NSVs (Project X has 400+ variables). Y2 has few libraries where the biggest one contains 468 NSVs, while X has many libraries where each only contains a few NSVs.

Project X's crashes started this post: https://forums.ni.com/t5/LabVIEW/cRIO-9067-Real-time-unexpected-error-restart-Hex-0x661/td-p/3105450. When we switched from cRIO-9067 to cRIO-9024 without changing our code, the crashing stopped.

Certified LabVIEW Developer
0 Kudos
Message 9 of 12
(4,536 Views)

With it needing several hours to reproduce, strace might not be very helpful, but gdb might still tell us something.

So "more NSV's" is a lead, it sounds like. Any practical way to test that lead, by removing some from Y2 or X, or adding some to Y1?

Embedded UI has significant implications internally and would be a good variable to rule out. Can you try either removing it from Y2 or adding it to Y1?

0 Kudos
Message 10 of 12
(4,536 Views)