Error 'Hex 0x661' on sbRIO-9607

JoeLesker · ‎08-27-2018

After running stable for several years, I've had a 3 instances of error 0x661 on 2 separate systems/configurations. The hardware is sbRIO-9607, built against LabView 2015.

I see one previous thread that discusses this issue (https://forums.ni.com/t5/LabVIEW/cRIO-9067-Real-time-unexpected-error-restart-Hex-0x661/m-p/3105450)... But it never really reaches a conclusion or a debugging strategy.

Has anyone else run into this problem, and hopefully had a successful resolution?

Thanks.

-Joe

JoeLesker · ‎08-27-2018

Sorry, forgot to attach error log.

MichaelBalzer · ‎08-27-2018

Welcome to the frustrating world of 0x661. I posted the original thread and am still seeing the issue to this day. With the internal help of NI support and R&D, I'm still no closer to a resolution after a couple of years. There are a few things you can try which may help.

Check your application's memory usage. If it continues to allocate memory or you have a leak, it'll eventually run out and cause the 0x661. See this KB article for the 'correct' way to monitor memory usage under linux RT. Try logging the memory once every 30 seconds or so and let it run until it crashes, then check the memory usage over time.
If your CPU usage is below 50%, try reducing the app to run on a single core of the CPU. Follow these steps from NI to reduce to single core mode:
1. Check the current state of the cRIO (You should see two CPUs reported):
  cat /proc/cpuinfo
2. Set an environment variable to only use one cpu
  fw_setenv othbootargs nosmp
  
  This can be confirmed using the fw_printenv command
3. Reboot the controller (using the "reboot" command or the reset button)
4. Confirm that only one core of the cRIO is running:
  cat /proc/cpuinfo
5. To remove the single core limitation:
  
  fw_set env othbootargs
6. Reboot
If you can handle application reboots, you can try setting the YouOnlyLiveTwice flag in your controller's ni-rt.ini / ni-rt.conf file. This won't stop the crash/reboots, but will stop the system going into safe mode after two unexpected reboots.

From experience I've only seen this issue occur on ARM based cRIOs, and not Intel based ones (cRIO-9067 = crash, cRIO-9068 = OK). I know that doesn't help with your sbRIO setup, but if I had to guess, I'd say the root of my particular 0x661 issue is a multi-threading / multi-CPU bug on ARM based controllers which will likely never be resolved.

Is your code very complex? If you monitor the sbRIO's resources, is there anything strange going on (CPU spikes, memory stuff, etc)?

Are you running a built and deployed RT application, or running from source using LabVIEW? I have run into other random 0x661s when running from source, and sometimes when deploying, but those don't seem to affect other applications once properly deployed (only this one app).

I hope your issue has a resolution!

Unless otherwise stated, all code snippets and examples provided
by me are "as is", and are free to use and modify without attribution.

MichaelBalzer · ‎08-27-2018

Also if you poke around in the log files located in /var/local/natinst/log/ (specifically LabVIEW_Failure_Log.lvuser.txt) it might shed some light on the crash reason. In my case the app receives a fatal SIGSEGV.

####
#Date: Mon, Jun 27, 2016 09:35:58 AM
#Desc: LabVIEW caught fatal signal
14.0.1 - Received SIGSEGV
Reason: address not mapped to object
Attempt to reference address: 0x0x4
#RCS: unspecified
#OSName: Linux
#OSVers: 3.2.35-rt52-2.10.0f0
#OSBuild: 197155
#AppName: lvrt
#Version: 14.0.1
#AppKind: AppLib
#AppModDate:

Unless otherwise stated, all code snippets and examples provided
by me are "as is", and are free to use and modify without attribution.

JoeLesker · ‎08-29-2018

Thanks for the replies.

I wasn't aware of the memory reporting differences on Linux RT. I'll switch over to the method in the KB.

The CPU usage is regularly above 50%, so running a single core isn't possible.

The code is complex, and very configurable. Depending on the hardware that is being controlled/communicated with, many different parts of the code can be accessed. I will keep a closer track of CPU and memory across different configurations.

It is always run as a compiled and deployed RT application.

Are you still on LV 2014? Is it possible that upgrading to say LV 2018 could resolve this?

The LabVIEW_Failure_Log.lvuser.txt file does show a similar SIGSEGV error. What exactly does the reference address refer to? What am I able to glean from this:

####
#Desc: LabVIEW caught fatal signal
#Date: Fri, Aug 24, 2018 02:55:13 PM
15.0 - Received SIGSEGV
Reason: address not mapped to object
Attempt to reference address: 0x0xc
#RCS: unspecified
#OSName: Linux
#OSVers: 3.14.40-rt37-ni-3.0.0f2
#OSBuild: 200232
#AppName: lvrt
#Version: 15.0
#AppKind: AppLib
#AppModDate:

Thanks so much for your time.

-Joe

MichaelBalzer · ‎08-29-2018

Have we been writing the same project?

I've seen the issue on the same project across LabVIEW 2014, 2015 and 2016. I have a project now which is a stripped back version of the original project running in LabVIEW 2017 on a cRIO 9066, and so far I haven't run into the 0x661 issue (save for one during a deployment). It is less complex and has far fewer channels, so it could just be that the 0x661 takes x times as long to rear its head, or it might actually be fixed. I don't have the hardware setup to verify the original project in 2017/2018 so can't confirm.

The address in the crash log doesn't reveal much, but the fact it's a SIGSEGV means Linux RT has told the LabVIEW runtime (lvrt) to abort because it's doing something incorrect with a memory location.

Is your code calling any 3rd party libraries / .so files? I've used network shared variables, EtherNet/IP, Current Value Table, Scan Engine, all used in regular LabVIEW + LVOOP. No individual component seems to cause the issue, but in combination the crash occurs. You might try stripping code sections out to see if it has any effect.

How frequently does the 0x661 occur? I saw it maybe once every 1-5 days, sometimes within hours.

Unless otherwise stated, all code snippets and examples provided
by me are "as is", and are free to use and modify without attribution.

LabVIEW

Error 'Hex 0x661' on sbRIO-9607

Error 'Hex 0x661' on sbRIO-9607

Re: Error 'Hex 0x661' on sbRIO-9607

Re: Error 'Hex 0x661' on sbRIO-9607

Re: Error 'Hex 0x661' on sbRIO-9607

Re: Error 'Hex 0x661' on sbRIO-9607

Re: Error 'Hex 0x661' on sbRIO-9607