I'm working on narrowing down this previously mentioned 0x661 crashing issue, and have noticed some odd behaviour and errors messages on a cRIO-9067.
The latest error/warning message is from the linux kernel which states NOHZ: local_softirq_pending 08. Is this something to be concerned about?
The real-time application appears to continue running after these events occur without any obvious issue. Poking around some linux message boards seems to indicate this message is closely associated with networking calls. Co-incidently enough one cause of the 0x661 crash seems network related, so I'm beginning to wonder the two are related. For what it's worth I've never seen this message and 0x661 appear in the same log.
The cRIO is running Real-Time 14.5, with firmware version 3.5.0f0. See the attached log for more info.
Thank you for linking this update to the previous thread. From the messages you are getting it sounds like it may be a networking related issue. In the prevous thread it seems like you were not able to reliably get the error. Are we able to crash/error consistently now? Can you elaborate a bit more on your work narrowing down the issue? Do we have a smaller project or code to reproduce this issue?
Thank you and looking forward to your update.
Hi Clemens, thanks for responding.
I can't reproduce the 0x661 error on command. It's more a case of deploying the RT application and letting it run for several days. The error usually presents itself within 24h, but may take longer. The quickest I've seen is within about 15 minutes from a cold boot, and twice within the hour. Other times the same RT app will run for 2-3 days before crashing.
I've been narrowing down the cause, or rather eliminating code which isn't the cause, using the diagram disable structure and trial and error. Removing certain components seems to stop the error (or at least up-time exceeeded 3-4 days), but then testing those same components in isolation also wasn't able to cause a crash. So at this stage it's looking like no single piece of the code is at fault - rather when multiple code modules are running together is the only time the error seems to occur.
I'm running a few more isolation tests over the coming days. If they don't prove useful, I'll try pare the code down to something minimal which can still cause the crash and go from there.
If it's any help, some typical error messages logged to /var/local/natinst/log/LabVIEW_Failure_Log.lvuser.txt are below. As you can see, each crash is a result of a SIGSEGV signal.
#### #Date: Mon, Jun 27, 2016 09:35:58 AM #Desc: LabVIEW caught fatal signal 14.0.1 - Received SIGSEGV Reason: address not mapped to object Attempt to reference address: 0x0x4 #RCS: unspecified #OSName: Linux #OSVers: 3.2.35-rt52-2.10.0f0 #OSBuild: 197155 #AppName: lvrt #Version: 14.0.1 #AppKind: AppLib #AppModDate: #### #Date: Mon, Jun 27, 2016 09:48:38 AM #Desc: LabVIEW caught fatal signal 14.0.1 - Received SIGSEGV Reason: address not mapped to object Attempt to reference address: 0x0xc #RCS: unspecified #OSName: Linux #OSVers: 3.2.35-rt52-2.10.0f0 #OSBuild: 197155 #AppName: lvrt #Version: 14.0.1 #AppKind: AppLib #AppModDate:
Thank you for the update and plan of action. If we are getting SIGSEGV, it sounds like something in the code is trying to access an invalid memory address. Do we have a sense of where we may be in the code when we are crashing? Is there a specific call/function or series of functions/calls that may be causing this behavior?