Real-Time Measurement and Control

cancel
Showing results for 
Search instead for 
Did you mean: 

Debug RT Application Crashing

I have a cDAQ 9132 running RT Linux.  I wrote an application in LabVIEW 2017 32-bit, on Windows 7 x64 PC.  I've been making incremental updates to the software for several weeks now, and I started to notice that when developing code once in a while I would get a message stating that LabVIEW lost its connection to the controller.  I didn't think too much of this because I thought it was just a crash of the IDE since I was in source.  Once I moved to building the RT application the issue seemed to go away.  Started last week however the built application which runs on startup, is exhibiting some of the same issues when it was in an application.

 

After I restart my controller my application runs, and I am able to run a couple of tests, by commanding it to start over Network Streams from my host.  After about 5 minutes of testing my host application no longer talks to the remote application over network streams, and my sequence running on the RT stops.  All IO stay at the value it was last commanded.  Other functionality of the controller seems to still work.  I can FTP into it for example.  So it really does seem like it is as if someone pressed the stop button on my application or task killing it but other parts of the OS keep going.

 

The application is pretty basic, and doesn't have many timed loops.  It talks to 3 devices over TCP, it has two high speed CAN XNet ports, one XNet LIN, one DIO card, and one AI card.  It uses networks streams to communicate to the host application.  I rolled back the software using SCC and it still seems to be happening on a version of code I was sure was stable.  I've formatted the controller and started again.  I've logged CPU and memory usage into a text file periodically and it seems stable.

 

So what suggestions are there for debugging a crashing application on RT?  This is easy to reproduce, but only in a full application.  I doubt running just one of the code modules on it will cause the issue.  Any suggestions?  Thanks.

Message 1 of 19
(5,618 Views)

Sounds like a good time to try the RT trace toolkit. Wild guess here but I have noticed that the various ways of doing instrument communication on RT don't always mesh well together, maybe that's the thing to look at. For example in LVRT 2014 I have noticed weirdness when using multiple VISA sessions where one of the sessions was over ethernet and the others over serial devices. In fact what would happen is that the VISA session over ethernet would block the serial session! If the VISA ethernet session was waiting on a timeout to open a connection the loop polling an instrument over a serial connection would show highly irregular timing when it shouldn't due to hanging on the VISA read. 

 

 

Message 2 of 19
(5,569 Views)

Well that didn't help.  I started a trace at the start of my test sequence, and then logged a user event at the start of every subsequence, then hit the stop and save (to the RT) when the sequence was done which was about 3 minutes.  I then copied the log over to the host and opened it where I saw time go from -1us to 0us.  No useful data was recorded.

 

Is the RT Trace toolkit one of those debugging tools that only works when you extremely simple VI?  This is a multi thousand VI project with about 15 parallel asynchronous modules running and I doubt anything meaningful can come out of the RT trace log.

 

I've tried rolling software back, and reformatting the controller.  I know in later versions of my code I improved performance a lot and going back is painful especially when it still seems to lock up randomly.

 

I had the Distributed System Manager open a couple of times when it stopped running and one of two things will happen.  In both cases the LabVIEW IDE reports that it can't communicate with the target anymore and the VI appears to just stop.  But some times the Distributed System Manager will hold at the last read value (relatively high because code was running) but then never drops, making me think the controller is still reporting the same value.  Attached is what I mean.

 

Loss Of Connection Error.png

 

Then there are times that the software will stop running but the Distributed System Manager will report the freed up resources and CPU usage will drop to about 1%.  The system really just seems unstable no matter what version of my code I put on it.  Often disconnecting, claiming an unexpected restart.  Attached are some of the logs from the RT.  Anyone from NI I'd be glad to throw all of my reuse and project at you and you can run it and tell me if I'm crazy.  Most of the hardware can be simulated, but the CAN traffic is relativly high and at the moment I'm simulating that with a Vector CANCase replaying a file I recorded.

Download All
Message 3 of 19
(5,539 Views)

So I may have figured it out.  I believe this sudden crash is caused by using the new Read-Only DVR.  It could be possible this feature is causing the crash, or it could be that as a result of using the Read-Only feature, more copies of the memory need to be made, which is causing the problem.

 

In either case me turning off the Read-Only feature used in one place on my project (which is a reentrant VI called a few places) has brought my system back to being stable.  It hasn't died in several dozen tests I've ran where before it would crash after just a couple of runs.

Message 4 of 19
(5,530 Views)

Okay NI the ball is officially in your court.  I have a very reproducible crash on RT in 2015 and 2017.  I know the fix, but the fix shouldn't be needed.  The system appears to crash (lost connection with the RT and must disconnect) if I attempt to dequeue from an invalid queue reference.  Attached is the code needed to reproduce this crash.  I tried minimizing it more but the issue would sometimes go away.  Open the project and run the Main RT Code.vi.  For me this is being deployed to an NI cDAQ-9132.  Also attached is the MAX report generated for the RT target.  The software that needs to be installed is the LabVIEW run-time 2017, and NI-XNet 17.0.  No XNet hardware is actually needed.  All other dependencies are included so other people can run the source and see the RT application crash too.

 

The init function will create a bunch of DVRs and Queues, then spawn off an asynchronous VI and use these reference data types.  In the "\Code\SVN\LabVIEW_Reuse\Package Source\Automotive\Package Source\Drivers\Generic CAN Drivers\CAN Class\Helper\Read Signals Parallel Loop.vi" VI there is a state that the state machine will go to first and will attempt to dequeue this reference that we neglected to obtain earlier.  If it is obtained properly, or if we choose to not attempt to dequeue from this reference the crash goes away.  What should happen is the dequeue should generate an error but keep executing.  Instead the application just stops running on the RT.  If you attempt to reconnect you get the following error.

 

Spoiler
Errors were detected in the target log when connecting to the target:

LabVIEW: (Hex 0x661) The LabVIEW Real-Time process encountered an unexpected error and restarted automatically.


Select "Apply" to ignore these errors and continue with deployment.

 

I believe my original problem is somewhat related to this issue.  I was dealing with constantly reading from Read-Only DVRs and I believe that one of these DVRs somehow became invalid as I was reading it, and it crashed.  What should happen is the IPE should return an error but execution keep going.

 

When someone at NI confirms this crash I would like them to reply and reply with the CAR.  I have a working system but this seems to be like something that might happen again to me.

 

The host is LabVIEW 2017 32-bit on Windows 7 x64.

Download All
Message 5 of 19
(5,497 Views)

Hi,

Thanks for reporting this behavior. I will try reproducing this on my end and will get back to you if I have any questions.

Message 6 of 19
(5,488 Views)

Got it reproducing on my end on an NI Linux RT cRIO-9066 for version 17. 

 

Just so that I have all the context I need for escalating and filing a CAR, the dequeue you're referring to is the one in the "Check Queue Command" case in that state machine for the Read Signals Parallel Loop.vi, correct? 

 

If so, what modification did you make to not attempt to dequeue so that I can verify the crash behavior disappears? Did you just diagram disable the dequeue, remove that case, or actually remove the problematic DVR?

Message 7 of 19
(5,481 Views)

@vipillai wrote:

 

Just so that I have all the context I need for escalating and filing a CAR, the dequeue you're referring to is the one in the "Check Queue Command" case in that state machine for the Read Signals Parallel Loop.vi, correct? 


Yes that is the one.

 


 If so, what modification did you make to not attempt to dequeue so that I can verify the crash behavior disappears? Did you just diagram disable the dequeue, remove that case, or actually remove the problematic DVR?


If I put a disable diagram around the dequeue it wouldn't crash, or if I put a disable diagram around the whole Read Signals Parallel Loop.vi, found in Main CAN Class Helper.vi it wouldn't crash.  Also in the Open Device Protected.vi if I properly created the queue and bundled it into the Read Signals Loop Request in the CAN Class, then the crash also goes away.

 

I'm hoping this bug is related to a similar crash when accessing a DVR with an IPE which also crashed in a similar but non-reproducible way when using the Read-Only feature in 2017.

Message 8 of 19
(5,476 Views)

I can confirm the crash goes away when disabling the dequeue.

Could you attach your revised Open Device Protected.vi?

 

0 Kudos
Message 9 of 19
(5,445 Views)

Attached is the updated Open Device Protected which obtains a queue, and then bundles it into the Read Signals Loop Request inside the class data type.  Using this the RT application doesn't crash.

0 Kudos
Message 10 of 19
(5,441 Views)