RT target crashing randomly - 97% ISRs

ehrlich · ‎04-05-2018

I am running an application on a single-core RT target (PXIe-8101) which does some data acquisition and instrument control. There are 2 - 3 timed loops running, more parallel while loops for data processing and it sends the acquired data to a PC using network streams.

What I observe now is that the load without a client connected (still everything running) is ~ 50% (timed loops 35%, ISRs 6%, other 6%). Upon connection of the client, the load increases a bit to ~ 60% (timed loops 36%, ISRs 8%, other 16%). Things are good if no client is connected. When connecting and starting to receive data, after ~ 5 min the RT target „crashes“, i.e. the ISR load goes to 97%, total load to 100% and the connection to the target is lost from the PC. Only restarting the target helps.

From the timed loops, one is running constantly with priority 50000 @ 1kHz one is triggered by a sample clock @ 1kHz and two are only running sometimes to steer an instrument in given steps @ 200 Hz with priority 60000. In my tests, I never ran the latter two (since they require user interaction).

The RT error log doesn't show anything under "RTLog" or "Exception Log".

So my question is: What could cause the RT target to „crash“ in the described way? What would be good approaches to debug this? Could this have something to do with the network streams? Might connecting the RT target directly to the PC via ethernet instead of going through the building network help?
Any advice is appreciated.

Bob_Schor · ‎04-05-2018

Ouch! Having a two-core controller and dedicating one core to the Timed Loop really improved the performance of my LabVIEW RT Remote code. We do connect the RT Target directly (via a $30 switch) to a second NIC on the Host (we are running 4 Network Streams, two bidirectional Messages, two streaming data), but I don't think this is the cause of your problem -- I think your one core is being "worn out".

Bob Schor

ehrlich · ‎04-06-2018

I tried directly connecting the RT target to the host PC but that didn't change anything. I tend to agree that it is a problem of too many high priority loops on a single core. It is just hard make sure that a multi-core RT target would actually solve that problem before making the investment. But thanks for your thoughts!

wiebe@CARYA · ‎04-09-2018

@ehrlich wrote:

So my question is: What could cause the RT target to „crash“ in the described way?

Memory leaks, resource leaks or doing things with hardware or OS that should not be done...

@ehrlich wrote:

What would be good approaches to debug this?

Elimination. Try to disable half the code, and see if it still happens. Then reverse the functional code. Obviously you want the enabled parts to function as normal. you might need simulation code.

Also you could try to increate loop speeds or reduce loop speeds to see if the crashes happen more or less often.

@ehrlich wrote:

Could this have something to do with the network streams?

Sure. At least make sure all references are closed properly.

@ehrlich wrote:

Might connecting the RT target directly to the PC via ethernet instead of going through the building network help?

Only one way to find out... Network Streams should work without crashing. But you might be using them wrong (hard to tell without seeing code). Replacing them will either rule them out, or identify them as the problem. You might even replace them with some dummy, so all code is executed as if they are there. My guess would be that it will still crash... It's all about elimination and deduction.

Search LabVIEW like a graph!

Intaris · ‎04-09-2018

Could you have some resource contention issues? Deadlock perhaps?

DMA FIFOs with read timeout -1 (if the FPGA stops sending data) or RT-FIFOs set to polling mode can lead to this.

ehrlich · ‎04-09-2018

For the RT-FIFOs, I set them all to be blocking for write access and polling for read access which so far worked well. I also tested setting them to blocking for both but that didn't change anything.

Concerning the memory leaks etc: Memory consumption is stable throughout the run time. All the network streams are properly closed when the client disconnects.

Reducing loop speeds helps a bit but the problem still occurs, it just takes longer to happen.

If I disable one of the timed loops, things seem to be fine, which is why I started thinking it could be a scheduling problem (or too much load in the timed loops).

The one test I didn't do yet is to see what happens when no network streams are involved at all (no client connected) but the load is unchanged otherwise. I'll probably try this out over night.

LabVIEW

RT target crashing randomly - 97% ISRs

RT target crashing randomly - 97% ISRs

Re: RT target crashing randomly - 97% ISRs

Re: RT target crashing randomly - 97% ISRs

Re: RT target crashing randomly - 97% ISRs

Re: RT target crashing randomly - 97% ISRs

Re: RT target crashing randomly - 97% ISRs