NI sbrio 9632 restart unpredictable - how to debug?

technixtp · ‎04-24-2016

Hello,

we use the sbrio 9632 in different testsystems since several years and had not such a problem.

But in our newest systems we have this issue. Everything works and after a time between 2 and 28 hours the restart occurs.

The application is executed as rtexe and we use LabVIEW 2015.

What we have done so far:

- Reinstalling the complete sbrio (software and application)

- Logging CPU, free ram and disc usage (cpu is between 50% and 95%; ram usage and hdd stays stable over time and is not increasing - except the growing logfile, but still over 100MB of disc left)

- figured out running the application in idle mode does not lead to the restart, but in operational mode (command send from host to sbrio via a string shared variable every 200ms)

The problem no ist that I have no clue to debug this further. As there is nothing conspicuous in the logs as rising memory consumtion etc.

Is there a way to turn on logging mechanisms in VXworks to find out what triggered the restart?

regards,

Tobias

Stephan_D · ‎04-25-2016

Hey Tobias,

if the restart is due to an problem with the RTOS, it is may an option to enable the console out. If a software package crashs on the system, it will often output a stack trace to the serial output. This could may help to nail down the error.

You could also take a look into the rtlog.txt, which is located at /ni-rt/system/.

Hope this helps,

Stephan

technixtp · ‎04-26-2016

hi,

thanks for the advice.

Now I have a trace of the console output (see attachment).

What I see is there are restarts after:

Exception code: 0x00000300

Thread name: tNetTask

an another one after

interrupt: PCI Error: initiator aborted due to timeout

Is it possible to set a higher log level to get more information why the exception occurs?

Stephan_D · ‎04-26-2016

Hi Tobias,

I could not find any command on the ni.com site to enable a higher debugging level. It may exist, but I couldn't found any information about this topic.

The thread "tNetTask" is part of the network stack of the RTOS. I found this page about this thread, but I don't know why the sbRIO is only crashing in the operational mode and not in the idle mode.

http://www.vxdev.com/docs/vx55man/vxworks/netguide/c-aboutStack.html

Stephan

Marcelo_I · ‎05-03-2016

Hi Tobias,

Those errors mean the CPU is trying to access memory that doesn't exist. That could be happening because of different things, the main ones being:

Running out of memory
Stack overflow
Memory corruption

Since you're already monitoring memory, you could try running chkStack from the serial console periodically to see if anything bad is happening. The output may be a bit obtuse to parse so feel free to post it and I'll try to help.

technixtp · ‎05-04-2016

hi,

by now i have isolated the problem. Part of our application is a tcp/ip connection between the sbrio and a host pc. One thing among others we do with this connection is transferring data sampled from analog inputs to the host. There is a ring buffer on the host side and the data transfer ist never stopped (except during autmomatic reconnects caused by connection loss).

When all parts of the application except the data transfer (aquisition still active) is running then there is no restart. The next step was only to put the "TCP Write" Function, which sends out the data packet, into a disable structure. All other parts of the connection (handshakes etc) still active => no restarts for 7 days until I stopped the test

It seems like that there is a new memory allocation every time the data in string is passed to the TCP Write Function and the memory ist getting clustered. I don´t know how it is implemented behind the scenes but my theory is that there is something like malloc called and the data is copied. After hours the function fails to get a memory block large enough as requested and the TCP Write Function tears down the whole application.

Theoretic this would explain why I see no increasing memory consumption or cpu loads. With the console command "memShow" I see the "cumulative alloc" ist increasing very fast with the data transfer active.

While researching I found this article

VxWorks Network Buffer Fills Causing Loss of Communication

But I have no increasing send or receive Qs. So this is not the problem.

I used the VIs of this article

Do LabVIEW TCP Functions Use the Nagle Algorithm?

to get the raw net object and have access to the socket options but I´m not expierinced enough in this topic to know which screw to adjust.

I will monitor the stack with checkStack.

But I cannot change the TCP Write Function.

By the way - the string size transfer to the function ist 40028 Bytes

Real-Time Measurement and Control

NI sbrio 9632 restart unpredictable - how to debug?

NI sbrio 9632 restart unpredictable - how to debug?

Re: NI sbrio 9632 restart unpredictable - how to debug?

Re: NI sbrio 9632 restart unpredictable - how to debug?

Re: NI sbrio 9632 restart unpredictable - how to debug?

Re: NI sbrio 9632 restart unpredictable - how to debug?

Re: NI sbrio 9632 restart unpredictable - how to debug?