Help on Compact-Rio 9068 CPU Lockup Issue

WillsG · ‎04-10-2017

I am looking for some help and getting suggestion about a critical issue we saw on one of the CompactRio-9068 platform running with NI Real Time Linux. CPU SW implementation was done using C/C++ Eclipse IDE provided from NI.

Background on SW implementation:

CPU core-1: We have two threads affiliated to CPU core-1: (both threads are running with SCHED_FIFO scheduler policy)
1. 1^st thread runs with priority 50: This runs a 1 KHz loop which interacts with FPGA logic to get/send data from/to I/O modules (it is an ISR which waits for IRQ from FPGA).
2. 2^nd thread runs with priority 7: It runs with low priority background loop with 50mSec sleep every iteration of the loop.
CPU core 0: We have around 18 threads affiliated to CPU core-0: (all threads are running with SCHED_FIFO scheduler policy)
1. All threads in CPU core-0 runs with SCHED_FIFO priority varying from 2 to 7. These threads are generally not active and not running all the time. Few of these threads executes some logic wherever there is an external command trigger through socket communication.
We also have a NFS mounted drive. We write some log data from running SW application (stdout messages from printf() in SW) to log files on NFS mounted drive.

Issue seen while running SW:

It appeared that somehow SW threads were not running at all. We did not see any log messages from running SW for more than an hour on NFS mounted drive.
We did check “/var/log/messages” and found that it has some exceptions logged in. I have attached a sample of the system log messages along with this message. Can we get some help on understanding these exception messages? We are looking for help to get some reference on how to interpret those log messages and what they are complaining about?
Are these exception log messages indicating CPU lock up scenario?
After the event, SW application did recover itself after in an hour or so. We do see exception log messages during the time SW appears to be not running.
Occurrence rate of issue is very low. We have seen this issue in past like once in a month or so.

One of the /var/log/message file is attached on this message. Part of it is shown below for a quick reference:

Apr 1 13:45:01 lc kernel: [1967263.177879] [<c0013f08>] (unwind_backtrace+0x0/0xec) from [<c03a4e64>] (dump_stack+0x20/0x24)

Apr 1 13:45:01 lc kernel: [1967263.177907] [<c03a4e64>] (dump_stack+0x20/0x24) from [<c0076870>] (__rcu_pending+0x98/0x3e8)

Apr 1 13:45:01 lc kernel: [1967263.177933] [<c0076870>] (__rcu_pending+0x98/0x3e8) from [<c00781d8>] (rcu_check_callbacks+0x194/0x24c)

Apr 1 13:45:01 lc kernel: [1967263.177964] [<c00781d8>] (rcu_check_callbacks+0x194/0x24c) from [<c003b138>] (update_process_times+0x4c/0x58)

Apr 1 13:45:01 lc kernel: [1967263.177991] [<c003b138>] (update_process_times+0x4c/0x58) from [<c005b268>] (tick_sched_timer+0xa4/0xdc)

Apr 1 13:45:01 lc kernel: [1967263.178021] [<c005b268>] (tick_sched_timer+0xa4/0xdc) from [<c004f344>] (__run_hrtimer+0x160/0x260)

Apr 1 13:45:01 lc kernel: [1967263.178046] [<c004f344>] (__run_hrtimer+0x160/0x260) from [<c0050328>] (hrtimer_interrupt+0x148/0x2b0)

Apr 1 13:45:01 lc kernel: [1967263.178069] [<c0050328>] (hrtimer_interrupt+0x148/0x2b0) from [<c00134d4>] (twd_handler+0x30/0x38)

Apr 1 13:45:01 lc kernel: [1967263.178094] [<c00134d4>] (twd_handler+0x30/0x38) from [<c0073ccc>] (handle_percpu_devid_irq+0x110/0x1c0)

Apr 1 13:45:01 lc kernel: [1967263.178124] [<c0073ccc>] (handle_percpu_devid_irq+0x110/0x1c0) from [<c006ffc0>] (generic_handle_irq+0x30/0x40)

Apr 1 13:45:01 lc kernel: [1967263.178151] [<c006ffc0>] (generic_handle_irq+0x30/0x40) from [<c000e7c4>] (handle_IRQ+0x88/0xc8)

Apr 1 13:45:01 lc kernel: [1967263.178174] [<c000e7c4>] (handle_IRQ+0x88/0xc8) from [<c0008438>] (asm_do_IRQ+0x18/0x1c)

Apr 1 13:45:01 lc kernel: [1967263.178195] [<c0008438>] (asm_do_IRQ+0x18/0x1c) from [<c000d8bc>] (__irq_svc+0x3c/0xb8)

Apr 1 13:45:01 lc kernel: [1967263.178209] Exception stack(0xdee83e58 to 0xdee83ea0)

Apr 1 13:45:01 lc kernel: [1967263.178222] 3e40: dee82020 dee83ea0

Apr 1 13:45:01 lc kernel: [1967263.178243] 3e60: de1742d8 dee82000 de1742d0 de392ac0 dee83fb0 de174280 8000000f 00000318

Apr 1 13:45:01 lc kernel: [1967263.178263] 3e80: de1742d0 dee83eb4 00000000 dee83ea0 c0022bbc c03a9ed8 a0000113 ffffffff

Apr 1 13:45:01 lc kernel: [1967263.178287] [<c000d8bc>] (__irq_svc+0x3c/0xb8) from [<c03a9ed8>] (rt_mutex_trylock+0x40/0x80)

Apr 1 13:45:01 lc kernel: [1967263.178312] [<c03a9ed8>] (rt_mutex_trylock+0x40/0x80) from [<c005fdb0>] (rt_down_read_trylock+0x3c/0x68)

Apr 1 13:45:01 lc kernel: [1967263.178337] [<c005fdb0>] (rt_down_read_trylock+0x3c/0x68) from [<c00155d4>] (do_page_fault+0x88/0x210)

Apr 1 13:45:01 lc kernel: [1967263.178361] [<c00155d4>] (do_page_fault+0x88/0x210) from [<c00083b8>] (do_PrefetchAbort+0x44/0xa8)

Apr 1 13:45:01 lc kernel: [1967263.178384] [<c00083b8>] (do_PrefetchAbort+0x44/0xa8) from [<c000dd3c>] (ret_from_exception+0x0/0x10)

Apr 1 13:45:01 lc kernel: [1967263.178398] Exception stack(0xdee83fb0 to 0xdee83ff8)

BradM · ‎04-14-2017

Hi WillsG,

Those types of kernel messages are indicative of some higher-priority loop preventing RCU callbacks from being run. Note, also, the time in the log that you'd posted: there are distinct, 5 minute jumps between the various chunks of the log messages. You have a lot running (with FIFO policy) on core 0, and until the upstream PREEMPT_RT patchset get core isolation to a good place (meaning normal kernel-type stuff moves off the core as well), you need to allot some time on each core to do kernel-type housekeeping, otherwise you'll see the RCU-related warning that you do.

It would be helpful to see a fuller log (what you attached is only noise after the original issue occurred) as well as to know what version of the kernel that you're using on this controller.

WillsG · ‎04-17-2017

Thank you, BradM for providing some insights on the issue.

We are using Labview RT 2013 version. Following output from "uname -a" cmd provides all details about Kernel we are using:

admin@lc:/LC/bin# uname -a
Linux lc 3.2.35-rt52-1.0.0f1 #1 SMP PREEMPT RT Wed Jun 12 09:57:26 CDT 2013 armv7l GNU/Linux

We noticed the issue after system was recovered from exception during which all /var/log/message files were rotated and its content was only left with this after-issue log messages. I believe that the 5-minute jumps in exception logs are because of cron-jobs running is background. I have attached a cron-job log file here for reference. There are 2 cron-jobs which runs in background: one runs at 5 minute interval and other is at 15 minute interval. The exception logs also show this time characteristics.

Does kernel level house keeping runs with normal scheduling policy (SCHED_OTHER)? or if with lower priority than our listed core-0 threads? In that case, I might have to change scheduling policy for some threads in core-0. Any specific suggestions on this?

BradM · ‎04-24-2017

WillsG,

As I recall, for 3.2 (2013 or sp1), the threaded irq model used for almost all HW is half of the max RT prio (50), while some other kernel threads are lower prio (may not even use the RT scheduling classes).

You can check the priority of the various different tasks on the system, mapping some of the kernel threaded tasks to PIDs to check the /proc pseudofs (since the included ps is the fairly limited busybox implementation, you can't simply use the ps -eo fields to get formatted output of the tasks running on the system). You can check fields 40 and 41 of the /proc/$PID/stat file to see the RT priority as well as the scheduling policy being used for a task. On a fairly recent target (2016 stack, I think), I see that the rcu_callback task is priority 1, policy SCHED_FIFO. This means if you run a core at 100%, even at RT priority 2, rcu_callback won't get a chance to run, resulting in posted writes to rcu datastructures within the kernel not completing, resulting in a warning after some timeout (default is 60 seconds).

Putting this all together, it makes sense that you're seeing the kernel spamming at 5 minute intervals with an offset of 1 minute: cron runs on the core that actually has some breathing room, it seems like it's hitting a RCU-protected variable, on the :05, :10, …, and a minute later, the RCU stall detection times out, spamming the kernel log.

Since it seems that this occurs when the app gets hung up or otherwise is not working correctly, I'd take a nice, hard look at the loops within your C++ app to see if there's a way that one (or more) of the loops can hard-loop on a core, resulting in the that thread (at a higher-than-rcu_callback prio) filling that core's spare time, preventing the app from making progress, from rcu_callback from running, and some other important kernel threads (e.g. I think that will preempt the page cache flushing from happening, meaning that posted writes to disk hung up in the VFS won't make it to disk)

Taking another step back, this is why RT is not easy: you'll need to get your hands dirty and dig into the system, figuring out RT-safe ways to debug what's going on. Or, you know, use LabVIEW RT 🙂

WillsG · ‎04-28-2017

Thank you, BradM for the insights.

I will take another look at the threads with RT priority in my application.

NI Linux Real-Time Discussions

Help on Compact-Rio 9068 CPU Lockup Issue

Help on Compact-Rio 9068 CPU Lockup Issue

Re: Help on Compact-Rio 9068 CPU Lockup Issue

Re: Help on Compact-Rio 9068 CPU Lockup Issue

Re: Help on Compact-Rio 9068 CPU Lockup Issue

Re: Help on Compact-Rio 9068 CPU Lockup Issue