10.4 Detecting Deadlocks and CPU Stalls in the Kernel
At the end of the previous section, we talked about the "red line" in the Panic handler: don't make it too complex, or you won't even be able to leave a "dying message." But sometimes, the kernel dies in a more insidious way—it doesn't crash immediately or scream for help; it just suddenly goes silent.
This is the "hung" crime scene we'll tackle in this section.
What Exactly is a Deadlock?
The meaning of a deadlock is straightforward: the system or a specific CPU core enters a state of long unresponsiveness. This is more troublesome than a direct Panic because it might happen in the dead of night in a production environment—the machine appears to be running, but it has long been frozen.
To catch these ghost-like failures, we need more advanced monitoring mechanisms. Before diving into the specific kernel detection mechanisms, let's take some time to clarify the underlying "watchdog" concept—it's the physical foundation for all detection mechanisms.
A Brief Note on Watchdogs
A watchdog is essentially a monitoring program. Its logic is very simple: I send you a heartbeat (ping) at regular intervals; if you don't respond within the specified time, I consider you dead and forcibly reboot the system.
In the Linux ecosystem, watchdogs come in a few forms:
- Hardware Watchdog: This is an independent chip or module soldered directly onto the board. It's connected to the system's reset circuit, and once triggered, it physically pulls the reset pin. Driving this usually depends heavily on the specific board, and the kernel provides a generic framework to make it easier for driver developers to interface with all sorts of messy hardware chips.
- Software Watchdog: This is the
softdogdriver in the kernel. It's not as brutal as hardware, but it's sufficient in many scenarios. Its location in the kernel configuration isDevice Drivers | Watchdog Timer Support, and the corresponding config option isCONFIG_SOFT_WATCHDOG.
However, a kernel module alone isn't enough. We usually need a userspace daemon to work with it. This daemon's job is to periodically feed the dog (usually by writing something to /dev/watchdog, or issuing an ioctl). If this daemon stops working because the system is hung, the watchdog feeding times out, and the system is rebooted.
In many production kernels (like the custom 5.10.60 kernel we build in this book), we compile softdog as a module:
CONFIG_SOFT_WATCHDOG=m
Once loaded, you'll see a module named softdog in the system.
Hands-on: Running softdog and the Userspace Watchdog
Talk is cheap. Let's fire up these components in an x86_64 Ubuntu VM and see for ourselves.
First, load the kernel's softdog module, then manually start the userspace watchdog daemon (for demonstration purposes, I'll run it in --verbose mode):
$ sudo modprobe softdog
$ sudo watchdog --verbose &
[...]
watchdog: String 'watchdog-device' found as '/dev/watchdog'
watchdog: Variable 'realtime' found as 'yes' = 1
watchdog: Integer 'priority' found = 1
[1]+ Done watchdog --verbose
Now let's confirm they're actually running:
$ ps -e | grep watch
111 ? 00:00:00 watchdogd
10106 ? 00:00:00 watchdog
There are two lines here:
The first line, watchdogd, is a kernel thread (part of the softdog driver).
The second line, watchdog, is the userspace daemon we just started.
By the way, if your system uses systemd, it has built-in watchdog functionality as well (configured in /etc/systemd/system.conf).
⚠️ Debugging Warning: Watchdogs are great on servers, but they're a nightmare when debugging the kernel. Imagine you're single-stepping code in an interactive debugger like KGDB, and because you're stepping too slowly, you trigger the watchdog and the system just reboots on you—who could stand that? So, when you're about to do deep kernel debugging, remember to turn these watchdogs off.
Alright, now that we've laid the watchdog foundation, we can look at how the kernel leverages this mechanism (especially NMI) to build more advanced detectors—specifically for hard and soft lockups.
Using the Kernel's Hard and Soft Lockup Detectors
Software (and even hardware) is imperfect. I'd bet you've encountered that kind of "mysterious hang": the system doesn't Panic, the logs aren't updating, the mouse won't move, the keyboard is unresponsive, and the screen is frozen.
This is a Lockup.
The kernel has ways to detect these kinds of problems. The watchdog we just mentioned plays a key role here. The kernel uses the NMI Watchdog (Non-Maskable Interrupt Watchdog) and the perf subsystem to implement hard and soft lockup detection.
The relevant config options are hidden in the Kernel hacking | Debug Oops, Lockups and Hangs menu. Looking at Figure 10.8 (assuming you can see that pile of config options), you might ask: since you said this is a "production kernel," why aren't panic_on_oops or panic on soft/hard lockup selected?
That's a good question.
The "production kernel" in this book is primarily for demonstration purposes; it's not actually running business workloads in a server room. If you're building a real product, whether to enable automatic reboot is an architectural decision: When the system hangs, do you want it to reboot immediately to heal itself, or do you want it to stay frozen in place so you can investigate the corpse? If it's the former, turn on the panic_on related options and pair them with the panic=n boot parameter (which means automatically reboot n seconds after a Panic).
In our configuration, the detection functionality is enabled, but it won't immediately Panic. Table 10.3 summarizes the key configurations, boot parameters, and sysctl controls involved here.
You can use sysctl to check your current system's settings (note that nmi_watchdog corresponds to hard lockups, soft_watchdog corresponds to soft lockups, and does not refer to the softdog module):
$ sudo sysctl -a | grep watchdog
kernel.nmi_watchdog = 0
kernel.soft_watchdog = 1
kernel.watchdog = 1
kernel.watchdog_cpumask = 0-5
kernel.watchdog_thresh = 10
In this example, nmi_watchdog is 0 because this is a VM, which typically lacks hardware watchdog support. soft_watchdog, however, is always available.
Now, let's clarify what these terms actually mean.
What is a Soft Lockup?
A soft lockup is a specific type of kernel bug.
Imagine a task getting stuck in an infinite loop in kernel mode, or refusing to leave the CPU for some reason, and doing so for a long time. The result is that other tasks have no chance to be scheduled and run on that CPU core. This is a soft lockup.
Time Limit:
The default timeout for a soft lockup is 20 seconds. How is this value calculated? It's twice the watchdog_thresh value.
The hard lockup timeout is the watchdog_thresh value itself, defaulting to 10 seconds.
You can view and modify this value:
$ cat /proc/sys/kernel/watchdog_thresh
10
If you want to change it to 5 seconds (meaning a soft lockup triggers at 10 seconds), just write an integer to it. Writing 0 disables the detection.
What Happens When a Soft Lockup is Detected?
This depends on a few config options:
- If the
kernel.softlockup_panicsysctl is 1, or if thesoftlockup_panic=1boot parameter is present, the kernel will directly Panic. - If Panic is not enabled (the default), the kernel will print a warning message and dump the stack of the stuck task.
⚠️ Note: In the second case (warning only, no Panic), the buggy task that caused the problem will continue to sit there and hog the CPU. The system will not recover automatically.
Hands-on: Triggering a Soft Lockup on x86_64
Reading the definition alone doesn't give you a real feel for it, so let's artificially create a disaster.
The idea is simple: pick an unlucky CPU core, put it into kernel mode, and then run a highly CPU-intensive infinite loop on that core.
I modified a demo module from the book Linux Kernel Programming Part 2 (kthread_simple) and added some malicious code to it. The specific code is in the ch10/kthread_stuck directory.
Load this module (without parameters, it defaults to testing a soft lockup):
[ ... 模块加载过程 ... ]
After running for more than 20 seconds, the kernel's soft lockup watchdog will catch on and immediately spit out a BUG() message! Looking at Figure 10.9, you'll see the console flooded with BUG: soft lockup ... type KERN_EMERG level messages.
Besides that prominent BUG alert, the watchdog will also call routines like dump_stack() to dump the entire current state:
- The list of modules in memory
- Context information
- CPU register snapshots
- Machine instructions
- Most importantly: the kernel-mode call stack
If you used our handy PRINT_CTX() macro, you'll see output similar to this:
002) [lkd/kt_stuck]:3530 | .N.1 /* simple_kthread() */
Notice that .N.1 format string.
The first character is . (a dot), which means hardware interrupts are enabled.
This matches our expectation: in this test, we used a regular spin_lock() and didn't disable interrupts. This stands in stark contrast to the hard lockup we'll discuss next.
Don't Forget the Spinlock!
You might ask: since it's an infinite loop, why bother acquiring a spinlock? This is the essence of the test.
A spinlock—especially the spin_lock_irqsave() variant—doesn't just spin-wait; it also disables hardware interrupts. Disabling interrupts has a side effect: it also disables kernel preemption. This means that once you're in, basically no one can interrupt you (except for NMI).
This is exactly the condition needed to simulate a hard lockup.
But there's an interesting contradiction here: Since interrupts are disabled, how can the kernel watchdog detect that it's stuck?
The answer lies in NMI (Non-Maskable Interrupt). The very definition of NMI is "even if interrupts are disabled, I'm coming in." It uses hardware performance counters to periodically check whether the CPU is still ticking. So, even if you disable all interrupts in your code and enter an infinite loop, NMI can still break down the door and catch you slacking.
(Of course, as we emphasized in Chapter 8, you must be quick when holding a spinlock. We're deliberately doing the exact opposite here purely for educational demonstration.)
If you want to dive into the soft lockup detection source code, you can check out the watchdog_timer_fn() function in kernel/watchdog.c.
Also, if you try to rmmod this malicious module without properly sending a signal to let the kernel thread exit first, after about 2 minutes, the rmmod process itself will be detected as a "Hung Task." We'll cover this detection mechanism shortly.
What is a Hard Lockup?
A hard lockup is more severe than a soft lockup.
If a soft lockup is "I won't let anyone else run, but I can still respond to interrupts," then a hard lockup is "shutting the door completely." A CPU core gets stuck in an infinite loop in kernel mode and has interrupts disabled. This means that on this core, not even hardware interrupts can be handled.
Time Limit:
The default timeout for a hard lockup is 10 seconds (watchdog_thresh).
What Happens When a Hard Lockup is Detected?
Again, there are two scenarios:
- Panic: If the
nmi_watchdog=1boot parameter is set, or ifkernel.hardlockup_panicis 1, the kernel will directly Panic. - Warning: By default, it will only print a warning and a stack trace. If the
hardlockup_all_cpu_backtrace=1boot parameter is present, it will also print stack traces for all CPUs.
Similarly, if it doesn't Panic, the buggy code will continue to freeze that core.
RCU and CPU Stalls
Another common source of "hangs" comes from the RCU (Read-Copy Update) mechanism.
You may have heard of RCU; it's a powerful lockless synchronization mechanism in the kernel. But when using RCU, if a CPU falls into a prolonged state of disabled interrupts or unpreemptibility, it will cause an RCU CPU Stall.
It's like the RCU mechanism is waiting for you to finish reading data, but you just won't let go, so it times out and sounds the alarm.
Quickly Understanding the Core Logic of RCU:
Imagine several readers (R1, R2, R3) are reading a piece of shared data. Along comes a writer. The writer doesn't modify the data directly; instead, it makes a copy, modifies the copy, and then atomically points the pointer to the new data. What happens to the old data? It can only be freed after all the readers who started reading (R1, R2, R3) have confirmed they are done.
How do we confirm the readers are done?
RCU's implementation approach is: wait for all current readers to voluntarily yield the CPU (e.g., by calling the scheduler).
The writer sets a grace period, which defaults to a lengthy 60 seconds (CONFIG_RCU_CPU_STALL_TIMEOUT). If even after that much time the readers haven't finished, or if the CPU just isn't scheduling, the kernel will print an RCU Stall warning.
Hands-on: Triggering a Hard Lockup / RCU Stall
This step is a bit more troublesome than the soft lockup, as the conditions are stricter:
- Must be physical hardware: Hard lockup detection relies on NMI, which VMs usually don't have.
- Enable NMI: You need to add
nmi_watchdog=1to the boot parameters. - Configuration check:
CONFIG_RCU_CPU_STALL_TIMEOUTneeds to be between 3 and 300.
Once configured, you should be able to see this with sysctl:
# sysctl -a | grep watchdog
kernel.nmi_watchdog = 1
kernel.soft_watchdog = 1
...
kernel.watchdog_thresh = 10
Now, let's load that malicious module ch10/kthread_stuck again, this time passing the parameter lockup_type=2.
This parameter makes the kernel thread call spin_lock_irq() while holding the spinlock, disabling both interrupts and preemption, and then entering an infinite loop.
After a while, the kernel logs will explode. You might see backtraces generated by NMI interrupts, or RCU CPU Stall warnings.
The logs will look roughly like this:
rcu: INFO: rcu_sched detected stalls on CPUs/tasks:
rcu: 3-...0: (1 GPs behind) idle=462/1/0x4000000000000000 softirq=60126/60127 fqs=6463
(detected by 2, t=15003 jiffies, g=127897, q=1345272)
Sending NMI from CPU 2 to CPUs 3:
NMI backtrace for cpu 3
CPU: 3 PID: 16351 Comm: lkd/kt_stuck Tainted: P W OEL 5.13.0-37-generic #42~20.04.1-Ubuntu
[...]
This is the RCU mechanism roaring after discovering that a CPU has been stuck with interrupts disabled for too long.
You can use the kernel.panic_on_rcu_stall sysctl to make the kernel Panic directly in this situation.
Finally, the kernel actually provides a more professional test module, test_lockup (CONFIG_TEST_LOCKUP), specifically for testing these detection mechanisms. It's much more comprehensive than our homemade module.
Table 10.4 summarizes all the key parameters and configurations related to hard and soft lockups.
Using the Kernel's Hung Task and Workqueue Stall Detectors
Besides CPU infinite loops, another common type of "hang" is a stuck task.
For example, a process enters TASK_UNINTERRUPTIBLE (D state) and refuses to wake up from its sleep, persisting for more than the default 120 seconds. This is a Hung Task.
Configuring Hung Task Detection
Still in the Kernel hacking | Debug Oops, Lockups and Hangs menu:
[*] Detect Hung Tasks
(120) Default timeout for hung task detection (in seconds)
[ ] Panic (Reboot) On Hung Tasks
Once enabled, the kernel will periodically scan all tasks to see if anyone is in the D state and has timed out.
The relevant parameters are as follows:
CONFIG_DEFAULT_HUNG_TASK_TIMEOUT: Compile-time configuration for the timeout.kernel.hung_task_timeout_secs: Runtime sysctl; set to 0 to disable.kernel.hung_task_panic: Whether to Panic when detected.
⚠️ Debugging Pitfall:
We mentioned earlier that if you try to rmmod that malicious module without first sending a signal to stop the kernel thread, the rmmod process itself might get stuck (because it's waiting for the module's reference count to drop to zero), and then get caught red-handed by the Hung Task detector. This is a classic pitfall: the tool used to debug a deadlock gets deadlocked itself.
Besides stuck tasks, the kernel's Workqueues can also stall. If a work item sits in the queue for too long without executing, it will also trigger an alarm.
Chapter Echoes
In this chapter, we built a complete "fault monitoring system."
From the most brutal Panic, to the Hard Lockup detection that uses NMI to penetrate defenses, to the Hung Task detection that discovers tasks suffocating in their sleep. Essentially, we are installing more and more sensors into the kernel.
Do you remember the question we posed in the chapter introduction—The system didn't crash, but it's hung. What do we do? Now the answer is clear:
- If the CPU is running wild without scheduling, it's a Soft Lockup;
- If the CPU has interrupts disabled and isn't responding, it's a Hard Lockup;
- If a task is deadlocked waiting for a resource and won't wake up, it's a Hung Task;
- If the RCU grace period can't complete, it's an RCU Stall.
In the next chapter, we'll shift our focus from "the kernel dying on its own" to "the kernel being killed by external forces"—namely, hardware failures, memory corruption, and the ultimate means of debugging these nuclear-level problems.