Skip to main content

10.5 Leveraging the Kernel's Hung Task and Workqueue Stall Detectors

Following up on the previous section, we just mentioned that the system might experience tasks getting stuck — the so-called "Hung Task."

But there's an interesting detail here: how does the kernel know a task is stuck? After all, if the CPU is busy running an infinite loop, who has the time to act as the "referee"?

The answer lies in a kernel thread called khungtaskd. It's a "watchdog" the kernel keeps specifically to catch offenders in the act — it wakes up periodically, looks around, and checks if any task has been sitting in the D state (uninterruptible sleep) for too long.

Now, we're going to tear this mechanism wide open. Just like fine-tuning a race car, the kernel provides a full set of sysctl parameters that let you adjust this detector's sensitivity and behavior.

Assuming you're on an x86_64 virtual machine, let's look at the default values for these parameters:

$ sudo sysctl -a|grep hung_task
kernel.hung_task_all_cpu_backtrace = 0
kernel.hung_task_check_count = 4194304
kernel.hung_task_check_interval_secs = 0
kernel.hung_task_panic = 0
kernel.hung_task_timeout_secs = 120
kernel.hung_task_warnings = 10
$

Behind this string of output, every switch is worth a deep dive. Let's break them down one by one and see how they dictate the system's fate.


Thresholds and Trade-offs: Configuring the Hung Task Detector

All of the following parameters depend on the kernel configuration option CONFIG_DETECT_HUNG_TASK being enabled. If you're running a kernel on an embedded device, this section is especially critical — with limited resources, you have to strike a balance between "detection accuracy" and "system overhead."

hung_task_timeout_secs

This is the most core parameter: the time standard for determining a "stuck" task.

When a task remains in uninterruptible sleep (the D state you see in the ps command) for longer than this number of seconds, the kernel determines it's a Hung Task and triggers a warning.

The default value is 120 seconds.

This is a "qualitative" parameter. If your system has extremely high response time requirements — for example, if a real-time task must respond within 10ms — you obviously can't wait 120 seconds to sound the alarm. You can turn it down, even to 1 second. But be careful: setting it too low might cause normal I/O waits to be falsely reported as hangs.

Its valid range is {0:LONG_MAX/HZ}.

hung_task_warnings

Maximum number of warnings.

Even when a hang is detected, the kernel doesn't necessarily want to keep spamming logs. This parameter defines the maximum number of warnings the system will report. The default value is 10. Each time a hung task is detected, this counter decrements by 1.

When it reaches 0, the kernel goes silent — even if there are still tasks stuck in the system, it won't print new warnings.

It's like a car's check engine light that flashes a few times and then automatically turns off so it doesn't annoy you. But this has a hidden danger: in the case of a persistent deadlock, you might only see the first few log entries and then mistakenly assume the system has quieted down.

If you don't want it to go silent, you can set it to -1, which allows unlimited warnings.

hung_task_panic

Whether to escalate from "warning" to "execution".

The default value is 0. This means that when a Hung Task is detected, the kernel merely prints an alarm message (KERN_WARNING) and lets the task continue sitting in the D state.

But if you set it to 1, the nature changes: as soon as a Hung Task is discovered, the kernel immediately calls panic() to directly reboot or halt the machine.

This switch is typically used in High Availability (HA) cluster environments: rather than letting a server linger half-dead, it's better to just reboot it and restore service.

hung_task_check_count

Upper limit of the detection scope.

When khungtaskd scans, it doesn't actually iterate over tens of thousands of tasks — that would be too slow. It only checks the number of tasks specified by this parameter.

This is actually a performance optimization. On resource-constrained embedded systems, traversing the task list is heavy work in itself. This value allows you to limit the detector's workload.

Interestingly, this value is architecture-dependent.

  • On my x86_64 virtual machine, it's 4,194,304 (about 4.2 million).
  • But on a Raspberry Pi (ARM-32), it might only be 32,768.

This reflects different platforms' tolerance for "traversal overhead."

hung_task_check_interval_secs

Detection frequency.

Usually, this value is 0. This means the detection interval is determined by hung_task_timeout_secs — that is, "detect as soon as a timeout occurs."

But if you set it to a positive number (like 5), the kernel will force a scan every 5 seconds, regardless of whether any task has timed out. This value overrides the logic of timeout.

The valid range is {0:LONG_MAX/HZ}. Generally, keeping the default of 0 is fine — letting the timeout logic trigger naturally is more reasonable.

hung_task_all_cpu_backtrace

Full-scene snapshot capability.

The default value is 0. When set to 1, as soon as a Hung Task is detected, the kernel sends an NMI (Non-Maskable Interrupt) to all CPU cores, forcing each CPU to print its current stack backtrace.

This is quite aggressive, but also extremely useful.

Imagine a thread is deadlocked, but you don't know who's holding the lock. With this option enabled, you can not only see the victim's (Hung Task's) stack, but also see what all the other CPUs are doing — it's very likely that one of them is clinging to the lock and refusing to let go.

This requires CONFIG_SMP and CONFIG_TRACE_IRQFLAGS_SUPPORT support.


More Hidden Deadlocks: Workqueue Stall Detection

Does solving the task hang problem mean we're all set?

Not necessarily.

There's another trouble spot in the kernel: Workqueues. Driver developers love using them to defer time-consuming work to process context. The kernel internally maintains a bunch of kernel threads (Worker Threads) to silently digest these work items.

The question is: if a work item you submitted just lies in the queue without being executed, or executes too slowly, does that count as a fault?

Yes, and it can be fatal. This could lead to storage device response timeouts, network packet loss, or you might watch the fans spin wildly while the system is completely unresponsive.

To catch this kind of "slacking" behavior, the kernel provides Workqueue Stall Detection.

Enabling Detection: CONFIG_WQ_WATCHDOG

This requires enabling the configuration option when compiling the kernel:

CONFIG_WQ_WATCHDOG = y

You can find it under make menuconfig: Kernel hacking -> Debug Oops, Lockups and Hangs -> Detect Workqueue Stalls

Once enabled, if a work queue's thread pool gets stuck while processing a work item, or if progress is absurdly slow, the kernel will print a KERN_WARN level alarm, accompanied by the work queue's internal state information.

Threshold: workqueue.watchdog_thresh

This timeout threshold defaults to 30 seconds.

It's controlled by the kernel boot parameter workqueue.watchdog_thresh, and can also be dynamically modified through the corresponding sysfs file.

If a work item hasn't finished executing within this time, an alarm will be triggered.

If you want to disable this detection (for example, if you legitimately have a long-running task), just set this value to 0.


Intentionally Breaking Things: Testing Workqueue Stall in Practice

Talk is cheap. Let's manually trigger a Workqueue Stall and see how the system reacts.

⚠️ Experiment Warning Never run this code in a production environment! This will instantly push one of your CPU cores to 100% in an infinite loop, and could directly freeze the system. Make sure to run this experiment on a test machine with a multi-core CPU, leaving at least one core free to maintain system responsiveness.

We've prepared a simple kernel module, with the core logic right in the work queue's handler function.

The code logic is simple: we submit a task to the default kernel work queue (system_wq), and then intentionally write an infinite loop inside this task to max out the CPU.

Let's look at this code (ch10/workq_stall/workq_stall.c):

/* [ ... 省略头文件包含 ... ] */

static void workq_func(struct work_struct *work)
{
pr_info("%s: workqueue handler start\n", KBUILD_MODNAME);

/* 正常工作完成后,故意卡住 CPU */
mdelay(100); // 模拟一点正常工作耗时

/* --- 下面是 BUG 代码 --- */
/* 这是一个死循环,专门用来触发 Workqueue Stall */
while (1)
cpu_relax(); // 空转,吃死 CPU
/* --- BUG 代码结束 --- */

pr_info("%s: workqueue handler done\n", KBUILD_MODNAME);
}

/* [ ... 模块初始化代码,提交工作 ... ] */

The key here is that while(1) loop. cpu_relax() merely tells the CPU this is a busy wait, but when there are no higher-priority threads, this core will just keep spinning here idly, never reaching the end of the function, and never processing the next item in the queue.

After you insmod this module, within a few seconds (depending on your watchdog_thresh setting), your console will start outputting red alarm messages like crazy:

BUG: workqueue lockup - pool cpu0, time=30024ms, last=42s ago!
...
WARNING: CPU: 0 PID: 1234 at kernel/workqueue.c:5806 check_flush_dependency+0x...
...
Workqueue: events workq_func
RIP: 0010:check_flush_dependency+0...
...
Call Trace:
check_flush_dependency
process_one_work
worker_thread
...

At this point, if you open top or htop in another terminal, you'll see a kernel thread named something like kworker/0:1 hogging 100% of a CPU.

This is exactly the result we expected:

  1. We submitted work to the system's default work queue (events).
  2. One of the work queue's kernel threads (kworker) took on this work.
  3. The work function entered an infinite loop, causing this kworker thread to get stuck.
  4. The kernel's Watchdog detected that this kworker had entered a "stalled" state.
  5. The kernel angrily printed the stack trace, telling you exactly who's causing trouble (workq_func).

This is the power of the Workqueue Stall Detector — it can precisely pinpoint faults where "the thread is still alive but just isn't doing any work."


Chapter Reflection

In this chapter, we've finally assembled the last piece of the kernel fault monitoring puzzle.

From the initial Kernel Panic (total crash), to Lockup (CPU spinning without responding), and now to Hung Task (tasks sleeping to death) and Workqueue Stall (work queues slacking off). We've essentially been building a multi-dimensional monitoring system:

  • Panic addresses the problem of "kernel data structures are corrupted, it can't keep living";
  • Soft/Hard Lockup addresses the problem of "the CPU is still spinning, but the logic is already stuck";
  • Hung Task addresses the problem of "a task is waiting for a resource it will never get";
  • Workqueue Stall addresses the problem of "the task queue is blocked, with severe backlog".

Remember the scenario we mentioned in the chapter introduction? — The system looks like it's up, the network still responds to pings, but it's completely unresponsive otherwise. Now you should know that this is most likely a Soft Lockup or Hung Task, rather than a simple Panic. You're no longer someone who stares blankly at a black screen; you have tools (sysctl, kexec, Magic SysRq) to deeply probe the kernel's final state before death.

This is very important. Because only by knowing how it died can you, before the next reboot, find that bug, fix it, and prevent it from happening again.

In the next chapter, we'll push these monitoring techniques to their limits. We'll shift our perspective from "the kernel messing up on its own" to "external sabotage" — that is, hardware faults, memory bit flips, and how to use ultimate weapons like GDB and Kdump to perform an on-site autopsy on the kernel's corpse. That will be hardcore among hardcore.

Are you ready? Let's continue.


Exercises

Exercise 1: understanding

Question: In a standard kernel Panic output, the message 'Kernel panic - not syncing: ...' contains the phrase 'not syncing'. What specific operational behavior does this phrase represent in the kernel source code? Why is this behavior adopted when a fatal error occurs?

Answer and Analysis

Answer: This phrase indicates that the kernel intentionally stops flushing buffered data (syncing) to disk. This is because when the kernel panics, the system is in an unstable state, and attempting disk I/O operations could cause further data corruption or erroneous overwrites.

Analysis: This question tests the understanding of a fundamental concept. According to the section 'Why Is the Phrase "not syncing" in the Kernel Panic Message?': when the system detects an unrecoverable error, the in-memory filesystem buffers may contain inconsistent data. If disk syncing is forcibly executed at this point, it could compromise filesystem integrity. 'not syncing' explicitly informs the user that the kernel, to protect data safety, has abandoned saving data not yet written to disk.

Exercise 2: application

Question: Suppose you are writing a kernel module that needs to perform some emergency handling when the system crashes (such as recording specific hardware state to specific registers), but this handling process might contain sleep operations. Should you use atomic_notifier_chain_register() or blocking_notifier_chain_register() to register your callback function? Please explain why.

Answer and Analysis

Answer: Neither should be used directly, or they require extremely careful handling, but architecturally, Panic typically demands non-blocking operations.

More precisely: if you must register to the Panic chain, you must ensure your code contains no sleep operations. Because when a Panic occurs, the system disables local interrupts and preemption, running in an atomic context. Although panic_notifier_list itself is a ATOMIC_NOTIFIER_HEAD, any attempt to sleep could cause the system to deadlock or experience even more severe errors.

Analysis: This question tests scenario analysis. Although 'Application' difficulty usually encourages actual coding, this is a critical trap question.

  1. The text explicitly states that panic_notifier_list is an 'Atomic' type notifier chain (ATOMIC_NOTIFIER_HEAD).
  2. The text mentions that Atomic type callbacks run in an atomic context and cannot block.
  3. The text notes that the Panic function internally disables interrupts and stops scheduling. Therefore, if a callback function attempts to sleep, it will never be woken up, causing the system to hang.
  4. For Panic handling, best practice is to be fast and non-blocking.

Exercise 3: application

Question: Suppose your Linux server occasionally experiences kernel Oops in a production environment, causing service interruptions. For debugging, you decide to trigger a Panic and reboot when an Oops occurs. At the same time, to get more detailed context, you want to automatically print backtraces of all active tasks and memory information when a Panic happens. Please write the shell commands to configure the required /proc/sys/kernel parameters and their corresponding values.

Answer and Analysis

Answer: Configure Oops to trigger Panic: echo 1 > /proc/sys/kernel/panic_on_oops

Configure Panic to print task and memory information (panic_print mask calculation): echo 17 > /proc/sys/kernel/panic_print (or 0x11)

Analysis: This is a practical application question.

  1. Triggering Panic: According to knowledge point panic_on_oops, setting it to 1 allows Oops to escalate to Panic.
  2. Printing Information: According to Table 10.2 in the text (panic_print):
    • Show all tasks info: bit 4 (value 16)
    • Show memory info: bit 0 (value 1)
    • To get both pieces of information simultaneously, a bitwise OR operation is needed: 16 | 1 = 17.
  3. Combining these gives the two parameter settings.

Exercise 4: thinking

Question: The text mentions the difference between 'Hard Lockup' and 'Soft Lockup', mainly regarding whether interrupts are disabled and the role of the NMI Watchdog. Combining the kdump (Dump-capture kernel) mechanism, analyze the following scenario:

If a CPU disables interrupts and falls into an infinite loop, triggering the NMI Watchdog and causing a Hard Lockup warning, the main kernel can no longer schedule and run normally. In this extreme situation, how does the kdump mechanism work? (Note: focus on the role of NMI)

Answer and Analysis

Answer: Even if the main kernel is stuck in an infinite loop due to disabled interrupts (Hard Lockup), NMI (Non-Maskable Interrupt) still has higher priority than normal interrupts and can forcibly interrupt the CPU's execution flow. The NMI Watchdog uses NMI precisely to detect hard lockups. When kdump is configured, the Panic trigger path (even if triggered by NMI) ultimately executes the kexec mechanism, leveraging the CPU's responsiveness in the NMI context to forcibly load and boot into the capture kernel, thereby preserving the memory image of the main kernel at the time of the crash.

Analysis: This is a deep-thinking question that requires synthesizing knowledge of Hard Lockup, NMI, and Kdump.

  1. Nature of Hard Lockup: The CPU is in an infinite loop with interrupts disabled. Normal interrupts (including timer interrupts) cannot execute, so task scheduling cannot occur, and normal software watchdogs cannot work.
  2. Special nature of NMI: NMI is a Non-Maskable Interrupt. Even when interrupts are disabled in the CPU's flags register (IF=0), NMI will still be serviced by the processor. The NMI Watchdog uses performance counters to generate NMIs.
  3. Kdump's intervention: After the NMI triggers the Panic routine, although the main kernel logic is dead, the Panic handling flow (especially the kexec loading part) leverages the NMI context or the CPU's ability to still execute low-level instructions, transferring control to the pre-loaded capture kernel. This allows the system to bypass the deadlocked main kernel and complete the memory dump.

Key Takeaways

This chapter focuses on establishing "post-mortem" and real-time monitoring capabilities for kernel faults. First, Panic is the kernel's act of complete surrender when encountering unfixable errors. The system stops scheduling and enters an infinite loop through the panic() function. During this time, it not only prints a last testament containing registers, stack traces, and KASLR offsets, but may also trigger visual alerts like keyboard LED blinking (panic_blink). The kernel parameters panic and panic_print can control the reboot timing and the granularity of detailed information output, respectively.

To obtain critical diagnostic information when the system freezes, using netconsole combined with a network receiver (like netcat) to remotely capture logs is standard practice. This solves the problem of local terminals being unable to output logs because the system is frozen. Additionally, the Magic SysRq mechanism (via /proc/sysrq-trigger or key combinations) can forcibly trigger a crash or execute emergency operations (like syncing disks, dumping task information) when the system is unresponsive, serving as a powerful backdoor for debugging deadlocks or hung states.

For more hidden system hangs, the kernel provides a Lockup detection mechanism based on NMI and watchdogs. Soft Lockup refers to a task spinning in an infinite loop in kernel space, preventing other processes from being scheduled (triggered after 20 seconds by default), while Hard Lockup involves an extreme infinite loop with interrupts disabled (triggered after 10 seconds by default). Both can be configured via kernel parameters to directly trigger a Panic or merely print a warning, helping developers distinguish between scheduler failure and interrupt masking.

Beyond CPU-level deadlocks, the Hung Task detector is specifically responsible for discovering tasks that have been stuck in uninterruptible sleep (D state) for too long (120 seconds by default), usually caused by waiting on I/O or lock resources. Meanwhile, the Workqueue Stall Detection mechanism monitors whether tasks in kernel work queues are indefinitely delayed in execution, preventing tasks deferred by drivers or subsystems via work queues from blocking the entire queue's operation due to infinite loops.

Developers can also register with the panic_notifier_list atomic notifier chain to insert custom handling logic at the exact moment of a kernel crash (such as lighting up a fault LED or recording critical hardware state). But it's important to note that such Panic handling callbacks run in a highly unstable atomic context, and executing any operation that might cause sleep or blocking is strictly prohibited — otherwise, it could prevent the system from even generating a crash dump.