8.4 Real-World Lock Defect Cases
In the previous section, we covered KCSAN, which acts like a tireless night watchman, helping us monitor those fleeting data races.
But that's just a tool. Real-world bugs are often far more cunning than the textbook deadlock models—they hide deep within complex call chains, disguised as perfectly reasonable code. In this section, we won't introduce any new mechanisms. Instead, we'll zoom in and examine a few actual kernel bugs.
You'll see that even experienced developers can stumble on fundamental issues like "calling a blocking function while holding a spinlock." You'll also see how a seemingly insignificant lock error can be exploited by security researchers into a system privilege escalation vulnerability.
This isn't just about reading stories; it's about building an intuition: When you write a line of code in a critical section, do you know the cost behind it?
8.4.1 LDV Rules: Errors Written in Blood and Sweat
The Linux Driver Verification (LDV) project maintains a rule set for kernel development. Rather than just "rules," it's more of a "history of blood and tears" paid for with countless Panics by those who came before us. Let's look at a few core rules regarding locks.
Rule 1: Double-Locking a Mutex
This rule sounds simple: Never try to lock the same mutex twice.
This isn't just a code style issue; it's dictated by the kernel's mechanisms. The kernel's mutex implementation does not support Recursive Locking. If you write this:
mutex_lock(&my_mutex);
// ... 做一些事 ...
mutex_lock(&my_mutex); // 试图再次加锁
The consequences are catastrophic: the second lock acquisition will directly result in a (self-)deadlock.
There's a subtle historical difference here: User-space POSIX threads actually support recursive locks, as long as you set the type to
PTHREAD_MUTEX_RECURSIVEduring initialization. But in the kernel, this option was ruthlessly stripped out. Why? Because kernel developers believe recursive locks usually mask design flaws—if you need recursive locking, it often means your lock responsibilities aren't clearly separated.Back to the mechanism: A true mutex doesn't just prohibit double-locking; it also prohibits "unlocking a lock you don't hold." This sounds like common sense, but in complex error-handling paths, it's very easy for
unlockandlockto become mismatched, ultimately leading to logic corruption.
Real-World Case: There was once a bug in the EDAC (Error Detection and Correction) driver (see the original text for the Commit link). The code logic was as follows:
- The function
edac_device_reset_delay_period()first acquired amutex. - It then called
edac_device_workq_teardown(). - Here's the trap: Inside the
teardownfunction, it attempted to acquire the samemutex.
The result was a deadlock. The fix was simple: adjust the call order and invoke teardown after releasing the lock.
Rule 2: Blocking Inside a Spinlock
This is the pitfall most easily fallen into by beginners, and occasionally by veterans as well.
The rule is clear: When holding a spinlock, you must absolutely never perform memory allocations that might sleep.
This means that when you call kmalloc() within a critical section protected by a spinlock, you must pass the GFP_ATOMIC flag, not the default GFP_KERNEL.
spin_lock(&my_lock);
// ❌ 危险!这可能会触发页面回收,导致进程睡眠,从而调度走
// ptr = kmalloc(size, GFP_KERNEL);
// ✅ 正确做法:告诉内核这是原子上下文,别睡
ptr = kmalloc(size, GFP_ATOMIC);
spin_unlock(&my_lock);
Real-World Case:
In a wireless network driver, someone did exactly this (Commit: 5b0691508aa9). While holding a spinlock, they called kzalloc() with GFP_KERNEL.
The result? The kernel configuration option CONFIG_DEBUG_ATOMIC_SLEEP caught this error, threw a warning, and clearly pointed out the culprit in the Call Trace. Without this debug option, your system might inexplicably freeze, because the scheduler tries to schedule away a "not-allowed-to-sleep" process, resulting in a completely messed-up CPU state.
Rule 3: Spinlock Lock/Unlock Pairing
This rule is essentially the "spinlock version" of Rule 1:
- You cannot lock the same spinlock twice.
- You cannot release a spinlock you don't hold.
- You must unlock before exiting.
Although these rules sound like common sense, remember: Common sense is often the easiest thing to violate. Especially in error-handling branches, or at 3 AM when you think "I'll just finish this one function."
8.4.2 Local Locks: Making Intent Clearer
Before we look at more bugs, we need to mention a new synchronization primitive introduced in kernel 5.8: Local Locks.
This isn't magic; it's actually a wrapper.
Background: In the kernel, we frequently need to protect "per-CPU" data. The traditional approach is to directly disable kernel preemption or hardware interrupts.
preempt_disable(); // 禁止抢占
// ... 访问 per-CPU 数据 ...
preempt_enable(); // 恢复抢占
The problem arises:
When you see only preempt_disable() in the code, it's hard to tell at a glance: "Is this code meant to prevent preemption, or is it protecting a specific variable?"
Local Locks exist to solve this problem. They wrap these low-level operations into a true "lock" API.
// 使用 Local Lock
local_lock(&my_lock);
// ... 访问受保护的数据 ...
local_unlock(&my_lock);
Where's the value?
- Clear intent: Code readers explicitly know that a lock is being acquired, not just disabling preemption.
- Debugging friendly: When used with Lockdep, Local Locks can be tracked just like normal locks. If you forget to unlock, or if the lock dependency logic has a deadlock risk, Lockdep can catch it just as easily as it catches Mutex errors.
This is a great example of something brought from the PREEMPT_RT (Real-Time Linux) project into the mainline. It proves that: Good encapsulation doesn't just reduce lines of code; it also reduces the cost of understanding.
8.4.3 "Autopsy Reports" from Bugzilla
If LDV rules are the textbook, then the kernel Bugzilla is the hospital's pathology department. It's full of real cases.
Tip: Go to Bugzilla and search for specific warning strings, such as the classic error thrown by Lockdep:
possible circular locking dependency detected
(Figure 8.7 in the original text shows the search results).
When you see this output, you know there's a deadlock risk. Lockdep doesn't just tell you "there's a deadlock"; it also prints a complex lock dependency graph, pointing out exactly which two locks formed a cycle.
Additional advice:
Besides Lockdep, enabling the kernel configuration option CONFIG_DEBUG_ATOMIC_SLEEP is also a powerful tool for catching bugs. It will immediately report an error when code attempts to sleep in an atomic context, rather than leaving you to guess after the system deadlocks.
8.4.4 In-Depth Analysis from Community Blogs
Just reading Bugzilla titles isn't enough. Let's dive into a few classic blog posts to see what the bugs that drove developers crazy actually look like.
Case 1: How a Simple Memory Bug Leads to System Compromise
Source: Jann Horn (Google Project Zero), Oct 2021.
This is a chilling story. It tells us: A trivial lock usage error can be the crack that brings down the entire fortress.
Vulnerability Background:
The problem lay in the TTY (pseudo-terminal) driver code (drivers/tty/tty_jobctrl.c:tiocspgrp()).
Nature of the Bug:
The code used the wrong spinlock. It should have used a specific spinlock to protect the struct pid structure, but it mistakenly used a lock from another structure that could be arbitrarily specified by user space.
What did this lead to?
- Data race: Attackers could exploit this vulnerability to manufacture race conditions on
struct pidacross different CPU cores. - Reference count tampering: Through carefully constructed timing, attackers could corrupt the reference count of
struct pid. - Privilege escalation: Jann Horn leveraged this corruption to build a complete exploit chain, ultimately gaining a Root Shell on a Debian Linux system.
Fix: Take a look at the screenshot of that Commit (Figure 8.7 in the original text)—the fix was literally just one line: Use the correct spinlock.
This is what we've emphasized repeatedly in this chapter: Locks are the final line of defense for security. When you choose the wrong lock, or forget a lock, you aren't just creating a concurrency bug; you might be leaving the door wide open for hackers.
Case 2: Disabling Interrupts Too Long Paralyzes the Network
Source: Alibaba Cloud, Jan 2020.
Symptom: Severe network jitter on the system.
Troubleshooting process: What does this have to do with concurrency? Everything.
Tying back to the mechanism: Why must we treat spinlocks with caution?
Remember we mentioned earlier that APIs like spin_lock_irq() and spin_lock_irqsave() disable local CPU hardware interrupts while acquiring the lock.
It's like telling the receptionist answering the phone: "Don't take any calls, no matter who it is, until I'm done with this." If this task only takes a few microseconds, no problem. But what if it takes 50 milliseconds?
Imagine a network packet processing scenario:
- The NIC receives a packet and raises a hardware interrupt.
- The CPU should respond immediately and read the packet.
- But if the CPU is executing code that holds a
spin_lock_irqsave(), interrupts are disabled. - The NIC's interrupt handler is delayed.
The real pitfall:
Alibaba Cloud engineers discovered that the network jitter was caused by Slab statistics code. This code iterated through a linked list of dentry objects in a loop, and it held the spin_lock_irq() the entire time.
spin_lock_irq(&lock); // 关中断
// 遍历一个超级大的链表
list_for_each_entry(...) {
// ... 统计工作 ...
}
spin_unlock_irq(&lock); // 开中断
When the system has a massive number of dentry objects, this loop runs for a long time. This results in hardware interrupts being disabled for an extended period, severely delaying network packet processing, which ultimately manifests as packet loss and jitter.
Lesson:
When you use spin_lock_irq(), your critical section must be extremely short and lean.
Never iterate over long linked lists while holding a lock, and never perform any O(n) complexity operations.
How to measure?
If you suspect there are "long-held lock" situations in your system, you can use the eBPF tool criticalstat to catch them.
# 测量禁用抢占超过 5000 微秒(5ms)的代码路径
sudo criticalstat-bpfcc -p -d 5000 2>/dev/null
This tool will directly print the call stack of the function that "refuses to release the lock," leaving it nowhere to hide.
Case 3: Sleeping in Atomic Context, Leaking Reference Counts
Source: Ryan Eberhardt, Nov 2020.
This article is titled "My First Kernel Module: A Debugging Nightmare," and the title honestly expresses the author's pain.
Bug 1: Sleeping in an RCU Critical Section
rcu_read_lock(); // 进入 RCU 读侧临界区(原子上下文)
// ...
msleep(10); // 💥 爆炸!这里不能睡眠
rcu_read_unlock();
Analysis:
msleep() triggers process scheduling. But during an RCU read-side critical section, the CPU is not allowed to be scheduled away. This violates the kernel's atomicity rules.
Fix: If you must delay, use udelay() or mdelay(), which are busy-waits and won't trigger scheduling.
Bug 2: Accessing a Global Structure Without Taking a Reference Count
Global data structures in the kernel (like the task_struct or file structures) can be freed by another process at any time. If you don't increment its reference count, the pointer you just obtained might be freed right away—a classic Use-After-Free.
// ❌ 错误:直接使用
printk("%d\n", task->pid);
// ✅ 正确:先拿引用,用完再放
get_task_struct(task);
printk("%d\n", task->pid);
put_task_struct(task);
The author's debugging methodology: Ryan mentioned a very "crude" but highly effective method in the article: Binary comment-out method. Since you don't know where it's crashing, comment out all the code blocks and run it—no problems? Okay, uncomment half of it and run it again—crashed? Then the problem is in that half. In the face of extremely complex concurrency bugs, this clumsy approach is often faster than high-tech tools.
Chapter Echoes
We've come a long way in this chapter.
From the most basic definition of data races, to the implementation principles of dynamic analysis marvels like KCSAN, to several painful real-world cases. If we were to distill all of this into a single sentence, it would probably be:
There are no "small matters" in concurrent programming.
A misused spinlock, a memory allocation while holding a lock, or a forgotten reference count—these are just logical flaws under a single-threaded mindset, but in a multi-core kernel, they are the source of system crashes, data leaks, and even privilege escalation vulnerabilities.
Do you remember the question we asked at the beginning of this chapter? — "Why are kernel concurrency bugs so hard to track down?"
Now you have the answer. Because they are often not logic errors, but timing errors. The code order hasn't changed; it's just that the speed of CPU execution varied a little, and the world collapses. This is why we need KCSAN, Lockdep, and all these debugging tools—because the human brain is inherently bad at simulating concurrent timing.
In the next chapter, we will enter another dimension of debugging: Tracing kernel execution flow. If this chapter was about teaching you how to patch a leaky roof, the next chapter will be about teaching you how to install cameras to see exactly where the water is seeping in.