7.2 Generating a Simple Kernel Bug and Oops
Now that our tools are in place, it's time to break things.
There's an old saying about fighting fire with fire. To catch kernel bugs, we first need to learn how to create them. Don't worry—this shouldn't be hard for anyone here. In fact, it might even be a bit exciting.
The classic introductory bug in education is, without a doubt, the infamous NULL pointer dereference. You've probably heard of it, and maybe even experienced its power in user space (remember Segfault?). But this time, we're moving the battlefield to kernel mode.
The plan is simple:
- Write a kernel module that deliberately dereferences a NULL pointer. We'll call it
oops_tryv1. - Watch it crash and observe the kernel's reaction.
- Level up to
oops_tryv2and blow up the kernel in three different ways.
But first, we need to answer two fundamental questions: What exactly can the procmap mentioned in the previous section do? And what on earth is this so-called NULL trap page?
The procmap utility — Visualizing Memory
We installed procmap in the previous section. Now let's see what it's actually good for.
Simply put, it visualizes a process's Virtual Address Space (VAS)—including both user and kernel space—as a map. While /proc/<pid>/maps can also show this, it's purely text-based and exhausting to read, especially when you're staring at hundreds of mapping lines. Your brain easily overloads. procmap helps you see at a glance where the mountains are (mapped memory) and where the pits are (sparse regions).
Its GitHub page states it clearly:
procmap is a command-line tool for visualizing the complete memory mapping of a Linux process, including the kernel and user space Virtual Address Space (VAS).
It outputs a vertical chart ordered from high to low virtual addresses. Most importantly, it intelligently identifies Sparse Regions (Holes), and on 64-bit systems, it can show you that massive non-canonical region—on x86_64, this "hole" is about 16,384 PB in size.
The tool is still under active development with a few caveats, but it's more than sufficient for our experiments.
What's this NULL trap page anyway?
Alright, let's get to the point. To trigger a NULL pointer dereference, we first need to know exactly where a NULL pointer points.
On all Linux-based systems (and virtually all modern virtual memory operating systems), the kernel splits a process's available virtual memory into two halves: the User Space VAS and the Kernel Space VAS. This is commonly known as the VM split.
On x86_64, the full VAS size for each process is $2^{64}$ bytes. That sounds like an astronomical number—and it is: 16 EB (Exabytes). 1 EB = 1,024 PB = 1 million TB. This space is so large it's impossible to exhaust.
In reality, the kernel only uses a tiny fraction of this space on x86_64 by default:
- Kernel VAS: 128 TB, anchored at the top of the VAS (from
0xffffffffffffffffup to0xffff800000000000). - User VAS: 128 TB, anchored at the bottom of the VAS (from
0x00007fffffffffffup to0x0).
Even combined, that's only 256 TB. Relative to the total capacity of 16 EB (16,384 PB), we're only using about 0.0015%.
Think about This
The 64-bit address space is luxuriously large. We haven't even used 0.002% of it. Where did the rest go? They form massive address "holes"—any access to these holes results in a memory access violation.
Back to our main topic: the NULL pointer.
At the very bottom of the user VAS, the page from virtual address 0x0 to 0x4095 is known as the NULL trap page.
Let's run procmap to see what it looks like (assuming your shell process PID is 1076):
$ procmap --pid=1076
[...]
You'll see a map, and the tiny sliver at the very bottom is it. If you look closely at the figure from the previous section (Figure 7.1), you'll notice an area marked as [NULL trap page] below the bash process's mappings.
How this area works is simple: its permissions are all --- (no read, no write, no execute).
This means that any process (or thread) attempting to read from or write to this area will be rejected by the MMU.
How does it "trap"?
Let's break this process down—it's important:
- Attempted access: The process tries to read from or write to address
0x0(or any byte within this page, like0x100). - MMU inspection: The CPU's MMU (Memory Management Unit) takes over this virtual address and prepares to translate it into a physical address. It checks the page table entry.
- Permission denied: The MMU finds that all permission bits for this page are 0. This isn't "allowed but unmapped"—it's "explicitly forbidden."
- Exception triggered: The MMU immediately raises its hands and throws an exception to the operating system. On x86, this is typically a General Protection Fault or a Page Fault.
- Kernel intervenes: The OS's exception handler wakes up. This function runs in the context of the offending process.
- Verdict: The kernel discovers: "Ah, it's a user-mode process misbehaving, accessing an address it shouldn't."
- Execution: The kernel sends a fatal signal to the process—SIGSEGV.
- Outcome: If you've written C, you should be very familiar with this result: the process receives the signal, usually chooses to terminate itself, and the console prints that classic
Segmentation fault (core dumped).
Of course, a process can install a signal handler to "catch" this signal and do some cleanup, but ultimately, it still has to die.
Now we understand the mechanism of the NULL trap page. Next, we're going to do something outrageous: in kernel mode, ignore this rule and forcibly read from or write to a NULL address.
A simple Oops v1 — Dereferencing the NULL pointer
Here is our first victim, oops_tryv1. Its logic is simple: read from or write to a NULL address.
As mentioned earlier, any access to the NULL trap page triggers an MMU error. This holds true in kernel mode as well.
Here is the core code (you can clone the full code from GitHub):
// ch7/oops_tryv1/oops_tryv1.c
[...]
static bool try_reading;
module_param(try_reading, bool, 0644);
MODULE_PARM_DESC(try_reading,
"Trigger an Oops-generating bug when reading from NULL; else, do so by writing to NULL");
We define a boolean module parameter try_reading.
- If set to
1(yes), the code attempts to read from the NULL address. - If left at the default
0, the code attempts to write a byte'x'to the NULL address.
Take a look at the init function:
static int __init try_oops_init(void)
{
size_t val = 0x0;
pr_info("Lets Oops!\nNow attempting to %s something %s the NULL address 0x%p\n",
!!try_reading ? "read" : "write",
!!try_reading ? "from" : "to",
NULL);
if (!!try_reading) {
val = *(int *)0x0;
/*
* 这里的注释很关键。如果我们只读不做任何处理,
* 聪明的编译器会直接把这行代码优化掉(删掉),
* 因为它觉得你读了个寂寞。
* 为了强迫编译器生成代码,我们用 pr_info 把 val 打印出来。
* 这样就能保证「触电」成功。
*/
pr_info("val = 0x%lx\n", val);
} else // 尝试写入 NULL
*(int *)val = 'x';
return 0; /* success */
}
Note: There's a pitfall here. If you merely read *(int *)0x0 without using the result, the modern GCC optimizer will simply treat it as dead code and delete it. Then you won't see the Oops, and you might mistakenly think reading from NULL doesn't cause an error. So we must use pr_info to actually use the variable val, forcing the generation of that memory access instruction.
The logic here is straightforward. Whether reading or writing, as long as you touch the NULL trap page, the MMU will flag an error.
What happens at this moment?
At this point, the kernel module code is running in the context of the insmod process (specifically, the context of a system call initiated by the process). When that illegal instruction executes:
- User mode misbehaves: The kernel sends a SIGSEGV to kill the process.
- Kernel mode misbehaves: The kernel realizes: "Wait a minute, it's my own code that's broken." This means the kernel itself has a bug. The kernel tolerates no such ambiguity, so it triggers an Oops.
What is the
!!<boolean>syntax? This is a little C language trick.!5is0, and!0is1. So!!5ultimately evaluates to1. No matter what non-zero integer you pass in, it gets normalized to a strict1; if it's0, it remains0.
Examining the scene of the crash
When we load the module and attempt to write to NULL, the kernel log spits out a massive amount of text. Figure 7.2 below shows the first few lines:
(Figure 7.2: Partial screenshot of an Oops)
You'll notice that every line is prefixed with Oops:. This is the kernel screaming: "I fell down!"
A lifeline for the restless developer: Reloading without rebooting?
Here's a very practical problem: once a module triggers an Oops, it becomes very difficult to unload.
Try rmmod? Probably won't work. Because when the Oops occurred, the insmod process was killed outright, and the module's reference count didn't drop to 0. You can check with lsmod:
$ lsmod |grep oops
oops_tryv1 16384 1
The 1 on the far right stands there like a tombstone, blocking you from running rmmod.
At this point, if you want to tweak the code and retry, do you have to reboot the machine? For "restless" developers, rebooting is a waste of time. Here is a very dumb but highly effective hack:
- Run
make cleanto clean things up. - Rename the source file (e.g., to
oops_try_v1b.c). - Modify the Makefile to compile with the new filename.
insmodthe new module.
This gives you a module with a new name, and the kernel will treat it as a newcomer, even if the old one's "corpse" is still lying around. This can be a lifesaver during frequent debugging.
Doing a bit more of an Oops — Our buggy module v2
Just reading from or writing to 0x0 is too boring. In oops_tryv2, we're going to do something fancier. This module provides three ways to crash the kernel:
- Case 1: Write to a random address within the NULL trap page.
- Case 2: Specify an invalid kernel space address (a sparse region) yourself and write to it.
- Case 3: In a kernel Workqueue, attempt to write to an uninitialized structure member (the scenario closest to a real bug).
Case 1 — Randomly crashing into the NULL page
Similar to v1, except this time we use the get_random_bytes() kernel API to generate a random number, and then take the modulo of PAGE_SIZE (usually 4096).
// ch7/oops_tryv2/oops_tryv2.c
static int __init try_oops_init(void)
{
unsigned int page0_randptr = 0x0;
[...]
} else { // 没传参数,随机撞 NULL 页
pr_info("Generating Oops by attempting to write to a random invalid kernel address in NULL trap page\n");
get_random_bytes(&page0_randptr, sizeof(unsigned int));
bad_kva = (page0_randptr %= PAGE_SIZE);
}
pr_info("bad_kva = 0x%lx; now writing to it...\n", bad_kva);
*(unsigned long *)bad_kva = 0xdead;
[...]
No matter what the random number is, as long as it's between 0 and 4095, it falls inside the NULL trap page. The final write operation is guaranteed to trigger an Oops.
Case 2 — Attacking the "holes" in the kernel VAS
This is a bit more interesting. We define a module parameter mp_randaddr that lets you pass in a kernel address yourself.
static unsigned long mp_randaddr;
module_param(mp_randaddr, ulong, 0644);
MODULE_PARM_DESC(mp_randaddr, "Random non-zero kernel virtual address; deliberately invalid, to cause an Oops!");
The code logic is simple: if you pass this parameter, we write 0xdead to that address.
But here's the question: How do I know which address is invalid?
If you just guess an address and happen to hit the kernel's code segment or data segment, it won't just be an Oops—it might directly cause a Panic. We need a safe but invalid target.
This is exactly why we spent so much effort introducing procmap in the previous section.
Run procmap, looking only at kernel space:
$ procmap --pid=1 --only-kernel
You'll see output similar to Figure 7.3. Pay attention to the areas marked as <... K sparse region ...>. These are sparse regions, also known as Holes. No physical memory is mapped in these places; they are completely empty.
On x86_64, there's usually a large hole like this between the module loading area (modules) and the vmalloc area.
For example, in the figure, the address range from 0xffffffffc0000000 to 0xffffda377fffffff is pure void.
Let's just pick an address from the middle of it, say 0xffffffffc000dead. The name sounds auspicious enough.
Let's run the experiment:
$ modinfo -p ./oops_tryv2.ko
mp_randaddr:Random non-zero kernel virtual address; deliberately invalid, to cause an Oops! (ulong)
bug_in_workq:Trigger an Oops-generating bug in our workqueue function (bool)
$ sudo insmod ./oops_tryv2.ko mp_randaddr=0xffffffffc000dead
Killed
$
Killed. Same recipe, same flavor.
This time, the kernel also triggered an Oops because it wrote to an unmapped address. The underlying mechanism is the same as with the NULL trap page: MMU translation fails -> throws a Page Fault -> the kernel's Page Fault Handler discovers it's kernel mode performing an illegal write -> Oops.
Case 3 — Digging a pit in a Workqueue
The first two cases both occurred in the process context of insmod. But the kernel isn't just about process contexts; there are many asynchronous execution scenarios, such as interrupts, Tasklets, and Workqueues.
Real kernel bugs often hide in places like this: you get a structure pointer, forget to check if it's NULL, and then try to access it in some asynchronous callback.
This is very realistic. We'll use the third parameter of the v2 module, bug_in_workq, to demonstrate this.
We define a structure and initialize a Workqueue:
struct st_ctx {
int x, y, z;
struct work_struct work;
u8 data;
} *gctx, *oopsie; /* 小心,这两个指针没分配内存! */
Note that global pointer oopsie. It defaults to NULL.
When bug_in_workq=1, we set up a workqueue task:
static int setup_work(void)
{
gctx = kzalloc(sizeof(struct st_ctx), GFP_KERNEL);
[...]
gctx->data = 'C';
/* 初始化 work */
INIT_WORK(&gctx->work, do_the_work);
// 提交到内核默认工作队列
schedule_work(&gctx->work);
[...]
}
The actual work function do_the_work will be called shortly after by one of the kernel's worker threads. The bug hides right here:
static void do_the_work(struct work_struct *work)
{
struct st_ctx *priv = container_of(work, struct st_ctx, work);
[...]
if (!!bug_in_workq) {
pr_info("Generating Oops by attempting to write to an invalid kernel memory pointer\n");
oopsie->data = 'x'; // ⚠️ 踩雷!
}
kfree(gctx);
}
See that line oopsie->data = 'x';? oopsie is a null pointer. We're trying to write data into a structure pointed to by a null pointer.
Load this module:
sudo insmod ./oops_tryv2.ko bug_in_workq=yes
Notice that this time, the console does not print Killed. That's because the one that died isn't the insmod process, but an innocent kernel worker thread.
Figure 7.5 shows the Oops log at this point. You'll notice it looks a bit different from the previous ones, but the essence is exactly the same: illegal memory access.
A kernel Oops and what it signifies
Now that we can skillfully crash the kernel, let's pause for a moment and think about what an Oops actually means.
- An Oops is not a Segfault: A Segfault is a user-mode error signal sent by the kernel to a process. An Oops is the kernel's own diagnostic message, indicating that the kernel code itself has a problem. Although an Oops might cause certain processes to receive a SIGSEGV or even die, that's just a side effect.
- An Oops does not equal a Kernel Panic: A Panic means the kernel has completely given up and the system stops running. An Oops simply indicates that the kernel encountered an error, but in many cases, the kernel can stubbornly stay alive and keep running (albeit possibly wounded).
- Of course, you can configure the kernel to Panic immediately upon an Oops. Check
/proc/sys/kernel/panic_on_oops; if it's 1, the system will immediately go down.
- Of course, you can configure the kernel to Panic immediately upon an Oops. Check
Whether the kernel keeps running or goes down, an Oops is fundamentally a kernel-level bug. It must be detected, interpreted, and fixed.
Alright, now we not only have the ability to generate Oopses, but we also have several vivid case studies in hand. Next comes the most hardcore part: we're going to dissect those cryptic Oops logs line by line, like forensic pathologists examining a body.
Buckle up.