Chapter 7: When the Kernel Roars
There is a class of problems that can instantly send a chill down a systems engineer's spine.
It's not high CPU usage, nor is it a memory leak—those are annoying, but at least the system is still alive. You can still SSH in, still read the logs, still observe it like a doctor examining a patient.
The truly terrifying scenario is the sudden, dead silence. The cursor on the screen stops moving, heartbeat packets drop, or the monitor even goes completely black. The system isn't just broken; it's broken ugly—without even leaving a final dying word.
In this chapter, we tackle exactly this nightmare scenario: kernel panics.
When kernel code goes wrong, there is no "exception handling" to catch the fall. A process crash only harms itself, but a kernel crash means the collapse of the entire world. However, the kernel doesn't die quietly. It leaves behind an angry diagnostic message—a Kernel Oops.
Reading an Oops is like deciphering a killer's code left at a crime scene. It is obscure, packed with hexadecimal addresses, and littered with hardware register jargon, making you want to close the terminal at first glance. But once you master the method, this seemingly garbled text will tell you with extreme precision: which pointer was NULL, which line of assembly triggered the illegal access, and exactly which colleague's code caused the disaster.
Even better, the Linux kernel community has long embedded a complete toolchain to make life a little easier for us unlucky engineers: from procmap that can see through the memory layout, to decodecode that can act like a detective to reverse-engineer machine code back into source lines, all the way to netconsole that can throw logs over the network before the system completely hangs.
In this chapter, we won't just learn how to read the kernel's "dying words." We will set up a hands-on environment, deliberately trigger a crash, and use these tools to dissect it clearly.
Ready to enter the ICU? Let's begin.
7.1 Technical Preparation
Before we start crashing the kernel, we need to get our toolbox ready.
The good news is that our base workspace configuration is exactly the same as in Chapter 1. If you've already followed the previous chapters to set up your development environment and configure the kernel source tree, you can skip this part—no need to spin up a new virtual machine or reinstall the system.
All the example code is available in this book's GitHub repository. You can clone it directly:
git clone https://github.com/PacktPublishing/Linux-Kernel-Debugging.git
Since we are going to dive deep into the kernel's memory layout and crash scenes, there is one new tool that is very important—procmap.
You already saw /proc/<pid>/maps in Chapter 1. That's a plain text file; while it contains all the information, it's really hard to read, especially when you're staring at hundreds of lines of memory mappings—it's easy for your eyes to glaze over. procmap is a dedicated visualization tool that can draw a clear map of a process's address space, including both user space and kernel space. In the "NULL pointer dereference" and "kernel-mode Oops" cases we are about to handle, this map will be your navigator, helping you see at a glance exactly where a virtual address maps to.
When we analyze the NULL trap page later, you'll thank this tool—it gives you an intuitive look at how the kernel "locks down" the very first page at the bottom of a process's address space.
Beyond that, to ensure the following experiments run smoothly, please double-check that you have the standard compilation toolchain installed on your system. Although we'll be doing most of our work on x86_64, if you want to try a cross-platform experience—like reproducing a crash on an ARM board—make sure your cross-compilation toolchain is also ready.
Tools in place, code in place. Next up, we're going to do something mischievous—deliberately dig a pit for the kernel and watch it fall in.