Chapter 10: The Undying Echo: When the Kernel Gives Up
10.1 Technical Preparation and Site Inspection
There is a class of problems where debugging them feels like searching for the truth at an aircraft crash site with no black box.
As an engineer who frequently works with the kernel, you will eventually face that moment: the screen stops scrolling, the keyboard lights flicker erratically, and the last line of logs coldly displays "Kernel panic - not syncing". At this moment, the operating system that usually covers for you and silently recovers from errors suddenly "shows its hand"—it refuses to continue executing because it realizes that moving forward will only cause greater destruction.
At this point, the most terrifying thing isn't that the system crashed, but that your hands are empty. You didn't capture a memory image at the time of the crash, you don't know which driver crossed the line last, and you don't even know if this was a hardware failure or a vicious software logic deadlock.
The core mission of this chapter is to build a "post-mortem" capability. We will explore what the kernel goes through when it encounters an unrecoverable error (Panic), how those various "deadlock" detectors work, and how to capture the crash scene using kexec and kdump. But before we dive into this tough battle, we need to first confirm that our tools are complete and our ammunition is sufficient.
Technical Preparation
If you read Chapter 1, you should already be equipped. The battlefield environment for this chapter is exactly the same as that one, so we won't reinvent the wheel.
To ensure you can reproduce these low-level mechanisms along with me, you need to meet the following conditions:
-
Development and Build Environment
- A modern Linux distribution (Ubuntu 20.04+ or Fedora are recommended choices, as newer toolchains offer better support for new kernel features).
- A complete kernel compilation toolchain (
gcc,make,bc, etc.). - Sufficient disk space—if you plan to enable kdump, you need at least as much disk space as your physical memory to save the crash dump.
-
Experimental Environment
- We strongly recommend using a QEMU/LKVM virtual machine. Why? Because in some experiments in this chapter, we need to actively trigger a kernel Panic or freeze the CPU. In a virtual machine, this is just a matter of closing a window; on a physical machine, you might face the risk of a forced reboot. More importantly, debugging on physical hardware is often limited by the hardware itself (for example, if the serial port baud rate isn't fast enough, debug messages will get dropped).
- If you must play on physical hardware, make sure you have a JTAG debugger or at least a reliable serial connection.
-
Code Retrieval
All example code, patches, and test kernel modules have been packaged up. You can pull them locally with the following command:
git clone https://github.com/PacktPublishing/Linux-Kernel-Debugging
cd Linux-Kernel-Debugging
This isn't just a bunch of code; it's the various detonators we'll use to "blow up" the kernel in this chapter.
💡 Translator's Note / Author's Reminder: The "Technical Preparation" here actually corresponds to the first section of the original book. As a continuation of the hands-on style of the entire book, although this section is short, it serves as a "safety checkpoint". In the field of kernel debugging, if the environment isn't right, all reproduction attempts turn into pure guesswork.
Please make sure your kernel source tree is ready. When we modify
panic.cand configurekexeclater, we will need to directly compile and replace the kernel image. Don't be lazy and don't just use the distribution's default kernel—that's meant for "users", not for "kernel hackers".
If you're ready, let's officially enter the kernel's darkest moment—starting with Panic.