Skip to main content

Chapter 12: The Debugging Arsenal: No Silver Bullet

There is a class of problems that appear to be about tools, but are actually philosophical in nature.

In this chapter, we tackle exactly such a problem: how much can you really trust your tools?

Many beginners (and even experienced engineers) share an obsession: finding a "universal debugger" — just run it, and it highlights all bugs in red, points out every memory leak, and refactors your code as a bonus.

Frankly, this obsession is dangerous. It breeds a false sense of security.

The mission of this chapter is to shatter that illusion. We take static analysis, dynamic analysis, kernel crash dumps (kdump), fuzzing, and code coverage — tools that seem unrelated — and piece them together into a complete "debugging map." You will see that no single tool can cover all scenarios alone — and that is exactly why we need to combine them.

There is no silver bullet. But if you melt all those silver fragments and forge them into a sword, that is an entirely different story.


Let's begin.

12.1 The Remaining Weapons: From Static to Post-Mortem

This chapter is a bit of a "mixed bag" — but do not underestimate these miscellanea. We previously covered many dynamic tracing and live debugging techniques, but they all share a limitation: you need the system to be alive, or you need to be able to reproduce the problem.

What if the issue happens at 3 AM while you are asleep? What if the bug is hiding in some extremely obscure code path?

In this section, we fill in the final pieces of the puzzle.

⚫️ kdump: Taking an Autopsy Photo of the Kernel

First, let's talk about that blood-pressure-spiking scenario: Kernel Panic.

When a production server suddenly freezes, the screen locked on that dreaded "Kernel panic - not syncing: Attempted to kill init!", what do you do? Reboot? And then pray it does not happen again?

No, what you need is an autopsy report.

That is exactly what kdump is for.

You can think of kdump as the kernel's "black box." When the airplane (the main kernel) goes down, the black box (the capture kernel) is ejected, recording all the data at the crash scene (a memory image).

But the "black box" analogy is slightly inaccurate: a real black box is an independent device, whereas kdump's "capture kernel" is actually another stripped-down Linux kernel that you preload into memory. It does nothing normally, sitting quietly in a reserved memory region sleeping; only when the main kernel crashes is it "woken up."

Behind this lies a mechanism called kexec. kexec allows the kernel to bypass the sluggish BIOS/firmware boot process and jump directly to the entry address of another kernel. It is like playing a parkour game where you do not need to return to the starting point — you are instantly teleported to the next level's entrance.

Back to the "black box": crashkernel=size@offset is a boot parameter where you tell the main kernel, "Please reserve this much space (size) at this location (offset) for me; I am going to place my black box there."

Configuring and using it typically involves a few steps:

  1. Reserve memory: Add crashkernel=256M to the main kernel's grub configuration (or adjust according to your physical memory).
  2. Load the capture kernel: Usually done via kdumpctl or your distribution's script services. This maps /proc/kcore to a specific location, or directly loads a specific vmlinuz using kexec.
  3. Trigger a crash: Never do this in production, but in a test environment, you can use echo c > /proc/sysrq-trigger to manually simulate a crash.

Once the capture kernel boots, you will find that although it is tiny, it can access what the main kernel left behind: /proc/vmcore.

This file is not an ordinary file; it is an ELF-format pseudo-file representing the entire physical memory at the time of the main kernel's crash. You can cp it out, and that becomes your vmcore file.

Once you have the vmcore, the real forensic work begins. You will need crash (utility).

Crash is a powerful user-space analysis tool. It can disassemble that massive dump of seemingly garbled memory into readable data structures.

$ crash vmlinux vmcore
...
crash> bt # 查看 panic 时的调用栈
crash> ps # 查看当时的进程列表
crash> log # 查看内核环形缓冲区里的日志

This is the power of Post-mortem analysis. Although you cannot change the past, you can at least see how it happened. For those eerie bugs that "only crash once a month," this is almost the only lifeline.

⚫️ Static Analysis: Finding Bugs Without Running Code

Next, let's shift our gaze from "runtime" to "the code itself."

Static Analysis sounds boring — like nitpicking at compile time. But frankly, it might offer the best return on investment of any debugging technique.

Think about it: if the compiler could tell you "this will divide by zero" while you are writing the code, would you still need to wake up at 3 AM to look at dmesg?

The kernel community has a few familiar faces: sparse, smatch, and Coccinelle.

  • Sparse: Written primarily by Linus Torvalds, it checks for type errors. For example, if you directly dereference a __user pointer, sparse will warn you immediately.
  • Smatch: Built on top of sparse, it goes a step further by building control flow graphs to check for complex logic errors, such as forgetting to unlock (mutex_unlock).
  • Coccinelle: This is a heavy hitter. It does not just check — it can modify code. It uses a scripting language called Semantic Patches. If you want to replace one API with another across the entire kernel tree, Coccinelle can finish in minutes what would take humans days.

Do not ignore these tools. What they catch are often code smells — things that do not affect compilation but will rot into real bugs over time.

Things like Uninitialized Memory Reads (UMR) or Use-After-Return (UAR). Some of these are invisible to the human eye no matter how many times you review the code, but a tool can spot them instantly.

⚫️ Code Coverage and Fault Injection: Testing Those "Impossible" Paths

If static analysis can find so many problems, why do we still need runtime testing?

Because static analysis does not understand "logic." It knows you forgot to initialize a pointer, but it does not know whether your algorithm will deadlock under network congestion.

Two concepts must be mentioned here: Code Coverage and Fault Injection.

Code Coverage is simple: when you run a test, how much of the code actually gets executed?

If the coverage report shows that line 452 (the error handling path) of your drivers/foobar.c is red (never executed), you should feel uneasy. That path is like a sealed-off corridor during a home renovation — there might be a monster living inside.

In the kernel, we typically use gcov (gcc's native tool) paired with lcov (a graphical frontend) to generate nice HTML reports. The kernel also has a dedicated kcov, which is a low-level interface designed for fuzzing.

But simply running through is not enough. You need to force the code down that "error path."

Fault Injection does exactly that. The kernel's fault-injection framework allows you to do mischievous things, such as: "make all kmalloc() calls fail on the 100th invocation and return NULL."

This is incredibly powerful.

The vast majority of code runs flawlessly when malloc succeeds, but might crash outright when malloc fails. Through fault injection, you can force code into those pesky and possibly buggy error code paths.

⚫️ Kernel Testing Self-Discipline: Kselftest and KUnit

Kernel development is no longer a free-for-all; we have formal testing frameworks.

Located at tools/testing/selftests is kselftest. It primarily tests the kernel from user space through system calls and filesystem interfaces. Its output format follows TAP (Test Anything Protocol).

KUnit, on the other hand, is a "unit testing" framework inside the kernel. It allows you to directly test internal kernel functions during the kernel build or at runtime — without even booting a full operating system.

If you write drivers without writing KUnit tests, it is like driving without a seatbelt. You might be fine, but if something goes wrong, you are on your own.

⚫️ Fuzzing: Firing at Chaos

If testing is using a ruler to measure your code, then Fuzzing is using a machine gun to spray it.

The principle is simple: feed the program massive amounts of random data, malformed data, and garbage data, and see if it crashes.

Sounds stupid? No, this is currently the most effective method for discovering security vulnerabilities.

  • syzkaller / syzbot: Google's system is currently the Linux kernel's nightmare generator (and also its guardian). It continuously sends random system call sequences to the kernel and has discovered countless hard-to-reproduce bugs.
  • AFL (American Fuzzy Lop): The classic coverage-guided fuzzing tool.
  • Trinity: A fuzzer specifically targeting Linux system calls.

If you want to go far in this domain, learning to use fuzzing tools is a required course.

⚫️ Logging and Assertions: The Last Line of Defense

Finally, we return to the most fundamental things: logging and assertions.

On modern Linux systems (systemd), stop using that relic cat /var/log/messages. You need to master journalctl.

# 查看本次启动的内核日志
journalctl -k -b 0

# 实时跟踪
journalctl -k -f

This is not just convenient; it is structured data.

As for Assertions and the BUG()/WARN() macros:

We may have glossed over them in previous chapters, but we must emphasize them here. Adding BUG_ON(condition) or WARN_ON(condition) in your code is like leaving a note for yourself.

WARN_ON() simply prints a warning and a call trace, and the system keeps running; BUG_ON() directly triggers a Panic.

They are your hypothesis tests on the system state. If BUG_ON(ptr != NULL) triggers, it means your assumption (that this cannot be null here) was wrong — spectacularly wrong — and the system is no longer trustworthy. It is better to die early.


⚫️ Echoes: No Silver Bullet, But a Sword

At this point, the core message of this chapter — and arguably this entire book — should be crystal clear.

Remember the question we opened with: how much can you really trust your tools?

The answer: do not fully trust any single one, but use every single one.

Fred Brooks has a famous quote that has been overused, but I am going to use it here anyway: No Silver Bullet.

No tool is a magic bullet. -Wall and -Wextra can only tell you about compile-time suspicions; KASAN can find memory errors, but it cannot catch logical deadlocks; kdump lets you see the crash scene, but it cannot prevent the crash from happening; fuzzing can stumble upon vulnerabilities, but it cannot prove your code is free of them.

A true debugging master is not someone who uses one hammer to hit every nail.

They hold a Swiss Army knife:

  • Use compiler warnings as the first line of defense;
  • Use Sparse/Smatch for static checking;
  • Use KASAN/KCSAN for dynamic memory detection;
  • Use ftrace/trace-cmd for behavioral analysis;
  • Use KUnit/kselftest for automated testing;
  • Use kdump/crash for post-mortem autopsies.

Our so-called "better Makefile" (you can find ch3/printk_loglevels/Makefile in this book's GitHub repository) is really an attempt to enforce this discipline. It has multiple targets corresponding to different checking tools. Do not be lazy — put them to use.

Although you have read this far, this is not the end.

It is more like a new beginning.

You are no longer the novice staring blankly at a Kernel Panic. You have the code, the tools, the theory, and more importantly, a methodology for "how to think about low-level problems."

Go forth, tinker, hunt down those bugs lurking deep in memory, and kill them.

That is our sincere hope!