2.3 The Kernel Debugging Panorama — When to Use What

Alright, we've classified bugs and figured out why they make the system misbehave.

Now the question is: with so many bugs and so many scenarios, which weapon do we actually reach for?

If you take away only one thing from this chapter, let it be this: there is no universal debugging tool.

It's like being a doctor. When a patient comes in, you don't just order an MRI right off the bat. If the patient just scraped their arm, an MRI is a massive waste of resources and won't show anything useful. Conversely, if the patient has intracranial bleeding, slapping a Band-Aid on them is malpractice.

Debugging is no different. KGDB is powerful, but using it in a production environment is asking for trouble. printk is rudimentary, but at the scene of a kernel panic, it might be your only hope.

In this section, we'll do two things:

Map out a panorama showing which tools are available at different stages and in different scenarios.
Create a matrix mapping these tools to bug types, so you know exactly "which row to look at to fix this kind of bug."

Phase 1: Development Stage — The God's-Eye View

When you're writing kernel code or driver modules, this is when you have the maximum privileges.

At this point, the system is usually running on a development board or in your own virtual machine. You can crash it anytime and reboot anytime. You can oversee the entire system's running state like a god.

In this phase, what we pursue is visibility.

1. Code-Level Instrumentation

This is the most primitive, but often the most effective method.

printk() and its family: The printf of the kernel.
- You can use pr_info() to print routine information, and pr_debug() for debug information.
- Tip: Use log levels (like KERN_DEBUG) to control console output so your key messages don't get drowned in the noise.
Dynamic Debug:
- Sometimes you don't want to recompile the kernel (too slow); you just want to temporarily flip the debug switch on a specific module.
- By mounting and configuring via debugfs, you can dynamically enable or disable pr_debug() statements in the code. It's like installing an "adjustable valve" in your code.
Stack Dumps:
- When a certain condition triggers, you can manually call dump_stack().
- It prints the current CPU register states and the function call chain.
- Use case: When you don't know "who called this function" or "where exactly the code went," this trick is extremely useful.
Assertions:
- Use BUG_ON() or WARN_ON().
- If the condition isn't met, BUG_ON() will directly trigger a panic (system halt), while WARN_ON() only prints a warning along with a stack trace.
- Warning: Never abuse BUG_ON in production environments—it will kill the system outright.

2. Debug Hooks

Sometimes you don't want to keep printing logs; you just want to intervene at a critical moment.

debugfs interface:
- Create a file in debugfs.
- A userspace program just needs to echo 1 > /sys/kernel/debug/my_module/debug_trigger to trigger a specific kernel function. This is much more convenient than ioctl.
ioctl hooks:
- A traditional approach. Write a dedicated character device driver paired with a private ioctl command.
- Although it's a bit more tedious to write, it's very reliable if you need to pass complex parameters to the kernel for testing.

3. Source-Level Debugging — KGDB

If you're a developer coming from userspace, the thing you probably miss the most is GDB.

The good news is that the kernel also has a stub called KGDB.

How to play: You need two machines (or one machine running two VMs). One runs the kernel, the other runs GDB.
Connection method: Via serial port or network.
What you can do:
- Set breakpoints.
- Single-step execution.
- Inspect variables.
- Modify variables (this is some serious dark magic).
The cost: When you hit a breakpoint, the entire kernel pauses. This means all tasks stop, network interrupts drop packets, and the watchdog might bite. Therefore, this is strictly for the development stage.

Phase 2: Testing and QA Stage — Automation and Checkups

The code is written. Next come unit tests, integration tests, and the QA team.

The goal of this stage is to uncover as many hidden dangers as possible before handing the code to users. Manual intervention decreases here, and the demand for tool automation increases.

1. Dynamic Analysis

Let the program run and watch it with specialized tools.

Memory Checkers:
- Such as KASAN (Kernel Address Sanitizer).
- What can it find? Out-of-bounds access, use-after-free, double-free.
- Principle: It monitors illegal accesses by filling "poison" around memory or using shadow memory.
- Cost: It significantly slows down the system and doubles memory usage. Turn it on during development and testing, but remember to turn it off before going live.
Undefined Behavior Checkers:
- UB (Undefined Behavior) is the root of all evil in C.
- Examples: integer overflow, incorrect shift operations.
- The kernel has a dedicated UBSAN (Undefined Behavior Sanitizer) to catch these things.
Lock Debugging Tools:
- Lockdep: A god-tier tool built into the kernel.
- It doesn't require you to enable special locks; it just dynamically detects lock dependencies at runtime.
- What can it find? Deadlocks, potential deadlock risks, self-deadlocks.

2. Static Analysis

Find faults just by looking at the source code without running it.

Tools like Sparse (checks for semantic errors, like mixing userspace/kernel pointers) and Coccinelle (semantic patches).
Purpose: Although there can be false positives, running them before code is merged into mainline can save you a lot of midnight wake-up calls. They can uncover security vulnerabilities and coding standard violations.

3. Code Coverage Analysis

Tools: Gcov / Lcov.
Purpose: Confirm exactly how much of your code your tests are actually hitting.
Why it matters: If 30% of your code branches have never been executed by tests, the bugs hiding in there are ticking time bombs.
Goal: Achieving 100% coverage is hard, but it should be the direction we strive for.

Phase 3: Production and Runtime — Look, Don't Touch

The system has been delivered to the customer or is running critical online business.

At this point, your overriding principle is: do not stop the business. Unless the system has already crashed, you cannot use system-pausing methods like KGDB.

What you need is non-intrusive monitoring.

1. Monitoring and Tracing Tools

This tier of tools is extremely rich and represents one of the most powerful aspects of the modern Linux kernel.

Tracing Infrastructure:
- Ftrace: The kernel's internal tracer; can show function call graphs and latency.
- Perf: The holy grail of performance analysis; can show CPU cache hit rates, context switch counts, and can be used to identify hot functions.
- eBPF: The current rising star. Allows you to run small sandboxed programs inside the kernel to do almost anything: monitor network packets, measure filesystem latency, or even intercept system calls.
- LTTng: Suited for ultra-large-scale trace tracking.
Kprobes (Kernel Probes):
- It's like burying a "landmine" in your code.
- Jprobe: Used to intercept function entry.
- Kretprobe: Used to intercept function return.
- When the program executes to that point, it triggers your pre-written handler. When you're done, just remove it—no source code modifications needed.
Watchdogs and Soft/Hard Lockup Detectors:
- The kernel has built-in detection mechanisms called softlockup and hardlockup.
- If a CPU gets stuck in a kernel-mode loop and can't get out, the watchdog will sound the alarm, or even directly panic and leave a log.
Magic SysRq:
- When the system looks dead, but you can still type on the keyboard.
- Press Alt + SysRq + [command].
- Common commands:
  - t: Display the state of all current tasks.
  - p: Display the current CPU's registers and stack.
  - c: Force a crash (to work with kdump).

2. Debug Hooks and Logs

The hooks here mainly refer to backdoors left via debugfs or ioctl.
Logs: systemd journal, dmesg. Never underestimate logs; the final clues to many baffling issues lie in the last few lines of dmesg.

Phase 4: Post-Mortem Analysis — The Autopsy Report

The system has crashed and rebooted. Now all you have is a vmcore file (or a photo of the screen).

What you need to do now is perform an autopsy, like a forensic pathologist.

1. Kernel Oops Analysis

What is an Oops: When the kernel encounters an error it can't recover from (like a null pointer dereference), it prints a diagnostic message.
Contents: Register values, stack trace, and the location of the faulty code.
How to do it: Take the Oops text and cross-reference it with the source code. Although primitive, most simple bugs can be pinpointed to an exact line number via the IP (Instruction Pointer) in the stack trace.

2. Kdump and Crash

Kdump: This is a mechanism. When the main kernel crashes, it boots a spare, minimal kernel (the Capture Kernel). The spare kernel dumps the main kernel's memory to disk.
Crash tool: This is an analysis tool. You use the crash command to open the vmcore file.
What you can do:
- bt (backtrace): View the process stack at the time of the crash.
- ps: See which processes were in the system at that time.
- kmem: Inspect kernel memory structures, like whether a certain slab is full.
- This is critical for analyzing "why it deadlocked" or "who corrupted the memory."

The Trade-off Between Tools and Hardware/Software

When choosing tools, we must look not only at functionality but also at our resources.

Hardware Constraints:
- Kdump: Requires reserving a chunk of memory. On an embedded router with only 64MB of RAM, you might not be able to afford that luxury.
- KASAN: Doubles memory usage. If your board has very tight memory, running it might directly trigger an OOM.
- KGDB: Requires a serial or network port. Some closed devices don't even expose these ports.
Software Constraints:
- Your kernel configuration (.config) might have disabled many debugging options for the sake of performance and size.
- For example: If CONFIG_KPROBES isn't enabled, you can't use kprobes.
- Static analysis doesn't need hardware resources; it only costs compilation time.

Tool Selection Matrix (Quick Reference)

Finally, to make it easy for you to look up later, we've condensed everything we just discussed into a few tables. Pin them next to your workstation—or at least keep a mental impression of them.

Table 2.1: Development/Coding Stage — You Have the Power

Scenario	Recommended Tool/Technique	Notes
Print debugging	`printk`, `pr_debug`, `dynamic debug`	Fastest to pick up, highest efficiency
Stack info	`dump_stack()`	View the call path
Breakpoint debugging	`KGDB`	Requires two machines, pauses the kernel
Assertions	`BUG_ON()`, `WARN_ON()`	`BUG_ON` will directly panic, use with caution
Manual triggering	`debugfs`, `ioctl` hooks	Convenient for dynamic testing

Table 2.2: Testing/QA Stage — The Bug Hunt Sweep

Scenario	Recommended Tool/Technique	Notes
Memory corruption	KASAN	Must-have for dev/QA, a performance killer
Locks/Deadlocks	Lockdep	Incredibly powerful, a shame not to run it
Undefined behavior	UBSAN	Catches integer overflows, bitwise operation errors
Code standards	Sparse, Coccinelle	Static checking, finds rookie mistakes
Coverage	Gcov, Lcov	Ensures your tests aren't running in vain

Table 2.3: Production/Runtime — Observing from the Shadows

Scenario	Recommended Tool/Technique	Notes
Performance analysis	Perf, Ftrace	Top choice for hotspot analysis
Dynamic tracing	eBPF, SystemTap	Powerful and safe, no reboot required
Probes	Kprobes, Jprobes	Dynamically insert monitoring points
Hang detection	Watchdog, Softlockup detector	Automatic alerts
Emergency rescue	Magic SysRq	Last resort when the system is unresponsive

Table 2.4: Post-Mortem Analysis — The Autopsy

Scenario	Recommended Tool/Technique	Notes
Crash scene	Oops analysis	Relies on screen photos or serial logs
Memory dump	Kdump (generate) + Crash (analyze)	Standard in server environments, limited in embedded
Log review	`dmesg`, `/var/log/messages`	The most fundamental information source

Table 2.5: Targeted Remedies for Different Bug Types (Abridged)

Bug Type	Primary Tool	Secondary Tool
Memory leak	Kmemleak	KASAN
Memory corruption	KASAN	SLUB debug
Deadlock	Lockdep	Crash stack analysis
Logic error	KGDB / printk	Dynamic debug
Performance issue	Perf / Ftrace	Trace
Concurrency/Race condition	Lockdep / Thread Sanitizer (TSAN)	Kprobes

(Note: Y = Highly recommended, N = Not recommended, ? = Depends on the situation)

Chapter Echoes

Returning to the question we asked at the beginning of this chapter: why is kernel debugging so hard?

Because you're dealing with a decentralized, concurrently running system with no safety net.

We spent a lot of space in this chapter on "classification"—classifying bugs, classifying debugging stages. This might seem dry, even like rote memorization.

But this is exactly what separates professional debuggers from amateurs.

Amateurs see the system hang and just reboot, or blindly sprinkle a bunch of printk everywhere. Professionals see the system hang and quickly run through this panorama in their heads:

Is it a deadlock? (Is Lockdep enabled?)
Is it memory corruption? (Can KASAN reproduce it?)
Or is it simply a performance issue? (Fire up Perf to check hotspots.)

The tools themselves are just moves; judging the scenario is the inner strength.

In the next chapter, we'll truly get our hands dirty. We'll start with the oldest, most unadorned, yet never-outdated method—code instrumentation. We'll dive deep into the mechanisms of printk to see exactly how it gets characters from kernel space onto your screen. Don't assume it's simple; sometimes, the simplest tools, when mastered, become legendary skills.

Exercises

Exercise 1: Understanding

Question: In C language development, if (ptr = NULL) is a classic syntactic flaw. What operation does this statement actually perform? How does it differ from the developer's original intent, if (ptr == NULL), in terms of compiler behavior (especially regarding compiler optimizations and static analysis)?

Answer and Analysis

Answer: This statement is actually an assignment operation—it assigns NULL to ptr, and then evaluates the value of ptr (which is false/0). This is completely different from the intended equality check. Modern compilers (with warnings like -Wall enabled) will usually detect this obvious assignment behavior and issue a warning; static analysis tools will classify it as a potential logic flaw.

Analysis: This question tests your understanding of "syntactic flaws." In C, = is the assignment operator, while == is the relational operator. if (ptr = NULL) will first assign NULL to ptr, causing ptr to become a null pointer (and creating a risk of leaking the memory ptr originally pointed to), and then the result of the expression is 0 (false), causing the if branch not to execute. This is a typical logic error, and it also falls under syntactic misuse.

Exercise 2: Application

Question: In Linux kernel development, code coverage and dynamic analysis tools (like KASAN) are common debugging methods. Suppose you enable KASAN during the development stage, run your test suite, and achieve 100% code coverage. Does this mean your code is completely free of memory corruption risks (like Use-After-Free or out-of-bounds access)? Please explain why based on the characteristics of "dynamic analysis."

Answer and Analysis

Answer: No. 100% code coverage only means that every line of code has been executed at least once. It does not guarantee that the code is correct under all possible execution flows, all concurrent scenarios, or all input boundaries. KASAN can only detect illegal memory accesses triggered during "this specific run"; it cannot detect potential vulnerabilities not triggered by specific test cases.

Analysis: This question tests your "application" skills. KASAN (Kernel Address Sanitizer) is a dynamic analysis tool whose core principle is to detect memory accesses at runtime. If a certain code path that leads to an out-of-bounds access is not triggered during testing (e.g., a specific asynchronous race condition or a rare boundary input), KASAN cannot report an error. Therefore, high coverage is a necessary condition for high-quality testing, but not a sufficient condition to guarantee a complete absence of bugs.

Exercise 3: Thinking

Question: When the Linux kernel encounters an unrecoverable error (like a fatal exception), it usually triggers a 'Kernel Panic'. In contrast, a 'Kernel Oops' typically refers to a non-fatal but serious error (like a page fault). From the perspective of "post-mortem analysis," why is it that in a production kernel, even if an Oops occurs, the system state is no longer trustworthy, and a prompt reboot is usually recommended? How does this relate to the 'kdump' debugging technique?

Answer and Analysis

Answer: A Kernel Oops means the kernel has already violated its original design constraints (like accessing an illegal pointer). Although the kernel tries to continue running (for example, by killing the offending process), the memory state may have been corrupted (data structure corruption, abnormal lock holders, etc.). Continuing to run in this "compromised" state could lead to further data corruption or security vulnerabilities, so the state is untrustworthy.

The connection to kdump lies in the fact that kdump's design purpose is precisely to handle this untrustworthy state. When the system crashes (Panic or severe Oops), it uses a clean, reserved spare kernel to capture the main kernel's memory dump. Without using kdump or a similar dumping mechanism, once the system completely deadlocks from corruption (Panic), developers will lose the critical clues needed to analyze the root cause.

Analysis: This is a deep-thinking question involving kernel state consistency and debugging strategies. 1. State trustworthiness: An Oops indicates that "unpredictable behavior" has already occurred, and continuing to run is a gamble based on luck. 2. Debugging strategy: Post-mortem analysis relies on scene data. kdump provides a mechanism where, when the main kernel is "terminally ill," a spare kernel performs the "autopsy." This corresponds to the knowledge point about "Post-mortem analysis," which involves using crash dumps to analyze fatal errors that cannot be debugged on the live scene.

Key Takeaways

Kernel debugging differs fundamentally from userspace debugging because the kernel environment lacks isolation and protection. A minor pointer error can paralyze the entire system and lose context. Therefore, the design of the debugging toolchain is essentially about extracting information through specific constraints in an extremely restricted environment.

Effective debugging begins with the precise classification of bugs, just as a doctor must diagnose before treating. Debuggers must distinguish between logic errors, memory corruption (like UAF, out-of-bounds), race conditions, or resource leaks, because the nature of the bug dictates the choice of subsequent methods. Misjudging the bug type is the most common cause of debugging failure.

There is no universal, one-size-fits-all debugging tool. We must choose targeted "weapons" based on the scenario where the bug occurs and the development stage. During the development stage, intrusive tools like KGDB or printk can be used, but in production environments, we must rely on non-intrusive tracing methods like eBPF and Kprobes to avoid interrupting business workflows.

Tool selection is constrained by hardware resources and software configurations. Embedded devices might lack the memory to run Kdump, and kernels without specific compilation options enabled (like CONFIG_KGDB) cannot use the corresponding advanced debugging features. This requires developers to precisely build and configure the system during the environment preparation stage.

Post-mortem analysis is a crucial part of the debugging system. When the system has already crashed and rebooted, debuggers must use the vmcore file generated by Kdump, combined with the Crash tool, to perform an "autopsy." By analyzing register states, stack backtraces, and memory structures at the time of the crash, we can pinpoint the root cause of the panic or deadlock.

Phase 1: Development Stage — The God's-Eye View​

1. Code-Level Instrumentation​

2. Debug Hooks​

3. Source-Level Debugging — KGDB​

Phase 2: Testing and QA Stage — Automation and Checkups​

1. Dynamic Analysis​

2. Static Analysis​

3. Code Coverage Analysis​

Phase 3: Production and Runtime — Look, Don't Touch​

1. Monitoring and Tracing Tools​

2. Debug Hooks and Logs​

Phase 4: Post-Mortem Analysis — The Autopsy Report​

1. Kernel Oops Analysis​

2. Kdump and Crash​

The Trade-off Between Tools and Hardware/Software​

Tool Selection Matrix (Quick Reference)​

Table 2.1: Development/Coding Stage — You Have the Power​

Table 2.2: Testing/QA Stage — The Bug Hunt Sweep​

Table 2.3: Production/Runtime — Observing from the Shadows​

Table 2.4: Post-Mortem Analysis — The Autopsy​

Table 2.5: Targeted Remedies for Different Bug Types (Abridged)​

Chapter Echoes​

Exercises​

Exercise 1: Understanding​

Exercise 2: Application​

Exercise 3: Thinking​

Key Takeaways​

Phase 1: Development Stage — The God's-Eye View

1. Code-Level Instrumentation

2. Debug Hooks

3. Source-Level Debugging — KGDB

Phase 2: Testing and QA Stage — Automation and Checkups

1. Dynamic Analysis

2. Static Analysis

3. Code Coverage Analysis

Phase 3: Production and Runtime — Look, Don't Touch

1. Monitoring and Tracing Tools

2. Debug Hooks and Logs

Phase 4: Post-Mortem Analysis — The Autopsy Report

1. Kernel Oops Analysis

2. Kdump and Crash

The Trade-off Between Tools and Hardware/Software

Tool Selection Matrix (Quick Reference)

Table 2.1: Development/Coding Stage — You Have the Power

Table 2.2: Testing/QA Stage — The Bug Hunt Sweep

Table 2.3: Production/Runtime — Observing from the Shadows

Table 2.4: Post-Mortem Analysis — The Autopsy

Table 2.5: Targeted Remedies for Different Bug Types (Abridged)

Chapter Echoes

Exercises

Exercise 1: Understanding

Exercise 2: Application

Exercise 3: Thinking

Key Takeaways