Skip to main content

Chapter 4: Seeing the Elephant Through a Needle's Eye — Kernel Probes and Dynamic Tracing

We are staring at a high-speed tangle of chaos.

When you try to understand a complex system like the Linux kernel, the most intuitive tool is "printing." You stuff your code with printf, recompile, run, and observe the output. This is standard practice in user-space development. But in the kernel, it's not just cumbersome—it's a disaster.

Why? Because the kernel does not have its own runtime environment. The moment kernel code stops to print a log, the entire system—including the very subsystem responsible for receiving that log—might pause. Worse still, the bug you want to observe might occur in a delicate instant of concurrent racing, or deep within interrupt handling. If you recompile and reboot the kernel just to insert a single log line, that bug might slip away like a startled fish.

We need a capability: to insert "hooks" at the entry or exit of any kernel function without interrupting system operation or recompiling the kernel. When the system flows through that point, the hook triggers, records the register states, parameter values, and return values we want to know, and then lets the system continue running as if nothing happened.

This is the core mission of this chapter: building this "God's-eye view" observability capability.

But honestly, this path is not smooth. Old-school kernel debugging requires you to write C kernel modules and manually register probe interfaces named kprobe—this is known as "static probes." It works, but writing them is extremely tedious, and even changing a print format requires recompiling and reloading. Modern Linux kernels offer a more elegant approach: leveraging the ftrace and tracepoints frameworks to dynamically insert probes into the kernel directly via user-space command-line tools or scripts. This is called "dynamic probing," and even revolutionary technologies like eBPF are built on top of these underlying mechanisms.

In this chapter, we will start with the most primitive "static probes," writing code by hand, filling out structures, and handling registers. Then we will gradually free our hands and experience the convenience brought by dynamic tracing and eBPF. Only after experiencing the pain firsthand will you truly understand why modern tools are designed the way they are.

Now, let's start with the most fundamental and versatile mechanism: kprobes.


4.1 The Fundamental Principle of Kprobes: Operating on Functions

Imagine you are a privileged observer inside the system. You want to monitor every "file open" operation that occurs in kernel space, regardless of which process initiated it, and regardless of whether it calls open() or openat().

With traditional debugging methods, you would find the source code of the kernel function do_sys_open, add a printk to the first line, and recompile the kernel. But this is far too clunky. Kprobes provides a lighter-weight way to intervene: it allows you to dynamically "inject" a breakpoint at a specific address within a kernel function.

You can think of it as the Swiss Army knife of kernel debugging.

But this knife has a special blade—it can not only cut into a function but also intercept data when the function returns. To use it well, we first need to understand exactly what it looks like.

Three States of a Probe

Kprobes is not just a simple "set a breakpoint" mechanism. To make debugging more flexible, it provides three insertion points in the execution flow.

Suppose we want to probe the famous kernel function do_sys_open() (the kernel function that user-space ultimately lands in when calling the open(2) system call; for details, see the section "Where System Calls Land in the Kernel"). Through the kprobes infrastructure, you can attach three different types of handlers:

  1. Pre-handler: Triggered before the first instruction of the do_sys_open() function executes. This is the most commonly used hook, typically employed to print function parameters (via registers), inspect the call stack, or decide at this point whether to skip the function's execution entirely.

  2. Post-handler: Triggered immediately after all instructions of the do_sys_open() function have finished executing and it is about to return. It is suitable for checking the side effects after the function executes, or verifying whether state has been modified.

  3. Fault-handler: This is a safety net. If a CPU exception occurs (such as a Page Fault) during the execution of the pre-handler or post-handler, or if kprobes itself encounters an issue while single-stepping an instruction, this handler will be called. Often, it's simply because your handler code accessed an illegal memory address. Without a fault-handler, the kernel might panic directly; with it, you at least have a chance to exit gracefully with an error.

All three handlers are optional. You can set only a pre-handler, or set all three—it entirely depends on your needs.

Kprobe or Kretprobe?

In addition to the "regular probe" described above, the Linux kernel provides a special probe designed to solve a specific pain point: I want to know what this function's return value is.

This is the Kretprobe (Return Probe).

Why is it needed? Because when a function finishes executing, the CPU's instruction pointer has already returned to the caller. If you check the regular post-handler at this point, although it triggers, getting the return value often requires digging deep into the stack or registers, which is very cumbersome (and strongly tied to the CPU architecture).

Kretprobe does something clever: at the function entry, by modifying the stack frame and other means, it secretly records the return address and intercepts the control flow right before the function actually returns. This way, it can easily hand the return value over to you.

The registration APIs for these two types of probes are also separate:

  • Regular Kprobe: uses register_kprobe[s]() / unregister_kprobe[s]()
  • Kretprobe: uses register_kretprobe[s]() / unregister_kretprobe[s]()

We will start with the most basic kprobe and leave kretprobe for the slightly more advanced section later.

The Scalpel's Manual: struct kprobe

To actually use this in code, you need to prepare a core data structure: struct kprobe. Think of it as a "surgical checklist"—the kernel relies on this list to know where to insert the blade and what to do once it's in.

The API for registering this structure looks like this:

#include <linux/kprobes.h>
int register_kprobe(struct kprobe *p);

To keep you from getting lost in a sea of structures, we will only focus on the most critical fields (you can leave the rest to their defaults):

  • const char *symbol_name: This is the name of the kernel function you want to "operate on," such as "do_sys_open". Under the hood, the kprobes framework calls APIs like kallsyms_lookup() to resolve this symbol string into a Kernel Virtual Address (KVA) and populates the internal addr member of the structure. Note: Not all functions can be probed. Some functions are placed on a blacklist (such as kprobes' own internal functions), and probing them will cause a kernel crash or deadlock. We will discuss this in detail in the "Limitations of Kprobes" section later.

  • kprobe_pre_handler_t pre_handler: This is a function pointer pointing to your pre-handler code. It will be called before the target instruction executes.

  • kprobe_post_handler_t post_handler: Similarly, this is the function pointer for the post-handler.

  • kprobe_fault_handler_t fault_handler: If your pre- or post-handler triggers an exception, the kernel will jump into this function. Key point: The return value of this function is very specific. Returning 0 means "I can't handle this, hand it over to the kernel's default exception handling mechanism" (this is the usual case); returning 1 means "It's fixed, the error has been resolved, continue execution" (this is rare and requires you to truly know what you are doing, such as manually fixing a page table).

Here's an advanced trick

kprobe can be attached not only at the beginning of a function but also at any arbitrary offset within the function. You simply need to set the offset member of struct kprobe. This is extremely useful when debugging complex assembly code blocks or when you only care about the logic in the latter half of a function. However, on CISC architectures (like x86), if you set the offset incorrectly and land directly in the middle of an instruction, the CPU will throw an error immediately. It's like sticking a needle into a heart right at the moment it beats—proceed at your own risk.

When you're done, don't forget one thing: cleanup.

When your module is unloaded, you must call the unregister function to remove the probe:

void unregister_kprobe(struct kprobe *p);

If you forget this step, the consequences are severe. The next time any code flows through that address, the kernel will try to trigger a probe callback that no longer exists. The result? A kernel bug, or even a direct crash. This is a classic resource leak scenario, except what's leaking isn't memory, but "control flow hijack points."

How Does It Work? (A Glimpse Beyond the Black Box)

You might be wondering: how is injecting a breakpoint into a running kernel actually implemented? Does it stop all CPUs? Or is it some kind of black magic?

To tell you the truth, the implementation details behind this—such as how to temporarily replace instructions, how to handle single-stepping, and how to ensure correctness in a multi-core environment—are extremely complex, involving architecture-specific assembly tricks and the kernel's low-level exception handling mechanisms.

If you are genuinely interested in those gory details, we highly recommend reading the official kernel documentation. It explains how kprobes leverages CPU debug registers (like DR0-DR7 on x86) and breakpoint instructions (like INT3) to achieve all of this.

In this book, we will keep these black boxes closed for now and focus on how to use this powerful blade.

Why Go Through All This Trouble? (From Static to Dynamic)

The approach we are introducing in this section—writing a kernel module, filling out struct kprobe, and registering handlers—is known as Static Kprobes.

It's called "static" because every time you want to probe a new function or change the log output format, you have to modify the C code, recompile the module, unload the old module, and load the new one.

This sounds a bit outdated, right? In this day and age, still recompiling just to take a peek at a log? Modern Linux kernels do have more advanced tricks up their sleeves.

This is Dynamic Probing, or Kprobe-based Event Tracing. It deeply integrates with the ftrace and tracepoints frameworks. You don't need to write a single line of C code; you can dynamically instrument the kernel just by writing a one-line string configuration via debugfs, or by using the perf tool or eBPF scripts. No compilation, no reboot—silky smooth.

But before we fly off to that modern world, let's lay a solid foundation. In the following demos, we will start with the most primitive hand-written static kprobe and step by step close in on the truth of that file open:

  • Demo 1: The simplest "hardcoded" probe—hardcoding the probe on the open system call.
  • Demo 2: Slightly smarter—specifying the function name to probe via module parameters, eliminating the need to modify code and recompile.
  • Demo 3: The real meat—not just "seeing" that a function was called, but actually extracting the filename.

Ready? Let's start writing some code.