Skip to main content

Chapter 4: Inside the Black Box: Kprobes and Kernel Instrumentation

4.2 The Classic Approach: Hardcore Static Kprobes

We start with the most primitive method—static Kprobes.

This "traditional" probing approach means that as a developer, you must sit at your keyboard, write a kernel module in C, hardcode the name of the kernel function you want to probe into the source code, compile it, and then insmod. Any modification—even just changing the probe point—means a new round of make and module reloading.

This sounds tedious, maybe even outdated. In an era where even coffee machines connect to WiFi, why should we still learn this?

Because it is the foundation. The flashy operations of dynamic probing and eBPF still rely on this underlying mechanism. Once you understand how static Kprobes hammer "hooks" into the kernel flow, you won't see magic when looking at automated tools later—you'll see logic.


Demo 1: Hardcoded Interception

Our first target is do_sys_open(). This is the core routine in the kernel that handles file opening. Whenever user space calls open(), it ultimately lands here.

Our current task is simple: set up a sentinel at both the entry and exit of this function.

Registering the Probe

All the action happens in the kernel module's init function. We need to initialize a struct kprobe structure and register it with the kernel.

The code is located at ch4/kprobes/1_kprobe/1_kprobe.c:

#include <linux/kprobes.h>
#include "<...>/convenient.h"

static struct kprobe kpb;

/* 注册 kprobe 处理函数 */
kpb.pre_handler = handler_pre;
kpb.post_handler = handler_post;
kpb.fault_handler = handler_fault;
kpb.symbol_name = "do_sys_open";

if (register_kprobe(&kpb)) {
pr_alert("register_kprobe on do_sys_open() failed!\n");
return -EINVAL;
}
pr_info("registering kernel probe @ 'do_sys_open()'\n");

There is a common misconception here: many people think kprobes can only intercept "system calls." Wrong. They can intercept almost any kernel function or symbol exported by a module. The do_sys_open here is just an example; you can try replacing symbol_name with do_fork or any other random kernel function—as long as it's not on the blacklist, this mechanism works.

Measuring Execution Time

Besides simply printing logs, kprobe has another highly practical use: performance measurement.

You can calculate exactly how long a kernel function takes to run. The logic is so intuitive it barely needs explanation:

  1. On entry: Record a timestamp in pre_handler, saved as tm_start. Using ktime_get_real_ns() is fine.
  2. On exit: Record another timestamp right at the beginning of post_handler, saved as tm_end.
  3. Calculate the difference: (tm_end - tm_start) is the time consumed by this function.

Let's look at the specific handler implementation:

/* Pre-handler: 函数执行前调用 */
static int handler_pre(struct kprobe *p, struct pt_regs *regs)
{
PRINT_CTX(); // 使用 pr_debug() 打印上下文
spin_lock(&lock);
tm_start = ktime_get_real_ns();
spin_unlock(&lock);
return 0;
}

/* Post-handler: 函数执行后调用 */
static void handler_post(struct kprobe *p, struct pt_regs *regs, unsigned long flags)
{
spin_lock(&lock);
tm_end = ktime_get_real_ns();
PRINT_CTX(); // 使用 pr_debug() 打印上下文
SHOW_DELTA(tm_end, tm_start); // 计算并打印时间差
spin_unlock(&lock);
}

There are a few technical details worth noting here:

  • The SHOW_DELTA and PRINT_CTX macros are defined in our convenient.h header file. They aren't standard kernel APIs; they are utility macros we wrote for convenient debugging.
  • Inside, the PRINT_CTX macro actually calls pr_debug(). This means that if you haven't enabled the DEBUG macro or configured the kernel's dynamic debugging feature, you won't see anything in dmesg. It isn't that nothing happened—the logs were just swallowed by the system.
  • What is that spin_lock for? It's for concurrency control. Because pre_handler and post_handler might run simultaneously on multiple cores (if do_sys_open is called frequently), and tm_start and tm_end are global variables. Without locking, you'll get very strange test results—or a direct Kernel Panic.

Fault Handling: Defensive Programming

Since we're poking around inside the kernel, we must be prepared for things to go wrong.

The kprobe mechanism provides a fault_handler. If a page fault or other exception is triggered while our handler is executing, this function gets called. Here, we usually can't do any complex recovery—that's the job of the core kernel code—our main responsibility is to record the scene and then throw this "hot potato" back to the kernel.

static int handler_fault(struct kprobe *p, struct pt_regs *regs, int trapnr)
{
pr_info("fault_handler: p->addr = 0x%p, trap #%d\n",
p->addr, trapnr);
/* 返回 0,表示我们不处理这个故障,让内核默认机制接管 */
return 0;
}
NOKPROBE_SYMBOL(handler_fault);

There is an extremely important macro here: NOKPROBE_SYMBOL().

Please remember it. It tells the kernel: "This function must absolutely never be probed by kprobe."

Why? Imagine if handler_fault itself triggered a kprobe, and that kprobe's handler crashed, triggering handler_fault again... This would turn into infinite recursion, and the system would instantly deadlock. Therefore, any helper functions used internally by kprobe (especially the handlers themselves) must be protected with this macro.

Testing on the Board

Just reading the code doesn't give you a real feel, so let's run it.

The run script provided for this section has already packaged all the steps: clearing the logs, compiling, inserting the module, waiting for 5 seconds (during which the system will frantically call do_sys_open), and then unloading the module and printing the logs.

$ cd <lkd-src-tree>/ch4/kprobes/1_kprobe
$ ./run
[... 编译输出 ...]
[... 插入模块 ...]
(等待 5)
[... 卸载模块 ...]

Let's look at the dmesg output. You should see information similar to this:

[ 123.456789] 1_kprobe:handler_pre: [...] <...> do_sys_open-1234
[ 123.456790] 1_kprobe:handler_post: [...] <...> do_sys_open-1234
[ 123.456790] 1_kprobe:handler_post: delta: 3501 ns

This single line of log contains a massive amount of information.

Decoding the PRINT_CTX() Output

PRINT_CTX() is a macro I wrote, imitating ftrace's latency trace format. It tells you which process is currently running, what its PID is, which CPU it's running on, and whether it's in interrupt context or process context.

Study this output format carefully—it's incredibly useful during deep debugging sessions:

[<时间戳>] <模块名>:<函数名>: <...> <进程名>-<PID> [<CPU>] <中断标志> <抢占标志>
  • Process name (comm): Which process triggered this call?
  • PID: The process ID.
  • CPU: Which core it ran on.
  • Delta: How many nanoseconds this execution of do_sys_open took. You'll notice it's extremely fast, usually just a few microseconds.

⚠️ Pitfall Warning: The Interrupt Context Trap

Here's a question: if do_sys_open is called within HardIRQ context (rare, but possible during certain driver operations), what will PRINT_CTX() display?

The answer might surprise you: it will display the name and PID of the process that was interrupted, not the interrupt itself. Because the interrupt itself has no "process context"—it simply "borrows" the current process's stack to run. If you see a process name like sshd in the logs, but logically sshd couldn't possibly be performing this operation, it's most likely because sshd happened to be unlucky enough to be running on the CPU when the interrupt struck.

This detail is often the key to cracking bizarre bugs.


The Kprobe Blacklist—Places You Can't Touch

Some functions in the kernel are "restricted areas." You cannot insert probes there. The main reason is simple: kprobe's own implementation needs to call these functions. If you place probes in these functions, it's very easy to cause a recursive crash and bring down the entire kernel.

You can check the blacklist directly:

sudo cat /sys/kernel/debug/kprobes/blacklist

If register_kprobe() fails, besides checking if you misspelled the function name, the first thing you should do is check this blacklist. The kernel documentation provides a detailed explanation of kprobe restrictions; we highly recommend reading it before you start.


Demo 2: A Bit More Flexible: Specifying Functions via Module Parameters

Demo 1 has an obvious weakness: every time you want to probe a different function, you have to change the symbol_name in the code and recompile. That's too much work.

Let's improve on this by introducing module parameters.

This way, we can pass the name of the function we want to probe as an argument when we insmod.

Parameterizing the Code

The code is located at ch4/kprobes/2_kprobe/2_kprobe.c:

#define MAX_FUNCNAME_LEN 64
static char kprobe_func[MAX_FUNCNAME_LEN];
module_param_string(kprobe_func, kprobe_func, sizeof(kprobe_func), 0);
MODULE_PARM_DESC(kprobe_func, "Function name to attach a kprobe to");

static int verbose;
module_param(verbose, int, 0644);
MODULE_PARM_DESC(verbose, "Set to 1 to get verbose printks (defaults to 0).");

When registering, we no longer use a hardcoded string, but rather a variable:

kpb.symbol_name = kprobe_func;

Usage

Now when inserting the module, we can freely specify the target:

# 探测 do_sys_open
sudo insmod 2_kprobe.ko kprobe_func=do_sys_open verbose=1

# 也可以探测 do_fork
sudo rmmod 2_kprobe
sudo insmod 2_kprobe.ko kprobe_func=do_fork verbose=1

This is much more comfortable. It frees us from the "modify source -> compile" death spiral.

Suppressing Log Floods

But this introduces a new problem: if you're probing a high-frequency function like do_sys_open, your dmesg will be instantly flooded, and any useful information will be drowned out.

To solve this, we introduced a filtering macro, SKIP_IF_NOT_VI, in Demo 2.

Its logic is simple: only print logs when the current process is vi.

#ifdef SKIP_IF_NOT_VI
/* 为了演示目的,我们只记录进程上下文是 'vi' 的信息 */
if (strncmp(current->comm, "vi", 2))
return 0;
#endif

You can try changing this macro to a process name you care about (like firefox), or refactor it into a module parameter. This allows you to precisely filter out the noise and observe only the system behavior you need.


Understanding the Basics: What is an ABI?

Now we can "see" when a function is called. But that's not enough.

If you're a true debugging expert, you'll want to know: what are the parameters passed in when this function is called?

For example, when do_sys_open is called, where exactly is that filename? Is it in a register? Or on the stack?

To answer this question, we must venture into a realm known as ABI (Application Binary Interface). This is the contract between the compiler and the processor.

How Does the Compiler Work?

When you write a line of C code like open(filename, O_RDWR), the compiler, when turning it into assembly instructions, must solve several problems:

  1. Which register should hold the address of the filename string to pass to the function?
  2. Where should the O_RDWR integer go?
  3. Which register should I check for the return value?

These rules are not defined by the C language, but by the processor architecture. Each architecture—x86, ARM64, MIPS—has its own set of "house rules," and this is the ABI.

Key Differences: x86 vs ARM

Without understanding the ABI, you won't know how to extract parameters from struct pt_regs. Here is a cheat sheet worth saving:

ArchitectureParameter Passing RulesReturn Value Location
x86-32 (ia32)Almost entirely passed via the stack. Parameters are pushed in right-to-left order.EAX
x86-64The first 6 parameters are placed in the RDI, RSI, RDX, RCX, R8, R9 registers respectively. The 7th parameter and beyond go on the stack.RAX
ARM-32The first 4 parameters are placed in R0, R1, R2, R3. The rest go on the stack.R0
ARM-64The first 8 parameters are placed in X0 ~ X7. The rest go on the stack.X0

This table explains why kernel code often contains a bunch of ifdef—to get function parameters on different architectures, you have to look up the corresponding registers.

⚠️ Note: This table applies to integer and pointer types. Floating-point parameters usually have a separate set of rules (for example, x86-64 uses xmm registers to pass floats). Furthermore, compiler optimizations might alter these details, but within the core flow of the kernel, these rules are generally stable.

Armed with this knowledge, we can move on to the next stage: not just "seeing" function calls, but snooping on their parameters. We will use this knowledge in the next demo to extract the full path of the file that do_sys_open attempts to open.