4.4 Getting Started with kretprobes
In the previous section, we saw the power of kprobes—like forcibly inserting a breakpoint inside a kernel function. But that only lets you see what a function looks like when it goes "in." What if we want to see what result it brings back when it comes "out"?
That's where kretprobes (return probes) come in.
You can think of it as kprobe's twin sibling. kprobe watches the "entry," and kretprobe watches the "exit."
Motivation: Why Do We Care About Return Values?
In debugging scenarios, being able to dynamically capture a function's return value is often the key to cracking the case.
Imagine you suspect an allocation function along a certain kernel path is failing, but the logs are completely empty. If you could intercept that function the exact moment it returns and take a peek at what it's holding (a valid pointer, or a glaring negative errno), you'd save yourself a whole night of blind guessing.
Pro Dev Tip
Don't take it for granted: if a function has a return value, you must check for failure.
Even for
malloc()orkmalloc()in the kernel, there will eventually come a day when it fails. If you don't capture that potential failure return value, you'll just be yelling at the void when things go wrong. Remember, robust code doesn't assume things will go smoothly; it assumes things will definitely go wrong.
Core API: Registration and Unregistration
The kretprobe API is very intuitive, cut from the same cloth as kprobe:
#include <linux/kprobes.h>
int register_kretprobe(struct kretprobe *rp);
void unregister_kretprobe(struct kretprobe *rp);
Following the kernel's usual convention, it returns 0 on success and a negative errno value on failure.
Tech Tip: Where does errno come from?
You're surely familiar with
errno. In user space, it's an integer in each process's uninitialized data segment (modern implementations use Thread Local Storage for thread safety, meaning the compiler-level__threadkeyword).When a system call fails (usually returning -1), glibc's glue code takes the negative
errnovalue returned by the kernel, multiplies it by -1 to make it positive, and stores it there.How do you look up errors? Don't memorize them. Go to the header files:
/usr/include/asm-generic/errno-base.h(1 to 34, common errors)/usr/include/asm-generic/errno.h(35 to 133, extended errors)For example, if you see a function return
-101in the logs, a quick lookup tells you it'sENETUNREACH(Network is unreachable). That's much better than playing guessing games.
Data Structures: Diving into kretprobe
struct kretprobe actually contains a struct kprobe internally. This makes perfect sense—to intercept a return, you first need to know where to intercept it, right?
The key members of the structure are as follows:
-
Setting the target: Done through the internal
kpmember.rp->kp.addr: Specify the address directly.rp->kp.symbol_name: Specify the function name (this is the most common approach).
-
Setting the handler:
rp->handler: This is your "return handler." When the probed function finishes execution and is about to return, this function gets called.
Its signature looks like this:
int kretprobe_handler(struct kretprobe_instance *ri, struct pt_regs *regs);
The parameter struct pt_regs *regs here is an old friend—it holds the CPU register state. The parameter struct kretprobe_instance *ri contains the context for this specific probe instance:
ri->ret_addr: The return address.ri->task: Pointer to the current process'stask_struct.ri->data: Used to store private, per-instance data.
The Crucial Step: Getting the Return Value
Alright, our goal is the "return value."
Remember the calling convention we discussed when talking about the ABI? A function's return value is typically placed in a specific register.
- On x86, it's
ax. - On ARM (32-bit), it's
r0. - On ARM64, it's
regs[0].
While you could manually look up the register table and extract it yourself, that's just asking for pain. The kernel provides a perfect, hardware-agnostic abstraction macro:
regs_return_value(regs);
This macro automatically fishes out the correct register value from pt_regs based on the current architecture.
How is it implemented? Take a peek at the source code—it's brutally simple and elegant:
- ARM (AArch32):
return regs->ARM_r0; - ARM64 (AArch64):
return regs->regs[0]; - x86:
return regs->ax;
Just call it, and leave the rest to the architecture layer.
Hands-on: Dissecting the Kernel Example Code
The kernel source code comes with a kretprobe example (samples/kprobes/kretprobe_example.c). Let's break down its key parts—it's more intuitive than writing one from scratch.
1. Module Parameters: Who Do We Want to Probe?
For flexibility, the example code allows you to pass in a function name:
static char func_name[NAME_MAX] = "kernel_clone";
module_param_string(func, func_name, NAME_MAX, S_IRUGO);
MODULE_PARM_DESC(func, "Function to kretprobe; this module will report the function's execution time");
By default, it probes kernel_clone (the core path for creating processes/threads), but when you load the module, you can change it to any function you want to monitor.
2. Defining the kretprobe Structure
This is the core configuration:
static struct kretprobe my_kretprobe = {
.handler = ret_handler, // 返回时的处理函数
.entry_handler = entry_handler, // 入口时的处理函数
.data_size = sizeof(struct my_data), // 需要多少私有空间
/* Probe up to 20 instances concurrently. */
.maxactive = 20, // 同时监控多少个实例
};
There are a few fields here worth chewing on:
.entry_handler (Entry Handler)
This is somewhat like kprobe's pre-handler. It triggers when the target function has just been called but hasn't started executing yet.
- If it returns
0, it means "continue probing," and when the function finishes executing,.handlerwill be called. - If it returns non-
0, it means "ignore this one," and kretprobe will directly skip this instance.
.maxactive (Max Active Instances)
This is a parameter that's easy to trip over.
What it means is: how many concurrent instances of this function are you allowing to be probed?
- The default value is
NR_CPUS(the number of CPU cores). - If your target function executes slowly or is called recursively, there might be many instances running at the same time.
- If
maxactiveis set too low, the excess instances will be missed and recorded in thenmissedfield. If you noticenmissedkeeps climbing, bump this value up.
3. Planting the Probe
During module initialization, fill in the name and register:
my_kretprobe.kp.symbol_name = func_name;
ret = register_kretprobe(&my_kretprobe);
if (ret < 0) {
pr_info("register_kretprobe failed, returned %d\n", ret);
return ret;
}
4. The Real Payoff: Return Value and Elapsed Time
This is the implementation of .handler, where we actually capture the data:
static int ret_handler(struct kretprobe_instance *ri, struct pt_regs *regs)
{
unsigned long retval = regs_return_value(regs);
struct my_data *data = (struct my_data *)ri->data;
// ... 省略时间计算代码 ...
// delta = ktime_to_ns(ktime_sub(now, data->entry_stamp));
pr_info("%s returned %lu and took %lld ns to execute\n",
func_name, retval, (long long)delta);
return 0;
}
Notice the first line: regs_return_value(regs).
This is what we went to all this trouble to find—that return value hidden away in a register.
5. Cleanup
Don't forget to unregister. While we're at it, let's check if any instances were missed:
unregister_kretprobe(&my_kretprobe);
pr_info("kretprobe at %p unregistered\n", my_kretprobe.kp.addr);
/* nmissed > 0 suggests that maxactive was set too low. */
pr_info("Missed probing %d instances of %s\n",
my_kretprobe.nmissed, my_kretprobe.kp.symbol_name);
If your nmissed is not 0, go back and increase .maxactive, then try again.
Miscellaneous: Multiple Probes and Toggles
Batch Registration
If you have a bunch of functions to probe, calling register one by one is too tedious. The kernel provides a batch API:
int register_kretprobes(struct kretprobe **rps, int num);
int unregister_kretprobes(struct kretprobe **rps, int num);
This is really just a wrapper around a loop, but it keeps the code much cleaner.
Temporary Toggles Sometimes you only want to enable probing under high load and disable it under low load to reduce performance overhead. Instead of unloading the module, use these two APIs:
int disable_kretprobe(struct kretprobe *rp);
int enable_kretprobe(struct kretprobe *rp);
This is much more elegant than the brute-force rmmod approach.
A Glimpse Under the Hood
If you're interested in how kretprobe hooks into a function's return path—this actually involves modifying the return address on the function's stack or leveraging architecture-specific trap mechanisms—you can dive into the kernel documentation: Documentation/kprobes.txt
Simply put, it tamperes with the function entry point, saves the original return address, and replaces it with a trap address. When the function returns, the CPU jumps to this trap, triggers your handler, and then jumps back to the real return address.
This is also why kretprobe has slightly more overhead than kprobe—it has to touch the stack.
Costs and Limitations: No Silver Bullet
Frederick Brooks said, "There is no silver bullet." As powerful as kretprobe is, it isn't omnipotent.
1. Which Functions Can't Be Probed?
To protect themselves (and to prevent you from crashing the system), kernel developers forbid probing certain critical functions:
- Functions marked with
__kprobesornokprobe_inline. - Functions that explicitly use the
NOKPROBE_SYMBOL()macro. - Functions on the blacklist.
Where is the blacklist?
In /sys/kernel/debug/kprobes/blacklist. The functions listed here are usually tightly coupled with the kprobe implementation itself; probing them would lead to recursive deadlocks.
By the way: That
kp_load.shhelper script we wrote in the previous section is actually pretty smart—before loading, it checks this blacklist, and if it finds the function you want to probe is on it, it refuses to execute.
2. Stability in Production
Using k[ret]probes in a production environment carries risks.
- Performance overhead: Every entry and exit requires a trap. Although a single trap is fast, on high-frequency functions (like the network receive path or the scheduler path), this overhead scales exponentially.
- Fragility of the ABI:
Internal kernel APIs are unstable. Today you might probe the
x()function, taking its 3rd parameter and return value; tomorrow, after a kernel upgrade, thex()function might have been renamed, its parameters changed, or the meaning of its return value altered. This means your kernel module must be maintained alongside kernel versions—a form of long-term debt.
So, treat it like a scalpel—use it to save the day in critical moments, don't use it like a cleaver for everyday chopping.
Alright, since static probing comes with so many hassles (writing code, compiling, loading), is there a "lazy" approach where we can just look at whatever we want without writing C code?
Of course there is. In the next section, we'll look at the real magic of dynamic tracing.