4.5 Kprobe-Based Event Tracing — The Internals
Remember that cliffhanger at the end of the last section? Is there a way to "bug" any function in the kernel without writing a single line of C code or compiling a kernel module?
The answer is yes, and this mechanism is much closer than you think. It's hiding right there in the /sys/kernel/debug/tracing directory that you might walk past every day.
This mechanism is called kprobe-based event tracing.
What This Section Is Really About
On the surface, this section teaches you how to use the tracefs filesystem. In reality, it reveals the "secret weapon" behind advanced tools like perf and eBPF. When you run a complex perf script, it's essentially just reading and writing to the filesystem we'll discuss here.
Let's pull back the curtain.
The Truth Behind the Event Tracing Framework
When you run perf probe or those automated scripts, you might think they're performing some dark magic. Actually, they're just doing something very mechanical: using the kernel's ftrace infrastructure to register probes.
You might have noticed in the last section that when we looked at /sys/kernel/debug/kprobes/list, there was a [FTRACE] column on the right. The mystery is now solved: part of kprobe's underlying implementation borrows from the ftrace mechanism.
Within this framework, there's a specific piece called kprobe events, which is a subset of the larger ftrace system. Its core idea is to abstract "probes" as "events." You define an event, the kernel attaches the probe for you, and then collects the data into a unified trace buffer.
Tracing Built-in Functions via the Event Tracing Framework
The prerequisite for all of this is that the kernel has CONFIG_KPROBE_EVENTS=y enabled. The good news is that most distribution kernels, even those in production environments, have this enabled by default.
Let's dive into that directory:
$ ls /sys/kernel/tracing/events
There are a bunch of folders here, representing "event categories" already exposed by various subsystems within the kernel.
A Minor Path Issue
You might see two paths:
/sys/kernel/debug/tracingand/sys/kernel/tracing. They usually point to the exact same thing. Sometimes, for security reasons, production environments will unmountdebugfsto prevent snooping, buttracefs(/sys/kernel/tracing) is often left available for performance tools. For the sake of brevity, we'll uniformly refer to it as<tracing>.
Look at the figure below to get a feel for how lush this tree is:
(📷 Figure 4.10 – Screenshot showing the kernel's event tracing - pseudo files and folders)
Everything in here is ready to use out of the box.
Let's Try One: kmalloc
Let's take the most commonly used memory allocation function, kmalloc, for a spin. At the bottom of Figure 4.10, you can see the events/kmem/kmalloc directory. This is a monitoring point pre-configured for you by the kernel.
No need to write code; just do this:
# 打开监控
echo 1 > <tracing>/events/kmem/kmalloc/enable
# 查看输出
cat <tracing>/trace_pipe
trace_pipe is a special file. Just like tail -f, it streams data out in real time. A regular trace file, on the other hand, is just a snapshot.
At this point, any module in the system that calls kmalloc will be recorded. You'll see output similar to this scrolling across your screen:
<...>-1234 [001] .... 12345.678901: kmalloc: call_site=ptr_spin_lock+0x4/0x20 ptr=ffff88000001 size=64 bytes_req=64 bytes_alloc=64 gfp_flags=GFP_KERNEL
The information here is incredibly rich:
- Who called it (
call_site) - How much was requested (
size) - How much was actually allocated (
bytes_alloc) - What the allocation flags are (
gfp_flags)
The format definitions for these fields are actually written in the format pseudo-file in that directory. Tools like perf rely on parsing this file to know how to print the data.
Cleaning Up
Once you're done looking, remember to turn it off, otherwise the trace buffer will overflow:
echo 0 > <tracing>/events/kmem/kmalloc/enable
echo > <tracing>/trace
(📷 Figure 4.11 – Truncated screenshot showing an example of easily tracing the kmalloc routine)
It's that simple. No module loading, no printk, no risk of kernel crashes.
Alright, Here Comes the Catch
So far in this section, we've been playing with "pre-made" items—specifically, Tracepoints that kernel developers thoughtfully pre-embedded in the code for us.
The directories listed in Figure 4.10 are all static tracepoints. It's like buying a fully furnished house where the furniture is picked out by the developer.
But reality is often harsh.
What if the function you want to monitor—say, a function in your own kernel module, or some obscure kernel function—doesn't appear in the /sys/kernel/tracing/events directory at all?
If it's not in the directory, you can't use the enable method described above.
This leads to the ultimate problem that the next section will solve: When you're faced with an "unfurnished house" with no pre-embedded tracepoints, how do you dynamically drill holes in the walls?
In the next section, we'll enter the true realm of dynamic kprobes—there are no ready-made directories, and everything must be created by you.