Skip to main content

4.6 Setting Up Dynamic Kprobes (via kprobe events) — Placing a Watchpoint on Any Function

In the previous section, we mentioned that reality is often harsher than a demo. What if the function you need to monitor—perhaps an unassuming internal function in your own kernel module, or some obscure system call—doesn't even show up under /sys/kernel/tracing/events?

This is where we enter true "raw renovation" mode.

Since there's no ready-made directory, we have to create one ourselves. In this section, we'll use the kernel's dynamic event tracing framework, also known as kprobe-based event tracing, to forcefully place a watchpoint on any function.

Why Can We Place a Watchpoint?

Before diving in, let's confirm one thing: whether you can place a watchpoint on a function depends on whether it's registered on the "roster."

There are two places to check this roster:

  1. The kernel's global symbol table: That is, /proc/kallsyms. Any symbol exported by the kernel is listed here.
  2. ftrace's available functions list: That is, /sys/kernel/debug/tracing/available_filter_functions.

What if the function is in a kernel module? No problem. As long as the module is loaded into memory, its symbols are automatically merged into the kernel symbol table, and you can see it in /proc/kallsyms (assuming you have root privileges, of course). The following example will demonstrate this.

Step 1 — Create the Watchpoint

First, head to the tracing control center:

# cd /sys/kernel/debug/tracing

If, for some reason (like a production kernel configured with CONFIG_DEBUG_FS_DISALLOW_MOUNT=y), the path above doesn't exist, take the standard route:

# cd /sys/kernel/tracing

Now, let's create a dynamic kprobe on the do_sys_open() function. The syntax is a bit particular, so be careful not to make a typo:

echo "p:<kprobe-name> <function-to-kprobe> [...]" >> kprobe_events
  • p: indicates you are setting up a kprobe (for a return probe, it would be r:).
  • <kprobe-name> is the alias you give to this probe. You can name it anything; if omitted, it defaults to the function name.
  • <function-to-kprobe> is the function you want to trace.
  • [...] is an optional parameter that we'll cover later. Its killer feature is the ability to capture function arguments.

Let's do this in practice. We'll give do_sys_open() the alias my_sys_open:

echo "p:my_sys_open do_sys_open" >> kprobe_events

And it's written. When we execute this command, the kernel doesn't modify any files on disk; instead, it dynamically registers a hook in memory.

Now, if we look at the /sys/kernel/[debug]/tracing/events directory again, there's a new kprobes folder at the bottom—created by the command we just ran.

# ls -lR events/kprobes/
events/kprobes/:
total 0
drwxr-xr-x 2 root root 0 Oct 9 18:58 my_sys_open/
-rw-r--r-- 1 root root 0 Oct 9 18:58 enable
-rw-r--r-- 1 root root 0 Oct 9 18:58 filter
events/kprobes/my_sys_open:
total 0
-rw-r--r-- 1 root root 0 Oct 9 18:59 enable
-rw-r--r-- 1 root root 0 Oct 9 18:58 filter
-r--r--r-- 1 root root 0 Oct 9 18:58 format
[]

See? The structure is exactly the same as the static tracepoints from the previous section. This is the elegance of the ftrace framework—whether it's a statically planted mine or a dynamically placed watchpoint, the interface is unified.

Step 2 — Fire It Up

Although the probe is created, it's off by default (the enable file contains 0). We need to enable it:

echo 1 > events/kprobes/my_sys_open/enable

Now, whenever a process calls do_sys_open, the kernel will dump the information into the trace buffer. You can view it directly:

cat trace
[]
cat-192796 [001] .... 392192.698410: my_sys_open: (do_sys_open+0x0/0x80) file="/usr/lib/locale/locale-archive"

cat-192796 [001] .... 392192.698650: my_sys_open: (do_sys_open+0x0/0x80) file="trace"
gnome-shell-7441 [005] .... 392192.777608: my_sys_open: (do_sys_open+0x0/0x80) file="/sys/class/net/wlo1/statistics/rx_packets"
[]

If you want that "real-time monitoring" feel, don't use cat; use cat trace_pipe instead. The data stream will continuously scroll across your screen, which is extremely useful when interactively working with dynamic kprobes.

Alternatively, you can save the results to a file for later analysis:

cp /sys/kernel/tracing/trace /tmp/mytrc.txt

Step 3 — Defuse the Mine

When you're done playing around, remember to clean up the battlefield. It takes two steps: flip the switch off, then defuse the bomb.

echo 0 > events/kprobes/my_sys_open/enable
echo "-:my_sys_open" >> kprobe_events

Note a detail here: when deleting, we use -:name. The minus sign tells the kernel "delete the one we just added."

If you want to wipe out all dynamic probes at once, you can simply empty the file:

echo > /sys/kernel/tracing/kprobe_events

Once all probes are deleted, the kprobe_events pseudo-file will disappear on its own. Additionally, if you want to clear the data in the buffer as well:

echo > trace

This combo—create, enable, view, disable, delete—is the basic workflow for dynamic kprobes. If you want to dig deeper (like how to format arguments), the kernel documentation has a detailed article on Kprobe-based Event Tracing. Alternatively, check out the source code of the kprobe-perf script—it's a living textbook.


⚠️ Don't Blow Up Your System

We need to be serious here. Just as we warned when manually using kprobes, the author of the kprobe-perf script also puts the warning right in your face:

WARNING: This uses dynamic tracing of kernel functions, and could cause kernel panics or freezes, depending on the function traced. Test in a lab environment, and know what you are doing, before use.

How do we "mitigate" this risk?

  1. Only trace what you need: Don't be greedy. Only hook that specific function, and avoid setting overly broad wildcards.
  2. Shorten the time window: Take a quick glance and turn it off immediately.
  3. Leverage the buffer: The kernel relies on per-CPU buffers to store data, with a fixed size set in /sys/kernel/[debug]/tracing/buffer_size_kb. If you notice data overflow, try increasing this value.

Getting the above commands running on x86 is a given. But if you switch to an ARM board and want to do something more advanced—like printing the filename argument of the open system call—things get a bit more interesting.

We know that the second argument of do_sys_open is the filename path (on x86_64, arguments are sequentially placed in the RDI, RSI, RDX... registers). So on x86, you would write:

echo "p:my_sys_open do_sys_open file=+0(%si):string" > /sys/kernel/debug/tracing/kprobe_events

But what happens when you run this on ARM?

bash: echo: write error: Invalid argument

An error. Why?

Because ARM simply doesn't have a %si register.

This brings us back to the ABI (Application Binary Interface) knowledge we repeatedly emphasized in earlier chapters. Parameter passing is architecture-dependent. On ARM-32, the first four arguments are passed through r0, r1, r2, and r3 (recall Table 4.1). Therefore, the command needs to be changed to this:

echo "p:my_sys_open do_sys_open file=+0(%r1):string" > /sys/kernel/debug/tracing/kprobe_events

Now it's correct. We can capture all arguments at once:

echo 'p:my_sys_open do_sys_open dfd=%r0 file=+0(%r1):string flags=%r2 mode=%r3' > /sys/kernel/debug/tracing/kprobe_events

Don't forget to enable it with echo 1.

If you have perf-tools (or perf-tools-unstable) installed, things can be even simpler—just use the pre-packaged scripts:

rpi # kprobe-perf 'p:my_sys_open do_sys_open dfd=%r0 file=+0(%r1):string flags=%r2 mode=%r3'
Tracing kprobe my_sys_open. Ctrl-C to end.
cat-1866 [000] d... 8803.206194: my_sys_open: (do_sys_open+0x0/0xd8) dfd=0xffffff9c file="/etc/ld.so.preload" flags=0xa0000 mode=0x0
cat-1866 [000] d... 8803.206548: my_sys_open: (do_sys_open+0x0/0xd8) dfd=0xffffff9c file="/usr/lib/arm-linux-gnueabihf/libarmmem-v6l.so" flags=0xa0000 mode=0x0
cat-1866 [000] d... 8803.207085: my_sys_open: (do_sys_open+0x0/0xd8) dfd=0xffffff9c file="/etc/ld.so.cache" flags=0xa0000 mode=0x0
cat-1866 [000] d... 8803.207235: my_sys_open: (do_sys_open+0x0/0xd8) dfd=0xffffff9c file="/lib/arm-linux-gnueabihf/libc.so.6" flags=0xa0000 mode=0x0
cat-1866 [000] d... 8803.209703: my_sys_open: (do_sys_open+0x0/0xd8) dfd=0xffffff9c file="/usr/lib/locale/locale-archive" flags=0xa0000 mode=0x0
cat-1866 [000] d... 8803.210395: my_sys_open: (do_sys_open+0x0/0xd8) dfd=0xffffff9c file="trace_pipe" flags=0x20000 mode=0x0
^C
Ending tracing...

Looking at these logs, we can clearly see what files each process opened and what flags they passed. This capability is like having X-ray vision for debugging low-level issues.


Exercise: See Who Calls the Interrupt Bottom Half

Problem: Set up a kprobe that triggers whenever an interrupt handler's tasklet (bottom half) is scheduled for execution, and display the kernel stack at that time.

One approach to the solution:

In the traditional interrupt handling model (Top/Bottom halves), driver developers typically call the schedule_tasklet() kernel API within the hardware interrupt handler (Top half) to request scheduling of the bottom half.

What we're looking for is this underlying scheduling function.

Let's check the symbol table first:

# grep tasklet_schedule /sys/kernel/debug/tracing/available_filter_functions
__tasklet_schedule_common
__tasklet_schedule

Good, the target is locked on __tasklet_schedule. We don't just want to place a watchpoint on it; we also want to add the -s parameter so it conveniently prints the kernel stack, showing us who is calling it:

# kprobe-perf -s 'p:mytasklets __tasklet_schedule'
Tracing kprobe mytasklets. Ctrl-C to end.
kworker/0:0-1855 [000] d.h. 9909.886809: mytasklets: (__tasklet_schedule+0x0/0x28)
kworker/0:0-1855 [000] d.h. 9909.886829: <stack trace>
=> __tasklet_schedule
=> bcm2835_mmc_irq
=> __handle_irq_event_percpu
=> handle_irq_event_percpu
=> handle_irq_event
=> handle_level_irq
=> generic_handle_irq
=> __handle_domain_irq
=> bcm2835_handle_irq
=> __irq_svc
=> bcm2835_mmc_request
=> __mmc_start_request
=> mmc_start_request
=> mmc_wait_for_req
=> mmc_wait_for_cmd
=> mmc_io_rw_direct_host
=> mmc_io_rw_direct
=> process_sdio_pending_irqs
=> sdio_irq_work
=> process_one_work
=> worker_thread
=> kthread
=> ret_from_fork
[...]

This pile of output is packed with information. Pay attention to the d.h. marker after the kworker line. Recalling the explanation in Figure 4.4:

  • d: Interrupts are disabled.
  • h: This is a hard interrupt context.

We need to read the stack trace from bottom to top. The bottom-most call chain indicates that an I/O operation on an SD/MMC card triggered an interrupt (bcm2835_mmc_irq), and during the interrupt handling process, a tasklet was scheduled.

This is the power of dynamic kprobes—you don't need to read the source code and guess; you just let the kernel tell you exactly how it's running.