Chapter 9: The Kernel Under a Microscope: Tracing, Profiling, and the End of the Black Box
In this chapter, we tackle an awkward problem.
When you face a kernel behaving erratically, what do you usually do? If it still throws an error, even just a Oops or Kernel Panic, we should count our blessings—because that's it crying for help, at least telling us where it died. The most terrifying situation isn't a crash, but a "pathological" one: the system still runs, but it's absurdly slow; or a certain feature intermittently fails, appearing and disappearing like a ghost; or perhaps you simply want to know: "Who exactly triggered this interrupt?" or "How many times was this function actually called?"
With traditional debugging techniques, we're like searching for something with a flashlight—printk is the flashlight, kgdb is the magnifying glass. They're both great, but they share a common flaw: you are taking the initiative. You must preset a breakpoint, or insert a print statement in the code. If your preset is wrong, you see nothing. You're like someone fumbling around in a dark room—finding something is pure luck, finding nothing is the norm.
We need a more advanced approach. Not "shining a flashlight," but installing "surveillance cameras." We need the ability to silently record the system's execution trajectory, without presetting breakpoints, without disrupting system operation, and allowing us to analyze what happened every second, just like playing back a recording.
This is the world of kernel tracing.
Here, we encounter two easily confused near-synonyms: Tracing and Profiling.
Understanding their difference is crucial, because many people mix up these two terms, leading to the wrong tool choice.
Tracing is continuous and detailed. It focuses on "flow." It's like the black box on an airplane, recording every function call, every parameter passed, every context switch. Its purpose is to answer "what exactly did the system do at a specific moment." Its data volume is typically large because it's exhaustive.
Profiling is statistical and sampled. It focuses on "hotspots." It's like taking timed snapshots, checking what the CPU is doing every 10 milliseconds. If the CPU spends 50% of its time executing function A, then you know A is the performance bottleneck. Its purpose is to answer "where did the time go?" Its data volume is relatively small because it only records frequencies, not every single detail.
These two capabilities—one capturing points (hotspots), the other capturing surfaces (flow)—combine to provide complete kernel visibility. And the main character of this chapter is the most powerful tracing framework built into the Linux kernel—Ftrace—along with the modern tool ecosystem derived from it.
But this isn't a light topic. Ftrace itself is complex and low-level; using its interfaces directly (those pseudo-files in /sys/kernel/debug) will drive you crazy. So we'll peel this onion layer by layer: first, we cover the underlying Ftrace mechanism, showing you how it uses compiler instrumentation to almost "magically" record every function call; then we cover command-line frontends like trace-cmd, so you don't have to manually type echo strings every time; finally, we cover GUI tools like KernelShark, letting you see a true "timeline."
Ready? Let's turn on the microscope and look at the most authentic operating picture inside this precision machine.
9.1 Technical Preparation
Before we start tinkering, let's confirm our gear. The good news is that the hardware environment for this chapter hasn't changed—you still only need that standard Ubuntu 20.04 LTS development machine (or VM). All code examples still live in this book's GitHub repository: https://github.com/PacktPublishing/Linux-Kernel-Debugging.
But in this chapter, we'll introduce two new friends: LTTng and Trace Compass.
LTTng (Linux Trace Toolkit - next generation) is an industrial-grade tracing system—heavier and more powerful than Ftrace, suited for complex user-space + kernel-space joint tracing. Trace Compass is a GUI tool for visualizing LTTng data.
Installation Steps:
Open your terminal. Both tools are available in the default Ubuntu 20.04 repositories, so just run apt:
sudo apt update
sudo apt install lttng-tools lttng-modules-dkms babeltrace tracecompass
(Note: lttng-modules-dkms will automatically compile and install LTTng's kernel module. This step requires that your system already has the kernel headers and build tools needed for compilation—that standard environment we've been using throughout previous chapters.)
After installation, we can quickly verify:
$ lttng --version
lttng (LTTng Control) 2.10.7 - lttng-tools (2.10.7)
- Website: https://lttng.org
- Documentation: https://lttng.org/docs
- Git tree: https://git.lttng.org
...
Seeing the version number pop up means this step is solid.
Workspace Recommendation:
While not mandatory, I recommend creating a new folder in your working directory specifically for this chapter's experiment data. Tracing tools generate quite a few .dat files or CTF (Common Trace Format) files. Don't mix them with your source code, or cleaning up later will be painful.
mkdir -p ~/kd_ch9_tracing
cd ~/kd_ch9_tracing
Beyond that, there are no other fancy requirements. The focus of this chapter isn't on setting up environments, but on how to use tools that already exist to "see" things that were previously invisible.