9.11 LTTng and Trace Compass: The God Mode of High-Level Perspectives
Entering the Scene: A Different Perspective on the Kernel
Up to this point, we've mostly been working with Ftrace and its derivatives. Ftrace is great—it's built into the kernel, has almost zero dependencies, and not only lets us observe but also modify behavior. But sometimes, what you need isn't just a "flashlight," but a "full panoramic monitoring system."
This is where LTTng (Linux Trace Toolkit - next generation) comes in.
LTTng is quite different from the tools we've discussed so far. If Ftrace is a compact Swiss Army knife, LTTng is a heavy-duty industrial CT scanner. It doesn't just watch the kernel; it can simultaneously monitor user-space applications and libraries, specifically designed for multi-core parallel systems and real-time systems. If you're debugging a bizarre issue where "a stutter happens once every ten minutes," or you need to analyze complex interactions across multiple cores, LTTng's capabilities are absolutely formidable.
LTTng has a fairly long history. The original version (LTT) dates back to 2005, and the current LTTng is an actively maintained next-generation version. It's open source, freely licensed (available under LGPL/GPL/MIT), and highly regarded in the high-performance computing domain.
However, its documentation is also notoriously comprehensive. Due to space constraints, I can't reproduce all of its manuals here. If you're interested in "what is Tracing" and its fundamental differences from "Profiling," I highly recommend reading through the official LTTng documentation—it's truly textbook-grade material.
Getting Started: LTTng Installation and Quick Start
Since we want to use it in the kernel, the first step is naturally installation.
If you're using Ubuntu 20.04 and don't want to bother with source packages, you can get it done directly with apt:
sudo apt install lttng-tools lttng-modules-dkms -y
⚠️ Repository Version Warning Keep in mind, however, that the repository version usually isn't the latest. At the time of writing, installing this way gets you version 2.11 and 2.12 modules, while the latest official stable version is already at 2.13. In most cases, this is fine, but if you find that certain new features are missing, it might just be a generation gap in versions.
After installation, I recommend going through the Quick start guide on the LTTng website. Although it can trace user space, since this is a kernel debugging book, we only care about Kernel Tracing here.
Core Workflow: LTTng's Five Steps
Recording a kernel tracing session with LTTng follows very clear logic. Although it has many parameters, the core workflow consists of only five steps (root privileges are required throughout):
Step 1: Create a Session
We need to tell LTTng: "I'm starting work, store the data here."
lttng create <session-name> --output=~/my_lttng_traces/
If you don't add the --output parameter, it defaults to dumping data in the ~/lttng_traces/ directory.
Step 2: Enable Events
The core philosophy of LTTng is "events." You need to explicitly tell it what you want to capture. For the sake of a simple demo (and to avoid blowing up your hard drive), we'll enable all kernel events:
lttng enable-event --kernel --all
⚠️ The Cost of Brute-Forcing
This is just a demo. In a production environment, --all is a dangerous move. All syscalls, all scheduler events, all network packet ingress and egress... everything gets recorded. On a high-load server, the log files will grow at a rate that will make you question reality. In actual use, please make sure to specify specific event classes or channels.
Step 3: Start Recording
Ready? Flip the switch.
lttng start
Now, go do what you need to do. Reproduce that bug, or run that stress test.
Step 4: Stop Recording
lttng stop
This step isn't strictly necessary—because the next step, destroy, will automatically stop it. But if you want to record in batches within the same session, stop lets you pause.
Step 5: Destroy the Session
lttng destroy
"Destroy" sounds a bit scary here, but don't worry—it only destroys the session configuration in the LTTng daemon. It does not delete the raw trace data you've already saved.
Finally, if you want non-root users to be able to view this data, remember to change the permissions:
sudo chown -R $(whoami):$(whoami) ~/my_lttng_traces
Hands-on: Tracing That Ping Packet Again
To give you an intuitive feel, I wrote a simple Bash script that wraps the steps above. You can find it here: ch9/lttng/lttng_trc.sh. This script is just to make your life a little easier (I only did light testing, so I can't guarantee it works perfectly in all environments).
Its usage is simple:
$ cd <lkd_src>/ch9/lttng ; sudo ./lttng_trc.sh
Usage: lttng_trc.sh session-name program-to-trace-with-LTTng|0
1st parameter: name of the session
2nd parameter, ...:
If '0' is passed, we just do a trace of the entire system (all kevents),
else we do a trace of the particular process (all kevents).
Eg. sudo ./lttng_trc.sh ps1 ps -LA
[NOTE: other stuff running _also_ gets traced (this is non-exclusive)].
Let's continue using ping as our guinea pig. It's practically an old friend by now:
sudo ./lttng_trc.sh ping1 ping -c1 packtpub.com
The script will run its course, and you'll notice that besides the screen output, it also generates a compressed archive in the /tmp directory (for example, lttng_ping1_08Mar22_1104.tar.gz). This is your "black box" recording. You can copy it to another machine for analysis at your own pace, without needing to keep a heavy GUI hanging on your production environment.
Command-Line Analysis: The Babeltrace Flood
Now that we have the data, how do we view it?
LTTng comes with a command-line tool called Babeltrace 2. As the name implies, it's a translation from the Tower of Babel—converting LTTng's binary, efficient CTF (Common Trace Format) into human-readable text.
You can refer to the official documentation to learn its detailed usage. But I must give you a heads-up: the output volume is staggering.
That single ping operation from earlier, when exported to text via Babeltrace 2, generated over 123,000 lines of information.
It's like taking the plumbing blueprints of an entire skyscraper and spreading them across a football field. All the information is there, but trying to read this text stream with your naked eye will make you go blind. This is exactly why we need a graphical interface.
Graphical Visualization: Trace Compass's God's-Eye View
If using the command line is "looking for pipes on the ground," then Trace Compass lets you "see the big picture from a satellite."
Trace Compass is an Eclipse-based project (Eclipse RCP) and the officially recommended graphical viewer for LTTng. It's very simple to install—just download and extract it from the official website.
After installation, launch Trace Compass, select File | Open Trace... from the menu bar, and then select the directory where you saved your trace data. It will automatically parse it and present you with an extremely professional console.
The screen will be divided into several sections. At the top is a massive timeline, in the middle is the event list, and at the bottom is the detailed information. It might be a bit overwhelming at first glance because it really does provide information across too many dimensions.
Precision Targeting: The Art of Filtering
Facing an ocean of data, we need to learn how to fish. In that ping example from earlier, to find that split-second network packet transmission, I did two things:
- Lock onto a CPU: I knew ping happened to run on CPU 6 (probably because the scheduler just happened to put it there), so I typed
6in the CPU column search box. - Lock onto an event: In the Contents column, I typed
icmp.
Trace Compass will immediately filter out the noise, leaving only the needle you care about.
In the screenshot above, you can see a highlighted row: net_dev_queue. This represents the network device preparing the transmit queue. Right-click this row, select Copy to Clipboard, and you'll see a very detailed block of text:
Timestamp Channel CPU Event type Contents TID Prio PID Source
18:39:11.274 970 channel0_6 6 net_dev_queue
skbaddr=0xffff8a4c30fb5c00, len=98, name=eno2, network_header_
type=_ipv4, network_header=ipv4=[version=4, ihl=5, tos=0,
tot_len=84, id=0x2950, frag_off=16384, ttl=64, protocol=_
icmp, checksum=0xe6db, saddr_padding=[], saddr=[192, 168, 1,
16], daddr_padding=[], daddr=[104, 22, 0, 175], transport_
header_type=_icmp, transport_header=icmp=[type=8, code=0,
checksum=393, gateway=720897]] 3932722 20 3932722
This level of detail is a goldmine for a network driver developer. You can see the skb address, packet length, network interface name, and even every bit in the IPv4 header and the ICMP payload. This is the value of runtime parameters.
The Power of Color: Understanding System State at a Glance
Trace Compass's greatest strength lies in its color coding.
It paints different colors on the timeline to represent process states. You don't need to read logs; just look at the graph:
- Blue: System call.
- Green: User-space code is executing.
- Yellow: Blocked, such as waiting for I/O.
Looking back at the earlier screenshot, the starting position of the ping process on the far left is brown. Hovering the mouse over it reveals it's sched_switch—it's waiting for CPU scheduling (a non-blocking wait). And that long strip of lemon yellow on the far right is it blocking to wait for I/O (specifically, a power_cpu_idle event, sleeping for 14.6 microseconds).
This kind of visualization lets you instantly understand the "breathing rhythm" of the system.
No Silver Bullet: Tool Selection and Compromises
At this point, our journey through kernel tracing in this chapter is coming to an end.
We've seen Ftrace, the lightweight ninja; played with the golden duo of trace-cmd and KernelShark; experienced the convenience of perf-tools scripts; and finally taken a tour through the grand panorama of LTTng and Trace Compass.
You'll notice one fact: there is no perfect tool.
- Trace Compass's visualization is great—you can tell what the system is doing at a glance—but when it comes to displaying context details (like specific pending interrupt flags or preemption depth), it's not as granular as KernelShark (and the underlying Ftrace latency format).
- Ftrace has everything, and can even help you dump the in-memory buffer when the kernel crashes (
ftrace_dump_on_oops), but its usability when handling massive amounts of multi-core data falls short of LTTng.
As early as 1975, Fred Brooks told us in The Mythical Man-Month: there is no silver bullet.
This is why you need to put all these tools into your toolbox. In different scenarios and for different problems, choose the sharpest knife.
In the next chapter, we'll face a heavier topic: Kernel Panic. Don't panic—with the debugging sharp tools we have today, even when the kernel crashes, we can still read something from the corpse.
Chapter Echoes
Now, let's return to the foreshadowing we planted at the beginning of this chapter.
Remember? We started out wrestling with the question: why do kernel developers need this "microscope"-level capability?
Now the answer is clear: because modern systems are too complex.
When your code runs amidst hundreds of thousands of lines of kernel code, dozens of concurrent processes, and countless intertwined interrupt contexts, "intuition" is the most unreliable thing. Only Tracing can freeze this chaotic dynamic process into visualizable, analyzable data.
Whether you're watching real-time data stream across the screen via trace_pipe, or dragging the timeline in Trace Compass, you're essentially doing the same thing—eliminating uncertainty. This is the essence of system-level debugging.
In the next chapter, when the screen turns red, when /dev/kmsg throws Oops and Stack dump, don't panic. That's the system crying for help, and the tools we learned today are your codebook for deciphering those distress signals.