Skip to main content

9.9 Ftrace in Action: From Stack Overflow Monitoring to Android Debugging

Let's set the Instances topic aside for now.

With all these tools at our disposal, how do we actually use them? Ftrace is like a Swiss Army knife—so feature-rich that you might find yourself holding the blade, unsure where to cut. In this section, we'll look at a few real-world "slaughter" scenes—not actual slaughters, but Bug slaughters.

We'll start with a subtle but deadly kernel issue: stack overflows.


Monitoring Kernel Stack Usage — Ftrace's "Lie Detector"

You probably know that every live thread has two stacks: a user-space stack and a kernel-space stack.

The user-space stack is large and "soft"—it grows dynamically. On a typical Linux distribution, its upper limit (RLIMIT_STACK) is usually 8 MB. If you run out of space, the kernel automatically expands it until it hits this limit.

But the kernel-space stack is an entirely different beast.

It is fixed, rigid, and very small.

  • On 32-bit systems, it's typically only 8 KB.
  • On 64-bit systems, it's typically 16 KB.

This leads to a serious consequence: overflow. If your kernel code recurses too deeply or allocates overly large local variables, the kernel stack will overflow. This isn't as simple as "throwing an exception"—it usually means the system locks up instantly or triggers a Kernel Panic. It's the kind of bug that has you staring at the console at 3 AM, questioning your life choices.

⚠️ Configuration Tip

There are two kernel options that can slightly ease this pain:

  • CONFIG_VMAP_STACK: When enabled, the kernel stack is allocated from the vmalloc region. This allows the kernel to set up a "guard page." If a stack overflow touches this guard page, the kernel triggers an Oops and gracefully kills the process instead of taking down the entire machine.
  • CONFIG_THREAD_INFO_IN_TASK: This further mitigates the cascading effects of a stack overflow.

If you're compiling a kernel, make sure to enable these.

Because the kernel stack is so fragile, monitoring how much of it is used at runtime becomes an incredibly valuable debugging technique. Ftrace comes with a dedicated tool for exactly this—the Stack Tracer.

To use it, ensure your kernel configuration includes CONFIG_STACK_TRACER=y (it's usually enabled by default). It isn't directly controlled by tracefs, but rather toggled via the /proc pseudo-file /proc/sys/kernel/stack_tracer_enabled. It is disabled by default.

Let's do a hands-on exercise. We'll enable the ftrace stack tracer, run a sampling pass, and see which kernel functions are the "stack-eating monsters" (note that this requires root privileges):

Step 1 — Enable the Stack Tracer

echo 1 > /proc/sys/kernel/stack_tracer_enabled

Step 2 — Run a Sampling Pass

We need a script to help with this. I wrote a simple script, ch9/ftrace/ftrc_1s.sh, that enables ftrace and records kernel activity over a 1-second window. During this window, we'll try our best to trigger deep stack usage.

cd /sys/kernel/tracing
<...>/ch9/ftrace/ftrc_1s.sh
[...]

Step 3 — Check the Maximum Stack Depth and Details

Now the data is recorded. We mainly care about two files:

  • stack_max_size: The maximum stack usage observed since system boot (or since the last reset).
  • stack_trace: The detailed call stack from the moment that maximum depth was recorded.
cat stack_max_size
cat stack_trace

The screenshot below shows the result of one run. In this example, the maximum kernel stack usage exceeded 4000 bytes (nearly 4 KB, meaning half the stack space on a 32-bit system is already gone).

(Insert Figure 9.19 screenshot description here: showing stack_max_size as 4160 bytes, along with the detailed function call stack below it)

If you see this number approaching 8 KB (32-bit) or 16 KB (64-bit), your heart rate should increase. That means you're only a few steps away from a crash.

For more details on the Stack Tracer, check out the official kernel documentation: Documentation/trace/ftrace.html.


How Android Uses Ftrace to Debug System Issues

Whether a tool is useful depends on the battlefield.

The Android Open Source Project (AOSP) doesn't just use Ftrace—they practically treat it as a nuclear weapon. Internally, Android development wraps a layer around Ftrace, but also allows direct usage.

There's a particularly hardcore quote in the AOSP documentation that I recommend memorizing:

"However, every single difficult performance bug in 2015 and 2016 was ultimately root-caused using dynamic ftrace."

That's a bold claim, but if you've seen the complexity of Android devices, you'll understand it's not an exaggeration.

The documentation specifically highlights a few killer use cases for Ftrace:

  1. Debugging Uninterruptible Sleep Why is this hard to debug? Because when a process is asleep, its logs might stop too. But with Ftrace's function_graph or function tracer combined with filters, you can capture a kernel stack snapshot every single time the code enters the uninterruptible_sleep function. It's like taking an EKG of a comatose patient every minute.

  2. Confirming When a Driver "Hogs" the CPU Drivers might disable interrupts or preemption for extended periods to maintain "atomicity." This might seem safe in single-threaded code, but it's a disaster on multi-core systems.

    Remember the "latency tracers" we mentioned in the previous section?

    • irqsoff
    • preemptoff
    • preemptirqsoff

    The AOSP documentation calls them out directly: these tools are primarily used to confirm whether a driver is keeping interrupts or preemption disabled for too long.

Real-World Case: Post-Camera Jank

The documentation provides a real example involving a Pixel XL phone. Symptom: After taking an HDR photo, immediately rotating the viewfinder caused noticeable UI jank. Method: They tracked it down using Ftrace. Through the trace, they discovered that a critical path was blocked by a long-running operation, causing response latency. (See: https://source.android.com/devices/tech/debug/ftrace)

Beyond this example, the AOSP documentation summarizes several common patterns:

  • Driver Misbehavior: A driver keeps hardware interrupts (IRQs) or preemption disabled for too long, slowing down system responsiveness.
  • Excessive Softirq Duration: Softirqs disable kernel preemption. If softirq handling takes too long, other tasks can't be scheduled in, and the system feels like it's frozen.

This is fascinating stuff. If you're writing embedded Linux or Android drivers and don't understand Ftrace, you're essentially driving blind.


Netflix's Cloud Practice: A Preview

This "relentless pursuit" mentality isn't limited to the mobile world.

In the cloud, Ftrace is also a stabilizing force. As a preview of an even more exciting real-world case: using the perf-tools scripts (the Ftrace-based frontend toolkit written by Brendan Gregg that we mentioned earlier) to debug a database disk I/O issue on Netflix Linux instances.

We'll dedicate a future section ("Investigating a database disk I/O issue on Netflix cloud instances with perf-tools") to cover this.

There, the battlefield shifts from phones to cloud servers, and the weapon of choice shifts from raw Ftrace to more user-friendly scripts—but the core logic remains the same: making the invisible visible.