Skip to main content

6.4 Decoding SLUB Debug Error Reports

Alright, now that we've added slub_debug=FZPU to the kernel boot parameters and successfully triggered some bugs, let's look at what happened. As we saw at the end of the previous section, the SLUB debug mechanism did catch the culprit and spat out a rather intimidating pile of error logs.

But what exactly is that log dump trying to say?

Don't worry—that's exactly what we'll do in this section: dissect the SLUB debug output line by line, just like a forensic report. You'll find that this error report is densely packed with clues—everything from which cache had the incident, down to the exact line of code that wrote the illegal byte. It's all in there.

Since we already have the run_tests script handy, let's pull up a few typical test cases and examine them alongside the logs.

Decoding the "Out-of-Bounds Write" Report

Let's start with test case #5.2. This is a typical Right OOB Write—meaning we wrote data past the end of a legitimate slab object. It's the one marked as [V4] in Table 6.2.

As soon as you execute this operation, the SLUB debug framework immediately jumps out and loudly complains. Its output looks something like this (don't be intimidated by the length; we'll break it down piece by piece):

(Figure 6.3 insertion point: Screenshot of SLUB catching a right out-of-bounds write, part 1)

Line 1: Who Raised the Alarm

BUG kmalloc-32 (Tainted: G B OE ): Right Redzone overwritten

Immediately following BUG is the name of the slab cache where the incident occurred—in this case, kmalloc-32. This makes perfect sense, since our test code dynamically allocated exactly 32 bytes (kmalloc(32, ...)). The subsequent Right Redzone overwritten is the diagnostic conclusion: Right redzone overwritten. This is very straightforward—what we did in the dynamic_mem_oob_right() function was exactly this: we wrote data at the 32nd byte position (the legitimate range was 0–31).

By the way, the Tainted: ... that appears here is the kernel's taint flag. Because we loaded an unofficial test module, the kernel considers itself "tainted." This flag tells you whether the kernel is still in a "trusted" state.

Line 2: Crime Scene Coordinates

INFO: 0x00000000d969b0bf-0x00000000d969b0bf

This INFO line gives the start and end addresses of the corrupted memory region. You might notice that these addresses look a bit odd—they don't look like the kernel virtual addresses you normally see. That's correct: these addresses are hashed. This is a security measure to prevent attackers from inferring the memory layout directly from logs and leaking information. Don't forget, we're running these tests on a production-grade kernel.

Line 3: Who is the Culprit

INFO: dynamic_mem_oob_right+0x39/0x9c [test_kmembugs]

This is one of the most valuable pieces of information in the entire report. It tells us exactly where in the code the erroneous access occurred. The format is the standard <函数>+偏移/长度 [模块名]. In plain English: the error occurred in the dynamic_mem_oob_right() function within the test_kmembugs module, at an offset of 0x39 bytes from the start of the function (the kernel estimates the total length of this function to be 0x9c). (We'll cover how to use this offset to pinpoint the exact source code line in the next chapter. Just keep this skill in mind for now.)

The right side of this line usually also displays the process context—the PID and which CPU core it was running on.

Line 4: The Backtrace

Next comes a long kernel stack backtrace:

kmem_cache_alloc_trace+0x40b/0x450
dynamic_mem_oob_right+0x39/0x9c [test_kmembugs]
dbgfs_run_testcase+0x4d9/0x59a [test_kmembugs]
vfs_write+0xca/0x2c0
[...]

There's a rule for reading stack frames: read from bottom to top, and ignore lines starting with ?. You can clearly see the call chain: user space enters the kernel via vfs_write, and ultimately crashes at dynamic_mem_oob_right().

Line 5: Who Poisoned It

INFO:Freed in kvfree+0x28/0x30 [...]

SLUB also thoughtfully (if it could put up a wanted poster) tells you who freed this memory. This information usually helps you pinpoint the culprit—because in most cases, the task that freed the memory is the same one that originally allocated it.

Below this INFO: line, there is usually the call stack from when the free occurred. However, note that this "who freed it" information isn't always accurate; sometimes it might not be available.


The next part is even more interesting: a bunch of statistics and a memory snapshot.

(Figure 6.4 insertion point: Screenshot of SLUB error, part 2, showing the corrupted redzone, object memory, and padding contents)

This shows the slab's statistics, the state of the unfortunate object, and—most critically—the actual memory contents. You can see the values of the left and right redzones, the padding bytes, and the contents of the actual memory area.

Let's look back at that test code:

// ch5/kmembugs_test.c
int dynamic_mem_oob_right(int mode)
{
volatile char *kptr, ch = 0;
char *volatile ptr;
size_t sz = 32;
kptr = (char *)kmalloc(sz, GFP_KERNEL);
// [...]
ptr = (char *)kptr + sz + 3; // right OOB

// [...]
} else if (mode == WRITE) {
/* 这里的 OOB 访问不会被 UBSAN 抓到,但会被 KASAN/SLUB 抓到! */
*(volatile char *)ptr = 'x'; // invalid, OOB right write

/* 而下面这些 OOB 访问会被 KASAN/UBSAN 抓到。
结论:只有基于索引的访问才会被 UBSAN 抓。 */
kptr[sz] = 'x'; // invalid, OOB right write
}
}

In the two highlighted sections of our code, we deliberately wrote an illegal x (ASCII 0x78). Now, going back to Figure 6.4, if you look at the erroneous write I circled in that figure, you can clearly see that 0x78—that's exactly what our test code wrote in there. That's the smoking gun.

About Poisoning

You'll also see various magic numbers in the logs. This is the SLUB Poisoning mechanism at work (because we enabled the P flag):

  • 0x6b: This is the value used to initialize legitimate slab memory areas, representing "uninitialized."
  • 0xa5: This is the poison value for the end marker.
  • 0x5a: This is the poison value representing "use of uninitialized memory."

These numbers aren't random—they are the fingerprint database in the hands of memory detectives.

The Final Part

(Figure 6.5 insertion point: Screenshot of SLUB error, part 3, showing CPU registers, fix-up information, etc.)

At the end of a standard kernel bug report, you'll usually see the CPU register values.

But the bottom two lines are SLUB's masterstroke:

FIX kmalloc-32: Restoring 0x00000000d969b0bf-0x00000000d969b0bf

Because we enabled the F flag (sanity checks), the kernel doesn't just report the error—it attempts to fix the scene. It tries to restore the corrupted slab object to a "correct" state (for example, by filling the overwritten redzone back in). Of course, this doesn't always succeed, and such a fix-up merely prevents the kernel from crashing immediately; it doesn't mean the problem has gone away. The log also mentions ... not freed, which is completely normal—because this report was generated while the object hadn't been freed yet (our code will call kfree a bit later).

Decoding the "Use-After-Free" (UAF) Report

Next, let's look at Use-After-Free (UAF), the case marked as [V5] in Table 6.2.

The UAF error looks like this:

BUG kmalloc-32 (Tainted: G B OE ): Poison overwritten
--------------------------------------------
INFO: 0x00000000d969b0bf-0x00000000d969b0bf @offset=872. First byte 0x79 instead of 0x6b
INFO: Allocated in uaf+0x20/0x47 [test_kmembugs] age=5 cpu=5 pid=2306

The template is the same as before, but the diagnostic conclusion has changed: Poison overwritten. Why? Because in the uaf() test case, we first kfreeed this memory, and then wrote a byte into it!

The INFO line here gives the hashed address of the corrupted region, along with a very crucial piece of information: offset=872 and First byte 0x79 instead of 0x6b. This means: at offset 872, where the poison value 0x6b should have been, it is now 0x79 (which is exactly the ASCII value of the character y we wrote in).

The subsequent INFO line shows where the allocation occurred (uaf+0x20/0x47), along with the process PID. Below that, as usual, there is a stack backtrace, as well as the free information:

INFO: Freed in uaf+0x34/0x47 [test_kmembugs] age=5 cpu=5 pid=2306

Here you'll notice an interesting fact: both the allocation and the free happened in the same function, uaf, and even the PID didn't change. This is a classic case of self-sabotage—it allocated it, it freed it, and it corrupted it.

Decoding the "Double-Free" Report

Finally, let's take a quick look at Double-Free, the case marked as [V6] in Table 6.2. The kernel reports it like this:

BUG kmalloc-32 (Tainted: G B OE ): Object already free
--------------------------------------------
INFO: Allocated in double_free+0x20/0x4b [test_kmembugs] age=1 cpu=5 pid=2330

The conclusion is very straightforward: Object already free. The template format that follows is exactly the same as what you just learned. You can try interpreting it yourself: who did this? In which function?

Regarding Uninitialized Memory Reads (UMR)

We've actually already seen how SLUB handles Uninitialized Memory Reads (UMR—the tests marked as [V3] and [V7] in Table 6.2). When you enable slub_debug, although it doesn't always trigger a prominent BUG report, if you dump that memory region, you'll see it's entirely filled with the poison value 0x6b. This is SLUB quietly telling you: "Buddy, this memory hasn't been initialized yet; whatever you're reading out of it is garbage."

At this point, you should be able to summarize a pattern: While the SLUB debug framework can catch the vast majority of corruption issues on slab memory, it seems to have no way of dealing with out-of-bounds reads. Can't remember? No problem—look back at Table 6.2. Those blank [ ] entries are SLUB's blind spots. If you want to catch out-of-bounds reads, you'll need to bring in the heavy artillery from the next chapter: KASAN.