Skip to main content

1.7 Two Kernels: The Frugal Producer and the Detective with a Magnifying Glass

We've stocked our toolbox, but before we start working, I want to show you a kernel developer's desk.

On it, there usually aren't just one screwdriver, but two.

In the previous section, we scoured the world for software packages, like stuffing our toolbox. But there's a fundamental difference between kernel development and ordinary software development: with ordinary software, you write a version, fix some bugs, and ship it; but if kernel software (including kernel modules and drivers) goes wrong, it takes down the entire system—rebooting is the least of your worries, data corruption is the real nightmare.

Remember the tragic case we saw in the "Software bugs" section? Priority Inversion almost killed the Mars Pathfinder—a lesson that cost tens of millions of dollars. If you don't want this playing out in your own product, then thorough testing/QA (Quality Assurance) before releasing code isn't just "nice to have"—it's "must do, or die trying."

This raises the question: If the kernel itself isn't equipped with probing instruments, how do you detect deep-seated problems?

The Dual-Kernel Strategy: Production Kernel vs. Debug Kernel

This brings us to the "dual-kernel" spectacle. In real-world engineering, we usually prepare—and even strongly recommend that you force yourself to prepare—two sets (sometimes three) of kernel configurations:

  1. Production Kernel: Meticulously tuned for efficiency, security, and high performance. This is the one you ship to your customers.
  2. Debug Kernel: Has all recommended or even aggressive debug options enabled. Performance? What's that? Our goal is to catch bugs, even if it makes the system run like a turtle—as long as it can trigger that damn bug.
  3. Hybrid Mode Kernel (Optional): Uses the production kernel configuration but flips on one or two specific debug switches to catch particular ghosts in a near-production environment.

Let's use an analogy to lock in this concept.

Analogy Mapping: Imagine you're repairing cars. The production kernel is like an F1 race car on the track—all parts are stripped of excess covers for pure speed, pedal to the metal, built to be fast. The debug kernel is like a test vehicle loaded with oscilloscopes, sensors, and slow-motion cameras—it runs slow, but every engine vibration is recorded.

But there's a subtle trap here (a SICP-style pause).

If it's just about speed, why not just use the production kernel for testing? Or rather, why can't we have a "fast and bug-catching" perfect kernel?

Because the laws of physics don't allow it.

Those checking mechanisms that help you catch out-of-bounds memory access, lock contention, and null pointer dereferences essentially insert a "hey, let me check if that last step was correct" at every single operation in the system. This check has a cost—sometimes in CPU cycles, sometimes in extra memory footprint.

Analogy Distance Revealed: Back to that F1 car. You can't have it running at 300 km/h while also stopping to let you carefully inspect tire wear. That's why we need two cars. The debug kernel sacrifices performance in exchange for visibility. It's not just slow; it might generate massive amounts of log output, or even panic outright because the checks are so strict—but that's exactly what we want. Let it die on our behalf, not the user's system.

Therefore, the wise strategy is: during development and unit testing, run the debug kernel, turn on every switch you can (CONFIG_DEBUG_*), and ruthlessly torture your code. When it's time to release, switch to the production kernel for functional verification and performance testing only.

There's a third scenario: you encounter an extremely bizarre, intermittent bug in production. At this point, you might need to compile a hybrid mode kernel—mostly identical to the production config, but with debug switches enabled for a specific subsystem. This lets you keep the system running as normally as possible while still capturing critical diagnostic information.

Here's a key point worth emphasizing: in the vast majority of cases, the Linux mainline kernel itself is fine. Bugs are usually hidden in our code.

We typically use the LKM (Loadable Kernel Module) framework to write drivers or custom features. These .ko files will ultimately be installed under /lib/modules/$(uname -r)/ in the rootfs. Although they are "modules," they run in kernel space and enjoy Ring 0 privileges. The debug kernel helps you watch their every move—as soon as they cross the line, it raises the alarm immediately.

Analogy Recovery Verification: Back to the car repair scenario. Your custom driver is that turbocharger you modified yourself. If you're running the debug kernel, the moment the boost pressure gets too high, the test car will immediately flash a red light and alarm, and you can safely pull over to fix it. If you're running the production kernel—that F1 car with all the sensors stripped out—the same fault could cause the engine to literally blow up mid-race. Blowing up in the garage is always better than blowing up on the track.

Which Version: The Security of LTS

Having decided on the "dual-kernel" approach, the next question is: Which version should we base this on?

The Linux kernel's development pace is astonishing, with a major release roughly every 6 to 8 weeks. But not every one of these releases is suitable as an engineering foundation.

For serious projects or products, you need a Long-Term Support (LTS) version.

LTS kernels are maintained by the community for years, with security patches and critical bug fixes continuously backported. Choosing LTS means you're choosing a solid foundation that won't be abandoned for at least a few years.

At the time of writing this book (and for content stability), we've chosen the Linux 5.10 LTS series. This is a very classic release that will be supported until December 2026.

⚠️ A Real Security Warning

We need to pause here and share a real-world case that happened while I was writing this book—it perfectly illustrates why we need to care about kernel versions and security updates.

While writing Chapter 10 in March 2022, the Dirty Pipe vulnerability broke. This was a severe flaw that allowed attackers to gain root privileges, affecting almost all mainstream distributions, including 5.10.60, which was our baseline at the time.

This is incredibly ironic, yet utterly typical. A tiny oversight can lead to the compromise of systems worldwide.

So, while the technical details in this book are written based on 5.10.60 (to ensure consistency in code logic), I strong recommend that you choose 5.10.102 or later (or fixed versions like 5.15.25 / 5.16.11+) when actually compiling and using it.

This isn't just a number—it's your security defense line.

Alright, with all this groundwork laid—why we need two kernels, why we choose LTS, why we avoid Dirty Pipe—now we can finally get our hands dirty.

Next, we'll start building these two kernels: one a lightweight production kernel, and one a fully armored debug kernel.

Let's start with the production kernel and see how we get this elephant into the fridge.