Skip to main content

7.6 Capturing Crash Logs in Interrupt Context Using a Console Device

In the previous section, we added a few more sharp tools to our toolbox—ranging from stack space checks to source code locating scripts. You might think that with these, even if the kernel crashes, we can pin it down effortlessly.

Don't be too confident.

Our previous test environment had a bit of a "greenhouse" feel: whether triggered directly or via a workqueue, our Oops always occurred in process context. What does that mean? It means the kernel still had a chance to "gracefully" report the error, print the stack trace to the screen, and might even let you type a few more commands.

But in the real world, many tricky crashes happen in a much more dangerous place—interrupt context.

There's no process to blame, no way out, and—if you're unlucky—even the screen output gets swallowed.

In this section, we're going to push the scenario to the extreme: we'll manually trigger a crash inside a hard interrupt handler, and then see how we can retrieve that life-saving diagnostic log when the system freezes, the keyboard becomes unresponsive, and even SysRq commands can't be typed.


A Runaway Interrupt Context

We've prepared a new test module irq_crash.ko. Its logic is simple and brutal: it uses the kernel's tasklet mechanism to register a function for execution in interrupt context.

Let's briefly review this (we used a similar technique in the memory leak test in the previous chapter): tasklet is a special asynchronous execution mechanism that guarantees the callback function will be executed in interrupt context (usually a hardirq).

Let's look at the code:

// ch7/oops_inirqv3/oops_inirqv3.c
void irq_work(struct irq_work *irqwk)
{
int want_oops = 1;
PRINT_CTX(); // 打印当前上下文信息,让我们确信它在 IRQ 里
if (!!want_oops) // 好了,让我们在中断里制造个事故
// 这一行可能导致致命死锁!
*(int *)0x100 = 'x';
}

Notice that comment. We manually write data to the 0x10 address. This is obviously an illegal operation (located within the NULL trap page) and will trigger a page fault.

Now, load this module in your virtual machine:

cd <...>/ch7/oops_inirqv3
make
sudo insmod ./oops_inirqv3.ko

In my test environment (VirtualBox 6.1 + Ubuntu 20.04 Guest + 5.10.60 kernel), after pressing Enter, something despairing happened:

Nothing happened.

It wasn't a "printed a bunch of errors and then froze" kind of nothing. The screen simply went black, or the cursor completely stopped moving. No Oops information, no SysRq prompt, not even the Caps Lock light blinked.

The system just died right in front of you.

What now? SysRq? Can't type it. Reboot? Logs are all gone.

At this point, as an engineer, your intuition will tell you: The kernel definitely wanted to speak, but its words couldn't reach the display. We need another channel—a channel that doesn't rely on the GPU driver—to smuggle out the kernel's dying words.


Taking a Detour: The Virtual Serial Console

Welcome to the real world.

The approach to solving this problem is a classic one. Although our operations here target a virtualized environment like VirtualBox, the underlying concept applies to almost all embedded development boards and servers:

Since the graphical interface (display) route is broken, we'll use the serial port route to "smuggle" the logs out.

What we need to do is establish a "virtual cable" between the guest and the host—a virtual serial port. The kernel will write all printk information to this serial port simultaneously, and the other end of the serial port will be connected to a plain text file on your host machine.

Step 1: Configure the Virtual Serial Port on the Host

If the virtual machine is still running, shut it down first.

Open the Oracle VirtualBox manager interface, select your virtual machine, and click the "Settings" gear icon.

  1. Find "Serial Ports" in the left menu.
  2. Check "Enable Port 1".
  3. Port Mode: Select "Raw File". This means data will be written directly to a file, rather than to another network stream or physical device.
  4. Path/Address: Enter a file path on your host machine. For example, on my Linux host, I entered ~/vb_serial.log.

The configuration should look something like this (see Figure 7.21 in the book). Click OK to save.

Step 2: Tell the Kernel to Use This Console

Now start the virtual machine. But before the kernel actually boots, we need to hijack the boot parameters.

  1. Mash the Esc key during boot to interrupt the GRUB boot process and enter the menu.
  2. Find "Advanced options for Ubuntu" and select the custom kernel we want to test (5.10.60-prod01) inside it.
  3. Press the e key to edit the boot entry.
  4. Find the line starting with linux.

This is where we make our move. Move the cursor to the end of this line and append this string of parameters:

console=ttyS0 console=tty0 ignore_loglevel

Here, ttyS0 is our serial port device. console=ttyS0 tells the kernel to "redirect console output to the serial port," while console=tty0 retains the screen output. loglevel=8 is the old saying: "spit everything out, don't worry about log levels, I want it all."

After making the changes, press Ctrl+x to boot the system.


Verifying the Channel and Capturing the Crash

After entering the system, don't rush to load the module. We need to confirm that this "smuggling channel" is open first.

Verifying Serial Output

Inside the virtual machine, execute this command as root:

# echo "testing serial console 1 2 3" > /dev/kmsg

At this moment, something magical happens.

Switch to your host machine and open the log file you just specified (e.g., ~/vb_serial.log). You should see the line you just typed already sitting in there:

$ tail -n1 ~/console_lkd.txt
[ 646.403129] testing serial console 1 2 3

The channel is open. Now, even if there's an abyss ahead, we have a safety rope.

Triggering the Crash Again

Go back to the virtual machine and load that ruthless module again:

cd <...>/ch7/oops_inirqv3
sudo insmod ./oops_inirqv3.ko

As expected, the virtual machine instantly freezes, and the interface locks up.

Ignore it. Go back to the host machine and open that log file.

Victory Moment

Although the virtual machine's screen is dead and pitch-black, inside the text file on the host, the kernel has clearly stated everything it wanted to say (see Figure 7.23 in the book).

You will see the complete Oops stack trace, clear enough to bring a tear to your eye.


In-Depth Diagnostics: The Stack Fingerprint of Interrupt Context

Getting the log is only the first step; now comes the detective work.

Look closely at this Oops output. It has a few very critical differences compared to the crashes we triggered in process context before.

1. The Direction of the RIP Pointer

RIP: 0010:irq_work+0x18/0x20 [oops_inirqv3]

The RIP points directly to our irq_crash_tasklet_func function. This confirms that the crash indeed occurred inside our designated interrupt callback.

2. The Dual Stack

This is the most fascinating part. You'll notice that this Call Trace is split into two sections, with a clear marker in between:

<IRQ>
irq_work+0x18/0x20 [oops_inirqv3]
...
</IRQ>
...

On modern architectures like x86_64, interrupt handling uses a dedicated interrupt stack.

  • ... : This region records the execution path when the interrupt occurred. It starts from the interrupt entry point, which is where irq_crash_tasklet_func resides.
  • The lower section: This is the original stack of the process that was interrupted.

Reading this stack bottom-up, you'll find that the kernel was originally executing some process's task (perhaps bash, or maybe the idle thread), when it was suddenly interrupted, switched to the interrupt stack, executed our code, and then—bam, it crashed.

3. Who is the Real "Culprit"?

Pay attention to this line in the Oops output:

CPU: 1 PID: 1699 Comm: insmod Tainted: G OE 5.10.60-prod01 #6

Here, the PID is 1699, and the process name is insmod.

If your instinct is "Ah, the insmod process broke the kernel," then in this specific case, you are wrong.

Remember what we said in the previous chapter? In interrupt context, the current macro points to the unlucky victim—the process that happened to be running at that exact moment. It's just an innocent bystander that got "face-slapped" by our hard interrupt. Although the code was triggered by it (insmod caused the module to load, which in turn registered the tasklet), at the exact moment the crashing instruction executed, the CPU was actually working for the interrupt handler, not for insmod.

This also explains why sometimes the process you see in the Oops log is swapper/0 (PID 0, the idle thread). The CPU was just idling when an interrupt suddenly yanked it up to do some work, but the work went poorly and crashed the CPU. swapper/0 is purely taking the blame.

4. Why Did the System Freeze?

At the very end of the log, there's usually a prompt like this:

Kernel panic - not syncing: Fatal exception in interrupt

This explains why your keyboard became completely unresponsive. The kernel realized—"Oops, I crashed while handling an interrupt. This is an unrecoverable error. To ensure data safety, I'd better end it all." So it triggered a panic, and the system stopped all scheduling.


Wrapping Up and a Preview

This method of "exporting" logs using a serial console is a required course for embedded developers and kernel engineers.

On physical machines, this concept evolves into a more flexible network tool—netconsole. It doesn't require a physical serial cable; instead, it broadcasts kernel logs via UDP packets to another machine on the local network.

In the next section, we'll migrate the battlefield from an x86 virtual machine to a real ARM device (a Raspberry Pi), to see what Oops looks like under a heterogeneous architecture and how to handle it using "wireless debugging" techniques like netconsole.