7.7 The Heterogeneous Battlefield: Oops and netconsole in Action on ARM Linux
In the previous section's x86 virtual machine environment, we used a virtual serial cable to "fish out" the kernel logs. But in a real embedded battlefield—like a Raspberry Pi—things are rarely that elegant.
You might not have a screen, or the device might have frozen while the network interface is still alive. In such cases, a physical serial port is certainly an option, but we're going to explore a more flexible path: netconsole.
This isn't just a tool; it's a shift in mindset—taking the kernel's distress signal and tossing it across the network to another machine via UDP packets.
🌐 netconsole: Broadcasting Kernel Logs Over the Network
Netconsole is a built-in kernel "reflex" mechanism. Its principle is brutally simple: as soon as the kernel printk prints anything—whether it's a routine log message or a crash stack trace—netconsole immediately wraps that text into a UDP packet and blasts it out through the network interface.
This means that as long as your Ethernet cable is plugged in and the switch is powered on, even if the system is so dead that the keyboard is unresponsive, the logs can still fly over to your laptop.
1. Preparation
First, confirm that your target kernel (the ARM board in this case) has CONFIG_NETCONSOLE enabled.
We typically compile it as a module (m), which makes it easy to dynamically configure parameters without rebooting every time we make a change.
2. Parsing the Configuration Parameters
When loading the netconsole module, we need to pass it a netconsole parameter that tells it: "Who am I, and where am I sending this?"
The parameter format looks like this—don't be intimidated, it makes sense once we break it down:
netconsole=[+][src-port]@[src-ip]/[<dev>],[tgt-port]@<tgt-ip>/[tgt-macaddr]
src-ip/src-port: The sender's (the ARM board that's about to crash) IP and port.dev: The network interface used by the sender (e.g.,eth0orwlan0).tgt-ip/tgt-port: The receiver's (your PC or VM) IP and port.tgt-macaddr: Crucial point. The receiver's MAC address. Because the kernel might have lost its routing (or the routing table is corrupted) at this point, the ARP protocol can't be relied upon. Therefore, we must hardcode the destination MAC address to send packets directly at Layer 2.
If you don't want to memorize these, the official documentation is right there: Documentation/networking/netconsole.txt.
3. Practical Configuration
Let's assume our battlefield looks like this:
- Sender (Raspberry Pi): IP
192.168.1.24, interfacewlan0. - Receiver (Host): IP
192.168.1.101, listening on the default port6666.
On the ARM board (sender):
Type in the following command (note that this is a single line):
sudo modprobe netconsole netconsole=@192.168.1.24/wlan0,@192.168.1.101/
⚠️ Warning: We didn't explicitly specify the source and destination ports here, so the kernel will use the defaults. We also didn't specify the destination MAC address—if your LAN environment is simple and ARP can work the instant the module loads, this usually works fine. But in a strict production environment, you must fill in
tgt-macaddr.
On the host (receiver):
You don't need to load the netconsole module on the host. You just need a tool that can listen for UDP. The old-school netcat (called nc on some distributions) is perfect for this:
netcat -d -u -l 6666 | tee -a dmesg_arm.txt
Parameter explanation:
-d: Don't read from standard input (run in the background).-u: Use the UDP protocol.-l 6666: Listen on port 6666 (netconsole's default destination port).tee -a dmesg_arm.txt: Display the received content on the screen and append it to a file for leisurely analysis later.
⚔️ First Blood on ARM: Triggering an Oops
Now everything is in place. Our ARM board has netcat listening, and the host is standing by with netcat ready to catch the output.
But before we crash the system, there's a classic ARM cross-compilation pitfall we need to avoid.
⚠️ Pitfall Warning: The Division Problem in ARM Cross-Compilation
When compiling modules for ARM using the cross-compiler from x86_64 (arm-linux-gnueabihf-gcc), you'll often run into this baffling error:
ERROR: modpost: "__aeabi_ldivmod" [<...>/ch7/oops_tryv2/oops_tryv2.ko] undefined!
This is due to ABI differences in how GCC handles division. The brutal but effective fix is: don't do division.
In our test code, the SHOW_DELTA() macro in convenient.h might trigger this issue. Simply comment out that macro call and recompile—it usually passes. It's not elegant, but in the debugging phase, getting it to run is what matters.
🔥 The Battle Begins
Figure 7.24 captures this moment. The light-colored window at the top is the ARM sender, and the dark-colored window at the bottom is the receiver.
When we insmod that buggy module on the ARM board, you'll see a massive flood of text instantly pour into the receiver's window.
It's like catching the scene red-handed. Even if your ARM board is completely dead at this point (or has triggered a panic reboot), the crucial "dying words" are already safely resting in your dmesg_arm.txt file.
🧐 Analyzing the ARM Oops: Architectural Differences
Once we have the logs, don't rush to apply your x86 experience. Although the kernel tries to unify the format, underlying architectural differences still leak through.
Look at this line:
Internal error: Oops: 817 [#1] ARM
On x86, we're used to seeing error codes like 0002, but on ARM, this magic number 817 (in hex) means something completely different.
Architectural Differences: FSR and Encoding
To decode this number, you have to consult the architecture's "bible"—the Technical Reference Manual (TRM).
- For the Raspberry Pi Zero W (BCM2835, ARM1176JZF-S core): You need to look at its Fault Status Register (FSR) encoding rules.
- For the BeagleBone Black (TI Sitara AM335x, Cortex-A8 core): You need to look up the Memory Protection Fault Status Register (MPFSR).
There's no way around it—each architecture's hardware implementation is different, and the kernel just faithfully copies the register values. The good news, however, is that aside from this hardware error code, the rest of the information—the PC pointer and the call trace—follows the same logic.
🔍 Pinpointing the Source Code: The Three-Pronged Approach
Let's go back to the ARM Oops log we just captured. The core information is here:
Workqueue: events do_the_work [oops_tryv2]
PC is at do_the_work+0x68/0x94 [oops_tryv2]
LR is at irq_work_queue+0x6c/0x90
- PC (Program Counter): Equivalent to RIP on x86. It tells us the crash occurred inside the
do_the_workfunction, at offset0x68(decimal 104). - LR (Link Register): A register unique to ARM that stores the return address. It tells us that
do_the_workwas called byirq_work_queue.
With the offset 0x68 in hand, we have three tools to pinpoint the exact line of code.
Method 1: addr2line (Most Direct)
On the ARM board (or in your cross-compilation environment), directly run addr2line against that .ko file:
rpi oops_tryv2 $ addr2line -e ./oops_tryv2.ko 0x68
</path/to/>Linux-Kernel-Debugging/ch7/oops_tryv2/oops_tryv2.c:62
It spits out the filename and line number directly: line 62.
Looking back at the source code:
61 pr_info("Generating Oops by attempting to write to an invalid kernel memory pointer\n");
62 oopsie->data = 'x'; // <--- 罪魁祸首
63 }
Case closed, right?
Method 2: GDB (Most Intuitive)
If you have GDB (either the cross-version or one running on the board), you can use its list command:
$ arm-linux-gnueabihf-gdb ./oops_tryv2.ko
(gdb) list *do_the_work+0x68
GDB will list the surrounding source code and point to line 62 with a => marker. This looks great in a graphical interface (like TUI mode).
Method 3: objdump (Lowest Level)
If you have no symbols at all and are left with just a binary file, objdump is your last line of defense.
rpi oops_tryv2 $ objdump -dS ./oops_tryv2.ko | less
Then find the do_the_work function in the output and count down to around offset 0x68:
5c: ebfffffe bl 0 <printk>
oopsie->data = 'x';
60: e3a03000 mov r3, #0 ; 获取指针 r3 (也就是 oopsie,这里居然是 0!)
64: e3a02078 mov r2, #120 ; 'x' 的 ASCII 码
68: e5c3201c strb r2, [r3, #28] ; !向 [0+28] 地址写入,炸了
}
See that 68 assembly line? The strb (Store Byte) instruction attempts to write data to the address pointed to by register r3. And in the instruction right before it, r3 was set to 0. This is a classic NULL pointer dereference, crystal clear at the assembly level.
🌍 Real-World ARM Oops: BeagleBone Black
To show some "diversity," Figure 7.26 displays a crash log from a TI BeagleBone Black board.
You can try to find the key points we just discussed:
- Oops bitmask:
Internal error: Oops: 805 [#2] - PC pointer location
- Call trace
For this one, the TRM you need to look at is TI's AM335x documentation to check the MPFSR register definition. Although the details differ, the debugging mindset is exactly the same: grab the PC, match the symbols, read the assembly, and locate the source line.
🏁 Chapter Summary
We've gone deep and far in this chapter.
We started with the simplest "intentionally writing to a NULL pointer" approach and manually triggered a kernel Oops. But that was just a warm-up. We then dove deep into the kernel's Virtual Address Space (VAS) to touch upon those seemingly present but actually fatal "sparse regions" and "NULL trap pages."
More importantly, we learned how to read the dying words the kernel leaves behind—those few lines of seemingly cryptic Oops logs. Whether it's RIP and RSP on x86_64, or PC and LR on ARM, once you master the three-pronged approach of addr2line, objdump, and GDB, those cold register values instantly translate into specific line numbers in your code.
Finally, we crossed architectural boundaries and used netconsole—a "wireless wiretap" technique—to remotely capture crash scenes on ARM devices. This isn't just an upgrade in technical skills; it's an upgrade in debugging mindset—don't be limited by the physical environment; use every means possible to extract the information.
Remember the interrupt story from the beginning of the chapter that drove the CPU into a panic? Now you should fully understand the mechanisms behind that scenario: how the kernel detects illegal access, how it generates diagnostic information, and how it decisively halts execution to protect system safety before we even have time to react.
Preview of the Next Chapter:
Oops and Panic are often accompanied by memory corruption. When multiple CPU cores contend for resources simultaneously, or when interrupt handlers modify shared data at the wrong time, the system exhibits even more bizarre symptoms—inexplicable deadlocks, data corruption, or even silent corruption.
That is the topic of the next chapter: Locking and Concurrency. If Oops is an "explicit crash," then concurrency bugs are "implicit ghosts." To catch ghosts, we need a more advanced net.
Exercises
Exercise 1: Understanding
Question: Read the following description and determine whether it is true or false: 'The NULL trap page is a block of physical memory allocated by the operating system, specifically used to store NULL values (0x0), to prevent processes from misusing them.' If false, briefly state the correct definition.
Answer and Analysis
Answer: False. The NULL trap page is the first page at the bottom of the user virtual address space (addresses 0 through 4095). All of its permissions are denied (---), and it has no physical memory mapped to it. It is used to catch illegal accesses to NULL pointers.
Analysis: This is a basic concept question. The NULL trap page is not a physical memory storage unit, but rather a protection mechanism. By setting the permissions for virtual addresses 0-4095 to none in the page table, any attempt to read, write, or execute in this region will trigger an MMU page fault exception. The kernel will then send a SIGSEGV signal to the process.
Exercise 2: Understanding
Question: Suppose you are a kernel developer and you see an error code of '0002' (x86 architecture) in an Oops log. Based on this chapter, what specific information does this error code (bitmask) contain? List the two key pieces of information represented by this value.
Answer and Analysis
Answer: 1. The operation that caused the page fault was a "write operation" (Write, bit 1 is set). 2. The error occurred in "kernel mode" (Kernel mode, bit 2 is clear, typically indicating supervisor mode). Note: This also implies it was not caused by a non-present page, but rather by protection violation.
Analysis: This tests the ability to interpret the Oops bitmask. The bit definitions for the x86 page fault error code are as follows:
- Bit 0 (P): 0 = page not present, 1 = protection violation
- Bit 1 (W/R): 0 = read, 1 = write
- Bit 2 (U/S): 0 = kernel mode, 1 = user mode
- Bit 3 (I/D): 0 = instruction fetch, 1 = data access Error code 0002 (binary 0010) means Bit 1 is 1 (write operation) and Bit 2 is 0 (kernel mode).
Exercise 3: Application
Question: You are debugging a kernel crash issue on an ARM device. The device has no monitor connected, and when the Oops occurs, the local terminal displays garbled text. You need to capture kernel logs in real-time on a remote machine for analysis. Which tool would you choose, and briefly describe two key network parameters that must be specified when configuring it?
Answer and Analysis
Answer: Choose to use netconsole. The two key network parameters that must be specified are:
- The remote host's IP address (receiver).
- The remote host's MAC address (or the local network interface name and local port, depending on the specific configuration syntax, but the core is the destination IP and port for establishing the UDP connection).
Analysis: This is an application question testing the ability to solve practical debugging scenarios. In embedded development or remote debugging, when the serial port is unavailable or the graphical interface has crashed, netconsole is an effective means of broadcasting kernel printk messages to a remote server via UDP. Configuring netconsole typically requires specifying the source interface, destination IP, and destination MAC address (because it sends at Layer 2).
Exercise 4: Application
Question: You have just triggered a kernel Oops, and the log shows the RIP (instruction pointer) pointing to offset 0x1a8. You have an uncompressed kernel image with debug symbols on hand. Please write down which command-line tool you would use (excluding helper scripts like faddr2line) to convert this address into a specific source code filename and line number.
Answer and Analysis
Answer: Use the addr2line tool.
Command example: addr2line -e vmlinux 0x1a8
Analysis: This tests the application of debugging tools. addr2line is the standard tool for converting addresses to line numbers. The key is specifying the -e parameter for the executable file (the kernel image vmlinux here) and the specific address offset. Note: If KASLR is enabled, the static address might need to have the random offset subtracted, but in a basic scenario, addr2line is the direct answer.
Exercise 5: Thinking
Question: In modern Linux kernel development, the CONFIG_VMAP_STACK configuration option was introduced to enhance security. Combining your knowledge of kernel stacks and Oops from this chapter, analyze and explain: compared to traditional contiguous physical memory stacks, how does VMAP_STACK help developers more easily locate and catch severe bugs like "kernel stack overflow"?
Answer and Analysis
Answer: Traditional kernel stacks are contiguous physical memory. If a stack overflow occurs, data spills over the boundary and overwrites adjacent kernel data structures, leading to hard-to-reproduce and hard-to-track memory corruption (silent hangs). With CONFIG_VMAP_STACK enabled, the kernel stack is allocated via the vmalloc mechanism, which means the stack's virtual memory pages are not necessarily contiguous, and the end of the stack can be an unmapped "sparse region" or a protected page. When a stack overflow occurs, the spilled data touches these unmapped regions, immediately triggering a page fault exception and an Oops. This transforms potentially hidden data corruption into an immediately catchable and recordable kernel crash.
Analysis: This is a deep-thinking question that requires integrating knowledge of kernel memory management, stack mechanisms, and exception handling. The key to solving it is understanding the difference between "silent corruption" and "explicit crashes." CONFIG_VMAP_STACK leverages the characteristics of virtual memory—namely, the ability to reserve unmapped memory gaps. By setting "traps" at the end of the stack, a stack overflow is no longer a silent data overwrite but actively triggers an exception, greatly improving system robustness and debuggability.
Key Takeaways
Handling kernel crashes (Kernel Oops) is not only a crucial skill for troubleshooting system freezes, but also a necessary path to deeply understanding Linux internals. This chapter first demonstrated how to trigger a crash in a controlled environment by building a "NULL pointer dereference" kernel module, and used tools like procmap to visually display the layout of the kernel's Virtual Address Space (VAS). This revealed why accessing a tiny address like 0x30 is bound to trigger an exception—this usually means the code is trying to access a NULL pointer structure member, where the offset is directly reflected in the error address.
When a crash occurs, the cryptic log output by dmesg is actually a precise "autopsy report" left by the kernel. By breaking down the Oops information line by line, we can extract the smoking gun (such as kernel NULL pointer dereference), the error type (#PF page fault), the instruction pointer (RIP), and the exact linear address that triggered the exception (the CR2 register). Mastering the meaning of these fields—especially understanding the Oops mask bits, the Tainted flag, and the reading rules for the Call Trace—allows engineers to instantly pinpoint the problematic function and error context from the "garbled text," whether it occurred in process context or in a kernel workqueue.
Knowing just the function name isn't enough; we must use tools to map the RIP register or offset back to a specific line of source code. This chapter introduced three core methods: using a vmlinux and modules compiled with debug symbols to generate a disassembly listing via objdump -dS and search for the address; using GDB to directly locate the symbol address; or using the lightweight addr2line tool for quick address-to-filename-and-line-number conversion. This requires developers to enable CONFIG_DEBUG_INFO and use the -g option during the compilation phase to ensure the binary files contain the "map" needed to locate the truth.
To improve debugging efficiency, the Linux kernel source tree comes with a built-in set of script toolchains that can automate tedious analysis work. For example, scripts/decode_stacktrace.sh can batch-convert addresses in a call trace to source code lines, scripts/decodecode can disassemble the machine code in an Oops and highlight the specific assembly instruction that caused the crash, and scripts/checkstack.pl is used for statically analyzing code stack usage to prevent the hidden and fatal hazard of kernel stack overflow.
For modern systems with Kernel Address Space Layout Randomization (KASLR) enabled, the randomness of runtime addresses poses a challenge to traditional static address matching. Although you can disable it via the nokaslr kernel parameter to aid debugging, in production environments, script tools like scripts/faddr2line can handle address queries in the "symbol+offset" format without rebooting, thereby bypassing KASLR's interference and effectively reconstructing the crash scene.