Skip to main content

Chapter 7: Inside Memory Management

Kernel internals, especially memory management, is a vast and complex field. To be honest, I don't plan to spill all the gory low-level details in this book—that would probably take two more books to cover.

But as someone who wants to tinker in this field, you must master enough background knowledge. This isn't just about passing interviews or writing a few drivers; it's about knowing where to look when you actually face those bizarre kernel panics or memory leaks.

In this chapter, we'll tear apart Linux's memory management mechanisms. We'll dive deep into the Virtual Memory (VM) split mechanism to thoroughly understand how user space and kernel space are actually divided; we'll grab a magnifying glass to examine a process's Virtual Address Space (VAS), seeing exactly how the text segment, data segment, heap, and stack are laid out; and of course, we'll pry open the kernel's VAS to see what treasures it hides.

In addition, we'll touch on the cornerstone of physical memory management. This might sound dry, but trust me, once you understand memory mapping—both virtual and physical—you'll find that seemingly random panics or OOM (Out Of Memory) events actually leave clear clues.

The knowledge in this chapter lays the foundation for the next two. There, we'll actually write code to allocate and free dynamic memory in the kernel. If you don't build a solid foundation now, when the time comes to face the differences between kmalloc and vmalloc, you might end up staring at the screen blankly, just like I did back then.

1.1 Understanding the VM Split

To understand Linux's memory management, we first have to accept a premise: virtual memory. In modern operating systems (Linux, Unix, Windows), almost all addresses used in your programs are virtual. It's like giving every process a pair of "VR goggles"—put them on, and every process thinks it has the entire RAM stick all to itself.

But here's a key question: exactly how big is this "illusion"?

That depends on your processor architecture.

  • 32-bit systems: The highest address is $2^{32} = 4 \text{ GB}$.
  • 64-bit systems: The highest address is $2^{64} = 16 \text{ EB}$ (Exabyte). This number is absurdly large—$1 \text{ EB} = 1,024 \text{ PB}$ (Petabyte), $1 \text{ PB} = 1,024 \text{ TB}$. In other words, it's a hole you'll probably never fill in your lifetime.

To keep things clear, let's first focus on 32-bit systems. Under this setup, a process's Virtual Address Space (VAS) ranges from $0$ to $4 \text{ GB}$. Within these 4 GB, there's both actual "substance"—text segments, data segments, heaps, and stacks—and vast expanses of unused "empty land," which we call sparse regions.

Before diving into the details, let's do a quick experiment: let's see what really happens at the Linux底层 for that most classic of C programs—Hello, world.

The Foundation of Hello, World

Alright, I assume everyone can write a K&R-style Hello, world with their eyes closed:

printf("Hello, world.\n");

But you might not have thought deeply about what happens behind this line of code. The printf function isn't something you wrote yourself; it lives in the C standard library (usually glibc).

This raises a question: as we mentioned in Chapter 6, a process's VAS is a completely isolated "sandbox"—you can't see anything outside it. Since the code for printf is in glibc, it must be mapped into the current process's VAS, otherwise we couldn't call it at all.

And that's exactly what happens. When your program starts, a hidden little character—the dynamic linker (usually ld.so or ld-linux.so)—takes control first. It looks for the shared library file where printf resides (libc.so), and then uses the mmap system call to "paste" the library's text and data segments into your process's VAS.

We can verify this with the ldd command:

$ gcc helloworld.c -o helloworld
$ ldd ./helloworld
linux-vdso.so.1 (0x00007fffcfce3000)
libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007feb7b85b000)
/lib64/ld-linux-x86-64.so.2 (0x00007feb7be4e000)

Notice the addresses in parentheses (like 0x00007feb7b85b000)—this is where glibc is loaded into your process's VAS—User Virtual Address (UVA). Moreover, this address might be different every time you run it, thanks to ASLR (Address Space Layout Randomization), which we'll cover later.

Now, let's take it a step further.

Crossing the Boundary: From printf to write

We all know that printf is essentially a wrapper around the write system call. write writes the string to standard output (usually the terminal).

But a "boundary crossing" happens in between.

write is a system call, which means the CPU has to switch from user mode to kernel mode to execute the write code inside the kernel. But kernel code lives in the kernel's VAS, and we just said that a process's VAS is a "sandbox."

This brings us to one of the most core concepts of this chapter: the VM Split (Virtual Address Space Split).

If the kernel VAS were truly outside the "sandbox," then every system call would require not just a privilege level switch, but a complete address space switch. This is unacceptable for performance (it would cause high-speed caches like the TLB to be completely invalidated).

So, engineers came up with a brilliant solution: stuff both the kernel space and user space into the same 4 GB address space.

This is the origin of the VM Split.

The 3:1 Split: The Classic 32-bit Scheme

On most 32-bit ARM (AArch32) and x86 systems, the default is the 3:1 GB split scheme.

  • User space: 0 to 3 GB ($0 \text{x}00000000 \sim 0 \text{xbfffffff}$)
  • Kernel space: 3 GB to 4 GB ($0 \text{xc0000000} \sim 0 \text{xffffffff}$)

(Note: The starting address of kernel space is defined by a macro called PAGE_OFFSET, which equals exactly $0 \text{xc0000000}$ under the 3:1 split.)

This splitting method is illustrated intuitively in Figure 7.1.

[Figure 7.1: 3:1 GB VM Split diagram on an AArch32 system]

What does this mean? When your process calls write, the CPU is still working within the same process's VAS; the pointer simply jumps from the lower 3 GB to the upper 1 GB.

Here is a crucial point to understand: Although each process has its own unique 3 GB user space, all processes share the same 1 GB kernel space.

This split ratio is configurable. When compiling the kernel (for example, configuring a Raspberry Pi kernel), you can choose 2:2 or even 1:3. You can verify this by checking the kernel configuration:

$ zcat /proc/config.gz | grep -C3 VMSPLIT
#
# Kernel Features
#
CONFIG_VMSPLIT_3G=y
# CONFIG_VMSPLIT_3G_OPT is not set
# CONFIG_VMSPLIT_2G is not set
# CONFIG_VMSPLIT_1G is not set
CONFIG_PAGE_OFFSET=0xC0000000

See CONFIG_PAGE_OFFSET=0xC0000000? This confirms that kernel space starts at the 3 GB mark.

Once you understand the 32-bit case, the logic in the 64-bit world is exactly the same—just with absurdly larger numbers.

1.2 VM Split on 64-bit Systems: The Massive "Hole"

Since we're all using 64-bit machines now, why not just use all 64 bits for addressing?

Because there's no need. $2^{64}$ bytes equals 16 EB; for today's computers, this number is like giving you a hard drive bigger than the Earth—you simply can't use it all.

Currently, mainstream Linux x86_64 configurations (with 4 KB page size) only use the lower 48 bits for addressing.

So how are these 48 bits divided? There's a very interesting design choice here.

  • User space: Uses the lower half of these 48 bits, ranging from $0 \text{x}0000000000000000 \sim 0 \text{x}00007fffffffffff$. This is approximately 128 TB.
  • Kernel space: Uses the upper half of these 48 bits, ranging from $0 \text{xffff800000000000} \sim 0 \text{xffffffffffffffff}$. This is also approximately 128 TB.

This layout is called "Canonical Addressing." Simply put, the upper 16 bits (bits 48 through 63) of a 64-bit address must be all 0s (for user space) or all 1s (for kernel space).

What does this mean? There is a massive hole in the middle.

From $0 \text{x}0000800000000000$ to $0 \text{xffff7fffffffffff}$, this region is the non-canonical address area. How big is this region? It accounts for roughly 99.998% of the entire 64-bit address space. This region is completely inaccessible.

So, although the theoretical address space of a 64-bit system is 16 EB, only the bottom 128 TB (user) and the top 128 TB (kernel) are actually usable.

This is why we usually don't need to worry about "high memory" issues on 64-bit systems—128 TB of kernel space is more than enough to directly map all the physical memory of every machine in the world right now, with plenty to spare.

[Figure 7.5: 16 EB VAS layout diagram on x86_64]

How to Tell Kernel and User Addresses Apart at a Glance

Now that we know the rules, we can instantly identify an address's properties while debugging:

  • KVA (Kernel Virtual Address): Always starts with 0xffff.
  • UVA (User Virtual Address): Always starts with 0x0000.

This isn't just for show—it's incredibly useful when analyzing crash stacks. If you see an address starting with 0xffff... appearing in a user-mode program's stack, something has gone terribly wrong.

What Exactly Is a Virtual Address?

Before moving on, we must correct a very common intuitive mistake.

You write this line of code:

int i = 5;
printf("address of i is %p\n", &i);

The address you print out is not an absolute value starting from 0. It is a bitmask.

When the CPU's MMU (Memory Management Unit) processes this address, it slices this 32-bit or 64-bit data into pieces. On x86_64 (with 48-bit addressing), it gets chopped into 5 fields: PGD, PUD, PMD, PTE, and Offset. It's like a progressive indexing system that ultimately points to a specific byte in physical memory.

[Figure 7.2: Breakdown of a 64-bit virtual address on x86_64]

Each level of indexing points to a table (a page table), and through these tables, the CPU ultimately calculates the physical address. This process is called "page table walking."

⚠️ Note: Although this is how we understand it logically, in actual hardware execution, for the sake of speed, the CPU caches previously looked-up results in the TLB (Translation Lookaside Buffer). Only on a TLB miss does it go through the slow 4-level page table process.

1.3 The Complete Process VAS View

Let's zoom out and see what a complete process address space looks like.

Whether 32-bit or 64-bit, the structure is the same: each process has its own unique user space (the lower portion), but all processes share the same kernel space (the upper portion).

[Figure 7.7: A process has a unique user VAS but shares the kernel VAS]

Anatomy of User Space: Segments and VMAs

We saw /proc/PID/maps in Chapter 6. This file acts like a map, listing all the "road segments" in user space.

Each line represents a contiguous range of virtual addresses, described in the kernel by a data structure called struct vm_area_struct (VMA).

Let's pick a random line to dissect:

558822d66000-558822d6a000 r-xp 00002000 08:01 7340181 /usr/bin/cat
  • Address range: 558822d66000-558822d6a000. This is the start and end UVA of this mapping.
  • Permissions: r-xp. r = read, x = execute, - = not writable, p = private mapping (a typical characteristic of the text segment).
  • Offset: 00002000. The starting offset of this content within the file /usr/bin/cat.
  • Device number: 08:01. The device number where the file resides.
  • Inode: 7340181. The inode number of the file.
  • Path: /usr/bin/cat. The source file of the mapping.

Here is a very important detail: All of these addresses are entirely virtual. They only exist in the current process's page tables. Even if another process also runs /usr/bin/cat, its libc load address will most likely be different from yours.

Those Weird Mappings

When you look at the maps file, besides the obvious [heap] and [stack], you'll also see a few strange names:

  • vdso / vvar: Virtual DSO (Virtual Dynamic Shared Object). This is an ultimate optimization made by Linux. For super-frequently used system calls like gettimeofday, entering kernel mode every time is too slow. The kernel directly maps the code implementing this functionality into user space, allowing you to call it directly in user mode without a context switch.
  • vsyscall: An older predecessor of this, now kept mainly for backward compatibility.

1.4 The Kernel VAS Map

Now, let's enter that shared territory—the kernel VAS.

Although the details differ across architectures, they all share some common regions. Let's stick with the classic 32-bit 3:1 split as an example. The kernel space (starting from 0xc0000000) mainly contains the following parts:

[Figure 7.12: User and kernel VAS layout (focusing on the lowmem region)]

  1. Lowmem Region (Low Memory / Linear Mapping Region): This is the most important piece. The kernel directly maps physical RAM into this region.

    • Physical address 0 $\rightarrow$ virtual address PAGE_OFFSET.
    • There is a fixed offset (PAGE_OFFSET) between the virtual addresses and physical addresses here.
    • Addresses in this region are called kernel logical addresses.
    • ⚠️ Pitfall: This linear mapping is very convenient; you can calculate the physical address directly through simple subtraction. However, this only applies to this region! If you're writing a driver and you get an address belonging to the vmalloc region, and you dare to use this subtraction to calculate the physical address, you will definitely kernel panic.
  2. vmalloc Region: Used to allocate memory regions that are virtually contiguous but physically discontiguous. When you need a large chunk of memory but don't require the physical pages to be contiguous, you use this area.

  3. Modules Region: This is where the code and data of your LKMs (Loadable Kernel Modules) are loaded when you write them.

  4. Fixmap Region: A small reserved region used to permanently map specific physical pages (like the page tables themselves) early during boot.

  5. High Memory: This is a pain point unique to 32-bit systems. If your 32-bit machine has 4 GB of RAM, but the kernel space is only 1 GB, then a large chunk of physical memory cannot be directly mapped into the Lowmem region. This "leftover" memory is called high memory. The kernel cannot access it directly; it must use temporary dynamic mappings (kmap/kunmap) to use it.

    • Good news: On 64-bit systems, because the kernel space is a massive 128 TB, we don't need the Highmem concept at all. All physical memory can be directly mapped in.

1.5 Hands-on: Exploring the VAS with a Kernel Module

Talk is cheap. Let's write a kernel module to print out these macros and addresses.

The following module, show_kernel_vas.ko, queries and prints the kernel VAS layout for the current architecture.

// ch7/show_kernel_vas/kernel_vas.c
static void show_kernelvas_info(void)
{
unsigned long ram_size;

#if LINUX_VERSION_CODE >= KERNEL_VERSION(5, 0, 0)
ram_size = totalram_pages() * PAGE_SIZE;
#else
ram_size = totalram_pages * PAGE_SIZE;
#endif

pr_info("PAGE_SIZE = %lu, total RAM ~= %lu MB\n",
PAGE_SIZE, ram_size/(1024*1024));

// 打印 vmalloc 区域
pr_info("|vmalloc region: "
#if (BITS_PER_LONG == 64)
" %px - %px | [%9zu MB]\n",
#else
" %px - %px | [%5zu MB]\n",
#endif
SHOW_DELTA_M((void *)VMALLOC_START, (void *)VMALLOC_END));

// 打印 Lowmem 区域 (直接映射区)
pr_info("|lowmem region: "
#if (BITS_PER_LONG == 32)
" %px - %px | [%5zu MB]\n"
"| ^^^^^^^^ |\n"
"| PAGE_OFFSET |\n",
#else
" %px - %px | [%9zu MB]\n"
"| ^^^^^^^^^^^^^^^^ |\n"
"| PAGE_OFFSET |\n",
#endif
SHOW_DELTA_M((void *)PAGE_OFFSET, (void *)(PAGE_OFFSET) + ram_size));

// ... (打印 modules 区域, KASAN 区域等)
}

Running this module on a Raspberry Pi Zero W (32-bit ARM) yields the following output:

[Figure 7.13: Output of show_kernel_vas.ko on a Raspberry Pi Zero W]

We can see that:

  • PAGE_OFFSET is 0xc0000000 (3 GB).
  • The Lowmem region size is about 508 MB (the RAM of the Raspberry Pi Zero).
  • The Kernel Modules region is located at 0xbf000000, right below Lowmem.

This pieces together a complete memory map.

[Figure 7.14: Complete process VAS layout on AArch32 (with kernel details)]

1.6 Randomized Layouts: ASLR and KASLR

Since memory addresses are deterministic, they become fixed targets for hackers. As long as they know the address of a certain kernel function, they can launch an attack.

To prevent this, Linux introduced Address Space Layout Randomization (ASLR).

  • User-space ASLR: Every time you run a program, the locations of the stack, heap, and libc load addresses will randomly change.
    • You can control this via /proc/sys/kernel/randomize_va_space.
  • Kernel-space KASLR: Every time the system boots, the base address of the kernel's text and data segments within the kernel VAS will also randomly change.
    • This works very well on 64-bit systems because the space is large, providing strong randomness. Its effectiveness is limited on 32-bit systems.

We can write a script to check the current system's ASLR status:

[Figure 7.20/7.21: ASLR status check script execution results]

If you're debugging a kernel crash and find that the addresses don't match up on every boot, remember to check if KASLR is enabled first.

1.7 Physical Memory Organization: Nodes, Zones, Pages

Finally, let's return from the virtual world to the physical world.

The Linux kernel doesn't view physical memory as one giant blob of "RAM." It sees it as a hierarchical structure:

  1. Node: This is a concept from NUMA (Non-Uniform Memory Access) architectures. On multiprocessor servers, CPUs might be attached to different memory controllers.

    • Accessing "local" memory is fast; accessing "remote" memory is slightly slower.
    • Even on your PC (UMA architecture), for code portability, Linux pretends it has one Node (Node 0).
  2. Zone: Each Node is divided into several Zones. This is primarily to cope with hardware limitations.

    • DMA Zone: Ancient ISA devices can only access the low 16 MB of memory.
    • DMA32 Zone: Some devices can only access the low 4 GB.
    • Normal Zone: This is "regular" memory.
    • HighMem Zone: The 32-bit high memory we just mentioned.
  3. Page Frame: The smallest unit of physical memory management. Each page frame corresponds to a struct page structure.

You can use /proc/buddyinfo to see the distribution of page blocks across the system's Zones.

$ cat /proc/buddyinfo
Node 0, zone DMA 3 2 4 3 3 1 0 0 1 1 3
Node 0, zone DMA32 31306 10918 1373 942 505 196 48 16 4 0 0
Node 0, zone Normal 49135 7455 1917 535 237 89 19 3 0 0 0

This tells us: the system only has Node 0 (UMA), with three Zones underneath.


Chapter Echoes

We have now built a panoramic view of the memory world: From a process's perspective, it's a massive Virtual Address Space (VAS), split by a 3:1 or 128TB:128TB boundary; From the kernel's perspective, physical memory is organized into Nodes, Zones, and Pages, mapped into virtual space through page tables.

Remember the question we posed at the beginning of this chapter: why does a device fail to respond even though the driver registered successfully? Now you should be able to think about this more deeply—perhaps the driver is running in the correct address space, but the physical memory mapping it tries to access hasn't been established, or it incorrectly used an algorithm meant only for Lowmem to translate a vmalloc address, leading to an invalid memory access.

In the next chapter, we will use all the intuition built in this chapter to do something more practical: claiming a piece of land that truly belongs to us on this complex memory map. We'll dive deep into kmalloc, vmalloc, and the Slab Allocator behind them. That is the true brick-laying moment for kernel developers.


Exercises

Exercise 1: Understanding

Question: On a default-configured x86_64 Linux system, the kernel prints a memory address: 0xffff888012345678. Which of the following most likely describes the nature of this address?

Answer and Explanation

Answer: Kernel Virtual Address (KVA)

Explanation: In x86_64 Linux systems, the high bits of a virtual address are used to distinguish between user space and kernel space. The upper 16 bits (MSB) of a User Virtual Address (UVA) are all 0s, typically in the format 0x0000...; while the upper 16 bits of a Kernel Virtual Address (KVA) are all 1s, typically in the format 0xffff.... The address in the question starts with 0xffff, matching the characteristics of a KVA and belonging to the Canonical Upper Half.

Exercise 2: Application

Question: Suppose you are developing a driver module running on a 32-bit ARM (AArch32) Linux system. You need to allocate a block of memory where there is a fixed linear offset relationship between the virtual address and the physical address (i.e., the physical address can be obtained by simply subtracting PAGE_OFFSET). Which memory allocation function (or region) should you use?

Answer and Explanation

Answer: Lowmem region (allocated via kmalloc or similar functions)

Explanation: The question requires a fixed linear offset between the virtual address and the physical address (i.e., direct mapping). By definition, the Lowmem Region is exactly where physical RAM is directly mapped in the kernel VAS, where virtual address = physical address + PAGE_OFFSET. Although virt_to_phys can be used in 32-bit systems, the prerequisite is that the address lies within the Lowmem region. If you use the vmalloc region or Highmem, this simple linear relationship does not exist, and Highmem requires temporary mappings to be accessed.

Exercise 3: Thinking

Question: When writing a kernel driver, why can't we directly perform DMA (Direct Memory Access) operations on memory allocated from the vmalloc region, and why do we typically need to use kmalloc (allocated from the Lowmem/Normal Zone)? Briefly analyze the reasons based on the virtual address mapping mechanism and hardware limitations.

Answer and Explanation

Answer: Because memory in the vmalloc region is virtually contiguous but physically discontiguous, and many DMA controllers can only handle contiguous physical memory blocks.

Explanation: The core of this question lies in understanding the difference between virtual contiguity and physical contiguity.

  1. kmalloc/Lowmem: The returned memory resides in the Lowmem region; it is not only virtually contiguous but its corresponding physical memory is also contiguous. Most simple DMA controller hardware only accepts contiguous physical address ranges.
  2. vmalloc: The returned memory is contiguous in the virtual address space, but in physical memory, it is stitched together from multiple non-contiguous page frames.
  3. Conclusion: If you pass an address returned by vmalloc directly to a DMA device that only supports physically contiguous addressing, the device might only read the data from the first physical page, leading to data transfer errors or out-of-bounds memory access. Although modern kernels provide APIs like dma_map_single to handle scatter/gather lists, this adds complexity.

Key Takeaways

Virtual memory is the foundational mechanism for process isolation in modern operating systems; every process believes it has the entire address space to itself—for example, 4 GB on a 32-bit system. To balance isolation with system call performance, Linux employs a Virtual Memory (VM) split strategy. A typical 32-bit configuration uses a 3:1 ratio (3 GB user space : 1 GB kernel space), while 64-bit systems utilize Canonical Addressing to split the massive 64-bit space into a bottom 128 TB user region and a top 128 TB kernel region. The key to this design is that all processes share the same kernel space, allowing the CPU to handle system calls without switching page tables, thus balancing security and efficiency.

Translating a virtual address to a physical address is not a simple mathematical operation, but a progressive indexing process completed by MMU hardware through multi-level page tables (PGD -> PUD -> PMD -> PTE). Although we typically access memory through pointers, these are merely virtual addresses; the CPU must traverse the page tables to map them to actual physical page frames. To accelerate this frequent operation, modern CPUs use the TLB (Translation Lookaside Buffer) to cache translation results, only performing the expensive memory access on a cache miss.

Although a process's user space address layout consists of the text segment, data segment, heap, and stack, its specific locations are finely managed by the kernel data structure vm_area_struct (VMA) and can be viewed via /proc/PID/maps. Worth mentioning is the vdso mechanism, which maps the code for high-frequency system calls like gettimeofday directly into user space, allowing applications to execute functionality without trapping into kernel mode. This is an important compromise the operating system makes for performance optimization.

The layout of the Kernel Virtual Address Space (KVAS) directly dictates how we develop drivers, and the most critical part is the linear mapping region of "low memory." Within this region, there is a fixed PAGE_OFFSET offset between kernel logical addresses and physical addresses, allowing direct conversion through simple subtraction. However, in the vmalloc region or the Highmem region, memory might be virtually contiguous but physically scattered, or entirely inaccessible directly. Understanding the differences between these regions is a required lesson for avoiding the misuse of kmalloc and vmalloc, and for preventing invalid memory accesses that lead to kernel panics.

At the physical level, the Linux kernel doesn't treat memory as a single monolithic block; instead, based on hardware architecture (like NUMA), it organizes memory into a three-layer model: Node, Zone, and Page. The division of Zones is primarily to cope with the DMA limitations of legacy hardware (such as only being able to access the low 16 MB) or the addressing bottlenecks of high memory on 32-bit architectures. By using /proc/buddyinfo, we can observe this fragmented management state. Mastering this hierarchical structure helps us understand how the kernel efficiently allocates physical memory while satisfying specific device constraints (like DMA requirements).