Chapter 8: Kernel Memory Allocators: The Buddy System and Slab
8.0 Introduction: When the Kernel Starts Managing Memory
In previous chapters, we dissected the kernel's internal architecture and process management like taking apart a clock. Now, we're going to touch the most sensitive and core nervous system of this operating system—Memory Management.
For driver developers, this is often where the "nightmare" begins. You've written the logic, registered the interrupts, and everything looks perfect—but the system crashes intermittently, or mysteriously slows down after running for a while. More often than not, the problem lies in the details of memory allocation: you're unknowingly wasting precious physical memory, or sleeping in the wrong context.
Memory allocation in the Linux kernel isn't as simple as "giving you a piece of land." It's divided into two main layers: the underlying Page Allocator, which is the Buddy System we'll discuss; and the Slab Allocator built on top of it. Understanding how both work and knowing when to use which will save you a few hairs during code reviews.
In this chapter, we won't just cover how to use the APIs—we'll also dig into those counterintuitive memory waste issues hidden behind the scenes, using code and kernel logs.
8.1 Kernel Memory Allocators: Who's in Charge of RAM?
The core question we need to solve is simple: How is memory allocated and freed in the kernel?
But behind this lies a deep cognitive trap. If you bring your user-space malloc() intuitions into the kernel, you're going to fall hard. Kernel memory is physically contiguous, precious, and never swapped out to disk.
8.1.1 Two Engines: The Page Allocator and the Slab Allocator
You can think of kernel memory management as a large building materials supermarket.
- The Page Allocator is the warehouse manager driving the forklift. He has a checklist tracking all bundles of lumber (physical page blocks in powers of two). If you need 128 pieces of wood, he'll give you a bundle of 128; if he doesn't have that, he'll cut a bundle of 256 in half. He only handles these bulk transactions. His algorithm is called the Buddy System.
- The Slab Allocator is the retail counter at the warehouse entrance. He knows people often only need a few nails (small objects), so there's no need to go to the warehouse for a whole bundle every time. So he pre-cuts some wood and places it in boxes of various sizes. When you need 20 nails, he gives you a box of 32. It's a little more than you asked for, but it's much better than giving you an entire bundle.
This brings up a key point: The Slab Allocator lives on top of the Page Allocator. The Slab's "boxes" ultimately still need to ask the Page Allocator for "whole bundles of wood." But for us developers, knowing which counter to use saves a lot of trouble.
8.1.2 The Iron Law of Kernel Memory: Non-swappable
Here is a rule you must etch into your brain: Kernel memory is non-swappable.
User-space programs can use up physical memory and then get thrown into a swap partition (i.e., disk) by the kernel to free up RAM. But the data structures used by the kernel code itself—such as the data structures used to manage memory—must never be thrown out. Otherwise, when the kernel wants to read data back from disk, it needs memory to operate, but the memory management code is on disk... This is like needing to wear your glasses in order to find your glasses—it's a deadlock.
So, kernel memory is resident in RAM. This also means that wasting kernel memory is far more costly than wasting user memory.
8.2 The Page Allocator: Secrets of the Buddy System
Let's step into that large warehouse. The core data structure of the Page Allocator is struct zone, and inside it, there is an array free_area[MAX_ORDER]. This is the "Buddy System free list" we're going to talk about.
8.2.1 Understanding the Structure of the Free List
MAX_ORDER is an architecture-dependent constant, usually 11 on x86 and ARM. This means we have 11 lists (numbered 0 to 10), each holding "memory blocks" of different sizes.
- Order 0: Holds 1-page (4 KB) blocks.
- Order 1: Holds 2-page (8 KB) blocks.
- Order 2: Holds 4-page (16 KB) blocks.
- ...
- Order 10: Holds 1024-page (4 MB) blocks.
All blocks are physically contiguous. This is why the allocator is called the "Buddy System": if you have an Order N block and cut it in half, these two Order N-1 blocks are "buddies." If they both become free, they will merge back into an Order N block. This is the magic of "anti-fragmentation."
You can see the current inventory through /proc/buddyinfo, the "warehouse ledger":
$ cat /proc/buddyinfo
Node 0, zone DMA 1 1 1 0 1 1 1 0 1 1 1
Node 0, zone DMA32 3551 1542 476 255 158 77 41 21 11 3 0
Each column of numbers corresponds to the number of free blocks on Order 0 through Order 10. If you see a 0 in the Order 10 column, it means the system can't find even a single 4 MB chunk of physically contiguous memory.
8.2.2 How the Algorithm Actually Works: The Life of an Allocation Request
Suppose your driver requests 128 KB of memory. Let's calculate: 128 KB / 4 KB = 32 pages. 32 is 2 to the 5th power. So the allocator will look at the Order 5 list.
- If in stock: Take it directly. Done.
- If out of stock: Look at Order 6. If Order 6 has a 256 KB block, the allocator will cut it in half (into two 128 KB buddies).
- Handling the excess: One half goes to you (fulfilling the request), and the other half is hung back on the Order 5 list for next time.
Sounds perfect, right? But there's a huge pitfall here called internal fragmentation.
8.2.3 Pitfall Alert: The Trap of Internal Fragmentation
Suppose you need 132 KB of memory. The allocator takes a look: 132 KB is not a power of two. The next "box" that can hold it is an Order 7 block, which is 256 KB. Result: You requested 132 KB, but the kernel gave you 256 KB. The remaining 124 KB is just wasted.
This is the cost of "cutting wood." If you only need a tiny bit more than 128 KB, the warehouse manager is still forced to cut a 256 KB bundle for you.
To alleviate this problem, the kernel provides a pair of APIs: alloc_pages_exact() and free_pages_exact(). They cleverly allocate a bit more and then "return" the excess. While it can't completely eliminate waste, it stops the bleeding.
8.3 Using the Page Allocator API: "Trading" with the Kernel
Now let's write some code. The Page Allocator provides a set of APIs, and their names usually contain page or free_page.
8.3.1 Core Cheat Sheet
These APIs are your tools for making "bulk transactions" in the kernel:
| API / Macro | Function | Return Value |
|---|---|---|
__get_free_page(gfp) | Allocate 1 page | Kernel logical address |
__get_free_pages(gfp, order) | Allocate $2^{order}$ pages | Kernel logical address |
get_zeroed_page(gfp) | Allocate 1 page and zero it | Kernel logical address |
alloc_page(gfp) | Allocate 1 page | struct page * structure pointer |
alloc_pages(gfp, order) | Allocate $2^{order}$ pages | struct page * structure pointer |
Note the difference: __get_free_page returns an address (which you can use directly), while alloc_page returns a page descriptor (struct page). You need to call page_address() to convert it to an address before you can use it. It's like one directly handing you the key, while the other gives you a registration card to exchange for a key at the front desk.
8.3.2 Freeing Memory: Don't Mess It Up
The corresponding free APIs are simple:
void free_page(unsigned long addr);
void free_pages(unsigned long addr, unsigned int order);
void __free_pages(struct page *page, unsigned int order);
Here's a classic pitfall: if you allocate with alloc_page and get a page pointer, never treat it as a long integer address and pass it to free_pages, or mix up the first argument when passing it to __free_pages. Always use them in pairs.
8.4 GFP Flags: Telling the Kernel Your Bottom Line
All allocation functions have a parameter called gfp_mask (e.g., GFP_KERNEL, GFP_ATOMIC). This isn't just for show—it's a "life-or-death contract" between you and the kernel.
8.4.1 The Golden Rule: Process Context vs. Atomic Context
You must memorize this rule; violating it will crash or hang the system:
- GFP_KERNEL: You are in process context (e.g., in your module's
initfunction, or in a system call implementation), and sleeping is allowed. This is the most commonly used flag. It tells the kernel: "If there's not enough memory, you can reclaim memory, or even read/write to disk. Take your time, I can wait." - GFP_ATOMIC: You are in atomic context (e.g., in an Interrupt Handler
ISR, or while holding a spinlockspinlock). Sleeping is absolutely forbidden here. You must tell the kernel: "Give it to me now. If you can't, just fail and return immediately. Whatever you do, don't try to schedule or do I/O."
If you use GFP_KERNEL while holding a spinlock, the kernel might try to sleep to swap for memory, causing scheduler chaos. This is what's known as "sleeping in an atomic region," and the consequence is usually a system deadlock or panic.
8.4.2 Other Common Flags
__GFP_ZERO: This is a modifier, usually used in combination with the two above (GFP_KERNEL | __GFP_ZERO). It tells the kernel to zero out the allocated memory. This is a good programming practice that prevents information leaks and uninitialized memory vulnerabilities.
8.5 The Slab Allocator: Making Small Object Allocation More Efficient
Let's return to our "building materials supermarket" analogy. If you only need a few bytes each time (like a network packet's struct sk_buff), the Page Allocator's big saw is way too clumsy. This is where we need the Slab Allocator.
8.5.1 Object Caches and General Caches
The Slab Allocator was designed with two goals in mind:
- Object Caches: The kernel has many frequently used data structures (like
task_struct,inode,dentry). They are allocated and freed constantly. The Slab Allocator pre-creates "cache pools" for these objects during system initialization. When you need atask_struct, you simply grab a ready-made one from the pool and throw it back when you're done. This is much faster than initializing a new structure every time. - General Caches: For regular memory allocations where we have no special requirements, the kernel maintains a set of
kmalloc-Ncaches (e.g.,kmalloc-8,kmalloc-16, ...kmalloc-8192). When you callkmalloc(20, ...), the kernel sees that you need more than 16 bytes, so it directly gives you a 32-byte slab.
8.5.2 Core Cheat Sheet: kmalloc and kzalloc
These are the APIs you'll use most often when writing drivers day-to-day:
void *kmalloc(size_t size, gfp_t flags);
void *kzalloc(size_t size, gfp_t flags);
void kfree(const void *);
- kmalloc: Returns uninitialized memory (the contents are random, garbage data).
- kzalloc: Returns zeroed memory. It's recommended to prioritize this one—it saves worry and is safer.
- kfree: Frees memory. Note that passing
NULLtokfreeis safe; it simply does nothing.
Important characteristic: Memory allocated by the Slab Allocator is also physically contiguous and guarantees CPU cache line alignment (Cache Line Aligned), which is crucial in high-performance network and storage drivers.
8.6 The Ultimate Limit: kmalloc's Ceiling
Since Slab is so handy, can I just use kmalloc to request 100 MB of memory?
No.
Let's go back to the building materials supermarket analogy. Although the Slab counter is flexible, its inventory (Slab objects) ultimately comes from the warehouse (Page Allocator). And the largest single order in the warehouse (MAX_ORDER) is typically limited to 4 MB (1024 pages).
Therefore, kmalloc's single allocation upper limit is 4 MB.
If you try to allocate memory larger than this (e.g., 5 MB), kmalloc will directly return NULL, and the kernel might even print a warning message, because such an allocation request usually indicates a design flaw. If you really need that much physically contiguous memory, you should consider other DMA allocation mechanisms, or redesign your data structures to handle non-contiguous memory pages.
8.7 Resource Management and Tips
Finally, as kernel developers, we have a few tricks to make our code more robust.
devm_kzalloc: If you're writing a device driver, try to use this API. It's "resource-managed." When your driver is unloaded or the device is disconnected, the kernel will automatically free this memory for you. You don't need to worry about forgettingkfreeand causing a memory leak, which is incredibly useful in error handling paths.kcallocandkmalloc_array: If you need to allocate an array, don't manually calculatecount * size. Use these two APIs directly. They internally check for integer overflow. This prevents your calculation of2000000000 * 2from overflowing to 0, which would lead to allocating a tiny amount of memory and triggering a crash.- Mind your alignment: The Slab Allocator already handles cache line alignment for you, so try to put frequently accessed data structure members at the beginning of the struct so they fall on the same cache line, improving the hit rate.
Chapter Echoes
Think back to the "nightmare" scenario mentioned at the beginning of this chapter. Now you should understand why simply "allocating memory" is so complex in the kernel.
The kernel's memory allocation system—from the underlying Page Allocator to the upper Slab—is essentially balancing three contradictory goals: physical contiguity, speed, and fragmentation.
- If you need physically contiguous large blocks of memory (e.g., for DMA), you must face the internal fragmentation problem of the Buddy System.
- If you need speed and small blocks of memory, you rely on Slab, but you're still limited by the underlying Buddy System's ceiling (4 MB).
- Regardless of which choice you make, you must always be acutely aware of which context you're in (GFP flags), otherwise the price could be the crash of the entire system.
In the next chapter, we'll continue diving deeper into this topic. You'll discover that sometimes even "physical contiguity" can't be guaranteed, and we'll need to handle more complex scenarios like vmalloc and the OOM Killer that strikes fear into the hearts of all server administrators.
Exercises
Exercise 1: application
Question: Suppose you are writing a Network Device Driver, and its Interrupt Handler (ISR) needs to allocate a physically contiguous block of memory for a received data packet. Given that this function runs in atomic context, which memory allocation flag (GFP flag) should you use? What are the potential consequences of choosing the wrong one?
Answer and Analysis
Answer: You should use GFP_ATOMIC. Using GFP_KERNEL could lead to a system deadlock or crash.
Analysis: 1. Concept Application: An Interrupt Handler (ISR) runs in Atomic Context, where process sleeping is not allowed.
2. GFP Differences: GFP_KERNEL is the standard allocation flag that allows the kernel to wait by reclaiming memory or performing I/O operations when memory is low, which puts the calling process to sleep. GFP_ATOMIC explicitly forbids the allocator from sleeping; if memory is insufficient, it fails immediately rather than waiting.
3. Consequence Analysis: If you call GFP_KERNEL while holding a spinlock or in interrupt context, the kernel might attempt to schedule another process, leading to a hang or system crash. Therefore, you must use GFP_ATOMIC in atomic context.
Exercise 2: understanding
Question: If you request the allocation of 132 KB of physically contiguous memory, how much memory will the kernel actually allocate under the standard Buddy System algorithm (assuming a page size of 4 KB)? What is this phenomenon called? If there is 50 MB of available memory but it is severely fragmented, could this cause the allocation to fail?
Answer and Analysis
Answer: It will actually allocate 256 KB. This phenomenon is called internal fragmentation. Yes, it could fail.
Analysis: 1. Calculation Logic: The Buddy System can only allocate memory blocks of $2^{order}$ page frames. 132 KB is approximately 33 page frames (132/4). Rounding up to the nearest power of two gives $2^6 = 64$ page frames, which is 256 KB. 2. Concept Definition: The requested 132 KB is less than the actually allocated 256 KB, and the remaining 124 KB cannot be effectively utilized by the system. This waste is called internal fragmentation. 3. Thought/Analysis: Although the system might have 50 MB of total free memory, if this 50 MB is scattered as fragments in low-order lists (like order 0, 1, 2) and cannot be pieced together into a large contiguous block (like order 5 or higher), the 256 KB allocation request will fail. This is the limitation of the Buddy System.
Exercise 3: thinking
Question: The Linux kernel has a pre-allocated Slab cache named kmalloc-192 (on systems supporting CONFIG_SLAB or CONFIG_SLUB). Analyze: Why doesn't the kernel just use the Buddy System to fulfill small memory allocation requests of less than a page, instead of maintaining this Slab layer? What potential impact does this have on CPU cache performance?
Answer and Analysis
Answer: Using the Slab layer reduces internal fragmentation, improves allocation speed, and optimizes object initialization costs. The impact on CPU cache is: The Slab Allocator typically guarantees memory alignment to cache lines, thereby improving cache hit rates.
Analysis: 1. Architectural Design: If you directly use the Buddy System to allocate 192 bytes, the Buddy System allocates a minimum of one page (4 KB), resulting in massive internal fragmentation (wasting ~95%). The Slab Allocator "slices" pages into fixed-size slots (like 192 bytes), greatly improving memory utilization.
2. Performance Considerations: Buddy System allocation/freeing involves complex linked list operations and metadata updates, which is relatively slow. The Slab Allocator, based on object caches, usually performs allocation and freeing via simple pointer operations, making it extremely fast.
3. Cache Coherence: The kernel frequently needs to allocate/free specific data structures (like task_struct, dentry). Slab allows caching these objects and even supports object constructor functions, avoiding repeated initialization overhead.
4. CPU Cache Alignment: The Slab Allocator (especially the general kmalloc caches) typically aligns allocated memory to CPU cache lines (like 64 bytes). This avoids "False Sharing"—the performance jitter caused by different CPU cores modifying different variables on the same cache line—thereby improving performance in multi-core systems.
Key Takeaways
Linux kernel memory management is divided into two core layers: the underlying Buddy System manages physically contiguous page frames and is suitable for large block memory allocations; the upper Slab Allocator is built on top of the Buddy System, efficiently handling frequent allocation and freeing of small memory blocks through cached object pools.
The Buddy System divides memory into blocks sized as powers of two, managed via the MAX_ORDER array. While its "split and merge" mechanism effectively reduces external fragmentation, it is highly prone to internal fragmentation (e.g., requesting 132 KB but having to occupy 256 KB), so extra caution against waste is needed when allocating memory in non-power-of-two sizes.
Memory allocation in the kernel must strictly follow the rules for using GFP flags: use GFP_KERNEL in process context when sleeping is allowed, and you must use GFP_ATOMIC in atomic context such as when holding a spinlock or in an Interrupt Handler, otherwise it will lead to a system deadlock or crash.
Developers should choose the appropriate API based on the size of the data structure: for large blocks of memory (typically multiple pages), prioritize Page Allocator interfaces like get_free_pages; for small object allocations (less than a page), use kmalloc or kzalloc (the latter is recommended as it auto-zeros), leveraging the Slab cache for efficiency.
Although the Slab Allocator is flexible, its maximum single allocation limit is constrained by the underlying Buddy System's MAX_ORDER (typically 4 MB). For driver development, make good use of managed APIs like devm_kzalloc for automatic freeing, and prioritize kmalloc_array to prevent integer overflow risks when allocating arrays.