Skip to main content

Chapter 9: Advanced Kernel Memory Allocation: Choices, Reclaim, and Survival

In the last chapter, we explored the underlying logic of kernel memory allocation—the perpetually running engine known as the Buddy System—and the Slab Allocator built on top of it. If you thought that was the whole picture, you might be underestimating the complexity of this system.

In this chapter, we'll zoom out and look at the real "choice paralysis" you'll face as a module author or driver developer. With so many tools at your disposal—kmalloc, vmalloc, custom Slab caches, and even kvmalloc—which one should you actually use?

Make the wrong choice, and at best, your performance will crawl like a snail; at worst, you'll trigger that terrifying entity—the OOM Killer.

This chapter follows a clear main thread: When memory runs low, what exactly does the kernel do? We'll start by creating our own dedicated caches, move on to the virtual continuity of vmalloc, and finally confront the scenario that makes all backend engineers sweat: a system running out of memory.

Ready? This chapter is packed with practical knowledge, and some of the exercises might actually crash your system—we highly recommend experimenting inside a virtual machine.


9.1 When the Standard Library Isn't Enough: Creating Custom Slab Caches

In the last chapter, we spent a lot of time discussing the benefits of the Slab Allocator: speed, object caching, and reduced fragmentation. However, most of that discussion was based on the kernel's existing general-purpose caches (like kmalloc-192).

Consider this scenario: you're writing a driver, and your code frequently allocates and frees a specific structure (struct my_device_context). If you stick with kmalloc() and kfree(), it will work, but it's not the optimal solution.

Here's a counterintuitive fact: the generic approach is often inefficient.

The kernel anticipated this long ago. It allows you—as a module author—to create your own "private vault," known as a custom Slab cache.

9.1.1 Building a Dedicated Cache from Scratch

What we're about to do is quite intuitive. It's like opening a factory that only produces for you, and it happens in three steps:

  1. Build the factory (Create): Tell the kernel what "specifications" you need (object size) and give the factory a name.
  2. Produce and use: Take a product out of the factory to use, and put it back when you're done.
  3. Shut down (Destroy): When you're done, tear down the factory and return the "land" to the kernel.

Let's take it one step at a time.

Step 1: Build the Factory — kmem_cache_create()

This is the starting point for everything. You can think of it as submitting a "factory construction application" to the kernel's memory management authority.

#include <linux/slab.h>

struct kmem_cache *kmem_cache_create(const char *name,
unsigned int size,
unsigned int align,
slab_flags_t flags,
void (*ctor)(void *));

We need to look closely at these parameters, because every pitfall hides in the details.

  • name: The name of the cache. This isn't just for humans—tools like the /proc filesystem and slabtop will display this name. Pick a memorable name, like "my_dev_ctx", to make debugging easier later.
  • size: This is the most critical one—the size of each object in bytes.
    • ⚠️ Pitfall Warning: What you enter here is the "theoretical size," but the kernel might actually give you something larger.
    • Why? As we mentioned in the last chapter, for alignment, metadata, or simply because there's no slot of the exact size, the kernel will give you a "close enough" container. For example, if you ask for 328 bytes, the kernel might give you a 448-byte slot (don't be surprised, this is very common).
  • align: Alignment requirements. If you don't care, set it to 0. But on certain architectures (especially ARM) or if you're doing DMA, alignment is crucial. Usually, filling in sizeof(long) is a safe bet, ensuring word-length alignment.
  • flags: Flags. There are several highly practical "debug switches" here:
    • SLAB_POISON: Poison mode. The kernel will fill this memory region with a magic number like 0xa5a5a5a5. If you see a wild pointer pointing to this value, it means you used uninitialized memory. A debugging lifesaver.
    • SLAB_RED_ZONE: Red zone. Inserts "guard pages" before and after your object, specifically designed to catch overflow errors. If you write out of bounds and hit the red zone, the kernel will immediately alert you.
    • SLAB_HWCACHE_ALIGN: Hardware cache line alignment. For performance, this is generally recommended to be enabled. This is why memory coming from a standard kmalloc is always aligned to the cache line.
  • ctor: Constructor function pointer. This is a very interesting design. Although the kernel is written in C, this has a distinct object-oriented flavor.
    • Whenever the kernel carves out a new object from this cache for you, this function is automatically called to perform initialization.
    • ⚠️ Note: The constructor is called when the object is allocated, not when you use kmem_cache_alloc—this means the kernel might pre-allocate some objects, causing the constructor to run ahead of time.

If creation is successful, you'll get a struct kmem_cache * pointer. Never lose it; it's your only credential for retrieving goods later. We usually store it as a global variable.

Step 2: Produce and Use — Allocation and Freeing

The factory is built. Now let's get to work.

Allocation: kmem_cache_alloc()

void *kmem_cache_alloc(struct kmem_cache *s, gfp_t gfpflags);
  • s: That cache pointer you just saved.
  • gfpflags: The usual drill: GFP_KERNEL (can sleep) or GFP_ATOMIC (cannot sleep).

It's like getting food at a cafeteria: you hand over your tray (the cache pointer), and the cafeteria worker serves you a scoop of food (the object memory).

Freeing: kmem_cache_free()

void kmem_cache_free(struct kmem_cache *s, void *x);
  • s: Still that cache pointer.
  • x: The block of memory you're returning.

There is an ironclad rule here: you must return the pointer to its original cache. You cannot take goods from Factory A and return them to Factory B. What happens if you do? Usually, a kernel panic.

Step 3: Shut Down — kmem_cache_destroy()

When you unload your module or no longer need this structure, you must clean up the site.

void kmem_cache_destroy(struct kmem_cache *s);

This destroy operation will only succeed if all objects borrowed from your cache have been returned. If there are still "stragglers," the kernel will refuse to destroy the cache and complain in the logs.


9.1.2 Writing the Code: Custom Slab Demo

Talk is cheap. Let's write a module to run through the entire workflow we just discussed.

Suppose we have a frequently used structure, myctx:

// ch9/slab_custom/slab_custom.c
#define OURCACHENAME "our_ctx"

/* 我们的演示结构体。
* 假设这个东西分配释放非常频繁,所以我们决定给它建个专用缓存。
* Size: 328 bytes。
*/
struct myctx {
u32 iarr[10]; // 40 bytes; total=40
u64 uarr[10]; // 80 bytes; total=120
s8 uname[128], passwd[16], config[64]; // 208 bytes; total=328
};

static struct kmem_cache *gctx_cachep;

Code for creating the cache:

static int create_our_cache(void)
{
// ...
gctx_cachep = kmem_cache_create(OURCACHENAME,
sizeof(struct myctx),
sizeof(long), // 对齐
SLAB_POISON | SLAB_RED_ZONE | SLAB_HWCACHE_ALIGN,
our_ctor); // 构造函数,可以是 NULL

if (!gctx_cachep)
return -ENOMEM;
return 0;
}

A side note about the constructor:

static void our_ctor(void *new)
{
struct myctx *ctx = new;

/* 这里很像 C++ 的构造函数 */
memset(ctx, 0, sizeof(struct myctx));

/* 为了演示,我们在这里填一些当前进程的信息 */
snprintf_lkp(ctx->config, sizeof(ctx->config), "%d.%d,%ld.%ld",
current->tgid, current->pid, current->nvcsw, current->nivcsw);
}

There's something counterintuitive about this: you only called kmem_cache_alloc() once, but in the logs, you might see the constructor being called 18 times. Don't panic. As we mentioned earlier, for efficiency, the kernel pre-fills this cache (batching). It created 18 objects at once, just waiting for you to grab them. So, the constructor ran 18 times.

Allocation and usage:

struct myctx *obj;

obj = kmem_cache_alloc(gctx_cachep, GFP_KERNEL);
if (!obj) return -ENOMEM;

/* 打印一下看看实际大小 */
pr_info("Our cache object size is %u bytes; ksize=%lu\n",
kmem_cache_size(gctx_cachep), ksize(obj));

print_hex_dump_bytes("obj: ", DUMP_PREFIX_OFFSET, obj, sizeof(struct myctx));

/* 用完了记得还 */
kmem_cache_free(gctx_cachep, obj);

⚠️ Here is a real "pitfall": Pay attention to the kmem_cache_size and vmstat output in the logs. Your defined structure is 328 bytes, but the actual size allocated by the kernel is likely 448 bytes. This kind of "internal fragmentation" is an unavoidable cost when using Slab. If you're counting memory down to the byte on an embedded system, you absolutely must factor this in.


9.2 Virtual Continuity: vmalloc and Its Partners

When we need a massive block of memory—so large that the Slab Allocator (or rather, the Buddy System behind it) can't provide it (e.g., over 4MB)—we need to change our approach.

This brings us to vmalloc.

9.2.1 What is vmalloc()?

You can think of kmalloc like buying a plot of land; this land must be physically contiguous, suitable for building construction. vmalloc, on the other hand, is like setting up a virtual office—it assigns you a consecutive series of street numbers (virtually contiguous addresses), but the actual offices behind them (physical memory) might be scattered all over the city.

void *vmalloc(unsigned long size);
  • Virtually contiguous: The returned pointer is contiguous.
  • Physically discrete: The underlying physical pages might be all over the place.
  • Overhead: Because it needs to establish a bunch of mappings in the page table, both allocation and access overhead are higher than with kmalloc.

When should you use it?

  1. When you need a huge buffer (several MB or even hundreds of MB).
  2. When you only need software access and don't need to do DMA with hardware (hardware usually only recognizes physical addresses, not addresses from vmalloc, unless there's an IOMMU).
  3. When you're not in interrupt context (because it might sleep).

⚠️ Reminder again: Never call vmalloc while holding a spinlock, because it will sleep, and sleeping leads to deadlocks.

9.2.2 vmalloc's Little Brothers

The kernel provides a series of variants. Memorizing them will make your code more robust:

  • vzalloc(size): Does the same thing as vmalloc, but zeros out the memory. The z stands for Zero. Use this if you want to avoid leaking uninitialized kernel memory.
  • kvmalloc(size, flags): This is a "lazy" but smart API.
    • Its logic: first, try to get it done with kmalloc (because it's fast and physically contiguous).
    • If kmalloc fails (too large), it automatically falls back to vmalloc.
    • For programmers who don't want to agonize over which one to use, this is a godsend.
    • To free it, use kvfree().

9.2.3 Code Demo: Seeing vmalloc's True Colors

Let's write a module to try it out:

// ch9/vmalloc_demo/vmalloc_demo.c
static int vmalloc_try(void)
{
void *vptr_rndm, *vptr_init;

/* 1. 普通 vmalloc */
vptr_rndm = vmalloc(10000);
// 注意:实际上分配的是 2 个页面(8192 字节),因为页面对齐
if (!vptr_rndm) return -ENOMEM;

print_hex_dump_bytes(" content: ", DUMP_PREFIX_NONE, vptr_rndm, 16);
// 输出可能是乱码,因为 vmalloc 不清零

/* 2. vzalloc:清零版 */
vptr_init = vzalloc(10000);
if (!vptr_init) {
vfree(vptr_rndm);
return -ENOMEM;
}
print_hex_dump_bytes(" content: ", DUMP_PREFIX_NONE, vptr_init, 16);
// 输出全是 00

vfree(vptr_rndm);
vfree(vptr_init);
return 0;
}

9.3 Which One Should You Use? — Decision Time

Now you have at least four allocators in your head: kmalloc, vmalloc, kmem_cache, and __get_free_pages. Are you paralyzed when writing code?

Don't worry, here's a simple decision logic (a decision tree) you can stick next to your monitor:

Decision Logic: Ask from top to bottom

  1. Who is it for?

    • If it's for DMA hardware?
      • Don't use any of these; go use the DMA-specific APIs (dma_alloc_coherent).
    • If it's for normal kernel code or driver software logic? -> Continue.
  2. Size is key

    • Very small (less than a page, a few KB)?
      • First choice: kmalloc() / kzalloc(). This is the fastest and most effortless. Performance first.
    • Medium (less than a few MB, e.g., 1MB - 4MB)?
      • If you are absolutely certain you need physical continuity: you can only bite the bullet and use kmalloc (but watch out for failures), or use the low-level page allocator __get_free_pages().
      • If you don't care about physical continuity: use kvmalloc(). It will choose intelligently.
    • Huge (over 4MB)?
      • You're basically limited to vmalloc().
  3. Is the same object frequently allocated/freed?

    • Yes -> Consider creating a custom Slab cache (kmem_cache_create). This can massively improve performance and reduce fragmentation.

9.3.1 Performance Trap: Don't Abuse vmalloc

There is a common misconception here:

"Since vmalloc can allocate large memory anyway, why don't I just switch all my small memory allocations to vmalloc?"

Absolutely do not do this.

  • kmalloc grabs directly from the memory pool, which is very fast.
  • vmalloc needs to modify page tables and handle TLB invalidations, making it much slower.
  • Furthermore, memory from vmalloc cannot be directly DMA-mapped on x86; you'd have to call kmap first, adding insult to injury.

In a nutshell: Default to kmalloc first. Only when it truly can't meet your needs (too large) should you take a step back and seek help from vmalloc or kvmalloc.


9.4 What to Do When Memory Runs Out? — Reclaim and OOM

Now, suppose your driver is running perfectly, and memory allocation is fine. But as the system runs for a long time, memory keeps shrinking... At this point, the kernel starts doing "housekeeping." This is called Memory Reclaim.

9.4.1 Watermarks and kswapd

The kernel defines the amount of free memory in each memory Zone using three watermarks:

  • min: The minimum warning line.
  • low: Getting a bit tight.
  • high: Comfortable, very abundant.

There is a kernel thread called kswapd, acting like a diligent cleaner, constantly watching these watermarks.

  • When the watermark drops below high: It starts cleaning, mainly throwing away unused page cache and Slab objects.
  • If it cleans down to low and it's still not enough: It starts more aggressive memory writebacks.
  • If it falls below min: The kernel panics and might directly block processes requesting memory until enough is reclaimed.

9.4.2 The Last Line of Defense: OOM Killer

If kswapd works its heart out and there's still not enough memory, and even the min watermark can't be held, the kernel will call upon that "cold-blooded killer"—the OOM Killer.

Its logic is simple and brutal: To keep the entire system from crashing, it must kill some processes to free up memory.

It uses a scoring mechanism (oom_score) to select the process that "most deserves to die" (usually the one consuming the most memory) and sends it a SIGKILL signal.

What does this mean for you? If a service you're running suddenly gets killed, and the only line in the logs is Killed, that's the OOM Killer's doing.

Hands-on: Manually Triggering OOM Want to experience what it feels like to be OOM'd? (Recommended to do this in a virtual machine) You can trigger it via the SysRq key:

echo f > /proc/sysrq-trigger

At this point, the system will immediately evaluate who deserves to die the most and take action.

9.4.3 Protecting Important Processes: oom_score_adj

Since the OOM killer is so ruthless, can we protect critical processes (like sshd)? Yes. By adjusting /proc/<pid>/oom_score_adj.

  • Range: -1000 to 1000.
  • -1000: Absolutely never kill.
  • 1000: Prefer to kill (if you want to commit suicide).
# 保护 SSH 进程
echo -1000 > /proc/$(pidof sshd)/oom_score_adj

9.5 Chapter Echoes

In this chapter, we journeyed from the microscopic world of custom Slab caches to the macroscopic scale of vmalloc, and finally faced system-level memory crises head-on.

We discovered that memory management is far more than just "allocate" and "free." It's a game of trade-offs:

  • Speed vs. Size: Slab is fast but limited; vmalloc is large but slow.
  • Contiguous vs. Discrete: Physical continuity is expensive; virtual continuity is cheap.
  • Individual vs. Whole: Your driver might want to hoard memory, but the OOM Killer, for the survival of the entire system, can kick you out at any time.

Remember the "choice paralysis" mentioned at the beginning of this chapter? You should now have a clear mental model: The kernel provides a complete set of tools that let you make trade-offs across different dimensions. There is no best API, only the one most suitable for the current scenario.

In the next chapter, we'll leave the "land" of memory and explore another core resource—CPU time. We'll see how the kernel, like a dispatcher, decides who gets to run on the CPU and who must wait in line.

That is a story about time.


Exercises

Exercise 1: Understanding

Question: Suppose you are writing a kernel module that frequently allocates and frees a data structure named struct packet_obj (size: 100 bytes). You decide to create a custom Slab cache named packet_cache. To detect buffer overflow errors as early as possible and leverage hardware cache line alignment for better performance, which flags should you use when calling kmem_cache_create()?

Answer and Explanation

Answer: SLAB_RED_ZONE | SLAB_HWCACHE_ALIGN

Explanation: Tests your understanding of Slab flags. Based on the concepts covered:

  1. SLAB_RED_ZONE: Inserts red zones around the allocated buffer to detect buffer overflow or underflow errors, meeting the requirement to "detect buffer overflow errors as early as possible."
  2. SLAB_HWCACHE_ALIGN: Ensures cache objects are aligned to hardware cache line boundaries to improve performance, meeting the requirement to "leverage hardware cache line alignment."
  3. SLAB_POISON is also used for debugging, but it primarily detects uninitialized memory references (by filling with a specific pattern). While it could be enabled, the question specifically asks for "detecting overflows" and "performance alignment."

Exercise 2: Application

Question: In kernel module development, you need to allocate a huge array (approximately 32 MB in size) for temporary data storage. Considering the limitations of physical memory continuity and the risk of allocation failure, which of the following APIs is the most appropriate choice? A. kmalloc() B. kmem_cache_alloc() C. __get_free_pages() D. kvmalloc()

Answer and Explanation

Answer: D. kvmalloc()

Explanation: Tests your ability to choose the right API for a practical scenario.

  • A. kmalloc(): Based on the Slab Allocator, it can only allocate a few KB at most (usually up to 8KB or 4MB, depending on the order and architecture), making it unable to reliably allocate 32MB.
  • B. kmem_cache_alloc(): Used for allocating objects of a specific size, not suitable for large memory blocks.
  • C. __get_free_pages(): Directly allocates physically contiguous pages. 32MB requires 8192 contiguous 4KB pages (order = 13), which is highly likely to fail in a fragmented memory system.
  • D. kvmalloc(): This is the most appropriate choice. Its original design purpose is to handle larger memory allocations: it tries to use kmalloc() to obtain physically contiguous memory; if that fails (or if the request is too large), it falls back to using vmalloc(), which only guarantees virtual continuity without requiring physical continuity, thereby greatly increasing the chances of successfully allocating large memory blocks.

Exercise 3: Thinking

Question: When designing a kernel module for a high-reliability system, you need to decide how to handle memory allocation failures and system memory pressure. Please compare and analyze the use cases and fundamental differences of the following two mechanisms:

  1. The fallback mechanism implemented inside kvmalloc().
  2. The OOM Killer mechanism triggered by the kernel when memory is severely insufficient.
Answer and Explanation

Answer: See the detailed explanation.

Explanation: Tests your deep understanding of memory allocation strategies and system-level protection mechanisms.

1. kvmalloc()'s Fallback Mechanism (Tactical Adjustment):

  • Purpose: To ensure the success of a single allocation request.
  • Scenario: When a developer requests a large memory block (like tens of MB), physically contiguous memory might be insufficient. kvmalloc() intelligently downgrades from "high performance but demanding" (kmalloc/physically contiguous) to "lower performance but easier to succeed" (vmalloc/virtually contiguous). This is a mechanism that serves the caller, aiming to keep the program running.

2. OOM Killer Mechanism (Strategic Damage Control):

  • Purpose: To save the entire system from crashing.
  • Scenario: Triggered when system memory is almost exhausted, and even page reclaim, Slab shrinking, and other methods can't free up enough memory. It is an extreme measure that sacrifices processes (usually the ones consuming the most memory) to free up resources and prevent the system from freezing.

Fundamental Differences:

  • Fallback is an optimization/adaptive behavior designed to transparently handle memory fragmentation; it is friendly to user processes.
  • OOM Killer is a disaster recovery behavior, indicating the system is already on the edge of failure. It is destructive (it kills processes) and serves as the system's last line of defense.

When designing high-reliability modules, you should prioritize leveraging mechanisms like kvmalloc() to avoid allocation failures, while reasonably setting oom_score_adj to protect critical processes from being mistakenly killed by the OOM Killer.


Key Takeaways

When dealing with specific structures that are frequently allocated and freed, the generic kmalloc often presents performance bottlenecks and fragmentation issues. By creating a dedicated Slab cache via kmem_cache_create, you can manage objects efficiently, much like building a "private vault." Note that during creation, the kernel might allocate more space than your specified size due to alignment and metadata—this is the cost of "internal fragmentation." By utilizing flags like SLAB_POISON (poison mode) and SLAB_RED_ZONE (red zone), you can effectively catch memory out-of-bounds and uninitialized usage errors during the development phase.

When you need to allocate large blocks of memory (exceeding a few MB), kmalloc, which relies on physically contiguous memory, will often fail. In such cases, you should opt for vmalloc. The core logic of vmalloc is that it only guarantees virtual address continuity, while the underlying physical memory can be discrete, achieved by establishing page table mappings. However, it also has significant side effects: high allocation and access overhead, inability to be used directly for DMA, and it must never be called while holding a spinlock or in interrupt context because it might sleep.

For most allocation scenarios where you're unsure, or when you need to allocate "medium-sized" memory (between a few KB and a few MB), kvmalloc is the best "lazy" option. It intelligently tries the fast, physically contiguous kmalloc first, and automatically falls back to vmalloc if that fails. This strategy avoids the complexity of manual decision-making for developers while balancing the performance of small memory and the availability of large memory. To free it, simply use the matching kvfree.

Memory allocation isn't just code logic; it directly impacts system stability. When physical memory is exhausted, the kernel launches the kswapd daemon thread for background reclaiming. If the watermark drops below the minimum warning line, the kernel triggers the OOM Killer (Out-Of-Memory Killer). This is a brutal mechanism designed to save the system by sacrificing some processes. It kills the process consuming the most memory based on the oom_score score. In production environments, you can adjust the /proc/<pid>/oom_score_adj parameter (set to -1000) to protect critical services (like sshd) from being mistakenly killed.

Overall, there is no master key for kernel memory allocation, only scenario-based trade-offs. By default, you should prefer the high-performance kmalloc; only consider the low-level page allocator when you need large, physically contiguous memory; only fall back to vmalloc when dealing with huge buffers that don't interact with hardware; and for high-frequency specific objects, custom Slab caches are the premier choice for performance optimization. Understanding the physical continuity, overhead costs, and reclaim mechanisms behind these APIs is key to writing robust kernel modules.