Skip to main content

7.2 Memory Barriers in Practice: A Careful Conversation with Hardware

In the previous section, we discussed how to safely manage reference counts, but in the world of kernel synchronization, beyond the lifecycle of data structures, there is another area that gives hardware driver developers headaches: communicating with the outside world.

By "outside world," we mean DMA controllers and network chips.

When you offload data movement to hardware, you are essentially collaborating with an entity that has a completely different "mindset." CPUs execute instructions out of order for performance, and compilers optimize instructions for efficiency, but the DMA controller on the hardware side couldn't care less. It follows a strict rule: it reads descriptors in the order they appear in memory (or according to its own defined timing).

If you assume that writing code sequentially means the hardware will receive it in that same order, you are gravely mistaken.

Here is a counterintuitive fact: the order you see in C code is not necessarily the order that occurs in memory.

To understand why we need to use a seemingly "low-level" mechanism to solve this problem, let's look at a real driver example.


Starting with a Network Driver

Take the Realtek 8139 "Fast Ethernet" network driver as an example (code located at drivers/net/ethernet/realtek/8139cp.c). To transmit a network packet, the driver must first set up a DMA descriptor, telling the hardware: "fetch data from here, here is the length, and here are the flags."

For this specific network chip, the DMA descriptor looks like this:

// drivers/net/ethernet/realtek/8139cp.c
struct cp_desc {
__le32 opts1;
__le32 opts2;
__le64 addr;
};

These three fields are: option 1, option 2, and the data address. Before handing the data over to DMA, all three fields must be initialized.

Now, the question arises: why does the initialization order of these three fields matter so much?


Why Does Order Matter?

You can think of this DMA descriptor as a "shipping manifest."

  1. addr (Address): Tells the warehouse (DMA) which shelf to fetch the goods from.
  2. opts1 / opts2 (Options): Tells the warehouse how to handle this batch of goods, such as "expedite," "valid," or "this is the final order."

If you are the warehouse clerk (CPU) responsible for filling out the form, you might casually write the "handling method" first, then write the "shelf number."

But in the real world, the DMA controller "mover" is extremely rigid. It might glance at the manifest in memory from time to time. The moment it sees the "manifest valid" flag set to 1, it will immediately grab its shovel and get to work—completely regardless of whether you've finished writing down the shelf number or not.

If, from the DMA controller's perspective, it first sees the "valid" flag in opts1, and immediately goes to read addr, but due to CPU out-of-order execution or cache incoherence, it reads a stale address or 0...

The result is: the hardware starts moving the wrong memory, or directly triggers a catastrophic bus error.

So here is an ironclad rule: you must write all the preparatory fields first, and only "give the final stamp of approval" (set the valid flag) at the very end.


wmb(): The Impenetrable Wall

To prevent this kind of tragedy, the kernel's DMA mapping guidelines explicitly require: when writing DMA descriptors, you must guarantee the memory write order.

This is where wmb() (Write Memory Barrier) comes in. Its job is to drive a stake into the code, telling both the compiler and the CPU: "All write operations before this point must be fully committed to memory (and visible to other devices) before any write operations after this point can execute."

Returning to our Realtek driver, let's see how it transmits a packet (the cp_start_xmit function).

First, prepare the data:

len = skb->len;
mapping = dma_map_single(&cp->pdev->dev, skb->data, len,
PCI_DMA_TODEVICE);

mapping is the physical address—the "shelf number" we need to tell the hardware about.

Next comes the critical part, setting up the descriptor:

struct cp_desc *txd;

/* [...... 省略部分代码 ......] */

// 第一步:设置 opts2 和地址
txd->opts2 = opts2;
txd->addr = cpu_to_le64(mapping);

// 【关键点 1】屏障!
// 确保 opts2 和 addr 已经安全写入内存,绝不能被重排到后面去
wmb();

// 第二步:组装最终的控制字(包括长度、首尾标志等)
opts1 |= eor | len | FirstFrag | LastFrag;

// 第三步:写入 opts1(包含让硬件开始干活的标志位)
txd->opts1 = cpu_to_le32(opts1);

// 【关键点 2】再来一道屏障!
// 确保 opts1 的写入(尤其是那个「有效」位)立刻对硬件可见
wmb();

Do you see what's happening here?

  1. The first wmb(): Protects data dependencies. It ensures that the two "preparatory" fields, opts2 and addr, land in memory first. Without this barrier, the CPU or compiler might decide "it's faster to calculate opts1 first," causing the flag to be written before the address.
  2. The second wmb(): Ensures the final command takes effect. Once this line of code executes, the hardware must immediately see this descriptor turn red (valid), thereby starting the DMA transfer.

It's just like filling out a manifest: fill in all the details first, double-check them, and only then check the "ship?" box. The wmb() in the middle is the pen in your hand, forcing you to follow the order and not allowing you to check the box early just because you have fast hands.


The Myth of volatile (FAQ)

At this point, people often ask: "Since the volatile keyword can also prevent compiler optimizations, why not use it to guarantee ordering?"

This is a very classic misconception.

volatile does indeed tell the compiler: "don't mess with reads and writes to this variable; always faithfully fetch the value from memory." This is useful when operating MMIO (Memory-Mapped I/O).

But in the concurrent world, volatile has two fatal flaws:

  1. It does not guarantee atomicity: If two threads perform a i++ on a volatile int i at the same time, it will still be a mess, because i++ is a read-modify-write sequence of three instructions, and volatile has no jurisdiction at that level.
  2. It does not act as a memory barrier: Although the C standard dictates that accesses between volatile variables cannot be reordered, it has no control over non-volatile variables. In the txd->addr = ...; txd->opts1 = ...; example, even if opts1 is volatile, the compiler and CPU might still reorder the assignment to addr to happen later, because they aren't in the same camp.

So, don't count on volatile to solve synchronization problems. In the kernel, we either use locks, atomic operations, or, as shown above, we dutifully insert memory barriers.


Beyond Locks

Honestly, as a driver developer, you don't need to append wmb() after every line of code. Most of the heavy lifting is already done for you behind the scenes by the kernel's lock APIs and primitives (like RCU).

However, when your code starts crossing that boundary—from pure software memory operations to telling hardware to do work (such as setting up DMA descriptors or triggering register commands)—you must be on high alert.

At this boundary, the CPU's assumptions (e.g., "I can execute out of order") no longer hold true. Hardware is honest and rigid; it interprets your memory layout literally. At this point, explicitly adding a memory barrier is a sign of respect for the hardware, and a commitment to system stability.

The picture is clear at this point: atomic_t protects the value of data, while memory barriers protect the timing of data. Both are indispensable.


⚠️ Pitfall Warning

Never take shortcuts in the middle of DMA descriptor initialization.

You might think: "Oh, the docs say opts2 isn't important, so I'll just skip it," or "Let me just remove the wmb(); I've run it a thousand times without a crash."

Trust me, you might get lucky on x86 without issues (x86 has a strong memory model, and the hardware itself guarantees quite a bit of ordering). But once you port the code to ARM or PowerPC, or switch to a more particular network card, you will be rewarded with extremely hard-to-reproduce bugs—maybe packets going to the wrong places, maybe a kernel panic, and these bugs tend to strike at 3 AM when the load is at its highest.