13.3 Memory Region (MR)
In the previous section, we got the Address Handle (AH) sorted out—like putting up road signs for data packets in the RDMA network. But that's not enough. Road signs only tell you which way to go; the vehicle (data) still needs somewhere to be loaded.
In a high-performance network like RDMA, we can't just toss data into any random memory. If your memory hasn't been officially certified, the NIC won't touch it.
This certification mechanism is the Memory Region (MR).
Why Must We Register Memory?
You might ask: I have a pointer, so why can't I just read and write directly? Why bother with this whole "registration" process?
This is the biggest difference between RDMA and standard socket programming.
With sockets, you copy data into a kernel buffer, and the kernel handles the rest. But with RDMA, we bypass the kernel and let the NIC (HCA) read and write your memory directly. This introduces a major problem: how do you know where your virtual address actually lives in physical memory?
To make matters worse, operating systems have paging mechanisms. Your memory could be swapped out to disk at any time. If the NIC is reading away and the system swaps out that memory, the NIC reads garbage or triggers an exception outright.
To solve this, RDMA introduced Memory Registration.
You can think of an MR as a "visa" for your memory.
But this visa is a bit special: it's not made of paper. It's more like putting a double-headed lock on your memory.
- One end locks down the virtual-to-physical address mapping (preventing the memory from being swapped out, i.e., Pinning).
- The other end generates two keys (lkey and rkey)—only those holding the keys (the local CPU or a remote NIC) can unlock and access the memory.
However, the "visa" analogy falls short in one respect: a visa usually only covers a single entry or exit, whereas MR registration is a persistent state. Once registered, the physical characteristics of that memory are "frozen" until you deregister it. Moreover, this isn't a free service—registering an MR is an expensive operation that requires involvement from the kernel and even the hardware.
What happens during registration? The kernel does four things for you:
- Split: Breaks the contiguous virtual address you provided into individual memory pages.
- Translate: Resolves the virtual-to-physical address mapping and hands this mapping table over to the NIC.
- Validate: Checks the permissions you requested (read-only? read-write?) to see if they are actually allowed for this memory.
- Pin: Pins these memory pages in place, forbidding them from being swapped out to the swap space. This guarantees the virtual-to-physical mapping will never change.
Only after all this does the memory truly become an MR.
Two Keys: lkey and rkey
Once registration succeeds, you get an MR structure, and the most important things inside it are two keys: Local Key (lkey) and Remote Key (rkey).
Returning to the "double-headed lock" analogy:
- lkey (Local Key): This is kept for your own use. When you (the CPU) fill in an address in a Work Request, you must present this key to tell the local NIC, "I've registered this memory, it's safe to read."
- rkey (Remote Key): This is meant for the remote machine. If you want a remote NIC to directly read or write your memory, you must give it the rkey. When the remote side sends an RDMA Read/Write request, it must carry this rkey, otherwise your NIC will reject the request outright.
Don't mix them up: local access uses lkey, remote access uses rkey. If you use rkey in a local operation, or vice versa, the NIC will mercilessly throw an error.
⚠️ Note The same memory buffer can be registered multiple times, even with different permissions each time. But that doesn't mean you should do it carelessly. Registration has overhead, and registering the same memory multiple times is usually done when different logical modules (like different connections) need to isolate permissions—not just for convenience.
Core API Deep Dive
Let's break down the core functions for manipulating MRs in the kernel. This is where you're most likely to step on landmines in practice.
1. ib_get_dma_mr() — Get a General-Purpose Ticket
This is the simplest way to register.
struct ib_mr *ib_get_dma_mr(struct ib_pd *pd, int access_flags);
It returns an MR for system memory DMA. You need to pass in a Protection Domain (PD) and the access permissions you want.
This function is typically used for DMA regions that exist long-term, spanning the entire lifetime of the driver. It's straightforward and brute-force, getting a chunk of memory sorted out in one go.
2. ib_dma_map_single() — Precise Mapping
If you just want to temporarily hand over a kernel virtual address allocated via kmalloc() to the NIC, this function is more appropriate.
dma_addr_t ib_dma_map_single(struct ib_device *dev, void *cpu_addr,
size_t size, enum dma_data_direction direction);
It maps the kernel virtual address into a DMA address. This DMA address is what the NIC can actually understand.
⚠️ Pitfall Warning After mapping, never forget to check for errors! Mapping can fail (though the probability is low), but if it does and you use the address directly afterward, you'll get a kernel panic.
if (ib_dma_mapping_error(dev, addr)) {
// 处理错误:返问、打印日志、别往下走了
}
When you're done with this memory, remember to always unmap it, otherwise the DMA mapping table will leak:
void ib_dma_unmap_single(struct ib_device *dev, dma_addr_t addr,
size_t size, enum dma_data_direction direction);
Variants: The kernel also provides a set of similar functions for handling more complex scenarios:
ib_dma_map_page(): Map only a single page.ib_dma_map_single_attrs(): Map with attributes.ib_dma_map_sg(): Handle scatter/gather lists.ib_dma_map_sg_attrs(): Handle scatter/gather lists with attributes.
They all have corresponding unmap functions. Don't be lazy—pick the right one.
3. Synchronizing CPU and Device Views
Before accessing DMA-mapped memory, you must do one more thing: synchronization.
Why? Because the caches seen by the CPU and the NIC might be inconsistent.
- If the CPU writes to this memory and then wants the NIC to read it, you must flush the CPU cache to memory first.
- If the NIC has finished writing and the CPU wants to read it, you must invalidate the CPU cache to force it to re-read from memory.
The corresponding functions are:
ib_dma_sync_single_for_cpu(): NIC -> CPU (CPU is preparing to read).ib_dma_sync_single_for_device(): CPU -> NIC (NIC is preparing to write).
⚠️ Note This step is very easy to forget. If you forget to synchronize, you might encounter bizarre bugs—the data you read is stale, or the data you wrote is completely invisible to the NIC. These bugs are often intermittent and extremely painful to debug.
4. ib_dma_alloc_coherent() — The Hassle-Free Choice
If you don't want to deal with mapping and synchronization, the kernel provides an all-in-one solution:
void *ib_dma_alloc_coherent(struct ib_device *dev, size_t size,
dma_addr_t *dma_handle, gfp_t flag);
It allocates a block of memory that the CPU can access and that can also be directly used for DMA by the NIC.
- The returned pointer is for the CPU to use.
- The
dma_handlepointer will be filled with the DMA address for the NIC to use.
This memory is "coherent," meaning it exists simultaneously in the view of both the CPU and the NIC, without the need for frequent synchronization.
To free it, use ib_dma_free_coherent().
Advanced: Physical Memory Registration and Querying
Sometimes, what you have in hand are physical pages (for example, if you want to register a group of physical pages via ib_reg_phys_mr()):
struct ib_mr *ib_reg_phys_mr(struct ib_pd *pd,
struct ib_phys_buf *phys_buf_list,
int num_phys_buf,
int access_flags,
int *mr_attrs);
This is typically used in scenarios with extreme memory management requirements. If you want to change the properties of an MR after registration (like its size or physical address), don't foolishly deregister and then re-register—use ib_rereg_phys_mr(), which can modify it in place and save overhead.
If you want to check the current state of an MR (like its size or permissions), use:
int ib_query_mr(struct ib_mr *mr, struct ib_mr_attr *mr_attr);
⚠️ Note
Although the interface exists, many low-level drivers actually do not implement this function. Before calling, it's best to confirm your hardware's support status, otherwise you'll get a ENOSYS error.
Finally, when everything is done, call ib_dereg_mr() to deregister the MR and unfreeze the memory.
Fast Memory Region (FMR) Pool — Built for Speed
As mentioned earlier, registering an MR is a "heavy" operation. It can be slow, and might even put the current process to sleep while waiting for resources.
Imagine this: you're in an interrupt handler, or in an atomic context where sleeping is not allowed, and you suddenly need to register a chunk of memory. Call a normal MR registration? That's a direct deadlock or panic.
This is where FMR (Fast Memory Region) comes in.
FMR allows you to set up a Pool. During idle time (like the initialization phase), you pre-register a batch of MRs in the pool.
When you need one, you simply grab one from the pool to use (lightweight registration). When you're done, you throw it back into the pool (deregistration).
This "grab" and "throw" process is very fast and will not sleep. This makes FMR the only solution for handling dynamic, high-frequency memory registration scenarios (such as certain storage protocols).
The related APIs are defined in include/rdma/ib_fmr_pool.h.
Memory Window (MW) — Flexible Access Control
Finally, let's look at a slightly more convoluted concept: Memory Window (MW).
Two Approaches to Access Control
If you want a remote machine to access your memory, there are usually two ways:
- Directly register an MR: Enable remote permissions at registration time (e.g.,
IB_ACCESS_REMOTE_WRITE). - MR + MW: First register a normal MR, then bind an MW on top of it.
What problem does the second approach solve?
Suppose you have a block of memory, and you want Node A to access it for 5 seconds, then deny access, and later let Node B access it.
With approach 1, you'd have to constantly dereg_mr and reg_mr. Remember, registration is a heavy operation.
With approach 2 (MW), it's simple. The MR stays put (remaining registered), and you only need to manipulate the MW:
- Bind: Bind the MW to the MR, generating a new rkey. The remote side gets this rkey and can access the memory.
- Unbind: After unbinding, this rkey immediately becomes invalid. If the remote side tries to use this key to access the memory again, the NIC will reject it.
Binding and unbinding are lightweight operations (although they are essentially sending a special Work Request to the QP).
The MW Operation Trilogy
The kernel provides three functions to work with MWs:
-
Allocate:
struct ib_mw *ib_alloc_mw(struct ib_pd *pd, enum ib_mw_type type);You need a PD, and you must specify the MW type.
-
Bind:
int ib_bind_mw(struct ib_qp *qp, struct ib_mw *mw, struct ib_mw_bind *mw_bind);This is the crucial step. It throws a special WR into the QP's Send Queue (SQ).
- Specify the MR to bind to.
- Specify the address, size, and remote permissions.
- Once this operation completes, you get a Work Completion (WC) telling you whether the binding succeeded.
⚠️ Note If this MW was previously bound to an MR (whether the same one or a different one), this binding will automatically invalidate the previous binding. This is convenient, but also error-prone—if you assume the old binding is still active, it's actually already gone.
-
Deallocate:
int ib_dealloc_mw(struct ib_mw *mw);Free it when you're done—don't hog resources.
With this, we've finally cracked the hard nut of memory.
We have the AH (for directions) and the MR (for cargo). But this is still static.
The soul of RDMA lies in "motion." How does data fly out of the queue? How do you tell the NIC "send it now"? How does the NIC notify you when the work is done?
In the next section, we will meet the heart of RDMA: Queue Pair (QP). That is where all the action happens.