Skip to main content

13.8 Cheat Sheet

By now, we've dissected most of the bones in the RDMA stack. You should have a lot of loose parts on hand: ib_client, PD, QP, CQ, MR...

Before closing this book (or before you start writing your own driver), you need a blueprint. That's the whole point of this section—it's not here to teach you anything new, but to gather the APIs scattered across the previous hundreds of pages and organize them by function.

You'll find that when you're staring at a blank .c file, what you need isn't a lengthy philosophical discourse, but that damn function signature and what parameters it actually takes.

Below is a cheat sheet for the core APIs of the kernel RDMA subsystem. Every function is worth an extra two seconds of your attention in the grep results.


Client and Device Management

Everything starts with registration.

int ib_register_client(struct ib_client *client);

Registers a client with the kernel RDMA stack. This is how you tell the kernel "I'm here." On successful registration, your add callback is invoked, and you receive notifications for existing RDMA devices in the system.

void ib_unregister_client(struct ib_client *client);

Unregisters. Tells the kernel "I'm out" and that you no longer care about device events.

void ib_set_client_data(struct ib_device *device, struct ib_client *client, void *data);
void *ib_get_client_data(struct ib_device *device, struct ib_client *client);

Each device can bind a private data pointer (context) for each client. Typically, you set this in the add callback and get it in subsequent operations.


Event Handling

RDMA devices are asynchronous; you need to listen for their "screams."

int ib_register_event_handler(struct ib_event_handler *event_handler);
int ib_unregister_event_handler(struct ib_event_handler *event_handler);

Registers/unregisters an event handler. When an asynchronous event occurs on a device (such as a port state going down or a device hot-unplug), your registered callback is triggered.


Device and Port Query

Before operating, figure out who you're dealing with.

int ib_query_device(struct ib_device *device, struct ib_device_attr *device_attr);

Queries device attributes. This tells you what the device supports (maximum MR size, maximum QP count, whether it supports atomic operations, etc.). Don't guess capabilities; query them.

int ib_query_port(struct ib_device *device, u8 port_num, struct ib_port_attr *port_attr);

Queries the state of a specified port (rate, link layer state, physical state).

enum rdma_link_layer rdma_port_get_link_layer(struct ib_device *device, u8 port_num);

What is running underneath the port? InfiniBand, Ethernet, or something else? This function gives you the answer. This is crucial for deciding whether to run RoCE or native IB at the upper layer.


Address Query

The RDMA address hierarchy is complex: GID, P_Key, LID.

int ib_query_gid(struct ib_device *device, u8 port_num, int index, union ib_gid *gid);

Queries the GID at a specified index in the port's GID table.

int ib_query_pkey(struct ib_device *device, u8 port_num, u16 index, u16 *pkey);

Queries the partition key at a specified index in the port's P_Key table.

int ib_find_gid(struct ib_device *device, union ib_gid *gid, u8 *port_num, u16 *index);
int ib_find_pkey(struct ib_device *device, u8 port_num, u16 pkey, u16 *index);

Reverse query: given a GID or P_Key, find the port number and table index where it resides.


Protection Domain (PD)

The container for all resources. Almost all of your operations start with allocating a PD.

struct ib_pd *ib_alloc_pd(struct ib_device *device);

Allocates a PD. This is a prerequisite for allocating QPs and MRs.

int ib_dealloc_pd(struct ib_pd *pd);

Destroys a PD. Note: Before destruction, you must ensure that all QPs and MRs dependent on it have already been destroyed; otherwise, it will return -EBUSY.


Address Handle (AH)

Used for UD (Unreliable Datagram) QPs. Since these are datagrams, you need to tell the hardware "where to send this packet."

struct ib_ah *ib_create_ah(struct ib_pd *pd, struct ib_ah_attr *ah_attr);

Creates an AH. The ah_attr is filled with the destination LID, GID, path MTU, and other information.

int ib_init_ah_from_wc(struct ib_device *device, u8 port_num, struct ib_wc *wc, struct ib_grh *grh, struct ib_ah_attr *ah_attr);
struct ib_ah *ib_create_ah_from_wc(struct ib_pd *pd, struct ib_wc *wc, struct ib_grh *grh, u8 port_num);

"Reverse-engineering the address from a received packet." If you receive a UD message and want to reply to it immediately, these two functions help you extract the correct AH attributes from the Work Completion (WC) and GRH, saving you from looking up the routing table yourself.

int ib_modify_ah(struct ib_ah *ah, struct ib_ah_attr *ah_attr);
int ib_query_ah(struct ib_ah *ah, struct ib_ah_attr *ah_attr);
int ib_destroy_ah(struct ib_ah *ah);

Standard modify, query, and destroy operations.


Memory Region (MR) and DMA

This is the most complex part. You need to map kernel virtual addresses into DMA addresses that the hardware understands.

DMA Mapping Operations

Low-level DMA interfaces for handling memory coherency.

static inline int ib_dma_mapping_error(struct ib_device *dev, u64 dma_addr);

Never skip this step. After every ib_dma_map_xxx, you must call this function to check if the returned address is valid. Hardware mapping can fail.

static inline u64 ib_dma_map_single(struct ib_device *dev, void *cpu_addr, size_t size, enum dma_data_direction direction);
static inline void ib_dma_unmap_single(struct ib_device *dev, u64 addr, size_t size, enum dma_data_direction direction);

The simplest mapping: maps a kernel virtual address (one that came from kmalloc or is on the stack) to a DMA address.

static inline u64 ib_dma_map_page(struct ib_device *dev, struct page *page, unsigned long offset, size_t size, enum dma_data_direction direction);
static inline void ib_dma_unmap_page(struct ib_device *dev, u64 addr, size_t size, enum dma_data_direction direction);

Mapping based on struct page. Use this if you are dealing with page-level data.

static inline int ib_dma_map_sg(struct ib_device *dev, struct scatterlist *sg, int nents, enum dma_data_direction direction);
static inline void ib_dma_unmap_sg(struct ib_device *dev, struct scatterlist *sg, int nents, enum dma_data_direction direction);

Scatter/gather mapping. Use this when handling non-contiguous physical memory blocks.

DMA Synchronization

If your CPU and NIC are accessing this memory simultaneously, you need to coordinate ownership.

static inline void ib_dma_sync_single_for_cpu(struct ib_device *dev, u64 addr, size_t size, enum dma_data_direction dir);
static inline void ib_dma_sync_single_for_device(struct ib_device *dev, u64 addr, size_t size, enum dma_data_direction dir);

for_device before DMA, and for_cpu before CPU reads. Skipping this step can lead to data corruption on certain architectures.

Coherent Memory

static inline void *ib_dma_alloc_coherent(struct ib_device *dev, size_t size, u64 *dma_handle, gfp_t flag);
static inline void ib_dma_free_coherent(struct ib_device *dev, size_t size, void *cpu_addr, u64 *dma_handle);

Allocates a block of memory accessed by both the CPU and the device. This memory is coherently mapped and doesn't require frequent syncs, but allocation is more expensive. Suitable for control structures (like shared Ring Buffers).

Memory Registration (MR)

Handing memory over to the RDMA device for management.

struct ib_mr *ib_get_dma_mr(struct ib_pd *pd, int mr_access_flags);

This is a "clever shortcut" function. It directly returns an MR that covers the entire system address space (based on DMA addresses). Simple and brute-force, but in certain scenarios with low security requirements, it saves you from tedious registration steps.

struct ib_mr *ib_reg_phys_mr(struct ib_pd *pd, struct ib_phys_buf *phys_buf_array, int num_phys_buf, int mr_access_flags, u64 *iova_start);

The standard approach. Registers an MR based on an array of physical pages. You need to prepare the physical page list.

int ib_rereg_phys_mr(struct ib_mr *mr, int mr_rereg_mask, ...);

This is a very useful performance optimization point. When you need to change the size or permissions of an MR, the traditional approach is to destroy the old one and register a new one (which is very slow). rereg allows you to hot-modify its attributes without destroying the MR, avoiding the need to re-establish mappings.

int ib_query_mr(struct ib_mr *mr, struct ib_mr_attr *mr_attr);
int ib_dereg_mr(struct ib_mr *mr);

Query attributes and deregister.


Memory Window (MW)

A dynamic, temporary remote access authorization mechanism.

struct ib_mw *ib_alloc_mw(struct ib_pd *pd, enum ib_mw_type type);

Allocates an MW.

static inline int ib_bind_mw(struct ib_qp *qp, struct ib_mw *mw, struct ib_mw_bind *mw_bind);

Binds an MW to an MR and specifies the remote access permissions for this binding (such as read-only, read-write). Once bound, remote nodes holding this window's key can access the corresponding memory segment. Upon unbinding, permissions immediately become invalid.

int ib_dealloc_mw(struct ib_mw *mw);

Deallocates an MW.


Completion Queue (CQ)

The end of the line for producers, the starting point for consumers.

struct ib_cq *ib_create_cq(struct ib_device *device,
ib_comp_handler comp_handler,
void (*event_handler)(struct ib_event *, void *),
void *cq_context,
int cqe,
int comp_vector);

Creates a CQ.

  • comp_handler: A callback function invoked directly in kernel space when a WC is generated (hard interrupt context).
  • event_handler: Handles CQ-related asynchronous events (such as CQ overflow).
  • cqe: Queue depth.
int ib_resize_cq(struct ib_cq *cq, int cqe);

Dynamically resizes the CQ. Useful if your application's workload suddenly changes.

int ib_modify_cq(struct ib_cq *cq, u16 cq_count, u16 cq_period);

Key to performance tuning. Sets the CQ's interrupt moderation parameters. If you generate an interrupt for every single packet, your CPU will die of exhaustion. Setting reasonable cq_count and cq_period values allows the device to notify the kernel only after accumulating a certain number of WCs.

int ib_peek_cq(struct ib_cq *cq, int wc_cnt);

Take a sneak peek. Non-blockingly checks if the CQ has at least wc_cnt WCs.

static inline int ib_req_notify_cq(struct ib_cq *cq, enum ib_cq_notify_flags flags);

Tells the kernel "I'm ready." Usually called before a poll loop. If set to IB_CQ_NEXT_COMP, the next arriving WC will trigger a notification.

static inline int ib_poll_cq(struct ib_cq *cq, int num_entries, struct ib_wc *wc);

The real workhorse function. Pulls WCs out of the CQ.


Shared Receive Queue (SRQ)

struct ib_srq *ib_create_srq(struct ib_pd *pd, struct ib_srq_init_attr *srq_init_attr);
int ib_modify_srq(struct ib_srq *srq, struct ib_srq_attr *srq_attr, enum ib_srq_attr_mask srq_attr_mask);
int ib_query_srq(struct ib_srq *srq, struct ib_srq_attr *srq_attr);
int ib_destroy_srq(struct ib_srq *srq);

Creates, modifies, queries, and destroys an SRQ. The logic is similar to a regular QP, but it is passive and only handles receiving.


Queue Pair (QP)

This is the heart of the entire system.

struct ib_qp *ib_create_qp(struct ib_pd *pd, struct ib_qp_init_attr *qp_init_attr);

Creates a QP. You need to specify here whether it's RC, UC, or UD, the SQ/RQ depths, and which CQ and SRQ to use.

int ib_modify_qp(struct ib_qp *qp, struct ib_qp_attr *qp_attr, int qp_attr_mask);

This is the function that will make you tear your hair out. It's responsible for moving the QP from the RESET state all the way to the RTS state. You need to carefully fill in the qp_attr according to the state machine. Miss a single mask bit, or fill in the wrong state (like trying to RTS before you've RTR), and the hardware will flat-out reject your request.

int ib_query_qp(struct ib_qp *qp, struct ib_qp_attr *qp_attr, int qp_attr_mask, struct ib_qp_init_attr *qp_init_attr);

Queries the current QP attributes. Extremely useful for debugging to see what mess you've actually turned it into.

int ib_destroy_qp(struct ib_qp *qp);

Destroys a QP.


Data Posting

Finally, send the requests out.

static inline int ib_post_send(struct ib_qp *qp, struct ib_send_wr *send_wr, struct ib_send_wr **bad_send_wr);
static inline int ib_post_recv(struct ib_qp *qp, struct ib_recv_wr *recv_wr, struct ib_recv_wr **bad_recv_wr);

Pay attention to that bad_xxx_wr pointer. If ib_post_send returns an error, it won't tell you which WR failed; it will only fill in the pointer to the failed WR into bad_send_wr. You must check this pointer to know which link in the chain broke. If it's a batch post, the earlier ones might have succeeded while the later ones failed.

static inline int ib_post_srq_recv(struct ib_srq *srq, struct ib_recv_wr *recv_wr, struct ib_recv_wr **bad_recv_wr);

Posts receive requests to an SRQ.


Multicast

int ib_attach_mcast(struct ib_qp *qp, union ib_gid *gid, u16 lid);
int ib_detach_mcast(struct ib_qp *qp, union ib_gid *gid, u16 lid);

Joins or leaves a multicast group with a UD QP. Note that you can only truly start receiving multicast data after the QP state is ready via RESET.


Chapter Echoes

At this point, we have completed a full deconstruction of the Linux kernel's InfiniBand/RDMA subsystem.

Think back to the starting point of this chapter: we were faced with a mysterious black box that promised "zero-copy" and "kernel bypass." To open this box, we had to introduce a whole slew of new vocabulary—Verbs, QP, CQ, MR, PD...

The cheat sheet you now hold in your hands is essentially the key list for opening this black box.

Behind every API lies a set of register operations on the hardware or a piece of firmware logic. When you call ib_post_send, you are actually writing a descriptor into the NIC's send queue ring buffer; when you call ib_req_notify_cq, you are telling the interrupt controller "you can start bothering me now."

The RDMA learning curve is indeed steep, not just because of the inherent complexity of the technology, but because it challenges our intuition about traditional network programming—we're used to the kernel buffering, retransmitting, and queuing for us, while RDMA requires you to take over all of this yourself.

But as we hinted at the beginning of this chapter, this complexity comes with a cost, but also with a reward. When you truly master this mechanism, you can transfer data within microsecond-level latency—something the traditional TCP/IP stack can never achieve.

In the next chapter, we will leave this high-speed physical wire and turn our gaze toward more macroscopic system orchestration—network namespaces and the Bluetooth subsystem. A completely different philosophy of connection awaits us there.


Exercises

Exercise 1: Understanding

Question: In the InfiniBand network architecture, GUID, GID, and LID represent different address identifiers. Briefly describe the main differences between the three: How is the LID generated? In what scenarios is the GID primarily used for packet routing?

Answer and Analysis

Answer: 1. LID (Local IDentifier): A 16-bit address assigned by the Subnet Manager (SM), used for routing and forwarding within a subnet. 2. GID (Global IDentifier): A 128-bit identifier generated based on the port GUID and Subnet ID, primarily used for cross-subnet routing or multicast packets (GRH header). 3. Difference: LID is a locally assigned short address used for efficient forwarding within a subnet; GID is a globally unique long address, similar in format to IPv6.

Analysis: Tests the understanding of the InfiniBand address hierarchy. As described in the text, the LID is assigned by the SM and used for switch forwarding table lookups within a subnet; the GID is generated based on the GUID, and when crossing subnets (using routers) or in multicast communication, the packet must include a GRH (Global Routing Header), which uses the GID for addressing.

Exercise 2: Understanding

Question: When using RDMA technology, why must memory buffers be "registered"? What two critical keys are generated after registration, and what are their respective purposes?

Answer and Analysis

Answer: Registration is required to ensure that physical address mappings are fixed (preventing them from being swapped out) and to set hardware access permissions. The two generated keys are:

  1. lkey (Local Key): Used by local Work Requests to access local memory.
  2. rkey (Remote Key): Provided to remote machines, used by remote RDMA operations (Read/Write) to access this memory.

Analysis: Tests the understanding of the Memory Region (MR) concept. The registration process maps virtual memory to physical memory and pins it. The lkey and rkey are credential tokens for hardware DMA access to memory, ensuring that only requests holding the correct key can access the memory region.

Exercise 3: Application

Question: Suppose you are developing a network module for a high-performance distributed database. You've chosen an InfiniBand network and want to avoid CPU handling of network packets in the main flow to reduce latency. Combining the advantages of RDMA, explain which RDMA operation you would choose to directly write local modification logs into the standby node's memory without requiring the standby node's CPU involvement? Why?

Answer and Analysis

Answer: Choose the RDMA Write operation. Reason: RDMA Write allows the local node to directly write data into the remote node's memory without any intervention from the remote CPU (no need for the remote node to execute a receive call). This fully aligns with the Kernel Bypass and CPU Offload characteristics, drastically reducing the latency of the copy operation and imposing near-zero load on the standby node's CPU.

Analysis: This is an application question testing the practical use of RDMA operation types. A Send operation typically requires the remote node to post a receive WR first, which consumes CPU. RDMA Read requires the remote node to expose memory but is initiated as a pull from the local side; while it also doesn't consume remote CPU, it's usually used for pulling data. For a "write" scenario (like log synchronization), RDMA Write is the most direct and efficient choice, achieving true zero-copy and zero-intervention.

Exercise 4: Application

Question: When designing a highly concurrent RDMA server application, you find that creating an independent QP and its corresponding RQ (Receive Queue) for each connection consumes a massive amount of memory, because you need to reserve a large number of WQEs (Work Queue Entries) for each RQ to avoid RNR (Receiver Not Ready) errors. Which kernel mechanism would you introduce to optimize the receiver's memory consumption and scalability? Briefly describe how it works.

Answer and Analysis

Answer: Introduce the SRQ (Shared Receive Queue). Principle: An SRQ allows multiple QPs to share a single receive queue. The application layer only needs to fill the SRQ with enough Receive WQEs, and all associated QPs can consume receive requests from it. This avoids reserving a large buffer for each QP individually, thereby drastically reducing memory usage and improving scalability.

Analysis: Tests the analysis of SRQ use cases. RNR errors occur because the receive queue has no WQEs ready. When the number of connections is huge, reserving a queue of sufficient depth for each connection's RQ leads to a memory explosion. The SRQ pools receive resources and is the standard solution for solving RDMA receiver-side scalability issues.

Exercise 5: Thinking

Question: RDMA provides the Kernel Bypass feature, allowing user-space applications to send and receive data directly through the HCA (NIC), bypassing the kernel network stack to reduce latency. However, the Linux kernel still maintains the InfiniBand subsystem (such as drivers/infiniband/core). Think about this: if user space directly operates the hardware, why is the kernel's RDMA subsystem still needed? What indispensable roles does the kernel subsystem play in the RDMA ecosystem? (List 2-3 points)

Answer and Analysis

Answer: Although the data plane is bypassed, the kernel RDMA subsystem is still necessary. Its main responsibilities include:

  1. Resource Allocation and Security Management: Creating resource objects like PDs, MRs, CQs, and QPs, and ensuring memory and access isolation between different processes or tenants through the lkey/rkey and P_Key mechanisms (preventing unauthorized DMA access).
  2. Hardware Initialization and Configuration: Responsible for loading drivers, initializing HCA hardware, and interacting with the Subnet Manager (SM) (such as obtaining the LID, configuring routing tables).
  3. Control Plane and Multiplexing: Although a single application can bypass the kernel, the operating system needs to coordinate multiple applications' usage of the same NIC and handle asynchronous events (like NIC hot-plugs, link state changes).
  4. Providing Non-Verbs Protocol Support: The implementation of upper-layer protocols (ULPs) such as IPoIB and iSER still requires kernel stack support.

Analysis: This is a deep-thinking question testing the understanding of the RDMA software-hardware interaction boundary. Kernel Bypass is primarily for zero-copy and low latency on the data plane, but "control" still requires operating system intervention. Bare-metal applications directly writing to hardware registers is unrealistic and insecure; the OS must abstract the hardware, manage global resources (like address mappings), and guarantee system security through the kernel subsystem.


Key Takeaways

The core value of RDMA (Remote Direct Memory Access) lies in bypassing the kernel and CPU to enable direct data interaction between the NIC and user-space memory. This mechanism eliminates the heavy context switches, kernel copies, and interrupt handling overhead of the traditional TCP/IP stack, achieving ultra-low latency in the nanosecond range and extremely high bandwidth utilization. To achieve this, RDMA relies on intelligent hardware like HCAs (Host Channel Adapters) to offload transport protocol processing and memory management tasks from the CPU to the NIC, completely freeing up compute resources.

The RDMA software stack is built upon the unified Verbs API, abstracting away the differences of underlying physical media such as InfiniBand, RoCE, or iWARP. In the Linux kernel, this subsystem resides in the drivers/infiniband directory, where the core layer handles the logical implementation, the hardware layer handles driver adaptation, and upper-layer protocols (like IPoIB, iSER) provide specific business support. Developers can operate the hardware simply by calling the ib_* family of functions, without needing to care whether the underlying link layer is fiber or Ethernet.

Memory registration is the cornerstone of RDMA security and performance; its role is to "pin" virtual memory to physical memory and establish mappings. Through the MR (Memory Region) mechanism, the system generates a local key and a remote key for the memory region. The NIC can directly access authorized memory using only these keys, without CPU intervention. Additionally, the PD (Protection Domain) provides a resource isolation sandbox, ensuring that resources (like QPs, MRs) within different security domains cannot be mixed, preventing unauthorized data access.

The core carrier for data transmission is the QP (Queue Pair), which consists of two independent unidirectional channels: the Send Queue (SQ) and the Receive Queue (RQ). QPs support multiple transport service types: RC provides reliable connections and retransmission, UC pursues speed but allows packet loss, and UD is similar to connectionless UDP. To handle high-concurrency scenarios, RDMA also introduces the SRQ (Shared Receive Queue), allowing multiple QPs to share a single receive buffer pool, which drastically saves memory resources and improves scalability.

Since all operations are executed asynchronously, RDMA introduces the CQ (Completion Queue) as a feedback mechanism for processing results. After an application submits a Work Request to a QP, the NIC processes it asynchronously in the background and writes status information to the CQ upon completion. Developers must poll or subscribe to CQ events to obtain "completion receipts," and release memory or replenish receive resources based on these receipts. Although complex, this "post and forget, confirm later" model is key to RDMA achieving zero-copy and kernel bypass.