13.7 Supported RDMA Operations
In the previous section, we spent a lot of effort figuring out how the "avatar" known as a QP is created and how fragile its lifecycle can be.
Now, when a QP finally reaches the RTS (Ready to Send) state, it means the transport channel is open. It's time to put it to real work.
In the world of RDMA, not all communication methods are created equal. You have a whole range of operations at your disposal, from the simplest "send a message" to the magical "remote memory read/write." In this section, we will break down the mechanisms behind these operations, see how data actually gets moved out, and understand how the hardware recovers on its own when the network misbehaves.
The Operation Menu: What Can You Do?
InfiniBand offers a much richer set of operation types than standard network interfaces. This isn't just for show; it's designed to cover different performance semantics.
We can break them down into several categories.
1. Message Passing
This is the part that most resembles traditional Sockets, but the resemblance is only superficial.
- Send: Pushes a message onto the wire. However, there is a prerequisite: the remote end must have posted a Receive Request in advance. If the remote end isn't ready, the packet will be dropped or trigger an error flow (depending on the transport type). The message is written into the remote buffer.
- Send with Immediate: Sends a message while attaching a 32-bit out-of-band data payload. This 32-bit value does not go into the data buffer; instead, it appears directly in the receiver's Work Completion (WC). This is a very clever mechanism. You can use it to send short instructions or metadata, and the receiver can grab it without having to parse a massive data packet.
2. RDMA Operations
This is the soul of RDMA and what sets it apart from ordinary NICs.
- RDMA Write: Writes data directly into a remote memory address. The remote CPU does not need to participate at all—no interrupts, no kernel context switches. As long as the remote end has granted permission, the data just zips over.
- RDMA Write with Immediate:
This is a combination of
RDMA WriteandSend with Immediate. Data is written to the specified remote memory (like RDMA Write), while that 32-bit immediate value is stuffed into the remote CQ (like Send with Immediate). Note: This operation requires the remote end to have a Receive Request queued up. Why? Because that immediate value needs somewhere to land (a WC), and a WC is only generated when there is a corresponding receive operation. You can think of it as a "zero-byte Send + a normal RDMA Write." - RDMA Read: The initiator specifies a remote address and pulls the data from there back into a local buffer. This is a "pull" model, where the initiative lies with the reader.
3. Atomic Operations
In distributed systems, locks are performance killers. RDMA provides hardware-level atomic operations, allowing you to manipulate remote memory directly, bypassing locks.
- Compare and Swap (CAS):
Compares the value at a remote address with
valueX. If they are equal, it replaces the value withvalueY. The entire process is atomic. After the operation completes, the original value from the remote end is sent back and stored locally. - Fetch and Add: Atomically increments the value at a remote address by a specified amount. The original value is similarly sent back locally.
- Masked Compare and Swap: A CAS with a mask. It only compares the bits specified by the mask, and if they match, it only replaces the corresponding masked bits.
- Masked Fetch and Add: An addition with a mask, only altering the bits specified by the mask.
4. Memory Management Extensions
- Bind Memory Window: Binds a memory window to a specific memory region.
- Fast Registration: Quickly registers an FMR via a WR.
- Local Invalidate: Invalidates an FMR via a WR. If anyone tries to use the old key afterward, an error will be reported. This operation can be combined with Send/Read, with the execution order being read/write first, then invalidate.
Receive Requests: Who Picks Up the Tab?
For all operations that "consume" a Receive Request (like Send), the remote end must have set the table in advance.
A Receive Request specifies where the data lands. The total size of the buffers listed in your Scatter List must be greater than or equal to the size of the incoming message. Otherwise? An Overflow error awaits you.
There is a specific pitfall to watch out for with UD QPs:
UD is an unreliable datagram. Messages might come from within the subnet or from outside; they might be unicast or multicast. This means a message might carry a 40-byte GRH (Global Routing Header).
If you are using a UD QP, you must reserve an extra 40 bytes in the Receive Request's buffer.
- If the received message includes a GRH, the first 40 bytes will be filled with the GRH content (telling you how to reply), and the actual data starts at byte 40.
- If there is no GRH, those 40 bytes are undefined (or ignored by the hardware), and the data is placed right from the beginning (the exact behavior depends on the hardware implementation; usually, you also need to check the flags in the WC to determine this).
Hands-on: Posting a Receive Request
Let's look at how to use the kernel API to push a Receive Request,也就是 ib_recv_wr, into the RQ.
Here we assume qp has already been created, dma_addr is an address already mapped with ib_dma_map_single, and mr is a registered memory region.
struct ib_recv_wr wr, *bad_wr;
struct ib_sge sg;
int ret;
// 1. 填充 scatter-gather 元素(SGE)
// 这就是数据要落地的地址
memset(&sg, 0, sizeof(sg));
sg.addr = dma_addr; // 物理地址
sg.length = len; // 缓冲区长度(UD记得加40!)
sg.lkey = mr->lkey; // 本地密钥
// 2. 填充 Work Request
memset(&wr, 0, sizeof(wr));
wr.next = NULL; // 链表指针,单个请求就填 NULL
wr.wr_id = (uintptr_t)dma_addr; // 用于标识这个请求的 ID,轮询 CQ 时会原样返回
wr.sg_list = &sg; // 指向 SGE 数组
wr.num_sge = 1; // 有几个 SGE
// 3. 提交到内核
ret = ib_post_recv(qp, &wr, &bad_wr);
if (ret) {
printk(KERN_ERR "Failed to post Receive Request to a QP\n");
return ret;
}
⚠️ Don't forget to poll the CQ: Submitting a Request merely posts the task. When the data arrives, you need to fetch the Work Completion from the corresponding CQ to know whether the receive was successful.
Send Requests: Pushing Data Out
The logic on the sending side is very similar to the receiving side, except the Work Request structure is richer because you have to specify the opcode (operation code).
Below is a standard Send operation example.
struct ib_sge sg;
struct ib_send_wr wr, *bad_wr;
int ret;
// 1. SGE 设置
memset(&sg, 0, sizeof(sg));
sg.addr = dma_addr;
sg.length = len;
sg.lkey = mr->lkey;
// 2. Send WR 设置
memset(&wr, 0, sizeof(wr));
wr.next = NULL;
wr.wr_id = (uintptr_t)dma_addr;
wr.sg_list = &sg;
wr.num_sge = 1;
// 关键部分:指定操作和行为
wr.opcode = IB_WR_SEND; // 这是个普通的 Send
wr.send_flags = IB_SEND_SIGNALED; // 请求产生一个 WC。如果不设这个,
// 只有在链表里的最后一个 WR 才会产生 WC,
// 这通常用于性能优化(Batching)。
// 3. 提交
ret = ib_post_send(qp, &wr, &bad_wr);
if (ret) {
printk(KERN_ERR "Failed to post Send Request to a QP\n");
return ret;
}
When the Network Fails: Retry Flows
Ideally, the flow is: WR submitted -> hardware sends -> hardware receives -> WC generated.
But reality is harsh. A WC might come back with an error. Once an error occurs, the contents of the memory buffer are undefined—they are dirty.
Some errors are fatal (like a permission violation), and the hardware won't retry; it will simply report the error. But under the Reliable (RC) transport type, the hardware has two very powerful automatic retry mechanisms. As a developer, you usually won't even feel these retries happening, aside from a slight network hiccup.
1. General Retry Flow
If the sender transmits a packet but doesn't receive an ACK or NACK within the timeout period, the hardware will automatically retransmit.
This usually happens because:
- The remote QP is in the wrong state (hasn't reached RTR, or has entered the Error state).
- The routing is misconfigured.
- The packet was dropped in transit (CRC error).
- The returning ACK was lost.
As long as the ACK is eventually received, everything is fine, and the upper-layer application is completely unaware. If the retry count is exhausted and the ACK still hasn't arrived, the sender will receive a WC with Retry Error.
2. RNR (Receiver Not Ready) Flow
This is a protection mechanism specifically designed for "clueless receivers."
Scenario: The sender transmits data, the receiver gets it, but finds no empty Receive Request in the RQ (i.e., no plate was set).
In traditional networking, this packet would simply be dropped. In RDMA RC mode, the receiver will reply with an RNR NACK.
Upon receiving the NACK, the sender will pause for a short while (using the time specified in the RNR NACK) and then retransmit.
If the receiver posts a Receive Request in time, the data can be received normally, the sender gets an ACK, and everyone is happy.
If the receiver never posts one, the sender exhausts its retry count and will receive a WC with RNR Retry Error.
Multicast: One-to-Many
Multicast allows a UD QP to send messages to multiple UD QPs.
The mechanism is simple:
- A UD QP that wants to receive messages must call
ib_attach_mcast()to attach itself to a specific multicast group. - When the NIC receives a multicast packet, it replicates the packet to all QPs attached to that group.
Don't want to receive anymore? Just call ib_detach_mcast().
Userspace vs. Kernel API
The beauty of RDMA is that the APIs used in userspace and kernel space are almost identical.
- Prefixes: The kernel uses
ib_, while userspace usesibv_. - Control path: When userspace calls control functions, it traps into the kernel because it needs to manipulate privileged resources (like allocating a QP number).
- Differences:
- Some QP types are only visible in the kernel (SMI, GSI).
- Some privileged operations can only be done in the kernel (physical memory registration, FMR).
- Notification mechanism: The kernel API is asynchronous (callback functions); the userspace API is synchronous, requiring you to actively poll the CQ or events.
Chapter Summary
We've come a long way in this chapter. From InfiniBand's zero-copy and kernel bypass advantages, to the complex software and hardware architecture, to the creation and management of core objects like QPs, MRs, and CQs.
In this final section, we tied these objects together, demonstrating how to make data flow through Send and RDMA operations.
The RDMA learning curve is indeed steep—you have to understand address translation, memory registration, state machines, retry flows... But once you tame this system, what you get is a powerful engine that can bypass the operating system kernel and directly touch the pulse of the network.
In the next chapter, we will turn our gaze to a broader system perspective—network namespaces and the Bluetooth subsystem.