Skip to main content

ch13_6

13.6 Queue Pair

We've already seen one approach to receive queue reuse with SRQ. But that's only half the story—or rather, it's a specialized trick reserved for extreme performance scenarios.

In day-to-day RDMA communication, the real star of the show—the object we interact with constantly and must tune by hand—is the Queue Pair (QP).

If SRQ solves the challenge of "one side feeding many rooms," then the QP is the most fundamental, indispensable "communication pipe" itself.


The Essence of a QP: Two Unidirectional Lanes

Intuitively, we tend to think of a network connection as a single pipe—data flows in one end and out the other. But in the RDMA world, in the pursuit of ultimate performance, this model is split apart.

A Queue Pair (QP) is the actual object used for sending and receiving data in InfiniBand. The "Pair" in the name is precise: it consists of two completely independent work queues:

  1. Send Queue (SQ): You post requests here, telling the NIC to send data out.
  2. Receive Queue (RQ): You post requests here, telling the NIC "I have free space, put received data here."

This is a common cognitive pitfall: sending and receiving are completely decoupled.

SQ and RQ each have their own attributes:

  • How many Work Requests (WRs) can they hold?
  • How many Scatter/Gather (SGE) elements does each WR support?
  • Which CQ should the completion status be written to?

You can make the SQ very large and the RQ very small, or vice versa. As long as it stays within hardware limits, the kernel doesn't care.

Ordering Guarantees and Decoupling

Within a single queue, ordering is strictly guaranteed. If you post WR1 and WR2 to the SQ in order, the NIC will always process WR1 before WR2. The same applies to the RQ.

However, there is no relationship between the SQ and the RQ.

You can post a send request to the SQ first, and then post a receive request to the RQ; or vice versa. They don't affect each other, much like two parallel one-way streets. This is crucial for understanding concurrent behavior.

Figure 13-5 shows a standard QP structure.

Figure 13-5. QP (Queue Pair) (Diagram: a QP contains two queues, one arrow pointing out (Send), one arrow pointing in)

When you create a QP on a device, it receives a qp_num that is currently unique on that RDMA device. When others need to communicate with this pipe, they rely on this number.


QP Transport Types: Not All Roads Are Paved the Same

InfiniBand is complex because it wasn't designed to do just one thing. It provides several QP transport types, each corresponding to different scenarios and trade-offs.

We need to choose the QP type as carefully as we would choose a NIC.

1. Reliable Connected (RC)

This is the most commonly used and fully featured type.

  • Connection mode: One-to-one. An RC QP must connect to a specific RC QP on the remote end.
  • Reliability: Absolute guarantee. Lost packets are automatically retransmitted, out-of-order packets are automatically reordered, and corrupted content is automatically corrected.
  • Transport mechanism: Messages are fragmented at the sender according to the path MTU and reassembled at the receiver.
  • Supported operations: The full course — Send, RDMA Write, RDMA Read, and Atomic operations.

If you are building storage or database clusters and need consistency and strong semantics, RC is the right choice.

2. Unreliable Connected (UC)

Looks very similar to RC—it's also a one-to-one, point-to-point connection. But it "cuts" the retransmission mechanism.

  • Reliability: Not guaranteed. If any single packet within a message is lost, the entire message is lost.
  • Supported operations: Send and RDMA Write.
  • Why does it exist? Some application layers have their own retransmission logic, or only care about throughput and can tolerate occasional packet loss. Without the hardware retransmission overhead, it can be slightly faster.

3. Unreliable Datagram (UD)

This is the RDMA version of UDP.

  • Connection mode: One-to-many. A single UD QP can send messages to any UD QP within the subnet, and it even supports multicast.
  • Reliability: Completely unguaranteed.
  • Limitations: Message size is limited to the path MTU; no fragmentation is allowed. Only Send operations are supported; RDMA Read/Write are not.
  • Use case: This is the "control plane" of RDMA. For example, when we need to exchange information before establishing a connection, UD is the most appropriate choice.

4. eXtended Reliable Connected (XRC)

Remember the SRQ we discussed in the previous section? XRC is SRQ's best partner.

  • Scenario: Multiple QPs on the same node (or even across multiple processes) can simultaneously send messages to a specific SRQ on the remote end.
  • Purpose: Reduce the number of QPs. Previously it was one-to-one, and QP counts would explode as core counts grew; now it can be many-to-one.
  • Limitations: This is a privilege reserved for userspace applications; kernel drivers typically don't touch this.

5. Raw Packet / Raw Ethertype

This is for "hackers."

  • Functionality: Allows the client to construct a complete Layer 2 (L2) header and send raw data directly. The receiving RDMA device does not strip any headers.
  • Use case: If you want to run custom protocols on an RDMA NIC, or do some weird network experiments, this comes in handy.
  • Current status: Most RDMA devices do not currently support Raw IPv6/Raw Ethertype.

Special Types: Management QPs

In addition to data-transfer QPs, there are two QPs specifically for "management," built into each port:

  • SMI / QP0: Dedicated to Subnet Management, handling management packets.
  • GSI / QP1: For General Services, such as querying NIC attributes via MAD.

Creating a QP: Building the Pipe

Enough theory—let's get our hands dirty and create an RC QP.

The core function for creating a QP is ib_create_qp(). You need to prepare a PD (Protection Domain) and a structure describing the QP attributes, struct ib_qp_init_attr.

Here is a practical example: we want to create an RC QP with the send queue and receive queue bound to different CQs, both with very small capacities (only 2 WRs) for easy demonstration.

struct ib_qp_init_attr init_attr;
struct ib_qp *qp;

memset(&init_attr, 0, sizeof(init_attr));

/* 设置事件回调:当 QP 状态发生异步变化(比如错误)时调用 */
init_attr.event_handler = my_qp_event;

/* 发送队列容量:最多 2 个 WR */
init_attr.cap.max_send_wr = 2;

/* 接收队列容量:最多 2 个 WR */
init_attr.cap.max_recv_wr = 2;

/* 每个 WR 的 Scatter/Gather 元素数量限制 */
init_attr.cap.max_recv_sge = 1;
init_attr.cap.max_send_sge = 1;

/* 发送完成通知策略:每个 WR 完成都在 CQ 里产生通知 */
init_attr.sq_sig_type = IB_SIGNAL_ALL_WR;

/* QP 类型:RC */
init_attr.qp_type = IB_QPT_RC;

/* 绑定 CQ */
init_attr.send_cq = send_cq;
init_attr.recv_cq = recv_cq;

qp = ib_create_qp(pd, &init_attr);
if (IS_ERR(qp)) {
printk(KERN_ERR "Failed to create a QP\n");
return PTR_ERR(qp);
}

Note: Here, sq_sig_type is set to IB_SIGNAL_ALL_WR. This means we want a notification for every packet sent. This is very convenient when writing test programs, but in high-performance production environments, you would choose IB_SIGNAL_REQ_WR (manually controlling when to generate notifications) to reduce CQ interrupts.


QP State Machine: The Famous "Tetris"

A newly created QP is just an empty shell. It can't send or receive data immediately; it must go through a series of strict state transitions.

This is the most hair-pulling and error-prone part of RDMA development. You can think of a QP as a stage in the middle of a scene change—the actors absolutely must not go on stage until the lighting is properly set.

Figure 13-6 depicts this complex state machine.

Figure 13-6. QP state machine (State machine diagram: Reset -> Init -> RTR -> RTS -> SQD / Error)

Let's walk through it in order:

1. Reset

  • Initial state: All newly created QPs start here.
  • Capabilities: Can't do anything. Cannot post Send or Recv requests.
  • Behavior: All incoming messages are silently dropped.
  • Use case: This is a "safe mode" used to clear previous configurations.

2. Init

  • Capabilities: Still can't post Send requests, but you can now post Receive requests.
  • Behavior: Although you can post Recv requests, they won't be processed (because no connection has been established yet). All incoming messages are still dropped.
  • Best practice: Pre-post a few Receive Requests to the RQ during this stage. This prevents a classic race condition: the moment you transition the QP to the RTR state, remote data might arrive instantly. If you're too slow to post Recv requests, an RNR (Receiver Not Ready) error will immediately knock on your door.

3. Ready To Receive (RTR)

  • Capabilities: Can now process Receive requests. Still cannot post Send requests.
  • Behavior: Incoming messages are formally processed.
  • Event: When the first message is received in this state, a "communication established" asynchronous event is triggered.
  • Use case: If you only want to receive and not send, you can stop here.

4. Ready To Send (RTS)

  • Capabilities: Full speed ahead. Both Send and Recv requests can be posted and processed.
  • Use case: This is the QP's "combat state." The vast majority of normally functioning QPs reside here.

5. Send Queue Drained (SQD)

  • Capabilities: This is a transitional state. The QP will finish sending all Send Requests that have already started processing, but will reject new send requests.
  • Internal details: Divided into two phases: Draining (still sending) and Drained (finished sending).
  • Use case: When you need to modify certain QP attributes without destroying the QP, you can have it pause first.

6. Error

  • Trigger: For unreliable transports (UC/UD), if the send queue encounters an error, it enters the SQE state (at which point the receive queue can still be used). For reliable transports (RC) or any receive queue error, the QP drops directly into the Error state.
  • Behavior: All outstanding WRs are flushed and generate error WCs. All received messages are dropped directly.
  • Recovery: Once in the Error state, the QP is basically dead. You must manually transition it back to the Reset state and reconfigure the resources.

Manipulating the State Machine: ib_modify_qp()

State transitions don't happen automatically; you need to call ib_modify_qp() to push it along. This function doesn't just change the state—it also configures the parameters required for that state along the way.

This is the most complex part in practice. Let's drag a newly created QP all the way to the RTS state.

Step 1: Reset -> Init

Before entering Init, we need to tell it: which port to use? What is the P_Key?

struct ib_qp_attr attr = {
.qp_state = IB_QPS_INIT,
.pkey_index = 0,
.port_num = port, /* 指定物理端口 */
.qp_access_flags = 0 /* 远端有没有权限对我的内存做 RDMA Read/Atomic */
};

ret = ib_modify_qp(qp, &attr,
IB_QP_STATE |
IB_QP_PKEY_INDEX |
IB_QP_PORT |
IB_QP_ACCESS_FLAGS);

if (ret) {
printk(KERN_ERR "Failed to modify QP to INIT state\n");
return ret;
}

Step 2: Init -> RTR

This is the most critical step. We need to tell the QP: who are you going to talk to?

For an RC QP, we must configure the remote end's information: what is the remote LID? What is the remote QP number? What is the starting PSN (Packet Serial Number)?

attr.qp_state = IB_QPS_RTR;
attr.path_mtu = mtu;
attr.dest_qp_num = remote->qpn; /* 对方的 QP 号 */
attr.rq_psn = remote->psn; /* 对方期望的起始 PSN */
attr.max_dest_rd_atomic = 1; /* 对方能同时发起多少个 RDMA Read/Atomic */
attr.min_rnr_timer = 12; /* 对方没准备好时,我等多久重试 */

/* Address Handle (AH) 属性:描述路由信息 */
attr.ah_attr.is_global = 0;
attr.ah_attr.dlid = remote->lid; /* 目标 LID */
attr.ah_attr.sl = sl; /* Service Level */
attr.ah_attr.src_path_bits = 0;
attr.ah_attr.port_num = port;

ret = ib_modify_qp(ctx->qp, &attr,
IB_QP_STATE |
IB_QP_AV | /* Address Vector */
IB_QP_PATH_MTU |
IB_QP_DEST_QPN |
IB_QP_RQ_PSN |
IB_QP_MAX_DEST_RD_ATOMIC |
IB_QP_MIN_RNR_TIMER);

if (ret) {
printk(KERN_ERR "Failed to modify QP to RTR state\n");
return ret;
}

Step 3: RTR -> RTS

Finally, configure the sending parameters on our side, and we can open the floodgates.

attr.qp_state = IB_QPS_RTS;
attr.timeout = 14; /* 传输超时时间 (4.096us * 2^14) */
attr.retry_cnt = 7; /* 最大重试次数 */
attr.rnr_retry = 6; /* 对方 RNR 时的重试策略 */
attr.sq_psn = my_psn; /* 我自己发送的起始 PSN */
attr.max_rd_atomic = 1; /* 我能同时发起多少个 RDMA Read/Atomic */

ret = ib_modify_qp(ctx->qp, &attr,
IB_QP_STATE |
IB_QP_TIMEOUT |
IB_QP_RETRY_CNT |
IB_QP_RNR_RETRY |
IB_QP_SQ_PSN |
IB_QP_MAX_QP_RD_ATOMIC);

if (ret) {
printk(KERN_ERR "Failed to modify QP to RTS state\n");
return ret;
}

Only after completing these three steps is the QP truly alive.


Work Request (WR) Processing Mechanism: The Flowing Lifeline

The QP is built, the states are transitioned, and now it's time to fill it with data.

Figure 13-7 shows the lifecycle of a Work Request.

Figure 13-7. Work Request processing flow (Flowchart: Post WR -> Driver/HW -> Processing -> Poll WC)

As soon as a WR is posted to a queue, it enters the Outstanding state. It isn't considered finished until you poll the corresponding Work Completion (WC) from the associated CQ.

The "Signaling" Art of the Send Queue

In the SQ, there is a very subtle design: not all Send Requests generate a WC.

To reduce interrupts and PCIe bus pressure, you can choose to mark only specific WRs (Signaled). Only marked WRs (or WCs forcibly generated due to errors) will appear in the CQ.

This is why we see sq_sig_type in init_attr. If you choose IB_SIGNAL_ALL_WR, that's the "convenience" mode; if you choose IB_SIGNAL_REQ_WR, you must carefully calculate in your code and manually add a signal to every Nth packet.

But there's a catch: if an unmarked WR encounters an error, even if you don't want a notification, the hardware will forcibly generate a WC with an error status. This is a safety net.

Resource Locking: Absolutely Untouchable Memory

While a WR is in the Outstanding state, you absolutely must not touch any resources it uses.

  • UD QP send: If the WR carries an Address Handle (AH), you cannot free this AH before the WC is returned.
  • Receive requests: If you post a Receive WR pointing to a buffer, you cannot read from this buffer until the WC returns telling you "receiving is complete." Why? Because DMA might still be writing to it, or might not have even started. Data read at this point is undefined and could even cause cache coherency issues.

Fencing

This is an advanced feature. Imagine this scenario:

  1. You initiate an RDMA Read to pull remote data back.
  2. You immediately issue a Send to transmit the data you just read.

Because the QP operates as a pipeline, if the hardware is too aggressive, the Send might fetch data from memory and transmit it before the RDMA Read has finished reading and writing the data into memory.

The result: you send out garbage data.

Fence exists to solve this problem. If you set the Fence flag on a Send WR, the NIC guarantees: wait until all prior RDMA Read and Atomic operations on this SQ are completely finished before starting to process this Send request.

This sacrifices a bit of performance, but it buys you correctness.

Error Handling

Finally, we have to face reality: networks will always have errors.

  • SQ errors (unreliable transports): For UC/UD, a send error only puts the SQ into the SQE state; the RQ can still receive normally. You can attempt recovery.
  • RQ errors: Once the receive queue has a problem (such as an out-of-bounds memory access), the entire QP drops directly into the Error state. This is unrecoverable; you must Reset.

Summary

The Queue Pair is the "avatar" of the RDMA world. All of our operational intentions—read, write, send messages—ultimately turn into WRs stuffed into its belly.

Understanding it requires three levels of depth:

  1. Physical structure: Two queues (SQ/RQ), one PD, two CQs.
  2. Logical attributes: Do you choose RC (reliable) or UD (multicast)? This determines what operations you can perform.
  3. Lifecycle: The most complex part. You must strictly dance to the rhythm of Reset -> Init -> RTR -> RTS—skip a step and it fails, rush a step and it fails.

This is indeed tedious, but it is the necessary price to pay in exchange for that ultimate performance.