Skip to main content

ch13_4

13.4 Completion Queue (CQ) — The Mailbox for Task Completion

In the previous section, we sorted out the "cargo" (Memory Region) and the "route" (Address Handle), but those are just static preparations. The essence of RDMA is motion — data must actually fly out and fly back.

This brings up a core question: How do you know the NIC has finished its work?

You toss a Work Request (WR) into a queue and move on to other tasks. In the background, the NIC silently performs DMA reads, assembles packets, and sends them out. All of this is asynchronous. You can't just sit there waiting, or you'd waste CPU cycles.

You need a place where the NIC can notify you that a job is done. That place is the Completion Queue (CQ).


Why Do We Need a CQ?

Imagine you post a Send request to a QP's Send Queue (SQ), pointing to a memory buffer.

This creates a subtle moment: Until the NIC confirms completion, that memory is in a "Schrödinger" state.

  • For a Send operation: The NIC might be reading that memory. If you free it or write new data into it right now, the NIC will send a corrupted packet. Even worse, you have no idea whether it has already been sent.
  • For a Receive operation: The NIC could write incoming data into that buffer at any time. If you read it now, you might get garbage data or incomplete data.

The rule is simple: as long as a WR is not complete, the memory it points to is strictly off-limits.

So, what does "complete" mean?

This brings us to the concept of a Work Completion (WC). A WC is a "receipt."

  • For a Reliable Connection (RC): Receiving a WC means the remote end has acknowledged receipt. The data is safely in the bag.
  • For an Unreliable Connection (UC/UD): Receiving a WC means the NIC has done its best to send it (it's on the wire). Whether the other side actually received it is anyone's guess.

The CQ is the queue specifically designed to hold these "receipts."


How the CQ Works: FIFO and Notifications

You can think of a CQ as a mailbox. The NIC is the mail carrier, and you are the one picking up the mail.

  1. FIFO ordering: The letters (WCs) in the mailbox are stuffed in chronological order. When you pick them up, you must do so in order — no skipping. This guarantees causality: the request you posted first will always yield its result first.
  2. Capacity limit: The mailbox has a finite size. If you don't pick up the mail and it fills up, the mail carrier can't stuff any more in. At this point, the RDMA stack will report an error, and all associated QPs will enter an Error State. This is serious — it's equivalent to the entire communication link going down.
  3. Two ways to check the mail:
    • Polling: You peek at the mailbox at short intervals. "Any mail? No? I'll check back later." This is the preferred approach for high-performance scenarios, as it has zero interrupt overhead.
    • Event notification: You tell the mail carrier, "Knock when there's mail." You can configure it to "knock after 10 letters" or "knock every 100ms." This is known as interrupt coalescing, which reduces how often the CPU is interrupted.

In Practice: Creating and Operating a CQ

Let's look at how we work with this mechanism in the kernel.

1. Creating a CQ (ib_create_cq)

First, you need a mailbox.

struct ib_cq *ib_create_cq(struct ib_device *device,
ib_comp_handler comp_handler,
void (*event_handler)(struct ib_event *, void *),
void *context,
struct ib_cq_init_attr *cq_attr);

There are a few key parameters to fill in:

  • device: Points to your RDMA device (HCA).
  • comp_handler: This is your "mail arrival callback." When using notification mode and a WC arrives, the kernel calls this function. If you are using pure polling mode, this can be NULL.
  • cq_attr: This defines the size and attributes of the mailbox.

⚠️ Pitfall Warning cq_attr->cqe (CQ Entries) is the minimum capacity you are requesting. The underlying driver might allocate a larger value (for performance alignment). Never assume it's exactly the number you requested; you must use the actual capacity from the return value.

Furthermore, this capacity must be >= the sum of the Send Queue and Receive Queue capacities of all associated QPs. Why? In the worst case, all requests complete simultaneously, and the CQ must be able to hold all the WCs. If it overflows, the link goes down hard.

2. Resizing (ib_resize_cq)

Running along and realize the mailbox is too small? Or too big and wasteful? You can change it.

int ib_resize_cq(struct ib_cq *cq, int cqe);

But there's a hard constraint: The new capacity cannot be smaller than the number of WCs already in the CQ. You can't shrink the mailbox while it still has mail in it — that would crush the letters.

3. Modifying Behavior (ib_modify_cq)

This function is used to fine-tune the notification policy.

int ib_modify_cq(struct ib_cq *cq, u16 cq_count, u16 cq_period);

This is for implementing interrupt coalescing:

  • cq_count: How many WCs to accumulate before firing an interrupt.
  • cq_period: The maximum time to wait (in microseconds) before an interrupt must be fired.

This helps you balance latency and CPU usage. If you don't tune this, the default might fire an interrupt for every single WC, which will drive the CPU crazy under high throughput.

4. Peeking (ib_peek_cq)

Don't want to pull them out, just want to see how much mail is there?

int ib_peek_cq(struct ib_cq *cq, int count);

This is like looking through the transparent glass of the mailbox and counting the letters inside. It does not remove the WCs.

5. Requesting Notification (ib_req_notify_cq)

If you are using "knock mode" (event notification), you need to tell the kernel, "I'm ready to hear the knock."

int ib_req_notify_cq(struct ib_cq *cq,
enum ib_cq_notify_flags flags);

flags has two variations:

  • IB_CQ_SOLICITED: Only knock when a WC marked as "Solicited" arrives.
  • IB_CQ_NEXT_COMP: Knock when the next WC arrives, regardless of marking.

Here's a subtle pitfall: Calling this function does not itself generate a notification. It merely "subscribes" to the next one. If there are already WCs in the CQ before you call this, the function will not trigger a notification — you have to go poll those WCs out yourself.

6. Polling (ib_poll_cq) — The Core Operation

This is the most commonly used action. Regardless of whether you use notifications, you ultimately rely on this function to pull the data out.

int ib_poll_cq(struct ib_cq *cq, int num_entries, struct ib_wc *wc);
  • num_entries: How many you want to retrieve this time.
  • wc: The array to hold the results.
  • Return value: How many were actually retrieved.

Note that this is a "non-blocking" function. If there's no mail, it returns 0 — it never sleeps.


A Real-World Scenario: Draining the CQ and Checking for Errors

Let's write a typical "processing loop."

This code usually lives in your comp_handler callback or your main loop.

struct ib_wc wc;
int num_comp = 0;

// 循环取信,直到信箱空了
while (ib_poll_cq(cq, 1, &wc) > 0) {

// 第一步:看状态码
if (wc.status != IB_WC_SUCCESS) {
// 💥 炸了
printk(KERN_ERR "The Work Completion[%d] has a bad status %d\n",
num_comp, wc.status);
// 这里通常要做很重的清理工作,比如重启 QP
return -EINVAL;
}

// 第二步:看是谁干的(操作码)
switch (wc.opcode) {
case IB_WC_SEND:
// 发送完成
printk("Send operation completed.\n");
break;
case IB_WC_RECV:
// 接收完成
// wc.byte_len 告诉你收了多长的数据
printk("Received %d bytes\n", wc.byte_len);
break;
case IB_WC_RDMA_WRITE:
// RDMA 写完成
break;
// ... 其他操作
}

num_comp++;
}

⚠️ One Thing You Must Never Forget After you process a IB_WC_RECV, you must immediately post a new Post Receive to the RQ (Receive Queue). Why? Because the Receive Queue is a consumable resource. If you don't replenish it, the NIC will have nowhere to place incoming packets, triggering an RNR (Receiver Not Ready) error and breaking the flow. Many beginners writing RDMA for the first time find that the first packet goes through but the second one hangs — precisely because they forgot to Post Receive.


A Special Entity: XRC Domain

At the end of the CQ discussion, the original text mentions something called an XRC Domain.

This is a fairly advanced concept used for a transport mode called eXtended Reliable Connected (XRC).

Remember how we said earlier that a QP is a pair (SQ and RQ tightly bound)? XRC decouples them, enabling scenarios like "one super-server connecting to ten thousand clients."

An XRC Domain is an isolation domain. It defines "which XRC SRQs can communicate with each other."

  • If two QPs are in the same XRC Domain, they can share an SRQ.
  • If they aren't, even if they're physically connected, they are logically isolated from each other.

It's like being in an office building (RDMA Device): even though everyone is on the same physical network, to get through the door of the "Finance Department" (XRC Domain), you need a specific access card. This is primarily for resource and security isolation in large-scale clusters.


Chapter Echoes

At this point, the puzzle of RDMA core objects is nearly complete. From the lowest-level device, to the guiding AH, to the cargo-carrying MR, and now to today's CQ, which is responsible for "clocking out."

You'll notice that RDMA's design philosophy is remarkably consistent: all "management" is handled in the kernel (creation, modification, destruction), while all "data paths" are handed off to user space to issue commands directly.

The CQ is the bridge between these two worlds. It is a pure status queue — it doesn't care what the data looks like; it only cares about "success" or "failure." This minimalism is the fundamental reason it can run so fast.

In the next section, we will assemble these components and look at the ultimate executor: the Queue Pair (QP). That is where all instructions truly take off.