13.5 Shared Receive Queue (SRQ)
We've already covered QPs, CQs, and the various fancy fields. You might feel that the RDMA object model resembles a set of Russian nesting dolls, layer after layer.
But if your server needs to handle tens of thousands of concurrent connections, you'll find that the old rule of "each QP gets its own dedicated receive queue" starts to become unreasonable.
Imagine this scenario: you have 10,000 clients connected. Any client might send you a message at any given moment, but the vast majority of the time, they are silent. Under the old rules, you have to prepare receive buffers for all 10,000 QPs—just in case one suddenly sends data.
It's like inviting 10,000 people to dinner. To prevent anyone from suddenly getting hungry, you lay out a full banquet at every single table. The result? Only 100 people are eating, while the food on the remaining 9,900 tables slowly goes cold, or even spoils.
Memory is a scarce resource, and servers cannot afford this kind of luxury.
This brings us to the star of this section: the Shared Receive Queue (SRQ).
Why Do We Need SRQ?
The core idea behind SRQ is very simple: change receive resources from "private" to "public".
Instead of each QP guarding its own pile of Receive Requests, all QPs connect to a single large pool. Whoever receives data grabs a buffer from the pool to hold it.
If you have N QPs, and each QP might receive up to M burst messages at any given moment:
- Without SRQ: You must dutifully post
N * Mreceive requests. Even if 99% of the connections are idle, this memory must be occupied. - With SRQ: You only need to post
K * M(whereK << N). As long as you ensure there are always buffers available in the pool, you don't need to worry about a specific QP suddenly flooding you with data.
This sounds perfect, but it's actually a classic "resource pooling" trade-off in engineering: you exchange management complexity for resource utilization.
The Cost of Pooling: Loss of Control and Watermarks
SRQ is not without its costs.
Once you share the receive queue, you lose an important level of control: you no longer know exactly which QP will take away a given buffer.
In non-shared mode, you know that QP A handles small packets and QP B handles large packets, so you can post small buffers to QP A and large buffers to QP B.
But in SRQ mode, when you throw a buffer into the pool, you have no idea which QP will pick it up. It could be a small QP processing logs, or a large QP transferring huge files.
This leads to a hard constraint: all Receive Requests posted to the SRQ must have buffer sizes large enough to accommodate the largest message among all associated QPs.
If you have two QPs—one sending 64B heartbeat packets and another sending 4MB data blocks—unfortunately, to accommodate the big guy, all the buffers in your pool must be at the 4MB level. Sounds wasteful? Yes, this is indeed a pain point of SRQ. The usual solution is tiering—creating two SRQs, one dedicated to small-packet QPs and another to large-packet QPs, to reduce memory waste.
Beyond the risk of uncontrolled buffer sizes, there is an even trickier problem: what happens when the pool runs dry?
In normal QP mode, if a specific QP's receive queue empties, it only affects itself. But in SRQ mode, once the pool is exhausted, all QPs attached to it will "starve," leading to dropped packets.
This introduces an exclusive feature of SRQ: the watermark.
SRQ allows you to set a srq_limit threshold. When the number of available receive requests in the pool drops below this value, the hardware triggers an asynchronous event to notify you: "Hey, it's running dry, replenish it now!"
This gives you a chance to catch your breath. You can post new receive requests into the pool via ib_post_srq_recv() before it is completely exhausted.
SRQ Lifecycle Management
Let's look at how we manage this in the kernel.
Creating an SRQ
Like the MRs and QPs we've seen before, an SRQ cannot exist in a vacuum; it must belong to a Protection Domain (PD). This ensures that only QPs within the same security domain can share this queue.
We call ib_create_srq().
struct ib_srq *ib_create_srq(struct ib_pd *pd,
struct ib_srq_init_attr *srq_init_attr);
You need to pass in the PD and an initialization attribute structure, srq_init_attr. In this structure, you specify the maximum number of receive requests, the maximum number of SGEs, and so on.
Modifying SRQ Attributes
Creation isn't the end of it. As mentioned earlier, the watermark needs dynamic adjustment, or certain hardware might allow you to dynamically resize the SRQ.
This is where ib_modify_srq() comes in.
Here is a practical scenario: I want to receive an alert when the SRQ's remaining request count drops below 5.
struct ib_srq_attr srq_attr;
int ret;
memset(&srq_attr, 0, sizeof(srq_attr));
srq_attr.srq_limit = 5; /* 设置水位线为 5 */
ret = ib_modify_srq(srq, &srq_attr, IB_SRQ_LIMIT);
if (ret) {
printk(KERN_ERR "Failed to set the SRQ's limit value\n");
return ret;
}
Here we use the IB_SRQ_LIMIT command, telling the kernel that we want to modify the watermark attribute. Once set successfully, when the internal counter inside the SRQ drops below this number, your registered event handler will be woken up.
⚠️ Pitfall Warning: Never set the watermark to 0. If a new burst of traffic consumes the remaining 5 requests before you receive the event, you will still drop packets. Leave yourself some margin, such as 5% or 10%.
Querying an SRQ
If you forget what you set, or want to check the current state, you can use ib_query_srq().
int ib_query_srq(struct ib_srq *srq, struct ib_srq_attr *srq_attr);
This is typically used for debugging or monitoring. If you find that the queried srq_limit is 0, it means you haven't set a watermark before (or you turned it off).
Destroying an SRQ
When it's all over and sharing is no longer needed, call ib_destroy_srq(). But before destroying it, make sure all QPs associated with it have already been destroyed or unbound; otherwise, the behavior is undefined.
The Real Work: Posting Receive Requests
Creating an SRQ just sets the stage; what actually gets data flowing is posting Work Requests (WRs) into it.
In a normal QP, we use ib_post_recv. For SRQ, the API becomes ib_post_srq_recv().
The logic is the same: hang a ib_recv_wr (Receive Request) on a linked list and throw it to the hardware. But note here that because the SRQ is shared, your wr_id must be designed smartly enough so that when you retrieve a WC, you can trace it back to the context through some mechanism.
Below is a standard SRQ receive posting flow. We post a receive request, telling the hardware: "If data arrives, put it at this DMA address."
struct ib_recv_wr wr, *bad_wr;
struct ib_sge sg;
int ret;
/* 1. 准备 SGE (scatter/gather entry),指明数据放哪 */
memset(&sg, 0, sizeof(sg));
sg.addr = dma_addr; /* 物理地址 */
sg.length = len; /* 缓冲区长度 */
sg.lkey = mr->lkey; /* 本地密钥 */
/* 2. 准备 Work Request */
memset(&wr, 0, sizeof(wr));
wr.next = NULL; /* 单个请求,没有 next */
wr.wr_id = (uintptr_t)dma_addr; /* 把地址塞进 wr_id,方便回溯 */
wr.sg_list = &sg; /* 指向 SGE */
wr.num_sge = 1; /* 只有一个 SGE */
/* 3. 投递! */
ret = ib_post_srq_recv(srq, &wr, &bad_wr);
if (ret) {
printk(KERN_ERR "Failed to post Receive Request to an SRQ\n");
return ret;
}
Here is a detail worth noting: the bad_wr parameter.
If ib_post_srq_recv fails, it puts the pointer to the first failed WR into bad_wr. This is extremely useful during bulk posting (linked-list posting)—it can tell you "which link in this chain broke," rather than simply telling you "it failed." But in a single-post example like this, it primarily serves as an error-checking flag.
Revisiting That Diagram
Going back to our earlier diagram (Figure 13-4). What you see now is no longer three isolated QPs hanging there alone, but the "receive ends" of all three QPs converging into the SRQ below.
This is RDMA's ultimate answer to the "receive-side scalability" problem.
It's not perfect (you have to endure the waste of large buffers, and you need to handle watermark logic), but in high-concurrency scenarios, it's the only way to preserve both memory and latency.