Skip to main content

ch13_2

13.2 RDMA Device —— Who Takes Control of This Machine?

In the previous section, we spent considerable effort clearing the periphery: the SM directs traffic, the SMA takes orders, and the SA answers queries. The entire InfiniBand subnet operates like a precision machine, where every component knows its place.

But what does this have to do with your code?

When you write your first line of RDMA code, the first question you face isn't "how do I send data," but rather "how do I know if this board has an RDMA NIC? And if it does, how do I take control of it?"

It's like driving a car—the first step isn't hitting the gas, but getting the keys first and confirming the car is actually yours to drive. In the kernel's RDMA stack, that key is the RDMA Device object, and the act of grabbing the key is registering a client.

13.2.1 Becoming an RDMA Client

The kernel doesn't just hand you a device out of nowhere. You must first identify yourself to the RDMA stack: "I'm a client, notify me when devices are added or removed."

This process is accomplished through ib_register_client().

This isn't just filling out a form. Once you register successfully, two things happen immediately:

  1. Retrospective notification: Your callback function is invoked immediately, iterating over all RDMA devices that already exist in the system. This means that whether your module loads before or after the hardware driver, you won't miss a single NIC.
  2. Hot-plug monitoring: If a new NIC is plugged in later (e.g., via hot-plug), your callback function will also be triggered.

Conversely, when your module unloads, you must call ib_unregister_client() to gracefully let go. If you forget this step, the kernel might try to call your callback after you're gone—the outcome is usually the kernel panic you dread the most.

The following code is the standard template for a kernel module plugging into the RDMA stack. Memorize it, or at least keep it handy, because it's the opening move for all RDMA kernel code:

/* 当有新设备加入(或已存在)时,内核会调用这个函数 */
static void my_add_one(struct ib_device *device)
{
/* 在这里,你可以查询设备能力、分配资源
并把这个设备指针存起来备用 */
printk(KERN_INFO "Device %s found\n", device->name);
}

/* 当设备被移除时,内核会调用这个函数 */
static void my_remove_one(struct ib_device *device)
{
/* 重要:在这里释放所有与该设备相关的资源!
如果不清理干净,卸载模块时会卡住或崩溃 */
printk(KERN_INFO "Device %s removed\n", device->name);
}

/* 定义客户端结构体 */
static struct ib_client my_client = {
.name = "my RDMA module", /* 给自己起个名 */
.add = my_add_one, /* 设备添加时的回调 */
.remove = my_remove_one /* 设备移除时的回调 */
};

/* 模块初始化 */
static int __init my_init_module(void)
{
int ret;

/* 向 RDMA 栈注册客户端 */
ret = ib_register_client(&my_client);
if (ret) {
printk(KERN_ERR "Failed to register IB client\n");
return ret;
}

return 0;
}

/* 模块退出 */
static void __exit my_cleanup_module(void)
{
/* 注销客户端 */
ib_unregister_client(&my_client);
}

module_init(my_init_module);
module_exit(my_cleanup_module);

At this point, you might have a question: If I want to store some private data on this ib_device (like my own custom driver context), where should it go?

Maintaining a list in global variables certainly works, but it's messy and cumbersome when dealing with multi-device concurrency. The RDMA stack provides a more elegant mechanism: ib_set_client_data() and ib_get_client_data(). You can attach your private data to ib_device, much like sewing a pocket onto a garment. This way, when my_remove_one is called, you can accurately empty only the pocket belonging to that specific device without accidentally deleting data for other devices.

Beyond devices coming and going, all sorts of odd things can happen while a device is running—like a cable being unplugged, a port state changing, or a transmission error occurring. That's where ib_register_event_handler() comes in. It registers an asynchronous event handler; whenever the device so much as sneezes, your registered callback receives a notification. Remember to use the INIT_IB_EVENT_HANDLER macro to initialize this structure—don't fill it in manually.

13.2.2 Querying the Device Inside Out

Now you have the ib_device pointer, like holding a remote control for a black box. But you don't yet know this box's capabilities or which features it supports. You need to "run a background check."

RDMA provides a set of ib_query_* functions that let you peek under the hood without altering the device state.

Querying global device attributes ib_query_device() is the most macroscopic query. It returns the NIC's innate "factory settings"—such as which transport types it supports (RC, UD?), the maximum message size (MTU), or whether it supports atomic operations. These attributes are static; as long as the firmware hasn't changed, they won't change.

Querying port state Devices are typically multi-port. You need to care about whether a specific port is actually active. ib_query_port() is used to query a port's current state. Note that the attributes retrieved here are dynamic. For example, the port's state (DOWN or ACTIVE), its LID (assigned logical address), or the current link rate—these all change with network conditions.

Querying the link layer RDMA networks aren't just InfiniBand; there's also RoCE running over Ethernet. How do you tell them apart? Call rdma_port_get_link_layer(). It tells you whether the physical substrate is IB or Ethernet. This is crucial for upper-layer protocols (like IPoIB) because they need to decide how to encapsulate packets.

Querying address tables In the previous section, we mentioned GIDs and P_Keys. They aren't globally unique; they exist as tables. You need to look up the tables to confirm:

  • ib_query_gid(): Check what the GID at position N is in the port's GID table.
  • ib_find_gid(): Reverse lookup—if I know the GID, tell me its index in the table.
  • ib_query_pkey() / ib_find_pkey(): Similarly, for querying partition keys.

Once you've queried all of these, your understanding of the hardware is sufficient to support subsequent operations. But before manipulating resources, there's one more concept we must clarify—Protection Domain (PD).

13.2.3 Protection Domain (PD) — The Resource Isolation Sandbox

Imagine working in a massive open-plan office. Without walls separating you, anyone could walk up to your desk, take your files, or spill coffee on your keyboard. In a domain like RDMA that directly manipulates memory, this is an absolutely unacceptable disaster.

The PD exists as a firewall to prevent this kind of "crosstalk."

A PD is a collection of resources (such as QPs, MRs, and AHs). There's only one rule:

Resources belonging to PDx must absolutely never work together with resources belonging to PDy.

If you try to use a QP from PDx to access a memory key (MR) from PDy, the hardware will flat-out reject the operation and return an error. This seems rigid, but it provides a mandatory security isolation mechanism.

Typically, if you're just writing a simple driver, creating a single global PD is enough—throw all your resources into it. But if you're writing a highly sensitive multi-tenant system (like virtualized RDMA in a cloud environment), you might need to allocate an independent PD for each tenant or each remote connection, ensuring data doesn't interfere with each other.

Creating and destroying a PD is very simple, but you must follow this order:

  1. Allocate: Call ib_alloc_pd(device). You need to pass in the ib_device pointer you obtained earlier.
  2. Use: Attach the returned struct ib_pd * to the QPs, MRs, and other resources you're about to create.
  3. Destroy: When you unload your driver, or no longer need all resources under this PD, call ib_dealloc_pd().

⚠️ Warning Never release a PD while resources still reference it. It's like pulling a chair out from under someone who's still sitting in it—the consequences are unpredictable. Make sure to release your QPs and MRs first, and only release the PD last.

13.2.4 Address Handle (AH) — The Directional Signpost

With a device and an isolation sandbox in place, we're only one step away from sending data: Who do we send it to?

In Unreliable Datagram (UD) mode, this question is more complex than you'd think. You can't just fill in an IP address and call it done, because the RDMA network path might include multiple routes and switches, and there are also QoS (Quality of Service) requirements.

An Address Handle (AH) is a detailed "signpost" or "navigation map." When you send a UD message, you don't need to tell the NIC "go to the third switch then turn left"—you just hand this AH to the NIC. The AH contains all the path information from the local port to the destination port (such as the destination LID, GID, QoS parameters, etc.).

There are two ways to create an AH, depending on whether you're "replying":

1. Initiating a connection (you speak first) You know the peer's address (through querying the SA or a configuration file), so you directly call ib_create_ah(pd, attr). Here, attr is the struct ib_ah_attr you fill out yourself, packed with all sorts of addressing information about the destination.

2. Replying passively (talking back) You just received a UD packet and now want to send one back. This is easy: you already have the Work Completion (ib_wc) from the packet you just received, and it actually contains the sender's path information. You don't need to piece together the address yourself—just call ib_init_ah_from_wc(), toss in the WC, and the kernel will automatically extract the path information and initialize the AH. If you want to do it in one step, you can directly use ib_create_ah_from_wc().

Reusing AHs An AH is read-only. Once created, the destination address it points to is locked in. This also means that if you're sending many messages to the same destination, you can completely create just one AH and reuse it. Even if you have 100 QPs, as long as they all target the same node, those 100 QPs can share this single AH. This saves considerable memory and CPU overhead.

When you've completely parted ways with that node (connection disconnected), remember to call ib_destroy_ah() to tear down the signpost. If you destroy an AH while it's still being referenced by a QP, send operations might fail because they can't find the way.


At this point, we've prepared the venue (Device), the walls (PD), and the signposts (AH).

But these are just static infrastructure. The core of RDMA lies in "movement"—how does data flow through queues? How is memory accessed by remote machines?

In the next section, we'll dive into the most core objects of RDMA: Queue Pair (QP) and Memory Region (MR). That's where data truly flows.