Skip to main content

Chapter 13: The Cost of Bypassing the Kernel

There is a class of problems that appear to be about "network performance," but are actually about "who is paying for it."

When you send data using traditional TCP/IP sockets, you might think you're simply moving data from point A to point B. In reality, you're paying for the three-way handshake, paying for context switches between kernel and user space, paying for data being copied back and forth between kernel buffers—not to mention the NIC driver that interrupts the CPU every time it receives a single packet.

For most applications, this tax is acceptable. But if you're building high-frequency trading systems, large-scale distributed storage, or million-node compute clusters, this "tax" isn't just expensive—it's unacceptable.

This is the moment for our main character to take the stage: RDMA (Remote Direct Memory Access).

Its promises sound like cheating: directly accessing a remote machine's memory without CPU involvement, without kernel intervention, and without even copying data. This isn't just a protocol upgrade; it's a "jailbreak" from the traditional network stack.

But freedom comes at a price. RDMA discards the "nanny-style services" of the traditional kernel protocol stack and dumps enormous complexity—connection management, memory registration, error handling—onto the developer. If you think you can get a performance boost just by swapping out an API, you'll most likely end up with a pile of crashes and bizarre memory errors.

Our mission in this chapter is to walk you through this "jailbreak" path. We'll first figure out what RDMA actually is (and why it requires so much hardware support), then dive into the Linux kernel's drivers/infiniband directory to see how this system is built within the kernel. By the end of this chapter, you won't just be amazed by "zero-copy"—you'll start questioning why it took us so many years to realize this step.


13.1 RDMA and InfiniBand — An Overview

13.1.1 What is RDMA

Before rushing into the code, let's establish the most important mental model: what exactly does RDMA do?

Remote Direct Memory Access (RDMA). As the name implies, it allows one machine to directly read from and write to another machine's memory. Note the word "directly" here:

  • No remote CPU involvement: The remote machine's CPU doesn't even know its memory was read or written.
  • No kernel intervention: Data bypasses the kernel protocol stack and is transferred directly between the NIC and user-space memory.

To achieve this, we need a completely new set of hardware and protocol stacks. Currently, there are three mainstream network protocols that support RDMA:

  1. InfiniBand (IB): This is a brand-new network architecture designed specifically for high performance. It starts from scratch, discarding the baggage of traditional Ethernet. The specification is maintained by the InfiniBand Trade Association (IBTA), and you can look up the InfiniBand Architecture Specification, which is as thick as a brick.
  2. RoCE (RDMA over Converged Ethernet): pronounced as "Rocky". Since InfiniBand switches are too expensive, why not run RDMA over existing Ethernet? That's RoCE. It uses Ethernet/IP on top of the InfiniBand link layer, making it a "hybrid" solution. Its specification is typically an appendix to the InfiniBand specification.
  3. iWARP (Internet Wide Area RDMA Protocol): This is another path, attempting to implement RDMA on top of the standard TCP/IP protocol stack. Maintained by the RDMA Consortium, this is your choice if you don't want to modify the underlying network infrastructure and just want to play with RDMA over wide-area networks.

Despite the vast differences in the underlying physical and link layers (fiber vs. twisted pair, lossless networks vs. Ethernet), the APIs they expose are completely unified.

This unified API is called Verbs. You can think of it as the "system call table" of the RDMA world. Whether it's InfiniBand, RoCE, or iWARP, your client code only needs to operate the hardware through the Verbs API (i.e., the ib_* family of functions), and the underlying differences are abstracted away by kernel modules and drivers.

This RDMA subsystem was introduced in Linux kernel version 2.6.11. Initially, it only supported InfiniBand, and it wasn't until subsequent years that iWARP and RoCE were added. So when you see include/rdma/ib_verbs.h in the kernel source, don't be fooled by the ib (InfiniBand) in the name—it's generic.

❌ Pitfall Warning There is a legacy issue here: although the API is collectively referred to as RDMA, a massive number of kernel functions, structures, and filenames still start with ib_ (referring to InfiniBand). For example, you'll see ib_register_client in the code, but it can actually register RoCE devices too. Don't get hung up on the names; just treat them as generic RDMA interfaces.

Before diving in, here are a few trivia points about this API:

  • Mixed function styles: Some functions are inline, and some are not. This isn't some perverse taste of the designers, but a product of performance optimization. Future kernel versions might change this, so be careful when writing drivers.
  • Who uses it: The ib_verbs.h header file serves three types of audiences:
    1. RDMA stack core code: The ones responsible for maintaining order.
    2. Low-level hardware drivers: The drivers/infiniband/hw from various vendors (Mellanox, Intel, etc.).
    3. Upper-layer consumers: Kernel modules that use RDMA (such as NFS/RDMA, ISER, or your own custom driver).

For the rest of this book, we'll primarily play the third role: consumers. What we care about is how to put the hardware to work through the Verbs API, not how to write hardware drivers.


13.1.2 What Does the RDMA Stack Look Like in the Kernel

If you open the kernel source, the vast majority of RDMA-related code is hidden in the drivers/infiniband directory. This name is also a bit misleading, because it contains both InfiniBand-specific things and logic for handling RoCE and iWARP.

Let's dismantle this like a server rack and see what modules are inside:

  • core/cm.c (Communication Manager): The communication manager. Just like how dating might need a matchmaker, establishing an RDMA connection between two nodes isn't as simple as saying hello—it requires negotiating parameters and exchanging keys. CM is responsible for these chores.
  • core/verbs.c (Kernel Verbs): This is the core API implementation layer. Most of the ib_post_send functions you call can be traced back to the logic here.
  • core/uverbs_*.c (User Verbs): User-space Verbs. Although we mainly focus on kernel space, a large number of RDMA applications actually run in user space. This mechanism allows user programs to interact directly with the hardware through ioctl, completely bypassing the kernel.
  • core/mad.c (Management Datagram): Management datagrams. There are special "management packets" in RDMA networks, such as configuring switches or querying port states. These don't go through the data path; instead, they are handled by the MAD module.
  • ulp/ (Upper Layer Protocols): The upper-layer protocols. RDMA is just a transport channel; what business runs on top of it is determined here:
    • ipoib: IP over InfiniBand. Since we have an RDMA network, can we run regular TCP/IP applications directly on it? IPoIB is this adaptation layer; it makes an IB NIC look like a regular NIC.
    • iser: iSCSI Extensions for RDMA. Moving the iSCSI storage protocol onto RDMA for explosive performance.
    • srp: SCSI RDMA Protocol. Another storage protocol.

You can imagine this stack as a skyscraper. core/ is the foundation and plumbing, hw/ is the various utility providers (hardware drivers) plugging into it, and ulp/ are the tenants renting space inside to open shops.


13.1.3 Why Bother? — The Technical Advantages of RDMA

Before diving into protocol details, we must first answer a question: why would we abandon such a mature, stable, and easy-to-use TCP/IP in favor of this complex RDMA?

The answer lies in these four killer features:

1. Zero-Copy In traditional network transmission, data has to endure an "81-fold tribulation": user buffer -> kernel socket buffer -> protocol stack processing -> NIC driver ring buffer -> DMA to the NIC. RDMA says: stop the hassle. It allows the NIC to perform DMA operations directly on user-space memory. Data is moved only once, between "local user memory" and "remote user memory." There are no intermediate copies.

2. Kernel Bypass This is an incredibly tempting feature. When an application sends data, it doesn't need to trap into kernel mode, doesn't need to make system calls (like sendmsg), and doesn't need to wake up kernel softirqs to process packets. The application issues commands directly to the NIC's registers (mapped into user space), and the NIC fetches the data directly. Context switches? Non-existent.

3. CPU Offload Not only is the kernel bypassed, but even the CPU's work is stolen away. RDMA NICs typically contain powerful dedicated processors. They handle the transport protocol, calculate checksums, retransmit lost packets, and even process RDMA atomic operations. For the receiver, this is a bizarre experience: memory suddenly gets overwritten, but CPU utilization doesn't budge, because it has no idea what just happened.

4. Low Latency and High Bandwidth While these are hardware metrics, thanks to the architectural advantages mentioned above, RDMA's latency is absurdly low.

  • Latency: In small-message scenarios, latency can drop to a few hundred nanoseconds. Yes, nanoseconds. On Ethernet, just the protocol stack processing alone could take several microseconds.
  • Bandwidth: InfiniBand's bandwidth scales extremely well. The same protocol technology can easily scale from 2.5 Gbps to 120 Gbps or even higher. In contrast, Ethernet standards (like 10G, 25G, 40G, 100G) often require changing physical layer technologies with each generational upgrade.

13.1.4 Hardware Components: It's Not Just a NIC

Understanding RDMA requires first understanding its underlying hardware topology, because many concepts in the software API (like LID, GID) exist to serve this hardware.

Let's look at the roles in an InfiniBand architecture:

1. HCA (Host Channel Adapter) You can think of this as a "super NIC." This isn't some ten-dollar Realtek network card. An HCA is an intelligent device sitting on the PCIe bus, with its own DMA engine and memory management unit. It acts as both the initiator and receiver of packets. It is responsible for executing those Verbs commands (such as send, receive, RDMA read/write).

2. Switches But the switches here are quite different from Ethernet switches. InfiniBand switches are very "dumb" (in a good way). They don't learn MAC addresses, don't run the Spanning Tree Protocol, and don't analyze packets much. Their forwarding table is remotely configured by a god's-eye-view entity called the SM (Subnet Manager). It only does one thing: receive a packet, look up the table, and throw it out of a specific port. This simplicity results in extremely low forwarding latency. (Note: InfiniBand doesn't support broadcast, only multicast, so switches need to know how to replicate multicast packets.)

3. Routers If switches are responsible for connecting the same subnet, then routers are responsible for connecting different InfiniBand subnets. This is only used in large-scale clusters.

4. Subnet A subnet is a group of connected HCA, switch, and router ports. It is the basic unit of management and addressing.


13.1.5 Addressing Mechanism: Where Is It?

In Ethernet we use MAC addresses, and at the IP layer we use IP addresses. In InfiniBand, things become more complex—and more rigorous. We have three main types of IDs:

1. GUID (Globally Unique Identifier) This is the "ID number" assigned when the hardware leaves the factory. 64-bit, globally unique.

  • Node GUID: Every node (device) has one.
  • Port GUID: Every port (like a specific network interface on an HCA) has one.
  • System GUID: If a device consists of multiple chips (like a core switch), they share the same System GUID, indicating they belong to the same whole.

Analogy time: A GUID is like a combination of a human's fingerprint and DNA—theoretically globally unique and unchangeable.

2. GID (Global IDentifier) GUIDs are for hardware; GIDs are for routing. It is usually 128-bit, generated based on the GUID plus a subnet ID. Its format is exactly the same as an IPv6 address (which is why RoCE can conveniently merge into IPv6 networks). Each port has at least one GID, stored at index 0 of the GID table.

3. LID (Local IDentifier) This is the address actually used inside the packet header to run within a subnet. It's a short, 16-bit address assigned by the SM (Subnet Manager). Why do we need LIDs when we have GUIDs? Imagine if every packet had to carry a 64-bit or 128-bit address—the lookup pressure on the switches would be enormous, and routing would be too complex. The SM assigns a short LID to each port (ranging from 0x0001 to 0xBFFF), and switches only need to quickly forward based on this 16-bit LID.

  • Unicast LID: 0x0001 - 0xBFFF
  • Multicast LID: 0xC000 - 0xFFFE

13.1.6 Key Features: Beyond Speed, There's Security and Isolation

If you thought RDMA was just about speed, you're underestimating it. It also solves many enterprise-level problems:

1. P_Key (Partition Key) — Virtual Isolation You have one physical switch, but you want to completely isolate the "production environment" from the "test environment," just like two VLANs. In InfiniBand, this is called partitioning. Each port maintains a P_Key table. Each Queue Pair (QP, the core object for sending and receiving data) is associated with a P_Key. The rule is simple: communication can only occur when the P_Keys on both ends match, and at least one end has "full member" privileges. It's like fitting different doors with different keys.

2. Q_Key (Queue Key) — The Safety Lock for UDP Mode UD (Unreliable Datagram) mode is like UDP—anyone can send packets. To prevent random spamming, InfiniBand introduces the Q_Key. A UD QP will only accept a packet if the Q_Key carried in the received packet matches its own configured Q_Key. This greatly reduces the risk of malicious interference.

3. VL (Virtual Lanes) — Traffic Lanes Imagine a physical fiber as a highway. To prevent "VIP" packets (like storage data) from being blocked by "civilian" packets (like heartbeat packets), InfiniBand introduces Virtual Lanes (VL). Each physical link can be divided into multiple virtual links, and each VL has its own independent buffer. Used in conjunction with Service Level (SL), you can tag different types of traffic and map them to different VLs, thereby achieving hardware-level QoS (Quality of Service).

4. Failover — Automatic Fault Switchover An RDMA QP can be configured with two paths: a primary path and a backup path. If the primary path goes down (a fiber is unplugged, a switch crashes), the hardware automatically switches to the backup path. For upper-layer applications, as long as they handle the error codes slightly, they might not even feel the disaster that occurred underneath.


13.1.7 What Does a Packet Look Like

As a low-level engineer, you'll eventually need to capture and analyze packets for troubleshooting. Even though Wireshark has plugins, knowing a bit about the headers is always good.

A standard InfiniBand packet is like a Russian nesting doll:

  • LRH (Local Routing Header, 8 bytes): Mandatory.
    • Contains the source LID and destination LID.
    • Contains QoS attributes (Service Level).
    • This is used for local routing and becomes useless once it leaves the subnet.
  • GRH (Global Routing Header, 40 bytes): Optional.
    • Required if it's cross-subnet transmission or multicast.
    • It looks exactly like an IPv6 header.
  • BTH (Base Transport Header, 12 bytes): Mandatory.
    • This is the transport layer header.
    • Contains the source QP number and destination QP number.
    • Contains the opcode (is this an RDMA Write or a Send?).
    • Contains the sequence number (used for packet ordering).
  • Payload: The actual data. Optional.
  • ICRC / VCRC: Checksums. Mandatory. Guarantees data integrity.

We can think of the LRH as the recipient address on an envelope (used by the local post office), the GRH as the international postal barcode (used by cross-border post offices), and the BTH as the "Dear so-and-so" at the top of the letter (indicating exactly which person should read it).


13.1.8 Management Entities: Who Manages This Network?

A network this complex must have someone managing it, right? InfiniBand doesn't adopt the chaotic "everyone manages themselves" Ethernet model; instead, it introduces centralized managers.

1. SM (Subnet Manager) — The Singular God The SM is the brain of the entire subnet. It's usually a software process running on a host, or even inside a "managed switch." Its workload is heavy:

  • Topology discovery: Who is connected to whom? Were cables plugged or unplugged?
  • Switch configuration: Pushing forwarding tables, telling switches where to send packets.
  • Resource allocation: Assigning a LID to each port.
  • Fault detection: If the network goes down, it's responsible for recalculating paths.

To ensure high availability, you can deploy multiple SMs, but at any given moment, only one is the Master, and the others are in Standby state. If the Master goes down, they will automatically elect a new leader.

2. SMA (Subnet Management Agent) — The Resident Ambassador No matter how capable the SM is, it can't physically crawl into every port. Each port has an SMA. The SMA is an agent program specifically responsible for receiving the SM's instructions (MAD packets), executing configurations, and reporting the results back to the SM.

3. SA (Subnet Administrator) — The Information Query Center The SM is in charge of control; the SA is in charge of providing information. When you need to know "what is the optimal path from port A to port B" or "how do I join this multicast group," you go ask the SA.

4. CM (Communication Manager) The SM manages the network; the CM manages connections. If two nodes want to establish a reliable RDMA connection (like RC mode), they need to handshake with each other, exchanging sensitive information like QP numbers and memory keys. The CM service exists to facilitate establishing this connection state.


At this point, we've cleared all the peripheral obstacles: from API definitions and kernel code structure to hardware components and addressing mechanisms.

But I know what you care about the most right now: "how do I actually send data out in code?"

This brings us to the soul of RDMA — the Queue Pair (QP) and its surrounding memory management mechanisms. In the next section, we'll dive down to the data structure level and see how to create, configure, and use these resources within the kernel.