14.15 Macros and Utility Functions
Although this section is titled "Macros," there is really only one core question: how do packets flow through the kernel?
As a network driver developer, when you write a PCI driver registration macro, you might wonder: how does this line of code actually hook into the kernel's massive device model? More importantly, when a packet arrives at the kernel from the hardware like a ghost, what does it become?
Furthermore, as the grand finale of this final chapter, the upcoming content will take you deep into the cornerstones of the Linux kernel network stack: sk_buff and net_device.
Before we proceed, I must be honest with you: the following content is extremely dense. If the previous chapters felt like "telling a story," then Appendix A is "tearing down an engine." We will dissect the two most core data structures in the kernel—Socket Buffer (SKB) and network devices—examining every field, every pointer, and every bit.
This isn't for you to memorize, but rather to ensure that when debugging crash logs, you can read the kernel's pathways as clearly as a map.
15.1 Registration: Just the Beginning
When writing a PCI network driver, the first thing you do is usually not initializing the hardware, but reporting to the kernel.
You can think of this macro as an "envelope" with "I handle devices with this ID" written on it.
But there is one flaw in this "envelope" analogy: a real envelope is done once mailed, whereas the registration macro is just the beginning of the story. When the kernel finds a matching device on the PCI bus, it wakes up your driver, but packet flow doesn't depend on the driver itself—it depends on the net_device structure the kernel carefully prepared for this device, and the sk_buff that carries its soul.
Returning to the "envelope": you should now see that the registration macro simply drops the envelope in the mailbox. The truly long journey begins only after the kernel takes it. Next, we will dive into the kernel to see exactly how it handles this letter.
Appendix A: Deep Dive into Kernel Data Structures
The following content is the "hardcore appendix" of the entire book. If you truly want to understand how the Linux network stack works, rather than just staying at the level of tweaking parameters, this is the necessary path.
It covers the two most core data structures in the Linux kernel network stack: sk_buff (Socket Buffer) and net_device.
To keep you from getting lost in an ocean of 27 fields, we won't read through them like a manual from start to finish. Instead, we will dissect sk_buff along the flow path of a packet through the kernel—from allocation, to pointer movement, to final freeing. We will see which fields control the packet's life, which fields optimize performance, and which fields are designed specifically to "bypass copies."
Finally, we will briefly look at net_device and RDMA, which are the bridges connecting the physical world to the kernel's virtual interfaces.
A.1 sk_buff: The Soul of a Packet
The sk_buff structure represents a network packet. SKB stands for Socket Buffer.
Imagine a packet that might be generated by a local user-space socket (via an HTTP request), or received from a physical NIC. It might be destined for the outside, or for another socket on the local machine. For the kernel, regardless of its source and destination, it will ultimately be wrapped in a sk_buff structure and passed between layers of the network stack.
You can think of sk_buff as an "envelope".
- The back of the envelope says who sent it (
sk), who it's for (dev), and which route to take (dst). - Inside the envelope is the letter's content (the data pointed to by the
datapointer).
But there is one flaw in this "envelope" analogy: a real envelope, once sealed, cannot be changed, whereas the data area of sk_buff is dynamically resizable. Each layer of the protocol stack (TCP, IP, Ethernet) stuffs its own "padding" (protocol headers) into the envelope, or tears off the padding from the layer above. The most ingenious design of sk_buff is using pointer movement to achieve this "stuffing" and "tearing" process, rather than actually moving data around.
Returning to the "envelope": you should now see that the data pointer is the top line of the currently visible page inside the envelope, while head and end are the physical boundaries of that page. If head reaches the position of data, it means there is no space to write a new address (Headroom is exhausted), and the kernel may have to find a larger piece of paper (copy the data), which is a performance killer.
1. Pointer Management: Tetris
This is the most ingenious and error-prone part of the SKB. The four pointers head, data, tail, and end define the layout of the packet in memory.
sk_buff_data_t transport_header; // L4 头部位置
sk_buff_data_t network_header; // L3 头部位置
sk_buff_data_t mac_header; // L2 头部位置
sk_buff_data_t tail; // 数据的尾部
sk_buff_data_t end; // 缓冲区的尾部
unsigned char *head; // 缓冲区的起始
unsigned char *data; // 数据的起始
Core Concepts: Linear Data Area vs. Non-linear Data Area
- Linear Data Area: The area from
headtoendis the entire linear buffer allocated for the SKB. - Valid Data: The area from
datatotailis the current valid payload. - Protocol Headers:
mac_header,network_header, andtransport_headerare pointers to specific positions within the linear buffer, pointing to the start of the L2, L3, and L4 layers respectively.
Operation Macros (modifying data and tail pointers):
skb_put(skb, len): Add data at the tail. Thetailpointer moves down, andlenincreases. Used for receiving data or building protocol headers.skb_push(skb, len): Add data at the head. Thedatapointer moves up, andlenincreases. Used for adding protocol headers (e.g., the IP layer calls this function before sending to add an IP header).skb_pull(skb, len): Remove data from the head. Thedatapointer moves down, andlendecreases. Used for stripping protocol headers (e.g., after the IP layer finishes processing, it moves thedatapointer past the IP header and hands it to the TCP layer).skb_reserve(skb, len): Reserve head space. Both thedataandtailpointers move down simultaneously. Usually called immediately after allocating an SKB to leave enough space in front for lower-layer protocols to add headers.
Headroom and Tailroom
- Headroom:
dataminushead. This is the reserved head space. - Tailroom:
endminustail. This is the reserved tail space.
This design allows the kernel to efficiently pass packets between protocol layers without copying data, simply by moving pointers.
2. Mechanism Breakdown: A Packet's Journey
Looking at pointer definitions alone can be dizzying. Let's walk through what actually happens inside an SKB as a packet goes from the NIC driver (L2) to TCP (L4).
Scenario: A host receives a TCP packet
-
L2 Entry (Driver Receive)
- The driver allocates an SKB, typically using
netdev_alloc_skb(). This step usually also callsskb_reserve()to reserve head space. - The driver DMAs the data to the
dataposition. - At this point,
datapoints to the Ethernet header.
- The driver allocates an SKB, typically using
-
L2 -> L3 (Network Layer)
- The kernel calls
eth_type_trans(). This function does two things:- Sets
skb->protocoltoETH_P_IP(IPv4). - Adjusts
skb->mac_headerto point to the L2 header. - Calls
skb_pull(skb, hlen)to strip the Ethernet header.
- Sets
- Key point: At this point, the
datapointer skips past the Ethernet header and points directly to the IP header. - This is why the TCP layer doesn't need to handle the Ethernet header—because before delivery, the kernel already threw away the "garbage" by moving pointers.
- The kernel calls
-
L3 -> L4 (Transport Layer)
- After the IP layer finishes processing (checksum verification, routing, etc.), it calls
ip_rcv(). - Similarly, it calls
skb_pull()to strip the IP header. - At this point, the
datapointer points to the TCP header. - Finally, the SKB is placed into the socket's receive queue.
- After the IP layer finishes processing (checksum verification, routing, etc.), it calls
Why go through all this trouble?
For efficiency. The kernel doesn't need to copy data; it only needs to move the data pointer to change what the current layer "sees." This is also why the SKB structure is so complex—it must precisely record the position of each layer's protocol header so it can instantly find the IP header when needed (e.g., when calculating the TCP checksum).
3. Routing and Devices: The Address on the Envelope
When the kernel gets this SKB, it first needs to know: where did it come from? Where is it going?
struct net_device *dev; // 关联的网卡
struct dst_entry *_skb_refdst; // 路由缓存
dev: This "NIC" is the data entry and exit point. For a received packet, it represents the incoming interface; for a sent packet, it represents the outgoing interface._skb_refdst: This is the kernel's routing decision result. After the kernel looks up the routing table once, it caches the result (next hop, output interface) here. This way, the next time this packet is processed (e.g., forwarding), it doesn't need to look up the table again. The lower bits of this field are borrowed for reference count marking, a classic kernel trick to save memory.
4. Protocol Private Control Area (cb[])
char cb[48];
This is a "Control Buffer." It is a notepad left by the kernel for each protocol layer to store private information.
- Why it exists: For efficiency. The kernel doesn't want to define separate SKB variants for each protocol, so it reserves this space.
- Who uses it:
- TCP protocol: Uses the macro
TCP_SKB_CB(__skb)to cast it into TCP's control structuretcp_skb_cb, used to store TCP sequence numbers, acknowledgment numbers, etc. - Bluetooth protocol: Uses the macro
bt_cb(skb)to cast it into Bluetooth's control structure.
- TCP protocol: Uses the macro
- Note: This area is opaque; once written by one layer, the next layer may overwrite or reinterpret it.
5. Data Length Management
The length fields in an SKB can be confusing and need to be clearly distinguished:
unsigned int len; // 数据包的总字节数
unsigned int data_len; // 非线性数据的长度(即分页数据)
__u16 mac_len; // MAC 头部(L2)的长度
__wsum csum; // 校验和
len: The length of the entire packet (including both linear data and non-linear paged data).data_len: If the packet uses scatter/gather, part of the data is stored in separate memory pages, anddata_lenrecords the size of this portion. Ifdata_lenis 0, it means all data is stored linearly and contiguously.mac_len: The length of the link-layer header (e.g., an Ethernet header is typically 14 bytes).
6. Checksums and Hardware Offload
Modern NICs are smart; they can calculate checksums for you, and even fragment packets for you.
__u8 ip_summed:2;
CHECKSUM_NONE: Hardware doesn't support checksums; software must do it.CHECKSUM_UNNECESSARY: No checksum needed (e.g., loopback devices).CHECKSUM_COMPLETE: Hardware has completed checksum calculation (receive path).CHECKSUM_PARTIAL: Hardware will complete checksum calculation (send path, header only).
7. Fragmentation and Cloning (cloned, users)
atomic_t users;
__u8 cloned:1;
users: The SKB's reference count. Initialized to 1.cloned: Marks whether this SKB is a clone.- Mechanism: When the kernel only needs to modify the SKB's metadata (like the
devpointer) without modifying the data content, to avoid expensive memory copies, it clones the SKB structure itself but shares the underlying data block. At this point, both the original SKB and the cloned SKB will have theirclonedflag set to 1.
8. Non-linear Data: skb_shared_info
When the data volume is large (exceeding one page, typically 4KB), the kernel stores the excess data in scattered memory pages instead of squeezing it into the linear data area. Information about these scattered pages is stored in the skb_shared_info structure.
It is located at the very end of the linear data area (accessed via skb_end_pointer()).
struct skb_shared_info {
unsigned char nr_frags; // 分散页的数量
// ... 其他字段 ...
skb_frag_t frags[MAX_SKB_FRAGS]; // 分散页数组
};
nr_frags: How many scatter pages are currently in use.frags[]: Stores the physical address, offset, and length of each page.frag_list: Another form of non-linear data, directly hanging an SKB list (used for IP fragment reassembly).
A.2 net_device: The General Garrisoning the Fortress
If sk_buff is the flowing "soldiers," then net_device is the "general" garrisoning the fortress.
The net_device structure represents a network interface device. It can be a physical device (like eth0) or a virtual device (like bridge0 or vlan100).
1. Operation Set: The Soul of the Device
const struct net_device_ops *netdev_ops;
This is the soul of net_device. It is a set of function pointers that define how the kernel commands this device. When you execute ip link set eth0 up in user space, the kernel ultimately calls ndo_open() here.
ndo_open(): Start the device.ndo_stop(): Stop the device.ndo_start_xmit(): The most important function. The kernel network stack calls it to send packets. The driver must write the packet to the hardware's transmit queue.ndo_set_mac_address(): Change the MAC address.ndo_tx_timeout(): Watchdog callback when transmission stalls.
2. Hardware Features and Offloads
By telling the kernel which hardware acceleration features it supports, the driver allows the kernel to decide whether to offload heavy tasks to the hardware.
netdev_features_t features; // 当前激活的特性
netdev_features_t hw_features; // 硬件支持的特性
NETIF_F_IP_CSUM: Hardware supports IPv4 checksum calculation.NETIF_F_TSO: TCP Segmentation Offload. The hardware is responsible for splitting large TCP segments into small packets that fit the MTU.NETIF_F_GRO: Generic Receive Offload. The hardware or kernel merges multiple small packets into a large one upon reception, reducing CPU interrupts.
3. State and Queues
unsigned long state;
struct netdev_queue *_tx;
state: Records the link state (e.g.,__LINK_STATE_STARTup,__LINK_STATE_NOCARRIERcable unplugged)._tx: Transmit queue array. Modern high-performance NICs typically have multiple queues to leverage the parallel processing capabilities of multi-core CPUs.
A.3 RDMA (Remote DMA)
RDMA is a high-performance networking technology that allows one computer to directly access another computer's memory without the involvement of the remote CPU. This drastically reduces latency and increases throughput.
You can think of RDMA as an "express delivery service." Normal network communication is like sending a letter, requiring a post office (the remote CPU) to sort it; RDMA is like driving your truck straight into the other party's warehouse to move goods.
But this "express delivery" has a prerequisite: security. You can't just let anyone into your warehouse.
1. Protection Domain (PD) - The Security Perimeter
A PD is like a private club. All resources (memory regions, queue pairs) must belong to the same club (PD) to access each other. If you don't belong to this club, you won't be let in even if you have the address.
2. Memory Region (MR) - Warehouse Registration
In RDMA, you can't just take a memory address and write to a remote machine. You must first register that memory.
- Registration process:
- Pinning: Prevents the memory from being swapped out to disk.
- Translation: Translates virtual addresses to physical addresses and informs the NIC's DMA engine.
- Keys: After registration, you get two keys:
- LKey (Local Key): Used for local access.
- RKey (Remote Key): The credential that the remote machine must provide to access this memory.
Returning to the "delivery" analogy: an MR is like the cargo manifest you registered with the warehouse (memory). Only registered cargo, and couriers (NICs) holding the corresponding RKey, can move that cargo. If the other party presents the wrong key or attempts to move unregistered cargo, the RDMA hardware will directly reject the access, protecting your memory safety.
3. Queue Pair (QP) - The Communication Channel
This is the core channel of RDMA communication. A QP contains two queues:
- Send Queue (SQ): Submits send work requests.
- Receive Queue (RQ): Submits receive work requests.
QP Types:
- RC (Reliable Connected): Reliable connection, similar to TCP.
- UD (Unreliable Datagram): Unreliable datagram, similar to UDP.
4. Completion Queue (CQ) - The Receipt
RDMA operations are asynchronous. When you initiate a send request, the function returns immediately. How do you know the send is complete? Through the CQ. The hardware writes completion status to the CQ, which you can poll or wait for notifications on.
Chapter Echoes
What this chapter is really doing is establishing the foundational understanding of the packet perspective.
On the surface, we are looking at individual structure fields, but in reality, we are understanding how the kernel balances the "cost of copying" against the "complexity of pointers." The reason the data pointer in sk_buff jumps around like this is fundamentally to avoid data copies—because copying memory is the number one killer of network performance.
Remember the question from the beginning—how do you understand a packet flowing through the network stack?
You should now be able to answer: it is not a static block of data, but a "perspective" that constantly changes between different protocol layers. L2 sees a frame, L3 sees a packet, L4 sees a stream, and the kernel, by cleverly moving pointers within sk_buff, allows all layers to efficiently process data in the same physical memory. That registration macro is merely the admission ticket to all of this; the real performance plays out inside these structures.
In the next chapter (if there is one), we will bring this knowledge back to reality to solve those maddening bugs that occur in the real world. Good luck.
Exercises
By this point, the mechanisms should be clear—or so you think. The following questions increase in difficulty. I recommend thinking independently before looking at the hints; only check them if you get stuck. If you can solve the third question, it means you truly understand.
Exercise 15.1 Simulating Protocol Encapsulation (Pointer Drill) ⭐ (Understanding)
Suppose you have just allocated an empty SKB in the kernel, where the data and tail pointers coincide, and you have already reserved 64 bytes of Headroom.
Now you need to build a TCP packet in order (without filling in data for now, only considering header reservation):
- Call
skb_push(skb, 20)to reserve the IP header. - Call
skb_push(skb, 20)to reserve the TCP header.
Question: After these two operations, how many bytes has the data pointer moved relative to its initial position? Which layer's protocol header does the data pointer currently point to?
Answer and Analysis
Answer: It moved 40 bytes (20+20). The data pointer currently points to the start of the TCP header (L4).
Analysis:
The purpose of skb_push is to move the data pointer toward lower addresses (expanding forward), freeing up space at the head.
First, push 20 bytes (IP header), and the data pointer moves up by 20 bytes.
Then, push 20 bytes (TCP header), and the data pointer continues moving up by 20 bytes.
Total movement: 40 bytes.
When building a transmit packet, we usually push headers in reverse order from L4 to L2, so at this point data points to the outermost (and first-built) TCP header. When actually transmitting, the driver will push the Ethernet header again.
Exercise 15.2 Headroom Exhaustion ⭐⭐ (Application)
When building a packet, if there are too many protocol headers or insufficient reserved space, the data pointer might collide with the head pointer (i.e., Headroom is exhausted).
Question: When the kernel detects insufficient Headroom (e.g., when skb_push finds that data is about to be less than head), what happens? Is this a simple or expensive operation? Why?
Answer and Analysis
Answer: The kernel will call the pskb_expand_head() function.
This is an extremely expensive operation.
Because the kernel must allocate a larger new memory block, copy all existing data (including linear data and fragmented data) over, and then free the old SKB. This involves memory allocation and massive memory copy operations, which severely degrades network performance.
Exercise 15.3 SKB and Device Lifecycle ⭐⭐⭐ (Reflection)
Consider Netfilter's NF_QUEUE mechanism: when the kernel sends a packet to a user-space program (like Suricata) via NFQUEUE rules for processing, the kernel must hold onto this SKB until user-space processing is complete and the packet is reinjected.
Questions:
- Why does the kernel prefer to clone the SKB (
skb_clone()) in this case, rather than directly referencing or copying it? - What changes occur in the
usersandclonedfields between the cloned SKB and the original SKB? - If the user-space program modifies the packet's content (e.g., rewrites the IP address), how does the kernel handle this "copy-on-write" requirement?
Answer and Analysis
Answer:
- Reason: Direct referencing would cause race conditions (if the driver is still transmitting while user space reads simultaneously, it would crash); a full copy (
skb_copy()) is too expensive (requires copying all data and pages). Cloning is the compromise: it only copies thesk_buffstructure itself (metadata) while sharing the same data buffer. This way, user space and the kernel can each have independent metadata pointers but share the data, greatly improving performance. - Field changes:
- The
clonedflag of both the new and old SKBs will be set to 1. - The
usersreference count of the original SKB will increase (because two structures point to the same data).
- The
- Copy-on-write: If the user-space program attempts to modify the data content (e.g., via
skb_store_bits()or by modifying the packet body through the queue), the kernel will detect that the data pages are shared. At this point, it triggers the "copy-on-write" mechanism, automatically copying the affected data pages to a new memory location, thereby breaking the sharing relationship and ensuring the modification doesn't affect other holders.
Analysis: This question tests a deep understanding of the core of SKB design. The cloning mechanism is one of the keys to the high performance of the Linux network stack, embodying the typical Linux design philosophy of "reference counting + shared data."
Key Takeaways
sk_buff is the core data structure for handling packets in the Linux network stack. The essence of its design lies in using pointer operations (skb_push, skb_pull, skb_reserve) rather than data copies to achieve passing between protocol layers. The four pointers head, data, tail, and end define a dynamic linear data area, while the protocol field acts as the "baton" passed between layers, telling the next layer which type of protocol header to treat the data pointer as.
Headroom (reserved head space) is key to SKB performance optimization. Once the reserved space is insufficient, the kernel must perform an expensive data copy (pskb_expand_head), involving new memory allocation and full data migration. This is a trap that must be avoided at all costs in high-performance network processing. The correct reservation strategy (usually calling skb_reserve at allocation time) is a fundamental skill for driver developers.
The net_device structure, as the abstraction of a network interface, has its core in the net_device_ops function set, which defines the specific behaviors for the kernel to control the hardware (such as start, stop, send). Through hardware feature offloading (like NETIF_F_IP_CSUM, TSO, GRO), the kernel can offload heavy computational tasks (checksums, fragmentation, reassembly) to the NIC hardware, thereby freeing up CPU compute power.
The cloning mechanism is the solution for multiple paths sharing the same packet. When only metadata needs to be modified (like Netfilter changing the route), the kernel uses skb_clone() to copy only the sk_buff structure, shares the underlying data buffer, and sets the cloned flag while incrementing the users reference count. Only when a write operation occurs does the kernel trigger an actual data copy, i.e., "copy-on-write."
RDMA technology achieves zero-copy network transmission by bypassing the remote CPU. It relies on Protection Domains (PDs) to delineate security boundaries, Memory Regions (MRs) to complete memory registration (pinning virtual addresses and translating them to physical addresses for NIC access), and uses Local/Remote Keys (LKey/RKey) as access credentials, ensuring that only authorized operations can read and write memory. This is a paradigm of balancing high performance with security.