Skip to main content

1.2 The Network Device

Let's shift our focus down to the bottom layer — Layer 2 (L2), the data link layer.

This is where the Network Device Driver lives. Although the main character of this book is the kernel Network Stack rather than teaching you how to write drivers (which would require a whole separate thick book), to truly understand how packets enter and leave, we must first meet the "handshaker" between the driver and the stack: the net_device structure.

It's like driving on a highway — you need to understand how toll booths work, even if you're not the toll collector.

net_device: The NIC's "ID Card"

In the kernel's eyes, a NIC doesn't feel like a physical entity; it's simply a massive structure instance — net_device.

This structure holds everything about the NIC's "life and soul":

  • Hardware IRQ number: The interrupt request line the CPU relies on to know the NIC has work to do.
  • MTU (Maximum Transmission Unit): The Ethernet default is 1500 bytes. Packets exceeding this must be fragmented.
  • MAC address: The NIC's physical ID, 48 bits long.
  • Device name: Names like eth0 and wlan0 are defined here.
  • Flags: Is the NIC UP or DOWN? Is it RUNNING?
  • Multicast address list.
  • Promiscuous mode counter: We'll cover this shortly; it's key to packet capture tools.
  • Hardware features: Such as whether it supports offloading features like GSO and GRO.
  • net_device_ops callback set: The NIC's operation manual, containing function pointers for opening the device, stopping the device, transmitting data, changing the MTU, and so on.
  • ethtool callbacks: This is why typing ethtool eth0 on the command line shows you a bunch of register information.
  • Multi-queue support: Modern 10-Gigabit NICs have multiple Tx/Rx queues.
  • Timestamps: Recording the time of the last packet transmission or reception.

Let's take a quick look at a snippet of the kernel definition to get a feel for its style:

struct net_device {
unsigned int irq; /* 中断号 */
. . .
const struct net_device_ops *netdev_ops; /* 操作方法集 */
. . .
unsigned int mtu; /* 最大传输单元 */
. . .
unsigned int promiscuity; /* 混杂模式计数器 */
. . .
unsigned char *dev_addr; /* 硬件地址 (MAC) */
. . .
};

Note: I dissect this behemoth in detail in Appendix A. When you get stuck reading the source code, it's worth flipping back to it.

Here's a particularly interesting and easily overlooked detail: the promiscuous mode counter.

Why is it an int counter rather than a bool flag?

If it were a boolean, when a packet capture tool (like tcpdump) is opened, the flag is set to 1. If another capture tool (like wireshark) is then opened and later closed, it would set the flag back to 0. The result? The first tool is still running, but the NIC has exited promiscuous mode — total chaos.

Using a counter perfectly solves this problem:

  1. tcpdump starts, counter +1 (becomes 1), NIC enters promiscuous mode and starts receiving all packets passing on the wire (regardless of whether they are destined for the local machine).
  2. wireshark starts, counter +1 (becomes 2).
  3. tcpdump exits, counter -1 (becomes 1). The NIC remains in promiscuous mode because wireshark is still using it.
  4. wireshark exits, counter -1 (becomes 0). At this point, the kernel confirms no one is eavesdropping, and the NIC exits promiscuous mode and resumes normal operation.

Although simple, this design is a classic textbook case for handling "shared resource state among multiple users."


A Performance Savior: NAPI (New API)

While browsing the kernel's core networking code, you'll frequently run into an acronym — NAPI.

It's an indispensable part of modern network drivers. To understand it, we need to look at what the old world looked like.

The Old Approach: Interrupt-Driven Early NIC drivers had a very simple working model: one packet arrives, one interrupt is fired. Packet arrives → interrupt triggers → CPU stops what it's doing, saves context → jumps to the interrupt handler → grabs the packet → restores context. This is fine for everyday web browsing, but if you're hit with a DDoS attack or processing massive amounts of small-packet traffic, the CPU will collapse. What does hundreds of thousands of interrupts per second mean? It means the CPU exhausts its computing power just "entering" and "exiting" (saving/restoring registers and context), leaving no time to do actual work. In operating systems, this is called interrupt livelock.

The New Approach: NAPI (Hybrid Mode) To solve this problem, kernel developers introduced NAPI (New API). Its core idea is to dynamically switch strategies based on the load.

  • Under low load: It still uses interrupts. No packets means no bothering the CPU, saving power and ensuring fast response times.
  • Under high load: It switches to polling. After the initial interrupt triggers, the driver temporarily disables that interrupt and tells the kernel, "I have a bunch of packets now. Come poll me periodically to get them, don't wait for the next interrupt."

This changes "processing N packets requires N interrupt context switches" into "processing N packets requires only 1 interrupt + polling." The performance improvement is immediate.

However, you can't have the best of both worlds. Although polling offers high throughput, it increases latency — because the CPU doesn't respond immediately, but waits until the polling cycle arrives to fetch the packets.

For applications that are extremely latency-sensitive and willing to splurge CPU resources to achieve it (like high-frequency trading), Linux introduced Busy Polling on Sockets starting from kernel 3.11. This is a more niche optimization, which we'll save for the "Busy Poll Sockets" section in Chapter 14.

Alright, now we have a sufficient understanding of the NIC as an "exit." Next, let's look at the real main event: how packets travel through the kernel.


The Packet's Journey: Receive and Transmit

The entire life of a network device driver is basically spent on two things:

  1. Receiving: Taking packets captured from the wire and passing them up layer by layer to the network layer (L3), and ultimately to the transport layer (L4).
  2. Transmitting: Taking packets generated locally or needing to be forwarded, and pushing them into the NIC to be sent out on the wire.

Sounds simple? There are plenty of detours along the way.

For every packet passing through the kernel, whether inbound or outbound, the routing subsystem performs a table lookup. This lookup determines where this "ticket" should actually go: should it be kept for the local machine, or forwarded out through which NIC? This part is extremely important, and we'll dive deep into the IPv4 and IPv6 routing subsystems in Chapters 5 and 6.

But the routing decision isn't the only hurdle. As packets travel through the stack, they also encounter "roadblocks":

1. Netfilter Hooks

This is the mechanism behind firewalls and NAT. The kernel sets up "checkpoints" at five critical nodes in the network stack.

  • Before a received packet even hits the routing table, it passes through NF_INET_PRE_ROUTING.
  • Before a transmitted packet leaves the NIC, it passes through NF_INET_POST_ROUTING.

These checkpoints are triggered by the NF_HOOK() macro. If your iptables rules say this packet is illegitimate, the callback function returns NF_DROP, and the packet is dropped on the spot — the kernel doesn't even bother to write an obituary for it. If it's NF_ACCEPT, it's let through to the next stop. We'll break down the underlying mechanisms of Netfilter, connection tracking, and iptables in detail in Chapter 9.

2. IPsec

If a packet matches an IPsec policy, it gets sent off for encryption or decryption. IPsec provides network-layer security (using the ESP or AH protocols). Although the IPv6 specification mandates IPsec support while it's optional for IPv4, Linux has full support for both. IPsec has two modes:

  • Transport mode: Only the payload is encrypted.
  • Tunnel mode: The entire original packet is encrypted and wrapped in a new IP packet (the common method for VPNs). IPsec and NAT often clash (because you can't modify the address after encryption), leading to a "NAT Traversal" solution. We'll leave all of this for Chapter 10.

3. TTL (Time To Live)

If a packet is being forwarded, its IPv4 header contains a field called TTL (Time To Live). Every time it passes through a router, this value is decremented by 1. When it reaches 0, the router drops the packet immediately and sends back an ICMP "Time Exceeded" message. This prevents packets from living forever in the network due to routing loops. Similarly, every time the TTL is modified, the IPv4 header checksum must be recalculated, which is quite a bit of work. In IPv6, this field was renamed to Hop Limit (Hop Limit) — same meaning, just a more elegant name.

In addition, the packet's journey is full of variables:

  • Fragmentation and reassembly: Large packets (exceeding the MTU) get chopped up, and the receiving end has to piece them back together. We'll cover this in Chapter 4.
  • Multicast: The packet isn't sent to one person, but to a group. This involves the IGMP protocol and multicast routing daemons (like pimd), which is typically much more complex than unicast. We'll touch on this in Chapter 6.

To maintain order amidst all these complex detours, the kernel must have a unified way to describe and manage these packets. This core data structure is the legendary SKB.


The Core: Socket Buffer (sk_buff)

sk_buff (which we usually just call SKB) is the most core, most complex, and most headache-inducing data structure in the Linux kernel Network Stack.

It is the physical "body" of the packet inside the kernel. Whether the packet was just fished out of the wire by the NIC driver, or is about to be sent out from the TCP layer, it is an SKB.

Let's take a look at its definition (only listing key members; see Appendix A for the full version):

struct sk_buff {
. . .
struct sock *sk; /* 拥有这个包的套接字 */
struct net_device *dev; /* 关联的网卡设备 */
. . .
__u8 pkt_type:3; /* 包类型(单播/组播/广播)*/
. . .
__be16 protocol; /* 协议类型 */
. . .
sk_buff_data_t tail; /* 尾部指针 */
sk_buff_data_t end; /* 结束指针 */
unsigned char *head,
*data; /* 头部和数据指针 */
sk_buff_data_t transport_header; /* L4 头部位置 */
sk_buff_data_t network_header; /* L3 头部位置 */
sk_buff_data_t mac_header; /* L2 头部位置 */
. . .
};

The Golden Rule of SKB Manipulation Never try to manually skb->data++ or directly manipulate pointers! SKB's internal management relies on a strict set of APIs:

  • Want to move the data pointer forward (strip the header)? Use skb_pull() or skb_pull_inline().
  • Want to move it backward (reserve header space)? Use skb_push().
  • Want to get the transport layer (L4) header? Use skb_transport_header(skb).
  • Want to get the network layer (L3) header? Use skb_network_header(skb).
  • Want to get the MAC header? Use skb_mac_header(skb).

Following this API is crucial for managing the exquisite linear data area and paged structure inside the SKB — don't try to be too clever.

The Birth of an SKB (Receive Path)

When a packet comes in from the Ethernet cable, the driver allocates an SKB via netdev_alloc_skb() (older code might use dev_alloc_skb()).

At the data link layer (L2), the driver does two important things:

  1. Determine the type (eth_type_trans): This function sets the SKB's pkt_type based on the destination MAC address in the Ethernet frame header:

    • Destined for the local machine? → PACKET_HOST.
    • Multicast? → PACKET_MULTICAST.
    • Broadcast? → PACKET_BROADCAST. At the same time, it reads the Type field from the Ethernet header (e.g., 0x0800 for IPv4, 0x86DD for IPv6) and fills it into the SKB's protocol field.
  2. Pointer jumping (skb_pull_inline): This is the part where beginners get dizzy most easily. When an Ethernet frame enters memory, skb->data points to the start of the L2 header. However! The moment the driver hands this packet over to the network layer (L3), the kernel expects skb->data to point to the L3 header (the IP header). So eth_type_trans() calls skb_pull_inline(skb, 14) to move the pointer forward by 14 bytes (exactly the length of the Ethernet header ETH_HLEN), skipping the L2 header.

    Imagine this: You're peeling an onion. Peel off one layer (L2), and what's left in your hand should be exactly the next layer (L3). skb_pull is exactly this peeling action.

(Insert original book Figure 1-3 diagram here: a standard UDPv4 packet, from the 14 bytes of L2, to the 20 bytes of L3, and then the 8 bytes of L4)

The Wandering SKB

Every SKB has a dev pointer that records "whose side it's on" (incoming packets record the input NIC, outgoing packets record the output NIC). This is important because the kernel needs to decide whether to slice the packet based on this NIC's MTU.

Every transmitted SKB also has a sk pointer pointing to the Socket that produced this packet. Note: For forwarded packets, sk is NULL. Because it wasn't generated locally; it's just a "passerby."

The Destination of the SKB

Received packets are ultimately dispatched to their corresponding protocol handler functions:

  • IPv4 packets are handed off to ip_rcv().
  • IPv6 packets are handed off to ipv6_rcv().

How are these functions registered? Using the dev_add_pack() method. We'll take a closer look at this in the dedicated chapters for IPv4 and IPv6 (Chapters 4 and 8).

Taking ip_rcv() as an example, it first performs a bunch of sanity checks, and then — if it isn't intercepted by Netfilter's PRE_ROUTING hook — it enters ip_rcv_finish(). Here, the kernel queries the routing subsystem and builds a dst_entry (destination cache entry), which determines where this packet goes next. For a hardcore analysis of the routing subsystem, check out Chapters 5 and 6.


Finally, we need to mention those trivial but critical matters in the L2 layer.

MAC Addresses and ARP Every NIC comes from the factory with a 48-bit MAC address, although you can tamper with it using the ifconfig or ip commands. When your Socket wants to send a packet to 192.168.1.5, it only knows the IP address. However, Ethernet frames only recognize MAC addresses. This calls for an interpreter: the Neighbor Subsystem.

  • IPv4 uses the ARP protocol. It shouts out in the local network (broadcast): "Who has 192.168.1.5? Tell me your MAC!"
  • IPv6 uses the NDISC protocol (based on ICMPv6). The principle is similar, but it's more complex and uses multicast instead of broadcast.

This mechanism is explained in detail in Chapter 7. If your network is down, try pinging first, and then check arp -n — the problem is often right here.

Kernel-to-Userspace Communication: Netlink How does the kernel tell userspace that a route has changed? Or how do userspace tools tell the kernel, "I want to add a route"? Through Netlink Sockets. It acts like a special phone line, with one end connected to the kernel and the other to tools like iproute2. We'll cover this in detail in Chapter 2.


Special Subsystems: Wireless, RDMA, and Virtualization

In the grand edifice of the Linux Network Stack, there are also a few special "VIP suites."

  1. Wireless Subsystem: This is driven by the mac80211 framework. It handles things that ordinary wired NICs have never seen: power-saving modes, Mesh networking (HWMP routing protocol), and Ad-hoc networks. In particular, the 802.11n Block Ack mechanism is key to improving throughput. We'll save this content for Chapter 12.

  2. RDMA / InfiniBand: In high-performance computing and data centers, traditional CPU data transfer is too slow. RDMA allows NICs to directly read and write remote host memory, bypassing the CPU. This technology was introduced starting from kernel 2.6.11 and is now ubiquitous in large clusters. We'll cover it in Chapter 13.

  3. Virtualization: Process-level virtualization based on Namespaces. This isn't full virtualization like KVM, but a lighter-weight "isolation." Linux currently has six types of Namespaces. Network Namespaces allow you to run multiple completely independent network stacks on a single machine (multiple lo interfaces, multiple routing tables). This is the cornerstone of container technologies like Docker. We'll dive deep into this in Chapter 14, and we'll also touch on Bluetooth, PCIe, and 6LoWPAN in the IoT domain (stuffing IPv6 into low-power frames like IEEE 802.15.4 — just thinking about it feels cramped).


That concludes our introduction to network devices. Starting from the next section, we will truly dive into the code and see how Linux kernel network development is done — that's another world full of Git trees and mailing list politics.