Linux Kernel Networking

📄️Chapter 1: Linux Network Stack Overview

This chapter contains 3 sections. Click the links below to read:

📄️Chapter 1: Deep into the Kernel: Dissecting the Network Stack Black Box

Chapter Introduction

1.2 The Network Device

Let's shift our focus down to the bottom layer — Layer 2 (L2), the data link layer.

1.3 Linux Kernel Network Development Model

The networking subsystem is incredibly complex, and it evolves at breakneck speed—fast enough that if you blink, you might miss an API change.

📄️Chapter 2: Netlink Sockets

This chapter contains 7 sections. Click the links below to read:

📄️Chapter 2: When Userspace Meets the Kernel

Imagine you are writing an application that needs to monitor network traffic changes. Whenever the kernel's routing table changes or a new network interface comes online, your program needs to know immediately.

2.2 Kernel Netlink Sockets

In the previous section, we discussed the userspace tools—iproute2, net-tools, and the handy libnl and libmnl libraries. But on the other side of the fence, how does the kernel view all of this?

2.3 Netlink Message Header

In the previous section, we saw how the kernel places messages into the send queue — tracing the call chain from rtnetlink all the way to the generic netlink module.

2.4 NETLINK_ROUTE Messages: More Than Just Routing

In the previous section, we spent our time deep in the Generic Netlink mechanism, tearing apart message headers, TLVs, and validation policies. Now that the foundation is laid, it's time to shift our focus back to the networking subsystem itself and see how the "veteran" Netlink protocol families actually work.

2.5 Adding and Removing Routing Table Entries: Dancing in the FIB

In the previous section, we broke down the Netlink message header and those dazzling flag bits. Now, it's time to throw that theory into a real-world scenario.

2.6 Generic Netlink Protocol

Remember the question we left you with at the end of the last section?

📄️ch02_7

2.7 Quick Reference

📄️Chapter 3: ICMP Protocol

This chapter contains 3 sections. Click the links below to read:

3.0 The Network's Nervous System: Why We Need ICMP

Imagine managing a massive, globally distributed package delivery system.

3.2 The "Swiss Army Knife" of IPv6: ICMPv6

In the previous section, we navigated the world of IPv4. While ICMPv4 is certainly important, within the IPv4 architecture, it plays somewhat of a "patch" role—ARP handles address resolution, IGMP manages multicast, and ICMPv4 is mainly left with error reporting and diagnostics.

3.3 Quick Reference and Practical Supplements

In this chapter, we broke down the internal workings of ICMPv4 and ICMPv6, from initialization flows to packet transmission and reception, covering a significant amount of kernel code. Now it's time to consolidate these scattered details into a quick reference, and along the way, cover two "corners" that are easily overlooked in engineering practice.

📄️Chapter 4: IPv4 Protocol Implementation

This chapter contains 9 sections. Click the links below to read:

4.1 IPv4 Header and Protocol Registration

There is a category of problems that appear to be "misconfigurations" on the surface, but are actually "structural misunderstandings."

4.2 Receiving IPv4 Packets

In the previous section, we covered the "registration" process of IPv4 — the kernel hangs `ippackettype on the global protocol list and sets ip_rcv() as the callback function. When the NIC delivers a packet and the Ethernet type is 0x0800`, the kernel heads straight for this function.

4.3 Receiving IPv4 Multicast Packets

At the end of the previous section, we mentioned that once iprcvfinish() completes route lookup and option processing, a packet's fate generally splits into three paths: local delivery, forwarding, or—multicast.

4.4 When IP Options Wake Up in a Packet

In the previous section, we followed a multicast packet through its entire lifecycle in the kernel, entering from ip_rcv, either being consumed locally or forwarded onward. That path was clean, like a highway with no obstacles.

4.5 Sending IPv4 Packets

Now, let's switch roles.

4.6 Fragmentation

In the previous section, we mentioned that the packet ultimately leaves the local machine and embarks on the point of no return via iplocalout().

4.7 Reassembly: Putting the Shattered Mirror Back Together

In the previous section, we sliced large packets into fragments like a chef chopping vegetables. That was satisfying, but it left behind a massive mess.

4.8 Packet Forwarding

In the previous section, we reassembled the shattered mirror, watching those fragments come back together in ip_defrag() before finally being handed off to the transport layer.

4.9 Quick Reference

Now that we've completed this journey, let's pause and pack our backpack.

📄️Chapter 5: The IPv4 Routing Subsystem

This chapter contains 8 sections. Click the links below to read:

5.0 No Map, No Journey

One of the core tasks of the Linux network stack is to act as a transit hub—accurately forwarding packets that don't belong to the local machine to the next hop. This sounds simple, but when you're dealing with a core router on the internet backbone, it's an entirely different scale of problem. In this world, network topologies change in the blink of an eye, and massive amounts of routing information are updated every single second.

5.2 Performing Lookups in the Routing Subsystem

In the previous section, we figured out what the FIB is—it's not a simple table, but rather the "treasure map" the kernel uses to determine a packet's fate. We have the map, but nobody has read it yet.

5.3 FIB Info: The "ID Card" of a Routing Entry

In the previous section, we mentioned that once fiblookup() fills in fibresult, it has essentially completed its historical mission. The most important pointer in the result structure—the one pointing to fib_info—is the real boss.

📄️ch05_4

5.4 The Last Mile: Nexthop (fib_nh)

5.5 Policy Routing: Choices Beyond the Map

In the previous section, we discussed how fib_nh acts like a dutiful guide, holding a sticky note with the outgoing interface and gateway address, directing packets on how to leave the kernel. We built a very intuitive mental model: the destination determines the route. As long as you know where you want to go (the destination IP), the routing table can tell you how to get there.

5.6 FIB Alias: When a Destination Has Multiple Identities

In the previous section, we discussed the general structure of FIB Tables and fibinfo. You might think everything looks perfect—one route, one fibinfo, recording the gateway, device, and metric, clear and straightforward.

6.7 The Router's "Whisper": ICMPv4 Redirect

In the previous section, we explored the precise static structure of the FIB. You can think of it as a meticulously drawn map hanging on the wall.

📄️Chapter 5: The IPv4 Routing Subsystem

5.8 Quick Reference Panel

📄️Chapter 6: Multicast Routing

This chapter contains 9 sections. Click the links below to read:

6.0 Introduction: How Do You Send One Letter to a Whole Crowd?

Here's a counterintuitive problem in the networking world: if you want to send the exact same email to a hundred people simultaneously, what do you do?

6.2 Multicast Forwarding Cache (MFC)

In the previous section, we compared mr_table to a dispatch center—holding an interface table and an unresolved queue. But you might have noticed a detail: the actual "forwarding decision" logic hasn't appeared yet.

6.3 Multicast Router

In the previous section, we dissected mrtable and mfccache in the kernel—that's the skeleton of multicast forwarding.

📄️ch06_4

6.4 IPv4 Multicast Rx Path

📄️ch06_5

6.5 The ipmrforward() Method

6.6 The ipmr_queue_xmit() Method

In the previous section, we saw that ipmrforward() acts like a diligent dispatcher, deciding which Virtual Interface (VIF) a packet should be sent to. But it doesn't actually "ship" the packet. The real shipping work—including route lookup, tunnel encapsulation, and handing the packet off to the NIC driver—is taken over by ipmrqueuexmit().

📄️ch06_7

6.7 Policy Routing

6.8 Multipath Routing

Policy routing gives us the freedom to choose different routes, but it solves the problem of "selecting a route based on criteria other than the destination (such as the source address)."

6.9 Final Cheat Sheet

We have finally reached the end of this chapter.

📄️Chapter 7: The Neighbour Subsystem and ARP

This chapter contains 5 sections. Click the links below to read:

📄️Chapter 7: The Linux Neighbouring Subsystem

This chapter explores how the Linux kernel maintains link-layer address mappings, along with the implementation details of the ARP/NDISC protocols within the kernel.

7.2 Interacting with the Neighbor Subsystem from Userspace

In the previous section, we discussed the internal "housekeeping" of the neighbor subsystem: how memory is allocated, how reference counts are managed, and when entries are destroyed. If you are a kernel developer, these are your bricks and mortar.

📄️ch07_3

7.3 The ARP Protocol (IPv4)

7.4 NDISC Protocol (IPv6)

In the previous section, we finished discussing IPv4's ARP. To be fair, ARP does exactly what it's supposed to do: it's simple, direct, and effective. But it also has inherent flaws—there are no security mechanisms, and anyone can impersonate the gateway (what we commonly call ARP spoofing).

7.5 Quick Reference

At this point, we have dissected the skeleton (core structures), blood (protocol interactions), and muscles (state machine) of the neighbor subsystem.

📄️Chapter 8: IPv6 Protocol

This chapter contains 9 sections. Click the links below to read:

📄️Chapter 8: When Addresses Are No Longer a Scarce Resource

Chapter Prelude: Historical Baggage and a New Starting Point

8.2 IPv6 Address Types and Special Addresses

In the previous section, we broke down the IPv6 header structure. That massive 128-bit address space brings unprecedented address freedom, but it also introduces classification complexity. In the IPv4 era, we were used to distinguishing between unicast, broadcast, and multicast, but in IPv6, the rules of the game have changed.

📄️ch08_3

8.3 IPv6 Header

8.4 Extension Headers — Infinite Extension via Chaining

In the previous section, we spent a good while looking at IPv6's fixed 40-byte header. It's clean and pure, shedding all unnecessary baggage.

8.5 Autoconfiguration: Stateless Magic

When we wrapped up ipv6_rcv() in the previous section, I actually had a lingering concern.

8.6 Receiving IPv6 Packets

In the previous section, we discussed address configuration—how to give a machine an identity on the network. Now that it has an identity, the real action begins.

8.7 Receiving IPv6 Multicast Packets

In the previous section, we traced the journey of unicast packets through the kernel. Whether locally delivered or forwarded, their final destination was clear-cut. But the networking world also features a "one-to-many" communication model—multicast. For routers, multicast packet handling is inherently more complex: they must decide not only whether to listen themselves, but also whether to forward the traffic on behalf of their neighbors.

8.8 Multicast Listener Discovery (MLD)

In the previous section, we navigated the "maze" of packet delivery, ultimately routing packets to the correct Socket. That process resembled a series of security checks: first verifying the destination, then checking membership, and finally confirming the exact sender.

8.9 Quick Reference & Kernel Fragments

Throughout this book, we often say: "Don't memorize—understand the mechanisms." But before you dive deep into the kernel code, having an accurate "map" in hand is essential. This section is that map.

📄️Chapter 9: Netfilter and Firewalls

This chapter contains 10 sections. Click the links below to read:

📄️Chapter 9: Netfilter Frameworks

Chapter Intro: Setting Up Checkpoints on the Highway

📄️ch09_10

9.10 Quick Reference and Toolchain

9.2 Netfilter Hooks

Let's turn our focus back inside the kernel Network Stack.

📄️ch09_3

9.3 Connection Tracking Initialization: Injecting "State" into the Network Stack

9.4 Connection Tracking Entries

In the previous section, we discussed nfconntracktuple—that "one-way ticket." Now the question arises: with this ticket in hand, where exactly does the kernel look for the corresponding "passenger record"? What does the structure that holds the connection state, known as struct nf_conn, actually look like?

9.5 Connection Tracking Helpers and Expected Connections

So far, the connection tracking we've discussed has been "single-threaded" — one connection comes in, one record goes out. In reality, though, network protocols are often much more complex than this.

9.6 IPTables: The Frontend Implementation of Rules

In the previous section, we discussed Xtables' extension mechanism and saw how Targets and Matches are registered in the kernel. Think of it as preparing a set of standard molds in a factory.

9.7 Network Address Translation (NAT)

Now we arrive at the most famous feature in the Netfilter world—NAT (Network Address Translation).

9.8 The Dance Between NAT Hook Callbacks and Conntrack Hook Callbacks

In the previous section, we registered the `nfnatipv4_ops` hook array into the kernel, much like setting up checkpoints on a highway that every packet must pass through. But if you look closely at these checkpoints, you'll notice something interesting: at some of them, both conntrack is checking IDs and NAT is modifying addresses. They crowd the same hook point, and the order in which they execute is a detail that can literally make or break the connection.

9.9 NAT Hook Callbacks and Connection Tracking Extensions

In the previous section, we saw how `manip_pkt`, our "surgeon," wields its scalpel to modify packet IPs and ports. But a question arises: who calls this function? When does it step in? More importantly, how does it know what to change the packet into—for instance, should it modify the source address (SNAT) or the destination address (DNAT)?

📄️Chapter 10: IPsec and Cryptography

This chapter contains 8 sections. Click the links below to read:

📄️Chapter 10: A Tunnel to Trust

There is a class of problems that appear to be network configuration issues on the surface, but are actually problems of trust.

10.2 IKE (Internet Key Exchange)

In the previous section, we mentioned that IPsec is not just a kernel module—it's a joint operation spanning both user space and kernel space. The kernel is only responsible for "execution"—that is, taking the keys and encrypting or decrypting packets. But before execution, someone needs to negotiate the keys and establish the rules. This "diplomatic negotiation" task is exactly what IKE (Internet Key Exchange) is all about.

10.3 The XFRM Framework

In the previous section, we discussed the cryptographic algorithms used by IPsec and saw how the kernel squeezes every ounce of CPU performance out of the Crypto API and pcrypt. Algorithms are the muscle, but muscle alone can't get the job done—you also need a skeleton and nervous system to organize these algorithms, telling the kernel when to encrypt, which key to use, and where to send the packet.

10.4 ESP Implementation (IPv4)

We have seen how the XFRM framework stores policies (SPD) and states (SAD)—like building a house and preparing its ledgers. But how traffic flows in and out of that house, and how the rules in those ledgers are enforced, depends on the specific protocol.

10.5 Receiving an IPsec Packet (Transport Mode)

In the previous section, we paused at the end of initialization at xfrm4_rcv()—the receive entry point that the ESP protocol registered with the kernel. The pipeline is laid out, so what kind of journey does a packet actually take when it flows in?

📄️ch10_6

10.6 XFRM Lookup

10.7 NAT Traversal in IPsec

In the previous section, we were enjoying the sense of certainty brought by xfrm_lookup() — the policy matched, the state was found, the encryption completed, and the packet was sent out. Everything looked great.

10.8 Quick Reference

The content in this chapter is admittedly brain-bending—the XFRM framework's state machine, the entanglement of policies and states, and NAT-T's "compromises born of necessity."

📄️Chapter 11: Transport Layer Protocols

This chapter contains 7 sections. Click the links below to read:

11.1 Sockets

A philosophical question has run throughout Unix history: "everything is a file."

📄️ch11_2

11.2 Creating Sockets

11.3 UDP (User Datagram Protocol)

Remember those fields we saw in the `msghdr structure in the previous section? msgiov stores data, and msgcontrol` stores auxiliary information. At the time, you might have thought they were just boring data structure definitions.

11.4 TCP (Transmission Control Protocol)

If the UDP we discussed in the previous section is a carefree "fire-and-forget" optimist, then the TCP we face in this section is the most severe control freak in the network protocol world.

11.5 SCTP: The Hybrid Born of Engineering Trade-offs

In the previous section, we left off looking at TCP's complex and meticulous world, which spares no expense for reliability and ordered delivery. But in an engineer's reality, not all scenarios can tolerate TCP's rigidity, nor can they fully accept UDP's indifference. What we need is a hybrid—combining TCP's reliability and congestion control with UDP's message boundaries and multi-homing capabilities.

11.6 DCCP: Datagram Congestion Control Protocol

We have finally reached the last stop on the IPv4 transport layer tour.

11.7 Quick Reference Manual

We've read through the code, studied the protocols — now let's assemble all the scattered pieces back into a single blueprint.

📄️Chapter 12: Wireless Networking

This chapter contains 9 sections. Click the links below to read:

12.1 Mac80211 Subsystem

Before diving into the Linux kernel's wireless implementation, we need to face a reality: while wireless and wired networks look similar at the /etc/network/interfaces level, to the kernel, they are entirely different beasts.

12.2 802.11 MAC Header

---

12.3 Network Topologies

With the frame header byte ordering sorted out, we need to take a step back and look at the bigger picture.

12.4 Power Save Mode

Beyond forwarding packets, an AP has another critical function: acting as a caretaker for "sleeping" clients by caching their data.

📄️ch12_5

12.5 MAC Layer Management Entity (MLME)

📄️mac80211 Implementation Details

Now, let's shift our focus from the over-the-air protocol interactions back into the kernel.

12.7 High Throughput (802.11n) — The Ticket to the Highway

When we finished discussing the skeleton and muscles of mac80211 in the previous section, I mentioned that the wireless world needs not just connectivity, but speed.

📄️ch12_8

12.8 Mesh Networking (802.11s)

📄️ch12_9

12.9 Quick Reference

📄️Chapter 13: RDMA and High-Performance Networking

This chapter contains 8 sections. Click the links below to read:

📄️Chapter 13: The Cost of Bypassing the Kernel

There is a class of problems that appear to be about "network performance," but are actually about "who is paying for it."

📄️ch13_2

13.2 RDMA Device —— Who Takes Control of This Machine?

13.3 Memory Region (MR)

In the previous section, we got the Address Handle (AH) sorted out—like putting up road signs for data packets in the RDMA network. But that's not enough. Road signs only tell you which way to go; the vehicle (data) still needs somewhere to be loaded.

📄️ch13_4

13.4 Completion Queue (CQ) — The Mailbox for Task Completion

13.5 Shared Receive Queue (SRQ)

We've already covered QPs, CQs, and the various fancy fields. You might feel that the RDMA object model resembles a set of Russian nesting dolls, layer after layer.

📄️ch13_6

13.6 Queue Pair

13.7 Supported RDMA Operations

In the previous section, we spent a lot of effort figuring out how the "avatar" known as a QP is created and how fragile its lifecycle can be.

13.8 Cheat Sheet

By now, we've dissected most of the bones in the RDMA stack. You should have a lot of loose parts on hand: ib_client, PD, QP, CQ, MR...

📄️Chapter 14: Network Namespaces and Advanced Features

This chapter contains 15 sections. Click the links below to read:

14. Namespaces Implementation

Chapter Intro: Invisible Walls

14.10 Notification Chains

In the previous section, we discussed NFC as a "handshake protocol" where the kernel plays the role of an ingenious translator. But the kernel doesn't just translate between hardware components—it must also constantly monitor state changes across the entire system.

14.11 The PCI Subsystem

In the previous section, we covered the kernel's notification chains — an art of decoupling at the software level.

14.12 PPPoE Header — Pinning the Protocol onto Ethernet

In the previous section, we walked through the two phases of PPPoE — Discovery and Session — like watching two people shake hands and greet each other before starting a conversation.

14.13 Android

In the previous section, we were still immersed in the world of PPPoE, figuring out how to wrap Ethernet frames into point-to-point channels. In this section, we'll zoom out and look at one of the largest "tenants" of Linux kernel networking in the mobile world—Android.