Chapter 1: Linux Network Stack Overview
This chapter contains 3 sections. Click the links below to read:
Chapter 1: Deep into the Kernel: Dissecting the Network Stack Black Box
Chapter Introduction
.2 The Network Device
Let's shift our focus down to the bottom layer — Layer 2 (L2), the data link layer.
.3 Linux Kernel Network Development Model
The networking subsystem is incredibly complex, and it evolves at breakneck speed—fast enough that if you blink, you might miss an API change.
Chapter 2: Netlink Sockets
This chapter contains 7 sections. Click the links below to read:
Chapter 2: When Userspace Meets the Kernel
Imagine you are writing an application that needs to monitor network traffic changes. Whenever the kernel's routing table changes or a new network interface comes online, your program needs to know immediately.
.2 Kernel Netlink Sockets
In the previous section, we discussed the userspace tools—iproute2, net-tools, and the handy libnl and libmnl libraries. But on the other side of the fence, how does the kernel view all of this?
.3 Netlink Message Header
In the previous section, we saw how the kernel places messages into the send queue — tracing the call chain from rtnetlink all the way to the generic netlink module.
.4 NETLINK_ROUTE Messages: More Than Just Routing
In the previous section, we spent our time deep in the Generic Netlink mechanism, tearing apart message headers, TLVs, and validation policies. Now that the foundation is laid, it's time to shift our focus back to the networking subsystem itself and see how the "veteran" Netlink protocol families actually work.
.5 Adding and Removing Routing Table Entries: Dancing in the FIB
In the previous section, we broke down the Netlink message header and those dazzling flag bits. Now, it's time to throw that theory into a real-world scenario.
.6 Generic Netlink Protocol
Remember the question we left you with at the end of the last section?
ch02_7
2.7 Quick Reference
Chapter 3: ICMP Protocol
This chapter contains 3 sections. Click the links below to read:
.0 The Network's Nervous System: Why We Need ICMP
Imagine managing a massive, globally distributed package delivery system.
.2 The "Swiss Army Knife" of IPv6: ICMPv6
In the previous section, we navigated the world of IPv4. While ICMPv4 is certainly important, within the IPv4 architecture, it plays somewhat of a "patch" role—ARP handles address resolution, IGMP manages multicast, and ICMPv4 is mainly left with error reporting and diagnostics.
.3 Quick Reference and Practical Supplements
In this chapter, we broke down the internal workings of ICMPv4 and ICMPv6, from initialization flows to packet transmission and reception, covering a significant amount of kernel code. Now it's time to consolidate these scattered details into a quick reference, and along the way, cover two "corners" that are easily overlooked in engineering practice.
Chapter 4: IPv4 Protocol Implementation
This chapter contains 9 sections. Click the links below to read:
.1 IPv4 Header and Protocol Registration
There is a category of problems that appear to be "misconfigurations" on the surface, but are actually "structural misunderstandings."
.2 Receiving IPv4 Packets
In the previous section, we covered the "registration" process of IPv4 — the kernel hangs `ippackettype on the global protocol list and sets ip_rcv() as the callback function. When the NIC delivers a packet and the Ethernet type is 0x0800`, the kernel heads straight for this function.
.3 Receiving IPv4 Multicast Packets
At the end of the previous section, we mentioned that once iprcvfinish() completes route lookup and option processing, a packet's fate generally splits into three paths: local delivery, forwarding, or—multicast.
.4 When IP Options Wake Up in a Packet
In the previous section, we followed a multicast packet through its entire lifecycle in the kernel, entering from ip_rcv, either being consumed locally or forwarded onward. That path was clean, like a highway with no obstacles.
.5 Sending IPv4 Packets
Now, let's switch roles.
.6 Fragmentation
In the previous section, we mentioned that the packet ultimately leaves the local machine and embarks on the point of no return via iplocalout().
.7 Reassembly: Putting the Shattered Mirror Back Together
In the previous section, we sliced large packets into fragments like a chef chopping vegetables. That was satisfying, but it left behind a massive mess.
.8 Packet Forwarding
In the previous section, we reassembled the shattered mirror, watching those fragments come back together in ip_defrag() before finally being handed off to the transport layer.
.9 Quick Reference
Now that we've completed this journey, let's pause and pack our backpack.
Chapter 5: The IPv4 Routing Subsystem
This chapter contains 8 sections. Click the links below to read:
.0 No Map, No Journey
One of the core tasks of the Linux network stack is to act as a transit hub—accurately forwarding packets that don't belong to the local machine to the next hop. This sounds simple, but when you're dealing with a core router on the internet backbone, it's an entirely different scale of problem. In this world, network topologies change in the blink of an eye, and massive amounts of routing information are updated every single second.
.2 Performing Lookups in the Routing Subsystem
In the previous section, we figured out what the FIB is—it's not a simple table, but rather the "treasure map" the kernel uses to determine a packet's fate. We have the map, but nobody has read it yet.
.3 FIB Info: The "ID Card" of a Routing Entry
In the previous section, we mentioned that once fiblookup() fills in fibresult, it has essentially completed its historical mission. The most important pointer in the result structure—the one pointing to fib_info—is the real boss.
ch05_4
5.4 The Last Mile: Nexthop (fib_nh)
.5 Policy Routing: Choices Beyond the Map
In the previous section, we discussed how fib_nh acts like a dutiful guide, holding a sticky note with the outgoing interface and gateway address, directing packets on how to leave the kernel. We built a very intuitive mental model: the destination determines the route. As long as you know where you want to go (the destination IP), the routing table can tell you how to get there.
.6 FIB Alias: When a Destination Has Multiple Identities
In the previous section, we discussed the general structure of FIB Tables and fibinfo. You might think everything looks perfect—one route, one fibinfo, recording the gateway, device, and metric, clear and straightforward.
.7 The Router's "Whisper": ICMPv4 Redirect
In the previous section, we explored the precise static structure of the FIB. You can think of it as a meticulously drawn map hanging on the wall.
Chapter 5: The IPv4 Routing Subsystem
5.8 Quick Reference Panel
Chapter 6: Multicast Routing
This chapter contains 9 sections. Click the links below to read:
.0 Introduction: How Do You Send One Letter to a Whole Crowd?
Here's a counterintuitive problem in the networking world: if you want to send the exact same email to a hundred people simultaneously, what do you do?
.2 Multicast Forwarding Cache (MFC)
In the previous section, we compared mr_table to a dispatch center—holding an interface table and an unresolved queue. But you might have noticed a detail: the actual "forwarding decision" logic hasn't appeared yet.
.3 Multicast Router
In the previous section, we dissected mrtable and mfccache in the kernel—that's the skeleton of multicast forwarding.
ch06_4
6.4 IPv4 Multicast Rx Path
ch06_5
6.5 The ipmrforward() Method
.6 The ipmr_queue_xmit() Method
In the previous section, we saw that ipmrforward() acts like a diligent dispatcher, deciding which Virtual Interface (VIF) a packet should be sent to. But it doesn't actually "ship" the packet. The real shipping work—including route lookup, tunnel encapsulation, and handing the packet off to the NIC driver—is taken over by ipmrqueuexmit().
ch06_7
6.7 Policy Routing
.8 Multipath Routing
Policy routing gives us the freedom to choose different routes, but it solves the problem of "selecting a route based on criteria other than the destination (such as the source address)."
.9 Final Cheat Sheet
We have finally reached the end of this chapter.
Chapter 7: The Neighbour Subsystem and ARP
This chapter contains 5 sections. Click the links below to read:
Chapter 7: The Linux Neighbouring Subsystem
This chapter explores how the Linux kernel maintains link-layer address mappings, along with the implementation details of the ARP/NDISC protocols within the kernel.
.2 Interacting with the Neighbor Subsystem from Userspace
In the previous section, we discussed the internal "housekeeping" of the neighbor subsystem: how memory is allocated, how reference counts are managed, and when entries are destroyed. If you are a kernel developer, these are your bricks and mortar.
ch07_3
7.3 The ARP Protocol (IPv4)
.4 NDISC Protocol (IPv6)
In the previous section, we finished discussing IPv4's ARP. To be fair, ARP does exactly what it's supposed to do: it's simple, direct, and effective. But it also has inherent flaws—there are no security mechanisms, and anyone can impersonate the gateway (what we commonly call ARP spoofing).
.5 Quick Reference
At this point, we have dissected the skeleton (core structures), blood (protocol interactions), and muscles (state machine) of the neighbor subsystem.
Chapter 8: IPv6 Protocol
This chapter contains 9 sections. Click the links below to read:
Chapter 8: When Addresses Are No Longer a Scarce Resource
Chapter Prelude: Historical Baggage and a New Starting Point
.2 IPv6 Address Types and Special Addresses
In the previous section, we broke down the IPv6 header structure. That massive 128-bit address space brings unprecedented address freedom, but it also introduces classification complexity. In the IPv4 era, we were used to distinguishing between unicast, broadcast, and multicast, but in IPv6, the rules of the game have changed.
ch08_3
8.3 IPv6 Header
.4 Extension Headers — Infinite Extension via Chaining
In the previous section, we spent a good while looking at IPv6's fixed 40-byte header. It's clean and pure, shedding all unnecessary baggage.
.5 Autoconfiguration: Stateless Magic
When we wrapped up ipv6_rcv() in the previous section, I actually had a lingering concern.
.6 Receiving IPv6 Packets
In the previous section, we discussed address configuration—how to give a machine an identity on the network. Now that it has an identity, the real action begins.
.7 Receiving IPv6 Multicast Packets
In the previous section, we traced the journey of unicast packets through the kernel. Whether locally delivered or forwarded, their final destination was clear-cut. But the networking world also features a "one-to-many" communication model—multicast. For routers, multicast packet handling is inherently more complex: they must decide not only whether to listen themselves, but also whether to forward the traffic on behalf of their neighbors.
.8 Multicast Listener Discovery (MLD)
In the previous section, we navigated the "maze" of packet delivery, ultimately routing packets to the correct Socket. That process resembled a series of security checks: first verifying the destination, then checking membership, and finally confirming the exact sender.
.9 Quick Reference & Kernel Fragments
Throughout this book, we often say: "Don't memorize—understand the mechanisms." But before you dive deep into the kernel code, having an accurate "map" in hand is essential. This section is that map.
Chapter 9: Netfilter and Firewalls
This chapter contains 10 sections. Click the links below to read:
Chapter 9: Netfilter Frameworks
Chapter Intro: Setting Up Checkpoints on the Highway
ch09_10
9.10 Quick Reference and Toolchain
.2 Netfilter Hooks
Let's turn our focus back inside the kernel Network Stack.
ch09_3
9.3 Connection Tracking Initialization: Injecting "State" into the Network Stack
.4 Connection Tracking Entries
In the previous section, we discussed nfconntracktuple—that "one-way ticket." Now the question arises: with this ticket in hand, where exactly does the kernel look for the corresponding "passenger record"? What does the structure that holds the connection state, known as struct nf_conn, actually look like?
.5 Connection Tracking Helpers and Expected Connections
So far, the connection tracking we've discussed has been "single-threaded" — one connection comes in, one record goes out. In reality, though, network protocols are often much more complex than this.
.6 IPTables: The Frontend Implementation of Rules
In the previous section, we discussed Xtables' extension mechanism and saw how Targets and Matches are registered in the kernel. Think of it as preparing a set of standard molds in a factory.
.7 Network Address Translation (NAT)
Now we arrive at the most famous feature in the Netfilter world—NAT (Network Address Translation).
.8 The Dance Between NAT Hook Callbacks and Conntrack Hook Callbacks
In the previous section, we registered the `nfnatipv4_ops` hook array into the kernel, much like setting up checkpoints on a highway that every packet must pass through. But if you look closely at these checkpoints, you'll notice something interesting: at some of them, both conntrack is checking IDs and NAT is modifying addresses. They crowd the same hook point, and the order in which they execute is a detail that can literally make or break the connection.
.9 NAT Hook Callbacks and Connection Tracking Extensions
In the previous section, we saw how `manip_pkt`, our "surgeon," wields its scalpel to modify packet IPs and ports. But a question arises: who calls this function? When does it step in? More importantly, how does it know what to change the packet into—for instance, should it modify the source address (SNAT) or the destination address (DNAT)?
Chapter 10: IPsec and Cryptography
This chapter contains 8 sections. Click the links below to read:
Chapter 10: A Tunnel to Trust
There is a class of problems that appear to be network configuration issues on the surface, but are actually problems of trust.
0.2 IKE (Internet Key Exchange)
In the previous section, we mentioned that IPsec is not just a kernel module—it's a joint operation spanning both user space and kernel space. The kernel is only responsible for "execution"—that is, taking the keys and encrypting or decrypting packets. But before execution, someone needs to negotiate the keys and establish the rules. This "diplomatic negotiation" task is exactly what IKE (Internet Key Exchange) is all about.
0.3 The XFRM Framework
In the previous section, we discussed the cryptographic algorithms used by IPsec and saw how the kernel squeezes every ounce of CPU performance out of the Crypto API and pcrypt. Algorithms are the muscle, but muscle alone can't get the job done—you also need a skeleton and nervous system to organize these algorithms, telling the kernel when to encrypt, which key to use, and where to send the packet.
0.4 ESP Implementation (IPv4)
We have seen how the XFRM framework stores policies (SPD) and states (SAD)—like building a house and preparing its ledgers. But how traffic flows in and out of that house, and how the rules in those ledgers are enforced, depends on the specific protocol.
0.5 Receiving an IPsec Packet (Transport Mode)
In the previous section, we paused at the end of initialization at xfrm4_rcv()—the receive entry point that the ESP protocol registered with the kernel. The pipeline is laid out, so what kind of journey does a packet actually take when it flows in?
ch10_6
10.6 XFRM Lookup
0.7 NAT Traversal in IPsec
In the previous section, we were enjoying the sense of certainty brought by xfrm_lookup() — the policy matched, the state was found, the encryption completed, and the packet was sent out. Everything looked great.
0.8 Quick Reference
The content in this chapter is admittedly brain-bending—the XFRM framework's state machine, the entanglement of policies and states, and NAT-T's "compromises born of necessity."
Chapter 11: Transport Layer Protocols
This chapter contains 7 sections. Click the links below to read:
1.1 Sockets
A philosophical question has run throughout Unix history: "everything is a file."
ch11_2
11.2 Creating Sockets
1.3 UDP (User Datagram Protocol)
Remember those fields we saw in the `msghdr structure in the previous section? msgiov stores data, and msgcontrol` stores auxiliary information. At the time, you might have thought they were just boring data structure definitions.
1.4 TCP (Transmission Control Protocol)
If the UDP we discussed in the previous section is a carefree "fire-and-forget" optimist, then the TCP we face in this section is the most severe control freak in the network protocol world.
1.5 SCTP: The Hybrid Born of Engineering Trade-offs
In the previous section, we left off looking at TCP's complex and meticulous world, which spares no expense for reliability and ordered delivery. But in an engineer's reality, not all scenarios can tolerate TCP's rigidity, nor can they fully accept UDP's indifference. What we need is a hybrid—combining TCP's reliability and congestion control with UDP's message boundaries and multi-homing capabilities.
1.6 DCCP: Datagram Congestion Control Protocol
We have finally reached the last stop on the IPv4 transport layer tour.
1.7 Quick Reference Manual
We've read through the code, studied the protocols — now let's assemble all the scattered pieces back into a single blueprint.
Chapter 12: Wireless Networking
This chapter contains 9 sections. Click the links below to read:
2.1 Mac80211 Subsystem
Before diving into the Linux kernel's wireless implementation, we need to face a reality: while wireless and wired networks look similar at the /etc/network/interfaces level, to the kernel, they are entirely different beasts.
2.2 802.11 MAC Header
---
2.3 Network Topologies
With the frame header byte ordering sorted out, we need to take a step back and look at the bigger picture.
2.4 Power Save Mode
Beyond forwarding packets, an AP has another critical function: acting as a caretaker for "sleeping" clients by caching their data.
ch12_5
12.5 MAC Layer Management Entity (MLME)
mac80211 Implementation Details
Now, let's shift our focus from the over-the-air protocol interactions back into the kernel.
2.7 High Throughput (802.11n) — The Ticket to the Highway
When we finished discussing the skeleton and muscles of mac80211 in the previous section, I mentioned that the wireless world needs not just connectivity, but speed.
ch12_8
12.8 Mesh Networking (802.11s)
ch12_9
12.9 Quick Reference
Chapter 13: RDMA and High-Performance Networking
This chapter contains 8 sections. Click the links below to read:
Chapter 13: The Cost of Bypassing the Kernel
There is a class of problems that appear to be about "network performance," but are actually about "who is paying for it."
ch13_2
13.2 RDMA Device —— Who Takes Control of This Machine?
3.3 Memory Region (MR)
In the previous section, we got the Address Handle (AH) sorted out—like putting up road signs for data packets in the RDMA network. But that's not enough. Road signs only tell you which way to go; the vehicle (data) still needs somewhere to be loaded.
ch13_4
13.4 Completion Queue (CQ) — The Mailbox for Task Completion
3.5 Shared Receive Queue (SRQ)
We've already covered QPs, CQs, and the various fancy fields. You might feel that the RDMA object model resembles a set of Russian nesting dolls, layer after layer.
ch13_6
13.6 Queue Pair
3.7 Supported RDMA Operations
In the previous section, we spent a lot of effort figuring out how the "avatar" known as a QP is created and how fragile its lifecycle can be.
3.8 Cheat Sheet
By now, we've dissected most of the bones in the RDMA stack. You should have a lot of loose parts on hand: ib_client, PD, QP, CQ, MR...
Chapter 14: Network Namespaces and Advanced Features
This chapter contains 15 sections. Click the links below to read:
4. Namespaces Implementation
Chapter Intro: Invisible Walls
4.10 Notification Chains
In the previous section, we discussed NFC as a "handshake protocol" where the kernel plays the role of an ingenious translator. But the kernel doesn't just translate between hardware components—it must also constantly monitor state changes across the entire system.
4.11 The PCI Subsystem
In the previous section, we covered the kernel's notification chains — an art of decoupling at the software level.
4.12 PPPoE Header — Pinning the Protocol onto Ethernet
In the previous section, we walked through the two phases of PPPoE — Discovery and Session — like watching two people shake hands and greet each other before starting a conversation.
4.13 Android
In the previous section, we were still immersed in the world of PPPoE, figuring out how to wrap Ethernet frames into point-to-point channels. In this section, we'll zoom out and look at one of the largest "tenants" of Linux kernel networking in the mobile world—Android.
4.14 Quick Reference (The Wrenches and Screwdrivers in Your Toolbox)
And so, our journey truly comes to an end.
4.15 Macros and Utility Functions
---
4.2 UTS Namespace Implementation
In the previous section, we mentioned that the kernel doesn't care about a namespace's "name" — it distinguishes different instances solely by inode number. This sounds minimalist, but minimalism is the highest form of design.
4.3 Network Namespaces Implementation
In the previous section, we covered the UTS namespace—that easy target that only manages the hostname. It served as a warm-up, showing us how the kernel transforms the bad habit of "global variables" into "namespace-private data."
4.4 Managing Network Namespaces: From a God's-Eye View to Hands-On Operations
In the previous section, we traced the past and present of struct net from the kernel's perspective. Now, let's return to user space.
4.5 Cgroups: When Isolation Meets Resource Contention
Namespaces solve one problem: out of sight, out of mind.
4.6 Busy Poll Sockets
In the previous section, we covered Cgroups and Namespaces, the cornerstones of containers. Now, let's shift our focus away from "isolation" and zero in on a topic of extreme performance.
4.7 The Linux Bluetooth Subsystem
In the previous section, we were stressing over network latency, discussing how to squeeze every last drop of performance out of the system using "hardcore" techniques like Busy Poll. This is an extreme strategy that trades CPU idle time for speed.
4.8 IEEE 802.15.4 and 6LoWPAN
In the previous section, we enjoyed the conveniences of the Bluetooth "Personal Area Network." In this section, we are heading into a much stingier, more demanding world.
4.9 Near Field Communication (NFC)
If you think Bluetooth's range of a few meters still isn't close enough, let's get even closer.