4.1 IPv4 Header and Protocol Registration
There is a category of problems that appear to be "misconfigurations" on the surface, but are actually "structural misunderstandings."
The problem we tackle in this chapter is exactly that kind. You might think you know the IP header — source address, destination address, checksum, nothing more — but when you stand face-to-face with the kernel code, trying to send a crafted packet through a Raw Socket or analyzing a strange fragmentation fault in a packet capture, you'll find that the textbook's 20-byte definition hides many overlooked details.
The mission of this chapter is to lay those details bare. We will start from every single bit of the IPv4 header, dissecting this protocol header nerve by nerve like a frog in biology class. This is not only the foundation for understanding tcpdump output, but also a necessary prerequisite for understanding subsequent Netfilter filtering, route lookup, and fragment reassembly.
Ready? Let's start with that most familiar "face."
IPv4 Header
If the network is a postal system, then the IPv4 header is the envelope. But this is a very special envelope: its format is fixed, yet its length is variable; it doesn't just tell you where the letter is from and where it's going, it also tells the intermediate sorters (routers) how urgent it is (TOS), whether it can be torn apart and mailed (DF), and if it is torn apart, which piece belongs to this fragment.
In the eyes of the Linux kernel, this envelope is abstracted into a C struct struct iphdr. It is defined in include/uapi/linux/ip.h.
Don't rush to look at the fields just yet — first, look at this diagram (just get a mental impression):
0 16 31
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|Version| IHL | Type of Service | Total Length |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Identification |Flags| Fragment Offset |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Time to Live | Protocol | Header Checksum |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Source Address |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Destination Address |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Options (optional) |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
This is the "face" the kernel sees when processing every IP packet. The code-level definition is as follows:
struct iphdr {
#if defined(__LITTLE_ENDIAN_BITFIELD)
__u8 ihl:4,
version:4;
#elif defined (__BIG_ENDIAN_BITFIELD)
__u8 version:4,
ihl:4;
#else
#error "Please fix <asm/byteorder.h>"
#endif
__u8 tos;
__be16 tot_len;
__be16 id;
__be16 frag_off;
__u8 ttl;
__u8 protocol;
__sum16 check;
__be32 saddr;
__be32 daddr;
/*The options start here. */
};
You'll notice it feels a bit awkward at first glance — version and ihl are crammed into a single byte, and you have to worry about byte order. This is to squeeze every last bit of space. Let's break it down one by one.
1. Version and Header Length (version / ihl)
version is simple: it must be 4. If it isn't 4, the kernel will immediately drop the packet because it simply doesn't recognize it.
ihl (Internet Header Length) is the key. It tells us exactly how long this IPv4 header is.
Here is a counterintuitive point: the IPv4 header length is not fixed.
Unlike IPv6, which is fixed at 40 bytes, the IPv4 header is a minimum of 20 bytes (when there are no options) and a maximum of 60 bytes. Because it counts in 4-byte (32-bit) units, the ihl field stores the "number of units," not the number of bytes.
- Minimum 20 bytes →
ihl= 5 - Maximum 60 bytes →
ihl= 15
Why design it this way?
For compatibility. The IPv4 header is followed by an optional options field. Although rarely used today, network protocol designers back in the day thought that maybe one day we'd need to stuff more routing information or security parameters into the header. This design "parameterized" the length — as long as we have ihl, the kernel can calculate where the true header ends and where the data payload begins.
Returning to the envelope analogy: ihl tells the sorter how many layers of "envelope paper" this letter has been folded into. If you tear it open without checking this, you might rip the actual letter inside.
2. Type of Service and Congestion Notification (tos / DSCP / ECN)
The 8 bits of tos (Type of Service) have undergone multiple "repurposings" throughout history.
- Originally (RFC 791): It was used for QoS. You could mark a packet as "minimize delay," "maximize throughput," etc., hoping routers would prioritize it. It's like stamping "URGENT" on an envelope.
- Later (RFC 2474): People found this too coarse, so the first 6 bits were redefined as the DS Field (Differentiated Services Field), or DSCP. Modern network devices primarily look at these 6 bits for traffic control (such as QoS marking).
- Finally (RFC 3168): The remaining 2 bits weren't left idle either — they were co-opted for ECN (Explicit Congestion Notification).
These 2 bits (bits 6 and 7) are quite interesting. They allow routers to "not fight" (not drop packets) during congestion, but instead mark this bit as 1 to tell the receiver: "Hey, it was a bit congested just now, slow down your sending." This is an important mechanism in modern networks for combating congestion.
3. Total Length (tot_len)
tot_len is the length of the entire IP packet, including both header and data. It is 16 bits, meaning a maximum of 64KB.
There's a pitfall here: the Ethernet MTU is typically 1500 bytes. If you send a 3000-byte IP packet, it must be sliced into fragments at the link layer. However, tot_len records the total length before slicing. When the receiver sees these fragments, it relies entirely on this total length to determine whether reassembly is complete.
4. Identification and Fragmentation (id / frag_off)
This is one of the most troublesome parts of IPv4.
id (Identification) is a 16-bit ID. When an IP packet is sliced into multiple fragments, all fragments must share the same id. When the receiver reassembles them, it's like a jigsaw puzzle — first find all fragments with the same id, then piece them together in order.
frag_off (Fragment Offset) is even more extreme — it has to store both the offset and the flag bits, all crammed into 16 bits:
- Low 13 bits: The offset. Note that the unit is 8 bytes. So the offset of the first fragment is 0. If the second fragment starts at byte 1400, the value here is 1400 / 8 = 175.
- High 3 bits: Flag bits.
These flag bits are defined as macros in the kernel:
- IP_MF (More Fragments, 0x01): Value is 001. Means "there are more siblings." Except for the last fragment, all other fragments must have this bit set to 1.
- IP_DF (Don't Fragment, 0x02): Value is 010. Means "don't slice me!" If a packet has the DF flag set but encounters a link with an MTU that is too small along the way, the router will directly drop it and send back an ICMP "Fragmentation Needed" message. This is the cornerstone of PMTU (Path MTU Discovery).
- IP_CE (Congestion, 0x04): Value is 100. This is a congestion flag reserved for ECN.
⚠️ Pitfall Warning Many beginners see a
frag_offvalue of 8192 (decimal) when capturing packets and assume the offset is huge, but this is actually because they're looking at the DF flag bit (0x2000). When examining fragmentation, always remember to apply a mask operation to strip the high 3 bits and only look at the low 13 bits.
5. Time to Live (ttl)
ttl (Time To Live) is essentially a "countdown timer." It is decremented by 1 every time it passes through a router. When it reaches 0, the packet is destroyed.
This exists to prevent "zombie packets" caused by routing loops from circulating in the network forever. You'll encounter an interesting ICMP error called ICMP_TIME_EXCEEDED — when you use traceroute, it works by intentionally crafting packets with incrementally increasing TTLs to probe the path.
6. Protocol and Checksum (protocol / check)
protocol tells the kernel what is inside this IP packet.
IPPROTO_TCP(6): TCPIPPROTO_UDP(17): UDPIPPROTO_ICMP(1): ICMPIPPROTO_ICMPV6(58): ICMPv6- ...and dozens of others, defined in
include/uapi/linux/in.h.
check is the checksum. Note that it only verifies the header. If even a single bit is flipped during transmission, the checksum won't match, and the kernel will drop the packet directly. There's a detail here: because the TTL changes with every hop, routers must recalculate the checksum when forwarding.
7. Addresses (saddr / daddr)
32-bit source and destination addresses. This is the most fundamental routing basis at the network layer.
Protocol Registration: How Does the Kernel Claim IPv4 Packets?
Now that we've seen what the header looks like, we need to take a step back and ask a more fundamental question:
When a NIC receives a data frame, pulls it up from the DMA buffer, and hands it to the kernel, how does the kernel know this is an IPv4 packet and not ARP or IPv6?
The answer lies in the Ethernet header's ethertype field. IPv4's type is ETH_P_IP.
The kernel needs a mechanism to bind the number ETH_P_IP to the "IPv4 handler function." This is exactly what struct packet_type does.
Here is the real "registration" code, defined in net/ipv4/af_inet.c:
static struct packet_type ip_packet_type __read_mostly = {
.type = cpu_to_be16(ETH_P_IP), // 0x0800
.func = ip_rcv, // 处理函数指针
};
How does this take effect? When the IPv4 stack initializes:
static int __init inet_init(void)
{
...
dev_add_pack(&ip_packet_type);
...
}
The inet_init function does something extremely important: it hangs ip_packet_type onto the kernel's global protocol processing list (typically the ptype_base hash table).
From then on, for every packet coming in from the NIC, the kernel will take a look at its Ethernet type.
- If it is
ETH_P_IP, the kernel will findip_packet_typeand then callip_packet_type.func— which isip_rcv.
This is the starting point of the IPv4 story. ip_rcv is like the "customs" of the IPv4 kingdom — all packets must pass through this checkpoint first. It is responsible for verification (is the Version 4? Is the Checksum correct?), distribution (is it for me? Does it need to be forwarded?), and only if everything goes smoothly will it hand the packet to the next stop.
In the next section, we will stand at the ip_rcv entry point and see how the IPv4 receive path actually runs.