4.6 Fragmentation
In the previous section, we mentioned that the packet ultimately leaves the local machine and embarks on the point of no return via ip_local_out().
But a practical problem stands in our way: the road is too narrow.
The Ethernet MTU (Maximum Transmission Unit) is typically 1500 bytes. While some NICs support Jumbo Frames up to 9K, that's the exception rather than the rule. If we need to send a 4000-byte UDP packet, or a massive TCP packet that wasn't constrained by MSS negotiation, the physical device simply can't swallow it.
At this point, there are only two choices:
- Don't send it so large: Send an ICMP message back telling the peer "this packet is too big, nothing I can do" (Path MTU Discovery).
- Chop it up and send: Slice the large packet into smaller fragments that fit the MTU, send them on their way, and reassemble them at the destination.
In this section, we look at how the kernel handles the second scenario—fragmentation.
The task of handling these fragments on the receive path is called reassembly, which we cover in the next section. For now, let's focus on ip_fragment() on the send path.
To fragment or not to fragment, that is the question
ip_fragment() is not a function we can just call on a whim. Before cutting into a packet, it must first answer a serious question: is this packet allowed to be fragmented?
In the IP header, there is a flag called DF (Don't Fragment). If this bit is set to 1, it means the sender is assertive: "I either go through in one piece, or I don't go through at all. Don't chop me up."
The kernel's implementation is very straightforward. At the beginning of ip_fragment(), we see this logic:
int ip_fragment(struct sk_buff *skb, int (*output)(struct sk_buff *))
{
...
struct rtable *rt = skb_rtable(skb);
struct iphdr *iph = ip_hdr(skb);
// 检查 DF 标志位,或者是否设置了 frag_max_size 限制
if (unlikely(((iph->frag_off & htons(IP_DF)) && !skb->local_df) ||
(IPCB(skb)->frag_max_size &&
IPCB(skb)->frag_max_size > dst_mtu(&rt->dst)))) {
IP_INC_STATS(dev_net(dev), IPSTATS_MIB_FRAGFAILS);
// 发送 ICMP "Destination Unreachable; Fragmentation Needed"
icmp_send(skb, ICMP_DEST_UNREACH, ICMP_FRAG_NEEDED,
htonl(ip_skb_dst_mtu(skb)));
kfree_skb(skb);
return -EMSGSIZE;
}
...
}
The logic here is brutal: if the IP_DF flag is set, or if the packet clearly exceeds certain path limits, the kernel won't fragment it for us. Instead, it directly calls icmp_send() to throw back a ICMP_FRAG_NEEDED error, and then drops the packet (kfree_skb).
This is exactly why, when we accidentally block the ICMP protocol while configuring a firewall or VPN, large packets simply refuse to go through—the kernel tries to tell us the path is too narrow, but we've taped its mouth shut.
Only when the packet is allowed to be fragmented does the code proceed to the actual slicing phase.
Two paths of fragmentation: fast and slow
When it comes to actual fragmentation, the kernel faces two distinctly different scenarios.
If the packet already has a "slicing manifest" reserved at the transport layer (like UDP) or during local generation—meaning the SKB's frag_list is not empty—then the kernel's job is simple: take each item from the manifest, attach an IP header, package it, and send it out.
This is the fast path.
Conversely, if we only have a single, massive, contiguous SKB with no pre-sliced fragment list, the kernel has to do the heavy lifting itself: allocate new memory, copy the data piece by piece, and calculate offsets.
This is the slow path.
Let's look at the more relaxed fast path first.
Fast path: hitching the trailers
The core of the fast path lies in skb_has_frag_list(skb). If this function returns true, it means a string of pre-sliced data fragments is hanging off the SKB's skb_shinfo(skb)->frag_list. The kernel simply treats these fragments as independent "trailers" and slaps a new label (IP header) on each one.
The first step is to organize the "lead vehicle."
The original SKB becomes the first fragment. We need to correct its length and header information:
hlen = iph->ihl * 4; // IP 头部长度
...
if (skb_has_frag_list(skb)) {
struct sk_buff *frag, *frag2;
int first_len = skb_pagelen(skb);
// 初始化 frag_list,准备把它拆散
skb_frag_list_init(skb);
// 修正主 SKB 的长度,它现在只承载第一个分片的数据
skb->data_len = first_len - skb_headlen(skb);
skb->len = first_len;
iph->tot_len = htons(first_len);
// 设置标志位:后面还有更多分片 (IP_MF)
iph->frag_off = htons(IP_MF);
// 头改了,校验和得重算
ip_send_check(iph);
The IP_MF (More Fragments) flag here is crucial. It tells the receiver: "Don't rush, this isn't over, there's more cargo coming." Only the last fragment has this bit set to 0.
Next, the kernel enters an infinite loop to process each trailer in frag_list:
for (;;) {
if (frag) {
frag->ip_summed = CHECKSUM_NONE;
skb_reset_transport_header(frag);
// 为这个分片腾出 IP 头的空间
// skb->data 原本指向传输层头,现在要往后推 hlen 字节
__skb_push(frag, hlen);
// 重置网络层头指针
skb_reset_network_header(frag);
// 把原始 IP 头拷贝过来
memcpy(skb_network_header(frag), iph, hlen);
// 获取这个新分片的 IP 头并修正总长度
iph = ip_hdr(frag);
iph->tot_len = htons(frag->len);
// 复制元数据(优先级、标记等)
ip_copy_metadata(frag, skb);
There's an interesting detail here: offset calculation.
The IP protocol specifies that the lower 13 bits of the frag_off field represent the offset, and the unit is not bytes, but 8-byte blocks (64 bits). This means the data length of all fragments (except the last one) must be a multiple of 8.
// 只有第一个分片才需要处理 IP 选项
if (offset == 0)
ip_options_fragment(frag);
// 计算当前偏移量(字节)
offset += skb->len - hlen;
// 转换为 8 字节单位并写入头部
iph->frag_off = htons(offset>>3);
// 只要后面还有分片(不是最后一个),就必须打上 MF 标志
if (frag->next != NULL)
iph->frag_off |= htons(IP_MF);
ip_send_check(iph);
}
Here, ip_options_fragment() does something clever: IP options (like Record Route) only need to appear in the first fragment. If subsequent fragments don't replace these options with IPOPT_NOOP, it both wastes bandwidth and increases security risks.
Finally, these sliced fragments are sent out the door:
err = output(skb);
if (!err)
IP_INC_STATS(dev_net(dev), IPSTATS_MIB_FRAGCREATES);
if (err || !frag)
break;
// 取下一个分片
skb = frag;
frag = skb->next;
skb->next = NULL;
}
The fast path is fast because it doesn't need to copy data. The data already resides in the memory pages corresponding to frag_list. The SKB simply jumps between pointers, adjusting header metadata.
Slow path: the heavy cleaver
If the SKB lacks a frag_list, the kernel is holding a genuine behemoth. There's no way to cut corners here; we have to bite the bullet and slice it up.
The slow path code logic works like this: first calculate how much data is left to send, then slice it off piece by piece in a while (left > 0) loop.
iph = ip_hdr(skb);
left = skb->len - hlen; // 剩余待发送的数据量
while (left > 0) {
len = left;
// 这一块不能超过 MTU
if (len > mtu)
len = mtu;
// 硬性规定:除了最后一块,所有块必须是 8 字节对齐的
// 如果这一块不是最后一块,就把末尾那几个零头切掉
if (len < left) {
len &= ~7;
}
This len &= ~7 (binary ...11111000) is a classic operation in network protocols. It forcefully ensures that fragment boundaries align to an 8-byte granularity.
Next comes the most expensive operation: allocating a new SKB and copying the data.
// 分配新的 SKB,包含 IP 头和数据部分
if ((skb2 = alloc_skb(len+hlen+ll_rs, GFP_ATOMIC)) == NULL) {
NETDEBUG(KERN_INFO "IP: frag: no memory for new fragment!\n");
err = -ENOMEM;
goto fail;
}
// 复制元数据
ip_copy_metadata(skb2, skb);
skb_reserve(skb2, ll_rs);
skb_put(skb2, len + hlen);
skb_reset_network_header(skb2);
skb2->transport_header = skb2->network_header + hlen;
// 如果原 SKB 有属主(比如某个 socket),要把内存账算在它头上
if (skb->sk)
skb_set_owner_w(skb2, skb->sk);
// 拷贝 IP 头
skb_copy_from_linear_data(skb, skb_network_header(skb2), hlen);
// 拷贝数据片段(这是最慢的部分)
if (skb_copy_bits(skb, ptr, skb_transport_header(skb2), len))
BUG();
There are a few noteworthy engineering details in this code:
- Memory allocation uses
GFP_ATOMIC: Because we might be holding a lock at this point, we can't sleep. If the allocation fails, the entire fragmentation operation fails, and the code jumps to thefaillabel to clean up resources. skb_set_owner_w: This is a crucial accounting operation. If this fragment is being sent for a specific Socket, the memory usage of the new SKB must be charged to that Socket's overhead. Otherwise, a user could exhaust system memory by sending massive amounts of data without being constrained by Socket buffer limits.
Once the data copy is complete, the remaining operations are similar to the fast path: setting the offset, handling the MF flag, calculating the checksum, and sending.
iph = ip_hdr(skb2);
iph->frag_off = htons((offset >> 3));
if (offset == 0)
ip_options_fragment(skb2); // 还是只处理第一个分片的选项
if (left > 0 || not_last_frag)
iph->frag_off |= htons(IP_MF); // 只要还有剩余,就是 MF
iph->tot_len = htons(len + hlen);
ip_send_check(iph);
err = output(skb2);
if (err)
goto fail;
IP_INC_STATS(dev_net(dev), IPSTATS_MIB_FRAGCREATES);
left -= len;
ptr += len;
offset += len;
} // end while
There's a logical trap here: if (left > 0 || not_last_frag). left > 0 is easy to understand—it means we haven't finished fragmenting. But why also check not_last_frag? This is because in certain special cases (like fragmentation after IP Sec encryption), even if left is exhausted, additional processing might be needed to ensure protocol logic completeness. However, in standard fragmentation logic, it usually just comes down to checking left.
Wrapping up
Whether on the fast path or the slow path, once all fragments have been sent successfully, the kernel updates the success counter and returns:
consume_skb(skb); // 释放原始的大包 SKB
IP_INC_STATS(dev_net(dev), IPSTATS_MIB_FRAGOKS);
return err;
If anything goes wrong along the way (such as a memory allocation failure in the slow path, or the send function returning an error), the code jumps to the fail label, cleans up all generated but unsent fragment SKBs, and updates the failure counter IPSTATS_MIB_FRAGFAILS.
Summary and foreshadowing
At this point, we've clearly explained how the IP layer "dismembers" an oversized packet into pieces.
In this section, we saw that the IP layer is very cautious when handling fragmentation: it first checks the DF flag and gives up immediately if fragmentation is forbidden; it takes a shortcut via frag_list (the fast path) to avoid memory copies; but when necessary, it relies on heavy operations like alloc_skb and skb_copy_bits (the slow path) to ensure the data gets sent out.
But this leaves a massive cliffhanger:
Since we chopped the packet up and sent it out, how does the peer know how to piece this messy pile of fragments back together when it receives them? What if a fragment gets lost in the middle? What if they arrive out of order? If someone intentionally sends malicious fragments to attack the system, how does the kernel defend against it?
These are exactly the problems that the reverse engineering of ip_fragment()—reassembly—aims to solve. In the next section, we will enter the domain of ip_defrag() and see how the kernel turns fragments back into complete packets.