4.7 Reassembly: Putting the Shattered Mirror Back Together
In the previous section, we sliced large packets into fragments like a chef chopping vegetables. That was satisfying, but it left behind a massive mess.
Imagine you are the kernel on the receiving end. What you just saw wasn't a neat sequence of data packets, but a chaotic flurry of fragments arriving out of order: some arrive head-first, some tail-first, some separated by hundreds of milliseconds. Worse still, this pile of fragments might be laced with "malicious fragments" deliberately thrown in by an attacker—like bogus puzzle pieces designed to confuse you.
If you were the kernel, how would you know which fragments belong to the same packet? If a piece goes missing in the middle, how long do you wait before giving up? If someone sends a flood of overlapping fragments to mess with you, how do you defend against it?
This brings us to the topic of this section—reassembly. It is the inverse operation of ip_fragment(), and it is one of the trickiest and most vulnerability-prone parts of the Network Stack.
Checking for Fragments: ip_is_fragment()
Reassembly isn't a free lunch—it's an expensive CPU and memory operation. So before the kernel gets down to business, it does something lightweight first: checking whether this packet is actually a fragment.
This process happens inside ip_local_deliver()—when the packet has already passed routing lookup and is confirmed to be destined for the local machine.
int ip_local_deliver(struct sk_buff *skb)
{
/*
* Reassemble IP fragments.
*/
if (ip_is_fragment(ip_hdr(skb))) {
if (ip_defrag(skb, IP_DEFRAG_LOCAL_DELIVER))
return 0;
}
return NF_HOOK(NFPROTO_IPV4, NF_INET_LOCAL_IN, skb, skb->dev, NULL,
ip_local_deliver_finish);
}
The ip_is_fragment() called here contains an extremely streamlined check. It stares at the frag_off field in the IPv4 header. You might think, "How hard can this be? Just check the MF flag."
Your intuition is wrong again.
What about the last fragment? Its MF bit is 0, indicating nothing follows, but it's still a fragment. What about a middle fragment? MF is 1, and the offset isn't 0. Only the original packet (or an unfragmented packet) has MF as 0 and an offset of 0.
So the kernel's check logic looks like this:
static inline bool ip_is_fragment(const struct iphdr *iph)
{
return (iph->frag_off & htons(IP_MF | IP_OFFSET)) != 0;
}
As long as either the MF flag or the offset in the frag_off field is non-zero, it's a fragment.
This means:
- First fragment: offset is 0, but MF is 1 → returns true.
- Middle fragment: offset is non-zero, MF is 1 → returns true.
- Last fragment: offset is non-zero, but MF is 0 → returns true.
Anything satisfying one of these three conditions gets thrown into the meat grinder of ip_defrag().
Finding Where It Belongs: Four-Dimensional Positioning
Once confirmed as a fragment, the kernel needs to find it a home.
We can't just toss all fragments into a single global linked list—that would be way too inefficient. The kernel uses a hash table, and the hash calculation relies on a four-dimensional coordinate. This coordinate must be completely identical across all fragments to prove they belong to the "same family":
- Identification (id): As we mentioned in the previous section on fragmentation, this is the shared ID number of a packet after being sliced up.
- Source address (saddr): Who sent it.
- Destination address (daddr): Who it's meant for.
- Protocol: Whether it's TCP or UDP (this determines the upper-layer protocol).
The kernel uses this 4-tuple to call the hash function ipqhashfn, looking up the corresponding ipq (IP queue) structure in the global hash table. This structure serves as the temporary household registration record for this family in the kernel:
struct ipq {
struct inet_frag_queue q;
u32 user; // 谁在用这个队列(本地?防火墙?)
__be32 saddr; // 源 IP
__be32 daddr; // 目的 IP
__be16 id; // 身份证 ID
u8 protocol; // 协议号
u8 ecn; // 显式拥塞通知支持
int iif; // 入站接口索引
unsigned int rid; // 路由 ID
struct inet_peer *peer; // 对端信息
};
Here's an interesting design detail: IPv4's reassembly logic is actually shared with IPv6. Look at this struct ipq—it internally embeds a struct inet_frag_queue q. This generic structure and its related low-level methods (like inet_frag_find and inet_frag_evictor) aren't IPv4-specific; IPv6 uses them too. This shows that the pain of fragmentation and reassembly is protocol-agnostic, and kernel designers abstracted this pain into a common framework.
The Reassembly Entry Point: ip_defrag()
When we actually enter ip_defrag(), the first thing we do isn't work—it's cleaning house.
int ip_defrag(struct sk_buff *skb, u32 user)
{
struct ipq *qp;
struct net *net;
net = skb->dev ? dev_net(skb->dev) : dev_net(skb_dst(skb)->dev);
IP_INC_STATS_BH(net, IPSTATS_MIB_REASMREQDS);
/* Start by cleaning up the memory. */
ip_evictor(net);
This line, ip_evictor(net), is crucial. Before starting a new reassembly task, it checks memory pressure. If there are too many fragment queues in the system and they consume more memory than the threshold, it will ruthlessly evict the oldest queues (inet_frag_evictor). This means that if the network is too congested or memory is too tight, your packet might get swept out the door before reassembly even begins.
After cleaning house, we see the classic "find or create" logic:
/* Lookup (or create) queue header */
if ((qp = ip_find(net, ip_hdr(skb), user)) != NULL) {
int ret;
spin_lock(&qp->q.lock);
ret = ip_frag_queue(qp, skb);
spin_unlock(&qp->q.lock);
ipq_put(qp);
return ret;
}
IP_INC_STATS_BH(net, IPSTATS_MIB_REASMFAILS);
kfree_skb(skb);
return -ENOMEM;
ip_find() looks up the hash table based on the four-dimensional coordinate we just discussed. If found, it returns the existing qp; if not found, it creates and initializes a new ipq structure. If even creation fails (out of memory), it simply updates the failure counter IPSTATS_MIB_REASMFAILS and drops the packet.
Inserting Fragments: The Art of Out-of-Order and Overlap
After obtaining the queue qp, the main event arrives—ip_frag_queue(). This function's job is to insert the newly arrived SKB into the correct position in the queue.
Note that the arrival order of fragments is completely uncontrollable. You might receive the third fragment before the first. So the ipq->q.fragments linked list must be strictly sorted by offset.
Before inserting, the kernel also has to solve a very tricky problem: overlap.
This can happen: for example, the first fragment of a packet is sent once, a router thinks it's lost, the source retransmits the first fragment (or an attacker in the middle deliberately sends overlapping packets). The kernel must handle this overlap precisely, discarding redundant data to prevent data corruption or buffer overflow exploits. This logic is extremely lengthy in the source code, full of boundary checks. Here, we'll focus on how it determines the position and insertion logic.
First, the kernel calculates the end position of the current fragment:
/* Determine the position of this fragment. */
end = offset + skb->len - ihl;
err = -EINVAL;
/* Is this the final fragment? */
if ((flags & IP_MF) == 0) {
/* If we already have some bits beyond end
* or have different end, the segment is corrupted.
*/
if (end < qp->q.len ||
((qp->q.last_in & INET_FRAG_LAST_IN) && end != qp->q.len))
goto err;
qp->q.last_in |= INET_FRAG_LAST_IN;
qp->q.len = end;
} else {
...
}
Here is a key variable, qp->q.len, which records the total length of the entire original packet.
- When we receive a packet with MF of 0 (the last fragment), we know the total length of this packet is
end. - If the total length implied by previously received fragments contradicts the current one, or if a fragment extending further already exists, it means the packet is corrupted and we drop it directly.
Next comes the linked list traversal to find the insertion position. This is textbook-grade linked list insertion logic:
prev = NULL;
for (next = qp->q.fragments; next != NULL; next = next->next) {
if (FRAG_CB(next)->offset >= offset)
break; /* bingo! */
prev = next;
}
FRAG_CB is a macro used to retrieve the offset information stored inside the SKB from the SKB's control block (cb). It's a very small helper macro, but absolutely critical:
#define FRAG_CB(skb) ((struct ipfrag_skb_cb *)((skb)->cb))
After finding the position (prev and next), we can roll up our sleeves and insert it into the linked list:
FRAG_CB(skb)->offset = offset;
/* Insert this fragment in the chain of fragments. */
skb->next = next;
if (!next)
qp->q.fragments_tail = skb;
if (prev)
prev->next = skb;
else
qp->q.fragments = skb;
...
qp->q.meat += skb->len;
The qp->q.meat here is a vividly named variable (literally "meat" in Chinese). It records the total length of valid data collected so far.
Whenever a new fragment is successfully inserted, meat increases a bit.
The Happy Reunion: ip_frag_reasm()
When do we consider the puzzle complete?
Two conditions must be met simultaneously:
- We received the last fragment (the
INET_FRAG_LAST_INflag is set, meaning we know the exact total lengthlen). - The valid data length collected equals the total length (
qp->q.meat == qp->q.len).
When both conditions are met, the kernel knows the puzzle is complete. It's time to glue them together into a single, complete SKB.
if (qp->q.last_in == (INET_FRAG_FIRST_IN | INET_FRAG_LAST_IN) &&
qp->q.meat == qp->q.len) {
unsigned long orefdst = skb->_skb_refdst;
skb->_skb_refdst = 0UL;
err = ip_frag_reasm(qp, prev, dev);
skb->_skb_refdst = orefdst;
return err;
}
Upon entering ip_frag_reasm(), the kernel faces its biggest challenge: memory allocation.
It needs a new buffer to hold the complete IP packet. The size of this buffer is ihlen + qp->q.len.
/* Allocate a new buffer for the datagram. */
ihlen = ip_hdrlen(head);
len = ihlen + qp->q.len;
err = -E2BIG;
if (len > 65535)
goto out_oversize;
...
There is a hard check here: len > 65535. The IPv4 total length field is only 16 bits, with a maximum value of 65535. If the reassembled fragments exceed this number, something is definitely wrong (possibly malicious fragments), and it must be dropped.
If the length is legitimate, the kernel will copy the skb_copy_bits of all fragments into this new large SKB, adjust the IP header, clear all fragmentation flags, and turn it into a complete packet that looks as if it was never cut up in the first place.
At this point, the shattered mirror has finally been put back together, and it is sent to the transport layer to continue its journey.
The Dark Side of Reassembly: A Ticking Time Bomb
Since we mentioned ip_defrag(), we have to talk about a mechanism that has given countless network engineers and kernel developers headaches: timeouts.
Reassembly isn't indefinite. If someone sends you the first fragment and then goes silent, not sending the rest, should your kernel keep occupying memory to hold onto this incomplete packet? Of course not.
Every ipq queue has a timer. If reassembly isn't completed within this time (default is 30 seconds, configurable in /proc/sys/net/ipv4/ipfrag_time), the ip_expire() method is triggered. It sends an ICMP Time Exceeded message to the peer and clears the entire queue.
The existence of this mechanism also prevents a classic DoS attack: the Teardrop attack. The attacker sends a flood of carefully crafted fragments with severely overlapping heads and tails, causing the kernel to enter an infinite loop or crash when calculating overlaps. Although modern kernels have patched most overlap calculation vulnerabilities, this timeout mechanism remains the last line of defense against resource exhaustion attacks.