4.4 When IP Options Wake Up in a Packet
In the previous section, we followed a multicast packet through its entire lifecycle in the kernel, entering from ip_rcv, either being consumed locally or forwarded onward. That path was clean, like a highway with no obstacles.
But the real-world network isn't that tidy.
The IPv4 header hides an ancient and often headache-inducing mechanism: IP Options. While packets carrying IP options are exceedingly rare in modern Internet traffic—because they slow down router forwarding—they do pop up in certain scenarios, such as when you want to trace the path a packet takes (ping -R) or perform precise timestamp recording.
What makes things more complicated is that processing IP options requires a completely different code path. Once a packet with options arrives, the kernel can't just fast-forward it as usual; it must stop, parse, process, and even modify the header.
Since we mentioned them at the end of the previous section, let's thoroughly settle this historical baggage in this chapter.
4.4.1 Common IP Option Types
Before diving into the kernel code, let's survey the "enemy." RFC 791 defines a large number of IP options. While many are now "fossils," the kernel still needs to handle them.
You can think of these options as special delivery notes stuck on a package: most packages don't need notes and are shipped with a standard label; but for special packages, the sender might write "must go through relay station A," "please record arrival time," or "no relays allowed" in the notes section.
The label has limited space (the IPv4 header is at most 60 bytes), so the notes naturally have an upper limit. Here are the most common types:
Record Route
This is the most intuitive one: make each router along the way fill in its own IP address.
This is what ping -R uses. If you type that command in your terminal, the outgoing ICMP request packet will carry a IPOPT_RR option. Each router along the path (if it supports it and hasn't disabled it due to security policies) will fill in its egress IP address into this option's buffer.
Sounds great? Reality is harsh.
The IPv4 header has at most 40 bytes available for options (since the maximum header length is 60 bytes, minus the fixed 20 bytes). After subtracting the option's own metadata (type, length, pointer), and considering that each IP address takes 4 bytes, you can record at most 9 IP addresses. If your network has more than 9 hops, subsequent routers won't have room to fill in their addresses—at which point they typically just forward the packet and ignore the option.
Furthermore, for security reasons (to prevent attackers from probing network topology), many modern routers will silently drop or ignore packets with Record Route by default. So if you eagerly try ping -R and find that nothing was recorded, don't be surprised—even man ping specifically includes this line: "Many hosts ignore or discard this option."
Stream ID (IPOPT_SID)
This was used for the old SATNET network to carry a 16-bit stream identifier. It's basically unseen today—you can treat it as an "antique"—but it still occupies a place in the code.
Strict Source Record Route (IPOPT_SSRR) and Loose Source Record Route (IPOPT_LSRR)
These two are "forced navigation" options.
- SSRR (Strict): The packet must travel in the exact order of the addresses in the list, with no intermediate routers outside the list allowed.
- LSRR (Loose): The packet must pass through all routers in the list, but intermediate routers not on the list are allowed.
These sound powerful, but precisely because of security concerns (attackers can use them for IP spoofing or bypassing firewalls), most modern network devices simply disable both options.
Router Alert (IPOPT_RA)
This option is like sticking a little flag on the packet, telling routers along the way: "Hey, don't just blindly forward me—stop and look at my contents carefully!"
This is typically used for RSVP (Resource Reservation Protocol) or multicast protocols. Once a router sees IPOPT_RA, it knows this packet might contain control information it needs to process, so it can't just throw it onto the fast forwarding path.
4.4.2 How Does the Kernel Represent These Options?
At the kernel level, Linux doesn't directly compare raw IPv4 header byte streams. That would be too slow and error-prone. Instead, it parses these scattered bytes into a structure: struct ip_options.
You can think of this structure as translating the "delivery notes" into an internal work order. A courier might not be able to read sloppy handwritten notes, but they can read a standardized internal work order.
This structure is defined in include/net/inet_sock.h:
struct ip_options {
__be32 faddr; /* 保存的第一跳地址 */
__be32 nexthop; /* LSRR/SSRR 的下一跳地址 */
unsigned char optlen; /* 选项总长度,不超过 40 字节 */
unsigned char srr; /* 源路由选项 (SRR) 的偏移量 */
unsigned char rr; /* 记录路由 (RR) 的偏移量 */
unsigned char ts; /* 时间戳 选项的偏移量 */
/* 下面是一堆标志位,打包成一个 unsigned char 和几个 bit field */
unsigned char is_strictroute:1, /* 是否使用了严格源路由 (SSRR) */
srr_is_hit:1, /* 目的地址是否命中本机 (SRR 用) */
is_changed:1, /* IP 头是否被修改过 (校验和需重算) */
rr_needaddr:1, /* 是否需要记录 RR 地址 */
ts_needtime:1, /* 是否需要记录时间戳 */
ts_needaddr:1; /* 是否需要记录时间戳对应的地址 */
unsigned char router_alert; /* 路由器警告选项的值 */
unsigned char cipso; /* CIPSO 安全选项 */
unsigned char __pad2;
unsigned char __data[0]; /* 柔性数组,存放从用户空间来的原始数据 */
};
There are a few fields here that deserve special attention, as they directly affect the subsequent code logic:
is_strictroute: If it'sIPOPT_SSRR, this bit is 1; if it'sIPOPT_LSRR, it's 0. This is the key to distinguishing between strict and loose modes.rr_needaddr: This is a "to-do" flag. When the kernel parses aIPOPT_RRoption and finds there's still room to record an address, it sets this bit to 1. When forwarding or sending, seeing this flag tells the kernel "oh, I need to fill in the current interface's IP address."is_changed: This is a very important dirty bit. Once we modify anything in the IP options (such as filling in a new address), the IPv4 header checksum becomes invalid. This bit tells the kernel: "don't forget I tampered with it—go recalculate the checksum!"
4.4.3 Parsing Options: ip_options_compile()
Now let's enter the core of the receive path. When a packet carrying IP options arrives, ip_rcv eventually calls ip_rcv_options, and the brain of the latter is ip_options_compile().
As the name suggests, this function acts like a compiler: it reads the raw, byte-level IP option stream (source code) and "compiles" it into a kernel-friendly struct ip_options object (target code).
It's mainly called in two scenarios:
- Receive path: Parsing options from an incoming packet. Here,
skbis not NULL, and the option data is taken fromskb. - Send path: Handling options set by the user via
setsockopt(). Here,skbis NULL, and the option data is taken fromopt->__data.
We'll focus on the receive path.
Initialization: Where Do We Start Reading?
The first thing the code does is determine the position of the "cursor."
int ip_options_compile(struct net *net, struct ip_options *opt, struct sk_buff *skb)
{
...
unsigned char *optptr;
unsigned char *iph;
if (skb != NULL) {
/* 接收路径:选项紧跟在固定 20 字节的 IP 头部后面 */
optptr = (unsigned char *)&(ip_hdr(skb)[1]);
} else {
/* 发送路径:选项从用户空间拷贝到了 __data 里 */
optptr = opt->__data;
}
/* 倒推 IP 头部位置(通用写法) */
iph = optptr - sizeof(struct iphdr);
...
}
Note a detail here: &(ip_hdr(skb)[1]). This syntax is a bit like a pointer arithmetic game. ip_hdr(skb) returns a struct iphdr *. Adding 1 means the pointer moves forward by sizeof(struct iphdr) (usually 20) bytes. This is exactly where the options begin.
Loop Parsing: Handling Single-Byte Options
Next comes a for loop that goes through the options one by one. The first byte of each option is the type code (Type).
The simplest types are single-byte options: IPOPT_END (End of Options List) and IPOPT_NOOP (No Operation).
IPOPT_NOOP: This is just padding; when encountered, it's skipped directly, and the pointer moves forward by one.IPOPT_END: This marks the end of the option list. Per RFC specifications, no valid options can followIPOPT_END. But for safety, the kernel fills all remaining space withIPOPT_ENDto prevent dirty data from being misinterpreted. At the same time, because the header was modified,opt->is_changedis set to 1.
for (l = opt->optlen; l > 0; ) {
switch (*optptr) {
case IPOPT_END:
/* 把后面剩下的字节全填成 END,并标记已修改 */
for (optptr++, l--; l>0; optptr++, l--) {
if (*optptr != IPOPT_END) {
*optptr = IPOPT_END;
opt->is_changed = 1;
}
}
goto eol; /* 跳出循环 */
case IPOPT_NOOP:
l--;
optptr++;
continue; /* 继续下一个 */
}
...
}
Loop Parsing: Handling Multi-Byte Options
Aside from the two above, all other options are multi-byte. They contain at least: type (1 byte) + length (1 byte) + data.
There's a classic gotcha here: if an option claims a length greater than the remaining space, or if the length itself is less than 2, it's a bad packet. The kernel sets pp_ptr (pointing to the error location) and jumps to error handling, sending an ICMP "Parameter Problem" message back.
/* 读取选项长度(第二个字节) */
optlen = optptr[1];
/* 长度合法性检查:必须 >= 2 且不能超过剩余空间 */
if (optlen < 2 || optlen > l) {
pp_ptr = optptr;
goto error;
}
switch (*optptr) {
...
}
Case Study: Handling Record Route (IPOPT_RR)
Now for the main event. Let's see how the kernel handles the Record Route option sent by ping -R.
Structure Review: The RR option's structure is [Type (1) | Len (1) | Ptr (1) | IP1 (4) | IP2 (4) | ...].
- Ptr: This is an offset pointer pointing to the next position where an IP address can be filled in.
The code logic is as follows:
- Length check: The option must be at least 3 bytes (Type, Len, Ptr).
- Pointer check: The Ptr value must be at least 4 (because the first 3 bytes are the header, and the data area starts at byte 4).
- Overflow check:
Ptr + 3must not exceed the total length (otherwise the pointer goes out of bounds). - Fill in address: If there's still space, the kernel copies the current egress address (obtained via
spec_dst_fill) to the position pointed to byPtr. Note: The destination address calculation inmemcpyusesoptptr[optptr[2]-1], where the-1is because the Ptr in the option counts from 1 (pointing to the Type field as 1), while C array indices start at 0. - Advance pointer: After filling in, Ptr is incremented by 4 to point to the next empty slot.
- Set flag:
opt->rr_needaddr = 1, indicating that processing needs to continue next time (if on the forwarding path).
case IPOPT_RR:
/* 检查最小长度 */
if (optlen < 3) {
pp_ptr = optptr + 1;
goto error;
}
/* 检查指针有效性 (Ptr >= 4) */
if (optptr[2] < 4) {
pp_ptr = optptr + 2;
goto error;
}
if (optptr[2] <= optlen) {
/* 检查是否有空间填入 4 字节地址 (Ptr + 3 > Len) */
if (optptr[2] + 3 > optlen) {
pp_ptr = optptr + 2;
goto error;
}
if (rt) { /* 如果路由项存在 */
spec_dst_fill(&spec_dst, skb); /* 获取出口地址 */
/* 填入地址:注意指针计算的细节 */
memcpy(&optptr[optptr[2]-1], &spec_dst, 4);
opt->is_changed = 1; /* 改了头部,校验和失效 */
}
/* 指针后移 4 字节 */
optptr[2] += 4;
/* 设置标志:告诉转发路径还得干活 */
opt->rr_needaddr = 1;
}
/* 记录 RR 选项在 IP 头中的偏移量 */
opt->rr = optptr - iph;
break;
Case Study: Handling Timestamp (IPOPT_TIMESTAMP)
The timestamp option is a bit more complex than RR because it has several modes.
Its structure has an extra byte: [Type | Len | Ptr | Flags (Overflow:4 | Flag:4)].
- Overflow: The upper 4 bits, recording the number of times a record couldn't be made due to insufficient space.
- Flag: The lower 4 bits, defining three modes:
- 0 (IPOPT_TS_TSONLY): Record only timestamps, no IP addresses. Each hop takes 4 bytes.
- 1 (IPOPT_TS_TSANDADDR): Record both IP and timestamp. Each hop takes 8 bytes.
- 3 (IPOPT_TS_PRESPEC): Record timestamps only when the IP address matches a preset list.
The logic here also involves various boundary checks, then deciding based on the Flag whether to write only 4 bytes of time, or 8 bytes of "address + time."
case IPOPT_TIMESTAMP:
/* ... 省略类似的长度和指针检查 ... */
/* 提取 Flag (低 4 位) */
switch (optptr[3] & 0xF) {
case IPOPT_TS_TSONLY:
if (skb)
timeptr = &optptr[optptr[2]-1];
opt->ts_needtime = 1;
optptr[2] += 4; /* 指针挪 4 字节 */
break;
case IPOPT_TS_TSANDADDR:
/* 检查空间是否足够 (8 字节) */
if (optptr[2] + 7 > optptr[1]) { ... }
if (rt) {
spec_dst_fill(&spec_dst, skb);
/* 先填 IP */
memcpy(&optptr[optptr[2]-1], &spec_dst, 4);
/* timeptr 指向 4 字节后的位置 */
timeptr = &optptr[optptr[2]+3];
}
opt->ts_needaddr = 1;
opt->ts_needtime = 1;
optptr[2] += 8; /* 指针挪 8 字节 */
break;
case IPOPT_TS_PRESPEC:
/* ... 预置地址逻辑 ... */
optptr[2] += 8;
break;
}
...
Security Check: Source Route Disabled
After parsing all options, the kernel checks a global policy: whether source routing is allowed.
Because SSRR and LSRR carry enormous security risks, administrators typically disable them via the sysctl net.ipv4.conf.all.accept_source_route. If an SRR option is found during parsing and the system configuration doesn't allow it, the packet is dropped immediately.
if (unlikely(opt->srr)) {
struct in_device *in_dev = __in_dev_get_rcu(dev);
if (in_dev) {
if (!IN_DEV_SOURCE_ROUTE(in_dev)) {
/* 策略禁止,丢弃 */
goto drop;
}
}
/* 处理源路由逻辑 */
if (ip_options_rcv_srr(skb))
goto drop;
}
4.4.4 The Conflict Between Forwarding Path and Fragmentation
Think parsing is the end of it? No, there's another tricky part: fragmentation.
Suppose a large packet with a Record Route option (say, 4000 bytes) passes through a router and needs to be fragmented. The IP header (including options) gets copied to every fragment.
But! Some options should not be copied to all fragments. For example, with Record Route, you only want to record the list of routers once. If every fragment records it, not only is it a waste of space, but it also leads to logical confusion (who has the time to stamp every single fragment?).
The RFC specifies that the highest bit of the option type byte (Copied Flag) determines this:
- 1: Must be copied to all fragments.
- 0: Only copied to the first fragment.
Unfortunately, both IPOPT_RR and IPOPT_TIMESTAMP have a Copied Flag of 0. This means only the first fragment carries the complete options. Subsequent fragments not only lack these options, but the original space must be padded with IPOPT_NOOP to maintain header alignment.
This is exactly what ip_options_fragment() does.
It only processes the first fragment (called by ip_fragment). It iterates through the option list and digs out all options with Copied Flag = 0, replacing them with IPOPT_NOOP.
void ip_options_fragment(struct sk_buff *skb)
{
unsigned char *optptr = skb_network_header(skb) + sizeof(struct iphdr);
struct ip_options *opt = &(IPCB(skb)->opt);
int l = opt->optlen;
while (l > 0) {
switch (*optptr) {
case IPOPT_END:
return;
case IPOPT_NOOP:
l--;
optptr++;
continue;
}
/* 检查该选项的 Copied Flag (最高位) */
if (!IPOPT_COPIED(*optptr)) {
/* 如果不用复制,就把它 memset 成 NOOP */
memset(optptr, IPOPT_NOOP, optptr[1]);
}
l -= optptr[1];
optptr += optptr[1];
}
/* 既然选项都被抹掉了,相关的标志也得清零 */
opt->ts = 0;
opt->rr = 0;
opt->rr_needaddr = 0;
/* ... */
}
This is why, when packet capturing, you often see that apart from the first fragment, the headers of subsequent fragments are filled with NOP (0x01).
4.4.5 Building and Sending: Doing It in Reverse
Everything above covered the receive path (parsing). What about the send path? For example, if you write a Raw Socket yourself and want to manually construct a packet with Record Route to send to the other side, the kernel needs to "compile" your struct ip_options back into a binary stream and stuff it into the IP header.
This is what ip_options_build() does. It's essentially the reverse operation of ip_options_compile:
memcpythe contents ofopt->__datainto the IP header option area.- If it's source routing (SRR), fill in the destination address.
- If it's RR or TS and data needs to be filled in, it's responsible for filling in the current source address or timestamp.
We won't paste the code line by line; the logic is straightforward. Notably, it also calls ip_send_check() to recalculate the checksum—because modifying the header is par for the course.
With this, we've clearly mapped out the ins and outs of IP options in the kernel.
From compile-and-parse on receive (ip_options_compile), to strict checks during forwarding (Source Route), to partial discarding during fragmentation (ip_options_fragment), and finally reverse building on send (ip_options_build). This is a complete chain.
Although IP options are an anomaly in modern networking and are even considered a performance killer, as systems engineers, we need to know how the kernel carefully handles these "legacy issues" when they appear. After all, you never know when some ancient protocol or some stubborn network engineer will throw a packet with options at you.
In the next section, we'll leave behind this complex option-handling logic and enter the final leg of the IPv4 journey: the send path. At that point, we'll no longer be passively receiving, but actively constructing packets and pushing them out onto the network.