6.7 The Router's "Whisper": ICMPv4 Redirect
In the previous section, we explored the precise static structure of the FIB. You can think of it as a meticulously drawn map hanging on the wall.
But real-world networks are dynamic. Sometimes, a router realizes its "map" isn't accurate enough, or worse—it notices a neighbor taking a detour.
In such cases, the router doesn't just stand by. It taps the neighbor on the shoulder and sends a message: "Hey, don't send it to me, it's faster to go that way directly."
This is exactly why the ICMPv4 Redirect mechanism exists.
Setting the Scene: An Awkward "Detour"
Before diving into the code, let's build an intuitive scenario. This is much easier to understand than a dry definition.
Imagine a simple local area network (192.168.2.0/24). Three machines are connected to the same switch:
- AMD Server: 192.168.2.200 (sender)
- Windows Server: 192.168.2.10 (default gateway)
- Laptop: 192.168.2.7 (final destination)
Under normal circumstances, when the AMD server accesses the laptop (ping 192.168.2.7), it should send the packet directly via Layer 2 ARP.
But suppose an administrator ran a very strange command on the AMD server:
ip route add 192.168.2.7 via 192.168.2.10
This command tells the kernel: "Packets destined for 192.168.2.7 must first be sent to 192.168.2.10 (the Windows server)."
An awkward situation then unfolds:
- The AMD server sends the packet intended for the laptop to the Windows server.
- The Windows server receives the packet, checks its routing table, and realizes: "Huh? This destination is on my own subnet. I can send it directly myself—why am I forwarding this?"
- The Windows server realizes this is a suboptimal path.
ICMP Redirect is designed to resolve exactly this kind of awkwardness.
The Windows server will forward the packet to the laptop (to keep traffic flowing), but at the same time, it sends an ICMP Redirect message back to the AMD server:
- Type: ICMP Redirect (Type 5)
- Code:
ICMP_REDIR_HOST(Redirect Host) — meaning "reroute for this specific host" - New Gateway Address: 192.168.2.7 (telling the AMD server: "Just send it here directly next time")
Initiating a Redirect: The Kernel's Decision Logic
The kernel doesn't send a Redirect every time it forwards a packet—otherwise, the network would be flooded with these messages. It has a strict set of criteria.
This process happens in two steps:
- Flagging phase: Set a flag during the route lookup input path (
__mkroute_input). - Execution phase: Actually send the message during the forwarding path (
ip_forward).
Step 1: Setting the Flag (__mkroute_input)
When a packet arrives at the router and needs to be forwarded, the kernel calls fib_lookup to find the route result, then enters __mkroute_input to construct a routing cache entry. Here, the kernel asks a critical question: "Is this packet taking a needless detour?"
Let's look at this key if check:
/* net/ipv4/route.c */
static int __mkroute_input(struct sk_buff *skb,
const struct fib_result *res,
struct in_device *in_dev,
__be32 daddr, __be32 saddr, u32 tos)
{
struct rtable *rth;
struct in_device *out_dev;
unsigned int flags = 0;
/* ... 省略部分代码 ... */
/*
* 核心判断:必须同时满足以下所有条件,才会设置 RTCF_DOREDIRECT 标志
* 1. out_dev == in_dev: 进接口和出接口是同一个(这是最明显的“绕路”特征)
* 2. IN_DEV_TX_REDIRECTS(out_dev): sysctl 开关允许发送重定向
* 3. 必须满足以下之一:
* a. IN_DEV_SHARED_MEDIA: 这是一个共享介质(比如以太网)
* b. inet_addr_onlink(...): 源地址和下一跳网关在同一个子网内
*/
if (out_dev == in_dev && err && IN_DEV_TX_REDIRECTS(out_dev) &&
(IN_DEV_SHARED_MEDIA(out_dev) ||
inet_addr_onlink(out_dev, saddr, FIB_RES_GW(*res)))) {
flags |= RTCF_DOREDIRECT; /* 好了,标记一下,准备发 Redirect */
/* 既然要发重定向,说明路由马上要变了,这包的缓存先别加 */
do_cache = false;
}
/* 将标记写入路由表项 */
rth->rt_flags = flags;
/* ... */
}
Why such strict conditions?
- In-interface = Out-interface: If a packet comes in on eth0 and goes right back out eth0, it means the source and destination hosts are on the same subnet. It should have been sent directly—why route through me? This fits the definition of a "suboptimal route."
- Shared Media: For point-to-point links (like PPP), the in-interface will always equal the out-interface, but that's normal behavior, so we must not send a Redirect.
- On-link Check: Ensures the new gateway suggestion is reasonable and won't point the host to an unreachable destination.
Step 2: Sending the Message (ip_forward)
Once the flag is set, the packet enters the forwarding flow in ip_forward(). Here, the kernel performs a final confirmation before issuing the "rectification notice."
/* net/ipv4/ip_forward.c */
int ip_forward(struct sk_buff *skb)
{
struct iphdr *iph = ip_hdr(skb);
struct rtable *rt = skb_rtable(skb);
struct ip_options *opt = &(IPCB(skb)->opt);
/*
* 检查是否允许发送重定向:
* 1. rt->rt_flags & RTCF_DOREDIRECT: 刚才在 __mkroute_input 里打的标记
* 2. !opt->srr: IP 选项里没有 Strict Source Route (源站选路),如果用户强制指定路径,别瞎掺和
* 3. !skb_sec_path(skb): 这不是个 IPsec 包 (IPsec 解包后可能导致 iface 相同,但那是假象)
*/
if (rt->rt_flags & RTCF_DOREDIRECT && !opt->srr && !skb_sec_path(skb))
ip_rt_send_redirect(skb);
/* ... 继续转发数据包 ... */
}
Finally, inside ip_rt_send_redirect(), the kernel constructs the ICMP packet and transmits it:
/* net/ipv4/route.c */
void ip_rt_send_redirect(struct sk_buff *skb)
{
/* ... */
/*
* 发送 ICMP Redirect 消息
* ICMP_REDIR_HOST: 代码,表示针对主机重定向
* rt_nexthop(...): 这是从路由项里取出的“更优的下一跳”地址
* 在我们的例子里,这个值就是 192.168.2.7
*/
icmp_send(skb, ICMP_REDIRECT, ICMP_REDIR_HOST,
rt_nexthop(rt, ip_hdr(skb)->daddr));
}
Receiving a Redirect: Obedience and Deception Prevention
Now let's flip the scenario. The AMD server receives the ICMP Redirect from the Windows server. What does it do?
It doesn't immediately modify the main routing table (FIB)—that would be way too dangerous. If anyone could change my routing table just by sending a packet, the network would be in chaos.
Instead, it does two things:
- Strict security check: Is this message legitimate?
- Creating an "exception": Establish a temporary, flow-specific Nexthop Exception above the FIB.
The processing logic is centralized in the __ip_do_redirect() method.
1. Security Check Logic
/* net/ipv4/route.c */
static void __ip_do_redirect(struct rtable *rt, struct sk_buff *skb,
struct flowi4 *fl4, bool kill_route)
{
__be32 new_gw = icmp_hdr(skb)->un.gateway; /* ICMP 建议的新网关 */
__be32 old_gw = ip_hdr(skb)->saddr; /* 原来的网关(发消息给我的那个) */
struct net_device *dev = skb->dev;
struct in_device *in_dev;
/* ... */
/* 第一关:消息里的“旧网关”必须跟我现在的路由网关一致 */
if (rt->rt_gateway != old_gw)
return;
in_dev = __in_dev_get_rcu(dev);
if (!in_dev)
return;
/* 第二关:各种变态过滤条件 */
if (new_gw == old_gw || /* 新旧一样?耍我呢? */
!IN_DEV_RX_REDIRECTS(in_dev) || /* sysctl 允许接收重定向吗? */
ipv4_is_multicast(new_gw) || /* 新网关不能是组播 */
ipv4_is_lbcast(new_gw) || /* 不能是受限广播 */
ipv4_is_zeronet(new_gw)) /* 不能是 0.0.0.0 */
goto reject_redirect;
/* 第三关:如果是非共享介质,必须严格检查新网关是否在链路上 */
if (!IN_DEV_SHARED_MEDIA(in_dev)) {
if (!inet_addr_onlink(in_dev, new_gw, old_gw))
goto reject_redirect;
/* secure_redirects 检查:新网关本身得是一个合法的默认路由候选者 */
if (IN_DEV_SEC_REDIRECTS(in_dev) && ip_fib_check_default(new_gw, dev))
goto reject_redirect;
} else {
/* 共享介质下的额外检查 */
if (inet_addr_type(net, new_gw) != RTN_UNICAST)
goto reject_redirect;
}
/* ... 通过安检,开始干活 ... */
This pile of goto reject_redirect checks exists to prevent attackers from forging ICMP Redirect packets to hijack traffic. Historically, many network attacks succeeded by spamming ICMP Redirect messages against legacy systems.
2. Creating a FIB Nexthop Exception
Once it passes the security check, the kernel executes the core operation—updating or creating a FIB nexthop exception.
Remember the FIB architecture we mentioned at the beginning of this chapter? fib_info stores nexthop information. Modifying fib_info directly would have a global impact, so the kernel uses the Exception mechanism here.
/* ... 接上文 __ip_do_redirect ... */
/* 去邻居子系统找一下这个新网关的邻居项 */
n = ipv4_neigh_lookup(&rt->dst, NULL, &new_gw);
if (n) {
/* 如果邻居状态不是有效的(NUD_VALID),先发个 ARP 问问 */
if (!(n->nud_state & NUD_VALID)) {
neigh_event_send(n, NULL);
} else {
/* 如果是合法的,再次查 FIB 表(为什么?为了拿到 fib_nh 指针) */
if (fib_lookup(net, fl4, &res) == 0) {
struct fib_nh *nh = &FIB_RES_NH(res);
/*
* 关键操作:更新/创建 FNHE (FIB Nexthop Exception)
* fl4->daddr: 目标地址 (192.168.2.7)
* new_gw: 新的网关 (192.168.2.7)
* 后面两个 0 是关于 PMTU 和其他指标的
*/
update_or_create_fnhe(nh, fl4->daddr, new_gw, 0, 0);
}
/* 如果需要,把旧的路由缓存标记为过时 */
if (kill_route)
rt->dst.obsolete = DST_OBSOLETE_KILL;
/* 发送邻居更新通知 */
call_netevent_notifiers(NETEVENT_NEIGH_UPDATE, n);
}
neigh_release(n);
}
return;
reject_redirect:
/* ... 报错或日志 ... */
What happens here?
The kernel doesn't modify the global FIB table. Instead, for the specific flow "to 192.168.2.7," it attaches an Exception to fib_nh: "From now on, when going to 192.168.2.7, skip the gateway and send directly to 192.168.2.7."
This highlights the elegance of Linux routing design:
- Globally, traffic still follows the FIB rules (
via 192.168.2.10). - Locally (for flows "educated" by the Redirect), traffic takes the Exception and sends packets directly.
Echoes of History: The Demise of the IPv4 Routing Cache
Since we've mentioned Redirect and routing caches, we need to spend some time discussing a mechanism that has already been removed—the IPv4 Routing Cache.
If you look through older documentation (pre-kernel 3.6), you'll find that the entire routing subsystem was designed around the "routing cache."
That era looked like this:
- Routing Cache: A massive hash table, and it was the first stop for any lookup. The key was.
- FIB Tables: Only consulted on a cache miss, and the result was then stuffed back into the Cache.
Why was it removed? It sounds great—grab it fast, use it fast. But it had a fatal flaw: DoS attacks.
Because every unique would generate a cache entry, an attacker only needed to send packets to random IP addresses (e.g., 1.2.3.4, 1.2.3.5...), and the kernel would frantically create cache entries until memory was exhausted.
Although it had a garbage collection (GC) mechanism, under a high-intensity attack, the GC itself would consume all the CPU.
The Turning Point: FIB TRIE (LC-trie) To eliminate the cache, the kernel had to make FIB table lookups fast enough on their own. The solution was the introduction of FIB TRIE (a tree structure based on longest prefix match).
The lookup complexity of FIB TRIE is O(length of address), which is more stable than a hash table at scale and avoids the headaches of hash collisions. Once FIB TRIE became fast enough, that bloated and insecure cache layer became dead weight.
Thus, around kernel 3.6 in 2012, ip_rt_cache was permanently removed.
Significance for "Archaeology"
Although modern kernels no longer have it, in the industry (such as RHEL 6 or older embedded systems), you might still encounter this code. If you see rt_intern_hash or ip_route_input_slow in an old kernel's logs, don't panic—those are just relics of a bygone era.
Chapter Reflection
Routers aren't dumb devices that silently forward packets.
Through ICMP Redirect, a router can tell a host "there's a shorter path." But this isn't simple—the kernel must carefully balance trust and security: it needs to obediently optimize the path (by creating an FNHE) while preventing deception by malicious messages (Strict Checks).
In this chapter, we traced the logic from fib_lookup, through the TRIE structure of FIB tables, to the dynamic corrections of Redirect. We saw how the kernel transforms a static routing table into a dynamic, efficient, and secure forwarding system.
In the next chapter, we'll turn our gaze further afield—to multipath routing. When one path isn't enough, or for load balancing purposes, how does the kernel dance across multiple paths simultaneously?