ch05_4
5.4 The Last Mile: Nexthop (fib_nh)
In the previous section, we dissected the "central nervous system" of a fib_info routing entry.
But did you feel like something was missing?
We have a destination (fib_info contains almost all the metadata), and we have a map (the routing table itself), but we're missing the "signposts" that point the way.
When the kernel actually decides to send a packet out, it doesn't need to know about complex TOS, protocol priorities, or routing scopes. In that nanosecond, it only cares about two things: send it out from this device, and send it to whom.
This brings us to the star of this section—fib_nh (Next Hop). It is the final data structure destination of a routing decision, and the last puzzle piece before we reach the actual transmission path.
Structure Breakdown: What's inside fib_nh?
Let's hold off on the code for a moment and build a mental model first.
As mentioned in the previous section, fib_info is the "mother" of the routing entry. Then, fib_nh is the "child" held by this mother's hand. For most simple routes (like ip route add 192.168.1.0/24 dev eth0), a fib_info contains only one fib_nh; but for Multipath Routing, fib_info holds an array of fib_nh, like a mother leading a row of children.
Now, let's see what this "child" looks like:
struct fib_nh {
struct net_device *nh_dev;
struct hlist_node nh_hash;
struct fib_info *nh_parent;
unsigned int nh_flags;
unsigned char nh_scope;
#ifdef CONFIG_IP_ROUTE_MULTIPATH
int nh_weight;
int nh_power;
#endif
#ifdef CONFIG_IP_ROUTE_CLASSID
__u32 nh_tclassid;
#endif
int nh_oif;
__be32 nh_gw;
__be32 nh_saddr;
int nh_saddr_genid;
struct rtable __rcu * __percpu *nh_pcpu_rth_output;
struct rtable __rcu *nh_rth_input;
struct fnhe_hash_bucket *nh_exceptions;
};
There are a few fields here that we must understand immediately, otherwise we won't get far:
nh_dev: This is the output device. The kernel relies on this pointer to find thenet_deviceand call the driver to send the packet out.nh_oif: This is the interface index of thenh_dev. Sometimes we don't have the device pointer yet, only the ID, so we rely on this field to look it up.nh_gw: This is the Nexthop Gateway IP address. If the destination is directly connected, this field is 0; if it needs to hop through a router, this is the router's IP.nh_parent: A pointer back to thefib_info. This forms a doubly-linked list structure, allowing child nodes to look up their parent.
We'll dive into the remaining fields—nh_saddr (preferred source address) and nh_pcpu_rth_output (per-CPU routing cache)—when we walk through the actual transmission path.
When a Device "Goes Down": The Lifecycle of fib_nh
Devices aren't always online.
When we run ip link set eth0 down, or the moment we unplug a network cable, the kernel must react. If the device pointed to by a fib_nh is gone, the route is useless.
The kernel handles this through the Notifier Chain mechanism.
Bonus Knowledge: Notifier Chain There's a "gossip network" inside the kernel. When a subsystem experiences a major event (like device registration or unregistration), it broadcasts the news. Any module that cares about this event just needs to register a callback on this chain in advance to receive the message.
Here, the FIB module is the "eavesdropper"—it registered
fib_netdev_notifier.
The specific call chain looks like this:
- Trigger: The user shuts down the network interface, or the interface physically disconnects.
- Notify: The kernel networking device core emits a
NETDEV_DOWNevent. - Callback: The FIB callback function
fib_netdev_event()is triggered (defined innet/ipv4/fib_frontend.c). - Handle:
fib_netdev_event()callsfib_disable_ip().
Once we reach fib_disable_ip(), things get serious. There are three "lethal" steps here:
Step 1: Mark as Dead (fib_sync_down_dev)
First, the kernel calls fib_sync_down_dev(). This function does something ruthless: it iterates over all fib_nh using this device and sets their RTNH_F_DEAD flag.
At the same time, it also modifies the flags on the parent fib_info of these fib_nh.
It's like a doctor issuing a terminal diagnosis. The routing entry itself is still in memory (
fib_infohasn't been freed), but it has been marked as "dead." If subsequent packets look up this route and seeRTNH_F_DEAD, they will know the path is unreachable, causing the packet to be dropped or triggering alternative lookup logic.
Step 2: Clean Up the Battlefield (fib_flush)
Once marked, it's time to delete.
The fib_flush() method is called to actually clean up those "dead and unused" routing entries. This decrements the reference count of fib_info, and only when the count reaches zero is the fib_info structure actually freed.
Step 3: Flush Caches (rt_cache_flush)
Although the route in the routing table (FIB) has been marked as dead, previous lookup results might still be cached elsewhere (like the rtable cache).
To prevent the kernel from continuing to use stale, invalid cached paths, rt_cache_flush() is called to forcefully flush these caches. At the same time, arp_ifdown() is also called to clean up ARP neighbor entries associated with this device—after all, if the device is gone, remembering its MAC address is useless.
Exceptions Always Exist: FIB Nexthop Exceptions
If you thought a route was an ironclad rule that never changes once configured, you've underestimated the complexity of networks.
Imagine this scenario:
You configure a default route via 192.168.1.1. Suddenly, a nearby router (192.168.1.2) sends an ICMP Redirect message saying: "Hey, don't bother with 1.1, I'm the shortcut, use me instead."
Or, the MTU along the path changes (like moving from Ethernet into a PPPoE tunnel), and packets need fragmentation.
In these cases, modifying the global routing table (FIB) is too costly, especially since this is only a temporary fix for one specific destination address.
Kernel 3.6 introduced an elegant solution: FIB Nexthop Exceptions.
You can think of it as a "sticky note pad" attached to a fib_nh.
The global routing table says "packets to 10.0.0.0/24 go via Gateway A," but the sticky note says "if it's going to 10.0.0.5, change the MTU to 1400, or swap the gateway to B."
This sticky note is a hash table (nh_exceptions), stored within each fib_nh structure.
- Hash Table Key: The destination IP address (
fnhe_daddr). - Size: 2048 entries.
- Reclamation Mechanism: If the linked list depth of a hash bucket exceeds 5, it starts evicting old entries.
Let's look at the data structure for this "sticky note" (fib_nh_exception):
struct fib_nh_exception {
struct fib_nh_exception __rcu *fnhe_next;
__be32 fnhe_daddr; // 目标地址(Key)
u32 fnhe_pmtu; // 修正后的 PMTU
__be32 fnhe_gw; // 修正后的网关
unsigned long fnhe_expires; // 过期时间
struct rtable __rcu *fnhe_rth; // 指向修正后的路由缓存
unsigned long fnhe_stamp; // 时间戳
};
Who Writes on This Sticky Note?
Who writes on the fib_nh sticky note pad? There are mainly two scenarios.
Scenario 1: Receiving an ICMP Redirect
When the kernel receives an ICMPv4 Redirect message (code ICMP_REDIR_HOST), the __ip_do_redirect() function is called.
This function extracts the new gateway address (new_gw) from the ICMP packet, then calls update_or_create_fnhe() to create or update an exception entry in the fib_nh hash table.
static void __ip_do_redirect(struct rtable *rt, struct sk_buff *skb, struct flowi4 *fl4,
bool kill_route)
{
...
__be32 new_gw = icmp_hdr(skb)->un.gateway;
...
// 核心:在 nh_exceptions 里记一笔
update_or_create_fnhe(nh, fl4->daddr, new_gw, 0, 0);
...
}
From then on, when sending packets to this daddr, the kernel checks this sticky note in addition to the global FIB during route lookup. If it finds an entry here, it prioritizes the fnhe_gw over the globally configured gateway.
Scenario 2: PMTU Update (Path MTU Discovery)
When the kernel discovers that the MTU to a destination has decreased (like receiving an ICMP "Fragmentation needed" error), __ip_rt_update_pmtu() is triggered.
Similarly, it calls update_or_create_fnhe() to record the new MTU value in the fnhe_pmtu field.
static void __ip_rt_update_pmtu(struct rtable *rt, struct flowi4 *fl4, u32 mtu)
{
. . .
if (fib_lookup(dev_net(dst->dev), fl4, &res) == 0) {
struct fib_nh *nh = &FIB_RES_NH(res);
// 核心:更新 PMTU,并设置过期时间
update_or_create_fnhe(nh, fl4->daddr, 0, mtu,
jiffies + ip_rt_mtu_expires);
}
. . .
}
There's an important detail here: expiration.
PMTU information is time-sensitive. By default, if a PMTU entry isn't updated for 10 minutes (600 seconds), it expires. This duration is controlled by /proc/sys/net/ipv4/route/mtu_expires.
Whenever dst_mtu() is called (typically on the transmission path), the kernel checks the timestamp via ipv4_mtu() to see if the sticky note has expired. If it has, it discards the entry and falls back to the global path.
Section Summary
With this, our dissection of fib_nh is complete.
Let's recap:
fib_nhis the endpoint of action: It contains the output devicenh_devand the gateway addressnh_gw, serving as the last navigation information a packet depends on before leaving the kernel.- Device state linkage: Through the notifier chain mechanism,
fib_nhcan quickly sense deviceNETDEV_DOWNevents. By markingRTNH_F_DEADand callingfib_flush, it implements a "clean up after departure" logic. - FIB Exceptions mechanism: The kernel doesn't need to modify the massive global routing table for edge cases. By hanging a hash table on each
fib_nhhead, the kernel implements fine-tuned adjustments for specific destinations—whether changing the gateway or the MTU, a small sticky note is all it takes.
But this raises a new question: if we have 255 different maps (255 routing tables), whose instructions should the kernel follow?
That's what we'll cover in the next chapter: Policy Routing. But before we get there, let's make sure the foundation we just laid is solid.