6.8 Multipath Routing
Policy routing gives us the freedom to choose different routes, but it solves the problem of "selecting a route based on criteria other than the destination (such as the source address)."
Here is a more subtle requirement: suppose you are a network administrator with two broadband connections to the internet—China Telecom and China Unicom—with similar bandwidth. You don't want either link to sit idle, and you don't want to manually configure half of your computers to use Telecom and the other half to use Unicom. What you want is simple: spread the traffic out.
Or, suppose you are a server administrator with two NICs connected to the same switch, and you simply want to max out the 1Gbps bandwidth limit by "using both pipes."
This is the problem that multipath routing solves.
What is Multipath Routing
From the kernel's perspective, this sounds a bit counterintuitive. In previous chapters, we consistently emphasized that route lookups are "exact matches"—given a destination, return a single, definitive next hop. If there are two next hops, which one wins?
All of them.
Multipath routing allows you to configure multiple next hops for a single route entry. You can express it like this:
"For packets destined for 192.168.1.10, you can go via 192.168.2.1, or you can go via 192.168.2.10. Figure it out, just make sure to distribute them."
On the command line, it looks like this:
# 简单的多路径:两条路平分秋色
ip route add default scope global nexthop dev eth0 nexthop dev eth1
# 加权的多路径:第二条路承载更多流量
ip route add 192.168.1.10 nexthop via 192.168.2.1 weight 3 \
nexthop via 192.168.2.10 weight 5
You can think of each nexthop as a lane. The second command above means: I have 8 units of traffic volume, 3 units go to the first lane, and 5 units go to the second lane.
But the kernel doesn't just "figure it out"—it needs a precise algorithm to decide which specific path each individual packet should take. This mechanism is hidden inside fib_info.
Kernel Representation: fib_info and fib_nh
In the IPv4 FIB (Forwarding Information Base) subsystem, the fib_info structure is the core carrier of routing information. As we discussed earlier, a normal route entry has a single fib_nh (FIB Nexthop), but in a multipath scenario, this field becomes an array.
There is a subtle distinction here:
- Single path:
fib_infopoints to a singlefib_nhstructure. - Multipath:
fib_infocontains afib_nharray, whose length is specified by thefib_nhsmember.
When you execute the ip route add command above with two nexthop options, the kernel creates a fib_info object, sets its fib_nhs to 2, and packs two fib_nh objects into the fib_nh array.
At this point, each fib_nh object also has a key field: nh_weight.
This field is the "weight" we just mentioned. If you explicitly specify weight 3 in the ip route command, the kernel fills it in; if you don't, the fib_create_info() method defaults it to 1.
Making the Decision: fib_select_multipath()
The routing table determines "which paths are available," but when a packet arrives, we must pick exactly one path to take. The black box that makes this decision is called fib_select_multipath().
This function is called in two key places:
- Tx path: Inside
__ip_route_output_key(). This is the path taken when locally generated packets are ready to go out. - Rx path: Inside
ip_mkroute_input(). This is the path taken when forwarding packets.
But there is a conditional check here. In the Tx path code, you will see this logic:
struct rtable *__ip_route_output_key(struct net *net, struct flowi4 *fl4) {
...
#ifdef CONFIG_IP_ROUTE_MULTIPATH
// 只有当存在多条路径,并且没有强制指定输出设备时,才做选择
if (res.fi->fib_nhs > 1 && fl4->flowi4_oif == 0)
fib_select_multipath(&res);
else
#endif
...
}
Why is there a check for flowi4_oif == 0?
Because when an application sends data, it can explicitly specify "must go out via eth0" through bind() or ancillary messages of sendmsg(). Once the user makes this hard requirement, multipath load balancing becomes meaningless—even if the other path is completely idle, the packet must take this one. In this case, the kernel skips fib_select_multipath() and directly uses the user-specified path.
In the Rx path, this restriction typically doesn't exist, because forwarded packets don't carry a preset instruction like "please go out via eth1."
Selection Algorithm
The internal implementation of fib_select_multipath() is actually quite interesting. It doesn't simply do round-robin (1, 2, 1, 2...) as our intuition might suggest. It introduces randomness.
This randomness is not for unpredictability, but for weighted fairness.
It uses the system time (jiffies) as a seed for hash computation. For each incoming packet, the kernel calculates a hash value (based on source IP, destination IP, source port, destination port, etc.), combines it with the nh_weight of each path, and determines which fib_nh this packet should go to.
The design goals of this algorithm are:
- Packets belonging to the same flow (identical 5-tuple) should take the same path. Otherwise, TCP out-of-order delivery would cause a sharp performance drop.
- Different flows should be distributed across different paths according to their weight ratios.
The最终选定的路径索引,会被保存到 fib_result 结构体的 nh_sel (Nexthop Selector) field, and subsequent forwarding logic uses this index to look up the fib_nh array.
Where the Code Lives
If you go digging through the source code looking for a file called multipath.c, you will be disappointed.
For multicast routing, the kernel has a dedicated net/ipv4/ipmr.c module, nice and clean. But for multipath routing, the code is like a "ghost"—scattered across various corners of the generic routing code, wrapped in numerous #ifdef CONFIG_IP_ROUTE_MULTIPATH conditionals.
This shows that in the eyes of the kernel designers, multipath routing is not an independent subsystem, but rather an enhancement feature of the route lookup logic.
This approach has a historical side effect. If you look at older kernel code (before 2007), you will find that IPv4 once had a dedicated "multipath route cache." This cache was removed in the 2.6.23 kernel.
Be careful not to confuse this: removal of the multipath route cache ≠ removal of the route cache.
- The multipath cache was removed in 2007 because, as an experimental feature, it never worked very well.
- The real route cache was removed in 2012 in kernel 3.6, in order to solve the cache synchronization overhead problem on multi-core CPUs.
The current multipath implementation is completed directly during the FIB lookup phase, with no additional cache layer, which actually makes the logic clearer.
Configuration Switch: CONFIG_IP_ROUTE_MULTIPATH
Finally, remember to check your kernel configuration.
To make all of this work, your kernel configuration must include CONFIG_IP_ROUTE_MULTIPATH=y. Many distributions may disable this feature or build it as a module to streamline the kernel. If you find that adding the nexthop command doesn't take effect, check make menuconfig first.