ch06_7
6.7 Policy Routing
In the previous section, we discussed multicast—the art of "one-to-many." In this section, we pull our perspective back and re-examine that seemingly simple action: table lookup.
In traditional routing, a packet is like a passenger who only cares about the destination: regardless of who you are, where you came from, or what marks you carry, as long as you're heading to the same place, you get put on the same bus.
But the real world doesn't work that way. Sometimes we need traffic coming in over a dedicated line to take an expensive, high-quality link, while traffic from the public internet takes a cheap, congested link. Sometimes we need to split traffic based on the TOS field in the IP header. Sometimes we even need to route based on firewall marks.
This is Policy Routing. It breaks the "destination is everything" rule, allowing system administrators to define logic like "if condition A is met, look up table B; if condition C is met, look up table C."
In this section, we focus on the IPv4 policy routing implementation (we'll tackle IPv6 in Chapter 8). To avoid confusion, we'll refer to entries in policy routing as rules, and entries in a routing table as routes. Don't mix them up: rules determine which table to look up, and the table determines where to forward.
Managing Policy Routing
We're used to managing routes with the route command, but when it comes to policy routing, route falls short. We need the ip rule command from the iproute2 package. This is a completely different management interface.
Let's see how to work with these rules.
Adding a Rule
Suppose we want all packets with a TOS field of 0x04 to look up table 252. We can issue this command:
ip rule add tos 0x04 table 252
What happens behind the scenes when we hit Enter?
When we press Enter, the kernel calls the fib_nl_newrule() method in net/core/fib_rules.c. It inserts a new rule into our rule chain: from now on, whenever a packet arrives with a TOS matching 0x04, skip the default main table and look up table 252 instead.
Rules alone aren't enough; table 252 needs actual routes. We can add routes to table 252 like this:
ip route add default via 192.168.2.10 table 252
The tos here is just one of many selectors. We can match on source address, destination address, input interface, firewall mark (fwmark), and more. For a full parameter list, check man 8 ip-rule, or flip to Table 6-1 in the "Quick Reference" at the end of this chapter.
Deleting a Rule
Done playing? Just delete it.
ip rule del tos 0x04 table 252
Inside the kernel, fib_nl_delrule() takes over and removes the rule we previously added from the rule chain.
Viewing Rules
Want to see what custom rules are currently in the system?
ip rule list
# 或者
ip rule show
These two commands are essentially the same thing. They both ultimately call the fib_nl_dumprule() method in the kernel, dumping out all our custom rules for us to inspect.
Now that we know how to manage them, we have to peel back the kernel and see how this mechanism is actually built.
Kernel Implementation Details
There's one point here that can easily cause confusion: there are several files named fib_rules.c in the source tree.
net/core/fib_rules.c: This is the true core infrastructure. It's not specific to IPv4 or IPv6; it's a generic framework.net/ipv4/fib_rules.c: This is the IPv4 implementation of that framework.net/ipv6/fib6_rules.c: This is the IPv6 implementation.
We'll focus on the IPv4 implementation.
Core Data Structures
In the IPv4 world, every rule is ultimately mapped to a fib4_rule structure, defined in net/ipv4/fib_rules.c:
struct fib4_rule {
struct fib_rule common;
u8 dst_len;
u8 src_len;
u8 tos;
__be32 src;
__be32 srcmask;
__be32 dst;
__be32 dstmask;
#ifdef CONFIG_IP_ROUTE_CLASSID
u32 tclassid;
#endif
};
We can see that it contains not only the source address (src) and destination address (dst), but also the TOS field. These act as "filters" for matching packets. If a packet's header information matches these fields, the rule is hit.
Default Rules at Boot
Even if we haven't done anything, our system already has three ironclad rules. During kernel boot, the fib_default_rules_init() method silently creates three default policies:
- Local table (RT_TABLE_LOCAL): Specifically handles local addresses (like 127.0.0.1) and broadcast addresses.
- Main table (RT_TABLE_MAIN): This is the table we normally see with
ip route. All unmarked, regular traffic goes here. - Default table (RT_TABLE_DEFAULT): This is the last resort. If nothing matches above, this table is used (it's usually empty, or simply forwards to a gateway).
Lookup Flow: Fast Path and Slow Path
This is the most elegant part of the entire mechanism. When the kernel needs to find a route for a packet, it calls the fib_lookup() method.
Inside include/net/ip_fib.h, we'll see two things that look alike but are actually different fib_lookup() functions. If the kernel configuration doesn't have CONFIG_IP_MULTIPLE_TABLES enabled (meaning policy routing isn't used), it takes the simple, single-table lookup logic.
But once policy routing is enabled, things get interesting. The kernel faces two scenarios:
- Fast path: The variable
net->ipv4.fib_has_custom_rulesisfalse. This means we've never touched the default rules; the system is still in its factory-default state. Since the rules haven't changed, there's no need to traverse the rule chain. It directly looks up the tables in order: Local → Main → Default. This saves the overhead of traversing the rules. - Slow path: Once we execute
ip rule addorip rule del,fib4_rule_configure()orfib4_rule_delete()setsnet->ipv4.fib_has_custom_rulestotrue. At this point, the kernel knows things are no longer simple. It calls_fib_lookup(), which leads intofib_rules_lookup().
What fib_rules_lookup() does is tedious but necessary: it takes the packet, walks through the rule linked list from head to tail, and calls fib_rule_match() on each rule. If a rule matches, it looks up the routing table specified by that rule; if it doesn't match, it moves on to the next one.
This logic may seem clunky, but it's the price of flexibility. We want freedom, so we pay with CPU cycles.
That wraps up the framework of policy routing. But with policy routing in place, we can do even fancier things.
In the next section, we'll discuss a more specific "advanced trick"—Multipath Routing. That is, when a routing table tells us "there are two next hops, both are valid," how does the kernel choose? Round-robin? Weighted? This is yet another topic balancing performance and efficiency.