5.5 Policy Routing: Choices Beyond the Map
In the previous section, we discussed how fib_nh acts like a dutiful guide, holding a sticky note with the outgoing interface and gateway address, directing packets on how to leave the kernel. We built a very intuitive mental model: the destination determines the route. As long as you know where you want to go (the destination IP), the routing table can tell you how to get there.
This mental model works 99% of the time. But as an engineer digging into low-level mechanisms, you might encounter that remaining 1%.
Imagine this scenario: you have a machine with two network cables plugged in.
- One is
eth0, connected to the internal network, used for management, and free of charge. - The other is
eth1, connected to the external network, billed by traffic, and very expensive.
Now you're running a program that needs to download data from 10.0.0.1. Following the usual routine, the kernel checks the routing table, finds that 10.0.0.1 is reachable via eth0, and sends all the traffic out that way.
But you don't want that. You have your own logic: even though the destination address is the same, if this traffic is generated by a system backup, I want it to take the expensive eth1 (because it's fast); if it's just regular browsing, then I'll use eth0.
If we only look at the destination, traditional routing tables are helpless. They only see the target and don't ask about the origin.
This is exactly the problem that Policy Routing solves. It makes routing decisions no longer solely based on "where am I going," but also on "who am I," "what am I doing," or even "what protocol am I using." Before introducing this mechanism, let's first look at how things work in its absence—a world without Policy Routing.
Life Without Policy Routing: Two Tables
When the kernel configuration option CONFIG_IP_MULTIPLE_TABLES is not enabled, the kernel's routing world is very simple: there are only two tables.
-
Local Table (RT_TABLE_LOCAL, ID 255) This is the kernel's "private territory." It contains only routes for local IP addresses (such as
127.0.0.1or the IP you assigned toeth0). This table is highly sensitive—only the kernel itself can add entries to it. If an administrator (User) tries to useip routeto stuff entries into the Local table, they will be rejected. This table determines "which addresses belong to me." -
Main Table (RT_TABLE_MAIN, ID 254) This is our "world map." The vast majority of routes you configure via the
ip route addcommand reside in this table. It determines "if an address isn't mine, where should I throw it."
This initialization process happens in the fib4_rules_init() method of net/ipv4/fib_frontend.c.
A Historical Footnote: In kernels prior to 2.6.25, these two tables were still global variables:
ip_fib_local_tableandip_fib_main_table. Back then, the code was full of logic directly accessing these two variables. Later, kernel developers realized this was too inflexible—if you wanted to add a table, you had to modify the code and recompile. So they refactored it, consolidating all table operations into thefib_get_table()method. Regardless of whether you have Policy Routing enabled, or how many tables you have, everyone usesfib_get_table(net, table_id)to get the table pointer.
This "unified access" approach is like turning "dedicated drawers" into a "numbered locker system"—no matter how many lockers there are, the action of using a key to unlock them is exactly the same.
When Policy Routing is Enabled: 255 Maps
When you enable CONFIG_IP_MULTIPLE_TABLES, the world changes.
The kernel is no longer limited to the Local and Main tables; it supports up to 255 routing tables. At boot time, three tables are initialized by default:
- Local (255)
- Main (254)
- Default (253)
(Note: Regarding the specific use of the Default table and its detailed interaction with the Policy Routing rule set fib_rules, we will dive deep in Chapter 6. For now, let's focus on the management mechanism of the "tables" themselves.)
The question now is: with so many tables, how do we put things into them?
The Administrator's Control Interface: Netlink and IOCTL
As a kernel engineer, you're certainly familiar with the ip route command. But do you know what it looks like in the kernel's eyes? It's a Netlink message.
1. Adding and Deleting Routes: ip route add/del
When you type:
ip route add 192.168.1.0/24 dev eth0
Your userspace tool (iproute2) actually sends an RTM_NEWROUTE message to the kernel via a Netlink socket.
The kernel side catches this with the inet_rtm_newroute() method (located in net/ipv4/fib_frontend.c).
- It parses the parameters you brought (destination subnet, outgoing interface, priority, etc.).
- It creates the corresponding
fib_infoandfib_alias. - It hangs them in the hash or TRIE structure of the corresponding FIB table (the Main table by default).
When you type ip route del ..., the flow is similar, except the message type becomes RTM_DELROUTE, and the kernel hands it over to inet_rtm_delroute(), which is responsible for removing the corresponding entry from the FIB table.
Here is a counter-intuitive detail worth pausing to think about: A route doesn't always mean "allow passage."
You can configure it like this:
ip route add prohibit 192.168.1.17 from 192.168.2.103
This command adds a "prohibition order" to the routing table. When the kernel looks up a route and matches this entry, it not only won't forward the packet, but will drop it directly and reply with an ICMP "Packet Filtered" error message. This is extremely useful in firewall or policy control scenarios—the routing table itself acts as a rule set.
Viewing Routes: ip route show
This corresponds to the RTM_GETROUTE message, handled by inet_dump_fib().
- By default,
ip route showonly looks at the Main table. - If you want to see the Local table, you must explicitly specify:
ip route show table local.
2. The Old-School Approach: route add/del
Although the ip command is the current standard, the route command still exists. Its kernel interface is a completely different path—IOCTL.
route addsends theSIOCADDRTIOCTL.route delsends theSIOCDELRTIOCTL.
Both IOCTLs are handled by the ip_rt_ioctl() method (also in net/ipv4/fib_frontend.c).
This is an interface left over for compatibility with ancient network tools. Although functionally similar to Netlink, inside the kernel, the IOCTL processing path is typically more rigid than Netlink's.
3. Dynamic Routing Protocols: BGP, OSPF, etc.
Besides administrators typing commands by hand, the other main source of routing table data is routing daemons. These are heavy-duty software programs (like Quagga, Bird, Zebra) running on backbone routers. They implement complex protocols like BGP and OSPF.
These processes run in the background, chatting with neighbor routers via protocols. As soon as they detect a change in the network topology (like a fiber optic cable getting cut), they immediately call the Netlink API, flooding the kernel with RTM_NEWROUTE or RTM_DELROUTE messages to instantly update the FIB tables.
To the kernel, it doesn't care whether these routes were typed in manually by an administrator or calculated by the OSPF protocol—they all ultimately end up as fib_info structures, hanging in the exact same tables.
Exceptions and Fine-Tuning: Returning to FIB Exceptions
At the beginning of this section, we mentioned that although we are discussing "table"-level management, we must not forget the FIB nexthop exception from the previous section.
- If the next hop changed due to an ICMP Redirect.
- Or if the MTU changed due to Path MTU Discovery.
These changes will not touch that massive, shared FIB routing table. They will only modify the small hash table (exception table) attached to the specific fib_nh header.
This is an excellent isolation design: don't let special cases pollute global rules.
If the path discovered by PMTU were directly modified in the global Main table, then all traffic heading to that subnet might incorrectly apply an unverified MTU value, which would be a disaster. Through the exception mechanism, the kernel only makes fine-tuned adjustments on "the specific flow that actually needs it."
Summary
In this section, we pulled our perspective back from the microscopic fib_nh to the macroscopic FIB Tables architecture.
We learned:
- Dual-Table Mode: Without Policy Routing, the kernel only recognizes the Local and Main tables.
- Unified Access: Through
fib_get_table(), the kernel abstracted table operations, laying the foundation for multi-table support. - User Interface: The Netlink messages (like
RTM_NEWROUTE) behind theip routecommand are how administrators and routing daemons manipulate the FIB. - Routes as Policy: Route entries are not just for navigation; they can also be prohibitions like
prohibit.
The FIB architecture is now fairly clear: we have tables, entries, next hops, and a fine-tuning mechanism for next hops. But as we mentioned in the opening "dual-NIC" scenario, having tables alone isn't enough—we also need a set of rules to decide "which table to check and when."
That is the main character of the next chapter—FIB Rules. That is where the true soul of Policy Routing lies.