Chapter 5: The IPv4 Routing Subsystem
5.8 Quick Reference Panel
IDE auto-completion is great when writing code, but when you're jumping around large stretches of kernel code or digging through grep results, you need something more direct—a cheat sheet pinned to your wall.
This section is that sheet.
There's no narrative here, no buildup—just the core data structures, API functions, and macro definitions we dissected in this chapter. They're scattered across a few key files in net/ipv4, and now we've gathered them together so you can quickly look them up during your future debugging (or fire-fighting) endeavors.
Module and File Inventory
First, let's confirm the "crime scene" locations one more time. The IPv4 routing subsystem implementation is primarily concentrated in the following modules:
fib_frontend.c: The FIB "front desk," handling requests from user space (such asip route add).fib_trie.c: The core lookup logic—where the highly efficient LC-trie tree lives.fib_semantics.c: Handles the semantics of FIB entries, such as the management offib_info.route.c: The original home of the routing cache (although the caching mechanism has changed drastically in modern kernels, the core still resides here).fib_rules.c: The implementer of policy routing. Note that this module is only compiled in when you enableCONFIG_IP_MULTIPLE_TABLES—if you need to make routing decisions based on source address or other conditions, this is the switch you need.
Outside of the source code, these header files are worth bookmarking:
include/net/ip_fib.h: Core FIB definitions.fib_lookup.h: Header file for the lookup logic.include/net/route.h: Interface between the routing layer and upper layers.
Don't forget the generic implementation of the destination cache (dst_entry), which lives in net/core/dst.c and include/net/dst.h.
Core API Methods
Below are the key functions mentioned in this chapter. If you're reading source code or writing your own kernel module and need to interact with the FIB, these are most likely the interfaces you'll call.
Routing Table Operations
-
int fib_table_insert(struct fib_table *tb, struct fib_config *cfg);- What it does: Inserts a route into the specified FIB table (
tb). - Parameters:
cfgcontains the routing configuration passed down from user space (destination address, gateway, netmask, etc.). - Note: This is the final hop into the kernel after a Netlink socket triggers
ip route add.
- What it does: Inserts a route into the specified FIB table (
-
int fib_table_delete(struct fib_table *tb, struct fib_config *cfg);- What it does: Deletes a route from the specified FIB table.
- Corresponding command:
ip route del.
-
struct fib_table *fib_trie_table(u32 id);- What it does: Allocates and initializes a TRIE-based FIB routing table. If you want to create a custom routing table (not just
localormain), this is the constructor you're looking for.
- What it does: Allocates and initializes a TRIE-based FIB routing table. If you want to create a custom routing table (not just
FIB Entry Management
-
struct fib_info *fib_create_info(struct fib_config *cfg);- What it does: Constructs a
fib_infoobject based on the configurationcfg. - Mechanism: This is the "meat" of the routing entry. It checks whether an identical
fib_infoalready exists (to save memory) and only creates a new one if it doesn't.
- What it does: Constructs a
-
void free_fib_info(struct fib_info *fi);- What it does: Frees a
fib_infoobject. - Condition: It is only actually freed when the reference count drops to zero (the
fib_deadflag is set), and it decrements the global counterfib_info_cnt.
- What it does: Frees a
-
void fib_alias_accessed(struct fib_alias *fa);- What it does: Marks a
fib_aliasentry as "visited." - Details: It simply sets
fa->fa_statetoFA_S_ACCESSED. This might be used during garbage collection or statistics gathering to distinguish "hot" routes from cold ones.
- What it does: Marks a
Lookup and TRIE Traversal
struct leaf *fib_find_node(struct trie *t, u32 key);- What it does: Looks up a node matching
keyin the TR treet. - Returns: Returns a
leafnode on success, orNULLon failure. This is the direct manifestation of the Longest Prefix Match (LPM) algorithm at the data structure level.
- What it does: Looks up a node matching
Redirect and Exception Handling
-
void ip_rt_send_redirect(struct sk_buff *skb);- What it does: Sends an ICMPv4 Redirect message.
- Scenario: When the kernel discovers a host taking a "detour" (forwarding through itself when it's not the optimal next hop), it calls this function to kindly tell the other party, "don't go the long way, go that way."
-
void __ip_do_redirect(struct rtable *rt, struct sk_buff *skb, struct flowi4 *fl4, bool kill_route);- What it does: Handles received ICMPv4 Redirect messages.
- Core logic: This is where FIB nexthop exceptions (FNHEs) are created. If
kill_routeis true, it completely wipes out the old route; otherwise, it creates an exception entry that allows specific traffic to bypass the FIB table and head straight to the new gateway.
-
void update_or_create_fnhe(struct fib_nh *nh, __be32 daddr, __be32 gw, u32 pmtu, unsigned long expires);- What it does: As the name implies, updates or creates an FNHE.
- Parameters: Specifies the next hop
nh, destination addressdaddr, new gatewaygw, PMTU, and expiration time. This is the unified entry point for the Redirect and PMTU discovery mechanisms to modify routing behavior.
Metric Queries
u32 dst_metric(const struct dst_entry *dst, int metric);- What it does: Extracts the specified metric (such as MTU, initial window, etc.) from a
dst_entry.
- What it does: Extracts the specified metric (such as MTU, initial window, etc.) from a
Core Macro Definitions
Kernel code is full of macros; here are a few key players we encountered while dissecting the FIB logic.
FIB Lookup Result Extraction
These macros typically take a fib_result structure as an argument and extract key fields from the lookup result.
FIB_RES_GW(res): Returns the gateway address of the next hop (nh_gw).FIB_RES_DEV(res): Returns the network device for the next step (net_device).FIB_RES_OIF(res): Returns the output interface index (nh_oif).FIB_RES_NH(res): Returns the completefib_nhstructure.- Details: If multipath routing is enabled, this uses
res->nh_selas an index to select the correct one from thenexthoparray infib_info.
- Details: If multipath routing is enabled, this uses
Device Behavior Checks
Before deciding "should I forward this?" or "should I obey?", the kernel has to ask the NIC's attitude first.
IN_DEV_FORWARD(in_dev): Checks whether IP forwarding is enabled on the device. If this is off, it's a quiet endpoint, not a router.IN_DEV_RX_REDIRECTS(in_dev): Checks whether the device receives ICMP Redirects. If you're building a router, you usually turn this off to prevent being disrupted by someone else's Redirects.IN_DEV_TX_REDIRECTS(in_dev): Checks whether the device sends ICMP Redirects.
TRIE Tree Structure Traversal
IS_LEAF(node): Checks if this node is a leaf node (an endpoint).IS_TNODE(node): Checks if this node is an internal node (a Trie Node, meaning we need to keep looking deeper).
Multipath Routing Iteration
change_nexthops(fi): A macro defined infib_semantics.c. It provides a loop mechanism to iterate over allnexthopentries in afib_info. When you need to inspect every possible path, this is what you'll use.
Key Data Tables
Table 5-1: Routing Metrics
The kernel doesn't just manage where the road goes; it also manages road conditions. There are 15 (RTAX_MAX) metrics here, some of which are specifically for TCP.
| Linux Symbol | TCP Related (Y/N) | Meaning |
|---|---|---|
RTAX_UNSPEC | N | Unspecified |
RTAX_LOCK | N | Lock metric (prevent updates) |
RTAX_MTU | N | Path Maximum Transmission Unit |
RTAX_WINDOW | Y | TCP initial window size |
RTAX_RTT | Y | Round-Trip Time |
RTAX_RTTVAR | Y | Round-Trip Time Variance |
RTAX_SSTHRESH | Y | Slow Start Threshold |
RTAX_CWND | Y | Congestion Window |
RTAX_ADVMSS | Y | Peer MSS (suggested) |
RTAX_REORDERING | Y | Packet reordering threshold |
RTAX_HOPLIMIT | N | Hop limit |
RTAX_INITCWND | Y | Initial Congestion Window |
RTAX_FEATURES | N | Feature flags |
RTAX_RTO_MIN | Y | Minimum Retransmission Timeout |
RTAX_INITRWND | Y | Initial Receive Window |
(Source: include/uapi/linux/rtnetlink.h)
Table 5-2: Route Types and Error Codes
When a routing lookup hits, different type values mean different fates. Here we list the mappings in the fib_props array: the error codes and scopes corresponding to specific route types.
| Linux Symbol | Error Code | Scope | Meaning |
|---|---|---|---|
RTN_UNSPEC | 0 | RT_SCOPE_NOWHERE | Unspecified (usually doesn't appear in final results) |
RTN_UNICAST | 0 | RT_SCOPE_UNIVERSE | Normal unicast route |
RTN_LOCAL | 0 | RT_SCOPE_HOST | Local address |
RTN_BROADCAST | 0 | RT_SCOPE_LINK | Broadcast address |
RTN_ANYCAST | 0 | RT_SCOPE_LINK | Anycast address |
RTN_MULTICAST | 0 | RT_SCOPE_UNIVERSE | Multicast route |
RTN_BLACKHOLE | -EINVAL | RT_SCOPE_UNIVERSE | Blackhole (silently drop, no error) |
RTN_UNREACHABLE | -EHOSTUNREACH | RT_SCOPE_UNIVERSE | Unreachable (drop and send ICMP Destination Unreachable) |
RTN_PROHIBIT | -EACCES | RT_SCOPE_UNIVERSE | Prohibit (rejected by admin, send ICMP Administratively Prohibited) |
RTN_THROW | -EAGAIN | RT_SCOPE_UNIVERSE | "Continue to the next table" (used by policy routing) |
RTN_NAT | -EINVAL | RT_SCOPE_NOWHERE | NAT (legacy usage) |
RTN_XRESOLVE | -EINVAL | RT_SCOPE_NOWHERE | Requires external resolution (e.g., via a daemon) |
Route Flags
When you type route -n or ip route show, that string of abbreviated letters (UG, UH) in the output isn't gibberish. They are the kernel's "annotations" for that route.
Below are the meanings of the common flags, corresponding to the example output in Table 5-3:
U(Route is up): The route is active.H(Target is a host): The target is a specific host (usually with a netmask of 255.255.255.255).G(Use gateway): Uses a gateway (the packet isn't sent directly to the destination subnet, but goes through an intermediary).R(Reinstate route): Reinstate a dynamic route (restarted by a routing daemon).D(Dynamically installed): Dynamically installed (created by a redirect or a daemon).M(Modified): Modified (altered by a redirect or a daemon).A(Installed by addrconf): Auto-configured by addrconf (usually IPv6-related, but appears here too).!(Reject route): Reject route (corresponds to types likeRTN_PROHIBIT).
Table 5-3: Routing Table Example
| Destination | Gateway | Genmask | Flags | Metric | Ref | Use | Iface |
|---|---|---|---|---|---|---|---|
| 169.254.0.0 | 0.0.0.0 | 255.255.0.0 | U | 1002 | 0 | 0 | eth0 |
| 192.168.3.0 | 192.168.2.1 | 255.255.255.0 | UG | 0 | 0 | 0 | eth1 |
Breakdown:
- First row: Traffic destined for the link-local address
169.254.0.0/16goes directly outeth0(no gateway, so no G). - Second row: Traffic destined for
192.168.3.0/24must first be sent to the gateway192.168.2.1, and then goes out viaeth1.
Chapter Echoes
In this chapter, we peeled back the kernel routing subsystem layer by layer, like an onion.
The outermost layer is the ip route command—that's what you see; deeper in is the FIB table, where the kernel stores the rules; at the very core are the LC-trie tree and the fib_lookup algorithm—the engine processing millions of flows every second.
If you take away only one thing from this chapter, let it be this: A routing decision is not a simple match, but a complete closed loop from lookup, to caching, to dynamic correction (Redirect/FNHE). The metrics in Table 5-1, the error codes in Table 5-2, and those structures starting with fib_ all exist to make this closed loop run fast while remaining flexible enough to adjust its posture when the network topology changes.
In the next chapter, we'll shift from "where to go" to "how to send." We'll leave the calm deliberations of the routing layer and enter the enthusiastic handshakes of the Neighbor Subsystem—seeing how ARP and ND turn abstract IP addresses into real Ethernet frames and actually push data onto the wire.
Exercises
Exercise 1: Understanding
Question: In the Linux kernel's routing lookup process, the fib_lookup() function is the core entry point. Briefly describe the basic flow of how the fib_lookup() function performs a route lookup using the input parameter flowi4 and the output parameter fib_result, and explain how the kernel constructs the dst_entry (destination cache) based on fib_result when the lookup succeeds.
Answer and Analysis
Answer: fib_lookup() uses flowi4 (which contains the destination address, source address, TOS, etc.) as a key. It first looks up in the Local table; if that fails, it performs a Longest Prefix Match (LPM) in the Main table. On a successful lookup, it populates the fib_result structure (which includes the prefix length, fib_info pointer, route type, etc.). Subsequently, the kernel creates a dst_entry object (embedded in a rtable) based on the information in fib_result (such as whether type is RTN_LOCAL or RTN_UNICAST), and sets its input or output callback functions (like ip_local_deliver or ip_forward).
Analysis: This question tests your understanding of the core data flow in the IPv4 routing subsystem. flowi4 defines the "key" for the lookup, while fib_result stores the "result." fib_info is the specific parameter carrier for the routing entry. After a successful lookup, the kernel must translate this static FIB information into a dynamically usable routing cache object (dst_entry). The most important part of this is setting the callback functions that handle the packet, which determines whether the packet is received locally, forwarded, or dropped.
Exercise 2: Application
Question: Suppose you are a network administrator and need to configure a Linux server to block traffic from the 192.168.1.0/24 subnet from accessing 10.0.0.5. Write the ip route command to implement this policy, and explain, based on the principles of fib_props and RTN_PROHIBIT, what error code the kernel returns and what ICMP message it sends when it receives a packet matching this rule.
Answer and Analysis
Answer: Command: ip route add prohibit 10.0.0.5 from 192.168.1.0/24.
Principle: The fib_props array in the kernel defines the behavior for different route types. The RTN_PROHIBIT type corresponds to the error code -EACCES. When a packet matches this rule, the routing lookup returns this error, and the kernel immediately invokes ip_error() to handle it, dropping the packet and replying to the sender with an ICMP "Destination Unreachable" message with the code "Packet Filtered" (ICMP_PKT_FILTERED).
Analysis: This question tests how to apply the concept of Policy Routing for traffic filtering. Understanding that RTN_PROHIBIT is a type of fib_type is key—it doesn't just silently drop packets, but has specific interactive behavior (sending ICMP). By looking up the error field corresponding to RTN_PROHIBIT in the fib_props array, you can determine the specific kernel behavior.
Exercise 3: Thinking
Question: In early kernel versions (< 3.6), Linux used a routing cache to accelerate lookups, but removed it after version 3.6, relying entirely on the FIB TRIE instead. Analyze the main reasons the kernel development team made this change (adopting FIB TRIE over Routing Cache) from both "performance" and "security/stability" perspectives.
Answer and Analysis
Answer: 1. Performance: As routing table sizes grew (internet core routing tables have a massive number of entries), the overhead of maintaining a huge hash cache and its consistency (such as updating invalidated routes) became very large. The LC-trie (a tree structure based on Longest Prefix Match) is inherently very fast at lookups (O(key length)) and highly memory-efficient, making an additional caching layer unnecessary.
- Security and Stability: The routing cache was vulnerable to "Shadow Master" type DoS attacks. An attacker could send a massive volume of packets with random destination IPs, forcing the kernel to constantly perform cache-miss lookups and fill the cache, exhausting system memory and CPU. By removing the cache and looking up directly in the FIB TRIE, this attack surface based on cache overflow was eliminated.
Analysis: This question tests deep thinking about the evolution of the kernel networking subsystem. This is a classic case of transitioning from "trading space for time" to "algorithmic optimization." Understanding this requires recognizing that while caching usually accelerates access, in highly dynamic or specific attack scenarios, the management costs of the cache (lock contention, entry refreshes) and its fragility (susceptibility to attack) can outweigh its benefits. The FIB TRIE provides sufficiently high lookup efficiency to allow the removal of complex caching logic.
Key Takeaways
The Linux kernel uses the FIB (Forwarding Information Base) and routing lookup mechanisms to determine where packets should go. The core function fib_lookup() queries the routing table based on parameters like the destination address, ultimately generating a dst_entry (routing cache entry) that contains the input and output function pointers. This transforms the complex table lookup process into a call to a specific callback function (such as ip_local_deliver or ip_forward), thereby implementing the logical branching between local reception and forwarding.
The fib_info structure is the complete "ID card" of a routing entry in the kernel, encapsulating all metadata except the destination. It not only records the route's origin (such as static configuration or kernel-generated), scope, and priority, but also manages TCP performance parameters like MTU and RTT through the fib_metrics array. This design separates the routing decision attributes from the physical path, allowing the kernel to efficiently manage the lifecycle of routing entries via reference counting and supporting multipath routing configurations.
Routing doesn't always mean permitting traffic. The kernel uses the fib_props mapping table to translate different route types (like RTN_PROHIBIT) into specific operational behaviors. When a routing lookup hits a "prohibit" type, the kernel doesn't silently drop the packet; instead, it triggers an ICMP "Destination Unreachable" or "Filtered" message based on the configuration. This mechanism elevates the routing table itself into an efficient traffic control policy system.
To cope with the dynamic nature of network environments, the kernel introduces an Exception mechanism at the fib_nh (next hop) level. For specific destination addresses, the kernel uses a hash table to record gateway changes brought by ICMP redirects or MTU adjustments brought by PMTU discovery. This "sticky note" style of fine-tuning avoids frequently modifying the massive global routing table, achieving precise correction and isolation for individual paths.
When multiple routes point to the same destination and physical path but have different attributes (such as TOS or priority), the kernel employs a fib_alias mechanism to optimize memory. It allows multiple lightweight fib_alias structures to share a single fib_info that stores the actual path information. This design avoids duplicating large amounts of routing data for minor differences, significantly improving memory utilization in large-scale routing tables (such as BGP scenarios).