Skip to main content

Chapter 5: The IPv4 Routing Subsystem

5.8 Quick Reference Panel

IDE auto-completion is great when writing code, but when you're jumping around large stretches of kernel code or digging through grep results, you need something more direct—a cheat sheet pinned to your wall.

This section is that sheet.

There's no narrative here, no buildup—just the core data structures, API functions, and macro definitions we dissected in this chapter. They're scattered across a few key files in net/ipv4, and now we've gathered them together so you can quickly look them up during your future debugging (or fire-fighting) endeavors.

Module and File Inventory

First, let's confirm the "crime scene" locations one more time. The IPv4 routing subsystem implementation is primarily concentrated in the following modules:

  • fib_frontend.c: The FIB "front desk," handling requests from user space (such as ip route add).
  • fib_trie.c: The core lookup logic—where the highly efficient LC-trie tree lives.
  • fib_semantics.c: Handles the semantics of FIB entries, such as the management of fib_info.
  • route.c: The original home of the routing cache (although the caching mechanism has changed drastically in modern kernels, the core still resides here).
  • fib_rules.c: The implementer of policy routing. Note that this module is only compiled in when you enable CONFIG_IP_MULTIPLE_TABLES—if you need to make routing decisions based on source address or other conditions, this is the switch you need.

Outside of the source code, these header files are worth bookmarking:

  • include/net/ip_fib.h: Core FIB definitions.
  • fib_lookup.h: Header file for the lookup logic.
  • include/net/route.h: Interface between the routing layer and upper layers.

Don't forget the generic implementation of the destination cache (dst_entry), which lives in net/core/dst.c and include/net/dst.h.


Core API Methods

Below are the key functions mentioned in this chapter. If you're reading source code or writing your own kernel module and need to interact with the FIB, these are most likely the interfaces you'll call.

Routing Table Operations

  • int fib_table_insert(struct fib_table *tb, struct fib_config *cfg);

    • What it does: Inserts a route into the specified FIB table (tb).
    • Parameters: cfg contains the routing configuration passed down from user space (destination address, gateway, netmask, etc.).
    • Note: This is the final hop into the kernel after a Netlink socket triggers ip route add.
  • int fib_table_delete(struct fib_table *tb, struct fib_config *cfg);

    • What it does: Deletes a route from the specified FIB table.
    • Corresponding command: ip route del.
  • struct fib_table *fib_trie_table(u32 id);

    • What it does: Allocates and initializes a TRIE-based FIB routing table. If you want to create a custom routing table (not just local or main), this is the constructor you're looking for.

FIB Entry Management

  • struct fib_info *fib_create_info(struct fib_config *cfg);

    • What it does: Constructs a fib_info object based on the configuration cfg.
    • Mechanism: This is the "meat" of the routing entry. It checks whether an identical fib_info already exists (to save memory) and only creates a new one if it doesn't.
  • void free_fib_info(struct fib_info *fi);

    • What it does: Frees a fib_info object.
    • Condition: It is only actually freed when the reference count drops to zero (the fib_dead flag is set), and it decrements the global counter fib_info_cnt.
  • void fib_alias_accessed(struct fib_alias *fa);

    • What it does: Marks a fib_alias entry as "visited."
    • Details: It simply sets fa->fa_state to FA_S_ACCESSED. This might be used during garbage collection or statistics gathering to distinguish "hot" routes from cold ones.

Lookup and TRIE Traversal

  • struct leaf *fib_find_node(struct trie *t, u32 key);
    • What it does: Looks up a node matching key in the TR tree t.
    • Returns: Returns a leaf node on success, or NULL on failure. This is the direct manifestation of the Longest Prefix Match (LPM) algorithm at the data structure level.

Redirect and Exception Handling

  • void ip_rt_send_redirect(struct sk_buff *skb);

    • What it does: Sends an ICMPv4 Redirect message.
    • Scenario: When the kernel discovers a host taking a "detour" (forwarding through itself when it's not the optimal next hop), it calls this function to kindly tell the other party, "don't go the long way, go that way."
  • void __ip_do_redirect(struct rtable *rt, struct sk_buff *skb, struct flowi4 *fl4, bool kill_route);

    • What it does: Handles received ICMPv4 Redirect messages.
    • Core logic: This is where FIB nexthop exceptions (FNHEs) are created. If kill_route is true, it completely wipes out the old route; otherwise, it creates an exception entry that allows specific traffic to bypass the FIB table and head straight to the new gateway.
  • void update_or_create_fnhe(struct fib_nh *nh, __be32 daddr, __be32 gw, u32 pmtu, unsigned long expires);

    • What it does: As the name implies, updates or creates an FNHE.
    • Parameters: Specifies the next hop nh, destination address daddr, new gateway gw, PMTU, and expiration time. This is the unified entry point for the Redirect and PMTU discovery mechanisms to modify routing behavior.

Metric Queries

  • u32 dst_metric(const struct dst_entry *dst, int metric);
    • What it does: Extracts the specified metric (such as MTU, initial window, etc.) from a dst_entry.

Core Macro Definitions

Kernel code is full of macros; here are a few key players we encountered while dissecting the FIB logic.

FIB Lookup Result Extraction

These macros typically take a fib_result structure as an argument and extract key fields from the lookup result.

  • FIB_RES_GW(res): Returns the gateway address of the next hop (nh_gw).
  • FIB_RES_DEV(res): Returns the network device for the next step (net_device).
  • FIB_RES_OIF(res): Returns the output interface index (nh_oif).
  • FIB_RES_NH(res): Returns the complete fib_nh structure.
    • Details: If multipath routing is enabled, this uses res->nh_sel as an index to select the correct one from the nexthop array in fib_info.

Device Behavior Checks

Before deciding "should I forward this?" or "should I obey?", the kernel has to ask the NIC's attitude first.

  • IN_DEV_FORWARD(in_dev): Checks whether IP forwarding is enabled on the device. If this is off, it's a quiet endpoint, not a router.
  • IN_DEV_RX_REDIRECTS(in_dev): Checks whether the device receives ICMP Redirects. If you're building a router, you usually turn this off to prevent being disrupted by someone else's Redirects.
  • IN_DEV_TX_REDIRECTS(in_dev): Checks whether the device sends ICMP Redirects.

TRIE Tree Structure Traversal

  • IS_LEAF(node): Checks if this node is a leaf node (an endpoint).
  • IS_TNODE(node): Checks if this node is an internal node (a Trie Node, meaning we need to keep looking deeper).

Multipath Routing Iteration

  • change_nexthops(fi): A macro defined in fib_semantics.c. It provides a loop mechanism to iterate over all nexthop entries in a fib_info. When you need to inspect every possible path, this is what you'll use.

Key Data Tables

Table 5-1: Routing Metrics

The kernel doesn't just manage where the road goes; it also manages road conditions. There are 15 (RTAX_MAX) metrics here, some of which are specifically for TCP.

Linux SymbolTCP Related (Y/N)Meaning
RTAX_UNSPECNUnspecified
RTAX_LOCKNLock metric (prevent updates)
RTAX_MTUNPath Maximum Transmission Unit
RTAX_WINDOWYTCP initial window size
RTAX_RTTYRound-Trip Time
RTAX_RTTVARYRound-Trip Time Variance
RTAX_SSTHRESHYSlow Start Threshold
RTAX_CWNDYCongestion Window
RTAX_ADVMSSYPeer MSS (suggested)
RTAX_REORDERINGYPacket reordering threshold
RTAX_HOPLIMITNHop limit
RTAX_INITCWNDYInitial Congestion Window
RTAX_FEATURESNFeature flags
RTAX_RTO_MINYMinimum Retransmission Timeout
RTAX_INITRWNDYInitial Receive Window

(Source: include/uapi/linux/rtnetlink.h)

Table 5-2: Route Types and Error Codes

When a routing lookup hits, different type values mean different fates. Here we list the mappings in the fib_props array: the error codes and scopes corresponding to specific route types.

Linux SymbolError CodeScopeMeaning
RTN_UNSPEC0RT_SCOPE_NOWHEREUnspecified (usually doesn't appear in final results)
RTN_UNICAST0RT_SCOPE_UNIVERSENormal unicast route
RTN_LOCAL0RT_SCOPE_HOSTLocal address
RTN_BROADCAST0RT_SCOPE_LINKBroadcast address
RTN_ANYCAST0RT_SCOPE_LINKAnycast address
RTN_MULTICAST0RT_SCOPE_UNIVERSEMulticast route
RTN_BLACKHOLE-EINVALRT_SCOPE_UNIVERSEBlackhole (silently drop, no error)
RTN_UNREACHABLE-EHOSTUNREACHRT_SCOPE_UNIVERSEUnreachable (drop and send ICMP Destination Unreachable)
RTN_PROHIBIT-EACCESRT_SCOPE_UNIVERSEProhibit (rejected by admin, send ICMP Administratively Prohibited)
RTN_THROW-EAGAINRT_SCOPE_UNIVERSE"Continue to the next table" (used by policy routing)
RTN_NAT-EINVALRT_SCOPE_NOWHERENAT (legacy usage)
RTN_XRESOLVE-EINVALRT_SCOPE_NOWHERERequires external resolution (e.g., via a daemon)

Route Flags

When you type route -n or ip route show, that string of abbreviated letters (UG, UH) in the output isn't gibberish. They are the kernel's "annotations" for that route.

Below are the meanings of the common flags, corresponding to the example output in Table 5-3:

  • U (Route is up): The route is active.
  • H (Target is a host): The target is a specific host (usually with a netmask of 255.255.255.255).
  • G (Use gateway): Uses a gateway (the packet isn't sent directly to the destination subnet, but goes through an intermediary).
  • R (Reinstate route): Reinstate a dynamic route (restarted by a routing daemon).
  • D (Dynamically installed): Dynamically installed (created by a redirect or a daemon).
  • M (Modified): Modified (altered by a redirect or a daemon).
  • A (Installed by addrconf): Auto-configured by addrconf (usually IPv6-related, but appears here too).
  • ! (Reject route): Reject route (corresponds to types like RTN_PROHIBIT).

Table 5-3: Routing Table Example

DestinationGatewayGenmaskFlagsMetricRefUseIface
169.254.0.00.0.0.0255.255.0.0U100200eth0
192.168.3.0192.168.2.1255.255.255.0UG000eth1

Breakdown:

  • First row: Traffic destined for the link-local address 169.254.0.0/16 goes directly out eth0 (no gateway, so no G).
  • Second row: Traffic destined for 192.168.3.0/24 must first be sent to the gateway 192.168.2.1, and then goes out via eth1.

Chapter Echoes

In this chapter, we peeled back the kernel routing subsystem layer by layer, like an onion.

The outermost layer is the ip route command—that's what you see; deeper in is the FIB table, where the kernel stores the rules; at the very core are the LC-trie tree and the fib_lookup algorithm—the engine processing millions of flows every second.

If you take away only one thing from this chapter, let it be this: A routing decision is not a simple match, but a complete closed loop from lookup, to caching, to dynamic correction (Redirect/FNHE). The metrics in Table 5-1, the error codes in Table 5-2, and those structures starting with fib_ all exist to make this closed loop run fast while remaining flexible enough to adjust its posture when the network topology changes.

In the next chapter, we'll shift from "where to go" to "how to send." We'll leave the calm deliberations of the routing layer and enter the enthusiastic handshakes of the Neighbor Subsystem—seeing how ARP and ND turn abstract IP addresses into real Ethernet frames and actually push data onto the wire.


Exercises

Exercise 1: Understanding

Question: In the Linux kernel's routing lookup process, the fib_lookup() function is the core entry point. Briefly describe the basic flow of how the fib_lookup() function performs a route lookup using the input parameter flowi4 and the output parameter fib_result, and explain how the kernel constructs the dst_entry (destination cache) based on fib_result when the lookup succeeds.

Answer and Analysis

Answer: fib_lookup() uses flowi4 (which contains the destination address, source address, TOS, etc.) as a key. It first looks up in the Local table; if that fails, it performs a Longest Prefix Match (LPM) in the Main table. On a successful lookup, it populates the fib_result structure (which includes the prefix length, fib_info pointer, route type, etc.). Subsequently, the kernel creates a dst_entry object (embedded in a rtable) based on the information in fib_result (such as whether type is RTN_LOCAL or RTN_UNICAST), and sets its input or output callback functions (like ip_local_deliver or ip_forward).

Analysis: This question tests your understanding of the core data flow in the IPv4 routing subsystem. flowi4 defines the "key" for the lookup, while fib_result stores the "result." fib_info is the specific parameter carrier for the routing entry. After a successful lookup, the kernel must translate this static FIB information into a dynamically usable routing cache object (dst_entry). The most important part of this is setting the callback functions that handle the packet, which determines whether the packet is received locally, forwarded, or dropped.

Exercise 2: Application

Question: Suppose you are a network administrator and need to configure a Linux server to block traffic from the 192.168.1.0/24 subnet from accessing 10.0.0.5. Write the ip route command to implement this policy, and explain, based on the principles of fib_props and RTN_PROHIBIT, what error code the kernel returns and what ICMP message it sends when it receives a packet matching this rule.

Answer and Analysis

Answer: Command: ip route add prohibit 10.0.0.5 from 192.168.1.0/24.

Principle: The fib_props array in the kernel defines the behavior for different route types. The RTN_PROHIBIT type corresponds to the error code -EACCES. When a packet matches this rule, the routing lookup returns this error, and the kernel immediately invokes ip_error() to handle it, dropping the packet and replying to the sender with an ICMP "Destination Unreachable" message with the code "Packet Filtered" (ICMP_PKT_FILTERED).

Analysis: This question tests how to apply the concept of Policy Routing for traffic filtering. Understanding that RTN_PROHIBIT is a type of fib_type is key—it doesn't just silently drop packets, but has specific interactive behavior (sending ICMP). By looking up the error field corresponding to RTN_PROHIBIT in the fib_props array, you can determine the specific kernel behavior.

Exercise 3: Thinking

Question: In early kernel versions (< 3.6), Linux used a routing cache to accelerate lookups, but removed it after version 3.6, relying entirely on the FIB TRIE instead. Analyze the main reasons the kernel development team made this change (adopting FIB TRIE over Routing Cache) from both "performance" and "security/stability" perspectives.

Answer and Analysis

Answer: 1. Performance: As routing table sizes grew (internet core routing tables have a massive number of entries), the overhead of maintaining a huge hash cache and its consistency (such as updating invalidated routes) became very large. The LC-trie (a tree structure based on Longest Prefix Match) is inherently very fast at lookups (O(key length)) and highly memory-efficient, making an additional caching layer unnecessary.

  1. Security and Stability: The routing cache was vulnerable to "Shadow Master" type DoS attacks. An attacker could send a massive volume of packets with random destination IPs, forcing the kernel to constantly perform cache-miss lookups and fill the cache, exhausting system memory and CPU. By removing the cache and looking up directly in the FIB TRIE, this attack surface based on cache overflow was eliminated.

Analysis: This question tests deep thinking about the evolution of the kernel networking subsystem. This is a classic case of transitioning from "trading space for time" to "algorithmic optimization." Understanding this requires recognizing that while caching usually accelerates access, in highly dynamic or specific attack scenarios, the management costs of the cache (lock contention, entry refreshes) and its fragility (susceptibility to attack) can outweigh its benefits. The FIB TRIE provides sufficiently high lookup efficiency to allow the removal of complex caching logic.


Key Takeaways

The Linux kernel uses the FIB (Forwarding Information Base) and routing lookup mechanisms to determine where packets should go. The core function fib_lookup() queries the routing table based on parameters like the destination address, ultimately generating a dst_entry (routing cache entry) that contains the input and output function pointers. This transforms the complex table lookup process into a call to a specific callback function (such as ip_local_deliver or ip_forward), thereby implementing the logical branching between local reception and forwarding.

The fib_info structure is the complete "ID card" of a routing entry in the kernel, encapsulating all metadata except the destination. It not only records the route's origin (such as static configuration or kernel-generated), scope, and priority, but also manages TCP performance parameters like MTU and RTT through the fib_metrics array. This design separates the routing decision attributes from the physical path, allowing the kernel to efficiently manage the lifecycle of routing entries via reference counting and supporting multipath routing configurations.

Routing doesn't always mean permitting traffic. The kernel uses the fib_props mapping table to translate different route types (like RTN_PROHIBIT) into specific operational behaviors. When a routing lookup hits a "prohibit" type, the kernel doesn't silently drop the packet; instead, it triggers an ICMP "Destination Unreachable" or "Filtered" message based on the configuration. This mechanism elevates the routing table itself into an efficient traffic control policy system.

To cope with the dynamic nature of network environments, the kernel introduces an Exception mechanism at the fib_nh (next hop) level. For specific destination addresses, the kernel uses a hash table to record gateway changes brought by ICMP redirects or MTU adjustments brought by PMTU discovery. This "sticky note" style of fine-tuning avoids frequently modifying the massive global routing table, achieving precise correction and isolation for individual paths.

When multiple routes point to the same destination and physical path but have different attributes (such as TOS or priority), the kernel employs a fib_alias mechanism to optimize memory. It allows multiple lightweight fib_alias structures to share a single fib_info that stores the actual path information. This design avoids duplicating large amounts of routing data for minor differences, significantly improving memory utilization in large-scale routing tables (such as BGP scenarios).