Skip to main content

6.9 Final Cheat Sheet

We have finally reached the end of this chapter.

This section introduces no new narratives or concepts. It is more like the appendix of an explorer's journal—when you actually start writing code or debugging weird network issues, you will flip back to this page.

Listed here are the core methods we wrestled with in this chapter, a few key macros, and the entry points for peeking into kernel state under /proc.


Core Methods Quick Reference

First are the important kernel methods we encountered. I have categorized them by their roles in multicast and policy routing.

Note: The code signatures in this section are critical. If you are writing a kernel module or debugging a driver, getting a single parameter type wrong will cause a compilation failure.

Channels for the Multicast Routing Daemon

This part covers the interfaces where the userspace mrouted or pimd talks to the kernel.

int ip_mroute_setsockopt(struct sock *sk, int optname, char __user *optval, unsigned int optlen);

This is the "telephone handset" between the kernel and the multicast routing daemon. The daemon calls it via setsockopt() to issue various commands. Supported commands (optname) include:

  • MRT_INIT: Initialize multicast forwarding.
  • MRT_DONE: Stop and clean up.
  • MRT_ADD_VIF / MRT_DEL_VIF: Add or delete a virtual interface.
  • MRT_ADD_MFC / MRT_DEL_MFC: Add or delete a forwarding cache entry.
  • MRT_ASSERT: Assert related.
  • MRT_PIM: If the kernel has PIM support enabled.
  • MRT_TABLE: If multicast policy routing (multi-table support) is enabled.
int ip_mroute_getsockopt(struct sock *sk, int optname, char __user *optval, int __user *optlen);

The corresponding uplink channel. The daemon uses this to query states, such as the current state of MRT_VERSION, MRT_ASSERT, or MRT_PIM.

Multicast Routing Table Lifecycle

struct mr_table *ipmr_new_table(struct net *net, u32 id);

Creates a new multicast routing table. If you are doing policy routing, you might use this. The parameter id specifies the table ID.

void ipmr_free_table(struct mr_table *mrt);

Frees the specified multicast routing table and all resources it occupies. Remember, the kernel does not automatically do all the cleanup for you.

Host Side: Joining and Leaving

int ip_mc_join_group(struct sock *sk, struct ip_mreqn *imr);

This is not for routers; it is for hosts. Call this method when you want to join a multicast group. The multicast group address goes into the imr parameter. Returns 0 on success.

Core Lookup and Forwarding Logic

static struct mfc_cache *ipmr_cache_find(struct mr_table *mrt, __be32 origin, __be32 mcastgrp);

We saw this in the "IPv4 Multicast Rx Path" section. It looks up an entry in the IPv4 multicast forwarding cache (MFC). The keys are the source address (origin) and the multicast group address (mcastgrp). Returns NULL if not found.

bool ipv4_is_multicast(__be32 addr);

A simple check tool: is this IP address a multicast address? (i.e., is it a Class D address?)

int ip_mr_input(struct sk_buff *skb);

This is the "main function" of the multicast packet receive path (located at net/ipv4/ipmr.c). All incoming multicast packets that are not destined for the local host go through it to determine how to forward them.

Memory Management and Cache Construction

struct mfc_cache *ipmr_cache_alloc(void);

Allocates a standard multicast forwarding cache entry (mfc_cache).

static struct mfc_cache *ipmr_cache_alloc_unres(void);

Used specifically to allocate an "unresolved" entry. Recall: when the kernel receives a multicast packet but there is no corresponding route in the cache yet, it first creates an unresolved entry, sets an expiration time, and then waits for the userspace daemon to fill in the gap.

static int ipmr_mfc_add(struct net *net, struct mr_table *mrt, struct mfcctl *mfc, int mrtsock, int parent);

This is the kernel implementation of the MRT_ADD_MFC command. It converts the mfcctl structure passed from userspace into a kernel cache entry.

static int ipmr_mfc_delete(struct mr_table *mrt, struct mfcctl *mfc, int parent);

Similarly, this is the kernel implementation of MRT_DEL_MFC.

Virtual Interface (VIF) Management

static int vif_add(struct net *net, struct mr_table *mrt, struct vifctl *vifc, int mrtsock);

Registers a physical NIC or tunnel as a virtual multicast interface (VIF). Corresponds to MRT_ADD_VIF.

static int vif_delete(struct mr_table *mrt, int vifi, int notify, struct list_head *head);

Deletes a VIF. Corresponds to MRT_DEL_VIF.

int dev_set_allmulti(struct net_device *dev, int inc);

A critical function. Remember how we needed the NIC to receive all multicast packets? This function implements that. It modifies the NIC's allmulti counter. It increments by one when inc is positive, and decrements by one when negative.

Notifications and Maintenance

static int ipmr_cache_report(struct mr_table *mrt, struct sk_buff *pkt, vifi_t vifi, int assert);

When the kernel encounters a packet it cannot handle (e.g., an unresolved NOCACHE, or a WRONGVIF coming in from the wrong interface), it calls this function. It constructs a special IGMP message packet, stuffs it into a queue, and sends it to the userspace daemon via sock_queue_rcv_skb().

static int ipmr_device_event(struct notifier_block *this, unsigned long event, void *ptr);

A hotplug event callback. If you unplug a NIC, the kernel triggers a NETDEV_UNREGISTER event. This callback function receives the notification and then deletes the VIF corresponding to that NIC from vif_table—preventing the kernel from accessing a device that no longer exists.

static void mrtsock_destruct(struct sock *sk);

When the daemon calls setsockopt(MRT_DONE) to shut down and leave, this destructor is called. It does the cleanup: zeros out the mroute_sk pointer, decrements /proc/sys/net/ipv4/conf/all/mc_forwarding by 1, and calls mroute_clean_tables() to wipe the table clean.

static void ipmr_expire_process(unsigned long arg);

A timer callback. If ipmr_cache_report() sent out a distress signal but the daemon never responded (did not add a route), this timer will delete that unresolved entry when it expires—the kernel does not play the good guy forever.

Protocol Dispatch

int igmp_rcv(struct sk_buff *skb);

The IGMP protocol receive handler. Although this does not directly belong to routing, it is the foundation for maintaining multicast group membership.

Policy Routing and Multipath

void fib_select_multipath(struct fib_result *res);

Used in multipath routing (ECMP) scenarios. When a route lookup reveals multiple paths, this function picks a specific next hop based on a weighting algorithm (hash or round-robin).


Key Macros

Here are two macros we saw in the code but did not discuss in detail; they are key to the implementation.

MFC_HASH(a, b)

#define MFC_HASH(a, b) ...

The hash algorithm for the MFC (Multicast Forwarding Cache). The parameter a is the multicast group address, and b is the source address. The kernel does not linearly scan the cache table (that would be too slow); instead, it uses this macro to calculate an index and jumps directly to the corresponding position to look up.

VIF_EXISTS(_mrt, _idx)

#define VIF_EXISTS(_mrt, _idx) ...

Bounds-checking macro. Used to determine whether the virtual interface at index _idx actually exists in the vif_table array of routing table _mrt. It is like going to a hotel to find a room—you have to confirm that the room number actually exists before barging in.


/proc Peek Holes

When debugging multicast issues, besides packet captures, these /proc files are your best friends. They directly reflect what is in the kernel's mind at this very moment.

/proc/net/ip_mr_vif

Virtual interface list.

Read this file, and you can see which virtual multicast interfaces are currently registered in the kernel. The underlying implementation function is ipmr_vif_seq_show().

What you will see:

  • The index of each interface.
  • The physical device it is bound to.
  • Those statistical counters (how many packets were received, how many were sent in error).

/proc/net/ip_mr_cache

MFC forwarding cache state.

This is the place that best reflects whether the routing logic is correct. The underlying implementation function is ipmr_mfc_seq_show().

What you will see: Detailed fields for every cache entry:

  • mfc_mcastgrp: Multicast group address.
  • mfc_origin: Source address.
  • mfc_parent: Incoming interface index (which interface the packet came in from).
  • pkt / bytes: Forwarding statistics (how many packets and bytes were forwarded).
  • wrong_if: Wrong interface statistics (how many packets went to the wrong door).
  • ttls: List of forwarding interfaces and their TTL thresholds.

Policy Routing Selector Table (Table 6-1)

Finally, this table contains the matching fields commonly used when configuring policy routing (ip rule), along with their corresponding symbols in the kernel code.

It is best to print this out and tape it to the edge of your monitor, because when writing complex ip rule add commands, it is easy to forget which FRA constant corresponds to a given field.

Linux SymbolRouting Command KeywordDescriptionStructure Member
FRA_SRCfromMatch source addressfib4_rule->src
FRA_DSTtoMatch destination addressfib4_rule->dst
FRA_IIFNAMEiifMatch incoming interface namefib_rule->iifname
FRA_OIFNAMEoifMatch outgoing interface namefib_rule->oifname
FRA_FWMARKfwmarkMatch firewall markfib_rule->mark
FRA_FWMASKfwmaskMatch firewall mark maskfib_rule->mark_mask
FRA_PRIORITYpriority / preference / orderRule priorityfib_rule->pref
(No corresponding symbol)tos / dsfieldType of Servicefib4_rule->tos

Chapter Echoes

Here, the journey of this chapter comes to an end.

We started with a simple question—how do we send a packet to a group of people?—and ended up unearthing one of the most complex parts of the Linux kernel: the routing subsystem.

On the surface, this chapter is about "advanced routing," but it is really about control.

  • Multicast routing hands control from simple "point-to-point transmission" over to complex "tree-based distribution," forcing the kernel to distinguish between "who is this for?" and "who should forward it?"
  • Policy routing hands control from "addressing by destination" over to "addressing by administrator intent," allowing the same packet to take drastically different paths based on its source, firewall mark, or even incoming interface.

Remember the dealings we had with IGMP and the multicast routing tables? Those obscure data structures—mfc_cache, vif_table, fib4_rule—are all designed to accommodate this flexibility while maintaining high kernel performance.

The next time you type ip route add or ip maddr add at the command line, you should be able to imagine what the kernel is searching for in the hash table, and what it is incrementing on the network device's reference count behind that single line.

In the next chapter, we will cross the transport layer and look at deeper things—or perhaps, more fundamental things: the cornerstones of the network's lower layers.

Ready? Take a deep breath.


Exercises

Exercise 1: Understanding

Question: In the Linux kernel's multicast routing implementation, when the multicast routing daemon sends the MRT_INIT command via setsockopt() to initialize multicast routing, which member in the kernel's mr_table structure is initialized to point to that userspace socket? What read-only procfs entry is also automatically set to an enabled state as a result of this operation?

Answer and Explanation

Answer: mroute_sk; /proc/sys/net/ipv4/conf/all/mc_forwarding

Explanation: Based on source code analysis, when the kernel processes the MRT_INIT command (in the ip_mroute_setsockopt method), it retains a reference to the userspace socket and stores it in the mroute_sk member of the mr_table structure. This is the foundation for communication between the kernel and userspace daemons (such as pimd or mrouted). At the same time, the kernel increments the mc_forwarding counter by 1 via IPV4_DEVCONF_ALL(net, MC_FORWARDING)++, thereby reflecting in the /proc filesystem that multicast forwarding is enabled. Note that this procfs entry is read-only and cannot be modified directly by writes from userspace.

Exercise 2: Application

Question: Suppose a multicast data packet arrives at a Linux host configured as a multicast router. If the lookup for an entry in the Multicast Forwarding Cache (MFC) fails (cache miss), the kernel calls the ipmr_cache_unresolved() method. Question: 1) How many of these packets will the kernel cache? 2) If the unresolved queue is full, what message will the kernel send to the userspace daemon to request route resolution?

Answer and Explanation

Answer: Up to 3 (SKBs); IGMPMSG_NOCACHE

Explanation: In the implementation of ip_mr_cache_unresolved(), the kernel checks whether c->mfc_un.unres.unresolved.qlen is greater than 3. If it does not exceed 3, the packet is added to the unresolved queue; otherwise, the packet is discarded (kfree_skb) and -ENOBUFS is returned. When an unresolved entry is created, the kernel calls the ipmr_cache_report() method, which constructs an IGMPMSG_NOCACHE message and sends it to the userspace multicast routing daemon via sock_queue_rcv_skb, notifying it that a new multicast source needs a routing entry established.

Exercise 3: Application

Question: In a multipath routing scenario, the Linux kernel uses the fib_select_multipath() function to decide which next hop to send a packet to. Suppose you configure a route with two next hops, where the first next hop has a weight of 2 and the second next hop has a weight of 1. If 300 packets need to be forwarded through this route at this time, roughly how many packets will theoretically be sent to the second next hop? What problem is this mechanism primarily designed to solve in the kernel?

Answer and Explanation

Answer: Approximately 100 (load balancing); network bandwidth utilization and link redundancy

Explanation: Multipath routing performs hash-weighted load balancing based on weights. A weight ratio of 2 to 1 means the total weight is 3. Theoretically, 2/3 of the traffic (about 200 packets) will take the first hop, and 1/3 of the traffic (about 100 packets) will take the second hop. The kernel achieves this distribution through fib_select_multipath() combined with a hashing algorithm, aiming to fully utilize the bandwidth of multiple links and provide link redundancy backup to prevent single points of failure.

Exercise 4: Thinking

Question: Why is the "source filtering" feature of IGMPv3 crucial for building efficient, large-scale multicast networks? Combining your analysis with the Linux kernel's multicast routing mechanisms (particularly the MFC cache and the interaction between the kernel and the userspace daemon), analyze what impact a host being able to join a multicast group only via IGMPv2 might have on network bandwidth and router performance.

Answer and Explanation

Answer: IGMPv2's lack of source filtering leads to unnecessary traffic flooding and increased burden on core routers.

Explanation: IGMPv2 only allows a host to declare "I want to join group G" and cannot specify a source S. This means the router must forward all traffic destined for G from any source to that host. At the kernel level, MFC entries are looked up based on (S, G). Without source filtering, the multicast routing daemon might tend to establish (*, G) type entries, causing a large number of data packets from irrelevant sources to be forwarded to the receiver, wasting bandwidth. IGMPv3, on the other hand, allows hosts to specify INCLUDE or EXCLUDE lists, enabling the routing daemon to precisely establish specific source (S, G) MFC entries in the kernel. This not only reduces unnecessary packet replication and forwarding (lightening the load on ip_mr_forward), but also decreases bandwidth consumption on upstream links.


Key Takeaways

The core of the multicast mechanism lies in solving the efficiency problem of "one-to-many" communication. By having routers intelligently replicate and distribute data packets, it avoids the bandwidth disaster that would come from establishing a separate transmission channel for each receiver. This requires the network infrastructure to have the ability to manage dynamic group membership. Hosts negotiate "joining" or "leaving" a group with the router via the IGMP protocol, ensuring that multicast traffic is only routed to the network segments where truly interested receivers reside.

The Linux kernel implements the logical evolution of group membership management through the IGMP protocol stack, from the simple queries and reports of v1, to v2's introduction of主动 leave messages to optimize exit efficiency, and then to v3's support for source filtering to achieve more fine-grained traffic control. When the kernel receives an IGMP query, it resets timers and prepares to send reports via functions like igmp_heard_query. This ensures that as long as hosts are online, the router can maintain correct membership states, and all of this is based on strict Time-To-Live (TTL) limits to prevent messages from leaking outside the local network.

The decision-making brain of multicast routing is the mr_table structure in the kernel, which not only maintains the list of physical or virtual interfaces (vif_table) but also controls the core Multicast Forwarding Cache (MFC). The kernel itself does not run complex routing protocols; instead, it interacts bidirectionally with userspace daemons (such as pimd) via the setsockopt system call: the daemon is responsible for calculating routing policies and populating the MFC table, while the kernel is responsible for high-speed forwarding based on these entries. When encountering unknown traffic, the kernel temporarily stores the packets in an unresolved queue and notifies the daemon to handle them.

The packet forwarding decision relies entirely on the mfc_cache entry, which uses (source address S, group address G) as hash keys and records the packet's incoming interface along with all valid outgoing interfaces and their TTL thresholds. During the actual execution of ip_mr_forward, the kernel strictly checks whether the packet's actual incoming interface matches the cache entry (to prevent loops and routing leaks), and iterates through all virtual interfaces, performing clone and transmit operations only on ports where the packet's TTL is higher than the interface threshold, thereby achieving precise traffic distribution.

When a multicast data packet arrives but results in a cache miss, the kernel does not immediately drop it; instead, it initiates a temporary storage and help-seeking mechanism, namely ipmr_cache_unresolved. It suspends the unresolved packets (typically limited to 3 packets to prevent memory exhaustion) and sends an IGMPMSG_NOCACHE alert to the userspace daemon via a special socket. If the daemon fills in the routing entry via MRT_ADD_MFC within 10 seconds, the kernel releases the detained packets for forwarding; otherwise, these entries are cleaned up by the timer. This design cleverly strikes a balance between the kernel's high-speed forwarding and userspace's complex decision-making.