6.6 The ipmr_queue_xmit() Method
In the previous section, we saw that ip_mr_forward() acts like a diligent dispatcher, deciding which Virtual Interface (VIF) a packet should be sent to. But it doesn't actually "ship" the packet. The real shipping work—including route lookup, tunnel encapsulation, and handing the packet off to the NIC driver—is taken over by ipmr_queue_xmit().
In this section, we'll walk through the "final leg" of this packet's journey. There's a counterintuitive design decision waiting for us here, but let's not rush; we'll break it down step by step.
Function Signature and VIF Validation
First, let's look at the function signature:
static void ipmr_queue_xmit(struct net *net, struct mr_table *mrt,
struct sk_buff *skb, struct mfc_cache *c, int vifi)
{
const struct iphdr *iph = ip_hdr(skb);
struct vif_device *vif = &mrt->vif_table[vifi];
struct net_device *dev;
struct rtable *rt;
struct flowi4 fl4;
int encap = 0;
Here, vifi is the outgoing index calculated by ip_mr_forward(). mrt->vif_table[vifi] retrieves the corresponding virtual interface device.
The first logical gate appears:
if (vif->dev == NULL)
goto out_free;
If you accidentally omit a VIF while configuring your routing daemon, or if a VIF is unexpectedly deleted, the kernel will silently drop the packet right here. No warnings, no errors, just out_free. This is the cold reality of the kernel—when it discovers the device is gone at the very moment of transmission, what else can it do but drop it?
Special Case: PIM Register VIF
Next is processing logic specific to a particular protocol. If you have enabled CONFIG_IP_PIMSM (PIM Sparse Mode), you will encounter a special interface called VIFF_REGISTER.
#ifdef CONFIG_IP_PIMSM
if (vif->flags & VIFF_REGISTER) {
vif->pkt_out++;
vif->bytes_out += skb->len;
vif->dev->stats.tx_bytes += skb->len;
vif->dev->stats.tx_packets++;
ipmr_cache_report(mrt, skb, vifi, IGMPMSG_WHOLEPKT);
goto out_free;
}
#endif
This isn't just forwarding. The purpose of the PIM Register VIF is to encapsulate the multicast packet inside a PIM Register message and send it to the RP (Rendezvous Point). Therefore, ipmr_cache_report() is called here to pass the entire packet (IGMPMSG_WHOLEPKT) up to the user-space routing daemon, which takes care of the subsequent encapsulation and transmission. The kernel itself doesn't handle this complex encapsulation directly; it's just a courier.
Route Lookup: Tunnel vs. Physical Interface
Next comes the crucial step of deciding "where does this packet go next?" There is a major fork in the road here: is the VIF you're sending to a tunnel, or a regular physical NIC?
If it's a tunnel (VIFF_TUNNEL):
We need to look up a unicast route for the tunnel itself. Note that the destination here is not the multicast group address, but the tunnel's remote address.
if (vif->flags & VIFF_TUNNEL) {
rt = ip_route_output_ports(net, &fl4, NULL,
vif->remote, vif->local,
0, 0,
IPPROTO_IPIP,
RT_TOS(iph->tos), vif->link);
if (IS_ERR(rt))
goto out_free;
encap = sizeof(struct iphdr);
See IPPROTO_IPIP? This shows that a tunnel is essentially IP-in-IP encapsulation. The rt found here is the route pointing to the router at the other end of the tunnel. Because a new IP header needs to be added, encap is set to sizeof(struct iphdr) to reserve headroom.
If it's a regular physical interface:
This is more straightforward. We're looking for the route to the multicast group address.
} else {
rt = ip_route_output_ports(net, &fl4, NULL, iph->daddr, 0,
0, 0,
IPPROTO_IPIP, // 注意这里参数虽是IPIP,但实际物理接口通常走常规路径
RT_TOS(iph->tos), vif->link);
if (IS_ERR(rt))
goto out_free;
}
dev = rt->dst.dev;
Once we have rt, we know the final physical device dev that will transmit the packet.
MTU Check: The Cruel Black Hole
If you're doing unicast forwarding and the packet is too large, what happens? The kernel sends an ICMP Fragmentation Needed (Type 3, Code 4) message back to the host, telling it "the packet is too big, fragment it."
But in multicast? Do nothing.
Look at this code:
if (skb->len+encap > dst_mtu(&rt->dst) && (ntohs(iph->frag_off) & IP_DF)) {
/* Do not fragment multicasts. Alas, IPv4 does not
* allow to send ICMP, so that packets will disappear
* to blackhole.
*/
IP_INC_STATS_BH(dev_net(dev), IPSTATS_MIB_FRAGFAILS);
ip_rt_put(rt);
goto out_free;
}
This is a highly counterintuitive design, but it's the result of careful deliberation. Imagine a multicast group with thousands of receivers. If a router starts sending ICMP error messages back to the source just because the MTU on one path shrank, it creates two problems:
- ICMP Storm: The source could be overwhelmed by a massive flood of ICMP messages.
- Impossible to satisfy everyone: Some paths have an MTU of 1500, others 1400. If the source fragments to 1400, it wastes bandwidth; if it sends at 1500, the smaller-MTU paths still can't handle it.
The RFCs dictate: in this situation, drop the packet silently. Increment the statistic counter, then stay quiet. It's like a strict black hole—it swallows non-compliant packets without making a sound.
Header Adjustment and Encapsulation
Next are the routine operations: reserving space for potential header expansion (skb_cow), and then binding the looked-up route dst to the skb.
encap += LL_RESERVED_SPACE(dev) + rt->dst.header_len;
if (skb_cow(skb, encap)) {
ip_rt_put(rt);
goto out_free;
}
vif->pkt_out++;
vif->bytes_out += skb->len;
skb_dst_drop(skb);
skb_dst_set(skb, &rt->dst);
Then, TTL is decremented by 1. This is standard procedure for IP forwarding, whether unicast or multicast.
ip_decrease_ttl(ip_hdr(skb));
If it's tunnel mode, we also need to perform the actual encapsulation (ip_encap)—wrapping the old packet in a new IP header.
if (vif->flags & VIFF_TUNNEL) {
ip_encap(skb, vif->local, vif->remote);
/* FIXME: extra output firewall step used to be here. --RR */
vif->dev->stats.tx_packets++;
vif->dev->stats.tx_bytes += skb->len;
}
The lingering FIXME comment in the code is a relic of history. In early kernel versions, there were firewall hooks here, but as the Netfilter framework was unified, these scattered hook calls were removed. Now, everything uniformly goes through the Netfilter hook below.
Calling Netfilter and Final Transmission
Finally, we tag it with FORWARDED and hand it off to Netfilter's NF_INET_FORWARD hook.
IPCB(skb)->flags |= IPSKB_FORWARDED;
There is a long comment here citing RFC 1584. It explains a subtle design choice: a multicast router typically needs to both deliver packets to local users (if there are running multicast applications) and forward them out. To avoid forcing user-space applications to join every single interface, the kernel router itself should act as a receiver.
This comment reflects some of the trade-offs in early multicast implementations, but in the modern Linux kernel, this logic is primarily handled through the local delivery flag in ip_mr_input. ipmr_queue_xmit focuses on its core job—getting the packet out.
NF_HOOK(NFPROTO_IPV4, NF_INET_FORWARD, skb, skb->dev, dev,
ipmr_forward_finish);
return;
out_free:
kfree_skb(skb);
}
If the Netfilter hook allows it (NF_ACCEPT), the flow proceeds to ipmr_forward_finish.
ipmr_forward_finish() and dst_output
This function is shockingly short; it's almost a clone of ip_forward_finish:
static inline int ipmr_forward_finish(struct sk_buff *skb)
{
struct ip_options *opt = &(IPCB(skb)->opt);
IP_INC_STATS_BH(dev_net(skb_dst(skb)->dev), IPSTATS_MIB_OUTFORWDATAGRAMS);
IP_ADD_STATS_BH(dev_net(skb_dst(skb)->dev), IPSTATS_MIB_OUTOCTETS, skb->len);
if (unlikely(opt->optlen))
ip_forward_options(skb);
return dst_output(skb);
}
It does three things:
- Updates statistics (how many packets and bytes were sent out).
- Handles legacy IP Options (rarely seen nowadays, but kept for compatibility).
- Calls
dst_output().
dst_output() ultimately calls the neighbor subsystem's transmit function, pushing the packet into the NIC driver's transmit queue (qdisc). If skb->dev is a physical NIC, the packet flies out as an Ethernet frame; if it's a tunnel, the packet is sent out as a regular IP packet, and the peer will decapsulate it upon receipt, restoring the original multicast packet.
The TTL in Multicast Traffic
With this, the kernel's multicast routing transmit path is fully traversed. But before wrapping up this section, we need to step back and talk about a field that has been playing an important role in the background all along—TTL.
In the multicast world, TTL has two layers of meaning.
The first meaning is "hop limit", which is the same as in unicast. Every time it passes through a router, the TTL is decremented by 1, and it's dropped when it reaches 0. This prevents infinite storms caused by routing loops.
The second meaning is "scope threshold", which is unique to multicast. To prevent multicast traffic from flooding aimlessly across the entire internet, early multicast pioneer Steve Deering established a set of TTL-based "administrative boundary" rules. Each router interface has a threshold, and forwarding is only permitted when the packet's TTL is greater than this threshold.
This set of rules turned TTL values into codes representing geographic or administrative scopes:
- 0: Restricted to the local host (can't even leave the interface).
- 1: Restricted to the same subnet (can't pass through a router).
- 32: Restricted to the same site.
- 64: Restricted to the same region.
- 128: Restricted to the same continent.
- 255: Global scope.
If you're writing a multicast application, you can control how far your packets fly by setting the IP_MULTICAST_TTL socket option. Set it to 1, and your packets wander within the LAN; set it to 64, and theoretically they can traverse your ISP, though this depends on whether the ISP's routers are actually configured with thresholds following Deering's recommendations.
Although this mechanism is ancient (dating back to the 4.3BSD era), similar TTL check logic is still retained in modern PIM (Protocol Independent Multicast) protocols as the first line of defense.
Implementation Files
The Linux kernel's multicast routing implementation is primarily concentrated in the following three files. If you're interested in the source code, they're worth exploring:
net/ipv4/ipmr.c: Core implementation.include/linux/mroute.h: Internal kernel header file.include/uapi/linux/mroute.h: User-space API (ioctl interface definitions).
This concludes our look at the kernel mechanisms of multicast routing. We started from the simple question of "who do we send this packet to?" and traced the path through mr_table, the MFC cache, and VIF devices, finally completing encapsulation and transmission in ipmr_queue_xmit.
But our journey isn't over yet. So far, all routing decisions have been fundamentally based on "destination." In the next chapter, we'll introduce a more powerful and complex mechanism—Policy Routing. There, routing decisions will no longer just look at dest, but will also consider source, fwmark, and even the packet's incoming interface. That's where advanced routing truly begins.