Skip to main content

6.3 Multicast Router

In the previous section, we dissected mr_table and mfc_cache in the kernel—that's the skeleton of multicast forwarding.

But a skeleton is lifeless. It lies there, waiting for someone to give it a push, to tell it which interfaces should participate in forwarding and where each packet should go.

In this section, we'll discuss that "pusher": the Multicast Router.

We'll see how it establishes a connection between user space and the kernel, along with that crucial handshake protocol—setsockopt.


Configuring the Multicast Router

Turning a Linux machine into a multicast router requires more than just kernel code.

First, you need to enable the CONFIG_IP_MROUTE option when compiling the kernel. It's like installing a four-wheel-drive kit on a car—without it, nothing else matters.

Second, you need to run a capable night watchman in user space.

We mentioned pimd or mrouted earlier. These daemons don't just run aimlessly; the very first thing they do is establish a "hotline" to the kernel.

The way this hotline is established is a classic pattern: creating a Raw Socket.

Take pimd as an example. During startup, it makes a system call like this:

socket(AF_INET, SOCK_RAW, IPPROTO_IGMP);

This line of code essentially says: "Kernel, open a backdoor straight to the IGMP protocol layer. I don't want the TCP/UDP encapsulation—I want to send IGMP messages directly."

Once it gets this socket file descriptor, the real control has only just begun.


The Handshake: MRT_INIT

With a socket in hand, how does the daemon tell the kernel "I want to enable multicast routing"?

The answer is: setsockopt().

This isn't a regular parameter setting—it's an identity declaration.

When the daemon calls setsockopt(sock, IPPROTO_IP, MRT_INIT, ...), the signal goes straight to the kernel's ip_mroute_setsockopt() method.

While processing the MRT_INIT command, the kernel does two extremely important things:

  1. Saving the secret handshake: The kernel carefully stores a reference to the current socket in the mroute_sk field of the mr_table structure. From this moment on, the kernel recognizes this socket as the sole "commander-in-chief".
  2. Flipping the switch: The kernel automatically sets the /proc/sys/net/ipv4/conf/all/mc_forwarding procfs entry to 1 (via the macro IPV4_DEVCONF_ALL(net, MC_FORWARDING)++).

Wait, there's a subtle detail worth noting here.

That mc_forwarding file is read-only in /proc. You can't change it with echo 1 > ....

Why? Because it's a state, not a configuration.

It's automatically determined by the kernel based on "whether a multicast routing daemon is running." It only flips when MRT_INIT occurs; once the daemon exits, it automatically resets to zero. This design prevents you from forcibly enabling forwarding without a daemon, which would cause unpredictable kernel behavior.


Exclusivity: Only One Boss at a Time

Since we mentioned mroute_sk, there's a hard rule here.

Only one multicast routing daemon can exist at any given time.

Imagine what would happen if two pimd instances were running simultaneously—one saying "turn left" and the other saying "turn right." Who would the kernel listen to?

So, when ip_mroute_setsockopt() handles MRT_INIT, the very first thing it does is check whether mroute_sk is already occupied:

if (mrt->mroute_sk)
return -EADDRINUSE;

If a socket has already taken the spot and you try to start a second daemon, the kernel will straight up throw an -EADDRINUSE (Address already in use) at you. This is the kernel saying: "This spot's taken. Beat it."


Adding Interfaces: MRT_ADD_VIF

Once initialization is complete, the daemon's next task is to register physical network interfaces into the multicast routing table.

This is no longer just shouting slogans—it's about issuing an "admission ticket" for each network interface.

The way to issue these tickets is again through setsockopt(), but this time the command changes to MRT_ADD_VIF (Virtual Interface Add).

You need to fill out a form—struct vifctl.

Here's what that "form" looks like:

struct vifctl {
vifi_t vifc_vifi; /* 虚拟接口的索引 ID */
unsigned char vifc_flags; /* 各种标志位 */
unsigned char vifc_threshold; /* TTL 阈值限制 */
unsigned int vifc_rate_limit; /* 流量限速值(未实现) */
union {
struct in_addr vifc_lcl_addr; /* 本地接口 IP 地址 */
int vifc_lcl_ifindex;/* 本地接口索引 */
};
struct in_addr vifc_rmt_addr; /* 隧道远端地址(如果是隧道的话)*/
};

The fields in this structure are quite minimal, but each one hits the nail on the head. Let's break down a few key ones:

  • vifc_vifi: This is the "codename" you assign to this interface. The kernel doesn't care what you call it, as long as it's a unique integer between 0 and 31.
  • vifc_flags: This determines the nature of the admission ticket.
    • VIFF_TUNNEL: If you set this flag, it means you want to encapsulate packets inside another IP packet for transmission (IPIP tunnel). This is commonly used to traverse public networks that don't support multicast.
    • VIFF_REGISTER: This is a flag dedicated to PIM-SM (Sparse Mode), used to register a special interface.
    • VIFF_USE_IFINDEX: By default, we use IP addresses to identify interfaces. But modern machines have multiple NICs, and IP addresses can be unreliable (a NIC might not have an IP configured). If you set this flag, union will use the vifc_lcl_ifindex (network interface index) to locate the device instead. This is an advanced feature introduced after kernel 2.6.33.
  • vifc_lcl_addr vs vifc_lcl_ifindex: This is a union. In other words, you either use an IP address to find the device or use the interface index—pick one. We recommend using the index; it's more reliable.
  • vifc_rmt_addr: If it's tunnel mode, this holds the IP address of the machine at the other end of the tunnel.

Once the daemon fills out this structure and throws it to the kernel via setsockopt, the kernel calls vif_add() to hang this device on the routing table.

What about deletion?

Change the command to MRT_DEL_VIF, fill in the vifc_vifi, and the kernel will call vif_delete() to take it down.


Wrapping Up: MRT_DONE

When the daemon decides to retire (whether through a normal exit or being kill), it must finish what it started.

It makes one final call to setsockopt, with the command MRT_DONE.

This triggers the kernel's mrtsock_destruct() method.

The name is quite telling—destruct (destructor).

It does two things:

  1. Clears the mroute_sk in mr_table (sets it to NULL), signaling "the boss is gone, there's no leader now."
  2. Changes the state of mc_forwarding back to 0, turning off multicast forwarding.

At this point, the machine degrades from a "router" back to an ordinary "host".


The Vif Device (Virtual Interface Device)

The vifctl we mentioned in the previous section is the "application form" filled out by user space.

What the kernel actually uses is struct vif_device. That's the real "object" living inside the kernel and doing the work.

Multicast routing supports two modes: direct multicast and tunnel multicast.

Regardless of the mode, the kernel uses the same structure—vif_device—to represent it.

You can think of it as a high-level network driver abstraction. If it's in tunnel mode (VIFF_TUNNEL is set), its dev pointer might point to that virtual tunl0 device.

Let's see what it actually looks like:

struct vif_device {
struct net_device *dev; /* 正在使用的真实网卡设备 */
unsigned long bytes_in, bytes_out;
unsigned long pkt_in, pkt_out; /* 统计信息:收发了多少包,多少字节 */
unsigned long rate_limit; /* 流量整形(暂未实现)*/
unsigned char threshold; /* TTL 阈值 */
unsigned short flags; /* 控制标志位 */
__be32 local, remote; /* 地址:local 是本地地址,remote 是隧道远端地址 */
int link; /* 底层物理接口的索引 */
};

Most fields are self-explanatory—it's essentially an enhanced network interface descriptor.

There are two details to note here:

  1. dev_set_allmulti(dev, 1): When vif_add() is called to add an interface, the kernel not only allocates the vif_device structure but also calls dev_set_allmulti(dev, 1). The purpose of this line of code is to increment the allmulti counter on the underlying physical NIC by 1. It's like telling the NIC driver: "Hey, from now on, don't just help me receive unicast packets. Pass up all multicast packets that come through—I'll sort them out myself." Without this step, the NIC would drop multicast packets not destined for the local machine at the hardware level, and the kernel would never get a chance to forward them.

  2. Cleanup work: When vif_delete() is called, it must call dev_set_allmulti(dev, -1) to decrement the counter back. It's a matter of politeness—"I'm not doing multicast anymore, so stop blindly receiving all those junk packets and go back to power-saving mode."


The Host Perspective: Joining and Leaving Groups

Everything we discussed above is what a "router" does—forwarding packets.

But if this machine also wants to watch a multicast stream (for example, it's a video server that both forwards streams and needs to decode the video itself), it has to switch back to "host" mode and join a multicast group.

This process is initiated from the application layer.

If you write network programming in C, this is standard practice:

  1. Create a socket.
  2. Call setsockopt(socket, IPPROTO_IP, IP_ADD_MEMBERSHIP, ...).

The IP_ADD_MEMBERSHIP here is essentially an application form.

The application layer prepares a struct ip_mreq structure (defined in <netinet/in.h>) and fills in two things:

  • imr_multiaddr: The address of the multicast group you want to join (e.g., 239.1.1.1).
  • imr_interface: The identity of your local NIC (IP address), telling the kernel "I want to use this NIC to listen."

After this system call enters the kernel, it ultimately lands in the ip_mc_join_group() method within net/ipv4/igmp.c.

The core logic of ip_mc_join_group() is simple: It hangs the multicast address you want to join onto the mc_list linked list of the network interface structure (in_device).

At the same time, the kernel sends a "report" to the network via the IGMP protocol, telling the router: "Hey, I'm here, I'm interested in 239.1.1.1, send traffic this way."

Of course, what goes in must come out.

When you no longer want to watch, call setsockopt and specify IP_DROP_MEMBERSHIP. The kernel will call ip_mc_leave_group() to remove that address from mc_list and notify the router "I've left the group."


The Limit: A Maximum of 20

There's a limit here that you might not have noticed.

A single socket can join a maximum of 20 multicast groups.

This limit is hardcoded in the kernel's sysctl_igmp_max_memberships.

If you get greedy and try to join a 21st group with the same socket, ip_mc_join_group() will mercilessly return -ENOBUFS (No buffer space available).

Why 20?

It's a product of historical baggage and resource balancing. Each membership consumes kernel memory and requires maintaining timers. For ordinary applications, 20 is more than enough; if you really need to join hundreds of groups (like when building a multicast gateway), you should open multiple sockets to share the load, or change the default value of /proc/sys/net/ipv4/igmp_max_memberships.

At this point, we have a clear picture: There's a routing table in the kernel, a daemon in user space, and they're connected via setsockopt.

But what's the point of just configuring it? When a packet actually arrives, how does it flow?

In the next section, we'll dive into the Rx Path and see how, when a multicast packet arrives at the NIC, the kernel step by step "delivers" it to the correct output queue.