Skip to main content

4.8 Packet Forwarding

In the previous section, we reassembled the shattered mirror, watching those fragments come back together in ip_defrag() before finally being handed off to the transport layer.

But in the networking world, not all packets are destined for the local machine. Often, a Linux machine acts as a middleman—a router. A packet comes in from the left, and its destination is on the right. In this case, the kernel doesn't need to bother reassembling it (unless for some special need); it just needs to make one decision: where to throw it?

This is forwarding.

The logic of forwarding seems simple—catch it, look up the table, send it out—but if you think it's just a "porter's" job, you're underestimating the kernel. Before sending a packet out the door, the kernel faces a host of thorny problems: Is this packet too large? Has its lifespan (TTL) expired? Is fragmentation prohibited? Or even, is this packet some kind of monster that has already been "specialized" by hardware?

Let's start with ip_forward() and see what really happens on this transit journey.


Core Function: ip_forward()

ip_forward() is the master of the forwarding path. After the kernel finishes looking up the routing table in ip_rcv_finish() and realizes "this isn't for me, it's for someone else," it ends up here.

First up is an old friend, grabbing the packet's header and routing information:

int ip_forward(struct sk_buff *skb)
{
struct iphdr *iph; /* Our header */
struct rtable *rt; /* Route we use */
struct ip_options *opt = &(IPCB(skb)->opt);

But before we even start processing the logic, the first checkpoint arrives. This checkpoint isn't mandated by the protocol; it's the result of a compromise between hardware and software.

Intercepting LRO: The Trap of Hardware Optimization

You've probably heard of LRO (Large Receive Offload). It's a great thing—to reduce CPU burden, the NIC merges a stream of small packets (like small TCP segments) into one large packet and hands it to the kernel in one go. This is perfect for "receive and process locally" scenarios.

But in a forwarding scenario? It's a disaster.

Imagine the NIC, trying to save effort, merged 10 packets of 1500 bytes into one massive SKB. Now you need to forward it. You check the MTU of the outgoing interface—it's still 1500 bytes. This is awkward: you're holding a 15,000-byte monster, but the exit only allows 1500 bytes through.

Worse still, this monster may have been stitched together seamlessly by the NIC, making it very difficult for the kernel to cleanly tear it back apart.

So, the kernel's stance is clear: in the forwarding path, absolutely do not touch LRO packets.

if (skb_warn_if_lro(skb))
goto drop;

Behind this single line of code lies the designers' resignation: LRO was never designed with forwarding in mind from the very beginning. The later GRO (Generic Receive Offload) corrected this shortsightedness by adding forwarding capabilities, but LRO, as a legacy of an older era, must be ruthlessly intercepted here.

Router Alert: An Urgent Telegram for Routers

Next, a special option needs to be handled. The IPv4 header has an option called IPOPT_RA (Router Alert). When a packet carries this flag, it means: "Hey, all routers along the path, don't just blindly forward me—stop what you're doing and look at me!"

This is typically used in RSVP (Resource ReSerVation Protocol) or certain multicast scenarios.

How does the kernel handle this? It maintains a linked list called ip_ra_chain, where all Raw Sockets that have set IP_ROUTER_ALERT via setsockopt() are hung. ip_call_ra_chain() will feed this packet to all Sockets on the list.

You might ask: why not just send it to one Socket?

Because Raw Sockets don't have the concept of port numbers like TCP or UDP; they are protocol-based. If multiple Raw Sockets are interested in the same protocol (like IGMP), they all need to receive a copy.

if (IPCB(skb)->opt.router_alert && ip_call_ra_chain(skb))
return NET_RX_SUCCESS;

Note that if a Raw Socket claims this packet (returns a non-zero value), the forwarding process ends immediately, and the packet does not continue down the path.

Basic Security Check: Is It Really for Us?

Although the routing table says "forward," we still need to double-check skb->pkt_type. This field is filled in when the NIC driver calls eth_type_trans().

If pkt_type is not PACKET_HOST (i.e., not destined for the local MAC address), it means this packet might be some kind of anomaly involving broadcast or multicast, or there's a link-layer issue. Regardless, since it's not for us, we shouldn't forward it either—just drop it.

if (skb->pkt_type != PACKET_HOST)
goto drop;

Race Against Time: The Judgment of TTL

Every IP packet has a countdown timer on its head: TTL (Time To Live). This is a "self-destruct mechanism" designed to prevent packets from looping infinitely in the network.

Each time a packet passes through a router, its TTL must be decremented by 1. If it reaches 0, the router is obligated to terminate its life and send a death notice (ICMP Time Exceeded) to the sender.

if (ip_hdr(skb)->ttl <= 1)
goto too_many_hops;

If the TTL is exhausted, the kernel updates the statistics counter and calls icmp_send() to send an error message:

too_many_hops:
/* Tell the sender its packet died... */
IP_INC_STATS_BH(dev_net(skb_dst(skb)->dev), IPSTATS_MIB_INHDRERRORS);
icmp_send(skb, ICMP_TIME_EXCEEDED, ICMP_EXC_TTL, 0);
goto drop;

The Stubbornness of Strict Source Routing

Remember "Strict Source Routing" (SSRR) from the IP options? This is an extremely domineering requirement: the packet must travel exactly along the list of IP addresses specified, without a single misstep.

But reality often slaps you in the face. Suppose the packet has the strict routing option enabled (is_strictroute), but our routing subsystem's lookup reveals that the next hop is a gateway (rt_uses_gateway). This means we need to send the packet to the gateway first, not to the final destination.

This creates a conflict: strict routing doesn't allow passing through unlisted intermediate hops, but to get out the door, we have to go through the gateway.

What to do? It can't be done—just tell the sender "Strict Routing Failed."

rt = skb_rtable(skb);

if (opt->is_strictroute && rt->rt_uses_gateway)
goto sr_failed;

The failure handling also sends an ICMP message:

sr_failed:
icmp_send(skb, ICMP_DEST_UNREACH, ICMP_SR_FAILED, 0);
goto drop;

MTU and DF: A Dilemma with No Way Out

Next comes the most classic and maddening check in the forwarding path: MTU (Maximum Transmission Unit).

We looked up the MTU of the outgoing interface (dst_mtu(&rt->dst)). If the packet in our hands is larger than this MTU, and the packet does not allow fragmentation (the DF flag IP_DF is set), what should we do?

  • Send it out without fragmenting? No, it exceeds the MTU, so the physical layer can't transmit it.
  • Send it out fragmented? No, it explicitly said Don't Fragment.

This is truly a dilemma. The only solution is to give up and tell the sender "you need to make the packet smaller (Fragmentation Needed)."

This is the core step of the PMTUD (Path MTU Discovery) mechanism: by dropping the packet and sending an ICMP message, it forces the sender to reduce the packet size.

if (unlikely(skb->len > dst_mtu(&rt->dst) &&
!skb_is_gso(skb) && (ip_hdr(skb)->frag_off & htons(IP_DF)))
&& !skb->local_df) {
IP_INC_STATS(dev_net(rt->dst.dev), IPSTATS_MIB_FRAGFAILS);
icmp_send(skb, ICMP_DEST_UNREACH, ICMP_FRAG_NEEDED,
htonl(dst_mtu(&rt->dst)));
goto drop;
}

The skb_is_gso(skb) check in the code is there to bypass GSO (Generic Segmentation Offload) scenarios—those packets may look huge, but they haven't actually been fragmented yet, so we can just hand them off to the hardware. We shouldn't accidentally kill them here.

Before Making a Move, Make a Copy

If you survived all the checks above, congratulations—this packet is really about to be forwarded out.

But before making any changes, we must do one thing: copy the SKB (skb_cow).

Why? Because the upcoming operations (decrementing TTL, updating the checksum) will modify the IP header. And this SKB might be shared by others (for example, it was previously peeked at by some tap device), or its memory layout might not allow direct writes.

skb_cow ensures we have a safely writable copy (if we were the only ones using it originally, it might simply be made writable without actually copying the data—hence, Copy-on-Write).

/* We are about to mangle packet. Copy it! */
if (skb_cow(skb, LL_RESERVED_SPACE(rt->dst.dev)+rt->dst.header_len))
goto drop;
iph = ip_hdr(skb);

Decrementing TTL and Updating the Checksum

We finally reach this step. ip_decrease_ttl() decrements the TTL and conveniently updates the checksum as well.

Here's a detail: since we only changed one byte, why recalculate the entire checksum? Actually, the kernel leverages a mathematical trick mentioned in RFC 1624—there's no need to traverse the entire header; it only needs to perform a differential update based on the old checksum. Extremely efficient.

/* Decrease ttl after skb cow done */
ip_decrease_ttl(iph);

ICMP Redirect: The Good Neighbor Guide

The packet is about to be sent out, but the routing subsystem suddenly thinks: "Hey, actually, you don't need me to forward this. It's closer if you just go directly to that machine next door (Next Hop)."

This is the purpose of the ICMP Redirect message. If the routing cache has the RTCF_DOREDIRECT flag set, and strict routing isn't enabled (srr), and IPsec isn't active (skb_sec_path), the kernel will be kind enough to tell the sender: "Don't take the long way next time, just go directly to that IP."

/*
* We now generate an ICMP HOST REDIRECT giving the route
* we calculated.
*/
if (rt->rt_flags&RTCF_DOREDIRECT && !opt->srr && !skb_sec_path(skb))
ip_rt_send_redirect(skb);

Priority: Who Goes First?

In the world of QoS (Quality of Service), packets have their ranks. Typically, the sending Socket sets a priority (SO_PRIORITY), and this priority travels along with the SKB all the way down to the driver layer.

But forwarded packets don't have a Socket at all (they came in from the outside)—so who determines their priority?

The kernel uses a table lookup here: rt_tos2priority. It maps the tos (Type of Service) field in the IP header to an internal priority value.

skb->priority = rt_tos2priority(iph->tos);

The Final Kick: Netfilter and ip_forward_finish

All logical processing is done, and the last step is to hand it off. Along the way, it must pass through Netfilter's NF_INET_FORWARD hook point (this is the place we most commonly intercept when configuring firewalls).

return NF_HOOK(NFPROTO_IPV4, NF_INET_FORWARD, skb, skb->dev,
rt->dst.dev, ip_forward_finish);

If the firewall lets it through, it enters ip_forward_finish(). There are no more tricks here—just update the statistics, handle IP options (if any), and then call dst_output() to push the packet into the transmit path.

static int ip_forward_finish(struct sk_buff *skb)
{
struct ip_options *opt = &(IPCB(skb)->opt);

IP_INC_STATS_BH(dev_net(skb_dst(skb)->dev), IPSTATS_MIB_OUTFORWDATAGRAMS);
IP_ADD_STATS_BH(dev_net(skb_dst(skb)->dev), IPSTATS_MIB_OUTOCTETS, skb->len);

if (unlikely(opt->optlen))
ip_forward_options(skb);

return dst_output(skb);
}

With this, an IP packet has completed its transit mission. It was not reassembled, it was not seen by upper-layer protocols; it simply made a quiet loop through the kernel, had its TTL modified, and was pushed toward the next intersection. This is the non-stop work of a router.


Chapter Summary

In this chapter, we explored every aspect of the IPv4 protocol—from packet construction and header structure to complex IP option handling. We witnessed how the IPv4 protocol handler is registered into the kernel, and we walked through both the receive path and the transmit path in their entirety.

In this process, we had to face the reality of network fragmentation: when a packet exceeds the MTU, how does the kernel slice it up on the sending end and stitch it back together on the receiving end? We saw the slow-path and fast-path fragmentation and reassembly algorithms, and we learned about the security vulnerabilities introduced by the fragmentation mechanism (such as the Teardrop attack).

Finally, we studied IPv4's forwarding mechanism—turning a Linux machine into a router, shuttling packets between different interfaces. We saw which situations cause packets to be ruthlessly dropped, and which situations trigger ICMP redirect messages.

All of this may seem like just shuffling bits, but it is precisely these rules, precise down to the bit level, that form the cornerstone of interconnectivity in the modern Internet.

In the next chapter, we will turn our attention to the behind-the-scenes hero that decides exactly where a packet should go: the IPv4 Routing Subsystem.