4.5 Sending IPv4 Packets
Now, let's switch roles.
Until now, we've been the person at the door unpacking parcels (the receive path)—checking labels, tearing off wrapping, and deciding whether to accept the delivery or forward it to a neighbor. But networks aren't one-way streets; the Linux kernel also needs to actively send things out.
This is the transmit path.
When the transport layer (TCP or UDP) has data ready and wants to hand it off to the link layer for transmission, the IPv4 layer steps in. The goal of this section is to see exactly how the kernel packages a "transport layer invoice" into a standard "IPv4 parcel" and sends it on its way.
Here's an interesting point of divergence: TCP and UDP have fundamentally different attitudes toward sending data in the kernel. You'll see two completely different transmit flows, reflecting the design philosophies of the two protocols.
Two Paths, Two Personalities
Coming down from the transport layer, there are two main ways to send an IPv4 packet. This is clearly separated in the kernel source code (primarily in net/ipv4/ip_output.c).
The first path is for "worrywart" protocols—the classic example being TCP.
The method used is ip_queue_xmit().
Why is TCP called a worrywart? Because it cares deeply about segmentation. TCP has its own complex mechanism for handling data segmentation and doesn't want the IP layer interfering. So, when TCP calls ip_queue_xmit(), it usually comes down with an SKB that it has already decided how to segment.
By the way, ip_queue_xmit() isn't TCP's only exit. For example, when sending a SYN_ACK handshake packet, TCP uses another function called ip_build_and_send_pkt() (see tcp_v4_send_synack). This shows that even within the same protocol, different scenarios call for the most convenient tool.
The second path is for "hands-off" protocols—the classic examples being UDP and ICMP.
The method used is ip_append_data().
UDP itself doesn't care about fragmentation; it doesn't even care if the packet is too large. It dumps the data to the IP layer and says, "Here you go, do whatever you want with it."
But here's a detail: the name ip_append_data() is actually misleading—it does not send the packet. It simply prepares the data and queues it in a queue called sk_write_queue. The actual transmission is triggered by another function: ip_push_pending_frames().
This combo works like this: ip_append_data is responsible for packing everything to be shipped into boxes, sealing them, and piling them by the door; ip_push_pending_frames is the person who calls the courier to come pick them up.
However, after kernel 2.6.39, things changed a bit. In the pursuit of extreme performance (we'll cover lockless transmission later), UDP gained a new fast path using ip_make_skb(). This function merges the "packing" and "calling the courier" steps into one.
There's also an alternative route: Raw Sockets.
Some applications (like ping or nmap) prefer a DIY approach. They construct the IP header right there in user space. In this case, they enable the IP_HDRINCL option. For these packets, the kernel doesn't need ip_queue_xmit or ip_append_data at all; instead, it directly calls raw_send_hdrinc() and throws the ready-made packet to Netfilter's LOCAL_OUT hook:
static int raw_send_hdrinc(struct sock *sk, struct flowi4 *fl4,
void *from, size_t length,
struct rtable **rtp,
unsigned int flags)
{
...
/* 既然用户都把头造好了,直接扔给 LOCAL_OUT 钩子点 */
err = NF_HOOK(NFPROTO_IPV4, NF_INET_LOCAL_OUT, skb, NULL,
rt->dst.dev, dst_output);
...
}
This is why you can manually specify the TTL using a command like ping -ttl 128—because that IP header wasn't generated by the kernel at all; it was generated by you (via the ping tool).
Path 1: ip_queue_xmit() — TCP's Choice
Let's take the simpler path first: ip_queue_xmit(). This is usually TCP's home turf.
Right off the bat, this function has to answer one question: Where is this thing supposed to go?
int ip_queue_xmit(struct sk_buff *skb, struct flowi *fl)
. . .
/* 先确认一下我们能不能路由这个包 */
rt = (struct rtable *)__sk_dst_check(sk, 0);
The rtable object here is the routing subsystem's lookup result.
Case 1: The route cache isn't ready yet
If rt is NULL, it means this connection is just starting to send data, or the route cache has expired. We need to look up the routing table.
Before the lookup, there's a minor detour: Strict Source Routing.
Remember the "Strict Source Route" we discussed in the IP options section? If this option is enabled for the packet, its "destination address" isn't actually its final destination, but rather the first hop address specified in the options.
if (rt == NULL) {
__be32 daddr;
/* 如果有选项,用选项里指定的地址 */
daddr = inet->inet_daddr;
if (inet_opt && inet_opt->opt.srr)
daddr = inet_opt->opt.faddr;
After obtaining the address (whether it's the normal daddr or the SSR hop address), we call ip_route_output_ports() to look up the route:
/* 如果查失败,传输层的重传机制会重试,直到连上或超时 */
rt = ip_route_output_ports(sock_net(sk), fl4, sk,
daddr, inet->inet_saddr,
inet->inet_dport,
inet->inet_sport,
sk->sk_protocol,
RT_CONN_FLAGS(sk),
sk->sk_bound_dev_if);
if (IS_ERR(rt))
goto no_route;
sk_setup_caps(sk, &rt->dst);
}
skb_dst_set_noref(skb, &rt->dst);
If the route lookup fails (e.g., the network is down), we directly call goto no_route, the packet is dropped, and we return -EHOSTUNREACH. At this point, the upper-layer protocol (like TCP) is responsible for retrying.
Case 2: Route found, but a conflict is detected
Here's a very subtle pitfall: if you enable both "Strict Source Routing" and a "Gateway".
Imagine this: you say "I must strictly follow this path (A→B→C)", but the routing lookup result says "You must first go through gateway G". This is a contradiction. The kernel will outright refuse to send the packet:
if (inet_opt && inet_opt->opt.is_strictroute && rt->rt_uses_gateway)
goto no_route;
This design makes sense—if we had to accommodate this scenario, the code logic would become extremely complex. It's better to just throw an error and force you to fix your routing table or IP options.
Building the Header
Alright, we've found the route. Now we can pack the box.
At this point, the SKB coming down from the transport layer has its skb->data pointer pointing at the transport layer header (like the TCP header). We need to move the pointer forward to make room for the IP header.
This step is handled by skb_push():
/* 知道去哪了,分配并构建 IP 头 */
skb_push(skb, sizeof(struct iphdr) + (inet_opt ? inet_opt->opt.optlen : 0));
skb_reset_network_header(skb);
iph = ip_hdr(skb);
Next up is filling in the fields. There's a bit of bitwise operations here that might look dizzying at first glance:
*((__be16 *)iph) = htons((4 << 12) | (5 << 8) | (inet->tos & 0xff));
What is this doing? It fills the first 16 bits of the IP header (Version + IHL + Type of Service) all at once.
4 << 12: The version is 4, placed in the highest 4 bits.5 << 8: The Internet Header Length (IHL) defaults to 5 (i.e., 20 bytes), placed in the next 4 bits.inet->tos: The Type of Service, filled into the lower 8 bits.
Then we handle the fragmentation flags (the DF bit):
if (ip_dont_fragment(sk, &rt->dst) && !skb->local_df)
iph->frag_off = htons(IP_DF);
else
iph->frag_off = 0;
If the "Don't Fragment" (DF) flag is set, we set the IP_DF flag bit (0x4000) in frag_off. Otherwise, we set it to 0.
After that, it's standard procedure: fill in the TTL, protocol number, and source/destination addresses:
iph->ttl = ip_select_ttl(inet, &rt->dst);
iph->protocol = sk->sk_protocol;
ip_copy_addrs(iph, fl4);
Finally, we must not forget the IP options. If there are options, the header length (IHL) changes:
if (inet_opt && inet_opt->opt.optlen) {
/* IHL 字段单位是 4 字节,所以 optlen 要除以 4(右移 2 位) */
iph->ihl += inet_opt->opt.optlen >> 2;
/* 把选项填进去 */
ip_options_build(skb, &inet_opt->opt, inet->inet_daddr, rt, 0);
}
This line of iph->ihl += inet_opt->opt.optlen >> 2 is crucial. The IHL in the IP header represents "how many 4-byte words there are." If your options length is 20 bytes (optlen = 20), a right-shift by 2 bits gives 5. Adding the base 5, the IHL becomes 10, representing a header length of 40 bytes.
Sending It Out
The packet is built. The final step is to set the packet's ID (used for fragment reassembly) and hand it off to the next layer:
ip_select_ident_more(iph, &rt->dst, sk,
(skb_shinfo(skb)->gso_segs ?: 1) - 1);
skb->priority = sk->sk_priority;
skb->mark = sk->sk_mark;
/* 上路 */
res = ip_local_out(skb);
At this point, the mission of ip_queue_xmit is complete. This is TCP's most commonly used transmit path.
Path 2: ip_append_data() — UDP's Slow Path
The UDP world is slightly more complex.
Before diving into the code, I need to mention something you might not have heard of: UDP_CORK.
The name is quite vivid. Imagine pouring water (data) into a bottle. If you plug it with a cork (enabling the UDP_CORK option), the water can't flow out and accumulates inside. Only when you pull the cork does the water gush out all at once.
In the kernel, this corresponds to an optimization: when you have lots of small chunks of data to send, it's more efficient to accumulate them into one large packet rather than sending a separate packet for each chunk (lower protocol overhead). This feature was introduced in kernel 2.5.44.
The Logic of ip_append_data()
This function doesn't send directly. Its main job is to copy data from user space into the kernel and attach it to the socket's transmit queue.
Its function signature is quite long, with one parameter standing out:
int ip_append_data(struct sock *sk, struct flowi4 *fl4,
int getfrag(void *from, char *to, int offset, int len,
int odd, struct sk_buff *skb),
...
This getfrag is a callback function. Since the data is still in user space, how does the kernel move it into the SKB?
- For UDP, this callback is usually
ip_generic_getfrag(). - For ICMP, it's
icmp_glue_bits().
It's like hiring a moving company—the workers they send (getfrag) are responsible for moving things from the old house (user-space data) into new boxes (SKBs).
Let's look at the code logic:
struct inet_sock *inet = inet_sk(sk);
int err;
/* 如果只是探测一下(比如 PMTU 发现),不真正发数据 */
if (flags&MSG_PROBE)
return 0;
Step 1: Initialize the cork
If this is the first send for this socket (the queue is empty), the kernel needs to initialize some state:
if (skb_queue_empty(&sk->sk_write_queue)) {
/* 设置 cork,比如处理 IP 选项 */
err = ip_setup_cork(sk, &inet->cork.base, ipc, rtp);
if (err)
return err;
} else {
/* 如果队列里已经有东西了,说明这是后续的分片,不需要头了 */
transhdrlen = 0;
}
ip_setup_cork() locks down information like IP options. Because once you start accumulating packets, if the route or options change midway, the first half of what you've accumulated might not match the second half. So during the "Cork" period, many parameters are frozen.
Step 2: Moving the data
The real heavy lifting is done by __ip_append_data(). This function is extremely complex and handles two scenarios:
-
Hardware supports Scatter/Gather (NETIF_F_SG): If the NIC supports SG, the kernel is much happier because it can use
skb_shinfo(skb)->fragsto directly map user-space data pages to the SKB, without even needing to copy the data (zero-copy). -
Hardware doesn't support SG: Then there's no choice but to copy the data honestly, usually by attaching fragments to
skb_shinfo(skb)->frag_list.
There's another detail called MSG_MORE. If the user sends data with this flag (similar to the effect of UDP_CORK, telling the kernel "more data is coming"), __ip_append_data will be more strategic when allocating memory, trying to fill up an entire page.
return __ip_append_data(sk, fl4, &sk->sk_write_queue, &inet->cork.base,
sk_page_frag(sk), getfrag,
from, length, transhdrlen, flags);
Path 3: ip_make_skb() — Modern UDP's Fast Path
The ip_append_data + ip_push_pending_frames combo mentioned above, while logically clear (accumulate first, send later), has a major problem in the multi-core era: locks.
To protect the sk->sk_write_queue queue, the traditional UDP transmit path often requires holding the socket lock. This is a bottleneck in multi-core, high-concurrency scenarios.
So, in 2.6.39, the kernel introduced a new API: ip_make_skb().
Its design philosophy is: can we avoid touching the socket's shared queue entirely?
It acts like a "temp worker." When UDP_CORK isn't needed (no need to accumulate packets), it assembles and encapsulates the SKB on the local stack or in temporary space, then hands it directly to ip_send_skb() for transmission.
旧路径:
UDP → (加锁) → ip_append_data → 塞入队列 → (解锁) → ip_push_pending_frames → 发送
新路径:
UDP → ip_make_skb (在本地构建) → ip_send_skb → 发送
This new path completely bypasses the queue that requires locking—hence the so-called "lockless transmit fast path."
Leaving the Local Host
Regardless of which path you take, these packets, having arrived at the same destination by different routes, all flow toward the same exit: ip_local_out() (or, in the case of raw sockets, directly calling dst_output).
Our next stop presents a very practical problem: what if this parcel is too big and the courier (the NIC) can't carry it?
That is the topic of the next section: Fragmentation.
Before we dive into fragmentation, take a moment to think about this: since the TCP layer already avoids handing packets larger than the MSS to the IP layer, who is actually triggering IP-layer fragmentation? And aside from fragmentation, is there another way? See you in the next section.