11.3 UDP (User Datagram Protocol)
Remember those fields we saw in the msghdr structure in the previous section? msg_iov stores data, and msg_control stores auxiliary information. At the time, you might have thought they were just boring data structure definitions.
Now, these structures are about to get to work.
Let's set aside the most complex part (TCP) for now and start with the "simplest" protocol in the transport layer—UDP (User Datagram Protocol). The reason we call it simple is that it does almost nothing extra: it doesn't guarantee delivery, doesn't guarantee order, and doesn't even guarantee a connection exists. It's like a thin sheet of wrapping paper over the IP layer, adding only the concept of a "port number."
But precisely because of its simplicity, it's the best entry point for understanding data flow in the kernel network stack.
UDP Protocol Overview and Header Structure
The UDP protocol was finalized as early as 1980 in RFC 768. Its design philosophy is "best effort." Many application-layer protocols that demand real-time performance but are less sensitive to packet loss run over UDP, such as RTP (Real-time Transport Protocol) commonly used in VoIP. Dropping a few frames of audio or video just causes a brief glitch or stutter, which is far better than introducing a multi-second delay for retransmission. Although RFC 4571 states that RTP can also run over TCP, that's a workaround for special scenarios like firewall traversal, not the mainstream approach.
Further Reading: UDP-Lite You might encounter something called UDP-Lite (RFC 3828). It's a variant of UDP that allows computing a checksum over only a portion of the packet (Partial Checksum). This is useful in certain wireless scenarios where, even if the data payload is corrupted, we're still willing to accept and process the packet as long as the header is correct. Most of its implementation reuses UDP code, with the main logic in
net/ipv4/udplite.c, but you'll often see its shadow inudp.cas well.
Whether it's standard UDP or UDP-Lite, their header length is fixed at 8 bytes. It's defined in the kernel as follows (include/uapi/linux/udp.h):
struct udphdr {
__be16 source; // 源端口
__be16 dest; // 目的端口
__be16 len; // 长度(包含头部)
__sum16 check; // 校验和
};
Figure 11-1: IPv4 UDP Header Structure (Shows the layout with source port, destination port, length, and checksum each occupying 16 bits)
Although the header is only 8 bytes long, the kernel needs to do quite a bit of preparation work at startup to correctly fill in those 8 bytes and send them out.
Initialization: Registering with the Kernel
For the UDP protocol to work, it must "register" in two core tables within the kernel.
1. Registering with the Network Layer Protocol Table
First, the kernel defines a udp_protocol object. This is a net_protocol structure whose job is to tell the network layer (IP layer): "Hey, if you receive a packet with protocol number IPPROTO_UDP, please call the handler callback function here." The registration happens in the inet_init() function during system initialization.
static const struct net_protocol udp_protocol = {
.handler = udp_rcv, // 收包处理函数
.err_handler = udp_err, // 错误处理函数(如ICMP报错)
.no_policy = 1,
.netns_ok = 1, // 支持网络命名空间
};
In inet_init(), it's added to the global inet_protos array via inet_add_protocol():
static int __init inet_init(void)
{
...
if (inet_add_protocol(&udp_protocol, IPPROTO_UDP) < 0)
pr_crit("%s: Cannot add UDP protocol\n", __func__);
...
}
2. Registering with the Socket Operations Table
Having a packet reception entry point isn't enough. User-space programs interact with the kernel through system calls like socket() and sendmsg(). The kernel needs to know: when a user creates a socket of type SOCK_DGRAM, where should the specific operations (like .sendmsg) be mapped?
This is accomplished through proto_register(&udp_prot, 1). The udp_prot structure is filled with callback function pointers:
struct proto udp_prot = {
.name = "UDP",
.close = udp_lib_close,
.connect = ip4_datagram_connect,
.disconnect = udp_disconnect,
.ioctl = udp_ioctl,
.setsockopt = udp_setsockopt,
.getsockopt = udp_getsockopt,
.sendmsg = udp_sendmsg, // 重点!
.recvmsg = udp_recvmsg,
.sendpage = udp_sendpage,
...
};
Note: The UDP protocol, along with other core protocols, is initialized at startup via the
inet_init()method.
Once initialization is complete, everything is in place. Next, let's see what actually happens in the kernel when a user calls sendmsg to send data.
Sending Packets: A Deep Dive into udp_sendmsg
When a user-space program calls send() or sendmsg() to send UDP data, it ultimately lands on the kernel's udp_sendmsg() function.
int udp_sendmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg, size_t len)
{
Here we need to introduce a very useful concept: UDP_CORK.
By default, UDP packets are "fire-and-forget"—you give it 10 bytes of data, and it immediately sends a 10-byte IP packet. This is fine in most cases. But if you need to combine multiple small write operations into a single large UDP packet (for example, when the application layer assembles data in chunks), you need this mechanism.
There are two ways to enable this behavior:
- Set the
UDP_CORKsocket option (introduced in Kernel 2.5.44). - Pass the
MSG_MOREflag in thesendmsgflags.
At the beginning of udp_sendmsg, the kernel first checks whether it needs to "cork" the bottleneck:
int corkreq = up->corkflag || msg->msg_flags & MSG_MORE;
struct inet_sock *inet = inet_sk(sk);
...
Next come the routine checks. For example, the data length len cannot exceed 65535 bytes. Why? Because the len field in the UDP header is only 16 bits, which can represent a maximum of 65535.
if (len > 0xFFFF)
return -EMSGSIZE;
Now, we need to know who we're sending to. This involves determining the destination address and port to build the flowi4 object needed for route lookup. The destination port absolutely cannot be 0—this is mandated by IANA (established as early as RFC 1010).
There are two scenarios here:
Scenario A: Directly Specifying the Destination Address
The user passes a sockaddr_in structure in msg->msg_name.
if (msg->msg_name) {
struct sockaddr_in *usin = (struct sockaddr_in *)msg->msg_name;
if (msg->msg_namelen < sizeof(*usin))
return -EINVAL;
if (usin->sin_family != AF_INET) {
if (usin->sin_family != AF_UNSPEC)
return -EAFNOSUPPORT;
}
daddr = usin->sin_addr.s_addr;
dport = usin->sin_port;
// 目标端口为 0 是非法的
if (dport == 0)
return -EINVAL;
Scenario B: Using a Connected Socket
If the user doesn't specify an address in msg_name, this socket must have previously called connect(). In this case, the socket's state is marked as TCP_ESTABLISHED (note: UDP using this state doesn't mean it has truly established a connection like TCP; it merely indicates that a default peer address has been specified and it has passed certain kernel checks).
} else {
if (sk->sk_state != TCP_ESTABLISHED)
return -EDESTADDRREQ;
daddr = inet->inet_daddr;
dport = inet->inet_dport;
/* 开启已连接 Socket 的快路径 */
connected = 1;
}
...
Handling Ancillary Data
Remember the msg_control mentioned in the previous section? This is where it comes into play. Users can pass Ancillary Data through it.
Ancillary data is essentially a series of cmsghdr structures (see man 3 cmsg for details). Through it, you can do things that normal parameters can't, such as specifying a source address on an unconnected UDP socket (using IP_PKTINFO).
If msg_controllen is not 0, the kernel calls ip_cmsg_send() to parse these messages and construct a ipcm_cookie structure.
struct ipcm_cookie {
__be32 addr; // 指定的源地址等
int oif; // 出接口索引
struct ip_options_rcu *opt; // IP 选项
__u8 tx_flags; // 传输标志
};
The code logic is as follows:
if (msg->msg_controllen) {
err = ip_cmsg_send(sock_net(sk), msg, &ipc);
if (err)
return err;
if (ipc.opt)
free = 1;
connected = 0;
}
...
if (connected)
rt = (struct rtable *)sk_dst_check(sk, 0);
...
Route Lookup
If the socket's cached route entry (rt) is empty, it means a route lookup hasn't been performed yet (or the route cache has expired). In this case, we need to construct a flowi4 object and call ip_route_output_flow() to query the routing table.
if (rt == NULL) {
struct net *net = sock_net(sk);
fl4 = &fl4_stack;
flowi4_init_output(fl4, ipc.oif, sk->sk_mark, tos,
RT_SCOPE_UNIVERSE, sk->sk_protocol,
inet_sk_flowi_flags(sk) | FLOWI_FLAG_CAN_SLEEP,
faddr, saddr, dport, inet->inet_sport);
security_sk_classify_flow(sk, flowi4_to_flowi(fl4));
rt = ip_route_output_flow(net, fl4, sk);
if (IS_ERR(rt)) {
err = PTR_ERR(rt);
rt = NULL;
if (err == -ENETUNREACH)
IP_INC_STATS_BH(net, IPSTATS_MIB_OUTNOROUTES);
goto out;
}
...
Sending Path: Fast and Slow
Kernel 2.6.39 introduced an important optimization: the lockless fast path for sending.
Fast path: If UDP_CORK is not enabled, meaning there's no need to accumulate packets, there's no reason to acquire the heavy socket lock (lock_sock). We directly call ip_make_skb() to build the SKB, and then udp_send_skb() to send it off.
/* 无需上锁的快路径 */
if (!corkreq) {
skb = ip_make_skb(sk, fl4, getfrag, msg->msg_iov, ulen,
sizeof(struct udphdr), &ipc, &rt,
msg->msg_flags);
err = PTR_ERR(skb);
if (!IS_ERR_OR_NULL(skb))
err = udp_send_skb(skb, fl4);
goto out;
}
Slow path: If corkreq (cork) is enabled, we need to take the lock because it involves state maintenance (such as accumulating length).
lock_sock(sk);
do_append_data:
up->len += ulen;
Then we call ip_append_data(). This function doesn't send directly; instead, it copies the data into the kernel's buffer queue (sk_write_queue). Finally, when enough data is accumulated or the CORK option is canceled, udp_push_pending_frames() is called to actually trigger sending and fragmentation.
err = ip_append_data(sk, fl4, getfrag, msg->msg_iov, ulen,
sizeof(struct udphdr), &ipc, &rt,
corkreq ? msg->msg_flags | MSG_MORE : msg->msg_flags);
If an error occurs in the middle, we must flush all the accumulated SKBs in the queue; otherwise, we'll have a memory leak.
if (err)
udp_flush_pending_frames(sk); // 释放 sk_write_queue
else if (!corkreq)
err = udp_push_pending_frames(sk); // 真正发送
else if (unlikely(skb_queue_empty(&sk->sk_write_queue)))
up->pending = 0;
release_sock(sk);
The sending process is like mailing letters: either you write one letter and drop it in a mailbox one at a time (fast path), or you save up several letters, pack them together, and call a courier (slow path/CORK).
Receiving Packets: From the Network Layer to the Socket
Now that we've seen the sending side, let's look at receiving. When the network layer (L3) receives a UDP packet, it calls the udp_rcv() function we registered during initialization. This function is very simple—it's just a pass-through that directly calls __udp4_lib_rcv().
int udp_rcv(struct sk_buff *skb)
{
return __udp4_lib_rcv(skb, &udp_table, IPPROTO_UDP);
}
Let's dive into __udp4_lib_rcv().
First, the UDP header, length, source address, and destination address are all extracted from the SKB:
int __udp4_lib_rcv(struct sk_buff *skb, struct udp_table *udptable, int proto)
{
struct sock *sk;
struct udphdr *uh;
unsigned short ulen;
struct rtable *rt = skb_rtable(skb);
__be32 saddr, daddr;
struct net *net = dev_net(skb->dev);
...
uh = udp_hdr(skb);
ulen = ntohs(uh->len);
saddr = ip_hdr(skb)->saddr;
daddr = ip_hdr(skb)->daddr;
If the packet is a broadcast or multicast packet, the handling logic is quite special and is delegated to __udp4_lib_mcast_deliver(). We'll skip that here.
if (rt->rt_flags & (RTCF_BROADCAST | RTCF_MULTICAST))
return __udp4_lib_mcast_deliver(net, skb, uh,
saddr, daddr, udptable);
For the most common unicast packets, the core task the kernel needs to do is: find the socket.
It looks up the UDP hash table (udp_table), matching by the four-tuple (source IP, source port, destination IP, destination port).
sk = __udp4_lib_lookup_skb(skb, uh->source, uh->dest, udptable);
if (sk != NULL) {
If we find a match! This means some application is listening on this port. The next step is to put the packet into that socket's receive queue.
We call udp_queue_rcv_skb() -> sock_queue_rcv_skb() -> __skb_queue_tail() to append the SKB to the tail of sk->sk_receive_queue.
int ret = udp_queue_rcv_skb(sk, skb);
sock_put(sk);
/* 返回值 > 0 意味着需要重新提交,但这里返回 -protocol 或 0 */
if (ret > 0)
return -ret;
return 0; // 成功
}
...
What if no socket is found?
This means the packet arrived at the machine, but no application is listening on this port. In this case, we can't just silently drop it (unless the checksum is wrong).
-
Checksum check: If the checksum is wrong, drop the packet immediately.
if (udp_lib_checksum_complete(skb))goto csum_error; -
Send an ICMP error: If the checksum is fine, it means the address is correct, just no one claims the port. So the kernel sends an ICMP "Destination Unreachable" (Code 3: Port Unreachable) back to the sender, politely telling it: "Stop sending, no one is listening." At the same time, it increments the
NoPortscounter in UDP_MIB (visible vianetstat -s).
UDP_INC_STATS_BH(net, UDP_MIB_NOPORTS, proto == IPPROTO_UDPLITE);
icmp_send(skb, ICMP_DEST_UNREACH, ICMP_PORT_UNREACH, 0);
/*
* Hmm. We got an UDP packet to a port to which we
* don't wanna listen. Ignore it.
*/
kfree_skb(skb);
return 0;
Figure 11-2: UDP Reception Flow Diagram (Shows the flow from packet arrival -> hash table lookup -> socket found (enqueue) / socket not found (send ICMP))
This concludes our kernel journey into UDP. It's refreshingly straightforward: write data, look up route, send; receive data, look up socket, enqueue.
But this is just the calm before the storm. In the next section, we'll face the most complex monster in the networking protocol world—TCP. That's where state machines live, complex timeout retransmissions lurk, and hair-pulling congestion control awaits.