11.4 TCP (Transmission Control Protocol)
If the UDP we discussed in the previous section is a carefree "fire-and-forget" optimist, then the TCP we face in this section is the most severe control freak in the network protocol world.
TCP (Transmission Control Protocol) was born in 1981 with RFC 793. Over the three decades that followed, it was patched and layered with features—to adapt to satellite links, to saturate gigabit fiber, and to survive in congested wireless networks. Today, it is the cornerstone of the internet. The HTTP, SSH, and FTP you use, and even the page you are reading right now, are all carried by TCP underneath.
Compared to UDP's simplicity and brute force, TCP provides a reliable, connection-oriented byte stream service. It doesn't want you to lose packets, and it doesn't want them arriving out of order. To achieve this, it introduced sequence numbers, acknowledgments, state machines, congestion control... All of this combined makes TCP one of the most complex protocols in the kernel. To be honest, to thoroughly explain all of TCP's implementation details, optimization algorithms, and edge cases, the thickness of this book would need to triple.
Here, we only grab the most core bones: how a connection is established, how data is sent and received, and that indispensable bunch of timers. As for TCP's dazzling array of congestion control algorithms (like Cubic, BBR, etc.), although the Linux kernel supports hot-plugging them, diving into them would require a dedicated chapter.
TCP Header: A Much Heavier Backpack Than UDP
Before diving into the kernel implementation, we need to recognize TCP's face. Unlike UDP's lean 8-byte header, TCP's header is 20 bytes even without options, and can reach up to 60 bytes with options. Every bit has its purpose.
Let's see how this header is defined in the kernel:
struct tcphdr {
__be16 source;
__be16 dest;
__be32 seq;
__be32 ack_seq;
#if defined(__LITTLE_ENDIAN_BITFIELD)
__u16 res1:4,
doff:4,
fin:1,
syn:1,
rst:1,
psh:1,
ack:1,
urg:1,
ece:1,
cwr:1;
#elif defined(__BIG_ENDIAN_BITFIELD)
__u16 doff:4,
res1:4,
cwr:1,
ece:1,
urg:1,
ack:1,
psh:1,
rst:1,
syn:1,
fin:1;
#else
#error "Adjust your <asm/byteorder.h> defines"
#endif
__be16 window;
__sum16 check;
__be16 urg_ptr;
};
(include/uapi/linux/tcp.h)
We can look at this bunch of fields one by one, like dismantling a clock:
- source / dest: Source and destination ports (16 bits each). This is the transport layer's multiplexing key, determining which process the data belongs to.
- seq: Sequence number (32 bits). This is the cornerstone of TCP's reliability, identifying the byte position in the data stream.
- ack_seq: Acknowledgment number (32 bits). Note that this field is only valid when the ACK flag is 1. It tells the peer: "I have received all data before this, and I expect data with this sequence number next."
- res1: Reserved bits (4 bits), must be 0.
- doff: Data Offset (4 bits). This actually refers to the length of the TCP header, in units of 4 bytes. Because the TCP header is variable-length (it has options), this field must exist to tell the kernel "where the actual data starts." The minimum value is 5 (5×4=20 bytes), and the maximum is 15 (60 bytes).
Next is a row of 1-bit flags, each capable of changing the course of TCP's state machine:
- fin (Finish): "I'm done sending, ready to close."
- syn (Synchronize): Used to synchronize sequence numbers during the three-way handshake.
- rst (Reset): "Something went wrong with the connection, restart immediately." This is the emergency brake in TCP.
- psh (Push): "Stop caching, push the data to the application layer immediately."
- ack: Indicates that the ack_seq field is valid. Except for the first packet of a connection setup, almost all packets carry this flag.
- urg (Urgent): Indicates that the urg_ptr (urgent pointer) field is valid.
- ece (ECN-Echo) and cwr (Congestion Window Reduced): These two flags are related to Explicit Congestion Notification (ECN, RFC 3168), used to notify each other when the network is congested without dropping packets—much more civilized than the old brute-force packet dropping.
Finally, there are several fields managing flow control and verification:
- window: Receive window size (16 bits). This is the valve for flow control, telling the peer: "I have this much space left in my receive buffer, don't send more than this."
- check: Checksum, covering both the header and data.
- urg_ptr: Urgent pointer. Only meaningful when the URG flag is set; it is an offset pointing to the last byte of urgent data.
Figure 11-3: IPv4 TCP Header Diagram (Shows the header layout here, including Source Port, Dest Port, Sequence Number, Acknowledgment Number, Data Offset, Flags, Window, Checksum, Urgent Pointer)
You see, this is much more complex than UDP's four fields. Complexity means overhead, but it also means control. UDP gave up control in exchange for speed, while TCP tightly grips every bit to ensure your packets don't get lost in the network wasteland.
Alright, now that we understand the header, we can dive into the kernel and see how TCP initializes these complex mechanisms.
TCP Initialization: Registering a Complex Soul in the Kernel
Since TCP is so complex, its initialization flow in the kernel naturally can't be as casual as UDP's.
First, we need to define a net_protocol object tcp_protocol and hang it on the kernel's protocol list. This step is similar to UDP, still calling inet_add_protocol():
static const struct net_protocol tcp_protocol = {
.early_demux = tcp_v4_early_demux,
.handler = tcp_v4_rcv,
.err_handler = tcp_v4_err,
.no_policy = 1,
.netns_ok = 1,
};
(net/ipv4/af_inet.c)
Then we register it in inet_init():
static int __init inet_init(void)
{
. . .
if (inet_add_protocol(&tcp_protocol, IPPROTO_TCP) < 0)
pr_crit("%s: Cannot add TCP protocol\n", __func__);
. . .
}
(net/ipv4/af_inet.c)
Just registering the protocol isn't enough; TCP also needs to handle socket-level operations. We define a proto object tcp_prot, also registered using proto_register():
struct proto tcp_prot = {
.name = "TCP",
.owner = THIS_MODULE,
.close = tcp_close,
.connect = tcp_v4_connect,
.disconnect = tcp_disconnect,
.accept = inet_csk_accept,
.ioctl = tcp_ioctl,
.init = tcp_v4_init_sock,
. . .
};
(net/ipv4/tcp_ipv4.c)
Notice the callback function on the .init line: tcp_v4_init_sock.
In the UDP section, you might not have seen a similar .init assignment, or it might have been set to NULL. Why? Because UDP is too simple—there's no special initialization strictly required when creating a socket. But TCP is different.
When you create a TCP socket in user space (socket(AF_INET, SOCK_STREAM, 0)), the kernel ultimately calls tcp_v4_init_sock(). This function calls tcp_init_sock() to do a bunch of dirty work, such as:
- Setting the socket state to
TCP_CLOSE. - Initializing timers (calling
tcp_init_xmit_timers()). TCP relies heavily on timers; without them, TCP wouldn't know whether to retransmit or give up. - Setting the sizes of the send buffer (
sk_sndbuf) and receive buffer (sk_rcvbuf).- The default send buffer is 16KB (
sysctl_tcp_wmem[1]). - The default receive buffer is 87KB (
sysctl_tcp_rmem[1]). - You can tune these parameters via
/proc/sys/net/ipv4/tcp_wmemandtcp_rmem.
- The default send buffer is 16KB (
- Initializing the out-of-order queue and prequeue.
- Initializing the congestion window, setting the initial congestion window to 10 segments (
TCP_INIT_CWND) as specified by RFC 6928.
Speaking of timers, they are the power source for TCP's heartbeat. Let's take a dedicated look.
TCP Timers: Guardians of Time
A large part of TCP's reliability is built on "waiting" and "retrying." All of this is managed by the timer mechanism located in net/ipv4/tcp_timer.c. TCP primarily uses four types of timers, each targeting a specific type of anxiety:
-
Retransmission Timer: This is the most anxious one. It starts every time a segment is sent. If no ACK is received within the specified time, it assumes the packet is lost and resends it. When a packet is truly lost or eaten by link-layer noise, this is the last lifeline.
-
Delayed ACK Timer: This one is more laid-back. After receiving data, TCP doesn't have to reply with an ACK immediately; it can wait a bit (e.g., 200ms) to see if there's any data it can piggyback the ACK onto. This reduces the number of small packets on the network and improves efficiency.
-
Keepalive Timer: This is a mechanism that exists to prevent "zombie connections." Sometimes, when both ends of a connection haven't transmitted data for a long time, an intermediate router might drop one side, or one side might simply lose power. Nobody knows if the other side is still alive. The Keepalive timer periodically probes, and if it finds no response, it calls
tcp_send_active_reset()to kill the connection. -
Zero Window Probe Timer (also known as the Persistent Timer): This is a classic deadlock prevention mechanism. If the receiver's receive buffer is full, it tells the sender: "Window is 0, stop sending." The sender then stops and waits. But there's a huge trap here: What if the receiver frees up space and sends a "window update" packet to notify the sender, but this unfortunate window update packet gets lost halfway? The sender thinks the window is still 0 and keeps waiting; the receiver thinks it has notified and keeps waiting for data. This is a deadlock. The solution is the zero window probe: when the sender sees a zero window, it doesn't just wait idly. Instead, it starts this timer and occasionally sends a small data packet to poke the receiver: "Hey, is the window open yet?" After receiving a non-zero window response, it resumes data transmission.
TCP Socket Initialization: Everything Starts with tcp_v4_init_sock
When a user-space program wants to use TCP, it must first call socket() to create a socket of type SOCK_STREAM. At this step, the kernel calls the tcp_v4_init_sock() -> tcp_init_sock() we mentioned earlier.
The reason this callback function is important is that it is the generic initialization entry point. Whether it's IPv4 or IPv6, creating a TCP socket ultimately goes through similar logic (IPv6 goes through tcp_v6_init_sock).
The work it does was briefly mentioned in the previous section, but let's emphasize the key points again:
It transforms a freshly allocated struct sock object from an empty shell into a stateful TCP entity. It sets up buffers, starts timers, and calculates the initial congestion window. Without this step, the subsequent connect() and listen() would be out of the question.
TCP Connection Setup: The Kernel Perspective of the Three-Way Handshake
TCP's connection establishment and teardown are essentially the transitions of a state machine. At any given moment, a socket is in a specific state (such as TCP_LISTEN, TCP_SYN_SENT, etc.). This state is saved in the sk_state member of struct sock.
Textbooks all cover the three-way handshake, but in the kernel, it's not just about exchanging three packets—it's a process of state and memory structure transitions:
-
Client sends SYN: The client calls
connect()and sends a SYN packet. At this point, the client socket state changes fromTCP_CLOSEtoTCP_SYN_SENT. -
Server receives SYN, sends SYN-ACK: The server is in the
TCP_LISTENstate at this time (having calledlisten()). When it receives the SYN packet, the kernel does something very interesting: it doesn't directly change the listening socket to a connected state, because the listening socket is used to serve all clients. Instead, the kernel creates a newrequest_sock(request sock) to represent this connection being established. This new sock's state is set toTCP_SYN_RECV. Then, the server sends a SYN-ACK packet back to the client. -
Client receives SYN-ACK, sends ACK: The client receives the SYN-ACK, and its state transitions from
TCP_SYN_SENTtoTCP_ESTABLISHED. The connection is considered established on the client side. It sends the final ACK. -
Server receives ACK: The server receives that final ACK, and the
request_sockhas completed its historical mission. The kernel creates a full child socket based on thisrequest_sockand sets the state toTCP_ESTABLISHED. This new socket is placed into the accept queue, waiting for the application layer to callaccept()to retrieve it.
Note:
If you want to find the "master controller" for this state machine transition in the source code, it's the tcp_rcv_state_process() method (located in net/ipv4/tcp_input.c). Whether it's IPv4 or IPv6, handling most state changes (except for the fast path of the ESTABLISHED state) goes through it.
Receiving Packets: When a Network Layer Packet Arrives at the TCP Layer
The connection is established, and data starts flowing. As kernel engineers, we need to care about: when an IP layer packet (struct sk_buff) arrives, how does TCP catch it?
The entry function is tcp_v4_rcv() (net/ipv4/tcp_ipv4.c).
Let's walk through the code:
int tcp_v4_rcv(struct sk_buff *skb)
{
struct sock *sk;
. . .
Step 1: Routine Checks and Socket Lookup
First, there are a bunch of basic sanity checks: Is the packet destined for us? Is the length enough for a TCP header?
Then, the most crucial step: finding the Socket. We need to know which process this packet is serving. We call __inet_lookup_skb() to look it up in the hash table.
sk = __inet_lookup_skb(&tcp_hashinfo, skb, th->source, th->dest);
. . .
if (!sk)
goto no_tcp_socket;
Here, it first looks for a connected socket in the established hash table; if not found, it looks for a listening socket in the listening hash table. If neither is found, it means this packet was sent blindly, so we drop it.
Step 2: Check if the Socket is Occupied
After finding the socket, a question arises: Is a user-space process currently using this socket?
The kernel uses the sock_owned_by_user() macro to judge this. If it returns 1, it means a user process is holding the lock and operating on this socket (e.g., currently calling read() or write()).
if (!sock_owned_by_user(sk)) {
. . .
{
Case A: Socket is Not Occupied If no one is using it, great, the kernel can process it directly. To optimize performance, the kernel first tries to throw the packet into the prequeue. This is a special queue dedicated to caching packets, waiting for the user process to batch-process them the next time it calls a socket interface, reducing context switches.
If the prequeue is full or the packet isn't suitable for the prequeue, it calls tcp_v4_do_rcv() to go through the normal flow.
if (!tcp_prequeue(sk, skb))
ret = tcp_v4_do_rcv(sk, skb);
}
Case B: Socket is Occupied If a user process is currently using this socket (it's locked), the kernel can't recklessly modify its data structures. To avoid dropping packets, the kernel can only temporarily stuff the packet into the backlog queue.
} else if (unlikely(sk_add_backlog(sk, skb,
sk->sk_rcvbuf + sk->sk_sndbuf))) {
bh_unlock_sock(sk);
NET_INC_STATS_BH(net, LINUX_MIB_TCPBACKLOGDROP);
goto discard_and_relse;
}
}
If even the backlog is full, it can only drop the packet and increment the LINUX_MIB_TCPBACKLOGDROP counter.
Diving into tcp_v4_do_rcv()
No matter which path is taken, the packet ultimately gets sorted here:
int tcp_v4_do_rcv(struct sock *sk, struct sk_buff *skb)
{
-
If it's in the
TCP_ESTABLISHEDstate (fast path): It callstcp_rcv_established(). This is the most commonly used path and is handled very efficiently. -
If it's in the
TCP_LISTENstate: It callstcp_v4_hnd_req(), which usually handles the arrival of new connections (receiving a SYN or the final ACK). -
Other states: It calls the aforementioned head butler
tcp_rcv_state_process()to handle various state transitions (e.g., receiving a FIN to enter the close process, etc.).
Sending Packets: Pushing Data Out
Finally, let's look at sending. When user space calls send() or sendmsg(), the kernel ultimately reaches tcp_sendmsg() (net/ipv4/tcp.c).
This function is much more complex than UDP's sending logic. It's not just a matter of pointing a pointer and being done.
Core tasks of tcp_sendmsg():
- Copying data from user space to kernel space (
skb). - Handling logic like Nagle's algorithm (deciding whether to send immediately or batch it up).
- Assembling the
sk_buff. - Calling the transport layer's send function.
The code snippet is as follows:
int tcp_sendmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg,
size_t size)
{
struct iovec *iov;
struct tcp_sock *tp = tcp_sk(sk);
struct sk_buff *skb;
int iovlen, flags, err, copied = 0;
int mss_now = 0, size_goal, copied_syn = 0, offset = 0;
bool sg;
long timeo;
. . .
There's a lot of logic here regarding MSS (Maximum Segment Size) and sk_sndbuf checks.
When the data is finally assembled and sitting in skb, ready to depart, it calls tcp_push_one() -> tcp_write_xmit() -> tcp_transmit_skb().
In tcp_transmit_skb(), the final leap that truly hands the packet over to the network layer is this line:
. . .
err = icsk->icsk_af_ops->queue_xmit(skb, &inet->cork.fl);
. . .
}
(net/ipv4/tcp_output.c)
Here, an icsk_af_ops (INET Connection Socket ops) is used; this is an address-family-oriented operation object. For IPv4 TCP, it points to ipv4_specific, whose queue_xmit callback is the generic ip_queue_xmit().
With this, the TCP layer's processing is complete. The packet is officially handed over to the IP layer, which is the territory of the next layer down (L3).
The world of TCP is bottomless. We have peeled back the layers of connection establishment, timers, and the send/receive packet paths here, but this is just the tip of the iceberg. The good news is that with this foundation, when you go on to understand protocols like SCTP or DCCP, you'll find that they are actually making all sorts of trade-offs and hybrids between the two extremes of TCP and UDP.
In the next section, let's go look at this "hybrid"—SCTP.