11.5 SCTP: The Hybrid Born of Engineering Trade-offs
In the previous section, we left off looking at TCP's complex and meticulous world, which spares no expense for reliability and ordered delivery. But in an engineer's reality, not all scenarios can tolerate TCP's rigidity, nor can they fully accept UDP's indifference. What we need is a hybrid—combining TCP's reliability and congestion control with UDP's message boundaries and multi-homing capabilities.
This is exactly why SCTP (Stream Control Transmission Protocol) exists. Designed in 2000, it was originally created to solve PSTN (Public Switched Telephone Network) signaling transport problems. But people soon realized that in general IP networks—especially in LTE scenarios that are extremely sensitive to failures—it outperforms TCP.
Why? Because TCP naively assumes a connection is only down when it's truly broken, whereas SCTP can detect a failed link or a lost packet much faster.
SCTP's Hybrid Nature
We can think of it as the offspring of a marriage between TCP and UDP:
- It is reliable (like TCP): it features congestion control and flow control (receive window
a_rwnd). - It is message-oriented (like UDP): TCP is a byte stream, meaning your neatly partitioned messages are just a string of data to TCP; SCTP preserves message boundaries, so you send a chunk and you receive a chunk.
- Security upgrade: it uses a four-way handshake instead of TCP's three-way handshake, specifically designed to mitigate SYN flood attacks.
- Multi-homing: endpoints can have multiple IP addresses. If one network cable is cut, it automatically switches to another.
- Multi-streaming: multiple independent data streams run in parallel within a single association. This solves TCP's fatal "head-of-line blocking" problem.
But before we dive into these mechanisms, we need to look at how it plugs itself into the kernel.
5.1 Jumping the Queue: Protocol Initialization
`In the last line, the final leap that actually hands the packet to the network layer is this one:
. . .
err = icsk->icsk_af_ops->queue_xmit(skb, &inet->cork.fl);
. . .
}
(net/ipv4/tcp_output.c)
The world of TCP is bottomless. We've peeled back the layers of connection establishment, timers, and packet transmission/reception, but this is just the tip of the iceberg. The good news is that with TCP and UDP as a foundation, understanding SCTP is much easier—it essentially makes engineering trade-offs between these two extremes.
But if the kernel doesn't recognize it, we can't use it.
SCTP's initialization entry point is sctp_init(). Its tasks are tedious: allocating memory, initializing sysctl variables, and most importantly, registering itself with the IP layer (both IPv4 and IPv6).
int sctp_init(void)
{
int status = -EINVAL;
. . .
/* 先在 IPv4 协议层注册 */
status = sctp_v4_add_protocol();
if (status)
goto err_add_protocol;
/* 再在 IPv6 协议层注册 */
status = sctp_v6_add_protocol();
if (status)
goto err_v6_add_protocol;
. . .
}
(net/sctp/protocol.c)
This step is no different from protocols like UDP—it's just filling out forms. SCTP defines a net_protocol structure instance, fills in the processing and error callbacks, and hooks itself onto the kernel's protocol list.
static const struct net_protocol sctp_protocol = {
.handler = sctp_rcv, /* 收包入口 */
.err_handler = sctp_v4_err, /* ICMP 错误处理 */
.no_policy = 1,
};
(net/sctp/protocol.c)
The actual registration happens in sctp_v4_add_protocol():
static int sctp_v4_add_protocol(void)
{
/* 监听 IP 地址的变化,增删都要通知 SCTP */
register_inetaddr_notifier(&sctp_inetaddr_notifier);
/* 正式把 SCTP 挂载到 IP 层,协议号是 IPPROTO_SCTP */
if (inet_add_protocol(&sctp_protocol, IPPROTO_SCTP) < 0)
return -EAGAIN;
return 0;
}
(net/sctp/protocol.c)
There's a detail worth noting here: register_inetaddr_notifier().
SCTP cares deeply about changes in network interface IP addresses. Because it is "multi-homed," if a local IP suddenly disappears or a new one is added, SCTP must know immediately so it can update its global address list (sctp_local_addr_list) and notify the peer. This notifier is the kernel's pipeline for passing messages.
5.2 Building Blocks and Boxes: Packet Structure and Chunks
SCTP's packet structure is much more "Lego-like" than TCP's. In TCP, everything after the header is data with blurry boundaries; in SCTP, the common header is followed by a bunch of "chunks."
Each SCTP packet consists of a Common Header and several Chunks.
Common Header
This is the ID card for every SCTP packet:
typedef struct sctphdr {
__be16 source;
__be16 dest;
__be32 vtag; /* Verification Tag,验证标签 */
__le32 checksum; /* 校验和,带 Adler-32 或 CRC32c */
} __attribute__((packed)) sctp_sctphdr_t;
(include/linux/sctp.h)
- source / dest: ports, just like TCP.
- vtag: this is SCTP's anti-forgery tag. Each association has a random 32-bit Tag. If a packet arrives with the wrong vtag, the kernel drops it immediately, no questions asked.
- checksum: the checksum.
Chunk Header
Immediately following the common header are the chunks. Chunks have their own headers too:
typedef struct sctp_chunkhdr {
__u8 type; /* 块类型 */
__u8 flags; /* 标志位 */
__be16 length; /* 块长度 */
} __packed sctp_chunkhdr_t;
(include/linux/sctp.h)
- type: what kind of chunk is this? Is it data (
SCTP_CID_DATA), a connection-establishingINIT, or an error-reportingABORT? All chunks follow the TLV (Type-Length-Value) format, which guarantees protocol extensibility. - flags: usually all zeros, but has specific meanings in certain special chunks (like
ABORT). - length: the total length, including the header.
To process these chunks, the kernel defines a massive object called struct sctp_chunk. It is the basic unit for handling SCTP logic within the kernel. We can think of it as a parcel containing the specific data chunk, labeled with the source and destination, and even aware of which "association" it belongs to.
struct sctp_chunk {
. . .
atomic_t refcnt;
/* 根据类型不同,subh 指向不同的子头部 */
union {
__u8 *v;
struct sctp_datahdr *data_hdr;
struct sctp_inithdr *init_hdr;
struct sctp_sackhdr *sack_hdr;
struct sctp_heartbeathdr *hb_hdr;
/* ... 更多类型 ... */
} subh;
struct sctp_chunkhdr *chunk_hdr;
struct sctphdr *sctp_hdr;
struct sctp_association *asoc; /* 这个块属于哪个关联 */
/* 接收端点信息 */
struct sctp_ep_common *rcvr;
/* 来源地址和目的地址 */
union sctp_addr source;
union sctp_addr dest;
/* 传输路径:如果是入包,它告诉我们要从哪回;如果是出包,它告诉我们要去哪 */
struct sctp_transport *transport;
};
(include/net/sctp/structs.h)
5.3 Associations
In TCP, we say "connection," but in SCTP, we say "association."
Why the change in terminology? Because a TCP connection is strictly one-to-one in terms of IP addresses. Between two SCTP endpoints, multiple IP paths might exist simultaneously. Therefore, an "association" is a broader concept than a "connection"—it describes the relationship between two endpoints, regardless of how many network cables lie in between.
The kernel uses struct sctp_association to represent it:
struct sctp_association {
...
sctp_assoc_t assoc_id; /* 关联 ID */
/* 状态 Cookie,用于四次握手验证 */
struct sctp_cookie c;
/* 对端的信息 */
struct {
struct list_head transport_addr_list; /* 对端的地址列表 */
__u16 transport_count; /* 有几个地址 */
__u16 port;
/* primary_path: 最开始建连用的那个地址(老家) */
struct sctp_transport *primary_path;
/* active_path: 当前正在用来发数据的地址(可能切了) */
struct sctp_transport *active_path;
} peer;
sctp_state_t state; /* 关联状态机 */
. . .
};
(include/net/sctp/structs.h)
- assoc_id: the unique ID number for each association.
- peer: this is the profile of the remote endpoint. Note the
transport_addr_list—it's a linked list. Because SCTP supports multi-homing, the peer might tell us: "I have IP A, IP B, and IP C—take your pick." - primary_path vs active_path: this is the essence of SCTP.
primary_pathis the "primary path," usually the first address that successfully connects;active_pathis the "currently active path." Ifprimary_pathgoes down, SCTP automatically switches to a backup path, andactive_pathchanges.
How do we add peer addresses to this association? Through the sctp_connectx() system call. Want to bind to multiple local addresses? Use sctp_bindx().
5.4 Establishing Trust: The Four-Way Handshake
TCP uses a three-way handshake; SCTP uses a four-way one. Why the extra step?
Remember when we talked about TCP's SYN Flood attacks? The attacker sends a flood of SYN packets, clogging the server's half-open connection queue. The server waits in vain to establish connections until its resources are exhausted. SCTP's designers decided: before the other party proves they genuinely want to communicate, we will absolutely not allocate precious TCB (Transmission Control Block, i.e., connection control block) resources.
This is the core logic behind the four-way handshake.
First Hop: INIT
Client A wants to talk to Server Z. A sends an INIT chunk.
- A generates a random Tag and places it in the
INITchunk. - The
vtagin the SCTP common header is set to 0. - A's state changes to
SCTP_STATE_COOKIE_WAIT.
Second Hop: INIT-ACK
Z receives the INIT. Z does not establish a TCB or allocate expensive resources. Instead, it does something clever: it generates a State Cookie. This Cookie contains all the information Z needs to remember (such as A's IP, Tag, Z's own Tag, etc.), and it is encrypted and signed to prevent forgery.
Z sends this Cookie back to A inside an INIT-ACK.
- Z generates its own Tag.
- Z fills the Tag A sent into the SCTP header's
vtag(to prove it received it). - It attaches the State Cookie.
Third Hop: COOKIE-ECHO
A receives the INIT-ACK and the Cookie. A dutifully packs this Cookie, completely unaltered, into a COOKIE-ECHO chunk and sends it back.
- From now on, the
vtagin all packets A sends will be filled with the Tag Z gave it. - A's state changes to
SCTP_STATE_COOKIE_ECHOED.
Fourth Hop: COOKIE-ACK
Z receives the COOKIE-ECHO. Z takes out the Cookie, decrypts it, and finds that it was indeed issued by itself just now, hasn't expired, and the information inside matches up. "Okay, you're legitimate."
Only at this point does Z actually allocate the TCB, establish the struct sctp_association, change its state to SCTP_STATE_ESTABLISHED, and reply with a COOKIE-ACK.
A receives the COOKIE-ACK, and the association is established.
Key takeaway: Throughout this entire process, server Z does not retain any state about this connection until it receives the Cookie-Echo. This is the ultimate weapon against SYN Floods—stateless rejection.
5.5 Packet Reception and OOTB
When a packet arrives at the kernel, the entry point is sctp_rcv().
It first performs routine checks: is the packet long enough? Is the checksum correct?
Then, it encounters a concept very specific to SCTP: OOTB (Out of the Blue).
What is OOTB? It means the packet's format is completely correct and the checksum is valid, but the kernel searches through all its associations and simply cannot find which one it belongs to.
- It might be a late packet from an association that was torn down long ago.
- It might be a randomly sent probe packet.
What do we do when we encounter an OOTB packet? The sctp_rcv_ootb() function takes over. According to RFC 4960, it doesn't just ignore the packet; instead, it reacts based on the chunk type inside the packet. For example, if it's an ABORT chunk, it is discarded; if it's an INIT chunk, it might trigger a new association establishment attempt.
If a matching association is found, the packet is pushed into the association's receive queue, where sctp_assoc_bh_rcv() further processes the state machine.
5.6 Packet Transmission Flow
When user space calls sendmsg() to send data, the kernel reaches sctp_sendmsg().
This is somewhat similar to TCP's packet transmission—it also finds the association and packages the data into chunks. But there's an extra state machine transition in the middle: sctp_primitive_SEND() -> sctp_do_sm().
sctp_do_sm() is a massive function that drives the entire SCTP state machine engine. After a series of complex judgments and side effect processing, the data chunks are ultimately handed off to sctp_packet_transmit(), which packages them into IP packets and sends them out.
5.7 Heartbeats: Vital Sign Monitoring
Since SCTP supports multi-homing, how does it know if the network cable currently in use has been disconnected?
Through heartbeats.
At regular intervals (defaulting to 30 seconds, adjustable via /proc/sys/net/sctp/hb_interval), SCTP sends a HEARTBEAT chunk to one of the peer's addresses. Upon receiving it, the peer must reply with a HEARTBEAT-ACK.
If a certain number of consecutive heartbeats are lost, SCTP determines: this path is down, trigger a active_path switch! This is why mobile networks like LTE favor SCTP—when base stations switch IP addresses, TCP often stalls for a long time, whereas SCTP can switch to a backup path in milliseconds.
Sending heartbeats is handled by sctp_sf_sendbeat_8_3() (the 8_3 in the function name refers to section 8.3 of RFC 4960).
5.8 Multi-streaming: Solving Head-of-Line Blocking
This is one of SCTP's most fascinating features.
In the era of HTTP/1.1 pipelining, we suffered from head-of-line blocking: a browser sends requests A, B, and C. If packet A is lost, TCP must wait for A to be successfully retransmitted before it can deliver B and C to the application layer. B and C may have arrived long ago, but because TCP is a byte stream, it must deliver them in order.
SCTP solves this problem.
A single SCTP association can have multiple "streams." Each stream has its own independent sequence number (SSN).
- If a packet in stream 1 is lost, only stream 1 is blocked.
- Data from streams 2 and 3 are still delivered to the application as usual.
This is the significance of sinit_num_ostreams and sinit_max_instreams. During connection establishment, we negotiate: "I have 10 outbound streams, can you receive 10?"
5.9 Multi-homing: More Than Just Looks
Finally, regarding multi-homing, there is a massive misconception.
Many people assume that as long as they bind() two IP addresses on the server, SCTP automatically gains fault tolerance.
Wrong.
SCTP implements "destination multi-homing". This means that the peer must also tell you it has multiple IP addresses, and you must add all of these addresses to peer.transport_addr_list for fault tolerance to take effect.
If your server knows 5 IP addresses but the client only gives you 1, then if the client loses network connectivity, there is nothing you can do. True fault tolerance means both sides expose multiple legs—if the left leg breaks, we walk on the right leg.
With this, the skeleton of this SCTP hybrid is roughly assembled. It isn't as ubiquitous as TCP, but in specific industrial-grade scenarios (telecommunications, aerospace and defense, high-frequency trading), it is the irreplaceable king.
In the next section, we'll look at the final transport layer protocol in this chapter: DCCP. It attempts to find a finer middle ground between UDP's real-time capabilities and TCP's congestion control.