ch11_2
11.2 Creating Sockets
In the previous section, we skimmed through the basic Socket types like reading a menu. Now, let's open the kernel's doors and see what actually happens inside the machine when you call that simple socket().
It's more complex than it looks—because the kernel needs to handle two things well: act like a file to the upper layers, and act like a network protocol endpoint to the lower layers. To manage these two completely different roles simultaneously, the kernel splits a socket into two structures: struct socket and struct sock.
These two names differ by only a single letter, and this "twin" design has confused countless beginners (including me). Let's break them apart.
Two Structures, Two Faces
There are actually two structures representing a socket in the kernel:
struct socket: This is the interface facing user space, created bysys_socket().struct sock: This is the interface facing the network layer (L3). It lives inside the protocol layer and is protocol-agnostic.
Let's look at them one by one. First up is struct socket, which is the "face" the kernel presents to user space:
struct socket {
socket_state state;
kmemcheck_bitfield_begin(type);
short type;
kmemcheck_bitfield_end(type);
unsigned long flags;
. . .
struct file *file;
struct sock *sk;
const struct proto_ops *ops;
};
Although this structure isn't long, every field is crucial:
state: The socket's state. For example,SS_UNCONNECTED(unconnected) orSS_CONNECTED(connected). When an INET socket is created, its initial state isSS_UNCONNECTED(see theinet_create()method). Once a stream socket (like TCP) successfully connects to a remote host, the state changes toSS_CONNECTED. These enum values are defined ininclude/uapi/linux/net.h.type: The socket type. This corresponds to the second argument you pass when callingsocket(), such asSOCK_STREAM(stream) orSOCK_DGRAM(datagram).flags: Socket flags. For example,SOCK_EXTERNALLY_ALLOCATED, which is set when a TUN device allocates a socket, rather than through the normalsocket()system call (seetun_chr_open()indrivers/net/tun.c).file: Points to the file structure associated with this socket. This is why Sockets can be operated on withread()/write()—in the kernel's eyes, it's just a file.sk: This is the key. It points to astruct sockobject.struct sockis the one that actually represents the "network layer interface." When creating a Socket, the kernel binds these two objects together. For example, in the IPv4 implementation, theinet_create()method allocates asockobject (sk) and associates it with the currentsocketobject.ops: This is aproto_opsstructure instance containing a bunch of callback functions, such asconnect,listen,sendmsg,recvmsg, and so on.- These callbacks are the concrete implementations of the user space interface. For example, when you call
write(),send(),sendto(), orsendmsg()in user space, they all ultimately land on theops->sendmsgcallback in the kernel. The same goes forrecvmsg, which handles a series of calls likeread()andrecv(). - Each protocol defines its own
proto_ops. For TCP, itsproto_opshas realinet_listen()andinet_accept()implementations; but for UDP, it doesn't needlistenoracceptat all, so these two callbacks are set tosock_no_listen()andsock_no_accept()—the only action these two functions take is to return-EOPNOTSUPP(operation not supported).
- These callbacks are the concrete implementations of the user space interface. For example, when you call
Diving into the Network Layer: struct sock
Now, let's turn our attention to the lower-level and more complex guy—struct sock.
If struct socket is the "facade," then struct sock is the "engine room." It is the network layer's (L3) representation of a socket. This structure is very long, so we'll only pick out the fields most relevant to our discussion:
struct sock {
struct sk_buff_head sk_receive_queue;
int sk_rcvbuf;
unsigned long sk_flags;
int sk_sndbuf;
struct sk_buff_head sk_write_queue;
. . .
unsigned int sk_shutdown : 2,
sk_no_check : 2,
sk_protocol : 8,
sk_type : 16;
. . .
void (*sk_data_ready)(struct sock *sk, int bytes);
void (*sk_write_space)(struct sock *sk);
};
sk_receive_queue: The receive queue. All incoming packets are hung here first, waiting to be read by user space.sk_rcvbuf: The size of the receive buffer (in bytes).sk_flags: Various flags, such asSOCK_DEADorSOCK_DBG.sk_sndbuf: The size of the send buffer (in bytes).sk_write_queue: The send queue. Packets ready to be sent line up here.-
⚠️ Note: In the later "TCP Socket Initialization" section, we'll discuss in detail how these two buffers are initialized and how to modify them through the
/procinterface. For now, you just need to know they exist.
-
sk_no_check: Disable checksum flag. Can be set via theSO_NO_CHECKsocket option.sk_protocol: Protocol identifier. This value is set based on the third argument you pass when callingsocket().sk_type: Socket type (this again, because the lower layer also needs to know if you're a stream or a datagram).sk_data_ready: A callback function. When new data arrives, the kernel calls it to notify the socket, "Hey, we've got goods."sk_write_space: A callback function. When space frees up in the send buffer, the kernel calls it to notify, "You can keep writing."
Hands-on Time: The socket() System Call
We've talked theory long enough; now let's look at the actual operation. Here's how you initiate the request in user space:
sockfd = socket(int socket_family, int socket_type, int protocol);
What do these three parameters represent?
socket_family(address family): For example,AF_INET(IPv4),AF_INET6(IPv6), orAF_UNIXfor local communication.socket_type(socket type):SOCK_STREAM(stream),SOCK_DGRAM(datagram), orSOCK_RAW(raw socket).protocol(protocol):- For TCP, pass
0orIPPROTO_TCP. - For UDP, pass
0orIPPROTO_UDP. - For Raw Sockets, you must pass an explicit IP protocol identifier here (such as
IPPROTO_ICMP), as defined in RFC 1700.
- For TCP, pass
On a successful call, the return value sockfd is a file descriptor. From then on, this fd is your sole key to operating this Socket.
So, how does the kernel catch this call? It is handled by the sys_socket() method. Let's look at a snippet of its source code (with irrelevant error handling removed):
SYSCALL_DEFINE3(socket, int, family, int, type, int, protocol)
{
int retval;
struct socket *sock;
int flags;
. . .
retval = sock_create(family, type, protocol, &sock);
if (retval < 0)
goto out;
. . .
retval = sock_map_fd(sock, flags & (O_CLOEXEC | O_NONBLOCK));
if (retval < 0)
goto out_release;
out:
. . .
return retval;
}
What happens here is actually quite clear:
sock_create(): This does the heavy lifting. Internally, it calls the specific creation method based on the address family. For IPv4, that'sinet_create()(seenet/ipv4/af_inet.c).- Inside
inet_create(), the kernel not only allocates thestruct socketmentioned above, but also conveniently allocates and initializes the associatedstruct sock(that is, theskobject).
- Inside
sock_map_fd(): Once allocation is done, we need to issue a credential. This method returns a file descriptor (fd) and associates the fd with the socket structure just created. This fd is ultimately what ends up in your hands assockfd.
The Data Suitcase: struct msghdr
Now that the Socket is created, the next step is sending and receiving data. Inside the kernel, the core functions handling send and receive are sendmsg() and recvmsg(), respectively. You'll notice that both functions rely heavily on a single structure: struct msghdr.
You can think of it as the "suitcase" data uses when traveling between user space and the kernel. This suitcase doesn't just hold the actual data (msg_iov); it also carries some control information (like which network card to send on, or what the source IP is).
struct msghdr {
void *msg_name; /* Socket name */
int msg_namelen; /* Length of name */
struct iovec *msg_iov; /* Data blocks */
__kernel_size_t msg_iovlen; /* Number of blocks */
void *msg_control; /* Per protocol magic (eg BSD file descriptor passing) */
__kernel_size_t msg_controllen; /* Length of cmsg list */
unsigned int msg_flags;
};
Every compartment in this suitcase has its purpose:
msg_name: The destination socket address. This is actually avoid*pointer, and you typically need to cast it tostruct sockaddr_in*to get the destination IP and port (see its usage inudp_sendmsg()).msg_namelen: The length of the address.msg_iov: This is aiovecvector, the actual payload area. It supports scatter/gather I/O, meaning you can send multiple non-contiguous memory blocks in one shot without having to concatenate them yourself first.msg_iovlen: The number of data blocks.msg_control: Auxiliary data. This is the so-called "control information," such as passing special packet options (detailed in theIP_PKTINFOmentioned in later chapters).msg_controllen: The length of the auxiliary data.-
⚠️ Note: The length of the auxiliary buffer the kernel can handle is limited, and this limit is determined by
sysctl_optmem_max(path at/proc/sys/net/core/optmem_max). Don't try to stuff a ton of control information in there.
-
msg_flags: Flags for receiving messages, such asMSG_MORE(meaning "don't rush, there's more data coming, wait before sending").
Chapter Roadmap
Alright, now we not only have a Socket in the kernel, but we also know how data gets packed into a msghdr. The preparations are complete.
Next, we'll formally dive into the hands-on implementation of transport layer (L4) protocols. We'll start with the simplest one—UDP. It's the most straightforward of all protocols: no handshakes, no complex retransmission mechanisms, making it the perfect first stepping stone for understanding the kernel's network subsystem.