Skip to main content

ch11_2

11.2 Creating Sockets

In the previous section, we skimmed through the basic Socket types like reading a menu. Now, let's open the kernel's doors and see what actually happens inside the machine when you call that simple socket().

It's more complex than it looks—because the kernel needs to handle two things well: act like a file to the upper layers, and act like a network protocol endpoint to the lower layers. To manage these two completely different roles simultaneously, the kernel splits a socket into two structures: struct socket and struct sock.

These two names differ by only a single letter, and this "twin" design has confused countless beginners (including me). Let's break them apart.


Two Structures, Two Faces

There are actually two structures representing a socket in the kernel:

  1. struct socket: This is the interface facing user space, created by sys_socket().
  2. struct sock: This is the interface facing the network layer (L3). It lives inside the protocol layer and is protocol-agnostic.

Let's look at them one by one. First up is struct socket, which is the "face" the kernel presents to user space:

struct socket {
socket_state state;

kmemcheck_bitfield_begin(type);
short type;
kmemcheck_bitfield_end(type);

unsigned long flags;

. . .

struct file *file;
struct sock *sk;
const struct proto_ops *ops;
};

Although this structure isn't long, every field is crucial:

  • state: The socket's state. For example, SS_UNCONNECTED (unconnected) or SS_CONNECTED (connected). When an INET socket is created, its initial state is SS_UNCONNECTED (see the inet_create() method). Once a stream socket (like TCP) successfully connects to a remote host, the state changes to SS_CONNECTED. These enum values are defined in include/uapi/linux/net.h.
  • type: The socket type. This corresponds to the second argument you pass when calling socket(), such as SOCK_STREAM (stream) or SOCK_DGRAM (datagram).
  • flags: Socket flags. For example, SOCK_EXTERNALLY_ALLOCATED, which is set when a TUN device allocates a socket, rather than through the normal socket() system call (see tun_chr_open() in drivers/net/tun.c).
  • file: Points to the file structure associated with this socket. This is why Sockets can be operated on with read()/write()—in the kernel's eyes, it's just a file.
  • sk: This is the key. It points to a struct sock object. struct sock is the one that actually represents the "network layer interface." When creating a Socket, the kernel binds these two objects together. For example, in the IPv4 implementation, the inet_create() method allocates a sock object (sk) and associates it with the current socket object.
  • ops: This is a proto_ops structure instance containing a bunch of callback functions, such as connect, listen, sendmsg, recvmsg, and so on.
    • These callbacks are the concrete implementations of the user space interface. For example, when you call write(), send(), sendto(), or sendmsg() in user space, they all ultimately land on the ops->sendmsg callback in the kernel. The same goes for recvmsg, which handles a series of calls like read() and recv().
    • Each protocol defines its own proto_ops. For TCP, its proto_ops has real inet_listen() and inet_accept() implementations; but for UDP, it doesn't need listen or accept at all, so these two callbacks are set to sock_no_listen() and sock_no_accept()—the only action these two functions take is to return -EOPNOTSUPP (operation not supported).

Diving into the Network Layer: struct sock

Now, let's turn our attention to the lower-level and more complex guy—struct sock.

If struct socket is the "facade," then struct sock is the "engine room." It is the network layer's (L3) representation of a socket. This structure is very long, so we'll only pick out the fields most relevant to our discussion:

struct sock {
struct sk_buff_head sk_receive_queue;
int sk_rcvbuf;

unsigned long sk_flags;

int sk_sndbuf;
struct sk_buff_head sk_write_queue;
. . .
unsigned int sk_shutdown : 2,
sk_no_check : 2,
sk_protocol : 8,
sk_type : 16;
. . .

void (*sk_data_ready)(struct sock *sk, int bytes);
void (*sk_write_space)(struct sock *sk);
};
  • sk_receive_queue: The receive queue. All incoming packets are hung here first, waiting to be read by user space.
  • sk_rcvbuf: The size of the receive buffer (in bytes).
  • sk_flags: Various flags, such as SOCK_DEAD or SOCK_DBG.
  • sk_sndbuf: The size of the send buffer (in bytes).
  • sk_write_queue: The send queue. Packets ready to be sent line up here.
    • ⚠️ Note: In the later "TCP Socket Initialization" section, we'll discuss in detail how these two buffers are initialized and how to modify them through the /proc interface. For now, you just need to know they exist.

  • sk_no_check: Disable checksum flag. Can be set via the SO_NO_CHECK socket option.
  • sk_protocol: Protocol identifier. This value is set based on the third argument you pass when calling socket().
  • sk_type: Socket type (this again, because the lower layer also needs to know if you're a stream or a datagram).
  • sk_data_ready: A callback function. When new data arrives, the kernel calls it to notify the socket, "Hey, we've got goods."
  • sk_write_space: A callback function. When space frees up in the send buffer, the kernel calls it to notify, "You can keep writing."

Hands-on Time: The socket() System Call

We've talked theory long enough; now let's look at the actual operation. Here's how you initiate the request in user space:

sockfd = socket(int socket_family, int socket_type, int protocol);

What do these three parameters represent?

  • socket_family (address family): For example, AF_INET (IPv4), AF_INET6 (IPv6), or AF_UNIX for local communication.
  • socket_type (socket type): SOCK_STREAM (stream), SOCK_DGRAM (datagram), or SOCK_RAW (raw socket).
  • protocol (protocol):
    • For TCP, pass 0 or IPPROTO_TCP.
    • For UDP, pass 0 or IPPROTO_UDP.
    • For Raw Sockets, you must pass an explicit IP protocol identifier here (such as IPPROTO_ICMP), as defined in RFC 1700.

On a successful call, the return value sockfd is a file descriptor. From then on, this fd is your sole key to operating this Socket.

So, how does the kernel catch this call? It is handled by the sys_socket() method. Let's look at a snippet of its source code (with irrelevant error handling removed):

SYSCALL_DEFINE3(socket, int, family, int, type, int, protocol)
{
int retval;
struct socket *sock;
int flags;

. . .
retval = sock_create(family, type, protocol, &sock);
if (retval < 0)
goto out;
. . .
retval = sock_map_fd(sock, flags & (O_CLOEXEC | O_NONBLOCK));
if (retval < 0)
goto out_release;
out:
. . .
return retval;
}

What happens here is actually quite clear:

  1. sock_create(): This does the heavy lifting. Internally, it calls the specific creation method based on the address family. For IPv4, that's inet_create() (see net/ipv4/af_inet.c).
    • Inside inet_create(), the kernel not only allocates the struct socket mentioned above, but also conveniently allocates and initializes the associated struct sock (that is, the sk object).
  2. sock_map_fd(): Once allocation is done, we need to issue a credential. This method returns a file descriptor (fd) and associates the fd with the socket structure just created. This fd is ultimately what ends up in your hands as sockfd.

The Data Suitcase: struct msghdr

Now that the Socket is created, the next step is sending and receiving data. Inside the kernel, the core functions handling send and receive are sendmsg() and recvmsg(), respectively. You'll notice that both functions rely heavily on a single structure: struct msghdr.

You can think of it as the "suitcase" data uses when traveling between user space and the kernel. This suitcase doesn't just hold the actual data (msg_iov); it also carries some control information (like which network card to send on, or what the source IP is).

struct msghdr {
void *msg_name; /* Socket name */
int msg_namelen; /* Length of name */
struct iovec *msg_iov; /* Data blocks */
__kernel_size_t msg_iovlen; /* Number of blocks */
void *msg_control; /* Per protocol magic (eg BSD file descriptor passing) */
__kernel_size_t msg_controllen; /* Length of cmsg list */
unsigned int msg_flags;
};

Every compartment in this suitcase has its purpose:

  • msg_name: The destination socket address. This is actually a void* pointer, and you typically need to cast it to struct sockaddr_in* to get the destination IP and port (see its usage in udp_sendmsg()).
  • msg_namelen: The length of the address.
  • msg_iov: This is a iovec vector, the actual payload area. It supports scatter/gather I/O, meaning you can send multiple non-contiguous memory blocks in one shot without having to concatenate them yourself first.
  • msg_iovlen: The number of data blocks.
  • msg_control: Auxiliary data. This is the so-called "control information," such as passing special packet options (detailed in the IP_PKTINFO mentioned in later chapters).
  • msg_controllen: The length of the auxiliary data.
    • ⚠️ Note: The length of the auxiliary buffer the kernel can handle is limited, and this limit is determined by sysctl_optmem_max (path at /proc/sys/net/core/optmem_max). Don't try to stuff a ton of control information in there.

  • msg_flags: Flags for receiving messages, such as MSG_MORE (meaning "don't rush, there's more data coming, wait before sending").

Chapter Roadmap

Alright, now we not only have a Socket in the kernel, but we also know how data gets packed into a msghdr. The preparations are complete.

Next, we'll formally dive into the hands-on implementation of transport layer (L4) protocols. We'll start with the simplest one—UDP. It's the most straightforward of all protocols: no handshakes, no complex retransmission mechanisms, making it the perfect first stepping stone for understanding the kernel's network subsystem.