2.2 Kernel Netlink Sockets

In the previous section, we discussed the userspace tools—iproute2, net-tools, and the handy libnl and libmnl libraries. But on the other side of the fence, how does the kernel view all of this?

When you type ip link add in a terminal, what happens inside the kernel is far more complex than simply "creating a virtual network interface." In this section, we dive into the kernel internals to see how this Netlink engine starts up, registers its handlers, and throws messages back to userspace.

Creating Kernel Netlink Sockets

Within the kernel's Network Stack, we create several different Netlink sockets, each responsible for handling different types of messages.

For example, the NETLINK_ROUTE socket, which specifically handles routing and link messages, is created inside the rtnetlink_net_init() function. Its code is highly typical—almost a standard template for kernel Netlink initialization:

static int __net_init rtnetlink_net_init(struct net *net) {
    ...
    struct netlink_kernel_cfg cfg = {
        .groups    = RTNLGRP_MAX,
        .input     = rtnetlink_rcv,
        .cb_mutex  = &rtnl_mutex,
        .flags     = NL_CFG_F_NONROOT_RECV,
    };

    sk = netlink_kernel_create(net, NETLINK_ROUTE, &cfg);
    ...
}

But here is a frequently overlooked detail: this socket is network namespace aware.

Notice the struct net *net parameter in the function signature—that's the network namespace object. Inside this object, there is a member called rtnl, which is a pointer to this rtnetlink socket. After rtnetlink_net_init() calls netlink_kernel_create() to create the socket, it assigns the return value to net->rtnl.

What does this mean? It means that when you configure a network interface in Container A, the kernel only looks for this socket within the struct net corresponding to Container A; the host or other containers remain completely unaffected. Netlink was designed to work with network namespaces from the very beginning, not bolted on later as a patch.

`netlink_kernel_create()`: The Factory Entrance

Let's zoom in and look at the parameter list of this netlink_kernel_create() "routine":

struct sock *netlink_kernel_create(struct net *net, 
                                   int unit, 
                                   struct netlink_kernel_cfg *cfg);

The first parameter, net, as we just mentioned, is the network namespace.

The second parameter, unit, is the Netlink protocol number.

Want to handle routing? Use NETLINK_ROUTE.
Want to handle IPsec? Use NETLINK_XFRM.
Want to handle audit logs? Use NETLINK_AUDIT.

There's a catch here: although the kernel defines over 20 protocols, the total number is capped at 32 (MAX_LINKS). This is why Generic Netlink was introduced later—the standard protocol numbers ran out, and a "universal outlet" was needed as a fallback. The full list of protocols is in include/uapi/linux/netlink.h.

The third parameter, cfg, is a configuration structure that tells the kernel: "The socket I want must be built to these specifications."

The Configuration Structure `netlink_kernel_cfg`

Let's look at the definition of this structure. Behind every field lies a set of rules:

struct netlink_kernel_cfg {
    unsigned int    groups;
    unsigned int    flags;
    void            (*input)(struct sk_buff *skb);
    struct mutex    *cb_mutex;
    void            (*bind)(int group);
};

1. `groups`: Multicast Group Mask

This field specifies the multicast group (or rather, the multicast group mask). In the old days, userspace programs joined multicast groups by setting the nl_groups field inside sockaddr_nl (or by using libnl's nl_join_groups()). However, this approach was limited to 32 groups.

The turning point: starting from kernel 2.6.14, things changed.

You can use setsockopt with the NETLINK_ADD_MEMBERSHIP / NETLINK_DROP_MEMBERSHIP options to join or leave a multicast group. This trick removes the 32-group limit. Under the hood, libnl's nl_socket_add_memberships() and nl_socket_drop_memberships() methods use this new mechanism.

2. `flags`: The Permission Control Switch

flags can be either NL_CFG_F_NONROOT_RECV or NL_CFG_F_NONROOT_SEND.

If NL_CFG_F_NONROOT_RECV is set, unprivileged users (non-root) are allowed to bind to a multicast group. You can see this check in the kernel's netlink_bind() code:

static int netlink_bind(struct socket *sock, struct sockaddr *addr,
                         int addr_len)
 {
  ...
  if (nladdr->nl_groups) {
         if (!netlink_capable(sock, NL_CFG_F_NONROOT_RECV))
                         return -EPERM;
    }
  ...
}

Without this flag set, when an unprivileged user tries to bind to a multicast group, netlink_capable() returns 0, and the kernel immediately throws a -EPERM (Permission denied).

Similarly, NL_CFG_F_NONROOT_SEND controls whether unprivileged users are allowed to send multicast messages.

3. `input`: The Soul of the Callback

This is the most critical field. If the input callback in netlink_kernel_cfg is NULL, this kernel socket will be unable to receive data from userspace (though it can still send data from the kernel to userspace).

For the rtnetlink socket, we specified rtnetlink_rcv as the callback. This means that all messages sent from userspace via the rtnetlink socket will ultimately be handed off to the rtnetlink_rcv() function for processing.

The reverse case: for uevent (kernel event notifications), we only need one-way communication—from the kernel to userspace. So in lib/kobject_uevent.c, the uevent socket configuration looks like this:

static int uevent_net_init(struct net *net)
{
    struct uevent_sock *ue_sk;
    struct netlink_kernel_cfg cfg = {
        .groups    = 1,
        .flags    = NL_CFG_F_NONROOT_RECV,
    };

    ...
    ue_sk->sk = netlink_kernel_create(net, NETLINK_KOBJECT_UEVENT, &cfg);
    ...
}

Notice anything? There is no input callback. It's like a broadcast speaker that only transmits but never listens.

4. `cb_mutex`: The Mystery of the Lock

cb_mutex is an optional mutex.

If you leave it empty, the kernel uses the default global lock cb_def_mutex (defined in net/netlink/af_netlink.c). In practice, most kernel Netlink sockets leave this field blank and simply use the default.

For example, the uevent socket (NETLINK_KOBJECT_UEVENT) and the audit socket (NETLINK_AUDIT) mentioned earlier both don't bother specifying a lock.

But rtnetlink is an exception. It uses its own lock, rtnl_mutex. Why? Because rtnetlink operations are too frequent and too critical—it doesn't want to contend for a lock with Netlink operations from other subsystems. Another exception is Generic Netlink, which also has its own lock, genl_mutex.

Unregistration and Lookup

When netlink_kernel_create() executes, it calls the netlink_insert() method to register itself in a global table called nl_table. Access to this table is protected by the read-write lock nl_table_lock.

Later, if someone needs to find this socket, they can use the netlink_lookup() method, specifying the protocol number and Port ID to locate it.

Registering Message Handling Callbacks

Having a socket alone isn't enough; you need to tell the kernel: "When you receive this type of message, call this function; when you receive that type, call that function."

This is exactly what rtnl_register() does. Throughout the networking kernel code, these callbacks are registered everywhere.

For example, in rtnetlink_init(), these message handlers are registered:

RTM_NEWLINK: Create a new link
RTM_DELLINK: Delete a link
RTM_GETROUTE: Dump the routing table

And in net/core/neighbour.c, neighbor-related handlers are registered:

RTM_NEWNEIGH: Create a new neighbor
RTM_DELNEIGH: Delete a neighbor
RTM_GETNEIGHTBL: Dump the neighbor table

(We'll dive deep into these operations in Chapters 5 and 7). Additionally, the FIB code, multicast code, and IPv6 code all register their respective callbacks.

Let's look at the prototype of rtnl_register():

extern void rtnl_register(int protocol, 
                          int msgtype,
                          rtnl_doit_func,
                          rtnl_dumpit_func,
                          rtnl_calcit_func);

Here is the parameter breakdown:

protocol: The protocol family. If not targeting a specific protocol, this is typically set to PF_UNSPEC. The full list of protocol families is in include/linux/socket.h.
msgtype: The Netlink message type. For example, RTM_NEWLINK, RTM_NEWNEIGH. These are rtnetlink-specific extended types, all defined in include/uapi/linux/rtnetlink.h.
The last three parameters: These are all callback function pointers.
- doit: Used for "create, delete, update" operations.
- dumpit: Used for "query/dump" operations.
- calcit: Used to calculate the required buffer size.

Usually, you only need to specify either doit or dumpit.

Inside the rtnetlink module, there is a large table called rtnl_msg_handlers. This table is indexed by protocol number. Each entry in this table is itself a sub-table, indexed by message type. Finally, the elements of these sub-tables are rtnl_link structures, which hold the pointers to these three callback functions.

So, when you call rtnl_register(), you are essentially plugging a function pointer into a specific cell of this multi-dimensional array.

For example, in net/core/rtnetlink.c, there is a call like this:

rtnl_register(PF_UNSPEC, RTM_NEWLINK, rtnl_newlink, NULL, NULL);

This line means: regardless of the protocol family (PF_UNSPEC), as long as a RTM_NEWLINK message is received, call the rtnl_newlink function.

How the Kernel Sends Netlink Messages

Conversely, what if the kernel needs to notify userspace?

Sending rtnetlink messages typically uses the rtmsg_ifinfo() method.

For example, when a device is brought up (dev_open()), the kernel creates a new link and then calls:

rtmsg_ifinfo(RTM_NEWLINK, dev, IFF_UP|IFF_RUNNING);

What happens behind the scenes? This can be broken down into four consecutive steps (see Figure 2-2):

Allocate space: Calls nlmsg_new() to allocate a properly sized sk_buff (the buffer used in the kernel to hold network packets).
Build the header: Creates two core objects—the Netlink message header (nlmsghdr) and the immediately following ifinfomsg structure.
Fill in the data: Calls rtnl_fill_ifinfo() to populate the specific network interface information (state, Flags, etc.).
Deliver: Calls rtnl_notify() to send the packet. Internally, this function ultimately calls the generic nlmsg_notify() (defined in net/netlink/af_netlink.c) to actually push the packet out.

Figure 2-2. Sending of rtnelink messages with the rtmsg_ifinfo() method

So far, we've been discussing the "mechanism"—how sockets are created and how functions are registered. But whether it's the kernel or userspace sending the data, what the packet itself looks like is the essence of the communication.

In the next section, we'll stop circling around the code and instead tear a packet open directly, staring right at the bytes inside. Welcome to the world of Netlink message headers.

Creating Kernel Netlink Sockets​

netlink_kernel_create(): The Factory Entrance​

The Configuration Structure netlink_kernel_cfg​

1. groups: Multicast Group Mask​

2. flags: The Permission Control Switch​

3. input: The Soul of the Callback​

4. cb_mutex: The Mystery of the Lock​

Unregistration and Lookup​

Registering Message Handling Callbacks​

How the Kernel Sends Netlink Messages​