14. Namespaces Implementation

Chapter Intro: Invisible Walls

Imagine you're debugging a network service on a server. You follow the documentation and modify the global routing table, hit Enter, and—bam—one second later, the monitoring alerts go off. It's not your service that crashed; it's the database running next door that suddenly lost connectivity.

Why? Because you both share the same routing table.

This is the "default behavior" of traditional Linux systems: all processes live in the same global view. The same network stack, the same filesystem mount tree, the same PID space. This was fine for the single-machine era, but for containerization, microservices, or even just isolating test environments on the same machine, it's like dancing in a minefield.

What we want is a more fine-grained isolation mechanism. Not the heavy, complete isolation of virtual machines, but a lightweight illusion that makes a process think it has the entire system to itself.

Namespaces are Linux's answer. They allow the system to virtualize multiple isolated "global" resource views within a single kernel.

But how does the kernel actually do this? How does it know that the eth0 seen by Process A and the eth0 seen by Process B aren't the same thing? When we create a container, what exactly happens under the hood in the kernel?

In this chapter, we're going to strip away this illusion and look directly at the kernel's skeleton. We'll see what new data structures the kernel introduced to achieve this isolation, what new system calls were added, and—most importantly—how the existing architecture was cleverly refactored to accommodate these new features.

This isn't just the underlying principle of containers; it's the key to understanding modern Linux resource management.

14.1 Namespaces Implementation —— When "Global" Is No Longer Global

To date, the Linux kernel implements six types of namespaces. Fitting these features into the existing kernel architecture while maintaining high performance and maintainability is no easy task. Kernel developers made a series of architectural adjustments and additions, primarily to support namespaces at the kernel level and provide an operational interface to user space.

Let's break down these changes.

nsproxy: The Proxy for Performance

First up is a structure called nsproxy (namespace proxy).

You might ask, why not just stuff six pointers into the Process Descriptor task_struct, each pointing to a namespace? That was my first thought, too.

But there's a subtle performance consideration here. The task_struct does need an entry point to namespaces, but if they were six independent pointers, every time fork() creates a child process that inherits the parent's environment, we'd have to increment the reference count (get operation) for all six namespaces individually. On a frequently called path, this overhead is not negligible.

So, nsproxy was introduced as an optimization. It acts like a bundle, packaging five namespace pointers together (note: five, not six).

Why five?

This is where the User Namespace is special. The nsproxy does not contain a pointer to the user_namespace. However, the other five namespace structures each contain a pointer named user_ns that points to the user namespace that owns them. This forms an inverted dependency.

The User Namespace itself hangs under the process's credential structure, cred. cred represents the security context of a process. Each task_struct has two cred objects, used for effective and objective credentials respectively. This delves into the details of the security model, which we won't expand on here. You just need to know: the User Namespace is an exception; it's part of the security credentials, not part of nsproxy.

A nsproxy object is created by the create_nsproxy() method and released by the free_nsproxy() method. The nsproxy field in the Process Descriptor task_struct is the pointer to this proxy structure.

Let's take a look at the definition of nsproxy (include/linux/nsproxy.h):

struct nsproxy {
    atomic_t count;
    struct uts_namespace *uts_ns;
    struct ipc_namespace *ipc_ns;
    struct mnt_namespace *mnt_ns;
    struct pid_namespace *pid_ns_for_children; // 注意这里，3.11 内核后改名了
    struct net           *net_ns;
};

Very intuitive. Aside from the missing user_ns, the other five are all there. The count member is an atomic counter, initialized to 1 when create_nsproxy() is created, incremented on get_nsproxy(), and decremented on put_nsproxy().

It's like a "combo meal." When copying a process environment, the kernel only needs to copy one pointer to the combo and increment the combo's reference count, rather than keeping a separate tab for every dish inside. This greatly simplifies the logic on the fork() path.

By the way, the pid_ns member was renamed to pid_ns_for_children after kernel 3.11. This name is actually more accurate—it's not just a pointer; it implies that this PID namespace is meant for the process's children.

unshare(): Going Solo

With the data structures in place, we still need an interface to manipulate them. The first to take the stage is the unshare() system call.

The name is quite vivid—"I'm not sharing anymore." It allows the current process (or parts of it) to detach from the originally shared namespaces and set up its own shop.

The parameter to unshare() is a bitmask consisting of CLONE_* flags. When you pass a parameter containing the CLONE_NEW* flag, the kernel performs the following two steps:

Create a new territory: It calls unshare_nsproxy_namespaces(), which in turn calls create_new_namespaces(). Based on the flags you specified (e.g., CLONE_NEWNET), it creates a new nsproxy object and mounts the corresponding new namespaces inside this new proxy. Because the parameter is a bitmask, you can "go solo" in several namespaces at once.
Move in: It calls switch_task_namespaces(), pointing the current process's nsproxy pointer to the newly created object.

There is a very special exception here, and it's a major pitfall: CLONE_NEWPID.

If you pass CLONE_NEWPID, you'll find that the PID of the process calling unshare() doesn't change! The one that actually enters the new PID namespace is the first child process created afterwards.

This can be confusing, but once you understand the hierarchical relationship of PID namespaces (PID 1 must be the init process), you'll realize this is necessary—you can't let an already-running process suddenly become PID 1, as that would disrupt the entire process tree management logic. Aside from the PID namespace, unshare() for the other five namespaces takes effect immediately—meaning the caller themselves enters the new space right away.

The implementation of unshare() is in kernel/fork.c.

setns(): I Want to Go Next Door

If unshare() is "building a new house," then setns() is "moving into an old house."

It allows a thread to join an already existing namespace. This is extremely useful when managing containers—for example, if you have a debugging tool that needs to enter a running container's network namespace to capture packets.

Its prototype is: int setns(int fd, int nstype);

There are two parameters:

fd: A file descriptor. You might ask, how can a namespace have a file descriptor? Remember the /proc filesystem? The kernel maps the namespaces associated with each process as symbolic links under the /proc/<pid>/ns/ directory. Opening these links gives you a file descriptor representing that namespace.
nstype: An optional validation parameter. If you pass 0, the kernel doesn't care what type of namespace fd is and lets you right in. But if you pass CLONE_NEWNET, the kernel will check whether fd corresponds to a network namespace. If the types don't match, it directly returns -EINVAL. This is a great safety lock to prevent you from walking through the wrong door.

The implementation of setns() is in kernel/nsproxy.c.

The Six Factions: CLONE_NEW* Flags

To support these six namespaces, the kernel added six bits to the flags of the clone() system call:

CLONE_NEWNS: Mount Namespace. Note that although the name is NEWNS, it was the first to be implemented (2.4.19), and was later adapted to the unified namespace framework.
CLONE_NEWUTS: UTS Namespace (hostname and domain name).
CLONE_NEWIPC: IPC Namespace (message queues, semaphores).
CLONE_NEWPID: PID Namespace (process IDs).
CLONE_NEWNET: Network Namespace (network stack).
CLONE_NEWUSER: User Namespace (user and group IDs).

clone() was originally used to create processes, but it has been extended: if these flags are included, it creates not just a new process, but a new process with new namespaces.

We'll be using CLONE_NEWNET frequently later in this chapter, so keep this name in mind.

Staking Their Own Territory: Independent Implementations of Subsystems

While the top-level logic resides in the nsproxy and fork code, each subsystem (networking, mount, IPC, etc.) has its own set of namespace implementation logic.

The Mount namespace is represented by the mnt_namespace structure.
The Network namespace is represented by the net structure (we'll dive into this later).

To create these specific objects, the kernel provides a generic factory method: create_new_namespaces() (located in kernel/nsproxy.c).

It takes a CLONE_NEW* flag as a menu and cooks to order:

static struct nsproxy *create_new_namespaces(unsigned long flags,
        struct task_struct *tsk, struct user_namespace *user_ns,
        struct fs_struct *new_fs)
{
        struct nsproxy *new_nsp;
        int err;

        // 第一步：先把套餐盒（nsproxy）造出来，引用计数置为 1
        new_nsp = create_nsproxy();
        if (!new_nsp)
                return ERR_PTR(-ENOMEM);
        . . .

Once we have the empty box, we start filling it with dishes.

First is the UTS namespace (copy_utsname()):

        new_nsp->uts_ns = copy_utsname(flags, user_ns, tsk->nsproxy->uts_ns);
        if (IS_ERR(new_nsp->uts_ns)) {
                err = PTR_ERR(new_nsp->uts_ns);
                goto out_uts;
        }
        . . .

Here's a detail: if CLONE_NEWUTS isn't set in flags, copy_utsname() won't create a new one at all; it simply returns the parent process's uts_ns pointer. This is called "sharing." If the flag is set, it calls clone_uts_ns() to allocate new memory and copies the parent's hostname over.

Next come IPC (copy_ipcs()), PID (copy_pid_ns()), and Network (copy_net_ns()).

For the Network namespace's copy_net_ns(), the logic is exactly the same:

CLONE_NEWNET not set? Directly return the parent process's net_ns.
Set? Call net_alloc() to allocate a new net structure, initialize it with setup_net(), and finally hang it on the global net_namespace_list list.

        new_nsp->net_ns = copy_net_ns(flags, user_ns, tsk->nsproxy->net_ns);
        if (IS_ERR(new_nsp->net_ns)) {
                err = PTR_ERR(new_nsp->net_ns);
                goto out_net;
        }
        return new_nsp;
    }

It's worth noting that setns() also calls create_new_namespaces() in its implementation, but the first argument it passes is 0. This means it only creates an empty nsproxy box without creating any new namespaces. Afterwards, it finds the already existing namespace based on the passed fd, hangs it in this box, and gives the box to the current thread.

This is why setns() is called "join" rather than "create."

Exit and Destruction: exit_task_namespaces()

All processes must eventually die. When a process exits via do_exit(), the kernel calls exit_task_namespaces() (kernel/nsproxy.c).

This function is extremely simple; it just calls switch_task_namespaces() and sets the process's nsproxy pointer to NULL.

void exit_task_namespaces(struct task_struct *p)
{
    switch_task_namespaces(p, NULL);
}

switch_task_namespaces() decrements the reference count of the old nsproxy (put_nsproxy()). If the count reaches zero, it means no other processes are using it, and this memory is freed. Those specific namespace objects (like net_ns) will also be destroyed along with it, provided their own reference counts have also reached zero.

Finding Namespaces: By PID or FD?

Sometimes in kernel code, we only have a PID and want to know which network namespace this process belongs to. That's where get_net_ns_by_pid() comes in handy. It finds the task_struct via the PID, then traces through nsproxy to reach the net_ns.

Other times, we're holding a file descriptor (fd), which was passed in from user space after being opened from /proc. get_net_ns_by_fd() is responsible for finding the corresponding inode via the fd, and then locating the associated net_namespace.

These two functions are the key bridges connecting user space operations to kernel objects.

/proc//ns: Visible Namespaces

To allow user space to manipulate these invisible structures, the kernel exposes six symbolic links under the /proc/<pid>/ns/ directory.

This isn't just for human viewing; more importantly, it's for holding references. As long as this file (or a bind mount pointing to it) remains open, the underlying namespace will not be destroyed, even if there are no processes left inside it. This is crucial for certain operations tasks.

You can use ls -al or readlink to view these links. The format they point to is type:[inode].

ls -al /proc/1/ns/
total 0
dr-x--x--x 2 root root 0 Nov  3 13:32 .
dr-xr-xr-x 8 root root 0 Nov  3 12:17 ..
lrwxrwxrwx 1 root root 0 Nov  3 13:32 ipc -> ipc:[4026531839]
lrwxrwxrwx 1 root root 0 Nov  3 13:32 mnt -> mnt:[4026531840]
lrwxrwxrwx 1 root root 0 Nov  3 13:32 net -> net:[4026531956]
lrwxrwxrwx 1 root root 0 Nov  3 13:32 pid -> pid:[4026531836]
lrwxrwxrwx 1 root root 0 Nov  3 13:32 user -> user:[4026531837]
lrwxrwxrwx 1 root root 0 Nov  3 13:32 uts -> uts:[4026531838]

Notice the number in the square brackets—that's the unique inode number in the proc filesystem. Each namespace is assigned a unique ID by proc_alloc_inum() when created, which is recycled upon destruction. With this number, you can confirm whether two processes are in the same "room."

To make these proc files work properly, each namespace defines a proc_ns_operations structure. It's full of callback functions: inum is used to return that unique inode number, and install is used to perform the specific mount operation during setns().

utsns_operations (kernel/utsname.c)
ipcns_operations (ipc/namespace.c)
mntns_operations (fs/namespace.c)
pidns_operations (kernel/pid_namespace.c)
userns_operations (kernel/user_namespace.c)
netns_operations (net/core/net_namespace.c)

Initial Namespaces: The Origin of All Things

When the system boots, it's not a blank slate. The kernel predefines a set of "initial namespaces"—the namespaces from God's perspective.

init_uts_ns (init/version.c)
init_ipc_ns (ipc/msgutil.c)
init_pid_ns (kernel/pid.c)
init_net (net/core/net_namespace.c)
init_user_ns (kernel/user.c)

In addition, there is an initial proxy object, init_nsproxy. This is the "combo box" held by the ancestor of all processes—the init process.

struct nsproxy init_nsproxy = {
        .count  = ATOMIC_INIT(1),
        .uts_ns = &init_uts_ns,
#if defined(CONFIG_POSIX_MQUEUE) || defined(CONFIG_SYSVIPC)
        .ipc_ns = &init_ipc_ns,
#endif
        .mnt_ns = NULL,  /* 初始时为空，稍后挂载 */
        .pid_ns_for_children = &init_pid_ns,
#ifdef CONFIG_NET
        .net_ns = &init_net,
#endif
};

There is one exception here: mnt_ns is initially NULL. The mounting of the rootfs happens at a slightly later stage of the kernel boot process.

A Detailed Look at the Six Factions

Finally, let's quickly run through the resumes of these six heavyweights:

Mount Namespaces (CLONE_NEWNS)
- Purpose: Isolate filesystem mount points. If you mount a disk inside a container, the host can't see it, and vice versa.
- History: The first to be implemented (2.4.19), a true veteran.
- Rules: A newly created namespace inherits a copy of the parent namespace's mount view. Subsequent mount operations are isolated from each other.
- Advanced Features: Introduced the Shared Subtrees mechanism, which comes with a bunch of complex propagation flags (MS_SHARED, MS_PRIVATE, etc.). These solve cascading reaction problems like "if a parent directory is mounted, should the child directory see it?" The main implementation is in fs/namespace.c.
PID Namespaces (CLONE_NEWPID)
- Purpose: Isolate process IDs. Most importantly, the first process in every new PID namespace is PID 1.
- Importance: It's the cornerstone of Container technology. With it, a container can have its own init process, which is responsible for reaping orphan processes.
- Pitfall: As mentioned earlier, unshare(CLONE_NEWPID) doesn't change the current process's PID; it only affects child processes. Furthermore, PID 1 cannot be killed (SIGKILL is ineffective) unless it's killed from the parent namespace—which will wipe out the entire child namespace. The main implementation is in kernel/pid_namespace.c.
Network Namespaces (CLONE_NEWNET)
- Purpose: Isolate the network stack. This includes network devices (lo, eth0), routing tables, iptables rules, socket states, etc.
- Structure: The core is struct net.
- Degree of Isolation: Very thorough. A newly created netns only has a lo network interface by default; you have to create a veth pair yourself to connect it to the outside. This is the focus of the rest of this chapter. The implementation is in net/core/net_namespace.c.
IPC Namespaces (CLONE_NEWIPC)
- Purpose: Isolate System V IPC and POSIX message queues.
- Implication: If you create a message queue inside a container, you can't read it from the host using msgctl.
- Support: System V IPC was supported earlier (2.6.19), with POSIX message queues following later (2.6.30). The implementation is in ipc/namespace.c.
UTS Namespaces (CLONE_NEWUTS)
- Purpose: Isolate the hostname and domain name.
- Origin: The data structure for the uname() system call is called utsname.
- Review: The simplest namespace, bar none. You can test it just by changing the hostname. The implementation is in kernel/utsname.c.
User Namespaces (CLONE_NEWUSER)
- Purpose: Isolate user and group ID mappings. This means you can be root (UID 0) inside a container, but in the host's eyes, you're just a regular user (e.g., UID 1000).
- Complexity: The most complex namespace. Because it involves changes to the global security model.
- Features: Allows Capabilities to have different definitions in different namespaces. The implementation is in kernel/user_namespace.c.

User Space Toolchain

Kernel support alone isn't enough; we also need handy tools. This wave of changes drove updates in four major software packages:

util-linux:
- unshare command: Directly invokes the system call, letting you enter a new space right from the shell.
- nsenter command: Essentially a command-line wrapper for setns(), specifically designed to "slip into" a container for debugging.
iproute2:
- ip netns command: The king of commands for managing network namespaces. It will be in the spotlight for the rest of this chapter.
- ip link command: Allows you to "move" a physical network interface into another network namespace.
ethtool:
- Supports querying the NETIF_F_NETNS_LOCAL flag. If this flag is set, it means the network interface is a "local specialty" and cannot be moved (like certain special hardware interfaces).
iw (wireless):
- Supports moving wireless interfaces.

The System Call Family Portrait

Finally, let's summarize these three system calls together:

System Call	Action	Typical Use Case
clone()	Create new process + (optional) new namespaces	Container startup
unshare()	Detach current process into (partial) new namespaces	Isolate current session
setns()	Join an existing namespace	Container debugging (nsenter)

Going Deeper

Note that namespaces in the kernel do not have "names."

You might ask, doesn't ip netns add my_net give namespaces a name? No, that's just a bind mount that the iproute2 package hangs under /var/run/netns/ for you. The kernel itself only knows about inode numbers, not names.

This is important. If the kernel had to maintain a global string table for names, it would introduce a whole series of complex concurrency issues like lock contention and deadlocks, and it would make checkpoint/restore (process migration) very troublesome. The current design, relying solely on file descriptors and inode numbers, is both simple and robust.

Before we dive into the most complex Network namespace, let's take a quick look at how the simplest UTS namespace is implemented. It may be small, but it has all the vital organs; once you understand UTS, the others are easy to grasp by analogy.

Chapter Intro: Invisible Walls​

14.1 Namespaces Implementation —— When "Global" Is No Longer Global​

nsproxy: The Proxy for Performance​

unshare(): Going Solo​

setns(): I Want to Go Next Door​

The Six Factions: CLONE_NEW* Flags​

Staking Their Own Territory: Independent Implementations of Subsystems​

Exit and Destruction: exit_task_namespaces()​

Finding Namespaces: By PID or FD?​

/proc//ns: Visible Namespaces​

Initial Namespaces: The Origin of All Things​

A Detailed Look at the Six Factions​

User Space Toolchain​

The System Call Family Portrait​

Going Deeper​