14.5 Cgroups: When Isolation Meets Resource Contention

Namespaces solve one problem: out of sight, out of mind. In the previous section, we used various Namespaces to lock processes into logical "isolated compartments." They can't see the host's processes, they can't see anyone else's network stack, and they even believe they are PID 1.

But this isolation has a fatal flaw: physical resources are shared. If a process inside a container goes rogue and starts hogging the CPU or devouring memory, the host will still freeze, and the OOM Killer will still come out "hunting"—it doesn't care whether you're in a container or not; it only checks if memory has exceeded its limit.

This is why we need Cgroups (Control Groups).

If Namespaces are the walls of a compartment, then Cgroups are the electricity meters and current limiters on those walls. They not only let you see who is using power, but also let you pull the plug directly when someone exceeds their quota.

In this section, we will tear down the most complex yet most powerful resource management framework in the Linux kernel. Don't expect it to be as elegant as Namespaces, because "limiting resources" is inherently a grueling task full of compromises and nitty-gritty rules.

5.1 Cgroups: Origins and Design Philosophy

The story of Cgroups began in 2006, initiated by Google engineers Paul Menage and Rohit Seth. Initially, its name was very straightforward: Process Containers. Later, because the term "container" was being used too loosely, it was renamed Control Groups to avoid confusion.

Since entering the mainline kernel in version 2.6.24, it has become the cornerstone of modern Linux infrastructure. Whether it's systemd, which replaced SysV init, the Linux Containers (LXC) we mentioned earlier, Google's internal lmctfy, or even libvirt—they are all dancing on the back of this Cgroups elephant.

5.1.1 Everything is a File

The design of Cgroups follows a very distinct Unix philosophy: since I already know how to operate a filesystem, why do I need new system calls?

Unlike Namespaces, which introduced a bunch of clone() flags, Cgroups did not introduce any new system calls. It implemented a brand-new Virtual File System (VFS) type, simply named cgroup.

This means that all your control over Cgroups—creating groups, limiting resources, and collecting usage statistics—is done entirely through filesystem operations like mkdir, echo (writing files), and cat (reading files).

The kernel defines this filesystem type:

static struct file_system_type cgroup_fs_type = {
        .name = "cgroup",
        .mount = cgroup_mount,
        .kill_sb = cgroup_kill_sb,
};

It's as if you took the console for resource management and turned it directly into a mountable disk. This design is incredibly clever, but it also brings a side effect: its interface is too primitive.

5.1.2 The Coordination Dilemma

Early Cgroups provided a library called libcgroup (libcg), which included tools like cgcreate and cgdelete. Essentially, they just wrapped those file read/write operations for you.

But there was a huge pitfall: controllers are singular.

The entire kernel has only one CPU controller for a resource like "CPU." If systemd wants to manage the CPU, and libvirt also wants to manage the CPU, and they both directly manipulate the underlying cgroup filesystem, it ultimately becomes a question of "who overwrites whom." It's like two people fighting to write to the same configuration file, and in the end, nobody knows which version is actually in effect.

This is why the current trend is: don't touch the files directly; let systemd or other high-level daemons manage them. You declare your requirements to the daemon, and it handles the coordination to avoid conflicts.

5.2 Deep into the Kernel: How Cgroups Actually Run

The kernel implementation of Cgroups is insanely complex, but we can break it down into a few key building blocks.

5.2.1 Core Data Structures

First, the kernel introduces a structure called cgroup_subsys. It represents a subsystem (or controller).

Want to manage the CPU? There's cpu_cgroup_subsys. Want to manage memory? There's mem_cgroup_subsys. Want to manage device permissions? There's devices_subsys.

Each subsystem is independent, but they all hang under the same cgroup framework. Here is a partial list of controllers so you can get a feel for its coverage:

mem_cgroup_subsys (mm/memcontrol.c): Memory limits, OOM Killer control.
cpuset_subsys (kernel/cpuset.c): Binds CPUs and NUMA nodes.
devices_subsys (security/device_cgroup.c): Device node (/dev) read/write permission control.
freezer_subsys (kernel/cgroup_freezer.c): Freezes/unfreezes processes.
net_prio_subsys (net/core/netprio_cgroup.c): Network traffic priority.
blkio_subsys (block/blk-cgroup.c): Block device I/O limits.

Besides subsystems, there is another core structure called cgroup. It represents a specific control group—that is, the directory you created using mkdir.

Inside each process (task_struct), there is an added pointer called cgroups, which points to a css_set object. This object holds a bunch of pointers, each pointing to the state of the various subsystems associated with that process.

It's like the process is holding a keychain, where each key corresponds to a resource controller.

5.2.2 Mounting and Initialization

When the Cgroups subsystem initializes (cgroup_init), it not only registers the filesystem but also creates a default entry under /sys/fs:

kobject_create_and_add("cgroup", fs_kobj); // 生成 /sys/fs/cgroup

This usually happens automatically during system boot. Of course, you can mount the cgroup filesystem elsewhere; we'll cover how to do that later.

The kernel has a global array subsys (renamed to cgroup_subsys after kernel 3.11) that contains all registered controllers. Want to see which controllers are enabled on your system?

cat /proc/cgroups

That's much faster than checking the documentation.

5.2.3 What Happens When You mkdir

When you execute mkdir my_group under a cgroup mount point, the kernel doesn't just create a directory. The VFS layer intercepts this operation and automatically generates four standard control files in that directory. Regardless of which controller you are using, these four files are standard issue:

notify_on_release: A boolean switch. If set to 1, when the last process in this cgroup exits (is released), the kernel will execute the release_agent script.
cgroup.event_control: Used in conjunction with the eventfd() system call. It allows you to write user-space programs to monitor cgroup events (such as exceeding a threshold) instead of foolishly polling.
tasks: This is the most commonly used file. It records the list of PIDs belonging to this group. Echoing a PID echo into it grabs that process and puts it into this group.
- Handled by cgroup_attach_task() in the code.
- Conversely, to see which group a process belongs to, you can cat /proc/<pid>/cgroup.
cgroup.procs: Similar to tasks, but it operates on the Thread Group ID (TGID). Its granularity is at the process level (all threads of a process must be together), whereas tasks allows throwing different threads of the same process into different cgroups (although this is rare).

Besides these four, there is a special file that only exists in the top-level cgroup (root), called release_agent.

Its value is a path to an executable file. When a child cgroup enables notify_on_release and its last process exits, the kernel will call call_usermodehelper() to execute this script. Note: This mechanism is expensive because it spawns a new process every time. So while systemd still uses it, everyone generally tries to avoid triggering it frequently.

5.2.4 How Controllers Add Their Own Parameters

Each controller (like memory) not only uses the four generic files above but also needs its own specific parameters, such as memory.limit_in_bytes.

This is achieved by defining a cftype structure array. Each controller points its base_cftypes to this array.

Taking the memory controller as an example (mm/memcontrol.c):

static struct cftype mem_cgroup_files[] = {
        {
                .name = "usage_in_bytes",
                .read = mem_cgroup_read,
                ...
        },
        ...
};

struct cgroup_subsys mem_cgroup_subsys = {
        .name = "memory",
        ...
        .base_cftypes = mem_cgroup_files,
};

This way, when you mount the memory controller and create a subdirectory, the kernel will automatically generate files like memory.usage_in_bytes and memory.limit_in_bytes in that directory. Reading these files calls the corresponding .read function, and writing calls the .write function.

5.3 Hands-on: Working with Cgroups

Just looking at the structures can make you dizzy, so let's get our hands dirty. You'll find that it really is just a bunch of file operations.

5.3.1 The Devices Controller: Stopping Processes from Touching Hardware

Suppose you want to forbid a certain process from accessing /dev/null (this is a weird example, but it illustrates the point well).

First, you need to mount the devices controller (if systemd hasn't already done it for you):

mkdir -p /sys/fs/cgroup/devices
mount -t cgroup -o devices devices /sys/fs/cgroup/devices

Then create a new group:

mkdir /sys/fs/cgroup/devices/0

Enter this directory, and you'll find three new files:

devices.deny: Blacklist.
devices.allow: Whitelist.
devices.list: Current rules.

Check the default rules; they are usually fully open:

cat /sys/fs/cgroup/devices/0/devices.list
# 输出：a *:* rwm
# a = all (所有设备), *:* = 主次设备号全匹配, rwm = 读/写/创建

Add the current shell process to this group:

echo $$ > /sys/fs/cgroup/devices/0/tasks

Now, we want to forbid everything:

echo a > /sys/fs/cgroup/devices/0/devices.deny

Try writing to /dev/null:

echo "test" > /dev/null
# -bash: /dev/null: Operation not permitted

Boom. Now even /dev/null is untouchable. Add it back:

echo a > /sys/fs/cgroup/devices/0/devices.allow
echo "test" > /dev/null
# 成功

This is the foundation of container security: you can let processes inside a container believe they have root privileges, but at the Cgroup level, you completely cut off device access permissions. Even if they want to modify /dev/sda, they can't.

5.3.2 The Memory Controller: Taming the OOM Killer

The memory controller has two classic use cases: limiting memory usage, or disabling the OOM Killer.

Create a group and move the current shell into it:

mkdir /sys/fs/cgroup/memory/0
echo $$ > /sys/fs/cgroup/memory/0/tasks

Scenario 1: Disable the OOM Killer Sometimes you don't want the kernel to kill processes; you'd rather let them freeze or wait. You can turn off OOM:

echo 1 > /sys/fs/cgroup/memory/0/memory.oom_control

Scenario 2: Hard Memory Limit For example, allocate only 20MB of memory to this group:

echo 20M > /sys/fs/cgroup/memory/0/memory.limit_in_bytes

Once a process inside mallocs beyond this number (counting page table overhead), the kernel will directly refuse the request, or trigger OOM (depending on the configuration). This is extremely effective for preventing a runaway buggy program from devouring the host's memory.

5.4 Network Specials: Net_prio and Cls_cgroup

Since we are in the networking section, we must look at the two network-related controllers in Cgroups. They prove just how extensible Cgroups is.

5.4.1 net_prio: Tagging Traffic with Priority Labels

Background: Usually, we use the SO_PRIORITY socket option to set packet priority (like letting VoIP traffic take the fast lane). But this requires modifying the application code.

net_prio's solution: At the kernel network device layer, it attaches a table to each network device (net_device), called priomap.

The structure definition is as follows:

struct netprio_map {
        struct rcu_head rcu;
        u32 priomap_len;
        u32 priomap[]; // 这是一个数组，索引是 cgroup id
};

When a process sends a packet, dev_queue_xmit() looks up this table based on the process's cgroup ID and fills the retrieved priority into skb->priority.

Hands-on:

Suppose you want processes running in group 0 to have a priority of 4 when going out through eth1.

mkdir /sys/fs/cgroup/net_prio
mount -t cgroup -onet_prio none /sys/fs/cgroup/net_prio
mkdir /sys/fs/cgroup/net_prio/0
# 格式：<interface_name> <priority>
echo "eth1 4" > /sys/fs/cgroup/net_prio/0/net_prio.ifpriomap

As long as you echo the process's PID echo into tasks, the packets it sends will automatically be tagged with priority 4 when passing through eth1. No application code changes needed.

5.4.2 cls_cgroup: Tagging Traffic for Traffic Control

net_prio modifies the priority at the QoS layer, while cls_cgroup is used in conjunction with the tc (Traffic Control) tool.

It allows you to assign a classid to a cgroup (like 1:10), and then match this ID in tc rules to perform bandwidth limiting or shaping.

Hands-on:

Create and mount the controller:

mkdir /sys/fs/cgroup/net_cls
mount -t cgroup -onet_cls none /sys/fs/cgroup/net_cls
mkdir /sys/fs/cgroup/net_cls/0

Set the classid (0x100001 represents 10:1):

echo 0x100001 > /sys/fs/cgroup/net_cls/0/net_cls.classid

Use the tc command to configure HTB (Hierarchical Token Bucket) rules:

# 创建根队列
tc qdisc add dev eth0 root handle 10: htb
# 创建一个速率为 40mbit 的 class
tc class add dev eth0 parent 10: classid 10:1 htb rate 40mbit
# 添加过滤器：凡是带有 cgroup 标签 1:1 的包，都送到 class 10:1 去
tc filter add dev eth0 parent 10: protocol ip prio 10 handle 1: cgroup

With this, you've achieved traffic shaping based on application grouping (rather than IP ports).

5.5 Wrapping Up: The Nitty-Gritty of Mounting

Finally, a note on mount options. By default, everything is under /sys/fs/cgroup, but you can mount it elsewhere.

Command-line parameters:

-o all: Mount all controllers.
-o none: Mount an empty framework.
-o release_agent=/path/to/script: Specify a cleanup script.
-o noprefix: Remove filename prefixes. By default, cpuset's file is called cpuset.mems; with noprefix, it becomes mems. This is mainly used by some legacy tools for convenience.

Important note: Cgroups and Namespaces are technically orthogonal. You can have Namespaces without Cgroups (isolation only, no limits). You can have Cgroups without Namespaces (resource limits only, no isolation). Historically, there was an attempt to create a ns cgroup to manage Namespaces, but the code was later deleted—because there's no need to forcefully tie them together.

Alright, now we have Namespaces for isolation and Cgroups for limiting. The foundational infrastructure for containers is actually complete. But there are two more network-related odds and ends in the kernel worth looking at: one is Busy Poll Sockets (an optimization for extremely low latency), and the other is PCI/Wake-on-LAN. While these don't directly belong to Cgroups, they are both advanced topics in modern kernel networking.

In the next chapter, we'll go step into these pitfalls.

5.1 Cgroups: Origins and Design Philosophy​

5.1.1 Everything is a File​

5.1.2 The Coordination Dilemma​

5.2 Deep into the Kernel: How Cgroups Actually Run​

5.2.1 Core Data Structures​

5.2.2 Mounting and Initialization​

5.2.3 What Happens When You mkdir​

5.2.4 How Controllers Add Their Own Parameters​

5.3 Hands-on: Working with Cgroups​

5.3.1 The Devices Controller: Stopping Processes from Touching Hardware​

5.3.2 The Memory Controller: Taming the OOM Killer​

5.4 Network Specials: Net_prio and Cls_cgroup​

5.4.1 net_prio: Tagging Traffic with Priority Labels​

5.4.2 cls_cgroup: Tagging Traffic for Traffic Control​

5.5 Wrapping Up: The Nitty-Gritty of Mounting​