Chapter 11: CPU Scheduler (Part 2)

Introduction: The Illusion of Control

If you share a Linux server with nine other people, you might assume CPU time is fairly distributed—everyone gets an equal slice, no interference. But this is an illusion. The kernel's CFS scheduler tries its best, but "completely fair" only exists in a vacuum.

Reality is messier: if one person writes a program that frantically forks a bunch of child threads, each devouring CPU, that server effectively becomes their personal property, and for you, it might as well be frozen. The kernel needs a more powerful, more forceful mechanism to break this deadlock—it can't just "try to schedule fairly," it must be able to "hard-limit" usage.

That's the story we're telling in this chapter: from pinning a thread to a specific CPU core (affinity), to locking a group of threads into a resource management cage, to transforming an ordinary Linux system into a hard real-time system. We're moving from "letting the kernel decide" to "making the kernel obey."

This is our second journey into the topic of the Linux kernel CPU (or rather, task) scheduler. In the previous chapter (Chapter 10), we laid the groundwork: we figured out who the actual schedulable unit is (it's threads, not processes), saw through POSIX scheduling policies, and even used tools like perf to witness the scheduler in action. We also learned how to query a thread's scheduling policy and priority, and dug into the modular design inside the scheduler.

Now that we have these cards in hand, we can raise the stakes. In this chapter, we'll dive into the following topics:

Understanding, querying, and setting CPU affinity masks
Querying and setting thread scheduling policies and priorities
Introduction to cgroups (Control Groups)
Running Linux as an RTOS—A getting started guide
Other miscellaneous scheduling-related topics

This chapter builds directly on the previous one, so we strongly recommend that you finish Chapter 10 before turning this page.

Technical Prerequisites

We assume you've set up your kernel workspace (see the online chapters of this book for details) and have a virtual machine running Ubuntu 22.04 LTS (or newer, or the latest Fedora) with all the necessary packages installed. If you haven't done this yet, now is a good time.

Additionally, for the best experience, we recommend cloning this book's companion GitHub repository and following along with the code.

The repository URL is here: https://github.com/PacktPublishing/Linux-Kernel-Programming_2E

Understanding, Querying, and Setting CPU Affinity Masks

task_struct—the root data structure for a thread (or task), packed with dozens of member variables—has several attributes directly tied to scheduling: priority (both nice values and real-time priority), the scheduling class structure pointer, the run queue the thread is on (if any), and so on. (By the way, we covered the details of task_struct back in Chapter 6 when we discussed processes and threads).

Among these attributes, one member is particularly important: the CPU affinity bitmask (the actual structure member is cpumask_t *cpus_ptr. Interestingly, prior to kernel 5.3, it was called cpus_allowed; it was later renamed in this commit: https://github.com/torvalds/linux/commit/3bd3706251ee8ab67e69d9340ac2abdca217e733).

As the name suggests, this bitmask is a string of bits used to indicate which CPU cores this thread (the entity represented by task_struct) is allowed to run on. A diagram makes this most intuitive: suppose we have an 8-core system, a typical CPU affinity mask might look like this:

       7     6     5     4     3     2     1     0      <- CPU 核心编号
       ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
       0     0     1     1     1     1     1     1      <- 亲和性位

In the example above, each cell represents a CPU core. The first row is the core number, and the row below shows the corresponding bit value: if it's 0, the thread cannot run on that core; if it's 1, it can.

So, if the mask value is 0x3f (binary 0011 1111), it means this thread can be scheduled on CPUs 0 through 5, but will never appear on cores 6 and 7.

By default, all mask bits are set to 1. In other words, by default, a thread can run wherever it wants. This makes sense—for example, on a box where the OS sees 8 cores, every living thread has a default CPU affinity mask of binary 1111 1111 (which is hexadecimal 0xff).

Since this mask lives inside task_struct, it tells us: CPU affinity is per-thread. This is also easy to understand—after all, the schedulable entities in the Linux kernel are threads.

At runtime, the scheduler decides which core a thread ultimately lands on. If you think about it, this is implicit by design: each CPU core is associated with a run queue. Every runnable thread queues up in some CPU's run queue; thus it becomes eligible to run, and by default, it runs on the CPU represented by that queue.

Of course, the scheduler has a "load balancing" component that will migrate threads to other CPU cores (i.e., other run queues) when necessary (the kernel threads doing this work are called migration/n, where n is the core number).

The kernel exposes a set of APIs to user space (system calls, specifically sched_{s,g}etaffinity(2), along with their pthread wrapper library functions) that allow applications to "affinitize" threads (or multiple threads) to specific CPU cores on demand.

We can do the same thing from kernel space, setting the affinity for any given kernel thread. For example, if you set the CPU affinity mask to binary 1000 0001 (which is hexadecimal 0x81), it means this thread can only run on core 7 and core 0 (remember, core numbering starts at 0).

（Insert Figure 11.1 image here: Diagram of CPU affinity bitmask）

Although technically you can arbitrarily modify a thread's CPU affinity mask, we strongly advise against doing so carelessly. The kernel scheduler subsystem is well aware of the CPU topology (or domains) and can achieve the best system load balancing.

That being said, in certain specific scenarios, explicitly setting CPU affinity does have benefits:

Reducing cache invalidation: Ensuring a thread always runs on the same core can significantly reduce cache data bouncing, which is crucial for performance. (We'll dive into CPU caches in Chapter 13 when discussing kernel synchronization).
Eliminating migration overhead: This directly removes the cost of a thread migrating back and forth between different cores.
Implementing CPU isolation: This is a strategy that exclusively dedicates a core to a specific thread by explicitly preventing other threads from running on it. This trick is commonly used in time-sensitive real-time systems.

The first two points usually only apply to extreme corner cases. The third point, CPU isolation, is typically a technique used in systems with extreme real-time requirements; while the cost is non-trivial, it's worth it in that context. (By the way, this used to be implemented via the isolcpus= kernel parameter; that's now deprecated, and everyone has switched to using the cpusets cgroup controller).

Now that we understand the theory, let's write a user-space C program to actually query and modify a thread's CPU affinity mask.

Querying and Setting a Thread's CPU Affinity Mask

Talk is cheap. We provide a small user-space C program to query and set the CPU affinity mask of a user-space process (actually a thread). Querying the mask relies on the sched_getaffinity() system call, and setting it relies on its counterpart—sched_setaffinity():

#define _GNU_SOURCE

#include <sched.h>

int sched_getaffinity(pid_t pid, size_t cpusetsize, cpu_set_t *mask);
int sched_setaffinity(pid_t pid, size_t cpusetsize, const cpu_set_t *mask);

This uses a dedicated data type called cpu_set_t, which represents that CPU affinity bitmask (the third parameter of the two functions above). There's a catch to this: its size is dynamically allocated, depending on how many CPU cores the system actually has.

This CPU mask (of type cpu_set_t) must be initialized to zero before use. The CPU_ZERO() macro does exactly this (there are a few similar helper macros; we recommend checking the man page for CPU_SET(3)).

The second parameter of both system calls above is the size of the CPU mask (we just use sizeof). The first parameter is the PID of the target process (or thread)—whose internals you want to query or set.

Looking at the code alone might not give you a feel for it, so let's run the example code (available on GitHub: ch11/cpu_affinity).

Here's what it looks like running on a native Linux machine with 12 cores:

（Insert Figure 11.2 image here: Demo program displaying the calling process's CPU affinity mask）

In this example, we ran the program without arguments. In this mode, it queries its own CPU affinity mask. We print out the bits of the mask: as you can clearly see in the figure above (Figure 11.2), the output is binary 1111 1111 1111 (equal to hexadecimal 0xfff). This shows that, by default, the process is eligible to run on all 12 cores of the system!

Internally, the program calls the popen() library API, which in turn calls the nproc utility to detect the number of cores. But note that nproc returns the number of cores available to the current process; this might be less than the actual number of cores (online and offline), though usually they are the same. There are several ways the number of available cores can change, the most "orthodox" being through the cgroup cpuset resource controller (we'll cover cgroups later in this chapter).

The core code for querying looks like this (source file is ch11/cpu_affinity/userspc_cpuaffinity.c):

static int query_cpu_affinity(pid_t pid)
{
    cpu_set_t cpumask;

    CPU_ZERO(&cpumask);
    if (sched_getaffinity(pid, sizeof(cpu_set_t), &cpumask) < 0) {
        perror("sched_getaffinity() failed");
        return -1;
    }
    disp_cpumask(pid, &cpumask, numcores);

    return 0;
}

disp_cpumask() is responsible for drawing the bitmask (we'll leave that for you to look at yourself).

If you pass some arguments to this program—the first argument being the PID of the process (or thread), and the second being the CPU bitmask (in hexadecimal format)—it will attempt to change that process's affinity mask to the value you passed.

Of course, to modify someone else's mask, you must own that process, or have root privileges (more precisely, you need the CAP_SYS_NICE capability).

A quick demo: in Figure 11.3, nproc tells us there are 12 cores. Then, we run our program to query and set the shell (bash) process's CPU affinity mask. Suppose on this 12-core laptop, bash's affinity mask starts out as 0xfff (binary 1111 1111 1111), which is normal; then, we change it to 0xdae (binary 1101 1010 1110), and query it again to verify:

（Insert Figure 11.3 image here: Demo program querying and setting bash's CPU affinity mask to 0xdae）

This gets interesting. First, the program correctly detected that the number of available cores is 12. Then it queried the (default) mask of the bash process (we passed its PID as the first argument), showing 0xfff, flawlessly.

Immediately after, because we passed a second argument—the mask value we wanted to set (0xdae)—the program did exactly that, setting bash's mask to 0xdae.

Now here's the catch: the terminal window we're currently in is this bash process. If you run nproc again now, you'll find it shows 8 instead of 12! This makes perfect sense: the bash process can now only see 8 CPU cores. (Because our program didn't restore the mask when it exited).

The relevant code for setting the CPU affinity mask is as follows:

// ch11/cpu_affinity/userspc_cpuaffinity.c
static int set_cpu_affinity(pid_t pid, unsigned long bitmask)
{
    cpu_set_t cpumask;
    int i;

    printf("\nSetting CPU affinity mask for PID %d now...\n", pid);
    CPU_ZERO(&cpumask);

    /* 遍历给定的位掩码，按需设置 CPU 位 */
    for (i=0; i<sizeof(unsigned long)*8; i++) {
        /* printf("bit %d: %d\n", i, (bitmask >> i) & 1); */
        if ((bitmask >> i) & 1)
            CPU_SET(i, &cpumask);
    }

    if (sched_setaffinity(pid, sizeof(cpu_set_t), &cpumask) < 0) {
        perror("sched_setaffinity() failed");
        return -1;
    }
    disp_cpumask(pid, &cpumask, numcores);

    return 0;
}

In the code snippet above, you can see we first set up the bitmask of cpu_set_t (by looping through each bit; as you know, the expression (bitmask >> i) & 1 is used to test if bit i is 1), and then call the sched_setaffinity() system call to set the new mask for the specified pid.

（Insert image here: Tip icon）

⚠️ Note Something very important, and entirely correct: anyone can query a task's CPU affinity mask, but if you don't own that task, don't have root privileges, or lack the CAP_SYS_NICE capability, you cannot set it.

Using the taskset Tool for CPU Affinity

Just as we used the convenient user-space tool chrt in the previous chapter to query (or set) a process's (or thread's) scheduling policy and priority, you can also use the taskset user-space tool to query or modify a given process's (or thread's) CPU affinity mask.

Here are two simple examples; note that these examples are run on an x86_64 Linux virtual machine with 6 cores:

Querying the CPU affinity mask of systemd (PID 1):

$ taskset -p 1
pid 1's current affinity mask: 3f
$

Think about it: 0x3f in binary is 0011 1111, which represents that the process/thread (systemd in this case) can run on all 6 cores.

Running the compiler under taskset's protection, ensuring GCC—and its child processes (assembler and linker)—only run on the first two cores.

The first argument to taskset is the CPU affinity mask (03 is binary 0011):

$ taskset 03 gcc userspc_cpuaffinity.c -o userspc_cpuaffinity -Wall
Done.

Done. For full usage details, check the man page for taskset(1). (By the way, as mentioned in the previous chapter, the schedtool(8) tool can also handle setting a given thread/process's CPU affinity bitmask on the fly).

Setting CPU Affinity Masks on Kernel Threads

As an interesting example, suppose we want to demonstrate a synchronization technique called "Per-CPU variables" (we will indeed learn about this in Chapter 13, and get hands-on in the "Per-CPU—A Kernel Module Example" section). We need to create two kernel threads (kthreads) and ensure they run on different CPU cores.

To do this, we must explicitly set the CPU affinity masks of these two kernel threads to be different and non-overlapping (for simplicity, we'll set the first kthread's mask to 0 (core 0 only), and the second to 1 (core 1 only), guaranteeing they run on cores 0 and 1 respectively).

But there's a catch... we'll get into that in the next section.

Working Around Unexported Symbol Availability

The problem is that setting CPU affinity from within a module right now is, frankly, a hack. We'll show it here, but we absolutely do not recommend it for production environments.

The reason is that the kernel API we need to set the CPU affinity mask—sched_setaffinity()—exists, but it is unexported. As we learned in the earlier chapters on writing modules, an out-of-tree module (like ours) can only call exported functions (and data). So what do we do?

For many years (I did this in the first edition of this book too!), the "standard" approach used by module developers was to call an existing convenience routine, kallsyms_lookup_name(), to look up any given symbol in the kernel and get its (kernel virtual) address.

Once you have the address, any decent C programmer can use it as a function pointer and call it however they want. This amounts to "cracking" the restriction—the restriction of only being able to call exported functions! (It's a slick trick, but kernel veterans would probably frown upon seeing this.)

Indeed, but starting with kernel version 5.7, the community decided it was time to stop this (foolish) abuse and simply unexported kallsyms_lookup_name() (and the similar kallsyms_on_each_symbol())! (The short commit ID is 0bd476e6c671; you can take a look).

So now what? Don't panic, as long as you have root privileges, we can always find the address of any kernel symbol via the /proc/kallsyms pseudo-file (this is also for security reasons). Moreover, modern kernels usually have Kernel Address Space Layout Randomization (KASLR) enabled, meaning this value changes on every boot and can't be hardcoded (which is also good for security).

So, we wrote a small wrapper script to do this (the code is here: ch13/3_lockfree/percpu/run; yes, this code actually belongs to Chapter 13), and then pass the retrieved address (the address of sched_setaffinity() found via /proc/kallsyms) as a parameter to the module (ch13/3_lockfree/percpu/percpu_var.c).

Once the module receives the address, it uses it as a function pointer and can successfully call it. Phew!

The function signature of sched_setaffinity() looks like this:

long sched_setaffinity(pid_t pid, const struct cpumask *new_mask);

Here is a small piece of the key code—we use the passed-in (via a module parameter named func_ptr) sched_setaffinity() function pointer to set our desired CPU mask:

// ch13/3_lockfree/percpu/percpu_var.c
[ … ]

static unsigned long func_ptr;
module_param(func_ptr, ulong, 0);

unsigned long (*schedsa_ptr)(pid_t, const struct cpumask *);
[ … ]

// 设置函数指针
schedsa_ptr = (unsigned long (*)(pid_t pid, const struct cpumask *))func_ptr;
[ … ]

/* 
 * !HACK! sched_setaffinity() 没导出，我们不能直接调它。
 * 所以我们通过函数指针来Invoke它
 */
ret = (*schedsa_ptr)(0, &mask); // 0 => 针对自己
[ … ]

Frankly, this method of "cracking" kernel addresses is unconventional and somewhat controversial, and rather crude from an engineering standpoint, but it does work—especially in demo and experimental scenarios. However, one thing must be kept in mind: this hack is built on unexported interfaces and is a "wild" approach. In serious production environments, or when your code needs long-term maintenance, please avoid relying on such fragile tricks. There's usually a reason the kernel didn't export a function, and bypassing these restrictions is like walking a tightrope—you might reach the end, but you do so at your own risk.

Alright, enough with this "hacker" stuff; let's get back on track. Now that you know how to query/modify a (kernel) thread's CPU affinity mask, let's move to the next logical step: how to programmatically query/modify a thread's scheduling policy and priority! We'll dive into the details in the next section.

Querying and Setting Thread Scheduling Policies and Priorities

In the "Thread Priorities" section of Chapter 10 (CPU Scheduler—Part 1), you learned how to query the scheduling policy and priority of any given thread using the chrt tool (we even demonstrated a simple Bash script to do this). We mentioned there that chrt internally calls the sched_getattr() system call.

Very similarly, setting the scheduling policy and priority can be done via the chrt tool (convenient for use in scripts, for example), or via the sched_setattr() system call in a (user-space) C application. Additionally, the kernel exposes other APIs: sched_{g,s}etscheduler() and their pthread library wrapper APIs pthread_{g,s}etschedparam() (since these are all user-space APIs, we'll leave the specific details and usage to you to explore in their man pages).

Setting Policies and Priorities from Within the Kernel—For Kernel Threads

As you know, the kernel itself is neither a process nor a thread. That said, the Linux kernel does support multithreading, and it does have threads, known as kernel threads (kthreads). Like their user-space counterparts, kernel threads can be created on demand (by the core kernel, device drivers, or kernel modules; the kernel exposes APIs for this).

They are schedulable entities (KSEs, after all!), and of course, each kernel thread has its own task_struct and kernel-mode stack; therefore, just like regular threads, they compete for CPU resources, and their scheduling policies and priorities can also be queried or set programmatically as needed.

（Insert image here: Linux Kernel Programming Part 2 free ebook ad）

（Insert image here: Link about kthread naming）

Which brings us to the point: in user space, the modern recommended system calls for querying and setting thread scheduling attributes are sched_getattr() and sched_setattr(), respectively. In earlier years, the sched_{g|s}et_scheduler() pair was used.

The current sched_{g|s}etattr() system call takes a pointer to a struct sched_attr, which contains all the details you might need; check the man page for specifics (https://man7.org/linux/man-pages/man2/sched_setattr.2.html).

So, following the modern approach, one might assume we'd use the kernel implementations of these system calls to do similar work inside the kernel. Not so fast; the kernel community considered the old design—which allowed users (applications) and module developers to happily call these APIs, casually fill in a SCHED_FIFO policy, and conveniently supply a (real-time) priority they thought was reasonable—fundamentally flawed.

Why? Because it easily leads to disaster: for example, two or more SCHED_FIFO threads with the same priority, or using "randomly" chosen priority values—picked without any real thought. This directly messes up CPU scheduling, which in turn messes up resource management.

Therefore, starting with kernel 5.9, the community made the following changes (allow me to quote the commit directly, as it's the best way to convey the message); this is part of commit https://github.com/torvalds/linux/commit/7318d4cc14c8c8a5dde2b0b72ea50fd2545f0b7a:

...

Therefore exposing the priority field is pointless; the kernel is incapable of setting a sensible value, as it lacks the system knowledge required to do so.

Take sched_setschedule() / sched_setattr() away from modules and replace it with:

sched_set_fifo(p); to create a FIFO task (priority 50)

sched_set_fifo_low(p); to create a task that is higher than NORMAL, which ends up being a FIFO task with priority 1.

sched_set_normal(p, nice); to (re)set the task to normal.

This prevents the proliferation of randomly chosen, meaningless priorities, which don't serve any real purpose anyway.

The system administrator/integrator, the people who have insight into the actual system design and requirements (userspace), can set appropriate priorities when required...

...

Aha; so, dear module authors, we now have to use these three APIs when setting up FIFO threads in the kernel—sched_set_fifo(), sched_set_fifo_low(), and sched_set_normal().

As the commit above says, we trust administrators and/or user-space developers to write user programs and provide correct, meaningful real-time priority values; the kernel (or modules) shouldn't question these decisions—it just executes them (again, this is the "provide mechanism, not policy" design principle at work).

The first two APIs:

Are wrappers around the kernel's sched_setscheduler_nocheck() function;
Set the thread's scheduling policy to SCHED_FIFO;
Set the thread's (real-time) priority to MAX_RT_PRIO/2 (i.e., 50) and 1, respectively.

And sched_set_normal():

Is a wrapper around sched_setattr_nocheck();
Sets the thread's scheduling policy to SCHED_NORMAL (same as SCHED_OTHER, meaning non-real-time, driven by the Completely Fair Scheduler (CFS));
Sets the thread's nice value to the second parameter.

（Insert image here: Note about _nocheck）

Here, the *_nocheck() suffix means the kernel doesn't even bother checking whether the process context running these APIs has sufficient permissions; it just lets it through. (See the comment here: https://elixir.bootlin.com/linux/v6.1.25/source/kernel/sched/core.c#L7742).

Additionally, all three of these APIs are GPL-exported, meaning only modules released under the GNU GPL license can use them.

A Real-World Example—Threaded Interrupt Handlers

A classic use case where the kernel uses kernel threads—which is actually very common—is Threaded Interrupts (work queues are another example). In this case, the kernel must create a dedicated kernel thread and set it to the SCHED_FIFO (soft) real-time scheduling policy with a real-time priority of 50 (a middle-ground value) to properly handle the so-called threaded interrupt.

Let's look at the relevant code path: https://elixir.bootlin.com/linux/v6.1.25/source/kernel/irq/manage.c#L1448

static int
setup_irq_thread(struct irqaction *new, unsigned int irq, bool secondary)
{
    struct task_struct *t;

    if (!secondary) {
        t = kthread_create(irq_thread, new, "irq/%d-%s", irq, new->name);
    } else {
        t = kthread_create(irq_thread, new, "irq/%d-s-%s", irq, new->name);
    }
    [ … ]

The kthread_create() macro is responsible for creating the kernel thread. Now, the kernel-specific API irq_thread() (called as the thread function via the kthread_create() macro) will, during its execution, set the scheduling policy and priority correctly. The code path is here: https://elixir.bootlin.com/linux/v6.1.25/source/kernel/irq/manage.c#L1286

/* Interrupt handler thread */
static int irq_thread(void *data)
{
    struct callback_head on_exit_work;
    struct irqaction *action = data;
    [ ... ]
    sched_set_fifo(current);
    [ ... ]

See that! Notice the call to sched_set_fifo(); as we've seen, it sets this kernel thread (the caller, referenced by current) to use the SCHED_FIFO policy with a priority of 50. Done.

（Insert image here: Tip about why IRQ threads use FIFO 50）

Understanding scheduling isn't enough—what if one of those ten people launches a fork bomb? Or what if you just want to gracefully limit the resource usage of a compilation task? That's when we need to bring out the heavy artillery—cgroups. We'll get to that in the next section.

Introduction to cgroups (Control Groups)

In the distant past, the kernel community wrestled with a rather tricky problem: although scheduling algorithms and their implementations—the early 2.6.0 O(1) scheduler, and the slightly later (2.6.23) Completely Fair Scheduler (CFS)—promised so-called "completely fair" scheduling, this wasn't truly "completely fair" in any real sense!

Think about it for a moment: suppose you and 9 other people are logged into a Linux server at the same time. All else being equal, CPU time would be distributed (more or less) fairly among the ten of you; of course, you also know that it's not "people" that actually run on the processor and consume memory, but the processes and threads running on your behalf.

Let's assume for now that it's still (roughly) fair. But suppose you—one of those ten people—wrote a user-space program that recklessly forks a bunch of new threads in a loop, each frantically consuming CPU (perhaps generously allocating memory on the side)! The CPU bandwidth allocation (even via CFS) would no longer be truly fair; your account would effectively monopolize the CPU (and perhaps other system resources, like memory and I/O)!

What was urgently needed was a general-purpose solution that could precisely and effectively manage CPU (and other resource) bandwidth, throttling it (preventing further use) once a specified limit was reached.

Many patch proposals were submitted at the time, discussed, and then tossed into the trash. Eventually, engineers from Google, IBM, and other companies stepped up and submitted a set of patches that brought the modern version of the control groups (cgroups) solution into the Linux kernel (this traces back to version 2.6.24, October 2007. The original conception and implementation were done by Google's Paul Menage and Rohit Seth in 2006).

Simply put, cgroups is a kernel feature that allows system administrators (or anyone with root privileges) to elegantly perform bandwidth allocation and fine-grained resource management for various system resources or controllers (as they are called in cgroup terminology).

Note: using cgroups isn't just for managing processors (CPU bandwidth); it can also manage memory and block I/O bandwidth (and much more), allowing you to partition, allocate, and monitor them with precision based on your project or product needs.

Returning to our earlier example—ten people on a Linux system—if all processes are placed in the same cgroup and the CPU controller is enabled for that cgroup, it will actually achieve fair CPU distribution when facing CPU contention!

Alternatively, as a system administrator, you can get fancier: you can slice the system into several cgroups—one for compiling projects (like a Yocto build), one for the web browser, one for virtual machines, and so on—and then fine-tune and allocate resources (CPU, memory, I/O) to each cgroup as needed!

In fact, almost all modern distributions do exactly this, thanks to the powerful systemd framework (we'll cover this in detail shortly); embedded Linux typically does this too, including Android.

Alright, now you're interested! How do you enable this cgroups feature? Simple—it's a kernel feature that you can enable (or disable) in the usual way when configuring the kernel: make menuconfig, and then enter General setup | Control Group support.

You can try grepping your kernel config file for CGROUP; if needed, modify the config, rebuild the kernel, reboot with the new kernel, and test it. (We covered kernel configuration and building in detail in Chapters 2 and 3).

（Insert image here: Tip that cgroups are enabled by default under systemd）

The good news is: on any modern Linux system running the systemd init framework, cgroups are enabled by default. As just mentioned, you can check which cgroup controllers are enabled by grepping the kernel config file, and modify them as needed; on desktop and server-grade systems, you usually don't need to touch this.

Since its debut in 2.6.24, cgroups has evolved just like all other kernel features. Recently, the improved cgroup feature has become incompatible with the old version, leading to a new cgroup design and release called cgroups v2 (or simply cgroups2—the maintainer is Tejun Heo); it was declared production-ready in the 4.5 kernel series (the old one is now referred to as cgroups v1, or the legacy cgroup implementation).

Note that, as of this writing, both versions coexist, albeit with some limitations; many applications and frameworks still use the old cgroups v1 and haven't migrated to v2. But this is changing; even if it's not fully rolled out yet, cgroups2 will become the de facto standard, so you should plan to use it.

In our coverage for this chapter, we'll focus almost exclusively on using the modern version, cgroups v2. The best documentation is the official kernel documentation; here it is for version 6.1: https://www.kernel.org/doc/html/v6.1/admin-guide/cgroup-v2.html. (By the way, the documentation for the latest kernel version is always here: https://docs.kernel.org/admin-guide/cgroup-v2.html).

（Insert image here: Doc links about v1 vs v2）

cgroup Controllers

A cgroup controller is the underlying kernel component responsible for distributing a specific resource (such as CPU cycles, memory, and I/O bandwidth, etc.) within a cgroup hierarchy (a cgroup and its descendants). You can think of it as a kind of "resource limiter" for a given cgroup hierarchy.

The man page for cgroups(7) details the interfaces for the various available (resource) controllers (sometimes called subsystems). The controllers typically available under cgroups v2 are as follows (Table 11.1 shows the v2 stuff; the original v1 implementations of many controllers trace back to 2.6.24):

Cgroups v2 Controller Name	What it Controls (or Limits or Regulates)	Kernel Version When Enabled
cpu	CPU bandwidth (cycles)	4.15
cpuset	CPU affinity and memory node placement (especially useful for large NUMA systems)	5.0
memory	Memory (RAM) usage	4.5
io	I/O resource allocation	4.5
pids	Hard limit on the number of processes in a cgroup	4.5
devices	Creation and access of device files (only via cgroup BPF programs)	4.15
rdma	Allocation and accounting of Remote Direct Memory Access (RDMA) resources	4.11
hugetlb	Limits HugeTLB (large page) usage per cgroup	5.6
misc	Various miscellaneous items; see documentation	5.13

Table 11.1: Summary of cgroups v2 controllers available on a modern Linux system

We recommend interested readers to read the official kernel documentation and man pages mentioned above for details; for example, the PIDS controller is very useful for preventing fork bombs, as it allows you to limit how many processes a cgroup or its descendants can fork. (A fork bomb is a stupid but deadly DoS attack that frantically calls the fork() system call in an infinite loop!)

Next, a very important point: how are kernel cgroups exposed (or interfaced) to user space? Ah, it's the old Linux trick: control groups are exposed through a specially constructed synthetic or pseudo-filesystem! This is the cgroup filesystem, typically mounted at /sys/fs/cgroup.

In cgroups v2, the filesystem type is now called cgroup2 (you can simply run mount | grep cgroup to see it). There's a lot of good stuff to explore in there; we'll see more as we continue...

Let's start with this question: how do I know which controllers are enabled on my system (actually, in the kernel)? Simple:

$ cat /sys/fs/cgroup/cgroup.controllers
cpuset cpu io memory hugetlb pids rdma misc

Obviously, it displays a space-separated list of available controllers (I ran this command on an x86_64 Fedora 38 VM. Also note that using /proc/cgroups to peek at controllers is cgroups v1 compatible; don't expect it to work for v2). The exact controllers you see here depend on how the kernel is configured.

In cgroups v2, all controllers are mounted in a single hierarchy (or tree). This differs from cgroups v1, which allowed multiple controllers to be hung under multiple hierarchies or groups.

The modern init framework systemd uses both v1 and v2 cgroups. In fact, it's systemd that automatically mounts the cgroups v2 filesystem at boot (right at /sys/fs/cgroup/).

Exploring the cgroups v2 Hierarchy

Looking under the mount point of the cgroups (v2) pseudo-filesystem—always /sys/fs/cgroup by default—staring at all those pseudo-files (and folders), you might feel a bit overwhelmed (go ahead, take a peek at Figure 11.4); in this section, we'll explore some of the more interesting and useful nooks and crannies!

First, let's confirm where the cgroups v2 hierarchy is mounted:

$ mount | grep cgroup2
cgroup2 on /sys/fs/cgroup type cgroup2 (rw,nosuid,nodev,noexec,relatime,nsdelegate,memory_recursiveprot)

Clearly, as expected, under /sys/fs/cgroup. (Curious about what that bunch of mount options in the parentheses means? The documentation is here: https://www.kernel.org/doc/html/v6.1/admin-guide/cgroup-v2.html#mounting).

（Insert image here: Tip that older distros might not have controllers）

Alright, let's start exploring!

In this segment, I'm working on an x86_64 Fedora 38 virtual machine, and I've built and booted a custom 6.1.25 kernel. Let's first get the big picture:

（Insert Figure 11.4 image here: Root directory of the cgroups v2 hierarchy）

At the root cgroup location—/sys/fs/cgroup—you can see several files and folders (needless to say, these are volatile pseudo-file objects; mounted in RAM via sysfs).

First:

The "regular" files you see—like cgroup.controllers, cpu.pressure, and so on—are cgroup2 interface files.
These files are further divided into core interfaces and controller interfaces; all cgroup.* files are core interface files, cpu.* is a CPU controller interface file, memory.* is for the memory controller, and so on.
The folders you see represent—finally!—control groups (cgroups)! Among these many folders, you'll find that not all of them are constrained (or rather, not all of them are "populated"). You might wonder who created them; the short answer (at least for the ones that exist by default) is systemd; we'll cover this in detail shortly.

Enabling or Disabling Controllers

Let's first look at a key core interface file: cgroup.controllers. We briefly mentioned this in the previous section. Its contents are the list of controllers available to this cgroup; for the root cgroup, this means which controllers are configured in the kernel. As we just saw, for modern distributions, it's typically something like this: cpuset cpu io memory hugetlb pids rdma misc.

Be careful here: a controller appearing in this list doesn't mean it's already enabled in this cgroup hierarchy; in fact, by default, none of them are enabled!

Enabling a controller means that the constraints on its target resource distribution will take effect on direct children. To enable a controller, you write the string +<controller-name> to the cgroup.subtree_control pseudo-file (conversely, write -<controller-name> to disable it).

So, for example, to enable the CPU and I/O controllers but disable the memory controller (on the current cgroup, thus letting it take effect in its descendants, i.e., the levels below), you would do this (with root privileges):

echo "+cpu +io -memory" > cgroup.subtree_control

So now we know that the cgroup.subtree_control file produces a space-separated list of controllers that are enabled for resource distribution from the current cgroup down to its children.

The kernel documentation puts it this way (https://docs.kernel.org/admin-guide/cgroup-v2.html):

Top-down Constraint Resources are distributed top-down and a cgroup can only distribute resources

Technical Prerequisites​

Understanding, Querying, and Setting CPU Affinity Masks​

Querying and Setting a Thread's CPU Affinity Mask​

Using the taskset Tool for CPU Affinity​

Setting CPU Affinity Masks on Kernel Threads​

Working Around Unexported Symbol Availability​

Querying and Setting Thread Scheduling Policies and Priorities​

Setting Policies and Priorities from Within the Kernel—For Kernel Threads​

A Real-World Example—Threaded Interrupt Handlers​

Introduction to cgroups (Control Groups)​

cgroup Controllers​

Exploring the cgroups v2 Hierarchy​

Enabling or Disabling Controllers​