Chapter 10: CPU Scheduler (Part 1)

Chapter Introduction

In the previous two chapters, we wrestled with the quagmire of memory management—MGLRU, DAMON, OOM Killer. These names should now sound familiar, if not headache-inducing. We were either dynamically allocating memory or desperately reclaiming it; in short, we were revolving entirely around memory.

Now, let's change the channel.

In this chapter and the next, we'll shift our focus from memory to the CPU. Specifically, CPU scheduling—the mechanism by which the operating system decides "who gets to run on the CPU and who gets kicked off to wait." This isn't just foundational knowledge for writing kernels and drivers; as a system architect, if you want your user-space programs to stay responsive and crash-free under high concurrency, you must understand this stuff.

We'll start from scratch: first, we need to figure out exactly what the kernel is scheduling (processes or threads?), and then understand what those POSIX scheduling policies are all about. Next, we'll pick up perf as our "scalpel" and dissect the CPU's runtime—watching with our own eyes as threads jump between cores.

Once we've mastered these visualization tools, we'll peel back the kernel layers like an onion to see how Linux implements complex scheduling logic through "modular scheduling classes." Finally, we'll dive into code-level details to figure out exactly who calls the scheduler functions, when they are called, and what actually happens during a context switch.

But before we dive headfirst into the code, we need to clarify a frequently misunderstood fact: the CPU is a shared resource, but that doesn't mean only processes are fighting for it. Linux has a deeper resource management framework—cgroups (Control Groups). If you looked at Figure 10.1 earlier, you might have noticed a blurry "cgroups" layer at the top. Ignore it for now; we'll dedicate the next chapter to it. But keep in mind that the final scheduling decisions are often the combined result of the scheduler and cgroups.

Ready? This ride might be a bit dizzying, but it's absolutely worth it.

10.1 CPU Scheduler Internals (Part 1): Essential Background Knowledge

Before we look at complex scheduling algorithms, we need to nail down a few of the most basic concepts. These form the foundation for understanding all the logic that follows.

10.1.1 What Exactly is Linux's Kernel Schedulable Entity (KSE)?

Back in Chapter 6, when we talked about processes and threads, we mentioned that for every living user-mode thread, the kernel allocates a task_struct structure, along with a user stack and a kernel stack. Kernel threads also have a task_struct, but they only have a kernel stack.

The key question now is: when the scheduler says "time to switch workers," who exactly is it switching?

In operating system terminology, this thing is called the Kernel Schedulable Entity (KSE).

Many people intuitively think that the "process" is the basic unit of scheduling. But on Linux, this intuition is wrong.

Linux's KSE is the thread, not the process (though, of course, every process contains at least one thread). The scheduler operates at the granularity of threads.

Let's use a hypothetical scenario to "anchor" this understanding.

Suppose your system has only 1 CPU core, running 3 user-space processes:

P1: has only 1 thread.
P2: has 2 threads.
P3: has 5 threads.

On top of that, there are 3 kernel threads running in the kernel.

So, how many entities are fighting for this single CPU core right now?

Not 3 (the number of processes), and not 6 (processes + kernel threads). The answer is 1 + 2 + 5 + 3 = 11 threads.

These 11 threads, if they are all in a "runnable" state (i.e., Ready), will be fighting it out in the CPU's "run queue." As for which process they belong to, the scheduler doesn't care. In the scheduler's eyes, there are only threads.

Analogy Moment 1: Establishing the Mapping You can think of the scheduler as a traffic cop. A process is a convoy, and a thread is each individual vehicle in that convoy. The cop doesn't direct "convoys"; they direct "each vehicle." A convoy might have one car, or it might have ten, but that's irrelevant to the cop—they only look at cars.

Analogy Moment 2: Revealing the Distance But this analogy breaks down when it comes to priority. In real traffic, vehicles in a convoy usually travel together. In Linux, however, different threads (cars) within the same process (convoy) can have one speeding down the highway (high priority) while another is broken down on the roadside (low priority). In the scheduler's eyes, they are independent individuals with no "brotherly loyalty."

Analogy Moment 3: Validating the Takeaway Returning to our traffic cop. When we say "system load is high," it actually means there are 11 cars fighting for lane space (11 threads), not just 3 convoys running. If the 5 cars in convoy P3 are all flooring the gas pedal, the cop will keep focusing on those 5 cars, completely ignoring the lone car from convoy P1.

Now we have it clear: the KSE is the thread. So in the following discussion, we will use the word "thread" instead of "process." If you're used to reading old-school OS textbooks that still talk about "process state machines," please automatically substitute "thread state machines" in your mind from now on.

10.1.2 The Linux Process (Thread) State Machine

Every Linux thread flows between different states during its lifetime. Drawing these states out gives us a state machine.

Since we've established that the KSE is the thread, let's use standard Linux terminology to describe these states. The ps command typically represents these states with a single letter, such as R, S, D, and so on.

Core State List

TASK_RUNNING (R state):
- Alias: Ready-to-run or Running.
- Meaning: This is the most easily misunderstood state. It does not mean "currently running." It represents "currently running" OR "ready to run".
- Key Point: As long as a thread is in this state, it is in the run queue. Linux doesn't bother to distinguish between "running on the CPU" and "queued up waiting to run"; both are R. They are all candidates, ready to hop onto the CPU at any time.
TASK_INTERRUPTIBLE (S state):
- Alias: Interruptible sleep.
- Meaning: The thread is sleeping (waiting for an event, like a network packet or disk I/O).
- Characteristic: Although it's "sleeping," it can be disturbed. If you send it a signal at this point, it will be forcibly woken up to handle the signal.
TASK_UNINTERRUPTIBLE (D state):
- Alias: Uninterruptible sleep.
- Meaning: The thread is sleeping, and it is immovable.
- Characteristic: Usually used for extremely critical I/O waits where it cannot be interrupted by signals. If a large number of processes in your system are in the D state, it usually means the storage system is hung or very slow. This is also why it shows up as D in ps (Dead? No, it's Disk Wait).
TASK_STOPPED (T state):
- Meaning: The thread has been paused. This usually happens when it receives a SIGSTOP signal or is being traced by a debugger.
TASK_DEAD / TASK_ZOMBIE (X / Z state):
- Meaning: The thread is dead, or has become a zombie.
- Zombie: The thread has finished executing, but its parent process hasn't come to "collect its body" (called wait()). This is a half-dead, half-alive state that must be cleaned out of the system.

Dynamic State Transitions

When a thread is freshly created by fork() or pthread_create(), the kernel sets it to TASK_RUNNING (the R state). At this point, it enters the run queue, ready to be picked by the scheduler at any moment.

If this thread needs to wait for a resource (like reading a file), it will call a scheduler-related sleep function, change its state from R to S or D, and remove itself from the run queue, placing itself into a wait queue.

When the event it was waiting for occurs (like disk data being read up), the kernel "wakes" it up. This actually does two things:

Removes it from the wait queue.
Puts it back onto the CPU's run queue, changing its state back to TASK_RUNNING.

Note that being woken up does not mean it runs on the CPU immediately. It simply becomes a candidate in the R state again and has to wait in line for the scheduler to pick it.

Code Perspective: __state

How is all of this implemented at the code level? The answer lies in the task_struct structure.

struct task_struct {
    // ...
    unsigned int __state;   // 注意：以前叫 'state'
    // ...
};

This member, __state, is simply an integer bitmask. When it is TASK_RUNNING, the thread is running; when it is TASK_INTERRUPTIBLE, the thread is sleeping.

⚠️ Pitfall Warning Never assume there is only one global run queue in the system, and never assume there is only one wait queue. Linux maintains a separate run queue for each CPU core. Wait queues, on the other hand, are everywhere—any driver or kernel module can create one whenever it wants.

With this background knowledge, let's look at the specific scheduling policies Linux supports—that is, what rules it uses to pick someone from the run queue.

10.1.3 POSIX Scheduling Policies

You can roughly think of a "scheduling policy" as an "algorithm for selecting talent."

Linux doesn't have just one scheduling algorithm. The POSIX standard mandates that any compliant operating system must implement at least three policies. Linux not only implements these three but adds a few of its own, seamlessly integrating them all through a modular scheduling class design.

First, we need to establish a major premise: every thread in the system, at any given moment, is bound to a specific scheduling policy. This policy can be changed dynamically.

The table below is a cheat sheet for Linux scheduling policies. I recommend skimming it first to get a general impression; we'll pick out the key points for a deep dive later.

Scheduling Policy	Key Characteristics	Priority Range
`SCHED_OTHER` (also known as `SCHED_NORMAL`)	Default policy. This is the "ordinary person's" policy, pursuing fairness and overall throughput. It belongs to the CFS (Completely Fair Scheduler) class.	Nice value: -20 (highest) ~ +19 (lowest) Default is 0
`SCHED_FIFO`	Very aggressive real-time policy. Once it gets on the CPU, unless it voluntarily steps down (blocks/exits) or gets kicked off by a higher-priority thread, it holds it forever with an infinite time slice.	Real-time priority: 1 ~ 99 Higher number means more aggressive
`SCHED_RR`	Round-robin real-time policy. Just as aggressive as `FIFO`, but it has a time slice (default 100ms). When the time slice expires, if the priority is the same, it swaps out for someone else—everyone takes turns.	Real-time priority: 1 ~ 99
`SCHED_BATCH`	Suitable for batch tasks. Designed for programs that don't interact with users and just do head-down computation. The kernel minimally preempts it to increase throughput.	Nice value: -20 ~ +19
`SCHED_IDLE`	Extremely humble policy. Lower priority than Nice 19. It only gets to run when the CPU has nothing else to do. The PID 0 `swapper` idle thread falls into this category.	Lowest (even lower than nice 19)

Table 10.1: Overview of Linux (POSIX-compliant) Scheduling Policies

⚠️ Concept Distinction: Hard Real-Time vs. Soft Real-Time The SCHED_FIFO and SCHED_RR mentioned here are known as soft real-time policies. As a general-purpose operating system (GPOS), Linux is not a hard real-time operating system (RTOS) by default. "Hard real-time" means "foolproof, must be done before the deadline," while "soft real-time" means "as fast as possible, but not life-or-death." You can patch Linux to turn it into a hard real-time system, but what we usually run is the general-purpose version.

Deep Dive: SCHED_FIFO vs SCHED_RR

Although both are real-time policies, there is a subtle difference that is crucial in practice:

SCHED_FIFO: No concept of a time slice. If two FIFO threads have the same priority (say, both are 50), the one that started first will run continuously until it voluntarily blocks or is interrupted by a higher priority. The other thread of the same priority can only watch helplessly. Consequence: If you design it poorly, threads at the same priority level might starve.
SCHED_RR: Has a time slice (Round Robin). Threads of the same priority take turns; when the time slice is up, it moves to the next one. This is a bit friendlier than FIFO.

The Supreme Authority of Interrupts

Here we must insert a "safety brake" mechanism.

No matter if your thread is at SCHED_FIFO priority 99 (theoretically the highest), hardware and software interrupts are always higher. When an interrupt comes, your thread must obediently get out of the way. This is the last line of defense for system stability.

The Hierarchy of Thread Priorities

Now let's zoom out and look at the overall priority ladder in Linux (see Figure 10.3).

Real-time Priority (1 ~ 99):
- Used by SCHED_FIFO and SCHED_RR.
- Rule: As long as there is a real-time thread in the run queue, all SCHED_OTHER (normal threads) must step aside. They are not in the same arena.
Normal Priority (Nice value: -20 ~ +19):
- Used by SCHED_OTHER.
- The smaller the Nice value, the higher the priority. 0 is the default.
- This is essentially a "nice guy" system: the more "Nice" (generous) you are, the more you yield the CPU (lower priority).

⚠️ Practical Experience: Interference from Autogroups Many modern distributions have CONFIG_SCHED_AUTOGROUP enabled. This uses cgroups technology to "boost" the priority of processes running under a terminal window. This can make traditional nice values appear to be broken. If you find that nice values aren't working during testing, check this kernel option.

Now we know what policies are available. Next, we need to see intuitively how these policies actually operate on the CPU.

10.1.4 Visualizing the Scheduling Flow

Multi-core systems allow threads to execute in parallel on different CPUs. This improves throughput but also brings debugging nightmares. If several threads are operating on a piece of shared memory at the same time, who goes first? This is hard to see with the naked eye.

We need a way to see, like watching a replay, which thread occupied which CPU core at which moment in time.

Here are two tools: one is the simple GUI tool gnome-system-monitor, and the other is the deeply powerful perf.

Method 1: Using `gnome-system-monitor`

This is the system monitor tool that comes with the GNOME desktop environment.

To see something interesting, we need to generate some "load." We will run three CPU-intensive processes and manually pin them to different CPU cores.

Here we'll use a powerful tool: taskset (part of the util-linux package).

Experiment Steps

Assume your system has 6 cores (numbered 0 to 5).

Run a yes process on CPU 1 (this is an infinite loop that prints 'y', a real CPU killer):
```
taskset -c 1 yes > /dev/null &
```

Similarly, run one each on CPU 2 and CPU 3:

taskset -c 2 yes > /dev/null &
taskset -c 3 yes > /dev/null &

Now open gnome-system-monitor and click the "Resources" tab.

You will see a screen similar to Figure 10.4: the load on CPUs 2, 3, and 4 (note that GUI display numbering usually starts at 1, so these correspond to cores 1, 2, and 3) is maxed out at 100%. This is intuitive concurrency and parallelism.

Use pkill yes to end this madness.

Method 2: Using the `perf` Tool

If you think gnome-system-monitor is too much of a toy, then perf will absolutely satisfy your inner geek. It is a performance analysis artifact built into the Linux kernel.

⚠️ Prerequisites perf is strongly tied to the kernel version. On Ubuntu, you need to install linux-tools-$(uname -r). If you compiled a custom-modified kernel (like the 6.1 included with the book), there might not be a corresponding perf package. It's recommended to use the distribution's default kernel for this experiment.

1. Command-Line Mode: perf sched map

Let's use the same experimental scenario. For convenience, the book provides a script, ch10/concurrent_exercise/concurrency, that can launch yes processes on all cores at once.

cd ch10/concurrent_exercise
./concurrency 6  # 在 6 个核心上各起一个 yes

In another terminal window, start recording scheduling events:

sudo perf sched record

Let it run for ten-plus seconds, then press Ctrl+C to stop. This will generate a binary file named perf.data.

Now, let's turn it into a "map" that humans can read:

sudo perf sched map > mymap.txt

Open mymap.txt, and you'll see a stream of characters. It looks like gibberish, but look closely and it's very regular (refer to Figure 10.5).

Each row represents a point in time.
Each column represents a CPU core.
Two-character codes represent threads. For example, A0 is the migration thread, and B0 is our yes thread.

The most striking thing is that if you search for B0..D0..F0... using vim, you'll find these characters appearing on the same line simultaneously (as shown in Figure 10.6). What does this prove? It proves they are truly running in parallel!

2. Graphical Mode: perf timechart

If the character stream makes you dizzy, perf can also generate SVG vector graphics. Open it in a browser, and the visual impact is maxed out.

The command is as follows:

sudo perf timechart -i ./perf.data

This will generate an output.svg file. Open it with Firefox or Chrome (as shown in Figures 10.7 and 10.8).

You will see colored bar charts.

Blue represents the Running state.
You will clearly see our yes processes spread out across the CPU cores like the Great Wall, running endlessly.
Interspersed below are other processes, mostly gray (sleeping state).

This visualization capability is an absolute lifesaver for analyzing performance bottlenecks (like why my application stutters twice a second).

With these intuitive feelings and background knowledge, we can now dive deep into the kernel to see how this complex mechanism is actually designed. In the next section, we will deeply analyze modular scheduling classes.

Exercises

Exercise 1: Understanding

Question: In the Linux scheduler's state machine model, both TASK_UNINTERRUPTIBLE (D state) and TASK_INTERRUPTIBLE (S state) indicate that a process is sleeping. From a system management perspective, analyze: why does Linux need the D state? If a system experiences a large number of processes in the D state that cannot recover, what underlying system problem does this usually imply?

Answer and Analysis

Answer: The D state is used to handle uninterruptible sleep, typically applied when waiting for critical I/O operations (such as disk swap writes), to prevent the process from being interrupted by a signal during data writing, which could lead to data corruption. If a system experiences a large number of unrecoverable processes in the D state, it usually implies a severe blockage in the underlying storage subsystem (such as an unresponsive NFS mount, disk failure, or a "hang" caused by I/O stack congestion).

Analysis: This question tests the understanding of the difference between the S and D states in the process state machine.

Concept Distinction: The S state allows waking up via signals (like the user pressing Ctrl+C) and is suitable for most waits; the D state ignores signals and must wait for the event it's waiting for (like hardware I/O completion) to occur.
Design Intent: The D state guarantees the atomicity of certain critical operations (like writing disk data structures), preventing the process from being unexpectedly terminated while holding a lock or operating on critical hardware paths.
Troubleshooting: The D state is usually brief. If a process is permanently in the D state, it means the resource it is waiting for will never become ready (for example, a server crash causing network I/O to hang). This is known as an "uninterruptible sleep deadlock" and is a fault signal that system administrators must be highly vigilant about.

Exercise 2: Application

Question: Suppose you are developing a high-performance computing application, and one of its threads needs to perform intensive mathematical calculations. By default, this thread uses the SCHED_OTHER policy. To ensure this thread gets the maximum CPU time slice and is preempted by other normal processes as little as possible, without using real-time policies (SCHED_FIFO/RR), how should you adjust its Nice value? What is the corresponding command?

Answer and Analysis

Answer: The Nice value should be adjusted to -20 (highest priority). You can use the command nice -n -20 ./your_program to start the program, or use renice -n -20 -p <PID> to modify the Nice value of a running process.

Analysis: This question tests the practical application of the Nice value within the SCHED_OTHER policy.

Policy Constraints: SCHED_OTHER is the default policy for normal Linux processes, based on the CFS algorithm. Unlike real-time processes (priorities 1-99), it cannot preempt by changing policies.
Nice Value Range: -20 to +19. -20 represents the highest priority, and +19 represents the lowest.
Application Scenario: For CPU-intensive tasks, setting the Nice value to -20 causes the CFS scheduler to give this task a higher weight (smaller vruntime growth) when calculating the Virtual Runtime (vruntime), making it more likely to be scheduled in the red-black tree and reducing the frequency of being preempted by other lower-priority (higher Nice value) tasks.

Exercise 3: Application

Question: In a single-core CPU system, there are two threads:

Thread A: SCHED_FIFO policy, priority 10, executing an infinite loop (while(1);).
Thread B: SCHED_FIFO policy, priority 5, trying to calculate 1+1.
The system also has several Interrupt Service Routines (ISRs) running.

Question: Will Thread B get a chance to run? Can the Interrupt Service Routines run? Please explain the scheduler's behavior in detail.

Answer and Analysis

Answer: Thread B will not get a chance to run (until Thread A voluntarily yields the CPU or exits). The Interrupt Service Routines (ISRs) can run normally and preempt Thread A.

Analysis: This question tests the understanding of the SCHED_FIFO preemption model and interrupt priorities.

Inter-Thread Scheduling: SCHED_FIFO is a real-time preemption policy. Thread A (Prio 10) has a higher priority than Thread B (Prio 5). According to the rules, a lower-priority task (B) can only run when the higher-priority task (A) blocks (I/O), exits, or voluntarily gives up the CPU. Since A is an infinite loop with no blocking, B will starve permanently.
Interrupt vs. Process Priority: Interrupts in the Linux kernel (hardware and softirqs) have a higher priority than all processes (including SCHED_FIFO at priority 99). Therefore, even though A occupies the CPU, when a hardware interrupt occurs (like a clock interrupt or a network packet arriving), the CPU will still pause Thread A and switch to executing the ISR. Only after the ISR finishes executing will Thread A's execution resume.
Conclusion: B starves, and interrupts are unaffected. This is why you must be careful when writing SCHED_FIFO programs to avoid infinite loops that cause a soft system hang (if left unattended).

Exercise 4: Thinking

Question: In the Linux kernel source code, there is a scheduler design principle known as the "golden rule": "Scheduler code must never run in an atomic context (including interrupt context)." Think about this: if this principle is broken and schedule() is allowed to be called to perform a process switch while holding a spinlock or in an interrupt context, what catastrophic consequences would occur? Please provide a deep analysis combining interrupt handling and locking mechanisms.

Answer and Analysis

Answer: If this principle is broken, it would lead to "double-acquiring a lock" or deadlocks, potentially causing the system to deadlock completely or the kernel to crash. The core reason is that schedule() triggers a switch_to, switching to another process that might also be trying to acquire the same spinlock, or triggering another interrupt on the same CPU.

Analysis: This question tests a comprehensive understanding of deep kernel scheduling principles and concurrency control.

Definition of Atomic Context: Atomic context means the kernel is executing critical section code (holding a spinlock spinlock) or is handling an interrupt. At this point, it cannot sleep, and process scheduling cannot occur.
Deadlock Scenario Analysis (Spinlock Perspective):
- Suppose process P1 on CPU0 holds spinlock L (preemption is disabled at this point) and enters the critical section.
- If calling schedule() were allowed while holding the lock, the scheduler would move P1 out of the run queue and swap in process P2.
- If P2 (running on CPU0) also tries to acquire lock L, it will spin-wait for P1 to release the lock.
- However, P1 can only release the lock when it is scheduled back to CPU0 and executes.
- Result: P2 waits for P1, and P1 waits for P2 (scheduler resources), forming a deadlock. The system will become unresponsive.
Interrupt Context Analysis:
- Interrupt handlers (ISRs) do not belong to any process; they "borrow" the current process's stack and resources to run.
- If schedule() is called inside an ISR, the kernel will attempt to save the "interrupt context" and restore a "process context." This is not only logically chaotic (who is sleeping in the interrupt?), but if the interrupt occurs again (recursive interrupt), the kernel stack could overflow or the state could become completely corrupted.
Design Philosophy: Therefore, Linux strictly requires that all function paths that might cause sleep (and consequently call the scheduler) must be in a non-atomic context. This is the most fundamental baseline for guaranteeing kernel concurrency safety.

Key Takeaways

Linux's Kernel Schedulable Entity (KSE) is the thread, not the process. The scheduler operates at the granularity of threads, and different threads within the same process are completely independent individuals in the scheduler's eyes, with no "brotherly loyalty" or inherent synchronization relationship.

The key to understanding thread state transitions lies in distinguishing TASK_RUNNING from actual execution. The R state merely indicates that the thread is in the run queue (currently executing or ready to execute), whereas a non-R state (like S or D) means the thread has given up the CPU and entered a wait queue. A wake-up operation simply puts it back into the run queue and does not mean it immediately gains the CPU.

Linux strictly divides thread priorities into real-time (1-99) and normal categories. As long as there is a real-time thread in the run queue, all normal threads must step aside regardless of their Nice value. Furthermore, even the highest-priority real-time thread must yield to the absolute authority of hardware and software interrupts.

Although both SCHED_FIFO and SCHED_RR are soft real-time policies, the former, once it occupies the CPU, will run continuously as long as it doesn't block or get preempted by a higher priority, easily causing same-priority threads to starve. The latter introduces a time slice mechanism, guaranteeing fair round-robin among threads of the same priority.

Using tools like perf sched map and perf timechart, we can visualize CPU runtime data into character maps or SVG graphs. This not only intuitively verifies multi-core parallelism (different cores running threads simultaneously) but also serves as a "scalpel" for diagnosing complex scheduling scenarios and performance bottlenecks.

Chapter Introduction​

10.1 CPU Scheduler Internals (Part 1): Essential Background Knowledge​

10.1.1 What Exactly is Linux's Kernel Schedulable Entity (KSE)?​

10.1.2 The Linux Process (Thread) State Machine​

10.1.3 POSIX Scheduling Policies​

10.1.4 Visualizing the Scheduling Flow​

Method 1: Using gnome-system-monitor​

Method 2: Using the perf Tool​

Exercises​

Exercise 1: Understanding​

Exercise 2: Application​

Exercise 3: Application​

Exercise 4: Thinking​

Key Takeaways​

Chapter Introduction

10.1 CPU Scheduler Internals (Part 1): Essential Background Knowledge

10.1.1 What Exactly is Linux's Kernel Schedulable Entity (KSE)?

10.1.2 The Linux Process (Thread) State Machine

10.1.3 POSIX Scheduling Policies

10.1.4 Visualizing the Scheduling Flow

Method 1: Using `gnome-system-monitor`

Method 2: Using the `perf` Tool

Exercises

Exercise 1: Understanding

Exercise 2: Application

Exercise 3: Application

Exercise 4: Thinking

Key Takeaways