Chapter 5: The Art of Waiting: Timers, Threads, and Workqueues
5.1 The Art of Delaying: How Long Should We Sleep?
When writing programs in user space, waiting is trivially simple. Want your program to pause for a second? sleep(1). Need millisecond precision? usleep. The operating system handles everything for you: your process goes to sleep, CPU resources are yielded to others, and the OS wakes you up when the time is up.
But inside the kernel, things get tricky.
Imagine you are writing a driver that needs to send commands to a slow hardware device. The hardware manual states: "After writing a command, you must wait at least 5 microseconds before sending the next one."
In user space, this wouldn't be an issue at all. But in the kernel, you immediately face an awkward choice:
- Option A: Busy-waiting. The CPU spins idly, counting cycles. This is certainly precise, but CPU resources are wasted—it's like standing in front of a microwave staring at the countdown, doing absolutely nothing.
- Option B: Process sleep. You tell the scheduler, "I'm going to take a nap," and the CPU moves on to handle other processes. This is efficient, but the question is—can you sleep?
If you are currently running in Interrupt Context, or if you are holding a spinlock, then "sleeping" is strictly forbidden. In scenarios where blocking is not allowed, your only way out is busy-waiting.
This is the core contradiction this chapter aims to resolve: how to handle the passage of time safely and efficiently in the kernel.
In this chapter, we will cover three mechanisms the kernel provides for "doing this later." Each has its own use cases, and choosing the wrong tool can range from degraded system performance to a direct deadlock.
- Short delays: Whether busy-waiting or briefly sleeping, how do we choose correctly?
- Kernel timers: Like an alarm clock, knocking at a future point in time.
- Kernel threads: Throwing background tasks into an independent thread to run slowly.
- Workqueues: The most common "deferred execution" mechanism, offloading heavy work to dedicated kernel threads.
These mechanisms are not isolated from each other; they complement one another in different contexts. Understanding when they are available and when they are not is key to writing robust kernel code.
5.1 Time Delays in the Kernel: Busy-Waiting vs. Sleeping
Let's start with the most basic scenario: I need to wait right now.
This requirement is very common, such as for hardware timing constraints. Depending on whether process scheduling is allowed (whether the CPU is allowed to switch to do other things), the kernel divides delay APIs into two distinct categories: atomic delays and blocking delays.
Atomic Delays — I Can't Sleep, I Can Only Count Sheep
When you are in atomic context—for example, handling an interrupt, holding a spinlock, or having preemption disabled—you absolutely cannot allow the CPU to be scheduled out. If you attempt to put the process to sleep in this state, the kernel will unhesitatingly throw an BUG() at you or simply deadlock.
In this scenario, we use the *delay() family of functions. Their essence is a busy loop.
The kernel provides three levels of precision for atomic delay APIs:
ndelay(unsigned long nsecs): Nanosecond-level delay.udelay(unsigned long usecs): Microsecond-level delay.mdelay(unsigned long msecs): Millisecond-level delay.
How Is It Calculated?
You might wonder how udelay(1) can guarantee waiting for exactly 1 microsecond. CPU frequencies change, so how does it calculate this accurately?
This brings us to the ancient yet important concept of BogoMIPS.
BogoMIPS (Bogus MIPS) is a value calibrated by the kernel at boot time. It measures how many "do-nothing" empty loops a CPU can execute in one second. This value is recorded in the kernel's loops_per_jiffy variable.
When you call udelay(), the kernel uses BogoMIPS to calculate how many loops are needed to burn through that delay time.
⚠️ Pitfall Warning
Never use mdelay() to wait for too long in atomic context, or worse, use mdelay() to wait for second-level delays. This will make the CPU spin wildly like a "mad dog," hitting 100% utilization and causing system stuttering. mdelay() is only suitable for extremely short, millisecond-level waits where scheduling is truly impossible.
Blocking Delays — I'm Going to Sleep, Don't Disturb Me
If your current state is process context and you aren't holding any locks, you can do the more polite thing: yield the CPU.
This is when we use the *sleep() family of functions. This invokes the scheduler, removing the current process from the CPU's Run Queue and placing it into a wait queue, until the time expires and it is woken up.
usleep_range(unsigned long usecs_min, unsigned long usecs_max)msleep(unsigned int msecs)ssleep(unsigned int seconds)
Why Is usleep_range() a Range?
You might find it strange that you can't specify an exact value like usleep(). The answer is: for power saving and performance optimization.
If you require the system to "wake up precisely at 1000 microseconds," the system might need to use a high-resolution timer, which prevents the CPU from entering deep power-saving modes. If you tell the system, "It's fine if I wake up anywhere between 1000 and 1500 microseconds," the scheduler has much more freedom. It can stretch this short period slightly (this is the concept of timer slack), allowing the CPU to sleep a little longer, reducing wake-up frequency, and lowering power consumption.
Recommended Practices:
- Less than 10 microseconds: Use
udelay()(busy-waiting). - 10 microseconds to 20 milliseconds: Use
usleep_range(). - Greater than 20 milliseconds: Use
msleep().
msleep() vs. msleep_interruptible()
The traditional msleep() is uninterruptible. Once it goes to sleep, neither time expiring nor a system crash aside, no one can wake it up.
msleep_interruptible(), on the other hand, allows being interrupted by signals. This is very useful for kernel threads that need to respond to user-space operations (like Ctrl+C). Its return value is the remaining time (if interrupted early).
Let's See the Actual Results — Is the Timing Accurate?
Let's let the code do the talking. We will test these delay functions in a kernel module and use a high-resolution timer (HRT) to benchmark them, seeing exactly how much the actual delays are.
We need to obtain high-precision timestamps. The kernel provides ktime_get_real_ns(), which returns the number of nanoseconds since the Epoch (1970-01-01 00:00:00 UTC).
Code Demo
// (...)
ktime_t start, end;
s64 actual_time_ns;
pr_info("Testing delay APIs...\n");
/* 1. 测试 udelay (忙等待,约 2ms) */
start = ktime_get_real_ns();
mdelay(2);
end = ktime_get_real_ns();
actual_time_ns = end - start;
pr_info("mdelay(2) expected: 2000000 ns, actual: %lld ns", actual_time_ns);
/* 2. 测试 msleep (休眠,约 20ms) */
start = ktime_get_real_ns();
msleep(20);
end = ktime_get_real_ns();
actual_time_ns = end - start;
pr_info("msleep(20) expected: 20000000 ns, actual: %lld ns", actual_time_ns);
/* 3. 测试 usleep_range (允许波动,约 5000-5500 us) */
start = ktime_get_real_ns();
usleep_range(5000, 5500);
end = ktime_get_real_ns();
actual_time_ns = end - start;
pr_info("usleep_range(5000, 5500) expected min: 5000000 ns, actual: %lld ns",
actual_time_ns);
// (...)
When you load this module and check the dmesg output, you will notice some interesting phenomena:
mdelay()is extremely precise, because it is busy-waiting.- The actual values of
msleep()andusleep_range()are usually longer than the preset values. This is because it takes time for the scheduler to wake up a process (scheduling latency), and after you wake up, you still have to queue up to wait for CPU resources.
⚠️ Remember: Delays in the kernel are always "at least" this much time, not "exactly" this much time. If you have hard requirements for hardware timing (e.g., "must be less than 10us"), you can only use busy-waiting; if you just want to buffer some data, use sleeping and yield the CPU to someone more important.