14.6 Busy Poll Sockets
In the previous section, we covered Cgroups and Namespaces, the cornerstones of containers. Now, let's shift our focus away from "isolation" and zero in on a topic of extreme performance.
Imagine you are writing a high-frequency trading system, or an edge gateway that requires nanosecond-level response times. Your code runs on a Xeon processor, paired with a top-tier 10GbE NIC—everything looks perfect. But when you stare at the flame graph, you'll notice a red region that sticks out like a sore thumb: context switches.
Every time a packet arrives, the NIC fires an interrupt, forcing the CPU to drop what it's doing, execute the interrupt handler, and then schedule your application to wake up. This entire flow has been optimized to the hilt in modern Linux, but for certain extreme scenarios, it is still too "heavy."
Can we completely bypass interrupts and scheduling?
That's exactly what Busy Poll Sockets are all about. At its core, it's a brute-force approach that trades CPU cycles for time.
The Traditional Dilemma: Sleeping and Waking
In the traditional network stack model, when an application reads a Socket and the receive queue is empty, what does the kernel do?
In blocking mode, the application goes to sleep. It suspends itself, releasing the CPU to other processes. At this point, the application is in a waiting state.
This is elegant, right? No data, so sleep—no CPU wasted. But the cost is latency.
- The application finds no data and calls
schedule()to sleep. - The NIC receives a packet and triggers a hardware interrupt.
- The CPU pauses its current task and jumps to the interrupt handler.
- The driver hands the packet to the protocol stack (L3) and pushes it into the Socket queue.
- The protocol stack wakes up the sleeping application.
- The scheduler performs a context switch and resumes the application's execution.
By the time this entire "combo" finishes, microseconds have slipped away. For low-latency applications, this is unacceptable.
Starting with kernel 3.11 (originally called Low Latency Sockets Poll, later renamed to Busy Poll per Linus's suggestion), Linux introduced a more aggressive approach.
Busy Poll: Refusing to Sleep
The core idea behind Busy Poll is simple: since sleeping and waking take too much time, I just won't sleep.
When an application reads a Socket and finds no data, instead of blocking, it goes straight down to the driver layer to ask: "Do you have any freshly received goods? If so, give them to me directly—skip the interrupt flow."
That's exactly what the ndo_busy_poll callback function does.
You can think of this process like this:
The traditional model is like passively waiting for a delivery: you sleep at home (application blocks), the delivery arrives and knocks on the door (interrupt), and you wake up to answer and sign for it. Busy Poll is like proactively checking the pickup station: you run to the station (driver) every few minutes to ask, "Is there a package for me?" If there is, you take it immediately—no need to wait for the courier to call.
Implementation of the Mechanism
For a NIC driver to support Busy Poll, it needs to implement the ndo_busy_poll callback in the net_device_ops structure.
Take a look at Intel's ixgbe driver (drivers/net/ethernet/intel/ixgbe/ixgbe_main.c); the ixgbe_low_latency_recv() inside is a typical implementation. This function's task is straightforward:
- Directly check the NIC registers or Ring Buffer.
- If there are packets, extract them and pass them up to the protocol stack (L3).
- It might find packets belonging to other Sockets and handle them along the way (since it's already down here, grabbing an extra one costs little).
- Return the number of packets processed. If there are no packets, return 0.
If the driver doesn't implement this callback, or returns 0, the kernel falls back to the traditional flow. It won't crash; it simply reverts to normal mode.
One More Detail: Preventing "Near Misses"
There is a subtle timing issue here.
Suppose the application just asked the driver "Got any goods?" and the driver said "No" and returned. A nanosecond later, the NIC receives a new packet.
Following traditional logic, the application should now go to sleep and wait for the next interrupt. But this circles right back to the "high latency" problem.
To avoid this awkward "it arrives right after you leave" scenario, Busy Poll introduces a fallback polling period.
Even if the driver returns no data the first time, the kernel doesn't give up immediately. Instead, it continues to linger in the driver layer within a configurable time window. As soon as a new packet arrives, it gets grabbed instantly.
That's why we see a parameter measured in microseconds (µs) in the configuration—it determines how long the application is willing to stubbornly wait in the driver layer for a packet.
Comparison: Two Paths to the Same Destination
Let's compare the flow differences between these two modes using Figure 14-1. This will give you an intuitive feel for what the CPU is actually doing.
Figure 14-1 Left: Traditional Receive Flow
- Application checks the receive queue.
- No data, block.
- NIC receives a packet.
- Driver hands the packet to the protocol layer.
- Protocol layer/Socket wakes up the application.
- Application gets the data.
The key here is between steps 2 and 5: interrupt + context switch.
Figure 14-1 Right: Busy Poll Receive Flow
- Application checks the receive queue.
- Goes straight down to the driver layer to check for pending packets (polling begins).
- Meanwhile, the NIC receives a packet.
- Driver processes this pending packet.
- Driver hands the packet to the protocol layer.
- Application gets the data.
Notice the right side:
- No wake-up process in step 5 (the application stays awake the whole time).
- No interrupt handling (the application polls proactively).
- Bypasses context switch and interrupt overhead.
This mode can deliver latency approaching hardware limits. But the cost is: CPU utilization will shoot straight up. If many Sockets have Busy Poll enabled and they compete for polling time on the same CPU core, performance can actually degrade due to resource contention.
Enabling Globally: Using a Sledgehammer to Crack a Nut
To put all Sockets in the system into this "combat mode," you can directly modify kernel parameters.
There are two main /proc/sys/net/core/ parameters:
busy_read: Controls the Busy Poll duration (in microseconds) during theread()system call.busy_poll: Controls the Busy Poll duration during theselect()andpoll()system calls.
By default, they are both 0, meaning disabled.
You can set them to 50 (microseconds). This is a widely recognized starting point that balances latency and CPU consumption.
There's a pitfall to watch out for here:
- For a blocking read,
busy_readdefines how long to stubbornly poll. - For a non-blocking read, if the Socket has Busy Poll enabled, the kernel will only poll once before returning to the user. This is because the semantics of a non-blocking call are "don't wait for me."
Precision Strike: Enabling On-Demand
Enabling globally is a bit brute-force—like using a sledgehammer to crack a nut and turning on all the lights in the house while you're at it. A more elegant approach is to let the application specify exactly which Sockets need low latency.
We can achieve this by setting the SO_BUSY_POLL Socket option.
int val = 50; // 微秒
setsockopt(sock, SOL_SOCKET, SO_BUSY_POLL, &val, sizeof(val));
This step sets the sk_ll_usec field inside the Socket structure (struct sock).
Configuration advice:
If you use SO_BUSY_POLL, we recommend setting the global sysctl.net.busy_read to 0, leaving control entirely to the application. This way, other unrelated applications and services on the system can still follow the traditional power-saving path, rather than suffering because of your high-performance requirements.
Tuning and Configuration: Making It Spin Faster
Enabling Busy Poll is just starting the Ferrari; to make it fast on the track, you still need to tune the chassis. Here are a few practical tips:
1. Interrupt Coalescing
We recommend using ethtool -C to increase the NIC's interrupt coalescing time (rx-usecs), for example, to 100.
This sounds counterintuitive—we're chasing low latency, so why delay interrupts?
Because Busy Poll applications proactively fetch data, the role of interrupts is diminished. If interrupts are too frequent, they cause unnecessary context switch overhead. Lengthening the interrupt coalescing time lets the driver handle packets in batches or wait for the application to poll, which yields better results.
2. The GRO/LRO Trade-off
Generic Receive Offload (GRO/LRO) usually improves throughput, but it can cause packet reordering, especially when mixing bulk traffic with low-latency traffic.
You can try disabling GRO and LRO using ethtool -K:
ethtool -K eth0 gro off lro off
But this isn't a silver bullet; often, keeping them enabled works best, so you have to test this.
3. Affinity and NUMA
This is the most critical point.
- The application thread is pinned to CPU Core A.
- The NIC IRQ (interrupt handling) is pinned to CPU Core B.
These should be two different cores. Furthermore, both cores should ideally be on the same NUMA node as the NIC device (cross-NUMA memory access is slow).
If the application and the IRQ contend for the same core, or if they cross NUMA boundaries, latency jitter will be severe. This is especially fatal when rx-usecs is set very low.
4. IOMMU
For ultimate low latency, some veterans will suggest turning off the IOMMU (Input/Output Memory Management Unit). The IOMMU exists for security (DMA protection) and virtualization, but it adds a layer of address translation overhead. On some systems, it might be disabled by default; if it's enabled and you need extreme performance, you can try turning it off (provided you know what you're doing).
Performance and Trade-offs
What will you see when using Busy Poll?
- Latency drops significantly.
- Jitter decreases significantly.
- Transactions per second (TPS) may increase.
But if you abuse it—enabling Busy Poll for thousands of Sockets—CPU utilization will explode. Because everyone is busy-waiting, the time actually spent doing useful work decreases.
Remember, this is fundamentally a trade-off that exchanges CPU resources for time. In scenarios where low latency is paramount (like HFT or telecom core networks), this deal is worth it; on a regular web server, it might be suicidal.
In this section, we tore through the "aggressive mode" of the network stack. In the next section, we'll shift our focus back to the hardware layer and look at the PCI subsystem and Wake-on-LAN—another way to wake hardware from its sleep, only this time, it happens through an Ethernet cable.