10.8 Quick Reference
The content in this chapter is admittedly brain-bending—the XFRM framework's state machine, the entanglement of policies and states, and NAT-T's "compromises born of necessity."
When you eventually work with this logic in the kernel code, you'll find that having an API map on hand makes things much more manageable. This section is that map.
Rather than listing every API exhaustively (you can find those in the kernel headers), I've organized the key functions, core operations, and critical counters scattered across the code paths into a convenient reference checklist. If you're staring blankly at /proc/net/snmp while debugging an IPsec connection, or wondering where a specific error code originated while reading xfrm_input(), this section is for you.
Core Methods Cheat Sheet
Most of the time, we deal with four categories of operations: "matching," "lookup," "creation," and "destruction." Below are the core functions you'll repeatedly encounter when reading source code or debugging.
1. Policy Matching and Lookup
All decisions start with traffic matching. The kernel needs to know whether the current packet should be taken over by IPsec.
xfrm_selector_match()
bool xfrm_selector_match(const struct xfrm_selector *sel,
const struct flowi *fl,
unsigned short family);
This is the most basic matching check. It asks: does this specific packet flow (flowi) hit this selector (selector)?
Depending on the protocol family, it calls the underlying __xfrm4_selector_match() (IPv4) or __xfrm6_selector_match() (IPv6). Returning true means "it's a match."
xfrm_policy_match()
int xfrm_policy_match(const struct xfrm_policy *pol,
const struct flowi *fl,
u8 type,
u16 family,
int dir);
This step goes further than selector matching. It doesn't just check if the selector matched; it also checks whether this policy (policy) can actually be applied to the current flow.
If it returns 0, the policy allows application; otherwise, it returns a negative errno. Note the dir parameter here—it specifies the traffic direction (inbound/outbound/forward), which is critical because the same IP flow might have completely different policies for ingress and egress.
2. Policy Lifecycle Management
Once a policy is configured, it resides in the kernel until manually deleted. Reference counting here is of utmost importance.
xfrm_policy_alloc()
struct xfrm_policy *xfrm_policy_alloc(struct net *net, gfp_t gfp);
Allocates and initializes an XFRM policy. Besides allocating memory, it performs a series of initializations:
- Sets the reference count to 1.
- Initializes the read-write lock.
- Associates it with the specified network namespace (
xp_net). - Sets the timer callback to
xfrm_policy_timer(). - Sets the policy queue timer (
policy->polq.hold_timer) callback toxfrm_policy_process().
xfrm_pol_hold() / xfrm_pol_put()
void xfrm_pol_hold(struct xfrm_policy *policy);
static inline void xfrm_pol_put(struct xfrm_policy *policy);
Standard reference counting operations.
hold: Increments the reference count by 1. Means "I'm using it, don't delete it."put: Decrements the reference count by 1. When the count drops to 0, it callsxfrm_policy_destroy()to thoroughly destroy the object. This is a deferred destruction mechanism that ensures code paths currently using the policy won't suddenly access a dangling pointer.
xfrm_policy_destroy()
void xfrm_policy_destroy(struct xfrm_policy *policy);
This is the true "endpoint." It removes the timers associated with the policy and frees the memory occupied by the policy. It's typically only called by xfrm_pol_put() when the reference count reaches zero.
3. State and Database Operations
Having a policy just means having "rules"—we still need a specific "plan" (i.e., SA, State). State management directly determines whether packets can be correctly encrypted or decrypted.
xfrm_state_add() / xfrm_state_delete()
int xfrm_state_add(struct xfrm_state *x);
int xfrm_state_delete(struct xfrm_state *x);
add: Adds a negotiated SA (described byxfrm_state) to the SAD (Security Association Database). This usually happens after IKE negotiation is complete, delivered by a userspace daemon via Netlink.delete: Removes the specified SA from the SAD.
xfrm_state_alloc()
struct xfrm_state *xfrm_state_alloc(struct net *net);
Allocates a new XFRM state object. This is the first step of an SA's life in the kernel.
__xfrm_state_destroy()
void __xfrm_state_destroy(struct xfrm_state *x);
Note the double underscore prefix __ in the function name—this usually means it's an internal implementation.
It doesn't free memory directly; instead, it adds the state object to XFRM's garbage collection list (GC list) and activates the garbage collector. This is a common kernel optimization technique to avoid performance jitter in frequent allocation and deallocation scenarios.
xfrm_state_walk()
int xfrm_state_walk(struct net *net,
struct xfrm_state_walk *walk,
int (*func)(struct xfrm_state *, int, void*),
void *data);
This is an iterator. It traverses all XFRM states (net->xfrm.state_all) within a namespace and calls your provided func callback for each state. If you're writing a kernel module or debugging script that needs to dump all SAs, you'll use this.
4. Packet Processing Paths
This is where data flows, and where all the preceding "configuration" finally takes effect.
xfrm_input()
int xfrm_input(struct sk_buff *skb, int nexthdr, __be32 spi, int encap_type);
The main entry point of the IPsec receive path. When an ESP or AH packet arrives, the protocol handler calls this. It's responsible for looking up the SAD, decrypting, verifying integrity, checking for replay attacks, and finally handing the restored clean IP packet back to the network stack. If an error occurs here, your SSH connection will drop.
esp_input()
int esp_input(struct xfrm_state *x, struct sk_buff *skb);
The specific handler for the IPv4 ESP protocol. After xfrm_input() determines the protocol type is ESP, it hands the work over to this function.
xfrm_lookup() / xfrm_bundle_create()
We spent a lot of time on this in the transmit path section.
xfrm_lookup(): The core of the transmit path. It looks up the corresponding SA based on the policy and decides which path this packet should take.xfrm_bundle_create(): Once the policy and state are both ready, this function creates axfrm_dst(Bundle), binding the routing and IPsec processing logic together to accelerate the forwarding of subsequent packets.
5. Exception Handling and Black Holes
Not all lookups succeed, and not all SAs can be negotiated in time.
make_blackhole()
static struct dst_entry *make_blackhole(struct net *net,
u16 family,
struct dst_entry *dst_orig);
Remember sysctl_larval_drop?
When a state cannot be found and this parameter is enabled, the kernel doesn't let the packet wait idly; instead, it creates a "blackhole route." Any packet sent to a blackhole route is silently dropped.
This function creates that blackhole. For IPv4, it calls ipv4_blackhole_route().
xdst_queue_output()
int xdst_queue_output(struct sk_buff *skb);
If the kernel chooses to "wait" instead of drop (i.e., sysctl_larval_drop=0), the packet is placed into the policy's wait queue (polq.hold_queue).
This function is responsible for stuffing the packet in. The queue has a length limit (default 100); once full, packets must be dropped.
Error Counter Mapping Table: A Debugging Compass
When you're troubleshooting why a VPN won't start or why performance suddenly drops, the XFRM counters in /proc/net/snmp are your best clues. But what exactly do those strings of XfrmInError, XfrmInNoStates correspond to in the kernel logic?
The table below directly connects the kernel symbols (Linux Symbol), SNMP error names, and the methods that trigger them.
Table 10-1: XFRM SNMP MIB Counter Mapping
| Kernel Symbol (Linux Symbol) | SNMP Name | Likely Triggering Call Path |
|---|---|---|
| LINUX_MIB_XFRMINERROR | XfrmInError | xfrm_input() — General error on the receive path |
| LINUX_MIB_XFRMINBUFFERERROR | XfrmInBufferError | xfrm_input(), __xfrm_policy_check() — Packet processing error |
| LINUX_MIB_XFRMINHDRERROR | XfrmInHdrError | xfrm_input(), __xfrm_policy_check() — Header parsing failure |
| LINUX_MIB_XFRMINNOSTATES | XfrmInNoStates | xfrm_input() — Packet received but no matching SA found |
| LINUX_MIB_XFRMINSTATEPROTOERROR | XfrmInStateProtoError | xfrm_input() — Protocol error (e.g., malformed ESP format) |
| LINUX_MIB_XFRMINSTATEMODEERROR | XfrmInStateModeError | xfrm_input() — Mode mismatch (e.g., tunnel mode packet entering a transport mode SA) |
| LINUX_MIB_XFRMINSTATESEQERROR | XfrmInStateSeqError | xfrm_input() — Sequence number error (replay attack detection failed) |
| LINUX_MIB_XFRMINSTATEEXPIRED | XfrmInStateExpired | xfrm_input() — SA has already expired |
| LINUX_MIB_XFRMINSTATEMISMATCH | XfrmInStateMismatch | xfrm_input(), __xfrm_policy_check() — SA and policy mismatch |
| LINUX_MIB_XFRMINSTATEINVALID | XfrmInStateInvalid | xfrm_input() — SA is invalid |
| LINUX_MIB_XFRMINTMPLMISMATCH | XfrmInTmplMismatch | __xfrm_policy_check() — Policy template mismatch |
| LINUX_MIB_XFRMINNOPOLS | XfrmInNoPols | __xfrm_policy_check() — No matching policy found |
| LINUX_MIB_XFRMINPOLBLOCK | XfrmInPolBlock | __xfrm_policy_check() — Policy explicitly blocks this traffic |
| LINUX_MIB_XFRMINPOLERROR | XfrmInPolError | __xfrm_policy_check() — Policy processing error |
| LINUX_MIB_XFRMOUTERROR | XfrmOutError | xfrm_output_one(), xfrm_output() — General error on the transmit path |
| LINUX_MIB_XFRMOUTBUNDLEGENERROR | XfrmOutBundleGenError | xfrm_resolve_and_create_bundle() — Bundle generation failed |
| LINUX_MIB_XFRMOUTBUNDLECHECKERROR | XfrmOutBundleCheckError | xfrm_resolve_and_create_bundle() — Bundle check failed |
| LINUX_MIB_XFRMOUTNOSTATES | XfrmOutNoStates | xfrm_lookup() — No SA found during transmission |
| LINUX_MIB_XFRMOUTSTATEPROTOERROR | XfrmOutStateProtoError | xfrm_output_one() — Protocol processing failed |
| LINUX_MIB_XFRMOUTSTATEMODEERROR | XfrmOutStateModeError | xfrm_output_one() — Mode error |
| LINUX_MIB_XFRMOUTSTATESEQERROR | XfrmOutStateSeqError | xfrm_output_one() — Sequence number issue |
| LINUX_MIB_XFRMOUTSTATEEXPIRED | XfrmOutStateExpired | xfrm_output_one() — SA found to be expired during transmission |
| LINUX_MIB_XFRMOUTPOLBLOCK | XfrmOutPolBlock | xfrm_lookup() — Policy prohibits transmission |
| LINUX_MIB_XFRMOUTPOLDEAD | XfrmOutPolDead | N/A — Policy is deadlocked |
| LINUX_MIB_XFRMOUTPOLERROR | XfrmOutPolError | xfrm_bundle_lookup(), xfrm_resolve_and_create_bundle() — Transmit policy error |
| LINUX_MIB_XFRMFWDHDRERROR | XfrmFwdHdrError | __xfrm_route_forward() — Forwarding header error |
| LINUX_MIB_XFRMOUTSTATEINVALID | XfrmOutStateInvalid | xfrm_output_one() — SA is invalid |
Additional Resources: Tracking Source Code
The Linux kernel's networking subsystem iterates rapidly, and IPsec, as a core security component, is no exception. If you want to track the latest fixes, patches, or new features not yet merged into the mainline, these three Git trees are what you need to watch:
-
IPsec git tree:
git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec.git- Purpose: Fix patches for the IPsec networking subsystem.
- Basis: Developed based on David Miller's
nettree.
-
ipsec-next git tree:
git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec-next.git- Purpose: New feature modifications for IPsec, targeting inclusion in
linux-next. - Basis: Developed based on David Miller's
net-nexttree.
- Purpose: New feature modifications for IPsec, targeting inclusion in
-
Maintainers:
- Steffen Klassert
- Herbert Xu
- David S. Miller
If you want to deeply understand why a particular XfrmIn...Error is incrementing, or see what weird NAT-T bug was fixed recently, digging through the commit logs in these Git trees is often much more useful than reading static source code.
Chapter Echoes
This brings us to the end of the IPsec chapter.
Recall our earlier question about "why not just use HTTPS"—the answer should be very clear by now: IPsec operates at the network layer (L3), and it protects the IP packet itself. This means the application layer is completely unaware; your TCP, UDP, and even ICMP protocols can all be transparently encrypted. This is something TLS/SSL cannot do.
By understanding the XFRM framework, you've essentially grasped the essence of the "separation of policy and execution" design within the Linux network stack.
- Policy is the boss (
xfrm_policy); it only sets the rules and doesn't do the actual work. - State is the employee (
xfrm_state); it takes the specific keys and algorithms to execute the actual encryption and decryption tasks. - Bundle is the contract between the boss and the employee, ensuring traffic flows smoothly between routing and encryption.
On the transmit and receive paths, we saw more than just a series of function calls—we witnessed the kernel's exquisite balance between performance (lookup caching) and security (replay windows, sequence numbers). Even when facing historical legacy issues like NAT that "break the layer" model, the kernel demonstrated remarkable adaptivity—surviving in the modern internet full of routers by adding a layer of UDP encapsulation (NAT-T).
In the next chapter, we'll enter another equally important and highly discussed domain: Network Filtering and Firewalls (Netfilter). If IPsec puts a bulletproof vest on data packets, then Netfilter is the security checkpoint at the entrance and exit of the building. We'll see how the kernel intercepts traffic at every critical node, and how it lets userspace tools like iptables and nftables call the shots.
Bring the foundational knowledge of the kernel networking subsystem you've learned in this chapter, and the next one will be much easier. See you then.
Exercises
Exercise 1: Understanding
Question: In the Linux kernel's IPsec implementation, suppose a user has configured a Security Policy (SPD entry), but because key negotiation has not yet completed, there is no corresponding Security Association (SA) in the kernel. When the kernel receives a packet matching this policy, how does it handle it? What is the relationship with the sysctl_larval_drop parameter?
Answer and Analysis
Answer: The kernel creates a temporary Acquire State (SPI is 0) and decides the packet's fate based on the value of sysctl_larval_drop: if it is 1 (the default), the packet is dropped (blackhole); if it is 0, the packet is added to the policy queue to wait for SA negotiation to complete (up to 100 packets).
Analysis: When traffic matches a policy but the SA has not yet been established, the kernel uses an Acquire State to record this condition and trigger the IKE daemon to negotiate. To prevent packet loss or indefinite waiting during negotiation, the kernel provides the sysctl_larval_drop switch. If set to 1, it means drop the packet (Make Blackhole); if set to 0, the kernel uses the polq queue in the xfrm_policy structure to buffer these packets until the SA establishment is complete (Larval state transitions to Mature state). This involves the synchronization issue between XFRM policies and states.
Exercise 2: Application
Question: In a company's network architecture, mobile employees connect to the corporate intranet via IPsec VPN. The employees' public IP addresses change frequently, and they are located behind different NAT devices. To ensure connection stability, IKEv2 negotiation uses NAT-T (NAT Traversal) technology. In the Linux kernel's IPsec receive path, how are UDP-encapsulated ESP packets processed? Please describe the decapsulation flow from the NIC driver to the XFRM framework.
Answer and Analysis
Answer: After a UDP packet arrives, the kernel protocol stack first passes it to the UDP layer for processing. Because UDP encapsulation is used (port 4500), the UDP socket layer hands it over to IPsec's UDP encapsulation receive function xfrm4_udp_encap_rcv(). This function strips the UDP header, extracts the inner ESP packet, and reinjects it into the IP protocol stack as a standard ESP packet, ultimately calling xfrm4_rcv() to enter the general xfrm_input() flow for decryption and verification.
Analysis: This is a typical NAT-T application scenario. NAT devices can only correctly translate the IP addresses and ports in UDP/TCP headers; they cannot handle pure ESP protocol (protocol number 50). Therefore, IPsec encapsulates ESP packets within UDP (port 4500). In the kernel receive path, the xfrm4_udp_encap_rcv() function plays the role of a "decapsulator"—it identifies this as a NAT-T packet, removes the UDP header, restores the original ESP packet, and hands it to the XFRM framework for processing. This allows IPsec traffic to seamlessly traverse NAT devices.
Exercise 3: Thinking
Question: While analyzing network performance, you find that a Linux VPN gateway configured with AES-GCM encryption (supporting Intel AES-NI hardware acceleration) is not achieving the theoretical line-rate throughput of the hardware. Through perf analysis, you discover that CPU time is mainly consumed by spinlock spinning waits, especially under multi-core concurrent processing of numerous short connections. Based on the XFRM framework's data structures, analyze the possible source of the bottleneck and propose an optimization approach.
Answer and Analysis
Answer: The bottleneck most likely lies in the lock contention during lookups in the XFRM Policy Database (SPD) and State Database (SAD). When multiple CPU cores concurrently handle a large number of flows, different cores attempt to simultaneously look up or update hash tables, causing frequent contention on the spinlocks of xfrm_policy or xfrm_state.
Optimization approach: Utilize the kernel's asynchronous cryptographic interfaces (such as the pcrypt template) to parallelize encryption operations and reduce processing time, or upgrade the kernel version to support the Read-Copy Update (RCU) mechanism to optimize the XFRM policy read path, thereby reducing lock hold times.
Analysis: The XFRM framework relies heavily on hash tables to store policies (xfrm_policy) and states (xfrm_state). Although lookups are hash-based, lock contention can become very severe in high-concurrency, short-connection scenarios. Furthermore, although AES-NI accelerates computation, the packet path traversal (policy lookup, route lookup, Bundle construction) involves accessing a lot of shared data.
The key point in this question is "numerous short connections," which leads to frequent SA lookups and policy matching. If the kernel uses coarse-grained locks or hot locks, performance will degrade. Solutions can be approached from two aspects: first, use concurrent programming techniques (like RCU) to optimize data structure reads; second, use software techniques (like pcrypt) to distribute encryption requests to different CPU cores, or leverage the asynchronous nature of hardware acceleration to avoid wasting CPU compute power on encryption/decryption waits.
Key Takeaways
IPsec aims to establish trusted channels over untrusted networks. Its core mechanism is not to purify the network, but to encrypt and encapsulate data through the ESP (Encapsulating Security Payload) protocol. The ESP protocol supports transport mode (encrypting only the payload) and tunnel mode (encrypting the entire original IP packet); the latter is commonly used in VPNs to hide private address communication. Paired with AEAD algorithms like AES-GCM, it can leverage CPU hardware instruction sets to simultaneously perform encryption and authentication, achieving high-performance data protection.
Key negotiation and parameter configuration are handled in userspace by IKE (Internet Key Exchange) daemons (such as strongSwan), serving as the bridge between user configuration and kernel execution. Compared to the older v1 version, IKEv2 significantly simplifies the handshake process (from 9 messages down to 4) and natively supports NAT traversal and EAP authentication. The negotiated keys are ultimately written into the kernel via the Netlink interface.
The XFRM framework in the kernel is the actual executor of IPsec. It works by maintaining two core databases: the SPD (Security Policy Database) determines which packets need processing based on traffic characteristics (addresses, ports, etc.); the SAD (Security Association Database) stores the specific encryption contexts (SAs), uniquely identified by SPI, destination address, and protocol number. The design of separating policy and state allows for flexible definition of complex encryption rules.
On the transmit path, the xfrm_lookup() function optimizes performance using a flow caching mechanism (Bundle), packaging and reusing policy lookups, route selection, and SA associations to avoid repeated table lookups for every packet. The receive path is driven by xfrm_input(); the system locates the SA based on the SPI, decrypts and strips the ESP header and trailer, modifies the IP protocol header, and reinjects the packet into the protocol stack, making it transparent to upper-layer applications.
IPsec has a natural conflict with NAT devices because NAT cannot modify the transport-layer checksums encrypted by ESP, which would cause connection interruptions. The solution is NAT-T (NAT Traversal), which encapsulates ESP packets within UDP datagrams, allowing NAT devices to handle IPsec data packets just like normal UDP traffic, thereby bypassing the limitation and ensuring connectivity in NAT environments.