ch10_6

10.6 XFRM Lookup

In the previous section, we watched an encrypted packet endure a grueling journey through the protocol stack's many layers, finally being decrypted back into plaintext. That was the receiving path—a long and winding road.

Now, let's turn around and look at the other direction: sending.

When a local application wants to send a protected IPsec packet, it doesn't look up keys or fill in ESP headers itself. It simply hands the plaintext packet to the kernel, and xfrm_lookup() takes care of all the dirty work.

This function is the heart of the IPsec send path. We want it to beat as fast as possible—after all, every outgoing packet passes through it.

But before we can understand "how fast it is," we need to figure out "what it's looking up."

10.6.1 The Core Decision on the Send Path

The name xfrm_lookup() is honest: it really is a "lookup."

Looking up what?

It's looking for an answer: "How exactly should this packet be sent?"

This answer involves more than just "which router is the next hop" (that's the routing table's job). It also includes "does it need to be encrypted? With which algorithm? Through how many SAs?"

To save us from repeating these tedious steps for every single packet, the XFRM framework introduces a core optimization mechanism: the Bundle.

We can think of a Bundle as a "pre-packaged shipping label"—it bundles together routing information, security policies, and even the number of SAs involved along with their pointers.

But the "shipping label" metaphor is only half-accurate here: a real shipping label is single-use, but a Bundle is reusable. As long as packets belong to the same traffic flow, all subsequent packets can simply copy this label. This is the key to the performance optimization: turning table lookups into cache hits.

To store this "shipping label," the kernel defines the xfrm_dst structure:

struct xfrm_dst {
        union {
                struct dst_entry        dst;
                struct rtable           rt;
                struct rt6_info         rt6;
        } u;
        struct dst_entry *route;          /* 底层路由 */
        struct flow_cache_object flo;     /* 流缓存对象，用于查找 */
        struct xfrm_policy *pols[XFRM_POLICY_MAX]; /* 匹配的策略数组 */
        int num_pols, num_xfrms;          /* 策略数量和变换层数 */
#ifdef CONFIG_XFRM_SUB_POLICY
        struct flowi *origin;             /* 原始流信息 */
        struct xfrm_selector *partner;    /* 子策略选择器 */
#endif
        u32 xfrm_genid;                   /* XFRM 生成计数（用于失效检测） */
        u32 policy_genid;                 /* 策略生成计数 */
        u32 route_mtu_cached;             /* 缓存的 MTU */
        u32 child_mtu_cached;
        u32 route_cookie;
        u32 path_cookie;
};

There's a detail hidden here: notice the flo member. It serves as the bridge connecting the XFRM mechanism to the kernel's generic flow cache. We'll see exactly how it works very soon.

10.6.2 The Send Path Entry Point: xfrm_lookup()

The signature of the xfrm_lookup() function looks simple, but there's a lot going on beneath the surface:

struct dst_entry *xfrm_lookup(struct net *net, struct dst_entry *dst_orig,
                              const struct flowi *fl, struct sock *sk, int flags);

It only handles the Tx (transmit) path. So the first step is to lock in the direction:

u8 dir = policy_to_flow_dir(XFRM_POLICY_OUT);

The logic that follows is a classic "double-check" pattern. We check for a path first, then find the path.

Step 1: The Socket Policy Express Lane

The kernel first asks: does this packet come from a privileged user (specifically, a socket with a bound policy)?

If it's locally generated traffic (sk is not NULL) and this socket has an outgoing policy bound to it (sk_policy[OUT]), the kernel takes the "VIP lane" and calls xfrm_sk_policy_lookup().

if (sk && sk->sk_policy[XFRM_POLICY_OUT]) {
        num_pols = 1;
        pols[0] = xfrm_sk_policy_lookup(sk, XFRM_POLICY_OUT, fl);
        ...
}

This step bypasses the global policy lookup, making it extremely efficient. However, most ordinary traffic (like when you haven't specifically configured a socket policy using setsockopt) won't reach this point.

Step 2: Flow Cache Interception

If the socket has no bound policy, the kernel turns to the generic flow cache—this is the most brilliant part of xfrm_lookup().

In the code, it's called like this:

flo = flow_cache_lookup(net, fl, family, dir, xfrm_bundle_lookup, dst_orig);

We can think of this flow_cache_lookup() as a librarian. You hand it a call slip (fl, the flow information) and tell it which book you're looking for.

If this is the first time you're borrowing this book, the librarian won't find it in the catalog. So it kicks off a "Resolver" (our callback function xfrm_bundle_lookup) to go into the stacks, find the book (or create one), and record its location.
If this is the second time (subsequent data packets), the librarian will notice: "Hey, I just looked this up," and hand you the result directly from the cache.

If it's a cache hit, the flo object we retrieve is actually embedded inside the xfrm_dst structure. Using the container_of macro, we can extract the entire Bundle in one go:

xdst = container_of(flo, struct xfrm_dst, flo);

Once we have the xdst, all the information we need for subsequent steps—routing, policies, SA count—is at our fingertips:

num_pols = xdst->num_pols;
num_xfrms = xdst->num_xfrms;
memcpy(pols, xdst->pols, sizeof(struct xfrm_policy*) * num_pols);
route = xdst->route;

Finally, we point the kernel's generic dst_entry pointer to it, and the lookup is complete:

dst = &xdst->u.dst;

This mechanism guarantees that only the first packet of each flow goes through the full policy matching and lookup process; all subsequent packets hit the fast cache.

10.6.3 The Real Pitfall: When the SA Isn't Ready Yet (Larval State)

But the story doesn't end here. If you're running IPsec, you'll eventually run into a maddening situation: the policy exists, but the SA hasn't been negotiated yet.

At this point, xfrm_bundle_lookup() will find an awkward result: the policy is present, but the corresponding xfrm_state (i.e., the SA) is missing or in a "larval" state.

It's like having your shipping label filled out, but the warehouse has no stock yet. The kernel returns a special Bundle—a Dummy Bundle.

The hallmark of this Bundle is that its route member is NULL.

This brings us to the trickiest branch in the code, if:

if (route == NULL && num_xfrms > 0) {
        /* ... 只有当模板无法解析时，
         * xfrm_bundle_lookup() 才会返回一个 route 为 NULL 的 bundle。
         * 这意味着策略有了，但 bundle 创建不出来，
         * 因为我们还没有 xfrm_state。
         * 我们要么等 KM（密钥管理）协商出新的 SA，
         * 要么就报错放弃。 */
         if (net->xfrm.sysctl_larval_drop) {
                 ...
                 return make_blackhole(net, family, dst_orig);
         }
         ...
}

Here, a kernel parameter holds the power of life and death: sysctl_larval_drop (corresponding to /proc/sys/net/core/xfrm_larval_drop).

Scenario A: sysctl_larval_drop = 1 (Default, Ruthless Mode)

This is the default behavior. If the SA isn't ready, the kernel won't wait around—it drops the packet immediately.

The code calls make_blackhole(). For IPv4, this calls ipv4_blackhole_route().

The name is quite apt—your packet is thrown into a blackhole route. It's like sending data to /dev/null; it vanishes without a trace, and no ICMP error is sent back.

This is usually the behavior we expect: before a VPN tunnel is fully established, leaking plaintext traffic or sending invalid encrypted packets is unsafe.

Scenario B: sysctl_larval_drop = 0 (Patient Mode, Queue and Wait)

If we set this parameter to 0, the kernel becomes much more patient.

It calls xdst_queue_output() and shoves these unprocessable packets into a queue—polq.hold_queue.

This queue can hold up to 100 packets (defined by XFRM_MAX_QUEUE_LEN). The kernel holds these packets and quietly waits for the IKE daemon (like Charon or Pluto) to finish negotiating the SA.

If we're lucky and the SA negotiation succeeds, the kernel flushes all the backlogged packets out at once.

If it takes too long (the timeout defined by xfrm_policy_queue expires), the kernel gives up waiting and calls xfrm_queue_purge() to purge all packets from the queue.

This mode is useful in scenarios with network jitter or frequent reconnections, but in high-throughput environments, it can introduce new problems due to memory consumption.

At this point, the mission of xfrm_lookup() is complete. It successfully transforms a raw dst_entry into a final routing result—which might be encrypted, might be cached, or might have been thrown into a blackhole.

But our IPsec journey isn't over yet. In the real world, networks are often "dirtier" than we imagine—for example, a NAT device might be sitting right in the middle.

In the next section, we'll discuss IPsec's most famous "patch": NAT Traversal (NAT-T), and see how the kernel wraps an extra UDP skin around encrypted packets to fool firewalls that don't understand ESP.

10.6 XFRM Lookup​

10.6.1 The Core Decision on the Send Path​

10.6.2 The Send Path Entry Point: xfrm_lookup()​

Step 1: The Socket Policy Express Lane​

Step 2: Flow Cache Interception​

10.6.3 The Real Pitfall: When the SA Isn't Ready Yet (Larval State)​

Scenario A: sysctl_larval_drop = 1 (Default, Ruthless Mode)​

Scenario B: sysctl_larval_drop = 0 (Patient Mode, Queue and Wait)​