Skip to main content

10.3 The XFRM Framework

In the previous section, we discussed the cryptographic algorithms used by IPsec and saw how the kernel squeezes every ounce of CPU performance out of the Crypto API and pcrypt. Algorithms are the muscle, but muscle alone can't get the job done—you also need a skeleton and nervous system to organize these algorithms, telling the kernel when to encrypt, which key to use, and where to send the packet.

That skeleton is the XFRM framework.

Honestly, the name "XFRM" is pretty technical—it's pronounced "transform." If you see "The XFRM Framework" in a directory, don't let the abbreviation intimidate you. It's simply a mechanism in the kernel dedicated to transforming incoming and outgoing packets according to your predefined rules—encrypting, decrypting, encapsulating, and decapsulating them. It originally came from the USAGI project, a pioneer effort from over 20 years ago that aimed to integrate IPv6 and IPsec into the Linux kernel.

A Protocol-Agnostic "Transformer"

The XFRM framework has a core design philosophy: decouple from protocol families.

What does this mean? It means that whether it's IPv4 or IPv6, whether it's encryption or authentication, most of the generic logic (like policy management, state maintenance, and garbage collection) is reused. This shared code lives in the net/xfrm directory. The specific protocol details, like the IPv4 ESP implementation, reside in net/ipv4/esp4.c, while IPv6's lives in net/ipv6/esp6.c, each paired with its own policy module (such as xfrm4_policy.c).

This layered design is clever, but it also means jumping around in the source code can be tiring—the generic logic and protocol-specific implementations are separated.

A Small World Inside Namespaces

Let's pause for a moment and think about network namespaces.

If you're running IPsec inside a container, you definitely don't want Container A's VPN policies leaking into Container B's bowl. The XFRM framework has supported network namespaces from the very beginning. Every network namespace (struct net) has its own independent xfrm member of type netns_xfrm.

You can think of this structure as the XFRM framework's "control room" within that namespace. Open up include/net/netns/xfrm.h, and you'll see:

struct netns_xfrm {
struct hlist_head *state_bydst;
struct hlist_head *state_bysrc;
struct hlist_head *state_byspi;
. . .
unsigned int state_num;
. . .

struct work_struct state_gc_work;

. . .

u32 sysctl_aevent_etime;
u32 sysctl_aevent_rseqth;
int sysctl_larval_drop;
u32 sysctl_acq_expires;
};

See all those hash table head pointers? state_bydst, state_bysrc, state_byspi—these are the index libraries for the "Security Associations" (SAs) we'll be looking for later. Then there's state_gc_work, the garbage collector responsible for cleaning up expired states. As for sysctl_larval_drop, that's a crucial parameter we'll encounter later when we discuss "dropping vs. waiting."

How does the XFRM framework start up?

When IPv4 initializes its routing subsystem (ip_rt_init), it conveniently initializes XFRM along the way (calling xfrm_init() and xfrm4_init()). IPv6 does the same. There's nothing fancy here; it's just standard infrastructure "plumbing."

But the XFRM framework doesn't generate keys and policies out of thin air—it needs to listen to the user. How do userspace daemons (like StrongSwan or the iproute2 tool) talk to the kernel? Through Netlink.

The specific protocol type is NETLINK_XFRM. During boot, the kernel creates a corresponding Netlink kernel socket:

static int __net_init xfrm_user_net_init(struct net *net)
{
struct sock *nlsk;
struct netlink_kernel_cfg cfg = {
.groups = XFRMNLGRP_MAX,
.input = xfrm_netlink_rcv,
};

nlsk = netlink_kernel_create(net, NETLINK_XFRM, &cfg);
. . .
return 0;
}

Look at cfg.input = xfrm_netlink_rcv. When you type ip xfrm state add ... on the command line, this command turns into a Netlink message (such as XFRM_MSG_NEWSA), flies into the kernel, gets caught by xfrm_netlink_rcv(), and is then dispatched to xfrm_user_rcv_msg() for processing.

This is the handle by which userspace controls the kernel.

Alright, now the "stage" is set and the lights (Netlink) are on. Next, we need to bring out the two leading actors of this show: Policies and States.


XFRM Policies

Let's start with a question: when the kernel receives a packet or is about to send one, how does it know whether to apply IPsec processing to that packet?

It doesn't guess. It looks up a table. That table is the SPD (Security Policy Database).

In the kernel, a single policy is represented by struct xfrm_policy. You can think of it as a "filter" or a "traffic cop"—it stops the flow of traffic (packets), checks their credentials, and decides whether to let them through, block them, or send them to the encryption shop.

How does this traffic cop identify target vehicles? Through selectors.

Selectors: Precise Matching

Every policy has a xfrm_selector structure, which serves as the policy's eyes:

struct xfrm_selector {
xfrm_address_t daddr;
xfrm_address_t saddr;
__be16 dport;
__be16 dport_mask;
__be16 sport;
__be16 sport_mask;
__u16 family;
__u8 prefixlen_d;
__u8 prefixlen_s;
__u8 proto;
int ifindex;
__kernel_uid32_t user;
};

It has everything here: source address, destination address, source port, destination port, upper-layer protocol (TCP/UDP/ICMP). It even includes the network interface index and user ID.

The kernel uses the xfrm_selector_match() method to compare a packet's flow information against this selector. Only when there's a match does the policy take effect.

Dissecting the Policy Structure

Now let's look at struct xfrm_policy itself. This is a large structure, so we'll only pick out the fields related to "whether traffic can flow" and "whether things will blow up":

struct xfrm_policy {
. . .
struct hlist_node bydst;
struct hlist_node byidx;

/* This lock only affects elements except for entry. */
rwlock_t lock;
atomic_t refcnt;
struct timer_list timer;

struct flow_cache_object flo;
atomic_t genid;
u32 priority;
u32 index;
struct xfrm_mark mark;
struct xfrm_selector selector;
struct xfrm_lifetime_cfg lft;
struct xfrm_lifetime_cur curlft;
struct xfrm_policy_walk_entry walk;
struct xfrm_policy_queue polq;
u8 type;
u8 action;
u8 flags;
u8 xfrm_nr;
u16 family;
struct xfrm_sec_ctx *security;
struct xfrm_tmpl xfrm_vec[XFRM_MAX_DEPTH];
};

Let's dive into a few key fields:

1. refcnt (Reference Count) Any kernel object without a reference count is a ticking time bomb. refcnt is initialized to 1, incremented by xfrm_pol_hold() when used, and decremented by xfrm_pol_put() when done. When it hits zero, the policy is destroyed.

2. timer (Timer) Policies aren't valid forever. xfrm_policy_timer() monitors the policy's lifespan. Once time is up (the hard limit specified by lft), it does two things:

  1. Deletes the policy (calls xfrm_policy_delete()).
  2. Sends an obituary (XFRM_MSG_POLEXPIRE) to all registered key management programs, telling them, "Buddy, this policy has expired."

3. lft and curlft (Lifetimes)

  • lft (Lifetime Config): These are configuration items. You can set them via the limit parameter on the command line, for example:
    ip xfrm policy add src 172.16.2.0/24 dst 172.16.1.0/24 limit byte-soft 6000 ...
    This sets the soft byte limit to 6000.
  • curlft (Current Lifetime): This is the dashboard, recording how much traffic has passed through. It has four u64 fields:
    • bytes: Total bytes processed (incremented by xfrm_output_one() on the send path, and by xfrm_input() on the receive path).
    • packets: Total packets processed.
    • add_time: Timestamp of when the policy was born.
    • use_time: Timestamp of when it was last used.

If you use ip -stat xfrm policy show, you can see these statistics.

4. polq (Policy Queue) This is a very interesting feature, and an easy place to fall into a trap. Here's the scenario: A packet arrives, matches a policy, and the policy says to encrypt it. But! The corresponding SA (Security Association) hasn't been established yet (key negotiation isn't complete). What do we do now?

  • Default behavior: Drop it immediately. Call make_blackhole() and the packet is gone.
  • Alternative choice: If you set /proc/sys/net/core/xfrm_larval_drop to 0, the kernel won't drop the packet. Instead, it stuffs them into this polq.hold_queue to queue up.
    • The queue holds a maximum of 100 packets (XFRM_MAX_QUEUE_LEN).
    • This creates a "dummy Bundle" (xfrm_create_dummy_bundle()) as a placeholder.
    • Once key negotiation is complete, these packets will be processed.

⚠️ Warning sysctl_larval_drop defaults to 1 (drop immediately). If you're debugging IPsec in a production environment and find that packets drop completely when key exchange hasn't fully finished, don't panic—check if this parameter is set to 1 first.

5. action (Action) The policy determines the packet's fate. This field only has two values:

  • XFRM_POLICY_ALLOW (0): Let it through. This usually means "if you can find an SA to encrypt with, use it; if you can't, it depends on your config—it might get dropped, or it might go out in plaintext."
  • XFRM_POLICY_BLOCK (1): Block it. This is like writing type=reject in your config file; the kernel simply kills the packet.

6. xfrm_vec (Template Array) A policy doesn't directly specify "encrypt with key A." Instead, it specifies "templates." xfrm_vec is an array that can hold up to 6 templates (XFRM_MAX_DEPTH). This allows you to do complex things, like "first do ESP encryption, then do IP encapsulation." These templates (xfrm_tmpl) are the bridge connecting policies to specific states.


XFRM States (Security Associations)

If policies are the "law" dictating what should be done, then states are the "weapons"—the guys actually doing the heavy lifting.

In the kernel, a Security Association (SA) is represented by struct xfrm_state. Note that an SA is unidirectional. If you want bidirectional communication, you need two SAs (one inbound, one outbound).

You can send an XFRM_MSG_NEWSA message to the kernel via the ip xfrm state add command to trigger xfrm_state_add() and create one; to delete it, you send XFRM_MSG_DELSA.

Dissecting the State Structure

struct xfrm_state is a massive structure packed with sensitive information:

struct xfrm_state {
. . .
union {
struct hlist_node gclist;
struct hlist_node bydst;
};
struct hlist_node bysrc;
struct hlist_node byspi;

atomic_t refcnt;
spinlock_t lock;

struct xfrm_id id; // <-- 身份证
struct xfrm_selector sel;
struct xfrm_mark mark;
u32 tfcpad;

u32 genid;

/* Key manager bits */
struct xfrm_state_walk km;

/* Parameters of this state. */
struct {
u32 reqid;
u8 mode;
u8 replay_window;
u8 aalgo, ealgo, calgo;
u8 flags;
u16 family;
xfrm_address_t saddr;
int header_len;
int trailer_len;
} props;

struct xfrm_lifetime_cfg lft;

/* Data for transformer */
struct xfrm_algo_auth *aalg;
struct xfrm_algo *ealg;
struct xfrm_algo *calg;
struct xfrm_algo_aead *aead;

/* Data for encapsulator */
struct xfrm_encap_tmpl *encap;

/* Data for care-of address */
xfrm_address_t *coaddr;

/* IPComp needs an IPIP tunnel for handling uncompressed packets */
struct xfrm_state *tunnel;

/* If a tunnel, number of users + 1 */
atomic_t tunnel_users;

/* State for replay detection */
struct xfrm_replay_state replay;
struct xfrm_replay_state_esn *replay_esn;

/* Replay detection state at the time we sent the last notification */
struct xfrm_replay_state preplay;
struct xfrm_replay_state_esn *preplay_esn;

/* The functions for replay detection. */
struct xfrm_replay *reply;
. . .

/* Statistics */
struct xfrm_stats stats;

struct xfrm_lifetime_cur curlft;
. . .

/* Reference to data common to all the instances of this
* transformer. */
const struct xfrm_type *type;
struct xfrm_mode *inner_mode;
struct xfrm_mode *inner_mode_iaf;
struct xfrm_mode *outer_mode;

/* Security context */
struct xfrm_sec_ctx *security;

/* Private data of this transformer, format is opaque,
* interpreted by xfrm_type methods. */
void *data;
};

There are a few key points here we must highlight:

1. id (Identity) The xfrm_id structure contains three fields: destination address, SPI, and protocol number (AH/ESP/IPCOMP). This triplet logically uniquely identifies an SA. Note the word "logically"—because for faster lookups, the kernel actually builds several hash tables.

2. props (Properties) These are the concrete configuration details:

  • mode: Is it Transport mode or Tunnel mode? This is a core concept in IPsec.
  • replay_window: The window size for anti-replay protection.
  • aalgo, ealgo: The authentication and encryption algorithm IDs.
  • flags: For example, XFRM_STATE_ICMP, used to control certain special behaviors.
  • saddr: The source address of this SA (since it's unidirectional, this is very important).

3. aalg / ealg / aead (Algorithm Pointers) These pointers point to specific algorithm instances. This is where the Crypto API we discussed in the previous section actually gets called. If AES-GCM is configured, the aead pointer will point to the corresponding AEAD algorithm instance.

4. replay (Anti-Replay Protection) IPsec must defend against replay attacks. The kernel uses the replay structure to track sequence numbers; if a received packet's sequence number falls within the window but is a duplicate, it gets dropped immediately.

5. type / mode (Polymorphic Support) xfrm_type points to the specific protocol operation methods (like how ESP handles output, how ESP handles input). xfrm_mode determines whether it's transport mode or tunnel mode. This is a classic example of implementing object-oriented design in C.


SAD: Security Association Database

Where does the kernel store all the xfrm_state? Looking back at the netns_xfrm structure we saw earlier, those three hash tables:

  • state_bydst: Hashed by destination address.
  • state_bysrc: Hashed by source address.
  • state_byspi: Hashed by SPI.

It's a bit like using three different keys to open the same door—depending on what clues you have on hand (sometimes you only know the SPI, other times you have the address), you can take different entry points to search.

When the kernel inserts an SA into the database (__xfrm_state_insert), it calculates all three hash values and hooks it into all of them.

The Special "Acquire State"

Here's an interesting detail: states with an SPI of 0.

Under normal circumstances, an SPI is never 0. So when would it be 0? When you've configured a policy but the key hasn't been negotiated yet (for example, IKEv2 is still handshaking), the kernel creates a temporary, SPI-0 "acquire state" to prevent spamming the userspace daemon with relentless requests.

  • This state doesn't get hung on the state_byspi hash table (since it's all zeros, there's no way to look it up).
  • As long as this state exists, the kernel knows "oh, I'm in the middle of negotiating" and stops sending new Acquire messages.
  • Once negotiation is complete, this temporary state gets replaced by the real SA.

Looking Up SAs

The kernel provides several lookup functions for different scenarios:

  • xfrm_state_lookup(): The most commonly used, looks up state_byspi.
  • xfrm_state_lookup_byaddr(): Looks up state_bysrc.
  • xfrm_state_find(): Looks up state_bydst, typically used on the send path.

At this point, we've thoroughly explored the XFRM skeleton—the Security Policy Database (SPD) and the Security Association Database (SAD). You now know how the kernel decides what to process and which "key" to use for processing.

But this is still just static configuration. In the next section, we'll see how packets flow through these hash tables and structures, ultimately transforming into a stream of ciphertext flying across the internet.