10.4 ESP Implementation (IPv4)

We have seen how the XFRM framework stores policies (SPD) and states (SAD)—like building a house and preparing its ledgers. But how traffic flows in and out of that house, and how the rules in those ledgers are enforced, depends on the specific protocol.

This is where ESP (Encapsulating Security Payload) steps in.

In the previous section, we discussed the skeleton of XFRM; now we fill in the meat. ESP is the most commonly used protocol in the IPsec suite because it handles both encryption and authentication. Its protocol number is 50 (IPPROTO_ESP).

The ESP Protocol View

Before diving into the kernel code, we need to understand what an ESP packet looks like.

According to RFC 4303, ESP appends a new header and a trailer to each packet. Unlike protocols that only modify the header, this creates a "sandwich" structure. If we capture an IPv4 packet processed by ESP, we see the following layout (Figure 10-1):

+------------+-----+-----------+--------+---+----+------------------+
| SPI (4B)   | Seq | Payload   | Padding| Pad|Next| Auth Data (ICV) |
| (Header)   | No. | (Data)    |        | Len|Hdr | (Trailer)       |
+------------+-----+-----------+--------+---+----+------------------+

These fields are worth a close look:

SPI (Security Parameter Index): 32 bits. This is the key to unlocking a specific SA. It must be used in conjunction with the source IP address to uniquely identify an SA (remember the triplet we mentioned in the last section? This is one part of it).
Sequence Number: 32 bits. This increments by 1 for each packet sent. It is not just a counter—it exists to prevent replay attacks. The receiver maintains a sliding window; if an incoming packet's sequence number is too old or a duplicate, it is dropped outright. An attacker intercepts a legitimate packet and tries to retransmit it? No way.
Payload Data: The actual data we want to transmit, now encrypted into ciphertext.
Padding: 0 to 255 bytes. Encryption algorithms typically require the data length to be a multiple of a specific block size (e.g., 16 bytes for AES). If it falls short, padding is added.
Pad Length: Tells the receiver how many padding bytes were added so they can be stripped after decryption.
Next Header: 1 byte. After decryption, what protocol is inside? TCP, UDP, or ICMP? Check this field.
Authentication Data: The ICV (Integrity Check Value). This is the tamper-proof fingerprint. The sender calculates and embeds it; the receiver recalculates and compares. If they do not match, the packet was altered and is discarded.

On Performance: The Evolution of Encryption and Authentication

Although ESP supports encryption-only or authentication-only modes, in the real world, almost no one dares to use them. For security, "encrypt and authenticate" is the standard. The traditional approach processes the data first with a cipher (like AES-CBC), then calculates the ICV using an HMAC (like SHA1/SHA2). This means traversing the data twice.

But hardware has evolved. Modern kernels prefer AEAD (Authenticated Encryption with Associated Data) algorithms, such as AES-GCM. These algorithms merge encryption and authentication into a single operation, which is not only more efficient but also highly parallelizable. If your CPU supports the Intel AES-NI instruction set, pushing IPsec throughput to several Gbit/s is a breeze—a sheer performance advantage.

IPv4 ESP Initialization

Now that we know what the protocol itself looks like, the next question is: how does the kernel know to "hand protocol number 50 packets to ESP for processing"?

This is a classic bidirectional registration process:

Tell the XFRM framework: "I am ESP, and I have these handler functions (input, output, state initialization, etc.)."
Tell the IPv4 stack: "If you receive a packet with protocol number 50, please call this hook function."

Both of these tasks are accomplished in the esp4_init() function, located at net/ipv4/esp4.c.

Step 1: Define the ESP Type (`xfrm_type`)

First, we define a xfrm_type structure instance named esp_type.

static const struct xfrm_type esp_type =
{
        .description    = "ESP4",
        .owner          = THIS_MODULE,
        .proto          = IPPROTO_ESP,
        .flags          = XFRM_TYPE_REPLAY_PROT,
        .init_state     = esp_init_state,
        .destructor     = esp_destroy,
        .get_mtu        = esp4_get_mtu,
        .input          = esp_input,
        .output         = esp4_output
};

This structure acts as ESP's "functional specification."

.proto declares that it handles the ESP protocol.
.flags sets XFRM_TYPE_REPLAY_PROT here, explicitly telling the kernel: "I have anti-replay attack capabilities."
Most important are the subsequent callback functions:
- .input and .output: The core logic for packet processing.
- .init_state: When userspace creates a new SA via Netlink (e.g., after key negotiation), this function is called to initialize ESP's private context (such as allocating key memory required by the encryption algorithm).

Next, we need to register this esp_type with the XFRM framework.

Each protocol family (IPv4 or IPv6) has an xfrm_state_afinfo object. This object contains an array called type_map. The registration process essentially hangs this esp_type into IPv4's type_map array.

Here is how the code does it:

        if (xfrm_register_type(&esp_type, AF_INET) < 0) {
                pr_info("%s: can't add xfrm type\n", __func__);
                return -EAGAIN;
        }

If this step succeeds, the kernel makes a note: "Oh, when handling the IPv4 ESP protocol, use the logic defined in esp_type."

Step 2: Register the Protocol Handler (`net_protocol`)

Just letting XFRM know is not enough. When a packet arrives from the NIC and reaches the IPv4 stack layer, the stack does not care about XFRM—it only looks at the protocol number in the IP header.

We need to tell the kernel: "If you see the Protocol field in the IP header is IPPROTO_ESP (50), throw this packet to the xfrm4_rcv function."

This is done through the standard IP protocol registration mechanism:

static const struct net_protocol esp4_protocol = {
        .handler        =       xfrm4_rcv,
        .err_handler    =       esp4_err,
        .no_policy      =       1,
        .netns_ok       =       1,
};

        if (inet_add_protocol(&esp4_protocol, IPPROTO_ESP) < 0) {
                pr_info("%s: can't add protocol\n", __func__);
                xfrm_unregister_type(&esp_type, AF_INET);
                return -EAGAIN;
        }

Note the mutual exclusion handling here: If inet_add_protocol fails (for example, protocol number 50 is already taken), the code rolls back the previous operation—calling xfrm_unregister_type to unload the esp_type we just registered. This "leave no trace on failure" style is the standard pattern for kernel module registration.

An interesting detail:

You might have noticed that the .handler here is not ESP's unique esp_input, but rather the generic xfrm4_rcv.

In fact, not only ESP, but the IPv4 AH protocol (net/ipv4/ah4.c) and the IPCOMP protocol (net/ipv4/ipcomp.c) also register xfrm4_rcv as their handler. This is because a lot of the receive-path logic is shared among these three protocols (such as looking up the SAD, checking policies, and handling the replay window), so there is no need for the kernel to implement it three times. xfrm4_rcv handles these common tasks first, and finally, based on the protocol type, jumps to the .input callback in the corresponding xfrm_type (which is the esp_input we defined above) to handle the actual decryption logic.

With this, initialization is complete. The pipeline is laid, and all that remains is to wait for the packets to flow.

The ESP Protocol View​

IPv4 ESP Initialization​

Step 1: Define the ESP Type (xfrm_type)​

Step 2: Register the Protocol Handler (net_protocol)​

The ESP Protocol View

IPv4 ESP Initialization

Step 1: Define the ESP Type (`xfrm_type`)

Step 2: Register the Protocol Handler (`net_protocol`)