9.9 NAT Hook Callbacks and Connection Tracking Extensions

In the previous section, we saw how manip_pkt, our "surgeon," wields its scalpel to modify packet IPs and ports. But a question arises: who calls this function? When does it step in? More importantly, how does it know what to change the packet into—for instance, should it modify the source address (SNAT) or the destination address (DNAT)?

The answer lies in Netfilter's NAT Hook callbacks.

For IPv4, the core code implementing NAT lives in net/ipv4/netfilter/iptable_nat.c (the IPv6 counterpart is ip6table_nat.c). This module registers four callback functions, corresponding to the Netfilter hook points we saw earlier.

9.9.1 Four Callback Entry Points

Let's first take a clear look at this "shift schedule." Table 9-1 lists the Hook callback functions registered by the IPv4 and IPv6 NAT modules.

Table 9-1: IPv4 and IPv6 NAT Callback Functions

Netfilter Hook	IPv4 Callback Function	IPv6 Callback Function
`NF_INET_PRE_ROUTING`	`nf_nat_ipv4_fn`	`nf_nat_ipv6_fn`
`NF_INET_POST_ROUTING`	`nf_nat_ipv4_out`	`nf_nat_ipv6_out`
`NF_INET_LOCAL_OUT`	`nf_nat_ipv4_local_fn`	`nf_nat_ipv6_local_fn`
`NF_INET_LOCAL_IN`	`nf_nat_ipv4_fn`	`nf_nat_ipv6_fn`

There is an interesting design detail worth noting here.

You'll notice that the name nf_nat_ipv4_fn appears twice. Among these four callback functions, despite their varying names, nf_nat_ipv4_fn is clearly the core of the core—the other three functions (nf_nat_ipv4_in, nf_nat_ipv4_out, nf_nat_ipv4_local_fn) ultimately pass the buck to it. In other words, nf_nat_ipv4_fn is the "brain" of the NAT engine, while the others are merely dispatchers.

9.9.2 Core Processing Function: nf_nat_ipv4_fn()

Now let's crack open this brain and take a look. The logic of this function isn't short, but every step has its purpose. Let's look at the first half:

static unsigned int nf_nat_ipv4_fn(unsigned int hooknum,
                                   struct sk_buff *skb,
                                   const struct net_device *in,
                                   const struct net_device *out,
                                   int (*okfn)(struct sk_buff *))
{
        struct nf_conn *ct;
        enum ip_conntrack_info ctinfo;
        struct nf_conn_nat *nat;
        /* maniptype == SRC for postrouting. */
        enum nf_nat_manip_type maniptype = HOOK2MANIP(hooknum);

The first thing we do is define variables and use the HOOK2MANIP macro to convert the current Hook number into an operation type (maniptype). This is because in Netfilter's design, the POST_ROUTING hook point typically performs SNAT (modifying the source address), and this macro handles exactly that mapping.

The following assertion is extremely critical; it reveals another implicit convention in the kernel's design:

        /* 我们在这里永远看不到分片包：conntrack 会在 pre-routing
         * 和 local-out 做重组，而 nf_nat_out 会保护 post-routing。
         */
        NF_CT_ASSERT(!ip_is_fragment(ip_hdr(skb)));

This means that by the time a packet reaches the NAT layer, it must be complete. Who do we have to thank for this? The connection tracking layer. It already reassembled the fragments for us when the packet entered PRE_ROUTING or LOCAL_OUT. So in NAT, we don't need to worry about the headaches caused by fragmentation at all—what a relief.

The following code starts dealing with the most core object—the connection tracking entry:

        ct = nf_ct_get(skb, &ctinfo);
        /* 无法跟踪？不是因为压力大，否则 conntrack 早就把它丢了。
         * 因此这是用户的责任：要么用包过滤规则把它丢掉，
         * 要么给那个协议实现 conntrack/NAT。;) --RR
         */
        if (!ct)
                return NF_ACCEPT;

        /* 如果这个包没有被 conntrack 跟踪，就不要尝试做 NAT */
        if (nf_ct_is_untracked(ct))
                return NF_ACCEPT;

Here we try to retrieve the nf_conn object from skb. If we can't get it, or if the packet is marked as "untracked," we let it pass through directly (NF_ACCEPT).

Why let it pass directly? This is defensive programming. If the NAT module can't understand this packet (or if conntrack didn't track it at all), recklessly modifying the address will only make a mess. It's better to do nothing than to do it wrong. The ;) in the original comment is a joke from Alan Cox (a Linux kernel veteran), meaning "this is your problem now, the kernel washes its hands of it."

But if we do get the ct, the real challenge begins—we need to allocate NAT extension space for this connection.

        nat = nfct_nat(ct);
        if (!nat) {
                /* NAT 模块加载晚了。 */
                if (nf_ct_is_confirmed(ct))
                        return NF_ACCEPT;
                nat = nf_ct_ext_add(ct, NF_CT_EXT_NAT, GFP_ATOMIC);
                if (nat == NULL) {
                        pr_debug("failed to add NAT extension\n");
                        return NF_ACCEPT;
                }
        }

The logic here is a bit tricky. We try to get the NAT extension (nf_conn_nat) associated with this connection entry. If we can't get it, it might be because the NAT module was just loaded while the connection already existed.

At this point, there's a fork in the road:

If the connection has already been "confirmed" (nf_ct_is_confirmed is true), it means the connection's state is already stable. Trying to attach a NAT extension to it now is too dangerous, so we let it pass directly.
If the connection hasn't been confirmed yet (for example, this is the first packet), we call nf_ct_ext_add to dynamically add a NAT extension to it.

Note the GFP_ATOMIC flag here—we are in interrupt context and cannot sleep.

9.9.3 Making Decisions Based on Connection State

After obtaining ct and nat, the code enters a massive switch statement to decide how to handle the packet based on its current state within the connection (ctinfo).

The first case is "related packets":

        switch (ctinfo) {
        case IP_CT_RELATED:
        case IP_CT_RELATED_REPLY:
                if (ip_hdr(skb)->protocol == IPPROTO_ICMP) {
                        if (!nf_nat_icmp_reply_translation(skb, ct, ctinfo,
                                                           hooknum))
                                return NF_DROP;
                        else
                                return NF_ACCEPT;
                }
                /* Fall thru... (只有 ICMP 会是 IP_CT_IS_REPLY) */

IP_CT_RELATED typically means this is an ICMP error message (like "port unreachable") or a data connection for a protocol like FTP. If it's the ICMP protocol, the kernel calls the specialized nf_nat_icmp_reply_translation to handle it—because ICMP messages embed the original packet's IP header, making them tricky to modify and requiring special care.

If it's not ICMP, the code falls through into the IP_CT_NEW logic. This is our most common "new connection" scenario:

        case IP_CT_NEW:
                /* 以前见过吗？这可能发生在环回、重传
                 * 或本地包的情况。
                 */
                if (!nf_nat_initialized(ct, maniptype)) {
                        unsigned int ret;

                        ret = nf_nat_rule_find(skb, hooknum, in, out, ct);
                        if (ret != NF_ACCEPT)
                                return ret;

This is where NAT rules actually take effect. nf_nat_initialized checks whether we have already set up the NAT rules for this direction (source or destination) of the connection. If not, we call nf_nat_rule_find.

This function runs through the iptables NAT table (ipt_do_table). If it finds a matching rule in the table (such as SNAT --to-source 192.168.1.2), it records the modification action in the nat structure and initializes the necessary reverse mapping logic. If no rule matches (returns NF_ACCEPT), it does nothing and lets the packet through as-is.

But if we have already set up NAT for this connection—for instance, if this is the second packet of a new connection or a retransmitted packet:

                } else {
                        pr_debug("Already setup manip %s for ct %p\n",
                                 maniptype == NF_NAT_MANIP_SRC ? "SRC" : "DST",
                                 ct);
                        if (nf_nat_oif_changed(hooknum, ctinfo, nat, out))
                                goto oif_changed;
                }
                break;

In this case, there's no need to consult the table again (which is highly efficient). All we need to do is check whether the output interface has changed (for example, if routing policies changed and the packet suddenly goes out a different network interface). If it hasn't changed, all is well; if it has, this NAT session might be invalid, and we jump to the oif_changed label to kill it.

The last case is the norm:

        default:
                /* ESTABLISHED */
                NF_CT_ASSERT(ctinfo == IP_CT_ESTABLISHED ||
                             ctinfo == IP_CT_ESTABLISHED_REPLY);
                if (nf_nat_oif_changed(hooknum, ctinfo, nat, out))
                        goto oif_changed;
        }

For connections in the ESTABLISHED state, the logic is very simple: check for interface changes, then let the packet through directly. Because the rules were settled long ago and the mapping relationship has been saved, we just need to execute.

The last two lines of the function are the actual "hands-on" part:

        return nf_nat_packet(ct, ctinfo, hooknum, skb);

oif_changed:
        nf_ct_kill_acct(ct, ctinfo, skb);
        return NF_DROP;

nf_nat_packet is the function we analyzed in the previous section. Based on the mapping information recorded in the nat structure, it calls manip_pkt to actually rewrite the IP header and ports.

If we reach oif_changed, it means the environment has changed (for example, a network interface went down or the route switched), and this NAT session is no longer trustworthy. The kernel will ruthlessly call nf_ct_kill_acct to kill this connection and drop the current packet.

9.9.4 Connection Tracking Extensions

You might ask: Since every connection needs to store so much state (IP mappings, port mappings), isn't a lot of memory wasted if most connections in a network don't even need NAT?

That's a great question. Before kernel 2.6.23, this was indeed a problem. But after that, Linux introduced the "Connection Tracking Extensions" mechanism.

The core idea behind this mechanism is: allocate on demand.

If you don't load the NAT module, the conntrack layer will absolutely never allocate the memory used to store NAT mapping information. If you want to label connections (for example, using iptables -m connlabel), as long as you haven't loaded the label module, it won't occupy that memory either.

Extension Registration and Attachment

Every module that wants to extend conntrack functionality needs to define an nf_ct_ext_type structure and call nf_ct_extend_register() to register it with the kernel's headquarters. Unregistering is done via nf_ct_extend_unregister().

The code also mentions a detail: each extension module should provide a function to attach its extension to the nf_conn object. This function is typically called within init_conntrack() (when initializing the connection tracking entry).

For example:

The timestamp extension module provides nf_ct_tstamp_ext_add()
The label extension module provides nf_ct_labels_ext_add()

The infrastructure for this entire extension system is implemented in net/netfilter/nf_conntrack_extend.c.

As of this writing, the kernel has built-in the following extension modules (all located in the net/netfilter/ directory):

nf_conntrack_timestamp.c: Records the timestamp when a connection was seen, or used for debugging.
nf_conntrack_timeout.c: Allows setting different timeout values for different protocols or connections.
nf_conntrack_acct.c: Connection accounting, counting the traffic bytes and packet numbers for each connection.
nf_conntrack_ecache.c: Event cache, used to send connection state change events to userspace (working with tools like conntrackd).
nf_conntrack_labels.c: Labels connections (like a bitmap), used for complex policy matching.
nf_conntrack_helper.c: This is the Helper mentioned earlier, providing support for protocols like FTP and SIP that require "special handling."

We can think of these extensions as adding attachments to a case file (the nf_conn) in a police station (conntrack). The default case file only contains a slip of basic suspect information (the tuple). If you need to monitor their phone calls (timestamp) or calculate their expenses (acct), you slip the corresponding little notes inside. If the case doesn't need them, don't slip them in—lest your filing cabinet (memory) explode.

9.9.1 Four Callback Entry Points​

9.9.2 Core Processing Function: nf_nat_ipv4_fn()​

9.9.3 Making Decisions Based on Connection State​

9.9.4 Connection Tracking Extensions​

Extension Registration and Attachment​

9.9.1 Four Callback Entry Points

9.9.2 Core Processing Function: nf_nat_ipv4_fn()

9.9.3 Making Decisions Based on Connection State

9.9.4 Connection Tracking Extensions

Extension Registration and Attachment