ch07_3

7.3 The ARP Protocol (IPv4)

At the end of the previous section, we left a cliffhanger: when the kernel is ready to send a packet but finds the neighbor table completely empty, what exactly happens?

Now we officially enter the world of IPv4 to answer this question. In IPv4, this question has a name: ARP (Address Resolution Protocol).

Although the ARP protocol itself was defined as early as 1982 in RFC 826, in the Linux kernel, it is deeply intertwined with the neighbor subsystem, evolving into a rather subtle mechanism. It is not as simple as "send a request, receive a reply"—it is mixed with caching policies, device compatibility, and even delay logic designed to prevent broadcast storms.

Let's start with the basics of the ARP protocol itself and see how it bridges the huge gap between IP addresses and MAC addresses.

ARP Protocol Basics and Header Structure

In the Ethernet world, every device has two identities: a Layer 3 IP address and a Layer 2 MAC address.

When the kernel wants to send a packet, it holds the destination IP (dst_ip), but before stuffing it into an Ethernet frame, it must first know the peer's MAC address. Without it, the packet cannot go out.

This is the reason ARP exists: it is responsible for shouting "Who owns IP address X?" and then waiting for X's owner to raise their hand and reply "Me, my MAC is Y."

This process is very intuitive at the protocol level:

Request: The source host sends a broadcast packet containing the target IP address.
Reply: If a host finds that this IP belongs to it, it sends a unicast packet back containing its own MAC address.

In the eyes of the Linux kernel, all ARP information is packed into a structure called arphdr. This structure is the generic format definition of the ARP protocol:

struct arphdr {
    __be16          ar_hrd;         /* 硬件地址格式 (Hardware type)          */
    __be16          ar_pro;         /* 协议地址格式 (Protocol type)          */
    unsigned char   ar_hln;         /* 硬件地址长度 (e.g., MAC is 6 bytes)   */
    unsigned char   ar_pln;         /* 协议地址长度 (e.g., IPv4 is 4 bytes)  */
    __be16          ar_op;          /* ARP 操作码 (Command: Request/Reply)   */
#if 0
    /* 下面这些字段紧跟在头部后面，但不包含在 arphdr 结构体本身里 */
    unsigned char           ar_sha[ETH_ALEN];       /* sender hardware address      */
    unsigned char           ar_sip[4];              /* sender IP address            */
    unsigned char           ar_tha[ETH_ALEN];       /* target hardware address      */
    unsigned char           ar_tip[4];              /* target IP address            */
#endif
};

Note: The fields in that #if 0 block are not members of arphdr. Logically, they immediately follow the ARP header, but in the kernel code, they must be read by manually offsetting the pointer. We will see this detail later in arp_process().

For quick translation, here are the key fields:

ar_hrd: Hardware type. Ethernet is 0x01. The kernel has a bunch of ARPHRD_XXX macros defining this.
ar_pro: Protocol type. For IPv4, this is 0x0800 (which is ETH_P_IP).
ar_hln: Hardware address length. An Ethernet MAC address is 6 bytes.
ar_pln: Protocol address length. An IPv4 address is 4 bytes.
ar_op: Opcode. ARPOP_REQUEST (1) indicates a request, ARPOP_REPLY (2) indicates a reply.

Figure 7-1 shows what a typical Ethernet ARP packet looks like:

(Here represents an ARP Header layout: Hardware/Protocol type, lengths, opcode, followed by SHA/SIP/THA/TIP)

Different Neighbors, Different Fates

In the kernel code, ARP does not treat all devices equally. Not all network devices need ARP (such as pure point-to-point PPP links), and not all devices support hardware header caching.

To handle these differences, the kernel defines four neigh_ops instances, which determine the behavior pattern of neighbor entries:

arp_direct_ops: Used for devices that do not need ARP. The header_ops for such devices is usually NULL. When sending packets, it directly calls neigh_direct_output(), which is essentially a wrapper around dev_queue_xmit().
arp_generic_ops: Used for devices that do not support hardware header caching.
arp_hh_ops: This is the standard configuration for most Ethernet devices. It uses eth_header_cache() to cache L2 headers, accelerating packet transmission.
arp_broken_ops: Used for certain legacy or special non-standard devices (such as ROSE, AX25, NETROM).

Which one to choose is determined in arp_constructor() based on the characteristics of net_device. For a standard Ethernet NIC, ether_setup() will set header_ops to eth_header_ops, which means the kernel will initialize the cache callback, ultimately matching arp_hh_ops.

Initiating a Request: When the Kernel Doesn't Know the MAC Address

Now, let's shift our focus to the sending path.

When a packet reaches the network layer (L3) exit ip_finish_output2(), it must cross the L3-to-L2 boundary. At this point, the kernel only holds the next-hop IP address.

It first checks the ARP table (arp_tbl):

/* net/ipv4/ip_output.c */
static inline int ip_finish_output2(struct sk_buff *skb)
{
    struct dst_entry *dst = skb_dst(skb);
    struct rtable *rt = (struct rtable *)dst;
    struct net_device *dev = dst->dev;
    struct neighbour *neigh;
    u32 nexthop;

    /* ... 省略无关代码 ... */

    /* 获取下一跳 IP */
    nexthop = (__force u32) rt_nexthop(rt, ip_hdr(skb)->daddr);

    /* 尝试在邻居表里查找 */
    neigh = __ipv4_neigh_lookup_noref(dev, nexthop);

    /* 如果没找到，创建一个新的邻居条目 (此时状态可能是 NUD_NONE) */
    if (unlikely(!neigh))
        neigh = __neigh_create(&arp_tbl, &nexthop, dev, false);

    if (!IS_ERR(neigh)) {
        /* 尝试通过邻居输出 */
        int res = dst_neigh_output(dst, neigh, skb);
        /* ... */
    }
    /* ... */
}

Here is a key point: if no neighbor is found, the kernel creates a new neighbor entry. But at this point, this entry does not yet have an L2 address bound to it. It is just an empty shell.

The main event follows in dst_neigh_output(). If this is the first time sending a packet, the neighbor's state (nud_state) is definitely not NUD_CONNECTED. This means the kernel cannot send the data packet directly; it must stop and resolve the address first:

/* include/net/dst.h */
static inline int dst_neigh_output(struct dst_entry *dst, struct neighbour *n,
                                   struct sk_buff *skb)
{
    const struct hh_cache *hh;

    /* ... 一些时间戳更新逻辑 ... */

    hh = &n->hh;
    /* 只有当状态是 NUD_CONNECTED 且有缓存头部时，才直接发 */
    if ((n->nud_state & NUD_CONNECTED) && hh->hh_len)
        return neigh_hh_output(hh, skb);
    else
        /* 否则，调用 output 方法（对于 ARP 通常是 neigh_resolve_output） */
        return n->output(n, skb);
}

For the ARP protocol, the n->output at this point points to neigh_resolve_output(). This function does something very important but often overlooked: it temporarily stores the data packet.

It calls neigh_event_send(), and ultimately through __skb_queue_tail(&neigh->arp_queue, skb), hangs your data packet onto the neighbor structure's arp_queue queue.

⚠️ Don't let the queue overflow This arp_queue has a length limit. If ARP never gets resolved, subsequent packets will keep being stuffed into it. Once it fills up, the kernel starts dropping packets. You will find that pings fail without any errors—the packets simply vanish into this black hole.

After temporarily storing the data packet, the kernel triggers a resolution action. This is usually done by the timer callback neigh_timer_handler calling neigh_probe():

/* net/core/neighbour.c */
static void neigh_probe(struct neighbour *neigh)
        __releases(neigh->lock)
{
    struct sk_buff *skb = skb_peek(&neigh->arp_queue);
    /* ... */

    /* 调用协议特定的 solicit 方法，对于 ARP 就是 arp_solicit */
    neigh->ops->solicit(neigh, skb);

    atomic_inc(&neigh->probes);

    /* 释放队列头部的包（因为已经发过探针了，通常是保留第一个包用于重传逻辑） */
    kfree_skb(skb);
}

Constructing the ARP Request: arp_solicit()

Now we arrive at arp_solicit(), the place where the ARP request packet is truly built. There are a few interesting details here regarding source address selection, which are the root cause of many network failures.

When sending an ARP request, we need to fill in the source IP (saddr). Which one do we choose?

The kernel provides a sysctl parameter arp_announce to control this logic:

0 (default): Can use any address on any local interface.
1: Try to use an address within the target subnet. If none exists, fall back to level 2.
2: Always use the primary address.

This parameter can be set under both /proc/sys/net/ipv4/conf/all/arp_announce and specific NICs.

Here is how the code handles it:

/* net/ipv4/arp.c */
static void arp_solicit(struct neighbour *neigh, struct sk_buff *skb)
{
    __be32 saddr = 0;
    struct net_device *dev = neigh->dev;
    __be32 target = *(__be32 *)neigh->primary_key;
    /* ... */

    switch (IN_DEV_ARP_ANNOUNCE(in_dev)) {
    default:
    case 0: /* 默认：任何本地 IP 都行 */
        if (skb && inet_addr_type(dev_net(dev), ip_hdr(skb)->saddr) == RTN_LOCAL)
            saddr = ip_hdr(skb)->saddr;
        break;
    case 1: /* 尽量用同子网地址 */
        if (!skb)
            break;
        saddr = ip_hdr(skb)->saddr;
        if (inet_addr_type(dev_net(dev), saddr) == RTN_LOCAL) {
            /* 检查 saddr 和 target 是否在同链路 */
            if (inet_addr_onlink(in_dev, target, saddr))
                break;
        }
        saddr = 0; /* 没找到合适的，清零，下面会重新选 */
        break;
    case 2: /* 只用主地址 */
        break;
    }

    /* 如果上面没定下来 saddr，让内核帮我们选一个最合适的 */
    if (!saddr)
        saddr = inet_select_addr(dev, target, RT_SCOPE_LINK);

After selecting the source IP, there is another question: is this a broadcast request or a unicast request?

Normally, we don't know the peer's MAC, so we can only broadcast. However, the kernel supports a "unicast probe" mechanism. If there is an old entry in the neighbor table (the state is invalid but a MAC record exists), the kernel might first try a unicast probe, which can reduce broadcast storms on the LAN.

    probes -= neigh->parms->ucast_probes;
    if (probes < 0) {
        /* 如果配置了单播探测次数，且此时状态不对，尝试用已知 MAC 发单播 */
        if (!(neigh->nud_state & NUD_VALID))
            pr_debug("trying to ucast probe in NUD_INVALID\n");
        neigh_ha_snapshot(dst_ha, neigh, dev);
        dst_hw = dst_ha;
    }
    /* ... */

Finally, it calls arp_send() to send the packet out. If it is a broadcast, the dst_hw parameter is NULL:

    arp_send(ARPOP_REQUEST, ETH_P_ARP, target, dev, saddr,
             dst_hw, dev->dev_addr, NULL);
}

Send Encapsulation: arp_send()

arp_send() is responsible for allocating the SKB, filling in the headers, and throwing it to the driver layer.

There is a very critical check here: IFF_NOARP.

/* net/ipv4/arp.c */
void arp_send(int type, int ptype, __be32 dest_ip,
              struct net_device *dev, __be32 src_ip,
              const unsigned char *dest_hw, const unsigned char *src_hw,
              const unsigned char *target_hw)
{
    struct sk_buff *skb;

    /* 检查设备是否禁止了 ARP */
    if (dev->flags & IFF_NOARP)
        return;

    /* 分配 SKB 并填充 ARP 头部 */
    skb = arp_create(type, ptype, dest_ip, dev, src_ip,
                     dest_hw, src_hw, target_hw);
    if (skb == NULL)
        return;

    /* 通过 Netfilter 过滤后发送 */
    arp_xmit(skb);
}

Which devices get the IFF_NOARP flag?

Manually disabled by the administrator: ip link set eth0 arp off.
Tunnel devices (like IPIP), PPP devices, etc. Because these links are point-to-point, or don't run Ethernet at all, they don't need ARP.

Reception and Processing: arp_process()

Sending the packet out is only half the battle; we still need to receive the reply. The ARP receive entry point is arp_rcv().

It first performs some validity checks (whether the device supports ARP, whether the packet length is sufficient, whether it is sent to loopback), and then calls arp_process() to enter the core processing logic.

arp_process() has a heavy workload, handling three situations:

Requests destined for the local machine: Needs a reply.
Replies destined for the local machine: Needs to update the neighbor table.
Requests that need forwarding (Proxy ARP): Needs to reply on behalf of someone else.

Step 1: Parsing the Header

Because the arphdr structure does not contain the variable-length address data, the kernel must manually calculate offsets to extract the data. This is a typical style in network stack programming:

static int arp_process(struct sk_buff *skb)
{
    struct net_device *dev = skb->dev;
    struct arphdr *arp;
    unsigned char *arp_ptr;
    __be32 sip, tip; /* Source IP, Target IP */
    unsigned char *sha; /* Source Hardware Addr */

    arp = arp_hdr(skb);
    arp_ptr = (unsigned char *)(arp + 1); /* 跳过 arphdr */

    /* 提取 SHA (Sender MAC) */
    sha = arp_ptr;
    arp_ptr += dev->addr_len;

    /* 提取 SIP (Sender IP) */
    memcpy(&sip, arp_ptr, 4);
    arp_ptr += 4;

    /* 跳过 THA (Target MAC)，虽然我们这里不用它 */
    arp_ptr += dev->addr_len;

    /* 提取 TIP (Target IP) */
    memcpy(&tip, arp_ptr, 4);

    /* 丢弃组播或 loopback 相关的非法请求 */
    if (ipv4_is_multicast(tip) || ...)
        goto out;

Step 2: Handling DAD (Duplicate Address Detection) Requests

Before handling normal business, the kernel first takes a look at the source IP: sip.

If sip is 0, this is a special ARP Probe (RFC 2131), used for DAD (Duplicate Address Detection).

Although IPv6 mandates DAD, in IPv4 it is optional (initiated by the arping tool). If you use arping -D to detect IP conflicts, the SIP of the sent packet is 0.

If the kernel receives this packet and is the owner of the Target IP, it must reply. Note: when replying, the kernel usually does not add the requester to its own neighbor table, because the requester's IP is clearly not yet determined.

Step 3: Handling Requests Destined for the Local Machine

This is the most common scenario: tip is my IP.

The kernel first checks the routing table to confirm that tip is a local address (RTN_LOCAL). Then, it faces two decision points:

Ignore policy (arp_ignore): Should I reply?
Filter policy (arp_filter): Is the reply legitimate?

The arp_ignore parameter determines how "aloof" the kernel is:

0: As long as the IP is local, reply.
1: Only reply when the request's destination IP is the receiving interface's address (preventing IP aliases from responding randomly).
Stricter values: Even require the source and destination to be in the same subnet.

    if (addr_type == RTN_LOCAL) {
        int dont_send;

        dont_send = arp_ignore(in_dev, sip, tip);

        /* 如果没被 ignore 策略拦住，再看 filter 策略 */
        if (!dont_send && IN_DEV_ARPFILTER(in_dev))
            dont_send = arp_filter(sip, tip, dev);

        if (!dont_send) {
            /* 先把对方 (Sender) 加到邻居表，状态设为 NUD_STALE
             * 这叫被动学习
             */
            n = neigh_event_ns(&arp_tbl, sha, &sip, dev);
            if (n) {
                /* 发送回复 */
                arp_send(ARPOP_REPLY, ETH_P_ARP, sip, dev, tip, sha,
                         dev->dev_addr, sha);
                neigh_release(n);
            }
        }
        /* ... */
    }

There is a very interesting mechanism here called passive learning. We usually think that MAC addresses are only learned when sending requests, but in fact, as long as an ARP packet is received (whether a Request or a Reply), the kernel casually records the sender's information. It is like someone knocking on your door to ask for directions—you answer them while simultaneously remembering their face, making it easier to find them next time.

Step 4: Handling Proxy ARP

If tip is not my IP, but my machine has forwarding enabled, and the routing table lookup shows that this packet should be forwarded through me, then I have the opportunity to become a proxy.

arp_fwd_proxy() and arp_fwd_pvlan() are used to determine this scenario. If you configure Proxy ARP, the kernel will reply to ARP requests on behalf of the target host, disguising itself as the target.

This is rarely used at home, but it is very useful in some legacy network migration scenarios or special gateway deployments.

Step 5: Updating the Neighbor Table

Regardless of whether it just replied to a request, the kernel finally updates the neighbor table based on the received packet.

    /* 查找发送者的邻居条目 */
    n = __neigh_lookup(&arp_tbl, &sip, dev, 0);

    /* ... 各种判断逻辑，包括是否接受非请求的回复 ... */

    if (n) {
        int state = NUD_REACHABLE;
        int override;

        /* 如果在 locktime 内收到多个不同的回复，只认第一个，防止 ARP 飘忽 */
        override = time_after(jiffies, n->updated + n->parms->locktime);

        /* 只有单播的 Reply 才能把状态设为 REACHABLE，广播包只能算 STALE */
        if (arp->ar_op != htons(ARPOP_REPLY) ||
            skb->pkt_type != PACKET_HOST)
            state = NUD_STALE;

        /* 执行更新 */
        neigh_update(n, sha, state,
                     override ? NEIGH_UPDATE_F_OVERRIDE : 0);
        neigh_release(n);
    }

Summary

At this point, the complete lifecycle of ARP in the kernel is closed.

From the sending path's "broadcasting a shout when you can't find anyone," to the receiving path's "remembering a face when you hear a shout," to handling various edge cases (DAD, Proxy, Filter).

This is neighbor discovery in the IPv4 era. It is simple and efficient, but full of hidden dangers—there is no verification mechanism, and anyone can claim to be the gateway.

This is why the next chapter will look at IPv6's NDISC. You will find that although it does roughly the same thing, it upgrades this simple and crude shouting mechanism into a much more rigorous protocol—at the same time, a much more complex one.

But before we go there, let's make sure we really understand how this "simple" shouting match keeps our IPv4 networks running every day.

7.3 The ARP Protocol (IPv4)​

ARP Protocol Basics and Header Structure​

Different Neighbors, Different Fates​

Initiating a Request: When the Kernel Doesn't Know the MAC Address​

Constructing the ARP Request: arp_solicit()​

Send Encapsulation: arp_send()​

Reception and Processing: arp_process()​

Step 1: Parsing the Header​

Step 2: Handling DAD (Duplicate Address Detection) Requests​

Step 3: Handling Requests Destined for the Local Machine​

Step 4: Handling Proxy ARP​

Step 5: Updating the Neighbor Table​

Summary​