Skip to main content

Chapter 7: The Linux Neighbouring Subsystem

This chapter explores how the Linux kernel maintains link-layer address mappings, along with the implementation details of the ARP/NDISC protocols within the kernel.

Chapter Prologue: When IP Meets MAC

There is a class of problems that appear to be about "route reachability" on the surface, but are actually about "unknown addresses."

What we are dealing with in this chapter is precisely this "translation layer" in the network stack—the one most easily overlooked.

Imagine this scenario: your machine has just sent a SYN packet, attempting to establish a TCP connection. The routing table has been consulted, and the next-hop IP is clearly right next door on the same subnet, yet the network card flatly refuses to send the packet out. What is it waiting for? It is waiting for an answer—what exactly is the MAC address corresponding to that IP address?

This is the entire reason the neighbouring subsystem exists. Whether it is ARP in IPv4 or NDISC in IPv6, they are essentially doing the same thing: establishing a trust relationship from L3 to L2 on a Layer 2 network full of uncertainties.

This trust is fragile. Early Ethernet was like a room full of imposters; anyone could stand up and shout "I am the gateway" (this is the famous ARP spoofing). The kernel must remain vigilant at all times—not only establishing mappings but also continuously verifying, updating, and quickly cleaning up garbage when connections become invalid.

Our task in this chapter is to open this black box. We will see how the kernel allocates neighbour entries, how it handles pressure from the network stack (GC), and how it carefully handles those special addresses (multicast, broadcast). This is not just about ARP—it is a comprehensive showcase of caching strategies, state machines, and concurrency management.


7.1 Creating and Destroying Neighbours

The Neighbouring Subsystem Core

Let's start with the most fundamental action: when the kernel decides "I need to talk to this IP," what does it actually do?

This involves the creation of the neighbour structure. In the kernel code, the core entry point for this action is __neigh_create().

Core Method __neigh_create()

The signature of this method is not long, but every parameter is crucial:

struct neighbour *__neigh_create(struct neigh_table *tbl,
const void *pkey,
struct net_device *dev,
bool want_ref)
  • tbl: Pointer to the neighbour table (e.g., the ARP table arp_tbl).
  • pkey: The protocol layer key (the destination IP address under IPv4). This is the unique index for looking up a neighbour.
  • dev: The outbound network device interface.
  • want_ref: Flag indicating whether to increment the reference count.

First, the kernel needs to allocate a neighbour object. This might sound like a simple kmalloc, but in reality, while calling neigh_alloc(), it is also engaged in an intense resource negotiation.

Step 1: The Tug-of-War Between neigh_alloc() and Garbage Collection

When you try to apply for a new neighbour entry, the first thing the neigh_alloc() method does is increment the current entry count of the neighbour table:

static struct neighbour *neigh_alloc(struct neigh_table *tbl, struct net_device *dev)
{
struct neighbour *n = NULL;
unsigned long now = jiffies;
int entries;

entries = atomic_inc_return(&tbl->entries) - 1;

This single line of atomic_inc_return is a double-edged sword. Once the counter is incremented, the kernel must check if it has crossed the boundary. Two lines of defense are set up here: gc_thresh2 and gc_thresh3.

  • gc_thresh3 (hard limit): Defaults to 1024. This is the absolute ceiling for the number of entries.
  • gc_thresh2 (soft limit): Defaults to 512. This is the threshold that triggers garbage collection (GC).

The following logic determines whether your new neighbour "lives" or "dies":

if (entries >= tbl->gc_thresh3 ||
(entries >= tbl->gc_thresh2 &&
time_after(now, tbl->last_flush + 5 * HZ))) {
if (!neigh_forced_gc(tbl) &&
entries >= tbl->gc_thresh3)
goto out_entries;
}

This code is worth pausing to read carefully. There are two conditions that trigger synchronous garbage collection (neigh_forced_gc):

  1. The number of entries reaches or exceeds gc_thresh3 (the hard limit).
  2. Or, the number of entries reaches gc_thresh2 (the soft limit), and 5 seconds have passed since the last cleanup (5 * HZ).

There is a subtle timing issue here: if GC is triggered, but after GC finishes, the number of entries is still higher than gc_thresh3, the kernel will abandon the allocation and jump directly to the out_entries label (typically returning NULL and an error). This means that under high load, if you do not adjust these thresholds, new connections might fail directly due to a "neighbour table full" condition.

Step 2: Protocol-Specific Initialization

Assuming we pass the GC test and obtain a clean block of neighbour memory. Next, the kernel must populate the details of this structure based on the protocol type (IPv4 or IPv6). This is accomplished by calling the constructor callback registered in the neighbour table.

For IPv4, this is arp_constructor(); for IPv6, it is ndisc_constructor().

These two constructors share several extremely important responsibilities: handling special cases that do not require ARP.

1. Multicast Address Handling

Multicast traffic does not require ARP resolution. If you want to send a packet to 224.0.0.1, you don't need to ask "who is 224.0.0.1" because it is handled according to the L2 multicast MAC address.

In arp_constructor():

/* 伪代码逻辑展示 */
if (IS_MULTICAST(key)) {
arp_mc_map(key, n->ha, dev, 0);
n->nud_state = NUD_NOARP;
}

It calls arp_mc_map() to calculate the corresponding MAC address, fills it into the ha field, and sets the state to NUD_NOARP (no ARP needed).

The IPv6 logic is similar, handled by ndisc_constructor() calling ndisc_mc_map().

2. Broadcast Address Handling

Broadcast addresses (like 255.255.255.255) also do not need ARP. In arp_constructor():

/* 伪代码逻辑展示 */
if (type == RTN_BROADCAST) {
memcpy(n->ha, dev->broadcast, dev->addr_len);
n->nud_state = NUD_NOARP;
}

It directly copies the network device's broadcast address into the neighbour structure, likewise marking it as NUD_NOARP.

⚠️ Warning: IPv6 has no traditional broadcast concept. Although it has ff02::1 (all-nodes multicast) and ff02::2 (all-routers multicast), in the kernel implementation, they are classified as multicast, not broadcast. Therefore, there is no such broadcast logic in the IPv6 neighbouring subsystem.

3. Device-Level Special Handling

The constructor also has two special call points. While uncommon, they are very useful at critical moments:

  • ndo_neigh_construct(): This is a callback in the net_device operation set. Currently, it is almost only implemented in the clip (Classical IP over ATM) driver. The neighbour discovery logic for the ancient ATM technology is extremely complex and requires driver intervention.
  • neigh_parms->neigh_setup(): This is for the Bonding (network device bonding) driver. In Bond mode, a virtual network device corresponds to multiple physical ports, and neighbour parameters require special configuration.

Step 3: Resizing the Hash Table

Neighbour entries are stored in a hash table. When the network is extremely busy—for example, when you are handling thousands of concurrent connections simultaneously—the hash table might suffer from severe collisions or become full. When creating an entry, the kernel checks whether a resize is needed:

struct neighbour *__neigh_create(struct neigh_table *tbl, const void *pkey,
struct net_device *dev, bool want_ref)
{
. . .

/* hash_shift 决定了哈希表的大小:1 << hash_shift */
if (atomic_read(&tbl->entries) > (1 << nht->hash_shift))
nht = neigh_hash_grow(tbl, nht->hash_shift + 1);
. . .
}

If the number of entries exceeds the current hash bucket size (1 << shift), neigh_hash_grow() is called to allocate a larger hash table and migrate the old data over. This is an expensive operation that can block.

Step 4: Wrapping Up

Finally, __neigh_create() enters the wrapping-up phase.

If the caller passed in want_ref == true, the reference count is incremented. At the same time, the kernel initializes the confirmed field:

n->confirmed = jiffies - (n->parms->base_reachable_time << 1);

This line of code looks strange at first glance: why set the confirmed time to the past?

If you understand base_reachable_time as "how long we think the neighbour can stay alive," then subtracting twice that amount is meant to tell the kernel: "Although this neighbour is new, I need to verify it right now." This is for security purposes—we do not want a newly created entry to be treated as "trusted" and used long-term just because it hasn't expired yet. We need it to enter the state machine verification process as soon as possible.

Finally, the dead flag is set to 0, indicating the object is alive, and then this neighbour object is hung on the global hash table.

Destruction: neigh_release() and neigh_destroy()

Neighbour objects are managed based on reference counting. When a certain kernel module no longer cares about this neighbour, it calls neigh_release().

When the reference count drops to zero, neigh_destroy() is called to clean up the memory. But there is a deadlock protection mechanism here: neigh_destroy() checks the dead flag.

If the dead flag is 0, the kernel will refuse to destroy the object.

This usually happens in certain asynchronous paths: an object is in use, and suddenly a deletion event is received (for example, userspace deleted an ARP entry). At this point, the kernel will first mark dead = 1, but the actual memory release must wait until the moment everyone lets go (the refcnt drops to zero). This is a typical deferred destruction strategy.


Chapter Echoes

In this section, we merely set the stage. Allocating memory, setting flags, handling hash tables—these seemingly boring initialization codes actually define the physical form of the neighbouring subsystem. gc_thresh determines the system's upper throughput limit, while constructor dictates how different protocols (IPv4/IPv6) plug their own special logic into this generic framework.

But now that we have the memory and the entries are built, they are still empty. Who fills in the real MAC address? Who initiates that broadcast "WHO HAS" request, and when?

In the next section, we will "activate" these static data structures and see how ARP requests are constructed, sent, and ultimately drive the entry state from NUD_INCOMPLETE to NUD_REACHABLE.