3.0 The Network's Nervous System: Why We Need ICMP
Imagine managing a massive, globally distributed package delivery system.
Billions of packages flow through this network every day. As the manager, you don't need to physically handle every package, but you absolutely need a mechanism to tell you: is something wrong with the system?
If a sorting center suddenly goes offline, if a route becomes completely blocked, or if a sender fills in the wrong address, you need to know immediately. You can't wait until packages vanish into thin air to start investigating—that would be too late.
The designers of the internet faced the exact same problem. The IP protocol is essentially a "best-effort" delivery service—it handles sending but doesn't care about the result. Without a feedback mechanism, the network would be an undebuggable black box: packets would simply disappear, with no one knowing why or where to look.
That is the very purpose of ICMP. It acts as the nervous system of the internet, dedicated to delivering error reports, diagnostic information, and control messages.
You use it every day. When you type ping google.com to test connectivity, you're using ICMP; when you use traceroute to trace the routers a packet passes through, you're also leveraging ICMP's behavioral characteristics.
In this chapter, we're going to take this "nervous system" apart. We won't be looking at how user tools are used, but rather at what exactly happens inside the kernel when it receives these peculiar packets.
3.1 ICMPv4: The Diagnostic Core of IPv4
The Two Faces of ICMPv4 Messages
ICMPv4 messages fall into two broad categories: Error Messages and Information Messages (also known as Query Messages in RFC 1812).
The most famous tool—ping—is essentially a user-space test program (usually included in the iputils package). Its working principle is very straightforward: it opens a Raw Socket, sends an ICMP_ECHO (Echo Request) message, and then waits for the other end to reply with an ICMP_ECHOREPLY (Echo Reply).
Another commonly used tool, Traceroute, is used to probe the path a packet takes from a host to a destination IP. Its design is quite clever, leveraging a field in the IP header—TTL (Time To Live). This field represents how many more routers (hops) a packet can pass through. Traceroute exploits the following rule: when a forwarding device receives a packet whose TTL has been decremented to 0, it must drop the packet and send back an ICMP_TIME_EXCEEDED message.
Traceroute starts by sending a packet with a TTL of 1. The first-hop router receives it, the TTL reaches zero, and it sends back an ICMP Time Exceeded message; next, the TTL is set to 2, and the second-hop router replies... and so on, until the final destination returns an ICMP Echo Reply. Through these intermediate ICMP reports, Traceroute can piece together a complete routing map.
Although we'll discuss this in detail in later chapters, it's worth mentioning that starting from kernel version 3.0, Linux introduced a new feature: ICMP Sockets (Ping Sockets). This allows regular users (non-root) to send Ping requests by creating a non-Raw Socket, eliminating the need for the ping tool to set the dangerous setuid root bit.
Initialization: inet_init() and icmp_sk_init()
ICMPv4 initialization doesn't happen during driver loading because it is part of the kernel's core Network Stack and cannot be compiled as a Kernel Module. All of this happens early in the kernel boot process.
The starting point of the entire process is in inet_init(). In this function, the kernel does two major things: first, it registers the ICMP handler with the protocol stack, and second, it prepares for the transmission of ICMP messages.
Registering the Protocol Handler
Just as we saw in the TCP/UDP chapters, ICMP is also a protocol, and it must be registered in the kernel's protocol dispatch array. This happens in inet_init():
static const struct net_protocol icmp_protocol = {
.handler = icmp_rcv,
.err_handler = icmp_err,
.no_policy = 1,
.netns_ok = 1,
};
icmp_rcv: This is the core callback function. When the IP layer receives a packet with a protocol field ofIPPROTO_ICMP (0x1), this function is ultimately called.no_policy: Setting this flag to 1 means IPsec policy checks are not required. This is an important optimization—for example, inip_local_deliver_finish(), the kernel sees this flag and skipsxfrm4_policy_check(), because for ICMP control messages, security policies are usually not the primary concern.netns_ok: Setting this to 1 indicates that this protocol is network Namespace aware. If set to 0,inet_add_protocol()will directly fail and return-EINVAL.
The registration process is straightforward:
if (inet_add_protocol(&icmp_protocol, IPPROTO_ICMP) < 0)
pr_crit("%s: Cannot add ICMP protocol\n", __func__);
Creating Sockets for Transmission
Just knowing how to receive isn't enough; the kernel also needs to know how to send ICMP packets. This is where icmp_sk_init() comes in.
There is an interesting design detail here: the kernel creates a dedicated Raw Socket for each CPU.
int __net_init icmp_sk_init(struct net *net)
{
. . .
for_each_possible_cpu(i) {
struct sock *sk;
err = inet_ctl_sock_create(&sk, PF_INET,
SOCK_RAW, IPPROTO_ICMP, net);
if (err < 0)
goto fail;
net->ipv4.icmp_sk[i] = sk;
. . .
sock_set_flag(sk, SOCK_USE_WRITE_QUEUE);
inet_sk(sk)->pmtudisc = IP_PMTUDISC_DONT;
}
. . .
}
Why go to this trouble? Because in a multi-core system, if all CPUs contend for a single Socket to send data, lock contention would be severe. Through the icmp_sk(struct net *net) method, the kernel can quickly obtain the Socket corresponding to the current CPU and use it directly for icmp_push_reply() to send data.
Here we see two configurations:
SOCK_USE_WRITE_QUEUE: Indicates the use of a write queue.IP_PMTUDISC_DONT: Disables PMTU (Path MTU) discovery. Because for control messages like ICMP error notifications, we want to ensure they are delivered as much as possible, and they shouldn't be dropped or fragmented due to MTU issues.
ICMPv4 Message Header: struct icmphdr
The header of every ICMP packet shares the same skeleton, but the payload varies.
The header contains: Type (8 bits), Code (8 bits), Checksum (16 bits), and a 32-bit variable part. The exact content of these 32 bits depends on the type and code.
struct icmphdr {
__u8 type;
__u8 code;
__sum16 checksum;
union {
struct {
__be16 id;
__be16 sequence;
} echo;
__be32 gateway;
struct {
__be16 __unused;
__be16 mtu;
} frag;
} un;
};
Immediately following this header is usually the header and partial payload of the original IP packet that triggered this ICMP message. RFC 1812 specifies that this should include as much of the original datagram as possible, but the total length of the ICMP datagram must not exceed 576 bytes. This is the minimum MTU requirement for IPv4 (originating from RFC 791), ensuring that any network device can handle a packet of this size.
Dispatch Mechanism: The icmp_pointers Array
How does the kernel know whether a received ICMP packet should be handed to Ping or to the routing table?
The answer is table lookup.
The kernel defines a global array, icmp_pointers, which uses the ICMP message type as an index and stores the corresponding icmp_control structure.
struct icmp_control {
void (*handler)(struct sk_buff *skb);
short error; /* 是否归类为错误消息 */
};
static const struct icmp_control icmp_pointers[NR_ICMP_TYPES+1];
NR_ICMP_TYPES is the maximum ICMP type number (18).
If the error field of a certain type is 1, it means it is an Error Message (such as Destination Unreachable); if it is 0 (implicit), it is an Information Message (such as Echo).
Next, let's look at a few key handler entries, which are the soul of ICMP processing logic.
1. ping_rcv(): More Than Just an Echo
Before kernel 3.0, Ping replies (ICMP_ECHOREPLY) were handled directly by user-space Raw Sockets. Back then, ip_local_deliver_finish() would first try to deliver the packet to a Raw Socket, and if the Raw Socket consumed it, the protocol layer's handler would not be called again.
But after the introduction of the ICMP Sockets mechanism, things changed. You no longer need root privileges to create a non-Raw ICMP Socket (such as socket(PF_INET, SOCK_DGRAM, PROT_ICMP)). Since the sender is not a Raw Socket, the returning reply cannot be matched to a Raw Socket, and naturally, no one would receive it.
To solve this problem, a hook called ping_rcv() was specifically attached to ICMP_ECHOREPLY in icmp_pointers.
This function is implemented in net/ipv4/ping.c. Interestingly, this file is dual-stack, meaning it handles both IPv4 Echo Replies and IPv6 ones (ICMPV6_ECHO_REPLY).
2. icmp_discard(): Silence is Golden
For some messages, the kernel doesn't need to do anything upon receiving them and simply drops them.
The most typical example is ICMP_TIMESTAMPREPLY (Timestamp Reply). Although ICMP has a timestamp feature, modern networks have long used NTP (Network Time Protocol) for time synchronization, which offers higher precision and more functionality. The ICMP Timestamp is mostly a legacy feature.
Address Mask related messages are also discarded. Previously, hosts would use ICMP_ADDRESS to ask routers for the subnet mask, but now everyone uses DHCP, making this feature obsolete.
3. icmp_unreach(), icmp_redirect(), and Others
- icmp_unreach(): Handles
ICMP_DEST_UNREACH(Destination Unreachable),ICMP_TIME_EXCEED(Time Exceeded),ICMP_PARAMETERPROB(Parameter Problem), etc.- TTL reaching zero: In
ip_forward(), the TTL is decremented by 1 on each forward. Once it reaches zero, the router callsicmp_send(skb, ICMP_TIME_EXCEEDED, ICMP_EXC_TTL, 0)and drops the packet. - Fragment reassembly timeout: If fragment reassembly times out,
ip_expire()sends aICMP_EXC_FRAGTIMEerror.
- TTL reaching zero: In
- icmp_redirect(): Handles
ICMP_REDIRECT(Redirect). According to RFC 1122, hosts should not send redirects; only gateways do. Previously, this would callip_rt_redirect()to update the routing table, but starting from kernel 3.6, this logic was moved to the protocol handling code, andicmp_redirect()now only handles checksum verification and dispatch. This is an architectural adjustment the kernel made to address security risks (such as ICMP redirect attacks). - icmp_echo(): Handles
ICMP_ECHOrequests. It changes the packet's type toICMP_ECHOREPLYand callsicmp_reply()to send it back. Unless you have setsysctl_icmp_echo_ignore_all.
Receiving Flow: The Rigorous Logic of icmp_rcv()
All ICMPv4 packets ultimately converge in the icmp_rcv() function. Although this function doesn't directly handle specific business logic (that's the job of individual handlers), it performs a substantial amount of security checks.
-
Statistics and Checksum: First, it increments the
InMsgscounter. Immediately after, the checksum must be correct. If it's wrong, it incrementsInCsumErrorsandInErrors, and then directly callskfree_skb.Here's a detail:
icmp_rcv()always returns 0, even if the checksum is wrong, it doesn't return a negative value. This is because if the protocol layer returns an error, the kernel might attempt retransmission or other processing, but for a corrupted ICMP packet, we just want it to quietly disappear. -
Type Check: Checks if the message type is valid (less than
NR_ICMP_TYPES). If it's an unknown type, RFC 1122 requires that it must be silently discarded. It incrementsInErrorsand drops the packet. -
Broadcast and Multicast Suppression: If the received packet is destined for a broadcast or multicast address, and it's an Echo Request or Timestamp Request, the kernel checks
sysctl_icmp_echo_ignore_broadcasts.- This is to prevent network storms. If someone pings the entire network via broadcast and every machine replies, the network would be paralyzed. This switch is turned on by default (ignore).
-
Final Dispatch: Finds the corresponding callback function via
icmp_pointers[type].handlerand executes it.
Sending Flow: icmp_send() and icmp_reply()
The kernel primarily uses two pathways to send ICMP messages:
- Proactive error reporting: Such as
icmp_send(). Called when network anomalies occur (e.g., port unreachable, fragmentation needed). - Passive replies: Such as
icmp_reply(). Used to reply to Echo or Timestamp requests.
Regardless of the pathway, both ultimately call icmp_push_reply() and rely on the previously mentioned per-CPU Socket to pass data into the IP layer.
Structure: struct icmp_bxm
Before sending, the kernel uses struct icmp_bxm to construct the message to be sent. This acts as a temporary "assembly workshop".
struct icmp_bxm {
struct sk_buff *skb; /* 触发消息的原始包 */
int offset; /* 网络头偏移量 */
int data_len; /* 载荷长度 */
struct {
struct icmphdr icmph; /* ICMP 头部 */
__be32 times[3]; /* 时间戳(用于 Timestamp 消息) */
} data;
int head_len; /* 头部总长度 */
struct ip_options_data replyopts; /* IP 选项 */
};
Rate Limiting: Don't Let the Network Explode
If ICMP messages are sent without restriction, it can easily trigger an "avalanche". For example, if a router is flapping and receives tens of thousands of erroneous packets per second, and it replies with an ICMP unreachable for each one, it would crash itself.
The kernel implements rate limiting through icmpv4_xrlim_allow().
Rate limiting is only skipped under the following circumstances:
- The message type is unknown (this is very rare).
- It's a PMTU discovery message (
ICMP_FRAG_NEEDED)—this must be sent as soon as possible, otherwise the TCP connection will break. - The device is loopback.
- This ICMP type is not enabled in the rate mask.
In all other cases, the kernel calls inet_peer_xrlim_allow() to check if too many ICMP packets have been sent to that destination.
Diving into Sending Scenarios: Destination Unreachable
Let's look at a few specific scenarios that trigger icmp_send(), which are very helpful for debugging network issues.
Code 2: Protocol Unreachable (ICMP_PROT_UNREACH)
Scenario: You send a packet with the protocol field in the IP header set to 137 (hypothetically), but the kernel has no handler registered for protocol number 137.
In ip_local_deliver_finish(), the kernel looks up the table inet_protos[protocol] and finds it's NULL. If no Raw Socket claims it, the kernel triggers:
icmp_send(skb, ICMP_DEST_UNREACH, ICMP_PROT_UNREACH, 0);
Meaning: "I wanted to handle this packet, but I couldn't find a protocol (like TCP/UDP) that can handle it, so it's unreachable."
Code 3: Port Unreachable (ICMP_PORT_UNREACH)
Scenario: You send a UDP packet to port 8888 on a certain machine, but that machine has no program listening on port 8888.
The UDP protocol layer looks up the Socket in __udp4_lib_rcv() and finds nothing. If the checksum is correct, it will reply with:
icmp_send(skb, ICMP_DEST_UNREACH, ICMP_PORT_UNREACH, 0);
This is typically the most common ICMP error message you will encounter.
Code 4: Fragmentation Needed (ICMP_FRAG_NEEDED)
Scenario: The key to PMTU discovery.
You want to send a large 9000-byte packet, but it passes through a router whose egress MTU is only 1500, and the DF (Don't Fragment) bit in your packet header is set to 1.
In ip_forward(), the kernel discovers that skb->len > dst_mtu(&rt->dst) and the DF flag is set. Since it cannot fragment, it can only drop the packet and notify:
icmp_send(skb, ICMP_DEST_UNREACH, ICMP_FRAG_NEEDED,
htonl(dst_mtu(&rt->dst)));
Note the last parameter here: it tells the sender the correct MTU value. This is how PMTU works.
Code 5: Source Route Failed (ICMP_SR_FAILED)
Scenario: Using Strict Source Routing.
If ip_forward() finds that the packet specifies it must pass through Gateway A, but the routing table shows the next hop is indeed Gateway A, or there is a configuration conflict, this error is triggered. Although source routing is rarely used today, the kernel still retains this logic.
Summary
ICMPv4 may look simple, but it is actually a crucial part of the network's self-regulation mechanism.
- It is the "exception handling mechanism" of the IP layer.
- It is the cornerstone of user diagnostic tools (Ping/Traceroute).
- Through rate limiting and specific processing logic, it protects itself from being abused.
Next, we will enter the world of IPv6. You will find that, although the names are similar, in IPv6, ICMP is endowed with a more core responsibility—it is the foundation of Neighbor Discovery, almost entirely replacing the functionality of ARP.