4.9 Quick Reference

Now that we've completed this journey, let's pause and pack our backpack.

In this chapter, we read through a lot of code and traced numerous paths—from how a packet is constructed, to how it is carefully fragmented, to how it navigates the kernel's forwarding path. Along the way, we encountered a multitude of function names, macro definitions, and structures—like tools scattered on the ground. Now we need to pick them up and sort them back into the toolbox.

This isn't a "glossary of terms" meant for passing an exam; it's a cheat sheet. Later, when you're debugging network issues, staring blankly at memory in crash tools, or struggling to recall "what was that function that handles the fragment forwarding option called?" in your code, coming back to this section will help you find that crucial clue.

Listed below are the "protagonists" and "supporting cast" that made frequent appearances in this chapter.

Method Quick Reference

Listed here are the most core methods of the IPv4 layer.

Transmit Path: From Transport Layer to Network Layer

These functions are responsible for pushing data from the transport layer (L4) to the network layer (L3). Although they all share the same destination, their points of origin and methods of departure vary.

int ip_queue_xmit(struct sk_buff *skb, struct flowi *fl);

Used for: TCPv4 data transmission.

This is the go-to method for TCP. It handles encapsulating TCP segments into IP packets and sending them off.

int ip_append_data(struct sock *sk, struct flowi4 *fl4, ... );

Used for: UDPv4 (Corked mode) and ICMPv4.

This is a "data accumulation" process. When you use the UDP_CORK option with UDP, or when sending ICMP messages, the kernel doesn't send a packet for every tiny piece of data. Instead, it uses this method to accumulate data until an explicit send is called.

struct sk_buff *ip_make_skb(struct sock *sk, struct flowi4 *fl4, ... );

Historical context: Introduced in kernel 2.6.39.

Why was it introduced? To implement a lockless fast path for UDP transmission. When is it called? When you haven't set the UDP_CORK option, UDP prefers to call this method directly, preparing and sending the data in one shot to avoid locking overhead.

int ip_generic_getfrag(void *from, char *to, int offset, int len, int odd, struct sk_buff *skb);

Function: The universal "mover".

This is a generic callback function specifically responsible for copying data from user space to a specified location in an SKB. If you are writing your own protocol and don't want to reinvent the wheel, you can use it directly.

static int icmp_glue_bits(void *from, char *to, int offset, int len, int odd, struct sk_buff *skb);

Function: ICMPv4's dedicated getfrag callback.

When the ICMP module calls ip_append_data(), it passes this function in as the getfrag parameter. At its core, it still copies data, but it incorporates ICMP-specific logic.

Receive and Routing: A Packet's Destination

A packet has arrived—what next? These functions answer that question.

int ip_rcv(struct sk_buff *skb, struct net_device *dev, struct packet_type *pt, struct net_device *orig_dev);

Identity: IPv4's "head receptionist".

This is the main receive handler for IPv4 packets. All incoming IPv4 packets make this their first stop. It performs the most basic security checks (version number, checksum) and then hands the packet off to Netfilter.

int ip_local_deliver(struct sk_buff *skb);

Final stop: Local upper-layer protocols.

When a routing lookup confirms that this packet is destined for the local machine, ip_rcv ultimately hands the packet to this function. If necessary, it performs reassembly first, and then knocks on the door of the L4 protocol (TCP/UDP).

int ip_forward(struct sk_buff *skb);

Transit station: The forwarding path.

If you have configured Linux as a router, this function is the core. It decrements the TTL by 1, updates the checksum, and then pushes the packet to the next egress.

int ip_route_input_noref(...)

(Though implied in the original text's logic, added here to complete the context) This is the executor of the routing lookup. Although not in the list above, it is the judge that decides whether to call ip_local_deliver or ip_forward.

Multicast: Niche but Important

Multicast has its own set of logic; while the path is similar, the handling functions are specialized.

int ip_mr_input(struct sk_buff *skb);

Function: Handle incoming multicast packets.

If this is a multicast packet, the kernel calls this to decide whether to deliver the packet to the local machine or forward it to interfaces specified in the MFC table.

static int ipmr_queue_xmit(struct net *net, struct mr_table *mrt, ...);

Function: Multicast's dedicated send method.

int ip_mr_forward(struct net *net, struct mr_table *mrt, ...);

Function: Multicast forwarding.

Fragmentation: Cutting and Pasting

This is the part most prone to errors and the most demanding of attention to detail.

int ip_fragment(struct sk_buff *skb, int (*output)(struct sk_buff *));

Function: The master of fragmentation.

When a packet's length exceeds the MTU and fragmentation is allowed, this function is called. It contains fast and slow path processing logic, slicing a large packet into smaller fragments that fit the MTU.

int ip_defrag(struct sk_buff *skb, u32 user);

Function: The master of reassembly.

On the receiving end, it is responsible for reassembling these scattered fragments back into a complete packet based on their ID, source address, and protocol.

Parameter user: This parameter tells us who is requesting the reassembly (e.g., the normal IP layer or Netfilter connection tracking). The full list of definitions is in the ip_defrag_users enum within include/net/ip.h.

bool ip_is_fragment(const struct iphdr *iph);

Function: Identify a fragment at a glance.

If this packet is merely a fragment (the MF flag is 1 or the FragOffset is not 0), it returns true.

bool skb_has_frag_list(const struct sk_buff *skb);

Function: Check the SKB's fragment linked list.

This might be the most confusing name. An SKB has two fragmentation methods: one is the page array (frags[]), and the other is the SKB linked list (frag_list). This function checks the latter.

Historical rename: It used to be called skb_has_frags(), but was renamed in kernel 2.6.37. Why? Because it was too misleading, making people think it was checking the former. The improved clarity of the new name has saved the sanity of countless engineers.

IP Options: Those Features Dusty with History

Modern networks rarely use IP options, but the kernel still retains complete processing code for them.

int ip_options_compile(struct net *net, struct ip_options *opt, struct sk_buff *skb);

Function: Parsing.

It parses the option byte stream in the IP header into an internal ip_options object that the kernel can understand.

void ip_options_fragment(struct sk_buff *skb);

Function: Option cleanup during fragmentation.

When fragmentation occurs, some options (like Record Route) don't need to be copied to every fragment. This function replaces options that don't need copying with IPOPT_NOOP. Note that it is only called when processing the first fragment.

void ip_options_build(struct sk_buff *skb, struct ip_options *opt, __be32 daddr, struct rtable *rt, int is_frag);

Function: Construction.

Writes the parsed ip_options object back into the IPv4 header. The parameter is_frag is actually 0 in all calls.

void ip_forward_options(struct sk_buff *skb);

Function: Handle options during forwarding.

ip_rcv_options(struct sk_buff *skb);

Function: Handle options during receiving.

int ip_options_rcv_srr(struct sk_buff *skb);

Function: Handle Strict Source Route.

If the packet carries the Strict Source Route option, this function ensures the packet actually took the specified path.

int ip_call_ra_chain(struct sk_buff *skb);

Function: Handle the Router Alert option.

int ip_options_get_from_user(struct net *net, ...);

Function: Retrieve options from user space.

When you set IP_OPTIONS via the setsockopt() system call, the kernel uses this to bring your settings in.

Low-level and Auxiliary

static int raw_send_hdrinc(struct sock *sk, struct flowi4 *fl4, ...);

Function: The Raw Socket express.

When you use a Raw Socket and set the IP_HDRINCL option (meaning "I will construct the IP header myself"), the kernel calls this method. It directly calls dst_output(), bypassing much of the kernel's automatic encapsulation logic.

int ip_decrease_ttl(struct iphdr *iph);

Function: Decrement TTL by 1 and update the checksum.

Don't underestimate it. It is called every time during forwarding. Because the TTL has changed, the checksum must be recalculated. It leverages the incremental update algorithm mentioned in RFC 1072, making it highly efficient.

int ip_build_and_send_pkt(struct sk_buff *skb, struct sock *sk, ...);

Function: Send a SYN ACK.

TCPv4 specific. You can check out tcp_v4_send_synack() in net/ipv4/tcp_ipv4.c; that's exactly what it calls.

Macro Quick Reference

Finally, here are a few key macros that often hide in the details of the code.

IPCB(skb)

Function: Get the control block.

It returns the inet_skb_parm object pointed to by skb->cb. Hidden inside this object is the ip_options object we need to access.

FRAG_CB(skb)

Function: Fragmentation-specific control block.

It returns the ipfrag_skb_cb object pointed to by skb->cb. This is where the fragmentation module stashes its private data inside the SKB.

int NF_HOOK(uint8_t pf, unsigned int hook, struct sk_buff *skb, ...)

Function: The gateway to Netfilter hooks.

This is the core Netfilter macro. The first parameter, pf, is the protocol family (IPv4 is NFPROTO_IPV4). The second parameter is one of the five hook points (PRE_ROUTING, LOCAL_IN, etc.). If the registered hooks don't drop the packet, it calls the final okfn callback to continue the flow.

int NF_HOOK_COND(..., bool cond)

Function: Conditional Netfilter hook.

Same as above, but with an additional boolean parameter, cond. It only actually calls the Netfilter hooks when cond is true. This is a performance optimization.

IPOPT_COPIED(option)

Function: Check option flag bits.

It returns the Copied flag (the highest bit) in the option type. If it is 1, it means this option must be copied to all fragments; if it is 0, it only appears in the first fragment.

Chapter Summary

With this, we have completed the full journey through the IPv4 stack.

This is not just a pile of functions and macros. Behind these APIs lie the trade-offs of network protocol design: Why is fragmentation done on the sending side while reassembly is done on the receiving side? Why are IP options barely used anymore, yet the kernel still spends significant effort to maintain compatibility? Why does the Raw Socket allow users to construct their own headers?

What we built in this chapter is not just knowledge about IPv4, but a "protocol implementer's" perspective. You are no longer just a user calling send(); you see how the kernel navigates through bitstreams, walking the tightrope between security and efficiency.

In the next chapter, we will turn our attention to the "brain" of this massive machine—the routing subsystem. A packet not only knows how to reach its destination, but more importantly, how does it know where to go? What exactly does that routing table look like, which decides the fork in the road between ip_local_deliver and ip_forward? See you in the next chapter.

Exercises

Exercise 1: understanding

Question: In the struct iphdr structure, the Internet Header Length (ihl) field occupies 4 bits. Suppose the kernel reads an IPv4 packet where the value of iph->ihl is 7. What is the IP header length of this packet in bytes? Does it contain IP options?

Answer and Explanation

Answer: 28 bytes; it contains IP options.

Explanation: According to the knowledge points, ihl represents the header length in 4-byte units. The calculation formula is: header length = ihl * 4. When ihl=7, the length is 7 * 4 = 28 bytes. The fixed part of an IPv4 header is 20 bytes (corresponding to ihl=5). Any header larger than 20 bytes implies the existence of IP options (the optional part). Since 28 > 20, this packet contains 8 bytes of IP options.

Exercise 2: understanding

Question: When the kernel function ip_rcv() processes a received IPv4 packet, it first performs a series of sanity checks. When a check fails (such as the version number not being 4 or a checksum error), which statistics counter does the kernel update? (Please provide the macro definition name)

Answer and Explanation

Answer: IPSTATS_MIB_INHDRERRORS

Explanation: Based on the code snippet descriptions in the original chapter text, when iph->ihl < 5 or iph->version != 4, the code jumps to the inhdr_error label. The original text explicitly states: "the packet is dropped and the statistics (IPSTATS_MIB_INHDRERRORS) are updated." This is used to record the number of received IP packets with header errors.

Exercise 3: application

Question: Suppose you need to develop a program that sends custom IP packets through a Raw Socket, and you do not want the packet to be fragmented by routers during transmission (e.g., for Path MTU Discovery testing). Which flag in the IPv4 header's frag_off field should you manually set? Please provide the constant name of this flag and its corresponding hexadecimal value.

Answer and Explanation

Answer: Flag name: IP_DF (Don't Fragment flag); Hexadecimal value: 0x02.

Explanation: According to the knowledge points, the upper 3 bits of the frag_off field are flags. Among them, IP_DF (Don't Fragment) indicates that the packet must not be fragmented. The original text states: "010 is DF (Don't Fragment)". In binary, 010 corresponds to the hexadecimal value 0x02. After setting this bit, if the packet encounters a link with an MTU smaller than the packet length, the router will drop the packet and return an ICMP Fragmentation Needed message, which is exactly what the Path MTU Discovery mechanism requires.

Exercise 4: application

Question: When the kernel calls ip_forward() to process a packet in the forwarding path, it calls the ip_decrease_ttl() function. Besides decrementing the TTL field in the IPv4 header by 1, what other critical operation must this function perform to maintain the packet's validity?

Answer and Explanation

Answer: Recalculate and update the IPv4 header checksum.

Explanation: The IPv4 header checksum (the check field) covers the entire header. When the TTL field changes, the original checksum is no longer valid. According to the knowledge point definitions, the ip_decrease_ttl() function is responsible for "decrementing the TTL in the IPv4 header by 1 and recalculating the checksum." If only the TTL is modified without updating the checksum, the next hop receiving this packet will drop it due to a checksum error.

Exercise 5: thinking

Question: When designing the network stack, the Linux kernel distinguishes between two main transmit helper functions, ip_append_data() and ip_push_pending_frames(), and later introduced ip_make_skb() for the UDP fast path. Please comparatively analyze why the "lockless transmission" design of ip_make_skb() can provide better performance than the traditional mode in a Symmetric Multiprocessing (SMP) environment? What type of overhead does this design primarily avoid in the traditional path?

Answer and Explanation

Answer: ip_make_skb() integrates data preparation and SKB construction, allowing the packet to be built before acquiring the socket lock (or reducing the time the lock is held). The traditional mode (such as ip_append_data paired with ip_push_pending_frames) typically requires complex queue management and fragmentation calculations while holding the socket lock, leading to intense lock contention. ip_make_skb() aims to implement a lockless transmission fast path, primarily avoiding the overhead of heavy processing inside the socket lock (BH lock or socket lock) as well as multiple context switches, thereby improving concurrent throughput.

Explanation: This is an in-depth analysis question. According to the knowledge points, ip_append_data() is used to prepare data but not send it, requiring coordination with ip_push_pending_frames(). This usually means maintaining the socket's send queue state, which involves locking. ip_make_skb() was introduced after kernel 2.6.39, aiming to "implement a lockless transmission fast path." Its core advantage lies in decoupling or optimizing the packet construction logic from the socket's state management, allowing most of the work to be completed without lock contention. In high-concurrency UDP scenarios, reducing the hold time of the socket lock is the key to improving performance, because lock contention leads to CPU cache invalidation and process queuing.

Key Takeaways

Although the IPv4 header seems to be only 20 bytes, every bit is full of design pitfalls. The kernel uses struct iphdr to describe this structure, where the ihl field defines the actual length of the header in 4-byte units, allowing the IPv4 header to vary between 20 and 60 bytes to accommodate options. The Type of Service (TOS) field has evolved multiple times and is now primarily used for DSCP traffic control and ECN Explicit Congestion Notification. The fragmentation mechanism multiplexes the 16-bit frag_off field, using the lower 13 bits to record the offset in 8-byte units, while the upper 3 bits are reserved for the DF (Don't Fragment) and MF (More Fragments) flags. Although this bit-field multiplexing saves space, it also requires strict masking operations during parsing, otherwise it is extremely easy to be misled by magic numbers.

The receiving process of an IPv4 packet is not accomplished in a single step, but rather a relay race with checks at every level. The ip_rcv function plays the role of a "gatekeeper," strictly responsible for verifying the header format, version number, and checksum. Once an anomaly is detected (such as ihl being less than 5), it drops the packet immediately. After passing security, the packet is handed over to ip_rcv_finish, whose core task is to perform a routing lookup. This step queries the routing subsystem, encapsulates the result in a dst_entry, and depending on the destination address, points the packet's input callback pointer to either the local delivery function ip_local_deliver, the forwarding function ip_forward, or the multicast handling function ip_mr_input, thereby deciding whether the packet "stays" or "forwards."

The processing logic for multicast packets is more complex than for unicast because it needs to simultaneously determine the dual identity of "receiver" and "forwarder." During the routing lookup phase, the kernel calls ip_check_mc_rcu to check if the local interface has subscribed to the multicast group, while also checking if multicast forwarding is enabled (controlled by the user-space pimd daemon manipulating the kernel switch mc_forwarding). If it is destined for the local machine, the packet enters ip_local_deliver and is ultimately delivered to the socket; if it needs to be forwarded, it enters ip_mr_input. Interestingly, even for multicast forwarding, the kernel strictly adheres to the fundamental laws of IP networking, calling ip_decrease_ttl to decrement the TTL and recalculate the checksum before sending, to prevent routing loops.

IP options, which rarely appear in modern networks, nonetheless have a complete compilation and processing mechanism inside the kernel. Because options can cause router performance degradation, and during fragmentation only some options (determined by the Copied Flag) are copied to subsequent fragments, handling them is extremely tedious. The kernel parses the raw byte stream into an struct ip_options structure via ip_options_compile. For options like Record Route (RR) and Timestamp, the kernel needs to dynamically fill in the current interface IP or time, which modifies the header and invalidates the checksum. Furthermore, for security reasons, the system disables Source Routing (SSRR/LSRR) options by default, preventing attackers from using them for IP spoofing or bypassing firewalls.

On the transmit path, TCP and UDP exhibit completely different personalities, using the ip_queue_xmit and ip_append_data mechanisms, respectively. As a "worrywart" protocol, TCP prefers to manage its own data segmentation. It usually carries an already assembled SKB downstream, requiring only a quick routing cache lookup to send. UDP, on the other hand, acts like a "hands-off" manager, dumping all data to the IP layer, where ip_append_data is responsible for page-level assembly, data merging, and header construction in kernel space, before ip_push_pending_frames finally sends it out uniformly. This design difference reflects the two protocols' different allocations of reliability control: TCP wants to master every detail, while UDP pursues efficiency and minimalism.

Method Quick Reference​

Transmit Path: From Transport Layer to Network Layer​

Receive and Routing: A Packet's Destination​

Multicast: Niche but Important​

Fragmentation: Cutting and Pasting​

IP Options: Those Features Dusty with History​

Low-level and Auxiliary​

Macro Quick Reference​

Chapter Summary​

Exercises​

Exercise 1: understanding​

Exercise 2: understanding​

Exercise 3: application​

Exercise 4: application​

Exercise 5: thinking​

Key Takeaways​

Method Quick Reference

Transmit Path: From Transport Layer to Network Layer

Receive and Routing: A Packet's Destination

Multicast: Niche but Important

Fragmentation: Cutting and Pasting

IP Options: Those Features Dusty with History

Low-level and Auxiliary

Macro Quick Reference

Chapter Summary

Exercises

Exercise 1: understanding

Exercise 2: understanding

Exercise 3: application

Exercise 4: application

Exercise 5: thinking

Key Takeaways