Skip to main content

14.11 The PCI Subsystem

In the previous section, we covered the kernel's notification chains — an art of decoupling at the software level.

Now, let's shift our focus from pure software logic back to the physical world. Looking at that network card plugged into a server, or that Wi-Fi module soldered onto a development board, you might ask: how does the kernel discover this hardware? And how does it know whether the board is from Intel or Realtek?

For the vast majority of modern network devices, the answer points to the same foundation — the PCI subsystem.

Not All Network Cards Are Created Equal

First, let's be clear: not all network interfaces are PCI devices.

In the embedded world, many network interfaces are directly integrated into the SoC (System on Chip). They hang off the CPU's internal buses (such as AHB, AXI, or PLB) rather than the PCI bus. The initialization and handling of these devices are completely different, and the discussion in this section does not apply to them.

But for x86 servers, desktop PCs, and many embedded boards, network cards (especially mainstream PCIe cards today) are indeed PCI devices. Particularly after the introduction of the PCI Express (PCIe) standard in 2004, the original parallel PCI bus was gradually replaced by the serial PCIe — which not only brought higher bandwidth but also changed how we interact with it.

Configuration Space: The Device's "ID Card"

Every PCI device, no matter how complex, has a read-only configuration space.

  • For legacy PCI devices, this space is at least 256 bytes.
  • For PCI-X 2.0 and PCI Express devices, this space is extended to 4096 bytes (extended configuration space).

You can think of this space as the device's "ID card" and "manual." It records the vendor ID, device ID, class code, memory-mapped addresses, interrupt request line, and a series of other critical information. Without reading this space, the kernel has no idea the device exists, let alone driving it.

When you type the lspci command in your terminal, what you see is the decoded version of this information. If you prefer to see the raw hexadecimal data (just as the kernel sees it), you can add the -x parameter:

  • lspci -xxx: Displays a hex dump of the standard PCI configuration space.
  • lspci -xxxx: Displays a hex dump of the extended PCI configuration space.

In kernel code, reading and writing these fields can't be done by simply dereferencing memory — we must go through the PCI API. Linux provides three sets of APIs, corresponding to 8-bit, 16-bit, and 32-bit granularity:

Reading configuration space:

static inline int pci_read_config_byte(const struct pci_dev *dev, int where, u8 *val);
static inline int pci_read_config_word(const struct pci_dev *dev, int where, u16 *val);
static inline int pci_read_config_dword(const struct pci_dev *dev, int where, u32 *val);

Writing configuration space:

static inline int pci_write_config_byte(const struct pci_dev *dev, int where, u8 val);
static inline int pci_write_config_word(const struct pci_dev *dev, int where, u16 val);
static inline int pci_write_config_dword(const struct pci_dev *dev, int where, u32 val);

Finding Your Driver: pci_device_id

How does the kernel know which driver to assign to which network card?

It relies on matching. Every PCI vendor fills in specific values in the configuration space's vendor (vendor), device (device), and class (class) fields. The kernel's PCI subsystem uses these values to identify the device.

In driver code, we use a pci_device_id structure to describe "which devices I can support." This structure is defined in include/linux/mod_devicetable.h:

struct pci_device_id {
__u32 vendor, device; /* Vendor and device ID or PCI_ANY_ID */
__u32 subvendor, subdevice; /* Subsystem ID's or PCI_ANY_ID */
__u32 class, class_mask; /* (class,subclass,prog-if) triplet */
kernel_ulong_t driver_data; /* Data private to the driver */
};

Here, vendor and device are the core fields — for the vast majority of drivers, filling in just these two is sufficient. If you see PCI_ANY_ID, it means "I don't care about this value, anyone will do."

The Driver Skeleton: struct pci_driver

The core of a PCI driver is a pci_driver object. It acts as a contract between the driver and the kernel, defining what the driver is called, what it supports, and what to do when a device is inserted or removed.

Let's look at its skeleton (defined in include/linux/pci.h):

struct pci_driver {
. . .
const char *name;
const struct pci_device_id *id_table; /* must be non-NULL for probe to be called */
int (*probe) (struct pci_dev *dev, const struct pci_device_id *id); /* New device inserted */
void (*remove) (struct pci_dev *dev); /* Device removed (NULL if not a hot-plug capable driver) */
int (*suspend) (struct pci_dev *dev, pm_message_t state); /* Device suspended */
. . .
int (*resume) (struct pci_dev *dev); /* Device woken up */
. . .
};

Here are a few key fields:

  • name: The driver name, human-readable.
  • id_table: This is the pci_device_id array we just mentioned. Through it, the driver tells the kernel: "I manage these devices." If you don't fill in this table, the probe function will never be called.
  • probe: This is the main event. When the kernel finds a device that matches your id_table, it calls this function. You need to do all initialization work here: requesting resources, mapping memory, registering the network device, etc.
  • remove: Called when the device is removed (or the driver is unloaded). Its job is usually to clean up all resources allocated in probe. If you don't clean up, you'll get memory leaks.
  • suspend / resume: Power management callbacks. Triggered when the device enters a low-power state or is woken up.

The Device Incarnation: struct pci_dev

How does the kernel represent a specific PCI device? Through struct pci_dev.

This structure is very large and contains all the dynamic information about the device. Let's pick out a few core fields (defined in include/linux/pci.h):

struct pci_dev {
. . .
unsigned short vendor;
unsigned short device;
unsigned short subsystem_vendor;
unsigned short subsystem_device;
. . .
struct pci_driver *driver; /* which driver has allocated this device */
. . .
pci_power_t current_state; /* Current operating state. In ACPI-speak,
this is D0-D3, D0 being fully functional,
and D3 being off. */
struct device dev; /* Generic device interface */
int cfg_size; /* Size of configuration space */
unsigned int irq; /* IRQ assigned to the device */
};

As you can see, it contains both static information read from the configuration space (vendor, device) and state allocated at kernel runtime (driver, current_state, irq). That embedded struct device dev is standard practice in the Linux Device Model — through it, PCI devices can be attached to the unified /sys/devices tree.

Registration and Initialization Flow

Getting a PCI driver up and running follows this standard flow:

  1. Define a pci_driver object: Fill in the name, ID table, and callback functions.
  2. Register the driver: Call pci_register_driver().
    • This is typically done in the driver's module_init().
    • Once called, the PCI core layer immediately scans the devices on the bus to see if any match your id_table. If there's a match, probe is called right away.
  3. Initialize the device in probe:
    • Call pci_enable_device(): This is a critical step. It wakes the device (if it's sleeping) and activates the device's I/O and memory resources. Without this call, the device remains dead.
    • Call request_irq(): Register the interrupt handler so you can process packet arrival signals from the network card.
    • DMA setup: Allocate DMA buffers.
  4. Unregister the driver: Call pci_unregister_driver().
    • Typically called in module_exit().

A Small DMA Detail: Coherent Memory

The high performance of PCI network cards relies entirely on DMA (Direct Memory Access). The device reads and writes memory directly, without the CPU moving data.

But there's a pitfall here: Cache Coherency. The CPU writes data to memory, but the data is still sitting in the L1 Cache and hasn't reached main memory. If the network card comes to read this memory at that point, it reads stale data (garbage), and the packet gets sent incorrectly.

The standard solution to this problem is using dma_alloc_coherent() / dma_free_coherent().

void *dma_alloc_coherent(struct device *dev, size_t size,
dma_addr_t *dma_handle, gfp_t flag);

Memory allocated by this API is "cache-coherent." That means the CPU and the device always see the same view of this memory. You don't need to manually call operations like cache flush — the kernel handles it for you.

You can see this in many high-performance network card drivers, such as the e1000_alloc_ring_dma() in Intel's e1000 driver (drivers/net/ethernet/intel/e1000e/netdev.c).

⚠️ Warning Don't just casually use kmalloc for DMA buffers, unless you know exactly what you're doing and know when to call dma_map_single / dma_unmap_single. Using dma_alloc_coherent has slightly more overhead, but it's safe and worry-free.

Note Single Root I/O Virtualization (SR-IOV) is a cool PCI feature that allows a single physical device to masquerade as multiple virtual devices (VFs). This is extremely useful in virtualization environments, as it allows virtual network cards to be passed through directly to virtual machines. For details, see the kernel documentation Documentation/PCI/pci-iov-howto.txt.


Wake-On-LAN (WOL): Remote Wake-Up

Sometimes you want to remotely wake up a machine after it's been shut down — that's what Wake-On-LAN (WOL) does.

It allows a machine in a soft-off state to be woken by a special network packet. By default, this feature is disabled, because nobody wants their computer to suddenly light up in the middle of the night due to a broadcast packet.

In Linux, for a network device driver to support WOL, it needs to define a set_wol() callback in the ethtool_ops object.

You can use the ethtool command to check and configure this:

  • ethtool <ethX>: Check whether the network card supports WOL.
  • ethtool -s eth1 wol g: Enable WOL and specify that it should only respond to MagicPacket (a special frame format defined by AMD).

How do you send this wake-up packet? You can use the ether-wake tool from the net-tools package. When you send this MagicPacket, the target machine's network card (which is usually still in a micro-power state even when the host is shut down) detects the packet and triggers the motherboard to power on.

There's a classic implementation of this in RealTek's 8139cp driver (drivers/net/ethernet/realtek/8139cp.c).


If you need higher bandwidth or redundancy, you typically bind two network cards together. This is Link Aggregation.

In the past, we used the bonding driver (drivers/net/bonding). It's stable, but all the code lives in the kernel, making the kernel module bloated, and changing any logic requires recompiling the kernel.

The new approach is called the Teaming Driver.

Its core idea is: push logic down, move control up.

  • The kernel driver part is only responsible for the most essential packet forwarding.
  • Complex control logic (such as LACP protocol calculations, port selection policies) is handed off to the user-space daemon teamd.
  • teamd and the kernel driver communicate through the libteam library, with Generic Netlink underneath (which we discussed in Chapter 2).

The Teaming driver supports four modes, all under drivers/net/team:

  1. loadbalance (net/team/team_mode_loadbalance.c): Used for LACP (802.3ad standard), automatically load-balancing based on traffic.
  2. activebackup (net/team/team_mode_activebackup.c): One active, multiple backups. Only one port is working, while the others stand by. If the active one goes down, a backup immediately takes over.
  3. broadcast (net/team/team_mode_broadcast.c): Simple and brute-force — every packet is duplicated and sent out from all ports.
  4. roundrobin (net/team/team_mode_roundrobin.c): Round-robin distribution, no user-space intervention needed.

This project is primarily developed by Jiri Pirko. For more information, check out http://libteam.org/.


The PPPoE Protocol: Behind Dial-Up Internet

Finally, let's briefly mention PPPoE (Point-to-Point Protocol over Ethernet).

Although fiber optics is widespread now, if you've used early ADSL broadband, you're definitely familiar with this protocol. Its core purpose is to simulate a point-to-point connection over a multi-access network like Ethernet, enabling user authentication and billing.

PPPoE is divided into two phases:

1. Discovery Phase

This is for the client (your computer) to find the ISP's access device (Access Concentrator, AC). It's like walking into a building and having to find the landlord who collects the rent.

This process involves a four-step handshake:

  • PADI (PPPoE Active Discovery Initiation):
    • The host broadcasts: "Is anyone here? I want to get online."
    • Code: 0x09, Session ID: 0x0000.
  • PADO (PPPoE Active Discovery Offer):
    • The AC (landlord) hears it and replies with a unicast: "I'm the landlord, I'm here."
    • Code: 0x07, Session ID: 0x0000.
  • PADR (PPPoE Active Discovery Request):
    • The host selects a specific AC and sends a request: "I want to establish a connection with you."
    • Code: 0x19, Session ID: 0x0000.
  • PADS (PPPoE Active Discovery Session-confirmation):
    • The AC replies: "Approved, here's your assigned ID."
    • Code: 0x65, Session ID: <非零的唯一ID>.

After this, both sides communicate using this Session ID. If they want to disconnect, a PADT (Terminate) packet is sent.

All five of these packets have an Ethernet Type of 0x8863.

2. Session Phase

Once Discovery is complete, both sides enter the Session phase.

At this point, the Ethernet Type changes to 0x8864. Packets no longer have the PPPoE discovery header prepended; instead, they get a 6-byte PPPoE header followed by a standard PPP header (2 bytes).

In this phase, you can run the PPP protocol — authenticate with PAP or CHAP, negotiate link parameters with LCP, and obtain an IP address with IPCP. This is why dial-up software ultimately gets an IP address.

The key to understanding PPPoE lies in that 6-byte header. If you capture packets and see 0x8863 and 0x8864, you know that a PPPoE story is unfolding on that network.