2.2 Technical Preparation and the Communication Path Landscape

Before we dive in, we need to make sure our tools are ready.

Technical Preparation

Assuming you've followed the "Preparation" section in the preface, you should have a virtual machine running Ubuntu 18.04 LTS (or a newer stable release) with all the necessary packages installed. If you haven't done this yet, we strongly recommend setting it up first—sharpening the axe before chopping the wood.

To get the most out of this book, we also highly recommend setting up your working environment and cloning this book's GitHub repository (https://github.com/PacktPublishing/Linux-Kernel-Programming-Part-2). We are going to get our hands dirty, not just read about it.

The User-Kernel Communication Path Landscape

As we mentioned in the introduction, the core task of this chapter is to establish efficient information transfer between kernel-space components (usually device drivers, but potentially any kernel module) and user-space processes (or threads). Before we start writing code, let's take stock of the tools at our disposal.

You can think of user-space components as a C program, a Shell script (we often showcase these two in the book), or even Python/Perl scripts.

In the previous chapter, we touched upon the edges of this topic: the system call API. This is the fundamental pathway for user-space applications to interact with the kernel (including device drivers). In the last chapter, you learned how to write a simple character device driver and how user-space programs pass data via the read(2) and write(2) system calls. This triggers the VFS to invoke the read/write method in your driver, and you completed the data transfer using the copy_{from|to}_user() API.

At this point, you might ask: Isn't this already done? What else is there to learn?

Ah, there's a lot more.

The reality is that beyond the standard read/write, there are a bunch of other interface technologies. They all inevitably rely on system calls—after all, there's no other way to synchronously enter the kernel from user space. But different paths offer different scenery. The goal of this chapter is to show you all these paths. Of course, no single method is a silver bullet; which one you choose depends on whether your project calls for a scalpel or an axe.

Here are the communication methods we will cover in this chapter:

Via the traditional procfs interface
Via sysfs
Via debugfs
Via netlink sockets
Via the ioctl(2) system call

We will use code examples to thoroughly explain the ins and outs of these technologies. In addition, we'll briefly discuss how practical they are in debugging scenarios.

2.3 Via the procfs Interface

In this section, we'll discuss the proc filesystem (procfs) and how to use it to bridge user space and kernel space. It is a powerful and easy-to-program interface, formerly commonly used to report status and debug core kernel subsystems.

But before we start, we need to pour some cold water on this: Starting with Linux 2.6, if you want to contribute code to the mainline kernel, this interface is off-limits for driver authors—it is strictly reserved for internal kernel use. Despite this, for the sake of a complete knowledge base, we still need to cover it.

Understanding the proc Filesystem

Linux has a virtual filesystem called proc, which is mounted by default at /proc.

The first thing to understand about procfs is to accept a counter-intuitive fact: its contents are not on disk. They reside in RAM and are volatile. The files and directories you see under /proc are all pseudo-files created by kernel code for the procfs mechanism. To hint at this, the kernel (almost) always displays the size of these files as 0:

$ mount | grep -w proc
proc on /proc type proc (rw,nosuid,nodev,noexec,relatime)
$ ls -l /proc/
total 0
dr-xr-xr-x  8 root  root          0 Jan 27 11:13 1/
dr-xr-xr-x  8 root  root          0 Jan 29 08:22 10/
-r--r--r--  1 root  root          0 Jan 29 08:22 consoles
-r--r--r--  1 root  root          0 Jan 29 08:19 cpuinfo
-r--r--r--  1 root  root          0 Jan 29 08:20 devices
[...]
-r--r--r--  1 root  root          0 Jan 29 08:22 vmstat
-r--r--r--  1 root  root          0 Jan 29 08:22 zoneinfo
$

We can summarize a few key points about procfs:

Objects under /proc (files, directories, symlinks, etc.) are all pseudo-objects that live in RAM!

Directories under `/proc`

Directories with integer names represent the living processes in the current system. The directory name is the process's PID (technically the TGID, as we discussed the difference between the two in the companion guide to Linux Kernel Programming).

This /proc/PID/ folder contains all the detailed information about that process. For example, for the init or systemd process (always PID 1), you can find all its details (attributes, open files, memory layout, child processes, etc.) under /proc/1/.

As an example, let's get a root shell on our x86_64 virtual machine and see what's inside /proc/1/:

(Screenshot showing the contents of the /proc/1 directory goes here)

For complete details on these pseudo-files and folders under /proc/<PID>/..., you can check the man page for proc(5) (man 5 proc), which is highly recommended reading.

Note: The specific contents under /proc depend on the kernel version and (CPU) architecture. Typically, x86_64 will have the richest set of contents.

The Purpose of procfs

procfs serves two main purposes:

It provides a simple interface that allows developers, system administrators (or anyone) to peek deep into the kernel internals to obtain internal information about processes, the kernel itself, and even hardware. Using this interface only requires basic knowledge of shell commands like cd, cat, echo, and ls.
As the root user (and sometimes even as the owner), you can write data to certain pseudo-files under /proc/sys to dynamically adjust kernel parameters. This feature is called sysctl. For example, you can adjust various IPv4 network parameters under /proc/sys/net/ipv4/.

Modifying a proc-based tunable is very simple. As an example, let's change the maximum number of threads allowed on the system. Run the following command as root:

# cat /proc/sys/kernel/threads-max
15741
# echo 10000 > /proc/sys/kernel/threads-max
# cat /proc/sys/kernel/threads-max
10000
#

That's it. However, it's obvious that the above operation is volatile—the change is limited to the current session and will revert after a reboot. How do we make it permanent? The short answer is: use the sysctl(8) utility. Please refer to its man page for specific details.

Ready to write some procfs interface code? Not so fast—the next section explains why this might not be a good idea.

procfs is Off-Limits for Driver Authors

Although we can use procfs to interact with user-space applications, there is a critical point to note!

You must realize that procfs, like many similar facilities in the kernel, is part of the ABI (Application Binary Interface). The kernel community does not guarantee that it will remain stable or maintain its current form forever, just like internal kernel APIs and data structures. In fact, starting with the 2.6 kernel, the kernel heavyweights have made it very clear—device driver authors (and similar roles) should not use procfs for their own purposes, whether for debugging or anything else.

Honestly, this rule is quite counter-intuitive—after all, it's quite useful and the API is simple, but the kernel community simply won't allow it.

In the early 2.6 Linux days, using proc for this kind of thing (referred to by the kernel community as "abuse," since proc is designed strictly for internal kernel use!) was quite common. Since procfs is considered off-limits, what facilities should driver authors use to communicate with user-space processes? Driver authors should use the sysfs facility to export interfaces. Actually, it's not just sysfs; you have several options: sysfs, debugfs, netlink sockets, and the ioctl system call. We will cover these in detail later in this chapter.

But wait a moment. The reality is that this rule about "driver authors shouldn't use procfs" primarily targets the community. This means that if you plan to upstream your driver or kernel module to the mainline kernel—that is, contributing code under the GPLv2 license—then all community rules absolutely apply. If you don't plan to upstream, it's really up to you. Of course, following the kernel community's guidelines and rules is always a good practice, and we strongly recommend that you do so.

Using the procfs Interface from User Space

As kernel module or device driver developers, we can actually create our own entries under /proc and use them as a simple interface to user space. How do we do this? The kernel provides a set of APIs for us to create directories and files under procfs. In this section, we'll learn how to use them.

Basic procfs API

We won't dive into the gory details of the procfs API set here; we'll only cover what's enough for you to understand and use it. For the details, look to the ultimate resource: the kernel source tree. The routines we'll discuss are all exported, so they are available to driver authors like you. Also, as mentioned earlier, all procfs file objects are actually pseudo-objects, meaning they only exist in RAM.

Here, we assume you already know how to design and implement a simple LKM; for more details, please refer to Chapters 4 and 5 of this book's companion guide, Linux Kernel Programming.

Let's explore a few simple procfs APIs that help you do a few key things—create a directory under procfs, create (pseudo) files inside it, and remove them. Before doing any of this, make sure to include the relevant header file: #include <linux/proc_fs.h>.

Step 1: Create a folder

First, we need a "room" to put our things in. Let's create a directory named name under /proc:

struct proc_dir_entry *proc_mkdir(const char *name,
                         struct proc_dir_entry *parent);

The first parameter is the directory name, and the second parameter is the pointer to the parent directory under which it should be created. Passing NULL means creating it under the root directory, which is /proc. Save the return value, as subsequent APIs will typically need it.

The proc_mkdir_data() routine also allows you to pass a data item along (void *); note that it is exported via EXPORT_SYMBOL_GPL.

Step 2: Create a file

Now that the room is built, let's put some files in it. Let's create a procfs (pseudo) file named /proc/parent/name:

struct proc_dir_entry *proc_create(const char *name, umode_t mode,
                         struct proc_dir_entry *parent,
                         const struct file_operations *proc_fops);

The most critical parameter here is struct file_operations, which we introduced in the previous chapter. You need to fill it with "method" implementations (we'll detail this shortly). Think about it—this is incredibly powerful: through the fops structure, you can set up "callback" functions in your driver (or kernel module), and the kernel's procfs layer will honor them: when a user-space process reads your proc file, it (the VFS) will invoke your driver's .read method or callback function. If a user-space application writes to it, it invokes your driver's .write callback!

Step 3: Clean up the battlefield

Finally, if you don't want to keep these things around, you can use remove_proc_entry():

void remove_proc_entry(const char *name, struct proc_dir_entry *parent)

This API removes the specified /proc/name entry and frees it (if it's not in use); similarly (and usually more conveniently), you can use the remove_proc_subtree() API to delete an entire subtree under /proc at once (typically used during cleanup or error handling).

The Four procfs Files We Will Create

To clearly demonstrate how to use procfs as an interface technology, our kernel module will create a directory under /proc. Within this directory, it will create four procfs (pseudo) files. Note that, by default, all procfs files have owner:group attributes of root:root.

We will create a directory called /proc/proc_simple_intf, and under it, create four (pseudo) files. The following table lists the names and attributes of these four files:

procfs 'file' name	R: Callback action triggered by user-space read	W: Callback action triggered by user-space write	File permissions
llkdproc_dbg_level	Retrieves (returns to user space) the current value of the global variable `debug_level`	Updates the `debug_level` global variable to the value written from user space	0644
llkdproc_show_pgoff	Retrieves (returns to user space) the kernel's `PAGE_OFFSET` value	– No write callback –	0444
llkdproc_show_drvctx	Retrieves (returns to user space) the current value of the driver "context" structure `drv_ctx`	– No write callback –	0440
llkdproc_config1 (also treated as dbg_level)	Retrieves (returns to user space) the current value of the context variable `drvctx->config1`	Updates the driver context member `drvctx->config1` to the value written from user space	0644

Later, we will look at the APIs and actual code for creating the proc_simple_intf directory under /proc and the aforementioned files within it. (Due to space constraints, we won't show all the code, only the parts related to getting and setting the "debug level"; this is fine, as the remaining code is conceptually very similar).

Trying Out the procfs for Dynamically Controlling debug_level

First, let's look at the "driver context" data structure we'll be using, which we'll rely on throughout this chapter (and actually used in the previous chapter as well):

// ch2/procfs_simple_intf/procfs_simple_intf.c
[ ... ]
/* Borrowed from ch1; the 'driver context' data structure;
 * all relevant 'state info' reg the driver and (fictional) 'device'
 * is maintained here.
 */
struct drv_ctx {
    int tx, rx, err, myword, power;
    u32 config1; /* treated as equivalent to 'debug level' of our driver */
    u32 config2;
    u64 config3;
#define MAXBYTES   128
    char oursecret[MAXBYTES];
};
static struct drv_ctx *gdrvctx;
static int debug_level;    /* 'off' (0) by default ... */

Here we can also see a global integer variable named debug_level; this will provide dynamic control over the project's debugging verbosity. The debug level is assigned a range of [0-2], with the following meanings:

0 means no debug information (default).
1 means medium debug verbosity.
2 means high debug verbosity.

The beauty of this entire design—and the real point of this section—is that we will be able to query and set this debug_level variable from user space via the procfs interface we created! This will allow end users (usually requiring root privileges for security reasons) to dynamically change the debug level at runtime (which is a fairly common feature in many products).

Before diving into the code details, let's do a quick test run to give you an idea of what to expect:

Here we will use our custom lkm convenience wrapper script to compile and insmod(8) the kernel module (from the ch2/proc_simple_intf directory in this section's source tree):

$ cd <booksrc>/ch2/proc_simple_intf
$ ../../lkm procfs_simple_intf          <-- builds the kernel module
Version info:
[...]
[24826.234323] procfs_simple_intf:procfs_simple_intf_init():321: 
proc dir (/proc/procfs_simple_intf) created
[24826.240592] procfs_simple_intf:procfs_simple_intf_init():333: 
proc file 1 (/proc/procfs_simple_intf/llkdproc_debug_level) created
[24826.245072] procfs_simple_intf:procfs_simple_intf_init():348: 
proc file 2 (/proc/procfs_simple_intf/llkdproc_show_pgoff) created
[24826.248628] procfs_simple_intf:alloc_init_drvctx():218: 
allocated and init the driver context structure
[24826.251784] procfs_simple_intf:procfs_simple_intf_init():368: 
proc file 3 (/proc/procfs_simple_intf/llkdproc_show_drvctx) created
[24826.255145] procfs_simple_intf:procfs_simple_intf_init():378: 
proc file 4 (/proc/procfs_simple_intf/llkdproc_config1) created
[24826.259203] procfs_simple_intf initialized
$

Here we compiled and inserted the kernel module; dmesg(1) shows the kernel printk indicating that one of our created procfs files is related to the dynamic debug feature (shown in bold here; since these are pseudo-files, the file size will show as 0 bytes).

Now, let's test it by querying the current value of debug_level:

$ cat /proc/procfs_simple_intf/llkdproc_debug_level
debug_level:0
$

Great, it's 0—the default value, as expected. Now, let's change it to 2:

$ sudo sh -c "echo 2 > /proc/procfs_simple_intf/llkdproc_debug_level"
$ cat /proc/procfs_simple_intf/llkdproc_debug_level
debug_level:2
$

Note that we must execute the echo as root here. Clearly, the debug level did change (to 2)! If we try to write an out-of-range value, it will be caught (and the debug_level variable's value will reset to the last valid value), as shown below:

$ sudo sh -c "echo 5 > /proc/procfs_simple_intf/llkdproc_debug_level"
sh: echo: I/O error
$ dmesg
[...]
[ 6756.415727] procfs_simple_intf: trying to set invalid value for
debug_level [allowed range: 0-2]; resetting to previous (2)

Exactly, the behavior is as expected. However, the question arises: how is this achieved at the code level? Read on!

Dynamically Controlling debug_level via procfs

Let's answer the above question—how is it implemented in the code? It's actually quite simple:

First, in the kernel module's init code, we must create our procfs directory, using the kernel module's name:

static struct proc_dir_entry *gprocdir;
[...]
gprocdir = proc_mkdir(OURMODNAME, NULL);

Also in the init code, create the procfs file that controls the "debug level" item:

// ch2/procfs_simple_intf/procfs_simple_intf.c
[...]
#define PROC_FILE1           "llkdproc_debug_level"
#define PROC_FILE1_PERMS     0644
[...]
static int __init procfs_simple_intf_init(void)
{
    int stat = 0;
    [...]
    /* 1. Create the PROC_FILE1 proc entry under the parent dir OURMODNAME;
     * this will serve as the 'dynamically view/modify debug_level' 
     * (pseudo) file */
    if (!proc_create(PROC_FILE1, PROC_FILE1_PERMS, gprocdir,
                     &fops_rdwr_dbg_level)) {
    [...]
    pr_debug("proc file 1 (/proc/%s/%s) created\n", OURMODNAME, PROC_FILE1);
    [...]

Here we used the proc_create() API to create the procfs file and "hooked" it up with the provided file_operations structure.

The fops structure (technically struct file_operations) is the key data structure here. As we learned in Chapter 1, Writing a Simple misc Character Device Driver, this is where we assign functionality to various file operations on a device, or in this case, a procfs file. Here is our code to initialize the fops:

static const struct file_operations fops_rdwr_dbg_level = {
    .owner = THIS_MODULE,
    .open = myproc_open_dbg_level,
    .read = seq_read,
    .write = myproc_write_debug_level,
    .llseek = seq_lseek,
    .release = single_release,
};

The .open method of the fops points to a function we must define:

static int myproc_open_dbg_level(struct inode *inode, struct file *file)
{
    return single_open(file, proc_show_debug_level, NULL);
}

Using the kernel's single_open() API, we register the fact that whenever this file is read—ultimately performed via a user-space read(2) system call—procfs will "callback" our proc_show_debug_level() routine (the second parameter passed to single_open()).

We won't dive deep into the internal implementation of the single_open() API here; if you're curious, you can look it up yourself: fs/seq_file.c:single_open().

There's some historical context here; don't dig too deep, just know that the old way procfs worked was problematic. Specifically, you couldn't transfer more than a page of data (via read or write) unless you manually iterated over the content. The sequence iterator functionality introduced in 2.6.12 fixed these issues. Now, using single_open() and its counterparts (the seq_read, seq_lseek, and seq_release built-in kernel functions) is the simpler and more correct way to use procfs.

So, what happens when user space writes to the proc file (via the write(2) system call)? Simple: in the code above, you can see that we registered the fops_rdwr_dbg_level.write method as the myproc_write_debug_level() function, meaning that whenever this (pseudo) file is written to, this function will be called back (the read callback is explained in step 6 later).

Our write callback function registered via single_open is as follows:

/* Our proc file 1: displays the current value of debug_level */
static int proc_show_debug_level(struct seq_file *seq, void *v)
{
    if (mutex_lock_interruptible(&mtx))
        return -ERESTARTSYS;
    seq_printf(seq, "debug_level:%d\n", debug_level);
    mutex_unlock(&mtx);
    return 0;
}

seq_printf() is conceptually similar to the familiar sprintf() API. It correctly "prints"—that is, writes—the data supplied to it into the seq_file object. When we say "print" here, what we really mean is that it effectively passes the data buffer to the user-space process or thread that initiated the read system call that brought us here, essentially transferring the data to user space.

Oh, by the way, what are those mutex_{un}lock*() APIs for? They are for a very critical task—locking. We'll discuss locks in detail in Chapter 6, Kernel Synchronization—Part 1 and Chapter 7, Kernel Synchronization—Part 2; for now, just understand that these are necessary synchronization primitives.

Our write callback function registered via fops_rdwr_dbg_level.write is as follows:

#define DEBUG_LEVEL_MIN     0
#define DEBUG_LEVEL_MAX     2
[...]
/* proc file 1 : modify the driver's debug_level global variable as
   per what user space writes */
static ssize_t myproc_write_debug_level(struct file *filp,
                const char __user *ubuf, size_t count, loff_t *off)
{
   char buf[12];
   int ret = count, prev_dbglevel;
   [...]
   prev_dbglevel = debug_level;
   // < ... validity checks (not shown here) ... >
   /* Get the user mode buffer content into the kernel (into 'buf') */
   if (copy_from_user(buf, ubuf, count)) {
        ret = -EFAULT;
        goto out;
   }
   [...]
   ret = kstrtoint(buf, 0, &debug_level); /* update it! */
   if (ret)
        goto out;
   if (debug_level < DEBUG_LEVEL_MIN || debug_level > DEBUG_LEVEL_MAX) {
            [...]
            debug_level = prev_dbglevel;
            ret = -EFAULT; goto out;
   }
   /* just for fun, let's say that our drv ctx 'config1'
      represents the debug level */
   gdrvctx->config1 = debug_level;
   ret = count;
out:
   mutex_unlock(&mtx);
   return ret;
}

In our write method implementation (notice how similar its structure is to the write method of a character device driver), we perform some validity checks, then copy the data the user-space process wrote to us using the usual copy_from_user() function (recall how we used the echo command to write to the procfs file). Then we use the kernel's built-in kstrtoint() API (along with several similar ones) to convert the string buffer into an integer, storing the result in our global variable—which is debug_level! After another check, if everything is fine, we incidentally (for example) set the driver context's config1 member to the same value, and then return a success message.

The rest of the kernel module code is very similar—we need to set up the functionality for the remaining three procfs files. We leave this to you to review in detail and try out.
One more quick demo: let's set debug_level to 1, and then dump the driver context structure (via the third procfs file we created):

$ cat /proc/procfs_simple_intf/llkdproc_debug_level
debug_level:0
$ sudo sh -c "echo 1 > /proc/procfs_simple_intf/llkdproc_debug_level"

Alright, now the value of the debug_level variable should be 1. Now let's dump the driver context structure:

$ cat /proc/procfs_simple_intf/llkdproc_show_drvctx
cat: /proc/procfs_simple_intf/llkdproc_show_drvctx: Permission denied
$ sudo cat /proc/procfs_simple_intf/llkdproc_show_drvctx
prodname:procfs_simple_intf
tx:0,rx:0,err:0,myword:0,power:1
config1:0x1,config2:0x48554a5f,config3:0x424c0a52
oursecret:AhA xxx
$

We need root privileges to do this. Once done, we can clearly see all the members of the drv_ctx data structure. Not only that, but we also verified that the config1 member (shown in bold above) now has a value of 1, thereby reflecting the "debug level" as designed.

Also note that the output here is intentionally generated in a highly parseable format for user space, almost in a JSON style. Of course, as a small exercise, you could try making it standard JSON.

Many recent IoT products use RESTful APIs for communication; the parsed format is usually JSON. It's only beneficial to get into the habit of designing and implementing kernel-to-user (and vice versa) communication using easily parseable formats like JSON.

With this, you've learned how to create a procfs directory, files within it, and most importantly, how to create and use read/write callback functions so that when user-mode processes read or write your proc files, you can respond appropriately from deep within the kernel. As mentioned earlier, due to space constraints, we won't describe the code for creating and using the remaining three procfs files. This is conceptually very similar to what we just covered. We expect you to read through the code and try it out yourself!

Miscellaneous procfs APIs

Let's wrap up this section with a few remaining miscellaneous procfs APIs. You can use the proc_symlink() function to create a symbolic or soft link under /proc.

Next, the proc_create_single_data() API can be quite useful; it's a "shortcut" that you can use when you only need to attach a "read" method to a procfs file:

struct proc_dir_entry *proc_create_single_data(const char *name, umode_t mode,
        struct proc_dir_entry *parent, 
        int (*show)(struct seq_file *, void *),
        void *data);

Using this API eliminates the need for a separate fops data structure. We could have used this function to create and handle our second procfs file—the llkdproc_show_pgoff file:

... proc_create_single_data(PROC_FILE2, PROC_FILE2_PERMS, gprocdir,
proc_show_pgoff, 0) ...

When read from user space, the kernel's VFS and proc layer code path will invoke the method we registered in our module—which is the proc_show_pgoff() function—inside which we simply call seq_printf() to send the value of PAGE_OFFSET to user space:

seq_printf(seq, "%s:PAGE_OFFSET:0x%px\n", OURMODNAME, PAGE_OFFSET);

Note: Regarding the proc_create_single_data API:

You can leverage the fifth parameter to pass an arbitrary data item to the read callback (where it can be retrieved via the seq_file member private, very much like how we used filp->private_data in the previous chapter).

Several (usually older) drivers in the mainline kernel do indeed use this function to create their procfs interfaces. These include the RTC driver (which sets up an entry under /proc/driver/rtc). The SCSI megaraid driver (drivers/scsi/megaraid) uses this routine no less than 10 times when setting up its proc interface (when a certain config option is enabled; it's enabled by default).

Careful! I found that on Ubuntu 18.04 LTS systems running the distribution (default) kernel, this API—proc_create_single_data()—isn't even available, so the build will fail. On our custom "vanilla" 5.4 LTS kernel, it works perfectly fine.

Additionally, there is indeed some documentation for the procfs APIs we mentioned here, though this documentation is often intended for internal use rather than for modules: https://www.kernel.org/doc/html/latest/filesystems/api-summary.html#the-proc-filesystem.

So, as we mentioned earlier, with procfs APIs, your mileage may vary (YMMV)! Test your code thoroughly before release. The best practice is probably to follow the kernel community guidelines and just say no to procfs as a driver interface technology. Don't worry—we'll cover better alternatives later in this chapter!

This completes our introduction to using procfs as a useful communication interface. Now, let's move on to the one more suitable for drivers—the sysfs interface.

2.4 Via the sysfs Interface

A key feature of the 2.6 Linux kernel release was the advent of the so-called "Modern Device Model." Essentially, a complex set of tree-structured hierarchical data structures models all devices on the system. In reality, it goes far beyond that; the sysfs tree contains the following (among other things):

Every bus on the system (which can be virtual or pseudo-buses)
Every device on every bus
Every device driver bound to a device on a bus

Therefore, not just peripheral devices, but the underlying system buses, the devices on those buses, and the device drivers bound (or about to be bound) to the devices, are all created at runtime and maintained by the device model. As a typical driver author, these internal mechanisms are invisible to you; you don't really need to worry about them. At system boot, and whenever a new device becomes visible, the driver core (part of the built-in kernel machinery) generates the required pseudo-files under the sysfs tree. (Conversely, when a device is removed or unplugged, its entries disappear from the tree).

Looking back at the "Via the procfs Interface" section, using procfs for device driver interface purposes isn't really the right approach, at least for code intended for mainline. So, what is the right approach? Ah, creating sysfs (pseudo) files is considered the "correct way" for device drivers to interface with user space.

Now we see it clearly! sysfs is a virtual filesystem typically mounted under the /sys directory. In fact, sysfs is very much like procfs—it is a tree of information (devices and more) exported to user space.

You can think of sysfs as a "real-time medical report" for your devices—like the kind a doctor would give you.

But there's one thing wrong with this analogy: A real medical report won't change once it's printed out, whereas sysfs is dynamic, with the kernel updating the data at any time. Moreover, this report is "charged per item"—each file can only display one metric. This isn't just a convention; it's a hard rule.

The following screenshot shows the contents of /sys, clearly illustrating this point:

(Screenshot showing the contents of the /sys directory goes here)

Creating sysfs (Pseudo) Files in Code

One way to create pseudo (or virtual) files under sysfs is via the device_create_file() API. Its signature is as follows:

drivers/base/core.c:int device_create_file(struct device *dev,
                         const struct device_attribute *attr);

Let's look at its two parameters one by one; first, there is a pointer to a struct device. The second parameter is a pointer to a device attribute structure; we will explain and manipulate it shortly (in the "Setting up device attributes and creating sysfs files" section). For now, let's focus only on the first parameter—the device structure.

This looks quite intuitive—a device is represented by a metadata structure called device (it's part of the driver core; you can find its full definition in the include/linux/device.h header file).

Back to the medical report analogy: To add a line to your report, you first need to confirm which "department" you are listed under.

Note that when you write (or work with) a "real" device driver, it's highly likely that a generic device structure already exists or is about to be formed. This usually happens when the device is registered; an underlying device structure typically exists as a member of a device-specific structure. For example, all structures like platform_device, pci_device, net_device, usb_device, i2c_client, and serial_port have an embedded struct device member. Therefore, you can use that device structure pointer as a parameter to the API to create files under sysfs. Rest assured, you'll see this done in code shortly! So, let's get our hands on a device structure by creating a simple "platform device." You'll learn how to do this in the next section!

Creating a Simple Platform Device

Obviously, to create a (pseudo) file under sysfs, we need some way to get a pointer to a struct device to use as the first parameter for device_create_file(). However, for our demo sysfs driver here, we don't actually have any real device, and thus no struct device available!

So, can't we create an artificial or pseudo-device and use it? We can, but how, and more importantly, why do we have to do this? It's crucial to understand this: the Modern Linux Device Model (LDM) is built upon three key components: an underlying bus must exist, devices exist on the bus, and devices are "bound to" and driven by a device driver. (We mentioned this in the "A Brief Note on the Linux Device Model" section of Chapter 1, Writing a Simple misc Character Device Driver).

All of these must be registered with the driver core. Now, don't worry about the bus and the bus driver that drives the bus; they will be internally registered and handled by the kernel's driver core subsystem. However, when there is no real device, we will have to create a pseudo-device in order to work within this model. Again, there are several ways to do this, but we will create a platform device. This device will "live" on a pseudo-bus (that is, a bus that only exists in software) known as the platform bus.

Platform Devices

A simple but important aside: platform devices are commonly used to represent various devices within a SoC (System on Chip) on embedded boards. A SoC is typically a very complex chip that integrates various components into the silicon. Besides the processing units (CPU/GPU), it may host several peripherals, including Ethernet MACs, USB, multimedia, serial UARTs, clocks, I2C, SPI, flash chip controllers, and more. The reason we need to enumerate these components as platform devices is that there are no physical buses inside the SoC; hence, the platform bus is used.

In the past, the code used to instantiate these SoC platform devices was kept in "board" files (or several files) within the kernel source (arch/<arch>/...). Because it became too bloated, it was moved out of the pure kernel source and into a useful hardware description format called the Device Tree (in Device Tree Source files, i.e., DTS files, which ship with the kernel source tree).

On our Ubuntu 18.04 LTS guest virtual machine, let's look at the platform devices under sysfs:

$ ls /sys/devices/platform/
alarmtimer  'Fixed MDIO bus.0'   intel_pmc_core.0   platform-framebuffer.0
reg-dummy
serial8250 eisa.0  i8042  pcspkr power rtc_cmos uevent
$

The Bootlin website (formerly Free Electrons) provides superb material on embedded Linux, drivers, and more. This link on their site provides excellent material on the LDM: https://bootlin.com/pub/conferences/2019/elce/opdenacker-kernel-programming-device-model/.

Back to the driver: we make our (artificial) platform device exist by registering it with the (already existing) platform bus driver via the platform_device_register_simple() API. Once we do this, the driver core will generate the necessary sysfs directory and some boilerplate sysfs entries (or files). Here, in the init code of our sysfs demo driver, we will set up a (as simple as possible) platform device by registering it with the driver core:

// ch2/sysfs_simple_intf/sysfs_simple_intf.c
include <linux/platform_device.h>
static struct platform_device *sysfs_demo_platdev;
[...]
#define PLAT_NAME    "llkd_sysfs_simple_intf_device"
sysfs_demo_platdev =
     platform_device_register_simple(PLAT_NAME, -1, NULL, 0);
[...]

The platform_device_register_simple() API returns a pointer to a struct platform_device. One of the members of this structure is struct device dev. We now have what we were looking for: a device structure. Also note that when this registration API runs, the effect is visible within sysfs. You can easily see the new platform device, along with some boilerplate sysfs objects created by the driver core, become visible here (visible to us via sysfs); let's compile and insmod our kernel module to see:

$ cd <...>/ch2/sysfs_simple_intf
$ make && sudo insmod ./sysfs_simple_intf.ko
[...]
$ ls -l /sys/devices/platform/llkd_sysfs_simple_intf_device/
total 0
-rw-r--r-- 1 root root 4.0K Feb 15 20:22 driver_override
-rw-r--r-- 1 root root 4.0K Feb 15 20:22 llkdsysfs_debug_level
-r--r--r-- 1 root root 4.0K Feb 15 20:22 llkdsysfs_pgoff
-r--r--r-- 1 root root 4.0K Feb 15 20:22 llkdsysfs_pressure
-r--r--r-- 1 root root 4.0K Feb 15 20:22 modalias
drwxr-xr-x 2 root root 0 Feb 15 20:22 power/
lrwxrwxrwx 1 root root 0 Feb 15 20:22 subsystem -> ../../../bus/platform/
-rw-r--r-- 1 root root 4.0K Feb 15 20:21 uevent
$

We can create struct device in different ways; the generic way is to set up and issue the device_create() API. An alternative method for creating sysfs files, while bypassing the need for a device structure, is to create a "kobject" and call the sysfs_create_file() API. (Links to tutorials using both methods can be found in the "Further reading" section). Here, we prefer using the "platform device" approach because it's closer to how actual drivers are written.

There is another valid approach. As we saw in Chapter 1, Writing a Simple misc Character Device Driver, we built a simple character driver that conformed to the kernel's misc framework. There, we instantiated a struct miscdevice; once registered (via the misc_register() API), this structure will contain a member called struct device *this_device;, thereby allowing us to use it as a valid device pointer! Therefore, we could have simply extended our previous misc device driver and used it here. However, to learn a bit about platform drivers, we chose that approach. (We leave extending the previous misc device driver to make it usable with the sysfs API and to create/use sysfs files as an exercise for you).

Back to our driver, in the cleanup code, as opposed to the init code, we must unregister our platform device:

platform_device_unregister(sysfs_demo_platdev);

Now, let's put all this knowledge together and look at the code that actually generates the sysfs files, along with their read and write callback functions!

Putting It All Together—Setting Up Device Attributes and Creating sysfs Files

As we mentioned at the beginning of this section, the device_create_file() API is the one we'll use to create sysfs files:

int device_create_file(struct device *dev, const struct device_attribute *attr);

In the previous section, you learned how to obtain the device structure (the first parameter of our API). Now, let's figure out how to initialize and use the second parameter; that is, the device_attribute structure. This structure is defined as follows:

// include/linux/device.h
struct device_attribute {
    struct attribute attr;
    ssize_t (*show)(struct device *dev, struct device_attribute *attr, 
                    char *buf);
    ssize_t (*store)(struct device *dev, struct device_attribute *attr, 
                     const char *buf, size_t count);
};

The first member, attr, essentially consists of the sysfs file's name and its mode (permission bitmask). The other two members are function pointers ("virtual functions," similar to those in the file operations or fops structure):

show: Represents the read callback function
store: Represents the write callback function

Our task is to initialize this device_attribute structure to set up the sysfs file. While you can always initialize it manually, there's an easier path: the kernel provides (several) macros to initialize struct device_attribute; one of them is the DEVICE_ATTR() macro:

// include/linux/device.h
#define DEVICE_ATTR(_name, _mode, _show, _store) \
   struct device_attribute dev_attr_##_name = __ATTR(_name, _mode, _show, _store)

Note the "stringification" operation performed by dev_attr_##_name, ensuring that the structure's name suffix is the name passed to DEVICE_ATTR. Furthermore, the actual workhorse macro, called __ATTR(), actually instantiates a device_attribute structure in the preprocessed code, making the structure's name dev_attr_<name> via (stringification):

// include/linux/sysfs.h
#define __ATTR(_name, _mode, _show, _store) { \
    .attr = {.name = __stringify(_name), \
    .mode = VERIFY_OCTAL_PERMISSIONS(_mode) }, \
    .show = _show, \
    .store = _store, \
}

Additionally, the kernel defines extra simple wrapper macros on top of these to specify the mode (permissions for

Technical Preparation​

The User-Kernel Communication Path Landscape​

2.3 Via the procfs Interface​

Understanding the proc Filesystem​

Directories under /proc​

The Purpose of procfs​

procfs is Off-Limits for Driver Authors​

Using the procfs Interface from User Space​

Basic procfs API​

The Four procfs Files We Will Create​

Trying Out the procfs for Dynamically Controlling debug_level​

Dynamically Controlling debug_level via procfs​

Miscellaneous procfs APIs​

2.4 Via the sysfs Interface​

Creating sysfs (Pseudo) Files in Code​

Creating a Simple Platform Device​

Platform Devices​

Putting It All Together—Setting Up Device Attributes and Creating sysfs Files​