Tour of the Linux kernel source

The HyperNews Linux KHG Discussion Pages

Tour of the Linux kernel source

By Alessandro Rubini, rubini@pop.systemy.it

This chapter tries to explain the Linux source code in an orderly manner, trying to help the reader to achieve a good understanding of how the source code is laid out and how the most relevant unix features are implemented. The target is to help the experienced C programmer who is not accustomed to Linux in getting familiar with the overall Linux design. That's why the chosen entry point for the kernel tour is the kernel own entry point: system boot.

A good understanding of C language is required to understand this material, as well as some familiarity with both Unix concepts and the PC architecture. However, no C code will appear in this chapter, but rather pointers to the actual code. The finest issues of kernel design are explained in other chapters of this guide, while this chapter tends to remain an informal overview.

Any pathname for files referenced in this chapter is referred to the main source-tree directory, usually /usr/src/linux.

[NEW] Most of the information reported here is taken from the source code of Linux release 1.0. Nonetheless, references to later versions are provided at times. Any paragraph within the tour with the [NEW] image in front of it is meant to underline changes the kernel has undergone after the 1.0 release. If no such paragraph is present, then no changes occurred up to release 1.0.9-1.1.76.

[MORE] Sometimes a paragraph like this occurs in the text. It is a pointer to the right sources to get more information on the subject just covered. Needless to say, the source is the primary source.

Booting the system

When the PC is powered up, the 80x86 processor finds itself in real mode and executes the code at address 0xFFFF0, which corresponds to a ROM-BIOS address. The PC BIOS performs some tests on the system and initializes the interrupt vector at physical address 0. After that it loads the first sector of a bootable device to 0x7C00, and jumps to it. The device is usually the floppy or the hard drive. The preceding description is quite a simplified one, but it's all that's needed to understand the kernel initial workings.

The very first part of the Linux kernel is written in 8086 assembly language (boot/bootsect.S). When run, it moves itself to absolute address 0x90000, loads the next 2 kBytes of code from the boot device to address 0x90200, and the rest of the kernel to address 0x10000. The message ``Loading...'' is displayed during system load. Control is then passed to the code in boot/Setup.S, another real-mode assembly source.

The setup portion identifies some features of the host system and the type of vga board. If requested to, it asks the user to choose the video mode for the console. It then moves the whole system from address 0x10000 to address 0x1000, enters protected mode and jumps to the rest of the system (at 0x1000).

The next step is kernel decompression. The code at 0x1000 comes from zBoot/head.S which initializes registers and invokes decompress_kernel(), which in turn is made up of zBoot/inflate.c, zBoot/unzip.c and zBoot/misc.c. The decompressed data goes to address 0x100000 (1 Meg), and this is the main reason why Linux can't run with less than 2 megs ram. [It's been done in 1 MB with uncompressed kernels; see Memory Savers--ED]

[MORE] Encapsulation of the kernel in a gzip file is accomplished by Makefile and utilities in the zBoot directory. They are interesting files to look at.

[NEW] Kernel release 1.1.75 moved the boot and zBoot directories down to arch/i386/boot. This change is meant to allow true kernel builds for different architectures. Nonetheless, I'll stick to i386-specific information.

Decompressed code is executed at address 0x1010000 [Maybe I've lost track of physical addresses, here, as I don't know very well gas source code], where all the 32-bit setup is accomplished: IDT, GDT and LDT are loaded, the processor and coprocessor are identified, and paging is setup; eventually, the routine start_kernel is invoked. The source for the above operations is in boot/head.S. It is probably the trickiest code in the whole kernel.

Note that if an error occurs during any of the preceding steps, the computer will lockup. The OS can't deal with errors when it isn't yet fully operative.

start_kernel() resides in init/main.c, and never returns. Anything from now on is coded in C language, left aside interrupt management and system call enter/leave (well, most of the macros embed assembly code, too).

Spinning the wheel

After dealing with all the tricky questions, start_kernel() initializes all the parts of the kernel, specifically:

Sets the memory bounds and calls paging_init().
Initializes the traps, IRQ channels and scheduling.
Parses the command line.
If requested to, allocates a profiling buffer.
Initializes all the device drivers and disk buffering, as well as other minor parts.
Calibrates the delay loop (computes the ``BogoMips'' number).
Checks if interrupt 16 works with the coprocessor.

Finally, the kernel is ready to move_to_user_mode(), in order to fork the init process, whose code is in the same source file. Process number 0 then, the so-called idle task, keeps running in an infinite idle loop.

The init process tries to execute /etc/init, or /bin/init, or /sbin/init.

If none of them succeeds, code is provided to execute ``/bin/sh /etc/rc'' and fork a root shell on the first terminal. This code dates back to Linux 0.01, when the OS was made by the kernel alone, and no login process was available.

After exec()ing the init program from one of the standard places (let's assume we have one of them), the kernel has no direct control on the program flow. Its role, from now on is to provide processes with system calls, as well as servicing asynchronous events (such as hardware interrupts). Multitasking has been setup, and it is now init which manages multiuser access by fork()ing system daemons and login processes.

Being the kernel in charge of providing services, the tour will proceed by looking at those services (the ``system calls''), as well as by providing general ideas about the underlying data structures and code organization.

How the kernel sees a process

From the kernel point of view, a process is an entry in the process table. Nothing more.

The process table, then, is one of the most important data structures within the system, together with the memory-management tables and the buffer cache. The individual item in the process table is the task_struct structure, quite a huge one, defined in include/linux/sched.h. Within the task_struct both low-level and high-level information is kept--ranging from the copy of some hardware registers to the inode of the working directory for the process.

The process table is both an array and a double-linked list, as well as a tree. The physical implementation is a static array of pointers, whose length is NR_TASKS, a constant defined in include/linux/tasks.h, and each structure resides in a reserved memory page. The list structure is achieved through the pointers next_task and prev_task, while the tree structure is quite complex and will not be described here. You may wish to change NR_TASKS from the default vaue of 128, but be sure to have proper dependency files to force recompilation of all the source files involved.

After booting is over, the kernel is always working on behalf of one of the processes, and the global variable current, a pointer to a task_struct item, is used to record the running one. current is only changed by the scheduler, in kernel/sched.c. When, however, all procecces must be looked at, the macro for_each_task is used. It is conderably faster than a sequential scan of the array, when the system is lightly loaded.

A process is always running in either ``user mode'' or ``kernel mode''. The main body of a user program is executed in user mode and system calls are executed in kernel mode. The stack used by the process in the two execution modes is different--a conventional stack segment is used for user mode, while a fixed-size stack (one page, owned by the process) is used in kernel mode. The kernel stack page is never swapped out, because it must be available whenever a system call is entered.

System calls, within the kernel, exist as C language functions, their `official' name being prefixed by `sys_'. A system call named, for example, burnout invokes the kernel function sys_burnout().

[MORE] The system call mechanism is described in chapter 3 of this guide. Looking at for_each_task and SET_LINKS, in include/linux/sched.h can help understanding the list and tree structures in the process table.

Creating and destroying processes

A unix system creates a process though the fork() system call, and process termination is performed either by exit() or by receiving a signal. The Linux implementation for them resides in kernel/fork.c and kernel/exit.c.

Forking is easy, and fork.c is short and ready understandable. Its main task is filling the data structure for the new process. Relevant steps, apart from filling fields, are:

getting a free page to hold the task_struct
finding an empty process slot (find_empty_process())
getting another free page for the kernel_stack_page
copying the father's LDT to the child
duplicating mmap information of the father

sys_fork() also manages file descriptors and inodes.

[NEW] The 1.0 kernel offers some vestigial support to threading, and the fork() system call shows some hints to that. Kernel threads is work-in-progress outside the mainstream kernel.

Exiting from a process is trickier, because the parent process must be notified about any child who exits. Moreover, a process can exit by being kill()ed by another process (these are Unix features). The file exit.c is therefore the home of sys_kill() and the vairious flavours of sys_wait(), in addition to sys_exit().

The code belonging to exit.c is not described here--it is not that interesting. It deals with a lot of details in order to leave the system in a consistent state. The POSIX standard, then, is quite demanding about signals, and it must be dealt with.

Executing programs

After fork()ing, two copies of the same program are running. One of them usually exec()s another program. The exec() system call must locate the binary image of the executable file, load and run it. The word `load' doesn't necessarily mean ``copy in memory the binary image'', as Linux supports demand loading.

The Linux implementation of exec() supports different binary formats. This is accomplished through the linux_binfmt structure, which embeds two pointers to functions--one to load the executable and the other to load the library, each binary format representing both the executable and the library. Loading of shared libraries is implemented in the same source file as exec() is, but let's stick to exec() itself.

The Unix systems provide the programmer with six flavours of the exec() function. All but one of them can be implemented as library functions, and theLinux kernel implements sys_execve() alone. It performs quite a simple task: loading the head of the executable, and trying to execute it. If the first two bytes are ``#!'', then the first line is parsed and an interpreter is invoked, otherwise the registered binary formats are sequentially tried.

The native Linux format is supported directly within fs/exec.c, and the relevant functions are load_aout_binary and load_aout_library. As for the binaries, the function loading an ``a.out'' executable ends up either in mmap()ing the disk file, or in calling read_exec(). The former way uses the Linux demand loading mechanism to fault-in program pages when they're accessed, while the latter way is used when memory mapping is not supported by the host filesystem (for example the ``msdos'' filesystem).

[NEW] Late 1.1 kernels embed a revised msdos filesystem, which supports mmap(). Moreover, the struct linux_binfmt is a linked list rather than an array, to allow loading a new binary format as a kernel module. Finally, the structure itself has been extended to access format-related core-dump routines.

Accessing filesystems

It is well known that the filesystem is the most basic resource in a Unix system, so basic and ubiquitous that it needs a more handy name--I'll stick to the standard practice of calling it simply ``fs''.

I'll assume the reader already knows the basic Unix fs ideas--access permissions, inodes, the superblock, mounting and umounting. Those concepts are well explained by smarter authors than me within the standard Unix literature, so I won't duplicate their efforts and I'll stick to Linux specific issues.

While the first Unices used to support a single fs type, whose structure was widespread in the whole kernel, today's practice is to use a standardized interface between the kernel and the fs, in order to ease data interchange across architectures. Linux itself provides a standardized layer to pass information between the kernel and each fs module. This interface layer is called VFS, for ``virtual filesystem''.

Filesystem code is therefore split into two layers: the upper layer is concerned with the management of kernel tables and data structures, while the lower layer is made up of the set of fs-dependent functions, and is invoked through the VFS data structures. All the fs-independent material resides in the fs/*.c files. They address the following issues:

Managing the buffer chache (buffer.c);
Responding to the fcntl() and ioctl() system calls (fcntl.c and ioctl.c);
Mapping pipes and fifos on inodes and buffers (fifo.c, pipe.c);
Managing file- and inode-tables (file_table.c, inode.c);
Locking and unlocking files and records (locks.c);
Mapping names to inodes (namei.c, open.c);
Implementing the tricky select() function (select.c);
Providing information (stat.c);
mounting and umounting filesystems (super.c);
exec()ing executables and dumping cores (exec.c);
Loading the various binary formats (bin_fmt*.c, as outlined above).

The VFS interface, then, consists of a set of relatively high-level operations which are invoked from the fs-independent code and are actually performed by each filesystem type. The most relevant structures are inode_operations and file_operations, though they're not alone: other structures exist as well. All of them are defined within include/linux/fs.h.

The kernel entry point to the actual file system is the structure file_system_type. An array of file_system_types is embodied within fs/filesystems.c and it is referenced whenever a mount is issued. The function read_super for the relevant fs type is then in charge of filling a struct super_block item, which in turn embeds a struct super_operations and a struct type_sb_info. The former provides pointers to generic fs operations for the current fs-type, the latter embeds specific information for the fs-type.

[NEW] The array of filesystem types has been turned in a linked list, to allow loading new fs types as kernel modules. The function (un-)register_filesystem is coded within fs/super.c.

Quick Anatomy of a Filesystem Type

The role of a filesystem type is to perform the low-level tasks used to map the relatively high level VFS operations on the physical media (disks, network or whatever). The VFS interface is flexible enough to allow support for both conventional Unix filesystems and exotic situations such as the msdos and umsdos types.

Each fs-type is made up of the following items, in addition to its own directory:

An entry in the file_systems[] array (fs/filesystems.c);
The superblock include file (include/linux/type_fs_sb.h);
The inode include file (include/linux/type_fs_i.h);
The generic own include file (include/linux/type_fs.h});
Two #include lines within include/linux/fs.h, as well as the entries in struct super_block and struct inode.

The own directory for the fs type contains all the real code, responsible of inode and data management.

[MORE] The chapter about procfs in this guide uncovers all the details about low-level code and VFS interface for that fs type. Source code in fs/procfs is quite understandable after reading the chapter.

We'll now look at the internal workings of the VFS mechanism, and the minix filesystem source is used as a working example. I chose the minix type because it is small but complete; moreover, any other fs type in Linux derives from the minix one. The ext2 type, the de-facto standard in recent Linux installations, is much more complex than that and its exploration is left as an exercise for the smart reader.

When a minix-fs is mounted, minix_read_super fills the super_block structure with data read from the mounted device. The s_op field of the structure will then hold a pointer to minix_sops, which is used by the generic filesystem code to dispatch superblock operations.

Chaining the newly mounted fs in the global system tree relies on the following data items (assuming sb is the super_block structure and dir_i points to the inode for the mount point):

sb->s_mounted points to the root-dir inode of the mounted filesystem (MINIX_ROOT_INO);
dir_i->i_mount holds sb->s_mounted;
sb->s_covered holds dir_i

Umounting will eventually be performed by do_umount, which in turn invokes minix_put_super.

Whenever a file is accessed, minix_read_inode comes into play; it fills the system-wide inode structure with fields coming form minix_inode. The inode->i_op field is filled according to inode->i_mode and it is responsible for any further operation on the file. The source for the minix functions just described are to be found in fs/minix/inode.c.

The inode_operations structure is used to dispatch inode operations (you guessed it) to the fs-type specific kernel functions; the first entry in the structure is a pointer to a file_operations item, which is the data-management equivalent of i_op. The minix fs-type allows three instances of inode-operation sets (for direcotries, for files and for symbolic links) and two instances of file-operation sets (symlinks don't need one).

Directory operations (minix_readdir alone) are to be found in fs/minix/dir.c; file operations (read and write) appear within fs/minix/file.c and symlink operations (reading and following the link) in fs/minix/symlink.c.

The rest of the minix directory implements the following tasks:

bitmap.c manages allocation and freeing of inodes and blocks (the ext2 fs, otherwise, has two different source files);
fsynk.c is responsible for the fsync() system calls--it manages direct, indirect and double indirect blocks (I assume you know about them, it's common Unix knowledge);
namei.c embeds all the name-related inode operations, such as creating and destroying nodes, renaming and linking;
truncate.c performs truncation of files.

The console driver

Being the main I/O device on most Linux boxes, the console driver deserves some attention. The source code related to the console, as well as the other character drivers, is to be found in drivers/char, and we'll use this very directory as our referenece point when naming files.

Console initialization is performed by the function tty_init(), in tty_io.c. This function is only concerned in getting major device numbers and calling the init function for each device set. con_init(), then is the one related to the console, and resides in console.c.

[NEW] Initialization of the console has changed quite a lot during 1.1 evolution. console_init() has been detatched from tty_init(), and is called directly by ../../main.c. The virtual consoles are now dynamically allocated, and quite a good deal of code has changed. So, I'll skip the details of initialization, allocation and such.

How file operations are dispatched to the console

This paragraph is quite low-level, and can be happily skipped over.

Needless to say, a Unix device is accessed though the filesystem. This paragraph details all steps from the device file to the actual console functions. Moreover, the following information is extracted from the 1.1.73 source code, and it may be slightly different from the 1.0 source.

When a device inode is opened, the function chrdev_open() (or blkdev_open(), but we'll stich to character devices) in ../../fs/devices.c gets executed. This function is reached by means of the structure def_chr_fops, which in turn is referenced by chrdev_inode_operations, used by all the filesystem types (see the previous section about filesystems).

chrdev_open takes care of specifying the device operations by substituting the device specific file_operations table in the current filp and calls the specific open(). Device specific tables are kept in the array chrdevs[], indexed by the majour device number, and filled by the same ../../fs/devices.c.

If the device is a tty one (aren't we aiming at the console?), we come to the tty drivers, whose functions are in tty_io.c, indexed by tty_fops. Thus, tty_open() calls init_dev(), which allocates any data structure needed by the device, based on the minor device number.

The minor number is also used to retrieve the actual driver for the device, which has been registered through tty_register_driver(). The driver, then, is still another structure used to dispatch computation, just like file_ops; it is concerned with writing and controlling the device. The last data structure used in managing a tty is the line discipline, described later. The line discipline for the console (and any other tty device) is set by initialize_tty_struct(), invoked by init_dev.

Everything we touched in this paragraph is device-independent. The only console-specific particular is that console.c, has registered its own driver during con_init(). The line discipline, on the contrary, in independent of the device.

[MORE] The tty_driver structure is fully explained within <linux/tty_driver.h>.

[NEW] The above information has been extracted from 1.1.73 source code. It isn't unlikely for your kernel to be somewhat different (``This information is subject to change without notice'').

Writing to the console

When a console device is written to, the function con_write gets invoked. This function manages all the control characters and escape sequences used to provide applications with complete screen management. The escape sequences implemented are those of the vt102 terminal; This means that your environment should say TERM=vt102 when you are telnetting to a non-Linux host; the best choice for local activities, however, is TERM=console because the Linux console offers a superset of vt102 functionality.

con_write(), thus, is mostly made up of nested switch statements, used to handle a finite state automaton interpreting escape sequences one character at a time. When in normal mode, the character being printed is written directly to the video memory, using the current attr-ibute. Within console.c, all the fields of struct vc are made accessible through macros, so any reference to (for example) attr, does actually refer to the field in the structure vc_cons[currcons], as long as currcons is the number of the console being referred to.

[NEW] Actually, vc_cons in newer kernels is no longer an array of structures , it now is an array of pointers whose contents are kmalloc()ed. The use of macros greatly simplified changing the approach, because much of the code didn't need to be rewritten.

Actual mapping and unmapping of the console memory to screen is performed by the functions set_scrmem() (which copies data from the console buffer to video memory) and get_scrmem (which copies back data to the console buffer). The private buffer of the current console is physically mapped on the actual video RAM, in order to minimize the number of data transfers. This means that get- and set-_scrmem() are static to console.c and are called only during a console switch.

Reading the console

Reading the console is accomplished through the line-discipline. The default (and unique) line discipline in Linux is called tty_ldisc_N_TTY. The line discipline is what ``disciplines input through a line''. It is another function table (we're used to the approach, aren't we?), which is concerned with reading the device. With the help of termios flags, the line discipline is what controls input from the tty: raw, cbreak and cooked mode; select(); ioctl() and so on.

The read function in the line discipline is called read_chan(), which reads the tty buffer independently of whence it came from. The reason is that character arrival through a tty is managed by asynchronous hardware interrupts.

[MORE] The line discipline N_TTY is to be found in the same tty_io.c, though later kernels use a different n_tty.c source file.

The lowest level of console input is part of keyboard management, and thus it is handled within keyboard.c, in the function keyboard_interrupt().

Keyboard management

Keyboard management is quite a nightmare. It is confined to the file keyboard.c, which is full of hexadecimal numbers to represent the various keycodes appearing in keyboards of different manifacturers.

I won't dig in keyboard.c, because no relevant information is there to the kernel hacker.

[MORE] For those readers who are really interested in the Linux keyboard, the best approach to keyboard.c is from the last line upward. Lowest level details occur mainly in the first half of the file.

Switching the current console

The current console is switched through invocation of the function change_console(), which resides in tty_io.c and is invoked by both keyboard.c and vt.c (the former switches console in response to keypresses, the latter when a program requests it by invoking an ioctl() call).

The actual switching process is performed in two steps, and the function complete_change_console() takes care of the second part of it. Splitting the switch is meant to complete the task after a possible handshake with the process controlling the tty we're leaving. If the console is not under process control, change_console() calls complete_change_console() by itself. Process intervertion is needed to successfully switch from a graphic console to a text one and viceversa, and the X server (for example) is the controlling process of its own graphic console.

The selection mechanism

``selection'' is the cut and paste facility for the Linux text consoles. The mechanism is mainly handled by a user-level process, which can be instantiated by either selection or gpm. The user-level program uses ioctl() on the console to tell the kernel to highlight a region of the screen. The selected text, then, is copied to a selection buffer. The buffer is a static entity in console.c. Pasting text is accomplished by `manually' pushing characters in the tty input queue. The whole selection mechanism is protected by #ifdef so users can disable it during kernel configuration to save a few kilobytes of ram.

Selection is a very-low-level facility, and its workings are hidden from any other kernel activity. This means that most #ifdef's simply deals with removing the highlight before the screen is modified in any way.

[NEW] Newer kernels feature improved code for selection, and the mouse pointer can be highlighted independently of the selected text (1.1.32 and later). Moreover, from 1.1.73 onward a dynamic buffer is used for selected text rather than a static one, making the kernel 4kB smaller.

`ioctl()`ling the device

The ioctl() system call is the entry point for user processes to control the behaviour of device files. Ioctl management is spawned by ../../fs/ioctl.c, where the real sys_ioctl() resides. The standard ioctl requests are performed right there, other file-related requests are processed by file_ioctl() (same source file), while any other request is dispatches to the device-specific ioctl() function.

The ioctl material for console devices resides in vt.c, because the console driver dispatches ioctl requests to vt_ioctl().

[NEW] The information above refer to 1.1.7x. The 1.0 kernel doesn't have the ``driver'' table, and vt_ioctl() is pointed to directly by the file_operations() table.

Ioctl material is quite confused, indeed. Some requests are related to the device, and some are related to the line discipline. I'll try to summarize things for the 1.0 and the 1.1.7x kernels. Anything happened in between.

The 1.1.7x series features the following approach: tty_ioctl.c implements only line discipline requests (namely n_tty_ioctl(), which is the only n_tty function outside of n_tty.c), while the file_operations field points to tty_ioctl() in tty_io.c. If the request number is not resolved by tty_ioctl(), it is passed along to tty->driver.ioctl or, if it fails, to tty->ldisc.ioctl. Driver-related stuff for the console it to be found in vt.c, while line discipline material is in tty_ioctl.c.

In the 1.0 kernel, tty_ioctl() is in tty_ioctl.c and is pointed to by generic tty file_operations. Unresolved requests are passed along to the specific ioctl function or to the line-discipline code, in a way similar to 1.1.7x.

Note that in both cases, the TIOCLINUX request is in the device-independent code. This implies that the console selection can be set by ioctlling any tty (set_selection() always operates on the foreground console), and this is a security hole. It is also a good reason to switch to a newer kernel, where the problem is fixed by only allowing the superuser to handle the selection.

A variety of requests can be issued to the console device, and the best way to know about them is to browse the source file vt.c.

Messages