let me start: 5/1/07

Friday, May 25, 2007

Linux Linked List Implementation

Before we go on to examine implementation of wait queues, we must acquaint ourselves with the Linux standard doubly-linked list implementation. Wait queues (as well as everything else in Linux) make heavy use of them and they are called in jargon "list.h implementation" because the most relevant file is include/linux/list.h.

The fundamental data structure here is struct list_head:

struct list_head {
struct list_head *next, *prev;
};

#define LIST_HEAD_INIT(name) { &(name), &(name) }

#define LIST_HEAD(name) \
struct list_head name = LIST_HEAD_INIT(name)

#define INIT_LIST_HEAD(ptr) do { \
(ptr)->next = (ptr); (ptr)->prev = (ptr); \
} while (0)

#define list_entry(ptr, type, member) \
((type *)((char *)(ptr)-(unsigned long)(&((type *)0)->member)))

#define list_for_each(pos, head) \
for (pos = (head)->next; pos != (head); pos = pos->next)

The first three macros are for initialising an empty list by pointing both next and prev pointers to itself. It is obvious from C syntactical restrictions which ones should be used where - for example, LIST_HEAD_INIT() can be used for structure's element initialisation in declaration, the second can be used for static variable initialising declarations and the third can be used inside a function.

The macro list_entry() gives access to individual list element, for example (from fs/file_table.c:fs_may_remount_ro()):

struct super_block {
...
struct list_head s_files;
...
} *sb = &some_super_block;

struct file {
...
struct list_head f_list;
...
} *file;

struct list_head *p;

for (p = sb->s_files.next; p != &sb->s_files; p = p->next) {
struct file *file = list_entry(p, struct file, f_list);
do something to 'file'
}

A good example of the use of list_for_each() macro is in the scheduler where we walk the runqueue looking for the process with highest goodness:

static LIST_HEAD(runqueue_head);
struct list_head *tmp;
struct task_struct *p;

list_for_each(tmp, &runqueue_head) {
p = list_entry(tmp, struct task_struct, run_list);
if (can_schedule(p)) {
int weight = goodness(p, this_cpu, prev->active_mm);
if (weight > c)
c = weight, next = p;
}
}

Here, p->run_list is declared as struct list_head run_list inside task_struct structure and serves as anchor to the list. Removing an element from the list and adding (to head or tail of the list) is done by list_del()/list_add()/list_add_tail() macros. The examples below are adding and removing a task from runqueue:

static inline void del_from_runqueue(struct task_struct * p)
{
nr_running--;
list_del(&p->run_list);
p->run_list.next = NULL;
}

static inline void add_to_runqueue(struct task_struct * p)
{
list_add(&p->run_list, &runqueue_head);
nr_running++;
}

static inline void move_last_runqueue(struct task_struct * p)
{
list_del(&p->run_list);
list_add_tail(&p->run_list, &runqueue_head);
}

static inline void move_first_runqueue(struct task_struct * p)
{
list_del(&p->run_list);
list_add(&p->run_list, &runqueue_head);
}

Wednesday, May 16, 2007

Interrupts

Interrupts Exceptions and Traps

Normally, processes are asleep, waiting on some event. When that event happens, these processes are called into action. Remember, it is the responsibility of the sched process to free memory when a process runs short of it. So, it is not until memory is needed that sched starts up.

How does sched know that memory is needed? When a process makes reference to a place in its virtual memory space that does not yet exist in physical memory, a page fault occurs. Faults belong to a group of system events called exceptions. An exception is simply something that occurs outside of what is normally expected. Faults (exceptions) can occur either before or during the execution of an instruction.

For example, if an instruction that is not yet in memory needs to be read, the exception (page fault) occurs before the instruction starts being executed. On the other hand, if the instruction is supposed to read data from a virtual memory location that isn't in physical memory, the exception occurs during the execution of the instruction. In cases like these, once the missing memory location is loaded into physical memory, the CPU can start the instruction.

Traps are exceptions that occur after an instruction has been executed. For example, attempting to divide by zero generates an exception. However, in this case it doesn't make sense to restart the instruction because every time we to try to run that instruction, it still comes up with a Divide-by-Zero exception. That is, all memory references are read before we start to execute the command.

It is also possible for processes to generate exceptions intentionally. These programmed exceptions are called software interrupts.

When any one of these exceptions occurs, the system must react to the exception. To react, the system will usually switch to another process to deal with the exception, which means a context switch. In our discussion of process scheduling, I mentioned that at every clock tick the priority of every process is recalculated. To make those calculations, something other than those processes have to run.

In Linux, the system timer (or clock) is programmed to generate a hardware interrupt 100 times a second (as defined by the HZ system parameter). The interrupt is accomplished by sending a signal to a special chip on the motherboard called an interrupt controller. (We go into more detail about these in the section on hardware.) The interrupt controller then sends an interrupt to the CPU. When the CPU receives this signal, it knows that the clock tick has occurred and it jumps to a special part of the kernel that handles the clock interrupt. Scheduling priorities are also recalculated within this same section of code.

Because the system might be doing something more important when the clock generates an interrupt, you can turn interrupts off using "masking". In other words, there is a way to mask out interrupts. Interrupts that can be masked out are called maskable interrupts. An example of something more important than the clock would be accepting input from the keyboard. This is why clock ticks are lost on systems with a lot of users inputting a lot of data. As a result, the system clock appears to slow down over time.

Sometimes events occur on the system that you want to know about no matter what. Imagine what would happen if memory was bad. If the system was in the middle of writing to the hard disk when it encountered the bad memory, the results could be disastrous. If the system recognizes the bad memory, the hardware generates an interrupt to alert the CPU. If the CPU is told to ignore all hardware interrupts, it would ignore this one. Instead, the hardware has the ability to generate an interrupt that cannot be ignored, or "masked out", called a non-maskable interrupt. Non-maskable interrupts are generically referred to as NMIs.

When an interrupt or an exception occurs, it must be dealt with to ensure the integrity of the system. How the system reacts depends on whether it was an exception or interrupt. In addition, what happens when the hard disk generates an interrupt is going to be different than when the clock generates one.

Within the kernel is the Interrupt Descriptor Table (IDT), which is a list of descriptors (pointers) that point to the functions that handle the particular interrupt or exception. These functions are called the interrupt or exception handlers. When an interrupt or exception occurs, it has a particular value, called an identifier or vector. Table 0-2 contains a list of the defined interrupt vectors.

Table Interrupt Vectors

Id Description
0 Divide error
1 Debug exception
2 Non-maskable interrupt
3 Breakpoint
4 Overflow
5 Bounds check
6 Invalid opcode
7 Coprocessor not available
8 Double fault
9 (reserved)
10 Invalid TSS
11 Segment not present
12 Stack exception
13 General protection fault
14 Page fault
15 (reserved)
16 Coprocessor error
17 alignment error (80486)
18-31 (reserved)
32-255 External (HW) interrupts

These numbers are actually indices into the IDT. When an interrupt, exception, or trap occurs, the system knows which number corresponds to that event. It then uses that number as an index into the IDT, which in turn points to the appropriate area of memory for handling the event.

It is possible for devices to share interrupts; that is, multiple devices on the system can be (and ofter are) configured to use the same interrupt. In fact, certain kinds of computers are designed to allow devices to share interrupts (I'll talk about them in the hardware section). If the interrupt number is an offset into a table of pointers to interrupt routines, how does the kernel know which one to call?

As it turns out, there are two IDTs: one for shared interrupts and one for non-shared interrupts. During a kernel rebuild (more on that later), the kernel determines whether the interrupt is shared. If it is, it places the pointer to that interrupt routine into the shared IDT. When an interrupt is generated, the interrupt routine for each of these devices is called. It is up to the interrupt routine to check whether the associated device really generated an interrupt. The order in which they are called is the order in which they are linked.

When an exception happens in user mode, the process passes through a trap gate. At this point, the CPU no longer uses the process' user stack, but rather the system stack within that process' task structure. (each task structure has a portion set aside for the system stack.) At this point, that process is operating in system (kernel) mode; that is, at the highest privilege level, 0.

The kernel treats interrupts very similarly to the way it treats exceptions: all the general purpose registers are pushed onto the system stack and a common interrupt handler is called. The current interrupt priority is saved and the new priority is loaded. This prevents interrupts at lower priority levels from interrupting the kernel while it handles this interrupt. Then the real interrupt handler is called.

Because an exception is not fatal, the process will return from whence it came. It is possible that a context switch occurs immediately on return from kernel mode. This might be the result of an exception with a lower priority. Because it could not interrupt the process in kernel mode, it had to wait until it returned to user mode. Because the exception has a higher priority than the process when it is in user mode, a context switch occurs immediately after the process returns to user mode.

It is abnormal for another exception to occur while the process is in kernel mode. Even a page fault can be considered a software event. Because the entire kernel is in memory all the time, a page fault should not happen. When a page fault does happen when in kernel mode, the kernel panics. Special routines have been built into the kernel to deal with the panic to help the system shut down as gracefully as possible. Should something else happen to cause another exception while the system is trying to panic, a double panic occurs.

This may sound confusing because I just said that a context switch could occur as the result of another exception. What this means is that the exception occurred in user mode, so there must be a jump to kernel mode. This does not mean that the process continues in kernel mode until it is finished. It may (depending on what it is doing) be context-switched out. If another process has run before the first one gets its turn on the CPU again, that process may generate the exception.

Unlike exceptions, another interrupt could possibly occur while the kernel is handling the first one (and therefore is in kernel mode). If the second interrupt has a higher priority than the first, a context switch will occur and the new interrupt will be handled. If the second interrupt has the same or lower priority, then the kernel will "put it on hold." These are not ignored, but rather saved (queued) to be dealt with later.

Thread Library changes between 2.4 and 2.6 Linux Kernels

Notes:

For most application developers, the majority of the changes made to the Linux kernel between the 2.4 and 2.6 kernel families have little direct impact. Most kernel changes only manifest themselves through increased system performance and capacity. Kernel and system changes that affect how applications spawn and manage other processes and threads are a significant exception to this rule.

The 2.6 Linux kernel introduces a new, improved threading model that is implemented through the NPTL. The adoption of a new threading model has significant implications for developers, system run-time libraries such as the GNU C library (glibc), shared application libraries, and so on. This white paper provides an overview of basic threading concepts, discusses new and existing Linux threading models, and then highlights the sorts of application changes that you might have to make to existing multi-threaded applications in order to enable them to work correctly with NPTL under the 2.6 Linux kernel.

Threading 101

On multi-processing systems such as Linux, the concept of one process creating another is fundamental. The most obvious example is a shell such as the Bash shell, the standard Linux command interpreter. The shell executes applications in response to user commands, and can either start commands and wait for them to finish or execute them as separate, concurrent processes (commonly referred to as "running in the background").

One process typically creates another through a set of fork() and exec() function calls. The initial process calls the fork() function, creating a clone of itself (known as a child process) that inherits the entire execution environment of the process that created it. The fork() call returns the child's process identifier (PID) in the parent process, and a PID of 0 in the child process. The child process then uses the exec() call to execute some other command, which totally replaces its existing execution context with that of the exec()'d command. At the same time, the parent process typically either exits immediately or waits until the child process returns its status. A simple example of this is the following:

int pid;
if (pid=fork()) {
/* parent code */
exit(1);
} else {
/* child code */
execl( "command", "arg1", "arg2", ...);
printf("Should never get here...\n");
exit (-1);
}

This model for process creation is simple and easy to use, but requires substantial system overhead. Child processes inherit a separate copy of the complete state of their parent, including its address space, global variables, open file descriptors, and so on. This consumes a fair amount of system resources, makes it relatively complex for child processes to share resources or communicate with each other, and is extremely resource-intensive for use at the operating system level.

Threads are similar to processes, but with some notable exceptions that make them far more suitable for use in complex, modern systems. Threads are essentially sub-processes that share the majority of the resources of their parent process, and thus require a lesser amount of system resources. A traditional way of thinking about threads is by referring to them as lightweight processes. All of the threads created by a process share the same address space, global variables, open file descriptors, pending alarms, signals and signal handlers, and system accounting information. Each thread maintains its own program counter, registers, stack, and state information.

A key differentiator between different thread implementations is the interaction between kernel threads, user-space processes, and user-space threads. This will be discussed in more detail later in this white paper.

Linux Threading Models

The standard Linux threading library in all stable versions of Linux prior to 2.6 is known as LinuxThreads. This library has been supported for use with the GNU C library, glibc, since glibc 2.0, and is largely POSIX-compliant.

NOTE: The primary libraries produced when compiling the LinuxThreads and Native POSIX Thread Library source code are libpthread.so and libpthread.a (for POSIX thread library). For this reasons, the terms LinuxThreads and pthreads have historically been used interchangeably, which confusing in light of the adoption of the NPTL. This whitepaper uses the term LinuxThreads and NPTL to clearly differentiate between the two threading libraries and their capabilities. The actual library names were preserved to minimize changes to existing Makefiles and linking commands.

LinuxThreads has a variety of performance, scalability, and usability limitations. It uses a compile-time setting for the number of threads that a single process can create, and uses a per-process manager thread to create and coordinate between all threads owned by each process. This significantly increases the overhead of creating and destroying threads. Signal handling is done on a per-process, rather than a per-thread, basis, though each thread has a separate process ID. Issues such as these, in combination with design issues such as an asymmetric relationship between kernel and user-space threads, and the lack of per-thread synchronization primitives for inter-thread communication and resource sharing, places some fundamental limitations on the number of threads that can simultaneously be created and do meaningful work in a LinuxThreads implementation.

The next threading model introduced for Linux was IBM's Next Generation POSIX Threads (NGPT) project. This was an external threading library that worked in conjunction with the LinuxThreads package, but provided additional POSIX compliance and better performance than the standard LinuxThreads package. This threading package is still available for use with 2.4 and earlier Linux kernels, but is no longer actively being worked on due to the new threading model introduced with the 2.5 and kernel series and made publicly available in the 2.6 kernel tree.

The Native POSIX Threading Library replaces LinuxThreads and NGPT in the 2.5 and later kernels. NPTL brings high-performance threading support to Linux, and provides the foundation required by multi-threaded enterprise applications such as database systems, high-capacity and high-load web and mail servers, and so on. NPTL was developed as part of the 2.5 kernel development process and was integrated with Linux run-time elements such as glibc at that time. NPTL is the strategic direction for Linux threading in the future.

Some Linux vendors, such as later versions of Red Hat Linux, have backported NPTL to earlier kernels and have even made the threading environment for specific processes selectable through an environment variable (LD_ASSUME_KERNEL). On systems that support this feature, the variable is set via a command such as the following:

# export LD_ASSUME_KERNEL=2.4.1

This is a clever way to enable some existing applications that rely on LinuxThreads to continue to work in an NPTL environment, but is a short-term solution. To make the most of the design and performance benefits provided by NPTL, you should update the code for any existing applications that use threading.

The next sections discuss the implications of the changes between the LinuxThreads and NPTL threading implementations as far as your applications are concerned. The first section discusses development changes related to the updated toolchains used with the 2.6 kernel highlights. A subsequent section highlights aspects of your applications that you must change due to differences between the LinuxThreads and NPTL implementations, in order to make the most of the increased power and performance of NPTL.

Recompiling Applications for 2.6 and NPTL

Though many applications will migrate from Linux 2.4 to 2.6 without recompilation, the addition of Native POSIX Thread Library technology will require minor modifications in most threaded applications. A subsequent section of this whitepaper, "Updating Applications for NPTL", discusses the primary situations where recompilation may be required.

Those applications that do require recompilation may be affected by the updated compilers included with distributions such as TimeSys Linux that are based around a 2.6-based kernel. TimeSys ships its 2.6 Reference Distributions with version 3.3.2 of the C and C++ compilers from the GNU Compiler Collection (GCC), and uses an updated version of the binutils package that makes up the remainder of a standard Linux toolchain (providing the assembler, linker, library archiver, and so).

Simply using a 2.6 based kernel does not mean that you are automatically using the NPTL. To determine the threading library that a system uses, you can execute the getconf command (part of the glibc package), to examine the GNU_LIBPTHREAD_VERSION environment variable, as in the following command example:

# getconf GNU_LIBPTHREAD_VERSION
linuxthreads-0.10

If your system uses the NPTL, the command would return the value of NPTL that your system was using, as in the following example:

# getconf GNU_LIBPTHREAD_VERSION
nptl-0.60

If you are building your own toolchain to work with a 2.6 system you must, of course, make sure that you've built the toolchain on a 2.6 system, and that you have enabled NPTL support in the version of the C library used by your toolchain. You will also need recent source codeyou're your C library. For example, if you are building glibc with NPTL support, version 2.3.3 or better of the glibc source is recommended. Enabling NPTL support when building a C library, such as glibc, is analogous to the mechanism used for enabling LinuxThreads support for glibc. This is done by using the configuration option --enable-add-ons=nptl when configuring glibc, rather than --enable-add-ons=linuxthreads. The latest NPTL source code is available here. You will also need a recent version of GCC (3.3.2 or better suggested for x86 platforms, 3.4 suggested for PPC platforms), and a version of binutils that supports the Call Frame Information (CFI) directive (2.14.90.0.7 or better suggested; available here).

In general, if your applications were compiled with earlier versions of GCC, you may notice changes in warning messages, code size, and changes to the options provided by the GCC compiler and the rest of the toolchain. It is especially important to be aware of the potential for increased code size when you are recompiling embedded applications. You may need to take advantage of additional or new optimization options in order to continue to fit your existing applications in resource-constrained environments. Newer versions of GCC are also increasingly conformant to various C and C++ specifications, and are therefore likely to complain about aspects of your application code that earlier versions of these compilers ignored. In general, new warnings from updated compilers should be seen as an opportunity to weed out latent defects.

Newer versions of GCC have also deprecated some machine-specific options (specified using -m) in favor of more general optimization options (specified using -f). This may require that you update values such as the CFLAGS options specified in application Makefiles.

Finally, newer versions of GCC implement optimizations that require them to notice possible alias situations that may have been ignored by earlier compilers. The aliasing issues can be resolved (at some performance cost) by using the no strict aliasing compiler option (-fno-strict-aliasing).

Updating Applications for NPTL

The changes between 2.4's threading support and NPTL provide significant design and performance improvements. NPTL is far more compliant with the POSIX specification than the LinuxThreads package was under 2.4. NPTL also supports useful features such as mutexes that are shared among threads, simplifying resource sharing, conflict prevention, and increasing parallelism. Finally, NPTL is vastly more efficient than 2.4's threading support. Some of the standard performance metrics used at TimeSys have shown NPTL implementations to be up to three orders of magnitude faster than the same code using LinuxThreads.

Some of the more complex changes that you may have to make in your application logic when moving to NPTL are changes related to NPTL's improved support for POSIX signals and signal handling. LinuxThreads implemented generic Unix-style threads, but were limited by various implementation details. NPTL is a POSIX-compliant threading implementation, and therefore handles signals better both between processes and between all threads within those processes. With the NPTL, signals can now be sent from one thread to another, rather than simply on a per-process basis. Signals can also use arguments to transfer information from one thread to another.

Using the NPTL also requires that you make changes to existing code that needs to be able to uniquely identify specific threads. Under LinuxThreads, each thread had a unique process ID (PID). Each thread now shares the PID of its parent process, and the getpid() function therefore returns the same process ID for all threads in a process. With NPTL, a thread's Thread ID must be used to uniquely identify individual threads.

Changes to thread and process IDs also mean that the processes that are traditionally used to spawn processes must now be thread-aware. For example, the exec() functions are now thread-aware, so that a thread inherits the PID of its caller. However, any application that depended on having all threads survive an exec() will need some modification. The parent of a multi-threaded process is notified of a child's termination only when the entire process terminates. Thread-related changes have also been made to the behavior of related fork() calls. For example, functions registered with the pthread_at_fork() function are no longer run when a vfork() occurs.

In addition to changing the internals of thread identification, NPTL does away with the LinuxThreads notion of a manager thread, simplifying the process/thread relationship and eliminating what was essentially administrative overhead under LinuxThreads. This may require application changes if, for example, your application kept track of the number of threads running on its behalf or looked for the manager thread as a signal target.

Finally, using a new threading library means that certain threading functions that were available under LinuxThreads are no longer supported under the NPTL. For example, the pthread_kill_other_threads_np() function is no longer available. This function was used to simulate POSIX-conformant exec() functions. Since the exec() functions themselves are now POSIX conformant, the helper function is not necessary.

For additional information about the design and implementation of the NPTL, a general design and philosophy document is available here.

Conclusion

Changing the kernel that your computer system runs is not necessarily an easy task, but is eminently doable. The white papers in this series have highlighted the issue in configuring the 2.6 kernel, updating device drivers, migrating desktop and custom systems, and updating applications. Certified 2.6-based distributions targeted for embedded use are already available from vendors such as TimeSys, whose 2.6 reference distribution was the first 2.6-based distribution for PPC systems. High-quality commercial software such as TimeSys TimeStorm IDE and TimeStorm Linux Developers Suite (LDS) is designed to help you migrate any Linux kernel, device driver, application, or deployed system to take advantage of the power of the 2.6 kernel and updated packages, threading methodology, and so on.

Linux is a shining example of the power of the Open Source movement as a positive force for change in the software industry. The Linux kernel, the core of any Linux distribution, is constantly evolving to incorporate new technologies and improve performance, scalability, support, and usability. The 2.6 kernel increases the spectrum of systems for which Linux is well-suited, from PDAs, process control systems, and set-top boxes all the way to enterprise servers. The cost, power, and supportability advantages by Linux are more obvious than ever in today's business market and fast-paced technical environment.

Monday, May 14, 2007

Linux kernel threads in Device Drivers

Purpose
This examples shows how to create and stop a kernel thread.
The driver is implemented as a loadable module. In the init_module() routine five kernel threads are created. This kernel threads sleep one second, wake up, print a message and fall asleep again. On unload of the module (cleanup_module), the kernel threads are killed.
The example has been tested with Linux kernel 2.4.2 on Intel (uni processor only) and Alpha platform (COMPAQ Personal Workstation 500au (uni processor), DS20 and ES40 (SMP).

A version for the 2.2 kernel can be found here. Note: depending on the context of the creator of the threads the new threads may inherit properties from the parent you do not want to have. The new version avoids this by having keventd create the threads. The 2.2. kernel do not have a keventd, so this approach is not implementable there.
Functions in example

* start_kthread: creates a new kernel thread. Can be called from any process context but not from interrupt. The functions blocks until the thread started.
* stop_kthread: stop the thread. Can be called from any process context but the thread to be terminated. Cannot be called from interrupt context. The function blocks until the thread terminated.
* init_kthread: sets the environment of the new threads. Is to be called out of the created thread.
* exit_kthread: needs to be called by the thread to be terminated on exit

Creation of new Thread
A new thread is created with kernel_thread(). The thread inherits properties from its parents. To make sure that we do not get any weired properties, we let keventd create the new thread.
The new thread is created with start_kthread(). It uses a semaphore to block until the new thread is running. A down() blocks the start_kthread() routine until the corresponding up() call in init_kthread() is executed.
The new thread must call init_kthread() in order to let the creator continue.

Stop of new Thread

stop_kthread() sets a flag that the thread uses to determine whether do die or not and sends a SIGKILL to the thread. This signal causes the thread to be woken up. On wakeup it will check for the flag and then terminate itself by calling exit_kthread and returning from the thread function. With a semaphore the stop_kthread() function blocks until the thread terminated.

Initialization of new Thread

Within the new created thread, init_kthread() needs to be called. This function sets a signal mask, initialises a wait queue, the termination flag and sets a new name for the thread. With a up() call it notifies the creator that the setup is done.

Exit of new Thread

When the thread receives the notification to terminate itself, is calls the exit_kthread() function. It notifies the stop_kthread() function that it terminated with an up() call.
The new Thread itself
The new thread is implemented in the example_thread() function. It runs an endless loop (for(;;)). In the loop it falls asleep with the interruptible_sleep_on_timeout() function. It comes out of this function either when the timeout expires or when a signal got caught.
The "work" in the thread is to print out a message with printk.
Kernel Versions
The example has been tested on 2.4.2.
Example Device Driver Code
The example consists of four files: kthread.h, kthread.c, thread_drv.c and a Makefile
kthread.h

#ifndef _KTHREAD_H
#define _KTHREAD_H
#include
#include

#include
#include
#include
#include

#include
#include

/* a structure to store all information we need
for our thread */
typedef struct kthread_struct
{
/* private data */

/* Linux task structure of thread */
struct task_struct *thread;
/* Task queue need to launch thread */
struct tq_struct tq;
/* function to be started as thread */
void (*function) (struct kthread_struct *kthread);
/* semaphore needed on start and creation of thread. */
struct semaphore startstop_sem;

/* public data */

/* queue thread is waiting on. Gets initialized by
init_kthread, can be used by thread itself.
*/
wait_queue_head_t queue;
/* flag to tell thread whether to die or not.
When the thread receives a signal, it must check
the value of terminate and call exit_kthread and terminate
if set.
*/
int terminate;
/* additional data to pass to kernel thread */
void *arg;
} kthread_t;

/* prototypes */

/* start new kthread (called by creator) */
void start_kthread(void (*func)(kthread_t *), kthread_t *kthread);

/* stop a running thread (called by "killer") */
void stop_kthread(kthread_t *kthread);

/* setup thread environment (called by new thread) */
void init_kthread(kthread_t *kthread, char *name);

/* cleanup thread environment (called by thread upon receiving termination signal) */
void exit_kthread(kthread_t *kthread);

#endif

kthread.c

#include
#include

#if defined(MODVERSIONS)
#include
#endif
#include
#include
#include
#include
#include

#include
#include

#include "kthread.h"

/* private functions */
static void kthread_launcher(void *data)
{
kthread_t *kthread = data;
kernel_thread((int (*)(void *))kthread->function, (void *)kthread, 0);

}

/* public functions */

/* create a new kernel thread. Called by the creator. */
void start_kthread(void (*func)(kthread_t *), kthread_t *kthread)
{
/* initialize the semaphore:
we start with the semaphore locked. The new kernel
thread will setup its stuff and unlock it. This
control flow (the one that creates the thread) blocks
in the down operation below until the thread has reached
the up() operation.
*/
init_MUTEX_LOCKED(&kthread->startstop_sem);

/* store the function to be executed in the data passed to
the launcher */
kthread->function=func;

/* create the new thread my running a task through keventd */

/* initialize the task queue structure */
kthread->tq.sync = 0;
INIT_LIST_HEAD(&kthread->tq.list);
kthread->tq.routine = kthread_launcher;
kthread->tq.data = kthread;

/* and schedule it for execution */
schedule_task(&kthread->tq);

/* wait till it has reached the setup_thread routine */
down(&kthread->startstop_sem);

}

/* stop a kernel thread. Called by the removing instance */
void stop_kthread(kthread_t *kthread)
{
if (kthread->thread == NULL)
{
printk("stop_kthread: killing non existing thread!\n");
return;
}

/* this function needs to be protected with the big
kernel lock (lock_kernel()). The lock must be
grabbed before changing the terminate
flag and released after the down() call. */
lock_kernel();

/* initialize the semaphore. We lock it here, the
leave_thread call of the thread to be terminated
will unlock it. As soon as we see the semaphore
unlocked, we know that the thread has exited.
*/
init_MUTEX_LOCKED(&kthread->startstop_sem);

/* We need to do a memory barrier here to be sure that
the flags are visible on all CPUs.
*/
mb();

/* set flag to request thread termination */
kthread->terminate = 1;

/* We need to do a memory barrier here to be sure that
the flags are visible on all CPUs.
*/
mb();
kill_proc(kthread->thread->pid, SIGKILL, 1);

/* block till thread terminated */
down(&kthread->startstop_sem);

/* release the big kernel lock */
unlock_kernel();

/* now we are sure the thread is in zombie state. We
notify keventd to clean the process up.
*/
kill_proc(2, SIGCHLD, 1);

}

/* initialize new created thread. Called by the new thread. */
void init_kthread(kthread_t *kthread, char *name)
{
/* lock the kernel. A new kernel thread starts without
the big kernel lock, regardless of the lock state
of the creator (the lock level is *not* inheritated)
*/
lock_kernel();

/* fill in thread structure */
kthread->thread = current;

/* set signal mask to what we want to respond */
siginitsetinv(¤t->blocked, sigmask(SIGKILL)|sigmask(SIGINT)|sigmask(SIGTERM));

/* initialise wait queue */
init_waitqueue_head(&kthread->queue);

/* initialise termination flag */
kthread->terminate = 0;

/* set name of this process (max 15 chars + 0 !) */
sprintf(current->comm, name);

/* let others run */
unlock_kernel();

/* tell the creator that we are ready and let him continue */
up(&kthread->startstop_sem);

}

/* cleanup of thread. Called by the exiting thread. */
void exit_kthread(kthread_t *kthread)
{
/* we are terminating */

/* lock the kernel, the exit will unlock it */
lock_kernel();
kthread->thread = NULL;
mb();

/* notify the stop_kthread() routine that we are terminating. */
up(&kthread->startstop_sem);
/* the kernel_thread that called clone() does a do_exit here. */

/* there is no race here between execution of the "killer" and real termination
of the thread (race window between up and do_exit), since both the
thread and the "killer" function are running with the kernel lock held.
The kernel lock will be freed after the thread exited, so the code
is really not executed anymore as soon as the unload functions gets
the kernel lock back.
The init process may not have made the cleanup of the process here,
but the cleanup can be done safely with the module unloaded.
*/

}

thread_drv.c

#include
#include

#include
#if defined(MODVERSIONS)
#include
#endif

#include
#include
#include
#include

#include "kthread.h"

#define NTHREADS 5

/* the variable that contains the thread data */
kthread_t example[NTHREADS];

/* prototype for the example thread */
static void example_thread(kthread_t *kthread);

/* load the module */
int init_module(void)
{
int i;

/* create new kernel threads */
for (i=0; i start_kthread(example_thread, &example[i]);

return(0);
}

/* remove the module */
void cleanup_module(void)
{
int i;

/* terminate the kernel threads */
for (i=0; i stop_kthread(&example[i]);

return;
}

/* this is the thread function that we are executing */
static void example_thread(kthread_t *kthread)
{
/* setup the thread environment */
init_kthread(kthread, "example thread");

printk("hi, here is the kernel thread\n");

/* an endless loop in which we are doing our work */
for(;;)
{
/* fall asleep for one second */
interruptible_sleep_on_timeout(&kthread->queue, HZ);

/* We need to do a memory barrier here to be sure that
the flags are visible on all CPUs.
*/
mb();

/* here we are back from sleep, either due to the timeout
(one second), or because we caught a signal.
*/
if (kthread->terminate)
{
/* we received a request to terminate ourself */
break;
}

/* this is normal work to do */
printk("example thread: thread woke up\n");
}
/* here we go only in case of termination of the thread */

/* cleanup the thread, leave */
exit_kthread(kthread);

/* returning from the thread here calls the exit functions */
}

Makefile

# set to your kernel tree
KERNEL = /usr/src/linux

# get the Linux architecture. Needed to find proper include file for CFLAGS
ARCH=$(shell uname -m | sed -e s/i.86/i386/ -e s/sun4u/sparc64/ -e s/arm.*/arm/ -e s/sa110/arm/)
# set default flags to compile module
CFLAGS = -D__KERNEL__ -DMODULE -I$(KERNEL)/include
CFLAGS+= -Wall -Wstrict-prototypes -O2 -fomit-frame-pointer -fno-strict-aliasing

all: thread_mod.o

# get configuration of kernel
include $(KERNEL)/.config
# modify CFLAGS with architecture specific flags
include $(KERNEL)/arch/${ARCH}/Makefile

# enable the module versions, if configured in kernel source tree
ifdef CONFIG_MODVERSIONS
CFLAGS+= -DMODVERSIONS -include $(KERNEL)/include/linux/modversions.h
endif
# enable SMP, if configured in kernel source tree
ifdef CONFIG_SMP
CFLAGS+= -D__SMP__
endif

# note: we are compiling the driver object file and then linking
# we link it into the module. With just one object file as in
# this example this is not needed. We can just load the object
# file produced by gcc
# link the thread driver module
thread_mod.o: thread_drv.o kthread.o
ld -r -o thread_mod.o thread_drv.o kthread.o
# compile the kthread object file
kthread.o: kthread.c kthread.h
gcc $(CFLAGS) -c kthread.c
# compile the thread driver
thread_drv.o: thread_drv.c kthread.h
gcc $(CFLAGS) -c thread_drv.c

clean:
rm -f *.o

Bugs
The code assumes that keventd is running with PID 2.

Friday, May 11, 2007

Idea on Writing a device driver in linux

Does the idea of writing a Linux device driver sound difficult? If
you have some basic programming experience, the task is
simpler than you think. Get started with this quick primer on
device driver programming.

What do I need to know about writing drivers?

Basic knowledge of kernel compilation, a good deal of programming
experience in C under Linux and lastly, the right techniques of data
structures, like linked list is essential along with their data types.

The first thing a programmer must know before attempting to write a
driver, is to know how the Linux kernel source compiles, paying attention
to the compilation process (the gcc compiler flags).

Choosing the device type

a) Block drivers

A block device is something that can host a filesystem such as a disk. A
block device can only be accessed as multiples of a block, where a block
is usually one kilobyte of data .

b) Character drivers

A character device is one that can be accessed like a file, and a char
driver is in charge of implementing this behaviour. This driver implements
the open, close, read and write system calls. The console and parallel
ports are examples of char devices.

Device drivers in Linux are known as modules and can be loaded dynamically
into the kernel using the insmod command.

A single module can be compiled alone, and also can be linked to the
kernel (here, care has to be taken on the type of driver).

eg: A simple module

#define MODULE
#include

int init_module (void) /* Loads a module in the kernel */
{
printk("Hello kernel n");
return 0;
}

void cleanup_module(void) /* Removes module from kernel */
{
printk("GoodBye Kerneln");
}

Compiling the module

# gcc -c hello.c
# insmod hello.o

The output is

Hello kernel

# rmmod hello.o

GoodBye Kernel

How init_module works?

init_module loads the relocated module image into kernel space and runs
the module's init function.

The module image begins with a module structure and is followed by code
and data as appropriate.

The module structure is defined as follows:

struct module
{
unsigned long size_of_struct;
struct module *next;
const char *name;
unsigned long size;
long usecount;
unsigned long flags;
unsigned int nsyms;
unsigned int ndeps;
struct module_symbol *syms;
struct module_ref *deps;
struct module_ref *refs;
int (*init)(void);
void (*cleanup)(void);
const struct exception_table_entry *ex_table_start;
const struct exception_table_entry *ex_table_end;
#ifdef __alpha__
unsigned long gp;
#endif
};

All of the pointer fields, with the exception of next and refs, are
expected to point within the module body and be initialized as appropriate
for kernel space, i.e. relocated with the rest of the module.

Return Values

On success, zero is returned. On error, -1 is returned
and errno is set appropriately.

Errors

EPERM The user is not the superuser.

ENOENT No module by that name exists.

EINVAL Some image slot filled in incorrectly, image->name
does not correspond to the original module name,
some image->deps entry does not correspond to a
loaded module, or some other similar inconsistency.

EBUSY The module's initialization routine failed.

EFAULT name or image is outside the program's accessible
address space.

How cleanup_module works?

cleanup_module attempts to remove an unused loadable module entry. If
name is NULL, all unused modules marked auto clean will be removed.

Return Values

On success, zero is returned. On error, -1 is returned and errno is
set appropriately.

Errors

EPERM The user is not the superuser.

ENOENT No module by that name exists.

EINVAL name was the empty string.

EBUSY The module is in use.

EFAULT name is outside the program's accessible address
space.

This simple module is called skull, short for Simple Kernel Utility For
Loading Localities.

General flags used for compiling any driver are

-D__KERNEL__ _DMODULE -O -Wall -I$(INCLUDEDIR)

Note: The INCLUDEDIR should contain the header files of the kernel source.

Module code has to be recompiled for each version of the kernel that it
will be linked to. Each module defines a symbol called kernel_version
which is defined in . In case of a version mismatch, use
the insmod -f (force) option to load the module.

Wednesday, May 9, 2007

Allocating Memory in the Kernel

Unfortunately for kernel developers, allocating memory in the kernel is not as simple as allocating memory in user space. A number of factors contribute to the complication, among them:

The kernel is limited to about 1GB of virtual and physical memory.
The kernel's memory is not pageable.
The kernel usually wants physically contiguous memory.
Often, the kernel must allocate the memory without sleeping.
Mistakes in the kernel have a much higher price than they do elsewhere.

Although easy access to an abundance of memory certainly is not a luxury to the kernel, a little understanding of the issues can go a long way toward making the process relatively painless.
A General-Purpose Allocator

The general interface for allocating memory inside of the kernel is kmalloc():

#include
void * kmalloc(size_t size, int flags);

It should look familiar-it is pretty much the same as user space's malloc(), after all-except that it takes a second argument, flags. Let's ignore flags for a second and see what we recognize. First off, size is the same here as in malloc()'s-it specifies the size in bytes of the allocation. Upon successful return, kmalloc() returns a pointer to size bytes of memory. The alignment of the allocated memory is suitable for storage of and access to any type of object. As with malloc(), kmalloc() can fail, and you must check its return value against NULL. Let's look at an example:

struct falcon *p;
p = kmalloc(sizeof (struct falcon), GFP_KERNEL);
if (!p)
/* the allocation failed - handle appropriately */

Flags

The flags field controls the behavior of memory allocation. We can divide flags into three groups: action modifiers, zone modifiers and types. Action modifiers tell the kernel how to allocate memory. They specify, for example, whether the kernel can sleep (that is, whether the call to kmalloc() can block) in order to satisfy the allocation. Zone modifiers, on the other hand, tell the kernel from where the request should be satisfied. For example, some requests may need to be satisfied from memory that hardware can access through direct memory access (DMA). Finally, type flags specify a type of allocation. They group together relevant action and zone modifiers into a single mnemonic. In general, instead of specifying multiple action and zone modifiers, you specify a single type flag.

Table 1 is a listing of the action modifiers, and Table 2 is a listing of the zone modifiers. Many different flags can be used; allocating memory in the kernel is nontrivial. It is possible to control many aspects of memory allocation in the kernel. Your code should use the type flags and not the individual action and zone modifiers. The two most common flags are GFP_ATOMIC and GFP_KERNEL. Nearly all of your kernel memory allocations should specify one of these two flags. Garrick, please kern the double underscores in Tables 1 and 2.

Table 1. Action Modifiers
Flag Description
__GFP_COLD The kernel should use cache cold pages.
__GFP_FS The kernel can start filesystem I/O.
__GFP_HIGH The kernel can access emergency pools.
__GFP_IO The kernel can start disk I/O.
__GFP_NOFAIL The kernel can repeat the allocation.
__GFP_NORETRY The kernel does not retry if the allocation fails.
__GFP_NOWARN The kernel does not print failure warnings.
__GFP_REPEAT The kernel repeats the allocation if it fails.
__GFP_WAIT The kernel can sleep.

Table 2. Zone Modifiers
Flag Description
__GFP_DMA Allocate only DMA-capable memory.
No flag Allocate from wherever available.

The GFP_ATOMIC flag instructs the memory allocator never to block. Use this flag in situations where it cannot sleep-where it must remain atomic-such as interrupt handlers, bottom halves and process context code that is holding a lock. Because the kernel cannot block the allocation and try to free up sufficient memory to satisfy the request, an allocation specifying GFP_ATOMIC has a lesser chance of succeeding than one that does not. Nonetheless, if your current context is incapable of sleeping, it is your only choice. Using GFP_ATOMIC is simple:

struct wolf *p;
p = kmalloc(sizeof (struct wolf), GFP_ATOMIC);
if (!p)
/* error */

Conversely, the GFP_KERNEL flag specifies a normal kernel allocation. Use this flag in code executing in process context without any locks. A call to kmalloc() with this flag can sleep; thus, you must use this flag only when it is safe to do so. The kernel utilizes the ability to sleep in order to free memory, if needed. Therefore, allocations that specify this flag have a greater chance of succeeding. If insufficient memory is available, for example, the kernel can block the requesting code and swap some inactive pages to disk, shrink the in-memory caches, write out buffers and so on.

Sometimes, as when writing an ISA device driver, you need to ensure that the memory allocated is capable of undergoing DMA. For ISA devices, this is memory in the first 16MB of physical memory. To ensure that the kernel allocates from this specific memory, use the GFP_DMA flag. Generally, you would use this flag in conjunction with either GFP_ATOMIC or GFP_KERNEL; you can combine flags with a binary OR operation. For example, to instruct the kernel to allocate DMA-capable memory and to sleep if needed, do:

char *buf;
/* we want DMA-capable memory,
* and we can sleep if needed */
buf = kmalloc(BUF_LEN, GFP_DMA | GFP_KERNEL);
if (!buf)
/* error */

Table 3 is a listing of the type flags, and Table 4 shows to which type flag each action and zone modifier equates. The header defines all of the flags.

Table 3. Types
Flag Description
GFP_ATOMIC The allocation is high-priority and does not sleep. This is the flag to use in interrupt handlers, bottom halves and other situations where you cannot sleep.

GFP_DMA This is an allocation of DMA-capable memory. Device drivers that need DMA-capable memory use this flag.

GFP_KERNEL This is a normal allocation and might block. This is the flag to use in process context code when it is safe to sleep.

GFP_NOFS This allocation might block and might initiate disk I/O, but it does not initiate a filesystem operation. This is the flag to use in filesystem code when you cannot start another filesystem operation.

GFP_NOIO This allocation might block, but it does not initiate block I/O. This is the flag to use in block layer code when you cannot start more block I/O.

GFP_USER This is a normal allocation and might block. This flag is used to allocate memory for user-space processes.

Table 4. Composition of the Type Flags
Flag Value
GFP_ATOMIC __GFP_HIGH
GFP_NOIO __GFP_WAIT
GFP_NOFS (__GFP_WAIT | __GFP_IO)
GFP_KERNEL (__GFP_WAIT | __GFP_IO | __GFP_FS)
GFP_USER (__GFP_WAIT | __GFP_IO | __GFP_FS)
GFP_DMA __GFP_DMA

Returning Memory

When you are finished accessing the memory allocated via kmalloc(), you must return it to the kernel. This job is done using kfree(), which is the counterpart to user space's free() library call. The prototype for kfree() is:

#include
void kfree(const void *objp);

kfree()'s usage is identical to the user-space variant. Assume p is a pointer to a block of memory obtained via kmalloc(). The following command, then, would free that block and return the memory to the kernel:

kfree(p);

As with free() in user space, calling kfree() on a block of memory that already has been freed or on a pointer that is not an address returned from kmalloc() is a bug, and it can result in memory corruption. Always balance allocations and frees to ensure that kfree() is called exactly once on the correct pointer. Calling kfree() on NULL is checked for explicitly and is safe, although it is not necessarily a sensible idea.

Let's look at the full allocation and freeing cycle:

struct sausage *s;
s = kmalloc(sizeof (struct sausage), GFP_KERNEL);
if (!s)
return -ENOMEM;
/* ... */
kfree(s);

Allocating from Virtual Memory

The kmalloc() function returns physically and therefore virtually contiguous memory. This is a contrast to user space's malloc() function, which returns virtually but not necessarily physically contiguous memory. Physically contiguous memory has two primary benefits. First, many hardware devices cannot address virtual memory. Therefore, in order for them to be able to access a block of memory, the block must exist as a physically contiguous chunk of memory. Second, a physically contiguous block of memory can use a single large page mapping. This minimizes the translation lookaside buffer (TLB) overhead of addressing the memory, as only a single TLB entry is required.

Allocating physically contiguous memory has one downside: it is often hard to find physically contiguous blocks of memory, especially for large allocations. Allocating memory that is only virtually contiguous has a much larger chance of success. If you do not need physically contiguous memory, use vmalloc():

#include
void * vmalloc(unsigned long size);

You then return memory obtained with vmalloc() to the system by using vfree():

#include
void vfree(void *addr);

Here again, vfree()'s usage is identical to user space's malloc() and free() functions:

struct black_bear *p;
p = vmalloc(sizeof (struct black_bear));
if (!p)
/* error */
/* ... */
vfree(p);

In this particular case, vmalloc() might sleep.

Many allocations in the kernel can use vmalloc(), because few allocations need to appear contiguous to hardware devices. If you are allocating memory that only software accesses, such as data associated with a user process, there is no need for the memory to be physically contiguous. Nonetheless, few allocations in the kernel use vmalloc(). Most choose to use kmalloc(), even if it's not needed, partly for historical and partly for performance reasons. Because the TLB overhead for physically contiguous pages is reduced greatly, the performance gains often are well appreciated. Despite this, if you need to allocate tens of megabytes of memory in the kernel, vmalloc() is your best option.
A Small Fixed-Size Stack

Unlike user-space processes, code executing in the kernel has neither a large nor a dynamically growing stack. Instead, each process in the kernel has a small fixed-size stack. The exact size of the stack is architecture-dependent. Most architectures allocate two pages for the stack, so the stack is 8KB on 32-bit machines.

Because of the small stack, allocations that are large, automatic and on-the-stack are discouraged. Indeed, you never should see anything such as this in kernel code:

#define BUF_LEN 2048
void rabbit_function(void)
{
char buf[BUF_LEN];
/* ... */
}

Instead, the following is preferred:

#define BUF_LEN 2048
void rabbit_function(void)
{
char *buf;
buf = kmalloc(BUF_LEN, GFP_KERNEL);
if (!buf)
/* error! */
/* ... */
}

You also seldom see the equivalent of this stack in user space, because there is rarely a reason to perform a dynamic memory allocation when you know the allocation size at the time you write the code. In the kernel, however, you should use dynamic memory any time the allocation size is larger than a handful of bytes or so. This helps prevent stack overflow, which ruins everyone's day.
Conclusion

With a little understanding, getting a hold of memory in the kernel is demystified and not too much more difficult to do than it is in user space. A few simple rules of thumb can go a long way:

Decide whether you can sleep (that is, whether the call to kmalloc() can block). If you are in an interrupt handler, in a bottom half, or if you hold a lock, you cannot. If you are in process context and do not hold a lock, you probably can.

If you can sleep, specify GFP_KERNEL.

If you cannot sleep, specify GFP_ATOMIC.

If you need DMA-capable memory (for example, for an ISA or broken PCI device), specify GFP_DMA.

Always check for and handle a NULL return value from kmalloc().

Do not leak memory; make sure you call kfree() somewhere.

Ensure that you do not race and call kfree() multiple times and that you never access a block of memory after you free it.

Tuesday, May 8, 2007

Kmalloc Internals

Notes:

Understanding this requires a minimal understanding of the linux kernel. Sometimes I may use the names of functions that are commonly known. If you don't know what they are, a quick Google or grep through the include directory should find them for you. All of the cache related functions I mention can be found in the file /usr/src/linux/mm/slab.c. Some of the numbers I give are only applicable on IA-32 architectures. I've tried to bold function names and data types, and italicize defined constants, but odds are I missed a few. In some of the code shown I have cut out debugging code. I have ignored some of the SMP code because I don't have an SMP system

Overview:

Any operating system kernel needs a way to manage physical memory. Device drivers, and various other parts of the kernel need to be able to allocate chunks of memory to store data. This memory comes from system RAM, and when one speaks of system RAM, it is always addressed in PAGE_SIZE chunks. On some architectures this corresponds to 4096 bytes, but this size varies: for example Alphas use 8192 bytes for a page. However, when some driver or other kernel function needs to dynamically allocate memory, it does not always need an entire page(s). Thus, there is a need for some entity that can manage system RAM and dole it out to those who request it in varying sized chunks. Such an allocator should carefully balance two goals: speed, and minimizing the amount of fragmented, unusable memory leftover in chunks allocated. The latter being especially important, since any fragmented memory is wasted system RAM that will not be reclaimed as long as the memory chunk it belongs to is in use. This job is performed by the kmalloc() function, and its counterpart kfree().

Kmalloc doesn't do much:

I lied a bit when I said that kmalloc() had the job of managing memory. In fact, it is just an interface to the REAL memory manager, the kmem cache allocator. So, in order to understand kmalloc(), what is really needed is to first understand the cache allocation scheme. That will be the focus of this discussion. After understanding that, you will see that kmalloc() is merely just a user of the cache allocators services.

Cache overview:

The idea behind a cache allocator is quite simple: divide up memory into pools of objects, where each pool contains objects of the same size, and different pools contain objects of other sizes. Then, when someone requests a block of memory of a given size, find the first pool large enough to hold a block that size and hand it over. And that's about all there is to it. Underneath this frame is a bit more complexity. Some things to consider: how large should each pool be? If we run out of objects in one pool, how do we create more? How do we track which objects in the pool are free or in use? How many pools should we have, and how large should the gaps be between pool sizes? The question in italics is one that is especially important to us, because as you may guess, taking care of free/allocated memory involves some sort of control structures. If these control structures lie in a place where we can overwrite, exploitation may be possible. The mixing of control and data channels is at the root (well bad code is the real root) of most exploitation techniques.

As said, memory is divided up into pools of objects. One of these pools is referred to as a cache, and there are many of these caches in the kernel. The key data structure behind all of this is the kmem_cache_t data type. It is used to hold all sorts of information about the cache it refers to, such as the size of each object and the number of objects found in each pool. There exists one of these data types for each different cache in the system. Drivers and the like are able to create their own caches of a size they choose. If a driver does this, it will create its cache by calling the kmem_cache_create() function. This function creates a cache of objects whose size is provided by the caller, and then it inserts this cache into the global list of created caches called the cache_cache. The cache_cache is itself a static kmem_cache_t object, and it is used to hold all of the system wide caches. Therefore the size of objects in its pool is sizeof(kmem_cache_t). You can can view all of the caches in the system by looking at the file /proc/slabinfo. In addition to this global 'cache of caches', we also have the general caches. The general caches are an array of caches, each sized a power of 2 greater than the last one. On IA-32 these sizes start at 32, and go up until 131,072 by multiples of 2. For each size, there are two caches, a cache for normal memory, and a cache for DMA capable memory. On IA-32, DMA memory must be at an address that is addressable with 16 bits. This is how kmalloc() works. When kmalloc() is called, all it does is search through the general caches until it finds a suitably sized cache, and then calls __kmem_cache_alloc() to grab an object from that cache and returns it to the caller. Similarly, kfree() just returns the object to its cache by calling __kmem_cache_free().

Caches and slabs:

The key data structure behind caches is the kmem_cache_t type. It has numerous members, here are some the most important ones:

struct list_head slabs_full;
struct list_head slabs_partial;
struct list_head slabs_free;
The lists of slabs for this cache. These are explained below.

unsigned int objsize;
The size of each object in this cache.

unsigned int flags;
Flags for this cache. Determine certain optimizations, and where the object tracking information is stored.

unsigned int num;
The number of objects stored on each slab.

spinlock_t spinlock;
To protect the structure from concurrent access by CPUS.

unsigned int gfporder;
The base 2 logarithm (2^n) number of pages to allocate for each slab.

unsigned int gfpflags;
The flags passed to get_free_pages() when allocating memory, used for specifying DMA memory is needed,
or that the caller does not want to be put to sleep if there is no memory available.

char name[CACHE_NAMELEN];
The name of the cache, will be output in /proc/slabinfo along with usage statistics.

struct list_head next;
When this cache is user created and belongs to the global cache_cache list, this sews it into the list.

void (*ctor)(void *, kmem_cache_t *, unsigned long);
void (*dtor)(void *, kmem_cache_t *, unsigned long);
The constructor/destructor can be passed by the creator of the cache, and will run whenever a slab is created/freed.

There are some other fields used to keep statistics, as well as some SMP specific fields, and a couple we will see later.

In order to organize the objects in each cache, the kernel relies on slabs. A slab hosts a number of objects, and possibly the control information needed to manage the objects stored in the slab. Each cache consists of multiple slabs, and each slab consists of multiple objects. A slab itself is merely a number of pages of physical memory allocated by the get_free_pages() function. In order to organize all of its slabs, a cache manages three separate doubly linked lists of slabs, using the standard kernel list implementation. You saw the heads of these three lists in the above variables. One list is for slabs that have no objects on them, the free list, another list is for slabs that are partially full of objects, the partial list, and the third list is for slabs that are completely full of objects, the full list. Each slab, represented by the slab_t type, contains a list pointer used to sew all the slabs in a list together. That is, each slab allocated for a cache will reside in one of the three lists. There can be many slabs in a list, and they are sewn together via pointers embedded in each slab, with the head of the list in the kmem_cache_t object. Here is a visualization of this:

The large boxes are slabs. Each slab in a list also contains forward and backward pointers to the corresponding slabs in the list, these are not shown.

Slab and object management:

So, as we can see the slab is where all of the objects are actually stored. In addition to storing objects, the slab also is responsible for managing all of the objects that it contains. The control information consists of two structures, slab_t and kmem_bufctl_t. They are shown here:

typedef struct slab_s {
struct list_head list; /* linkage into free/full/partial lists */
unsigned long colouroff; /* offset in bytes of objects into first page of slab */
void *s_mem; /* pointer to where the objects start */
unsigned int inuse; /* num of objs allocated in the slab */
kmem_bufctl_t free; /*index of the next free object */
} slab_t;

typedef unsigned int kmem_bufctl_t; /* tracks indexes of free objects, starting at base slab_t->s_mem */

Allocating objects:

When objects are allocated, the list of slabs is traversed in the following order: first we search the partial slabs list and grab the first one, if this list is empty, we then search the free slabs list and grab the first one, and finally, if this list is also empty then we allocate a new slab and add it to the free list, using this slab for our object.

The kmem_buf_ctl_t type is used to track which objects in the slab are free. There is an array of these, the size of this array is equal to the number of objects that is stored in this slab. This array is store directly after the slab_t structure for each slab. Each member in this array originally starts out as equal to its index + 1, so index 0 contains 1, index 1 contains 2, and so on. The value stored in each index represents an offset into the area pointed to by s_mem where a corresponding object lies. When an object is needed from the slab, this code is run:

#define slab_bufctl(slabp)  ((kmem_bufctl_t *)(((slab_t*)slabp)+1))

   void *objp;    /* the object that will be returned to caller */
   slab_t *slabp = current_slab;
   kmem_cache_t    *cachep = current_cache;

   objp = slabp->s_mem + slabp->free*cachep->objsize;
   slabp->free=slab_bufctl(slabp)[slabp->free];

The slab->free member is initially allocated to 0, so on the first object allocation u can see that it will return the memory pointed to by slabp->s_mem, which is the first object in the slab. Then slabp->free is updated with the value stored at index slabp->free in the bufctl array. Since the bufctl array is stored directly after the slab_t structure, the slab_bufctl macro just returns the start of that array.

Freeing objects:

Now at first this scheme may not make much sense, but once you see how objects are returned to the slab it is actually quite clever and makes perfect sense. Here is how an object is returned:

    void *objp = pointer to current object being freed;
   slab_t    *slabp = slab object belongs to;
   kmem_cache_t    *cachep = current_cache;

   /* get the index of this object offset from s_mem */
   unsigned int objnr = (objp-slabp->s_mem)/cachep->objsize;

   slab_bufctl(slabp)[objnr] = slabp->free;
   slabp->free = objnr;

Since objp points to the current object being freed, and slabp->s_mem points to the base address where objects begin in this slab, we subtract and divide by the cache size to get the index of this object. Here the index into the bufctl array of this object is updated with the previous index that would of been used for a new object, and the next index to be used is set to the object that was just freed. This technique allows the array of bufctls to be traversed sequentially: the object returned upon requesting a new object will always be the object that is at the lowest address. By saving our place each time we can move from the start of the bufctl list to the end without any other state. Once we get to the end of the bufctl array, we know the slab is full.

Managing slabs lists:

The last index of kmem_bufctl_t array contains a value of BUFCTL_END. When the slabp->free member gets to this value, we now know that the slab is full, illustrated by this code that follows the allocation code shown above:

    /* is this slab full? add it to the full list if so */
   if (unlikely(slabp->free == BUFCTL_END)) {
       list_del(&slabp->list);
       list_add(&slabp->list, &cachep->slabs_full);
   }

We check to see if this allocation makes the slab full, and if so, we remove the slab from the partial or empty list, and add it to the full list.

In addition to the above, we also need to know when a slab moves from being full to partially full, or partially full to free, and lastly from free to partially full. The first two cases are handled with the following code, that executes shortly after above code when freeing an object:

    /* just freed an object, so fixup slab chains. is slab now empty? is slab not full anymore? */
       int inuse = slabp->inuse;
       if (unlikely(!--slabp->inuse)) {
           /* Was partial or full, now empty. */
           list_del(&slabp->list);
           list_add(&slabp->list, &cachep->slabs_free);
       } else if (unlikely(inuse == cachep->num)) {
           /* Was full. */
           list_del(&slabp->list);
           list_add(&slabp->list, &cachep->slabs_partial);
       }

The inuse member holds the number of objects currently being used in the slab. When this hits 0 we remove it from the list it was on and add it to the free list. The cachep->num member holds the maximum number of objects a slab can hold, so if the slab was previously full, since we are removing an object we add it to the partially full list.

The third case, slabs moving from free to partially full is handled whenever we add an object to a free slab. It is just a couple lines of code not worth showing. It is found in the kmem_cache_alloc_one() macro if you're interested.

So where is the slab control info stored:

Finally, what we've been waiting for... Well, it is one of two places, and unfortunately neither of them is very friendly for exploitation. If a slab contains objects that are >= (PAGE_SIZE>>3), 512 bytes on IA-32, or if the caller to kmem_cache_create() specifies the CFLGS_OFF_SLAB flag (which the kmalloc() caches DO NOT), then the slab control info will be stored OFF of the slab, in a cache of its own. However, if neither of those situations occur, then the slab control info will be stored BEFORE all of the objects in the slab. The following illustrates that:

So, in the latter case the control info is stored off slab, and we have no angle at all. This occurs with any kmalloc()'d memory that is greater than or equal to 512 bytes. In the first case, the slab control is stored BEFORE the objects, and in addition to this, we will never find two slabs allocated one after another, unless by sheer luck from a call to get_free_pages() when allocating the slabs. So, unless we have an underflow of a buffer, exploitation does not seem to be possible. Even in the event of an underflow, we would have no idea where our object is located in the slab. Ignoring this though, if we can keep traversing up the slab until we hit the slab_t structure then some sort of exploitation is possible. We can overwrite the list pointers, and point them to a fake slab we create in the overflowed area. Then, when this slab is worked on by one of the list functions we can write nearly arbitrary values to arbitrary memory, similar to malloc style exploitation. A problem though is that we are destroying the list this slab is chained on, and unless we can work some magic to repair this damage a system crash would follow soon after exploitation. I didn't pursue this angle much further, because the circumstances needed to bring it about are not very likely. But given the right situation it would be possible, but extremely difficult to exploit. The other unlikely and nearly impossible to predict chance of exploitation lies in the fact that off-slab control structures are kept in one of the general caches. Through some simple math we can accurately predict which cache contains the control info for all memory allocations greater than or equal to 512 bytes. This control info will be stored in the slab along with other objects, so if we happened to have an object located at a lower memory address on the same slab as one of these control blocks, and if that buffer can be overflowed, then we can manipulate the slab_t data. However, you would have no idea if this situation was actually occurring without tracing every single call to kmalloc() from system startup, so I don't think that's a viable strategy.

If all you wanted to know was if exploitation was possible, then you can stop reading here. If you want a in depth explanation of how the rest of the cache allocation scheme works, then continue on. The remainder of this paper will explain all of the functions found in mm/slab.c, in order to understand cache managment from start to finish. What follows is some short descriptions of functions, and source code with my inlined comments.
Note: If you aren't going to read the rest, you may be interested in viewing this kmalloc performance table

Cache Data Structures:

There are three important global structures involved with caches. One is the cache_cache, its semaphore that protects concurrent access to it, and the other is the general cache used by kmalloc(). Here they are with some inlined comments:

/* an array of these structures represents the memory available for kmalloc() */
typedef struct cache_sizes {
   size_t       cs_size;                            /* the size of the cache objects */
   kmem_cache_t    *cs_cachep;         /* used for regular kmalloc'd memory */
   kmem_cache_t    *cs_dmacachep;  /* used for DMA kmalloc'd memory */
} cache_sizes_t;

/* each of these has a cache for DMA and regular memory */
static cache_sizes_t cache_sizes[] = {
#if PAGE_SIZE == 4096
   {    32,    NULL, NULL},
#endif
   {    64,    NULL, NULL},
   {   128,    NULL, NULL},
   {   256,    NULL, NULL},
   {   512,    NULL, NULL},
   {  1024,    NULL, NULL},
   {  2048,    NULL, NULL},
   {  4096,    NULL, NULL},
   {  8192,    NULL, NULL},
   { 16384,    NULL, NULL},
   { 32768,    NULL, NULL},
   { 65536,    NULL, NULL},
   {131072,    NULL, NULL},
   {     0,    NULL, NULL}
};

/* internal cache of cache description objs
* this holds all of the caches themselves through the next list pointer not shown here */
static kmem_cache_t cache_cache = {
    slabs_full: LIST_HEAD_INIT(cache_cache.slabs_full),
    slabs_partial:  LIST_HEAD_INIT(cache_cache.slabs_partial),
    slabs_free: LIST_HEAD_INIT(cache_cache.slabs_free),
    objsize:    sizeof(kmem_cache_t),
    flags:      SLAB_NO_REAP,
    spinlock:   SPIN_LOCK_UNLOCKED,
    colour_off: L1_CACHE_BYTES, /* 32 */
    name:       "kmem_cache",
};

/* Guard access to the cache-chain. */
static struct semaphore cache_chain_sem;

Startup, Cache Initialization

On system startup, two things need to be done. The cache_cache needs to be initialized by determining the number of objects per slab, and each of the caches in the array of general caches need to be created. The first task is accomplished in the kmem_cache_init() function, and the second via the kmem_cache_sizes_init(). Setting up the number of objects that will fit in a slab is accomplished with the kmem_cache_estimate() function, shown below with comments. The general cache initialization just calls kmem_cache_create() to create each one of the different sized general caches and is not worth showing.

/* Cal the num objs, wastage, and bytes left over for a given slab size.
* @param gfporder - slab size in 2^n form
* @param size - the size of a single cache object
* @flags - flags for allocation
* @left_ofter - how many bytes will be wasted in slab
* @num - how many objects will fit in the slab */
static  void kmem_cache_estimate (unsigned long gfporder, size_t size,
        int flags, size_t *left_over, unsigned int *num)
{         
   int i;
   size_t wastage = PAGE_SIZE</* total size being asked for */
   size_t extra = 0;
   size_t base = 0;
 
   /* if we're not storing control info off the slab, add size of control
    * structs to the size  */
   if (!(flags & CFLGS_OFF_SLAB)) {
       base = sizeof(slab_t);
       extra = sizeof(kmem_bufctl_t);
   }
   /* calc the number of objects that will fit inside the slab, including the
    * base slab_t and a kmem_bufclt_t for each as well */
   i = 0;
   while (i*size + L1_CACHE_ALIGN(base+i*extra) <= wastage)
       i++;
   if (i > 0)
       i--;

   /* this limit is absurdly huge... 0xfffffffe    */
   if (i > SLAB_LIMIT)
       i = SLAB_LIMIT;

   /* return number objects that fit, and total space wasted */
   *num = i;
   wastage -= i*size;
   wastage -= L1_CACHE_ALIGN(base+i*extra);
   *left_over = wastage;
   }

This function is used to determine how many objects will fit on slabs of a given order, and how much space will be wasted at the end of the slab. The caller passes in the gfporder parameter, the object size, and cache flags. The pointers it passes are filled in by the function. When this is called by kmem_cache_init() it is only to determine how many objects fit in each slab. However, we'll see that when this function is called later on by the kmem_cache_create() function it is called in a loop that tries to minimize wastage.

Getting/Freeing Physical Memory:

As mentioned before, internally the cache allocator relies on the get_free_pages() function to allocate pages of memory for its slabs. Each slab consists of some 2^n number of physical pages. It is that simple, and not worth showing. The two functions kmem_getpages() and kmem_freepages() are used to allocate pages for a new slab, or free pages for a slab being destroyed.

Slab Creation and Destruction:

After a slab has been created, it must be initialized. Two functions do this, kmem_cache_slabmgmt() and kmem_cache_init_objs(). The first one aligns the slab_t pointer on a proper boundary if the control is on-slab, or if it is off-slab it allocates an area for it using kmem_cache_alloc(). This area will be one of the general areas used for kmalloc(). It also initializes the s_mem member to point to the base where objects start. The second function has two roles: calling the constructor for each object on the slab if the user provided one at cache creation time, and also setting the kmem_bufctl_t array. This second part merely initializes each index as described above, to its index + 1, and the final index to BUFCTL_END.

Destroying a slab is equally simple. If a destructor was specified, then we call it for each object in the slab. Then we free the pages used by the slab with kmem_freepages(). If the slab control data was stored off-slab, we free it from its cache.

Cache Creation and Destruction:

Now that we have a good understanding of the lower levels of cache management, we can discuss the upper layers. There are two functions responsible for creating/removing caches from the kernel. kmem_cache_create() and kmem_cache_destroy(), i'll leave it up to you to decide which does what. Rather than describe what they do, I'll show the code for each with my comments along with developers. Also, note that when the cache is first created it has NO slabs in it, upon the first allocation of an object a slab will be created by the kmem_cache_grow() function we'll see later.

/**
* kmem_cache_create - Create a cache.
* @name: A string which is used in /proc/slabinfo to identify this cache.
* @size: The size of objects to be created in this cache.
* @offset: The offset to use within the page.
* @flags: SLAB flags
* @ctor: A constructor for the objects.
* @dtor: A destructor for the objects.
*
* Returns a ptr to the cache on success, NULL on failure.
* Cannot be called within a int, but can be interrupted.
* The @ctor is run when new pages are allocated by the cache
* and the @dtor is run before the pages are handed back.
* The flags are
*
* %SLAB_POISON - Poison the slab with a known test pattern (a5a5a5a5)
* to catch references to uninitialised memory.
*
* %SLAB_RED_ZONE - Insert `Red' zones around the allocated memory to check
* for buffer overruns.
*
* %SLAB_NO_REAP - Don't automatically reap this cache when we're under
* memory pressure.
*
* %SLAB_HWCACHE_ALIGN - Align the objects in this cache to a hardware
* cacheline.  This can be beneficial if you're counting cycles as closely
* as davem.
*/
kmem_cache_t *
kmem_cache_create (const char *name, size_t size, size_t offset,
   unsigned long flags, void (*ctor)(void*, kmem_cache_t *, unsigned long),
   void (*dtor)(void*, kmem_cache_t *, unsigned long))
{
   const char *func_nm = KERN_ERR "kmem_create: ";
   size_t left_over, align, slab_size;
   kmem_cache_t *cachep = NULL;
 
   /*
    * Sanity checks... these are all serious usage bugs.
    */
   if ((!name) ||
       ((strlen(name) >= CACHE_NAMELEN - 1)) ||
       in_interrupt() ||
       (size < BYTES_PER_WORD) ||
       (size > (1<       (dtor && !ctor) ||
       (offset <> size))
           BUG();

   /*
    * Always checks flags, a caller might be expecting debug
    * support which isn't available.  we check to make sure the caller hasn't specified
    * any flags that are not part of the allowed mask.
    */
   BUG_ON(flags & ~CREATE_MASK);

   /* Get cache's description obj. each cache that the user creates is part of the master cache_cache.  this cache
    *  holds kmem_cache_t objects */
   cachep = (kmem_cache_t *) kmem_cache_alloc(&cache_cache, SLAB_KERNEL);
   if (!cachep)
       goto opps;
   memset(cachep, 0, sizeof(kmem_cache_t));

   /* Check that size is in terms of words.  This is needed to avoid
    * unaligned accesses for some archs when redzoning is used, and makes
    * sure any on-slab bufctl's are also correctly aligned.
    */
   if (size & (BYTES_PER_WORD-1)) {
       size += (BYTES_PER_WORD-1);
       size &= ~(BYTES_PER_WORD-1);
       printk("%sForcing size word alignment - %s\n", func_nm, name);
   }

   /* how should it be aligned, sizeof(void *) or L1CS */
   align = BYTES_PER_WORD;
   if (flags & SLAB_HWCACHE_ALIGN)
       align = L1_CACHE_BYTES;

   /* Determine if the slab management is 'on' or 'off' slab. if size is large,
    * place management in it's own cache. this means we can't exploit here  */
   if (size >= (PAGE_SIZE>>3))
       flags |= CFLGS_OFF_SLAB;

   if (flags & SLAB_HWCACHE_ALIGN) {
       /* Need to adjust size so that objs are cache aligned. */
       /* Small obj size, can get at least two per cache line. */
       /* FIXME: only power of 2 supported, was better */
       while (size < align/2)
           align /= 2;
       size = (size+align-1)&(~(align-1));
   }
 
   /* Cal size (in pages) of slabs, and the num of objs per slab.
    * This could be made much more intelligent.  For now, try to avoid
    * using high page-orders for slabs.  When the gfp() funcs are more
    * friendly towards high-order requests, this should be changed.
    * order starts out at 0 and increments.
    * this loop is where we really use the estimate function that was shown above.  we go around the loop trying to
    * find an order of pages that doesn't waste a lot of slab space.  each time around the loop we increment the order.
    * remember that order is the base 2 log(2^n) of how many pages for the slab.  when we find an acceptable amount
    * of space left over, or when we max out at order of 5, we break out of the loop.  the estimate function has filled
    * in the the number of objects per slab for us.
    */
   do {
       unsigned int break_flag = 0;
cal_wastage:
       kmem_cache_estimate(cachep->gfporder, size, flags,
                       &left_over, &cachep->num);
       if (break_flag)
           break;
       if (cachep->gfporder >= MAX_GFP_ORDER)/* order == 5, 32 pages */
           break;
       if (!cachep->num)   /* we couldnt fit any objects on slab */
           goto next;
     
       /* obey limit on max # objects for off_slab caches */
       if (flags & CFLGS_OFF_SLAB && cachep->num > offslab_limit) {
           /* Oops, this num of objs will cause problems. */
           cachep->gfporder--;
           break_flag++;
           goto cal_wastage;
       }

       /*
        * Large num of objs is good, but v. large slabs are currently
        * bad for the gfp()s... this is only 2 pages per slab
        */
       if (cachep->gfporder >= slab_break_gfp_order)

       /* WTF?? doesn't this seem huge? guess it's best u can do.. */
       if ((left_over*8) <= (PAGE_SIZE<gfporder))
           break;  /* Acceptable internal fragmentation. */
next:
       cachep->gfporder++;
   } while (1);

   /* couldn't fit any objects into this size slab; they must be huge */
   if (!cachep->num) {
       printk("kmem_cache_create: couldn't create cache %s.\n", name);
       kmem_cache_free(&cache_cache, cachep);
       cachep = NULL;
       goto opps;
   }
   /* total size of slab control objects aligned for L1, includes:
    * bufctl_t for each object, and slab_t struct     */
   slab_size = L1_CACHE_ALIGN(cachep->num*sizeof(kmem_bufctl_t)+sizeof(slab_t));

   /*
    * If the slab has been placed off-slab, and we have enough space then
    * move it on-slab. This is at the expense of any extra colouring.
    * colouring involves offsetting the slab_t structure into the slab area to maximize cache alignment
    */
   if (flags & CFLGS_OFF_SLAB && left_over >= slab_size) {
       flags &= ~CFLGS_OFF_SLAB;
       left_over -= slab_size;
   }

   /* Offset must be a multiple of the alignment. */
   offset += (align-1);
   offset &= ~(align-1);
   if (!offset)
       offset = L1_CACHE_BYTES;
   cachep->colour_off = offset;
   cachep->colour = left_over/offset;

   /* init remaining fields */
   /* for single page slabs we do some optimizing for lookup, this isn't
    * currently in use in this code    */
    if (!cachep->gfporder && !(flags & CFLGS_OFF_SLAB))
       flags |= CFLGS_OPTIMIZE;
     
   cachep->flags = flags;
   cachep->gfpflags = 0;
   if (flags & SLAB_CACHE_DMA)
       cachep->gfpflags |= GFP_DMA;
   spin_lock_init(&cachep->spinlock);
   cachep->objsize = size;
   INIT_LIST_HEAD(&cachep->slabs_full);
   INIT_LIST_HEAD(&cachep->slabs_partial);
   INIT_LIST_HEAD(&cachep->slabs_free);
     
   /* off slab control structs use regular cache bins. the find_general function just picks out
     * one of the general caches used by kmalloc() that is large enough to hold our info */
   if (flags & CFLGS_OFF_SLAB)
       cachep->slabp_cache = kmem_find_general_cachep(slab_size,0);
   cachep->ctor = ctor;
   cachep->dtor = dtor;
   /* Copy name over so we don't have problems with unloaded modules */
   strcpy(cachep->name, name);
 
#ifdef CONFIG_SMP
   if (g_cpucache_up)
       enable_cpucache(cachep);
#endif
   /* make sure this cache doesn't already exist in master cache, if it does == trouble */
   down(&cache_chain_sem);
   {
       struct list_head *p;
 
       list_for_each(p, &cache_chain) {
           kmem_cache_t *pc = list_entry(p, kmem_cache_t, next);
 
           /* The name field is constant - no lock needed. */
           if (!strcmp(pc->name, name))
               BUG();
       }
   }
 
   /* add our cache into master list   */
   list_add(&cachep->next, &cache_chain);
   up(&cache_chain_sem);
opps:
   return cachep;
}

/**
* kmem_cache_destroy - delete a cache
* @cachep: the cache to destroy
*
* Remove a kmem_cache_t object from the slab cache.
* Returns 0 on success.
*
* It is expected this function will be called by a module when it is
* unloaded.  This will remove the cache completely, and avoid a duplicate
* cache being allocated each time a module is loaded and unloaded, if the
* module doesn't have persistent in-kernel storage across loads and unloads.
*
* The caller must guarantee that noone will allocate memory from the cache
* during the kmem_cache_destroy().
*/
int kmem_cache_destroy (kmem_cache_t * cachep)
{
   if (!cachep || in_interrupt() || cachep->growing)
       BUG();

   /* Find the cache in the chain of caches, and remove it from the list w/ semaphore held. */
   down(&cache_chain_sem);
   /* clock_searchp is used in a different function that frees up extra cache memory. 
    * it points to the last cache that was freed up.  so if was pointing to
    * the cache being destroyed, we just move it on to the next cache in chain. */
   if (clock_searchp == cachep)
       clock_searchp = list_entry(cachep->next.next,
                       kmem_cache_t, next);
   list_del(&cachep->next);
   up(&cache_chain_sem);
 
   /* if there are still some slabs left in cache then add cache back to list
    * and return to caller. the shrink function as we'll see goes through the free slab list
    * and destroys all of the slabs. it returns non-zero if there are any slabs left in either the
    * full or partial list.  in that case, that means the cache is still in use and we can't destroy it */
   if (__kmem_cache_shrink(cachep)) {
       printk(KERN_ERR "kmem_cache_destroy: Can't free all objects %p\n",
              cachep);
       down(&cache_chain_sem);
       list_add(&cachep->next,&cache_chain);
       up(&cache_chain_sem);
       return 1;
   }
#ifdef CONFIG_SMP
   {
       int i;
       for (i = 0; i < NR_CPUS; i++)
           kfree(cachep->cpudata[i]);
   }
#endif
   /* remove this cache from the master cache */
   kmem_cache_free(&cache_cache, cachep);

   return 0;
}

Cache Allocation and Deallocation:

Once a cache is created we can allocate and free objects from it. This is handled by a couple of wrapper functions, kmem_cache_alloc() and kmem_cache_free(), and a number of functions they call internally. We've already seen part of the code that runs above, and after you read the rest of this document it will be simple for you to look at these functions yourself. To give a brief overview:

Allocation:

The kmem_cache_alloc_one() macro is what gets called inside __kmem_cache_alloc() which is called by kmem_cache_alloc(). Inside the macro, the following takes place: we see if there is a partial list, if not then we see if there is a free list, if not then we create a new slab with the kmem_cache_grow() function we'll look at next. After this happens, we get the object with the kmem_cache_alloc_one_tail() function. We already saw all of the code for this function before when discussing allocating objects.

Freeing:

After disabling local interrupts, __kmem_cache_free() is called. On UNP, it then calls kmem_cache_free_one(). We looked at nearly all of the code for this function when discussing freeing objects. The one part we didn't see yet will be clear after reading the rest of this paper. So, after doing so you can go back and look at the rest.

Important Note: How do we determine what slab an object belongs on?

You may have been wondering this already, and here is how it happens. Recall that a slab is merely a chunk of contiguous pages. Every page in the system is represented by a struct page. This structure has a number of fields in, such as a reference count and a number of flags that determine what the page is being used for and other things. One of the members in this structure is a struct list_head type, and it can be used by whomever owns the page for whatever they need it for. If you are not familiar with the kernel list implementation, there is plenty of documentation on it, and the file include/linux/list.h is nearly self explanatory. Here is the struct list_head type:

struct list_head {
struct list_head *next, *prev;
};

It gets embedded into any structure that is part of a list. However, in this case we don't use it for a list. What the slab creation code does is use those two pointers to store pointers to the kmem_cache_t and the slab_t that this slab belongs to. It does so using these macros:

/* Macros for storing/retrieving the cachep and or slab from the
* global 'mem_map'. These are used to find the slab an obj belongs to.
* With kfree(), these are used to find the cache which an obj belongs to.
* all take a struct page * as an argument, and a pointer to a cache_t or a
* pointer to a slab_t
*/
#define SET_PAGE_CACHE(pg,x)  ((pg)->list.next = (struct list_head *)(x))
#define GET_PAGE_CACHE(pg)    ((kmem_cache_t *)(pg)->list.next)
#define SET_PAGE_SLAB(pg,x)   ((pg)->list.prev = (struct list_head *)(x))
#define GET_PAGE_SLAB(pg)     ((slab_t *)(pg)->list.prev)

These macros are used in functions we see next. For EVERY page in the slab, this information gets stored. Then, when we have a pointer to an object we can get a pointer to its associated struct page using the virt_to_page() function, which takes a kernel virtual address and returns a pointer to its corresponding page structure.

Cache Growing and Shrinking:

When the cache is first created, it has no slabs in it. When a cache has filled up all of its free and partial slabs, it has no room left for more objects. When either of these events occur the cache needs to allocate more slabs. This is handled by kmem_cache_grow(). When the cache is being destroyed, we need to destroy all of the free slabs. When this occurs the kmem_cache_shrink() function is called. The grow function adds a slab to the caches free list. The shrink function removes all slabs from the free list, and disposes of their memory. Here are both of those functions with comments.

/*
* Grow (by 1) the number of slabs within a cache.  This is called by
* kmem_cache_alloc() when there are no active objs left in a cache.
* @param cachep - the cache to grow
* @param flags - slab flags
* @return 1 success, 0 fail - gotta love those consistent return vals
*/
static  int kmem_cache_grow (kmem_cache_t * cachep, int flags)
{
   slab_t  *slabp;
   struct page *page;
   void        *objp;
   size_t       offset;
   unsigned int     i, local_flags;
   unsigned long    ctor_flags;
   unsigned long    save_flags;
     
   /* Be lazy and only check for valid flags here,
    * keeping it out of the critical path in kmem_cache_alloc().
    */
   if (flags & ~(SLAB_DMA|SLAB_LEVEL_MASK|SLAB_NO_GROW))
       BUG();
   /* some caches are not allowed to grow, this checks for that */
   if (flags & SLAB_NO_GROW)
       return 0;

   /*
    * The test for missing atomic flag is performed here, rather than
    * the more obvious place, simply to reduce the critical path length
    * in kmem_cache_alloc(). If a caller is seriously mis-behaving they
    * will eventually be caught here (where it matters).
    * if you're in an interrupt state, you cannot sleep. by passing the SLAB_ATOMIC flag
    * you are saying "don't put me to sleep". so if this isn't provided and we're in interrupt, BUG
    */
   if (in_interrupt() && (flags & SLAB_LEVEL_MASK) != SLAB_ATOMIC)
       BUG();

   /* remember that cache creators can pass a contstructor and destructor that will be called
    * at slab creation/deletion.  These functions get passed a mask of flags, some people use the
    * same function for constr/destr, so we pass them a flag letting them know it's constr.  we also
    * pass them the atomic flag if necessary, so they know not to sleep
    */
   ctor_flags = SLAB_CTOR_CONSTRUCTOR;
   local_flags = (flags & SLAB_LEVEL_MASK);
   if (local_flags == SLAB_ATOMIC)
       /*
        * Not allowed to sleep.  Need to tell a constructor about
        * this - it might need to know...
        */
       ctor_flags |= SLAB_CTOR_ATOMIC;

   /* About to mess with non-constant members - lock. */
   spin_lock_irqsave(&cachep->spinlock, save_flags);

   /* Get colour for the slab, and cal the next value. this is used to align the slab_t.
    * i'd be lying if i said i knew exactly what it means
    */
   offset = cachep->colour_next;
   cachep->colour_next++;
   if (cachep->colour_next >= cachep->colour)
       cachep->colour_next = 0;
   offset *= cachep->colour_off;

   /* let others know this cache has recently grown, we'll see this later in the 'reaper' code */
   cachep->dflags |= DFLGS_GROWN;

   /* lock the slab from being reaped or shrunk.  we set the growing flag, and other functions test for
    * it to make sure that we dont ever try to shrink or destroy a cache in the midst of growing */
   cachep->growing++;
   spin_unlock_irqrestore(&cachep->spinlock, save_flags);

   /* A series of memory allocations for a new slab.
    * Neither the cache-chain semaphore, or cache-lock, are
    * held, but the incrementing c_growing prevents this
    * cache from being reaped or shrunk.
    * Note: The cache could be selected in for reaping in
    * kmem_cache_reap(), but when the final test is made the
    * growing value will be seen.
    */

   /* Get mem for the objs. */
   if (!(objp = kmem_getpages(cachep, flags)))
       goto failed;

   /* Get slab management.  We talked about this func earlier, it just sets up slab_t structure
    * and returns us a pointer to it.  the struct may be in the slab or in a cache -in which case
    * this fn. would have allocated an object for it via kmem_cache_create() */
   if (!(slabp = kmem_cache_slabmgmt(cachep, objp, offset, local_flags)))
       goto opps1;

   /* Nasty!!!!!! I hope this is OK. well all the pages are contigous right?
    * set the slab bit for each page. use each pages list pointers to point to
    * the cache and slab so we can easily get them when freeing. here we see the
    * macros that were shown above. */
   i = 1 <<>gfporder;        /* get total number of pages */
   page = virt_to_page(objp);        /* get the struct page for the base address */

   /* pages will be laid out sequentially, so we go thru all of them and setup cache/slab pointers
    * so that we can easily get them when freeing an object. also set the bit in the flags for the page
    * that means this page is being used as part of a slab. */
   do {
       SET_PAGE_CACHE(page, cachep);
       SET_PAGE_SLAB(page, slabp);
       PageSetSlab(page);
       page++;
   } while (--i);
 
   /* init this slabs objects. call constructors, and set up kmem_bufctl_t array */
   kmem_cache_init_objs(cachep, slabp, ctor_flags);
 
   /* clear growing flag, add the slab to free list -under lock protection */
   spin_lock_irqsave(&cachep->spinlock, save_flags);
   cachep->growing--;

   /* Make slab active.  add it to the free list, the failures value is not used elsewhere, for future?*/
   list_add_tail(&slabp->list, &cachep->slabs_free);
   STATS_INC_GROWN(cachep);
   cachep->failures = 0;
 
   spin_unlock_irqrestore(&cachep->spinlock, save_flags);
   return 1;
opps1:
   kmem_freepages(cachep, objp);
failed:
   spin_lock_irqsave(&cachep->spinlock, save_flags);
   cachep->growing--;
   spin_unlock_irqrestore(&cachep->spinlock, save_flags);
   return 0;
}

There are a couple of wrappers around this next function that I won't show. The wrappers just lock the caches spinlock to prevent concurrent access while it is being modified. And they also return a meaningful value to the caller, depending on which wrapper is called. View them yourself.

/*
* Called with the &cachep->spinlock held, leaves with it held
* this frees any slabs that are in the free list 
* @param cachep - the cache to shrink
* @return - # slabs freed
*/
static int __kmem_cache_shrink_locked(kmem_cache_t *cachep)
{
   slab_t *slabp;
   int ret = 0;
 
   /* all we do here is iterate through the caches free list and destroy all of the
    * slabs that are on it.
    * don't try to shrink if it is in midst of growing */
   while (!cachep->growing) {
       struct list_head *p;

       /* if it is empty then stop here.  when the prev/next pointers are pointing to the l
        * ist head itself, that means that the list is empty */
       p = cachep->slabs_free.prev;
       if (p == &cachep->slabs_free)
           break;

       /* remove this slab from the list */
       slabp = list_entry(cachep->slabs_free.prev, slab_t, list);

       list_del(&slabp->list);

       /* free its memory. we call conditional_schedule() so that if someone else was waiting for
        * pages to be freed, they'll hopefully get a chance to run and grab what we just freed. note
        * that we first need to drop the spinlock before sleeping in schedule()
        */
       spin_unlock_irq(&cachep->spinlock);
       kmem_slab_destroy(cachep, slabp);
       conditional_schedule();     /* Can take 30 milliseconds */
       ret++;
       spin_lock_irq(&cachep->spinlock);
   }
   return ret;
}

Cache Reaping:

When the page allocator is running low on free pages, it needs a way to reclaim as many pages as possible can. One of the ways it does this is by 'cache reaping'. What happens is that we try to find a cache with a significant number number of pages on its free list, if we can find one, then we free all of these pages. This action is performed by the kmem_cache_reap() function. It traverses all of the caches by way of the cache_cache linked list, looking for a cache that has at least 10 pages it can give back. If it doesn't find one with 10 or more pages free, then it takes the best that it could get. Once it has made its choice, it destroys half of the free slabs. This function is not called in the cache management code, it is called by do_try_to_free_pages() or __alloc_pages().

/**
* kmem_cache_reap - Reclaim memory from caches.
* @gfp_mask: the type of memory required.
* @return number of pages freed
*
* Called from do_try_to_free_pages() and __alloc_pages()
*/
int kmem_cache_reap (int gfp_mask)
{
   slab_t *slabp;
   kmem_cache_t *searchp;
   kmem_cache_t *best_cachep;
   unsigned int best_pages;
   unsigned int best_len;
   unsigned int scan;
   int ret = 0;
 
   /* we're traversing the cache_cache chain, we need to protect from concurrent access */
   /* can we sleep or not? lock sem accordingly */
   if (gfp_mask & __GFP_WAIT)
       down(&cache_chain_sem);
   else
       if (down_trylock(&cache_chain_sem))
           return 0;
 
   /* these are used to maintain our search state    */
   scan = REAP_SCANLEN;     /* traverse at most 10 caches*/
   best_len = 0;                    /* saves the greatest number of free slabs */
   best_pages = 0;                /* saves the greatest numbef of free pages */
   best_cachep = NULL;    /* pointer to the cache with best stats */
   searchp = clock_searchp; /* starts at cache_cache.next, each time this function is called it starts where it left off last time */

   /* search thru at most 10 caches, trying to free slabs from them */
   do {
       unsigned int pages;
       struct list_head* p;
       unsigned int full_free;

       /* It's safe to test this without holding the cache-lock.  some caches can't be reaped from*/
       if (searchp->flags & SLAB_NO_REAP)
           goto next;
     
       /* we don't try and free from a cache that is in the midst of growing in kmem_cache_grow() */
       spin_lock_irq(&searchp->spinlock);
       if (searchp->growing)
           goto next_unlock;

       /* if it has recently grown, clear the flag but dont reap from it. next time around it's fair game though */
       if (searchp->dflags & DFLGS_GROWN) {
           searchp->dflags &= ~DFLGS_GROWN;
           goto next_unlock;
       }
#ifdef CONFIG_SMP
       {
           cpucache_t *cc = cc_data(searchp);
           if (cc && cc->avail) {
               __free_block(searchp, cc_entry(cc), cc->avail);
               cc->avail = 0;
           }
       }
#endif

       /* ok got a cache, count how many free slabs it has by traversing free list */
       full_free = 0;
       p = searchp->slabs_free.next;
       while (p != &searchp->slabs_free) {
           slabp = list_entry(p, slab_t, list);

           full_free++;
           p = p->next;
       }

       /*
        * Try to avoid slabs with constructors and/or
        * more than one page per slab (as it can be difficult
        * to get high orders from gfp()).
        */
       pages = full_free * (1<gfporder);        /* pages = num slabs * pages per slab */

       /* each of these tests scale down the # of pages by %20, or a minimum of 1 to make it less attractive.
        */
       if (searchp->ctor)
           pages = (pages*4+1)/5;
       if (searchp->gfporder)
           pages = (pages*4+1)/5;

       /* update best variables. if we have enough pages (10), then this is our victim, else keep looping round  */
       if (pages > best_pages) {
           best_cachep = searchp;
           best_len = full_free;
           best_pages = pages;
           /* ok, this cache is our victim, get a pointer to it and break out of the loop */
           if (pages >= REAP_PERFECT/* 10 */) {
               clock_searchp = list_entry(searchp->next.next,
                           kmem_cache_t,next);
               goto perfect;
           } 
       } 
next_unlock:
       spin_unlock_irq(&searchp->spinlock);
next:
       searchp = list_entry(searchp->next.next,kmem_cache_t,next);
   } while (--scan && searchp != clock_searchp);
     
   /* if we're here we didn't find a perfect match, but may have something */
   clock_searchp = searchp;

   if (!best_cachep)
       /* couldn't find anything to reap */
       goto out;

   spin_lock_irq(&best_cachep->spinlock);

perfect:    /* when we jump here we already have the spinlock locked    */
   /* free only 50% of the free slabs */
   best_len = (best_len + 1)/2;

   /* traverse free slab list, pull off slabs, and free them */
   for (scan = 0; scan < best_len; scan++) {
       struct list_head *p;
     
       /* is someone trying to grow at same time?? */
       if (best_cachep->growing)
           break;
       p = best_cachep->slabs_free.prev;
 
       /* when we hit here the list is now empty */
       if (p == &best_cachep->slabs_free)
           break;
       slabp = list_entry(p,slab_t,list);

       list_del(&slabp->list);
       STATS_INC_REAPED(best_cachep);

       /* Safe to drop the lock. The slab is no longer linked to the
        * cache.  we call conditional_schedule() here so that whomever is waiting on the pages can grab them
        */
       spin_unlock_irq(&best_cachep->spinlock);
       kmem_slab_destroy(best_cachep, slabp);
       conditional_schedule();     /* try_to_free_pages() */
       spin_lock_irq(&best_cachep->spinlock);
   }

   /* unlock the cache, and return the total number of pages freed */
   spin_unlock_irq(&best_cachep->spinlock);
   ret = scan * (1 <<>gfporder);
out:
   up(&cache_chain_sem);
   return ret;
}

Kmalloc() and Kfree():

You're probably saying to yourself, "wait a minute.. wasn't the topic of this article kmalloc()? so why haven't we seen the code for it??". Well, yes, and I could show it here, but it's absurdly simple now that you know the rest. So, you're encouraged to go see it for yourself, all 10 lines of it.