Wednesday, July 25, 2007

Configuration options on 2.6.x kernels

On the 2.6.x kernels there are four main frontend programs: config, menuconfig, and xconfig.

config is the least user-friendly option as it merely presents a series of questions that must be answered sequentially. Alas, if an error is made you must begin the process from the top. Pressing Enter will accept the default entry which is in upper case.

oldconfig will read the defaults from an existing .config and rewrite necessary links and files. Use this option if you've made minor changes to source files or need to script the rebuild process.

menuconfig is an ncurses based frontend. Your system must have the ncurses-devel libraries installed in order to use this utility. As the help text at the top of the screen indicates, use the arrow keys to navigate the menu. Press Enter to select sub-menus. Press the highlighted letter on each option to jump directly to that option. To build an option directly into the kernel, press Y. To disable an option entirely, press N. To build an option as a loadable module, press M. You can also access content-specific help screens by pressing ? on each page or selecting HELP from the lower menu. Figure 1 in the Section called Configuring the 2.4.x kernels shows an example screen from the 2.4.x kernel series.

xconfig is a graphical frontend using qconf by Roman Zippel. It requires the qt and X libraries to build and use. The interface is intuitive and customizable. Online help is automatically shown for each kernel configuration option. It also can show dependency information for each module which can help diagnose build errors. Figure 3 shows an example of the xconfig screen. From the online help:


For each option, a blank box indicates the feature is disabled, a check
indicates it is enabled, and a dot indicates that it is to be compiled
as a module. Clicking on the box will cycle through the three states.
If you do not see an option (e.g., a device driver) that you believe
should be present, try turning on Show All Options under the Options menu.
Although there is no cross reference yet to help you figure out what other
options must be enabled to support the option you are interested in, you can
still view the help of a grayed-out option.

--qconf help

Figure 3. make xconfig

Once you have decided which configuration option to use, start the process with make followed by either config, menuconfig, or xconfig. For example:

$ make menuconfig


The system will take a few moments to build the configuration utility. Next you will be presented with the configuration menus. Though similar to the 2.4.x series, the 2.6.x menu is more logically organized with better grouping of sub-modules. Following are some of the top level configuration options in the 2.6 kernel.

Code Maturity Level Options

This option allows configuration of alpha-quality software or obsoleted drivers. It is best to disable this option if the kernel is intended for a stable production system. If you require an experimental feature in the kernel, such as a driver for new hardware, then enable this option but be aware that it "may not meet the normal level of reliability" as more rigorously tested code.
General Setup

This section contains options for sysctl support, a feature allowing run-time configuration of the kernel. A new feature, kernel .config support, allows the complete kernel configuration to be viewed during run-time. This addresses many requests to be able to see what features were compiled into the kernel.
Loadable Module Support

You will almost certainly want to enable module support. If you will need third-party kernel modules you will also need to enable Set Version Information on All Module Symbols.
Processor Type and Features

This is perhaps the most important configuration choice. we determin our processor type by examining /proc/cpuinfo and we use that information here to select the appropriate processor. Included in this submenu are features such as Preemptible Kernel which can improve desktop responsiveness, Symmetric Multi-processing Support for machines with multiple CPUs, and High Memory Support for machines with more than 1G of RAM.
Power Management Options

Found here are options for ACPI and CPU Frequency Scaling which can dramatically improve laptop power usage. Read the Documentation/power file for more information.
Bus Options ( PCI, PCMCIA, EISA, MCA, ISA)

Here are found options for all system bus devices. On modern machines the ISA and MCA support can often be disabled.
Executable File Formats

Interesting features here include the kernel support for miscellaneous binaries, which can allow seamless operation of non-Linux binaries with a little userland help.
Device Drivers

All the device-driver configuration options that were previously scattered throughout the 2.4.x menus are now neatly organized under this option. Features such as SCSI support, graphic card optimizations, sound, USB and other hardware are configured here.
File Systems

Found here are options for filesystems which are supported by the kernel, such as EXT2 and ReiserFS. It is best to build support for the root filesystems directly into the kernel rather than as a module.
Security Options

Interesting options here include support for NSA Security Enhanced Linux and other, somewhat experimental, features to increase security.

Tuesday, July 24, 2007

Dynamic Kernels: Modularized Device Drivers

Although this is very old article, gives u lot of knowledge about the device drivers.
Some of the things mentioned here changed by this day.

By Alessandro Rubini on Fri, 1996-03-01 02:00. SysAdmin
This is the first in a series of four articles co-authored by Alessandro Rubini and Georg Zezchwitz which present a practical approach to writing Linux device drivers as kernel loadable modules. This installment presents and introduction to thte topic, preparing the reader to understand next month's installment.
Kernel modules are a great feature of recent Linux kernels. Although most users feel modules are only a way to free some memory by keeping the floppy driver out of the kernel most of the time, the real benefit of using modules is support for adding additional devices without patching the kernel source. In the next few Kernel Korners Georg Zezschwitz and I will try to introduce the ``art'' of writing a powerful module---while avoiding common design errors.
What is a device?
A device driver is the lowest level of the software that runs on a computer, as it is directly bound to the hardware features of the device.
The concept of ``device driver'' is quite abstract, actually, and the kernel can be considered like it was a big device driver for a device called ``computer''. Usually, however, you don't consider your computer a monolithic entity, but rather a CPU equipped with peripherals. The kernel can thus be considered as an application running on top of the device drivers: each driver manages a single part of the computer, while the kernel-proper builds process scheduling and file-system access on top of the available devices.
See Figure 1.
A few mandatory drivers are ``hardwired'' in the kernel, such as the processor driver and the the memory driver; the others are optional, and the computer is usable both with and without them---although a kernel with neither the console driver nor the network driver is pointless for a conventional user.
The description above is somehow a simplistic one, and slightly philosophical too. Real drivers interact in a complex way and a clean distinction among them is sometimes difficult to achieve.
In the Unix world things like the network driver and a few other complex drivers belong to the kernel, and the name of device driver is reserved to the low-level software interface to devices belonging to the following three groups:
character devices
Those which can be considered files, in that they can be read-from and/or written-to. The console (i.e. monitor and keyboard) and the serial/parallel ports are examples of character devices. Files like /dev/tty0 and /dev/cua0 provide user access to the device. A char device usually can only be accessed sequentially.
block devices
Historically: devices which can be read and written only in multiples of the block-size, often 512 or 1024 bytes. These are devices on which you can mount a filesystem, most notably disks. Files like /dev/hda1 provide access to the devices. Blocks of block devices are cached by the buffer cache. Unix provides uncached character devices corresponding to block devices, but Linux does not.
network interfaces
Network interfaces don't fall in the device-file abstraction. Network interfaces are identified by means of a name (such as eth0 or plip1) but they are not mapped to the filesystem. It would be theoretically possible, but it is impractical from a programming and performance standpoint; a network interface can only transfer packets, and the file abstraction does not efficiently manage structured data like packets.

The description above is rather sketchy, and each flavour of Unix differs in some details about what is a block device. It doesn't make too much a difference, actually, because the distinction is only relevant inside the kernel, and we aren't going to talk about block drivers in detail.
What is missing in the previous representation is that the kernel also acts as a library for device drivers; drivers request services from the kernel. Your module will be able to call functions to perform memory allocation, filesystem access, and so on.
As far as loadable modules are concerned, any of the three driver-types can be constructed as a module. You can also build modules to implement filesystems, but this is outside of our scope.
These columns will concentrate on character device drivers, because special (or home-built) hardware fits the character device abstraction most of the time. There are only a few differences between the three types, and so to avoid confusion, we'll only cover the most common type.
You can find an introduction to block drivers in issues 9, 10, and 11 of Linux Journal, as well as in the Linux Kernel Hackers' Guide. Although both are slightly outdated, taken together with these columns, they should give you enough information to get started.
What is a module?
A module is a code segment which registers itself with the kernel as a device driver, is called by the kernel in order to communicate with the device, and in turn invokes other kernel functions to accomplish its tasks. Modules utilize a clean interface between the ``kernel proper'' and the device, which both makes the modules easy to write and keeps the kernel source code from being cluttered.
The module must be compiled to object code (no linking; leave the compiled code in .o files), and then loaded into the running kernel with insmod. The insmod program is a runtime linker, which resolves any undefined symbols in the module to addresses in the running kernel by means of the kernel symbol table.
This means that you can write a module much like a conventional C-language program, and you can call functions you don't define, in the same way you usually call printf() and fopen() in your application. However, you can count only on a minimal set of external functions, which are the public functions provided by the kernel. insmod will put the right kernel-space addresses in your compiled module wherever your code calls a kernel function, then insert the module into the running Linux kernel.
If you are in doubt whether a kernel-function is public or not, you can look for its name either in the source file /usr/src/linux/kernel/ksyms.c or in the run-time table /proc/ksyms.
To use make to compile your module, you'll need a Makefile as simple as the following one:

TARGET = myname

ifdef DEBUG
# -O is needed, because of "extern inline"
# Add -g if your gdp is patched and can use it
CFLAGS = -O -DDEBUG_$(TARGET) -D__KERNEL__ -Wall
else
CFLAGS = -O3 -D__KERNEL__ -fomit-frame-pointer
endif

all: $(TARGET).o

As you see, no special rule is needed to build a module, only the correct value for CFLAGS. I recommend that you include debugging support in your code, because without patches, gdb isn't able to take advantage of the symbol information provided by the -g flag to a module while it is part of the running kernel.
Debugging support will usually mean extra code to print messages from within the driver. Using printk() for debugging is powerful, and the alternatives are running a debugger on the kernel, peeking in /dev/mem, and other extremely low-level techniques. There are a few tools available on the Internet to help use these other techniques, but you need to be conversant with gdb and be able to read real kernel code in order to benefit from them. The most interesting tool at time of writing is kdebug-1.1, which lets you use gdb on a running kernel, examining or even changing kernel data structures (including those in loaded kernel modules) while the kernel is running. Kdebug is available for ftp from sunsite.unc.edu and its mirrors under /pub/Linux/kernel.
Just to make things a little harder, the kernel equivalent of the standard printf() function is called printk(), because it does not work exactly the same as printf(). Before 1.3.37, conventional printk()'s generated lines in /var/adm/messages, while later kernels will dump them to the console. If you want quiet logging (only within the messages file, via syslogd) you must prepend the symbol KERN_DEBUG to the format string. KERN_DEBUG and similar symbols are simply strings, which get concatenated to your format string by the compiler. This means that you must not put a comma between KERN_DEBUG and the format string. These symbols can be found in , and are documented there. Also, printk() does not support floating point formats.
Remember, that syslog write to the messages file as soon as possible, in order to save all messages on disk in case of a system crash. This means that an over-printk-ed module will slow down perceptibly, and will fill your disk in a short time.
Almost any module misbehaviour will generate an [cw]Oops[ecw] message from the kernel. An Oops is what happens when the kernel gets an exception from kernel code. In other words, Oopses are the equivalent of segmentation faults in user space, though no core file is generated. The result is usually the sudden destruction of the responsible process, and a few lines of low-level information in the messages file. Most Oops messages are the result of dereferencing a NULL pointer.
This way to handle disasters is a friendly one, and you'll enjoy it whenever your code is faulty: most other Unices produce a kernel panic instead. This does not mean that Linux never panics. You must be prepared to generate panics whenever you write functions that operate outside of a process context, such as within interrupt handlers and timer callbacks.
The scarce, nearly unintelligible information included with the [cw]Oops[ecw] message represents the processor state when the code faulted, and can be used to understand where the error is. A tool called ksymoops is able to print more readable information out of the oops, provided you have a kernel map handy. The map is what is left in /usr/src/linux/System.map after a kernel compilation. Ksymoops was distributed within util-linux-2.4, but was removed in 2.5 because it has been included in the kernel distribution during the linux-1.3 development.
If you really understand the Oops message, you can use it as you want, like invoking gdb off-line to disassemble the whole responsible function. if you understand neither the Oops nor the ksymoops output, you'd better add some more debugging printk() code, recompile, and reproduce the bug.
The following code can ease management of debugging messages. It must reside in the module's public include file, and will work for both kernel code (the module) and user code (applications). Note however that this code is gcc-specific. Not too big a problem for a kernel module, which is gcc-dependent anyway. This code was suggested by Linus Torvalds, as an enhancement over my previous ansi-compliant approach.
#ifndef PDEBUG
# ifdef DEBUG_modulename
# ifdef __KERNEL__
# define PDEBUG(fmt, args...) printk (KERN_DEBUG fmt , ## args)
# else
# define PDEBUG(fmt, args...) fprintf (stderr, fmt , ## args)
# endif
# else
# define PDEBUG(fmt, args...)
# endif
#endif

#ifndef PDEBUGG
# define PDEBUGG(fmt, args...)
#endif

After this code, every PDEBUG("any %i or %s...\n", i, s); in the module will result in a printed message only if the code is compiled with -DDEBUG_modulename, while PDEBUGG() with the same arguments will expand to nothing. In user mode applications, it works the same, except that the message is printed to stderr instead of the messages file.
Using this code, you can enable or disable any message by removing or adding a single G character.
Writing Code
Let's look at what kind of code must go inside the module. The simple answer is ``whatever you need''. In practice, you must remember that the module is kernel code, and must fit a well-defined interface with the rest of Linux.
Usually, you start with header inclusion. And you begin to have contraints: you must always define the __KERNEL__ symbol before including any header unless it is defined in your makefile, and you must only include files pertaining to the and hierarchies. Sure, you can include your module-specific header, but never, ever, include library specific files, such as or .
The code fragment in Listing 1 represents the first lines of source of a typical character driver. If you are going to write a module, it will be easier to cut and paste these lines from existing source rather than copying them by hand from this article.
#define __KERNEL__ /* kernel code */

#define MODULE /* always as a module */
#include /* can't do without it */
#include /* and this too */

/*
* Then include whatever header you need.
* Most likely you need the following:
*/
#include /* ulong and friends */
#include /* current, task_struct, other goodies */
#include /* O_NONBLOCK etc. */
#include /* return values */
#include /* request_region() */
#include /* system name and global items */
#include /* kmalloc, kfree */

#include /* inb() inw() outb() ... */
#include /* unreadable, but useful */

#include "modulename.h" /* your own material */

After including the headers, there comes actual code. Before talking about specific driver functionality---most of the code---it is worth noting that there exist two module-specific functions, which must be defined in order for the module to be loaded:
int init_module (void);
void cleanup_module (void);

The first is in charge of module initialization (looking for the related hardware and registering the driver in the appropriate kernel tables), while the second is in charge of releasing any resources the module has allocated and deregistering the driver from the kernel tables.
If these functions are not there, insmod will fail to load your module.
The init_module() function returns 0 on success and a negative value on failure. The cleanup_module() function returns void, because it only gets invoked when the module is known to be unloadable. A kernel module keeps a usage count, and cleanup_module() is only called when that counter's value is 0 (more on this later on).
Skeletal code for these two functions will be presented in the next installment. Their design is fundamental for proper loading and unloading of the module, and a few details must dealt with. So here, I'll introduce you to each of the details, so that next month I can present the structure without explaining all the details.

contd....

Saturday, July 21, 2007

Hardware and Software IRQs

Hardware Interrupts (Hard IRQs)

Timer ticks, network cards and keyboard are examples of real hardware which produce interrupts at any time. The kernel runs interrupt handlers, which services the hardware. The kernel guarantees that this handler is never re-entered: if another interrupt arrives, it is queued (or dropped). Because it disables interrupts, this handler has to be fast: frequently it simply acknowledges the interrupt, marks a `software interrupt' for execution and exits.
You can tell you are in a hardware interrupt, because in_irq() returns true.


Software Interrupt Context: Bottom Halves, Tasklets, softirqs

Whenever a system call is about to return to userspace, or a hardware interrupt handler exits, any `software interrupts' which are marked pending (usually by hardware interrupts) are run (kernel/softirq.c).
Much of the real interrupt handling work is done here. Early in the transition to SMP, there were only `bottom halves' (BHs), which didn't take advantage of multiple CPUs. Shortly after we switched from wind-up computers made of match-sticks and snot, we abandoned this limitation.
include/linux/interrupt.h lists the different BH's. No matter how many CPUs you have, no two BHs will run at the same time. This made the transition to SMP simpler, but sucks hard for scalable performance. A very important bottom half is the timer BH (include/linux/timer.h): you can register to have it call functions for you in a given length of time.
2.3.43 introduced softirqs, and re-implemented the (now deprecated) BHs underneath them. Softirqs are fully-SMP versions of BHs: they can run on as many CPUs at once as required. This means they need to deal with any races in shared data using their own locks. A bitmask is used to keep track of which are enabled, so the 32 available softirqs should not be used up lightly. (Yes, people will notice).
tasklets (include/linux/interrupt.h) are like softirqs, except they are dynamically-registrable (meaning you can have as many as you want), and they also guarantee that any tasklet will only run on one CPU at any time, although different tasklets can run simultaneously (unlike different BHs).
You can tell you are in a softirq (or bottom half, or tasklet) using the in_softirq() macro (include/asm/softirq.h).

Wednesday, July 11, 2007

GCC Optimization

Here's what the O options mean in GCC, why some optimizations aren't optimal after all and how you can make specialized optimization choices for your application.

In this article, we explore the optimization levels provided by the GCC compiler toolchain, including the specific optimizations provided in each. We also identify optimizations that require explicit specifications, including some with architecture dependencies. This discussion focuses on the 3.2.2 version of gcc (released February 2003), but it also applies to the current release, 3.3.2.

Levels of Optimization

Let's first look at how GCC categorizes optimizations and how a developer can control which are used and, sometimes more important, which are not. A large variety of optimizations are provided by GCC. Most are categorized into one of three levels, but some are provided at multiple levels. Some optimizations reduce the size of the resulting machine code, while others try to create code that is faster, potentially increasing its size. For completeness, the default optimization level is zero, which provides no optimization at all. This can be explicitly specified with option -O or -O0.
Level 1 (-O1)

The purpose of the first level of optimization is to produce an optimized image in a short amount of time. These optimizations typically don't require significant amounts of compile time to complete. Level 1 also has two sometimes conflicting goals. These goals are to reduce the size of the compiled code while increasing its performance. The set of optimizations provided in -O1 support these goals, in most cases. These are shown in Table 1 in the column labeled -O1. The first level of optimization is enabled as:

gcc -O1 -o test test.c



Table 1. GCC optimizations and the levels at which they are enabled.

Any optimization can be enabled outside of any level simply by specifying its name with the -f prefix, as:

gcc -fdefer-pop -o test test.c



We also could enable level 1 optimization and then disable any particular optimization using the -fno- prefix, like this:

gcc -O1 -fno-defer-pop -o test test.c



This command would enable the first level of optimization and then specifically disable the defer-pop optimization.
Level 2 (-O2)

The second level of optimization performs all other supported optimizations within the given architecture that do not involve a space-speed trade-off, a balance between the two objectives. For example, loop unrolling and function inlining, which have the effect of increasing code size while also potentially making the code faster, are not performed. The second level is enabled as:

gcc -O2 -o test test.c



Table 1 shows the level -O2 optimizations. The level -O2 optimizations include all of the -O1 optimizations, plus a large number of others.
Level 2.5 (-Os)

The special optimization level (-Os or size) enables all -O2 optimizations that do not increase code size; it puts the emphasis on size over speed. This includes all second-level optimizations, except for the alignment optimizations. The alignment optimizations skip space to align functions, loops, jumps and labels to an address that is a multiple of a power of two, in an architecture-dependent manner. Skipping to these boundaries can increase performance as well as the size of the resulting code and data spaces; therefore, these particular optimizations are disabled. The size optimization level is enabled as:

gcc -Os -o test test.c



In gcc 3.2.2, reorder-blocks is enabled at -Os, but in gcc 3.3.2 reorder-blocks is disabled.
Level 3 (-O3)

The third and highest level enables even more optimizations (Table 1) by putting emphasis on speed over size. This includes optimizations enabled at -O2 and rename-register. The optimization inline-functions also is enabled here, which can increase performance but also can drastically increase the size of the object, depending upon the functions that are inlined. The third level is enabled as:

gcc -O3 -o test test.c



Although -O3 can produce fast code, the increase in the size of the image can have adverse effects on its speed. For example, if the size of the image exceeds the size of the available instruction cache, severe performance penalties can be observed. Therefore, it may be better simply to compile at -O2 to increase the chances that the image fits in the instruction cache.
Architecture Specification

The optimizations discussed thus far can yield significant improvements in software performance and object size, but specifying the target architecture also can yield meaningful benefits. The -march option of gcc allows the CPU type to be specified (Table 2).

Table 2. x86 Architectures
Target CPU Types -march= Type
i386 DX/SX/CX/EX/SL i386
i486 DX/SX/DX2/SL/SX2/DX4 i486
487 i486
Pentium pentium
Pentium MMX pentium-mmx
Pentium Pro pentiumpro
Pentium II pentium2
Celeron pentium2
Pentium III pentium3
Pentium 4 pentium4
Via C3 c3
Winchip 2 winchip2
Winchip C6-2 winchip-c6
AMD K5 i586
AMD K6 k6
AMD K6 II k6-2
AMD K6 III k6-3
AMD Athlon athlon
AMD Athlon 4 athlon
AMD Athlon XP/MP athlon
AMD Duron athlon
AMD Tbird athlon-tbird

The default architecture is i386. GCC runs on all other i386/x86 architectures, but it can result in degraded performance on more recent processors. If you're concerned about portability of an image, you should compile it with the default. If you're more interested in performance, pick the architecture that matches your own.

Let's now look at an example of how performance can be improved by focusing on the actual target. Let's build a simple test application that performs a bubble sort over 10,000 elements. The elements in the array have been reversed to force the worst-case scenario. The build and timing process is shown in Listing 1.

Listing 1. Effects of Architecture Specification on a Simple Application


[mtj@camus]$ gcc -o sort sort.c -O2
[mtj@camus]$ time ./sort
real 0m1.036s
user 0m1.030s
sys 0m0.000s
[mtj@camus]$ gcc -o sort sort.c -O2 -march=pentium2
[mtj@camus]$ time ./sort
real 0m0.799s
user 0m0.790s
sys 0m0.010s
[mtj@camus]$

By specifying the architecture, in this case a 633MHz Celeron, the compiler can generate instructions for the particular target as well as enable other optimizations available only to that target. As shown in Listing 1, by specifying the architecture we see a time benefit of 237ms (23% improvement).

Although Listing 1 shows an improvement in speed, the drawback is that the image is slightly larger. Using the size command (Listing 2), we can identify the sizes of the various sections of the image.

Listing 2. Size Change of the Application from Listing 1


[mtj@camus]$ gcc -o sort sort.c -O2
[mtj@camus]$ size sort
text data bss dec hex filename
842 252 4 1098 44a sort
[mtj@camus]$ gcc -o sort sort.c -O2 -march=pentium2
[mtj@camus]$ size sort
text data bss dec hex filename
870 252 4 1126 466 sort
[mtj@camus]$

From Listing 2, we can see that the instruction size (text section) of the image increased by 28 bytes. But in this example, it's a small price to pay for the speed benefit.
Math Unit Optimizations

Some specialized optimizations require explicit definition by the developer. These optimizations are specific to the i386 and x86 architectures. A math unit, for one, can be specified, although in many cases it is automatically defined based on the specification of a target architecture. Possible units for the -mfpmath= option are shown in Table 3.

Table 3. Math Unit Optimizations
Option Description
387 Standard 387 Floating Point Coprocessor
sse Streaming SIMD Extensions (Pentium III, Athlon 4/XP/MP)
sse2 Streaming SIMD Extensions II (Pentium 4)

The default choice is -mfpmath=387. An experimental option is to specify both sse and 387 (-mfpmath=sse,387), which attempts to use both units.
Alignment Optimizations

In the second optimization level, we saw that a number of alignment optimizations were introduced that had the effect of increasing performance but also increasing the size of the resulting image. Three additional alignment optimizations specific to this architecture are available. The -malign-int option allows types to be aligned on 32-bit boundaries. If you're running on a 16-bit aligned target, -mno-align-int can be used. The -malign-double controls whether doubles, long doubles and long-longs are aligned on two-word boundaries (disabled with -mno-align-double). Aligning doubles provides better performance on Pentium architectures at the expense of additional memory.

Stacks also can be aligned by using the option -mpreferred-stack-boundary. The developer specifies a power of two for alignment. For example, if the developer specified -mpreferred-stack-boundary=4, the stack would be aligned on a 16-byte boundary (the default). On the Pentium and Pentium Pro targets, stack doubles should be aligned on 8-byte boundaries, but the Pentium III performs better with 16-byte alignment.
Speed Optimizations

For applications that utilize standard functions, such as memset, memcpy or strlen, the -minline-all-stringops option can increase performance by inlining string operations. This has the side effect of increasing the size of the image.

Loop unrolling occurs in the process of minimizing the number of loops by doing more work per iteration. This process increases the size of the image, but it also can increase its performance. This option can be enabled using the -funroll-loops option. For cases in which it's difficult to understand the number of loop iterations, a prerequisite for -funroll-loops, all loops can be unrolled using the -funroll-all-loops optimization.

A useful option that has the disadvantage of making an image difficult to debug is -momit-leaf-frame-pointer. This option keeps the frame pointer out of a register, which means less setup and restore of this value. In addition, it makes the register available for the code to use. The optimization -fomit-frame-pointer also can be useful.

When operating at level -O3 or having -finline-functions specified, the size limit of the functions that may be inlined can be specified through a special parameter interface. The following command illustrates capping the size of the functions to inline at 40 instructions:

gcc -o sort sort.c --param max-inline-insns=40



This can be useful to control the size by which an image is increased using -finline-functions.
Code Size Optimizations

The default stack alignment is 4, or 16 words. For space-constrained systems, the default can be minimized to 8 bytes by using the option -mpreferred-stack-boundary=2. When constants are defined, such as strings or floating-point values, these independent values commonly occupy unique locations in memory. Rather than allow each to be unique, identical constants can be merged together to reduce the space that's required to hold them. This particular optimization can be enabled with -fmerge-constants.
Graphics Hardware Optimizations

Depending on the specified target architecture, certain other extensions are enabled. These also can be enabled or disabled explicitly. Options such as -mmmx and -m3dnow are enabled automatically for architectures that support them.
Other Possibilities

We've discussed many optimizations and compiler options that can increase performance or decrease size. Let's now look at some fringe optimizations that may provide a benefit to your application.

The -ffast-math optimization provides transformations likely to result in correct code but it may not adhere strictly to the IEEE standard. Use it, but test carefully.

When global common sub-expression elimination is enabled (-fgcse, level -O2 and above), two other options may be used to minimize load and store motions. Optimizations -fgcse-lm and -fgcse-sm can migrate loads and stores outside of loops to reduce the number of instructions executed within the loop, therefore increasing the performance of the loop. Both -fgcse-lm (load-motion) and -fgcse-sm (store-motion) should be specified together.

The -fforce-addr optimization forces the compiler to move addresses into registers before performing any arithmetic on them. This is similar to the -fforce-mem option, which is enabled automatically in optimization levels -O2, -Os and -O3.

A final fringe optimization is -fsched-spec-load, which works with the -fschedule-insns optimization, enabled at -O2 and above. This optimization permits the speculative motion of some load instructions to minimize execution stalls due to data dependencies.