Monday, February 25, 2008

Writing your own Preprocessors


Snort Preprocessors Development Kickstart

Last updated: November 29th 2005
Authors: Andrés Felipe Arboleda Torres, Charles Edward Bedón Cortázar


What is a Snort Preprocessor?
Preprocessors are plugable components of Snort, introduced since version 1.5. They're "located" just after the module of protocol analysis and before detection engine and do not depend of rules. They are called whenever a packet arrives, but JUST ONCE, the detection plugins, in the other hand, do depend of rules and may be applied many times for a single packet. SPP's can be used in different ways: They can look for an specific behavior(portscan, flowportscan), to be support for further analysis(is this the expression? help us) like flow, or just collect certain information, like perfmonitor.
More information can be found at http://www.snort.org/docs/snort_htmanuals/htmanual_2.4/node11.html
When should I code a Snort Preprocessor and not a Detection Plugin?
/*TODO*/
A Hello World Preprocessor
These are the steps you should follow to make your first preprocessor(supossing the directory for unzipped Snort is $SNORT_DIR):

1. Open the templates for your code at $SNORT_DIR/templates: 2 files--> spp_template.c and spp_template.h
2. Make your own files, trying to follow the convention of using the prefix "spp_" before the name of the preprocessor. Those files should be created at $SNORT_DIR/src/preprocessors
3. The header file:
spp_helloworld.h

/*To keep you from repeat this anywhere else*/
1 #ifndef __SPP_HELLOWORLD_H__
2 #define __SPP_HELLOWORLD_H__
/*Demostrative(won't be used in this example later):You can define structs to share info if for example, this preprocessor will be used for another one. */
4 typedef struct _helloStruct{
5 int x;
6 char *string;
7 }helloStruct;
/*Demostrative*/
9 int helloCounter=0;
/*This is the actually important prototype. This function will be called when initializating SPP's*/
11 void SetupHelloWorld(void);
/*Demostrative. Remember: important if this preprocessor is intended to be support for another one later*/
13 void myHelloWorldFunction(char *str);
/*Demostrative*/
15 extern helloStruct myHelloWorldStruct;
16 #endif
4. The source file:
spp_helloworld.h

/*config.h is generated by "autoheader", when configuring the project(information about OS, libs available ....)it's located at $SNORT_DIR. Useful things for you to know the what can do and what cannot*/
#ifdef HAVE_CONFIG_H
#include "config.h"
#endif

/*Definition of u_char*/
#include

/*Functions Add* and RegisterPreprocessor */
#include "plugbase.h"

/*Definition of Packet*/
#include "decode.h"

/*Our header file */
#include "spp_helloworld.h"

/*Demostrative */
helloStruct myHelloWorldStruct;

/*Prototypes of internal functions*/

/*Initialization Function(Usually, take parameters from config file, call a parseargs function and initializes local variables) */
void HelloInit(u_char* args);
/*What the pp actually does*/
void HelloFunc();
/*What to do when a term order is received*/
void HelloCleanExitFunction();
/*What to do when a restart order is received*/
void HelloRestartFunction();

Called from plugbase.c
void SetupHelloWorld(void){
/* Here you could (in fact you *should*) use the macro DEBUG_WRAP defined in debug.h, or if you really want this message to appear anyway(not only in debug mode, the place depends on the type of messaging log, it could even be registered in the syslog) use the function logMessage at util.h
NOTE: this function is called always, even if this preproc has not been included in the config file*/
printf("Let's see if Hello World Preprocessor is in the config file...\n");
/*Receives the name of your pp(the one in the config file) and a function pointer to the initialization code */
RegisterPreprocessor("helloworld", HelloInit);
}

/*NOTE: This function is called once, and depends on the addition of the preproc in the config file. See coments above*/
void HelloInit(u_char* args){
printf("Hello world Preprocessor is being initialized...\n");
/*"args" is a string with the arguments in the config file(remember the preprocessor definition syntax preprocessor:). Arguments can have any separator, at $SNORT_DIR/msplit.h, there is a good collection of functions to get the right values, eg, msplit. This file contains some functions not included in libc.*/
printf("Those arguments were received --> %s\n",args)
/*Add this pp to pp list. The argument is a pointer function to the code to be executed everytime a packet arrives*/
AddFuncToPreprocList(HelloFunc);
/*What to do when a term order is received*/
AddFuncToCleanExitList(HelloCleanExitFunction, NULL);
/*What 2 do when a restart order is received*/
AddFuncToRestartList(HelloRestartFunction, NULL);
}

/*NOTE: This function is called when a packet is received*/
void HelloFunc(Packet* p){
printf("Hello world Preproccessor is being called...\n");
/*Just a little demostration. Is the packet TCP?*/
printf((p->tcph==NULL)?
"This packet does not have any TCP header\n":"This a valid TCP packet\n");
}

/*TODO*/
void HelloRestartFunction(){
}

/*Called when Ctrl+C is hit*/
void HelloCleanExitFunction(){
/* Free pointers and other resources, for example. You can also present some stats on stdout or generate more logs*/
}

5. Edit the file $SNORT_DIR/src/plugbase.c: Add these lines-->
plugbase.c
In the includes section:
/* built-in preprocessors */
#include "preprocessors/spp_portscan.h"
(...)
/*Our preprocessor*/
#include "preprocessors/spp_helloworld.h"
(...)

In the function InitPreprocessors:
void InitPreprocessors(){
(...)
SetupHttpInspect();
/*Our preprocessor*/
SetupHelloWorld();
SetupFlow();
(...)
}
6. Edit $SNORT_DIR/src/preprocessors/Makefile (if already generated) or $SNORT_DIR/src/preprocessors/Makefile.in.(before configure)
Makefile or Makefile.in
In the libspp_a_SOURCES section:

(...)
libspp_a_SOURCES = spp_arpspoof.c spp_arpspoof.h spp_bo.c spp_bo.h \
(...)
#Our preprocessor
spp_helloworld.c spp_helloworld.h \
(...)
#And this for the linker in the section "am_libspp_a_OBJECTS"
(...)str_search.$(OBJEXT) spp_portscanai.$(OBJEXT)
(...)
7. Compile ... (nothing is over til it's over)
8. Add our pp in the config file:
etc/snort.conf
(...)
preprocessor flow: stats_interval 0 hash 2
#Our preprocessor
preprocessor helloword: hello world args

preprocessor stream4_reassemble
preprocessor stream4
(...)

Stuff Used

* Snort 2.2.0(successfully used under v2.4.1)
* Anjuta IDE 1.2.2(KDevelop 3.4 too)
* Debian Sarge 3.1 kernel 2.6.13

Bibliography

* http://www.snort.org/docs/

To Do

* Complete the sections marked as "TODO" je je ... excuse us we're a bit impatient, and we can't take til finishin' all work before release the docs... ;)
* Include a little FAQ with information about commonly used functions and data structures
* Do an example with the colaborative work between two or more spp

Happy Snortin'!!!!

Tuesday, February 19, 2008

Stream5 on Snort ( Good Article by Richard Bejtlich)

It's important for value-added resellers and consultants to understand how Snort detects security events. Stream5 is a critical aspect of the inspection and detection equation. A powerful Snort preprocessor, Stream5 addresses several aspects of network-centric traffic inspection. Sourcefire calls Stream5 a "target-based" system, meaning it can perform differently depending on the directives passed to it. These directives tell Stream5 to inspect traffic based on its understanding of differences of behavior in TCP/IP stacks. However, if Stream5 isn't configured properly, customers may end up with a Snort installation that is running but not providing much real value. In this edition of Snort Report I survey a specific aspect of Stream5, found in Snort 2.7.x and 2.8.x.

If you've never read Insertion, evasion and denial of service: Eluding network intrusion detection by Thomas Ptacek and Timothy Newsham, then you absolutely must do so now. Rather than repeat the theory and consequences of their landmark paper, I'll show how features of the most current version of Snort addresses problems identified by the authors almost ten years ago.

Snort 2.7.0.1 installation

First, I install Snort 2.7.0.1 on FreeBSD 6.2.

Next, I edit my snort.conf file to include the changes indicated by the diff below. Note the deletion of all of the rule locations and the addition of a single rule for testing purposes.

With the changes made to snort.conf, I run Snort in testing mode to ensure I didn't forget anything.

Since Snort operated normally, I'm ready to trigger an alert.

Testing Snort on normal traffic

I start Snort and tell it to listen to the local interface while sending alerts to the screen.

taosecurity:/usr/local/snort-2.7.0.1# bin/snort -i lnc0 -c snort.conf -A console

On a second machine I connect to the Snort sensor's OpenSSH port and enter a text string to trigger the alert.

tws:~# nc -v 192.168.2.103 22
192.168.2.103: inverse host lookup failed: Unknown host
(UNKNOWN) [192.168.2.103] 22 (ssh) open
SSH-2.0-OpenSSH_4.2p1 FreeBSD-20060930
This is a test of the emergency broadcast system.
Protocol mismatch.

Snort reports seeing the traffic as I expected.

09/03-14:31:34.315118 [**] [1:2000001:0] TEST String to Port 22 TCP [**] [Priority: 0] {TCP} 192.168.2.106:2099 -> 192.168.2.103:22

The traffic looks similar to this:

Notice the entire test message is contained in one segment. This is normal traffic, and it will serve as a baseline for future scenarios.

Using Fragroute

I use Fragroute to exercise the new stream preprocessor, Stream5, present in Snort 2.7.x and beyond. I discuss Fragroute in Detect events without Snort IDS rules so I won't repeat the introduction here.

I create a fragroute.conf file with these contents:

tcp_seg 16
print

These directives tell Fragroute to chop up TCP segments into 16 byte fragments and print the traffic to the console.

Next, I start Fragroute on a Debian 4.0 Linux host with IP address 192.168.2.106. For all subsequent cases, 192.168.2.106 will be the "attacker" or client, and 192.168.2.103 will be the "victim" and also the machine inspecting traffic with Snort.

tws:~# fragroute -f fragroute.conf 192.168.2.103
fragroute: tcp_seg -> print

When I repeat my test, connecting to port 22 and entering the "This is a test..." string, Snort doesn't alert. What could be wrong? The traffic was chopped into segments as expected.

Notice the test message is broken into four segments.

The key is found in the Snort startup messages. Look at this excerpt:

Stream5 TCP Policy config:
Reassembly Policy: FIRST
Timeout: 30 seconds
Min ttl: 1
Options:
Static Flushpoint Sizes: YES
Reassembly Ports:
21 client (Footprint)
23 client (Footprint)
25 client (Footprint)
...truncated...

Port 22 isn't being reassembled, although ports 21, 23, 25 and others are. To correct this problem I modify the entry for stream5_tcp in snort.conf to look like this:

preprocessor stream5_tcp: policy first, use_static_footprint_sizes, \
ports both 22

Repeating my test, I find my alert as expected.

09/03-16:29:07.070801 [**] [1:2000001:0] TEST String to Port 22 TCP [**] [Priority: 0] {TCP} 192.168.2.106:2415 -> 192.168.2.103:22

This is our first clue that it's important to understand how your inspection product is configured and how it makes decisions.

Overlapping segments

It's time to make life more interesting. I edit fragroute.conf to add overlapping fragments.

tcp_seg 16 new
print

Here "new" means Fragroute will send garbage first, then correct data last. If the recipient TCP/IP stack favors new data over old, then the connection will be OK because the garbage fragmented data will be discarded prior to being sent to the OpenSSH daemon. (The OpenSSH daemon will ignore it anyway, but if a legitimate OpenSSH connection attempt was being fragmented it would survive the "new" directive.)

At this point I am really interested to see if Snort and Stream5 tell me that someone is potentially attempting an insertion or evasion attack by introducing overlapping fragments.

When I run Snort and the test again, I don't get any alerts from Snort. What could be wrong this time?

Let's return to the stream5_tcp directive:

preprocessor stream5_tcp: policy first, use_static_footprint_sizes, \
ports both 22

Here "policy first" is the problem. According to the Stream5 README, "policy first" means "favor first overlapped segment." We want behavior that BSD supports, since the OpenSSH server is running FreeBSD. In other words, Snort is passively reassembling the traffic it sees to the OpenSSH server, but it's making decisions that don't match how the FreeBSD OS on the OpenSSH server reassembles overlapping fragments.

If we change the stream5_tcp directive we will have better results.

preprocessor stream5_tcp: policy bsd, use_static_footprint_sizes, \
ports both 22

Sure enough, when we re-run our test we get an alert:

09/03-16:44:40.729492 [**] [1:2000001:0] TEST String to Port 22 TCP [**] [Priority: 0] {TCP} 192.168.2.106:3340 -> 192.168.2.103:22

This time, Snort generates an alert because Stream5 reassembled the fragmented traffic in the same way that the underlying FreeBSD OS did. Incidentally, it's irrelevant that Snort is running on the FreeBSD OS itself. Snort makes its own TCP fragmentation and overlap reassembly decisions; it does not look to the underlying OS for answers.

Generic overlapping fragment detection

What if you want to detect any suspicious use of overlapping fragments? In my experience it's uncommon for normal TCP traffic to contain overlapping segments like the following:

That traffic is similar to the packets manipulated by Fragroute in the previous example. As you can see, the last packet contains the beginning of the test message. However, the first packet contains garbage ("b6reOo34j0rojpB.") that overlaps with bytes 17-32 (16 bytes) of the real test message. Because the FreeBSD OS of the OpenSSH server favored new data (the last packet) over old data (the first packet), the correct message was passed to layer 7.

It would be helpful to receive a message from Stream5 whenever odd behavior like this occurs on the wire.

We can modify the stream5_tcp directive one more time to detect this sort of odd activity.

preprocessor stream5_tcp: policy bsd, use_static_footprint_sizes, \
ports both 22, detect_anomalies, overlap_limit 1

Notice the inclusion of the "detect_anomalies, overlap_limit 1" option. (Thank you to beenph and VictorK in IRC for passing this information to me!) These options will cause Stream5 to alert when it sees at least one overlapping TCP segment.

In case one, I send the following string but Snort remains silent:

This is a test test test test test test test test test test

In contrast, the following string causes Snort to trigger:

This is a test test test test test test test test test test test

Stream5 produces an alert like this:

09/03-16:58:22.263117 [**] [129:7:1] Limit on number of overlapping TCP packets reached [**] [Priority: 3] {TCP} 192.168.2.106:4556 -> 192.168.2.103:22

It's good to get at least one alert on overlapping TCP fragments, but why did we only see one alert?

Why one and not the other?

Why did Snort stay quiet for case one, but react for case two?

The difference lies in the number of packets generated. To demonstrate this, I reproduced the critical elements of the two cases as captured by TCPdump. In each listing, the first number is a packet number for easy reference. Next is the P for the "Push" TCP flag. The number1:number2(number3) field means "sequence number of the first byte in this segment:sequence number of the first byte in the NEXT segment(number of bytes of data in this segement". In each case "ack 40" means the sender expects the other side of the conversation to next send data whose first byte will be number 40. All of these numbers are relative (not absolute) sequence numbers.

(Note that number2 is NOT the sequence number of the last byte of data in the segment. For packet 1 below, 32 is the last number. Count with me if you're confused: 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32. 33 would be the 17th byte, which is wrong. This is a common misconception in books on TCP/IP.)

Here are the sequence numbers for the traffic sent in the first case.

1. P 17:33(16) ack 40
2. P 33:49(16) ack 40
3. P 49:61(12) ack 40
4. P 1:33(32) ack 40

Packet 4 is the only overlapping fragment, since it contains bytes 17-32 which were already sent in fragment 1.

Here are the sequence numbers for the traffic sent in the second case.

1. P 17:33(16) ack 40
2. P 49:65(16) ack 40
3. P 65:66(1) ack 40
4. P 1:33(32) ack 40
5. P 33:65(32) ack 40M

Packets 4 and 5 are overlapping fragments. Packet 4 contains bytes 1-32; packet 1 already sent bytes 17-32. Packet 5 contains bytes 33-64; packet 2 already sent bytes 49-64. Therefore, unlike the previous case, this trace contains two overlapping fragments. Because Stream5's "overlap_limit 1" option was set and two fragments (case two) is greater than one fragment (case one), case two triggered Stream5 while case one did not.

I hope this edition of the Snort Report motivated you to think about how you can use Stream5 to detect Fragroute-type TCP fragment overlap attacks. Like most every Snort Report, this edition only scratched the surface of another powerful component of Snort.

Friday, September 21, 2007

SCTP (New Protocol within TCP/IP)

INTRODUCTION

SCTP is a reliable transport protocol operating over IP. SCTP is more akin to TCP than UDP, however it yields additional features to TCP while still supporting much of the same functionality. So SCTP is connection oriented and implements the same congestion/flow control. Detection of data corruption, loss of data and duplication of data is achieved by using checksums and sequence numbers. A selective retransmission mechanism is applied to correct loss or corruption of data.

In the TCP/IP network model, SCTP resides in the transport layer, alongside TCP and UDP. The transport layer handles communication among programs in a network. This involves accepting data from the application layer, and repackaging it (perhaps fragmenting the data) so it may be passed on to the network layer. In addition the transport layer will also ensure the data arrives correctly on the other end. A transport protocol in essence is a set of rules that govern how data is sent between communicating nodes.

The network layer is used for; basic communication, addressing and routing, IP is usually used. The application layer consists of end-user applications and the link layer defines network hardware and device drivers.


Comparing SCTP to TCP

SCTP and TCP have some distinct differences, yet also many similarities. In this section we will explore their similarities and then discuss the major ways in which they differ.

Similarities

Startup: to establish an association between two nodes, both protocols will exchange a series of messages. There are differences in the way these messages are exchanged and their format, but they hold the same purpose- to set up an end-to-end connection (association or connection).

Reliability and Ordering: both SCTP and TCP implement mechanisms to endure the successful delivery of user datagrams. This includes reliable and ordered data delivery.

Congestion Control: this is a critical element in any transport protocol. It regulates the flow of data entering the network, limiting it to accommodate for occurrences of congestion. SCTP and TCP hold the same congestion control mechanism- Additive Increase, Multiplicative Decrease (AIMD) congestion window management.

Closing Down: both protocols have two different close procedures, a graceful close and an abortive one. The graceful close will ensure that all user data in the queue will be delivered before the association is terminated. The abortive close occurs during errors.

Differences

There are two key differences between TCP and SCTP:

· Multihoming

· Multistreaming

These are new features in SCTP and are what really set the two protocols apart.

Multihoming: an essential property of SCTP is its support of multi-homed nodes, i.e. nodes which can be reached under several IP addresses. If we allow SCTP nodes to support more than one IP address, during network failure data can be rerouted to alternative destination IP addresses. This makes the nodes more tolerant against physical network failures and other problems of that kind.

Multistreaming: is an effective way to limit Head-of-Line Blocking. The benefit in having multiple independent data streams is if a packet is lost in one stream, while that stream blocks to wait for the retransmission the remaining unaffected streams can continue to send data. In TCP if a packet is lost, the connection effectively grinds to a halt while it waits for the retransmission to be sent.


An SCTP association is equivalent to a TCP connection, they both represent an end-to-end relationship between two transmitting nodes.

Multistreaming can be achieved in TCP, however it involves opening multiple TCP connections which each act as a stream to send data. This differs from multistreaming in SCTP where all the streams reside in a single association. Opening multiple TCP connections is TCP-unfriendly, which means that a pair of communicating nodes will obtain a larger proportion of the available channel bandwidth. Thus, SCTP is more TCP-friendly in this regard.

Although multihoming and multistreaming may be where SCTP and TCP differ most, the two protocols exhibit other differences, which are also important to discuss.

Security at Startup: SCTP and TCP both carry out an exchange of messages to establish an end-to-end relationship. The way these messages are sent however, are different. Traditional TCP uses a three-way handshake, whereas SCTP uses a four-way handshake. A signed state cookie is involved in the SCTP four-way handshake, which helps to protect from denial of service attacks.

A denial of service attack is where resources are tied up on the server side so that it is impossible to respond to legitimate connections. The attacker issues vast amounts of SYN requests (a message requesting set-up of a connection) to the server and when it receives the SYN, ACK it simply discards it, not bothering to respond with an ACK. This causes the server to retain the partial state that was allocated after the SYN request, and if carried out repetitively will lead to a denial of service.

SCTP protects against denial of service attacks with the use of a cookie. The cookie is bundled with the INIT-ACK from the server to the client. The server does not record the association or keep a transmission control block (TCB), rather it derives the TCB from the cookie, which is sent back from the client inside the COOKIE-ECHO. Since it has no knowledge of the association till the client responds with a COOKIE-ECHO, it becomes resilient to denial of service attacks.

There may at first appear to be an overhead to sending four messages, however user data can be bundled in the last two SCTP packets.

Data Delivery: Data transmission in TCP is byte-stream oriented; in SCTP, it is message-oriented. In TCP, data is transported as a consecutive stream of bytes between two endpoints, so user message boundaries are not preserved when they are on the wire between two end points. Parts of one message may be sent with parts of another message, in a single data packet. This means that some kind of message-delineation is required by the application, to inform the receiver, the message length and the amount to read. The receiving application will need to do some complex buffering and framing to reconstruct the messages.

SCTP, in contrast, makes an explicit demarcation of user message boundaries. Each message is delivered as a complete read, which lifts a lot of the work off the application layer. An exception to this is when the message is larger than the maximum packet size. Although, parts of two user messages will never be put into a single data packet.

Unordered Delivery: SCTP allows for data to be sent reliably but unordered. This has benefits when dealing with large amounts of independent transactions, e.g. components in a web page. TCP has no such facility.

SACKs: All acknowledgements in SCTP are with SACKs. They are useful as they indicate if there are any gaps in the transmission, i.e. missing blocks. TCP does not make explicit use of SACKs but can be configured to support them. However, TCP can only report four missing data packets in a SACK, SCTP allows for much larger amounts to be reported.

Closing Association: Despite the fact that both TCP and SCTP have graceful close mechanisms, there is a slight difference in what these mechanisms permit. TCP allows what is known as the “half-closed” state, where one endpoint stays open while the other endpoint closes. SCTP does not allow this, both endpoints must close when the shutdown primitive is issued. One reason for not putting the half-closed state in SCTP was the lack of use of it: very few applications require it.

Thursday, September 6, 2007

Strict Aliasing

What is strict aliasing?
Strict aliasing is an assumption, made by the C (or C++) compiler, that dereferencing pointers to objects of different types will never refer to the same memory location (i.e. alias eachother.)

Examples

Pointers to different built in types do not alias:

0int16_t* foo;
1int32_t* bar;

The compiler will assume that *foo and *bar never refer to the same location.
Pointers to aggregate or union types with differing tags do not alias:

0typedef struct
1{
2 uint16_t a;
3 uint16_t b;
4 uint16_t c;
5} Foo;
6
7typedef struct
8{
9 uint16_t a;
10 uint16_t b;
11 uint16_t c;
12} Bar;
13
14Foo* foo;
15Bar* bar;

The compiler will assume that *foo and *bar never refer to the same location, even though the contents of the structures are the same.

Pointers to aggregate or union types which differ only by name may alias:

0typedef struct
1{
2 uint16_t a;
3 uint16_t b;
4 uint16_t c;
5} Foo;
6
7typedef Foo Bar;
8
9Foo* foo;
10Bar* bar;

The compiler will assume that *foo and *bar may refer to the same location, and will not perform the optimizations decribed below.

Benefits to The Strict Aliasing Rule
When the compiler cannot assume that two object are not aliased, it must act very conservatively when accessing memory. For example:

0typedef struct
1{
2 uint16_t a;
3 uint16_t b;
4 uint16_t c;
5} Sample;
6
7void
8test( uint32_t* values,
9 Sample* uniform,
10 uint64_t count )
11{
12 uint64_t i;
13
14 for (i=0;i 15 {
16 values[i] += (uint32_t)uniform->b;
17 }
18}

Compiled with -fno-strict-aliasing -O3 -std=c99 on the 64 bit build of GNU C version 3.4.1 (CELL 2.3, Jul 21 2005) (powerpc64-linux) for the Cell PPU.

0test:
1 li 10, 0 # i = 0
2 cmpld 7, 10, 5 # done = (i==count)
3 bgelr- 7 # if (done) return
4 mtctr 5 # ctr = count
5.L8:
6 sldi 11, 10, 2 # offset = i * 4
7 lwz 9, 4(4) # b = *(uniform+4)
8 addi 10, 10, 1 # i++
9 lwzx 5, 11, 3 # value = *(values+offset)
10 add 0, 5, 9 # value = value + b
11 stwx 0, 11, 3 # *(values+offset) = value
12 bdnz .L8 # if (ctr--) goto .L8
13 blr # return

In this case uniform->b must be loaded during each iteration of the loop. This is because the compiler cannot be certain that values does not overlap b in memory. If, in fact, they do overlap, the programmer would expect that uniform->b would be properly updated and the values stored into the values array adjusted accordingly. The only method for the compiler to guarantee these results is reloading uniform->b at every iteration.

It was noted that this case is extremely uncommon in most code and the decision was made to presume objects of different types are not aliased and to be more aggresive with optimizations. It is certain the fact this presumption would break some existing code was discussed in detail. It must have been decided that those most likely to use memory aliasing techniques for optimization are are few and those that do use it are the most willing and capable of making the necessary changes.

The result, even for this small case, can make a significant performance impact. Compiled with -fstrict-aliasing -Wstrict-aliasing=2 -O3 -std=c99 on the 64 bit build of GNU C version 3.4.1 (CELL 2.3, Jul 21 2005) (powerpc64-linux) for the Cell PPU.

0test:
1 li 11,0 # i = 0
2 cmpld 7,11,5 # done = (i == count)
3 bgelr- 7 # if (done) return
4 lhz 4,2(4) # b = uniform.b
5 mtctr 5 # ctr = count
6.L8:
7 sldi 9,11,2 # offset = i * 4
8 addi 11,11,1 # i++
9 lwzx 5,9,3 # value = *(values+offset)
10 add 0,5,4 # value = value + b
11 stwx 0,9,3 # *(values+offset) = value
12 bdnz .L8 # if (ctr--) goto .L8
13 blr # return

The load of b is now only done once, outside the loop. For more examples of optimizations for non-aliasing memory see: Demystifying The Restrict Keyword
Casting Compatible Types
Aliases are permitted for types that only differ by qualifier or sign.

0uint32_t
1test( uint32_t a )
2{
3 uint32_t* const a0 = &a;
4 uint32_t* volatile a1 = &a;
5 int32_t* a2 = (int32_t*)&a;
6 int32_t* const a3 = (int32_t*)&a;
7 int32_t* volatile a4 = (int32_t*)&a;
8 const int32_t* const a5 = (int32_t*)&a;
9
10 (*a0)++;
11 (*a1)++;
12 (*a2)++;
13 (*a3)++;
14 (*a4)++;
15
16 return (*a5);
17}

In this case a0-a5 are all valid aliases of a and this function will return (a + 5).
GCC has two flags to enable warnings related to strict aliasing. -Wstrict-aliasing enables warnings for most common errors related to type-punning. -Wstrict-aliasing=2 attempts to warn about a larger class of cases, however false positives may be returned.
Casting through a union (1)
The most commonly accepted method of converting one type of object to another is by using a union type as in this example:

0typedef union
1{
2 uint32_t u32;
3 uint16_t u16[2];
4}
5U32;
6
7uint32_t
8swap_words( uint32_t arg )
9{
10 U32 in;
11 uint16_t lo;
12 uint16_t hi;
13
14 in.u32 = arg;
15 hi = in.u16[0];
16 lo = in.u16[1];
17 in.u16[0] = lo;
18 in.u16[1] = hi;
19
20 return (in.u32);
21}

This method is not properly called casting at all (although it may be called type-punning) as the value is simplied copied into a union which permits aliasing among its members. From a performance point of view, this method relies on the ability of the optimizer to remove the redundant stores and loads. When using recent versions of GCC, if the transformation is reasonably simple, it is very likely that the compiler will be able to remove the redundancies and produce an optimal code sequence.
Strictly speaking, reading a member of a union different from the one written to is undefined in ANSI/ISO C99 except in the special case of type-punning to a char*, similar to the example below: Casting to char*. However, it is an extremely common idiom and is well-supported by all major compilers. As a practical matter, reading and writing to any member of a union, in any order, is acceptable practice.
For example, when compiled with GNU C version 4.0.0 (Apple Computer, Inc. build 5026) (powerpc-apple-darwin8), the argument is simply rotated 16 bits.

0swap_words:
1 rlwinm r3,r3,16,0xffffffff
2 blr

When compiled with -fstrict-aliasing -O3 -Wstrict-aliasing -std=c99 on GNU C version 3.4.1 (CELL 2.3, Jul 21 2005) (powerpc64-linux) for the Cell PPU, the loads and stores are removed but the instruction sequence is less than optimal.

0swap_words:
1 slwi 4,3,16 ; hi = arg << 16
2 rldicl 3,3,48,48 ; lo = arg >> 16
3 or 0,4,3 ; out = hi | lo;
4 rldicl 3,0,0,32 ; final = out & 0xffffffff
5 blr


In order to generate reasonably good code across both the GCC3 and GCC4 families, use C99 style intializers:

0uint32_t
1swap_words( uint32_t arg )
2{
3 U32 in = { .u32=arg };
4 U32 out = { .u16[0]=in.u16[1],
5 .u16[1]=in.u16[0] };
6
7 return (out.u32);
8}

Compiled with -fstrict-aliasing -O3 -Wstrict-aliasing -std=c99 on the 32 bit build of GNU C version 3.4.1 (CELL 2.3, Jul 21 2005) (powerpc64-linux) for the Cell PPU.

0swap_words:
1 stwu 1,-16(1) ; Push stack
2 rlwinm 3,3,16,0xffffffff ; Rotate 16 bits
3 addi 1,1,16 ; Pop stack
4 blr

It is a parculiarity of the 32 bit build of GCC 3.4.1 for the Cell PPU that the stack is always pushed and popped regardless of whether or not it is used.
This method is most valuable for use with primitive types which can be returned by value. This is because it relies on doing a complete copy of the object (by value) and removing the redundancies. With more complex aggregate or union types copying may be done on the stack or through the memcpy function and redundancies are harder to eliminate.
Casting through a union (2)
Casting proper may be done between a pointer to a type and a pointer to an aggregate or union type which contains a member of a compatible type, as in the following example:

0uint32_t
1swap_words( uint32_t arg )
2{
3 U32* in = (U32*)&arg;
4 uint16_t lo = in->u16[0];
5 uint16_t hi = in->u16[1];
6
7 in->u16[0] = hi;
8 in->u16[1] = lo;
9
10 return (in->u32);
11}

in is a pointer to a U32 type, which contains the member u32 which is of type uint32_t which is compatible with arg, which is also of type uint32_t.
The above source when compiled with GCC 4.0 with the -Wstrict-aliasing=2 flag enabled will generate a warning. This warning is an example of a false positive. This type of cast is allowed and will generate the appropriate code (see below). It is documented clearly that -Wstrict-aliasing=2 may return false positives.
Compiled with -fstrict-aliasing -O3 -Wstrict-aliasing -std=c99 on GNU C version 4.0.0 (Apple Computer, Inc. build 5026) (powerpc-apple-darwin8),

0swap_words:
1 stw r3,24(r1) ; Store arg
2 lhz r0,24(r1) ; Load hi
3 lhz r2,26(r1) ; Load lo
4 sth r0,26(r1) ; Store result[1] = hi
5 sth r2,24(r1) ; Store result[0] = lo
6 lwz r3,24(r1) ; Load result
7 blr ; Return

GCC is extremely poor at combining loads and stores done through a pointer to a union type as can be seen from the generated code above. The output is a very naive interpretation of the source and would perform badly compared to the previous examples on most architectures.

However, once this fact is accounted for, this method can be very useful. Rather than copying the argument by value, which is problematic on large or complex structures, a pointer can be passed in and the value modified directly. If the loads and stores can be combined in the source the results will usually be excellent.
"But when the address of a variable is taken, doesn't the compiler force it to be stored in memory rather than in a register?"

Yes, both a store and a load may then generated as part of the trace. However, when alias analysis is done it can be determined that the object cannot be changed another mechanism so the load and store may be marked as redundant and removed.
Do not rely on the compiler to combine loads and stores. The programmer is always better equipted to make those decisions based on alignment concerns and complex instruction penalty rules.

0uint16_t*
1swap_words( uint16_t* arg )
2{
3 U32* combined = (U32*)arg;
4 uint32_t start = combined->u32;
5 uint32_t lo = start >> 16;
6 uint32_t hi = start << 16;
7 uint32_t final = lo | hi;
8
9 combined->u32 = final;
10}

Compiled with -fstrict-aliasing -O3 -Wstrict-aliasing -std=c99 on GNU C version 4.0.0 (Apple Computer, Inc. build 5026) (powerpc-apple-darwin8),

0swap_words:
1 lwz r0,0(r3) ; Load arg
2 rlwinm r0,r0,16,0xffffffff ; Rotate 16 bits
3 stw r0,0(r3) ; Store arg
4 blr ; Return

If the above source is called as a non-inline function, there will be a signficant penalty on most architectures waiting for the load before the rotate and the store on return.
If the above source is called as a inline function, it can be safely assumed the load and store will be removed by the compiler as redundant.
In C99, a static inline function, which may be included in a header file, differs from automatic inlining in that the function may be defined multiple times (e.g. included by multiple source files). Each definition of a static inline function must be identical.

0static inline void
1swap_words( uint16_t* arg )
2{
3 U32* combined = (U32*)arg;
4 uint32_t start = combined->u32;
5 uint32_t lo = start >> 16;
6 uint32_t hi = start << 16;
7 uint32_t final = lo | hi;
8
9 combined->u32 = final;
10}

With some care, this method is the most appropriate for modifying large or complex structures by multiple types.
Casting through a union (3)
Occasionally a programmer may encounter the following INVALID method for creating an alias with a pointer of a different type:

0typedef union
1{
2 uint16_t* sp;
3 uint32_t* wp;
4} U32P;
5
6uint32_t
7swap_words( uint32_t arg )
8{
9 U32P in = { .wp = &arg };
10 const uint16_t hi = in.sp[0];
11 const uint16_t lo = in.sp[1];
12
13 in.sp[0] = lo;
14 in.sp[1] = hi;
15
16 return ( arg ); <-- RESULT IS UNDEFINED
17}

The problem with this method is although U32P does in fact say that sp is an alias for wp, it does not say anything about the relationship between the values pointed to by sp and wp. This differs in a critical way from "Casting Through a Union (1)" and "Casting Through a Union (2)" which both define aliases for the values being pointed to, not the pointers themselves.

The presumption of strict aliasing remains true: Two pointers of different types are assumed, except in a few very limited conditions specified in the C99 standard, not to alias. This is not one of those exceptions.
The above source when compiled with GCC 3.4.1 or GCC 4.0 with the -Wstrict-aliasing=2 flag enabled will NOT generate a warning. This should serve as an example to always check the generated code. Warnings are often helpful hints, but they are by no means exaustive and do not always detect when a programmer makes an error. Like any peice of software, a compiler has limits. Knowing them can only be helpful.
For example, when compiled with -fstrict-aliasing -O3 -Wstrict-aliasing -std=c99 on GNU C version 4.0.0 (Apple Computer, Inc. build 5026) (powerpc-apple-darwin8),

0swap_words: ; RETURNS ARG UNCHANGED
1 lhz r0,24(r1) ; Load lo from stack (What value?!)
2 lhz r2,26(r1) ; Load hi from stack (What value?!)
3 stw r3,24(r1) ; Store arg to stack
4 sth r0,26(r1) ; Store hi to stack
5 sth r2,24(r1) ; Store lo to stack
6 blr ; Return

In this case notice that because hi, lo and arg are assumed not to alias, the resulting order of instruction has no value:

* [Line 1]: lo is loaded from the stack before anything is stored to the stack
* [Line 2]: hi is loaded from the stack before anything is stored to the stack
* [Line 3]: arg is stored to the stack, but this value will not be read.
* [Line 4]: hi is stored to the stack, but this value will not be read.
* [Line 5]: lo is stored to the stack, but this value will not be read.

Or when compiled with -fstrict-aliasing -O3 -Wstrict-aliasing -std=c99 on the 64 bit build of GNU C version 3.4.1 (CELL 2.3, Jul 21 2005) (powerpc64-linux) for the Cell PPU.

0swap_words: # RETURNS ARG UNCHANGED
1 stw 3,48(1) # Store arg to stack
2 lhz 9,48(1) # Load hi
3 lhz 0,50(1) # Load lo
4 lwz 3,48(1) # Load arg
5 sth 0,48(1) # Store hi to stack
6 sth 9,50(1) # Store lo to stack
7 blr # Return

Or when compiled with -fstrict-aliasing -O3 -Wstrict-aliasing -std=c99 on the 32 bit build of GNU C version 3.4.1 (CELL 2.3, Jul 21 2005) (powerpc64-linux) for the Cell PPU.

0swap_words: # RETURNS ARG UNCHANGED
1 stwu 1,-16(1) # Push stack
2 addi 1,1,16 # Pop stack
3 blr # Return

Casting to char*
It is always presumed that a char* may refer to an alias of any object. It is therefore quite safe, if perhaps a bit unoptimal (for architecture with wide loads and stores) to cast any pointer of any type to a char* type.

0uint32_t
1swap_words( uint32_t arg )
2{
3 char* const cp = (char*)&arg;
4 const char c0 = cp[0];
5 const char c1 = cp[1];
6 const char c2 = cp[2];
7 const char c3 = cp[3];
8
9 cp[0] = c2;
10 cp[1] = c3;
11 cp[2] = c0;
12 cp[3] = c1;
13
14 return (arg);
15}

The converse is not true. Casting a char* to a pointer of any type other than a char* and dereferencing it is usually in volation of the strict aliasing rule.
In other words, casting from a pointer of one type to pointer of an unrelated type through a char* is undefined.

0uint32_t
1test( uint32_t arg )
2{
3 char* const cp = (char*)&arg;
4 uint16_t* const sp = (uint16_t*)cp;
5
6 sp[0] = 0x0001;
7 sp[1] = 0x0002;
8
9 return (arg);
10}

When compiled with -fstrict-aliasing -O3 -Wstrict-aliasing -std=c99 on the 64 bit build of GNU C version 3.4.1 (CELL 2.3, Jul 21 2005) (powerpc64-linux) for the Cell PPU.

0test:
1 stw 3, 48(1) # arg stored to stack
2 li 0, 1 # hi = 0x0001
3 li 9, 2 # lo = 0x0002
4 lwz 3, 48(1) # result = loaded from stack
5 sth 0, 48(1) # store hi to stack
6 sth 9, 50(1) # store lo to stack
7 blr # return (result) <-- RETURNS ARG UNCHANGED

As clarified by Pinskia, it is not deferencing a char* per se that is specifically recognized as a potential alias of any object, but any address referring to a char object. This includes an array of char objects, as in the following example which will also break the strict aliasing assumption.

0 char const cp[4] = { arg0, arg1, arg2, arg3 };
1 uint16_t* const sp = (uint16_t*)cp;
2
3 sp[0] = 0x0001;
4 sp[1] = 0x0002;

GCC RULE BREAKING
GCC allows type-punned values to be deferenced at independent locations in memory (i.e. different objects) when the source of the lvalue is not directly known.

0void
1set_value( uint64_t* c,
2 uint32_t a_val,
3 uint16_t b_val )
4{
5 uint32_t* a = (uint32_t*)c;
6 uint16_t* b = (uint16_t*)c;
7
8 a[0] = a_val; // <--- Address of c + 0
9 b[2] = b_val; // <--- Address of c + 4
10 b[3] = b_val; // <--- Address of c + 6
11}

When compiled with -fstrict-aliasing -O3 -Wstrict-aliasing -std=c99 on the 64 bit build of GNU C version 3.4.1 (CELL 2.3, Jul 21 2005) (powerpc64-linux) for the Cell PPU.

0set_value:
1 stw 4,0(3) # (c+0) = a_val
2 sth 5,6(3) # (c+6) = b_val
3 sth 5,4(3) # (c+4) = b_val
4 blr # return (c)

Note any use of c[0] here would be (more?) undefined because it would alias the uses of a and b.

0void
1set_value( uint64_t* c,
2 uint32_t a_val,
3 uint16_t b_val )
4{
5 uint32_t* a = (uint32_t*)c;
6 uint16_t* b = (uint16_t*)c;
7
8 a[0] = a_val; // < Address of c + 0
9 b[2] = b_val; // < Address of c + 4
10 b[3] = b_val; // < Address of c + 6
11
12 // WHAT VALUE THIS WOULD PRINT IS UNDEFINED
13 printf("c = 0x%08x\n", c[0] );
14}

However, when set_value is compiled inline (perhaps automatically), the source of c may be known and GCC will assume the values do not alias and may reduce the expression differently and generate completely different code.

0static inline void
1set_value( uint64_t* c,
2 uint32_t a_val,
3 uint16_t b_val )
4{
5 uint32_t* a = (uint32_t*)c;
6 uint16_t* b = (uint16_t*)c;
7
8 a[0] = a_val; // <--- Address of c + 0
9 b[2] = b_val; // <--- Address of c + 4
10 b[3] = b_val; // <--- Address of c + 6
11}

0int64_t
1test( int64_t a
2 ,int64_t b
3 ,uint32_t hi32
4 ,uint16_t lo16 )
5{
6 int64_t c = a + b;
7
8 set_value( &c, hi32, lo16 );
9
10 return (c);
11}

When compiled with -fstrict-aliasing -O3 -Wstrict-aliasing -std=c99 on the 64 bit build of GNU C version 3.4.1 (CELL 2.3, Jul 21 2005) (powerpc64-linux) for the Cell PPU.

0test:
1 add 3,3,4 # c = (a+b)
2 blr # return (c)

In this case because the object c is never accessed through any valid aliases in set_value, the expression is reduced out.
The above example will NOT currently generate any warnings with -Wstrict-aliasing=2 and will simply generate different results depending on whether or not the expression is inlined. This is another good reason to always double check the generated code. Also, when writing unit tests, it is a good idea to test a function both as an inline function and an extern function.
With GCC, strict aliasing warnings are more likely to be generated at the point where an address is taken (e.g. uint16_t* a = (uint16_t*)&b;) than with pre-existing pointers (e.g. uint16_t* a = (uint16_t*)b_ptr;). Take special care when type-punning pre-existing pointers.
Perhaps surprisingly, illegal aliasing within a loop generates completely different results. It is probably not completely accidental though, as most of the historical arguments against strict aliasing have revolved around optimized versions of functions like memset and memcpy which would cast the data to the widest available register size to minimize the trips to and from memory.

0void
1set_value( uint64_t* c,
2 uint32_t a_val,
3 uint16_t b_val,
4 uint32_t count )
5{
6 uint32_t* a = (uint32_t*)c;
7 uint16_t* b = (uint16_t*)c;
8 uint32_t i = 0;
9
10 for (i=0;i 11 {
12 a[0] = a_val;
13 b[2] = b_val;
14 b[3] = b_val;
15 }
16}

As expected from the previous example above, this should still generate the "expected" result:

When compiled with -fstrict-aliasing -O3 -Wstrict-aliasing -std=c99 on the 32 bit build of GNU C version 3.4.1 (CELL 2.3, Jul 21 2005) (powerpc64-linux) for the Cell PPU.

0set_value:
1 cmpwi 0, 6, 0 # done = (count == 0)
2 stwu 1, -16(1) # Push stack
3 mr 9, 3 # Copy c
4 beq- 0, .L7 # if (done) goto .L7
5 mtctr 6 # i = count
6.L8:
7 stw 4, 0(9) # a[0] = a_val
8 addi 9, 9, 4 # a++
9 sth 5, 4(3) # b[2] = b_val
10 sth 5, 6(3) # b[3] = b_val
11 addi 3, 3, 4 # b+=2
12 bdnz .L8 # if (i) goto .L8
13.L7:
14 addi 1, 1, 16 # Pop stack
15 blr # return

When called inline, the previous example would suggest that the compiler, assuming c is not aliased would also return (a + b):

0int64_t
1test_loop( int64_t a,
2 int64_t b,
3 uint32_t hi32,
4 uint16_t lo16,
5 uint32_t count )
6{
7 static int64_t c[ C_COUNT ];
8
9 c[0] = a + b;
10
11 set_value( c, hi32, lo16, count );
12
13 return (c[0]);
14}

When compiled with -fstrict-aliasing -O3 -Wstrict-aliasing -std=c99 on the 32 bit build of GNU C version 3.4.1 (CELL 2.3, Jul 21 2005) (powerpc64-linux) for the Cell PPU.

0test_loop:
1 lis 12, c.0@ha # cloc = location of c
2 mr. 0, 9 # i = count
3 la 11, c.0@l(12) # c = *cloc
4 addc 10, 4, 6 # c1 = addlo (a,b)
5 adde 9, 3, 5 # c2 = addhi (a,b)
6 stwu 1, -16(1) # Push stack
7 stw 9, 0(11) # c[0].hi = c2
8 mr 6, 11 # a = c
9 stw 10, 4(11) # c[0].lo = c1
10 mr 9, 11 # b = c
11 beq- 0, .L19 # if (i==0) goto .L19
12 mtctr 0 # i = count
13.L20:
14 stw 7, 0(9) # a[0] = hi32
15 addi 9, 9, 4 # a++
16 sth 8, 4(6) # b[2] = lo16
17 sth 8, 6(6) # b[3] = lo16
18 addi 6, 6, 4 # b+=2
19 bdnz .L20 # if (i) goto .L20
20.L19:
21 la 9, c.0@l(12) # c = *cloc
22 addi 1, 1, 16 # Pop stack
23 lwz 3, 0(9) # result.hi = c[0].hi
24 lwz 4, 4(9) # result.lo = c[0].lo
25 blr # return (result)

The result is clearly different from the original version without the loop.

It is not the existance of the loop in the source that changes the transformation, but rather the existance of a loop after the initial optimization passes. For example, GCC is fairly good at optimizing (unrolling) loops with a fixed iteration count. Examine the following example:

0int64_t
1test_noloop( int64_t a,
2 int64_t b,
3 uint32_t hi32,
4 uint16_t lo16 )
5{
6 int64_t c = a + b;
7
8 set_value( &c, hi32, lo16, 1 );
9
10 return (c);
11}

It wouldn't be completely outrageous to expect the above example to generate similar, albeit unrolled, code. That is unless you know to expect simple loop transformations to be done fairly early in the compilation process and alias analysis to be done later. When compiled with -fstrict-aliasing -O3 -Wstrict-aliasing -std=c99 on the 32 bit build of GNU C version 3.4.1 (CELL 2.3, Jul 21 2005) (powerpc64-linux) for the Cell PPU.

0test_noloop: # <--- RETURNS (A+B)
1 stwu 1,-16(1) # Push stack
2 addc 4,4,6 # c.lo = addlo(a,b)
3 adde 3,3,5 # c.hi = addhi(a,b)
4 addi 1,1,16 # Pop stack
5 blr # return (c)

The existance of a loop around accessed aliases and whether or not the iteration count is known at compile time may impact the generated code. Tests should include both constant and extern'd iteration counts.
What is surprising is that the 64 bit build of the same version of the same compiler generates different results. When compiled with -fstrict-aliasing -O3 -Wstrict-aliasing -std=c99 on the 64 bit build of GNU C version 3.4.1 (CELL 2.3, Jul 21 2005) (powerpc64-linux) for the Cell PPU.

0test_loop:
1 li 10, 0 # i = 0
2 cmplw 7, 10, 7 # done = (i==count)
3 add 4, 3, 4 # sum = a + b
4 ld 3, .LC0@toc(2) # cloc = location of c
5 std 4, 0(3) # c[0] = sum
6 mr 9, 3 # a = c
7 mr 11, 3 # b = c
8 bge- 7, .L18 # if (done) goto .L18
9.L22:
10 addi 0, 10, 1 # i++
11 stw 5, 0(11) # a[0] = hi32
12 rldicl 10, 0, 0, 32 # i = i & 0xffffffff
13 sth 6, 4(9) # b[2] = lo16
14 sth 6, 6(9) # b[3] = lo16
15 cmplw 7, 10, 7 # done = (i==count)
16 addi 11, 11, 4 # a++
17 addi 9, 9, 4 # b+= 2
18 blt+ 7, .L22 # if (!done) goto .L22
19.L18:
20 ld 3,0(3) # result = c[0]
21 blr # return (result)

This indicates that there are significant non-obvious side-effects to building GCC as 32 bits versus 64 bits that someone might want to look into.
The platform, version number and build data (i.e. the output of gcc --version) is not sufficient information for compatibility testing. To be thorough, units tests should be run across all versions of the same compiler, if more than one is known to exist.
C99 Standard
This article has been pretty relaxed with the use of terminology and there is always room for some interpretation when reading a standard. There are many additional cases not covered above and compiler specific issues to consider. But for those interested in up-to-date definitive information on the C standard refer to ISO/IEC 9899:TC2 [open-std.org]. Here is the most relevant text from section "6.5 Expressions":


An object shall have its stored value accessed only by an lvalue expression that has one of the following types:

* a type compatible with the effective type of the object,
* a qualified version of a type compatible with the effective type of the object,
* a type that is the signed or unsigned type corresponding to the effective type of the object,
* a type that is the signed or unsigned type corresponding to a qualified version of the effective type of the object,
* an aggregate or union type that includes one of the aforementioned types among its members (including, recursively, a member of a subaggregate or contained union), or
* a character type.

Note the use of types like uint64_t and uint32_t in the above examples. For decades programmers have been creating their own integer types and reworking their header files for each platform simply to get consistant integer sizes across multiple architectures. This is because the standard does not guarantee types like int or short to be of any particular width, it only guarantees their sizes relative to eachother. But finally, with C99, the debate is over. Standard width integers are now defined in stdint.h. Always use this header, and if your implementation does not have it (e.g. Microsoft), there are portable public domain versions available (e.g. This stdint.h can be used for Win32).

SUMMARY


* Strict aliasing means that two objects of different types cannot refer to the same location in memory. Enable this option in GCC with the -fstrict-aliasing flag. Be sure that all code can safely run with this rule enabled. Enable strict aliasing related warnings with -Wstrict-aliasing, but do not expect to be warned in all cases.
* In order to discover aliasing problems as quickly as possible, -fstrict-aliasing should always be included in the compilation flags for GCC. Otherwise problems may only be visible at the highest optimization levels where it is the most difficult to debug.

Be wary of code that requires the use of -fno-strict-aliasing

Thursday, August 30, 2007

GCSE

Optimisers:

* Global constant and copy propagation.
* Global common subexpression elimination.


EGCS has two implementations of global common subexpression elimination. The first is based on the classic algorithm found in most compiler texts and is generally referred to as GCSE or classic GCSE.

The second implementation is commonly known as partial redundancy elimination (PRE). PRE is a more powerful scheme for eliminating redundant computations throughout a function. Our PRE optimizer is currently based on Fred Chow's thesis.

The difference between GCSE and PRE is best illustrated with an example flow graph (please forgive the poor graphics, I'm no HTML or graphics guru).





This example has a computation of "exp1" in blocks B2, B3 and B6. Assume that the remaining blocks do not effect the evaluation of "exp1".

Classic GCSE will only eliminate the computation of "exp1" found in block B3 since the evaluation in block B6 can be reached via a path which does not have an earlier computation of "exp1" (entry, B1, B7, B8, B5, B6).

PRE will eliminate the evaluations of "exp1" in blocks B3 and B6; to make the evaluation in B6 redundant, PRE will insert an evaluation of "exp1" in block B8.

While PRE is a more powerful optimization, it will tend to increase code size to improve runtime performance. Therefore, then optimizing for code size, the compiler will run the classic GCSE optimizer instead of the PRE optimizer.

Global constant/copy propagation and global common subexpression elimination are enabled by default with the -O2 optimization flag. They can also be enabled with the -fgcse flag.

Thursday, August 16, 2007

sk_buff Structure

The sk_buff structure ('skb') is actually only used for storing the
metadata corresponding to a packet. The packet's data is not stored
inside the sk_buff structure itself, but in a separate buffer that is
pointed to by skb->head. The skb->end member points one byte past the
end of this data buffer.

An important design requirement for sk_buffs is being able to add data
at the end as well as at the front of the packet. As a packet travels
downwards through the network stack, each layer will usually want to
add its own header in front of the packet, and it would be nice if we
could avoid reallocating and/or copying the entire data portion of the
packet around to make more space at the front of the buffer every time
we want to do this.

To achieve this goal, the packet data is not necessarily stored at the
front of the data buffer, but some space between the front of the buffer
and the front of the packet is left unused. skb->data and skb->tail
are two extra pointers that point to the beginning and one byte past
the end of the currently used portion of the data buffer, respectively.
Both are guaranteed to point somewhere within the data buffer.
('skb->head <= skb->data <= skb->tail <= skb->end')

+----------+------------------+----------------+
| headroom | packet data | tailroom |
+----------+------------------+----------------+
skb->head skb->data skb->tail skb->end

(sorry if the figure doesnt come up well)
The function skb_headroom(skb) calculates 'skb->data - skb->head', and
indicates how many bytes we can add to the front of the packet without
having to reallocate the buffer. Similarly, skb_tailroom(skb) calculates
'skb->end - skb->tail' and indicates how many bytes we can add to the
end of the packet before having to reallocate.

Adding data to and removing data from the front of the buffer is done
with skb_push and skb_pull, respectively. These wrappers do some sanity
checks to make sure the relevant constraints on the four pointers are
maintained.

When an sk_buff is allocated by alloc_skb, skb->{head,data,tail} are all
initialised to point to the start of the data buffer. Depending on what
the skb will be used for, the caller will usually want to reserve some
headroom in anticipation of expansion of the data buffer towards the
front. This is done by calling skb_reserve().

Rough Notes on Linux Networking Stack

Table of Contents

1. Existing Optimizations
2. Packet Copies
3. ICMP Ping/Pong : Function Calls
4. Transmit Interrupts and Flow Control
5. NIC driver callbacks and ifconfig
6. Protocol Structures in the Kernel
7. skb_clone() vs. skb_copy()
8. NICs and Descriptor Rings
9. How much networking work does the ksoftirqd do?
10. Packet Requeues in Qdiscs
11. Links
12. Specific TODOs
References

1. Existing Optimizations

A great deal of thought has gone into Linux networking
implementation and many optmizations have made their way to the
kernel over the years. Some prime examples include:
* NAPI - Receive interrupts are coalesced to reduce changes of a
livelock. Thus, now each packet receive does not generate an
interrupt. Required modifications to device driver interface.
Has been in the stable kernels since 2.4.20.
* Zero-Copy TCP - Avoids the overhead of kernel-to-userspace and
userspace-to-kernel packet copying.
http://builder.com.com/5100-6372-1044112.html describes this is
some detail.

2. Packet Copies

When a packet is received, the device uses DMA to put it in main
memory (let's ignore non-DMA or non-NAPI code and drivers). An skb
is constructed by the poll() function of the device driver. After
this point, the same skb is used throughout the networking stack,
i.e., the packet is almost never copied within the kernel (it is
copied when delivered to user-space).

This design is borrowed from BSD and UNIX SVR4 - the idea is to
allocate memory for the packet only once. The packet has 4 primary
pointers - head, end, data, tail into the packet data (character
buffer). head points to the beginning of the packet - where the link
layer header starts. end points to the end of the packet. data
points to the location the current networking layer can start
reading from (i.e., it changes as the packet moves up from the link
layer, to IP, to TCP). Finally, tail is where the current protocol
layer can begin writing data to (see alloc_skb(), which sets head,
data, tail to the beginning of allocated memory block and end to
data + size).

Other implementations refer to head, end, data, tail as base, limit,
read, write respectively.

There are some instances where a packet needs to be duplicated. For
example, when running tcpdump the packet needs to be sent to the
userspace process as well as to the normal IP handler. Actually, int
this case too, a copy can be avoided since the contents of the
packet are not being modified. So instead of duplicating the packet
contents, skb_clone() is used to increase the reference count of a
packet. skb_copy() on the other hand actually duplicates the
contents of the packet and creates a completely new skb.

See also: http://oss.sgi.com/archives/netdev/2005-02/msg00125.html

A related question: When a packet is received, are the tail and end
pointers equal? Answer: NO. This is because memory for packets
received is allocated before the packet is received, and the address
and size of this memory is communicated to the NIC using receive
descriptors - so that when it is actually received the NIC can use
DMA to transfer the packet to main memory. The size allocated for a
received packet is a function of the MTU of the device. The size of
an Ethernet frame actually received could be anything less than the
MTU. Thus, tail of a received packet will point to the end of the
received data while end will point to the end of the memory
allocated for the packet.

3. ICMP Ping/Pong : Function Calls

Code path (functions called) when an ICMP ping is received (and
corresponding pong goes out), for linux 2.6.9: First the packet is
received by the NIC and it's interrupt handler will ultimately call
net_rx_action() to be called (NAPI, [1]). This will call the device
driver's poll function which will submit packets (skb's) to the
networking stack via netif_receive_skb. The rest is outlined below:
1. ip_rcv() --> ip_rcv_finish()
2. dst_input() --> skb->dst->input = ip_local_deliver()
3. ip_local_deliver() --> ip_local_deliver_finish()
4. ipprot->handler = icmp_rcv()
5. icmp_pointers[ICMP_ECHO].handler == icmp_echo() -- At this point
I guess you could say that the "receive" path is complete, the
packet has reached the top. Now the outbound (down the stack)
journey begins)
6. icmp_reply() -- Might want to look into the checks this function
does
7. icmp_push_reply()
8. ip_push_pending_frames()
9. dst_output() --> skb->dst->output = ip_output()
10. ip_output() --> ip_finish_output() --> ip_finish_output2()
11. dst->neighbour->output ==

4. Transmit Interrupts and Flow Control

Transmit interrupts are generated after every packet transmission
and this is key to flow control. However, this does have significant
performance implications under heavy transmit-related I/O (imagine a
packet forwarder where the number of transmitted packets is equal to
the number of received oned). Each device provides a means to slow
down transmit (Tx) interrupts. For example, Intel's e1000 driver
exposes "TxIntDelay" that allows transmit interrupts to be delayed
in units of 1.024 microseconds. The default value is 64, thus eavy
under heavy transmissions an interrupt's are spaced 65.536
microseconds apart. Imagine the number of transmissions that can
take place in this time.

5. NIC driver callbacks and ifconfig

Interfaces are configured using the ifconfig command. Many of these
commands will result in a function of the NIC driver being called.
For example, ifconfig eth0 up should result in the device driver's
open() function being called (open is a member of struct
net_device). ifconfig communicates with the kernel through ioctl()
on any socket. The requests are a struct ifreq (see
/usr/include/net/if.h and
http://linux.about.com/library/cmd/blcmdl7_netdevice.htm. Thus,
ifconfig eth0 up will result in the following:
1. A socket (of any kind) is opened using socket()
2. A struct ifreq is prepared with ifr_ifname set to "eth0"
3. An ioctl() with request SIOCGIFFLAGS is done to get the current
flags and then the IFF_UP and IFF_RUNNING flags are set with
another ioctl() (with request SIOCSIFFLAGS).
4. Now we're inside the kernel. sock_ioctl() is called, which in
turn calls dev_ioctl() (see net/socket.c and net/core/dev.c)
5. dev_ioctl() --> ... --> dev_open() --> driver's open()
implementation.

6. Protocol Structures in the Kernel

There are various structs in the kernel which consist of function
pointers for protocol handling. Different structures correspond to
different layers of protocols as well as whether the functions are
for synchronous handling (e.g., when recv(), send() etc. system
calls are made) or asynchronous handling (e.g., when a packet
arrives at the interface and it needs to be handled). Here is what I
have gathered about the various structures so far:
* struct packet_type - includes instantiations such as
ip_packet_type, ipv6_packet_type etc. These provide low-level,
asynchronos packet handling. When a packet arrives at the
interface, the driver ultimately submits it to the networking
stack by a call to netif_receive_skb(), which iterates to the
list of registered packet handlers and submits the skb to them.
For example, ip_packet_type.func = ip_rcv, so ip_rcv() is where
one can say the IP protocol first receives a packet that has
arrived at the interface. Packet-types are registred with the
networking stack by a call to dev_add_pack().
* struct net_proto_family - includes instantiations such as
inet_family_ops, packet_family_ops etc. Each net_proto_family
structure handles one type of address family (PF_INET etc.).
This structure is associated with a BSD socket (struct socket)
and not the networking layer representation of sockets (struct
sock). It essentially provdides a create() function which is
called in response to the socket() system call. The
implementation of create() for each family typically allocates
the struct sock and also associates other synchronous operations
(see struct proto_ops below) with the socket. To cut a long
story short - net_proto_family provides the protocol-specific
part of the socket() system call. (NOTE: Not all BSD sockets
will have a networking socket associated with it. For example,
unix sockets (the PF_UNIX address family).
unix_family_ops.create = unix_create does not allocate a struct
sock). The net_proto_family structure is registered with the
networking stack by a call to sock_register().
* struct proto_ops - includes instantiations such as
inet_stream_ops, inet_dgram_ops, packet_ops etc. These provide
implementations of networking layer synchronous calls
(connect(), bind(), recvmsg(), ioctl() etc. system calls). The
ops member of the BSD socket structure (struct socket) points to
the proto_ops associated with the socket. Unlike the above two
structures, there is no function that explicitly registers a
struct proto_ops with the networking stack. Instead, the
create() implementation of struct net_proto_family just sets the
ops field of the BSD socket to the appropriate proto_ops
structure.
* struct proto - includes instantiations such as tcp_prot,
udp_prot, raw_prot. These provide protocol handlers inside a
network family. It seems that currently this means only over-IP
protocols as I could find only the above three instantiations.
These also provide implementations for synchronous calls. The
sk_prot field of the networking socket (struct sock) points to
such a structure. The sk_prot field would get set by the create
function in struct net_proto_family and the functions provided
will be called by the implementations of functions in the struct
proto_ops structure. For example, inet_family_ops.create =
inet_create allocates a struct sock and would set sk_prot =
udp_prot in reponse to a socket(PF_INET, SOCK_DGRAM, 0); system
call. A recvfrom() system call made on the socket would then
invoke inet_dgram_ops.recvmsg = sock_common_recvmsg, which calls
sk_prot->recvmsg = udp_recvmsg. Like proto_ops, struct protos
aren't explicitly "registered" with the networking stack using a
function, but are "regsitered" by the BSD socket create()
implementation in the struct net_proto_family.
* struct net_protocol - includes instantiations such as
tcp_protocol, udp_protocol, icmp_protocol etc. These provide
asynchronous packet receive routines for IP protocols. Thus,
this structure is specific to the inet-family of protocols.
Handlers are registered using inet_add_protocol(). This
structure is used by the IP-layer routines to hand off to a
layer 4 protocol. Specifically, the IP handler (ip_rcv()) will
invoke ip_local_deliver_finish() for packets that are to be
delivered to the local host. ip_local_deliver_finish() uses a
hash table (inet_protos) to decide which function to pass the
packet to based on the protocol field in the IP header. The hash
table is populated by the call to inet_add_protocol().

7. skb_clone() vs. skb_copy()

When a packet needs to be delivered to two separate handlers (for
example, the IP layer and tcpdump), then it is "cloned" by
incrementing the reference count of the packet instead of being
"copied". Now, though the two handlers are not expected to modify
the packet contents, they can change the data pointer. So, how do we
ensure that processing by one of the handlers doesn't mess up the
data pointer for the other?

A. Umm... skb_clone means that there are separate head, tail, data,
end etc. pointers. The difference between skb_copy() and skb_clone()
is precisely this - the former copies the packet completely, while
the latter uses the same packet data but separate pointers into the
packet.

8. NICs and Descriptor Rings

NOTE: Using the Intel e1000, driver source version 5.6.10.1, as an
example. Each transmission/reception has a descriptor - a "handle"
used to access buffer data somewhat like a file descriptor is a
handle to access file data. The descriptor format would be NIC
dependent as the hardware understands and reads/writes to the
descriptor. The NIC maintains a circular ring of descriptors, i.e.,
the number of descriptors for TX and RX is fixed (TxDescriptors,
RxDescriptors module parameters for the e1000 kernel module) and the
descriptors are used like a circular queue.

Thus, there are three structures:
* Descriptor Ring (struct e1000_desc_ring) - The list of
descriptors. So, ring[0], ring[1] etc. are individual
descriptors. The ring is typically allocated just once and thus
the DMA mapping of the ring is "consistent". Each descriptor in
the ring will thus have a fixed DMA and memory address. In the
e1000, the device registers TDBAL, TDBAH, TDLEN stand for
"Transmit Descriptors Base Address Low", "High" and "Length" (in
bytes of all descriptors). Similarly, there are RDBAL, RDBAH,
RDLEN
* Descriptors (struct e1000_rx_desc and struct e1000_tx_desc) -
Essentially, this stores the DMA address of the buffer which
contains actual packet data, plus some other accounting
information such as the status (transmission successsful?
receive complete? etc.), errors etc.
* Buffers - Now actual data cannot have a "consistent" DMA
mapping, meaning we cannot ensure that all skbuffs for a
particular device always have some specific memory addresses
(those that have been setup for DMA). Instead, "streaming" DMA
mappings need to be used. Each descriptor thus contains the DMA
address of a buffer that has been setup for streaming mapping.
The hardware uses that DMA address to pickup a packet to be sent
or to place a received packet. Once the kernel's stack picks up
the buffer, it can allocate new resources (a new buffer) and
tell the NIC to use that buffer next time by setting up a new
streaming mapping and putting the new DMA handle in the
descriptor.
The e1000 uses a struct e1000_buffer as a wrapper around the
actual buffer. The DMA mapping however is setup only for
skb->data, i.e., where raw packet data is to be placed.

9. How much networking work does the ksoftirqd do?

Consider what the NET_RX_SOFTIRQ does:
1. Each softirq invokation (do_softirq()) processes up to
net.core.netdev_max_backlog x MAX_SOFTIRQ_RESTART packets, if
available. The default values lead to 300 x 10 = 3000 pkts.
2. Every interrupt calls do_softirq() when exitting (irq_exit()) -
including the timer interrupt and NMIs too?
3. Default transmit/receive ring sizes on the NIC are less than
3000 (the e1000 for example defaults to 256 and can have at most
4096 descriptors on its ring)

Thus, the number of times ksoftirqd will be switched in/out depends
on how much processing is done by do_softirq() invokations on
irq_exit(). If the softirq handling on interrupt is able to clean up
the NIC ring faster than a new packet comes in, then ksoftirqd won't
be doing anything. Specifically, if the inter-packet-gap is greater
than the time it takes to pick-up and process a single packet from
the NIC, then ksoftirq will not be scheduled (and if the number of
descriptors on the NIC is less than 3000).

Without going into details, some quick experimental verification:
Machine A continuously generates UDP packets for Machine B which is
running an "sink" application, i.e., it just loops on a recvfrom().
When the size of the packet sent from A was 60 bytes (and
inter-packet gap averaged 1.5µs), then the ksoftirqd thread on B
observed a total of 375 context swithces (374 involuntary and 1
voluntary). When the packet size was 1280 bytes (and now
inter-packet gap increased almost 7 times to 10µs) then the
ksoftirqd thread was NEVER scheduled (0 context switches). The
single voluntary context switch in the former case probably happened
after all packets were processed (i.e., the sender stopped sending
and the receiver processed all that it got).

10. Packet Requeues in Qdiscs

The queueing discipline (struct Qdisc) provides a requeue().
Typically, packets are dequeued from the qdisc and submitted to the
device driver (the hard_start_xmit function in struct net_device).
However, at times it is possible that the device driver is "busy",
so the dequeued packet must be "requeued". "Busy" here means that
the xmit_lock of the device was held. It seems that this lock is
acquired at two places: (1) qdisc_restart() and (2) dev_watchdog().
The former handles packet dequeueing from the qdisc, acquiring the
xmit_lock and then submitting the packet to the device driver
(hard_start_xmit()) or alternatively requeuing the packet if the
xmit_lock was already held by someone else. The latter is invoked
asynchronously, periodically - its part of the watchdog timer
mechanism.

My understanding is that two threads cannot be in qdisc_restart()
for the same qdisc at the same time, however the xmit_lock may have
been acquired by the watchdog timer function causing a requeue.

11. Links

This is just a dump of links that might be useful.
* http://www.spec.org and SpecWeb http://www.spec.org/web99/
* linux-net and netdev mailing lists:
http://www.ussg.iu.edu/hypermail/linux/net/ and
http://oss.sgi.com/projects/netdev/archive/
* Linux Traffic Control HOWTO

12. Specific TODOs

* Study watchdog timer mechanism and figure out how flow control
is implemented in the receive and transmit side

References

[3] Beyond Softnet. Jamal Hadi Salim, Robert Olsson, and Alexey
Kuznetsov. Nov 2001. USENIX. 5. .

[5] A Map of the Networking Code in Linux Kernel 2.4.20. Miguel Rio,
Mathieu Goutelle, Tom Kelly, Richard Hugh-Jones, Jean-Phillippe
Martin-Flatin, and Yee-Ting Li. Mar 2004.

[4] Understanding the Linux Kernel. Daniel P. Bovet and Marco
Cesati. O'Reilly & Associates. 2nd Edition. 81-7366-589-3.

Friday, August 10, 2007

volatile

The use of volatile is poorly understood by many programmers. This is not surprising, as most C texts dismiss it in a sentence or two.

Have you experienced any of the following in your C/C++ embedded code?

* Code that works fine-until you turn optimization on
* Code that works fine-as long as interrupts are disabled
* Flaky hardware drivers
* Tasks that work fine in isolation-yet crash when another task is enabled

If you answered yes to any of the above, it's likely that you didn't use the C keyword volatile. You aren't alone. The use of volatile is poorly understood by many programmers. This is not surprising, as most C texts dismiss it in a sentence or two.

volatile is a qualifier that is applied to a variable when it is declared. It tells the compiler that the value of the variable may change at any time-without any action being taken by the code the compiler finds nearby. The implications of this are quite serious. However, before we examine them, let's take a look at the syntax.

Syntax

To declare a variable volatile, include the keyword volatile before or after the data type in the variable definition. For instance both of these declarations will declare foo to be a volatile integer:

volatile int foo;
int volatile foo;

Now, it turns out that pointers to volatile variables are very common. Both of these declarations declare foo to be a pointer to a volatile integer:

volatile int * foo;
int volatile * foo;

Volatile pointers to non-volatile variables are very rare (I think I've used them once), but I'd better go ahead and give you the syntax:

int * volatile foo;

And just for completeness, if you really must have a volatile pointer to a volatile variable, then:

int volatile * volatile foo;

Incidentally, for a great explanation of why you have a choice of where to place volatile and why you should place it after the data type (for example, int volatile * foo), consult Dan Sak's column, "Top-Level cv-Qualifiers in Function Parameters" (February 2000, p. 63).

Finally, if you apply volatile to a struct or union, the entire contents of the struct/union are volatile. If you don't want this behavior, you can apply the volatile qualifier to the individual members of the struct/union.

Use

A variable should be declared volatile whenever its value could change unexpectedly. In practice, only three types of variables could change:

* Memory-mapped peripheral registers
* Global variables modified by an interrupt service routine
* Global variables within a multi-threaded application

Peripheral registers

Embedded systems contain real hardware, usually with sophisticated peripherals. These peripherals contain registers whose values may change asynchronously to the program flow. As a very simple example, consider an 8-bit status register at address 0x1234. It is required that you poll the status register until it becomes non-zero. The nave and incorrect implementation is as follows:

UINT1 * ptr = (UINT1 *) 0x1234;

// Wait for register to become non-zero.
while (*ptr == 0);
// Do something else.

This will almost certainly fail as soon as you turn the optimizer on, since the compiler will generate assembly language that looks something like this:

mov ptr, #0x1234 mov a, @ptr loop bz loop

The rationale of the optimizer is quite simple: having already read the variable's value into the accumulator (on the second line), there is no need to reread it, since the value will always be the same. Thus, in the third line, we end up with an infinite loop. To force the compiler to do what we want, we modify the declaration to:

UINT1 volatile * ptr =
(UINT1 volatile *) 0x1234;

The assembly language now looks like this:

mov ptr, #0x1234
loop mov a, @ptr
bz loop

The desired behavior is achieved.

Subtler problems tend to arise with registers that have special properties. For instance, a lot of peripherals contain registers that are cleared simply by reading them. Extra (or fewer) reads than you are intending can cause quite unexpected results in these cases.

Interrupt service routines

Interrupt service routines often set variables that are tested in main line code. For example, a serial port interrupt may test each received character to see if it is an ETX character (presumably signifying the end of a message). If the character is an ETX, the ISR might set a global flag. An incorrect implementation of this might be:

int etx_rcvd = FALSE;

void main()
{
...
while (!ext_rcvd)
{
// Wait
}
...
}

interrupt void rx_isr(void)
{
...
if (ETX == rx_char)
{
etx_rcvd = TRUE;
}
...
}

With optimization turned off, this code might work. However, any half decent optimizer will "break" the code. The problem is that the compiler has no idea that etx_rcvd can be changed within an ISR. As far as the compiler is concerned, the expression !ext_rcvd is always true, and, therefore, you can never exit the while loop. Consequently, all the code after the while loop may simply be removed by the optimizer. If you are lucky, your compiler will warn you about this. If you are unlucky (or you haven't yet learned to take compiler warnings seriously), your code will fail miserably. Naturally, the blame will be placed on a "lousy optimizer."

The solution is to declare the variable etx_rcvd to be volatile. Then all of your problems (well, some of them anyway) will disappear.

Multi-threaded applications

Despite the presence of queues, pipes, and other scheduler-aware communications mechanisms in real-time operating systems, it is still fairly common for two tasks to exchange information via a shared memory location (that is, a global). When you add a pre-emptive scheduler to your code, your compiler still has no idea what a context switch is or when one might occur. Thus, another task modifying a shared global is conceptually identical to the problem of interrupt service routines discussed previously. So all shared global variables should be declared volatile. For example:

int cntr;

void task1(void)
{
cntr = 0;
while (cntr == 0)
{
sleep(1);
}
...
}

void task2(void)
{
...
cntr++;
sleep(10);
...
}

This code will likely fail once the compiler's optimizer is enabled. Declaring cntr to be volatile is the proper way to solve the problem.

Final thoughts

Some compilers allow you to implicitly declare all variables as volatile. Resist this temptation, since it is essentially a substitute for thought. It also leads to potentially less efficient code.

Also, resist the temptation to blame the optimizer or turn it off. Modern optimizers are so good that I cannot remember the last time I came across an optimization bug. In contrast, I come across failures to use volatile with depressing frequency.

If you are given a piece of flaky code to "fix," perform a grep for volatile. If grep comes up empty, the examples given here are probably good places to start looking for problems.

Wednesday, July 25, 2007

Configuration options on 2.6.x kernels

On the 2.6.x kernels there are four main frontend programs: config, menuconfig, and xconfig.

config is the least user-friendly option as it merely presents a series of questions that must be answered sequentially. Alas, if an error is made you must begin the process from the top. Pressing Enter will accept the default entry which is in upper case.

oldconfig will read the defaults from an existing .config and rewrite necessary links and files. Use this option if you've made minor changes to source files or need to script the rebuild process.

menuconfig is an ncurses based frontend. Your system must have the ncurses-devel libraries installed in order to use this utility. As the help text at the top of the screen indicates, use the arrow keys to navigate the menu. Press Enter to select sub-menus. Press the highlighted letter on each option to jump directly to that option. To build an option directly into the kernel, press Y. To disable an option entirely, press N. To build an option as a loadable module, press M. You can also access content-specific help screens by pressing ? on each page or selecting HELP from the lower menu. Figure 1 in the Section called Configuring the 2.4.x kernels shows an example screen from the 2.4.x kernel series.

xconfig is a graphical frontend using qconf by Roman Zippel. It requires the qt and X libraries to build and use. The interface is intuitive and customizable. Online help is automatically shown for each kernel configuration option. It also can show dependency information for each module which can help diagnose build errors. Figure 3 shows an example of the xconfig screen. From the online help:


For each option, a blank box indicates the feature is disabled, a check
indicates it is enabled, and a dot indicates that it is to be compiled
as a module. Clicking on the box will cycle through the three states.
If you do not see an option (e.g., a device driver) that you believe
should be present, try turning on Show All Options under the Options menu.
Although there is no cross reference yet to help you figure out what other
options must be enabled to support the option you are interested in, you can
still view the help of a grayed-out option.

--qconf help

Figure 3. make xconfig

Once you have decided which configuration option to use, start the process with make followed by either config, menuconfig, or xconfig. For example:

$ make menuconfig


The system will take a few moments to build the configuration utility. Next you will be presented with the configuration menus. Though similar to the 2.4.x series, the 2.6.x menu is more logically organized with better grouping of sub-modules. Following are some of the top level configuration options in the 2.6 kernel.

Code Maturity Level Options

This option allows configuration of alpha-quality software or obsoleted drivers. It is best to disable this option if the kernel is intended for a stable production system. If you require an experimental feature in the kernel, such as a driver for new hardware, then enable this option but be aware that it "may not meet the normal level of reliability" as more rigorously tested code.
General Setup

This section contains options for sysctl support, a feature allowing run-time configuration of the kernel. A new feature, kernel .config support, allows the complete kernel configuration to be viewed during run-time. This addresses many requests to be able to see what features were compiled into the kernel.
Loadable Module Support

You will almost certainly want to enable module support. If you will need third-party kernel modules you will also need to enable Set Version Information on All Module Symbols.
Processor Type and Features

This is perhaps the most important configuration choice. we determin our processor type by examining /proc/cpuinfo and we use that information here to select the appropriate processor. Included in this submenu are features such as Preemptible Kernel which can improve desktop responsiveness, Symmetric Multi-processing Support for machines with multiple CPUs, and High Memory Support for machines with more than 1G of RAM.
Power Management Options

Found here are options for ACPI and CPU Frequency Scaling which can dramatically improve laptop power usage. Read the Documentation/power file for more information.
Bus Options ( PCI, PCMCIA, EISA, MCA, ISA)

Here are found options for all system bus devices. On modern machines the ISA and MCA support can often be disabled.
Executable File Formats

Interesting features here include the kernel support for miscellaneous binaries, which can allow seamless operation of non-Linux binaries with a little userland help.
Device Drivers

All the device-driver configuration options that were previously scattered throughout the 2.4.x menus are now neatly organized under this option. Features such as SCSI support, graphic card optimizations, sound, USB and other hardware are configured here.
File Systems

Found here are options for filesystems which are supported by the kernel, such as EXT2 and ReiserFS. It is best to build support for the root filesystems directly into the kernel rather than as a module.
Security Options

Interesting options here include support for NSA Security Enhanced Linux and other, somewhat experimental, features to increase security.

Tuesday, July 24, 2007

Dynamic Kernels: Modularized Device Drivers

Although this is very old article, gives u lot of knowledge about the device drivers.
Some of the things mentioned here changed by this day.

By Alessandro Rubini on Fri, 1996-03-01 02:00. SysAdmin
This is the first in a series of four articles co-authored by Alessandro Rubini and Georg Zezchwitz which present a practical approach to writing Linux device drivers as kernel loadable modules. This installment presents and introduction to thte topic, preparing the reader to understand next month's installment.
Kernel modules are a great feature of recent Linux kernels. Although most users feel modules are only a way to free some memory by keeping the floppy driver out of the kernel most of the time, the real benefit of using modules is support for adding additional devices without patching the kernel source. In the next few Kernel Korners Georg Zezschwitz and I will try to introduce the ``art'' of writing a powerful module---while avoiding common design errors.
What is a device?
A device driver is the lowest level of the software that runs on a computer, as it is directly bound to the hardware features of the device.
The concept of ``device driver'' is quite abstract, actually, and the kernel can be considered like it was a big device driver for a device called ``computer''. Usually, however, you don't consider your computer a monolithic entity, but rather a CPU equipped with peripherals. The kernel can thus be considered as an application running on top of the device drivers: each driver manages a single part of the computer, while the kernel-proper builds process scheduling and file-system access on top of the available devices.
See Figure 1.
A few mandatory drivers are ``hardwired'' in the kernel, such as the processor driver and the the memory driver; the others are optional, and the computer is usable both with and without them---although a kernel with neither the console driver nor the network driver is pointless for a conventional user.
The description above is somehow a simplistic one, and slightly philosophical too. Real drivers interact in a complex way and a clean distinction among them is sometimes difficult to achieve.
In the Unix world things like the network driver and a few other complex drivers belong to the kernel, and the name of device driver is reserved to the low-level software interface to devices belonging to the following three groups:
character devices
Those which can be considered files, in that they can be read-from and/or written-to. The console (i.e. monitor and keyboard) and the serial/parallel ports are examples of character devices. Files like /dev/tty0 and /dev/cua0 provide user access to the device. A char device usually can only be accessed sequentially.
block devices
Historically: devices which can be read and written only in multiples of the block-size, often 512 or 1024 bytes. These are devices on which you can mount a filesystem, most notably disks. Files like /dev/hda1 provide access to the devices. Blocks of block devices are cached by the buffer cache. Unix provides uncached character devices corresponding to block devices, but Linux does not.
network interfaces
Network interfaces don't fall in the device-file abstraction. Network interfaces are identified by means of a name (such as eth0 or plip1) but they are not mapped to the filesystem. It would be theoretically possible, but it is impractical from a programming and performance standpoint; a network interface can only transfer packets, and the file abstraction does not efficiently manage structured data like packets.

The description above is rather sketchy, and each flavour of Unix differs in some details about what is a block device. It doesn't make too much a difference, actually, because the distinction is only relevant inside the kernel, and we aren't going to talk about block drivers in detail.
What is missing in the previous representation is that the kernel also acts as a library for device drivers; drivers request services from the kernel. Your module will be able to call functions to perform memory allocation, filesystem access, and so on.
As far as loadable modules are concerned, any of the three driver-types can be constructed as a module. You can also build modules to implement filesystems, but this is outside of our scope.
These columns will concentrate on character device drivers, because special (or home-built) hardware fits the character device abstraction most of the time. There are only a few differences between the three types, and so to avoid confusion, we'll only cover the most common type.
You can find an introduction to block drivers in issues 9, 10, and 11 of Linux Journal, as well as in the Linux Kernel Hackers' Guide. Although both are slightly outdated, taken together with these columns, they should give you enough information to get started.
What is a module?
A module is a code segment which registers itself with the kernel as a device driver, is called by the kernel in order to communicate with the device, and in turn invokes other kernel functions to accomplish its tasks. Modules utilize a clean interface between the ``kernel proper'' and the device, which both makes the modules easy to write and keeps the kernel source code from being cluttered.
The module must be compiled to object code (no linking; leave the compiled code in .o files), and then loaded into the running kernel with insmod. The insmod program is a runtime linker, which resolves any undefined symbols in the module to addresses in the running kernel by means of the kernel symbol table.
This means that you can write a module much like a conventional C-language program, and you can call functions you don't define, in the same way you usually call printf() and fopen() in your application. However, you can count only on a minimal set of external functions, which are the public functions provided by the kernel. insmod will put the right kernel-space addresses in your compiled module wherever your code calls a kernel function, then insert the module into the running Linux kernel.
If you are in doubt whether a kernel-function is public or not, you can look for its name either in the source file /usr/src/linux/kernel/ksyms.c or in the run-time table /proc/ksyms.
To use make to compile your module, you'll need a Makefile as simple as the following one:

TARGET = myname

ifdef DEBUG
# -O is needed, because of "extern inline"
# Add -g if your gdp is patched and can use it
CFLAGS = -O -DDEBUG_$(TARGET) -D__KERNEL__ -Wall
else
CFLAGS = -O3 -D__KERNEL__ -fomit-frame-pointer
endif

all: $(TARGET).o

As you see, no special rule is needed to build a module, only the correct value for CFLAGS. I recommend that you include debugging support in your code, because without patches, gdb isn't able to take advantage of the symbol information provided by the -g flag to a module while it is part of the running kernel.
Debugging support will usually mean extra code to print messages from within the driver. Using printk() for debugging is powerful, and the alternatives are running a debugger on the kernel, peeking in /dev/mem, and other extremely low-level techniques. There are a few tools available on the Internet to help use these other techniques, but you need to be conversant with gdb and be able to read real kernel code in order to benefit from them. The most interesting tool at time of writing is kdebug-1.1, which lets you use gdb on a running kernel, examining or even changing kernel data structures (including those in loaded kernel modules) while the kernel is running. Kdebug is available for ftp from sunsite.unc.edu and its mirrors under /pub/Linux/kernel.
Just to make things a little harder, the kernel equivalent of the standard printf() function is called printk(), because it does not work exactly the same as printf(). Before 1.3.37, conventional printk()'s generated lines in /var/adm/messages, while later kernels will dump them to the console. If you want quiet logging (only within the messages file, via syslogd) you must prepend the symbol KERN_DEBUG to the format string. KERN_DEBUG and similar symbols are simply strings, which get concatenated to your format string by the compiler. This means that you must not put a comma between KERN_DEBUG and the format string. These symbols can be found in , and are documented there. Also, printk() does not support floating point formats.
Remember, that syslog write to the messages file as soon as possible, in order to save all messages on disk in case of a system crash. This means that an over-printk-ed module will slow down perceptibly, and will fill your disk in a short time.
Almost any module misbehaviour will generate an [cw]Oops[ecw] message from the kernel. An Oops is what happens when the kernel gets an exception from kernel code. In other words, Oopses are the equivalent of segmentation faults in user space, though no core file is generated. The result is usually the sudden destruction of the responsible process, and a few lines of low-level information in the messages file. Most Oops messages are the result of dereferencing a NULL pointer.
This way to handle disasters is a friendly one, and you'll enjoy it whenever your code is faulty: most other Unices produce a kernel panic instead. This does not mean that Linux never panics. You must be prepared to generate panics whenever you write functions that operate outside of a process context, such as within interrupt handlers and timer callbacks.
The scarce, nearly unintelligible information included with the [cw]Oops[ecw] message represents the processor state when the code faulted, and can be used to understand where the error is. A tool called ksymoops is able to print more readable information out of the oops, provided you have a kernel map handy. The map is what is left in /usr/src/linux/System.map after a kernel compilation. Ksymoops was distributed within util-linux-2.4, but was removed in 2.5 because it has been included in the kernel distribution during the linux-1.3 development.
If you really understand the Oops message, you can use it as you want, like invoking gdb off-line to disassemble the whole responsible function. if you understand neither the Oops nor the ksymoops output, you'd better add some more debugging printk() code, recompile, and reproduce the bug.
The following code can ease management of debugging messages. It must reside in the module's public include file, and will work for both kernel code (the module) and user code (applications). Note however that this code is gcc-specific. Not too big a problem for a kernel module, which is gcc-dependent anyway. This code was suggested by Linus Torvalds, as an enhancement over my previous ansi-compliant approach.
#ifndef PDEBUG
# ifdef DEBUG_modulename
# ifdef __KERNEL__
# define PDEBUG(fmt, args...) printk (KERN_DEBUG fmt , ## args)
# else
# define PDEBUG(fmt, args...) fprintf (stderr, fmt , ## args)
# endif
# else
# define PDEBUG(fmt, args...)
# endif
#endif

#ifndef PDEBUGG
# define PDEBUGG(fmt, args...)
#endif

After this code, every PDEBUG("any %i or %s...\n", i, s); in the module will result in a printed message only if the code is compiled with -DDEBUG_modulename, while PDEBUGG() with the same arguments will expand to nothing. In user mode applications, it works the same, except that the message is printed to stderr instead of the messages file.
Using this code, you can enable or disable any message by removing or adding a single G character.
Writing Code
Let's look at what kind of code must go inside the module. The simple answer is ``whatever you need''. In practice, you must remember that the module is kernel code, and must fit a well-defined interface with the rest of Linux.
Usually, you start with header inclusion. And you begin to have contraints: you must always define the __KERNEL__ symbol before including any header unless it is defined in your makefile, and you must only include files pertaining to the and hierarchies. Sure, you can include your module-specific header, but never, ever, include library specific files, such as or .
The code fragment in Listing 1 represents the first lines of source of a typical character driver. If you are going to write a module, it will be easier to cut and paste these lines from existing source rather than copying them by hand from this article.
#define __KERNEL__ /* kernel code */

#define MODULE /* always as a module */
#include /* can't do without it */
#include /* and this too */

/*
* Then include whatever header you need.
* Most likely you need the following:
*/
#include /* ulong and friends */
#include /* current, task_struct, other goodies */
#include /* O_NONBLOCK etc. */
#include /* return values */
#include /* request_region() */
#include /* system name and global items */
#include /* kmalloc, kfree */

#include /* inb() inw() outb() ... */
#include /* unreadable, but useful */

#include "modulename.h" /* your own material */

After including the headers, there comes actual code. Before talking about specific driver functionality---most of the code---it is worth noting that there exist two module-specific functions, which must be defined in order for the module to be loaded:
int init_module (void);
void cleanup_module (void);

The first is in charge of module initialization (looking for the related hardware and registering the driver in the appropriate kernel tables), while the second is in charge of releasing any resources the module has allocated and deregistering the driver from the kernel tables.
If these functions are not there, insmod will fail to load your module.
The init_module() function returns 0 on success and a negative value on failure. The cleanup_module() function returns void, because it only gets invoked when the module is known to be unloadable. A kernel module keeps a usage count, and cleanup_module() is only called when that counter's value is 0 (more on this later on).
Skeletal code for these two functions will be presented in the next installment. Their design is fundamental for proper loading and unloading of the module, and a few details must dealt with. So here, I'll introduce you to each of the details, so that next month I can present the structure without explaining all the details.

contd....