This preprocessor measures Snort’s real-time and theoretical maximum performance. Whenever this preprocessor is
turned on, it should have an output mode enabled, either “console” which prints statistics to the console window or
“file” with a file name, where statistics get printed to the specified file name. By default, Snort’s real-time statistics
are processed. This includes:
• Time Stamp
• Drop Rate
• Mbits/Sec (wire) [duplicated below for easy comparison with other rates]
• Alerts/Sec
• K-Pkts/Sec (wire) [duplicated below for easy comparison with other rates]
• Avg Bytes/Pkt (wire) [duplicated below for easy comparison with other rates]
• Pat-Matched [percent of data received that Snort processes in pattern matching]
• Syns/Sec
• SynAcks/Sec
• New Sessions Cached/Sec
• Sessions Del fr Cache/Sec
• Current Cached Sessions
• Max Cached Sessions
• Stream Flushes/Sec
• Stream Session Cache Faults
• Stream Session Cache Timeouts
• New Frag Trackers/Sec
• Frag-Completes/Sec
• Frag-Inserts/Sec
• Frag-Deletes/Sec
• Frag-Auto Deletes/Sec [memory DoS protection]
• Frag-Flushes/Sec
• Frag-Current [number of current Frag Trackers]
• Frag-Max [max number of Frag Trackers at any time]
• Frag-Timeouts
• Frag-Faults
Number of CPUs [*** Only if compiled with LINUX SMP ***, the next three appear for each CPU]
• CPU usage (user)
• CPU usage (sys)
• CPU usage (Idle)
• Mbits/Sec (wire) [average mbits of total traffic]
• Mbits/Sec (ipfrag) [average mbits of IP fragmented traffic]
• Mbits/Sec (ipreass) [average mbits Snort injects after IP reassembly]
• Mbits/Sec (tcprebuilt) [average mbits Snort injects after stream4 reassembly]
• Mbits/Sec (applayer) [average mbits seen by rules and protocol decoders]
• Avg Bytes/Pkt (wire)
• Avg Bytes/Pkt (ipfrag)
• Avg Bytes/Pkt (ipreass)
• Avg Bytes/Pkt (tcprebuilt)
• Avg Bytes/Pkt (applayer)
• K-Pkts/Sec (wire)
• K-Pkts/Sec (ipfrag)
• K-Pkts/Sec (ipreass)
• K-Pkts/Sec (tcprebuilt)
• K-Pkts/Sec (applayer)
• Total Packets Received
• Total Packets Dropped (not processed)
• Total Packets Blocked (inline)
The following options can be used with the performance monitor:
• flow - Prints out statistics about the type of traffic and protocol distributions that Snort is seeing. This option
can produce large amounts of output.
• events - Turns on event reporting. This prints out statistics as to the number of signatures that were matched
by the setwise pattern matcher (non-qualified events) and the number of those matches that were verified with
the signature flags (qualified events). This shows the user if there is a problem with the rule set that they are
running.
• max - Turns on the theoreticalmaximumperformance that Snort calculates given the processor speed and current
performance. This is only valid for uniprocessor machines, since many operating systems don’t keep accurate
kernel statistics for multiple CPUs.
• console - Prints statistics at the console. This is enabled by default.
• file - Prints statistics in a comma-delimited format to the file that is specified. Not all statistics are output to
this file. You may also use snortfile which will output into your defined Snort log directory. Both of these
directives can be overridden on the command line with the -Z or --perfmon-file options.
• pktcnt - Adjusts the number of packets to process before checking for the time sample. This boosts performance,
since checking the time sample reduces Snort’s performance. By default, this is 10000.
• time - Represents the number of seconds between intervals.
• accumulate or reset - Defines which type of drop statistics are kept by the operating system. By default,
accumulate is used.
• atexitonly - Dump stats for entire life of Snort.
Examples
preprocessor perfmonitor: time 30 events flow file stats.profile max \
console pktcnt 10000
preprocessor perfmonitor: time 300 file /var/tmp/snortstat pktcnt 10000
The output of the log file is in fomr of numbers separated by commas.
Graphs can be generated by the perfmon-graph perl script which is located at
http://www.mtsac.edu/~jgau/Download/src/. It requires the rrdtool to be installed in the system which can be downloaded from the same link.
Wednesday, April 30, 2008
Tuesday, April 22, 2008
PCAP - API for Packet Capturing
Getting Started: The format of a pcap application
The first thing to understand is the general layout of a pcap sniffer. The flow of code is as follows:
1. We begin by determining which interface we want to sniff on. In Linux this may be something like eth0, in BSD it may be xl1, etc. We can either define this device in a string, or we can ask pcap to provide us with the name of an interface that will do the job.
2. Initialize pcap. This is where we actually tell pcap what device we are sniffing on. We can, if we want to, sniff on multiple devices. How do we differentiate between them? Using file handles. Just like opening a file for reading or writing, we must name our sniffing "session" so we can tell it apart from other such sessions.
3. In the event that we only want to sniff specific traffic (e.g.: only TCP/IP packets, only packets going to port 23, etc) we must create a rule set, "compile" it, and apply it. This is a three phase process, all of which is closely related. The rule set is kept in a string, and is converted into a format that pcap can read (hence compiling it.) The compilation is actually just done by calling a function within our program; it does not involve the use of an external application. Then we tell pcap to apply it to whichever session we wish for it to filter.
4. Finally, we tell pcap to enter it's primary execution loop. In this state, pcap waits until it has received however many packets we want it to. Every time it gets a new packet in, it calls another function that we have already defined. The function that it calls can do anything we want; it can dissect the packet and print it to the user, it can save it in a file, or it can do nothing at all.
5. After our sniffing needs are satisfied, we close our session and are complete.
This is actually a very simple process. Five steps total, one of which is optional (step 3, in case you were wondering.) Let's take a look at each of the steps and how to implement them.
Setting the device
This is terribly simple. There are two techniques for setting the device that we wish to sniff on.
The first is that we can simply have the user tell us. Consider the following program:
#include
#include
int main(int argc, char *argv[])
{
char *dev = argv[1];
printf("Device: %s\n", dev);
return(0);
}
The user specifies the device by passing the name of it as the first argument to the program. Now the string "dev" holds the name of the interface that we will sniff on in a format that pcap can understand (assuming, of course, the user gave us a real interface).
The other technique is equally simple. Look at this program:
#include
#include
int main(int argc, char *argv[])
{
char *dev, errbuf[PCAP_ERRBUF_SIZE];
dev = pcap_lookupdev(errbuf);
if (dev == NULL) {
fprintf(stderr, "Couldn't find default device: %s\n", errbuf);
return(2);
}
printf("Device: %s\n", dev);
return(0);
}
In this case, pcap just sets the device on its own. "But wait, Tim," you say. "What is the deal with the errbuf string?" Most of the pcap commands allow us to pass them a string as an argument. The purpose of this string? In the event that the command fails, it will populate the string with a description of the error. In this case, if pcap_lookupdev() fails, it will store an error message in errbuf. Nifty, isn't it? And that's how we set our device.
Opening the device for sniffing
The task of creating a sniffing session is really quite simple. For this, we use pcap_open_live(). The prototype of this function (from the pcap man page) is as follows:
pcap_t *pcap_open_live(char *device, int snaplen, int promisc, int to_ms,
char *ebuf)
The first argument is the device that we specified in the previous section. snaplen is an integer which defines the maximum number of bytes to be captured by pcap. promisc, when set to true, brings the interface into promiscuous mode (however, even if it is set to false, it is possible under specific cases for the interface to be in promiscuous mode, anyway). to_ms is the read time out in milliseconds (a value of 0 means no time out; on at least some platforms, this means that you may wait until a sufficient number of packets arrive before seeing any packets, so you should use a non-zero timeout). Lastly, ebuf is a string we can store any error messages within (as we did above with errbuf). The function returns our session handler.
To demonstrate, consider this code snippet:
#include
...
pcap_t *handle;
handle = pcap_open_live(somedev, BUFSIZ, 1, 1000, errbuf);
if (handle == NULL) {
fprintf(stderr, "Couldn't open device %s: %s\n", somedev, errbuf);
return(2);
}
This code fragment opens the device stored in the strong "somedev", tells it to read however many bytes are specified in BUFSIZ (which is defined in pcap.h). We are telling it to put the device into promiscuous mode, to sniff until an error occurs, and if there is an error, store it in the string errbuf; it uses that string to print an error message.
A note about promiscuous vs. non-promiscuous sniffing: The two techniques are very different in style. In standard, non-promiscuous sniffing, a host is sniffing only traffic that is directly related to it. Only traffic to, from, or routed through the host will be picked up by the sniffer. Promiscuous mode, on the other hand, sniffs all traffic on the wire. In a non-switched environment, this could be all network traffic. The obvious advantage to this is that it provides more packets for sniffing, which may or may not be helpful depending on the reason you are sniffing the network. However, there are regressions. Promiscuous mode sniffing is detectable; a host can test with strong reliability to determine if another host is doing promiscuous sniffing. Second, it only works in a non-switched environment (such as a hub, or a switch that is being ARP flooded). Third, on high traffic networks, the host can become quite taxed for system resources.
Filtering traffic
Often times our sniffer may only be interested in specific traffic. For instance, there may be times when all we want is to sniff on port 23 (telnet) in search of passwords. Or perhaps we want to highjack a file being sent over port 21 (FTP). Maybe we only want DNS traffic (port 53 UDP). Whatever the case, rarely do we just want to blindly sniff all network traffic. Enter pcap_compile() and pcap_setfilter().
The process is quite simple. After we have already called pcap_open_live() and have a working sniffing session, we can apply our filter. Why not just use our own if/else if statements? Two reasons. First, pcap's filter is far more efficient, because it does it directly with the BPF filter; we eliminate numerous steps by having the BPF driver do it directly. Second, this is a lot easier :)
Before applying our filter, we must "compile" it. The filter expression is kept in a regular string (char array). The syntax is documented quite well in the man page for tcpdump; I leave you to read it on your own. However, we will use simple test expressions, so perhaps you are sharp enough to figure it out from my examples.
To compile the program we call pcap_compile(). The prototype defines it as:
int pcap_compile(pcap_t *p, struct bpf_program *fp, char *str, int optimize,
bpf_u_int32 netmask)
The first argument is our session handle (pcap_t *handle in our previous example). Following that is a reference to the place we will store the compiled version of our filter. Then comes the expression itself, in regular string format. Next is an integer that decides if the expression should be "optimized" or not (0 is false, 1 is true. Standard stuff.) Finally, we must specify the net mask of the network the filter applies to. The function returns -1 on failure; all other values imply success.
After the expression has been compiled, it is time to apply it. Enter pcap_setfilter(). Following our format of explaining pcap, we shall look at the pcap_setfilter() prototype:
int pcap_setfilter(pcap_t *p, struct bpf_program *fp)
This is very straightforward. The first argument is our session handler, the second is a reference to the compiled version of the expression (presumably the same variable as the second argument to pcap_compile()).
Perhaps another code sample would help to better understand:
#include
...
pcap_t *handle; /* Session handle */
char dev[] = "rl0"; /* Device to sniff on */
char errbuf[PCAP_ERRBUF_SIZE]; /* Error string */
struct bpf_program fp; /* The compiled filter expression */
char filter_exp[] = "port 23"; /* The filter expression */
bpf_u_int32 mask; /* The netmask of our sniffing device */
bpf_u_int32 net; /* The IP of our sniffing device */
if (pcap_lookupnet(dev, &net, &mask, errbuf) == -1) {
fprintf(stderr, "Can't get netmask for device %s\n", dev);
net = 0;
mask = 0;
}
handle = pcap_open_live(dev, BUFSIZ, 1, 1000, errbuf);
if (handle == NULL) {
fprintf(stderr, "Couldn't open device %s: %s\n", somedev, errbuf);
return(2);
}
if (pcap_compile(handle, &fp, filter_exp, 0, net) == -1) {
fprintf(stderr, "Couldn't parse filter %s: %s\n", filter_exp, pcap_geterr(handle));
return(2);
}
if (pcap_setfilter(handle, &fp) == -1) {
fprintf(stderr, "Couldn't install filter %s: %s\n", filter_exp, pcap_geterr(handle));
return(2);
}
This program preps the sniffer to sniff all traffic coming from or going to port 23, in promiscuous mode, on the device rl0.
You may notice that the previous example contains a function that we have not yet discussed. pcap_lookupnet() is a function that, given the name of a device, returns its IP and net mask. This was essential because we needed to know the net mask in order to apply the filter. This function is described in the Miscellaneous section at the end of the document.
It has been my experience that this filter does not work across all operating systems. In my test environment, I found that OpenBSD 2.9 with a default kernel does support this type of filter, but FreeBSD 4.3 with a default kernel does not. Your mileage may vary.
The actual sniffing
At this point we have learned how to define a device, prepare it for sniffing, and apply filters about what we should and should not sniff for. Now it is time to actually capture some packets.
There are two main techniques for capturing packets. We can either capture a single packet at a time, or we can enter a loop that waits for n number of packets to be sniffed before being done. We will begin by looking at how to capture a single packet, then look at methods of using loops. For this we use pcap_next().
The prototype for pcap_next() is fairly simple:
u_char *pcap_next(pcap_t *p, struct pcap_pkthdr *h)
The first argument is our session handler. The second argument is a pointer to a structure that holds general information about the packet, specifically the time in which it was sniffed, the length of this packet, and the length of his specific portion (incase it is fragmented, for example.) pcap_next() returns a u_char pointer to the packet that is described by this structure. We'll discuss the technique for actually reading the packet itself later.
Here is a simple demonstration of using pcap_next() to sniff a packet.
#include
#include
int main(int argc, char *argv[])
{
pcap_t *handle; /* Session handle */
char *dev; /* The device to sniff on */
char errbuf[PCAP_ERRBUF_SIZE]; /* Error string */
struct bpf_program fp; /* The compiled filter */
char filter_exp[] = "port 23"; /* The filter expression */
bpf_u_int32 mask; /* Our netmask */
bpf_u_int32 net; /* Our IP */
struct pcap_pkthdr header; /* The header that pcap gives us */
const u_char *packet; /* The actual packet */
/* Define the device */
dev = pcap_lookupdev(errbuf);
if (dev == NULL) {
fprintf(stderr, "Couldn't find default device: %s\n", errbuf);
return(2);
}
/* Find the properties for the device */
if (pcap_lookupnet(dev, &net, &mask, errbuf) == -1) {
fprintf(stderr, "Couldn't get netmask for device %s: %s\n", dev, errbuf);
net = 0;
mask = 0;
}
/* Open the session in promiscuous mode */
handle = pcap_open_live(dev, BUFSIZ, 1, 1000, errbuf);
if (handle == NULL) {
fprintf(stderr, "Couldn't open device %s: %s\n", somedev, errbuf);
return(2);
}
/* Compile and apply the filter */
if (pcap_compile(handle, &fp, filter_exp, 0, net) == -1) {
fprintf(stderr, "Couldn't parse filter %s: %s\n", filter_exp, pcap_geterr(handle));
return(2);
}
if (pcap_setfilter(handle, &fp) == -1) {
fprintf(stderr, "Couldn't install filter %s: %s\n", filter_exp, pcap_geterr(handle));
return(2);
}
/* Grab a packet */
packet = pcap_next(handle, &header);
/* Print its length */
printf("Jacked a packet with length of [%d]\n", header.len);
/* And close the session */
pcap_close(handle);
return(0);
}
This application sniffs on whatever device is returned by pcap_lookupdev() by putting it into promiscuous mode. It finds the first packet to come across port 23 (telnet) and tells the user the size of the packet (in bytes). Again, this program includes a new call, pcap_close(), which we will discuss later (although it really is quite self explanatory).
The other technique we can use is more complicated, and probably more useful. Few sniffers (if any) actually use pcap_next(). More often than not, they use pcap_loop() or pcap_dispatch() (which then themselves use pcap_loop()). To understand the use of these two functions, you must understand the idea of a callback function.
Callback functions are not anything new, and are very common in many API's. The concept behind a callback function is fairly simple. Suppose I have a program that is waiting for an event of some sort. For the purpose of this example, let's pretend that my program wants a user to press a key on the keyboard. Every time they press a key, I want to call a function which then will determine that to do. The function I am utilizing is a callback function. Every time the user presses a key, my program will call the callback function. Callbacks are used in pcap, but instead of being called when a user presses a key, they are called when pcap sniffs a packet. The two functions that one can use to define their callback is pcap_loop() and pcap_dispatch(). pcap_loop() and pcap_dispatch() are very similar in their usage of callbacks. Both of them call a callback function every time a packet is sniffed that meets our filter requirements (if any filter exists, of course. If not, then all packets that are sniffed are sent to the callback.)
The prototype for pcap_loop() is below:
int pcap_loop(pcap_t *p, int cnt, pcap_handler callback, u_char *user)
The first argument is our session handle. Following that is an integer that tells pcap_loop() how many packets it should sniff for before returning (a negative value means it should sniff until an error occurs). The third argument is the name of the callback function (just its identifier, no parentheses). The last argument is useful in some applications, but many times is simply set as NULL. Suppose we have arguments of our own that we wish to send to our callback function, in addition to the arguments that pcap_loop() sends. This is where we do it. Obviously, you must typecast to a u_char pointer to ensure the results make it there correctly; as we will see later, pcap makes use of some very interesting means of passing information in the form of a u_char pointer. After we show an example of how pcap does it, it should be obvious how to do it here. If not, consult your local C reference text, as an explanation of pointers is beyond the scope of this document. pcap_dispatch() is almost identical in usage. The only difference between pcap_dispatch() and pcap_loop() is that pcap_dispatch() will only process the first batch of packets that it receives from the system, while pcap_loop() will continue processing packets or batches of packets until the count of packets runs out. For a more in depth discussion of their differences, see the pcap man page.
Before we can provide an example of using pcap_loop(), we must examine the format of our callback function. We cannot arbitrarily define our callback's prototype; otherwise, pcap_loop() would not know how to use the function. So we use this format as the prototype for our callback function:
void got_packet(u_char *args, const struct pcap_pkthdr *header,
const u_char *packet);
Let's examine this in more detail. First, you'll notice that the function has a void return type. This is logical, because pcap_loop() wouldn't know how to handle a return value anyway. The first argument corresponds to the last argument of pcap_loop(). Whatever value is passed as the last argument to pcap_loop() is passed to the first argument of our callback function every time the function is called. The second argument is the pcap header, which contains information about when the packet was sniffed, how large it is, etc. The pcap_pkthdr structure is defined in pcap.h as:
struct pcap_pkthdr {
struct timeval ts; /* time stamp */
bpf_u_int32 caplen; /* length of portion present */
bpf_u_int32 len; /* length this packet (off wire) */
};
These values should be fairly self explanatory. The last argument is the most interesting of them all, and the most confusing to the average novice pcap programmer. It is another pointer to a u_char, and it points to the first byte of a chunk of data containing the entire packet, as sniffed by pcap_loop().
But how do you make use of this variable (named "packet" in our prototype)? A packet contains many attributes, so as you can imagine, it is not really a string, but actually a collection of structures (for instance, a TCP/IP packet would have an Ethernet header, an IP header, a TCP header, and lastly, the packet's payload). This u_char pointer points to the serialized version of these structures. To make any use of it, we must do some interesting typecasting.
First, we must have the actual structures define before we can typecast to them. The following are the structure definitions that I use to describe a TCP/IP packet over Ethernet.
/* Ethernet addresses are 6 bytes */
#define ETHER_ADDR_LEN 6
/* Ethernet header */
struct sniff_ethernet {
u_char ether_dhost[ETHER_ADDR_LEN]; /* Destination host address */
u_char ether_shost[ETHER_ADDR_LEN]; /* Source host address */
u_short ether_type; /* IP? ARP? RARP? etc */
};
/* IP header */
struct sniff_ip {
u_char ip_vhl; /* version << 4 | header length >> 2 */
u_char ip_tos; /* type of service */
u_short ip_len; /* total length */
u_short ip_id; /* identification */
u_short ip_off; /* fragment offset field */
#define IP_RF 0x8000 /* reserved fragment flag */
#define IP_DF 0x4000 /* dont fragment flag */
#define IP_MF 0x2000 /* more fragments flag */
#define IP_OFFMASK 0x1fff /* mask for fragmenting bits */
u_char ip_ttl; /* time to live */
u_char ip_p; /* protocol */
u_short ip_sum; /* checksum */
struct in_addr ip_src,ip_dst; /* source and dest address */
};
#define IP_HL(ip) (((ip)->ip_vhl) & 0x0f)
#define IP_V(ip) (((ip)->ip_vhl) >> 4)
/* TCP header */
struct sniff_tcp {
u_short th_sport; /* source port */
u_short th_dport; /* destination port */
tcp_seq th_seq; /* sequence number */
tcp_seq th_ack; /* acknowledgement number */
u_char th_offx2; /* data offset, rsvd */
#define TH_OFF(th) (((th)->th_offx2 & 0xf0) >> 4)
u_char th_flags;
#define TH_FIN 0x01
#define TH_SYN 0x02
#define TH_RST 0x04
#define TH_PUSH 0x08
#define TH_ACK 0x10
#define TH_URG 0x20
#define TH_ECE 0x40
#define TH_CWR 0x80
#define TH_FLAGS (TH_FIN|TH_SYN|TH_RST|TH_ACK|TH_URG|TH_ECE|TH_CWR)
u_short th_win; /* window */
u_short th_sum; /* checksum */
u_short th_urp; /* urgent pointer */
};
Note: On my Slackware Linux 8 box (stock kernel 2.2.19) I found that code using the above structures would not compile. The problem, as it turns out, was in include/features.h, which implements a POSIX interface unless _BSD_SOURCE is defined. If it was not defined, then I had to use a different structure definition for the TCP header. The more universal solution, that does not prevent the code from working on FreeBSD or OpenBSD (where it had previously worked fine), is simply to do the following:
#define _BSD_SOURCE 1
prior to including any of your header files. This will ensure that a BSD style API is being used. Again, if you don't wish to do this, then you can simply use the alternative TCP header structure, which I've linked to here, along with some quick notes about using it.
So how does all of this relate to pcap and our mysterious u_char pointer? Well, those structures define the headers that appear in the data for the packet. So how can we break it apart? Be prepared to witness one of the most practical uses of pointers (for all of those new C programmers who insist that pointers are useless, I smite you).
Again, we're going to assume that we are dealing with a TCP/IP packet over Ethernet. This same technique applies to any packet; the only difference is the structure types that you actually use. So let's begin by defining the variables and compile-time definitions we will need to deconstruct the packet data.
/* ethernet headers are always exactly 14 bytes */
#define SIZE_ETHERNET 14
const struct sniff_ethernet *ethernet; /* The ethernet header */
const struct sniff_ip *ip; /* The IP header */
const struct sniff_tcp *tcp; /* The TCP header */
const char *payload; /* Packet payload */
u_int size_ip;
u_int size_tcp;
And now we do our magical typecasting:
ethernet = (struct sniff_ethernet*)(packet);
ip = (struct sniff_ip*)(packet + SIZE_ETHERNET);
size_ip = IP_HL(ip)*4;
if (size_ip < 20) {
printf(" * Invalid IP header length: %u bytes\n", size_ip);
return;
}
tcp = (struct sniff_tcp*)(packet + SIZE_ETHERNET + size_ip);
size_tcp = TH_OFF(tcp)*4;
if (size_tcp < 20) {
printf(" * Invalid TCP header length: %u bytes\n", size_tcp);
return;
}
payload = (u_char *)(packet + SIZE_ETHERNET + size_ip + size_tcp);
How does this work? Consider the layout of the packet data in memory. The u_char pointer is really just a variable containing an address in memory. That's what a pointer is; it points to a location in memory.
For the sake of simplicity, we'll say that the address this pointer is set to is the value X. Well, if our three structures are just sitting in line, the first of them (sniff_ethernet) being located in memory at the address X, then we can easily find the address of the structure after it; that address is X plus the length of the Ethernet header, which is 14, or SIZE_ETHERNET.
Similarly if we have the address of that header, the address of the structure after it is the address of that header plus the length of that header. The IP header, unlike the Ethernet header, does not have a fixed length; its length is given, as a count of 4-byte words, by the header length field of the IP header. As it's a count of 4-byte words, it must be multiplied by 4 to give the size in bytes. The minimum length of that header is 20 bytes.
The TCP header also has a variable length; its length is given, as a number of 4-byte words, by the "data offset" field of the TCP header, and its minimum length is also 20 bytes.
So let's make a chart:
Variable Location (in bytes)
sniff_ethernet X
sniff_ip X + SIZE_ETHERNET
sniff_tcp X + SIZE_ETHERNET + {IP header length}
payload X + SIZE_ETHERNET + {IP header length} + {TCP header length}
The sniff_ethernet structure, being the first in line, is simply at location X. sniff_ip, who follows directly after sniff_ethernet, is at the location X, plus however much space the Ethernet header consumes (14 bytes, or SIZE_ETHERNET). sniff_tcp is after both sniff_ip and sniff_ethernet, so it is location at X plus the sizes of the Ethernet and IP headers (14 bytes, and 4 times the IP header length, respectively). Lastly, the payload (which doesn't have a single structure corresponding to it, as its contents depends on the protocol being used atop TCP) is located after all of them.
So at this point, we know how to set our callback function, call it, and find out the attributes about the packet that has been sniffed. It's now the time you have been waiting for: writing a useful packet sniffer. Because of the length of the source code, it is not included in the body of this document.
The first thing to understand is the general layout of a pcap sniffer. The flow of code is as follows:
1. We begin by determining which interface we want to sniff on. In Linux this may be something like eth0, in BSD it may be xl1, etc. We can either define this device in a string, or we can ask pcap to provide us with the name of an interface that will do the job.
2. Initialize pcap. This is where we actually tell pcap what device we are sniffing on. We can, if we want to, sniff on multiple devices. How do we differentiate between them? Using file handles. Just like opening a file for reading or writing, we must name our sniffing "session" so we can tell it apart from other such sessions.
3. In the event that we only want to sniff specific traffic (e.g.: only TCP/IP packets, only packets going to port 23, etc) we must create a rule set, "compile" it, and apply it. This is a three phase process, all of which is closely related. The rule set is kept in a string, and is converted into a format that pcap can read (hence compiling it.) The compilation is actually just done by calling a function within our program; it does not involve the use of an external application. Then we tell pcap to apply it to whichever session we wish for it to filter.
4. Finally, we tell pcap to enter it's primary execution loop. In this state, pcap waits until it has received however many packets we want it to. Every time it gets a new packet in, it calls another function that we have already defined. The function that it calls can do anything we want; it can dissect the packet and print it to the user, it can save it in a file, or it can do nothing at all.
5. After our sniffing needs are satisfied, we close our session and are complete.
This is actually a very simple process. Five steps total, one of which is optional (step 3, in case you were wondering.) Let's take a look at each of the steps and how to implement them.
Setting the device
This is terribly simple. There are two techniques for setting the device that we wish to sniff on.
The first is that we can simply have the user tell us. Consider the following program:
#include
#include
int main(int argc, char *argv[])
{
char *dev = argv[1];
printf("Device: %s\n", dev);
return(0);
}
The user specifies the device by passing the name of it as the first argument to the program. Now the string "dev" holds the name of the interface that we will sniff on in a format that pcap can understand (assuming, of course, the user gave us a real interface).
The other technique is equally simple. Look at this program:
#include
#include
int main(int argc, char *argv[])
{
char *dev, errbuf[PCAP_ERRBUF_SIZE];
dev = pcap_lookupdev(errbuf);
if (dev == NULL) {
fprintf(stderr, "Couldn't find default device: %s\n", errbuf);
return(2);
}
printf("Device: %s\n", dev);
return(0);
}
In this case, pcap just sets the device on its own. "But wait, Tim," you say. "What is the deal with the errbuf string?" Most of the pcap commands allow us to pass them a string as an argument. The purpose of this string? In the event that the command fails, it will populate the string with a description of the error. In this case, if pcap_lookupdev() fails, it will store an error message in errbuf. Nifty, isn't it? And that's how we set our device.
Opening the device for sniffing
The task of creating a sniffing session is really quite simple. For this, we use pcap_open_live(). The prototype of this function (from the pcap man page) is as follows:
pcap_t *pcap_open_live(char *device, int snaplen, int promisc, int to_ms,
char *ebuf)
The first argument is the device that we specified in the previous section. snaplen is an integer which defines the maximum number of bytes to be captured by pcap. promisc, when set to true, brings the interface into promiscuous mode (however, even if it is set to false, it is possible under specific cases for the interface to be in promiscuous mode, anyway). to_ms is the read time out in milliseconds (a value of 0 means no time out; on at least some platforms, this means that you may wait until a sufficient number of packets arrive before seeing any packets, so you should use a non-zero timeout). Lastly, ebuf is a string we can store any error messages within (as we did above with errbuf). The function returns our session handler.
To demonstrate, consider this code snippet:
#include
...
pcap_t *handle;
handle = pcap_open_live(somedev, BUFSIZ, 1, 1000, errbuf);
if (handle == NULL) {
fprintf(stderr, "Couldn't open device %s: %s\n", somedev, errbuf);
return(2);
}
This code fragment opens the device stored in the strong "somedev", tells it to read however many bytes are specified in BUFSIZ (which is defined in pcap.h). We are telling it to put the device into promiscuous mode, to sniff until an error occurs, and if there is an error, store it in the string errbuf; it uses that string to print an error message.
A note about promiscuous vs. non-promiscuous sniffing: The two techniques are very different in style. In standard, non-promiscuous sniffing, a host is sniffing only traffic that is directly related to it. Only traffic to, from, or routed through the host will be picked up by the sniffer. Promiscuous mode, on the other hand, sniffs all traffic on the wire. In a non-switched environment, this could be all network traffic. The obvious advantage to this is that it provides more packets for sniffing, which may or may not be helpful depending on the reason you are sniffing the network. However, there are regressions. Promiscuous mode sniffing is detectable; a host can test with strong reliability to determine if another host is doing promiscuous sniffing. Second, it only works in a non-switched environment (such as a hub, or a switch that is being ARP flooded). Third, on high traffic networks, the host can become quite taxed for system resources.
Filtering traffic
Often times our sniffer may only be interested in specific traffic. For instance, there may be times when all we want is to sniff on port 23 (telnet) in search of passwords. Or perhaps we want to highjack a file being sent over port 21 (FTP). Maybe we only want DNS traffic (port 53 UDP). Whatever the case, rarely do we just want to blindly sniff all network traffic. Enter pcap_compile() and pcap_setfilter().
The process is quite simple. After we have already called pcap_open_live() and have a working sniffing session, we can apply our filter. Why not just use our own if/else if statements? Two reasons. First, pcap's filter is far more efficient, because it does it directly with the BPF filter; we eliminate numerous steps by having the BPF driver do it directly. Second, this is a lot easier :)
Before applying our filter, we must "compile" it. The filter expression is kept in a regular string (char array). The syntax is documented quite well in the man page for tcpdump; I leave you to read it on your own. However, we will use simple test expressions, so perhaps you are sharp enough to figure it out from my examples.
To compile the program we call pcap_compile(). The prototype defines it as:
int pcap_compile(pcap_t *p, struct bpf_program *fp, char *str, int optimize,
bpf_u_int32 netmask)
The first argument is our session handle (pcap_t *handle in our previous example). Following that is a reference to the place we will store the compiled version of our filter. Then comes the expression itself, in regular string format. Next is an integer that decides if the expression should be "optimized" or not (0 is false, 1 is true. Standard stuff.) Finally, we must specify the net mask of the network the filter applies to. The function returns -1 on failure; all other values imply success.
After the expression has been compiled, it is time to apply it. Enter pcap_setfilter(). Following our format of explaining pcap, we shall look at the pcap_setfilter() prototype:
int pcap_setfilter(pcap_t *p, struct bpf_program *fp)
This is very straightforward. The first argument is our session handler, the second is a reference to the compiled version of the expression (presumably the same variable as the second argument to pcap_compile()).
Perhaps another code sample would help to better understand:
#include
...
pcap_t *handle; /* Session handle */
char dev[] = "rl0"; /* Device to sniff on */
char errbuf[PCAP_ERRBUF_SIZE]; /* Error string */
struct bpf_program fp; /* The compiled filter expression */
char filter_exp[] = "port 23"; /* The filter expression */
bpf_u_int32 mask; /* The netmask of our sniffing device */
bpf_u_int32 net; /* The IP of our sniffing device */
if (pcap_lookupnet(dev, &net, &mask, errbuf) == -1) {
fprintf(stderr, "Can't get netmask for device %s\n", dev);
net = 0;
mask = 0;
}
handle = pcap_open_live(dev, BUFSIZ, 1, 1000, errbuf);
if (handle == NULL) {
fprintf(stderr, "Couldn't open device %s: %s\n", somedev, errbuf);
return(2);
}
if (pcap_compile(handle, &fp, filter_exp, 0, net) == -1) {
fprintf(stderr, "Couldn't parse filter %s: %s\n", filter_exp, pcap_geterr(handle));
return(2);
}
if (pcap_setfilter(handle, &fp) == -1) {
fprintf(stderr, "Couldn't install filter %s: %s\n", filter_exp, pcap_geterr(handle));
return(2);
}
This program preps the sniffer to sniff all traffic coming from or going to port 23, in promiscuous mode, on the device rl0.
You may notice that the previous example contains a function that we have not yet discussed. pcap_lookupnet() is a function that, given the name of a device, returns its IP and net mask. This was essential because we needed to know the net mask in order to apply the filter. This function is described in the Miscellaneous section at the end of the document.
It has been my experience that this filter does not work across all operating systems. In my test environment, I found that OpenBSD 2.9 with a default kernel does support this type of filter, but FreeBSD 4.3 with a default kernel does not. Your mileage may vary.
The actual sniffing
At this point we have learned how to define a device, prepare it for sniffing, and apply filters about what we should and should not sniff for. Now it is time to actually capture some packets.
There are two main techniques for capturing packets. We can either capture a single packet at a time, or we can enter a loop that waits for n number of packets to be sniffed before being done. We will begin by looking at how to capture a single packet, then look at methods of using loops. For this we use pcap_next().
The prototype for pcap_next() is fairly simple:
u_char *pcap_next(pcap_t *p, struct pcap_pkthdr *h)
The first argument is our session handler. The second argument is a pointer to a structure that holds general information about the packet, specifically the time in which it was sniffed, the length of this packet, and the length of his specific portion (incase it is fragmented, for example.) pcap_next() returns a u_char pointer to the packet that is described by this structure. We'll discuss the technique for actually reading the packet itself later.
Here is a simple demonstration of using pcap_next() to sniff a packet.
#include
#include
int main(int argc, char *argv[])
{
pcap_t *handle; /* Session handle */
char *dev; /* The device to sniff on */
char errbuf[PCAP_ERRBUF_SIZE]; /* Error string */
struct bpf_program fp; /* The compiled filter */
char filter_exp[] = "port 23"; /* The filter expression */
bpf_u_int32 mask; /* Our netmask */
bpf_u_int32 net; /* Our IP */
struct pcap_pkthdr header; /* The header that pcap gives us */
const u_char *packet; /* The actual packet */
/* Define the device */
dev = pcap_lookupdev(errbuf);
if (dev == NULL) {
fprintf(stderr, "Couldn't find default device: %s\n", errbuf);
return(2);
}
/* Find the properties for the device */
if (pcap_lookupnet(dev, &net, &mask, errbuf) == -1) {
fprintf(stderr, "Couldn't get netmask for device %s: %s\n", dev, errbuf);
net = 0;
mask = 0;
}
/* Open the session in promiscuous mode */
handle = pcap_open_live(dev, BUFSIZ, 1, 1000, errbuf);
if (handle == NULL) {
fprintf(stderr, "Couldn't open device %s: %s\n", somedev, errbuf);
return(2);
}
/* Compile and apply the filter */
if (pcap_compile(handle, &fp, filter_exp, 0, net) == -1) {
fprintf(stderr, "Couldn't parse filter %s: %s\n", filter_exp, pcap_geterr(handle));
return(2);
}
if (pcap_setfilter(handle, &fp) == -1) {
fprintf(stderr, "Couldn't install filter %s: %s\n", filter_exp, pcap_geterr(handle));
return(2);
}
/* Grab a packet */
packet = pcap_next(handle, &header);
/* Print its length */
printf("Jacked a packet with length of [%d]\n", header.len);
/* And close the session */
pcap_close(handle);
return(0);
}
This application sniffs on whatever device is returned by pcap_lookupdev() by putting it into promiscuous mode. It finds the first packet to come across port 23 (telnet) and tells the user the size of the packet (in bytes). Again, this program includes a new call, pcap_close(), which we will discuss later (although it really is quite self explanatory).
The other technique we can use is more complicated, and probably more useful. Few sniffers (if any) actually use pcap_next(). More often than not, they use pcap_loop() or pcap_dispatch() (which then themselves use pcap_loop()). To understand the use of these two functions, you must understand the idea of a callback function.
Callback functions are not anything new, and are very common in many API's. The concept behind a callback function is fairly simple. Suppose I have a program that is waiting for an event of some sort. For the purpose of this example, let's pretend that my program wants a user to press a key on the keyboard. Every time they press a key, I want to call a function which then will determine that to do. The function I am utilizing is a callback function. Every time the user presses a key, my program will call the callback function. Callbacks are used in pcap, but instead of being called when a user presses a key, they are called when pcap sniffs a packet. The two functions that one can use to define their callback is pcap_loop() and pcap_dispatch(). pcap_loop() and pcap_dispatch() are very similar in their usage of callbacks. Both of them call a callback function every time a packet is sniffed that meets our filter requirements (if any filter exists, of course. If not, then all packets that are sniffed are sent to the callback.)
The prototype for pcap_loop() is below:
int pcap_loop(pcap_t *p, int cnt, pcap_handler callback, u_char *user)
The first argument is our session handle. Following that is an integer that tells pcap_loop() how many packets it should sniff for before returning (a negative value means it should sniff until an error occurs). The third argument is the name of the callback function (just its identifier, no parentheses). The last argument is useful in some applications, but many times is simply set as NULL. Suppose we have arguments of our own that we wish to send to our callback function, in addition to the arguments that pcap_loop() sends. This is where we do it. Obviously, you must typecast to a u_char pointer to ensure the results make it there correctly; as we will see later, pcap makes use of some very interesting means of passing information in the form of a u_char pointer. After we show an example of how pcap does it, it should be obvious how to do it here. If not, consult your local C reference text, as an explanation of pointers is beyond the scope of this document. pcap_dispatch() is almost identical in usage. The only difference between pcap_dispatch() and pcap_loop() is that pcap_dispatch() will only process the first batch of packets that it receives from the system, while pcap_loop() will continue processing packets or batches of packets until the count of packets runs out. For a more in depth discussion of their differences, see the pcap man page.
Before we can provide an example of using pcap_loop(), we must examine the format of our callback function. We cannot arbitrarily define our callback's prototype; otherwise, pcap_loop() would not know how to use the function. So we use this format as the prototype for our callback function:
void got_packet(u_char *args, const struct pcap_pkthdr *header,
const u_char *packet);
Let's examine this in more detail. First, you'll notice that the function has a void return type. This is logical, because pcap_loop() wouldn't know how to handle a return value anyway. The first argument corresponds to the last argument of pcap_loop(). Whatever value is passed as the last argument to pcap_loop() is passed to the first argument of our callback function every time the function is called. The second argument is the pcap header, which contains information about when the packet was sniffed, how large it is, etc. The pcap_pkthdr structure is defined in pcap.h as:
struct pcap_pkthdr {
struct timeval ts; /* time stamp */
bpf_u_int32 caplen; /* length of portion present */
bpf_u_int32 len; /* length this packet (off wire) */
};
These values should be fairly self explanatory. The last argument is the most interesting of them all, and the most confusing to the average novice pcap programmer. It is another pointer to a u_char, and it points to the first byte of a chunk of data containing the entire packet, as sniffed by pcap_loop().
But how do you make use of this variable (named "packet" in our prototype)? A packet contains many attributes, so as you can imagine, it is not really a string, but actually a collection of structures (for instance, a TCP/IP packet would have an Ethernet header, an IP header, a TCP header, and lastly, the packet's payload). This u_char pointer points to the serialized version of these structures. To make any use of it, we must do some interesting typecasting.
First, we must have the actual structures define before we can typecast to them. The following are the structure definitions that I use to describe a TCP/IP packet over Ethernet.
/* Ethernet addresses are 6 bytes */
#define ETHER_ADDR_LEN 6
/* Ethernet header */
struct sniff_ethernet {
u_char ether_dhost[ETHER_ADDR_LEN]; /* Destination host address */
u_char ether_shost[ETHER_ADDR_LEN]; /* Source host address */
u_short ether_type; /* IP? ARP? RARP? etc */
};
/* IP header */
struct sniff_ip {
u_char ip_vhl; /* version << 4 | header length >> 2 */
u_char ip_tos; /* type of service */
u_short ip_len; /* total length */
u_short ip_id; /* identification */
u_short ip_off; /* fragment offset field */
#define IP_RF 0x8000 /* reserved fragment flag */
#define IP_DF 0x4000 /* dont fragment flag */
#define IP_MF 0x2000 /* more fragments flag */
#define IP_OFFMASK 0x1fff /* mask for fragmenting bits */
u_char ip_ttl; /* time to live */
u_char ip_p; /* protocol */
u_short ip_sum; /* checksum */
struct in_addr ip_src,ip_dst; /* source and dest address */
};
#define IP_HL(ip) (((ip)->ip_vhl) & 0x0f)
#define IP_V(ip) (((ip)->ip_vhl) >> 4)
/* TCP header */
struct sniff_tcp {
u_short th_sport; /* source port */
u_short th_dport; /* destination port */
tcp_seq th_seq; /* sequence number */
tcp_seq th_ack; /* acknowledgement number */
u_char th_offx2; /* data offset, rsvd */
#define TH_OFF(th) (((th)->th_offx2 & 0xf0) >> 4)
u_char th_flags;
#define TH_FIN 0x01
#define TH_SYN 0x02
#define TH_RST 0x04
#define TH_PUSH 0x08
#define TH_ACK 0x10
#define TH_URG 0x20
#define TH_ECE 0x40
#define TH_CWR 0x80
#define TH_FLAGS (TH_FIN|TH_SYN|TH_RST|TH_ACK|TH_URG|TH_ECE|TH_CWR)
u_short th_win; /* window */
u_short th_sum; /* checksum */
u_short th_urp; /* urgent pointer */
};
Note: On my Slackware Linux 8 box (stock kernel 2.2.19) I found that code using the above structures would not compile. The problem, as it turns out, was in include/features.h, which implements a POSIX interface unless _BSD_SOURCE is defined. If it was not defined, then I had to use a different structure definition for the TCP header. The more universal solution, that does not prevent the code from working on FreeBSD or OpenBSD (where it had previously worked fine), is simply to do the following:
#define _BSD_SOURCE 1
prior to including any of your header files. This will ensure that a BSD style API is being used. Again, if you don't wish to do this, then you can simply use the alternative TCP header structure, which I've linked to here, along with some quick notes about using it.
So how does all of this relate to pcap and our mysterious u_char pointer? Well, those structures define the headers that appear in the data for the packet. So how can we break it apart? Be prepared to witness one of the most practical uses of pointers (for all of those new C programmers who insist that pointers are useless, I smite you).
Again, we're going to assume that we are dealing with a TCP/IP packet over Ethernet. This same technique applies to any packet; the only difference is the structure types that you actually use. So let's begin by defining the variables and compile-time definitions we will need to deconstruct the packet data.
/* ethernet headers are always exactly 14 bytes */
#define SIZE_ETHERNET 14
const struct sniff_ethernet *ethernet; /* The ethernet header */
const struct sniff_ip *ip; /* The IP header */
const struct sniff_tcp *tcp; /* The TCP header */
const char *payload; /* Packet payload */
u_int size_ip;
u_int size_tcp;
And now we do our magical typecasting:
ethernet = (struct sniff_ethernet*)(packet);
ip = (struct sniff_ip*)(packet + SIZE_ETHERNET);
size_ip = IP_HL(ip)*4;
if (size_ip < 20) {
printf(" * Invalid IP header length: %u bytes\n", size_ip);
return;
}
tcp = (struct sniff_tcp*)(packet + SIZE_ETHERNET + size_ip);
size_tcp = TH_OFF(tcp)*4;
if (size_tcp < 20) {
printf(" * Invalid TCP header length: %u bytes\n", size_tcp);
return;
}
payload = (u_char *)(packet + SIZE_ETHERNET + size_ip + size_tcp);
How does this work? Consider the layout of the packet data in memory. The u_char pointer is really just a variable containing an address in memory. That's what a pointer is; it points to a location in memory.
For the sake of simplicity, we'll say that the address this pointer is set to is the value X. Well, if our three structures are just sitting in line, the first of them (sniff_ethernet) being located in memory at the address X, then we can easily find the address of the structure after it; that address is X plus the length of the Ethernet header, which is 14, or SIZE_ETHERNET.
Similarly if we have the address of that header, the address of the structure after it is the address of that header plus the length of that header. The IP header, unlike the Ethernet header, does not have a fixed length; its length is given, as a count of 4-byte words, by the header length field of the IP header. As it's a count of 4-byte words, it must be multiplied by 4 to give the size in bytes. The minimum length of that header is 20 bytes.
The TCP header also has a variable length; its length is given, as a number of 4-byte words, by the "data offset" field of the TCP header, and its minimum length is also 20 bytes.
So let's make a chart:
Variable Location (in bytes)
sniff_ethernet X
sniff_ip X + SIZE_ETHERNET
sniff_tcp X + SIZE_ETHERNET + {IP header length}
payload X + SIZE_ETHERNET + {IP header length} + {TCP header length}
The sniff_ethernet structure, being the first in line, is simply at location X. sniff_ip, who follows directly after sniff_ethernet, is at the location X, plus however much space the Ethernet header consumes (14 bytes, or SIZE_ETHERNET). sniff_tcp is after both sniff_ip and sniff_ethernet, so it is location at X plus the sizes of the Ethernet and IP headers (14 bytes, and 4 times the IP header length, respectively). Lastly, the payload (which doesn't have a single structure corresponding to it, as its contents depends on the protocol being used atop TCP) is located after all of them.
So at this point, we know how to set our callback function, call it, and find out the attributes about the packet that has been sniffed. It's now the time you have been waiting for: writing a useful packet sniffer. Because of the length of the source code, it is not included in the body of this document.
Thursday, April 17, 2008
Named pipes
A very useful Linux feature is named pipes which enable different processes to communicate.
One of the fundamental features that makes Linux and other Unices useful is the “pipe”. Pipes allow separate processes to communicate without having been designed explicitly to work together. This allows tools quite narrow in their function to be combined in complex ways.
A simple example of using a pipe is the command:
ls | grep x
When bash examines the command line, it finds the vertical bar character | that separates the two commands. Bash and other shells run both commands, connecting the output of the first to the input of the second. The ls program produces a list of files in the current directory, while the grep program reads the output of ls and prints only those lines containing the letter x.
The above, familiar to most Unix users, is an example of an “unnamed pipe”. The pipe exists only inside the kernel and cannot be accessed by processes that created it, in this case, the bash shell. For those who don't already know, a parent process is the first process started by a program that in turn creates separate child processes that execute the program.
The other sort of pipe is a “named” pipe, which is sometimes called a FIFO. FIFO stands for “First In, First Out” and refers to the property that the order of bytes going in is the same coming out. The “name” of a named pipe is actually a file name within the file system. Pipes are shown by ls as any other file with a couple of differences:
% ls -l fifo1
prw-r--r-- 1 andy users 0 Jan 22 23:11 fifo1|
The p in the leftmost column indicates that fifo1 is a pipe. The rest of the permission bits control who can read or write to the pipe just like a regular file. On systems with a modern ls, the | character at the end of the file name is another clue, and on Linux systems with the color option enabled, fifo| is printed in red by default.
On older Linux systems, named pipes are created by the mknod program, usually located in the /etc directory. On more modern systems, mkfifo is a standard utility. The mkfifo program takes one or more file names as arguments for this task and creates pipes with those names. For example, to create a named pipe with the name pipe1 give the command:
mkfifo pipe
The simplest way to show how named pipes work is with an example. Suppose we've created pipe as shown above. In one virtual console1, type:
ls -l > pipe1
and in another type:
cat < pipe
Voila! The output of the command run on the first console shows up on the second console. Note that the order in which you run the commands doesn't matter.
If you haven't used virtual consoles before, see the article “Keyboards, Consoles and VT Cruising” by John M. Fisk in the November 1996 Linux Journal.
If you watch closely, you'll notice that the first command you run appears to hang. This happens because the other end of the pipe is not yet connected, and so the kernel suspends the first process until the second process opens the pipe. In Unix jargon, the process is said to be “blocked”, since it is waiting for something to happen.
One very useful application of named pipes is to allow totally unrelated programs to communicate with each other. For example, a program that services requests of some sort (print files, access a database) could open the pipe for reading. Then, another process could make a request by opening the pipe and writing a command. That is, the “server” can perform a task on behalf of the “client”. Blocking can also happen if the client isn't writing, or the server isn't reading.
Pipe Madness
Create two named pipes, pipe1 and pipe2. Run the commands:
echo -n x | cat - pipe1 > pipe2 &
cat pipe1
On screen, it will not appear that anything is happening, but if you run top (a command similar to ps for showing process status), you'll see that both cat programs are running like crazy copying the letter x back and forth in an endless loop.
After you press ctrl-C to get out of the loop, you may receive the message “broken pipe”. This error occurs when a process writing to a pipe when the process reading the pipe closes its end. Since the reader is gone, the data has no place to go. Normally, the writer will finish writing its data and close the pipe. At this point, the reader sees the EOF (end of file) and executes the request.
Whether or not the “broken pipe” message is issued depends on events at the exact instant the ctrl-C is pressed. If the second cat has just read the x, pressing ctrl-C stops the second cat, pipe1 is closed and the first cat stops quietly, i.e., without a message. On the other hand, if the second cat is waiting for the first to write the x, ctrl-C causes pipe2 to close before the first cat can write to it, and the error message is issued. This sort of random behavior is known as a “race condition”.
Command Substitution
Bash uses named pipes in a really neat way. Recall that when you enclose a command in parenthesis, the command is actually run in a “subshell”; that is, the shell clones itself and the clone interprets the command(s) within the parenthesis. Since the outer shell is running only a single “command”, the output of a complete set of commands can be redirected as a unit. For example, the command:
(ls -l; ls -l) >ls.out
writes two copies of the current directory listing to the file ls.out.
Command substitution occurs when you put a < or > in front of the left parenthesis. For instance, typing the command:
cat <(ls -l)
results in the command ls -l executing in a subshell as usual, but redirects the output to a temporary named pipe, which bash creates, names and later deletes. Therefore, cat has a valid file name to read from, and we see the output of ls -l, taking one more step than usual to do so. Similarly, giving >(commands) results in Bash naming a temporary pipe, which the commands inside the parenthesis read for input.
If you want to see whether two directories contain the same file names, run the single command:
cmp <(ls /dir1) <(ls /dir2)
The compare program cmp will see the names of two files which it will read and compare.
Command substitution also makes the tee command (used to view and save the output of a command) much more useful in that you can cause a single stream of input to be read by multiple readers without resorting to temporary files—bash does all the work for you. The command:
ls | tee >(grep foo | wc >foo.count) \
>(grep bar | wc >bar.count) \
| grep baz | wc >baz.count
counts the number of occurrences of foo, bar and baz in the output of ls and writes this information to three separate files. Command substitutions can even be nested:
cat <(cat <(cat <(ls -l))))
works as a very roundabout way to list the current directory.
As you can see, while the unnamed pipes allow simple commands to be strung together, named pipes, with a little help from bash, allow whole trees of pipes to be created. The possibilities are limited only by your imagination.
One of the fundamental features that makes Linux and other Unices useful is the “pipe”. Pipes allow separate processes to communicate without having been designed explicitly to work together. This allows tools quite narrow in their function to be combined in complex ways.
A simple example of using a pipe is the command:
ls | grep x
When bash examines the command line, it finds the vertical bar character | that separates the two commands. Bash and other shells run both commands, connecting the output of the first to the input of the second. The ls program produces a list of files in the current directory, while the grep program reads the output of ls and prints only those lines containing the letter x.
The above, familiar to most Unix users, is an example of an “unnamed pipe”. The pipe exists only inside the kernel and cannot be accessed by processes that created it, in this case, the bash shell. For those who don't already know, a parent process is the first process started by a program that in turn creates separate child processes that execute the program.
The other sort of pipe is a “named” pipe, which is sometimes called a FIFO. FIFO stands for “First In, First Out” and refers to the property that the order of bytes going in is the same coming out. The “name” of a named pipe is actually a file name within the file system. Pipes are shown by ls as any other file with a couple of differences:
% ls -l fifo1
prw-r--r-- 1 andy users 0 Jan 22 23:11 fifo1|
The p in the leftmost column indicates that fifo1 is a pipe. The rest of the permission bits control who can read or write to the pipe just like a regular file. On systems with a modern ls, the | character at the end of the file name is another clue, and on Linux systems with the color option enabled, fifo| is printed in red by default.
On older Linux systems, named pipes are created by the mknod program, usually located in the /etc directory. On more modern systems, mkfifo is a standard utility. The mkfifo program takes one or more file names as arguments for this task and creates pipes with those names. For example, to create a named pipe with the name pipe1 give the command:
mkfifo pipe
The simplest way to show how named pipes work is with an example. Suppose we've created pipe as shown above. In one virtual console1, type:
ls -l > pipe1
and in another type:
cat < pipe
Voila! The output of the command run on the first console shows up on the second console. Note that the order in which you run the commands doesn't matter.
If you haven't used virtual consoles before, see the article “Keyboards, Consoles and VT Cruising” by John M. Fisk in the November 1996 Linux Journal.
If you watch closely, you'll notice that the first command you run appears to hang. This happens because the other end of the pipe is not yet connected, and so the kernel suspends the first process until the second process opens the pipe. In Unix jargon, the process is said to be “blocked”, since it is waiting for something to happen.
One very useful application of named pipes is to allow totally unrelated programs to communicate with each other. For example, a program that services requests of some sort (print files, access a database) could open the pipe for reading. Then, another process could make a request by opening the pipe and writing a command. That is, the “server” can perform a task on behalf of the “client”. Blocking can also happen if the client isn't writing, or the server isn't reading.
Pipe Madness
Create two named pipes, pipe1 and pipe2. Run the commands:
echo -n x | cat - pipe1 > pipe2 &
cat
On screen, it will not appear that anything is happening, but if you run top (a command similar to ps for showing process status), you'll see that both cat programs are running like crazy copying the letter x back and forth in an endless loop.
After you press ctrl-C to get out of the loop, you may receive the message “broken pipe”. This error occurs when a process writing to a pipe when the process reading the pipe closes its end. Since the reader is gone, the data has no place to go. Normally, the writer will finish writing its data and close the pipe. At this point, the reader sees the EOF (end of file) and executes the request.
Whether or not the “broken pipe” message is issued depends on events at the exact instant the ctrl-C is pressed. If the second cat has just read the x, pressing ctrl-C stops the second cat, pipe1 is closed and the first cat stops quietly, i.e., without a message. On the other hand, if the second cat is waiting for the first to write the x, ctrl-C causes pipe2 to close before the first cat can write to it, and the error message is issued. This sort of random behavior is known as a “race condition”.
Command Substitution
Bash uses named pipes in a really neat way. Recall that when you enclose a command in parenthesis, the command is actually run in a “subshell”; that is, the shell clones itself and the clone interprets the command(s) within the parenthesis. Since the outer shell is running only a single “command”, the output of a complete set of commands can be redirected as a unit. For example, the command:
(ls -l; ls -l) >ls.out
writes two copies of the current directory listing to the file ls.out.
Command substitution occurs when you put a < or > in front of the left parenthesis. For instance, typing the command:
cat <(ls -l)
results in the command ls -l executing in a subshell as usual, but redirects the output to a temporary named pipe, which bash creates, names and later deletes. Therefore, cat has a valid file name to read from, and we see the output of ls -l, taking one more step than usual to do so. Similarly, giving >(commands) results in Bash naming a temporary pipe, which the commands inside the parenthesis read for input.
If you want to see whether two directories contain the same file names, run the single command:
cmp <(ls /dir1) <(ls /dir2)
The compare program cmp will see the names of two files which it will read and compare.
Command substitution also makes the tee command (used to view and save the output of a command) much more useful in that you can cause a single stream of input to be read by multiple readers without resorting to temporary files—bash does all the work for you. The command:
ls | tee >(grep foo | wc >foo.count) \
>(grep bar | wc >bar.count) \
| grep baz | wc >baz.count
counts the number of occurrences of foo, bar and baz in the output of ls and writes this information to three separate files. Command substitutions can even be nested:
cat <(cat <(cat <(ls -l))))
works as a very roundabout way to list the current directory.
As you can see, while the unnamed pipes allow simple commands to be strung together, named pipes, with a little help from bash, allow whole trees of pipes to be created. The possibilities are limited only by your imagination.
Linux Signal Handling Model
Signals are used to notify a process or thread of a particular event. Many computer science researchers compare signals with hardware interrupts, which occur when a hardware subsystem, such as a disk I/O (input/output) interface, generates an interrupt to a processor when the I/O completes. This event in turn causes the processor to enter an interrupt handler, so subsequent processing can be done in the operating system based on the source and cause of the interrupt.
UNIX guru W. Richard Stevens aptly describes signals as software interrupts. When a signal is sent to a process or thread, a signal handler may be entered (depending on the current disposition of the signal), which is similar to the system entering an interrupt handler as the result of receiving an interrupt.
Operating system signals actually have quite a history of design changes in the signal code and various implementations of UNIX. This was due in part to some deficiencies in the early implementation of signals, as well as the parallel development work done on different versions of UNIX, primarily BSD UNIX and AT&T System V. James Cox, Berny Goodheart and W. Richard Stevens cover these details in their respective well-known books, so they don't need to be repeated here.
Implementation of correct and reliable signals has been in place for many years now, where an installed signal handler remains persistent and is not reset by the kernel. The POSIX standards provided a fairly well-defined set of interfaces for using signals in code, and today the Linux implementation of signals is fully POSIX-compliant. Note that reliable signals require the use of the newer sigaction interface, as opposed to the traditional signal call.
The occurrence of a signal may be synchronous or asynchronous to the process or thread, depending on the source of the signal and the underlying reason or cause. Synchronous signals occur as a direct result of the executing instruction stream, where an unrecoverable error (such as an illegal instruction or illegal address reference) requires an immediate termination of the process. Such signals are directed to the thread which caused the error with its execution stream. As an error of this type causes a trap into a kernel trap handler, synchronous signals are sometimes referred to as traps.
Asynchronous signals are external to (and in some cases, unrelated to) the current execution context. One obvious example would be the sending of a signal to a process from another process or thread via a kill(2), _lwp_kill(2) or sigsend(2) system call, or a thr_kill(3T), pthread_kill(3T) or sigqueue(3R) library invocation. Asynchronous signals are also aptly referred to as interrupts.
Every signal has a unique signal name, an abbreviation that begins with SIG (SIGINT for interrupt signal, for example) and a corresponding signal number. Additionally, for all possible signals, the system defines a default disposition or action to take when a signal occurs. There are four possible default dispositions:
* Exit: forces the process to exit.
* Core: forces the process to exit and create a core file.
* Stop: stops the process.
* Ignore: ignores the signal; no action taken.
A signal's disposition within a process's context defines what action the system will take on behalf of the process when a signal is delivered. All threads and LWPs (lightweight processes) within a process share the signal disposition, which is processwide and cannot be unique among threads within the same process.
Signal Table in my lat blog provides a complete list of signals, along with a description and default action. The data structures in the kernel to support signals in Linux are to be found in the task structure. Here are the most common elements of said structure pertaining to signals:
* current-->sig are the signal handlers.
* sigmask_lock is a per-thread spinlock which protects the signal queue and atomicity of other signal operations.
* current-signal and current-blocked contain a bitmask (currently 64 bits long, but freely expandable) of pending and permanently blocked signals.
* sigqueue and sigqueue_tail is a double-linked list of pending signals—Linux has RT signals which can be queued as well. “Traditional” signals are internally mapped to RT signals.
Signal Description and Default Action
The disposition of a signal can be changed from its default, and a process can arrange to catch a signal and invoke a signal-handling routine of its own or ignore a signal that may not have a default disposition of Ignore. The only exceptions are SIGKILL and SIGSTOP; their default dispositions cannot be changed. The interfaces for defining and changing signal disposition are the signal and sigset libraries and the sigaction system call. Signals can also be blocked, which means the process has temporarily prevented delivery of a signal. Generation of a signal that has been blocked will result in the signal remaining as pending to the process until it is explicitly unblocked or the disposition is changed to Ignore. The sigprocmask system call will set or get a process's signal mask, the bit array inspected by the kernel to determine if a signal is blocked or not. thr_setsigmask and pthread_sigmask are the equivalent interfaces for setting and retrieving the signal mask at the user-threads level.
I mentioned earlier that a signal may originate from several different places for a variety of different reasons. The first three signals listed in Table 1—SIGHUP, SIGINT and SIGQUIT—are generated by a keyboard entry from the controlling terminal (SIGINT and SIGHUP) or if the control terminal becomes disconnected (SIGHUP—use of the nohup command makes processes “immune” from hangups by setting the disposition of SIGHUP to Ignore).
Other terminal I/O-related signals include SIGSTOP, SIGTTIN, SIGTTOU and SIGTSTP. For the signals originating from a keyboard command, the actual key sequence that generates the signals, usually CTRL-C, is defined within the parameters of the terminal session, typically via stty(1) which results in a SIGINT being sent to a process, and has a default disposition of Exit.
User tasks in Linux, created via explicit calls to either thr_create or pthread_create, all have their own signal masks. Linux threads call clone with CLONE_SIGHAND; this shares all signal handlers between threads via sharing the current->sig pointer. Delivered signals are unique to a thread.
In some operating systems, such as Solaris 7, signals generated as a result of a trap (SIGFPE, SIGILL, etc.) are sent to the thread that caused the trap. Asynchronous signals are delivered to the first thread found not blocking the signal. In Linux, it is almost exactly the same. Synchronous signals happening in the context of a given thread are delivered to that thread.
Asynchronous in-kernel signals (e.g., asynchronous network I/O) is delivered to the thread that generated the asynchronous I/O. Explicit user-generated signals get delivered to the right thread as well. However, if CLONE_PID is used, all places that use the PID to deliver a signal will behave in a “weird” way; the signal gets randomly delivered to the first thread in the pidhash. Linux threads don't use CLONE_PID, so there is no such problem if you are using the pthreads.h thread API.
When a signal is sent to a user task, for example, when a user-space program accesses an illegal page, the following happens:
* page_fault (entry.S) in the low-level page-fault handler.
* do_page_fault (fault.c) fetches i386-specific parameters of the fault and does basic validation of the memory range involved.
* handle_mm_fault (memory.c) is generic MM (memory management) code (i386-independent), which gets called only if the memory range (VMA) exists. The MM reads the page table entry and uses the VMA to find out whether the memory access is legal or not.
{
int fault = handle_mm_fault(tsk, vma, address, write);
if (fault < 0) goto out_of_memory;
if (!fault) goto do_sigbus;
}
...
do_sigbus:
up(&mm->mmap_sem);
/*
* Send a sigbus, regardless of whether we
* were in kernel or user mode.
*/
tsk->thread.cr2 = address;
tsk->thread.error_code = error_code;
tsk->thread.trap_no = 14;
force_sig(SIGBUS, tsk);
Listing 1
The case we are interested in now is when the access was illegal (e.g., a write was attempted to a read-only mapping): handle_mm_fault returns 0 to do_page_fault in this case. As you can see from Listing 1, locking of the MM is very finely grained (and it better be this way); the mm->mmap_sem, per-MM semaphore, is used (which typically varies from process to process).
force_sig(SIGBUS,current) is used to “force” the SIGBUS signal on the faulting task. force_sig delivers the signal even if the process has attempted to ignore SIGBUS.
force_sig fills out the signal event structure and queues it into the process's signal queue (current->sigqueue and current->sigqueue_tail). The signal queue holds an indefinite number of queued signals. The semantics of “classic” signals are that follow-up signals are ignored—this is emulated in the signal code kernel/signal.c. “Generic” (or RT) signals can be queued arbitrarily; there are reasonable limits to the length of the signal queue.
The signal is queued, and current-signal is updated. Now comes the tricky part: the kernel returns to user space. Return to user space happens from do_page_fault=>page_fault (entry.S), then the low-level exit code in entry.S is executed in this order:
page_fault=>(called do_page_fault)=>error_code=>
ret_from_exception=>(checks if return to user space)=>
ret_with_reschedule=>(sees that current->signal is nonzero)
=>calls do_signal
Next, do_signal unqueues the signal to be executed. In this case, it's SIGBUS.
Then handle_signal is called with the “unqueued” signal (which can potentially hold extra event information in case of real-time signals/messages).
Next called is setup_frame, where all user-space registers are saved and the kernel stack frame return address is modified to point to the handler of the installed signal handler. A small sequence of code jumper is put on the user stack (obviously, the code first makes sure the user stack is valid) which will return us to kernel space once the signal handler has finished. (See Listing 2.)
{
err |= __put_user(frame->retcode, &frame->pretcode);
/* This is popl %eax ; movl $,%eax ; int $0x80 */
err |= __put_user(0xb858, (short *)(frame->retcode+0));
err |= __put_user(__NR_sigreturn, (int *)(frame->retcode+2));
err |= __put_user(0x80cd, (short *)(frame->retcode+6));
}
Listing 2
Careful: this area is one of the least-understood pieces of the Linux kernel, and for good reason; it is really tough code to read and follow.
The popl %eax ; movl $,%eax ; int $0x80 x86 assembly sequence calls sys_sigret, which later on will restore the kernel stack frame return address to point to the original (faulting) user address.
What is all this magic good for? Well, first the kernel has to guarantee that signal handlers get called properly and the original state is restored. The kernel also has to deal with binary compatibility issues. Linux guarantees that on the IA-32 (Intel x86) architecture, we can run any iBC86-compliant binary code. Speed is also an issue.
restore_all:
RESTORE_ALL
#define RESTORE_ALL \
popl %ebx; \
popl %ecx; \
popl %edx; \
popl %esi; \
popl %edi; \
popl %ebp; \
popl %eax; \
1: popl %ds; \
2: popl %es; \
addl $4,%esp; \
3: iret;
Listing 3
Finally, we return to entry.S again, but current-signal is already cleared, so we do not execute do_signal but jump to restore_all as shown in Listing 3. restore.all executes the “iret” that brings us into user space. Suddenly, we are magically executing the signal handler.
Did you get lost yet? No? Here is some more magic. Once the signal handler finishes (it does an assembly “ret” like all well-behaving functions), it will execute the small jumper function we have set up on the user stack. Again we return to the kernel, but now we execute the sys_sigreturn system call, which lives in arch/i386/kernel/signal.c as well. It essentially executes the following code section:
if (restore_sigcontext(regs, &frame->sc, &eax))
goto badframe;
return eax;
The above code restores the exact user-register contents into the kernel stack frame (including the return address and flags register) and executes a normal ret_from_syscall, bringing us back to the original faulting code. Hopefully the SIGBUS handler has fixed the problem of why we were faulting.
Now, while reading the above description, you might think this is awfully complex and slow. It actually isn't; lmbench reveals that Linux has the fastest signal-handler installation and execution performance by far of any UNIX running:
moon:~/l> ./lat_sig install
Signal handler installation: 1.688 microseconds
moon:~/l> ./lat_sig catch
Signal handler overhead: 3.186 microseconds
Best of all, it scales linearly on SMP:
moon:~/l> ./lat_sig catch & ./lat_sig catch &
Signal handler overhead: 3.264 microseconds
Signal handler overhead: 3.248 microseconds
moon:~/l> ./lat_sig install & ./lat_sig install &
Signal handler installation: 1.721 microseconds
Signal handler installation: 1.689 microseconds
Signals and Interrupts, A Perfect Couple
Signals can be sent from system calls, interrupts and bottom-half handlers (see sidebar) alike; there is no difference. In other words, the Linux signal queue is interrupt-safe, as strange and recursive as that sounds, so it's fairly flexible.
Bottom-Half Handlers
An interesting signal-delivery case, however, is on SMP. Imagine a thread is executing on one processor, and it gets an asynchronous event (e.g., synchronous socket I/O signal) from an IRQ handler (or another process) on another CPU. In that case, we send a cross-CPU message to the running process, so there is no latency in signal delivery. (The speed of cross-CPU delivery is about five microseconds on a Pentium II 350MHz.)
Conclusions
Once again, we notice how Linux is actually the technology leader in important kernel aspects such as scheduling, interrupt handling and signals handling. This also proves the conjecture that the Linux developer community is collectively more capable and more resourceful than any private corporation's R&D department could ever be.
UNIX guru W. Richard Stevens aptly describes signals as software interrupts. When a signal is sent to a process or thread, a signal handler may be entered (depending on the current disposition of the signal), which is similar to the system entering an interrupt handler as the result of receiving an interrupt.
Operating system signals actually have quite a history of design changes in the signal code and various implementations of UNIX. This was due in part to some deficiencies in the early implementation of signals, as well as the parallel development work done on different versions of UNIX, primarily BSD UNIX and AT&T System V. James Cox, Berny Goodheart and W. Richard Stevens cover these details in their respective well-known books, so they don't need to be repeated here.
Implementation of correct and reliable signals has been in place for many years now, where an installed signal handler remains persistent and is not reset by the kernel. The POSIX standards provided a fairly well-defined set of interfaces for using signals in code, and today the Linux implementation of signals is fully POSIX-compliant. Note that reliable signals require the use of the newer sigaction interface, as opposed to the traditional signal call.
The occurrence of a signal may be synchronous or asynchronous to the process or thread, depending on the source of the signal and the underlying reason or cause. Synchronous signals occur as a direct result of the executing instruction stream, where an unrecoverable error (such as an illegal instruction or illegal address reference) requires an immediate termination of the process. Such signals are directed to the thread which caused the error with its execution stream. As an error of this type causes a trap into a kernel trap handler, synchronous signals are sometimes referred to as traps.
Asynchronous signals are external to (and in some cases, unrelated to) the current execution context. One obvious example would be the sending of a signal to a process from another process or thread via a kill(2), _lwp_kill(2) or sigsend(2) system call, or a thr_kill(3T), pthread_kill(3T) or sigqueue(3R) library invocation. Asynchronous signals are also aptly referred to as interrupts.
Every signal has a unique signal name, an abbreviation that begins with SIG (SIGINT for interrupt signal, for example) and a corresponding signal number. Additionally, for all possible signals, the system defines a default disposition or action to take when a signal occurs. There are four possible default dispositions:
* Exit: forces the process to exit.
* Core: forces the process to exit and create a core file.
* Stop: stops the process.
* Ignore: ignores the signal; no action taken.
A signal's disposition within a process's context defines what action the system will take on behalf of the process when a signal is delivered. All threads and LWPs (lightweight processes) within a process share the signal disposition, which is processwide and cannot be unique among threads within the same process.
Signal Table in my lat blog provides a complete list of signals, along with a description and default action. The data structures in the kernel to support signals in Linux are to be found in the task structure. Here are the most common elements of said structure pertaining to signals:
* current-->sig are the signal handlers.
* sigmask_lock is a per-thread spinlock which protects the signal queue and atomicity of other signal operations.
* current-signal and current-blocked contain a bitmask (currently 64 bits long, but freely expandable) of pending and permanently blocked signals.
* sigqueue and sigqueue_tail is a double-linked list of pending signals—Linux has RT signals which can be queued as well. “Traditional” signals are internally mapped to RT signals.
Signal Description and Default Action
The disposition of a signal can be changed from its default, and a process can arrange to catch a signal and invoke a signal-handling routine of its own or ignore a signal that may not have a default disposition of Ignore. The only exceptions are SIGKILL and SIGSTOP; their default dispositions cannot be changed. The interfaces for defining and changing signal disposition are the signal and sigset libraries and the sigaction system call. Signals can also be blocked, which means the process has temporarily prevented delivery of a signal. Generation of a signal that has been blocked will result in the signal remaining as pending to the process until it is explicitly unblocked or the disposition is changed to Ignore. The sigprocmask system call will set or get a process's signal mask, the bit array inspected by the kernel to determine if a signal is blocked or not. thr_setsigmask and pthread_sigmask are the equivalent interfaces for setting and retrieving the signal mask at the user-threads level.
I mentioned earlier that a signal may originate from several different places for a variety of different reasons. The first three signals listed in Table 1—SIGHUP, SIGINT and SIGQUIT—are generated by a keyboard entry from the controlling terminal (SIGINT and SIGHUP) or if the control terminal becomes disconnected (SIGHUP—use of the nohup command makes processes “immune” from hangups by setting the disposition of SIGHUP to Ignore).
Other terminal I/O-related signals include SIGSTOP, SIGTTIN, SIGTTOU and SIGTSTP. For the signals originating from a keyboard command, the actual key sequence that generates the signals, usually CTRL-C, is defined within the parameters of the terminal session, typically via stty(1) which results in a SIGINT being sent to a process, and has a default disposition of Exit.
User tasks in Linux, created via explicit calls to either thr_create or pthread_create, all have their own signal masks. Linux threads call clone with CLONE_SIGHAND; this shares all signal handlers between threads via sharing the current->sig pointer. Delivered signals are unique to a thread.
In some operating systems, such as Solaris 7, signals generated as a result of a trap (SIGFPE, SIGILL, etc.) are sent to the thread that caused the trap. Asynchronous signals are delivered to the first thread found not blocking the signal. In Linux, it is almost exactly the same. Synchronous signals happening in the context of a given thread are delivered to that thread.
Asynchronous in-kernel signals (e.g., asynchronous network I/O) is delivered to the thread that generated the asynchronous I/O. Explicit user-generated signals get delivered to the right thread as well. However, if CLONE_PID is used, all places that use the PID to deliver a signal will behave in a “weird” way; the signal gets randomly delivered to the first thread in the pidhash. Linux threads don't use CLONE_PID, so there is no such problem if you are using the pthreads.h thread API.
When a signal is sent to a user task, for example, when a user-space program accesses an illegal page, the following happens:
* page_fault (entry.S) in the low-level page-fault handler.
* do_page_fault (fault.c) fetches i386-specific parameters of the fault and does basic validation of the memory range involved.
* handle_mm_fault (memory.c) is generic MM (memory management) code (i386-independent), which gets called only if the memory range (VMA) exists. The MM reads the page table entry and uses the VMA to find out whether the memory access is legal or not.
{
int fault = handle_mm_fault(tsk, vma, address, write);
if (fault < 0) goto out_of_memory;
if (!fault) goto do_sigbus;
}
...
do_sigbus:
up(&mm->mmap_sem);
/*
* Send a sigbus, regardless of whether we
* were in kernel or user mode.
*/
tsk->thread.cr2 = address;
tsk->thread.error_code = error_code;
tsk->thread.trap_no = 14;
force_sig(SIGBUS, tsk);
Listing 1
The case we are interested in now is when the access was illegal (e.g., a write was attempted to a read-only mapping): handle_mm_fault returns 0 to do_page_fault in this case. As you can see from Listing 1, locking of the MM is very finely grained (and it better be this way); the mm->mmap_sem, per-MM semaphore, is used (which typically varies from process to process).
force_sig(SIGBUS,current) is used to “force” the SIGBUS signal on the faulting task. force_sig delivers the signal even if the process has attempted to ignore SIGBUS.
force_sig fills out the signal event structure and queues it into the process's signal queue (current->sigqueue and current->sigqueue_tail). The signal queue holds an indefinite number of queued signals. The semantics of “classic” signals are that follow-up signals are ignored—this is emulated in the signal code kernel/signal.c. “Generic” (or RT) signals can be queued arbitrarily; there are reasonable limits to the length of the signal queue.
The signal is queued, and current-signal is updated. Now comes the tricky part: the kernel returns to user space. Return to user space happens from do_page_fault=>page_fault (entry.S), then the low-level exit code in entry.S is executed in this order:
page_fault=>(called do_page_fault)=>error_code=>
ret_from_exception=>(checks if return to user space)=>
ret_with_reschedule=>(sees that current->signal is nonzero)
=>calls do_signal
Next, do_signal unqueues the signal to be executed. In this case, it's SIGBUS.
Then handle_signal is called with the “unqueued” signal (which can potentially hold extra event information in case of real-time signals/messages).
Next called is setup_frame, where all user-space registers are saved and the kernel stack frame return address is modified to point to the handler of the installed signal handler. A small sequence of code jumper is put on the user stack (obviously, the code first makes sure the user stack is valid) which will return us to kernel space once the signal handler has finished. (See Listing 2.)
{
err |= __put_user(frame->retcode, &frame->pretcode);
/* This is popl %eax ; movl $,%eax ; int $0x80 */
err |= __put_user(0xb858, (short *)(frame->retcode+0));
err |= __put_user(__NR_sigreturn, (int *)(frame->retcode+2));
err |= __put_user(0x80cd, (short *)(frame->retcode+6));
}
Listing 2
Careful: this area is one of the least-understood pieces of the Linux kernel, and for good reason; it is really tough code to read and follow.
The popl %eax ; movl $,%eax ; int $0x80 x86 assembly sequence calls sys_sigret, which later on will restore the kernel stack frame return address to point to the original (faulting) user address.
What is all this magic good for? Well, first the kernel has to guarantee that signal handlers get called properly and the original state is restored. The kernel also has to deal with binary compatibility issues. Linux guarantees that on the IA-32 (Intel x86) architecture, we can run any iBC86-compliant binary code. Speed is also an issue.
restore_all:
RESTORE_ALL
#define RESTORE_ALL \
popl %ebx; \
popl %ecx; \
popl %edx; \
popl %esi; \
popl %edi; \
popl %ebp; \
popl %eax; \
1: popl %ds; \
2: popl %es; \
addl $4,%esp; \
3: iret;
Listing 3
Finally, we return to entry.S again, but current-signal is already cleared, so we do not execute do_signal but jump to restore_all as shown in Listing 3. restore.all executes the “iret” that brings us into user space. Suddenly, we are magically executing the signal handler.
Did you get lost yet? No? Here is some more magic. Once the signal handler finishes (it does an assembly “ret” like all well-behaving functions), it will execute the small jumper function we have set up on the user stack. Again we return to the kernel, but now we execute the sys_sigreturn system call, which lives in arch/i386/kernel/signal.c as well. It essentially executes the following code section:
if (restore_sigcontext(regs, &frame->sc, &eax))
goto badframe;
return eax;
The above code restores the exact user-register contents into the kernel stack frame (including the return address and flags register) and executes a normal ret_from_syscall, bringing us back to the original faulting code. Hopefully the SIGBUS handler has fixed the problem of why we were faulting.
Now, while reading the above description, you might think this is awfully complex and slow. It actually isn't; lmbench reveals that Linux has the fastest signal-handler installation and execution performance by far of any UNIX running:
moon:~/l> ./lat_sig install
Signal handler installation: 1.688 microseconds
moon:~/l> ./lat_sig catch
Signal handler overhead: 3.186 microseconds
Best of all, it scales linearly on SMP:
moon:~/l> ./lat_sig catch & ./lat_sig catch &
Signal handler overhead: 3.264 microseconds
Signal handler overhead: 3.248 microseconds
moon:~/l> ./lat_sig install & ./lat_sig install &
Signal handler installation: 1.721 microseconds
Signal handler installation: 1.689 microseconds
Signals and Interrupts, A Perfect Couple
Signals can be sent from system calls, interrupts and bottom-half handlers (see sidebar) alike; there is no difference. In other words, the Linux signal queue is interrupt-safe, as strange and recursive as that sounds, so it's fairly flexible.
Bottom-Half Handlers
An interesting signal-delivery case, however, is on SMP. Imagine a thread is executing on one processor, and it gets an asynchronous event (e.g., synchronous socket I/O signal) from an IRQ handler (or another process) on another CPU. In that case, we send a cross-CPU message to the running process, so there is no latency in signal delivery. (The speed of cross-CPU delivery is about five microseconds on a Pentium II 350MHz.)
Conclusions
Once again, we notice how Linux is actually the technology leader in important kernel aspects such as scheduling, interrupt handling and signals handling. This also proves the conjecture that the Linux developer community is collectively more capable and more resourceful than any private corporation's R&D department could ever be.
Tuesday, April 15, 2008
Basics of unix signals
SIGNALS
Signals offer another way to transition between Kernel and User Space. While system calls are synchronous calls originating from User Space, signals are asynchronous messages coming from Kernel space. Signals are always delivered by the Kernel but they can be initiated by:
* other processes on the system (using the kill command/system call)
* the process itself. This includes hardware exceptions triggered by a process: when a program executes an illegal instruction, such as dividing a number by zero or attempting to access a memory zone that has not been allocated yet, the hardware detects it and a signal is sent to the faulty program.
* the Kernel. The Kernel also use signals to notify a process of some system events, such as the arrival of out-of-band data. In the same way, when a program sets a system alarm, the Kernel sends a signal to the process every time a timer expires (e.g. every 10 seconds).
So a signal is an asynchronous message, but what happens exactly when a process receives it? Well… it depends. For each signal, a process can instruct the Kernel to either:
*Ignore this signal: In which case this signal has absolutely no effect on the process. Ignoring the signal must be explicitly requested before the signal is delivered. Also, some signals cannot be ignored.
*Catch this signal: In which case the Kernel will call a custom routine, as defined by this process when delivering the signal. The process must explicitly register this custom routine before the signal is delivered. The signal-catching function is traditionally called a custom signal handler.
*Let the default action apply: For each signal the system defines a default action that will be called if the process did not explicitly request to ignore or catch this signal. The default signal handler typically but not always terminates the process (we will cover the default actions for all common UNIX signals later in this article). Letting the default action apply is the implicit system behavior, but it can also be requested explicitly by a process.
The overall dynamic for signal delivery is quite simple:
1. When a process receives a signal that is not ignored, the program immediately interrupts its current execution flow5.
2. Control is then transferred to a dedicated signal handler, a custom one defined by the process or the system default.
3. Once the signal handler completes, the program resumes where it was originally interrupted.
In practice, though, the mechanics used by the Kernel to send a signal are more involved and consist of two distinct steps: generating and delivering the signal.
The Kernel generates a signal for a process simply by setting a flag that indicates the type of the signal that was received. More precisely, each process has a dedicated bitfield used to store pending signals; For the system, generating a signal is just a matter of updating the bit corresponding to the signal type in this bitfield structure. At this stage, the signal is said to be pending.
Before transferring control back to a process in user mode, the Kernel always checks the pending signals for this process. This check must happen in Kernel space because some signals can never be ignored by a process – namely SIGSTOP and SIGKILL (you trigger SIGKILL with the infamous kill -9 command).
When a pending signal is detected by the Kernel, the system will deliver the signal by performing one of the following actions:
* if the signal is SIGKILL the system does not switch back to user mode. It processes the signal in Kernel mode and terminates the process. This is why kill -9 is such a bulletproof way to terminate a misbehaving process.
* if the signal is SIGSTOP the system also stays in Kernel mode. Its suspends the process and puts it to sleep.
* if the process did not register any custom handler for this signal, the default system action is taken. If the default action is to ignore the signal, no action is taken, and the system just switches back to user mode and transfers control to the process. If the default action is not to ignore the signal, the system remains in Kernel mode and the process will exit, dump core, or be suspended. For instance, the default behavior for the SIGSEGV signal is to dump a core file and terminate the process, so that one can analyze the bug that triggered the segmentation fault.
* if the process registered a custom handler for the signal, the Kernel transfers control to the process and the custom signal handler is executed in user mode. At this point, the program is the one responsible for handling the signal properly.
A crucial point here is to realize that the Kernel triggers the signal handler, when the signal is delivered, not when the signal is generated. As signal delivery only happens when the system schedules the target process as active in a multitasking system (just before switching back to User Mode) there can be a significant delay between signal generation and delivery.
Finally a process has one last option when it comes to signals. It can instruct the Kernel to block the delivery of a specific signal. If a signal is blocked, the system still generates it and the signal is considered pending. Nevertheless the Kernel will not deliver a blocked signal until the process unblocks it. Signal blocking is typically used in critical sections of code that must not be interrupted.
Name Number Default Action Semantics
SIGHUP 1 Terminate Hangup detected on controlling terminal or death of controlling process
SIGINT 2 Terminate Interrupt from keyboard. Usually terminate the process. Can be triggered by Ctrl-C
SIGQUIT 3 Core dump Quit from keyboard. Usually causes the process to terminate and dump core. Cab be triggered by Ctrl-\
SIGILL 4 Core dump The process has executed an illegal hardware instruction.
SIGTRAP 5 Core dump Trace/breakpoint trap. Hardware fault.
SIGABRT 6 Core dump Abort signal from abort(3)
SIGFPE 8 Core dump Floating point exception such as dividing by zero or a floating point overflow.
SIGKILL 9 Terminate Sure way to terminate (kill) a process. Cannot be caught or ignored.
SIGSEGV 11 Core dump The process attempted to access an invalid memory reference.
SIGPIPE 13 Terminate Broken pipe: Sent to a process writing to a pipe or a socket with no reader (most likely the reader has terminated).
SIGALRM 14 Terminate Timer signal from alarm(2)
SIGTERM 15 Terminate Termination signal. The kill command send this signal by default, when no explicit signal type is provided.
SIGUSR1 30,10,16 Terminate First user-defined signal, designed to be used by application programs which can freely define its semantics.
SIGUSR2 31,12,17 Terminate Second user-defined signal, designed to be used by application programs which can freely define its semantics.
SIGCHLD 20,17,18 Ignore Child stopped or terminated
SIGCONT 19,18,25 Continue / Ignore Continue if stopped
SIGSTOP 17,19,23 Stop Sure way to stop a process: cannot be caught or ignored. Used for non interactive job-control while SIGSTP is the interactive stop signal.
SIGTSTP 18,20,24 Stop Interactive signal used to suspend process execution. Usually generated by typing Ctrl-Z in a terminal.
SIGTTIN 21,21,26 Stop A background process attempt to read from its controlling terminal (tty input).
SIGTTOU 22,22,27 Stop A background process attempt to write to its controlling terminal (tty output).
SIGIO 23,29,22 Terminate Asynchronous I/O now event.
SIGBUS 10,7,10 Core dump Bus error (bad memory access)
SIGPOLL Terminate Signals an event on a pollable device.
SIGPROF 27,27,29 Terminate Expiration of a profiling timer set with setitimer.
SIGSYS 12,-,12 Core dump Invalid system call. The Kernel interpreted a processor instruction as a system call, but its argument is invalid.
SIGURG 16,23,21 Ignore Urgent condition on socket (e.g. out-of-band data).
SIGVTALRM 26,26,28 Terminate Expiration of a virtual interval timer set with setitimer.
SIGXCPU 24,24,30 Core dump CPU soft time limit exceeded (Resource limits).
SIGXFSZ 25,25,31 Core dump File soft size limit exceeded (Resource limits).
SIGWINCH 28,28,20 Ignore Informs a process of a change in associated terminal window size.
Signals offer another way to transition between Kernel and User Space. While system calls are synchronous calls originating from User Space, signals are asynchronous messages coming from Kernel space. Signals are always delivered by the Kernel but they can be initiated by:
* other processes on the system (using the kill command/system call)
* the process itself. This includes hardware exceptions triggered by a process: when a program executes an illegal instruction, such as dividing a number by zero or attempting to access a memory zone that has not been allocated yet, the hardware detects it and a signal is sent to the faulty program.
* the Kernel. The Kernel also use signals to notify a process of some system events, such as the arrival of out-of-band data. In the same way, when a program sets a system alarm, the Kernel sends a signal to the process every time a timer expires (e.g. every 10 seconds).
So a signal is an asynchronous message, but what happens exactly when a process receives it? Well… it depends. For each signal, a process can instruct the Kernel to either:
*Ignore this signal: In which case this signal has absolutely no effect on the process. Ignoring the signal must be explicitly requested before the signal is delivered. Also, some signals cannot be ignored.
*Catch this signal: In which case the Kernel will call a custom routine, as defined by this process when delivering the signal. The process must explicitly register this custom routine before the signal is delivered. The signal-catching function is traditionally called a custom signal handler.
*Let the default action apply: For each signal the system defines a default action that will be called if the process did not explicitly request to ignore or catch this signal. The default signal handler typically but not always terminates the process (we will cover the default actions for all common UNIX signals later in this article). Letting the default action apply is the implicit system behavior, but it can also be requested explicitly by a process.
The overall dynamic for signal delivery is quite simple:
1. When a process receives a signal that is not ignored, the program immediately interrupts its current execution flow5.
2. Control is then transferred to a dedicated signal handler, a custom one defined by the process or the system default.
3. Once the signal handler completes, the program resumes where it was originally interrupted.
In practice, though, the mechanics used by the Kernel to send a signal are more involved and consist of two distinct steps: generating and delivering the signal.
The Kernel generates a signal for a process simply by setting a flag that indicates the type of the signal that was received. More precisely, each process has a dedicated bitfield used to store pending signals; For the system, generating a signal is just a matter of updating the bit corresponding to the signal type in this bitfield structure. At this stage, the signal is said to be pending.
Before transferring control back to a process in user mode, the Kernel always checks the pending signals for this process. This check must happen in Kernel space because some signals can never be ignored by a process – namely SIGSTOP and SIGKILL (you trigger SIGKILL with the infamous kill -9 command).
When a pending signal is detected by the Kernel, the system will deliver the signal by performing one of the following actions:
* if the signal is SIGKILL the system does not switch back to user mode. It processes the signal in Kernel mode and terminates the process. This is why kill -9 is such a bulletproof way to terminate a misbehaving process.
* if the signal is SIGSTOP the system also stays in Kernel mode. Its suspends the process and puts it to sleep.
* if the process did not register any custom handler for this signal, the default system action is taken. If the default action is to ignore the signal, no action is taken, and the system just switches back to user mode and transfers control to the process. If the default action is not to ignore the signal, the system remains in Kernel mode and the process will exit, dump core, or be suspended. For instance, the default behavior for the SIGSEGV signal is to dump a core file and terminate the process, so that one can analyze the bug that triggered the segmentation fault.
* if the process registered a custom handler for the signal, the Kernel transfers control to the process and the custom signal handler is executed in user mode. At this point, the program is the one responsible for handling the signal properly.
A crucial point here is to realize that the Kernel triggers the signal handler, when the signal is delivered, not when the signal is generated. As signal delivery only happens when the system schedules the target process as active in a multitasking system (just before switching back to User Mode) there can be a significant delay between signal generation and delivery.
Finally a process has one last option when it comes to signals. It can instruct the Kernel to block the delivery of a specific signal. If a signal is blocked, the system still generates it and the signal is considered pending. Nevertheless the Kernel will not deliver a blocked signal until the process unblocks it. Signal blocking is typically used in critical sections of code that must not be interrupted.
Name Number Default Action Semantics
SIGHUP 1 Terminate Hangup detected on controlling terminal or death of controlling process
SIGINT 2 Terminate Interrupt from keyboard. Usually terminate the process. Can be triggered by Ctrl-C
SIGQUIT 3 Core dump Quit from keyboard. Usually causes the process to terminate and dump core. Cab be triggered by Ctrl-\
SIGILL 4 Core dump The process has executed an illegal hardware instruction.
SIGTRAP 5 Core dump Trace/breakpoint trap. Hardware fault.
SIGABRT 6 Core dump Abort signal from abort(3)
SIGFPE 8 Core dump Floating point exception such as dividing by zero or a floating point overflow.
SIGKILL 9 Terminate Sure way to terminate (kill) a process. Cannot be caught or ignored.
SIGSEGV 11 Core dump The process attempted to access an invalid memory reference.
SIGPIPE 13 Terminate Broken pipe: Sent to a process writing to a pipe or a socket with no reader (most likely the reader has terminated).
SIGALRM 14 Terminate Timer signal from alarm(2)
SIGTERM 15 Terminate Termination signal. The kill command send this signal by default, when no explicit signal type is provided.
SIGUSR1 30,10,16 Terminate First user-defined signal, designed to be used by application programs which can freely define its semantics.
SIGUSR2 31,12,17 Terminate Second user-defined signal, designed to be used by application programs which can freely define its semantics.
SIGCHLD 20,17,18 Ignore Child stopped or terminated
SIGCONT 19,18,25 Continue / Ignore Continue if stopped
SIGSTOP 17,19,23 Stop Sure way to stop a process: cannot be caught or ignored. Used for non interactive job-control while SIGSTP is the interactive stop signal.
SIGTSTP 18,20,24 Stop Interactive signal used to suspend process execution. Usually generated by typing Ctrl-Z in a terminal.
SIGTTIN 21,21,26 Stop A background process attempt to read from its controlling terminal (tty input).
SIGTTOU 22,22,27 Stop A background process attempt to write to its controlling terminal (tty output).
SIGIO 23,29,22 Terminate Asynchronous I/O now event.
SIGBUS 10,7,10 Core dump Bus error (bad memory access)
SIGPOLL Terminate Signals an event on a pollable device.
SIGPROF 27,27,29 Terminate Expiration of a profiling timer set with setitimer.
SIGSYS 12,-,12 Core dump Invalid system call. The Kernel interpreted a processor instruction as a system call, but its argument is invalid.
SIGURG 16,23,21 Ignore Urgent condition on socket (e.g. out-of-band data).
SIGVTALRM 26,26,28 Terminate Expiration of a virtual interval timer set with setitimer.
SIGXCPU 24,24,30 Core dump CPU soft time limit exceeded (Resource limits).
SIGXFSZ 25,25,31 Core dump File soft size limit exceeded (Resource limits).
SIGWINCH 28,28,20 Ignore Informs a process of a change in associated terminal window size.
Subscribe to:
Posts (Atom)