Monday, September 8, 2008

Good Networking Q and A.

* Q: What should happen upon receipt of a RST packet containing data?
* A: The host should accept it. It was suggested that a RST could contain ASCII text explaining the error, but no standard was ever established.
* Credit: Kris Katterjohn

* Q: What is the longest acceptable delay for an ACK?
* A: 0.5 seconds.
* Credit: Kris Katterjohn

* Q: An ICMP error message must include what data from the original erring datagram?
* A: The IP header and at least 8 bytes of payload.
* Credit: Kris Katterjohn

* Q: What is the major difference between Windows tracert and Van Jacobson-style UNIX traceroute?
* A: Windows tracert uses ICMP where Van Jacobson-style traceroute uses UDP.
* Credit: Chris Royle

* Q: Name three modern OS's that still use the mbuf structure, described in detail in TCP/IP Illustrated.
* A: FreeBSD, OpenBSD, NetBSD
* Credit: Sam Elstob

* Q: What is cheapernet?
* A: Coaxial cable with a diameter no longer suited to hit burglers with. The original ethernet cables were 1/2" in diameter and quite rigid.
* Credit: Jörn Engel

* Q: How many colours are in an 8-wire Cat5 cable?
* A: Four. The other four wires are colour/white striped.
* Credit: Jörn Engel

* Q: What are the four colours in Cat5 cables?
* A: Orange, Blue, Brown, Green
* Credit: Jörn Engel

* Q: How does a Windows machine react if ports above 1080 are blocked by a router?
* A: A regular user can surf for roughly 2min after booting, then has to reboot.
* Credit: Jörn Engel

* Q: What happens if a network card's transmit interrupts are blocked in Linux?
* A: Usually nothing. But on packet loss, retransmits can get delayed indefinitely.
* Credit: Jörn Engel

* Q: Who wrote the original Ping program?
* A: Mike Muuss
* Credit: Kris Katterjohn

* Q: What's the base UDP port number used in the original Traceroute program?
* A: 33434 (32768 + 666)
* Credit: Kris Katterjohn

* Q: What's the minimum value allowed for the Header Length field in a valid IPv4 header?
* A: 5 (5 * 4 = 20)
* Credit: Kris Katterjohn

* Q: Which RFC introduced the TCP Timestamps option?
* A: RFC 1323
* Credit: Kris Katterjohn

* Q: While computing the checksum of a TCP packet, what happens to the checksum?
* A: It's replaced (or filled) with zeros.
* Credit: Kris Katterjohn

* Q: What is the Ethernet type for the Banyan Vines protocol?
* A: 0BAD
* Credit: Marcel Kaszap

* Q: Name at least two colors of packet-dropping mechanism that are NOT an acronym or an abbreviation.
* A: BLUE, GREEN, PURPLE, WHITE (RED and BLACK are acronyms).
* Credit: Jonathan Day

* Q: What was PC-IP?
* A: The first implementation of TCP/IP on an IBM PC.
* Credit: Bhyrava Prasad

* Q: What semi-famous hacker tool uses port 31337?
* A: Back Orifice, from Cult of the Dead Cow.
* Credit: Ken D'Ambrosio

* Q: What two ICMP types should never be blocked?
* A: ICMP type 3, Destination Unreachable, especially code 4, "fragmentation needed but don't fragment bit set" (necessary for path MTU discovery) and ICMP type 11, time exceeded (so you can use traceroute from inside the network and get replies).
* Credit: Arjan van de Ven

* Q: What is the typical MTU for an RFC 1149 transmission?
* A: From RFC 1149, Carrier Pigeon Internet Protocol: "The MTU is variable, and paradoxically, generally increases with increased carrier age. A typicall MTU is 256 milligrams"
* Credit: Joe Nygard

* Q: What data link layer algorithm is described by an "algorhyme" in the original paper? Extra credit for reciting the first two lines.
* A: The spanning tree algorithm by Dr. Radia Perlman
* A: The first two lines are: I think that I shall never see / A graph more lovely than a tree.
* Credit: Christina Zeeh

* Q: What is the minimum length of an Ethernet packet, and why is there a minimum length?
* A: 64 bytes. It must be this long so that a collision can be detected.
* Credit: Ted T'so

* Q: What is the 'Stretch ACK violation' documented in RFC 2525?
* A: When using delayed ACKs, the receiver sends an ACK less frequently than every other sender's MSS causing potential performance degradation.
* Credit: Rob Braun

* Q: Name at least three official DNS resource record types.
* A: Any three of A, CNAME, HINFO, MX, NS, PTR, SOA, TXT, WKS, RT, NULL, AXFR, MAILB, MAILA, KX, KEY, SIG, NXT, PX, NSAP, NSAP-PTR, RP, AFSDB, RT, GPOS, DNAME, AAAA, SRV, LOC, EID, NIMLOC, ATMA, NAPTR, CERT, SINK, OPT, APL, TKEY, TSIG, IXFR, Deprecated: MB, MD, MF, Experimental: MINFO, MR, MG, X25
* Credit: Ulrich Durholz, Rob Braun

* Q: What is the maximum amount of data in a UDP packet over IPv6?
* A: 65487 bytes (65535 - 40 IPv6 header - 8 UDP header).
* Credit: Rob Braun

* Q: What is the minimum IPv6 datagram size that a host must be able to receive?
* A: 1280 bytes.
* Credit: Rob Braun

* Q: What is the IANA reserved Ethernet MAC address range for IP Multicast?
* A: 01:00:5e.
* Credit: Rob Braun, Michael Dupuis

* Q: Name one of the Ethernet patent (#4,063,220) holders.
* A: Robert Metcalfe, David Boggs, Charles Thacker, or Butler Lampson. (Metcalfe and Lampson are generally credited for the invention.)
* Credit: Rob Braun

* Q: What is the MAC address prefix for DECnet addresses?
* A: AA:00:04:00
* Credit: Rob Braun

* Q: Who wrote the original traceroute program?
* A: Van Jacobson
* Credit: Warren Postma

* Q: What feature of IP is central to most traceroute implementations?
* A: The TTL (Time To Live) field. Most traceroutes send packets with artificially small TTLs and use the ICMP Time Exceeded responses from intermediate hosts to trace the route to a host.
* Credit: Warren Postma

* Q: Why was traceroute originally implemented using UDP packets rather than ICMP echo requests?
* A: In 1988, many TCP/IP stacks didn't return ICMP Time Exceeded responses to ICMP packets, but would for UDP packets.
* Credit: Warren Postma

* Q: What is RED, Random Early Detection?
* A: A route queuing protocol used for congestion avoidance. Once it detects "incipient congestion," the router randomly discards packets based on average queue size.
* Credit: Ivar Alm

* Q: What application uses TCP port 666?
* A: Doom.
* Credit: Jim Wilson

* Q: What is "ships in the night" routing?
* A: When you run two or more routing protocols on the same router.
* Credit: Jim Wilson

* Q: What does CRC stand for?
* A: Cyclic Redundancy Check.
* Credit: Jim Wilson

* Q: What IP network is reserved for internal testing?
* A: Anything with a netid (first octet) of 127.
* Credit: Jim Wilson

* Q: What are class D networks used for?
* A: Multicasting.
* Credit: Jim Wilson

* Q: What is bootp an abbreviation for?
* A: Bootstrap protocol.
* Credit: Jim Wilson

* Q: What is a runt packet?
* A: A packet that is shorter than the minimum packet length as defined by the protocol it is using.
* Credit: Jim Wilson

* Q: As of RFC 1394, how many values can the TOS field in an IPv4 header have?
* A: 5 (4 bit wide field, only one may be set at a time, 0 is valid).
* Credit: Jim Wilson

* Q: What is the H.323 protocol used for?
* A: Video or teleconferencing ("Packet-based multimedia communications systems").
* Credit: Jim Wilson

* Q: What OSI model layer does IP most closely resemble?
* A: The network layer, layer 3.
* Credit: Jim Wilson

* Q: Why do IP packets have a TTL (Time To Live) field?
* A: To prevent a packet being retransmitted forever in the case of a routing loop.
* Credit: Jim Wilson

* Q: What experimental protocol might be able to fulfill RFC 1122's requirement of "SHOULD: able to leap tall buildings at a single bound?"
* A: CPIP, Carrier Pigeon Internet Protocol.
* Credit: Helge Haftig

* Q: What are the Dave Clark Five?
* A: RFCs 813 through 817.
* Credit: Telsa Gwynne

* Q: What was the first remotely operated non-computer appliance to be connected to the Internet?
* A: A toaster (controlled using SNMP).
* Credit: Richard Lightman

* Q: What is CPIP?
* A: Carrier Pigeon Internet Protocol (see RFC 1149).
* Credit: SL Baur

* Q: What common multicast group uses the address 224.0.1.1?
* A: NTP (Network Time Protocol).
* Credit: Nivedita Singhvi

* Q: What is the only field that is different between a regular ARP packet and a gratuitous ARP packet?
* A: The target IP.
* Credit: Nivedita Singhvi

* Q: What error is returned if a UDP datagram is received and has a checksum error?
* A: None. It is silently discarded.
* Credit: Nivedita Singhvi

* Q: What is the minimum IP datagram size that a host must be able to receive?
* A: 576 bytes.
* Credit: Nivedita Singhvi

* Q: When is the transmitted UDP checksum 0?
* A: When the sender did not compute it.
* Credit: Nivedita Singhvi

* Q: Which is the only field used twice in the UDP checksum calculation?
* A: UDP length.
* Credit: Nivedita Singhvi

* Q: Why is a pad byte of 0 occasionally appended for the UDP checksum calculation?
* A: Because the checksum algorithm requires an even number of bytes.
* Credit: Nivedita Singhvi

* Q: What are the 5 fields of a UDP pseudoheader?
* A: Source IP, destination IP, zero, protocol, UDP length.
* Credit: Nivedita Singhvi

* Q: Which parts of the packet does the UDP checksum cover?
* A: UDP pseudoheader, UDP header, UDP data.
* Credit: Nivedita Singhvi

* Q: Which parts of the packet does the IP checksum cover?
* A: The IP header.
* Credit: Nivedita Singhvi

* Q: What is the maximum amount of data in a UDP packet over IPv4?
* A: 65507 bytes (65535 - 20 IP header - 8 UDP header).
* Credit: Nivedita Singhvi

* Q: Who was the first individual member of the Internet Society?
* A: Jon Postel, narrowly beating Steve Wolff.
* Credit: Matthew Wilcox

* Q: Why hasn't RFC 1149 been ratified?
* Hint: RFC 1149 specifies an unusual encapsulation of IP.
* A: The Avian Transmission Protocol has only been implemented once so far : http://www.blug.linux.no/rfc1149/
* Credit: Matthew Wilcox

* Q: How many identical acks need to be received for fast retransmit to occur?
* A: 4 (3 duplicate + original).
* Credit: Nivedita Singhvi

* Q: Under what circumstances is the TCP checksum incorrect, on a well-formed, in-flight packet?
* A: When the packet is using the IP source routing option (the destination IP changes along the route, which is used to calculate the TCP checksum).
* Credit: Credit: Rusty Russell

* Q: How many bytes total are in a standard sized ICMP echo request packet?
* A: 84 bytes (56 data, 8 ICMP header, 20 IP header).
* Credit: Credit: Linda J. Laubenheimer

* Q: What does "IETF" stand for?
* A: Internet Engineering Task Force.
* Credit: Credit: Linda J. Laubenheimer

* Q: What does SLIP stand for?
* A: Serial Line Internet Protocol.
* Credit: Credit: Ben Sittler

* Q: What is the TCP retransmission ambiguity problem?
* A: An ACK arrives after a retransmit - was it sent in response to the initial transmit or the retransmit?
* Credit: Craig Latta

* Q: Name one way to solve the TCP retransmission ambiguity problem.
* A: Use the Eifel detection algorithm.
* A: Enable timestamps (which is what Eifel does).
* Credit: Craig Latta

* Q: When is an IGMP report timer cancelled?
* A: When the host receives an IGMP report for the same group (with a matching destination IP).
* A: When more than one host is a member of the same group on the same network.
* Credit: Craig Latta

* Q: How many bits are in an "A" type DNS resource record?
* A: 112, plus the owner name.
* Credit: Craig Latta

* Q: What is archived at www.kohala.com?
* A: Richard Stevens' website.
* Credit: Telsa Gwynne

* Q: What does the tcp_close_wait_interval configuration option really do in Solaris?
* A: Sets the duration of the TIME_WAIT state.
* Credit: Laurel Fan

* Q: What is the range of class B IP addresses?
* A: 128.0.0.0 through 191.255.255.255.
* Credit: Annie White

* Q: What is the significance in networking of the amateur radio callsign KA9Q?
* A: It's the callsign of Phil Karn, who worked on SLIP, congestion control and TCP over amateur radio.
* Credit: Alan Cox

* Q: Sally Floyd was heavily involved in the design of which TCP enhancement?
* A: ECN, see RFC 3042.
* Credit: Alan Cox

* Q: Who said: "The IETF already has more than enough RFCs that codify the obvious, make stupidity illegal, support truth, justice, and the IETF way, and generally demonstrate the author is a brilliant and valuable Contributor to The Standards Process"?
* A: Vernon Schryver, on the mailing list for the tcp-impl IETF working group.
* Credit: Alan Cox

* Q: What is the minimum MTU that allows any IP datagram to pass?
* A: 68 bytes.
* Credit: Alan Cox

* Q: What is a syncookie and who invented it?
* A: Syn cookies help avert synflood attacks by forcing all of the TCP state into the client, invented by Dan Bernstein.
* Credit: Alan Cox

* Q: Van Jacobson claimed that the TCP receive packet processing fast-path could be done in how many instructions?
* A: 30. (33 on Sparc, due to "compiler brain damage.")
* Credit: Alan Cox

* Q: What would happen if you implemented the TCP URG pointer according to the standard?
* Hint: RFC 1122 cleared up the confusion generated by RFC 793.
* A: You would lose the last byte of urgent data because the other host implements BSD-style urgent pointers, which point to the byte following the last byte of urgent data.
* Credit: Alan Cox

* Q: What is an LFN (spelled L-F-N, pronounced "elephant")?
* A: Long Fat Network, defined in RFC 1072.
* Credit: Alan Cox

* Q: Where was John Nagle working when he invented the "Nagle algorithm?"
* A: Ford Motor Company.
* Credit: Alan Cox

* Q: Who invented Tinygram Avoidance?
* A: John Nagle. (Tinygram Avoidance is also known as the "Nagle algorithm.")
* Credit: Alan Cox

* Q: What is the sub-group FHE within the IETF?
* A: Facial hairius extremis, spotted at IETF conferences and noted in RFC 2323, "IETF Identification and Security Guidelines."
* Credit: Alan Cox

* Q: Under what circumstances should you return error number 418: "I'm a teapot"?
* A: Any attempt to brew coffee with a teapot according to RFC 2324, "Hyper Text Coffee Pot Control Protocol."
* Credit: Alan Cox

* Q: Who found additional problems beyond those in RFC 1337, "TIME-WAIT Assassination Hazards in TCP" which have yet (as of Feb 2002) to be fixed?
* Hint: He died after publishing the initial draft.
* A: Ian Heavens
* Credit: Alan Cox

* Q: Private networks came from RFC 1597. Which later RFC claims this is a bad idea?
* A: RFC 1627, "Network 10 Considered Harmful."
* Credit: Alan Cox

* Q: What makes it very difficult for any network stack to claim "strict compliance" to RFC 1122?
* A: Its requirement of "SHOULD: able to leap tall buildings at a single bound."
* Credit: Telsa Gwynne

* Q: Who said "If you know what you are doing, three layers is enough; if you don't even seventeen levels won't help?"
* A: Mike Padlipsky (or MAP).
* Credit: Larry McVoy

* Q: Which OSI networking model layers do TCP and IP correspond to?
* A: They don't. (Any answer with any kind of equivocation should be accepted.)
* Credit: Danny Quist

* Q: Who invented NAT (Network Address Translation)?
* A: Paul Francis (but he credits Van Jacobson for the concept).
* Credit: Dimitar (Mitko) Haralanov

* Q: How many hosts should be on a network with a 255.255.255.192 subnet mask?
* A: 62 (64 - (broadcast address and network address))
* Credit: Ryan Snyder

* Q: How many bytes are in an IPv4 header without options?
* A: 20.
* Credit: Val Henson

* Q: Name one of the men described as "The Father of the Internet."
* A: Any of: Vinton Cerf (TCP/IP co-designer), Robert Kahn (TCP/IP co-designer), John Postel (started IANA), Al Gore (made encouraging noises)
* Credit: Val Henson

* Q: What does TCP/IP stand for?
* A: Transmission Control Protocol/Internet Protocol.
* Credit: Val Henson

* Q: How many layers are in the OSI networking model?
* A: 7.
* Credit: Val Henson

* Q: Name a network address designated for private network use.
* A: 10.0.0.0, 192.168.0.0, or 172.16.0.0
* Credit: Val Henson

* Q: Name two TCP header options.
* A: Any 2 of maximum segment size (MSS), window scale factor, timestamp, noop, SACK, and end of options list.
* Credit: Val Henson

* Q: What is the MSL as defined in RFC 793?
* A: 2 minutes (but is usually implemented as 30 seconds).
* Credit: Val Henson

* Q: Name two ways to exit the TIME_WAIT state.
* A: 2MSL timeout, TIME_WAIT assassination (receive a RST), or receive a SYN with greater sequence number. Note: TIME_WAIT assassination is not permitted by RFC 1337.
* Credit: Val Henson

* Q: What is the TCP state that can only be reached through a simultaneous close?
* A: CLOSING
* Credit: Val Henson

* Q: What year was the first IETF meeting held?
* A: 1986.
* Credit: Val Henson

* Q: Name all 7 layers of the OSI network model.
* A: Physical, Data Link, Network, Transport, Session, Presentation, and Application.
* Credit: Val Henson

* Q: Why are many network services assigned odd ports?
* A: The precursor to TCP and UDP was NCP, which was simplex and required 2 ports for one connection. When duplex protocols arrived, the even port numbers were abandoned.
* Credit: Val Henson

* Q: Which two of these three protocols are the most similar: IPv4, IPv6, or CLNP?
* A: IPv4 and CLNP. (CLNP stands for ConnectionLess Network Protocol, and is basically IPv4 with larger addresses.)
* Credit: Val Henson

* Q: In TCP, simultaneous open is a transition between which two TCP states?
* A: From SYN_SENT to SYN_RCVD.
* Credit: Val Henson

* Q: What is the ICMP type field for an Echo Request?
* A: 8.
* Credit: Val Henson

* Q: What is the ICMP type field for an Echo Reply?
* A: 0.
* Credit: Val Henson

* Q: What is the maximum number of IP addresses recordable by the IP Record Route option?
* A: 9.
* Credit: Val Henson

* Q: If one end of a TCP connection crashes, and the other end doesn't attempt to send any data, is the resultant TCP connection half-open or half-closed from the point of view of the host that's still up?
* A: Half-open.
* Credit: Val Henson

* Q: In a Christmas tree packet, which TCP flag bits are turned on?
* A: SYN, URG, PSH, and FIN (all of them).
* Credit: Val Henson

* Q: TCP was defined in which RFC?
* A: RFC 793, "Transmission Control Protocol."
* Credit: Val Henson

* Q: What is silly window syndrome?
* A: In TCP, when the receiving end continually advertises a tiny window, resulting in data being sent in very small packets.
* Credit: Val Henson

* Q: If the remote host of a TCP connection does not advertise an MSS, what is the assumed MSS?
* A: 536 bytes over IPv4, 1220 over IPv6, although most implementations default to 512 and 1024 respectively.
* Credit: Val Henson

* Q: How is the initial path MTU of a TCP connection calculated?
* A: min (outgoing interface MTU, remote advertised MSS)
* Credit: Val Henson

* Q: Why was the "AAAA" DNS resource record type created?
* A: For IPv6 addresses.
* Credit: Val Henson

* Q: What is the maximum amount of data allowed in an IPv4 packet?
* A: 65515 bytes (65535 - 20, max total length minus 20 bytes of header)
* Credit: Val Henson

* Q: What was the default send and receive buffer size in 4.3BSD?
* A: 2048 bytes.
* Credit: Val Henson

* Q: In TCP, when is the sender limited by the congestion window?
* A: When using the slow start algorithm, after packet loss has occurred.
* Credit: Val Henson

* Q: In most implementations of TCP, what byte does the urgent pointer point to?
* A: The byte following the last byte of urgent data.
* Credit: Val Henson

* Q: What service runs on port 6667?
* A: Internet Relay Chat (IRC)
* Credit: Val Henson

* Q: Name three methods or algorithms related to congestion control in TCP.
* A: Congestion window, Vegas, Reno, NewReno, SACK, DSACK, FACK, Eifel algorithm, ECN, RTO
* Credit: Val Henson

* Q: The ECN field uses which bits in the byte that contains the IPv4 TOS field?
* A: Bits 6 and 7.
* Credit: Val Henson

* Q: What's wrong with the standard way of estimating RTO (Retransmission TimeOut)?
* A: It places too much weight on the variance of round trip times.
* Credit: Val Henson

* Q: What is the minimum data payload in a ping of death?
* A: 65508 bytes (65536 - 20 IP header - 8 ICMP header)
* Credit: Val Henson

* Q: What's the MTU of HiPPI (High Performance Parallel Interface)?
* A: 65280 bytes
* Credit: Val Henson

* Q: How large is an entire AAL/5 encapsulated ATM cell?
* A: 53 bytes (48 data + 5 header)
* Credit: Val Henson

* Q: Distance vector and link state are two types of what kind of protocol?
* A: Routing protocols
* Credit: Val Henson

* Q: The IPv4 fields formerly known as TOS (Type Of Service) and precedence are now called what?
* A: The DS (Differentiated Services) and ECN (Explicit Congestion Notification) fields.
* Credit: Val Henson

* Q: Why do most traceroute implementations NOT use the IP Record Route option to find intermediate hosts?
* A: The IP Record Route option can only record 9 intermediate hosts, which is too few for many routes in the modern Internet.
* Credit: Val Henson

* Q: Which spelling is correct, Van Jacobson or Van Jacobsen?
* A: Van Jacobson.
* Credit: Val Henson

* Q: Who invented the spanning tree algorithm?
* A: Dr. Radia Perlman
* Credit: Val Henson

Monday, May 5, 2008

Snort Few Q&A by Richard Bejtlich

In this edition of the Snort Report, I address some of the questions frequently asked by service providers who are users or potential users of Snort. I say "potential users" because some people hear about Snort and wonder if it can solve a particular problem. Here I hope to provide realistic expectations for service providers using Snort.

1. Can I use Snort to protect a network from denial-of-service attacks?

Before answering many of these questions it's important to define terms and reveal assumptions. A denial-of-service (DoS) attack consumes one or more computing resources (bandwidth, memory, CPU cycles, hard drive space or other information system components). Sometimes DoS attacks are initiated by a single party, while others are so-called distributed DoS or DDoS attacks.

DDoS attacks enlist more than one aggressor to assault a victim. The first popular DoS attacks were clever resource consumption attacks against memory (e.g., the SYN floods of the mid-1990s), but since the late 1990s DDoS attacks that consume bandwidth have been prevalent. Less popular, but still damaging, are application-centric DoS attacks, whereby regular activity (like retrieving a Web page) is repeated to the point that the victim's operation is impaired.

What can Snort do about DDoS attacks? Snort's Vulnerability Research Team publishes a set of rules named ddos.rules. This file contains a small set of signatures for detecting activity caused by older DoS tools like Tribe Flood Network, Shaft, Trinoo and Stacheldraht. Emerging Threats publishes bleeding-dos.rules, which contains a greater variety of rules. However, the question remains: What good are rules like these?

When users or potential users ask if Snort protects against DoS attacks, they usually want to know if Snort can deflect or mitigate bandwidth consumption attacks. The answer to this question is probably no. When deployed as an offline, passive device, there is little or nothing Snort can do to stop or reduce a bandwidth-consuming SYN flood, for example. Snort can potentially report seeing many SYN segments, but it won't improve the situation. The rules packaged in ddos.rules and bleeding-dos.rules are designed to either detect DoS agent command-and-control or possibly identify certain types of attacks that subvert but do not breach a target.

When deployed as an inline, active device, Snort acts as a so-called intrusion prevention system and can, in some cases, stop DoS attacks. For example, an intruder may use a malicious packet to cause a vulnerable Cisco router to reboot or freeze. An inline Snort deployment could identify and filter the malicious packet, thereby "protecting" the router. If the intruder switched to a SYN flood or other bandwidth consumption attack against the router, however, Snort would most likely not be able to counter the attack -- at least not on its own.

2. Can Snort decode encrypted traffic?

Let's assume that encrypted traffic means Secure Sockets Layer (SSL) or Transport Layer Security (TLS) as used by HTTPS, or Secure Shell protocol 2 as used by OpenSSH.

The short answer is no, Snort cannot decode encrypted traffic. An intruder who attacks a Web server in the clear on port 80 TCP might be detected by Snort. The same intruder who attacks the same Web server in an encrypted channel on port 443 TCP will not be detected by Snort. An intruder who displays the contents of a password file via a Telnet session on port 23 TCP might be detected by Snort. The same intruder who displays the same password file via a SSH session on port 22 TCP will not be detected by Snort.

Now, in some circumstances it's possible to decode HTTPS sessions. This is not done natively by vanilla Snort -- it must be handled by an external program. See my blog post on Wireshark Display Filters and SSL, especially the comments, for more details.

Generally speaking, a stand-alone Snort instance can inspect traffic in an encrypted channel if the traffic is subjected to a man-in-the-middle (MITM) attack. In other words, traffic is encrypted while traveling from the client to the MITM. Once the traffic reaches the MITM, it is unencrypted while Snort inspects it. Then, traffic is re-encrypted before traveling from the MITM to the server. (The reverse happens as well.) Such a setup must be intentionally designed and implemented by the network and security architects and accepted by management and users.

Also note that Snort cannot inspect Web pages that are Gzip-encoded. This bandwidth-consumption technique is perfectly legitimate, but it shields Web page contents from Snort's gaze. Uncompressing Gzip-encoded content on the fly would be prohibitively expensive, although not impossible.

3. Can Snort detect layer 2 attacks?

Generally speaking, Snort is a layer 3 and above detection system. This means Snort inspects and acts upon IP packet details, like source and destination IP addresses, time to live (TTL), IP ID and so on. This excludes MAC addresses, Ethertype, VLAN IDs and other details found before the start of the layer 3 header.

Snort does contain an "arpspoof" preprocessor, but the code has always been marked "experimental." I don't know of anyone who uses it in production. Most users who want to detect layer 2 network events use layer 2-specific tools like Arpwatch.

4. Can Snort log flows or sessions?

This question, like the others, indicates the hope that Snort can accomplish a goal best left to specialized tools. Let's assume the question indicates a desire to log details of TCP sessions. Snort's Stream4 preprocessor does include a "keepstats" option that records session statistics for TCP flows. An earlier version of Sguil relied on this data. Unfortunately, this capability is limited to TCP traffic. All other protocols are ignored.

Note that Stream4 is being deprecated in favor of Stream5. Stream5 does not offer a "keepstats" function, although Stream5 does track UDP "sessions" for Snort's own detection purposes.

To log flows or sessions, use a stand-alone tool like Argus. If you're already using Sguil, take a look at the Security Analyst Network Connection Profiler (SANCP), which logs session details for many protocols. A third option is to collect NetFlow or another flow format from a hardware probe, or less often, a software probe.

5. Can Snort rebuild content from traffic?

In order to perform its detection functions, Snort rebuilds several types of content. For example, it's impossible to match the password "hackerpassword" sent over Telnet without letting Snort rebuild the traffic. However, Snort is not designed to watch traffic and rebuild everything it sees. A review of the README.Stream5 document shipped with Snort 2.8.0 shows that the new preprocessor offers a "show_rebuilt_packets" option that will "Print/display packet after rebuilt (for debugging)." This option is off by default, but even if enabled it's not the sort of capability I recommend activating in Snort.

People who wish to rebuild content typically want to parse Libpcap trace files to rebuild TCP sessions. One of the best tools for this job is Tcpflow. Tcpflow can be run against a dead trace or a live interface. If given no parameters, Tcpflow will rebuild all TCP sessions it sees, putting the content from client to server in one file and the content from server to client in another file. Tcpflow repeats this process for every single TCP session it finds.

If you run this sort of operation on a large Libpcap trace, you might learn what it means to run out of inodes on a Unix machine. If you do the same against a live interface, you'll probably start dropping many packets. Tcpflow is best pointed against a trace after being told exactly what to rebuild. For example, "Rebuild this FTP session involving this source IP and this source port."

Do you have other questions you would like answered? Email them to me at taosecurity at gmail.com.

About the author
Richard Bejtlich is founder of TaoSecurity, author of several books on network security monitoring, including Extrusion Detection: Security Monitoring for Internal Intrusions, and operator of the TaoSecurity blog.

TCP SYNCOOKIES - SYN FLOOD

Mail service for Panix, an ISP in New York, was shut down by a SYN flood starting on 6 September 1996. A week later the story was covered by the RISKS Digest, the Wall Street Journal, the Washington Post, and many other newspapers.

SYN flooding had been considered by security experts before. It was generally considered insoluble. See, for example, ``Practical UNIX and Internet Security,'' by Garfinkel and Spafford, page 778:

The recipient will be left with multiple half-open connections that are occupying limited resources. Usually, these connection requests have forged source addresses that specify nonexistent or unreachable hosts that cannot be contacted. Thus, there is also no way to trace the connections back. ... There is little you can do in these situations. ... any finite limit can be exceeded.

Large SYN queues and random early drops make SYN flooding more expensive but don't actually solve the problem.
SYN cookies are now a standard part of Linux and FreeBSD. To enable them, add

echo 1 > /proc/sys/net/ipv4/tcp_syncookies

to your boot scripts.

What are SYN cookies?
SYN cookies are particular choices of initial TCP sequence numbers by TCP servers. The difference between the server's initial sequence number and the client's initial sequence number is

* top 5 bits: t mod 32, where t is a 32-bit time counter that increases every 64 seconds;
* next 3 bits: an encoding of an MSS selected by the server in response to the client's MSS;
* bottom 24 bits: a server-selected secret function of the client IP address and port number, the server IP address and port number, and t.

This choice of sequence number complies with the basic TCP requirement that sequence numbers increase slowly; the server's initial sequence number increases slightly faster than the client's initial sequence number.

A server that uses SYN cookies doesn't have to drop connections when its SYN queue fills up. Instead it sends back a SYN+ACK, exactly as if the SYN queue had been larger. (Exceptions: the server must reject TCP options such as large windows, and it must use one of the eight MSS values that it can encode.) When the server receives an ACK, it checks that the secret function works for a recent value of t, and then rebuilds the SYN queue entry from the encoded MSS.

A SYN flood is simply a series of SYN packets from forged IP addresses. The IP addresses are chosen randomly and don't provide any hint of where the attacker is. The SYN flood keeps the server's SYN queue full. Normally this would force the server to drop connections. A server that uses SYN cookies, however, will continue operating normally. The biggest effect of the SYN flood is to disable large windows.

Blind connection forgery
If an attacker guesses a valid sequence number sent to someone else's host then he can forge a connection from that host.

Attackers can try to cryptanalyze the server-selected secret function: inspect a series of valid cookies and then intelligently guess a new cookie. For a secure function, the attacker's chance of success is not noticeably better than the chance of success for a uniform random guess. Secret-key message authenticators are designed to provide exactly this type of security. The following function is extremely fast and appears to be secure: encode the input in 16 bytes; feed the result through Rijndael with a secret key; extract the first 24 bits of the result.

No matter what function is used, the attacker will succeed in a connection forgery after millions of random ACK packets. Servers can make this attack more expensive in two ways:

* Keep track of the most recent SYN queue overflow time (for each SYN queue, not in a global variable). Don't rebuild missing SYN entries if there hasn't been a recent overflow. This stops ACK forgeries from passing through SYN-blocking firewalls.
* Add another number to the cookie: a 32-bit server-selected secret function of the client address and server address (but not the current time). This forces the attacker to guess 32 bits instead of 24.

A new protocol with 128-bit sequence numbers would make blind connection forgeries practically impossible.

Wednesday, April 30, 2008

Snort -Performance Monitor Preprocessor

This preprocessor measures Snort’s real-time and theoretical maximum performance. Whenever this preprocessor is
turned on, it should have an output mode enabled, either “console” which prints statistics to the console window or
“file” with a file name, where statistics get printed to the specified file name. By default, Snort’s real-time statistics
are processed. This includes:
• Time Stamp
• Drop Rate
• Mbits/Sec (wire) [duplicated below for easy comparison with other rates]
• Alerts/Sec
• K-Pkts/Sec (wire) [duplicated below for easy comparison with other rates]
• Avg Bytes/Pkt (wire) [duplicated below for easy comparison with other rates]
• Pat-Matched [percent of data received that Snort processes in pattern matching]
• Syns/Sec
• SynAcks/Sec
• New Sessions Cached/Sec
• Sessions Del fr Cache/Sec
• Current Cached Sessions
• Max Cached Sessions
• Stream Flushes/Sec
• Stream Session Cache Faults
• Stream Session Cache Timeouts
• New Frag Trackers/Sec
• Frag-Completes/Sec
• Frag-Inserts/Sec
• Frag-Deletes/Sec
• Frag-Auto Deletes/Sec [memory DoS protection]
• Frag-Flushes/Sec
• Frag-Current [number of current Frag Trackers]
• Frag-Max [max number of Frag Trackers at any time]
• Frag-Timeouts
• Frag-Faults
Number of CPUs [*** Only if compiled with LINUX SMP ***, the next three appear for each CPU]
• CPU usage (user)
• CPU usage (sys)
• CPU usage (Idle)
• Mbits/Sec (wire) [average mbits of total traffic]
• Mbits/Sec (ipfrag) [average mbits of IP fragmented traffic]
• Mbits/Sec (ipreass) [average mbits Snort injects after IP reassembly]
• Mbits/Sec (tcprebuilt) [average mbits Snort injects after stream4 reassembly]
• Mbits/Sec (applayer) [average mbits seen by rules and protocol decoders]
• Avg Bytes/Pkt (wire)
• Avg Bytes/Pkt (ipfrag)
• Avg Bytes/Pkt (ipreass)
• Avg Bytes/Pkt (tcprebuilt)
• Avg Bytes/Pkt (applayer)
• K-Pkts/Sec (wire)
• K-Pkts/Sec (ipfrag)
• K-Pkts/Sec (ipreass)
• K-Pkts/Sec (tcprebuilt)
• K-Pkts/Sec (applayer)
• Total Packets Received
• Total Packets Dropped (not processed)
• Total Packets Blocked (inline)
The following options can be used with the performance monitor:
• flow - Prints out statistics about the type of traffic and protocol distributions that Snort is seeing. This option
can produce large amounts of output.
• events - Turns on event reporting. This prints out statistics as to the number of signatures that were matched
by the setwise pattern matcher (non-qualified events) and the number of those matches that were verified with
the signature flags (qualified events). This shows the user if there is a problem with the rule set that they are
running.
• max - Turns on the theoreticalmaximumperformance that Snort calculates given the processor speed and current
performance. This is only valid for uniprocessor machines, since many operating systems don’t keep accurate
kernel statistics for multiple CPUs.
• console - Prints statistics at the console. This is enabled by default.
• file - Prints statistics in a comma-delimited format to the file that is specified. Not all statistics are output to
this file. You may also use snortfile which will output into your defined Snort log directory. Both of these
directives can be overridden on the command line with the -Z or --perfmon-file options.
• pktcnt - Adjusts the number of packets to process before checking for the time sample. This boosts performance,
since checking the time sample reduces Snort’s performance. By default, this is 10000.

• time - Represents the number of seconds between intervals.
• accumulate or reset - Defines which type of drop statistics are kept by the operating system. By default,
accumulate is used.
• atexitonly - Dump stats for entire life of Snort.

Examples

preprocessor perfmonitor: time 30 events flow file stats.profile max \
console pktcnt 10000

preprocessor perfmonitor: time 300 file /var/tmp/snortstat pktcnt 10000

The output of the log file is in fomr of numbers separated by commas.
Graphs can be generated by the perfmon-graph perl script which is located at
http://www.mtsac.edu/~jgau/Download/src/. It requires the rrdtool to be installed in the system which can be downloaded from the same link.

Tuesday, April 22, 2008

PCAP - API for Packet Capturing

Getting Started: The format of a pcap application

The first thing to understand is the general layout of a pcap sniffer. The flow of code is as follows:

1. We begin by determining which interface we want to sniff on. In Linux this may be something like eth0, in BSD it may be xl1, etc. We can either define this device in a string, or we can ask pcap to provide us with the name of an interface that will do the job.
2. Initialize pcap. This is where we actually tell pcap what device we are sniffing on. We can, if we want to, sniff on multiple devices. How do we differentiate between them? Using file handles. Just like opening a file for reading or writing, we must name our sniffing "session" so we can tell it apart from other such sessions.
3. In the event that we only want to sniff specific traffic (e.g.: only TCP/IP packets, only packets going to port 23, etc) we must create a rule set, "compile" it, and apply it. This is a three phase process, all of which is closely related. The rule set is kept in a string, and is converted into a format that pcap can read (hence compiling it.) The compilation is actually just done by calling a function within our program; it does not involve the use of an external application. Then we tell pcap to apply it to whichever session we wish for it to filter.
4. Finally, we tell pcap to enter it's primary execution loop. In this state, pcap waits until it has received however many packets we want it to. Every time it gets a new packet in, it calls another function that we have already defined. The function that it calls can do anything we want; it can dissect the packet and print it to the user, it can save it in a file, or it can do nothing at all.
5. After our sniffing needs are satisfied, we close our session and are complete.

This is actually a very simple process. Five steps total, one of which is optional (step 3, in case you were wondering.) Let's take a look at each of the steps and how to implement them.
Setting the device

This is terribly simple. There are two techniques for setting the device that we wish to sniff on.

The first is that we can simply have the user tell us. Consider the following program:

#include
#include

int main(int argc, char *argv[])
{
char *dev = argv[1];

printf("Device: %s\n", dev);
return(0);
}

The user specifies the device by passing the name of it as the first argument to the program. Now the string "dev" holds the name of the interface that we will sniff on in a format that pcap can understand (assuming, of course, the user gave us a real interface).

The other technique is equally simple. Look at this program:

#include
#include

int main(int argc, char *argv[])
{
char *dev, errbuf[PCAP_ERRBUF_SIZE];

dev = pcap_lookupdev(errbuf);
if (dev == NULL) {
fprintf(stderr, "Couldn't find default device: %s\n", errbuf);
return(2);
}
printf("Device: %s\n", dev);
return(0);
}

In this case, pcap just sets the device on its own. "But wait, Tim," you say. "What is the deal with the errbuf string?" Most of the pcap commands allow us to pass them a string as an argument. The purpose of this string? In the event that the command fails, it will populate the string with a description of the error. In this case, if pcap_lookupdev() fails, it will store an error message in errbuf. Nifty, isn't it? And that's how we set our device.
Opening the device for sniffing

The task of creating a sniffing session is really quite simple. For this, we use pcap_open_live(). The prototype of this function (from the pcap man page) is as follows:

pcap_t *pcap_open_live(char *device, int snaplen, int promisc, int to_ms,
char *ebuf)

The first argument is the device that we specified in the previous section. snaplen is an integer which defines the maximum number of bytes to be captured by pcap. promisc, when set to true, brings the interface into promiscuous mode (however, even if it is set to false, it is possible under specific cases for the interface to be in promiscuous mode, anyway). to_ms is the read time out in milliseconds (a value of 0 means no time out; on at least some platforms, this means that you may wait until a sufficient number of packets arrive before seeing any packets, so you should use a non-zero timeout). Lastly, ebuf is a string we can store any error messages within (as we did above with errbuf). The function returns our session handler.

To demonstrate, consider this code snippet:

#include
...
pcap_t *handle;

handle = pcap_open_live(somedev, BUFSIZ, 1, 1000, errbuf);
if (handle == NULL) {
fprintf(stderr, "Couldn't open device %s: %s\n", somedev, errbuf);
return(2);
}

This code fragment opens the device stored in the strong "somedev", tells it to read however many bytes are specified in BUFSIZ (which is defined in pcap.h). We are telling it to put the device into promiscuous mode, to sniff until an error occurs, and if there is an error, store it in the string errbuf; it uses that string to print an error message.

A note about promiscuous vs. non-promiscuous sniffing: The two techniques are very different in style. In standard, non-promiscuous sniffing, a host is sniffing only traffic that is directly related to it. Only traffic to, from, or routed through the host will be picked up by the sniffer. Promiscuous mode, on the other hand, sniffs all traffic on the wire. In a non-switched environment, this could be all network traffic. The obvious advantage to this is that it provides more packets for sniffing, which may or may not be helpful depending on the reason you are sniffing the network. However, there are regressions. Promiscuous mode sniffing is detectable; a host can test with strong reliability to determine if another host is doing promiscuous sniffing. Second, it only works in a non-switched environment (such as a hub, or a switch that is being ARP flooded). Third, on high traffic networks, the host can become quite taxed for system resources.
Filtering traffic

Often times our sniffer may only be interested in specific traffic. For instance, there may be times when all we want is to sniff on port 23 (telnet) in search of passwords. Or perhaps we want to highjack a file being sent over port 21 (FTP). Maybe we only want DNS traffic (port 53 UDP). Whatever the case, rarely do we just want to blindly sniff all network traffic. Enter pcap_compile() and pcap_setfilter().

The process is quite simple. After we have already called pcap_open_live() and have a working sniffing session, we can apply our filter. Why not just use our own if/else if statements? Two reasons. First, pcap's filter is far more efficient, because it does it directly with the BPF filter; we eliminate numerous steps by having the BPF driver do it directly. Second, this is a lot easier :)

Before applying our filter, we must "compile" it. The filter expression is kept in a regular string (char array). The syntax is documented quite well in the man page for tcpdump; I leave you to read it on your own. However, we will use simple test expressions, so perhaps you are sharp enough to figure it out from my examples.

To compile the program we call pcap_compile(). The prototype defines it as:

int pcap_compile(pcap_t *p, struct bpf_program *fp, char *str, int optimize,
bpf_u_int32 netmask)

The first argument is our session handle (pcap_t *handle in our previous example). Following that is a reference to the place we will store the compiled version of our filter. Then comes the expression itself, in regular string format. Next is an integer that decides if the expression should be "optimized" or not (0 is false, 1 is true. Standard stuff.) Finally, we must specify the net mask of the network the filter applies to. The function returns -1 on failure; all other values imply success.

After the expression has been compiled, it is time to apply it. Enter pcap_setfilter(). Following our format of explaining pcap, we shall look at the pcap_setfilter() prototype:

int pcap_setfilter(pcap_t *p, struct bpf_program *fp)

This is very straightforward. The first argument is our session handler, the second is a reference to the compiled version of the expression (presumably the same variable as the second argument to pcap_compile()).

Perhaps another code sample would help to better understand:

#include
...
pcap_t *handle; /* Session handle */
char dev[] = "rl0"; /* Device to sniff on */
char errbuf[PCAP_ERRBUF_SIZE]; /* Error string */
struct bpf_program fp; /* The compiled filter expression */
char filter_exp[] = "port 23"; /* The filter expression */
bpf_u_int32 mask; /* The netmask of our sniffing device */
bpf_u_int32 net; /* The IP of our sniffing device */

if (pcap_lookupnet(dev, &net, &mask, errbuf) == -1) {
fprintf(stderr, "Can't get netmask for device %s\n", dev);
net = 0;
mask = 0;
}
handle = pcap_open_live(dev, BUFSIZ, 1, 1000, errbuf);
if (handle == NULL) {
fprintf(stderr, "Couldn't open device %s: %s\n", somedev, errbuf);
return(2);
}
if (pcap_compile(handle, &fp, filter_exp, 0, net) == -1) {
fprintf(stderr, "Couldn't parse filter %s: %s\n", filter_exp, pcap_geterr(handle));
return(2);
}
if (pcap_setfilter(handle, &fp) == -1) {
fprintf(stderr, "Couldn't install filter %s: %s\n", filter_exp, pcap_geterr(handle));
return(2);
}

This program preps the sniffer to sniff all traffic coming from or going to port 23, in promiscuous mode, on the device rl0.

You may notice that the previous example contains a function that we have not yet discussed. pcap_lookupnet() is a function that, given the name of a device, returns its IP and net mask. This was essential because we needed to know the net mask in order to apply the filter. This function is described in the Miscellaneous section at the end of the document.

It has been my experience that this filter does not work across all operating systems. In my test environment, I found that OpenBSD 2.9 with a default kernel does support this type of filter, but FreeBSD 4.3 with a default kernel does not. Your mileage may vary.
The actual sniffing

At this point we have learned how to define a device, prepare it for sniffing, and apply filters about what we should and should not sniff for. Now it is time to actually capture some packets.

There are two main techniques for capturing packets. We can either capture a single packet at a time, or we can enter a loop that waits for n number of packets to be sniffed before being done. We will begin by looking at how to capture a single packet, then look at methods of using loops. For this we use pcap_next().

The prototype for pcap_next() is fairly simple:

u_char *pcap_next(pcap_t *p, struct pcap_pkthdr *h)

The first argument is our session handler. The second argument is a pointer to a structure that holds general information about the packet, specifically the time in which it was sniffed, the length of this packet, and the length of his specific portion (incase it is fragmented, for example.) pcap_next() returns a u_char pointer to the packet that is described by this structure. We'll discuss the technique for actually reading the packet itself later.

Here is a simple demonstration of using pcap_next() to sniff a packet.

#include
#include

int main(int argc, char *argv[])
{
pcap_t *handle; /* Session handle */
char *dev; /* The device to sniff on */
char errbuf[PCAP_ERRBUF_SIZE]; /* Error string */
struct bpf_program fp; /* The compiled filter */
char filter_exp[] = "port 23"; /* The filter expression */
bpf_u_int32 mask; /* Our netmask */
bpf_u_int32 net; /* Our IP */
struct pcap_pkthdr header; /* The header that pcap gives us */
const u_char *packet; /* The actual packet */

/* Define the device */
dev = pcap_lookupdev(errbuf);
if (dev == NULL) {
fprintf(stderr, "Couldn't find default device: %s\n", errbuf);
return(2);
}
/* Find the properties for the device */
if (pcap_lookupnet(dev, &net, &mask, errbuf) == -1) {
fprintf(stderr, "Couldn't get netmask for device %s: %s\n", dev, errbuf);
net = 0;
mask = 0;
}
/* Open the session in promiscuous mode */
handle = pcap_open_live(dev, BUFSIZ, 1, 1000, errbuf);
if (handle == NULL) {
fprintf(stderr, "Couldn't open device %s: %s\n", somedev, errbuf);
return(2);
}
/* Compile and apply the filter */
if (pcap_compile(handle, &fp, filter_exp, 0, net) == -1) {
fprintf(stderr, "Couldn't parse filter %s: %s\n", filter_exp, pcap_geterr(handle));
return(2);
}
if (pcap_setfilter(handle, &fp) == -1) {
fprintf(stderr, "Couldn't install filter %s: %s\n", filter_exp, pcap_geterr(handle));
return(2);
}
/* Grab a packet */
packet = pcap_next(handle, &header);
/* Print its length */
printf("Jacked a packet with length of [%d]\n", header.len);
/* And close the session */
pcap_close(handle);
return(0);
}

This application sniffs on whatever device is returned by pcap_lookupdev() by putting it into promiscuous mode. It finds the first packet to come across port 23 (telnet) and tells the user the size of the packet (in bytes). Again, this program includes a new call, pcap_close(), which we will discuss later (although it really is quite self explanatory).

The other technique we can use is more complicated, and probably more useful. Few sniffers (if any) actually use pcap_next(). More often than not, they use pcap_loop() or pcap_dispatch() (which then themselves use pcap_loop()). To understand the use of these two functions, you must understand the idea of a callback function.

Callback functions are not anything new, and are very common in many API's. The concept behind a callback function is fairly simple. Suppose I have a program that is waiting for an event of some sort. For the purpose of this example, let's pretend that my program wants a user to press a key on the keyboard. Every time they press a key, I want to call a function which then will determine that to do. The function I am utilizing is a callback function. Every time the user presses a key, my program will call the callback function. Callbacks are used in pcap, but instead of being called when a user presses a key, they are called when pcap sniffs a packet. The two functions that one can use to define their callback is pcap_loop() and pcap_dispatch(). pcap_loop() and pcap_dispatch() are very similar in their usage of callbacks. Both of them call a callback function every time a packet is sniffed that meets our filter requirements (if any filter exists, of course. If not, then all packets that are sniffed are sent to the callback.)

The prototype for pcap_loop() is below:

int pcap_loop(pcap_t *p, int cnt, pcap_handler callback, u_char *user)

The first argument is our session handle. Following that is an integer that tells pcap_loop() how many packets it should sniff for before returning (a negative value means it should sniff until an error occurs). The third argument is the name of the callback function (just its identifier, no parentheses). The last argument is useful in some applications, but many times is simply set as NULL. Suppose we have arguments of our own that we wish to send to our callback function, in addition to the arguments that pcap_loop() sends. This is where we do it. Obviously, you must typecast to a u_char pointer to ensure the results make it there correctly; as we will see later, pcap makes use of some very interesting means of passing information in the form of a u_char pointer. After we show an example of how pcap does it, it should be obvious how to do it here. If not, consult your local C reference text, as an explanation of pointers is beyond the scope of this document. pcap_dispatch() is almost identical in usage. The only difference between pcap_dispatch() and pcap_loop() is that pcap_dispatch() will only process the first batch of packets that it receives from the system, while pcap_loop() will continue processing packets or batches of packets until the count of packets runs out. For a more in depth discussion of their differences, see the pcap man page.

Before we can provide an example of using pcap_loop(), we must examine the format of our callback function. We cannot arbitrarily define our callback's prototype; otherwise, pcap_loop() would not know how to use the function. So we use this format as the prototype for our callback function:

void got_packet(u_char *args, const struct pcap_pkthdr *header,
const u_char *packet);

Let's examine this in more detail. First, you'll notice that the function has a void return type. This is logical, because pcap_loop() wouldn't know how to handle a return value anyway. The first argument corresponds to the last argument of pcap_loop(). Whatever value is passed as the last argument to pcap_loop() is passed to the first argument of our callback function every time the function is called. The second argument is the pcap header, which contains information about when the packet was sniffed, how large it is, etc. The pcap_pkthdr structure is defined in pcap.h as:

struct pcap_pkthdr {
struct timeval ts; /* time stamp */
bpf_u_int32 caplen; /* length of portion present */
bpf_u_int32 len; /* length this packet (off wire) */
};

These values should be fairly self explanatory. The last argument is the most interesting of them all, and the most confusing to the average novice pcap programmer. It is another pointer to a u_char, and it points to the first byte of a chunk of data containing the entire packet, as sniffed by pcap_loop().

But how do you make use of this variable (named "packet" in our prototype)? A packet contains many attributes, so as you can imagine, it is not really a string, but actually a collection of structures (for instance, a TCP/IP packet would have an Ethernet header, an IP header, a TCP header, and lastly, the packet's payload). This u_char pointer points to the serialized version of these structures. To make any use of it, we must do some interesting typecasting.

First, we must have the actual structures define before we can typecast to them. The following are the structure definitions that I use to describe a TCP/IP packet over Ethernet.

/* Ethernet addresses are 6 bytes */
#define ETHER_ADDR_LEN 6

/* Ethernet header */
struct sniff_ethernet {
u_char ether_dhost[ETHER_ADDR_LEN]; /* Destination host address */
u_char ether_shost[ETHER_ADDR_LEN]; /* Source host address */
u_short ether_type; /* IP? ARP? RARP? etc */
};

/* IP header */
struct sniff_ip {
u_char ip_vhl; /* version << 4 | header length >> 2 */
u_char ip_tos; /* type of service */
u_short ip_len; /* total length */
u_short ip_id; /* identification */
u_short ip_off; /* fragment offset field */
#define IP_RF 0x8000 /* reserved fragment flag */
#define IP_DF 0x4000 /* dont fragment flag */
#define IP_MF 0x2000 /* more fragments flag */
#define IP_OFFMASK 0x1fff /* mask for fragmenting bits */
u_char ip_ttl; /* time to live */
u_char ip_p; /* protocol */
u_short ip_sum; /* checksum */
struct in_addr ip_src,ip_dst; /* source and dest address */
};
#define IP_HL(ip) (((ip)->ip_vhl) & 0x0f)
#define IP_V(ip) (((ip)->ip_vhl) >> 4)

/* TCP header */
struct sniff_tcp {
u_short th_sport; /* source port */
u_short th_dport; /* destination port */
tcp_seq th_seq; /* sequence number */
tcp_seq th_ack; /* acknowledgement number */

u_char th_offx2; /* data offset, rsvd */
#define TH_OFF(th) (((th)->th_offx2 & 0xf0) >> 4)
u_char th_flags;
#define TH_FIN 0x01
#define TH_SYN 0x02
#define TH_RST 0x04
#define TH_PUSH 0x08
#define TH_ACK 0x10
#define TH_URG 0x20
#define TH_ECE 0x40
#define TH_CWR 0x80
#define TH_FLAGS (TH_FIN|TH_SYN|TH_RST|TH_ACK|TH_URG|TH_ECE|TH_CWR)
u_short th_win; /* window */
u_short th_sum; /* checksum */
u_short th_urp; /* urgent pointer */
};

Note: On my Slackware Linux 8 box (stock kernel 2.2.19) I found that code using the above structures would not compile. The problem, as it turns out, was in include/features.h, which implements a POSIX interface unless _BSD_SOURCE is defined. If it was not defined, then I had to use a different structure definition for the TCP header. The more universal solution, that does not prevent the code from working on FreeBSD or OpenBSD (where it had previously worked fine), is simply to do the following:
#define _BSD_SOURCE 1
prior to including any of your header files. This will ensure that a BSD style API is being used. Again, if you don't wish to do this, then you can simply use the alternative TCP header structure, which I've linked to here, along with some quick notes about using it.

So how does all of this relate to pcap and our mysterious u_char pointer? Well, those structures define the headers that appear in the data for the packet. So how can we break it apart? Be prepared to witness one of the most practical uses of pointers (for all of those new C programmers who insist that pointers are useless, I smite you).

Again, we're going to assume that we are dealing with a TCP/IP packet over Ethernet. This same technique applies to any packet; the only difference is the structure types that you actually use. So let's begin by defining the variables and compile-time definitions we will need to deconstruct the packet data.

/* ethernet headers are always exactly 14 bytes */
#define SIZE_ETHERNET 14

const struct sniff_ethernet *ethernet; /* The ethernet header */
const struct sniff_ip *ip; /* The IP header */
const struct sniff_tcp *tcp; /* The TCP header */
const char *payload; /* Packet payload */

u_int size_ip;
u_int size_tcp;

And now we do our magical typecasting:

ethernet = (struct sniff_ethernet*)(packet);
ip = (struct sniff_ip*)(packet + SIZE_ETHERNET);
size_ip = IP_HL(ip)*4;
if (size_ip < 20) {
printf(" * Invalid IP header length: %u bytes\n", size_ip);
return;
}
tcp = (struct sniff_tcp*)(packet + SIZE_ETHERNET + size_ip);
size_tcp = TH_OFF(tcp)*4;
if (size_tcp < 20) {
printf(" * Invalid TCP header length: %u bytes\n", size_tcp);
return;
}
payload = (u_char *)(packet + SIZE_ETHERNET + size_ip + size_tcp);

How does this work? Consider the layout of the packet data in memory. The u_char pointer is really just a variable containing an address in memory. That's what a pointer is; it points to a location in memory.

For the sake of simplicity, we'll say that the address this pointer is set to is the value X. Well, if our three structures are just sitting in line, the first of them (sniff_ethernet) being located in memory at the address X, then we can easily find the address of the structure after it; that address is X plus the length of the Ethernet header, which is 14, or SIZE_ETHERNET.

Similarly if we have the address of that header, the address of the structure after it is the address of that header plus the length of that header. The IP header, unlike the Ethernet header, does not have a fixed length; its length is given, as a count of 4-byte words, by the header length field of the IP header. As it's a count of 4-byte words, it must be multiplied by 4 to give the size in bytes. The minimum length of that header is 20 bytes.

The TCP header also has a variable length; its length is given, as a number of 4-byte words, by the "data offset" field of the TCP header, and its minimum length is also 20 bytes.

So let's make a chart:

Variable Location (in bytes)
sniff_ethernet X
sniff_ip X + SIZE_ETHERNET
sniff_tcp X + SIZE_ETHERNET + {IP header length}
payload X + SIZE_ETHERNET + {IP header length} + {TCP header length}

The sniff_ethernet structure, being the first in line, is simply at location X. sniff_ip, who follows directly after sniff_ethernet, is at the location X, plus however much space the Ethernet header consumes (14 bytes, or SIZE_ETHERNET). sniff_tcp is after both sniff_ip and sniff_ethernet, so it is location at X plus the sizes of the Ethernet and IP headers (14 bytes, and 4 times the IP header length, respectively). Lastly, the payload (which doesn't have a single structure corresponding to it, as its contents depends on the protocol being used atop TCP) is located after all of them.

So at this point, we know how to set our callback function, call it, and find out the attributes about the packet that has been sniffed. It's now the time you have been waiting for: writing a useful packet sniffer. Because of the length of the source code, it is not included in the body of this document.

Thursday, April 17, 2008

Named pipes

A very useful Linux feature is named pipes which enable different processes to communicate.
One of the fundamental features that makes Linux and other Unices useful is the “pipe”. Pipes allow separate processes to communicate without having been designed explicitly to work together. This allows tools quite narrow in their function to be combined in complex ways.

A simple example of using a pipe is the command:

ls | grep x

When bash examines the command line, it finds the vertical bar character | that separates the two commands. Bash and other shells run both commands, connecting the output of the first to the input of the second. The ls program produces a list of files in the current directory, while the grep program reads the output of ls and prints only those lines containing the letter x.

The above, familiar to most Unix users, is an example of an “unnamed pipe”. The pipe exists only inside the kernel and cannot be accessed by processes that created it, in this case, the bash shell. For those who don't already know, a parent process is the first process started by a program that in turn creates separate child processes that execute the program.

The other sort of pipe is a “named” pipe, which is sometimes called a FIFO. FIFO stands for “First In, First Out” and refers to the property that the order of bytes going in is the same coming out. The “name” of a named pipe is actually a file name within the file system. Pipes are shown by ls as any other file with a couple of differences:

% ls -l fifo1
prw-r--r-- 1 andy users 0 Jan 22 23:11 fifo1|

The p in the leftmost column indicates that fifo1 is a pipe. The rest of the permission bits control who can read or write to the pipe just like a regular file. On systems with a modern ls, the | character at the end of the file name is another clue, and on Linux systems with the color option enabled, fifo| is printed in red by default.

On older Linux systems, named pipes are created by the mknod program, usually located in the /etc directory. On more modern systems, mkfifo is a standard utility. The mkfifo program takes one or more file names as arguments for this task and creates pipes with those names. For example, to create a named pipe with the name pipe1 give the command:

mkfifo pipe

The simplest way to show how named pipes work is with an example. Suppose we've created pipe as shown above. In one virtual console1, type:

ls -l > pipe1

and in another type:

cat < pipe

Voila! The output of the command run on the first console shows up on the second console. Note that the order in which you run the commands doesn't matter.

If you haven't used virtual consoles before, see the article “Keyboards, Consoles and VT Cruising” by John M. Fisk in the November 1996 Linux Journal.

If you watch closely, you'll notice that the first command you run appears to hang. This happens because the other end of the pipe is not yet connected, and so the kernel suspends the first process until the second process opens the pipe. In Unix jargon, the process is said to be “blocked”, since it is waiting for something to happen.

One very useful application of named pipes is to allow totally unrelated programs to communicate with each other. For example, a program that services requests of some sort (print files, access a database) could open the pipe for reading. Then, another process could make a request by opening the pipe and writing a command. That is, the “server” can perform a task on behalf of the “client”. Blocking can also happen if the client isn't writing, or the server isn't reading.
Pipe Madness

Create two named pipes, pipe1 and pipe2. Run the commands:

echo -n x | cat - pipe1 > pipe2 &
cat pipe1

On screen, it will not appear that anything is happening, but if you run top (a command similar to ps for showing process status), you'll see that both cat programs are running like crazy copying the letter x back and forth in an endless loop.

After you press ctrl-C to get out of the loop, you may receive the message “broken pipe”. This error occurs when a process writing to a pipe when the process reading the pipe closes its end. Since the reader is gone, the data has no place to go. Normally, the writer will finish writing its data and close the pipe. At this point, the reader sees the EOF (end of file) and executes the request.

Whether or not the “broken pipe” message is issued depends on events at the exact instant the ctrl-C is pressed. If the second cat has just read the x, pressing ctrl-C stops the second cat, pipe1 is closed and the first cat stops quietly, i.e., without a message. On the other hand, if the second cat is waiting for the first to write the x, ctrl-C causes pipe2 to close before the first cat can write to it, and the error message is issued. This sort of random behavior is known as a “race condition”.
Command Substitution

Bash uses named pipes in a really neat way. Recall that when you enclose a command in parenthesis, the command is actually run in a “subshell”; that is, the shell clones itself and the clone interprets the command(s) within the parenthesis. Since the outer shell is running only a single “command”, the output of a complete set of commands can be redirected as a unit. For example, the command:

(ls -l; ls -l) >ls.out

writes two copies of the current directory listing to the file ls.out.

Command substitution occurs when you put a < or > in front of the left parenthesis. For instance, typing the command:

cat <(ls -l)

results in the command ls -l executing in a subshell as usual, but redirects the output to a temporary named pipe, which bash creates, names and later deletes. Therefore, cat has a valid file name to read from, and we see the output of ls -l, taking one more step than usual to do so. Similarly, giving >(commands) results in Bash naming a temporary pipe, which the commands inside the parenthesis read for input.

If you want to see whether two directories contain the same file names, run the single command:

cmp <(ls /dir1) <(ls /dir2)

The compare program cmp will see the names of two files which it will read and compare.

Command substitution also makes the tee command (used to view and save the output of a command) much more useful in that you can cause a single stream of input to be read by multiple readers without resorting to temporary files—bash does all the work for you. The command:

ls | tee >(grep foo | wc >foo.count) \
>(grep bar | wc >bar.count) \
| grep baz | wc >baz.count

counts the number of occurrences of foo, bar and baz in the output of ls and writes this information to three separate files. Command substitutions can even be nested:

cat <(cat <(cat <(ls -l))))

works as a very roundabout way to list the current directory.

As you can see, while the unnamed pipes allow simple commands to be strung together, named pipes, with a little help from bash, allow whole trees of pipes to be created. The possibilities are limited only by your imagination.

Linux Signal Handling Model

Signals are used to notify a process or thread of a particular event. Many computer science researchers compare signals with hardware interrupts, which occur when a hardware subsystem, such as a disk I/O (input/output) interface, generates an interrupt to a processor when the I/O completes. This event in turn causes the processor to enter an interrupt handler, so subsequent processing can be done in the operating system based on the source and cause of the interrupt.

UNIX guru W. Richard Stevens aptly describes signals as software interrupts. When a signal is sent to a process or thread, a signal handler may be entered (depending on the current disposition of the signal), which is similar to the system entering an interrupt handler as the result of receiving an interrupt.

Operating system signals actually have quite a history of design changes in the signal code and various implementations of UNIX. This was due in part to some deficiencies in the early implementation of signals, as well as the parallel development work done on different versions of UNIX, primarily BSD UNIX and AT&T System V. James Cox, Berny Goodheart and W. Richard Stevens cover these details in their respective well-known books, so they don't need to be repeated here.

Implementation of correct and reliable signals has been in place for many years now, where an installed signal handler remains persistent and is not reset by the kernel. The POSIX standards provided a fairly well-defined set of interfaces for using signals in code, and today the Linux implementation of signals is fully POSIX-compliant. Note that reliable signals require the use of the newer sigaction interface, as opposed to the traditional signal call.

The occurrence of a signal may be synchronous or asynchronous to the process or thread, depending on the source of the signal and the underlying reason or cause. Synchronous signals occur as a direct result of the executing instruction stream, where an unrecoverable error (such as an illegal instruction or illegal address reference) requires an immediate termination of the process. Such signals are directed to the thread which caused the error with its execution stream. As an error of this type causes a trap into a kernel trap handler, synchronous signals are sometimes referred to as traps.

Asynchronous signals are external to (and in some cases, unrelated to) the current execution context. One obvious example would be the sending of a signal to a process from another process or thread via a kill(2), _lwp_kill(2) or sigsend(2) system call, or a thr_kill(3T), pthread_kill(3T) or sigqueue(3R) library invocation. Asynchronous signals are also aptly referred to as interrupts.

Every signal has a unique signal name, an abbreviation that begins with SIG (SIGINT for interrupt signal, for example) and a corresponding signal number. Additionally, for all possible signals, the system defines a default disposition or action to take when a signal occurs. There are four possible default dispositions:

* Exit: forces the process to exit.
* Core: forces the process to exit and create a core file.
* Stop: stops the process.
* Ignore: ignores the signal; no action taken.

A signal's disposition within a process's context defines what action the system will take on behalf of the process when a signal is delivered. All threads and LWPs (lightweight processes) within a process share the signal disposition, which is processwide and cannot be unique among threads within the same process.

Signal Table in my lat blog provides a complete list of signals, along with a description and default action. The data structures in the kernel to support signals in Linux are to be found in the task structure. Here are the most common elements of said structure pertaining to signals:

* current-->sig are the signal handlers.
* sigmask_lock is a per-thread spinlock which protects the signal queue and atomicity of other signal operations.
* current-signal and current-blocked contain a bitmask (currently 64 bits long, but freely expandable) of pending and permanently blocked signals.
* sigqueue and sigqueue_tail is a double-linked list of pending signals—Linux has RT signals which can be queued as well. “Traditional” signals are internally mapped to RT signals.

Signal Description and Default Action

The disposition of a signal can be changed from its default, and a process can arrange to catch a signal and invoke a signal-handling routine of its own or ignore a signal that may not have a default disposition of Ignore. The only exceptions are SIGKILL and SIGSTOP; their default dispositions cannot be changed. The interfaces for defining and changing signal disposition are the signal and sigset libraries and the sigaction system call. Signals can also be blocked, which means the process has temporarily prevented delivery of a signal. Generation of a signal that has been blocked will result in the signal remaining as pending to the process until it is explicitly unblocked or the disposition is changed to Ignore. The sigprocmask system call will set or get a process's signal mask, the bit array inspected by the kernel to determine if a signal is blocked or not. thr_setsigmask and pthread_sigmask are the equivalent interfaces for setting and retrieving the signal mask at the user-threads level.

I mentioned earlier that a signal may originate from several different places for a variety of different reasons. The first three signals listed in Table 1—SIGHUP, SIGINT and SIGQUIT—are generated by a keyboard entry from the controlling terminal (SIGINT and SIGHUP) or if the control terminal becomes disconnected (SIGHUP—use of the nohup command makes processes “immune” from hangups by setting the disposition of SIGHUP to Ignore).

Other terminal I/O-related signals include SIGSTOP, SIGTTIN, SIGTTOU and SIGTSTP. For the signals originating from a keyboard command, the actual key sequence that generates the signals, usually CTRL-C, is defined within the parameters of the terminal session, typically via stty(1) which results in a SIGINT being sent to a process, and has a default disposition of Exit.

User tasks in Linux, created via explicit calls to either thr_create or pthread_create, all have their own signal masks. Linux threads call clone with CLONE_SIGHAND; this shares all signal handlers between threads via sharing the current->sig pointer. Delivered signals are unique to a thread.

In some operating systems, such as Solaris 7, signals generated as a result of a trap (SIGFPE, SIGILL, etc.) are sent to the thread that caused the trap. Asynchronous signals are delivered to the first thread found not blocking the signal. In Linux, it is almost exactly the same. Synchronous signals happening in the context of a given thread are delivered to that thread.

Asynchronous in-kernel signals (e.g., asynchronous network I/O) is delivered to the thread that generated the asynchronous I/O. Explicit user-generated signals get delivered to the right thread as well. However, if CLONE_PID is used, all places that use the PID to deliver a signal will behave in a “weird” way; the signal gets randomly delivered to the first thread in the pidhash. Linux threads don't use CLONE_PID, so there is no such problem if you are using the pthreads.h thread API.

When a signal is sent to a user task, for example, when a user-space program accesses an illegal page, the following happens:

* page_fault (entry.S) in the low-level page-fault handler.
* do_page_fault (fault.c) fetches i386-specific parameters of the fault and does basic validation of the memory range involved.
* handle_mm_fault (memory.c) is generic MM (memory management) code (i386-independent), which gets called only if the memory range (VMA) exists. The MM reads the page table entry and uses the VMA to find out whether the memory access is legal or not.


{
int fault = handle_mm_fault(tsk, vma, address, write);
if (fault < 0) goto out_of_memory;
if (!fault) goto do_sigbus;
}
...
do_sigbus:
up(&mm->mmap_sem);
/*
* Send a sigbus, regardless of whether we
* were in kernel or user mode.
*/
tsk->thread.cr2 = address;
tsk->thread.error_code = error_code;
tsk->thread.trap_no = 14;
force_sig(SIGBUS, tsk);

Listing 1

The case we are interested in now is when the access was illegal (e.g., a write was attempted to a read-only mapping): handle_mm_fault returns 0 to do_page_fault in this case. As you can see from Listing 1, locking of the MM is very finely grained (and it better be this way); the mm->mmap_sem, per-MM semaphore, is used (which typically varies from process to process).

force_sig(SIGBUS,current) is used to “force” the SIGBUS signal on the faulting task. force_sig delivers the signal even if the process has attempted to ignore SIGBUS.

force_sig fills out the signal event structure and queues it into the process's signal queue (current->sigqueue and current->sigqueue_tail). The signal queue holds an indefinite number of queued signals. The semantics of “classic” signals are that follow-up signals are ignored—this is emulated in the signal code kernel/signal.c. “Generic” (or RT) signals can be queued arbitrarily; there are reasonable limits to the length of the signal queue.

The signal is queued, and current-signal is updated. Now comes the tricky part: the kernel returns to user space. Return to user space happens from do_page_fault=>page_fault (entry.S), then the low-level exit code in entry.S is executed in this order:

page_fault=>(called do_page_fault)=>error_code=>
ret_from_exception=>(checks if return to user space)=>
ret_with_reschedule=>(sees that current->signal is nonzero)
=>calls do_signal

Next, do_signal unqueues the signal to be executed. In this case, it's SIGBUS.

Then handle_signal is called with the “unqueued” signal (which can potentially hold extra event information in case of real-time signals/messages).

Next called is setup_frame, where all user-space registers are saved and the kernel stack frame return address is modified to point to the handler of the installed signal handler. A small sequence of code jumper is put on the user stack (obviously, the code first makes sure the user stack is valid) which will return us to kernel space once the signal handler has finished. (See Listing 2.)

{
err |= __put_user(frame->retcode, &frame->pretcode);
/* This is popl %eax ; movl $,%eax ; int $0x80 */
err |= __put_user(0xb858, (short *)(frame->retcode+0));
err |= __put_user(__NR_sigreturn, (int *)(frame->retcode+2));
err |= __put_user(0x80cd, (short *)(frame->retcode+6));
}

Listing 2

Careful: this area is one of the least-understood pieces of the Linux kernel, and for good reason; it is really tough code to read and follow.

The popl %eax ; movl $,%eax ; int $0x80 x86 assembly sequence calls sys_sigret, which later on will restore the kernel stack frame return address to point to the original (faulting) user address.

What is all this magic good for? Well, first the kernel has to guarantee that signal handlers get called properly and the original state is restored. The kernel also has to deal with binary compatibility issues. Linux guarantees that on the IA-32 (Intel x86) architecture, we can run any iBC86-compliant binary code. Speed is also an issue.

restore_all:
RESTORE_ALL
#define RESTORE_ALL \
popl %ebx; \
popl %ecx; \
popl %edx; \
popl %esi; \
popl %edi; \
popl %ebp; \
popl %eax; \
1: popl %ds; \
2: popl %es; \
addl $4,%esp; \
3: iret;

Listing 3

Finally, we return to entry.S again, but current-signal is already cleared, so we do not execute do_signal but jump to restore_all as shown in Listing 3. restore.all executes the “iret” that brings us into user space. Suddenly, we are magically executing the signal handler.

Did you get lost yet? No? Here is some more magic. Once the signal handler finishes (it does an assembly “ret” like all well-behaving functions), it will execute the small jumper function we have set up on the user stack. Again we return to the kernel, but now we execute the sys_sigreturn system call, which lives in arch/i386/kernel/signal.c as well. It essentially executes the following code section:

if (restore_sigcontext(regs, &frame->sc, &eax))
goto badframe;
return eax;

The above code restores the exact user-register contents into the kernel stack frame (including the return address and flags register) and executes a normal ret_from_syscall, bringing us back to the original faulting code. Hopefully the SIGBUS handler has fixed the problem of why we were faulting.

Now, while reading the above description, you might think this is awfully complex and slow. It actually isn't; lmbench reveals that Linux has the fastest signal-handler installation and execution performance by far of any UNIX running:

moon:~/l> ./lat_sig install
Signal handler installation: 1.688 microseconds
moon:~/l> ./lat_sig catch
Signal handler overhead: 3.186 microseconds

Best of all, it scales linearly on SMP:

moon:~/l> ./lat_sig catch & ./lat_sig catch &
Signal handler overhead: 3.264 microseconds
Signal handler overhead: 3.248 microseconds
moon:~/l> ./lat_sig install & ./lat_sig install &
Signal handler installation: 1.721 microseconds
Signal handler installation: 1.689 microseconds

Signals and Interrupts, A Perfect Couple

Signals can be sent from system calls, interrupts and bottom-half handlers (see sidebar) alike; there is no difference. In other words, the Linux signal queue is interrupt-safe, as strange and recursive as that sounds, so it's fairly flexible.

Bottom-Half Handlers

An interesting signal-delivery case, however, is on SMP. Imagine a thread is executing on one processor, and it gets an asynchronous event (e.g., synchronous socket I/O signal) from an IRQ handler (or another process) on another CPU. In that case, we send a cross-CPU message to the running process, so there is no latency in signal delivery. (The speed of cross-CPU delivery is about five microseconds on a Pentium II 350MHz.)

Conclusions

Once again, we notice how Linux is actually the technology leader in important kernel aspects such as scheduling, interrupt handling and signals handling. This also proves the conjecture that the Linux developer community is collectively more capable and more resourceful than any private corporation's R&D department could ever be.

Tuesday, April 15, 2008

Basics of unix signals

SIGNALS

Signals offer another way to transition between Kernel and User Space. While system calls are synchronous calls originating from User Space, signals are asynchronous messages coming from Kernel space. Signals are always delivered by the Kernel but they can be initiated by:

* other processes on the system (using the kill command/system call)
* the process itself. This includes hardware exceptions triggered by a process: when a program executes an illegal instruction, such as dividing a number by zero or attempting to access a memory zone that has not been allocated yet, the hardware detects it and a signal is sent to the faulty program.
* the Kernel. The Kernel also use signals to notify a process of some system events, such as the arrival of out-of-band data. In the same way, when a program sets a system alarm, the Kernel sends a signal to the process every time a timer expires (e.g. every 10 seconds).

So a signal is an asynchronous message, but what happens exactly when a process receives it? Well… it depends. For each signal, a process can instruct the Kernel to either:

*Ignore this signal: In which case this signal has absolutely no effect on the process. Ignoring the signal must be explicitly requested before the signal is delivered. Also, some signals cannot be ignored.

*Catch this signal: In which case the Kernel will call a custom routine, as defined by this process when delivering the signal. The process must explicitly register this custom routine before the signal is delivered. The signal-catching function is traditionally called a custom signal handler.

*Let the default action apply: For each signal the system defines a default action that will be called if the process did not explicitly request to ignore or catch this signal. The default signal handler typically but not always terminates the process (we will cover the default actions for all common UNIX signals later in this article). Letting the default action apply is the implicit system behavior, but it can also be requested explicitly by a process.

The overall dynamic for signal delivery is quite simple:

1. When a process receives a signal that is not ignored, the program immediately interrupts its current execution flow5.
2. Control is then transferred to a dedicated signal handler, a custom one defined by the process or the system default.
3. Once the signal handler completes, the program resumes where it was originally interrupted.

In practice, though, the mechanics used by the Kernel to send a signal are more involved and consist of two distinct steps: generating and delivering the signal.

The Kernel generates a signal for a process simply by setting a flag that indicates the type of the signal that was received. More precisely, each process has a dedicated bitfield used to store pending signals; For the system, generating a signal is just a matter of updating the bit corresponding to the signal type in this bitfield structure. At this stage, the signal is said to be pending.

Before transferring control back to a process in user mode, the Kernel always checks the pending signals for this process. This check must happen in Kernel space because some signals can never be ignored by a process – namely SIGSTOP and SIGKILL (you trigger SIGKILL with the infamous kill -9 command).

When a pending signal is detected by the Kernel, the system will deliver the signal by performing one of the following actions:

* if the signal is SIGKILL the system does not switch back to user mode. It processes the signal in Kernel mode and terminates the process. This is why kill -9 is such a bulletproof way to terminate a misbehaving process.

* if the signal is SIGSTOP the system also stays in Kernel mode. Its suspends the process and puts it to sleep.

* if the process did not register any custom handler for this signal, the default system action is taken. If the default action is to ignore the signal, no action is taken, and the system just switches back to user mode and transfers control to the process. If the default action is not to ignore the signal, the system remains in Kernel mode and the process will exit, dump core, or be suspended. For instance, the default behavior for the SIGSEGV signal is to dump a core file and terminate the process, so that one can analyze the bug that triggered the segmentation fault.

* if the process registered a custom handler for the signal, the Kernel transfers control to the process and the custom signal handler is executed in user mode. At this point, the program is the one responsible for handling the signal properly.

A crucial point here is to realize that the Kernel triggers the signal handler, when the signal is delivered, not when the signal is generated. As signal delivery only happens when the system schedules the target process as active in a multitasking system (just before switching back to User Mode) there can be a significant delay between signal generation and delivery.

Finally a process has one last option when it comes to signals. It can instruct the Kernel to block the delivery of a specific signal. If a signal is blocked, the system still generates it and the signal is considered pending. Nevertheless the Kernel will not deliver a blocked signal until the process unblocks it. Signal blocking is typically used in critical sections of code that must not be interrupted.


Name Number Default Action Semantics
SIGHUP 1 Terminate Hangup detected on controlling terminal or death of controlling process
SIGINT 2 Terminate Interrupt from keyboard. Usually terminate the process. Can be triggered by Ctrl-C
SIGQUIT 3 Core dump Quit from keyboard. Usually causes the process to terminate and dump core. Cab be triggered by Ctrl-\
SIGILL 4 Core dump The process has executed an illegal hardware instruction.
SIGTRAP 5 Core dump Trace/breakpoint trap. Hardware fault.
SIGABRT 6 Core dump Abort signal from abort(3)
SIGFPE 8 Core dump Floating point exception such as dividing by zero or a floating point overflow.
SIGKILL 9 Terminate Sure way to terminate (kill) a process. Cannot be caught or ignored.
SIGSEGV 11 Core dump The process attempted to access an invalid memory reference.
SIGPIPE 13 Terminate Broken pipe: Sent to a process writing to a pipe or a socket with no reader (most likely the reader has terminated).
SIGALRM 14 Terminate Timer signal from alarm(2)
SIGTERM 15 Terminate Termination signal. The kill command send this signal by default, when no explicit signal type is provided.
SIGUSR1 30,10,16 Terminate First user-defined signal, designed to be used by application programs which can freely define its semantics.
SIGUSR2 31,12,17 Terminate Second user-defined signal, designed to be used by application programs which can freely define its semantics.
SIGCHLD 20,17,18 Ignore Child stopped or terminated
SIGCONT 19,18,25 Continue / Ignore Continue if stopped
SIGSTOP 17,19,23 Stop Sure way to stop a process: cannot be caught or ignored. Used for non interactive job-control while SIGSTP is the interactive stop signal.
SIGTSTP 18,20,24 Stop Interactive signal used to suspend process execution. Usually generated by typing Ctrl-Z in a terminal.
SIGTTIN 21,21,26 Stop A background process attempt to read from its controlling terminal (tty input).
SIGTTOU 22,22,27 Stop A background process attempt to write to its controlling terminal (tty output).
SIGIO 23,29,22 Terminate Asynchronous I/O now event.
SIGBUS 10,7,10 Core dump Bus error (bad memory access)
SIGPOLL Terminate Signals an event on a pollable device.
SIGPROF 27,27,29 Terminate Expiration of a profiling timer set with setitimer.
SIGSYS 12,-,12 Core dump Invalid system call. The Kernel interpreted a processor instruction as a system call, but its argument is invalid.
SIGURG 16,23,21 Ignore Urgent condition on socket (e.g. out-of-band data).
SIGVTALRM 26,26,28 Terminate Expiration of a virtual interval timer set with setitimer.
SIGXCPU 24,24,30 Core dump CPU soft time limit exceeded (Resource limits).
SIGXFSZ 25,25,31 Core dump File soft size limit exceeded (Resource limits).
SIGWINCH 28,28,20 Ignore Informs a process of a change in associated terminal window size.