let me start: Networking Stack

Showing posts with label Networking Stack. Show all posts

Monday, September 8, 2008

Good Networking Q and A.

* Q: What should happen upon receipt of a RST packet containing data?
* A: The host should accept it. It was suggested that a RST could contain ASCII text explaining the error, but no standard was ever established.
* Credit: Kris Katterjohn

* Q: What is the longest acceptable delay for an ACK?
* A: 0.5 seconds.
* Credit: Kris Katterjohn

* Q: An ICMP error message must include what data from the original erring datagram?
* A: The IP header and at least 8 bytes of payload.
* Credit: Kris Katterjohn

* Q: What is the major difference between Windows tracert and Van Jacobson-style UNIX traceroute?
* A: Windows tracert uses ICMP where Van Jacobson-style traceroute uses UDP.
* Credit: Chris Royle

* Q: Name three modern OS's that still use the mbuf structure, described in detail in TCP/IP Illustrated.
* A: FreeBSD, OpenBSD, NetBSD
* Credit: Sam Elstob

* Q: What is cheapernet?
* A: Coaxial cable with a diameter no longer suited to hit burglers with. The original ethernet cables were 1/2" in diameter and quite rigid.
* Credit: Jörn Engel

* Q: How many colours are in an 8-wire Cat5 cable?
* A: Four. The other four wires are colour/white striped.
* Credit: Jörn Engel

* Q: What are the four colours in Cat5 cables?
* A: Orange, Blue, Brown, Green
* Credit: Jörn Engel

* Q: How does a Windows machine react if ports above 1080 are blocked by a router?
* A: A regular user can surf for roughly 2min after booting, then has to reboot.
* Credit: Jörn Engel

* Q: What happens if a network card's transmit interrupts are blocked in Linux?
* A: Usually nothing. But on packet loss, retransmits can get delayed indefinitely.
* Credit: Jörn Engel

* Q: Who wrote the original Ping program?
* A: Mike Muuss
* Credit: Kris Katterjohn

* Q: What's the base UDP port number used in the original Traceroute program?
* A: 33434 (32768 + 666)
* Credit: Kris Katterjohn

* Q: What's the minimum value allowed for the Header Length field in a valid IPv4 header?
* A: 5 (5 * 4 = 20)
* Credit: Kris Katterjohn

* Q: Which RFC introduced the TCP Timestamps option?
* A: RFC 1323
* Credit: Kris Katterjohn

* Q: While computing the checksum of a TCP packet, what happens to the checksum?
* A: It's replaced (or filled) with zeros.
* Credit: Kris Katterjohn

* Q: What is the Ethernet type for the Banyan Vines protocol?
* A: 0BAD
* Credit: Marcel Kaszap

* Q: Name at least two colors of packet-dropping mechanism that are NOT an acronym or an abbreviation.
* A: BLUE, GREEN, PURPLE, WHITE (RED and BLACK are acronyms).
* Credit: Jonathan Day

* Q: What was PC-IP?
* A: The first implementation of TCP/IP on an IBM PC.
* Credit: Bhyrava Prasad

* Q: What semi-famous hacker tool uses port 31337?
* A: Back Orifice, from Cult of the Dead Cow.
* Credit: Ken D'Ambrosio

* Q: What two ICMP types should never be blocked?
* A: ICMP type 3, Destination Unreachable, especially code 4, "fragmentation needed but don't fragment bit set" (necessary for path MTU discovery) and ICMP type 11, time exceeded (so you can use traceroute from inside the network and get replies).
* Credit: Arjan van de Ven

* Q: What is the typical MTU for an RFC 1149 transmission?
* A: From RFC 1149, Carrier Pigeon Internet Protocol: "The MTU is variable, and paradoxically, generally increases with increased carrier age. A typicall MTU is 256 milligrams"
* Credit: Joe Nygard

* Q: What data link layer algorithm is described by an "algorhyme" in the original paper? Extra credit for reciting the first two lines.
* A: The spanning tree algorithm by Dr. Radia Perlman
* A: The first two lines are: I think that I shall never see / A graph more lovely than a tree.
* Credit: Christina Zeeh

* Q: What is the minimum length of an Ethernet packet, and why is there a minimum length?
* A: 64 bytes. It must be this long so that a collision can be detected.
* Credit: Ted T'so

* Q: What is the 'Stretch ACK violation' documented in RFC 2525?
* A: When using delayed ACKs, the receiver sends an ACK less frequently than every other sender's MSS causing potential performance degradation.
* Credit: Rob Braun

* Q: Name at least three official DNS resource record types.
* A: Any three of A, CNAME, HINFO, MX, NS, PTR, SOA, TXT, WKS, RT, NULL, AXFR, MAILB, MAILA, KX, KEY, SIG, NXT, PX, NSAP, NSAP-PTR, RP, AFSDB, RT, GPOS, DNAME, AAAA, SRV, LOC, EID, NIMLOC, ATMA, NAPTR, CERT, SINK, OPT, APL, TKEY, TSIG, IXFR, Deprecated: MB, MD, MF, Experimental: MINFO, MR, MG, X25
* Credit: Ulrich Durholz, Rob Braun

* Q: What is the maximum amount of data in a UDP packet over IPv6?
* A: 65487 bytes (65535 - 40 IPv6 header - 8 UDP header).
* Credit: Rob Braun

* Q: What is the minimum IPv6 datagram size that a host must be able to receive?
* A: 1280 bytes.
* Credit: Rob Braun

* Q: What is the IANA reserved Ethernet MAC address range for IP Multicast?
* A: 01:00:5e.
* Credit: Rob Braun, Michael Dupuis

* Q: Name one of the Ethernet patent (#4,063,220) holders.
* A: Robert Metcalfe, David Boggs, Charles Thacker, or Butler Lampson. (Metcalfe and Lampson are generally credited for the invention.)
* Credit: Rob Braun

* Q: What is the MAC address prefix for DECnet addresses?
* A: AA:00:04:00
* Credit: Rob Braun

* Q: Who wrote the original traceroute program?
* A: Van Jacobson
* Credit: Warren Postma

* Q: What feature of IP is central to most traceroute implementations?
* A: The TTL (Time To Live) field. Most traceroutes send packets with artificially small TTLs and use the ICMP Time Exceeded responses from intermediate hosts to trace the route to a host.
* Credit: Warren Postma

* Q: Why was traceroute originally implemented using UDP packets rather than ICMP echo requests?
* A: In 1988, many TCP/IP stacks didn't return ICMP Time Exceeded responses to ICMP packets, but would for UDP packets.
* Credit: Warren Postma

* Q: What is RED, Random Early Detection?
* A: A route queuing protocol used for congestion avoidance. Once it detects "incipient congestion," the router randomly discards packets based on average queue size.
* Credit: Ivar Alm

* Q: What application uses TCP port 666?
* A: Doom.
* Credit: Jim Wilson

* Q: What is "ships in the night" routing?
* A: When you run two or more routing protocols on the same router.
* Credit: Jim Wilson

* Q: What does CRC stand for?
* A: Cyclic Redundancy Check.
* Credit: Jim Wilson

* Q: What IP network is reserved for internal testing?
* A: Anything with a netid (first octet) of 127.
* Credit: Jim Wilson

* Q: What are class D networks used for?
* A: Multicasting.
* Credit: Jim Wilson

* Q: What is bootp an abbreviation for?
* A: Bootstrap protocol.
* Credit: Jim Wilson

* Q: What is a runt packet?
* A: A packet that is shorter than the minimum packet length as defined by the protocol it is using.
* Credit: Jim Wilson

* Q: As of RFC 1394, how many values can the TOS field in an IPv4 header have?
* A: 5 (4 bit wide field, only one may be set at a time, 0 is valid).
* Credit: Jim Wilson

* Q: What is the H.323 protocol used for?
* A: Video or teleconferencing ("Packet-based multimedia communications systems").
* Credit: Jim Wilson

* Q: What OSI model layer does IP most closely resemble?
* A: The network layer, layer 3.
* Credit: Jim Wilson

* Q: Why do IP packets have a TTL (Time To Live) field?
* A: To prevent a packet being retransmitted forever in the case of a routing loop.
* Credit: Jim Wilson

* Q: What experimental protocol might be able to fulfill RFC 1122's requirement of "SHOULD: able to leap tall buildings at a single bound?"
* A: CPIP, Carrier Pigeon Internet Protocol.
* Credit: Helge Haftig

* Q: What are the Dave Clark Five?
* A: RFCs 813 through 817.
* Credit: Telsa Gwynne

* Q: What was the first remotely operated non-computer appliance to be connected to the Internet?
* A: A toaster (controlled using SNMP).
* Credit: Richard Lightman

* Q: What is CPIP?
* A: Carrier Pigeon Internet Protocol (see RFC 1149).
* Credit: SL Baur

* Q: What common multicast group uses the address 224.0.1.1?
* A: NTP (Network Time Protocol).
* Credit: Nivedita Singhvi

* Q: What is the only field that is different between a regular ARP packet and a gratuitous ARP packet?
* A: The target IP.
* Credit: Nivedita Singhvi

* Q: What error is returned if a UDP datagram is received and has a checksum error?
* A: None. It is silently discarded.
* Credit: Nivedita Singhvi

* Q: What is the minimum IP datagram size that a host must be able to receive?
* A: 576 bytes.
* Credit: Nivedita Singhvi

* Q: When is the transmitted UDP checksum 0?
* A: When the sender did not compute it.
* Credit: Nivedita Singhvi

* Q: Which is the only field used twice in the UDP checksum calculation?
* A: UDP length.
* Credit: Nivedita Singhvi

* Q: Why is a pad byte of 0 occasionally appended for the UDP checksum calculation?
* A: Because the checksum algorithm requires an even number of bytes.
* Credit: Nivedita Singhvi

* Q: What are the 5 fields of a UDP pseudoheader?
* A: Source IP, destination IP, zero, protocol, UDP length.
* Credit: Nivedita Singhvi

* Q: Which parts of the packet does the UDP checksum cover?
* A: UDP pseudoheader, UDP header, UDP data.
* Credit: Nivedita Singhvi

* Q: Which parts of the packet does the IP checksum cover?
* A: The IP header.
* Credit: Nivedita Singhvi

* Q: What is the maximum amount of data in a UDP packet over IPv4?
* A: 65507 bytes (65535 - 20 IP header - 8 UDP header).
* Credit: Nivedita Singhvi

* Q: Who was the first individual member of the Internet Society?
* A: Jon Postel, narrowly beating Steve Wolff.
* Credit: Matthew Wilcox

* Q: Why hasn't RFC 1149 been ratified?
* Hint: RFC 1149 specifies an unusual encapsulation of IP.
* A: The Avian Transmission Protocol has only been implemented once so far : http://www.blug.linux.no/rfc1149/
* Credit: Matthew Wilcox

* Q: How many identical acks need to be received for fast retransmit to occur?
* A: 4 (3 duplicate + original).
* Credit: Nivedita Singhvi

* Q: Under what circumstances is the TCP checksum incorrect, on a well-formed, in-flight packet?
* A: When the packet is using the IP source routing option (the destination IP changes along the route, which is used to calculate the TCP checksum).
* Credit: Credit: Rusty Russell

* Q: How many bytes total are in a standard sized ICMP echo request packet?
* A: 84 bytes (56 data, 8 ICMP header, 20 IP header).
* Credit: Credit: Linda J. Laubenheimer

* Q: What does "IETF" stand for?
* A: Internet Engineering Task Force.
* Credit: Credit: Linda J. Laubenheimer

* Q: What does SLIP stand for?
* A: Serial Line Internet Protocol.
* Credit: Credit: Ben Sittler

* Q: What is the TCP retransmission ambiguity problem?
* A: An ACK arrives after a retransmit - was it sent in response to the initial transmit or the retransmit?
* Credit: Craig Latta

* Q: Name one way to solve the TCP retransmission ambiguity problem.
* A: Use the Eifel detection algorithm.
* A: Enable timestamps (which is what Eifel does).
* Credit: Craig Latta

* Q: When is an IGMP report timer cancelled?
* A: When the host receives an IGMP report for the same group (with a matching destination IP).
* A: When more than one host is a member of the same group on the same network.
* Credit: Craig Latta

* Q: How many bits are in an "A" type DNS resource record?
* A: 112, plus the owner name.
* Credit: Craig Latta

* Q: What is archived at www.kohala.com?
* A: Richard Stevens' website.
* Credit: Telsa Gwynne

* Q: What does the tcp_close_wait_interval configuration option really do in Solaris?
* A: Sets the duration of the TIME_WAIT state.
* Credit: Laurel Fan

* Q: What is the range of class B IP addresses?
* A: 128.0.0.0 through 191.255.255.255.
* Credit: Annie White

* Q: What is the significance in networking of the amateur radio callsign KA9Q?
* A: It's the callsign of Phil Karn, who worked on SLIP, congestion control and TCP over amateur radio.
* Credit: Alan Cox

* Q: Sally Floyd was heavily involved in the design of which TCP enhancement?
* A: ECN, see RFC 3042.
* Credit: Alan Cox

* Q: Who said: "The IETF already has more than enough RFCs that codify the obvious, make stupidity illegal, support truth, justice, and the IETF way, and generally demonstrate the author is a brilliant and valuable Contributor to The Standards Process"?
* A: Vernon Schryver, on the mailing list for the tcp-impl IETF working group.
* Credit: Alan Cox

* Q: What is the minimum MTU that allows any IP datagram to pass?
* A: 68 bytes.
* Credit: Alan Cox

* Q: What is a syncookie and who invented it?
* A: Syn cookies help avert synflood attacks by forcing all of the TCP state into the client, invented by Dan Bernstein.
* Credit: Alan Cox

* Q: Van Jacobson claimed that the TCP receive packet processing fast-path could be done in how many instructions?
* A: 30. (33 on Sparc, due to "compiler brain damage.")
* Credit: Alan Cox

* Q: What would happen if you implemented the TCP URG pointer according to the standard?
* Hint: RFC 1122 cleared up the confusion generated by RFC 793.
* A: You would lose the last byte of urgent data because the other host implements BSD-style urgent pointers, which point to the byte following the last byte of urgent data.
* Credit: Alan Cox

* Q: What is an LFN (spelled L-F-N, pronounced "elephant")?
* A: Long Fat Network, defined in RFC 1072.
* Credit: Alan Cox

* Q: Where was John Nagle working when he invented the "Nagle algorithm?"
* A: Ford Motor Company.
* Credit: Alan Cox

* Q: Who invented Tinygram Avoidance?
* A: John Nagle. (Tinygram Avoidance is also known as the "Nagle algorithm.")
* Credit: Alan Cox

* Q: What is the sub-group FHE within the IETF?
* A: Facial hairius extremis, spotted at IETF conferences and noted in RFC 2323, "IETF Identification and Security Guidelines."
* Credit: Alan Cox

* Q: Under what circumstances should you return error number 418: "I'm a teapot"?
* A: Any attempt to brew coffee with a teapot according to RFC 2324, "Hyper Text Coffee Pot Control Protocol."
* Credit: Alan Cox

* Q: Who found additional problems beyond those in RFC 1337, "TIME-WAIT Assassination Hazards in TCP" which have yet (as of Feb 2002) to be fixed?
* Hint: He died after publishing the initial draft.
* A: Ian Heavens
* Credit: Alan Cox

* Q: Private networks came from RFC 1597. Which later RFC claims this is a bad idea?
* A: RFC 1627, "Network 10 Considered Harmful."
* Credit: Alan Cox

* Q: What makes it very difficult for any network stack to claim "strict compliance" to RFC 1122?
* A: Its requirement of "SHOULD: able to leap tall buildings at a single bound."
* Credit: Telsa Gwynne

* Q: Who said "If you know what you are doing, three layers is enough; if you don't even seventeen levels won't help?"
* A: Mike Padlipsky (or MAP).
* Credit: Larry McVoy

* Q: Which OSI networking model layers do TCP and IP correspond to?
* A: They don't. (Any answer with any kind of equivocation should be accepted.)
* Credit: Danny Quist

* Q: Who invented NAT (Network Address Translation)?
* A: Paul Francis (but he credits Van Jacobson for the concept).
* Credit: Dimitar (Mitko) Haralanov

* Q: How many hosts should be on a network with a 255.255.255.192 subnet mask?
* A: 62 (64 - (broadcast address and network address))
* Credit: Ryan Snyder

* Q: How many bytes are in an IPv4 header without options?
* A: 20.
* Credit: Val Henson

* Q: Name one of the men described as "The Father of the Internet."
* A: Any of: Vinton Cerf (TCP/IP co-designer), Robert Kahn (TCP/IP co-designer), John Postel (started IANA), Al Gore (made encouraging noises)
* Credit: Val Henson

* Q: What does TCP/IP stand for?
* A: Transmission Control Protocol/Internet Protocol.
* Credit: Val Henson

* Q: How many layers are in the OSI networking model?
* A: 7.
* Credit: Val Henson

* Q: Name a network address designated for private network use.
* A: 10.0.0.0, 192.168.0.0, or 172.16.0.0
* Credit: Val Henson

* Q: Name two TCP header options.
* A: Any 2 of maximum segment size (MSS), window scale factor, timestamp, noop, SACK, and end of options list.
* Credit: Val Henson

* Q: What is the MSL as defined in RFC 793?
* A: 2 minutes (but is usually implemented as 30 seconds).
* Credit: Val Henson

* Q: Name two ways to exit the TIME_WAIT state.
* A: 2MSL timeout, TIME_WAIT assassination (receive a RST), or receive a SYN with greater sequence number. Note: TIME_WAIT assassination is not permitted by RFC 1337.
* Credit: Val Henson

* Q: What is the TCP state that can only be reached through a simultaneous close?
* A: CLOSING
* Credit: Val Henson

* Q: What year was the first IETF meeting held?
* A: 1986.
* Credit: Val Henson

* Q: Name all 7 layers of the OSI network model.
* A: Physical, Data Link, Network, Transport, Session, Presentation, and Application.
* Credit: Val Henson

* Q: Why are many network services assigned odd ports?
* A: The precursor to TCP and UDP was NCP, which was simplex and required 2 ports for one connection. When duplex protocols arrived, the even port numbers were abandoned.
* Credit: Val Henson

* Q: Which two of these three protocols are the most similar: IPv4, IPv6, or CLNP?
* A: IPv4 and CLNP. (CLNP stands for ConnectionLess Network Protocol, and is basically IPv4 with larger addresses.)
* Credit: Val Henson

* Q: In TCP, simultaneous open is a transition between which two TCP states?
* A: From SYN_SENT to SYN_RCVD.
* Credit: Val Henson

* Q: What is the ICMP type field for an Echo Request?
* A: 8.
* Credit: Val Henson

* Q: What is the ICMP type field for an Echo Reply?
* A: 0.
* Credit: Val Henson

* Q: What is the maximum number of IP addresses recordable by the IP Record Route option?
* A: 9.
* Credit: Val Henson

* Q: If one end of a TCP connection crashes, and the other end doesn't attempt to send any data, is the resultant TCP connection half-open or half-closed from the point of view of the host that's still up?
* A: Half-open.
* Credit: Val Henson

* Q: In a Christmas tree packet, which TCP flag bits are turned on?
* A: SYN, URG, PSH, and FIN (all of them).
* Credit: Val Henson

* Q: TCP was defined in which RFC?
* A: RFC 793, "Transmission Control Protocol."
* Credit: Val Henson

* Q: What is silly window syndrome?
* A: In TCP, when the receiving end continually advertises a tiny window, resulting in data being sent in very small packets.
* Credit: Val Henson

* Q: If the remote host of a TCP connection does not advertise an MSS, what is the assumed MSS?
* A: 536 bytes over IPv4, 1220 over IPv6, although most implementations default to 512 and 1024 respectively.
* Credit: Val Henson

* Q: How is the initial path MTU of a TCP connection calculated?
* A: min (outgoing interface MTU, remote advertised MSS)
* Credit: Val Henson

* Q: Why was the "AAAA" DNS resource record type created?
* A: For IPv6 addresses.
* Credit: Val Henson

* Q: What is the maximum amount of data allowed in an IPv4 packet?
* A: 65515 bytes (65535 - 20, max total length minus 20 bytes of header)
* Credit: Val Henson

* Q: What was the default send and receive buffer size in 4.3BSD?
* A: 2048 bytes.
* Credit: Val Henson

* Q: In TCP, when is the sender limited by the congestion window?
* A: When using the slow start algorithm, after packet loss has occurred.
* Credit: Val Henson

* Q: In most implementations of TCP, what byte does the urgent pointer point to?
* A: The byte following the last byte of urgent data.
* Credit: Val Henson

* Q: What service runs on port 6667?
* A: Internet Relay Chat (IRC)
* Credit: Val Henson

* Q: Name three methods or algorithms related to congestion control in TCP.
* A: Congestion window, Vegas, Reno, NewReno, SACK, DSACK, FACK, Eifel algorithm, ECN, RTO
* Credit: Val Henson

* Q: The ECN field uses which bits in the byte that contains the IPv4 TOS field?
* A: Bits 6 and 7.
* Credit: Val Henson

* Q: What's wrong with the standard way of estimating RTO (Retransmission TimeOut)?
* A: It places too much weight on the variance of round trip times.
* Credit: Val Henson

* Q: What is the minimum data payload in a ping of death?
* A: 65508 bytes (65536 - 20 IP header - 8 ICMP header)
* Credit: Val Henson

* Q: What's the MTU of HiPPI (High Performance Parallel Interface)?
* A: 65280 bytes
* Credit: Val Henson

* Q: How large is an entire AAL/5 encapsulated ATM cell?
* A: 53 bytes (48 data + 5 header)
* Credit: Val Henson

* Q: Distance vector and link state are two types of what kind of protocol?
* A: Routing protocols
* Credit: Val Henson

* Q: The IPv4 fields formerly known as TOS (Type Of Service) and precedence are now called what?
* A: The DS (Differentiated Services) and ECN (Explicit Congestion Notification) fields.
* Credit: Val Henson

* Q: Why do most traceroute implementations NOT use the IP Record Route option to find intermediate hosts?
* A: The IP Record Route option can only record 9 intermediate hosts, which is too few for many routes in the modern Internet.
* Credit: Val Henson

* Q: Which spelling is correct, Van Jacobson or Van Jacobsen?
* A: Van Jacobson.
* Credit: Val Henson

* Q: Who invented the spanning tree algorithm?
* A: Dr. Radia Perlman
* Credit: Val Henson

Tuesday, April 22, 2008

PCAP - API for Packet Capturing

Getting Started: The format of a pcap application

The first thing to understand is the general layout of a pcap sniffer. The flow of code is as follows:

1. We begin by determining which interface we want to sniff on. In Linux this may be something like eth0, in BSD it may be xl1, etc. We can either define this device in a string, or we can ask pcap to provide us with the name of an interface that will do the job.
2. Initialize pcap. This is where we actually tell pcap what device we are sniffing on. We can, if we want to, sniff on multiple devices. How do we differentiate between them? Using file handles. Just like opening a file for reading or writing, we must name our sniffing "session" so we can tell it apart from other such sessions.
3. In the event that we only want to sniff specific traffic (e.g.: only TCP/IP packets, only packets going to port 23, etc) we must create a rule set, "compile" it, and apply it. This is a three phase process, all of which is closely related. The rule set is kept in a string, and is converted into a format that pcap can read (hence compiling it.) The compilation is actually just done by calling a function within our program; it does not involve the use of an external application. Then we tell pcap to apply it to whichever session we wish for it to filter.
4. Finally, we tell pcap to enter it's primary execution loop. In this state, pcap waits until it has received however many packets we want it to. Every time it gets a new packet in, it calls another function that we have already defined. The function that it calls can do anything we want; it can dissect the packet and print it to the user, it can save it in a file, or it can do nothing at all.
5. After our sniffing needs are satisfied, we close our session and are complete.

This is actually a very simple process. Five steps total, one of which is optional (step 3, in case you were wondering.) Let's take a look at each of the steps and how to implement them.
Setting the device

This is terribly simple. There are two techniques for setting the device that we wish to sniff on.

The first is that we can simply have the user tell us. Consider the following program:

#include
#include

int main(int argc, char *argv[])
{
char *dev = argv[1];

printf("Device: %s\n", dev);
return(0);
}

The user specifies the device by passing the name of it as the first argument to the program. Now the string "dev" holds the name of the interface that we will sniff on in a format that pcap can understand (assuming, of course, the user gave us a real interface).

The other technique is equally simple. Look at this program:

#include
#include

int main(int argc, char *argv[])
{
char *dev, errbuf[PCAP_ERRBUF_SIZE];

dev = pcap_lookupdev(errbuf);
if (dev == NULL) {
fprintf(stderr, "Couldn't find default device: %s\n", errbuf);
return(2);
}
printf("Device: %s\n", dev);
return(0);
}

In this case, pcap just sets the device on its own. "But wait, Tim," you say. "What is the deal with the errbuf string?" Most of the pcap commands allow us to pass them a string as an argument. The purpose of this string? In the event that the command fails, it will populate the string with a description of the error. In this case, if pcap_lookupdev() fails, it will store an error message in errbuf. Nifty, isn't it? And that's how we set our device.
Opening the device for sniffing

The task of creating a sniffing session is really quite simple. For this, we use pcap_open_live(). The prototype of this function (from the pcap man page) is as follows:

pcap_t *pcap_open_live(char *device, int snaplen, int promisc, int to_ms,
char *ebuf)

The first argument is the device that we specified in the previous section. snaplen is an integer which defines the maximum number of bytes to be captured by pcap. promisc, when set to true, brings the interface into promiscuous mode (however, even if it is set to false, it is possible under specific cases for the interface to be in promiscuous mode, anyway). to_ms is the read time out in milliseconds (a value of 0 means no time out; on at least some platforms, this means that you may wait until a sufficient number of packets arrive before seeing any packets, so you should use a non-zero timeout). Lastly, ebuf is a string we can store any error messages within (as we did above with errbuf). The function returns our session handler.

To demonstrate, consider this code snippet:

#include
...
pcap_t *handle;

handle = pcap_open_live(somedev, BUFSIZ, 1, 1000, errbuf);
if (handle == NULL) {
fprintf(stderr, "Couldn't open device %s: %s\n", somedev, errbuf);
return(2);
}

This code fragment opens the device stored in the strong "somedev", tells it to read however many bytes are specified in BUFSIZ (which is defined in pcap.h). We are telling it to put the device into promiscuous mode, to sniff until an error occurs, and if there is an error, store it in the string errbuf; it uses that string to print an error message.

A note about promiscuous vs. non-promiscuous sniffing: The two techniques are very different in style. In standard, non-promiscuous sniffing, a host is sniffing only traffic that is directly related to it. Only traffic to, from, or routed through the host will be picked up by the sniffer. Promiscuous mode, on the other hand, sniffs all traffic on the wire. In a non-switched environment, this could be all network traffic. The obvious advantage to this is that it provides more packets for sniffing, which may or may not be helpful depending on the reason you are sniffing the network. However, there are regressions. Promiscuous mode sniffing is detectable; a host can test with strong reliability to determine if another host is doing promiscuous sniffing. Second, it only works in a non-switched environment (such as a hub, or a switch that is being ARP flooded). Third, on high traffic networks, the host can become quite taxed for system resources.
Filtering traffic

Often times our sniffer may only be interested in specific traffic. For instance, there may be times when all we want is to sniff on port 23 (telnet) in search of passwords. Or perhaps we want to highjack a file being sent over port 21 (FTP). Maybe we only want DNS traffic (port 53 UDP). Whatever the case, rarely do we just want to blindly sniff all network traffic. Enter pcap_compile() and pcap_setfilter().

The process is quite simple. After we have already called pcap_open_live() and have a working sniffing session, we can apply our filter. Why not just use our own if/else if statements? Two reasons. First, pcap's filter is far more efficient, because it does it directly with the BPF filter; we eliminate numerous steps by having the BPF driver do it directly. Second, this is a lot easier :)

Before applying our filter, we must "compile" it. The filter expression is kept in a regular string (char array). The syntax is documented quite well in the man page for tcpdump; I leave you to read it on your own. However, we will use simple test expressions, so perhaps you are sharp enough to figure it out from my examples.

To compile the program we call pcap_compile(). The prototype defines it as:

int pcap_compile(pcap_t *p, struct bpf_program *fp, char *str, int optimize,
bpf_u_int32 netmask)

The first argument is our session handle (pcap_t *handle in our previous example). Following that is a reference to the place we will store the compiled version of our filter. Then comes the expression itself, in regular string format. Next is an integer that decides if the expression should be "optimized" or not (0 is false, 1 is true. Standard stuff.) Finally, we must specify the net mask of the network the filter applies to. The function returns -1 on failure; all other values imply success.

After the expression has been compiled, it is time to apply it. Enter pcap_setfilter(). Following our format of explaining pcap, we shall look at the pcap_setfilter() prototype:

int pcap_setfilter(pcap_t *p, struct bpf_program *fp)

This is very straightforward. The first argument is our session handler, the second is a reference to the compiled version of the expression (presumably the same variable as the second argument to pcap_compile()).

Perhaps another code sample would help to better understand:

#include
...
pcap_t *handle; /* Session handle */
char dev[] = "rl0"; /* Device to sniff on */
char errbuf[PCAP_ERRBUF_SIZE]; /* Error string */
struct bpf_program fp; /* The compiled filter expression */
char filter_exp[] = "port 23"; /* The filter expression */
bpf_u_int32 mask; /* The netmask of our sniffing device */
bpf_u_int32 net; /* The IP of our sniffing device */

if (pcap_lookupnet(dev, &net, &mask, errbuf) == -1) {
fprintf(stderr, "Can't get netmask for device %s\n", dev);
net = 0;
mask = 0;
}
handle = pcap_open_live(dev, BUFSIZ, 1, 1000, errbuf);
if (handle == NULL) {
fprintf(stderr, "Couldn't open device %s: %s\n", somedev, errbuf);
return(2);
}
if (pcap_compile(handle, &fp, filter_exp, 0, net) == -1) {
fprintf(stderr, "Couldn't parse filter %s: %s\n", filter_exp, pcap_geterr(handle));
return(2);
}
if (pcap_setfilter(handle, &fp) == -1) {
fprintf(stderr, "Couldn't install filter %s: %s\n", filter_exp, pcap_geterr(handle));
return(2);
}

This program preps the sniffer to sniff all traffic coming from or going to port 23, in promiscuous mode, on the device rl0.

You may notice that the previous example contains a function that we have not yet discussed. pcap_lookupnet() is a function that, given the name of a device, returns its IP and net mask. This was essential because we needed to know the net mask in order to apply the filter. This function is described in the Miscellaneous section at the end of the document.

It has been my experience that this filter does not work across all operating systems. In my test environment, I found that OpenBSD 2.9 with a default kernel does support this type of filter, but FreeBSD 4.3 with a default kernel does not. Your mileage may vary.
The actual sniffing

At this point we have learned how to define a device, prepare it for sniffing, and apply filters about what we should and should not sniff for. Now it is time to actually capture some packets.

There are two main techniques for capturing packets. We can either capture a single packet at a time, or we can enter a loop that waits for n number of packets to be sniffed before being done. We will begin by looking at how to capture a single packet, then look at methods of using loops. For this we use pcap_next().

The prototype for pcap_next() is fairly simple:

u_char *pcap_next(pcap_t *p, struct pcap_pkthdr *h)

The first argument is our session handler. The second argument is a pointer to a structure that holds general information about the packet, specifically the time in which it was sniffed, the length of this packet, and the length of his specific portion (incase it is fragmented, for example.) pcap_next() returns a u_char pointer to the packet that is described by this structure. We'll discuss the technique for actually reading the packet itself later.

Here is a simple demonstration of using pcap_next() to sniff a packet.

#include
#include

int main(int argc, char *argv[])
{
pcap_t *handle; /* Session handle */
char *dev; /* The device to sniff on */
char errbuf[PCAP_ERRBUF_SIZE]; /* Error string */
struct bpf_program fp; /* The compiled filter */
char filter_exp[] = "port 23"; /* The filter expression */
bpf_u_int32 mask; /* Our netmask */
bpf_u_int32 net; /* Our IP */
struct pcap_pkthdr header; /* The header that pcap gives us */
const u_char *packet; /* The actual packet */

/* Define the device */
dev = pcap_lookupdev(errbuf);
if (dev == NULL) {
fprintf(stderr, "Couldn't find default device: %s\n", errbuf);
return(2);
}
/* Find the properties for the device */
if (pcap_lookupnet(dev, &net, &mask, errbuf) == -1) {
fprintf(stderr, "Couldn't get netmask for device %s: %s\n", dev, errbuf);
net = 0;
mask = 0;
}
/* Open the session in promiscuous mode */
handle = pcap_open_live(dev, BUFSIZ, 1, 1000, errbuf);
if (handle == NULL) {
fprintf(stderr, "Couldn't open device %s: %s\n", somedev, errbuf);
return(2);
}
/* Compile and apply the filter */
if (pcap_compile(handle, &fp, filter_exp, 0, net) == -1) {
fprintf(stderr, "Couldn't parse filter %s: %s\n", filter_exp, pcap_geterr(handle));
return(2);
}
if (pcap_setfilter(handle, &fp) == -1) {
fprintf(stderr, "Couldn't install filter %s: %s\n", filter_exp, pcap_geterr(handle));
return(2);
}
/* Grab a packet */
packet = pcap_next(handle, &header);
/* Print its length */
printf("Jacked a packet with length of [%d]\n", header.len);
/* And close the session */
pcap_close(handle);
return(0);
}

This application sniffs on whatever device is returned by pcap_lookupdev() by putting it into promiscuous mode. It finds the first packet to come across port 23 (telnet) and tells the user the size of the packet (in bytes). Again, this program includes a new call, pcap_close(), which we will discuss later (although it really is quite self explanatory).

The other technique we can use is more complicated, and probably more useful. Few sniffers (if any) actually use pcap_next(). More often than not, they use pcap_loop() or pcap_dispatch() (which then themselves use pcap_loop()). To understand the use of these two functions, you must understand the idea of a callback function.

Callback functions are not anything new, and are very common in many API's. The concept behind a callback function is fairly simple. Suppose I have a program that is waiting for an event of some sort. For the purpose of this example, let's pretend that my program wants a user to press a key on the keyboard. Every time they press a key, I want to call a function which then will determine that to do. The function I am utilizing is a callback function. Every time the user presses a key, my program will call the callback function. Callbacks are used in pcap, but instead of being called when a user presses a key, they are called when pcap sniffs a packet. The two functions that one can use to define their callback is pcap_loop() and pcap_dispatch(). pcap_loop() and pcap_dispatch() are very similar in their usage of callbacks. Both of them call a callback function every time a packet is sniffed that meets our filter requirements (if any filter exists, of course. If not, then all packets that are sniffed are sent to the callback.)

The prototype for pcap_loop() is below:

int pcap_loop(pcap_t *p, int cnt, pcap_handler callback, u_char *user)

The first argument is our session handle. Following that is an integer that tells pcap_loop() how many packets it should sniff for before returning (a negative value means it should sniff until an error occurs). The third argument is the name of the callback function (just its identifier, no parentheses). The last argument is useful in some applications, but many times is simply set as NULL. Suppose we have arguments of our own that we wish to send to our callback function, in addition to the arguments that pcap_loop() sends. This is where we do it. Obviously, you must typecast to a u_char pointer to ensure the results make it there correctly; as we will see later, pcap makes use of some very interesting means of passing information in the form of a u_char pointer. After we show an example of how pcap does it, it should be obvious how to do it here. If not, consult your local C reference text, as an explanation of pointers is beyond the scope of this document. pcap_dispatch() is almost identical in usage. The only difference between pcap_dispatch() and pcap_loop() is that pcap_dispatch() will only process the first batch of packets that it receives from the system, while pcap_loop() will continue processing packets or batches of packets until the count of packets runs out. For a more in depth discussion of their differences, see the pcap man page.

Before we can provide an example of using pcap_loop(), we must examine the format of our callback function. We cannot arbitrarily define our callback's prototype; otherwise, pcap_loop() would not know how to use the function. So we use this format as the prototype for our callback function:

void got_packet(u_char *args, const struct pcap_pkthdr *header,
const u_char *packet);

Let's examine this in more detail. First, you'll notice that the function has a void return type. This is logical, because pcap_loop() wouldn't know how to handle a return value anyway. The first argument corresponds to the last argument of pcap_loop(). Whatever value is passed as the last argument to pcap_loop() is passed to the first argument of our callback function every time the function is called. The second argument is the pcap header, which contains information about when the packet was sniffed, how large it is, etc. The pcap_pkthdr structure is defined in pcap.h as:

struct pcap_pkthdr {
struct timeval ts; /* time stamp */
bpf_u_int32 caplen; /* length of portion present */
bpf_u_int32 len; /* length this packet (off wire) */
};

These values should be fairly self explanatory. The last argument is the most interesting of them all, and the most confusing to the average novice pcap programmer. It is another pointer to a u_char, and it points to the first byte of a chunk of data containing the entire packet, as sniffed by pcap_loop().

But how do you make use of this variable (named "packet" in our prototype)? A packet contains many attributes, so as you can imagine, it is not really a string, but actually a collection of structures (for instance, a TCP/IP packet would have an Ethernet header, an IP header, a TCP header, and lastly, the packet's payload). This u_char pointer points to the serialized version of these structures. To make any use of it, we must do some interesting typecasting.

First, we must have the actual structures define before we can typecast to them. The following are the structure definitions that I use to describe a TCP/IP packet over Ethernet.

/* Ethernet addresses are 6 bytes */
#define ETHER_ADDR_LEN 6

/* Ethernet header */
struct sniff_ethernet {
u_char ether_dhost[ETHER_ADDR_LEN]; /* Destination host address */
u_char ether_shost[ETHER_ADDR_LEN]; /* Source host address */
u_short ether_type; /* IP? ARP? RARP? etc */
};

/* IP header */
struct sniff_ip {
u_char ip_vhl; /* version << 4 | header length >> 2 */
u_char ip_tos; /* type of service */
u_short ip_len; /* total length */
u_short ip_id; /* identification */
u_short ip_off; /* fragment offset field */
#define IP_RF 0x8000 /* reserved fragment flag */
#define IP_DF 0x4000 /* dont fragment flag */
#define IP_MF 0x2000 /* more fragments flag */
#define IP_OFFMASK 0x1fff /* mask for fragmenting bits */
u_char ip_ttl; /* time to live */
u_char ip_p; /* protocol */
u_short ip_sum; /* checksum */
struct in_addr ip_src,ip_dst; /* source and dest address */
};
#define IP_HL(ip) (((ip)->ip_vhl) & 0x0f)
#define IP_V(ip) (((ip)->ip_vhl) >> 4)

/* TCP header */
struct sniff_tcp {
u_short th_sport; /* source port */
u_short th_dport; /* destination port */
tcp_seq th_seq; /* sequence number */
tcp_seq th_ack; /* acknowledgement number */

u_char th_offx2; /* data offset, rsvd */
#define TH_OFF(th) (((th)->th_offx2 & 0xf0) >> 4)
u_char th_flags;
#define TH_FIN 0x01
#define TH_SYN 0x02
#define TH_RST 0x04
#define TH_PUSH 0x08
#define TH_ACK 0x10
#define TH_URG 0x20
#define TH_ECE 0x40
#define TH_CWR 0x80
#define TH_FLAGS (TH_FIN|TH_SYN|TH_RST|TH_ACK|TH_URG|TH_ECE|TH_CWR)
u_short th_win; /* window */
u_short th_sum; /* checksum */
u_short th_urp; /* urgent pointer */
};

Note: On my Slackware Linux 8 box (stock kernel 2.2.19) I found that code using the above structures would not compile. The problem, as it turns out, was in include/features.h, which implements a POSIX interface unless _BSD_SOURCE is defined. If it was not defined, then I had to use a different structure definition for the TCP header. The more universal solution, that does not prevent the code from working on FreeBSD or OpenBSD (where it had previously worked fine), is simply to do the following:
#define _BSD_SOURCE 1
prior to including any of your header files. This will ensure that a BSD style API is being used. Again, if you don't wish to do this, then you can simply use the alternative TCP header structure, which I've linked to here, along with some quick notes about using it.

So how does all of this relate to pcap and our mysterious u_char pointer? Well, those structures define the headers that appear in the data for the packet. So how can we break it apart? Be prepared to witness one of the most practical uses of pointers (for all of those new C programmers who insist that pointers are useless, I smite you).

Again, we're going to assume that we are dealing with a TCP/IP packet over Ethernet. This same technique applies to any packet; the only difference is the structure types that you actually use. So let's begin by defining the variables and compile-time definitions we will need to deconstruct the packet data.

/* ethernet headers are always exactly 14 bytes */
#define SIZE_ETHERNET 14

const struct sniff_ethernet *ethernet; /* The ethernet header */
const struct sniff_ip *ip; /* The IP header */
const struct sniff_tcp *tcp; /* The TCP header */
const char *payload; /* Packet payload */

u_int size_ip;
u_int size_tcp;

And now we do our magical typecasting:

ethernet = (struct sniff_ethernet*)(packet);
ip = (struct sniff_ip*)(packet + SIZE_ETHERNET);
size_ip = IP_HL(ip)*4;
if (size_ip < 20) {
printf(" * Invalid IP header length: %u bytes\n", size_ip);
return;
}
tcp = (struct sniff_tcp*)(packet + SIZE_ETHERNET + size_ip);
size_tcp = TH_OFF(tcp)*4;
if (size_tcp < 20) {
printf(" * Invalid TCP header length: %u bytes\n", size_tcp);
return;
}
payload = (u_char *)(packet + SIZE_ETHERNET + size_ip + size_tcp);

How does this work? Consider the layout of the packet data in memory. The u_char pointer is really just a variable containing an address in memory. That's what a pointer is; it points to a location in memory.

For the sake of simplicity, we'll say that the address this pointer is set to is the value X. Well, if our three structures are just sitting in line, the first of them (sniff_ethernet) being located in memory at the address X, then we can easily find the address of the structure after it; that address is X plus the length of the Ethernet header, which is 14, or SIZE_ETHERNET.

Similarly if we have the address of that header, the address of the structure after it is the address of that header plus the length of that header. The IP header, unlike the Ethernet header, does not have a fixed length; its length is given, as a count of 4-byte words, by the header length field of the IP header. As it's a count of 4-byte words, it must be multiplied by 4 to give the size in bytes. The minimum length of that header is 20 bytes.

The TCP header also has a variable length; its length is given, as a number of 4-byte words, by the "data offset" field of the TCP header, and its minimum length is also 20 bytes.

So let's make a chart:

Variable Location (in bytes)
sniff_ethernet X
sniff_ip X + SIZE_ETHERNET
sniff_tcp X + SIZE_ETHERNET + {IP header length}
payload X + SIZE_ETHERNET + {IP header length} + {TCP header length}

The sniff_ethernet structure, being the first in line, is simply at location X. sniff_ip, who follows directly after sniff_ethernet, is at the location X, plus however much space the Ethernet header consumes (14 bytes, or SIZE_ETHERNET). sniff_tcp is after both sniff_ip and sniff_ethernet, so it is location at X plus the sizes of the Ethernet and IP headers (14 bytes, and 4 times the IP header length, respectively). Lastly, the payload (which doesn't have a single structure corresponding to it, as its contents depends on the protocol being used atop TCP) is located after all of them.

So at this point, we know how to set our callback function, call it, and find out the attributes about the packet that has been sniffed. It's now the time you have been waiting for: writing a useful packet sniffer. Because of the length of the source code, it is not included in the body of this document.

Thursday, August 16, 2007

Rough Notes on Linux Networking Stack

Table of Contents

1. Existing Optimizations
2. Packet Copies
3. ICMP Ping/Pong : Function Calls
4. Transmit Interrupts and Flow Control
5. NIC driver callbacks and ifconfig
6. Protocol Structures in the Kernel
7. skb_clone() vs. skb_copy()
8. NICs and Descriptor Rings
9. How much networking work does the ksoftirqd do?
10. Packet Requeues in Qdiscs
11. Links
12. Specific TODOs
References

1. Existing Optimizations

A great deal of thought has gone into Linux networking
implementation and many optmizations have made their way to the
kernel over the years. Some prime examples include:
* NAPI - Receive interrupts are coalesced to reduce changes of a
livelock. Thus, now each packet receive does not generate an
interrupt. Required modifications to device driver interface.
Has been in the stable kernels since 2.4.20.
* Zero-Copy TCP - Avoids the overhead of kernel-to-userspace and
userspace-to-kernel packet copying.
http://builder.com.com/5100-6372-1044112.html describes this is
some detail.

2. Packet Copies

When a packet is received, the device uses DMA to put it in main
memory (let's ignore non-DMA or non-NAPI code and drivers). An skb
is constructed by the poll() function of the device driver. After
this point, the same skb is used throughout the networking stack,
i.e., the packet is almost never copied within the kernel (it is
copied when delivered to user-space).

This design is borrowed from BSD and UNIX SVR4 - the idea is to
allocate memory for the packet only once. The packet has 4 primary
pointers - head, end, data, tail into the packet data (character
buffer). head points to the beginning of the packet - where the link
layer header starts. end points to the end of the packet. data
points to the location the current networking layer can start
reading from (i.e., it changes as the packet moves up from the link
layer, to IP, to TCP). Finally, tail is where the current protocol
layer can begin writing data to (see alloc_skb(), which sets head,
data, tail to the beginning of allocated memory block and end to
data + size).

Other implementations refer to head, end, data, tail as base, limit,
read, write respectively.

There are some instances where a packet needs to be duplicated. For
example, when running tcpdump the packet needs to be sent to the
userspace process as well as to the normal IP handler. Actually, int
this case too, a copy can be avoided since the contents of the
packet are not being modified. So instead of duplicating the packet
contents, skb_clone() is used to increase the reference count of a
packet. skb_copy() on the other hand actually duplicates the
contents of the packet and creates a completely new skb.

See also: http://oss.sgi.com/archives/netdev/2005-02/msg00125.html

A related question: When a packet is received, are the tail and end
pointers equal? Answer: NO. This is because memory for packets
received is allocated before the packet is received, and the address
and size of this memory is communicated to the NIC using receive
descriptors - so that when it is actually received the NIC can use
DMA to transfer the packet to main memory. The size allocated for a
received packet is a function of the MTU of the device. The size of
an Ethernet frame actually received could be anything less than the
MTU. Thus, tail of a received packet will point to the end of the
received data while end will point to the end of the memory
allocated for the packet.

3. ICMP Ping/Pong : Function Calls

Code path (functions called) when an ICMP ping is received (and
corresponding pong goes out), for linux 2.6.9: First the packet is
received by the NIC and it's interrupt handler will ultimately call
net_rx_action() to be called (NAPI, [1]). This will call the device
driver's poll function which will submit packets (skb's) to the
networking stack via netif_receive_skb. The rest is outlined below:
1. ip_rcv() --> ip_rcv_finish()
2. dst_input() --> skb->dst->input = ip_local_deliver()
3. ip_local_deliver() --> ip_local_deliver_finish()
4. ipprot->handler = icmp_rcv()
5. icmp_pointers[ICMP_ECHO].handler == icmp_echo() -- At this point
I guess you could say that the "receive" path is complete, the
packet has reached the top. Now the outbound (down the stack)
journey begins)
6. icmp_reply() -- Might want to look into the checks this function
does
7. icmp_push_reply()
8. ip_push_pending_frames()
9. dst_output() --> skb->dst->output = ip_output()
10. ip_output() --> ip_finish_output() --> ip_finish_output2()
11. dst->neighbour->output ==

4. Transmit Interrupts and Flow Control

Transmit interrupts are generated after every packet transmission
and this is key to flow control. However, this does have significant
performance implications under heavy transmit-related I/O (imagine a
packet forwarder where the number of transmitted packets is equal to
the number of received oned). Each device provides a means to slow
down transmit (Tx) interrupts. For example, Intel's e1000 driver
exposes "TxIntDelay" that allows transmit interrupts to be delayed
in units of 1.024 microseconds. The default value is 64, thus eavy
under heavy transmissions an interrupt's are spaced 65.536
microseconds apart. Imagine the number of transmissions that can
take place in this time.

5. NIC driver callbacks and ifconfig

Interfaces are configured using the ifconfig command. Many of these
commands will result in a function of the NIC driver being called.
For example, ifconfig eth0 up should result in the device driver's
open() function being called (open is a member of struct
net_device). ifconfig communicates with the kernel through ioctl()
on any socket. The requests are a struct ifreq (see
/usr/include/net/if.h and
http://linux.about.com/library/cmd/blcmdl7_netdevice.htm. Thus,
ifconfig eth0 up will result in the following:
1. A socket (of any kind) is opened using socket()
2. A struct ifreq is prepared with ifr_ifname set to "eth0"
3. An ioctl() with request SIOCGIFFLAGS is done to get the current
flags and then the IFF_UP and IFF_RUNNING flags are set with
another ioctl() (with request SIOCSIFFLAGS).
4. Now we're inside the kernel. sock_ioctl() is called, which in
turn calls dev_ioctl() (see net/socket.c and net/core/dev.c)
5. dev_ioctl() --> ... --> dev_open() --> driver's open()
implementation.

6. Protocol Structures in the Kernel

There are various structs in the kernel which consist of function
pointers for protocol handling. Different structures correspond to
different layers of protocols as well as whether the functions are
for synchronous handling (e.g., when recv(), send() etc. system
calls are made) or asynchronous handling (e.g., when a packet
arrives at the interface and it needs to be handled). Here is what I
have gathered about the various structures so far:
* struct packet_type - includes instantiations such as
ip_packet_type, ipv6_packet_type etc. These provide low-level,
asynchronos packet handling. When a packet arrives at the
interface, the driver ultimately submits it to the networking
stack by a call to netif_receive_skb(), which iterates to the
list of registered packet handlers and submits the skb to them.
For example, ip_packet_type.func = ip_rcv, so ip_rcv() is where
one can say the IP protocol first receives a packet that has
arrived at the interface. Packet-types are registred with the
networking stack by a call to dev_add_pack().
* struct net_proto_family - includes instantiations such as
inet_family_ops, packet_family_ops etc. Each net_proto_family
structure handles one type of address family (PF_INET etc.).
This structure is associated with a BSD socket (struct socket)
and not the networking layer representation of sockets (struct
sock). It essentially provdides a create() function which is
called in response to the socket() system call. The
implementation of create() for each family typically allocates
the struct sock and also associates other synchronous operations
(see struct proto_ops below) with the socket. To cut a long
story short - net_proto_family provides the protocol-specific
part of the socket() system call. (NOTE: Not all BSD sockets
will have a networking socket associated with it. For example,
unix sockets (the PF_UNIX address family).
unix_family_ops.create = unix_create does not allocate a struct
sock). The net_proto_family structure is registered with the
networking stack by a call to sock_register().
* struct proto_ops - includes instantiations such as
inet_stream_ops, inet_dgram_ops, packet_ops etc. These provide
implementations of networking layer synchronous calls
(connect(), bind(), recvmsg(), ioctl() etc. system calls). The
ops member of the BSD socket structure (struct socket) points to
the proto_ops associated with the socket. Unlike the above two
structures, there is no function that explicitly registers a
struct proto_ops with the networking stack. Instead, the
create() implementation of struct net_proto_family just sets the
ops field of the BSD socket to the appropriate proto_ops
structure.
* struct proto - includes instantiations such as tcp_prot,
udp_prot, raw_prot. These provide protocol handlers inside a
network family. It seems that currently this means only over-IP
protocols as I could find only the above three instantiations.
These also provide implementations for synchronous calls. The
sk_prot field of the networking socket (struct sock) points to
such a structure. The sk_prot field would get set by the create
function in struct net_proto_family and the functions provided
will be called by the implementations of functions in the struct
proto_ops structure. For example, inet_family_ops.create =
inet_create allocates a struct sock and would set sk_prot =
udp_prot in reponse to a socket(PF_INET, SOCK_DGRAM, 0); system
call. A recvfrom() system call made on the socket would then
invoke inet_dgram_ops.recvmsg = sock_common_recvmsg, which calls
sk_prot->recvmsg = udp_recvmsg. Like proto_ops, struct protos
aren't explicitly "registered" with the networking stack using a
function, but are "regsitered" by the BSD socket create()
implementation in the struct net_proto_family.
* struct net_protocol - includes instantiations such as
tcp_protocol, udp_protocol, icmp_protocol etc. These provide
asynchronous packet receive routines for IP protocols. Thus,
this structure is specific to the inet-family of protocols.
Handlers are registered using inet_add_protocol(). This
structure is used by the IP-layer routines to hand off to a
layer 4 protocol. Specifically, the IP handler (ip_rcv()) will
invoke ip_local_deliver_finish() for packets that are to be
delivered to the local host. ip_local_deliver_finish() uses a
hash table (inet_protos) to decide which function to pass the
packet to based on the protocol field in the IP header. The hash
table is populated by the call to inet_add_protocol().

7. skb_clone() vs. skb_copy()

When a packet needs to be delivered to two separate handlers (for
example, the IP layer and tcpdump), then it is "cloned" by
incrementing the reference count of the packet instead of being
"copied". Now, though the two handlers are not expected to modify
the packet contents, they can change the data pointer. So, how do we
ensure that processing by one of the handlers doesn't mess up the
data pointer for the other?

A. Umm... skb_clone means that there are separate head, tail, data,
end etc. pointers. The difference between skb_copy() and skb_clone()
is precisely this - the former copies the packet completely, while
the latter uses the same packet data but separate pointers into the
packet.

8. NICs and Descriptor Rings

NOTE: Using the Intel e1000, driver source version 5.6.10.1, as an
example. Each transmission/reception has a descriptor - a "handle"
used to access buffer data somewhat like a file descriptor is a
handle to access file data. The descriptor format would be NIC
dependent as the hardware understands and reads/writes to the
descriptor. The NIC maintains a circular ring of descriptors, i.e.,
the number of descriptors for TX and RX is fixed (TxDescriptors,
RxDescriptors module parameters for the e1000 kernel module) and the
descriptors are used like a circular queue.

Thus, there are three structures:
* Descriptor Ring (struct e1000_desc_ring) - The list of
descriptors. So, ring[0], ring[1] etc. are individual
descriptors. The ring is typically allocated just once and thus
the DMA mapping of the ring is "consistent". Each descriptor in
the ring will thus have a fixed DMA and memory address. In the
e1000, the device registers TDBAL, TDBAH, TDLEN stand for
"Transmit Descriptors Base Address Low", "High" and "Length" (in
bytes of all descriptors). Similarly, there are RDBAL, RDBAH,
RDLEN
* Descriptors (struct e1000_rx_desc and struct e1000_tx_desc) -
Essentially, this stores the DMA address of the buffer which
contains actual packet data, plus some other accounting
information such as the status (transmission successsful?
receive complete? etc.), errors etc.
* Buffers - Now actual data cannot have a "consistent" DMA
mapping, meaning we cannot ensure that all skbuffs for a
particular device always have some specific memory addresses
(those that have been setup for DMA). Instead, "streaming" DMA
mappings need to be used. Each descriptor thus contains the DMA
address of a buffer that has been setup for streaming mapping.
The hardware uses that DMA address to pickup a packet to be sent
or to place a received packet. Once the kernel's stack picks up
the buffer, it can allocate new resources (a new buffer) and
tell the NIC to use that buffer next time by setting up a new
streaming mapping and putting the new DMA handle in the
descriptor.
The e1000 uses a struct e1000_buffer as a wrapper around the
actual buffer. The DMA mapping however is setup only for
skb->data, i.e., where raw packet data is to be placed.

9. How much networking work does the ksoftirqd do?

Consider what the NET_RX_SOFTIRQ does:
1. Each softirq invokation (do_softirq()) processes up to
net.core.netdev_max_backlog x MAX_SOFTIRQ_RESTART packets, if
available. The default values lead to 300 x 10 = 3000 pkts.
2. Every interrupt calls do_softirq() when exitting (irq_exit()) -
including the timer interrupt and NMIs too?
3. Default transmit/receive ring sizes on the NIC are less than
3000 (the e1000 for example defaults to 256 and can have at most
4096 descriptors on its ring)

Thus, the number of times ksoftirqd will be switched in/out depends
on how much processing is done by do_softirq() invokations on
irq_exit(). If the softirq handling on interrupt is able to clean up
the NIC ring faster than a new packet comes in, then ksoftirqd won't
be doing anything. Specifically, if the inter-packet-gap is greater
than the time it takes to pick-up and process a single packet from
the NIC, then ksoftirq will not be scheduled (and if the number of
descriptors on the NIC is less than 3000).

Without going into details, some quick experimental verification:
Machine A continuously generates UDP packets for Machine B which is
running an "sink" application, i.e., it just loops on a recvfrom().
When the size of the packet sent from A was 60 bytes (and
inter-packet gap averaged 1.5Âµs), then the ksoftirqd thread on B
observed a total of 375 context swithces (374 involuntary and 1
voluntary). When the packet size was 1280 bytes (and now
inter-packet gap increased almost 7 times to 10Âµs) then the
ksoftirqd thread was NEVER scheduled (0 context switches). The
single voluntary context switch in the former case probably happened
after all packets were processed (i.e., the sender stopped sending
and the receiver processed all that it got).

10. Packet Requeues in Qdiscs

The queueing discipline (struct Qdisc) provides a requeue().
Typically, packets are dequeued from the qdisc and submitted to the
device driver (the hard_start_xmit function in struct net_device).
However, at times it is possible that the device driver is "busy",
so the dequeued packet must be "requeued". "Busy" here means that
the xmit_lock of the device was held. It seems that this lock is
acquired at two places: (1) qdisc_restart() and (2) dev_watchdog().
The former handles packet dequeueing from the qdisc, acquiring the
xmit_lock and then submitting the packet to the device driver
(hard_start_xmit()) or alternatively requeuing the packet if the
xmit_lock was already held by someone else. The latter is invoked
asynchronously, periodically - its part of the watchdog timer
mechanism.

My understanding is that two threads cannot be in qdisc_restart()
for the same qdisc at the same time, however the xmit_lock may have
been acquired by the watchdog timer function causing a requeue.

11. Links

This is just a dump of links that might be useful.
* http://www.spec.org and SpecWeb http://www.spec.org/web99/
* linux-net and netdev mailing lists:
http://www.ussg.iu.edu/hypermail/linux/net/ and
http://oss.sgi.com/projects/netdev/archive/
* Linux Traffic Control HOWTO

12. Specific TODOs

* Study watchdog timer mechanism and figure out how flow control
is implemented in the receive and transmit side

References

[3] Beyond Softnet. Jamal Hadi Salim, Robert Olsson, and Alexey
Kuznetsov. Nov 2001. USENIX. 5. .

[5] A Map of the Networking Code in Linux Kernel 2.4.20. Miguel Rio,
Mathieu Goutelle, Tom Kelly, Richard Hugh-Jones, Jean-Phillippe
Martin-Flatin, and Yee-Ting Li. Mar 2004.

[4] Understanding the Linux Kernel. Daniel P. Bovet and Marco
Cesati. O'Reilly & Associates. 2nd Edition. 81-7366-589-3.