let me start: 8/1/07

Thursday, August 30, 2007

GCSE

Optimisers:

* Global constant and copy propagation.
* Global common subexpression elimination.

EGCS has two implementations of global common subexpression elimination. The first is based on the classic algorithm found in most compiler texts and is generally referred to as GCSE or classic GCSE.

The second implementation is commonly known as partial redundancy elimination (PRE). PRE is a more powerful scheme for eliminating redundant computations throughout a function. Our PRE optimizer is currently based on Fred Chow's thesis.

The difference between GCSE and PRE is best illustrated with an example flow graph (please forgive the poor graphics, I'm no HTML or graphics guru).

This example has a computation of "exp1" in blocks B2, B3 and B6. Assume that the remaining blocks do not effect the evaluation of "exp1".

Classic GCSE will only eliminate the computation of "exp1" found in block B3 since the evaluation in block B6 can be reached via a path which does not have an earlier computation of "exp1" (entry, B1, B7, B8, B5, B6).

PRE will eliminate the evaluations of "exp1" in blocks B3 and B6; to make the evaluation in B6 redundant, PRE will insert an evaluation of "exp1" in block B8.

While PRE is a more powerful optimization, it will tend to increase code size to improve runtime performance. Therefore, then optimizing for code size, the compiler will run the classic GCSE optimizer instead of the PRE optimizer.

Global constant/copy propagation and global common subexpression elimination are enabled by default with the -O2 optimization flag. They can also be enabled with the -fgcse flag.

Thursday, August 16, 2007

sk_buff Structure

The sk_buff structure ('skb') is actually only used for storing the
metadata corresponding to a packet. The packet's data is not stored
inside the sk_buff structure itself, but in a separate buffer that is
pointed to by skb->head. The skb->end member points one byte past the
end of this data buffer.

An important design requirement for sk_buffs is being able to add data
at the end as well as at the front of the packet. As a packet travels
downwards through the network stack, each layer will usually want to
add its own header in front of the packet, and it would be nice if we
could avoid reallocating and/or copying the entire data portion of the
packet around to make more space at the front of the buffer every time
we want to do this.

To achieve this goal, the packet data is not necessarily stored at the
front of the data buffer, but some space between the front of the buffer
and the front of the packet is left unused. skb->data and skb->tail
are two extra pointers that point to the beginning and one byte past
the end of the currently used portion of the data buffer, respectively.
Both are guaranteed to point somewhere within the data buffer.
('skb->head <= skb->data <= skb->tail <= skb->end')

+----------+------------------+----------------+
| headroom | packet data | tailroom |
+----------+------------------+----------------+
skb->head skb->data skb->tail skb->end

(sorry if the figure doesnt come up well)
The function skb_headroom(skb) calculates 'skb->data - skb->head', and
indicates how many bytes we can add to the front of the packet without
having to reallocate the buffer. Similarly, skb_tailroom(skb) calculates
'skb->end - skb->tail' and indicates how many bytes we can add to the
end of the packet before having to reallocate.

Adding data to and removing data from the front of the buffer is done
with skb_push and skb_pull, respectively. These wrappers do some sanity
checks to make sure the relevant constraints on the four pointers are
maintained.

When an sk_buff is allocated by alloc_skb, skb->{head,data,tail} are all
initialised to point to the start of the data buffer. Depending on what
the skb will be used for, the caller will usually want to reserve some
headroom in anticipation of expansion of the data buffer towards the
front. This is done by calling skb_reserve().

Rough Notes on Linux Networking Stack

Table of Contents

1. Existing Optimizations
2. Packet Copies
3. ICMP Ping/Pong : Function Calls
4. Transmit Interrupts and Flow Control
5. NIC driver callbacks and ifconfig
6. Protocol Structures in the Kernel
7. skb_clone() vs. skb_copy()
8. NICs and Descriptor Rings
9. How much networking work does the ksoftirqd do?
10. Packet Requeues in Qdiscs
11. Links
12. Specific TODOs
References

1. Existing Optimizations

A great deal of thought has gone into Linux networking
implementation and many optmizations have made their way to the
kernel over the years. Some prime examples include:
* NAPI - Receive interrupts are coalesced to reduce changes of a
livelock. Thus, now each packet receive does not generate an
interrupt. Required modifications to device driver interface.
Has been in the stable kernels since 2.4.20.
* Zero-Copy TCP - Avoids the overhead of kernel-to-userspace and
userspace-to-kernel packet copying.
http://builder.com.com/5100-6372-1044112.html describes this is
some detail.

2. Packet Copies

When a packet is received, the device uses DMA to put it in main
memory (let's ignore non-DMA or non-NAPI code and drivers). An skb
is constructed by the poll() function of the device driver. After
this point, the same skb is used throughout the networking stack,
i.e., the packet is almost never copied within the kernel (it is
copied when delivered to user-space).

This design is borrowed from BSD and UNIX SVR4 - the idea is to
allocate memory for the packet only once. The packet has 4 primary
pointers - head, end, data, tail into the packet data (character
buffer). head points to the beginning of the packet - where the link
layer header starts. end points to the end of the packet. data
points to the location the current networking layer can start
reading from (i.e., it changes as the packet moves up from the link
layer, to IP, to TCP). Finally, tail is where the current protocol
layer can begin writing data to (see alloc_skb(), which sets head,
data, tail to the beginning of allocated memory block and end to
data + size).

Other implementations refer to head, end, data, tail as base, limit,
read, write respectively.

There are some instances where a packet needs to be duplicated. For
example, when running tcpdump the packet needs to be sent to the
userspace process as well as to the normal IP handler. Actually, int
this case too, a copy can be avoided since the contents of the
packet are not being modified. So instead of duplicating the packet
contents, skb_clone() is used to increase the reference count of a
packet. skb_copy() on the other hand actually duplicates the
contents of the packet and creates a completely new skb.

See also: http://oss.sgi.com/archives/netdev/2005-02/msg00125.html

A related question: When a packet is received, are the tail and end
pointers equal? Answer: NO. This is because memory for packets
received is allocated before the packet is received, and the address
and size of this memory is communicated to the NIC using receive
descriptors - so that when it is actually received the NIC can use
DMA to transfer the packet to main memory. The size allocated for a
received packet is a function of the MTU of the device. The size of
an Ethernet frame actually received could be anything less than the
MTU. Thus, tail of a received packet will point to the end of the
received data while end will point to the end of the memory
allocated for the packet.

3. ICMP Ping/Pong : Function Calls

Code path (functions called) when an ICMP ping is received (and
corresponding pong goes out), for linux 2.6.9: First the packet is
received by the NIC and it's interrupt handler will ultimately call
net_rx_action() to be called (NAPI, [1]). This will call the device
driver's poll function which will submit packets (skb's) to the
networking stack via netif_receive_skb. The rest is outlined below:
1. ip_rcv() --> ip_rcv_finish()
2. dst_input() --> skb->dst->input = ip_local_deliver()
3. ip_local_deliver() --> ip_local_deliver_finish()
4. ipprot->handler = icmp_rcv()
5. icmp_pointers[ICMP_ECHO].handler == icmp_echo() -- At this point
I guess you could say that the "receive" path is complete, the
packet has reached the top. Now the outbound (down the stack)
journey begins)
6. icmp_reply() -- Might want to look into the checks this function
does
7. icmp_push_reply()
8. ip_push_pending_frames()
9. dst_output() --> skb->dst->output = ip_output()
10. ip_output() --> ip_finish_output() --> ip_finish_output2()
11. dst->neighbour->output ==

4. Transmit Interrupts and Flow Control

Transmit interrupts are generated after every packet transmission
and this is key to flow control. However, this does have significant
performance implications under heavy transmit-related I/O (imagine a
packet forwarder where the number of transmitted packets is equal to
the number of received oned). Each device provides a means to slow
down transmit (Tx) interrupts. For example, Intel's e1000 driver
exposes "TxIntDelay" that allows transmit interrupts to be delayed
in units of 1.024 microseconds. The default value is 64, thus eavy
under heavy transmissions an interrupt's are spaced 65.536
microseconds apart. Imagine the number of transmissions that can
take place in this time.

5. NIC driver callbacks and ifconfig

Interfaces are configured using the ifconfig command. Many of these
commands will result in a function of the NIC driver being called.
For example, ifconfig eth0 up should result in the device driver's
open() function being called (open is a member of struct
net_device). ifconfig communicates with the kernel through ioctl()
on any socket. The requests are a struct ifreq (see
/usr/include/net/if.h and
http://linux.about.com/library/cmd/blcmdl7_netdevice.htm. Thus,
ifconfig eth0 up will result in the following:
1. A socket (of any kind) is opened using socket()
2. A struct ifreq is prepared with ifr_ifname set to "eth0"
3. An ioctl() with request SIOCGIFFLAGS is done to get the current
flags and then the IFF_UP and IFF_RUNNING flags are set with
another ioctl() (with request SIOCSIFFLAGS).
4. Now we're inside the kernel. sock_ioctl() is called, which in
turn calls dev_ioctl() (see net/socket.c and net/core/dev.c)
5. dev_ioctl() --> ... --> dev_open() --> driver's open()
implementation.

6. Protocol Structures in the Kernel

There are various structs in the kernel which consist of function
pointers for protocol handling. Different structures correspond to
different layers of protocols as well as whether the functions are
for synchronous handling (e.g., when recv(), send() etc. system
calls are made) or asynchronous handling (e.g., when a packet
arrives at the interface and it needs to be handled). Here is what I
have gathered about the various structures so far:
* struct packet_type - includes instantiations such as
ip_packet_type, ipv6_packet_type etc. These provide low-level,
asynchronos packet handling. When a packet arrives at the
interface, the driver ultimately submits it to the networking
stack by a call to netif_receive_skb(), which iterates to the
list of registered packet handlers and submits the skb to them.
For example, ip_packet_type.func = ip_rcv, so ip_rcv() is where
one can say the IP protocol first receives a packet that has
arrived at the interface. Packet-types are registred with the
networking stack by a call to dev_add_pack().
* struct net_proto_family - includes instantiations such as
inet_family_ops, packet_family_ops etc. Each net_proto_family
structure handles one type of address family (PF_INET etc.).
This structure is associated with a BSD socket (struct socket)
and not the networking layer representation of sockets (struct
sock). It essentially provdides a create() function which is
called in response to the socket() system call. The
implementation of create() for each family typically allocates
the struct sock and also associates other synchronous operations
(see struct proto_ops below) with the socket. To cut a long
story short - net_proto_family provides the protocol-specific
part of the socket() system call. (NOTE: Not all BSD sockets
will have a networking socket associated with it. For example,
unix sockets (the PF_UNIX address family).
unix_family_ops.create = unix_create does not allocate a struct
sock). The net_proto_family structure is registered with the
networking stack by a call to sock_register().
* struct proto_ops - includes instantiations such as
inet_stream_ops, inet_dgram_ops, packet_ops etc. These provide
implementations of networking layer synchronous calls
(connect(), bind(), recvmsg(), ioctl() etc. system calls). The
ops member of the BSD socket structure (struct socket) points to
the proto_ops associated with the socket. Unlike the above two
structures, there is no function that explicitly registers a
struct proto_ops with the networking stack. Instead, the
create() implementation of struct net_proto_family just sets the
ops field of the BSD socket to the appropriate proto_ops
structure.
* struct proto - includes instantiations such as tcp_prot,
udp_prot, raw_prot. These provide protocol handlers inside a
network family. It seems that currently this means only over-IP
protocols as I could find only the above three instantiations.
These also provide implementations for synchronous calls. The
sk_prot field of the networking socket (struct sock) points to
such a structure. The sk_prot field would get set by the create
function in struct net_proto_family and the functions provided
will be called by the implementations of functions in the struct
proto_ops structure. For example, inet_family_ops.create =
inet_create allocates a struct sock and would set sk_prot =
udp_prot in reponse to a socket(PF_INET, SOCK_DGRAM, 0); system
call. A recvfrom() system call made on the socket would then
invoke inet_dgram_ops.recvmsg = sock_common_recvmsg, which calls
sk_prot->recvmsg = udp_recvmsg. Like proto_ops, struct protos
aren't explicitly "registered" with the networking stack using a
function, but are "regsitered" by the BSD socket create()
implementation in the struct net_proto_family.
* struct net_protocol - includes instantiations such as
tcp_protocol, udp_protocol, icmp_protocol etc. These provide
asynchronous packet receive routines for IP protocols. Thus,
this structure is specific to the inet-family of protocols.
Handlers are registered using inet_add_protocol(). This
structure is used by the IP-layer routines to hand off to a
layer 4 protocol. Specifically, the IP handler (ip_rcv()) will
invoke ip_local_deliver_finish() for packets that are to be
delivered to the local host. ip_local_deliver_finish() uses a
hash table (inet_protos) to decide which function to pass the
packet to based on the protocol field in the IP header. The hash
table is populated by the call to inet_add_protocol().

7. skb_clone() vs. skb_copy()

When a packet needs to be delivered to two separate handlers (for
example, the IP layer and tcpdump), then it is "cloned" by
incrementing the reference count of the packet instead of being
"copied". Now, though the two handlers are not expected to modify
the packet contents, they can change the data pointer. So, how do we
ensure that processing by one of the handlers doesn't mess up the
data pointer for the other?

A. Umm... skb_clone means that there are separate head, tail, data,
end etc. pointers. The difference between skb_copy() and skb_clone()
is precisely this - the former copies the packet completely, while
the latter uses the same packet data but separate pointers into the
packet.

8. NICs and Descriptor Rings

NOTE: Using the Intel e1000, driver source version 5.6.10.1, as an
example. Each transmission/reception has a descriptor - a "handle"
used to access buffer data somewhat like a file descriptor is a
handle to access file data. The descriptor format would be NIC
dependent as the hardware understands and reads/writes to the
descriptor. The NIC maintains a circular ring of descriptors, i.e.,
the number of descriptors for TX and RX is fixed (TxDescriptors,
RxDescriptors module parameters for the e1000 kernel module) and the
descriptors are used like a circular queue.

Thus, there are three structures:
* Descriptor Ring (struct e1000_desc_ring) - The list of
descriptors. So, ring[0], ring[1] etc. are individual
descriptors. The ring is typically allocated just once and thus
the DMA mapping of the ring is "consistent". Each descriptor in
the ring will thus have a fixed DMA and memory address. In the
e1000, the device registers TDBAL, TDBAH, TDLEN stand for
"Transmit Descriptors Base Address Low", "High" and "Length" (in
bytes of all descriptors). Similarly, there are RDBAL, RDBAH,
RDLEN
* Descriptors (struct e1000_rx_desc and struct e1000_tx_desc) -
Essentially, this stores the DMA address of the buffer which
contains actual packet data, plus some other accounting
information such as the status (transmission successsful?
receive complete? etc.), errors etc.
* Buffers - Now actual data cannot have a "consistent" DMA
mapping, meaning we cannot ensure that all skbuffs for a
particular device always have some specific memory addresses
(those that have been setup for DMA). Instead, "streaming" DMA
mappings need to be used. Each descriptor thus contains the DMA
address of a buffer that has been setup for streaming mapping.
The hardware uses that DMA address to pickup a packet to be sent
or to place a received packet. Once the kernel's stack picks up
the buffer, it can allocate new resources (a new buffer) and
tell the NIC to use that buffer next time by setting up a new
streaming mapping and putting the new DMA handle in the
descriptor.
The e1000 uses a struct e1000_buffer as a wrapper around the
actual buffer. The DMA mapping however is setup only for
skb->data, i.e., where raw packet data is to be placed.

9. How much networking work does the ksoftirqd do?

Consider what the NET_RX_SOFTIRQ does:
1. Each softirq invokation (do_softirq()) processes up to
net.core.netdev_max_backlog x MAX_SOFTIRQ_RESTART packets, if
available. The default values lead to 300 x 10 = 3000 pkts.
2. Every interrupt calls do_softirq() when exitting (irq_exit()) -
including the timer interrupt and NMIs too?
3. Default transmit/receive ring sizes on the NIC are less than
3000 (the e1000 for example defaults to 256 and can have at most
4096 descriptors on its ring)

Thus, the number of times ksoftirqd will be switched in/out depends
on how much processing is done by do_softirq() invokations on
irq_exit(). If the softirq handling on interrupt is able to clean up
the NIC ring faster than a new packet comes in, then ksoftirqd won't
be doing anything. Specifically, if the inter-packet-gap is greater
than the time it takes to pick-up and process a single packet from
the NIC, then ksoftirq will not be scheduled (and if the number of
descriptors on the NIC is less than 3000).

Without going into details, some quick experimental verification:
Machine A continuously generates UDP packets for Machine B which is
running an "sink" application, i.e., it just loops on a recvfrom().
When the size of the packet sent from A was 60 bytes (and
inter-packet gap averaged 1.5Âµs), then the ksoftirqd thread on B
observed a total of 375 context swithces (374 involuntary and 1
voluntary). When the packet size was 1280 bytes (and now
inter-packet gap increased almost 7 times to 10Âµs) then the
ksoftirqd thread was NEVER scheduled (0 context switches). The
single voluntary context switch in the former case probably happened
after all packets were processed (i.e., the sender stopped sending
and the receiver processed all that it got).

10. Packet Requeues in Qdiscs

The queueing discipline (struct Qdisc) provides a requeue().
Typically, packets are dequeued from the qdisc and submitted to the
device driver (the hard_start_xmit function in struct net_device).
However, at times it is possible that the device driver is "busy",
so the dequeued packet must be "requeued". "Busy" here means that
the xmit_lock of the device was held. It seems that this lock is
acquired at two places: (1) qdisc_restart() and (2) dev_watchdog().
The former handles packet dequeueing from the qdisc, acquiring the
xmit_lock and then submitting the packet to the device driver
(hard_start_xmit()) or alternatively requeuing the packet if the
xmit_lock was already held by someone else. The latter is invoked
asynchronously, periodically - its part of the watchdog timer
mechanism.

My understanding is that two threads cannot be in qdisc_restart()
for the same qdisc at the same time, however the xmit_lock may have
been acquired by the watchdog timer function causing a requeue.

11. Links

This is just a dump of links that might be useful.
* http://www.spec.org and SpecWeb http://www.spec.org/web99/
* linux-net and netdev mailing lists:
http://www.ussg.iu.edu/hypermail/linux/net/ and
http://oss.sgi.com/projects/netdev/archive/
* Linux Traffic Control HOWTO

12. Specific TODOs

* Study watchdog timer mechanism and figure out how flow control
is implemented in the receive and transmit side

References

[3] Beyond Softnet. Jamal Hadi Salim, Robert Olsson, and Alexey
Kuznetsov. Nov 2001. USENIX. 5. .

[5] A Map of the Networking Code in Linux Kernel 2.4.20. Miguel Rio,
Mathieu Goutelle, Tom Kelly, Richard Hugh-Jones, Jean-Phillippe
Martin-Flatin, and Yee-Ting Li. Mar 2004.

[4] Understanding the Linux Kernel. Daniel P. Bovet and Marco
Cesati. O'Reilly & Associates. 2nd Edition. 81-7366-589-3.

Friday, August 10, 2007

volatile

The use of volatile is poorly understood by many programmers. This is not surprising, as most C texts dismiss it in a sentence or two.

Have you experienced any of the following in your C/C++ embedded code?

* Code that works fine-until you turn optimization on
* Code that works fine-as long as interrupts are disabled
* Flaky hardware drivers
* Tasks that work fine in isolation-yet crash when another task is enabled

If you answered yes to any of the above, it's likely that you didn't use the C keyword volatile. You aren't alone. The use of volatile is poorly understood by many programmers. This is not surprising, as most C texts dismiss it in a sentence or two.

volatile is a qualifier that is applied to a variable when it is declared. It tells the compiler that the value of the variable may change at any time-without any action being taken by the code the compiler finds nearby. The implications of this are quite serious. However, before we examine them, let's take a look at the syntax.

Syntax

To declare a variable volatile, include the keyword volatile before or after the data type in the variable definition. For instance both of these declarations will declare foo to be a volatile integer:

volatile int foo;
int volatile foo;

Now, it turns out that pointers to volatile variables are very common. Both of these declarations declare foo to be a pointer to a volatile integer:

volatile int * foo;
int volatile * foo;

Volatile pointers to non-volatile variables are very rare (I think I've used them once), but I'd better go ahead and give you the syntax:

int * volatile foo;

And just for completeness, if you really must have a volatile pointer to a volatile variable, then:

int volatile * volatile foo;

Incidentally, for a great explanation of why you have a choice of where to place volatile and why you should place it after the data type (for example, int volatile * foo), consult Dan Sak's column, "Top-Level cv-Qualifiers in Function Parameters" (February 2000, p. 63).

Finally, if you apply volatile to a struct or union, the entire contents of the struct/union are volatile. If you don't want this behavior, you can apply the volatile qualifier to the individual members of the struct/union.

Use

A variable should be declared volatile whenever its value could change unexpectedly. In practice, only three types of variables could change:

* Memory-mapped peripheral registers
* Global variables modified by an interrupt service routine
* Global variables within a multi-threaded application

Peripheral registers

Embedded systems contain real hardware, usually with sophisticated peripherals. These peripherals contain registers whose values may change asynchronously to the program flow. As a very simple example, consider an 8-bit status register at address 0x1234. It is required that you poll the status register until it becomes non-zero. The nave and incorrect implementation is as follows:

UINT1 * ptr = (UINT1 *) 0x1234;

// Wait for register to become non-zero.
while (*ptr == 0);
// Do something else.

This will almost certainly fail as soon as you turn the optimizer on, since the compiler will generate assembly language that looks something like this:

mov ptr, #0x1234 mov a, @ptr loop bz loop

The rationale of the optimizer is quite simple: having already read the variable's value into the accumulator (on the second line), there is no need to reread it, since the value will always be the same. Thus, in the third line, we end up with an infinite loop. To force the compiler to do what we want, we modify the declaration to:

UINT1 volatile * ptr =
(UINT1 volatile *) 0x1234;

The assembly language now looks like this:

mov ptr, #0x1234
loop mov a, @ptr
bz loop

The desired behavior is achieved.

Subtler problems tend to arise with registers that have special properties. For instance, a lot of peripherals contain registers that are cleared simply by reading them. Extra (or fewer) reads than you are intending can cause quite unexpected results in these cases.

Interrupt service routines

Interrupt service routines often set variables that are tested in main line code. For example, a serial port interrupt may test each received character to see if it is an ETX character (presumably signifying the end of a message). If the character is an ETX, the ISR might set a global flag. An incorrect implementation of this might be:

int etx_rcvd = FALSE;

void main()
{
...
while (!ext_rcvd)
{
// Wait
}
...
}

interrupt void rx_isr(void)
{
...
if (ETX == rx_char)
{
etx_rcvd = TRUE;
}
...
}

With optimization turned off, this code might work. However, any half decent optimizer will "break" the code. The problem is that the compiler has no idea that etx_rcvd can be changed within an ISR. As far as the compiler is concerned, the expression !ext_rcvd is always true, and, therefore, you can never exit the while loop. Consequently, all the code after the while loop may simply be removed by the optimizer. If you are lucky, your compiler will warn you about this. If you are unlucky (or you haven't yet learned to take compiler warnings seriously), your code will fail miserably. Naturally, the blame will be placed on a "lousy optimizer."

The solution is to declare the variable etx_rcvd to be volatile. Then all of your problems (well, some of them anyway) will disappear.

Multi-threaded applications

Despite the presence of queues, pipes, and other scheduler-aware communications mechanisms in real-time operating systems, it is still fairly common for two tasks to exchange information via a shared memory location (that is, a global). When you add a pre-emptive scheduler to your code, your compiler still has no idea what a context switch is or when one might occur. Thus, another task modifying a shared global is conceptually identical to the problem of interrupt service routines discussed previously. So all shared global variables should be declared volatile. For example:

int cntr;

void task1(void)
{
cntr = 0;
while (cntr == 0)
{
sleep(1);
}
...
}

void task2(void)
{
...
cntr++;
sleep(10);
...
}

This code will likely fail once the compiler's optimizer is enabled. Declaring cntr to be volatile is the proper way to solve the problem.

Final thoughts

Some compilers allow you to implicitly declare all variables as volatile. Resist this temptation, since it is essentially a substitute for thought. It also leads to potentially less efficient code.

Also, resist the temptation to blame the optimizer or turn it off. Modern optimizers are so good that I cannot remember the last time I came across an optimization bug. In contrast, I come across failures to use volatile with depressing frequency.

If you are given a piece of flaky code to "fix," perform a grep for volatile. If grep comes up empty, the examples given here are probably good places to start looking for problems.