let me start: 6/1/07

Tuesday, June 12, 2007

PMTU DIscovery

Summary

Internet computers, mostly servers, sometimes send packets too large for part of a given path. Not handling this correctly can make the server unusable for some people.

Correct handling of oversize packets is by one of two means, as chosen by the sending computer (the server):

* Permitting Packet fragmentation - used mostly by older systems
* Path MTU discovery - asking for ICMP notification when fragmentation would be needed

By default, modern servers disable fragmentation and try to use path MTU discovery, but sometimes the system administrators block the ICMP notifications. Overall, these servers ask for ICMP notification of packets that are too large and then refuse to accept the notifications they ask for.

Unable to learn that they are sending packets are too big for some paths, the servers are unable to send data to some users. To overcome this they need to accept appropriate ICMP types or allow fragmentation.

Path MTU Discovery

Every network link has a maximum packet size called the link's MTU (Maximum Transmission Unit). The full path from one computer to another may travel across many links with different MTUs. The smallest MTU for all the links in a path is the path MTU.

If a packet starts out on a network segment with a large MTU, it may arrive at a link with a smaller MTU and be too big to fit. Most servers are on segments with large MTUs, but it is increasingly common for internet users to be connected via links with reduced MTUs, so it is becoming common for some packets to be too big.

How the problem of oversize packets has been handled has evolved considerably over time. The original approach was to send only small packets corresponding to the TCP/IP default MTU (576 bytes). (To this day, a sending system needs permission from the receiving system to send larger packets, but that permisssion is given as a matter of routine.)

For some packets, especially those sent by older equipment, an oversize packet can be sent by breaking it into fragments and sending the fragments as smaller packets. The fragments can be reassembled downstream to reconstruct the original large packet, but this packet fragmentation has several problems involving both efficiency and security.

Newer servers try to optimize their transmissions by discovering the path MTU and sending packets of the maximum size when there's enough data to fill them. The procedure for doing this was standardized and published as RFC 1191 in 1990, but it did not become widely deployed until years later. By mid 2002, 80% to 90% of computers on the internet used path MTU discovery.

The basic procedure is simple - send the largest packet you can, and if it won't fit through some link get back a notification saying what size will fit. The notifications arrive as ICMP (Internet Control Message Protocol) packets known as "fragmentation needed" ICMPs (ICMP type 3, subtype 4). The notifications are requested by setting the "do not fragment" (DF) bit in packets that are sent out.

Some network and system administrators view all ICMPs as risky and block them all, disabling path MTU discovery, usually without even realizing it. Of the several dozen ICMP types and subtypes, some do pose some risk, but the risk is mostly mild and is of the "denial of service" nature. That is, an attacker can use them to interfere with service on and from the network.

By blocking all ICMPs the administrator himself interferes with service on and from his own network. Unless he also turns off path MTU discovery on his network's servers, he makes his servers unusable by users with reduced-MTU links in their paths. Because service is affected only in relatively unusual cases, it can be difficult to convince the administrator that a problem exists. The prevalence of such "unusual" cases is growing rapidly though.

Administrators who want to block all ICMPs should disable path MTU discovery on their computers, especially on their servers. It makes no sense to ask for ICMP notifications and then refuse to accept them. In addition, doing so opens the server to a special type of distributed denial of service attack based on resource exhaustion from a large number of fully-open connections.

Cisco's website has summary instructions for disabling path MTU discovery for these systems:

* Windows 95/98/ME
* Windows NT 3.1/3.51 and 4.0
* Windows 2000/XP
* Solaris
* HP-UX

Elsewhere on Cisco's site they recommend using the Dr. TCP utility for Windows.

Disabling path MTU on Windows systems requires modifying the Windows Registry. Microsoft's support website has information and cautions about the Registry.

Path MTU discovery can be turned off

* Under Linux, with the command
echo 1 >/proc/sys/net/ipv4/ip_no_pmtu_disc
* Under FreeBSD, by using sysctl(8) to zero
tcp.path_mtu_discovery
* On Cisco routers,with the configuration command
no ip tcp mtu-path discovery

Example Path MTU Discovery Failure Scenario

Joe User has a DSL line. Like many DSL lines, Joe's line actually comes from a large DSL wholesaler and uses PPPoE (PPP over Ethernet) to create a logical channel between Joe and his favorite ISP. The DSL line physically connects to a small PPPoE router with a built-in hub, which Joe uses as the basis for his home network.

The MTU on Joe's network is Ethernet's usual 1500 bytes. The MTU on his DSL line is also 1500 bytes, but PPPoE uses 8 of these bytes for an 8-byte PPPoE header, so the MTU of his PPPoE channel to his ISP is only 1492 bytes.

A computer on Joe's Ethernet doesn't know about the PPPoE channel. The largest packet it can receive is a 1500-byte Ethernet packet, so when Joe uses his computer to connect to a remote server, his computer tells the server it's ok to send packets up to 1500 bytes.[*]

Establishing connections uses only very small packets, so Joe can connect to the server without any MTU issues, but as soon as Joe asks for enough data to fill a 1500-byte packet, the server sends a 1500-byte packet. When that packet gets to Joe's ISP, it doesn't fit down the PPPoE channel. The packet has the DF bit set, telling Joe's ISP to drop the packet and send the server an ICMP saying what packet size will fit. The ISP sends an ICMP saying the largest size is 1492 bytes.

If the server gets that ICMP, it re-sends the first 1492 bytes as a new packet. This new packets fits down the PPPoE channel, so Joe's computer gets it and sends back an acknowledgement. The server continues sending 1492-byte packets until Joe has what he asked for. In most cases the server will remember the reduced path MTU, typically for ten minutes, and use it for future connections from Joe's computer.

However, if the server does not get that ICMP, things go downhill fast. The server is expecting an acknowledgement from Joe's computer, but Joe's computer didn't get the packet, so the acknowledgment never comes. After awhile, the server gives up waiting and sends the same 1500-byte packet again. Joe's ISP sends back another ICMP. The server doesn't get it again and tries sending the 1500-byte packet a third time, then a fourth, a fifth, and so on.

Meanwhile Joe's computer can't tell this is happening and is waiting for a response from the server. Eventually, it gives up and sends the server a connection reset. It reports a network failure to Joe, who is left wondering what happened. He may soon discover he can access nearly all other sites, just not the one he wanted.

If the server he's trying to access is a web server, he may get part of the website, such as getting the ads but not the main page content. This happens because the pages normally displayed for many websites are assembled from data obtained from multiple servers on a variety of networks. The networks dispensing the ads all support path MTU discovery.

If Joe complains to the server operator, they may tell him the problem has to be his ISP because other users can view their server. However, Joe's ISP is not doing anything wrong, but the server operator is. PPPoE and other connections with reduced MTUs are perfectly legitimate and increasingly common. Having a server ask for ICMP noptification that it refuses to accept is what's broken behavior. Some later day, when the server site has gotten enough complaints from people like Joe to believe something is wrong, it may either stop asking for the ICMPs or start accepting the useful ones. Then Joe will be able to access it.

[*] What Joe's computer actually tells the remote server is that the largest payload it can send in any packet is 1460 bytes, which is the 1500-byte maximum Ethernet packet size minus 40 bytes of overhead. This maximum payload is called the MSS (Maximum Segment Size).

Wednesday, June 6, 2007

Wait Queues

When a process requests the kernel to do something which is currently impossible but that may become possible later, the process is put to sleep and is woken up when the request is more likely to be satisfied. One of the kernel mechanisms used for this is called a 'wait queue'.

Linux implementation allows wake-on semantics using TASK_EXCLUSIVE flag. With waitqueues, you can either use a well-known queue and then simply sleep_on/sleep_on_timeout/interruptible_sleep_on/interruptible_sleep_on_timeout, or you can define your own waitqueue and use add/remove_wait_queue to add and remove yourself from it and wake_up/wake_up_interruptible to wake up when needed.

An example of the first usage of waitqueues is interaction between the page allocator (in mm/page_alloc.c:__alloc_pages()) and the kswapd kernel daemon (in mm/vmscan.c:kswap()), by means of wait queue kswapd_wait, declared in mm/vmscan.c; the kswapd daemon sleeps on this queue, and it is woken up whenever the page allocator needs to free up some pages.

An example of autonomous waitqueue usage is interaction between user process requesting data via read(2) system call and kernel running in the interrupt context to supply the data. An interrupt handler might look like (simplified drivers/char/rtc_interrupt()):

static DECLARE_WAIT_QUEUE_HEAD(rtc_wait);

void rtc_interrupt(int irq, void *dev_id, struct pt_regs *regs)
{
spin_lock(&rtc_lock);
rtc_irq_data = CMOS_READ(RTC_INTR_FLAGS);
spin_unlock(&rtc_lock);
wake_up_interruptible(&rtc_wait);
}

So, the interrupt handler obtains the data by reading from some device-specific I/O port (CMOS_READ() macro turns into a couple outb/inb) and then wakes up whoever is sleeping on the rtc_wait wait queue.

Now, the read(2) system call could be implemented as:

ssize_t rtc_read(struct file file, char *buf, size_t count, loff_t *ppos)
{
DECLARE_WAITQUEUE(wait, current);
unsigned long data;
ssize_t retval;

add_wait_queue(&rtc_wait, &wait);
current->state = TASK_INTERRUPTIBLE;
do {
spin_lock_irq(&rtc_lock);
data = rtc_irq_data;
rtc_irq_data = 0;
spin_unlock_irq(&rtc_lock);

if (data != 0)
break;

if (file->f_flags & O_NONBLOCK) {
retval = -EAGAIN;
goto out;
}
if (signal_pending(current)) {
retval = -ERESTARTSYS;
goto out;
}
schedule();
} while(1);
retval = put_user(data, (unsigned long *)buf);
if (!retval)
retval = sizeof(unsigned long);

out:
current->state = TASK_RUNNING;
remove_wait_queue(&rtc_wait, &wait);
return retval;
}

What happens in rtc_read() is this:

1. We declare a wait queue element pointing to current process context.
2. We add this element to the rtc_wait waitqueue.
3. We mark current context as TASK_INTERRUPTIBLE which means it will not be rescheduled after the next time it sleeps.
4. We check if there is no data available; if there is we break out, copy data to user buffer, mark ourselves as TASK_RUNNING, remove ourselves from the wait queue and return
5. If there is no data yet, we check whether the user specified non-blocking I/O and if so we fail with EAGAIN (which is the same as EWOULDBLOCK)
6. We also check if a signal is pending and if so inform the "higher layers" to restart the system call if necessary. By "if necessary" I meant the details of signal disposition as specified in sigaction(2) system call.
7. Then we "switch out", i.e. fall asleep, until woken up by the interrupt handler. If we didn't mark ourselves as TASK_INTERRUPTIBLE then the scheduler could schedule us sooner than when the data is available, thus causing unneeded processing.

It is also worth pointing out that, using wait queues, it is rather easy to implement the poll(2) system call:

static unsigned int rtc_poll(struct file *file, poll_table *wait)
{
unsigned long l;

poll_wait(file, &rtc_wait, wait);

spin_lock_irq(&rtc_lock);
l = rtc_irq_data;
spin_unlock_irq(&rtc_lock);

if (l != 0)
return POLLIN | POLLRDNORM;
return 0;
}

All the work is done by the device-independent function poll_wait() which does the necessary waitqueue manipulations; all we need to do is point it to the waitqueue which is woken up by our device-specific interrupt handler.

Tuesday, June 5, 2007

Semaphores and read/write Semaphores

Sometimes, while accessing a shared data structure, one must perform operations that can block, for example copy data to userspace. The locking primitive available for such scenarios under Linux is called a semaphore. There are two types of semaphores: basic and read-write semaphores. Depending on the initial value of the semaphore, they can be used for either mutual exclusion (initial value of 1) or to provide more sophisticated type of access.

Read-write semaphores differ from basic semaphores in the same way as read-write spinlocks differ from basic spinlocks: one can have multiple readers at a time but only one writer and there can be no readers while there are writers - i.e. the writer blocks all readers and new readers block while a writer is waiting.

Also, basic semaphores can be interruptible - just use the operations down/up_interruptible() instead of the plain down()/up() and check the value returned from down_interruptible(): it will be non zero if the operation was interrupted.

Using semaphores for mutual exclusion is ideal in situations where a critical code section may call by reference unknown functions registered by other subsystems/modules, i.e. the caller cannot know apriori whether the function blocks or not.

A simple example of semaphore usage is in kernel/sys.c, implementation of gethostname(2)/sethostname(2) system calls.

asmlinkage long sys_sethostname(char *name, int len)
{
int errno;

if (!capable(CAP_SYS_ADMIN))
return -EPERM;
if (len < 0 || len > __NEW_UTS_LEN)
return -EINVAL;
down_write(&uts_sem);
errno = -EFAULT;
if (!copy_from_user(system_utsname.nodename, name, len)) {
system_utsname.nodename[len] = 0;
errno = 0;
}
up_write(&uts_sem);
return errno;
}

asmlinkage long sys_gethostname(char *name, int len)
{
int i, errno;

if (len < 0)
return -EINVAL;
down_read(&uts_sem);
i = 1 + strlen(system_utsname.nodename);
if (i > len)
i = len;
errno = 0;
if (copy_to_user(name, system_utsname.nodename, i))
errno = -EFAULT;
up_read(&uts_sem);
return errno;
}

The points to note about this example are:

1. The functions may block while copying data from/to userspace in copy_from_user()/copy_to_user(). Therefore they could not use any form of spinlock here.
2. The semaphore type chosen is read-write as opposed to basic because there may be lots of concurrent gethostname(2) requests which need not be mutually exclusive.

Although Linux implementation of semaphores and read-write semaphores is very sophisticated, there are possible scenarios one can think of which are not yet implemented, for example there is no concept of interruptible read-write semaphores. This is obviously because there are no real-world situations which require these exotic flavours of the primitives.

Monday, June 4, 2007

Spinlocks, Read-write Spinlocks and Big-Reader Spinlocks

Since the early days of Linux support (early 90s, this century), developers were faced with the classical problem of accessing shared data between different types of context (user process vs interrupt) and different instances of the same context from multiple cpus.

SMP support was added to Linux 1.3.42 on 15 Nov 1995 (the original patch was made to 1.3.37 in October the same year).

If the critical region of code may be executed by either process context and interrupt context, then the way to protect it using cli/sti instructions on UP is:

unsigned long flags;

save_flags(flags);
cli();
/* critical code */
restore_flags(flags);

While this is ok on UP, it obviously is of no use on SMP because the same code sequence may be executed simultaneously on another cpu, and while cli() provides protection against races with interrupt context on each CPU individually, it provides no protection at all against races between contexts running on different CPUs. This is where spinlocks are useful for.

There are three types of spinlocks: vanilla (basic), read-write and big-reader spinlocks. Read-write spinlocks should be used when there is a natural tendency of 'many readers and few writers'. Example of this is access to the list of registered filesystems (see fs/super.c). The list is guarded by the file_systems_lock read-write spinlock because one needs exclusive access only when registering/unregistering a filesystem, but any process can read the file /proc/filesystems or use the sysfs(2) system call to force a read-only scan of the file_systems list. This makes it sensible to use read-write spinlocks. With read-write spinlocks, one can have multiple readers at a time but only one writer and there can be no readers while there is a writer. Btw, it would be nice if new readers would not get a lock while there is a writer trying to get a lock, i.e. if Linux could correctly deal with the issue of potential writer starvation by multiple readers. This would mean that readers must be blocked while there is a writer attempting to get the lock. This is not currently the case and it is not obvious whether this should be fixed - the argument to the contrary is - readers usually take the lock for a very short time so should they really be starved while the writer takes the lock for potentially longer periods?

Big-reader spinlocks are a form of read-write spinlocks heavily optimised for very light read access, with a penalty for writes. There is a limited number of big-reader spinlocks - currently only two exist, of which one is used only on sparc64 (global irq) and the other is used for networking. In all other cases where the access pattern does not fit into any of these two scenarios, one should use basic spinlocks. You cannot block while holding any kind of spinlock.

Spinlocks come in three flavours: plain, _irq() and _bh().

1. Plain spin_lock()/spin_unlock(): if you know the interrupts are always disabled or if you do not race with interrupt context (e.g. from within interrupt handler), then you can use this one. It does not touch interrupt state on the current CPU.
2. spin_lock_irq()/spin_unlock_irq(): if you know that interrupts are always enabled then you can use this version, which simply disables (on lock) and re-enables (on unlock) interrupts on the current CPU. For example, rtc_read() uses spin_lock_irq(&rtc_lock) (interrupts are always enabled inside read()) whilst rtc_interrupt() uses spin_lock(&rtc_lock) (interrupts are always disabled inside interrupt handler). Note that rtc_read() uses spin_lock_irq() and not the more generic spin_lock_irqsave() because on entry to any system call interrupts are always enabled.
3. spin_lock_irqsave()/spin_unlock_irqrestore(): the strongest form, to be used when the interrupt state is not known, but only if interrupts matter at all, i.e. there is no point in using it if our interrupt handlers don't execute any critical code.

The reason you cannot use plain spin_lock() if you race against interrupt handlers is because if you take it and then an interrupt comes in on the same CPU, it will busy wait for the lock forever: the lock holder, having been interrupted, will not continue until the interrupt handler returns.

The most common usage of a spinlock is to access a data structure shared between user process context and interrupt handlers:

spinlock_t my_lock = SPIN_LOCK_UNLOCKED;

my_ioctl()
{
spin_lock_irq(&my_lock);
/* critical section */
spin_unlock_irq(&my_lock);
}

my_irq_handler()
{
spin_lock(&lock);
/* critical section */
spin_unlock(&lock);
}

There are a couple of things to note about this example:

1. The process context, represented here as a typical driver method - ioctl() (arguments and return values omitted for clarity), must use spin_lock_irq() because it knows that interrupts are always enabled while executing the device ioctl() method.
2. Interrupt context, represented here by my_irq_handler() (again arguments omitted for clarity) can use plain spin_lock() form because interrupts are disabled inside an interrupt handler.