Wednesday, June 6, 2007

Wait Queues

When a process requests the kernel to do something which is currently impossible but that may become possible later, the process is put to sleep and is woken up when the request is more likely to be satisfied. One of the kernel mechanisms used for this is called a 'wait queue'.

Linux implementation allows wake-on semantics using TASK_EXCLUSIVE flag. With waitqueues, you can either use a well-known queue and then simply sleep_on/sleep_on_timeout/interruptible_sleep_on/interruptible_sleep_on_timeout, or you can define your own waitqueue and use add/remove_wait_queue to add and remove yourself from it and wake_up/wake_up_interruptible to wake up when needed.

An example of the first usage of waitqueues is interaction between the page allocator (in mm/page_alloc.c:__alloc_pages()) and the kswapd kernel daemon (in mm/vmscan.c:kswap()), by means of wait queue kswapd_wait, declared in mm/vmscan.c; the kswapd daemon sleeps on this queue, and it is woken up whenever the page allocator needs to free up some pages.

An example of autonomous waitqueue usage is interaction between user process requesting data via read(2) system call and kernel running in the interrupt context to supply the data. An interrupt handler might look like (simplified drivers/char/rtc_interrupt()):

static DECLARE_WAIT_QUEUE_HEAD(rtc_wait);

void rtc_interrupt(int irq, void *dev_id, struct pt_regs *regs)
{
spin_lock(&rtc_lock);
rtc_irq_data = CMOS_READ(RTC_INTR_FLAGS);
spin_unlock(&rtc_lock);
wake_up_interruptible(&rtc_wait);
}

So, the interrupt handler obtains the data by reading from some device-specific I/O port (CMOS_READ() macro turns into a couple outb/inb) and then wakes up whoever is sleeping on the rtc_wait wait queue.

Now, the read(2) system call could be implemented as:

ssize_t rtc_read(struct file file, char *buf, size_t count, loff_t *ppos)
{
DECLARE_WAITQUEUE(wait, current);
unsigned long data;
ssize_t retval;

add_wait_queue(&rtc_wait, &wait);
current->state = TASK_INTERRUPTIBLE;
do {
spin_lock_irq(&rtc_lock);
data = rtc_irq_data;
rtc_irq_data = 0;
spin_unlock_irq(&rtc_lock);

if (data != 0)
break;

if (file->f_flags & O_NONBLOCK) {
retval = -EAGAIN;
goto out;
}
if (signal_pending(current)) {
retval = -ERESTARTSYS;
goto out;
}
schedule();
} while(1);
retval = put_user(data, (unsigned long *)buf);
if (!retval)
retval = sizeof(unsigned long);

out:
current->state = TASK_RUNNING;
remove_wait_queue(&rtc_wait, &wait);
return retval;
}

What happens in rtc_read() is this:

1. We declare a wait queue element pointing to current process context.
2. We add this element to the rtc_wait waitqueue.
3. We mark current context as TASK_INTERRUPTIBLE which means it will not be rescheduled after the next time it sleeps.
4. We check if there is no data available; if there is we break out, copy data to user buffer, mark ourselves as TASK_RUNNING, remove ourselves from the wait queue and return
5. If there is no data yet, we check whether the user specified non-blocking I/O and if so we fail with EAGAIN (which is the same as EWOULDBLOCK)
6. We also check if a signal is pending and if so inform the "higher layers" to restart the system call if necessary. By "if necessary" I meant the details of signal disposition as specified in sigaction(2) system call.
7. Then we "switch out", i.e. fall asleep, until woken up by the interrupt handler. If we didn't mark ourselves as TASK_INTERRUPTIBLE then the scheduler could schedule us sooner than when the data is available, thus causing unneeded processing.

It is also worth pointing out that, using wait queues, it is rather easy to implement the poll(2) system call:

static unsigned int rtc_poll(struct file *file, poll_table *wait)
{
unsigned long l;

poll_wait(file, &rtc_wait, wait);

spin_lock_irq(&rtc_lock);
l = rtc_irq_data;
spin_unlock_irq(&rtc_lock);

if (l != 0)
return POLLIN | POLLRDNORM;
return 0;
}

All the work is done by the device-independent function poll_wait() which does the necessary waitqueue manipulations; all we need to do is point it to the waitqueue which is woken up by our device-specific interrupt handler.

No comments: