Multiplexed I/O

----> UNIX Articles : Multiplexed I/O

----> Author : Paulus Gandung Prakosa <-> syn_attack (syn1988@sdf.lonestar.org)

----> Thanks to : mywisdom (devilzc0de.org), ketek (devilzc0de.org), schumbag (devilzc0de.org), chaer.newbie (devilzc0de.org), kiddies (devilzc0de.org), gunslinger_ (devilzc0de.org), ditatompel (devilzc0de.org)

----- Articles Begin -----

Applications often need to block on more than one file descriptor, juggling I/O between keyboard input (stdin), interprocess communication (IPC), and a handful of files. Modern event-driven graphical user interface (GUI) applications may contend with literally hundreds of pending events via their mainloops.

Without the aid of threads -- essentially servicing each file descriptor separately -- a single process cannot reasonably block on more than file descriptor at the same time. Working with multiple file descriptors is fine, so long as they are always ready to be read from or written to. But as soon as one file descriptor that is not yet ready is encountered -- say, if a "read()" system call is issued, and there is not yet any data -- the process will block, no longer able to service the other file descriptors. It might block for just a few seconds, making the application inefficient and annoying the user. However, if no data becomes available on the file descriptor, it could block forever. Because file descriptors' I/O is often interrelated -- think pipes -- it quite possible for one file descriptor not to become ready until another is serviced. Particularly, with network applications, which may have many sockets open simultaneously, this is potentially quite a problem.

Imagine blocking on a file descriptor related to interprocess communication while "stdin" has data pending. The application won't know that keyboard input is pending until the blocked IPC file descriptor ultimately returns data -- But what is the blocked operations never returns?

Enter multiplexed I/O.

Multiplexed I/O allows an application to concurrently block on multiple file descriptors, and receive notification when any one of them becomes ready to read or write without blocking. Multiplexed I/O thus becomes the pivot point for the application, designed similarly to the following :

Multiplexed I/O : Tell me when any of these file descriptors are ready for I/O.
Sleep until one or more file descriptors are ready.
Woken up: What is ready?
Handle all file descriptors ready for I/O, without blocking.
Go back to step 1, and start over.

Linux provides three multiplexed I/O solutions: the select, poll, and epoll interfaces.

select()

The select() system call provides a mechanism for implementing synchronous multiplexing I/O :

	#include <sys/time.h>
	#include <sys/types.h>
	#include <unistd.h>

	int select(int n, fd_set *readfds, fd_set *writefds, fd_set *exceptfds, struct timeval *timeout);

	FD_CLR(int fd, fd_set *set);
	FD_ISSET(int fd, fd_set *set);
	FD_SET(int fd, fd_set *set);
	FD_ZERO(fd_set *set);

The timeout parameter is a pointer to a timeval structure, which is defined as follows :

	#include <sys/time.h>

	struct timeval {
		long tv_sec;		/* seconds */
		long tv_usec;		/* microseconds */
	};

On success, select() returns the number of file descriptors ready for I/O, among all three sets. If a timeout was provided, the return value may be 0. On error, the call returns -1, and errno is set to one of the following values :

An invalid file descriptor was provided in one of the sets.

A signal was caught while waiting, and the call can be reissued.

The parameter is negative, or the given timeout is invalid.

Insufficient memory was available to complete the request.

Because select() has historically been more readily implemented on various UNIX systems than a mechanism for subsecond-resolution sleeping, it is often employed as a portable way to sleep by providing a non-NULL timeout but NULL for all three sets :

	struct timeval tv;

	tv.tv_sec = 0;
	tv.tv_usec = 500;

	/* sleep for 500 microseconds */
	select(0, NULL, NULL, NULL, &tv);

pselect()

The select() system call, first introduced IN 4.2BSD, is popular, but POSIX defined it's own solution, pselect(), in POSIX 2003.1g-2000 and later in POSIX 1003.1-2001 :

	#define _XOPEN_SOURCE	600
	#include <sys/select.h>

	int pselect(int n,
		    fd_set *readfds,
                    fd_set *writefds,
                    fd_set *exceptfds,
                    const struct timespec *timeout,
                    const sigset_t *sigmask);

	FD_CLR(int fd, fd_set *set);
	FD_ISSET(int fd, fd_set *set);
	FD_SET(int fd, fd_set *set);
	FD_ZERO(fd_set *set);

There are three differences between pselect() and select() :

pselect() uses the timespec structure, not the timeval structure, for it's timeout parameter. The timespec structure uses seconds and nanoseconds, not seconds and microseconds, providing theoretically superior timeout resolution. In practice, however, neither call reliably provides even microsecond resolution.
A call to pselect() does not modify the timeout parameter. Conseqeuently, this parameter does not need to be reinitialized on subsequent invocations.
The select() system call does not have the sigmask parameter. With respect to signals, when this parameter is set to NULL, pselect() behaves like select().

The timespec structure is defined as follows :

	#include <sys/time.h>

	struct timespec {
		long tv_sec;		/* seconds */
		long tv_nsec;		/* nanoseconds */
	};

poll()

The poll() system call is System V's multiplexed I/O solution. It solves several deficiencies in select(), although select() is still often used (again, most likely out of habit, or in the name of portability) :

	#include <sys/poll.h>

	int poll(struct pollfd *fds, unsigned int nfds, int timeout);

Unlike select(), with it's inefficient three bitmask-based sets of file descriptors, poll() employs a single array of nfds pollfd structures, pointed to by fds. The structure is defined as follows :

	#include <sys/poll.h>

	struct pollfd {
		int fd;		/* file descriptor */
		short events;	/* requested events to watch */
		short revents;	/* returned events witnessed */
	};

Each pollfd structure specifies a single file descriptor to watch. Multiple structures may be passed, instructing poll() to watch multiple file descriptors. The events field of each structure is a bitmask of events to watch for on that file descriptor. The user sets this field. The revents field is a bitmask of events were witnessed on the file descriptor. The kernel sets this field on return. All of the events requested in the events field may be returned in the revents field. Valid events are as follows :

There is data to read.

There is normal data to read.

There is priority data to read.

There is urgent data to read.

Writing will not block.

Writing normal data will not block.

Writing priority data will not block.

A SIGPOLL message is available.

In addition, the following events may be returned in the revents field :

Error on the given file descriptor.

Hung up event on the given file descriptor.

The given file descriptor is invalid.

On success, poll() returns the number of file descriptors whose structures have non-zero revents fields. It returns 0 if the timeout occured before any events occured. On failure, -1 is returned, and global variable errno is set to one of the following :

An invalid file descriptor was given in one or more of the structures.

The pointer to fds pointed outside of the process' address space.

A signal occured before any requested event. The call may be reissued.

The nfds parameter exceeded the RLIMIT_NOFILE value.

Insufficient memory was available to complete the request.

ppoll()

Linux provides a ppoll() cousin to poll(), in the same vein as pselect(). Unlike pselect(), however, ppoll() is a Linux-specific interface :

	#define _GNU_SOURCE
	#include <sys/poll.h>

	int ppoll(struct pollfd *fds,
		  nfds_t nfds,
                  const struct timespec *timeout,
                  const sigset_t *sigmask);

Differences between poll() and select() Linux system call

Although they perform the same basic job, the poll() system call is superior to select() for a handful of reasons :

poll() does not require that the user calculate and pass in as a parameter the value of the highest-numbered file descriptor plus one.
poll() is more efficient for large-valued file descriptors. Imagine watchinig a single file descriptor with the value 900 via select() -- the kernel would have to check each bit of each passed-in set, up to 900th bit.
select()'s file descriptor sets are statically sized, introducing a tradeoff; they are small, limiting the maximum file descriptor that select() can watch, or they are inefficient. Operations on large bitmasks are not efficient, especially if it is not known whether they are sparsely populated. With poll(), one can create an array of exactly the right size. Only watching one item? Just pass in a single structure.
With select(), the file descriptor sets are reconstructed on return, so each subsequent call must reinitialize them. The poll() system call separates the input (events field) from the output (revents field), allowing the array to be reused without change.
The timeout parameter to select() is undefined on return. Portable code needs to reinitialize it. This is not an issue with pselect(), however.

The select() system call does have a few things going for it, though :

select() is more portable, as some UNIX systems do not support poll()
select() provides better timeout resolution; down to the microsecond. Both ppoll() and pselect() theoretically provide nanosecond resolution, but in practice, none of these calls reliably provides even microsecond resolution.

----- End Articles -----

References :

Love, Robert : Linux System Programming. O'Reilly Book Publisher
Kernighan, Brian W.; Ritchie, Dennis M. : The C Programming Language. Prentice-Hall
Digital Equipment Corporation, IBM, UNIX, Silicon Graphics Inc. : C Language Reference Manual
http://people.redhat.com/drepper
http://grsecurity.net/~spender/research.pdf