It's time for web servers to handle ten thousand clients simultaneously,
don't you think? After all, the web is a big place now.
And computers are big, too. You can buy a 500MHz machine with 1 gigabyte of
RAM and six 100Mbit/sec Ethernet card for $3000 or so. Let's see - at 10000
clients, that's 50KHz, 100Kbytes, and 60Kbits/sec per client. It shouldn't
take any more horsepower than that to take four kilobytes from the disk and
send them to the network once a second for each of ten thousand clients.
(That works out to $0.30 per client, by the way. Those $100/client
licensing fees some operating systems charge are starting to look a little
heavy!) So hardware is no longer the bottleneck.
One of the busiest ftp sites, cdrom.com,
actually can
handle 10000 clients simultaneously through a Gigabit Ethernet pipe to
its ISP.(
Here's a somewhat dated page about their configuration.) Pipes this fast
aren't common yet, but technology is improving rapidly.
There's interest in benchmarking this kind of configuration; see
the discussion on June 28th 1999 on linux-kernel, in which people
propose setting up a testbed that can be used for ongoing benchmarks.
There appears to be interest in this direction from the NSF, too; see the
Web100 project.
With that in mind, here are a few notes on how to configure operating
systems and write code to support thousands of clients. The discussion
centers around Unix-like operating systems, for obvious reasons.
Contents
If you haven't read it already, go out and get a copy of
Unix Network Programming : Networking Apis: Sockets and Xti (Volume 1)
by the late W. Richard Stevens. It describes many of the I/O strategies and
pitfalls related to writing high-performance servers.
It even talks about the 'thundering herd' problem.
There seem to be four ways of writing a fast web server to handle many clients:
- serve many clients with each server thread, and use nonblocking I/O
- serve one client with each server thread
- Serve many clients with each server thread, and use asynchronous I/O
- Build the server code into the kernel
... set nonblocking mode on all network handles, and use select() or poll()
to tell which network handle has data waiting. This is the traditional
favorite. It's still being improved; see e.g.
Niels Provos' benchmarks with hinting poll and thttpd.
An important bottleneck in this method is that read() or sendfile() from
disk blocks if the page is not in core at the moment; setting nonblocking
mode on a disk file handle has no effect. Same thing goes for memory-mapped
disk files. The first time a server needs disk I/O, its process blocks, all
clients must wait, and that raw nonthreaded performance goes to waste.
Worker threads or processes that do the disk I/O can get around this
bottleneck. One approach is to use memory-mapped files,
and if mincore() indicates I/O is needed, ask a worker to do the I/O,
and continue handling network traffic. Jef Poskanzer mentions that
Pai, Druschel, and Zwaenepoel's Flash web server uses this trick; they gave a talk at
Usenix '99 on it.
It looks like mincore() is available in BSD-derived Unixes
like FreeBSD
and Solaris, but is not part
of the Single Unix Specification.
It's available as part of Linux as of kernel 2.3.51,
thanks to Chuck Lever.
There are several ways for a single thread to tell which of a set of nonblocking sockets are ready for I/O:
- The traditional select(). Unfortunately,
select() is limited to FD_SETSIZE handles. This limit is compiled in to
the standard library and user programs.
- The traditional poll().
There is no hardcoded limit to the number of file descriptors poll() can handle,
but it does get slow about a few thousand, since most of the file descriptors
are idle at any one time, and scanning through thousands of file descriptors
takes time.
- /dev/poll
The idea behind /dev/poll is to take advantage of the fact that often
poll() is called many times with the same arguments.
With /dev/poll, you get an open handle to /dev/poll, and
tell the OS just once what files you're interested in by writing to that handle;
from then on, you just read the set of currently ready file descriptors from that handle.
It looks like this is lower overhead than even the realtime signal approach described below.
It appeared quietly in Solaris 7
(see patchid 106541)
but its first public appearance was in
Solaris 8;
according to Sun,
at 750 clients, this has 10% of the overhead of poll().
Linux is also starting to support /dev/poll, although it's not in the kernel as of 2.3.99. See
N. Provos, C. Lever,
"Scalable Network I/O in Linux,"
May, 2000. [FREENIX track, Proc. USENIX 2000, San Diego, California (June, 2000).]
A patch
implementing /dev/poll for the 2.2.14 Linux kernel is available
from www.citi.umic.edu.
Solaris's /dev/poll support may be the first practical example of
the kind of API proposed in
[Banga, Mogul, Drusha '99], although Winsock's WSAsyncSelect might qualify, too.
- FreeBSD Kernel Queues
FreeBSD 5.0 (and possibly 4.1) support a generalized alternative to poll() called
kqueue()/kevent(). (See also Jonathan Lemon's page.)
Like /dev/poll, you allocate a listening object, but rather than opening the file /dev/poll, you
call kqueue() to allocate one. To change the events you are listening for, or to get the
list of current events, you call kevent() on the descriptor returned by kqueue().
It can listen not just for socket readiness, but also for plain file readiness, signals, and even for I/O completion.
There's even a python binding for it.
- Realtime Signals
The 2.3.21 linux kernel introduced a way of using realtime signals to replace poll.
fcntl(fd, F_SETSIG, signum) associates a signal with each file descriptor, and raises that signal
when a normal I/O function like read() or write() completes.
To use this, you choose a realtime signal number, mask that signal and SIGIO with sigprocmask(),
use fcntl(fd, F_SETSIG, signum) on each fd, write a normal poll() outer loop, and
inside it, after you've handled all the fd's noticed by poll(), you
loop calling
sigwaitinfo().
If sigwaitinfo returns your realtime signal, siginfo.si_fd and siginfo.si_band give the
same information as pollfd.fd and pollfd.revents would after a call to poll(),
so you handle the i/o, and continue calling sigwaitinfo().
If sigwaitinfo returns a traditional SIGIO, the signal queue overflowed,
so you flush the signal queue by temporarily changing the signal handler to SIG_DFL,
and break back to the outer poll() loop.
You can support older kernels by surrounding
the sigwaitinfo code with #if defined(LINUX_VERSION_CODE) && (LINUX_VERSION_CODE >= KERNEL_VERSION(2.3.21))
See Zach Brown's phhttpd for example code that uses this feature, and deals with some of the gotchas,
e.g. events queued before an fd is dealt with or closed, but arriving afterwards.
... and let read() and write() block. Has the disadvantage of using a whole stack
frame for each client, which costs memory. Many OS's also have trouble handling more
than a few hundred threads.
This has not yet become popular,
probably because few operating systems support asynchronous I/O,
also possibly because it's hard to use.
Under Unix, this is done with the aio_ interface
(scroll down from that link to "Asynchronous input and output"),
which associates a signal and value with each I/O operation.
Signals and their values are queued and delivered efficiently to the user process.
This is from the POSIX 1003.1b realtime extensions, and is also in the Single Unix Specification,
version 2, and in glibc 2.1. The generic glibc 2.1 implementation
may have been written for standards compliance rather than performance.
SGI has implemented high-speed AIO
with kernel support. As of version 1.1, it's said to work well with both
disk I/O and sockets. It seems to use kernel threads.
Various people appear to be working on a different implementation that does
not use kernel threads, and should scale better. It won't be available
until kernel 2.5, though, probably.
The O'Reilly book
POSIX.4: Programming for the Real World
is said to include a good introduction to aio.
Novell and Microsoft are both said to have done this at various times,
at least one NFS implementation does this,
khttpd does this for Linux
and static web pages, and
"TUX" (Threaded linUX webserver)
is a blindingly fast and flexible kernel-space HTTP server by Ingo Molnar for Linux.
Ingo's September 1, 2000 announcement
says an alpha version of TUX can be downloaded from
ftp://ftp.redhat.com/pub/redhat/tux,
and explains how to join a mailing list for more info.
The linux-kernel list has been discussing the pros and cons of this
approach, and the consensus seems to be instead of moving web servers
into the kernel, the kernel should have the smallest possible hooks added
to improve web server performance. That way, other kinds of servers
can benefit. See e.g.
Zach Brown's remarks
about userland vs. kernel http servers.
ACE, a C++ I/O framework,
contains object-oriented implementations of some of these I/O strategies.
In particular, his Reactor is an OO way of doing nonblocking I/O, and
Proactor is an OO way of doing asynchronous I/O, I think. (Caution: those who are
allergic to Design Patterns may need to don a garlic necklace before reading the
ACE docs.)
Matt Welsh wrote a paper
in April 2000 about how to balance the use of worker thread and
event-drivent techniques when building scalable servers.
Richard Gooch has written
a paper discussing I/O options.
The Apache mailing lists have some
interesting posts
about why they prefer not to use select() (basically, they think that makes plugins harder). Still, they're planning to use select()/poll()/sendfile() for
static requests in Apache 2.0.
Mark Russinovich wrote
an editorial and
an article
discussing I/O strategy issues in the 2.2 Linux kernel. Worth reading, even
he seems misinformed on some points. In particular, he
seems to think that Linux 2.2's asynchronous I/O
(see F_SETSIG above) doesn't notify the user process when data is ready, only
when new connections arrive. This seems like a bizarre misunderstanding.
See also
comments on an earlier draft,
a rebuttal from Mingo,
Russinovich's comments of 2 May 1999,
a rebuttal from Alan Cox,
and various
posts to linux-kernel.
I suspect he was trying to say that Linux doesn't support asynchronous disk I/O,
which used to be true, but now that SGI has implemented KAIO,
it's not so true anymore.
See these pages at sysinternals.com and
MSDN for information
on "completion ports", which he said were unique to NT; in a nutshell,
win32's "overlapped I/O" turned out to be too low level to be convenient, and
a "completion port" is a wrapper that provides a queue of copletion events.
Compare this to Linux's F_SETSIG and queued signals feature.
There was an interesting discussion on linux-kernel in September 1999 titled
"> 15,000 Simultaneous Connections".
(second week,
third week) Highlights:
-
Ed Hall
posted a few notes on his experiences; he's achieved
>1000 connects/second on a UP P2/333 running Solaris. His code
used a small pool of threads (1 or 2 per CPU) each managing a large number
of clients using "an event-based model".
- Mike Jagdis posted an analysis of poll/select overhead, and said
"The current select/poll implementation can be improved significantly,
especially in the blocking case, but the overhead will still increase
with the number of descriptors because select/poll does not, and
cannot, remember what descriptors are interesting. This would be
easy to fix with a new API. Suggestions are welcome..."
- Mike posted about his work on improving select() and poll().
- Mike posted a bit about a possible API to replace poll()/select():
"How about a 'device like' API where you write 'pollfd like' structs,
the 'device' listens for events and delivers 'pollfd like' structs
representing them when you read it? ... "
- Rogier Wolff
suggested
using "the API that the digital guys suggested",
http://www.cs.rice.edu/~gaurav/papers/usenix99.ps
- Joerg Pommnitz pointed out that any new API along these lines should
be able to wait for not just file descriptor events, but also signals and maybe
SYSV-IPC. Our synchronization primitives should certainly be able to
do what Win32's WaitForMultipleObjects can, at least.
- Stephen Tweedie asserted that the combination of F_SETSIG, queued realtime
signals, and sigwaitinfo() was a superset of the API proposed in
http://www.cs.rice.edu/~gaurav/papers/usenix99.ps. He also mentions that
you keep the signal blocked at all times if you're interested in performance;
instead of the signal being delivered asynchronously, the process grabs the
next one from the queue with sigwaitinfo().
- Jayson Nordwick compared
completion ports with the F_SETSIG synchronous event model,
and concluded they're pretty similar.
- Alan Cox noted that an older rev of SCT's SIGIO patch is included in
2.3.18ac.
- Jordan Mendelson posted some example code showing how to use F_SETSIG.
- Stephen C. Tweedie continued the comparison of completion ports and F_SETSIG,
and noted: "With a signal dequeuing mechanism, your application is going to get
signals destined for various library components if libraries are using
the same mechanism," but the library can set up its own signal handler,
so this shouldn't affect the program (much).
- Doug Royer
noted that he'd gotten 100,000 connections on Solaris 2.6 while he
was working on the Sun calendar server.
Others chimed in with estimates of how much RAM that would require
on Linux, and what bottlenecks would be hit.
Interesting reading!
On any architecture, you may need to reduce the amount
of stack space allocated for each thread to avoid running
out of virtual memory. You can set this at runtime with
pthread_attr_init() if you're using pthreads.
Java's networking libraries mostly offer the one-thread-per-client model.
There is a way to do nonblocking reads, but no way to do nonblocking writes.
Sun's JDK 1.2 Production release for Solaris does contain an example of using JNI to access poll()
(or /dev/poll, if present). Here's the javadoc for the
interface.
Juergen Kreileder of the Blackdown team writes:
I once started to port the JNI-poll interface example from the Solaris
production release JDK to Linux but never had to time finish it. You
can find a more or less usable version at:
http://guiness.cs.uni-dortmund.de/~kreilede/poller/
.
The code requires a native threads VM, I recommend 1.2.2-RC4.
"Java Network Programming" 2nd ed. also touches on using JNI to access select().
HP's java now includes a Thread Polling API.
Matt Welsh has implemented nonblocking sockets for Java; his performance
benchmarks show that they have advantages over blocking sockets in servers
handling many (400) connections. His class library is called
java-nbio.
Check it out!
See also
Dean Gaudet's essay
on the subject of Java, network I/O, and threads, and the
paper by Matt Welsh
on events vs. worker threads.
There are several proposals for improving Java's networking APIs:
- Zero-Copy
Normally, data gets copied many times on its way from here to there.
Any scheme that eliminates these copies to the bare physical minimum is called "zero-copy".
- Avoid small frames by using writev (or TCP_CORK)
A new socket option under Linux, TCP_CORK, tells the kernel to
avoid sending partial frames, which helps a bit e.g. when there are
lots of little write() calls you can't bundle together for some reason.
Unsetting the option flushes the buffer. Better to use writev(), though...
- Some programs can benefit from using non-Posix threads.
Not all threads are created equal. The clone() function in Linux
(and its friends in other operating systems)
lets you create a thread that has its own current working directory,
for instance, which can be very helpful when implementing an ftp server.
See Hoser FTPd for an example of the use of native threads rather than pthreads.
- Caching your own data can sometimes be a win.
"Re: fix for hybrid server problems" by Vivek Sadananda Pai
(vivek@cs.rice.edu) on
new-httpd, May 9th, states:
"I've compared the raw performance of a select-based server with a
multiple-process server on both FreeBSD and Solaris/x86. On
microbenchmarks, there's only a marginal difference in performance
stemming from the software architecture. The big performance win for
select-based servers stems from doing application-level caching. While
multiple-process servers can do it at a higher cost, it's harder to
get the same benefits on real workloads (vs microbenchmarks).
I'll be presenting those measurements as part of a paper that'll
appear at the next Usenix conference. If you've got postscript,
the paper is available at
http://www.cs.rice.edu/~vivek/flash99/"
- Old system libraries might use 16 bit variables to hold
file handles, which causes trouble above 32767 handles. glibc2.1 should be ok.
- Many systems use 16 bit variables to hold process or thread id's.
It would be interesting to port the Volano scalability
benchmark to C, and see what the upper limit on number of threads is for the various operating systems.
- Too much thread-local memory is preallocated by some operating systems;
if each thread gets 1MB, and total VM space is 2GB, that creates an upper limit
of 2000 threads.
- Look at the performance comparison graph at the bottom of
http://www.acme.com/software/thttpd/benchmarks.html.
Notice how various servers have trouble above 128 connections, even on Solaris 2.6?
Anyone who figures out why, let me know.
Note: if the TCP stack has a bug that causes a short (200ms)
delay at SYN or FIN time, as Linux 2.2.0-2.2.6 had, and the OS or
http daemon has a hard limit on the number of connections open,
you would expect exactly this behavior. There may be other causes.
For Linux, it looks like kernel bottlenecks are being fixed constantly.
See my Mindcraft Redux page,
Kernelnotes.org,
Kernel Traffic,
and the Linux-Kernel mailing list
(Example interesting posts by
a user asking how to tune, and
Dean Gaudet)
In March 1999, Microsoft sponsored a benchmark comparing NT to Linux
at serving large numbers of http and smb clients, in which they
failed to see good results from Linux.
See also my article on Mindcraft's April 1999 Benchmarks
for more info.
See also The Linux Scalability Project.
They're doing interesting work, including
Niels Provos' hinting poll patch, and some work on
the thundering herd problem.
See also Mike Jagdis' work on improving select() and poll(); here's Mike's post about it.
Mohit Aron (aron@cs.rice.edu)
writes that rate-based clocking in TCP can improve HTTP response time over 'slow' connections by 80%.
Two tests in particular are simple, interesting, and hard:
- raw connections per second (how many 512 byte files per second can you
serve?)
- total transfer rate on large files with many slow clients
(how many 28.8k modem clients can simultaneously download
from your server before performance goes to pot?)
Jef Poskanzer has published benchmarks comparing many web servers.
See http://www.acme.com/software/thttpd/benchmarks.html
for his results.
I also have
a few old notes about comparing thttpd to Apache that may be of interest
to beginners.
Chuck Lever keeps reminding us about
Banga and Druschel's paper on web server benchmarking. It's worth a read.
IBM has an excellent paper titled Java server benchmarks [Baylor et al, 2000]. It's worth a read.
-
N. Provos, C. Lever,
"Scalable Network I/O in Linux,"
May, 2000. [FREENIX track, Proc. USENIX 2000, San Diego, California (June, 2000).] Describes a
version of thttpd modified to support /dev/poll. Performance is compared
with phhttpd.
- Zach Brown's phhttpd - "a
quick web server that was written to showcase the sigio/siginfo event
model. consider this code highly
experimental and yourself highly mental if you try and use it in a production environment."
Uses the siginfo features of 2.3.21 or later, and includes the needed patches
for earlier kernels.
Rumored to be even faster than khttpd.
See his post of 31 May 1999
for some notes.
Ãâó : http://www.kegel.com/c10k.html
|