Linuxerfer.com

웹서버 동시접속 만명 프로젝트(The C10K Problem)

It's time for web servers to handle ten thousand clients simultaneously, don't you think? After all, the web is a big place now.

And computers are big, too. You can buy a 500MHz machine with 1 gigabyte of RAM and six 100Mbit/sec Ethernet card for $3000 or so. Let's see - at 10000 clients, that's 50KHz, 100Kbytes, and 60Kbits/sec per client. It shouldn't take any more horsepower than that to take four kilobytes from the disk and send them to the network once a second for each of ten thousand clients. (That works out to $0.30 per client, by the way. Those $100/client licensing fees some operating systems charge are starting to look a little heavy!) So hardware is no longer the bottleneck.

One of the busiest ftp sites, cdrom.com, actually can handle 10000 clients simultaneously through a Gigabit Ethernet pipe to its ISP.( Here's a somewhat dated page about their configuration.) Pipes this fast aren't common yet, but technology is improving rapidly.

There's interest in benchmarking this kind of configuration; see the discussion on June 28th 1999 on linux-kernel, in which people propose setting up a testbed that can be used for ongoing benchmarks.

There appears to be interest in this direction from the NSF, too; see the Web100 project.

With that in mind, here are a few notes on how to configure operating systems and write code to support thousands of clients. The discussion centers around Unix-like operating systems, for obvious reasons.

The C10K problem
Book to Read First
I/O Strategies
Comments
Limits on open filehandles
Limits on threads
Java issues
Other tips
Other limits
Kernel Issues
Measuring Server Performance
Examples
- Interesting select()-based servers
- Interesting /dev/poll-based servers
- Interesting kqueue()-based servers
- Interesting realtime signal-based servers
- Interesting thread-based servers
- Interesting in-kernel servers
Other interesting links

Book to Read First

If you haven't read it already, go out and get a copy of Unix Network Programming : Networking Apis: Sockets and Xti (Volume 1) by the late W. Richard Stevens. It describes many of the I/O strategies and pitfalls related to writing high-performance servers. It even talks about the 'thundering herd' problem.

I/O Strategies

There seem to be four ways of writing a fast web server to handle many clients:

serve many clients with each server thread, and use nonblocking I/O
serve one client with each server thread
Serve many clients with each server thread, and use asynchronous I/O
Build the server code into the kernel

1. Serve many clients with each server thread, and use nonblocking I/O

... set nonblocking mode on all network handles, and use select() or poll() to tell which network handle has data waiting. This is the traditional favorite. It's still being improved; see e.g. Niels Provos' benchmarks with hinting poll and thttpd.
An important bottleneck in this method is that read() or sendfile() from disk blocks if the page is not in core at the moment; setting nonblocking mode on a disk file handle has no effect. Same thing goes for memory-mapped disk files. The first time a server needs disk I/O, its process blocks, all clients must wait, and that raw nonthreaded performance goes to waste.
Worker threads or processes that do the disk I/O can get around this bottleneck. One approach is to use memory-mapped files, and if mincore() indicates I/O is needed, ask a worker to do the I/O, and continue handling network traffic. Jef Poskanzer mentions that Pai, Druschel, and Zwaenepoel's Flash web server uses this trick; they gave a talk at Usenix '99 on it. It looks like mincore() is available in BSD-derived Unixes like FreeBSD and Solaris, but is not part of the Single Unix Specification. It's available as part of Linux as of kernel 2.3.51, thanks to Chuck Lever.

There are several ways for a single thread to tell which of a set of nonblocking sockets are ready for I/O:

The traditional select(). Unfortunately, select() is limited to FD_SETSIZE handles. This limit is compiled in to the standard library and user programs.
The traditional poll(). There is no hardcoded limit to the number of file descriptors poll() can handle, but it does get slow about a few thousand, since most of the file descriptors are idle at any one time, and scanning through thousands of file descriptors takes time.
/dev/poll
The idea behind /dev/poll is to take advantage of the fact that often poll() is called many times with the same arguments. With /dev/poll, you get an open handle to /dev/poll, and tell the OS just once what files you're interested in by writing to that handle; from then on, you just read the set of currently ready file descriptors from that handle. It looks like this is lower overhead than even the realtime signal approach described below.
It appeared quietly in Solaris 7 (see patchid 106541) but its first public appearance was in Solaris 8; according to Sun, at 750 clients, this has 10% of the overhead of poll().
Linux is also starting to support /dev/poll, although it's not in the kernel as of 2.3.99. See N. Provos, C. Lever, "Scalable Network I/O in Linux," May, 2000. [FREENIX track, Proc. USENIX 2000, San Diego, California (June, 2000).] A patch implementing /dev/poll for the 2.2.14 Linux kernel is available from www.citi.umic.edu.
Solaris's /dev/poll support may be the first practical example of the kind of API proposed in [Banga, Mogul, Drusha '99], although Winsock's WSAsyncSelect might qualify, too.
FreeBSD Kernel Queues
FreeBSD 5.0 (and possibly 4.1) support a generalized alternative to poll() called kqueue()/kevent(). (See also Jonathan Lemon's page.) Like /dev/poll, you allocate a listening object, but rather than opening the file /dev/poll, you call kqueue() to allocate one. To change the events you are listening for, or to get the list of current events, you call kevent() on the descriptor returned by kqueue(). It can listen not just for socket readiness, but also for plain file readiness, signals, and even for I/O completion. There's even a python binding for it.
Realtime Signals
The 2.3.21 linux kernel introduced a way of using realtime signals to replace poll. fcntl(fd, F_SETSIG, signum) associates a signal with each file descriptor, and raises that signal when a normal I/O function like read() or write() completes.
To use this, you choose a realtime signal number, mask that signal and SIGIO with sigprocmask(), use fcntl(fd, F_SETSIG, signum) on each fd, write a normal poll() outer loop, and inside it, after you've handled all the fd's noticed by poll(), you loop calling sigwaitinfo().
If sigwaitinfo returns your realtime signal, siginfo.si_fd and siginfo.si_band give the same information as pollfd.fd and pollfd.revents would after a call to poll(), so you handle the i/o, and continue calling sigwaitinfo().
If sigwaitinfo returns a traditional SIGIO, the signal queue overflowed, so you flush the signal queue by temporarily changing the signal handler to SIG_DFL, and break back to the outer poll() loop.
You can support older kernels by surrounding the sigwaitinfo code with #if defined(LINUX_VERSION_CODE) && (LINUX_VERSION_CODE >= KERNEL_VERSION(2.3.21))
See Zach Brown's phhttpd for example code that uses this feature, and deals with some of the gotchas, e.g. events queued before an fd is dealt with or closed, but arriving afterwards.

2. Serve one client with each server thread

... and let read() and write() block. Has the disadvantage of using a whole stack frame for each client, which costs memory. Many OS's also have trouble handling more than a few hundred threads.

3. Serve many clients with each server thread, and use asynchronous I/O

This has not yet become popular, probably because few operating systems support asynchronous I/O, also possibly because it's hard to use. Under Unix, this is done with the aio_ interface (scroll down from that link to "Asynchronous input and output"), which associates a signal and value with each I/O operation. Signals and their values are queued and delivered efficiently to the user process. This is from the POSIX 1003.1b realtime extensions, and is also in the Single Unix Specification, version 2, and in glibc 2.1. The generic glibc 2.1 implementation may have been written for standards compliance rather than performance.

SGI has implemented high-speed AIO with kernel support. As of version 1.1, it's said to work well with both disk I/O and sockets. It seems to use kernel threads.

Various people appear to be working on a different implementation that does not use kernel threads, and should scale better. It won't be available until kernel 2.5, though, probably.

The O'Reilly book POSIX.4: Programming for the Real World is said to include a good introduction to aio.

4. Build the server code into the kernel

Novell and Microsoft are both said to have done this at various times, at least one NFS implementation does this, khttpd does this for Linux and static web pages, and "TUX" (Threaded linUX webserver) is a blindingly fast and flexible kernel-space HTTP server by Ingo Molnar for Linux. Ingo's September 1, 2000 announcement says an alpha version of TUX can be downloaded from ftp://ftp.redhat.com/pub/redhat/tux, and explains how to join a mailing list for more info.
The linux-kernel list has been discussing the pros and cons of this approach, and the consensus seems to be instead of moving web servers into the kernel, the kernel should have the smallest possible hooks added to improve web server performance. That way, other kinds of servers can benefit. See e.g. Zach Brown's remarks about userland vs. kernel http servers.

Comments

ACE, a C++ I/O framework, contains object-oriented implementations of some of these I/O strategies. In particular, his Reactor is an OO way of doing nonblocking I/O, and Proactor is an OO way of doing asynchronous I/O, I think. (Caution: those who are allergic to Design Patterns may need to don a garlic necklace before reading the ACE docs.)

Matt Welsh wrote a paper in April 2000 about how to balance the use of worker thread and event-drivent techniques when building scalable servers.

Richard Gooch has written a paper discussing I/O options.

The Apache mailing lists have some interesting posts about why they prefer not to use select() (basically, they think that makes plugins harder). Still, they're planning to use select()/poll()/sendfile() for static requests in Apache 2.0.

Mark Russinovich wrote an editorial and an article discussing I/O strategy issues in the 2.2 Linux kernel. Worth reading, even he seems misinformed on some points. In particular, he seems to think that Linux 2.2's asynchronous I/O (see F_SETSIG above) doesn't notify the user process when data is ready, only when new connections arrive. This seems like a bizarre misunderstanding. See also comments on an earlier draft, a rebuttal from Mingo, Russinovich's comments of 2 May 1999, a rebuttal from Alan Cox, and various posts to linux-kernel. I suspect he was trying to say that Linux doesn't support asynchronous disk I/O, which used to be true, but now that SGI has implemented KAIO, it's not so true anymore.

See these pages at sysinternals.com and MSDN for information on "completion ports", which he said were unique to NT; in a nutshell, win32's "overlapped I/O" turned out to be too low level to be convenient, and a "completion port" is a wrapper that provides a queue of copletion events. Compare this to Linux's F_SETSIG and queued signals feature.

There was an interesting discussion on linux-kernel in September 1999 titled "> 15,000 Simultaneous Connections". (second week, third week) Highlights:

Ed Hall posted a few notes on his experiences; he's achieved >1000 connects/second on a UP P2/333 running Solaris. His code used a small pool of threads (1 or 2 per CPU) each managing a large number of clients using "an event-based model".
Mike Jagdis posted an analysis of poll/select overhead, and said "The current select/poll implementation can be improved significantly, especially in the blocking case, but the overhead will still increase with the number of descriptors because select/poll does not, and cannot, remember what descriptors are interesting. This would be easy to fix with a new API. Suggestions are welcome..."
Mike posted about his work on improving select() and poll().
Mike posted a bit about a possible API to replace poll()/select(): "How about a 'device like' API where you write 'pollfd like' structs, the 'device' listens for events and delivers 'pollfd like' structs representing them when you read it? ... "
Rogier Wolff suggested using "the API that the digital guys suggested", http://www.cs.rice.edu/~gaurav/papers/usenix99.ps
Joerg Pommnitz pointed out that any new API along these lines should be able to wait for not just file descriptor events, but also signals and maybe SYSV-IPC. Our synchronization primitives should certainly be able to do what Win32's WaitForMultipleObjects can, at least.
Stephen Tweedie asserted that the combination of F_SETSIG, queued realtime signals, and sigwaitinfo() was a superset of the API proposed in http://www.cs.rice.edu/~gaurav/papers/usenix99.ps. He also mentions that you keep the signal blocked at all times if you're interested in performance; instead of the signal being delivered asynchronously, the process grabs the next one from the queue with sigwaitinfo().
Jayson Nordwick compared completion ports with the F_SETSIG synchronous event model, and concluded they're pretty similar.
Alan Cox noted that an older rev of SCT's SIGIO patch is included in 2.3.18ac.
Jordan Mendelson posted some example code showing how to use F_SETSIG.
Stephen C. Tweedie continued the comparison of completion ports and F_SETSIG, and noted: "With a signal dequeuing mechanism, your application is going to get signals destined for various library components if libraries are using the same mechanism," but the library can set up its own signal handler, so this shouldn't affect the program (much).
Doug Royer noted that he'd gotten 100,000 connections on Solaris 2.6 while he was working on the Sun calendar server. Others chimed in with estimates of how much RAM that would require on Linux, and what bottlenecks would be hit.

Interesting reading!

Limits on open filehandles

Any Unix: the limits set by ulimit or setrlimit.
Solaris: see the Solaris FAQ, question 3.45.
FreeBSD: use sysctl -w kern.maxfiles=nnnn to raise limit
Linux: See Bodo Bauer's /proc documentation. On current 2.2.x kernels,
```
echo 32768 > /proc/sys/fs/file-max
echo 65536 > /proc/sys/fs/inode-max
```
increases the system limit on open files, and
```
ulimit -n 32768
```
increases the current process' limit. I verified that a process on Red Hat 6.0 (2.2.5 or so plus patches) can open at least 31000 file descriptors this way. Another fellow has verified that a process on 2.2.12 can open at least 90000 file descriptors this way (with appropriate limits). The upper bound seems to be available memory.
Stephen C. Tweedie posted about how to set ulimit limits globally or per-user at boot time using initscript and pam_limit.
In older 2.2 kernels, though, the number of open files per process is still limited to 1024, even with the above changes.
See also Oskar's 1998 post, which talks about the per-process and system-wide limits on file descriptors in the 2.0.36 kernel.

Limits on threads

On any architecture, you may need to reduce the amount of stack space allocated for each thread to avoid running out of virtual memory. You can set this at runtime with pthread_attr_init() if you're using pthreads.

Solaris: it supports as many threads as will fit in memory, I hear.
FreeBSD: ?
Linux: Even the 2.2.13 kernel limits the number of threads, at least on Intel. I don't know what the limits are on other architectures. Mingo posted a patch for 2.1.131 on Intel that removed this limit. It appears to be integrated into 2.3.20.
See also Volano's detailed instructions for raising file, thread, and FD_SET limits in the 2.2 kernel. Wow. This stunning little document steps you through a lot of stuff that would be hard to figure out yourself. This is a must-read, even though some of its advice is already out of date.
Java: See Volano's detailed benchmark info, plus their info on how to tune various systems to handle lots of threads.

Java issues

Java's networking libraries mostly offer the one-thread-per-client model. There is a way to do nonblocking reads, but no way to do nonblocking writes.

Sun's JDK 1.2 Production release for Solaris does contain an example of using JNI to access poll() (or /dev/poll, if present). Here's the javadoc for the interface.
Juergen Kreileder of the Blackdown team writes:

I once started to port the JNI-poll interface example from the Solaris production release JDK to Linux but never had to time finish it. You can find a more or less usable version at: http://guiness.cs.uni-dortmund.de/~kreilede/poller/ . The code requires a native threads VM, I recommend 1.2.2-RC4.

"Java Network Programming" 2nd ed. also touches on using JNI to access select().

HP's java now includes a Thread Polling API.

Matt Welsh has implemented nonblocking sockets for Java; his performance benchmarks show that they have advantages over blocking sockets in servers handling many (400) connections. His class library is called java-nbio. Check it out!

See also Dean Gaudet's essay on the subject of Java, network I/O, and threads, and the paper by Matt Welsh on events vs. worker threads.

There are several proposals for improving Java's networking APIs:

Matt Welsh's Jaguar system proposes preserialized objects, new Java bytecodes, and memory management changes to allow the use of asynchronous I/O with Java.
Interfacing Java to the Virtual Interface Architecture, by C-C. Chang and T. von Eicken, proposes memory management changes to allow the use of asynchronous I/O with Java.
Sun's JSR-51 ( New I/O APIs for the Java Platform) is a working group which hopes to come up with APIs which will give Java excellent I/O. Just getting started as of May 2000.

Other tips

Zero-Copy
Normally, data gets copied many times on its way from here to there. Any scheme that eliminates these copies to the bare physical minimum is called "zero-copy".
- IO-Lite is a proposal for a set of I/O primitives that gets rid of the need for many copies.
- Alan Cox noted that zero-copy is sometimes not worth the trouble back in 1999. (He did like sendfile(), though.)
- Ingo implemented a form of zero-copy TCP in the 2.4 kernel for TUX 1.0 in July 2000, and says he'll make it available to userspace soon.
- Drew Gallatin and Robert Picco have added some zero-copy features to FreeBSD; the idea seems to be that if you call write() or read() on a socket, the pointer is page-aligned, and the amount of data transferred is at least a page, *and* you don't immediately reuse the buffer, memory management tricks will be used to avoid copies. But see followups to this message on linux-kernel for people's misgivings about the speed of those memory management tricks.
- The sendfile() system call can implement zero-copy networking.
  The sendfile() function in Linux and FreeBSD lets you tell the kernel to send part or all of a file. This lets the OS do it as efficiently as possible. It can be used equally well in servers using threads or servers using nonblocking I/O. (In Linux, It's poorly documented at the moment; use _syscall4 to call it. Andi Kleen is writing new man pages that cover this.) Rumor has it, ftp.cdrom.com benefitted noticably from sendfile().
  One developer using sendfile() with Freebsd reports that using POLLWRBAND instead of POLLOUT makes a big difference.
Avoid small frames by using writev (or TCP_CORK)
A new socket option under Linux, TCP_CORK, tells the kernel to avoid sending partial frames, which helps a bit e.g. when there are lots of little write() calls you can't bundle together for some reason. Unsetting the option flushes the buffer. Better to use writev(), though...
Some programs can benefit from using non-Posix threads.
Not all threads are created equal. The clone() function in Linux (and its friends in other operating systems) lets you create a thread that has its own current working directory, for instance, which can be very helpful when implementing an ftp server. See Hoser FTPd for an example of the use of native threads rather than pthreads.
Caching your own data can sometimes be a win.
"Re: fix for hybrid server problems" by Vivek Sadananda Pai (vivek@cs.rice.edu) on new-httpd, May 9th, states:

"I've compared the raw performance of a select-based server with a multiple-process server on both FreeBSD and Solaris/x86. On microbenchmarks, there's only a marginal difference in performance stemming from the software architecture. The big performance win for select-based servers stems from doing application-level caching. While multiple-process servers can do it at a higher cost, it's harder to get the same benefits on real workloads (vs microbenchmarks). I'll be presenting those measurements as part of a paper that'll appear at the next Usenix conference. If you've got postscript, the paper is available at http://www.cs.rice.edu/~vivek/flash99/"

Other limits

Old system libraries might use 16 bit variables to hold file handles, which causes trouble above 32767 handles. glibc2.1 should be ok.
Many systems use 16 bit variables to hold process or thread id's. It would be interesting to port the Volano scalability benchmark to C, and see what the upper limit on number of threads is for the various operating systems.
Too much thread-local memory is preallocated by some operating systems; if each thread gets 1MB, and total VM space is 2GB, that creates an upper limit of 2000 threads.
Look at the performance comparison graph at the bottom of http://www.acme.com/software/thttpd/benchmarks.html. Notice how various servers have trouble above 128 connections, even on Solaris 2.6? Anyone who figures out why, let me know.
Note: if the TCP stack has a bug that causes a short (200ms) delay at SYN or FIN time, as Linux 2.2.0-2.2.6 had, and the OS or http daemon has a hard limit on the number of connections open, you would expect exactly this behavior. There may be other causes.

Kernel Issues

For Linux, it looks like kernel bottlenecks are being fixed constantly. See my Mindcraft Redux page, Kernelnotes.org, Kernel Traffic, and the Linux-Kernel mailing list (Example interesting posts by a user asking how to tune, and Dean Gaudet)

In March 1999, Microsoft sponsored a benchmark comparing NT to Linux at serving large numbers of http and smb clients, in which they failed to see good results from Linux. See also my article on Mindcraft's April 1999 Benchmarks for more info.

See also The Linux Scalability Project. They're doing interesting work, including Niels Provos' hinting poll patch, and some work on the thundering herd problem.

See also Mike Jagdis' work on improving select() and poll(); here's Mike's post about it.

Mohit Aron (aron@cs.rice.edu) writes that rate-based clocking in TCP can improve HTTP response time over 'slow' connections by 80%.

Measuring Server Performance

Two tests in particular are simple, interesting, and hard:

raw connections per second (how many 512 byte files per second can you serve?)
total transfer rate on large files with many slow clients (how many 28.8k modem clients can simultaneously download from your server before performance goes to pot?)

Jef Poskanzer has published benchmarks comparing many web servers. See http://www.acme.com/software/thttpd/benchmarks.html for his results.

I also have a few old notes about comparing thttpd to Apache that may be of interest to beginners.

Chuck Lever keeps reminding us about Banga and Druschel's paper on web server benchmarking. It's worth a read.

IBM has an excellent paper titled Java server benchmarks [Baylor et al, 2000]. It's worth a read.

Examples

Interesting select()-based servers

thttpd Very simple. Uses a single process. It has good performance, but doesn't scale with the number of CPU's.
mathopd. Similar to thttpd.
fhttpd
boa
Roxen
Zeus, a commercial server that tries to be the absolute fastest. See their tuning guide.
The other non-Java servers listed at http://www.acme.com/software/thttpd/benchmarks.html
BetaFTPd
Flash-Lite - web server using IO-Lite.
Flash: An efficient and portable Web server -- uses select(), mmap(), mincore()
xitami - uses select() to implement its own thread abstraction for portability to systems without threads.
Medusa - a server-writing toolkit in Python that tries to deliver very high performance.

Interesting /dev/poll-based servers

N. Provos, C. Lever, "Scalable Network I/O in Linux," May, 2000. [FREENIX track, Proc. USENIX 2000, San Diego, California (June, 2000).] Describes a version of thttpd modified to support /dev/poll. Performance is compared with phhttpd.

Interesting kqueue()-based servers

Benno has written a patch that adds kqueue support to Squid. Haven't tried it myself.

Interesting realtime signal-based servers

Zach Brown's phhttpd - "a quick web server that was written to showcase the sigio/siginfo event model. consider this code highly experimental and yourself highly mental if you try and use it in a production environment." Uses the siginfo features of 2.3.21 or later, and includes the needed patches for earlier kernels. Rumored to be even faster than khttpd. See his post of 31 May 1999 for some notes.

Interesting thread-based servers

Hoser FTPD. See their benchmark page.
Peter Eriksson's phttpd and
pftpd
The Java-based servers listed at http://www.acme.com/software/thttpd/benchmarks.html
Sun's Java Web Server (which has been reported to handle 500 simultaneous clients)

Interesting in-kernel servers

khttpd
"TUX" (Threaded linUX webserver) by Ingo Molnar et al. For 2.4 kernel.

Contents

Interesting /dev/poll-based servers