Scale Failure

George Neville-Neil

doi:10.1145/2133416.2147781

Abstract

Dear KV, I have been digging into a network-based logging system at work because, from time to time, the system jams up, even when there seems to be no good reason for it to do so. What I found would be funny, if only it weren’t my job to fix it: the central dispatcher for the entire logging system is a simple for loop around a pair of read and write calls; the for loop takes input from one of a set of file descriptors and sends output to one of another set of file descriptors. The system works fine so long as none of the remote readers or writers ever blocks, and normally that’s not a problem. The problem has come about because what was once handling fewer than 10 machines is now handling 40, some of which are remote across a wide area network. The obvious fix is to make the code nonblocking, but what I’m surprised about is that anyone would write code this way. It’s obvious from the first time you look at the code that it cannot scale.

Full Text