Posted Saturday, 13 January 2007, 7:22 pm
Much to my surprise, the article I published yesterday—regarding a nifty, simple trick for adding a measure of security to passwords that have been written down—was submitted to digg.com, and even more surprisingly, it turned out to be very popular. Or rather I should say, it was very popular on digg.com, but was a nightmare for me, scrambling to keep my server going under the crushing load.
The ‘fun’ began shortly after 7am, Pacific time. I got a text message from a script I run on an offsite system, probing my server every few minutes. After the second page, I dragged my ass out of bed and had a look. Yikes. I still had four SSH session open on my PC’s desktop from the night before, but none of them were responsive when I typed in them. Ouch.
I ran down to the garage (yep, the server’s in my garage) to check it out. The server was up, but the disks were going absolutely crazy. This past night, we had had record low temperatures for our area, so one thing that crossed my mind was that the disks had simply gotten too cold, and were unable to recalibrate to compensate. Maybe. Since I didn’t have a serial console handy and this is a headless server, I power-cycled her. I use a logging filesystem, so the risks of a power-cycle even while doing busy disk seeks are minimal—I’ve yanked the plug from the wall during tests, over and over, and never had a problem. No problem this time either—server came back up, I headed upstairs to have a look.
Yikes! Server load was off the charts. All httpd processes. I quickly took a look at the logs, and there I could see that Digg had descended upon my little server.
I’ve been running the anastrophe.com server for something like eight years now. It started out as a Sun Sparc 20, and now it’s a Sun Netra T1. 440Mhz UltraSPARC IIi CPU, one gig of ram, and a pair of disks, mirrored. I provide email and webhosting and shell and other services for a few dozen friends and family. It’s a light load, and the hardware has held up just dandy. Until today.
The problem more than anything else was lack of ram. The Netra T1 can only hold 1G, so there were no quick fixes there. I did the usual things to try to tame the tiger—reduced the maximum number of concurrent httpd processes, fine tuned KeepAlive, etc.. But the reality was, I was caught utterly unprepared. I think I can be forgiven for not having anticipated that an article about passwords of all things would be a hit on digg. It’s no excuse though. I really should have had at least a backup plan, just in case. Maybe set up a secondary server, mounting the apache directories NFS, so the load could be split. I’ll be looking into that later tonight. One thing I did at one point was to attach an external disk array, and move swap onto it. That helped some—but by that point, the load on the server had already begun dropping, so it was too little, too late.
So what does the title of this article have to do with any of this? There’s an old skydiving rule of thumb:
When cars look as big as ants, it’s time to open the parachute.
When ants look as big as cars, you’ve waited too long.
I’ve been picking ants from my teeth all day long.