Monday, June 28, 2010

back in the old days

At Danger, we had a team of about thirteen people, each responsible for one or two different servers. All the servers running together is called a "service", which provides all the functionality for the Sidekick: data storage, instant messaging, web browsing, email, etc.

Keeping all of these things working in sync is, as we geeks say, "nontrivial." Bugs internal to each server--say my data server is crashing if it gets too much data--are compounded by bugs in how the servers interact: I fixed my data server problem, but now it outputs data slightly differently and the mail server can't handle it. It starts out looking like a mail server bug, but in reality it's a bug in the communication between the two servers. The sane way to manage this problem is to put your code through constant functionality and interoperability testing, releasing it as often as you can (iteration) and always putting all the parts together (continuous integration). So every day we "rolled" the previous day's code out to a developer service that all the engineers and most of the other tech staff used, called "Daily". Because we used it, we usually noticed immediately if it was broken and we were responsible for fixing it.

One rainy day in my first year, I meandered into the office at my usual early hour, about 8 A.M., and ran into another Chris, the head of IT, who informed me that the network was down on the engineering floor, due to flooding in the network closet.
This was a three-story building. "We're having a flood on the second floor?"

"Yeah, leaking through the HVAC vents. Water was cascading down the equipment rack, it was great. There's a tarp over it now, we'll see if it dries out okay."

"A tarp."

Only TJ was in the office. TJ is, to put it mildly, a diligent worker, and yet there he was in the corridor, not working.
"Hey, TJ."

"Daily's down. I would have sent email, but, well, Daily's down."
We investigated. Daily shouldn't have been down, because (we thought) Daily's machines were in the nice, dry server room on the first floor. We'd worked really hard to make Daily be reliable in that way. And it was mostly true.

The team had a double cubicle partitioned off, with a couch and a wheeled shelf/drawer converted to a liquor cabinet, and a coffee table of sorts, with legs made of two standard beige-box computers, x86-a and x86-b. In the old office, both had been damaged by hallway soccer years earlier, and we assumed they were both turned off, though we couldn't actually tell because the power lights were broken and we were too lazy to check the fans in back.

As it happened, x86-a had a shared drive, containing...the Oracle database libraries. Which were needed by the data server. Which was running in the first-floor server room. Which couldn't talk to x86-a, which was up on the second floor, with its flooded network closet.

TJ and I got out a deck of cards and played Hearts for an hour or so until things were fixed.

Computers are hard.

No comments:

Post a Comment