February 15th, 2007 by Andrew

Lately the server that hosts this site has been having some issues. It all started in January, when I started receiving kernel errors in my daily logwatch emails from the server. All the errors were pointing to a hard drive failure:

WARNING:  Kernel Errors Present
Buffer I/O error on device hda, l ...:  4 Time(s)end_request: I/O error, dev
hda, sector ...:  4 Time(s)
hda: dma_intr: error=0x10 { SectorIdN ...:  4 Time(s)
hda: dma_intr: status=0x51 { DriveReady SeekComplete Error } ...:  4 Time(s)
hda: task_in_intr: error=0x10 { SectorIdN ...:  32 Time(s)
hda: task_in_intr: status=0x59 { DriveReady SeekComplete DataRequest Error } ...:  32 Time(s)

Those are definitely hard drive problems, right? Well, not exactly. I replaced the hard drive with a working one by copying it bit for bit, using the linux command ‘dd’. I swapped out the cables, and booted up using the new drive. Everything was fine for a day or two, then I started getting those annoying kernel errors again. Wait a second… I just put in a new hard drive, I couldn’t possibly be getting the same exact errors. Could it be that some of the OS files had been corrupted by the failing drive and weren’t copied over correctly?

Here’s where I took drastic measures. Rather than trying to keep the same server setup, I decided to install Ubuntu and copy just the old files. Previously I had been using Fedora Core, and I loved it, but Ubuntu seems to be my new favorite Linux OS. Anyway, since I still had the old “broken” drive lying around, so I wiped the new drive, installed Ubuntu, mounted the old drive, and copied the necessary files. There’s nothing like a freshly installed Linux distribution.

I let the server run overnight with the new OS, just to make sure everything would be OK. Hoping for the best, I checked it the next morning. Well, everything wasn’t OK. I was getting hard drive errors again! Despite the fact that hard drives are the most likely piece of equipment to fail on my computer, I just couldn’t accept the fact that I had two failing drives of completely different makes. I decided now that it had to be something else that was causing my hard drive to appear to fail.

Last summer I had experienced some fun with my desktop computer’s motherboard and its failing capacitors which I eventually replaced (all 20-something) myself. The strange power interruptions from the failing capacitors had caused it to appear that the memory was going bad. Since I knew that power supply problems could cause memory to appear bad, why couldn’t they also make hard drives appear bad?

Just to be sure that it wasn’t bad memory, I ran a memory check on the server for a whole day. The memory passed every single test 9 times I think, and there were absolutely no problems. After a quick visual inspection of the server’s motherboard capacitors, I only had one option left that could be the root of all my problems: the power supply.

I just happened to have a spare power supply sitting in an empty case across my room, so I quickly unplugged all the connectors, swapped it out, and left it running overnight again. The next morning I found absolutely no errors messages about anything failing. Finally, the server is back to normal. Well, as normal as it can be with using a new power supply, new hard drive, and a new operating system.


