RAID is amazing

In yesterday's post I explained some weird activity in my RAID backup server. Well, it turns out that it was just calm before the storm. Thankfully, that storm was nothing more than a little rain.

The backup server started working again, without issue, after I pushed the drives back into place. When I awoke today I noticed that the disk status light beside drive 1 was off. I tried reseating it but the light would not go back on. I figured that maybe something happened to the circuitry in the unit and the LED is no longer receiving power.

As a last ditch effort I turned on the SysNavi software that is used to communicate with the RAID server. Right away it tells me that the RAID volume is in Critical condition. Uh oh. So begins a long day, I thought. I checked around the software (the UI is terrible, but it gets the job done) and finally determined that drive 1's light was out because the drive had failed.

Every piece of electronic equipment has an MTBF, or mean time between failure, value. Without looking at the number, it tells the owner that this piece of electronic eqipment will fail at some point during its operation, it is only a matter of when. This particular drive that just failed me decided to fail after 4 years of continuous operation, and I do mean continuous. It was being used by the backup server every hour of every day by my main server's time machine application. So, I'm not too disappointment by it.

Now comes the real question. What happened to all my data now that one of my hard drives has failed? The standard configuration of the RAID unit is to be in RAID5, which means 4 hard drives are required to store 1 piece of data. Tha data is striped (think, copied, but not quite) across each drive with a parity bit (i.e. checksum) to tell the system whether the data is correct or not. If one drive fails, the system is able to rebuild itself onto the other 3 drives without any loss of data. Pretty cool, eh?

I happened to have a 1TB hard drive lying around, so I pulled the dead drive out of the unit 1 and popped the new drive into it. Once I put the new drive in, the system accessed it and starting using the three remaining drives with the backup data on it to rebuild the data onto the new drive. The system software stated that the new drive was Rebuilding.

The rebuilding process took four hours in total (rebuilding about 2.7TB of striped data) and, all the while, my other computers were still backing up data. RAID is a magical thing, and can certainly save you from losing your data when one of your drives die 2.

Footnotes
  1. This was done in a hot swap mode, which means that the unit can be running (even backing up more data) while I remove the dead drive and insert a new one. The system happily hums along until it is able to access the new drive, then begins using it. 

  2. IF two or more drives die at the same time, you may not be able to rebuild your backup data automatically. This is a hazard, but there's not much you can do about it. There's always a trade-off between ease and security.