QueTek
  
  
RAID 5 rebuilding analysis

Last updated January 4, 2016

A brief review of RAID 5

In a simple configuration a RAID 5 consists of three or more hard drives. The drives are logically divided into blocks of equal size. A stripe consists of blocks from all drives at the same offset from the beginning. In the example below a 3-drive RAID 5 is configured with block size of 64KB. Note that counting starts from 0 as commonly practiced in computer science.

Each stripe consists of three blocks. Two blocks store data and one block stores parity data. The parity block is computed by applying the exclusive-or (XOR) operator on the data blocks. In stripe #0 the value of byte #0 in the parity block (on Disk 2) is computed as follows:

Parity byte #0 = (Disk 0 byte #0) XOR (Disk 1 byte #0)

Note that from the properties of XOR:

Disk 0 byte #0 = (Parity byte #0) XOR (Disk 1 byte #0)
Disk 1 byte #0 = (Parity byte #0) XOR (Disk 0 byte #0)

If one drive fails the lost data can be computed by applying the XOR operator on the surviving drives.

RAID 5 degraded mode and rebuilding

When one drive fails a RAID 5 continues to operate in degraded mode. Data is still written as stripes but skipping the failed drive. When data needs to be read from a block on the failed drive, the RAID driver reads the remaining blocks in the stripe from the surviving drives and applies XOR to recompute the data of the missing block. In degraded mode, write performance is not affected. Read performance is slower due to the XOR process discussed. However the data is no longer protected from another drive failure.

Under this condition an operator typically performs a RAID rebuild. The failed drive is physically removed and replaced by a new drive. The RAID driver will automatically recompute the data on the failed drive and write it to the new drive. This process may take hours depending on the size of the drives. When rebuilding is complete the RAID is restored to normal working status.

Rebuilding failure

During rebuilding the RAID driver reads every block on all the surviving drives. If it encounters any bit errors, the rebuilding operation is typically aborted. The RAID is basically in limbo. It may stay in degraded mode or it may go into total failure mode. If the RAID consists of a large number of drives with large capacity, the probability of rebuilding failure can be very high. For example if the probability for a 2TB hard drive to have at least one bit error is 1%, then the probability of having at least one error in 12x2TB hard drives is 11%.

To rebuild or not to rebuild?

One may try and copy the data to new storage prior to rebuilding. However copying is subject to the same risk as rebuilding. If any error is encountered, the copying will stop. Copying, however, only requires reading from used space. The risk of failed copying for the example above is tabulated as follows:

Used space Failed rebuild probability
100% 11%
50% 6%
25% 3%
10% 1%

The procedure we recommend is as follows:

  • Arrange the data in the order of importance. One convenient criteria is the modified date. Newer data is usually more valuable.
  • Copy the data in the order of importance to a new RAID 5.
  • If copying stops due to a bit error, determine the particular file triggering the bit error. (If files are copied in the order of newest to oldest date, the next file to be copied can be determined by sorting the source files by date.) Skip this file and resume copying.
  • If copying encounters an error that causes the RAID to fail completely, the data that has not been copied will be lost. You may need to use our RAID recovery service.
  • After copying all data, rebuild the RAID. If the RAID cannot be rebuilt, reinitialize it and copy the data back from the new RAID.

Rebuilding RAID 6

A RAID 6 has two sets of parity and can continue to operate without two drives. Rebuilding a RAID 6 is different from RAID 5 in the following ways:

  • During rebuilding if the RAID driver encounters a bit error on a surviving drive, it usually requests the operator to replace the drive. Then it performs a two-drive rebuilding. If another bit error is detected on another surviving drive, the rebuilding will usually be aborted. The situation will be similar to a failed RAID 5 rebuilding.
  • Using the same assumption of a 1% bit error probability on a 2 TB drive, the probability of failed rebuilding for a 12-drive RAID 6 is 1.3%.
  • The probability of failed copying for this 12-drive RAID 6 is approximately:

    1.3% x (used space percentage) (e.g., 0.65% for 50% used space)