ZFS and RAID-Z recoverability and performance

Last updated April 24, 2015

What is ZFS?

ZFS is a modern open-source file system with advanced features to improve reliability. It is primarily used on Linux (http://www.zfsonlinux.org) and by open source NAS systems such as FreeNAS (http://www.freenas.org) and NAS4Free (http://www.nas4free.org).

ZFS reliability

ZFS developers claimed reliability as one of the main design criteria. To that end the following features were implemented:

  • Copy-on-write semantics.
  • Checksums for both metadata and data.
  • Redundant metadata.
  • Journaling.

Copy-on-write serves two main purposes. First it allows quick creation of volume snap-shots. Secondly it guarantees file system consistency across system failures such as a power outage. (One may argue that consistency can be achieved by a well-designed journal log.)

Checksums protect data integrity. Data corruption caused by such problem as bit rotting can be detected and even corrected in mirrored or RAID-Z configuration. ZFS provides end to end checksums for both data and metadata.

File data locations are stored in a metadata structure called the block pointer. All block pointers are redundantly stored.

The ZFS intent log (ZIL) provides journaling so that an update to the file system can be redone in case of a system failure before the update has been fully committed to storage.

Current state

ZFS development is currently driven by the open source group OpenZFS (http://www.open-zfs.com). Although ZFS was originally developed for Solaris, it is currently used primarily on FreeSBD and Linux (http://www.zfsonlinux.org).

FreeNAS (http://www.freenas.org) and NAS4Free (http://www.freenas.org) are two popular lines of NAS products incorporating ZFS as the main file system.

What is RAID-Z

RAID-Z protects data against physical drive failure by storing data redundantly among multiple drives. RAID-Z is similar to standard RAID but is integrated with ZFS. In standard RAID, the RAID layer is separate from and transparent to the file system layer. The file system is not aware of the underlying RAID storage scheme and uses the RAID storage as it does a single drive. The file system writes to and reads from a virtual single drive. The RAID layer maps data blocks to physical hard drive blocks.

In RAID-Z the two layers become one. ZFS operates directly on physical blocks. Parity data is computed separately for each data extent following RAID-Z schemes. The virtual address and length of the extent decide its physical location.

Single, double and triple parity

RAID-Z supports single (RAID-Z1), double (RAID-Z2) or triple (RAID-Z3) parity or no parity (RAID-Z0). Reed-Solomon is used for double and triple parity.

ZFS and RAID-Z recovery issues

QueTek® programmers have developed advanced techniques to recover ZFS and RAID-Z data. The difficulties we have encountered are discussed below.

Long chain of control blocks

ZFS contains cascaded chains of metadata objects that must be followed in order to get to the data. In a typical ZFS configuration a data recovery program has to do the following:

  • Read and parse the name-value pair list in the vdev label.
  • Choose the most current uberblock.
  • Read the Metal Object Set (MOS) using the main block pointer in the uberblock.
  • Read the Object Directory to determine the position in the MOS of the DSL directory.
  • Read the DSL directory to determine the position in the MOS of the DSL dataset
  • Note that at this point if the dataset is nested, the previous two steps are repeated for each nesting level.
  • Read the DSL dataset to determine the location (not in the MOS) of the File System dataset.
  • Read the File System dataset to determine the location of the File System Objects (FSO) array.
  • Read the Master Node in the FSO array to determine the position in the FSO array of the root node.
  • From the root node traverse the file system tree.

Integration of RAID layer and file system storage layer

In conventional RAID, the RAID layer is separate from the file system layer. The latter maps a virtually contiguous space of a file to (potentially noncontiguous) extents in the volume space. Similarly the RAID layer maps the virtually contiguous space of a volume to physical hard drive blocks.

ZFS maps virtual file space directly to physical blocks.

When ZFS metadata is lost, a recovery program faces three difficult tasks:

  1. Distinguish data from parity data.
  2. Determine sector sequence for sequential scanning.
  3. Match parity data to data.

Distinguish data from parity data

Parity data is repeated in a fixed pattern in a RAID 5. For example on a member drive of a 4-drive RAID 5 with block size 64KB and backward symmetric rotation, a parity block follows every three data blocks. The mapping from physical to logical sector numbers is depicted below. The second to fifth columns contain the volume logical sector numbers:

Physical sector Disk 0 Disk 1 Disk 2 Disk 3
0-127 0-127 128-255 256-383 Parity block
128-255 512-639


Parity block 384-511
256-383 1024-1151 Parity block 768-895 896-1023
384-511 Parity block 1152-1279 1280-1407 1408-1535

Figure 1

For example volume logical sector 640 is mapped to physical sector 128 on Disk 1.

A data recovery program can determine whether a specific sector contains data or parity data.

In RAID-Z the size of an extent varies between 1 and 256 sectors (assuming 512 bytes per sector). The length and location of an extent are stored in the Data Virtual Address (DVA). For example a 4-drive, single-parity RAID-Z may store a 10-sector extent in the pattern below. The second to fifth columns contain the virtual sector numbers of the data extent:

Physical sector Disk 0 Disk 1 Disk 2 Disk 3
1000     Parity 0
1001 4 7 Parity 1
1002 5 8 Parity 2
1003 6 9 Parity 3

Figure 2

For example sector 7 of the extent is mapped to physical sector 1001 on Disk 1.

For another extent the mapping can be as follows:

Physical sector Disk 0 Disk 1 Disk 2 Disk 3
1000       Parity
1001 0 4 7 Parity
1002 1 5 8 Parity
1003 2 6 9 Parity
1004 3      

Figure 3

Or for another extent:

Physical sector Disk 0 Disk 1 Disk 2 Disk 3
1000 Parity 0 4 7
1001 Parity 1 5 8
1002 Parity 2 6 9
1003 Parity 3    

Figure 4

Sector 1001 on Disk 2 may be a data or parity sector, depending on the context.

Determine sector sequence for sequential scanning

In figure 2 sector 1003 on Disk 0 is followed by sector 1001 on Disk 1. But in figure 3 it is followed by sector 1004 on Disk 0. Without the extent context, a program cannot determine the sector sequence.

Match parity data to data

In a RAID-Z stripe, the data blocks and corresponding parity block may not start on the same physical sector number. In figure 2, sector 1000 on Disk 2 contains the parity data for sector 1000 on Disk 3 and sector 1001 on Disk 0 and Disk 1. In figure 3, sector 1000 on Disk 3 contains the parity data for sector 1001 on Disk 0, Disk 1 and Disk 2. In figure 4, sectors 1000 to 1002 on Disk 0 contain the parity data for the same sector numbers on all the other drives. Sector 1003, however, contains the parity data only for sector 1003 on Disk 1.

Without ZFS extent metadata, it is extremely difficult to match a parity block to the corresponding data blocks.

Long scan

The File Scavenger® long scan is used when file system metadata is incomplete. Metadata is usually structured in a tree topology. If a tree is partially corrupted, many branches may not have a path to the root. The File Scavenger® quick scan starts from the root and traverses all branches and sub-branches until the entire tree is scanned. This scan will miss branches disconnected from the root. A long scan examines every sector to look for disconnected branches.

Performing a long scan on RAID-Z is extremely difficult because of the issues discussed in the previous section. For example:

  • When the program sees a sector containing metadata, it must determine if it is actual metadata or merely parity artifacts.
  • If a metadata object spans multiple sectors, the program cannot easily determine the correct sector sequence to read the entire object.

Missing drive

A missing drive in RAID 5 can be completely rebuilt using the remaining drives regardless of the file system status. However in RAID-Z each stripe is a data extent. The missing drive can be completely rebuilt only if all extent metadata is intact. For a corrupted ZFS volume, rebuilding a missing drive is very difficult.

File undeletion

Files are ultimately stored in the File System Objects (FSO) array of a dataset. Both the file and the parent folder are dnodes in the array. The parent dnode contains the filename. The file dnode contains the file extents, size, dates, etc.

When a file is deleted ZFS updates the FSO extent containing the file dnode and the parent dnode extent containing the filename. Thanks to copy-on-write the original extents still exist on the hard drive. A data recovery program must find the two extents and match the file dnode to the correct filename item in the parent folder extent. The program can reconstruct the complete path because the path to the parent folder is still valid if the parent folder has not been deleted.

Matching the file dnode to the correct filename is not an easy task. The filenames in the parent dnode are indexed by the position of the file dnode in the FSO array. The recovered file dnode does not contain its position in the array. Matching the folder dnode extent to the correct parent folder dnode is also difficult.

Presently File Scavenger® does not offer a general solution to ZFS undeletion due to the complexity of the tasks involved. Our staff can perform undeletion on a fee-based, case-by-case basis

Raw recovery

Raw recovery is a method where files are recovered based on data patterns instead of file system metadata. Raw recovery is used in the absence of metadata. The results are usually unstructured files with a generic filename. We will discuss this in details by contrasting ZFS to NTFS. In NTFS the metadata for a file is stored in a FILE record. The metadata includes the filename, dates and location of the data. A FILE record is stored separately from the actual file data.

If an NTFS volume is corrupted but the FILE records are still intact, a data recovery program can look for FILE records and use the metadata to recover the corresponding files.

When a FILE record is lost, the corresponding file may still be intact but its name and location are not known. This is when raw recovery comes into play. Many types of files contain identifiable header patterns. For example a bitmap file (extension .bmp) starts with a header that contains a special signature and the size of the file. Upon detecting the bitmap signature in a sector, a program knows the sector is the beginning of a bitmap file. With the file size in the header, the program knows where the file ends, assuming the file data is contiguous. (If the file data is fragmented, raw recovery is not possible and file carving techniques must be used. See http://www.quetek.com/data_recovery_techniques.htm#file_carving)

RAID-Z raw recovery is very difficult. Upon detecting a file header, a data recovery program must determine the correct sector sequence to end of file. As discussed in previous sections, this is very difficult when file system metadata is incomplete.

Single parity RAID-Z versus RAID 5 performance comparison

We will compare the performance of single parity RAID-Z to RAID 5. The latter is by far the most popular RAID configuration.

No "write hole"

RAID-Z is immune to the RAID 5 "write hole" where a stripe may become inconsistent if data is not completely written to one or more drives due to an interruption such as loss of power. In RAID 5 a write operation involves writing the data and the corresponding parity data. At the minimum that requires writing to two drives. If power loss occurs and one drive is not written to, the data and parity data become out of sync.

RAID-Z protects data with copy-on-write. The new data is first written to a new location. Then the reference to the data is updated in an atomic operation (i.e., an operation that is either performed completely or not performed at all).

One may wonder if ZFS on RAID 5 is vulnerable to the write hole. The following events occur when data at location A is modified:

  • ZFS reads the data at location A. (This data is referenced by the metadata at another location Z.)
  • The data is modified in memory.
  • ZFS allocates location B and writes the modified data to location B.
  • ZFS updates the metadata at location Z to reference the new location B.

In RAID 5 data is updated on a per stripe basis because each stripe contains a parity block that must stay in sync with the data. In the example above both A and B may be in the same stripe. An incomplete write may affect both A and B; therefore the vulnerability still exists. Location Z may also be in the same stripe.

In RAID-Z the parity data is maintained per data extent. In the example above A and B are two different extents. An incomplete write only affects B.

Therefore ZFS on RAID 5 storage is still vulnerable to the write hole. RAID-Z wins.

Read-modify-write cycle

In RAID 5 data is stored as blocks striped across all drives. One block per stripe holds parity data. When even a small change is made to one block, the RAID controller must update the corresponding parity block. In order to compute the new parity block, the RAID controller must read all data blocks in the stripe. An example of a stripe on a 4-drive RAID 5 is depicted below:

Disk 0 Disk 1 Disk 2 Disk 3
Block 0 Block 1 Block 2 Parity block

Assuming block 0 is modified, the sequence of operations is as follows:

  • Read block 0.
  • Modify block 0 in memory.
  • Read block 1 and block 2.
  • Compute a new parity block.
  • Write the updated block 0 to Disk 0 and the new parity block to Disk 3.

In RAID-Z the parity of each data extent is independently computed. It is not necessary to read other extents. RAID-Z also wins here.

Reading small data extents

In RAID 5 data extents up to the RAID block size may be stored entirely on one drive. Reading a small extent may require only one read.

RAID-Z stripes data on as many drives as possible with the smallest stripe size being a sector. Only a one-sector extent is stored entirely on one drive. A two-sector extent is striped across two drives. Larger extents are striped across more drives up to n-1 (n is the total number of drives). For example in a 6-drive RAID-Z configuration, a 5-sector extent may be striped across 5 drives as shown below:

Disk 0 Disk 1 Disk 2 Disk 3 Disk 4 Disk 5
Parity sector Sector 0 Sector 1 Sector 2 Sector 3 Sector 4

Reading small extents requires significantly more reads in RAID-Z, especially with a large number of drives. RAID 5 wins hands down.

Parity overhead

RAID 5 uses the equivalence of one drive for parity data. For example a 4-drive RAID 5 uses one-fourth of total capacity for parity data, or 33% overhead.

RAID-Z overhead is equal to RAID 5 in the best case scenario when the size of a data extent is a multiple of the number of drives less one. In the example above an extent of 3 (or 6, 9, etc.) sectors has 33% overhead. Other extent sizes incur more overhead. A 10-sector extent requires 4 parity sectors, or 40% overhead. In the worst case a one-sector extent requires one sector for parity data, or 100% overhead. The typical ZFS extent size is 128 sectors. In a 4-drive RAID-Z configuration that is 34% overhead.

Another overhead in RAID-Z is in the inability to use any free space of one sector. Each extent requires at least two sectors: one for data and one for parity.

Therefore a typical single-parity RAID-Z incurs about 1% additional overhead compared to RAID 5.

128 KB extent size

The maximum ZFS extent size (or block size in ZFS terminology) is 128 KB. With copy-on-write a data extent is written to a new location when it is modified even by the smallest amount. A smaller extent size reduces the time taken to write it to a new location. However this increases fragmentation. Naturally contiguous data (such as a file being copied in whole from another volume) may become fragmented.

At first glance 128 KB seems to be a bottleneck for reading large files. In practice ZFS can store a large file in contiguous extents so that reading a large chunk of data requires only one read per drive. However, if the file is modified, copy-on-write will relocate any modified extents, thus causing fragmentation.

Fragmentation is an issue of copy-on-write rather than RAID-Z. ZFS on a single drive faces the same problem.