Copyright © 2015 by QueTek Consulting Corporation. All rights reserved.
What is Btrfs?
Btrfs is a copy-on-write (COW) file system for Linux. It is expected to replace ext4 as the default file system in Linux. It uses B-Tree as its main on-disk data structure. Its advanced features include:
- Multiple physical devices.
- Snapshots, sub-volumes and clones.
- Checksums for both metadata and data.
- Compression and encryption.
- Efficient storage for small files.
Btrfs improves file system reliability over ext4 thanks to these new features:
- Copy-on-write (COW) semantics.
- Checksums (CRC32C) for both metadata and user data.
- Redundant metadata.
COW provides quick creation of subvolumes and snapshots. More importantly, it ensures system consistency and easy roll-back in case of corruption.
Checksums protect data integrity. Combining with redundant metadata they provide an efficient way to correct corruption on-the-fly.
The current version of Btrfs is stable. The on-disk structures are not expected to change. Even if they change, a legacy volume is expected to be mountable. However, the code is still under heavy and fast development. One should always run the latest available kernel.
Data storage schemes
Btrfs has its own RAID-like mechanism. It currently supports RAID 0, RAID 1 and RAID 10. RAID 5 and RAID 6 are experimental and not recommended for production data. RAID 2, RAID 3 and RAID 4 are not supported.
Btrfs divides its virtual space into chunks of 256 MB or more. Each chunk has its own mapping style (single drive, duplicate or stripe) and is managed by a CHUNK_ITEM node in the chunk tree. This implementation enables Btrfs to support devices of unequal sizes and even different RAID levels within a volume.
Btrfs treats metadata differently than user data. Metadata does not share chunks with user data. Chunks for metadata are usually stored in duplicate form (RAID 1 mapping style). When there is only one drive, metadata chunks are stored redundantly at two physical locations.
In the event of hardware failure, Btrfs allows rebalancing the chunk array in different ways. If the failed drive is replaced, it will be rebuilt similarly to a hardware-based RAID. Otherwise, Btrfs can rebalance the array to a smaller one.
The data storage schemes in current version of Btfrs have some drawbacks:
- Error handling does not work well, especially when dealing with out-of-date or physically corrupt hardware.
- The rebuilding procedure is neither straightforward nor easy to use.
This feature is unavailable in other file systems. It supports live storage configuration. Btrfs is capable of converting a RAID level to another RAID level while the system is running. Of course, one cannot utilize the features of the new RAID level until the conversion is done.
Subvolume and Snapshot
Subvolumes and snapshots are new features of next gen file systems. The user can create multiple file systems on a single virtual device which could be one drive or an array of drives.
When a Btrfs volume is formatted, one root volume (with subvolume ID 5) is created. Subvolumes and snapshots are created thereafter and appear as directories in the root volume. However, they are treated differently than a directory. For example creating a link to a subvolume will cause an Invalid cross-device link error. To browse the contents of a subvolume, one must mount it.
A subvolume can be accessed in two ways:
- From the parent subvolume in which case it is treated like a directory.
- From a separate mounting point.
A subvolume is treated as its own file system. Like a partition in other file system, a subvolume object is indexed by its own inode number starting from 257. The inode number of the root directory (always 256) is recorded in the ROOT_ITEM structure. Unlike a partition, a Btrfs subvolume can dynamically grows as needed and does not need pre-allocated disk space.
Each subvolume is managed by an FS tree. Each FS tree has its own number. The top-level (root) volume is numbered 5. The next subvolumes (and snapshots) will be 256, 257 and so forth.
A snapshot is a subvolume whose content is a copy of the current state of some other subvolume (or even the root volume). Thanks to COW, after a snapshot is taken any change to the snapshot does not affect the original subvolume and vice versa.
Unlike the NTFS Volume Snapshot Service (VSS), a snapshot in Btrfs can be created instantly. The newly created snapshot simply has a reference to the current root of the subvolume that it is based on. A snapshot can be mounted, browsed and modified as an ordinary subvolume.
However, a snapshot has its limitation:
- A snapshot is not a backup. If data becomes corrupt due to, for example, bad sectors, both the snapshot and the original subvolume will be damaged.
- Only a subvolume or snapshot can be used as source to create a snapshot. A directory cannot be.
- A snapshot is not recursive, i.e. it will not contain any enclosed subvolumes.
The chunk tree is one of the most important metadata objects needed for Btrfs data recovery. The device tree and the chunk tree (whose tree ID is 3) are the main mapping tables between logical and physical addresses. In order to locate the chunk tree, duplicate copies of system chunks are stored in the superblocks. This helps to load the root of the chunk tree at boot time.
There are two types of keys in the chunk tree:
- DEV_ITEM: stores information about all underlying devices in the file system.
- CHUNK_ITEM: maps logical to physical address.
If the chunk tree is unavailable, only raw recovery based on file signatures is possible.
Deleted data is highly recoverable thanks to COW. When a file is deleted, the file system will update the associated nodes. New metadata will be created with the updated information. The superblock will be updated last in an atomic operation to point to the newly created root. The old chain of nodes is intact. So it is possible to restore the deleted data with full path, provided that both metadata and user-data have not been overwritten.
However, a difficulty when restoring data from a COW file system is in handling multiple copies of metadata. All but one copy are obsolete. For example, a Word document may have been modified a few times before it was accidently deleted. To restore the current version of the file, one must pick the latest metadata copy in a set of tree nodes.
Raw recovery is a method where files are recovered based on their data patterns. Raw recovery is used when metadata is lost. Let us discuss this in details by contrasting Btrfs to NTFS.
In an NTFS file system the metadata of a file is stored in a FILE record. The metadata includes the filename, dates, where its extents are located on disk, etc. If the NTFS volume is reformatted, the FILE record is usually still intact. Software can look for FILE records and use the metadata to recover the data extents and other attributes such as filename, dates, etc.
When the FILE record has been overwritten by new data, recovery is still possible for certain types of files that contain identifiable data patterns. An example is a bitmap file (extension .bmp). It starts with a bitmap header that contains a special signature and the size of the file. Upon detecting the signature, a program knows it is the beginning of a bitmap file. Having the size of the file, the program knows where the file ends, assuming the file is stored entirely in one disk extent. (If the file is stored in two disjoint extents raw recovery is not possible and file carving techniques must be used. See http://www.quetek.com/data_recovery_techniques.htm#file_carving)
In Btrfs, scanning for the FS tree nodes is possible. However, like other COW file systems, one must handle multiple nodes with slightly different contents. It is crucial that the most up-to-date nodes be selected.
If the metadata is lost, raw recovery based on file signatures is possible for contiguous files. If a deleted file is fragmented and its metadata is lost, raw recovery is not possible and file carving techniques must be used. See http://www.quetek.com/data_recovery_techniques.htm#file_carving)