Waiting for a fsck to complete on a server system can tax your patience more than it should. Fortunately, a new breed of filesystem is coming to your Linux machine soon. Journaling filesystems maintain a special file called a log (or journal), the contents of which are not cached. Whenever the filesystem is updated, a record describing the transaction is added to the log. An idle thread processes these transactions, writes data to the filesystem, and flags each processed transaction as completed. If the machine crashes, the background process is run on reboot and simply finishes copying updates from the journal to the filesystem. Incomplete transactions in the journal file are discarded, so the filesystem's internal consistency is guaranteed.
This cuts the complexity of a filesystem check by a couple of orders of magnitude. A full-blown consistency check is never necessary (in contrast to ext2fs and similar filesystems) and restoring a filesystem after a reboot is a matter of seconds at most.
Today, at least four major players exist in the Linux journaling filesystem arena. They are in various stages of completion, with some of them becoming ready for use in production systems. They are:
Each offers distinct advantages. A detailed technical comparison is available from issue 55 of Linux Gazette.
Most of the available options provide support for dynamically extending the filesystems using a logical volume manager (such as LVM), which makes them perfect for large server installations.
ReiserFS is a radical departure from the traditional Unix filesystems, which are block-structured. It will be available in the upcoming Red Hat 7.1 distribution and is already available in SuSE Linux 7.0.
Hans Reiser writes about the filesystem he designed: "In my approach, I store both files and filenames in a balanced tree, with small files, directory entries, inodes, and the tail ends of large files all being more efficiently packed as a result of relaxing the requirements of block alignment and eliminating the use of a fixed space allocation for inodes." The effect is that a wide array of common operations, such a filename resolution and file accesses, are optimized when compared to traditional filesystems such as ext2fs. Furthermore, optimizations for small files are well developed, reducing storage overheads due to fragmentation.
ReiserFS is not yet a true journaling filesystem (although full journaling support is currently under development). Instead, buffering and preserve lists are used to track all tree modifications, which achieves a very similar effect. This reduces the risk of filesystem inconsistencies in the event of a crash and thus provides rapid recovery on restart.
Beside offering rapid restart capability after a crash and efficient storage of large numbers of small files, it is the developers' intention to offer facilities to store objects much smaller than those that are normally saved as separate files. Future design plans include adding set-theoretic semantics, making it possible to retrieve files by specifying their attributes instead of an explicit pathname.
ReiserFS was the first of this new breed that managed to be included in the standard Linux kernel distribution, giving it a head start in building a user community.
When SGI needed a high performance and scalable filesystem to replace EFS in 1990, it developed XFS to handle the demands of increased disk capacity and bandwidth, and parallelism with new applications such as film, video, and large databases. These demands included extremely fast crash recovery, support for large filesystems, directories with large numbers of files, and fair performance with small and large files. Now SGI is contributing this technology to the Open Source community and is in the process of finalizing its port to Linux.
Technically, XFS is based on the use of B+ trees (similar to the use of balanced trees in ReiserFS) to replace the conventional linear file system structure. B+ trees provide an efficient way to index directory entries and manage file extents, free space, and filesystem metadata. This guarantees quick directory listing and file accesses. The allocation of disk blocks to inodes is done dynamically, which means that you no longer need to create a filesystem with smaller block sizes for your mail server; your filesystem will handle this automatically for you. XFS is also a 64-bit filesystem, which theoretically allows the creation of files that are a few million terabytes in size, which compares favorably to the limitations of 32-bit filesystems. The ability to attach free-form metadata tags to files on an XFS volume is yet another useful feature of this filesystem.
XFS also contains good support for multiprocessor machines. This is visible in the implementation of the page buffer subsystem, which uses an AVL tree which is kept separate from the objects to avoid locking problems and cache thrashing on larger SMP systems. Multithreaded operation has been a declared design goal of this filesystem and has been well tested in large multiprocessor IRIX systems worldwide.
The Linux port is still undergoing development and some features are still to be finalized. For example, loop-mounting a file containing an XFS volume will not work without problems, yet. The X/Open data management API provided on IRIX is still incomplete in the Linux port and guaranteed rate I/O is also an IRIX exclusive, so far. Even now, XFS is more than just a viable alternative on Linux. I've personally used it for a few months on my own systems and have been very happy with its performance, which is at least on a par with ext2fs. Now that an installable CD image (based of the first CD of the Red Hat 7.0 distribution) is available for download, it will be even easier to enjoy the benefits of this filesystem. The user-level tools for filesystem creation, maintenance, and resizing are more functional and easier to use than their ReiserFS counterparts, which mostly stems from the fact that they have been around for a far longer time.
So why should one switch to XFS/Linux if ReiserFS will be readily available in Red Hat 7.1 and SuSE 7.0 (even though it will be a while until it is equally well integrated into and supported by the major distributions)? The main factor is trust, robustness, and maturity... XFS has been deployed on IRIX systems since 1994 and been used in a wide array of mission-critical applications. It's a proven technology, while ReiserFS and ext3fs are relatively new without offering too much new functionality.
IBM's JFS is a journaling filesystem used in its enterprise servers. It was designed for "high-throughput server environments, key to running intranet and other high-performance e-business file servers" according to IBM's Web site. Judging from the documentation available and the source drops, it will still be a while before the Linux port is completed and included in the standard kernel distribution.
JFS offers a sound design foundation and a proven track record on IBM servers. It uses an interesting approach to organizing free blocks by structuring them in a tree and using a special technique to collect and group continuous groups of free logical blocks. Although it uses extents for a file's block addressing, free space is therefore not used to maintain the free space. Small directories are supported in an optimized fashion (i.e., stored directly within an inode), although with different limitations than those of XFS. However, small files cannot be stored directly within an inode.
The port of JFS is an interesting project and will benefit the Linux community. However, it seems to be farther from being usable for production systems than its competitors.
ext3fs is an alternative for all those who do not want to switch their filesystem, but require journaling capabilities. It is distributed in the form of a kernel patch and provides full backward compatibility. It also allows the conversion of an ext2fs partition without reformatting and a reverse conversion to ext2fs, if desired.
However, using such an add-on to ext2fs has the drawback that none of the advanced optimization techniques employed in the other journaling filesystems is available: no balanced trees, no extents for free space, etc.
My personal opinion on ext3fs is that it is about to meet its fate with the availability of more powerful journaling filesystems. A handful of successful sites, such as RPMFind use this filesystem, but it lacks the momentum that the others have.
With the increasing size of hard disks, journaling filesystems are becoming important to an ever-increasing number of users. If you ever waited for a filesystem check on a machine with an 80GB hard disk, you know what I'm talking about. Even if you do not plan to reboot your system often, they can save you a lot of time and trouble if you experience a power failure or a hardware glitch. With the large number of contenders striving to become the de-facto standard in the journaling filesystem space on Linux, we can look forward to interesting months as these filesystems' code bases mature, are integrated into the standard kernel, and are supported in upcoming releases of the major Linux distributions.
However, keep in mind that migrating to another filesystem is not a trivial task. It usually requires backing up your data, reformatting, and restoring the data onto the newly created volume. You should thoroughly evaluate your options before making the switch.