Journaling File System
A journaling file system writes its intended changes to a journal before applying them. The technique is borrowed from databases (write-ahead logging) and produces two huge benefits: crash recovery in seconds instead of hours, and dramatically lower probability of corruption when something goes wrong. Modern file systems including NTFS, ext3/ext4, XFS, JFS, ReFS, and HFS+ all use journaling; ZFS, Btrfs, and APFS use copy-on-write to achieve similar safety through different means.
Caltech CS124 · Harvard CS161
Atomic commit semantics
JBD2 / NTFS $LogFile era
A journaling file system maintains a special on-disk area called a journal where it records intended changes before applying them to the main file system structures. The technique is borrowed from database write-ahead logging: write to the log first, then to the actual location. If a crash occurs partway through an operation, the journal lets the file system replay committed transactions and discard incomplete ones. The result is fast crash recovery (seconds, regardless of file system size) and dramatically lower corruption risk than non-journaled file systems where post-crash recovery requires full file system traversal via fsck.
What a Journaling File System Is
The Wikipedia journaling file system article captures the core mechanism: “A journaled file system allocates a special area, the journal, in which it records the changes it will make ahead of time.”1 The principle is borrowed directly from database systems: before making any change to the actual file system structures, write a description of the intended change to a separate, contiguous on-disk area; only after that journal entry is safely on disk does the file system proceed to apply the change to its target location.
The atomic transactions guarantee
Journaling produces a guarantee that the Wikipedia documentation states clearly: “The changes are thus said to be atomic (not divisible) in that they either succeed (succeeded originally or are replayed completely during recovery), or are not replayed at all (are skipped because they had not yet been completely written to the journal before the crash occurred).” This is the entire point of journaling: in any crash scenario, every operation either completes or doesn’t, with no partial states left over to confuse the file system.
The database connection
The technique is fundamentally borrowed from database systems, where it’s called write-ahead logging (WAL). The Harvard CS161 lecture notes capture the inheritance: “A transaction is a sequence of operations that should be treated as a logical whole. In the database world, transactions are described using A.C.I.D.” File system journaling implements a subset of database ACID properties (atomicity and consistency, with weaker durability guarantees than full database WAL) sufficient to handle the file system’s specific needs.
Where journals live
Different file systems implement the journal in different ways:
- NTFS: the journal is $LogFile (file 2 in the MFT), located on the same volume as the rest of NTFS data.
- ext3/ext4: the journal is typically inode 8 by default, stored as an internal hidden file. It can also be placed on a separate device for performance.
- XFS: uses a separate journal area that can be internal or on a separate device.
- JFS (IBM): uses a separate journal log area.
- HFS+ (macOS pre-APFS): uses a separate journal file.
- ReFS: Microsoft’s resilient file system uses a more sophisticated allocation-on-write scheme combined with limited journaling.
Journal size considerations
The CodeLucky journaling guide describes the typical configuration: ext4’s default journal size is 128 MB, with a typical journal length of 32,768 blocks. The journal can be tuned; larger journals provide more transaction history (helpful for recovery) but consume more disk space and may slow down write-heavy workloads. The Wikipedia documentation notes the flexibility: “Some file systems allow the journal to grow, shrink and be re-allocated just as a regular file, while others put the journal in a contiguous area or a hidden file that is guaranteed not to move or change size while the file system is mounted.”
External journals
The Wikipedia article describes a performance-oriented variant: “Some file systems may also allow external journals on a separate device, such as a solid-state drive or battery-backed non-volatile RAM.” External journals allow journal writes to be substantially faster than the main file system, reducing the journaling overhead. Battery-backed NVRAM journals were a popular configuration in enterprise storage arrays before SSDs became cheap; they provided guaranteed-durable writes at memory speeds. Modern systems use SSDs for external journals when journal performance is critical.
Why Journaling Exists: The Pre-Journaling Problem
Journaling exists to solve a specific, severe problem with non-journaled file systems: the slow and unreliable post-crash recovery process via fsck. The problem was acute enough on early Unix systems that journaling’s invention represented a substantial reliability improvement.2
The fsck problem
The Wikipedia documentation describes the pre-journaling reality: “Detecting and recovering from such inconsistencies normally requires a complete walk of its data structures, for example by a tool such as fsck (the file system checker). This must typically be done before the file system is next mounted for read-write access. If the file system is large and if there is relatively little I/O bandwidth, this can take a long time and result in longer downtimes if it blocks the rest of the system from coming back online.”
The “hours or days” downtime
The O’Reilly Managing RAID on Linux documentation captures the practical impact: “Journaling is especially helpful when working with RAID because arrays tend to be larger than single disks, which already take a long time to fsck. Imagine waiting for fsck to complete on a terabyte RAID partition that is using ext2. The downtime could be hours, or even days!”3 This isn’t hyperbole: ext2 fsck on a multi-terabyte volume genuinely could take 12-72 hours on early-2000s hardware, during which the system was unavailable. For production servers, that downtime was unacceptable.
The race-condition examples
The Deep Notes journaling documentation describes specific failure scenarios:
- The “blocks reused” race: “If step 3 preceded step 1, a crash between them could allow the file’s blocks to be reused for a new file, meaning the partially deleted file would contain part of the contents of another file.”
- The “inaccessible file” race: “If step 2 preceded step 1, a crash between them would cause the file to be inaccessible, despite appearing to exist.”
Both scenarios produce file system inconsistencies that fsck must heuristically repair, often losing data in the process. Heuristic repair on inconsistent metadata is a fundamentally hard problem; fsck makes its best guess but can produce surprising outcomes including silent data loss.
The Harvard fsck critique
The Harvard CS161 journaling lecture notes provide a concise summary of fsck’s limitations: “fsck makes a series of passes through the file system to ensure that metadata is consistent. fsck may result in lost data, but metadata will always be consistent. fsck works, but has several unattractive features: fsck requires detailed knowledge of file system, making fsck difficult to write and maintain. fsck is extremely slow, because it requires multiple traversals through the entire file system. Ideally, recovery time would be proportional to the number of recent writes that may or may not have made it to disk.”4
What journaling solves
Journaling addresses each fsck weakness directly:
- Recovery time: proportional to journal size (typically 64-256 MB), not file system size. Recovery in seconds vs hours.
- Recovery correctness: deterministic replay of committed transactions, not heuristic guesswork. No surprise data loss.
- Implementation simplicity: recovery code only needs to handle journal entries, not arbitrary inconsistent file system states.
- Predictable behavior: file system is either fully consistent or has uncommitted journal entries. No undefined intermediate states.
The cost is the additional journal write per metadata change, which adds modest overhead for typical workloads but can be significant for write-heavy server applications.
How Journaling Works
The standard journaling protocol involves four phases per transaction. The CodeLucky journaling guide breaks down the ext4 implementation specifically.5
The four-phase transaction protocol
The CodeLucky guide describes the protocol explicitly: “Each transaction in the journal follows a strict protocol: Transaction Start: Allocate transaction ID and journal space. Write Phase: Log all changes to descriptor and data blocks. Commit Phase: Write commit block to make transaction durable. Checkpoint Phase: Apply changes to main file system.”
In practice, the lifecycle of a typical metadata change looks like:
- Transaction Start: the file system creates a new transaction record with a unique ID and reserves space in the journal for the upcoming entries.
- Descriptor block: a header block is written to the journal describing what blocks are being modified, where their data is in the journal, and where they need to be applied in the main file system.
- Data blocks: the actual content of the modified blocks is written to the journal.
- Commit block: a commit block is written marking the transaction as complete and durable. Once this block is on disk, the transaction is guaranteed to be replayed if a crash occurs before checkpoint.
- Checkpoint: at some later time (often when the journal is getting full), the file system applies the journal’s pending changes to their final locations on disk.
- Discard: after checkpoint, the journal entries can be reclaimed for new transactions.
The critical ordering rule
The Caltech CS124 journaling lecture captures the fundamental constraint: “The filesystem must follow a rule: No changes may be made to the filesystem metadata itself until the journal on disk reflects all changes being made in the transaction. The filesystem itself cannot be updated until the corresponding transaction enters the ‘commit’ state. Otherwise, the filesystem itself will include changes from an incomplete transaction. The transaction could be aborted by a system crash. In that case, those changes would need to be rolled back somehow.”6
This ordering is what makes journaling work. Without strict ordering, a crash could leave the file system in a state that doesn’t match either the pre-transaction or post-transaction expected state; with ordering, the file system is always in a consistent state matching either the pre-transaction (incomplete journal entry) or post-transaction (committed journal entry) world.
Crash recovery via journal replay
The recovery process after a crash:
- System boots and the file system module initializes.
- The journal is examined for any transactions that don’t have a corresponding commit block; these are discarded as incomplete.
- For each transaction with a valid commit block, the journal entry is read and the changes are applied to the main file system (this is the “replay”).
- After all committed transactions have been replayed, the journal is cleared and normal operation resumes.
The Harvard CS161 documentation captures the recovery semantics: “During crash recovery, we’ll see a valid TxStart, but no valid TxEnd for the associated tid. If the data block made it to the journal, we’ll have to discard it, but the file system will be consistent.”
The reorder-writes problem
The Harvard CS161 documentation captures a subtle problem: “The disk can reorder writes, which may cause havoc if a crash happens during journal writes.” Modern disks (and especially SSDs) can reorder writes for performance, which can violate the journaling protocol’s ordering requirements. File systems work around this with explicit barrier commands (sync, fsync, FUA flags) that force the disk to complete pending writes before accepting new ones; without barriers, journaling’s safety guarantees can be compromised.
The USPTO parallel journaling research
The USPTO patent 11436200 on parallel journaling describes the safety/performance tradeoff: “Journaling file systems use sequential, rather than parallel, writing of the data (first) and the journal (afterward). This is slow, but is a safety measure because, in the event of a system crash, if data and journal writes had been in parallel, it is possible that the journal write was completed but the data write was not. In such a scenario, a recovery operation using the journal may lead to data corruption.” Modern research is exploring fault-tolerant parallel journaling to capture both safety and performance benefits.
Three Levels of Journal Protection
ext3 and ext4 support three journaling modes that trade off safety against performance. Understanding the modes helps with decisions about how to configure file systems for specific workloads and recovery requirements.
Journal mode (data=journal)
Both metadata and file data are written to the journal before being applied to their final locations. This means every byte the application writes is written to disk twice: once to the journal, once to its final location. Journal mode provides the strongest crash protection: no data loss is possible, no corruption is possible, the file system is guaranteed consistent at all times. The cost is substantial; write performance is roughly halved compared to non-journaled writing, and double the bytes are physically written, accelerating SSD wear.
Ordered mode (data=ordered): the default
Only metadata is journaled, but the file system enforces ordering so that file data is written to its final location before the metadata journal entry is committed. This is the default mode for ext3 and ext4 because it provides good crash protection at reasonable performance:
- Crash before data write completes: data is partially written but metadata still references the old (intact) data because the metadata journal entry hasn’t been committed yet.
- Crash after data write but before metadata commit: data is fully written but metadata still references the old location; on replay, the metadata journal entry will be discarded as incomplete, leaving the data orphaned (lost) but the file system consistent.
- Crash after metadata commit: metadata journal entry is replayed, pointing to the fully-written data.
The mode produces consistent file system state at the cost of potentially losing the most recent file content writes. This is the right default for most workloads; preserving file system integrity is more important than the last few seconds of writes.
Writeback mode (data=writeback)
Only metadata is journaled, with no ordering constraint between data and metadata writes. The file system can write data and metadata in any order, which is fastest but allows for data anomalies after a crash:
- Crash with data not yet written: metadata may have been committed pointing to blocks that contain whatever was there before (potentially garbage or another file’s data).
- The Deep Notes documentation captures the risk: “A file system with a logical journal still recovers quickly after a crash, but may allow unjournaled file data and journaled metadata to fall out of sync with each other, causing data corruption.”
Writeback mode is rarely the right choice; the modest performance gain over ordered mode rarely justifies the corruption risk. Some database workloads where the database engine has its own data integrity protection might use writeback mode for the underlying file system.
Mode comparison
| Mode | What’s journaled | Data ordering | Performance | Crash safety |
|---|---|---|---|---|
| journal | Metadata + data | N/A (all in journal) | Slowest | Highest |
| ordered (default) | Metadata only | Data before metadata commit | Moderate | Good |
| writeback | Metadata only | None | Fastest | Lowest (can corrupt data) |
Other file systems’ approaches
Different file systems make different choices on this spectrum:
- NTFS: uses an approach roughly equivalent to ordered mode for metadata; data writes are not journaled directly.
- XFS: uses metadata-only journaling similar to writeback mode; relies on application-level data integrity.
- HFS+: uses metadata-only journaling.
- ReFS: uses copy-on-write for metadata combined with limited journaling; doesn’t fit cleanly into the three-mode taxonomy.
- ZFS / Btrfs / APFS: use copy-on-write entirely; achieve crash consistency without separate journals.
Journaling and Forensic Recovery
Beyond crash recovery, journals serve a useful forensic function: they preserve a transaction-level history of file system changes that recovery tools can sometimes exploit to recover deleted files even when the current file system state has lost the relevant metadata.
The journal as time machine
Each transaction in the journal preserves the metadata state from before the change. For deletions, this means the journal contains the inode (or MFT record) state from before the file was deleted, including block pointers to where the data lives. Recovery tools can read the journal to find pre-deletion metadata and use it to recover deleted files even after the current inode has been zeroed out. The window for this recovery is bounded by the journal size: typical journals (64-256 MB) hold a few thousand transactions, so older deletions may have been overwritten.
extundelete and ext4magic
The two specialized ext recovery tools both leverage the journal:
- extundelete: reads the file system journal to find inode states before deletions, then uses those to recover file data.
- ext4magic: ext4-specific tool that handles extent trees better than generic debugfs; can recover from extent-tree-cleared inodes by reconstructing extent state from journal data.
Both tools work above debugfs’s level, providing more automated recovery workflows that depend on journal state being intact. If the journal has been overwritten with newer transactions, neither tool can recover from it.
NTFS $LogFile recovery
NTFS’s $LogFile plays the same forensic role as the ext journal. Recovery tools that understand $LogFile can extract pre-deletion MFT records, even when the current MFT state has marked the records as deleted. The $LogFile is part of the “NTFS Triforce” methodology (MFT + $LogFile + $UsnJrnl) that comprehensive Windows forensics uses for thorough investigations.
Journal limitations for recovery
Journals have several limitations as recovery tools:
- Limited size: typical 64-256 MB journals hold only recent history; older deletions are overwritten.
- Metadata-only: ordered and writeback modes don’t journal file data; the journal won’t help recover lost file contents in those modes.
- Wraparound: journals are typically circular; once full, oldest entries are overwritten.
- File system version dependence: recovery tools must understand the specific journal format used by the file system version.
- Race with ongoing activity: continued use of the file system after deletion accelerates journal overwrite, reducing recovery prospects.
When the journal helps and when it doesn’t
The journal is most useful for:
- Very recent deletions: within the journal’s window, full pre-deletion metadata can typically be recovered.
- Forensic timeline reconstruction: the journal documents what changed when, providing chronological context.
- Crash recovery investigations: the journal shows what operations were in flight at the crash, helping diagnose root cause.
The journal is not useful for:
- Old deletions: once journal has wrapped, the relevant entries are gone.
- Recovery of file content that was never journaled: in ordered/writeback modes, file data isn’t in the journal.
- Recovery from journal corruption: if the journal itself is damaged, its forensic value is lost.
Journaling is one of the most consequential file system architectural choices, transforming the post-crash recovery experience from a multi-hour ordeal of fsck-and-pray into a near-instant journal replay that’s both faster and more reliable. Every modern production file system uses either journaling or copy-on-write to achieve crash consistency; the days of waiting hours for fsck on a multi-terabyte volume are largely gone except in legacy contexts. The same mechanism that makes file systems crash-safe also provides forensic context that recovery tools exploit; the journal is genuinely a multi-purpose architectural element.7
For users wondering about journaling configuration, the practical guidance is consistent: leave the defaults alone unless you have specific reasons to change them. ext4’s ordered mode is the right default for nearly all workloads; switching to journal mode only makes sense if you genuinely cannot tolerate any data loss (rare); switching to writeback mode rarely makes sense given modern hardware speeds. The journal size default (128 MB on ext4) is also typically appropriate; doubling it to 256 MB modestly extends the recovery window for forensic purposes if disk space permits. External journals on SSDs are sometimes worth configuring for write-heavy server workloads where journal write latency matters.
For users facing potential data loss on journaled file systems, the journal-related guidance reinforces the standard advice: stop using the file system immediately. Continued operation generates new journal entries that overwrite the old ones containing pre-deletion state; recovery software specifically designed to read journals (extundelete for ext, NTFS-aware tools for Windows) can exploit pre-deletion journal state, but only if the journal hasn’t been overwritten. The journal is a complement to, not a replacement for, comprehensive backups; it extends the recovery window for very recent deletions but doesn’t help with older data loss. Cleanroom recovery is unrelated to journaling and applies regardless of file system; the physics of platter and NAND don’t care about file system journals. The combination of journaling, copy-on-write, comprehensive backups, and (when needed) professional recovery is the layered defense that makes modern data infrastructure broadly reliable.
Journaling File System FAQ
A journaling file system is one that maintains a special on-disk area called a journal where it records intended changes before applying them to the main file system structures. The technique is borrowed from database systems and is sometimes called write-ahead logging. When a file operation requires multiple disk writes, the file system writes a transaction record describing all intended changes to the journal first; only after the journal entry is safely committed does the file system apply the actual changes. If the system crashes partway through, the journal lets the file system replay committed transactions and discard incomplete ones. The result is fast crash recovery (seconds, regardless of file system size) and dramatically lower corruption risk than non-journaled file systems.
Without journaling, post-crash recovery requires running fsck (file system check) which traverses the entire file system looking for inconsistencies (orphaned inodes, dangling pointers, allocation bitmap mismatches, etc.). For a multi-terabyte volume on a hard drive, fsck can take hours or days. Journaling reduces recovery to seconds because the file system only needs to examine the journal (typically 64-256 MB) rather than the entire volume. Journaling also reduces the probability of corruption: with non-journaled file systems, a crash partway through a multi-step operation could leave the file system in an inconsistent state that fsck would have to guess how to fix; journaling makes operations atomic, so a crashed operation either completes via journal replay or is fully rolled back.
The standard ext4-style journaling protocol has four phases. Transaction Start: the file system allocates a transaction ID and reserves journal space. Write Phase: it writes descriptor blocks (describing the intended changes) and data blocks (the actual change content) to the journal. Commit Phase: it writes a commit block to mark the transaction as durable; once this commit block is on disk, the transaction is guaranteed to be replayed if a crash occurs. Checkpoint Phase: at some later time, the file system applies the changes to their final locations on disk and marks the journal entry as no longer needed. The critical ordering rule is that the file system metadata cannot be modified before the journal entry has been committed; this ensures that a crash during the operation can be recovered by replaying the journal.
ext3 and ext4 support three journaling modes that trade off safety against performance. Journal mode (data=journal): both metadata and file data are written to the journal before being applied to their final locations. This provides the strongest crash protection (no data loss, no corruption) but is slowest because every byte is written twice. Ordered mode (data=ordered): only metadata is journaled, but the file system enforces ordering so that file data is written to its final location before the metadata journal entry is committed. This is the default for ext3 and ext4 and provides good crash protection with reasonable performance. Writeback mode (data=writeback): only metadata is journaled, with no ordering constraint between data and metadata writes. This is fastest but allows file data to appear randomly garbled after a crash even though metadata is consistent.
Journaling and copy-on-write (COW) are two different approaches to crash consistency that achieve similar safety guarantees through different mechanisms. Journaling file systems (ext3/4, NTFS, XFS, JFS, HFS+) write intended changes to a journal first, then apply them to their final locations; the journal provides a crash-replay mechanism. Copy-on-write file systems (ZFS, Btrfs, APFS) never overwrite existing data; when changes are needed, they write new copies to free space and update the metadata pointers atomically. COW eliminates the need for a separate journal because the metadata update itself is atomic: either the new pointers are written or they aren’t. Both approaches achieve crash consistency, but COW typically offers additional benefits like efficient snapshots and built-in checksumming, at the cost of higher metadata complexity.
Yes, journals contain transaction-level history that recovery tools can sometimes exploit to recover deleted file metadata. On ext file systems, tools like extundelete and ext4magic read the journal to find inode states from before recent deletions; if the file system journal contains pre-deletion versions of inodes, the tools can use that information to reconstruct deleted files even when the current inode has been zeroed. The journal’s typical size (64-256 MB by default) limits how much history is preserved; older deletions are discarded as the journal wraps. NTFS’s $LogFile is similarly used by forensic tools to recover transaction-level changes that may not yet be reflected in the MFT. The journal is most useful for recovering files deleted very recently, before the journal has been overwritten with new transactions.
Related glossary entries
- $MFT: NTFS’s central data structure; works with $LogFile journal for recovery.
- Inode: Unix per-file metadata structure; ext journal contains pre-deletion inode states.
- NTFS: uses $LogFile as its journal for crash recovery and forensic context.
- ext4: the most common Linux file system; uses JBD2 journal in three modes.
- ZFS: uses copy-on-write instead of journaling; alternative crash consistency approach.
- File Carving: fallback recovery technique when journals can’t help.
- Forensic Recovery: journals provide transaction-level history for forensic investigations.
Sources
- Wikipedia: Journaling file system (accessed May 2026)
- Deep Notes: Journaling File System
- O’Reilly / Managing RAID on Linux: Journaling Filesystems
- Harvard CS161: Lecture 14: Journaling
- CodeLucky: File System Journaling: Complete Guide to Crash Recovery
- Caltech CS124: Journaling File Systems Lecture
- USPTO patent 11436200: Fault tolerant parallel journaling for file systems
About the Authors
Data Recovery Fix earns revenue through affiliate links on some product recommendations. This does not influence our reference content. Glossary entries are written and reviewed independently based on documented research, vendor documentation, independent testing, and recovery-engineer review. If anything on this page looks inaccurate, outdated, or worth revisiting, please reach out at contact@datarecoveryfix.com and we’ll review it promptly.
