1.. SPDX-License-Identifier: GPL-2.0 2.. _atomic_writes: 3 4Atomic Block Writes 5------------------------- 6 7Introduction 8~~~~~~~~~~~~ 9 10Atomic (untorn) block writes ensure that either the entire write is committed 11to disk or none of it is. This prevents "torn writes" during power loss or 12system crashes. The ext4 filesystem supports atomic writes (only with Direct 13I/O) on regular files with extents, provided the underlying storage device 14supports hardware atomic writes. This is supported in the following two ways: 15 161. **Single-fsblock Atomic Writes**: 17 EXT4 supports atomic write operations with a single filesystem block since 18 v6.13. In this the atomic write unit minimum and maximum sizes are both set 19 to filesystem blocksize. 20 e.g. doing atomic write of 16KB with 16KB filesystem blocksize on 64KB 21 pagesize system is possible. 22 232. **Multi-fsblock Atomic Writes with Bigalloc**: 24 EXT4 now also supports atomic writes spanning multiple filesystem blocks 25 using a feature known as bigalloc. The atomic write unit's minimum and 26 maximum sizes are determined by the filesystem block size and cluster size, 27 based on the underlying device’s supported atomic write unit limits. 28 29Requirements 30~~~~~~~~~~~~ 31 32Basic requirements for atomic writes in ext4: 33 34 1. The extents feature must be enabled (default for ext4) 35 2. The underlying block device must support atomic writes 36 3. For single-fsblock atomic writes: 37 38 1. A filesystem with appropriate block size (up to the page size) 39 4. For multi-fsblock atomic writes: 40 41 1. The bigalloc feature must be enabled 42 2. The cluster size must be appropriately configured 43 44NOTE: EXT4 does not support software or COW based atomic write, which means 45atomic writes on ext4 are only supported if underlying storage device supports 46it. 47 48Multi-fsblock Implementation Details 49~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 50 51The bigalloc feature changes ext4 to allocate in units of multiple filesystem 52blocks, also known as clusters. With bigalloc each bit within block bitmap 53represents a cluster (power of 2 number of blocks) rather than individual 54filesystem blocks. 55EXT4 supports multi-fsblock atomic writes with bigalloc, subject to the 56following constraints. The minimum atomic write size is the larger of the fs 57block size and the minimum hardware atomic write unit; and the maximum atomic 58write size is smaller of the bigalloc cluster size and the maximum hardware 59atomic write unit. Bigalloc ensures that all allocations are aligned to the 60cluster size, which satisfies the LBA alignment requirements of the hardware 61device if the start of the partition/logical volume is itself aligned correctly. 62 63Here is the block allocation strategy in bigalloc for atomic writes: 64 65 * For regions with fully mapped extents, no additional work is needed 66 * For append writes, a new mapped extent is allocated 67 * For regions that are entirely holes, unwritten extent is created 68 * For large unwritten extents, the extent gets split into two unwritten 69 extents of appropriate requested size 70 * For mixed mapping regions (combinations of holes, unwritten extents, or 71 mapped extents), ext4_map_blocks() is called in a loop with 72 EXT4_GET_BLOCKS_ZERO flag to convert the region into a single contiguous 73 mapped extent by writing zeroes to it and converting any unwritten extents to 74 written, if found within the range. 75 76Note: Writing on a single contiguous underlying extent, whether mapped or 77unwritten, is not inherently problematic. However, writing to a mixed mapping 78region (i.e. one containing a combination of mapped and unwritten extents) 79must be avoided when performing atomic writes. 80 81The reason is that, atomic writes when issued via pwritev2() with the RWF_ATOMIC 82flag, requires that either all data is written or none at all. In the event of 83a system crash or unexpected power loss during the write operation, the affected 84region (when later read) must reflect either the complete old data or the 85complete new data, but never a mix of both. 86 87To enforce this guarantee, we ensure that the write target is backed by 88a single, contiguous extent before any data is written. This is critical because 89ext4 defers the conversion of unwritten extents to written extents until the I/O 90completion path (typically in ->end_io()). If a write is allowed to proceed over 91a mixed mapping region (with mapped and unwritten extents) and a failure occurs 92mid-write, the system could observe partially updated regions after reboot, i.e. 93new data over mapped areas, and stale (old) data over unwritten extents that 94were never marked written. This violates the atomicity and/or torn write 95prevention guarantee. 96 97To prevent such torn writes, ext4 proactively allocates a single contiguous 98extent for the entire requested region in ``ext4_iomap_alloc`` via 99``ext4_map_blocks_atomic()``. EXT4 also force commits the current journalling 100transaction in case if allocation is done over mixed mapping. This ensures any 101pending metadata updates (like unwritten to written extents conversion) in this 102range are in consistent state with the file data blocks, before performing the 103actual write I/O. If the commit fails, the whole I/O must be aborted to prevent 104from any possible torn writes. 105Only after this step, the actual data write operation is performed by the iomap. 106 107Handling Split Extents Across Leaf Blocks 108~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 109 110There can be a special edge case where we have logically and physically 111contiguous extents stored in separate leaf nodes of the on-disk extent tree. 112This occurs because on-disk extent tree merges only happens within the leaf 113blocks except for a case where we have 2-level tree which can get merged and 114collapsed entirely into the inode. 115If such a layout exists and, in the worst case, the extent status cache entries 116are reclaimed due to memory pressure, ``ext4_map_blocks()`` may never return 117a single contiguous extent for these split leaf extents. 118 119To address this edge case, a new get block flag 120``EXT4_GET_BLOCKS_QUERY_LEAF_BLOCKS flag`` is added to enhance the 121``ext4_map_query_blocks()`` lookup behavior. 122 123This new get block flag allows ``ext4_map_blocks()`` to first check if there is 124an entry in the extent status cache for the full range. 125If not present, it consults the on-disk extent tree using 126``ext4_map_query_blocks()``. 127If the located extent is at the end of a leaf node, it probes the next logical 128block (lblk) to detect a contiguous extent in the adjacent leaf. 129 130For now only one additional leaf block is queried to maintain efficiency, as 131atomic writes are typically constrained to small sizes 132(e.g. [blocksize, clustersize]). 133 134 135Handling Journal transactions 136~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 137 138To support multi-fsblock atomic writes, we ensure enough journal credits are 139reserved during: 140 141 1. Block allocation time in ``ext4_iomap_alloc()``. We first query if there 142 could be a mixed mapping for the underlying requested range. If yes, then we 143 reserve credits of up to ``m_len``, assuming every alternate block can be 144 an unwritten extent followed by a hole. 145 146 2. During ``->end_io()`` call, we make sure a single transaction is started for 147 doing unwritten-to-written conversion. The loop for conversion is mainly 148 only required to handle a split extent across leaf blocks. 149 150How to 151~~~~~~ 152 153Creating Filesystems with Atomic Write Support 154^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 155 156First check the atomic write units supported by block device. 157See :ref:`atomic_write_bdev_support` for more details. 158 159For single-fsblock atomic writes with a larger block size 160(on systems with block size < page size): 161 162.. code-block:: bash 163 164 # Create an ext4 filesystem with a 16KB block size 165 # (requires page size >= 16KB) 166 mkfs.ext4 -b 16384 /dev/device 167 168For multi-fsblock atomic writes with bigalloc: 169 170.. code-block:: bash 171 172 # Create an ext4 filesystem with bigalloc and 64KB cluster size 173 mkfs.ext4 -F -O bigalloc -b 4096 -C 65536 /dev/device 174 175Where ``-b`` specifies the block size, ``-C`` specifies the cluster size in bytes, 176and ``-O bigalloc`` enables the bigalloc feature. 177 178Application Interface 179^^^^^^^^^^^^^^^^^^^^^ 180 181Applications can use the ``pwritev2()`` system call with the ``RWF_ATOMIC`` flag 182to perform atomic writes: 183 184.. code-block:: c 185 186 pwritev2(fd, iov, iovcnt, offset, RWF_ATOMIC); 187 188The write must be aligned to the filesystem's block size and not exceed the 189filesystem's maximum atomic write unit size. 190See ``generic_atomic_write_valid()`` for more details. 191 192``statx()`` system call with ``STATX_WRITE_ATOMIC`` flag can provide following 193details: 194 195 * ``stx_atomic_write_unit_min``: Minimum size of an atomic write request. 196 * ``stx_atomic_write_unit_max``: Maximum size of an atomic write request. 197 * ``stx_atomic_write_segments_max``: Upper limit for segments. The number of 198 separate memory buffers that can be gathered into a write operation 199 (e.g., the iovcnt parameter for IOV_ITER). Currently, this is always set to one. 200 201The STATX_ATTR_WRITE_ATOMIC flag in ``statx->attributes`` is set if atomic 202writes are supported. 203 204.. _atomic_write_bdev_support: 205 206Hardware Support 207~~~~~~~~~~~~~~~~ 208 209The underlying storage device must support atomic write operations. 210Modern NVMe and SCSI devices often provide this capability. 211The Linux kernel exposes this information through sysfs: 212 213* ``/sys/block/<device>/queue/atomic_write_unit_min`` - Minimum atomic write size 214* ``/sys/block/<device>/queue/atomic_write_unit_max`` - Maximum atomic write size 215 216Nonzero values for these attributes indicate that the device supports 217atomic writes. 218 219See Also 220~~~~~~~~ 221 222* :doc:`bigalloc` - Documentation on the bigalloc feature 223* :doc:`allocators` - Documentation on block allocation in ext4 224* Support for atomic block writes in 6.13: 225 https://lwn.net/Articles/1009298/ 226