1*0bf1f51eSRitesh Harjani (IBM).. SPDX-License-Identifier: GPL-2.0 2*0bf1f51eSRitesh Harjani (IBM).. _atomic_writes: 3*0bf1f51eSRitesh Harjani (IBM) 4*0bf1f51eSRitesh Harjani (IBM)Atomic Block Writes 5*0bf1f51eSRitesh Harjani (IBM)------------------------- 6*0bf1f51eSRitesh Harjani (IBM) 7*0bf1f51eSRitesh Harjani (IBM)Introduction 8*0bf1f51eSRitesh Harjani (IBM)~~~~~~~~~~~~ 9*0bf1f51eSRitesh Harjani (IBM) 10*0bf1f51eSRitesh Harjani (IBM)Atomic (untorn) block writes ensure that either the entire write is committed 11*0bf1f51eSRitesh Harjani (IBM)to disk or none of it is. This prevents "torn writes" during power loss or 12*0bf1f51eSRitesh Harjani (IBM)system crashes. The ext4 filesystem supports atomic writes (only with Direct 13*0bf1f51eSRitesh Harjani (IBM)I/O) on regular files with extents, provided the underlying storage device 14*0bf1f51eSRitesh Harjani (IBM)supports hardware atomic writes. This is supported in the following two ways: 15*0bf1f51eSRitesh Harjani (IBM) 16*0bf1f51eSRitesh Harjani (IBM)1. **Single-fsblock Atomic Writes**: 17*0bf1f51eSRitesh Harjani (IBM) EXT4's supports atomic write operations with a single filesystem block since 18*0bf1f51eSRitesh Harjani (IBM) v6.13. In this the atomic write unit minimum and maximum sizes are both set 19*0bf1f51eSRitesh Harjani (IBM) to filesystem blocksize. 20*0bf1f51eSRitesh Harjani (IBM) e.g. doing atomic write of 16KB with 16KB filesystem blocksize on 64KB 21*0bf1f51eSRitesh Harjani (IBM) pagesize system is possible. 22*0bf1f51eSRitesh Harjani (IBM) 23*0bf1f51eSRitesh Harjani (IBM)2. **Multi-fsblock Atomic Writes with Bigalloc**: 24*0bf1f51eSRitesh Harjani (IBM) EXT4 now also supports atomic writes spanning multiple filesystem blocks 25*0bf1f51eSRitesh Harjani (IBM) using a feature known as bigalloc. The atomic write unit's minimum and 26*0bf1f51eSRitesh Harjani (IBM) maximum sizes are determined by the filesystem block size and cluster size, 27*0bf1f51eSRitesh Harjani (IBM) based on the underlying device’s supported atomic write unit limits. 28*0bf1f51eSRitesh Harjani (IBM) 29*0bf1f51eSRitesh Harjani (IBM)Requirements 30*0bf1f51eSRitesh Harjani (IBM)~~~~~~~~~~~~ 31*0bf1f51eSRitesh Harjani (IBM) 32*0bf1f51eSRitesh Harjani (IBM)Basic requirements for atomic writes in ext4: 33*0bf1f51eSRitesh Harjani (IBM) 34*0bf1f51eSRitesh Harjani (IBM) 1. The extents feature must be enabled (default for ext4) 35*0bf1f51eSRitesh Harjani (IBM) 2. The underlying block device must support atomic writes 36*0bf1f51eSRitesh Harjani (IBM) 3. For single-fsblock atomic writes: 37*0bf1f51eSRitesh Harjani (IBM) 38*0bf1f51eSRitesh Harjani (IBM) 1. A filesystem with appropriate block size (up to the page size) 39*0bf1f51eSRitesh Harjani (IBM) 4. For multi-fsblock atomic writes: 40*0bf1f51eSRitesh Harjani (IBM) 41*0bf1f51eSRitesh Harjani (IBM) 1. The bigalloc feature must be enabled 42*0bf1f51eSRitesh Harjani (IBM) 2. The cluster size must be appropriately configured 43*0bf1f51eSRitesh Harjani (IBM) 44*0bf1f51eSRitesh Harjani (IBM)NOTE: EXT4 does not support software or COW based atomic write, which means 45*0bf1f51eSRitesh Harjani (IBM)atomic writes on ext4 are only supported if underlying storage device supports 46*0bf1f51eSRitesh Harjani (IBM)it. 47*0bf1f51eSRitesh Harjani (IBM) 48*0bf1f51eSRitesh Harjani (IBM)Multi-fsblock Implementation Details 49*0bf1f51eSRitesh Harjani (IBM)~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 50*0bf1f51eSRitesh Harjani (IBM) 51*0bf1f51eSRitesh Harjani (IBM)The bigalloc feature changes ext4 to allocate in units of multiple filesystem 52*0bf1f51eSRitesh Harjani (IBM)blocks, also known as clusters. With bigalloc each bit within block bitmap 53*0bf1f51eSRitesh Harjani (IBM)represents cluster (power of 2 number of blocks) rather than individual 54*0bf1f51eSRitesh Harjani (IBM)filesystem blocks. 55*0bf1f51eSRitesh Harjani (IBM)EXT4 supports multi-fsblock atomic writes with bigalloc, subject to the 56*0bf1f51eSRitesh Harjani (IBM)following constraints. The minimum atomic write size is the larger of the fs 57*0bf1f51eSRitesh Harjani (IBM)block size and the minimum hardware atomic write unit; and the maximum atomic 58*0bf1f51eSRitesh Harjani (IBM)write size is smaller of the bigalloc cluster size and the maximum hardware 59*0bf1f51eSRitesh Harjani (IBM)atomic write unit. Bigalloc ensures that all allocations are aligned to the 60*0bf1f51eSRitesh Harjani (IBM)cluster size, which satisfies the LBA alignment requirements of the hardware 61*0bf1f51eSRitesh Harjani (IBM)device if the start of the partition/logical volume is itself aligned correctly. 62*0bf1f51eSRitesh Harjani (IBM) 63*0bf1f51eSRitesh Harjani (IBM)Here is the block allocation strategy in bigalloc for atomic writes: 64*0bf1f51eSRitesh Harjani (IBM) 65*0bf1f51eSRitesh Harjani (IBM) * For regions with fully mapped extents, no additional work is needed 66*0bf1f51eSRitesh Harjani (IBM) * For append writes, a new mapped extent is allocated 67*0bf1f51eSRitesh Harjani (IBM) * For regions that are entirely holes, unwritten extent is created 68*0bf1f51eSRitesh Harjani (IBM) * For large unwritten extents, the extent gets split into two unwritten 69*0bf1f51eSRitesh Harjani (IBM) extents of appropriate requested size 70*0bf1f51eSRitesh Harjani (IBM) * For mixed mapping regions (combinations of holes, unwritten extents, or 71*0bf1f51eSRitesh Harjani (IBM) mapped extents), ext4_map_blocks() is called in a loop with 72*0bf1f51eSRitesh Harjani (IBM) EXT4_GET_BLOCKS_ZERO flag to convert the region into a single contiguous 73*0bf1f51eSRitesh Harjani (IBM) mapped extent by writing zeroes to it and converting any unwritten extents to 74*0bf1f51eSRitesh Harjani (IBM) written, if found within the range. 75*0bf1f51eSRitesh Harjani (IBM) 76*0bf1f51eSRitesh Harjani (IBM)Note: Writing on a single contiguous underlying extent, whether mapped or 77*0bf1f51eSRitesh Harjani (IBM)unwritten, is not inherently problematic. However, writing to a mixed mapping 78*0bf1f51eSRitesh Harjani (IBM)region (i.e. one containing a combination of mapped and unwritten extents) 79*0bf1f51eSRitesh Harjani (IBM)must be avoided when performing atomic writes. 80*0bf1f51eSRitesh Harjani (IBM) 81*0bf1f51eSRitesh Harjani (IBM)The reason is that, atomic writes when issued via pwritev2() with the RWF_ATOMIC 82*0bf1f51eSRitesh Harjani (IBM)flag, requires that either all data is written or none at all. In the event of 83*0bf1f51eSRitesh Harjani (IBM)a system crash or unexpected power loss during the write operation, the affected 84*0bf1f51eSRitesh Harjani (IBM)region (when later read) must reflect either the complete old data or the 85*0bf1f51eSRitesh Harjani (IBM)complete new data, but never a mix of both. 86*0bf1f51eSRitesh Harjani (IBM) 87*0bf1f51eSRitesh Harjani (IBM)To enforce this guarantee, we ensure that the write target is backed by 88*0bf1f51eSRitesh Harjani (IBM)a single, contiguous extent before any data is written. This is critical because 89*0bf1f51eSRitesh Harjani (IBM)ext4 defers the conversion of unwritten extents to written extents until the I/O 90*0bf1f51eSRitesh Harjani (IBM)completion path (typically in ->end_io()). If a write is allowed to proceed over 91*0bf1f51eSRitesh Harjani (IBM)a mixed mapping region (with mapped and unwritten extents) and a failure occurs 92*0bf1f51eSRitesh Harjani (IBM)mid-write, the system could observe partially updated regions after reboot, i.e. 93*0bf1f51eSRitesh Harjani (IBM)new data over mapped areas, and stale (old) data over unwritten extents that 94*0bf1f51eSRitesh Harjani (IBM)were never marked written. This violates the atomicity and/or torn write 95*0bf1f51eSRitesh Harjani (IBM)prevention guarantee. 96*0bf1f51eSRitesh Harjani (IBM) 97*0bf1f51eSRitesh Harjani (IBM)To prevent such torn writes, ext4 proactively allocates a single contiguous 98*0bf1f51eSRitesh Harjani (IBM)extent for the entire requested region in ``ext4_iomap_alloc`` via 99*0bf1f51eSRitesh Harjani (IBM)``ext4_map_blocks_atomic()``. EXT4 also force commits the current journalling 100*0bf1f51eSRitesh Harjani (IBM)transaction in case if allocation is done over mixed mapping. This ensures any 101*0bf1f51eSRitesh Harjani (IBM)pending metadata updates (like unwritten to written extents conversion) in this 102*0bf1f51eSRitesh Harjani (IBM)range are in consistent state with the file data blocks, before performing the 103*0bf1f51eSRitesh Harjani (IBM)actual write I/O. If the commit fails, the whole I/O must be aborted to prevent 104*0bf1f51eSRitesh Harjani (IBM)from any possible torn writes. 105*0bf1f51eSRitesh Harjani (IBM)Only after this step, the actual data write operation is performed by the iomap. 106*0bf1f51eSRitesh Harjani (IBM) 107*0bf1f51eSRitesh Harjani (IBM)Handling Split Extents Across Leaf Blocks 108*0bf1f51eSRitesh Harjani (IBM)~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 109*0bf1f51eSRitesh Harjani (IBM) 110*0bf1f51eSRitesh Harjani (IBM)There can be a special edge case where we have logically and physically 111*0bf1f51eSRitesh Harjani (IBM)contiguous extents stored in separate leaf nodes of the on-disk extent tree. 112*0bf1f51eSRitesh Harjani (IBM)This occurs because on-disk extent tree merges only happens within the leaf 113*0bf1f51eSRitesh Harjani (IBM)blocks except for a case where we have 2-level tree which can get merged and 114*0bf1f51eSRitesh Harjani (IBM)collapsed entirely into the inode. 115*0bf1f51eSRitesh Harjani (IBM)If such a layout exists and, in the worst case, the extent status cache entries 116*0bf1f51eSRitesh Harjani (IBM)are reclaimed due to memory pressure, ``ext4_map_blocks()`` may never return 117*0bf1f51eSRitesh Harjani (IBM)a single contiguous extent for these split leaf extents. 118*0bf1f51eSRitesh Harjani (IBM) 119*0bf1f51eSRitesh Harjani (IBM)To address this edge case, a new get block flag 120*0bf1f51eSRitesh Harjani (IBM)``EXT4_GET_BLOCKS_QUERY_LEAF_BLOCKS flag`` is added to enhance the 121*0bf1f51eSRitesh Harjani (IBM)``ext4_map_query_blocks()`` lookup behavior. 122*0bf1f51eSRitesh Harjani (IBM) 123*0bf1f51eSRitesh Harjani (IBM)This new get block flag allows ``ext4_map_blocks()`` to first check if there is 124*0bf1f51eSRitesh Harjani (IBM)an entry in the extent status cache for the full range. 125*0bf1f51eSRitesh Harjani (IBM)If not present, it consults the on-disk extent tree using 126*0bf1f51eSRitesh Harjani (IBM)``ext4_map_query_blocks()``. 127*0bf1f51eSRitesh Harjani (IBM)If the located extent is at the end of a leaf node, it probes the next logical 128*0bf1f51eSRitesh Harjani (IBM)block (lblk) to detect a contiguous extent in the adjacent leaf. 129*0bf1f51eSRitesh Harjani (IBM) 130*0bf1f51eSRitesh Harjani (IBM)For now only one additional leaf block is queried to maintain efficiency, as 131*0bf1f51eSRitesh Harjani (IBM)atomic writes are typically constrained to small sizes 132*0bf1f51eSRitesh Harjani (IBM)(e.g. [blocksize, clustersize]). 133*0bf1f51eSRitesh Harjani (IBM) 134*0bf1f51eSRitesh Harjani (IBM) 135*0bf1f51eSRitesh Harjani (IBM)Handling Journal transactions 136*0bf1f51eSRitesh Harjani (IBM)~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 137*0bf1f51eSRitesh Harjani (IBM) 138*0bf1f51eSRitesh Harjani (IBM)To support multi-fsblock atomic writes, we ensure enough journal credits are 139*0bf1f51eSRitesh Harjani (IBM)reserved during: 140*0bf1f51eSRitesh Harjani (IBM) 141*0bf1f51eSRitesh Harjani (IBM) 1. Block allocation time in ``ext4_iomap_alloc()``. We first query if there 142*0bf1f51eSRitesh Harjani (IBM) could be a mixed mapping for the underlying requested range. If yes, then we 143*0bf1f51eSRitesh Harjani (IBM) reserve credits of up to ``m_len``, assuming every alternate block can be 144*0bf1f51eSRitesh Harjani (IBM) an unwritten extent followed by a hole. 145*0bf1f51eSRitesh Harjani (IBM) 146*0bf1f51eSRitesh Harjani (IBM) 2. During ``->end_io()`` call, we make sure a single transaction is started for 147*0bf1f51eSRitesh Harjani (IBM) doing unwritten-to-written conversion. The loop for conversion is mainly 148*0bf1f51eSRitesh Harjani (IBM) only required to handle a split extent across leaf blocks. 149*0bf1f51eSRitesh Harjani (IBM) 150*0bf1f51eSRitesh Harjani (IBM)How to 151*0bf1f51eSRitesh Harjani (IBM)------ 152*0bf1f51eSRitesh Harjani (IBM) 153*0bf1f51eSRitesh Harjani (IBM)Creating Filesystems with Atomic Write Support 154*0bf1f51eSRitesh Harjani (IBM)~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 155*0bf1f51eSRitesh Harjani (IBM) 156*0bf1f51eSRitesh Harjani (IBM)First check the atomic write units supported by block device. 157*0bf1f51eSRitesh Harjani (IBM)See :ref:`atomic_write_bdev_support` for more details. 158*0bf1f51eSRitesh Harjani (IBM) 159*0bf1f51eSRitesh Harjani (IBM)For single-fsblock atomic writes with a larger block size 160*0bf1f51eSRitesh Harjani (IBM)(on systems with block size < page size): 161*0bf1f51eSRitesh Harjani (IBM) 162*0bf1f51eSRitesh Harjani (IBM).. code-block:: bash 163*0bf1f51eSRitesh Harjani (IBM) 164*0bf1f51eSRitesh Harjani (IBM) # Create an ext4 filesystem with a 16KB block size 165*0bf1f51eSRitesh Harjani (IBM) # (requires page size >= 16KB) 166*0bf1f51eSRitesh Harjani (IBM) mkfs.ext4 -b 16384 /dev/device 167*0bf1f51eSRitesh Harjani (IBM) 168*0bf1f51eSRitesh Harjani (IBM)For multi-fsblock atomic writes with bigalloc: 169*0bf1f51eSRitesh Harjani (IBM) 170*0bf1f51eSRitesh Harjani (IBM).. code-block:: bash 171*0bf1f51eSRitesh Harjani (IBM) 172*0bf1f51eSRitesh Harjani (IBM) # Create an ext4 filesystem with bigalloc and 64KB cluster size 173*0bf1f51eSRitesh Harjani (IBM) mkfs.ext4 -F -O bigalloc -b 4096 -C 65536 /dev/device 174*0bf1f51eSRitesh Harjani (IBM) 175*0bf1f51eSRitesh Harjani (IBM)Where ``-b`` specifies the block size, ``-C`` specifies the cluster size in bytes, 176*0bf1f51eSRitesh Harjani (IBM)and ``-O bigalloc`` enables the bigalloc feature. 177*0bf1f51eSRitesh Harjani (IBM) 178*0bf1f51eSRitesh Harjani (IBM)Application Interface 179*0bf1f51eSRitesh Harjani (IBM)~~~~~~~~~~~~~~~~~~~~~ 180*0bf1f51eSRitesh Harjani (IBM) 181*0bf1f51eSRitesh Harjani (IBM)Applications can use the ``pwritev2()`` system call with the ``RWF_ATOMIC`` flag 182*0bf1f51eSRitesh Harjani (IBM)to perform atomic writes: 183*0bf1f51eSRitesh Harjani (IBM) 184*0bf1f51eSRitesh Harjani (IBM).. code-block:: c 185*0bf1f51eSRitesh Harjani (IBM) 186*0bf1f51eSRitesh Harjani (IBM) pwritev2(fd, iov, iovcnt, offset, RWF_ATOMIC); 187*0bf1f51eSRitesh Harjani (IBM) 188*0bf1f51eSRitesh Harjani (IBM)The write must be aligned to the filesystem's block size and not exceed the 189*0bf1f51eSRitesh Harjani (IBM)filesystem's maximum atomic write unit size. 190*0bf1f51eSRitesh Harjani (IBM)See ``generic_atomic_write_valid()`` for more details. 191*0bf1f51eSRitesh Harjani (IBM) 192*0bf1f51eSRitesh Harjani (IBM)``statx()`` system call with ``STATX_WRITE_ATOMIC`` flag can provides following 193*0bf1f51eSRitesh Harjani (IBM)details: 194*0bf1f51eSRitesh Harjani (IBM) 195*0bf1f51eSRitesh Harjani (IBM) * ``stx_atomic_write_unit_min``: Minimum size of an atomic write request. 196*0bf1f51eSRitesh Harjani (IBM) * ``stx_atomic_write_unit_max``: Maximum size of an atomic write request. 197*0bf1f51eSRitesh Harjani (IBM) * ``stx_atomic_write_segments_max``: Upper limit for segments. The number of 198*0bf1f51eSRitesh Harjani (IBM) separate memory buffers that can be gathered into a write operation 199*0bf1f51eSRitesh Harjani (IBM) (e.g., the iovcnt parameter for IOV_ITER). Currently, this is always set to one. 200*0bf1f51eSRitesh Harjani (IBM) 201*0bf1f51eSRitesh Harjani (IBM)The STATX_ATTR_WRITE_ATOMIC flag in ``statx->attributes`` is set if atomic 202*0bf1f51eSRitesh Harjani (IBM)writes are supported. 203*0bf1f51eSRitesh Harjani (IBM) 204*0bf1f51eSRitesh Harjani (IBM).. _atomic_write_bdev_support: 205*0bf1f51eSRitesh Harjani (IBM) 206*0bf1f51eSRitesh Harjani (IBM)Hardware Support 207*0bf1f51eSRitesh Harjani (IBM)---------------- 208*0bf1f51eSRitesh Harjani (IBM) 209*0bf1f51eSRitesh Harjani (IBM)The underlying storage device must support atomic write operations. 210*0bf1f51eSRitesh Harjani (IBM)Modern NVMe and SCSI devices often provide this capability. 211*0bf1f51eSRitesh Harjani (IBM)The Linux kernel exposes this information through sysfs: 212*0bf1f51eSRitesh Harjani (IBM) 213*0bf1f51eSRitesh Harjani (IBM)* ``/sys/block/<device>/queue/atomic_write_unit_min`` - Minimum atomic write size 214*0bf1f51eSRitesh Harjani (IBM)* ``/sys/block/<device>/queue/atomic_write_unit_max`` - Maximum atomic write size 215*0bf1f51eSRitesh Harjani (IBM) 216*0bf1f51eSRitesh Harjani (IBM)Nonzero values for these attributes indicate that the device supports 217*0bf1f51eSRitesh Harjani (IBM)atomic writes. 218*0bf1f51eSRitesh Harjani (IBM) 219*0bf1f51eSRitesh Harjani (IBM)See Also 220*0bf1f51eSRitesh Harjani (IBM)-------- 221*0bf1f51eSRitesh Harjani (IBM) 222*0bf1f51eSRitesh Harjani (IBM)* :doc:`bigalloc` - Documentation on the bigalloc feature 223*0bf1f51eSRitesh Harjani (IBM)* :doc:`allocators` - Documentation on block allocation in ext4 224*0bf1f51eSRitesh Harjani (IBM)* Support for atomic block writes in 6.13: 225*0bf1f51eSRitesh Harjani (IBM) https://lwn.net/Articles/1009298/ 226