xref: /linux/Documentation/filesystems/ext4/atomic_writes.rst (revision d87d73895fcdbe6e45813efc473544433862364f)
1*0bf1f51eSRitesh Harjani (IBM).. SPDX-License-Identifier: GPL-2.0
2*0bf1f51eSRitesh Harjani (IBM).. _atomic_writes:
3*0bf1f51eSRitesh Harjani (IBM)
4*0bf1f51eSRitesh Harjani (IBM)Atomic Block Writes
5*0bf1f51eSRitesh Harjani (IBM)-------------------------
6*0bf1f51eSRitesh Harjani (IBM)
7*0bf1f51eSRitesh Harjani (IBM)Introduction
8*0bf1f51eSRitesh Harjani (IBM)~~~~~~~~~~~~
9*0bf1f51eSRitesh Harjani (IBM)
10*0bf1f51eSRitesh Harjani (IBM)Atomic (untorn) block writes ensure that either the entire write is committed
11*0bf1f51eSRitesh Harjani (IBM)to disk or none of it is. This prevents "torn writes" during power loss or
12*0bf1f51eSRitesh Harjani (IBM)system crashes. The ext4 filesystem supports atomic writes (only with Direct
13*0bf1f51eSRitesh Harjani (IBM)I/O) on regular files with extents, provided the underlying storage device
14*0bf1f51eSRitesh Harjani (IBM)supports hardware atomic writes. This is supported in the following two ways:
15*0bf1f51eSRitesh Harjani (IBM)
16*0bf1f51eSRitesh Harjani (IBM)1. **Single-fsblock Atomic Writes**:
17*0bf1f51eSRitesh Harjani (IBM)   EXT4's supports atomic write operations with a single filesystem block since
18*0bf1f51eSRitesh Harjani (IBM)   v6.13. In this the atomic write unit minimum and maximum sizes are both set
19*0bf1f51eSRitesh Harjani (IBM)   to filesystem blocksize.
20*0bf1f51eSRitesh Harjani (IBM)   e.g. doing atomic write of 16KB with 16KB filesystem blocksize on 64KB
21*0bf1f51eSRitesh Harjani (IBM)   pagesize system is possible.
22*0bf1f51eSRitesh Harjani (IBM)
23*0bf1f51eSRitesh Harjani (IBM)2. **Multi-fsblock Atomic Writes with Bigalloc**:
24*0bf1f51eSRitesh Harjani (IBM)   EXT4 now also supports atomic writes spanning multiple filesystem blocks
25*0bf1f51eSRitesh Harjani (IBM)   using a feature known as bigalloc. The atomic write unit's minimum and
26*0bf1f51eSRitesh Harjani (IBM)   maximum sizes are determined by the filesystem block size and cluster size,
27*0bf1f51eSRitesh Harjani (IBM)   based on the underlying device’s supported atomic write unit limits.
28*0bf1f51eSRitesh Harjani (IBM)
29*0bf1f51eSRitesh Harjani (IBM)Requirements
30*0bf1f51eSRitesh Harjani (IBM)~~~~~~~~~~~~
31*0bf1f51eSRitesh Harjani (IBM)
32*0bf1f51eSRitesh Harjani (IBM)Basic requirements for atomic writes in ext4:
33*0bf1f51eSRitesh Harjani (IBM)
34*0bf1f51eSRitesh Harjani (IBM) 1. The extents feature must be enabled (default for ext4)
35*0bf1f51eSRitesh Harjani (IBM) 2. The underlying block device must support atomic writes
36*0bf1f51eSRitesh Harjani (IBM) 3. For single-fsblock atomic writes:
37*0bf1f51eSRitesh Harjani (IBM)
38*0bf1f51eSRitesh Harjani (IBM)    1. A filesystem with appropriate block size (up to the page size)
39*0bf1f51eSRitesh Harjani (IBM) 4. For multi-fsblock atomic writes:
40*0bf1f51eSRitesh Harjani (IBM)
41*0bf1f51eSRitesh Harjani (IBM)    1. The bigalloc feature must be enabled
42*0bf1f51eSRitesh Harjani (IBM)    2. The cluster size must be appropriately configured
43*0bf1f51eSRitesh Harjani (IBM)
44*0bf1f51eSRitesh Harjani (IBM)NOTE: EXT4 does not support software or COW based atomic write, which means
45*0bf1f51eSRitesh Harjani (IBM)atomic writes on ext4 are only supported if underlying storage device supports
46*0bf1f51eSRitesh Harjani (IBM)it.
47*0bf1f51eSRitesh Harjani (IBM)
48*0bf1f51eSRitesh Harjani (IBM)Multi-fsblock Implementation Details
49*0bf1f51eSRitesh Harjani (IBM)~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
50*0bf1f51eSRitesh Harjani (IBM)
51*0bf1f51eSRitesh Harjani (IBM)The bigalloc feature changes ext4 to allocate in units of multiple filesystem
52*0bf1f51eSRitesh Harjani (IBM)blocks, also known as clusters. With bigalloc each bit within block bitmap
53*0bf1f51eSRitesh Harjani (IBM)represents cluster (power of 2 number of blocks) rather than individual
54*0bf1f51eSRitesh Harjani (IBM)filesystem blocks.
55*0bf1f51eSRitesh Harjani (IBM)EXT4 supports multi-fsblock atomic writes with bigalloc, subject to the
56*0bf1f51eSRitesh Harjani (IBM)following constraints. The minimum atomic write size is the larger of the fs
57*0bf1f51eSRitesh Harjani (IBM)block size and the minimum hardware atomic write unit; and the maximum atomic
58*0bf1f51eSRitesh Harjani (IBM)write size is smaller of the bigalloc cluster size and the maximum hardware
59*0bf1f51eSRitesh Harjani (IBM)atomic write unit.  Bigalloc ensures that all allocations are aligned to the
60*0bf1f51eSRitesh Harjani (IBM)cluster size, which satisfies the LBA alignment requirements of the hardware
61*0bf1f51eSRitesh Harjani (IBM)device if the start of the partition/logical volume is itself aligned correctly.
62*0bf1f51eSRitesh Harjani (IBM)
63*0bf1f51eSRitesh Harjani (IBM)Here is the block allocation strategy in bigalloc for atomic writes:
64*0bf1f51eSRitesh Harjani (IBM)
65*0bf1f51eSRitesh Harjani (IBM) * For regions with fully mapped extents, no additional work is needed
66*0bf1f51eSRitesh Harjani (IBM) * For append writes, a new mapped extent is allocated
67*0bf1f51eSRitesh Harjani (IBM) * For regions that are entirely holes, unwritten extent is created
68*0bf1f51eSRitesh Harjani (IBM) * For large unwritten extents, the extent gets split into two unwritten
69*0bf1f51eSRitesh Harjani (IBM)   extents of appropriate requested size
70*0bf1f51eSRitesh Harjani (IBM) * For mixed mapping regions (combinations of holes, unwritten extents, or
71*0bf1f51eSRitesh Harjani (IBM)   mapped extents), ext4_map_blocks() is called in a loop with
72*0bf1f51eSRitesh Harjani (IBM)   EXT4_GET_BLOCKS_ZERO flag to convert the region into a single contiguous
73*0bf1f51eSRitesh Harjani (IBM)   mapped extent by writing zeroes to it and converting any unwritten extents to
74*0bf1f51eSRitesh Harjani (IBM)   written, if found within the range.
75*0bf1f51eSRitesh Harjani (IBM)
76*0bf1f51eSRitesh Harjani (IBM)Note: Writing on a single contiguous underlying extent, whether mapped or
77*0bf1f51eSRitesh Harjani (IBM)unwritten, is not inherently problematic. However, writing to a mixed mapping
78*0bf1f51eSRitesh Harjani (IBM)region (i.e. one containing a combination of mapped and unwritten extents)
79*0bf1f51eSRitesh Harjani (IBM)must be avoided when performing atomic writes.
80*0bf1f51eSRitesh Harjani (IBM)
81*0bf1f51eSRitesh Harjani (IBM)The reason is that, atomic writes when issued via pwritev2() with the RWF_ATOMIC
82*0bf1f51eSRitesh Harjani (IBM)flag, requires that either all data is written or none at all. In the event of
83*0bf1f51eSRitesh Harjani (IBM)a system crash or unexpected power loss during the write operation, the affected
84*0bf1f51eSRitesh Harjani (IBM)region (when later read) must reflect either the complete old data or the
85*0bf1f51eSRitesh Harjani (IBM)complete new data, but never a mix of both.
86*0bf1f51eSRitesh Harjani (IBM)
87*0bf1f51eSRitesh Harjani (IBM)To enforce this guarantee, we ensure that the write target is backed by
88*0bf1f51eSRitesh Harjani (IBM)a single, contiguous extent before any data is written. This is critical because
89*0bf1f51eSRitesh Harjani (IBM)ext4 defers the conversion of unwritten extents to written extents until the I/O
90*0bf1f51eSRitesh Harjani (IBM)completion path (typically in ->end_io()). If a write is allowed to proceed over
91*0bf1f51eSRitesh Harjani (IBM)a mixed mapping region (with mapped and unwritten extents) and a failure occurs
92*0bf1f51eSRitesh Harjani (IBM)mid-write, the system could observe partially updated regions after reboot, i.e.
93*0bf1f51eSRitesh Harjani (IBM)new data over mapped areas, and stale (old) data over unwritten extents that
94*0bf1f51eSRitesh Harjani (IBM)were never marked written. This violates the atomicity and/or torn write
95*0bf1f51eSRitesh Harjani (IBM)prevention guarantee.
96*0bf1f51eSRitesh Harjani (IBM)
97*0bf1f51eSRitesh Harjani (IBM)To prevent such torn writes, ext4 proactively allocates a single contiguous
98*0bf1f51eSRitesh Harjani (IBM)extent for the entire requested region in ``ext4_iomap_alloc`` via
99*0bf1f51eSRitesh Harjani (IBM)``ext4_map_blocks_atomic()``. EXT4 also force commits the current journalling
100*0bf1f51eSRitesh Harjani (IBM)transaction in case if allocation is done over mixed mapping. This ensures any
101*0bf1f51eSRitesh Harjani (IBM)pending metadata updates (like unwritten to written extents conversion) in this
102*0bf1f51eSRitesh Harjani (IBM)range are in consistent state with the file data blocks, before performing the
103*0bf1f51eSRitesh Harjani (IBM)actual write I/O. If the commit fails, the whole I/O must be aborted to prevent
104*0bf1f51eSRitesh Harjani (IBM)from any possible torn writes.
105*0bf1f51eSRitesh Harjani (IBM)Only after this step, the actual data write operation is performed by the iomap.
106*0bf1f51eSRitesh Harjani (IBM)
107*0bf1f51eSRitesh Harjani (IBM)Handling Split Extents Across Leaf Blocks
108*0bf1f51eSRitesh Harjani (IBM)~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
109*0bf1f51eSRitesh Harjani (IBM)
110*0bf1f51eSRitesh Harjani (IBM)There can be a special edge case where we have logically and physically
111*0bf1f51eSRitesh Harjani (IBM)contiguous extents stored in separate leaf nodes of the on-disk extent tree.
112*0bf1f51eSRitesh Harjani (IBM)This occurs because on-disk extent tree merges only happens within the leaf
113*0bf1f51eSRitesh Harjani (IBM)blocks except for a case where we have 2-level tree which can get merged and
114*0bf1f51eSRitesh Harjani (IBM)collapsed entirely into the inode.
115*0bf1f51eSRitesh Harjani (IBM)If such a layout exists and, in the worst case, the extent status cache entries
116*0bf1f51eSRitesh Harjani (IBM)are reclaimed due to memory pressure, ``ext4_map_blocks()`` may never return
117*0bf1f51eSRitesh Harjani (IBM)a single contiguous extent for these split leaf extents.
118*0bf1f51eSRitesh Harjani (IBM)
119*0bf1f51eSRitesh Harjani (IBM)To address this edge case, a new get block flag
120*0bf1f51eSRitesh Harjani (IBM)``EXT4_GET_BLOCKS_QUERY_LEAF_BLOCKS flag`` is added to enhance the
121*0bf1f51eSRitesh Harjani (IBM)``ext4_map_query_blocks()`` lookup behavior.
122*0bf1f51eSRitesh Harjani (IBM)
123*0bf1f51eSRitesh Harjani (IBM)This new get block flag allows ``ext4_map_blocks()`` to first check if there is
124*0bf1f51eSRitesh Harjani (IBM)an entry in the extent status cache for the full range.
125*0bf1f51eSRitesh Harjani (IBM)If not present, it consults the on-disk extent tree using
126*0bf1f51eSRitesh Harjani (IBM)``ext4_map_query_blocks()``.
127*0bf1f51eSRitesh Harjani (IBM)If the located extent is at the end of a leaf node, it probes the next logical
128*0bf1f51eSRitesh Harjani (IBM)block (lblk) to detect a contiguous extent in the adjacent leaf.
129*0bf1f51eSRitesh Harjani (IBM)
130*0bf1f51eSRitesh Harjani (IBM)For now only one additional leaf block is queried to maintain efficiency, as
131*0bf1f51eSRitesh Harjani (IBM)atomic writes are typically constrained to small sizes
132*0bf1f51eSRitesh Harjani (IBM)(e.g. [blocksize, clustersize]).
133*0bf1f51eSRitesh Harjani (IBM)
134*0bf1f51eSRitesh Harjani (IBM)
135*0bf1f51eSRitesh Harjani (IBM)Handling Journal transactions
136*0bf1f51eSRitesh Harjani (IBM)~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
137*0bf1f51eSRitesh Harjani (IBM)
138*0bf1f51eSRitesh Harjani (IBM)To support multi-fsblock atomic writes, we ensure enough journal credits are
139*0bf1f51eSRitesh Harjani (IBM)reserved during:
140*0bf1f51eSRitesh Harjani (IBM)
141*0bf1f51eSRitesh Harjani (IBM) 1. Block allocation time in ``ext4_iomap_alloc()``. We first query if there
142*0bf1f51eSRitesh Harjani (IBM)    could be a mixed mapping for the underlying requested range. If yes, then we
143*0bf1f51eSRitesh Harjani (IBM)    reserve credits of up to ``m_len``, assuming every alternate block can be
144*0bf1f51eSRitesh Harjani (IBM)    an unwritten extent followed by a hole.
145*0bf1f51eSRitesh Harjani (IBM)
146*0bf1f51eSRitesh Harjani (IBM) 2. During ``->end_io()`` call, we make sure a single transaction is started for
147*0bf1f51eSRitesh Harjani (IBM)    doing unwritten-to-written conversion. The loop for conversion is mainly
148*0bf1f51eSRitesh Harjani (IBM)    only required to handle a split extent across leaf blocks.
149*0bf1f51eSRitesh Harjani (IBM)
150*0bf1f51eSRitesh Harjani (IBM)How to
151*0bf1f51eSRitesh Harjani (IBM)------
152*0bf1f51eSRitesh Harjani (IBM)
153*0bf1f51eSRitesh Harjani (IBM)Creating Filesystems with Atomic Write Support
154*0bf1f51eSRitesh Harjani (IBM)~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
155*0bf1f51eSRitesh Harjani (IBM)
156*0bf1f51eSRitesh Harjani (IBM)First check the atomic write units supported by block device.
157*0bf1f51eSRitesh Harjani (IBM)See :ref:`atomic_write_bdev_support` for more details.
158*0bf1f51eSRitesh Harjani (IBM)
159*0bf1f51eSRitesh Harjani (IBM)For single-fsblock atomic writes with a larger block size
160*0bf1f51eSRitesh Harjani (IBM)(on systems with block size < page size):
161*0bf1f51eSRitesh Harjani (IBM)
162*0bf1f51eSRitesh Harjani (IBM).. code-block:: bash
163*0bf1f51eSRitesh Harjani (IBM)
164*0bf1f51eSRitesh Harjani (IBM)    # Create an ext4 filesystem with a 16KB block size
165*0bf1f51eSRitesh Harjani (IBM)    # (requires page size >= 16KB)
166*0bf1f51eSRitesh Harjani (IBM)    mkfs.ext4 -b 16384 /dev/device
167*0bf1f51eSRitesh Harjani (IBM)
168*0bf1f51eSRitesh Harjani (IBM)For multi-fsblock atomic writes with bigalloc:
169*0bf1f51eSRitesh Harjani (IBM)
170*0bf1f51eSRitesh Harjani (IBM).. code-block:: bash
171*0bf1f51eSRitesh Harjani (IBM)
172*0bf1f51eSRitesh Harjani (IBM)    # Create an ext4 filesystem with bigalloc and 64KB cluster size
173*0bf1f51eSRitesh Harjani (IBM)    mkfs.ext4 -F -O bigalloc -b 4096 -C 65536 /dev/device
174*0bf1f51eSRitesh Harjani (IBM)
175*0bf1f51eSRitesh Harjani (IBM)Where ``-b`` specifies the block size, ``-C`` specifies the cluster size in bytes,
176*0bf1f51eSRitesh Harjani (IBM)and ``-O bigalloc`` enables the bigalloc feature.
177*0bf1f51eSRitesh Harjani (IBM)
178*0bf1f51eSRitesh Harjani (IBM)Application Interface
179*0bf1f51eSRitesh Harjani (IBM)~~~~~~~~~~~~~~~~~~~~~
180*0bf1f51eSRitesh Harjani (IBM)
181*0bf1f51eSRitesh Harjani (IBM)Applications can use the ``pwritev2()`` system call with the ``RWF_ATOMIC`` flag
182*0bf1f51eSRitesh Harjani (IBM)to perform atomic writes:
183*0bf1f51eSRitesh Harjani (IBM)
184*0bf1f51eSRitesh Harjani (IBM).. code-block:: c
185*0bf1f51eSRitesh Harjani (IBM)
186*0bf1f51eSRitesh Harjani (IBM)    pwritev2(fd, iov, iovcnt, offset, RWF_ATOMIC);
187*0bf1f51eSRitesh Harjani (IBM)
188*0bf1f51eSRitesh Harjani (IBM)The write must be aligned to the filesystem's block size and not exceed the
189*0bf1f51eSRitesh Harjani (IBM)filesystem's maximum atomic write unit size.
190*0bf1f51eSRitesh Harjani (IBM)See ``generic_atomic_write_valid()`` for more details.
191*0bf1f51eSRitesh Harjani (IBM)
192*0bf1f51eSRitesh Harjani (IBM)``statx()`` system call with ``STATX_WRITE_ATOMIC`` flag can provides following
193*0bf1f51eSRitesh Harjani (IBM)details:
194*0bf1f51eSRitesh Harjani (IBM)
195*0bf1f51eSRitesh Harjani (IBM) * ``stx_atomic_write_unit_min``: Minimum size of an atomic write request.
196*0bf1f51eSRitesh Harjani (IBM) * ``stx_atomic_write_unit_max``: Maximum size of an atomic write request.
197*0bf1f51eSRitesh Harjani (IBM) * ``stx_atomic_write_segments_max``: Upper limit for segments. The number of
198*0bf1f51eSRitesh Harjani (IBM)   separate memory buffers that can be gathered into a write operation
199*0bf1f51eSRitesh Harjani (IBM)   (e.g., the iovcnt parameter for IOV_ITER). Currently, this is always set to one.
200*0bf1f51eSRitesh Harjani (IBM)
201*0bf1f51eSRitesh Harjani (IBM)The STATX_ATTR_WRITE_ATOMIC flag in ``statx->attributes`` is set if atomic
202*0bf1f51eSRitesh Harjani (IBM)writes are supported.
203*0bf1f51eSRitesh Harjani (IBM)
204*0bf1f51eSRitesh Harjani (IBM).. _atomic_write_bdev_support:
205*0bf1f51eSRitesh Harjani (IBM)
206*0bf1f51eSRitesh Harjani (IBM)Hardware Support
207*0bf1f51eSRitesh Harjani (IBM)----------------
208*0bf1f51eSRitesh Harjani (IBM)
209*0bf1f51eSRitesh Harjani (IBM)The underlying storage device must support atomic write operations.
210*0bf1f51eSRitesh Harjani (IBM)Modern NVMe and SCSI devices often provide this capability.
211*0bf1f51eSRitesh Harjani (IBM)The Linux kernel exposes this information through sysfs:
212*0bf1f51eSRitesh Harjani (IBM)
213*0bf1f51eSRitesh Harjani (IBM)* ``/sys/block/<device>/queue/atomic_write_unit_min`` - Minimum atomic write size
214*0bf1f51eSRitesh Harjani (IBM)* ``/sys/block/<device>/queue/atomic_write_unit_max`` - Maximum atomic write size
215*0bf1f51eSRitesh Harjani (IBM)
216*0bf1f51eSRitesh Harjani (IBM)Nonzero values for these attributes indicate that the device supports
217*0bf1f51eSRitesh Harjani (IBM)atomic writes.
218*0bf1f51eSRitesh Harjani (IBM)
219*0bf1f51eSRitesh Harjani (IBM)See Also
220*0bf1f51eSRitesh Harjani (IBM)--------
221*0bf1f51eSRitesh Harjani (IBM)
222*0bf1f51eSRitesh Harjani (IBM)* :doc:`bigalloc` - Documentation on the bigalloc feature
223*0bf1f51eSRitesh Harjani (IBM)* :doc:`allocators` - Documentation on block allocation in ext4
224*0bf1f51eSRitesh Harjani (IBM)* Support for atomic block writes in 6.13:
225*0bf1f51eSRitesh Harjani (IBM)  https://lwn.net/Articles/1009298/
226