xref: /linux/Documentation/filesystems/ext4/atomic_writes.rst (revision d87d73895fcdbe6e45813efc473544433862364f)
1.. SPDX-License-Identifier: GPL-2.0
2.. _atomic_writes:
3
4Atomic Block Writes
5-------------------------
6
7Introduction
8~~~~~~~~~~~~
9
10Atomic (untorn) block writes ensure that either the entire write is committed
11to disk or none of it is. This prevents "torn writes" during power loss or
12system crashes. The ext4 filesystem supports atomic writes (only with Direct
13I/O) on regular files with extents, provided the underlying storage device
14supports hardware atomic writes. This is supported in the following two ways:
15
161. **Single-fsblock Atomic Writes**:
17   EXT4's supports atomic write operations with a single filesystem block since
18   v6.13. In this the atomic write unit minimum and maximum sizes are both set
19   to filesystem blocksize.
20   e.g. doing atomic write of 16KB with 16KB filesystem blocksize on 64KB
21   pagesize system is possible.
22
232. **Multi-fsblock Atomic Writes with Bigalloc**:
24   EXT4 now also supports atomic writes spanning multiple filesystem blocks
25   using a feature known as bigalloc. The atomic write unit's minimum and
26   maximum sizes are determined by the filesystem block size and cluster size,
27   based on the underlying device’s supported atomic write unit limits.
28
29Requirements
30~~~~~~~~~~~~
31
32Basic requirements for atomic writes in ext4:
33
34 1. The extents feature must be enabled (default for ext4)
35 2. The underlying block device must support atomic writes
36 3. For single-fsblock atomic writes:
37
38    1. A filesystem with appropriate block size (up to the page size)
39 4. For multi-fsblock atomic writes:
40
41    1. The bigalloc feature must be enabled
42    2. The cluster size must be appropriately configured
43
44NOTE: EXT4 does not support software or COW based atomic write, which means
45atomic writes on ext4 are only supported if underlying storage device supports
46it.
47
48Multi-fsblock Implementation Details
49~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
50
51The bigalloc feature changes ext4 to allocate in units of multiple filesystem
52blocks, also known as clusters. With bigalloc each bit within block bitmap
53represents cluster (power of 2 number of blocks) rather than individual
54filesystem blocks.
55EXT4 supports multi-fsblock atomic writes with bigalloc, subject to the
56following constraints. The minimum atomic write size is the larger of the fs
57block size and the minimum hardware atomic write unit; and the maximum atomic
58write size is smaller of the bigalloc cluster size and the maximum hardware
59atomic write unit.  Bigalloc ensures that all allocations are aligned to the
60cluster size, which satisfies the LBA alignment requirements of the hardware
61device if the start of the partition/logical volume is itself aligned correctly.
62
63Here is the block allocation strategy in bigalloc for atomic writes:
64
65 * For regions with fully mapped extents, no additional work is needed
66 * For append writes, a new mapped extent is allocated
67 * For regions that are entirely holes, unwritten extent is created
68 * For large unwritten extents, the extent gets split into two unwritten
69   extents of appropriate requested size
70 * For mixed mapping regions (combinations of holes, unwritten extents, or
71   mapped extents), ext4_map_blocks() is called in a loop with
72   EXT4_GET_BLOCKS_ZERO flag to convert the region into a single contiguous
73   mapped extent by writing zeroes to it and converting any unwritten extents to
74   written, if found within the range.
75
76Note: Writing on a single contiguous underlying extent, whether mapped or
77unwritten, is not inherently problematic. However, writing to a mixed mapping
78region (i.e. one containing a combination of mapped and unwritten extents)
79must be avoided when performing atomic writes.
80
81The reason is that, atomic writes when issued via pwritev2() with the RWF_ATOMIC
82flag, requires that either all data is written or none at all. In the event of
83a system crash or unexpected power loss during the write operation, the affected
84region (when later read) must reflect either the complete old data or the
85complete new data, but never a mix of both.
86
87To enforce this guarantee, we ensure that the write target is backed by
88a single, contiguous extent before any data is written. This is critical because
89ext4 defers the conversion of unwritten extents to written extents until the I/O
90completion path (typically in ->end_io()). If a write is allowed to proceed over
91a mixed mapping region (with mapped and unwritten extents) and a failure occurs
92mid-write, the system could observe partially updated regions after reboot, i.e.
93new data over mapped areas, and stale (old) data over unwritten extents that
94were never marked written. This violates the atomicity and/or torn write
95prevention guarantee.
96
97To prevent such torn writes, ext4 proactively allocates a single contiguous
98extent for the entire requested region in ``ext4_iomap_alloc`` via
99``ext4_map_blocks_atomic()``. EXT4 also force commits the current journalling
100transaction in case if allocation is done over mixed mapping. This ensures any
101pending metadata updates (like unwritten to written extents conversion) in this
102range are in consistent state with the file data blocks, before performing the
103actual write I/O. If the commit fails, the whole I/O must be aborted to prevent
104from any possible torn writes.
105Only after this step, the actual data write operation is performed by the iomap.
106
107Handling Split Extents Across Leaf Blocks
108~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
109
110There can be a special edge case where we have logically and physically
111contiguous extents stored in separate leaf nodes of the on-disk extent tree.
112This occurs because on-disk extent tree merges only happens within the leaf
113blocks except for a case where we have 2-level tree which can get merged and
114collapsed entirely into the inode.
115If such a layout exists and, in the worst case, the extent status cache entries
116are reclaimed due to memory pressure, ``ext4_map_blocks()`` may never return
117a single contiguous extent for these split leaf extents.
118
119To address this edge case, a new get block flag
120``EXT4_GET_BLOCKS_QUERY_LEAF_BLOCKS flag`` is added to enhance the
121``ext4_map_query_blocks()`` lookup behavior.
122
123This new get block flag allows ``ext4_map_blocks()`` to first check if there is
124an entry in the extent status cache for the full range.
125If not present, it consults the on-disk extent tree using
126``ext4_map_query_blocks()``.
127If the located extent is at the end of a leaf node, it probes the next logical
128block (lblk) to detect a contiguous extent in the adjacent leaf.
129
130For now only one additional leaf block is queried to maintain efficiency, as
131atomic writes are typically constrained to small sizes
132(e.g. [blocksize, clustersize]).
133
134
135Handling Journal transactions
136~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
137
138To support multi-fsblock atomic writes, we ensure enough journal credits are
139reserved during:
140
141 1. Block allocation time in ``ext4_iomap_alloc()``. We first query if there
142    could be a mixed mapping for the underlying requested range. If yes, then we
143    reserve credits of up to ``m_len``, assuming every alternate block can be
144    an unwritten extent followed by a hole.
145
146 2. During ``->end_io()`` call, we make sure a single transaction is started for
147    doing unwritten-to-written conversion. The loop for conversion is mainly
148    only required to handle a split extent across leaf blocks.
149
150How to
151------
152
153Creating Filesystems with Atomic Write Support
154~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
155
156First check the atomic write units supported by block device.
157See :ref:`atomic_write_bdev_support` for more details.
158
159For single-fsblock atomic writes with a larger block size
160(on systems with block size < page size):
161
162.. code-block:: bash
163
164    # Create an ext4 filesystem with a 16KB block size
165    # (requires page size >= 16KB)
166    mkfs.ext4 -b 16384 /dev/device
167
168For multi-fsblock atomic writes with bigalloc:
169
170.. code-block:: bash
171
172    # Create an ext4 filesystem with bigalloc and 64KB cluster size
173    mkfs.ext4 -F -O bigalloc -b 4096 -C 65536 /dev/device
174
175Where ``-b`` specifies the block size, ``-C`` specifies the cluster size in bytes,
176and ``-O bigalloc`` enables the bigalloc feature.
177
178Application Interface
179~~~~~~~~~~~~~~~~~~~~~
180
181Applications can use the ``pwritev2()`` system call with the ``RWF_ATOMIC`` flag
182to perform atomic writes:
183
184.. code-block:: c
185
186    pwritev2(fd, iov, iovcnt, offset, RWF_ATOMIC);
187
188The write must be aligned to the filesystem's block size and not exceed the
189filesystem's maximum atomic write unit size.
190See ``generic_atomic_write_valid()`` for more details.
191
192``statx()`` system call with ``STATX_WRITE_ATOMIC`` flag can provides following
193details:
194
195 * ``stx_atomic_write_unit_min``: Minimum size of an atomic write request.
196 * ``stx_atomic_write_unit_max``: Maximum size of an atomic write request.
197 * ``stx_atomic_write_segments_max``: Upper limit for segments. The number of
198   separate memory buffers that can be gathered into a write operation
199   (e.g., the iovcnt parameter for IOV_ITER). Currently, this is always set to one.
200
201The STATX_ATTR_WRITE_ATOMIC flag in ``statx->attributes`` is set if atomic
202writes are supported.
203
204.. _atomic_write_bdev_support:
205
206Hardware Support
207----------------
208
209The underlying storage device must support atomic write operations.
210Modern NVMe and SCSI devices often provide this capability.
211The Linux kernel exposes this information through sysfs:
212
213* ``/sys/block/<device>/queue/atomic_write_unit_min`` - Minimum atomic write size
214* ``/sys/block/<device>/queue/atomic_write_unit_max`` - Maximum atomic write size
215
216Nonzero values for these attributes indicate that the device supports
217atomic writes.
218
219See Also
220--------
221
222* :doc:`bigalloc` - Documentation on the bigalloc feature
223* :doc:`allocators` - Documentation on block allocation in ext4
224* Support for atomic block writes in 6.13:
225  https://lwn.net/Articles/1009298/
226