1.. SPDX-License-Identifier: GPL-2.0 2.. _iomap_design: 3 4.. 5 Dumb style notes to maintain the author's sanity: 6 Please try to start sentences on separate lines so that 7 sentence changes don't bleed colors in diff. 8 Heading decorations are documented in sphinx.rst. 9 10============== 11Library Design 12============== 13 14.. contents:: Table of Contents 15 :local: 16 17Introduction 18============ 19 20iomap is a filesystem library for handling common file operations. 21The library has two layers: 22 23 1. A lower layer that provides an iterator over ranges of file offsets. 24 This layer tries to obtain mappings of each file ranges to storage 25 from the filesystem, but the storage information is not necessarily 26 required. 27 28 2. An upper layer that acts upon the space mappings provided by the 29 lower layer iterator. 30 31The iteration can involve mappings of file's logical offset ranges to 32physical extents, but the storage layer information is not necessarily 33required, e.g. for walking cached file information. 34The library exports various APIs for implementing file operations such 35as: 36 37 * Pagecache reads and writes 38 * Folio write faults to the pagecache 39 * Writeback of dirty folios 40 * Direct I/O reads and writes 41 * fsdax I/O reads, writes, loads, and stores 42 * FIEMAP 43 * lseek ``SEEK_DATA`` and ``SEEK_HOLE`` 44 * swapfile activation 45 46This origins of this library is the file I/O path that XFS once used; it 47has now been extended to cover several other operations. 48 49Who Should Read This? 50===================== 51 52The target audience for this document are filesystem, storage, and 53pagecache programmers and code reviewers. 54 55If you are working on PCI, machine architectures, or device drivers, you 56are most likely in the wrong place. 57 58How Is This Better? 59=================== 60 61Unlike the classic Linux I/O model which breaks file I/O into small 62units (generally memory pages or blocks) and looks up space mappings on 63the basis of that unit, the iomap model asks the filesystem for the 64largest space mappings that it can create for a given file operation and 65initiates operations on that basis. 66This strategy improves the filesystem's visibility into the size of the 67operation being performed, which enables it to combat fragmentation with 68larger space allocations when possible. 69Larger space mappings improve runtime performance by amortizing the cost 70of mapping function calls into the filesystem across a larger amount of 71data. 72 73At a high level, an iomap operation `looks like this 74<https://lore.kernel.org/all/ZGbVaewzcCysclPt@dread.disaster.area/>`_: 75 761. For each byte in the operation range... 77 78 1. Obtain a space mapping via ``->iomap_begin`` 79 80 2. For each sub-unit of work... 81 82 1. Revalidate the mapping and go back to (1) above, if necessary. 83 So far only the pagecache operations need to do this. 84 85 2. Do the work 86 87 3. Increment operation cursor 88 89 4. Release the mapping via ``->iomap_end``, if necessary 90 91Each iomap operation will be covered in more detail below. 92This library was covered previously by an `LWN article 93<https://lwn.net/Articles/935934/>`_ and a `KernelNewbies page 94<https://kernelnewbies.org/KernelProjects/iomap>`_. 95 96The goal of this document is to provide a brief discussion of the 97design and capabilities of iomap, followed by a more detailed catalog 98of the interfaces presented by iomap. 99If you change iomap, please update this design document. 100 101File Range Iterator 102=================== 103 104Definitions 105----------- 106 107 * **buffer head**: Shattered remnants of the old buffer cache. 108 109 * ``fsblock``: The block size of a file, also known as ``i_blocksize``. 110 111 * ``i_rwsem``: The VFS ``struct inode`` rwsemaphore. 112 Processes hold this in shared mode to read file state and contents. 113 Some filesystems may allow shared mode for writes. 114 Processes often hold this in exclusive mode to change file state and 115 contents. 116 117 * ``invalidate_lock``: The pagecache ``struct address_space`` 118 rwsemaphore that protects against folio insertion and removal for 119 filesystems that support punching out folios below EOF. 120 Processes wishing to insert folios must hold this lock in shared 121 mode to prevent removal, though concurrent insertion is allowed. 122 Processes wishing to remove folios must hold this lock in exclusive 123 mode to prevent insertions. 124 Concurrent removals are not allowed. 125 126 * ``dax_read_lock``: The RCU read lock that dax takes to prevent a 127 device pre-shutdown hook from returning before other threads have 128 released resources. 129 130 * **filesystem mapping lock**: This synchronization primitive is 131 internal to the filesystem and must protect the file mapping data 132 from updates while a mapping is being sampled. 133 The filesystem author must determine how this coordination should 134 happen; it does not need to be an actual lock. 135 136 * **iomap internal operation lock**: This is a general term for 137 synchronization primitives that iomap functions take while holding a 138 mapping. 139 A specific example would be taking the folio lock while reading or 140 writing the pagecache. 141 142 * **pure overwrite**: A write operation that does not require any 143 metadata or zeroing operations to perform during either submission 144 or completion. 145 This implies that the filesystem must have already allocated space 146 on disk as ``IOMAP_MAPPED`` and the filesystem must not place any 147 constraints on IO alignment or size. 148 The only constraints on I/O alignment are device level (minimum I/O 149 size and alignment, typically sector size). 150 151``struct iomap`` 152---------------- 153 154The filesystem communicates to the iomap iterator the mapping of 155byte ranges of a file to byte ranges of a storage device with the 156structure below: 157 158.. code-block:: c 159 160 struct iomap { 161 u64 addr; 162 loff_t offset; 163 u64 length; 164 u16 type; 165 u16 flags; 166 struct block_device *bdev; 167 struct dax_device *dax_dev; 168 void *inline_data; 169 void *private; 170 u64 validity_cookie; 171 }; 172 173The fields are as follows: 174 175 * ``offset`` and ``length`` describe the range of file offsets, in 176 bytes, covered by this mapping. 177 These fields must always be set by the filesystem. 178 179 * ``type`` describes the type of the space mapping: 180 181 * **IOMAP_HOLE**: No storage has been allocated. 182 This type must never be returned in response to an ``IOMAP_WRITE`` 183 operation because writes must allocate and map space, and return 184 the mapping. 185 The ``addr`` field must be set to ``IOMAP_NULL_ADDR``. 186 iomap does not support writing (whether via pagecache or direct 187 I/O) to a hole. 188 189 * **IOMAP_DELALLOC**: A promise to allocate space at a later time 190 ("delayed allocation"). 191 If the filesystem returns IOMAP_F_NEW here and the write fails, the 192 ``->iomap_end`` function must delete the reservation. 193 The ``addr`` field must be set to ``IOMAP_NULL_ADDR``. 194 195 * **IOMAP_MAPPED**: The file range maps to specific space on the 196 storage device. 197 The device is returned in ``bdev`` or ``dax_dev``. 198 The device address, in bytes, is returned via ``addr``. 199 200 * **IOMAP_UNWRITTEN**: The file range maps to specific space on the 201 storage device, but the space has not yet been initialized. 202 The device is returned in ``bdev`` or ``dax_dev``. 203 The device address, in bytes, is returned via ``addr``. 204 Reads from this type of mapping will return zeroes to the caller. 205 For a write or writeback operation, the ioend should update the 206 mapping to MAPPED. 207 Refer to the sections about ioends for more details. 208 209 * **IOMAP_INLINE**: The file range maps to the memory buffer 210 specified by ``inline_data``. 211 For write operation, the ``->iomap_end`` function presumably 212 handles persisting the data. 213 The ``addr`` field must be set to ``IOMAP_NULL_ADDR``. 214 215 * ``flags`` describe the status of the space mapping. 216 These flags should be set by the filesystem in ``->iomap_begin``: 217 218 * **IOMAP_F_NEW**: The space under the mapping is newly allocated. 219 Areas that will not be written to must be zeroed. 220 If a write fails and the mapping is a space reservation, the 221 reservation must be deleted. 222 223 * **IOMAP_F_DIRTY**: The inode will have uncommitted metadata needed 224 to access any data written. 225 fdatasync is required to commit these changes to persistent 226 storage. 227 This needs to take into account metadata changes that *may* be made 228 at I/O completion, such as file size updates from direct I/O. 229 230 * **IOMAP_F_SHARED**: The space under the mapping is shared. 231 Copy on write is necessary to avoid corrupting other file data. 232 233 * **IOMAP_F_BUFFER_HEAD**: This mapping requires the use of buffer 234 heads for pagecache operations. 235 Do not add more uses of this. 236 237 * **IOMAP_F_MERGED**: Multiple contiguous block mappings were 238 coalesced into this single mapping. 239 This is only useful for FIEMAP. 240 241 * **IOMAP_F_XATTR**: The mapping is for extended attribute data, not 242 regular file data. 243 This is only useful for FIEMAP. 244 245 * **IOMAP_F_BOUNDARY**: This indicates I/O and its completion must not be 246 merged with any other I/O or completion. Filesystems must use this when 247 submitting I/O to devices that cannot handle I/O crossing certain LBAs 248 (e.g. ZNS devices). This flag applies only to buffered I/O writeback; all 249 other functions ignore it. 250 251 * **IOMAP_F_PRIVATE**: This flag is reserved for filesystem private use. 252 253 * **IOMAP_F_ANON_WRITE**: Indicates that (write) I/O does not have a target 254 block assigned to it yet and the file system will do that in the bio 255 submission handler, splitting the I/O as needed. 256 257 * **IOMAP_F_ATOMIC_BIO**: This indicates write I/O must be submitted with the 258 ``REQ_ATOMIC`` flag set in the bio. Filesystems need to set this flag to 259 inform iomap that the write I/O operation requires torn-write protection 260 based on HW-offload mechanism. They must also ensure that mapping updates 261 upon the completion of the I/O must be performed in a single metadata 262 update. 263 264 These flags can be set by iomap itself during file operations. 265 The filesystem should supply an ``->iomap_end`` function if it needs 266 to observe these flags: 267 268 * **IOMAP_F_SIZE_CHANGED**: The file size has changed as a result of 269 using this mapping. 270 271 * **IOMAP_F_STALE**: The mapping was found to be stale. 272 iomap will call ``->iomap_end`` on this mapping and then 273 ``->iomap_begin`` to obtain a new mapping. 274 275 Currently, these flags are only set by pagecache operations. 276 277 * ``addr`` describes the device address, in bytes. 278 279 * ``bdev`` describes the block device for this mapping. 280 This only needs to be set for mapped or unwritten operations. 281 282 * ``dax_dev`` describes the DAX device for this mapping. 283 This only needs to be set for mapped or unwritten operations, and 284 only for a fsdax operation. 285 286 * ``inline_data`` points to a memory buffer for I/O involving 287 ``IOMAP_INLINE`` mappings. 288 This value is ignored for all other mapping types. 289 290 * ``private`` is a pointer to `filesystem-private information 291 <https://lore.kernel.org/all/20180619164137.13720-7-hch@lst.de/>`_. 292 This value will be passed unchanged to ``->iomap_end``. 293 294 * ``validity_cookie`` is a magic freshness value set by the filesystem 295 that should be used to detect stale mappings. 296 For pagecache operations this is critical for correct operation 297 because page faults can occur, which implies that filesystem locks 298 should not be held between ``->iomap_begin`` and ``->iomap_end``. 299 Filesystems with completely static mappings need not set this value. 300 Only pagecache operations revalidate mappings; see the section about 301 ``iomap_valid`` for details. 302 303``struct iomap_ops`` 304-------------------- 305 306Every iomap function requires the filesystem to pass an operations 307structure to obtain a mapping and (optionally) to release the mapping: 308 309.. code-block:: c 310 311 struct iomap_ops { 312 int (*iomap_begin)(struct inode *inode, loff_t pos, loff_t length, 313 unsigned flags, struct iomap *iomap, 314 struct iomap *srcmap); 315 316 int (*iomap_end)(struct inode *inode, loff_t pos, loff_t length, 317 ssize_t written, unsigned flags, 318 struct iomap *iomap); 319 }; 320 321``->iomap_begin`` 322~~~~~~~~~~~~~~~~~ 323 324iomap operations call ``->iomap_begin`` to obtain one file mapping for 325the range of bytes specified by ``pos`` and ``length`` for the file 326``inode``. 327This mapping should be returned through the ``iomap`` pointer. 328The mapping must cover at least the first byte of the supplied file 329range, but it does not need to cover the entire requested range. 330 331Each iomap operation describes the requested operation through the 332``flags`` argument. 333The exact value of ``flags`` will be documented in the 334operation-specific sections below. 335These flags can, at least in principle, apply generally to iomap 336operations: 337 338 * ``IOMAP_DIRECT`` is set when the caller wishes to issue file I/O to 339 block storage. 340 341 * ``IOMAP_DAX`` is set when the caller wishes to issue file I/O to 342 memory-like storage. 343 344 * ``IOMAP_NOWAIT`` is set when the caller wishes to perform a best 345 effort attempt to avoid any operation that would result in blocking 346 the submitting task. 347 This is similar in intent to ``O_NONBLOCK`` for network APIs - it is 348 intended for asynchronous applications to keep doing other work 349 instead of waiting for the specific unavailable filesystem resource 350 to become available. 351 Filesystems implementing ``IOMAP_NOWAIT`` semantics need to use 352 trylock algorithms. 353 They need to be able to satisfy the entire I/O request range with a 354 single iomap mapping. 355 They need to avoid reading or writing metadata synchronously. 356 They need to avoid blocking memory allocations. 357 They need to avoid waiting on transaction reservations to allow 358 modifications to take place. 359 They probably should not be allocating new space. 360 And so on. 361 If there is any doubt in the filesystem developer's mind as to 362 whether any specific ``IOMAP_NOWAIT`` operation may end up blocking, 363 then they should return ``-EAGAIN`` as early as possible rather than 364 start the operation and force the submitting task to block. 365 ``IOMAP_NOWAIT`` is often set on behalf of ``IOCB_NOWAIT`` or 366 ``RWF_NOWAIT``. 367 368 * ``IOMAP_DONTCACHE`` is set when the caller wishes to perform a 369 buffered file I/O and would like the kernel to drop the pagecache 370 after the I/O completes, if it isn't already being used by another 371 thread. 372 373If it is necessary to read existing file contents from a `different 374<https://lore.kernel.org/all/20191008071527.29304-9-hch@lst.de/>`_ 375device or address range on a device, the filesystem should return that 376information via ``srcmap``. 377Only pagecache and fsdax operations support reading from one mapping and 378writing to another. 379 380``->iomap_end`` 381~~~~~~~~~~~~~~~ 382 383After the operation completes, the ``->iomap_end`` function, if present, 384is called to signal that iomap is finished with a mapping. 385Typically, implementations will use this function to tear down any 386context that were set up in ``->iomap_begin``. 387For example, a write might wish to commit the reservations for the bytes 388that were operated upon and unreserve any space that was not operated 389upon. 390``written`` might be zero if no bytes were touched. 391``flags`` will contain the same value passed to ``->iomap_begin``. 392iomap ops for reads are not likely to need to supply this function. 393 394Both functions should return a negative errno code on error, or zero on 395success. 396 397Preparing for File Operations 398============================= 399 400iomap only handles mapping and I/O. 401Filesystems must still call out to the VFS to check input parameters 402and file state before initiating an I/O operation. 403It does not handle obtaining filesystem freeze protection, updating of 404timestamps, stripping privileges, or access control. 405 406Locking Hierarchy 407================= 408 409iomap requires that filesystems supply their own locking model. 410There are three categories of synchronization primitives, as far as 411iomap is concerned: 412 413 * The **upper** level primitive is provided by the filesystem to 414 coordinate access to different iomap operations. 415 The exact primitive is specific to the filesystem and operation, 416 but is often a VFS inode, pagecache invalidation, or folio lock. 417 For example, a filesystem might take ``i_rwsem`` before calling 418 ``iomap_file_buffered_write`` and ``iomap_file_unshare`` to prevent 419 these two file operations from clobbering each other. 420 Pagecache writeback may lock a folio to prevent other threads from 421 accessing the folio until writeback is underway. 422 423 * The **lower** level primitive is taken by the filesystem in the 424 ``->iomap_begin`` and ``->iomap_end`` functions to coordinate 425 access to the file space mapping information. 426 The fields of the iomap object should be filled out while holding 427 this primitive. 428 The upper level synchronization primitive, if any, remains held 429 while acquiring the lower level synchronization primitive. 430 For example, XFS takes ``ILOCK_EXCL`` and ext4 takes ``i_data_sem`` 431 while sampling mappings. 432 Filesystems with immutable mapping information may not require 433 synchronization here. 434 435 * The **operation** primitive is taken by an iomap operation to 436 coordinate access to its own internal data structures. 437 The upper level synchronization primitive, if any, remains held 438 while acquiring this primitive. 439 The lower level primitive is not held while acquiring this 440 primitive. 441 For example, pagecache write operations will obtain a file mapping, 442 then grab and lock a folio to copy new contents. 443 It may also lock an internal folio state object to update metadata. 444 445The exact locking requirements are specific to the filesystem; for 446certain operations, some of these locks can be elided. 447All further mentions of locking are *recommendations*, not mandates. 448Each filesystem author must figure out the locking for themself. 449 450Bugs and Limitations 451==================== 452 453 * No support for fscrypt. 454 * No support for compression. 455 * No support for fsverity yet. 456 * Strong assumptions that IO should work the way it does on XFS. 457 * Does iomap *actually* work for non-regular file data? 458 459Patches welcome! 460