1.. SPDX-License-Identifier: GPL-2.0 2.. _iomap_design: 3 4.. 5 Dumb style notes to maintain the author's sanity: 6 Please try to start sentences on separate lines so that 7 sentence changes don't bleed colors in diff. 8 Heading decorations are documented in sphinx.rst. 9 10============== 11Library Design 12============== 13 14.. contents:: Table of Contents 15 :local: 16 17Introduction 18============ 19 20iomap is a filesystem library for handling common file operations. 21The library has two layers: 22 23 1. A lower layer that provides an iterator over ranges of file offsets. 24 This layer tries to obtain mappings of each file ranges to storage 25 from the filesystem, but the storage information is not necessarily 26 required. 27 28 2. An upper layer that acts upon the space mappings provided by the 29 lower layer iterator. 30 31The iteration can involve mappings of file's logical offset ranges to 32physical extents, but the storage layer information is not necessarily 33required, e.g. for walking cached file information. 34The library exports various APIs for implementing file operations such 35as: 36 37 * Pagecache reads and writes 38 * Folio write faults to the pagecache 39 * Writeback of dirty folios 40 * Direct I/O reads and writes 41 * fsdax I/O reads, writes, loads, and stores 42 * FIEMAP 43 * lseek ``SEEK_DATA`` and ``SEEK_HOLE`` 44 * swapfile activation 45 46This origins of this library is the file I/O path that XFS once used; it 47has now been extended to cover several other operations. 48 49Who Should Read This? 50===================== 51 52The target audience for this document are filesystem, storage, and 53pagecache programmers and code reviewers. 54 55If you are working on PCI, machine architectures, or device drivers, you 56are most likely in the wrong place. 57 58How Is This Better? 59=================== 60 61Unlike the classic Linux I/O model which breaks file I/O into small 62units (generally memory pages or blocks) and looks up space mappings on 63the basis of that unit, the iomap model asks the filesystem for the 64largest space mappings that it can create for a given file operation and 65initiates operations on that basis. 66This strategy improves the filesystem's visibility into the size of the 67operation being performed, which enables it to combat fragmentation with 68larger space allocations when possible. 69Larger space mappings improve runtime performance by amortizing the cost 70of mapping function calls into the filesystem across a larger amount of 71data. 72 73At a high level, an iomap operation `looks like this 74<https://lore.kernel.org/all/ZGbVaewzcCysclPt@dread.disaster.area/>`_: 75 761. For each byte in the operation range... 77 78 1. Obtain a space mapping via ``->iomap_begin`` 79 80 2. For each sub-unit of work... 81 82 1. Revalidate the mapping and go back to (1) above, if necessary. 83 So far only the pagecache operations need to do this. 84 85 2. Do the work 86 87 3. Increment operation cursor 88 89 4. Release the mapping via ``->iomap_end``, if necessary 90 91Each iomap operation will be covered in more detail below. 92This library was covered previously by an `LWN article 93<https://lwn.net/Articles/935934/>`_ and a `KernelNewbies page 94<https://kernelnewbies.org/KernelProjects/iomap>`_. 95 96The goal of this document is to provide a brief discussion of the 97design and capabilities of iomap, followed by a more detailed catalog 98of the interfaces presented by iomap. 99If you change iomap, please update this design document. 100 101File Range Iterator 102=================== 103 104Definitions 105----------- 106 107 * **buffer head**: Shattered remnants of the old buffer cache. 108 109 * ``fsblock``: The block size of a file, also known as ``i_blocksize``. 110 111 * ``i_rwsem``: The VFS ``struct inode`` rwsemaphore. 112 Processes hold this in shared mode to read file state and contents. 113 Some filesystems may allow shared mode for writes. 114 Processes often hold this in exclusive mode to change file state and 115 contents. 116 117 * ``invalidate_lock``: The pagecache ``struct address_space`` 118 rwsemaphore that protects against folio insertion and removal for 119 filesystems that support punching out folios below EOF. 120 Processes wishing to insert folios must hold this lock in shared 121 mode to prevent removal, though concurrent insertion is allowed. 122 Processes wishing to remove folios must hold this lock in exclusive 123 mode to prevent insertions. 124 Concurrent removals are not allowed. 125 126 * ``dax_read_lock``: The RCU read lock that dax takes to prevent a 127 device pre-shutdown hook from returning before other threads have 128 released resources. 129 130 * **filesystem mapping lock**: This synchronization primitive is 131 internal to the filesystem and must protect the file mapping data 132 from updates while a mapping is being sampled. 133 The filesystem author must determine how this coordination should 134 happen; it does not need to be an actual lock. 135 136 * **iomap internal operation lock**: This is a general term for 137 synchronization primitives that iomap functions take while holding a 138 mapping. 139 A specific example would be taking the folio lock while reading or 140 writing the pagecache. 141 142 * **pure overwrite**: A write operation that does not require any 143 metadata or zeroing operations to perform during either submission 144 or completion. 145 This implies that the filesystem must have already allocated space 146 on disk as ``IOMAP_MAPPED`` and the filesystem must not place any 147 constraints on IO alignment or size. 148 The only constraints on I/O alignment are device level (minimum I/O 149 size and alignment, typically sector size). 150 151``struct iomap`` 152---------------- 153 154The filesystem communicates to the iomap iterator the mapping of 155byte ranges of a file to byte ranges of a storage device with the 156structure below: 157 158.. code-block:: c 159 160 struct iomap { 161 u64 addr; 162 loff_t offset; 163 u64 length; 164 u16 type; 165 u16 flags; 166 struct block_device *bdev; 167 struct dax_device *dax_dev; 168 void *inline_data; 169 void *private; 170 const struct iomap_folio_ops *folio_ops; 171 u64 validity_cookie; 172 }; 173 174The fields are as follows: 175 176 * ``offset`` and ``length`` describe the range of file offsets, in 177 bytes, covered by this mapping. 178 These fields must always be set by the filesystem. 179 180 * ``type`` describes the type of the space mapping: 181 182 * **IOMAP_HOLE**: No storage has been allocated. 183 This type must never be returned in response to an ``IOMAP_WRITE`` 184 operation because writes must allocate and map space, and return 185 the mapping. 186 The ``addr`` field must be set to ``IOMAP_NULL_ADDR``. 187 iomap does not support writing (whether via pagecache or direct 188 I/O) to a hole. 189 190 * **IOMAP_DELALLOC**: A promise to allocate space at a later time 191 ("delayed allocation"). 192 If the filesystem returns IOMAP_F_NEW here and the write fails, the 193 ``->iomap_end`` function must delete the reservation. 194 The ``addr`` field must be set to ``IOMAP_NULL_ADDR``. 195 196 * **IOMAP_MAPPED**: The file range maps to specific space on the 197 storage device. 198 The device is returned in ``bdev`` or ``dax_dev``. 199 The device address, in bytes, is returned via ``addr``. 200 201 * **IOMAP_UNWRITTEN**: The file range maps to specific space on the 202 storage device, but the space has not yet been initialized. 203 The device is returned in ``bdev`` or ``dax_dev``. 204 The device address, in bytes, is returned via ``addr``. 205 Reads from this type of mapping will return zeroes to the caller. 206 For a write or writeback operation, the ioend should update the 207 mapping to MAPPED. 208 Refer to the sections about ioends for more details. 209 210 * **IOMAP_INLINE**: The file range maps to the memory buffer 211 specified by ``inline_data``. 212 For write operation, the ``->iomap_end`` function presumably 213 handles persisting the data. 214 The ``addr`` field must be set to ``IOMAP_NULL_ADDR``. 215 216 * ``flags`` describe the status of the space mapping. 217 These flags should be set by the filesystem in ``->iomap_begin``: 218 219 * **IOMAP_F_NEW**: The space under the mapping is newly allocated. 220 Areas that will not be written to must be zeroed. 221 If a write fails and the mapping is a space reservation, the 222 reservation must be deleted. 223 224 * **IOMAP_F_DIRTY**: The inode will have uncommitted metadata needed 225 to access any data written. 226 fdatasync is required to commit these changes to persistent 227 storage. 228 This needs to take into account metadata changes that *may* be made 229 at I/O completion, such as file size updates from direct I/O. 230 231 * **IOMAP_F_SHARED**: The space under the mapping is shared. 232 Copy on write is necessary to avoid corrupting other file data. 233 234 * **IOMAP_F_BUFFER_HEAD**: This mapping requires the use of buffer 235 heads for pagecache operations. 236 Do not add more uses of this. 237 238 * **IOMAP_F_MERGED**: Multiple contiguous block mappings were 239 coalesced into this single mapping. 240 This is only useful for FIEMAP. 241 242 * **IOMAP_F_XATTR**: The mapping is for extended attribute data, not 243 regular file data. 244 This is only useful for FIEMAP. 245 246 * **IOMAP_F_BOUNDARY**: This indicates I/O and its completion must not be 247 merged with any other I/O or completion. Filesystems must use this when 248 submitting I/O to devices that cannot handle I/O crossing certain LBAs 249 (e.g. ZNS devices). This flag applies only to buffered I/O writeback; all 250 other functions ignore it. 251 252 * **IOMAP_F_PRIVATE**: This flag is reserved for filesystem private use. 253 254 * **IOMAP_F_ANON_WRITE**: Indicates that (write) I/O does not have a target 255 block assigned to it yet and the file system will do that in the bio 256 submission handler, splitting the I/O as needed. 257 258 * **IOMAP_F_ATOMIC_BIO**: This indicates write I/O must be submitted with the 259 ``REQ_ATOMIC`` flag set in the bio. Filesystems need to set this flag to 260 inform iomap that the write I/O operation requires torn-write protection 261 based on HW-offload mechanism. They must also ensure that mapping updates 262 upon the completion of the I/O must be performed in a single metadata 263 update. 264 265 These flags can be set by iomap itself during file operations. 266 The filesystem should supply an ``->iomap_end`` function if it needs 267 to observe these flags: 268 269 * **IOMAP_F_SIZE_CHANGED**: The file size has changed as a result of 270 using this mapping. 271 272 * **IOMAP_F_STALE**: The mapping was found to be stale. 273 iomap will call ``->iomap_end`` on this mapping and then 274 ``->iomap_begin`` to obtain a new mapping. 275 276 Currently, these flags are only set by pagecache operations. 277 278 * ``addr`` describes the device address, in bytes. 279 280 * ``bdev`` describes the block device for this mapping. 281 This only needs to be set for mapped or unwritten operations. 282 283 * ``dax_dev`` describes the DAX device for this mapping. 284 This only needs to be set for mapped or unwritten operations, and 285 only for a fsdax operation. 286 287 * ``inline_data`` points to a memory buffer for I/O involving 288 ``IOMAP_INLINE`` mappings. 289 This value is ignored for all other mapping types. 290 291 * ``private`` is a pointer to `filesystem-private information 292 <https://lore.kernel.org/all/20180619164137.13720-7-hch@lst.de/>`_. 293 This value will be passed unchanged to ``->iomap_end``. 294 295 * ``folio_ops`` will be covered in the section on pagecache operations. 296 297 * ``validity_cookie`` is a magic freshness value set by the filesystem 298 that should be used to detect stale mappings. 299 For pagecache operations this is critical for correct operation 300 because page faults can occur, which implies that filesystem locks 301 should not be held between ``->iomap_begin`` and ``->iomap_end``. 302 Filesystems with completely static mappings need not set this value. 303 Only pagecache operations revalidate mappings; see the section about 304 ``iomap_valid`` for details. 305 306``struct iomap_ops`` 307-------------------- 308 309Every iomap function requires the filesystem to pass an operations 310structure to obtain a mapping and (optionally) to release the mapping: 311 312.. code-block:: c 313 314 struct iomap_ops { 315 int (*iomap_begin)(struct inode *inode, loff_t pos, loff_t length, 316 unsigned flags, struct iomap *iomap, 317 struct iomap *srcmap); 318 319 int (*iomap_end)(struct inode *inode, loff_t pos, loff_t length, 320 ssize_t written, unsigned flags, 321 struct iomap *iomap); 322 }; 323 324``->iomap_begin`` 325~~~~~~~~~~~~~~~~~ 326 327iomap operations call ``->iomap_begin`` to obtain one file mapping for 328the range of bytes specified by ``pos`` and ``length`` for the file 329``inode``. 330This mapping should be returned through the ``iomap`` pointer. 331The mapping must cover at least the first byte of the supplied file 332range, but it does not need to cover the entire requested range. 333 334Each iomap operation describes the requested operation through the 335``flags`` argument. 336The exact value of ``flags`` will be documented in the 337operation-specific sections below. 338These flags can, at least in principle, apply generally to iomap 339operations: 340 341 * ``IOMAP_DIRECT`` is set when the caller wishes to issue file I/O to 342 block storage. 343 344 * ``IOMAP_DAX`` is set when the caller wishes to issue file I/O to 345 memory-like storage. 346 347 * ``IOMAP_NOWAIT`` is set when the caller wishes to perform a best 348 effort attempt to avoid any operation that would result in blocking 349 the submitting task. 350 This is similar in intent to ``O_NONBLOCK`` for network APIs - it is 351 intended for asynchronous applications to keep doing other work 352 instead of waiting for the specific unavailable filesystem resource 353 to become available. 354 Filesystems implementing ``IOMAP_NOWAIT`` semantics need to use 355 trylock algorithms. 356 They need to be able to satisfy the entire I/O request range with a 357 single iomap mapping. 358 They need to avoid reading or writing metadata synchronously. 359 They need to avoid blocking memory allocations. 360 They need to avoid waiting on transaction reservations to allow 361 modifications to take place. 362 They probably should not be allocating new space. 363 And so on. 364 If there is any doubt in the filesystem developer's mind as to 365 whether any specific ``IOMAP_NOWAIT`` operation may end up blocking, 366 then they should return ``-EAGAIN`` as early as possible rather than 367 start the operation and force the submitting task to block. 368 ``IOMAP_NOWAIT`` is often set on behalf of ``IOCB_NOWAIT`` or 369 ``RWF_NOWAIT``. 370 371 * ``IOMAP_DONTCACHE`` is set when the caller wishes to perform a 372 buffered file I/O and would like the kernel to drop the pagecache 373 after the I/O completes, if it isn't already being used by another 374 thread. 375 376If it is necessary to read existing file contents from a `different 377<https://lore.kernel.org/all/20191008071527.29304-9-hch@lst.de/>`_ 378device or address range on a device, the filesystem should return that 379information via ``srcmap``. 380Only pagecache and fsdax operations support reading from one mapping and 381writing to another. 382 383``->iomap_end`` 384~~~~~~~~~~~~~~~ 385 386After the operation completes, the ``->iomap_end`` function, if present, 387is called to signal that iomap is finished with a mapping. 388Typically, implementations will use this function to tear down any 389context that were set up in ``->iomap_begin``. 390For example, a write might wish to commit the reservations for the bytes 391that were operated upon and unreserve any space that was not operated 392upon. 393``written`` might be zero if no bytes were touched. 394``flags`` will contain the same value passed to ``->iomap_begin``. 395iomap ops for reads are not likely to need to supply this function. 396 397Both functions should return a negative errno code on error, or zero on 398success. 399 400Preparing for File Operations 401============================= 402 403iomap only handles mapping and I/O. 404Filesystems must still call out to the VFS to check input parameters 405and file state before initiating an I/O operation. 406It does not handle obtaining filesystem freeze protection, updating of 407timestamps, stripping privileges, or access control. 408 409Locking Hierarchy 410================= 411 412iomap requires that filesystems supply their own locking model. 413There are three categories of synchronization primitives, as far as 414iomap is concerned: 415 416 * The **upper** level primitive is provided by the filesystem to 417 coordinate access to different iomap operations. 418 The exact primitive is specific to the filesystem and operation, 419 but is often a VFS inode, pagecache invalidation, or folio lock. 420 For example, a filesystem might take ``i_rwsem`` before calling 421 ``iomap_file_buffered_write`` and ``iomap_file_unshare`` to prevent 422 these two file operations from clobbering each other. 423 Pagecache writeback may lock a folio to prevent other threads from 424 accessing the folio until writeback is underway. 425 426 * The **lower** level primitive is taken by the filesystem in the 427 ``->iomap_begin`` and ``->iomap_end`` functions to coordinate 428 access to the file space mapping information. 429 The fields of the iomap object should be filled out while holding 430 this primitive. 431 The upper level synchronization primitive, if any, remains held 432 while acquiring the lower level synchronization primitive. 433 For example, XFS takes ``ILOCK_EXCL`` and ext4 takes ``i_data_sem`` 434 while sampling mappings. 435 Filesystems with immutable mapping information may not require 436 synchronization here. 437 438 * The **operation** primitive is taken by an iomap operation to 439 coordinate access to its own internal data structures. 440 The upper level synchronization primitive, if any, remains held 441 while acquiring this primitive. 442 The lower level primitive is not held while acquiring this 443 primitive. 444 For example, pagecache write operations will obtain a file mapping, 445 then grab and lock a folio to copy new contents. 446 It may also lock an internal folio state object to update metadata. 447 448The exact locking requirements are specific to the filesystem; for 449certain operations, some of these locks can be elided. 450All further mentions of locking are *recommendations*, not mandates. 451Each filesystem author must figure out the locking for themself. 452 453Bugs and Limitations 454==================== 455 456 * No support for fscrypt. 457 * No support for compression. 458 * No support for fsverity yet. 459 * Strong assumptions that IO should work the way it does on XFS. 460 * Does iomap *actually* work for non-regular file data? 461 462Patches welcome! 463