1.. SPDX-License-Identifier: GPL-2.0 2.. _iomap_design: 3 4.. 5 Dumb style notes to maintain the author's sanity: 6 Please try to start sentences on separate lines so that 7 sentence changes don't bleed colors in diff. 8 Heading decorations are documented in sphinx.rst. 9 10============== 11Library Design 12============== 13 14.. contents:: Table of Contents 15 :local: 16 17Introduction 18============ 19 20iomap is a filesystem library for handling common file operations. 21The library has two layers: 22 23 1. A lower layer that provides an iterator over ranges of file offsets. 24 This layer tries to obtain mappings of each file ranges to storage 25 from the filesystem, but the storage information is not necessarily 26 required. 27 28 2. An upper layer that acts upon the space mappings provided by the 29 lower layer iterator. 30 31The iteration can involve mappings of file's logical offset ranges to 32physical extents, but the storage layer information is not necessarily 33required, e.g. for walking cached file information. 34The library exports various APIs for implementing file operations such 35as: 36 37 * Pagecache reads and writes 38 * Folio write faults to the pagecache 39 * Writeback of dirty folios 40 * Direct I/O reads and writes 41 * fsdax I/O reads, writes, loads, and stores 42 * FIEMAP 43 * lseek ``SEEK_DATA`` and ``SEEK_HOLE`` 44 * swapfile activation 45 46This origins of this library is the file I/O path that XFS once used; it 47has now been extended to cover several other operations. 48 49Who Should Read This? 50===================== 51 52The target audience for this document are filesystem, storage, and 53pagecache programmers and code reviewers. 54 55If you are working on PCI, machine architectures, or device drivers, you 56are most likely in the wrong place. 57 58How Is This Better? 59=================== 60 61Unlike the classic Linux I/O model which breaks file I/O into small 62units (generally memory pages or blocks) and looks up space mappings on 63the basis of that unit, the iomap model asks the filesystem for the 64largest space mappings that it can create for a given file operation and 65initiates operations on that basis. 66This strategy improves the filesystem's visibility into the size of the 67operation being performed, which enables it to combat fragmentation with 68larger space allocations when possible. 69Larger space mappings improve runtime performance by amortizing the cost 70of mapping function calls into the filesystem across a larger amount of 71data. 72 73At a high level, an iomap operation `looks like this 74<https://lore.kernel.org/all/ZGbVaewzcCysclPt@dread.disaster.area/>`_: 75 761. For each byte in the operation range... 77 78 1. Obtain a space mapping via ``->iomap_begin`` 79 80 2. For each sub-unit of work... 81 82 1. Revalidate the mapping and go back to (1) above, if necessary. 83 So far only the pagecache operations need to do this. 84 85 2. Do the work 86 87 3. Increment operation cursor 88 89 4. Release the mapping via ``->iomap_end``, if necessary 90 91Each iomap operation will be covered in more detail below. 92This library was covered previously by an `LWN article 93<https://lwn.net/Articles/935934/>`_ and a `KernelNewbies page 94<https://kernelnewbies.org/KernelProjects/iomap>`_. 95 96The goal of this document is to provide a brief discussion of the 97design and capabilities of iomap, followed by a more detailed catalog 98of the interfaces presented by iomap. 99If you change iomap, please update this design document. 100 101File Range Iterator 102=================== 103 104Definitions 105----------- 106 107 * **buffer head**: Shattered remnants of the old buffer cache. 108 109 * ``fsblock``: The block size of a file, also known as ``i_blocksize``. 110 111 * ``i_rwsem``: The VFS ``struct inode`` rwsemaphore. 112 Processes hold this in shared mode to read file state and contents. 113 Some filesystems may allow shared mode for writes. 114 Processes often hold this in exclusive mode to change file state and 115 contents. 116 117 * ``invalidate_lock``: The pagecache ``struct address_space`` 118 rwsemaphore that protects against folio insertion and removal for 119 filesystems that support punching out folios below EOF. 120 Processes wishing to insert folios must hold this lock in shared 121 mode to prevent removal, though concurrent insertion is allowed. 122 Processes wishing to remove folios must hold this lock in exclusive 123 mode to prevent insertions. 124 Concurrent removals are not allowed. 125 126 * ``dax_read_lock``: The RCU read lock that dax takes to prevent a 127 device pre-shutdown hook from returning before other threads have 128 released resources. 129 130 * **filesystem mapping lock**: This synchronization primitive is 131 internal to the filesystem and must protect the file mapping data 132 from updates while a mapping is being sampled. 133 The filesystem author must determine how this coordination should 134 happen; it does not need to be an actual lock. 135 136 * **iomap internal operation lock**: This is a general term for 137 synchronization primitives that iomap functions take while holding a 138 mapping. 139 A specific example would be taking the folio lock while reading or 140 writing the pagecache. 141 142 * **pure overwrite**: A write operation that does not require any 143 metadata or zeroing operations to perform during either submission 144 or completion. 145 This implies that the fileystem must have already allocated space 146 on disk as ``IOMAP_MAPPED`` and the filesystem must not place any 147 constaints on IO alignment or size. 148 The only constraints on I/O alignment are device level (minimum I/O 149 size and alignment, typically sector size). 150 151``struct iomap`` 152---------------- 153 154The filesystem communicates to the iomap iterator the mapping of 155byte ranges of a file to byte ranges of a storage device with the 156structure below: 157 158.. code-block:: c 159 160 struct iomap { 161 u64 addr; 162 loff_t offset; 163 u64 length; 164 u16 type; 165 u16 flags; 166 struct block_device *bdev; 167 struct dax_device *dax_dev; 168 voidw *inline_data; 169 void *private; 170 const struct iomap_folio_ops *folio_ops; 171 u64 validity_cookie; 172 }; 173 174The fields are as follows: 175 176 * ``offset`` and ``length`` describe the range of file offsets, in 177 bytes, covered by this mapping. 178 These fields must always be set by the filesystem. 179 180 * ``type`` describes the type of the space mapping: 181 182 * **IOMAP_HOLE**: No storage has been allocated. 183 This type must never be returned in response to an ``IOMAP_WRITE`` 184 operation because writes must allocate and map space, and return 185 the mapping. 186 The ``addr`` field must be set to ``IOMAP_NULL_ADDR``. 187 iomap does not support writing (whether via pagecache or direct 188 I/O) to a hole. 189 190 * **IOMAP_DELALLOC**: A promise to allocate space at a later time 191 ("delayed allocation"). 192 If the filesystem returns IOMAP_F_NEW here and the write fails, the 193 ``->iomap_end`` function must delete the reservation. 194 The ``addr`` field must be set to ``IOMAP_NULL_ADDR``. 195 196 * **IOMAP_MAPPED**: The file range maps to specific space on the 197 storage device. 198 The device is returned in ``bdev`` or ``dax_dev``. 199 The device address, in bytes, is returned via ``addr``. 200 201 * **IOMAP_UNWRITTEN**: The file range maps to specific space on the 202 storage device, but the space has not yet been initialized. 203 The device is returned in ``bdev`` or ``dax_dev``. 204 The device address, in bytes, is returned via ``addr``. 205 Reads from this type of mapping will return zeroes to the caller. 206 For a write or writeback operation, the ioend should update the 207 mapping to MAPPED. 208 Refer to the sections about ioends for more details. 209 210 * **IOMAP_INLINE**: The file range maps to the memory buffer 211 specified by ``inline_data``. 212 For write operation, the ``->iomap_end`` function presumably 213 handles persisting the data. 214 The ``addr`` field must be set to ``IOMAP_NULL_ADDR``. 215 216 * ``flags`` describe the status of the space mapping. 217 These flags should be set by the filesystem in ``->iomap_begin``: 218 219 * **IOMAP_F_NEW**: The space under the mapping is newly allocated. 220 Areas that will not be written to must be zeroed. 221 If a write fails and the mapping is a space reservation, the 222 reservation must be deleted. 223 224 * **IOMAP_F_DIRTY**: The inode will have uncommitted metadata needed 225 to access any data written. 226 fdatasync is required to commit these changes to persistent 227 storage. 228 This needs to take into account metadata changes that *may* be made 229 at I/O completion, such as file size updates from direct I/O. 230 231 * **IOMAP_F_SHARED**: The space under the mapping is shared. 232 Copy on write is necessary to avoid corrupting other file data. 233 234 * **IOMAP_F_BUFFER_HEAD**: This mapping requires the use of buffer 235 heads for pagecache operations. 236 Do not add more uses of this. 237 238 * **IOMAP_F_MERGED**: Multiple contiguous block mappings were 239 coalesced into this single mapping. 240 This is only useful for FIEMAP. 241 242 * **IOMAP_F_XATTR**: The mapping is for extended attribute data, not 243 regular file data. 244 This is only useful for FIEMAP. 245 246 * **IOMAP_F_PRIVATE**: Starting with this value, the upper bits can 247 be set by the filesystem for its own purposes. 248 249 These flags can be set by iomap itself during file operations. 250 The filesystem should supply an ``->iomap_end`` function if it needs 251 to observe these flags: 252 253 * **IOMAP_F_SIZE_CHANGED**: The file size has changed as a result of 254 using this mapping. 255 256 * **IOMAP_F_STALE**: The mapping was found to be stale. 257 iomap will call ``->iomap_end`` on this mapping and then 258 ``->iomap_begin`` to obtain a new mapping. 259 260 Currently, these flags are only set by pagecache operations. 261 262 * ``addr`` describes the device address, in bytes. 263 264 * ``bdev`` describes the block device for this mapping. 265 This only needs to be set for mapped or unwritten operations. 266 267 * ``dax_dev`` describes the DAX device for this mapping. 268 This only needs to be set for mapped or unwritten operations, and 269 only for a fsdax operation. 270 271 * ``inline_data`` points to a memory buffer for I/O involving 272 ``IOMAP_INLINE`` mappings. 273 This value is ignored for all other mapping types. 274 275 * ``private`` is a pointer to `filesystem-private information 276 <https://lore.kernel.org/all/20180619164137.13720-7-hch@lst.de/>`_. 277 This value will be passed unchanged to ``->iomap_end``. 278 279 * ``folio_ops`` will be covered in the section on pagecache operations. 280 281 * ``validity_cookie`` is a magic freshness value set by the filesystem 282 that should be used to detect stale mappings. 283 For pagecache operations this is critical for correct operation 284 because page faults can occur, which implies that filesystem locks 285 should not be held between ``->iomap_begin`` and ``->iomap_end``. 286 Filesystems with completely static mappings need not set this value. 287 Only pagecache operations revalidate mappings; see the section about 288 ``iomap_valid`` for details. 289 290``struct iomap_ops`` 291-------------------- 292 293Every iomap function requires the filesystem to pass an operations 294structure to obtain a mapping and (optionally) to release the mapping: 295 296.. code-block:: c 297 298 struct iomap_ops { 299 int (*iomap_begin)(struct inode *inode, loff_t pos, loff_t length, 300 unsigned flags, struct iomap *iomap, 301 struct iomap *srcmap); 302 303 int (*iomap_end)(struct inode *inode, loff_t pos, loff_t length, 304 ssize_t written, unsigned flags, 305 struct iomap *iomap); 306 }; 307 308``->iomap_begin`` 309~~~~~~~~~~~~~~~~~ 310 311iomap operations call ``->iomap_begin`` to obtain one file mapping for 312the range of bytes specified by ``pos`` and ``length`` for the file 313``inode``. 314This mapping should be returned through the ``iomap`` pointer. 315The mapping must cover at least the first byte of the supplied file 316range, but it does not need to cover the entire requested range. 317 318Each iomap operation describes the requested operation through the 319``flags`` argument. 320The exact value of ``flags`` will be documented in the 321operation-specific sections below. 322These flags can, at least in principle, apply generally to iomap 323operations: 324 325 * ``IOMAP_DIRECT`` is set when the caller wishes to issue file I/O to 326 block storage. 327 328 * ``IOMAP_DAX`` is set when the caller wishes to issue file I/O to 329 memory-like storage. 330 331 * ``IOMAP_NOWAIT`` is set when the caller wishes to perform a best 332 effort attempt to avoid any operation that would result in blocking 333 the submitting task. 334 This is similar in intent to ``O_NONBLOCK`` for network APIs - it is 335 intended for asynchronous applications to keep doing other work 336 instead of waiting for the specific unavailable filesystem resource 337 to become available. 338 Filesystems implementing ``IOMAP_NOWAIT`` semantics need to use 339 trylock algorithms. 340 They need to be able to satisfy the entire I/O request range with a 341 single iomap mapping. 342 They need to avoid reading or writing metadata synchronously. 343 They need to avoid blocking memory allocations. 344 They need to avoid waiting on transaction reservations to allow 345 modifications to take place. 346 They probably should not be allocating new space. 347 And so on. 348 If there is any doubt in the filesystem developer's mind as to 349 whether any specific ``IOMAP_NOWAIT`` operation may end up blocking, 350 then they should return ``-EAGAIN`` as early as possible rather than 351 start the operation and force the submitting task to block. 352 ``IOMAP_NOWAIT`` is often set on behalf of ``IOCB_NOWAIT`` or 353 ``RWF_NOWAIT``. 354 355If it is necessary to read existing file contents from a `different 356<https://lore.kernel.org/all/20191008071527.29304-9-hch@lst.de/>`_ 357device or address range on a device, the filesystem should return that 358information via ``srcmap``. 359Only pagecache and fsdax operations support reading from one mapping and 360writing to another. 361 362``->iomap_end`` 363~~~~~~~~~~~~~~~ 364 365After the operation completes, the ``->iomap_end`` function, if present, 366is called to signal that iomap is finished with a mapping. 367Typically, implementations will use this function to tear down any 368context that were set up in ``->iomap_begin``. 369For example, a write might wish to commit the reservations for the bytes 370that were operated upon and unreserve any space that was not operated 371upon. 372``written`` might be zero if no bytes were touched. 373``flags`` will contain the same value passed to ``->iomap_begin``. 374iomap ops for reads are not likely to need to supply this function. 375 376Both functions should return a negative errno code on error, or zero on 377success. 378 379Preparing for File Operations 380============================= 381 382iomap only handles mapping and I/O. 383Filesystems must still call out to the VFS to check input parameters 384and file state before initiating an I/O operation. 385It does not handle obtaining filesystem freeze protection, updating of 386timestamps, stripping privileges, or access control. 387 388Locking Hierarchy 389================= 390 391iomap requires that filesystems supply their own locking model. 392There are three categories of synchronization primitives, as far as 393iomap is concerned: 394 395 * The **upper** level primitive is provided by the filesystem to 396 coordinate access to different iomap operations. 397 The exact primitive is specifc to the filesystem and operation, 398 but is often a VFS inode, pagecache invalidation, or folio lock. 399 For example, a filesystem might take ``i_rwsem`` before calling 400 ``iomap_file_buffered_write`` and ``iomap_file_unshare`` to prevent 401 these two file operations from clobbering each other. 402 Pagecache writeback may lock a folio to prevent other threads from 403 accessing the folio until writeback is underway. 404 405 * The **lower** level primitive is taken by the filesystem in the 406 ``->iomap_begin`` and ``->iomap_end`` functions to coordinate 407 access to the file space mapping information. 408 The fields of the iomap object should be filled out while holding 409 this primitive. 410 The upper level synchronization primitive, if any, remains held 411 while acquiring the lower level synchronization primitive. 412 For example, XFS takes ``ILOCK_EXCL`` and ext4 takes ``i_data_sem`` 413 while sampling mappings. 414 Filesystems with immutable mapping information may not require 415 synchronization here. 416 417 * The **operation** primitive is taken by an iomap operation to 418 coordinate access to its own internal data structures. 419 The upper level synchronization primitive, if any, remains held 420 while acquiring this primitive. 421 The lower level primitive is not held while acquiring this 422 primitive. 423 For example, pagecache write operations will obtain a file mapping, 424 then grab and lock a folio to copy new contents. 425 It may also lock an internal folio state object to update metadata. 426 427The exact locking requirements are specific to the filesystem; for 428certain operations, some of these locks can be elided. 429All further mention of locking are *recommendations*, not mandates. 430Each filesystem author must figure out the locking for themself. 431 432Bugs and Limitations 433==================== 434 435 * No support for fscrypt. 436 * No support for compression. 437 * No support for fsverity yet. 438 * Strong assumptions that IO should work the way it does on XFS. 439 * Does iomap *actually* work for non-regular file data? 440 441Patches welcome! 442