xref: /linux/Documentation/filesystems/iomap/operations.rst (revision 7fc2cd2e4b398c57c9cf961cfea05eadbf34c05c)
1.. SPDX-License-Identifier: GPL-2.0
2.. _iomap_operations:
3
4..
5        Dumb style notes to maintain the author's sanity:
6        Please try to start sentences on separate lines so that
7        sentence changes don't bleed colors in diff.
8        Heading decorations are documented in sphinx.rst.
9
10=========================
11Supported File Operations
12=========================
13
14.. contents:: Table of Contents
15   :local:
16
17Below are a discussion of the high level file operations that iomap
18implements.
19
20Buffered I/O
21============
22
23Buffered I/O is the default file I/O path in Linux.
24File contents are cached in memory ("pagecache") to satisfy reads and
25writes.
26Dirty cache will be written back to disk at some point that can be
27forced via ``fsync`` and variants.
28
29iomap implements nearly all the folio and pagecache management that
30filesystems have to implement themselves under the legacy I/O model.
31This means that the filesystem need not know the details of allocating,
32mapping, managing uptodate and dirty state, or writeback of pagecache
33folios.
34Under the legacy I/O model, this was managed very inefficiently with
35linked lists of buffer heads instead of the per-folio bitmaps that iomap
36uses.
37Unless the filesystem explicitly opts in to buffer heads, they will not
38be used, which makes buffered I/O much more efficient, and the pagecache
39maintainer much happier.
40
41``struct address_space_operations``
42-----------------------------------
43
44The following iomap functions can be referenced directly from the
45address space operations structure:
46
47 * ``iomap_dirty_folio``
48 * ``iomap_release_folio``
49 * ``iomap_invalidate_folio``
50 * ``iomap_is_partially_uptodate``
51
52The following address space operations can be wrapped easily:
53
54 * ``read_folio``
55 * ``readahead``
56 * ``writepages``
57 * ``bmap``
58 * ``swap_activate``
59
60``struct iomap_write_ops``
61--------------------------
62
63.. code-block:: c
64
65 struct iomap_write_ops {
66     struct folio *(*get_folio)(struct iomap_iter *iter, loff_t pos,
67                                unsigned len);
68     void (*put_folio)(struct inode *inode, loff_t pos, unsigned copied,
69                       struct folio *folio);
70     bool (*iomap_valid)(struct inode *inode, const struct iomap *iomap);
71     int (*read_folio_range)(const struct iomap_iter *iter,
72     			struct folio *folio, loff_t pos, size_t len);
73 };
74
75iomap calls these functions:
76
77  - ``get_folio``: Called to allocate and return an active reference to
78    a locked folio prior to starting a write.
79    If this function is not provided, iomap will call
80    ``iomap_get_folio``.
81    This could be used to `set up per-folio filesystem state
82    <https://lore.kernel.org/all/20190429220934.10415-5-agruenba@redhat.com/>`_
83    for a write.
84
85  - ``put_folio``: Called to unlock and put a folio after a pagecache
86    operation completes.
87    If this function is not provided, iomap will ``folio_unlock`` and
88    ``folio_put`` on its own.
89    This could be used to `commit per-folio filesystem state
90    <https://lore.kernel.org/all/20180619164137.13720-6-hch@lst.de/>`_
91    that was set up by ``->get_folio``.
92
93  - ``iomap_valid``: The filesystem may not hold locks between
94    ``->iomap_begin`` and ``->iomap_end`` because pagecache operations
95    can take folio locks, fault on userspace pages, initiate writeback
96    for memory reclamation, or engage in other time-consuming actions.
97    If a file's space mapping data are mutable, it is possible that the
98    mapping for a particular pagecache folio can `change in the time it
99    takes
100    <https://lore.kernel.org/all/20221123055812.747923-8-david@fromorbit.com/>`_
101    to allocate, install, and lock that folio.
102
103    For the pagecache, races can happen if writeback doesn't take
104    ``i_rwsem`` or ``invalidate_lock`` and updates mapping information.
105    Races can also happen if the filesystem allows concurrent writes.
106    For such files, the mapping *must* be revalidated after the folio
107    lock has been taken so that iomap can manage the folio correctly.
108
109    fsdax does not need this revalidation because there's no writeback
110    and no support for unwritten extents.
111
112    Filesystems subject to this kind of race must provide a
113    ``->iomap_valid`` function to decide if the mapping is still valid.
114    If the mapping is not valid, the mapping will be sampled again.
115
116    To support making the validity decision, the filesystem's
117    ``->iomap_begin`` function may set ``struct iomap::validity_cookie``
118    at the same time that it populates the other iomap fields.
119    A simple validation cookie implementation is a sequence counter.
120    If the filesystem bumps the sequence counter every time it modifies
121    the inode's extent map, it can be placed in the ``struct
122    iomap::validity_cookie`` during ``->iomap_begin``.
123    If the value in the cookie is found to be different to the value
124    the filesystem holds when the mapping is passed back to
125    ``->iomap_valid``, then the iomap should considered stale and the
126    validation failed.
127
128  - ``read_folio_range``: Called to synchronously read in the range that will
129    be written to. If this function is not provided, iomap will default to
130    submitting a bio read request.
131
132These ``struct kiocb`` flags are significant for buffered I/O with iomap:
133
134 * ``IOCB_NOWAIT``: Turns on ``IOMAP_NOWAIT``.
135
136 * ``IOCB_DONTCACHE``: Turns on ``IOMAP_DONTCACHE``.
137
138``struct iomap_read_ops``
139--------------------------
140
141.. code-block:: c
142
143 struct iomap_read_ops {
144     int (*read_folio_range)(const struct iomap_iter *iter,
145                             struct iomap_read_folio_ctx *ctx, size_t len);
146     void (*submit_read)(struct iomap_read_folio_ctx *ctx);
147 };
148
149iomap calls these functions:
150
151  - ``read_folio_range``: Called to read in the range. This must be provided
152    by the caller. If this succeeds, iomap_finish_folio_read() must be called
153    after the range is read in, regardless of whether the read succeeded or
154    failed.
155
156  - ``submit_read``: Submit any pending read requests. This function is
157    optional.
158
159Internal per-Folio State
160------------------------
161
162If the fsblock size matches the size of a pagecache folio, it is assumed
163that all disk I/O operations will operate on the entire folio.
164The uptodate (memory contents are at least as new as what's on disk) and
165dirty (memory contents are newer than what's on disk) status of the
166folio are all that's needed for this case.
167
168If the fsblock size is less than the size of a pagecache folio, iomap
169tracks the per-fsblock uptodate and dirty state itself.
170This enables iomap to handle both "bs < ps" `filesystems
171<https://lore.kernel.org/all/20230725122932.144426-1-ritesh.list@gmail.com/>`_
172and large folios in the pagecache.
173
174iomap internally tracks two state bits per fsblock:
175
176 * ``uptodate``: iomap will try to keep folios fully up to date.
177   If there are read(ahead) errors, those fsblocks will not be marked
178   uptodate.
179   The folio itself will be marked uptodate when all fsblocks within the
180   folio are uptodate.
181
182 * ``dirty``: iomap will set the per-block dirty state when programs
183   write to the file.
184   The folio itself will be marked dirty when any fsblock within the
185   folio is dirty.
186
187iomap also tracks the amount of read and write disk IOs that are in
188flight.
189This structure is much lighter weight than ``struct buffer_head``
190because there is only one per folio, and the per-fsblock overhead is two
191bits vs. 104 bytes.
192
193Filesystems wishing to turn on large folios in the pagecache should call
194``mapping_set_large_folios`` when initializing the incore inode.
195
196Buffered Readahead and Reads
197----------------------------
198
199The ``iomap_readahead`` function initiates readahead to the pagecache.
200The ``iomap_read_folio`` function reads one folio's worth of data into
201the pagecache.
202The ``flags`` argument to ``->iomap_begin`` will be set to zero.
203The pagecache takes whatever locks it needs before calling the
204filesystem.
205
206Both ``iomap_readahead`` and ``iomap_read_folio`` pass in a ``struct
207iomap_read_folio_ctx``:
208
209.. code-block:: c
210
211 struct iomap_read_folio_ctx {
212    const struct iomap_read_ops *ops;
213    struct folio *cur_folio;
214    struct readahead_control *rac;
215    void *read_ctx;
216 };
217
218``iomap_readahead`` must set:
219 * ``ops->read_folio_range()`` and ``rac``
220
221``iomap_read_folio`` must set:
222 * ``ops->read_folio_range()`` and ``cur_folio``
223
224``ops->submit_read()`` and ``read_ctx`` are optional. ``read_ctx`` is used to
225pass in any custom data the caller needs accessible in the ops callbacks for
226fulfilling reads.
227
228Buffered Writes
229---------------
230
231The ``iomap_file_buffered_write`` function writes an ``iocb`` to the
232pagecache.
233``IOMAP_WRITE`` or ``IOMAP_WRITE`` | ``IOMAP_NOWAIT`` will be passed as
234the ``flags`` argument to ``->iomap_begin``.
235Callers commonly take ``i_rwsem`` in either shared or exclusive mode
236before calling this function.
237
238mmap Write Faults
239~~~~~~~~~~~~~~~~~
240
241The ``iomap_page_mkwrite`` function handles a write fault to a folio in
242the pagecache.
243``IOMAP_WRITE | IOMAP_FAULT`` will be passed as the ``flags`` argument
244to ``->iomap_begin``.
245Callers commonly take the mmap ``invalidate_lock`` in shared or
246exclusive mode before calling this function.
247
248Buffered Write Failures
249~~~~~~~~~~~~~~~~~~~~~~~
250
251After a short write to the pagecache, the areas not written will not
252become marked dirty.
253The filesystem must arrange to `cancel
254<https://lore.kernel.org/all/20221123055812.747923-6-david@fromorbit.com/>`_
255such `reservations
256<https://lore.kernel.org/linux-xfs/20220817093627.GZ3600936@dread.disaster.area/>`_
257because writeback will not consume the reservation.
258The ``iomap_write_delalloc_release`` can be called from a
259``->iomap_end`` function to find all the clean areas of the folios
260caching a fresh (``IOMAP_F_NEW``) delalloc mapping.
261It takes the ``invalidate_lock``.
262
263The filesystem must supply a function ``punch`` to be called for
264each file range in this state.
265This function must *only* remove delayed allocation reservations, in
266case another thread racing with the current thread writes successfully
267to the same region and triggers writeback to flush the dirty data out to
268disk.
269
270Zeroing for File Operations
271~~~~~~~~~~~~~~~~~~~~~~~~~~~
272
273Filesystems can call ``iomap_zero_range`` to perform zeroing of the
274pagecache for non-truncation file operations that are not aligned to
275the fsblock size.
276``IOMAP_ZERO`` will be passed as the ``flags`` argument to
277``->iomap_begin``.
278Callers typically hold ``i_rwsem`` and ``invalidate_lock`` in exclusive
279mode before calling this function.
280
281Unsharing Reflinked File Data
282~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
283
284Filesystems can call ``iomap_file_unshare`` to force a file sharing
285storage with another file to preemptively copy the shared data to newly
286allocate storage.
287``IOMAP_WRITE | IOMAP_UNSHARE`` will be passed as the ``flags`` argument
288to ``->iomap_begin``.
289Callers typically hold ``i_rwsem`` and ``invalidate_lock`` in exclusive
290mode before calling this function.
291
292Truncation
293----------
294
295Filesystems can call ``iomap_truncate_page`` to zero the bytes in the
296pagecache from EOF to the end of the fsblock during a file truncation
297operation.
298``truncate_setsize`` or ``truncate_pagecache`` will take care of
299everything after the EOF block.
300``IOMAP_ZERO`` will be passed as the ``flags`` argument to
301``->iomap_begin``.
302Callers typically hold ``i_rwsem`` and ``invalidate_lock`` in exclusive
303mode before calling this function.
304
305Pagecache Writeback
306-------------------
307
308Filesystems can call ``iomap_writepages`` to respond to a request to
309write dirty pagecache folios to disk.
310The ``mapping`` and ``wbc`` parameters should be passed unchanged.
311The ``wpc`` pointer should be allocated by the filesystem and must
312be initialized to zero.
313
314The pagecache will lock each folio before trying to schedule it for
315writeback.
316It does not lock ``i_rwsem`` or ``invalidate_lock``.
317
318The dirty bit will be cleared for all folios run through the
319``->writeback_range`` machinery described below even if the writeback fails.
320This is to prevent dirty folio clots when storage devices fail; an
321``-EIO`` is recorded for userspace to collect via ``fsync``.
322
323The ``ops`` structure must be specified and is as follows:
324
325``struct iomap_writeback_ops``
326~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
327
328.. code-block:: c
329
330 struct iomap_writeback_ops {
331    int (*writeback_range)(struct iomap_writepage_ctx *wpc,
332        struct folio *folio, u64 pos, unsigned int len, u64 end_pos);
333    int (*writeback_submit)(struct iomap_writepage_ctx *wpc, int error);
334 };
335
336The fields are as follows:
337
338  - ``writeback_range``: Sets ``wpc->iomap`` to the space mapping of the file
339    range (in bytes) given by ``offset`` and ``len``.
340    iomap calls this function for each dirty fs block in each dirty folio,
341    though it will `reuse mappings
342    <https://lore.kernel.org/all/20231207072710.176093-15-hch@lst.de/>`_
343    for runs of contiguous dirty fsblocks within a folio.
344    Do not return ``IOMAP_INLINE`` mappings here; the ``->iomap_end``
345    function must deal with persisting written data.
346    Do not return ``IOMAP_DELALLOC`` mappings here; iomap currently
347    requires mapping to allocated space.
348    Filesystems can skip a potentially expensive mapping lookup if the
349    mappings have not changed.
350    This revalidation must be open-coded by the filesystem; it is
351    unclear if ``iomap::validity_cookie`` can be reused for this
352    purpose.
353
354    If this methods fails to schedule I/O for any part of a dirty folio, it
355    should throw away any reservations that may have been made for the write.
356    The folio will be marked clean and an ``-EIO`` recorded in the
357    pagecache.
358    Filesystems can use this callback to `remove
359    <https://lore.kernel.org/all/20201029163313.1766967-1-bfoster@redhat.com/>`_
360    delalloc reservations to avoid having delalloc reservations for
361    clean pagecache.
362    This function must be supplied by the filesystem.
363    If this succeeds, iomap_finish_folio_write() must be called once writeback
364    completes for the range, regardless of whether the writeback succeeded or
365    failed.
366
367  - ``writeback_submit``: Submit the previous built writeback context.
368    Block based file systems should use the iomap_ioend_writeback_submit
369    helper, other file system can implement their own.
370    File systems can optionally hook into writeback bio submission.
371    This might include pre-write space accounting updates, or installing
372    a custom ``->bi_end_io`` function for internal purposes, such as
373    deferring the ioend completion to a workqueue to run metadata update
374    transactions from process context before submitting the bio.
375    This function must be supplied by the filesystem.
376
377Pagecache Writeback Completion
378~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
379
380To handle the bookkeeping that must happen after disk I/O for writeback
381completes, iomap creates chains of ``struct iomap_ioend`` objects that
382wrap the ``bio`` that is used to write pagecache data to disk.
383By default, iomap finishes writeback ioends by clearing the writeback
384bit on the folios attached to the ``ioend``.
385If the write failed, it will also set the error bits on the folios and
386the address space.
387This can happen in interrupt or process context, depending on the
388storage device.
389Filesystems that need to update internal bookkeeping (e.g. unwritten
390extent conversions) should set their own bi_end_io on the bios
391submitted by ``->submit_writeback``
392This function should call ``iomap_finish_ioends`` after finishing its
393own work (e.g. unwritten extent conversion).
394
395Some filesystems may wish to `amortize the cost of running metadata
396transactions
397<https://lore.kernel.org/all/20220120034733.221737-1-david@fromorbit.com/>`_
398for post-writeback updates by batching them.
399They may also require transactions to run from process context, which
400implies punting batches to a workqueue.
401iomap ioends contain a ``list_head`` to enable batching.
402
403Given a batch of ioends, iomap has a few helpers to assist with
404amortization:
405
406 * ``iomap_sort_ioends``: Sort all the ioends in the list by file
407   offset.
408
409 * ``iomap_ioend_try_merge``: Given an ioend that is not in any list and
410   a separate list of sorted ioends, merge as many of the ioends from
411   the head of the list into the given ioend.
412   ioends can only be merged if the file range and storage addresses are
413   contiguous; the unwritten and shared status are the same; and the
414   write I/O outcome is the same.
415   The merged ioends become their own list.
416
417 * ``iomap_finish_ioends``: Finish an ioend that possibly has other
418   ioends linked to it.
419
420Direct I/O
421==========
422
423In Linux, direct I/O is defined as file I/O that is issued directly to
424storage, bypassing the pagecache.
425The ``iomap_dio_rw`` function implements O_DIRECT (direct I/O) reads and
426writes for files.
427
428.. code-block:: c
429
430 ssize_t iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
431                      const struct iomap_ops *ops,
432                      const struct iomap_dio_ops *dops,
433                      unsigned int dio_flags, void *private,
434                      size_t done_before);
435
436The filesystem can provide the ``dops`` parameter if it needs to perform
437extra work before or after the I/O is issued to storage.
438The ``done_before`` parameter tells the how much of the request has
439already been transferred.
440It is used to continue a request asynchronously when `part of the
441request
442<https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=c03098d4b9ad76bca2966a8769dcfe59f7f85103>`_
443has already been completed synchronously.
444
445The ``done_before`` parameter should be set if writes for the ``iocb``
446have been initiated prior to the call.
447The direction of the I/O is determined from the ``iocb`` passed in.
448
449The ``dio_flags`` argument can be set to any combination of the
450following values:
451
452 * ``IOMAP_DIO_FORCE_WAIT``: Wait for the I/O to complete even if the
453   kiocb is not synchronous.
454
455 * ``IOMAP_DIO_OVERWRITE_ONLY``: Perform a pure overwrite for this range
456   or fail with ``-EAGAIN``.
457   This can be used by filesystems with complex unaligned I/O
458   write paths to provide an optimised fast path for unaligned writes.
459   If a pure overwrite can be performed, then serialisation against
460   other I/Os to the same filesystem block(s) is unnecessary as there is
461   no risk of stale data exposure or data loss.
462   If a pure overwrite cannot be performed, then the filesystem can
463   perform the serialisation steps needed to provide exclusive access
464   to the unaligned I/O range so that it can perform allocation and
465   sub-block zeroing safely.
466   Filesystems can use this flag to try to reduce locking contention,
467   but a lot of `detailed checking
468   <https://lore.kernel.org/linux-ext4/20230314130759.642710-1-bfoster@redhat.com/>`_
469   is required to do it `correctly
470   <https://lore.kernel.org/linux-ext4/20230810165559.946222-1-bfoster@redhat.com/>`_.
471
472 * ``IOMAP_DIO_PARTIAL``: If a page fault occurs, return whatever
473   progress has already been made.
474   The caller may deal with the page fault and retry the operation.
475   If the caller decides to retry the operation, it should pass the
476   accumulated return values of all previous calls as the
477   ``done_before`` parameter to the next call.
478
479These ``struct kiocb`` flags are significant for direct I/O with iomap:
480
481 * ``IOCB_NOWAIT``: Turns on ``IOMAP_NOWAIT``.
482
483 * ``IOCB_SYNC``: Ensure that the device has persisted data to disk
484   before completing the call.
485   In the case of pure overwrites, the I/O may be issued with FUA
486   enabled.
487
488 * ``IOCB_HIPRI``: Poll for I/O completion instead of waiting for an
489   interrupt.
490   Only meaningful for asynchronous I/O, and only if the entire I/O can
491   be issued as a single ``struct bio``.
492
493Filesystems should call ``iomap_dio_rw`` from ``->read_iter`` and
494``->write_iter``, and set ``FMODE_CAN_ODIRECT`` in the ``->open``
495function for the file.
496They should not set ``->direct_IO``, which is deprecated.
497
498If a filesystem wishes to perform its own work before direct I/O
499completion, it should call ``__iomap_dio_rw``.
500If its return value is not an error pointer or a NULL pointer, the
501filesystem should pass the return value to ``iomap_dio_complete`` after
502finishing its internal work.
503
504Return Values
505-------------
506
507``iomap_dio_rw`` can return one of the following:
508
509 * A non-negative number of bytes transferred.
510
511 * ``-ENOTBLK``: Fall back to buffered I/O.
512   iomap itself will return this value if it cannot invalidate the page
513   cache before issuing the I/O to storage.
514   The ``->iomap_begin`` or ``->iomap_end`` functions may also return
515   this value.
516
517 * ``-EIOCBQUEUED``: The asynchronous direct I/O request has been
518   queued and will be completed separately.
519
520 * Any of the other negative error codes.
521
522Direct Reads
523------------
524
525A direct I/O read initiates a read I/O from the storage device to the
526caller's buffer.
527Dirty parts of the pagecache are flushed to storage before initiating
528the read io.
529The ``flags`` value for ``->iomap_begin`` will be ``IOMAP_DIRECT`` with
530any combination of the following enhancements:
531
532 * ``IOMAP_NOWAIT``, as defined previously.
533
534Callers commonly hold ``i_rwsem`` in shared mode before calling this
535function.
536
537Direct Writes
538-------------
539
540A direct I/O write initiates a write I/O to the storage device from the
541caller's buffer.
542Dirty parts of the pagecache are flushed to storage before initiating
543the write io.
544The pagecache is invalidated both before and after the write io.
545The ``flags`` value for ``->iomap_begin`` will be ``IOMAP_DIRECT |
546IOMAP_WRITE`` with any combination of the following enhancements:
547
548 * ``IOMAP_NOWAIT``, as defined previously.
549
550 * ``IOMAP_OVERWRITE_ONLY``: Allocating blocks and zeroing partial
551   blocks is not allowed.
552   The entire file range must map to a single written or unwritten
553   extent.
554   The file I/O range must be aligned to the filesystem block size
555   if the mapping is unwritten and the filesystem cannot handle zeroing
556   the unaligned regions without exposing stale contents.
557
558 * ``IOMAP_ATOMIC``: This write is being issued with torn-write
559   protection.
560   Torn-write protection may be provided based on HW-offload or by a
561   software mechanism provided by the filesystem.
562
563   For HW-offload based support, only a single bio can be created for the
564   write, and the write must not be split into multiple I/O requests, i.e.
565   flag REQ_ATOMIC must be set.
566   The file range to write must be aligned to satisfy the requirements
567   of both the filesystem and the underlying block device's atomic
568   commit capabilities.
569   If filesystem metadata updates are required (e.g. unwritten extent
570   conversion or copy-on-write), all updates for the entire file range
571   must be committed atomically as well.
572   Untorn-writes may be longer than a single file block. In all cases,
573   the mapping start disk block must have at least the same alignment as
574   the write offset.
575   The filesystems must set IOMAP_F_ATOMIC_BIO to inform iomap core of an
576   untorn-write based on HW-offload.
577
578   For untorn-writes based on a software mechanism provided by the
579   filesystem, all the disk block alignment and single bio restrictions
580   which apply for HW-offload based untorn-writes do not apply.
581   The mechanism would typically be used as a fallback for when
582   HW-offload based untorn-writes may not be issued, e.g. the range of the
583   write covers multiple extents, meaning that it is not possible to issue
584   a single bio.
585   All filesystem metadata updates for the entire file range must be
586   committed atomically as well.
587
588Callers commonly hold ``i_rwsem`` in shared or exclusive mode before
589calling this function.
590
591``struct iomap_dio_ops:``
592-------------------------
593.. code-block:: c
594
595 struct iomap_dio_ops {
596     void (*submit_io)(const struct iomap_iter *iter, struct bio *bio,
597                       loff_t file_offset);
598     int (*end_io)(struct kiocb *iocb, ssize_t size, int error,
599                   unsigned flags);
600     struct bio_set *bio_set;
601 };
602
603The fields of this structure are as follows:
604
605  - ``submit_io``: iomap calls this function when it has constructed a
606    ``struct bio`` object for the I/O requested, and wishes to submit it
607    to the block device.
608    If no function is provided, ``submit_bio`` will be called directly.
609    Filesystems that would like to perform additional work before (e.g.
610    data replication for btrfs) should implement this function.
611
612  - ``end_io``: This is called after the ``struct bio`` completes.
613    This function should perform post-write conversions of unwritten
614    extent mappings, handle write failures, etc.
615    The ``flags`` argument may be set to a combination of the following:
616
617    * ``IOMAP_DIO_UNWRITTEN``: The mapping was unwritten, so the ioend
618      should mark the extent as written.
619
620    * ``IOMAP_DIO_COW``: Writing to the space in the mapping required a
621      copy on write operation, so the ioend should switch mappings.
622
623  - ``bio_set``: This allows the filesystem to provide a custom bio_set
624    for allocating direct I/O bios.
625    This enables filesystems to `stash additional per-bio information
626    <https://lore.kernel.org/all/20220505201115.937837-3-hch@lst.de/>`_
627    for private use.
628    If this field is NULL, generic ``struct bio`` objects will be used.
629
630Filesystems that want to perform extra work after an I/O completion
631should set a custom ``->bi_end_io`` function via ``->submit_io``.
632Afterwards, the custom endio function must call
633``iomap_dio_bio_end_io`` to finish the direct I/O.
634
635DAX I/O
636=======
637
638Some storage devices can be directly mapped as memory.
639These devices support a new access mode known as "fsdax" that allows
640loads and stores through the CPU and memory controller.
641
642fsdax Reads
643-----------
644
645A fsdax read performs a memcpy from storage device to the caller's
646buffer.
647The ``flags`` value for ``->iomap_begin`` will be ``IOMAP_DAX`` with any
648combination of the following enhancements:
649
650 * ``IOMAP_NOWAIT``, as defined previously.
651
652Callers commonly hold ``i_rwsem`` in shared mode before calling this
653function.
654
655fsdax Writes
656------------
657
658A fsdax write initiates a memcpy to the storage device from the caller's
659buffer.
660The ``flags`` value for ``->iomap_begin`` will be ``IOMAP_DAX |
661IOMAP_WRITE`` with any combination of the following enhancements:
662
663 * ``IOMAP_NOWAIT``, as defined previously.
664
665 * ``IOMAP_OVERWRITE_ONLY``: The caller requires a pure overwrite to be
666   performed from this mapping.
667   This requires the filesystem extent mapping to already exist as an
668   ``IOMAP_MAPPED`` type and span the entire range of the write I/O
669   request.
670   If the filesystem cannot map this request in a way that allows the
671   iomap infrastructure to perform a pure overwrite, it must fail the
672   mapping operation with ``-EAGAIN``.
673
674Callers commonly hold ``i_rwsem`` in exclusive mode before calling this
675function.
676
677fsdax mmap Faults
678~~~~~~~~~~~~~~~~~
679
680The ``dax_iomap_fault`` function handles read and write faults to fsdax
681storage.
682For a read fault, ``IOMAP_DAX | IOMAP_FAULT`` will be passed as the
683``flags`` argument to ``->iomap_begin``.
684For a write fault, ``IOMAP_DAX | IOMAP_FAULT | IOMAP_WRITE`` will be
685passed as the ``flags`` argument to ``->iomap_begin``.
686
687Callers commonly hold the same locks as they do to call their iomap
688pagecache counterparts.
689
690fsdax Truncation, fallocate, and Unsharing
691------------------------------------------
692
693For fsdax files, the following functions are provided to replace their
694iomap pagecache I/O counterparts.
695The ``flags`` argument to ``->iomap_begin`` are the same as the
696pagecache counterparts, with ``IOMAP_DAX`` added.
697
698 * ``dax_file_unshare``
699 * ``dax_zero_range``
700 * ``dax_truncate_page``
701
702Callers commonly hold the same locks as they do to call their iomap
703pagecache counterparts.
704
705fsdax Deduplication
706-------------------
707
708Filesystems implementing the ``FIDEDUPERANGE`` ioctl must call the
709``dax_remap_file_range_prep`` function with their own iomap read ops.
710
711Seeking Files
712=============
713
714iomap implements the two iterating whence modes of the ``llseek`` system
715call.
716
717SEEK_DATA
718---------
719
720The ``iomap_seek_data`` function implements the SEEK_DATA "whence" value
721for llseek.
722``IOMAP_REPORT`` will be passed as the ``flags`` argument to
723``->iomap_begin``.
724
725For unwritten mappings, the pagecache will be searched.
726Regions of the pagecache with a folio mapped and uptodate fsblocks
727within those folios will be reported as data areas.
728
729Callers commonly hold ``i_rwsem`` in shared mode before calling this
730function.
731
732SEEK_HOLE
733---------
734
735The ``iomap_seek_hole`` function implements the SEEK_HOLE "whence" value
736for llseek.
737``IOMAP_REPORT`` will be passed as the ``flags`` argument to
738``->iomap_begin``.
739
740For unwritten mappings, the pagecache will be searched.
741Regions of the pagecache with no folio mapped, or a !uptodate fsblock
742within a folio will be reported as sparse hole areas.
743
744Callers commonly hold ``i_rwsem`` in shared mode before calling this
745function.
746
747Swap File Activation
748====================
749
750The ``iomap_swapfile_activate`` function finds all the base-page aligned
751regions in a file and sets them up as swap space.
752The file will be ``fsync()``'d before activation.
753``IOMAP_REPORT`` will be passed as the ``flags`` argument to
754``->iomap_begin``.
755All mappings must be mapped or unwritten; cannot be dirty or shared, and
756cannot span multiple block devices.
757Callers must hold ``i_rwsem`` in exclusive mode; this is already
758provided by ``swapon``.
759
760File Space Mapping Reporting
761============================
762
763iomap implements two of the file space mapping system calls.
764
765FS_IOC_FIEMAP
766-------------
767
768The ``iomap_fiemap`` function exports file extent mappings to userspace
769in the format specified by the ``FS_IOC_FIEMAP`` ioctl.
770``IOMAP_REPORT`` will be passed as the ``flags`` argument to
771``->iomap_begin``.
772Callers commonly hold ``i_rwsem`` in shared mode before calling this
773function.
774
775FIBMAP (deprecated)
776-------------------
777
778``iomap_bmap`` implements FIBMAP.
779The calling conventions are the same as for FIEMAP.
780This function is only provided to maintain compatibility for filesystems
781that implemented FIBMAP prior to conversion.
782This ioctl is deprecated; do **not** add a FIBMAP implementation to
783filesystems that do not have it.
784Callers should probably hold ``i_rwsem`` in shared mode before calling
785this function, but this is unclear.
786