Lines Matching +full:write +full:- +full:to +full:- +full:write
1 .. SPDX-License-Identifier: GPL-2.0
17 adjacent stack VMA could be extended to contain the accessed address.
21 threads) which share the virtual address space. We refer to this as the
27 .. note:: An exception to this is the 'gate' VMA which is provided by
29 object which does not belong to any specific mm.
31 -------
33 -------
35 The kernel is designed to be highly scalable against concurrent read operations
36 on VMA **metadata** so a complicated set of locks are required to ensure memory
43 -----------
45 * **mmap locks** - Each MM has a read/write semaphore :c:member:`!mmap_lock`
48 * **VMA locks** - The VMA lock is at VMA granularity (of course) which behaves
49 as a read/write semaphore in practice. A VMA read lock is obtained via
51 write lock via :c:func:`!vma_start_write` (all VMA write locks are unlocked
52 automatically when the mmap write lock is released). To take a VMA write lock
54 * **rmap locks** - When trying to access VMAs through the reverse mapping via a
56 (reachable from a folio via :c:member:`!folio->mapping`). VMAs must be stabilised via
59 :c:func:`!i_mmap_[try]lock_write` for file-backed memory. We refer to these
64 The first thing **any** of these locks achieve is to **stabilise** the VMA
72 ----------
74 If you want to **read** VMA metadata fields or just keep the VMA stable, you
80 * Try to obtain a VMA read lock via :c:func:`!lock_vma_under_rcu`. This tries to
81 acquire the lock atomically so might fail, in which case fall-back logic is
82 required to instead obtain an mmap read lock if this returns :c:macro:`!NULL`,
85 anonymous or file-backed) to obtain the required VMA.
87 If you want to **write** VMA metadata fields, then things vary depending on the
90 * Obtain an mmap write lock at the MM granularity via :c:func:`!mmap_write_lock` (or a
93 * Obtain a VMA write lock via :c:func:`!vma_start_write` for each VMA you wish to
96 * If you want to be able to write to **any** field, you must also hide the VMA
97 from the reverse mapping by obtaining an **rmap write lock**.
99 VMA locks are special in that you must obtain an mmap **write** lock **first**
100 in order to obtain a VMA **write** lock. A VMA **read** lock however can be
102 release an RCU lock to lookup the VMA for you).
108 means that without a VMA write lock, page faults will run concurrent with
116 mmap lock VMA lock rmap lock Stable? Read? Write most? Write all?
118 \- \- \- N N N N
119 \- R \- Y Y N N
120 \- \- R/W Y Y N N
121 R/W \-/R \-/R/W Y Y N N
122 W W \-/R Y Y Y N
126 .. warning:: While it's possible to obtain a VMA lock while holding an mmap read lock,
127 attempting to do the reverse is invalid as it can result in deadlock - if
128 another task already holds an mmap write lock and attempts to acquire a VMA
129 write lock that will deadlock on the VMA read lock.
131 All of these locks behave as read/write semaphores in practice, so you can
132 obtain either a read or a write lock for each of these.
134 .. note:: Generally speaking, a read/write semaphore is a class of lock which
135 permits concurrent readers. However a write lock can only be obtained
137 made to wait).
139 This renders read locks on a read/write semaphore concurrent with other
140 readers and write locks exclusive against all others holding the semaphore.
146 easier to explore their locking characteristics:
148 .. note:: We exclude VMA lock-specific fields here to avoid confusion, as these
154 Field Description Write lock
156 :c:member:`!vm_start` Inclusive start virtual address of range mmap write,
157 VMA describes. VMA write,
158 rmap write.
159 :c:member:`!vm_end` Exclusive end virtual address of range mmap write,
160 VMA describes. VMA write,
161 rmap write.
162 :c:member:`!vm_pgoff` Describes the page offset into the file, mmap write,
163 the original page offset within the VMA write,
164 virtual address space (prior to any rmap write.
172 are used to locate VMAs within the reverse mapping interval trees.
177 Field Description Write lock
179 :c:member:`!vm_mm` Containing mm_struct. None - written once on
181 :c:member:`!vm_page_prot` Architecture-specific page table mmap write, VMA write.
184 :c:member:`!vm_flags` Read-only access to VMA flags describing N/A
188 :c:member:`!__vm_flags` Private, writable access to VMA flags mmap write, VMA write.
191 :c:member:`!vm_file` If the VMA is file-backed, points to a None - written once on
195 :c:member:`!vm_ops` If the VMA is file-backed, then either None - Written once on
196 the driver or file-system provides a initial map by
197 :c:struct:`!struct vm_operations_struct` :c:func:`!f_ops->mmap()`.
198 object describing callbacks to be
201 driver-specific metadata.
204 These are the core fields which describe the MM the VMA belongs to and its attributes.
206 .. table:: Config-specific fields
209 … Configuration option Description Write lock
211 …:`!anon_name` CONFIG_ANON_VMA_NAME A field for storing a mmap write,
212 … :c:struct:`!struct anon_vma_name` VMA write.
215 is set or the VMA is file-backed. The
220 … to perform readahead. This field is swap-specific
222 …:`!vm_policy` CONFIG_NUMA :c:type:`!mempolicy` object which mmap write,
223 … describes the NUMA behaviour of the VMA write.
227 … describes the current state of numab-specific
228 … NUMA balancing in relation to this VMA. lock.
231 …:`!vm_userfaultfd_ctx` CONFIG_USERFAULTFD Userfaultfd context wrapper object of mmap write,
232 … type :c:type:`!vm_userfaultfd_ctx`, VMA write.
235 to an underlying
246 Field Description Write lock
248 …c:member:`!shared.rb` A red/black tree node used, if the mmap write, VMA write,
249 mapping is file-backed, to place the VMA i_mmap write.
251 :c:member:`!struct address_space->i_mmap`
253 …c:member:`!shared.rb_subtree_last` Metadata used for management of the mmap write, VMA write,
254 interval tree if the VMA is file-backed. i_mmap write.
255 …:member:`!anon_vma_chain` List of pointers to both forked/CoW’d mmap read, anon_vma wr…
257 :c:member:`!vma->anon_vma` if it is
258 non-:c:macro:`!NULL`.
260 … anonymous folios mapped exclusively to setting non-:c:macro:`NULL`:
263 … by the :c:macro:`!page_table_lock`. This When non-:c:macro:`NULL` and
265 … mmap write, VMA write,
266 anon_vma write.
269 These fields are used to both place the VMA within the reverse mapping, and for
270 anonymous mappings, to be able to access both related :c:struct:`!struct anon_vma` objects
271 and the :c:struct:`!struct anon_vma` in which folios mapped exclusively to this VMA should
274 .. note:: If a file-backed mapping is mapped with :c:macro:`!MAP_PRIVATE` set
280 -----------
283 virtual addresses to physical ones through a series of page tables, each of
290 In Linux these are divided into five levels - PGD, P4D, PUD, PMD and PTE. Huge
292 typically refer to the leaf level as the PTE level regardless.
296 out functions related to the skipped levels. This allows us to
298 compiler might, in practice, eliminate any code relating to missing
303 1. **Traversing** page tables - Simply reading page tables in order to traverse
307 2. **Installing** page table mappings - Whether creating a new mapping or
308 modifying an existing one in such a way as to change its identity. This
311 3. **Zapping/unmapping** page table entries - This is what the kernel calls
318 4. **Freeing** page tables - When finally the kernel removes page tables from a
320 be taken to ensure this is done safely, as this logic finally frees all page
330 locks described in the terminology section above - that is the mmap lock, the
333 That is - as long as you keep the relevant VMA **stable** - you are good to go
335 operations that perform writes also acquire internal page table locks to
336 serialise - see the page table implementation detail section for more details).
338 When **installing** page table entries, the mmap or VMA lock must be held to
343 If you want to traverse page tables in areas that might not be
351 containing the ranges those page tables map to be accessible via
355 from the reverse mappings, but no other VMAs can be permitted to be
359 -------------
362 same time as explicit mm or VMA locks, we have to be wary of lock inversion, and
365 .. note:: Lock inversion occurs when two threads need to acquire multiple locks,
368 For example, consider thread 1 which holds lock A and tries to acquire lock B,
369 while thread 2 holds lock B and tries to acquire lock A.
371 Both threads are now deadlocked on each other. However, had they attempted to
372 acquire locks in the same order, one would have waited for the other to
378 .. code-block::
380 inode->i_rwsem (while writing or truncating, not reading or faulting)
381 mm->mmap_lock
382 mapping->invalidate_lock (in filemap_fault)
386 mapping->i_mmap_rwsem
387 anon_vma->rwsem
388 mm->page_table_lock or pte_lock
391 mapping->private_lock (in block_dirty_folio)
393 lruvec->lru_lock (in folio_lruvec_lock_irq)
394 inode->i_lock (in set_page_dirty's __mark_inode_dirty)
395 bdi.wb->list_lock (in set_page_dirty's __mark_inode_dirty)
396 sb_lock (within inode_lock in fs/fs-writeback.c)
398 in arch-dependent flush_dcache_mmap_lock,
399 within bdi.wb->list_lock in __sync_single_inode)
401 There is also a file-system specific lock ordering comment located at the top of
404 .. code-block::
406 ->i_mmap_rwsem (truncate_pagecache)
407 ->private_lock (__free_pte->block_dirty_folio)
408 ->swap_lock (exclusive_swap_page, others)
409 ->i_pages lock
411 ->i_rwsem
412 ->invalidate_lock (acquired by fs in truncate path)
413 ->i_mmap_rwsem (truncate->unmap_mapping_range)
415 ->mmap_lock
416 ->i_mmap_rwsem
417 ->page_table_lock or pte_lock (various, mainly in memory.c)
418 ->i_pages lock (arch-dependent flush_dcache_mmap_lock)
420 ->mmap_lock
421 ->invalidate_lock (filemap_fault)
422 ->lock_page (filemap_fault, access_process_vm)
424 ->i_rwsem (generic_perform_write)
425 ->mmap_lock (fault_in_readable->do_page_fault)
427 bdi->wb.list_lock
428 sb_lock (fs/fs-writeback.c)
429 ->i_pages lock (__sync_single_inode)
431 ->i_mmap_rwsem
432 ->anon_vma.lock (vma_merge)
434 ->anon_vma.lock
435 ->page_table_lock or pte_lock (anon_vma_prepare and various)
437 ->page_table_lock or pte_lock
438 ->swap_lock (try_to_unmap_one)
439 ->private_lock (try_to_unmap_one)
440 ->i_pages lock (try_to_unmap_one)
441 ->lruvec->lru_lock (follow_page_mask->mark_page_accessed)
442 ->lruvec->lru_lock (check_pte_range->folio_isolate_lru)
443 ->private_lock (folio_remove_rmap_pte->set_page_dirty)
444 ->i_pages lock (folio_remove_rmap_pte->set_page_dirty)
445 bdi.wb->list_lock (folio_remove_rmap_pte->set_page_dirty)
446 ->inode->i_lock (folio_remove_rmap_pte->set_page_dirty)
447 bdi.wb->list_lock (zap_pte_range->set_page_dirty)
448 ->inode->i_lock (zap_pte_range->set_page_dirty)
449 ->private_lock (zap_pte_range->block_dirty_folio)
454 ------------------------------
456 ------------------------------
458 .. warning:: Locking rules for PTE-level page tables are very different from
462 --------------------------
464 In addition to the locks described in the terminology section above, we have
465 additional locks dedicated to page tables:
467 * **Higher level page table locks** - Higher level page tables, that is PGD, P4D
469 :c:member:`!mm->page_table_lock` lock when modified.
471 * **Fine-grained page table locks** - PMDs and PTEs each have fine-grained locks
475 mapped into higher memory (if a 32-bit system) and carefully locked via
478 These locks represent the minimum required to interact with each page table
497 * Reads from and writes to page table entries must be *appropriately*
500 held (read or write), doing so with only rmap locks would be dangerous (see
518 PTE-level page tables are different from page tables at other levels, and there
521 * On 32-bit architectures, they may be in high memory (meaning they need to be
522 mapped into kernel memory to be accessible).
523 * When empty, they can be unlinked and RCU-freed while holding an mmap lock or
527 So accessing PTE-level page tables requires at least holding an RCU read lock;
532 entries. Writers normally need to take the PTE lock and revalidate that the
533 PMD entry still refers to the same PTE-level page table.
535 To access PTE-level page tables, a helper like :c:func:`!pte_offset_map_lock` or
547 functionality like GUP-fast locklessly traverses (that is reads) page tables,
552 (for instance x86-64 does not require any special precautions).
554 If a write is being performed, or if a read informs whether a write takes place
562 functions - :c:func:`!pgdp_get`, :c:func:`!p4dp_get`, :c:func:`!pudp_get`,
565 Each of these uses :c:func:`!READ_ONCE` to guarantee that the compiler reads
568 However, if we wish to manipulate an existing page table entry and care about
573 GUP-fast (see :c:func:`!gup_fast` and its various page table level handlers like
578 Writes to page table entries must also be appropriately atomic, as established
579 by :c:func:`!set_pXX` functions - :c:func:`!set_pgd`, :c:func:`!set_p4d`,
583 as in :c:func:`!pXX_clear` functions - :c:func:`!pgd_clear`,
591 mmap or VMA lock in read or write mode (see the warning in the locking rules
592 section for details as to why).
595 PGD, P4D or PUD, the :c:member:`!mm->page_table_lock` must be held. This is
601 references the :c:member:`!mm->page_table_lock`.
603 Allocating a PTE will either use the :c:member:`!mm->page_table_lock` or, if
611 access to entries contained within a PTE, especially when we wish to modify
614 This is performed via :c:func:`!pte_offset_map_lock` which carefully checks to
616 :c:func:`!pte_lockptr` to obtain a spin lock at PTE granularity contained within
625 When modifying data in ranges we typically only wish to allocate higher page
626 tables as necessary, using these locks to avoid races or overwriting anything,
630 A typical pattern taken when traversing page table entries to install a new
631 mapping is to optimistically determine whether the page table entry in the table
633 again to see if it was allocated underneath us.
644 PTE-specific lock, and then *again* checking that the PMD entry is as expected.
646 If a THP collapse (or similar) were to occur then the lock on both pages would
649 Installing entries this way ensures mutual exclusion on write.
658 It is insufficient to simply hold an mmap write lock and VMA lock (which will
659 prevent racing faults, and rmap operations), as a file-backed mapping can be
660 truncated under the :c:struct:`!struct address_space->i_mmap_rwsem` alone.
663 through the :c:struct:`!struct anon_vma->rb_root` or the :c:member:`!struct
664 address_space->i_mmap` interval trees) can have its page tables torn down.
667 either the mmap write lock has been taken (as specified by its
671 that no new ones overlap these or any route remain to permit access to addresses
675 been taken to ensure that no further page table entries can be installed between
682 .. note:: It is possible for leaf page tables to be torn down independent of
702 ------------------
707 VMA read locking is entirely optimistic - if the lock is contended or a competing
708 write has started, then we do not obtain a read lock.
711 calls :c:func:`!rcu_read_lock` to ensure that the VMA is looked up in an RCU
712 critical section, then attempts to VMA lock it via :c:func:`!vma_start_read`,
715 VMA read locks hold the read lock on the :c:member:`!vma->vm_lock` semaphore for
719 VMA **write** locks are acquired via :c:func:`!vma_start_write` in instances where a
720 VMA is about to be modified, unlike :c:func:`!vma_start_read` the lock is always
721 acquired. An mmap write lock **must** be held for the duration of the VMA write
722 lock, releasing or downgrading the mmap write lock also releases the VMA write
725 Note that a semaphore write lock is not held across a VMA lock. Rather, a
726 sequence number is used for serialisation, and the write semaphore is only
727 acquired at the point of write lock to update this.
729 This ensures the semantics we require - VMA write locks provide exclusive write
730 access to the VMA.
735 The VMA lock mechanism is designed to be a lightweight means of avoiding the use
737 read/write semaphore and sequence numbers belonging to the containing
741 operation, i.e. it tries to acquire a read lock but returns false if it is
742 unable to do so. At the end of the read operation, :c:func:`!vma_end_read` is
743 called to release the VMA read lock.
751 Writing requires the mmap to be write-locked and the VMA lock to be acquired via
752 :c:func:`!vma_start_write`, however the write lock is released by the termination or
753 downgrade of the mmap write lock so no :c:func:`!vma_end_write` is required.
755 All this is achieved by the use of per-mm and per-VMA sequence counts, which are
756 used in order to reduce complexity, especially for operations which write-lock
759 If the mm sequence count, :c:member:`!mm->mm_lock_seq` is equal to the VMA
760 sequence count :c:member:`!vma->vm_lock_seq` then the VMA is write-locked. If
763 Each time the mmap write lock is released in :c:func:`!mmap_write_unlock` or
765 also increments :c:member:`!mm->mm_lock_seq` via
768 This way, we ensure that, regardless of the VMA's sequence number, a write lock
769 is never incorrectly indicated and that when we release an mmap write lock we
770 efficiently release **all** VMA write locks contained within the mmap at the
773 Since the mmap write lock is exclusive against others who hold it, the automatic
774 release of any VMA locks on its release makes sense, as you would never want to
775 keep VMAs locked across entirely separate write operations. It also maintains
779 :c:member:`!vma->vm_lock` read/write semaphore and hold it, while checking that
786 are also RCU safe, so the whole read lock operation is guaranteed to function
789 On the write side, we acquire a write lock on the :c:member:`!vma->vm_lock`
790 read/write semaphore, before setting the VMA's sequence number under this lock,
791 also simultaneously holding the mmap write lock.
797 complexity with a long-term held write lock.
799 This clever combination of a read/write semaphore and sequence count allows for
800 fast RCU-based per-VMA lock acquisition (especially on page fault, though
803 mmap write lock downgrading
804 ---------------------------
806 When an mmap write lock is held one has exclusive access to resources within the
807 mmap (with the usual caveats about requiring VMA write locks to avoid races with
810 It is then possible to **downgrade** from a write lock to a read lock via
811 :c:func:`!mmap_write_downgrade` which, similar to :c:func:`!mmap_write_unlock`,
812 implicitly terminates all VMA write locks via :c:func:`!vma_end_write_all`, but
818 have to acquire a write lock first to downgrade it, and the downgraded lock
819 prevents a new write lock from being obtained until the original lock is
822 For clarity, we map read (R)/downgraded write (D)/write (W) locks against one
825 .. list-table:: Lock exclusivity
827 :header-rows: 1
828 :stub-columns: 1
830 * -
831 - R
832 - D
833 - W
834 * - R
835 - N
836 - N
837 - Y
838 * - D
839 - N
840 - Y
841 - Y
842 * - W
843 - Y
844 - Y
845 - Y
851 ---------------
854 to be racing page faults, as a result we invoke :c:func:`!vma_start_write` to