1.. SPDX-License-Identifier: GPL-2.0 2 3================= 4Process Addresses 5================= 6 7.. toctree:: 8 :maxdepth: 3 9 10 11Userland memory ranges are tracked by the kernel via Virtual Memory Areas or 12'VMA's of type :c:struct:`!struct vm_area_struct`. 13 14Each VMA describes a virtually contiguous memory range with identical 15attributes, each described by a :c:struct:`!struct vm_area_struct` 16object. Userland access outside of VMAs is invalid except in the case where an 17adjacent stack VMA could be extended to contain the accessed address. 18 19All VMAs are contained within one and only one virtual address space, described 20by a :c:struct:`!struct mm_struct` object which is referenced by all tasks (that is, 21threads) which share the virtual address space. We refer to this as the 22:c:struct:`!mm`. 23 24Each mm object contains a maple tree data structure which describes all VMAs 25within the virtual address space. 26 27.. note:: An exception to this is the 'gate' VMA which is provided by 28 architectures which use :c:struct:`!vsyscall` and is a global static 29 object which does not belong to any specific mm. 30 31------- 32Locking 33------- 34 35The kernel is designed to be highly scalable against concurrent read operations 36on VMA **metadata** so a complicated set of locks are required to ensure memory 37corruption does not occur. 38 39.. note:: Locking VMAs for their metadata does not have any impact on the memory 40 they describe nor the page tables that map them. 41 42Terminology 43----------- 44 45* **mmap locks** - Each MM has a read/write semaphore :c:member:`!mmap_lock` 46 which locks at a process address space granularity which can be acquired via 47 :c:func:`!mmap_read_lock`, :c:func:`!mmap_write_lock` and variants. 48* **VMA locks** - The VMA lock is at VMA granularity (of course) which behaves 49 as a read/write semaphore in practice. A VMA read lock is obtained via 50 :c:func:`!lock_vma_under_rcu` (and unlocked via :c:func:`!vma_end_read`) and a 51 write lock via vma_start_write() or vma_start_write_killable() 52 (all VMA write locks are unlocked 53 automatically when the mmap write lock is released). To take a VMA write lock 54 you **must** have already acquired an :c:func:`!mmap_write_lock`. 55* **rmap locks** - When trying to access VMAs through the reverse mapping via a 56 :c:struct:`!struct address_space` or :c:struct:`!struct anon_vma` object 57 (reachable from a folio via :c:member:`!folio->mapping`). VMAs must be stabilised via 58 :c:func:`!anon_vma_[try]lock_read` or :c:func:`!anon_vma_[try]lock_write` for 59 anonymous memory and :c:func:`!i_mmap_[try]lock_read` or 60 :c:func:`!i_mmap_[try]lock_write` for file-backed memory. We refer to these 61 locks as the reverse mapping locks, or 'rmap locks' for brevity. 62 63We discuss page table locks separately in the dedicated section below. 64 65The first thing **any** of these locks achieve is to **stabilise** the VMA 66within the MM tree. That is, guaranteeing that the VMA object will not be 67deleted from under you nor modified (except for some specific fields 68described below). 69 70Stabilising a VMA also keeps the address space described by it around. 71 72Lock usage 73---------- 74 75If you want to **read** VMA metadata fields or just keep the VMA stable, you 76must do one of the following: 77 78* Obtain an mmap read lock at the MM granularity via :c:func:`!mmap_read_lock` (or a 79 suitable variant), unlocking it with a matching :c:func:`!mmap_read_unlock` when 80 you're done with the VMA, *or* 81* Try to obtain a VMA read lock via :c:func:`!lock_vma_under_rcu`. This tries to 82 acquire the lock atomically so might fail, in which case fall-back logic is 83 required to instead obtain an mmap read lock if this returns :c:macro:`!NULL`, 84 *or* 85* Acquire an rmap lock before traversing the locked interval tree (whether 86 anonymous or file-backed) to obtain the required VMA. 87 88If you want to **write** VMA metadata fields, then things vary depending on the 89field (we explore each VMA field in detail below). For the majority you must: 90 91* Obtain an mmap write lock at the MM granularity via :c:func:`!mmap_write_lock` (or a 92 suitable variant), unlocking it with a matching :c:func:`!mmap_write_unlock` when 93 you're done with the VMA, *and* 94* Obtain a VMA write lock via :c:func:`!vma_start_write` for each VMA you wish to 95 modify, which will be released automatically when :c:func:`!mmap_write_unlock` is 96 called. 97* If you want to be able to write to **any** field, you must also hide the VMA 98 from the reverse mapping by obtaining an **rmap write lock**. 99 100VMA locks are special in that you must obtain an mmap **write** lock **first** 101in order to obtain a VMA **write** lock. A VMA **read** lock however can be 102obtained without any other lock (:c:func:`!lock_vma_under_rcu` will acquire then 103release an RCU lock to lookup the VMA for you). 104 105This constrains the impact of writers on readers, as a writer can interact with 106one VMA while a reader interacts with another simultaneously. 107 108.. note:: The primary users of VMA read locks are page fault handlers, which 109 means that without a VMA write lock, page faults will run concurrent with 110 whatever you are doing. 111 112Examining all valid lock states: 113 114.. table:: 115 116 ========= ======== ========= ======= ===== =========== ========== 117 mmap lock VMA lock rmap lock Stable? Read? Write most? Write all? 118 ========= ======== ========= ======= ===== =========== ========== 119 \- \- \- N N N N 120 \- R \- Y Y N N 121 \- \- R/W Y Y N N 122 R/W \-/R \-/R/W Y Y N N 123 W W \-/R Y Y Y N 124 W W W Y Y Y Y 125 ========= ======== ========= ======= ===== =========== ========== 126 127.. warning:: While it's possible to obtain a VMA lock while holding an mmap read lock, 128 attempting to do the reverse is invalid as it can result in deadlock - if 129 another task already holds an mmap write lock and attempts to acquire a VMA 130 write lock that will deadlock on the VMA read lock. 131 132All of these locks behave as read/write semaphores in practice, so you can 133obtain either a read or a write lock for each of these. 134 135.. note:: Generally speaking, a read/write semaphore is a class of lock which 136 permits concurrent readers. However a write lock can only be obtained 137 once all readers have left the critical region (and pending readers 138 made to wait). 139 140 This renders read locks on a read/write semaphore concurrent with other 141 readers and write locks exclusive against all others holding the semaphore. 142 143VMA fields 144^^^^^^^^^^ 145 146We can subdivide :c:struct:`!struct vm_area_struct` fields by their purpose, which makes it 147easier to explore their locking characteristics: 148 149.. note:: We exclude VMA lock-specific fields here to avoid confusion, as these 150 are in effect an internal implementation detail. 151 152.. table:: Virtual layout fields 153 154 ===================== ======================================== =========== 155 Field Description Write lock 156 ===================== ======================================== =========== 157 :c:member:`!vm_start` Inclusive start virtual address of range mmap write, 158 VMA describes. VMA write, 159 rmap write. 160 :c:member:`!vm_end` Exclusive end virtual address of range mmap write, 161 VMA describes. VMA write, 162 rmap write. 163 :c:member:`!vm_pgoff` Describes the page offset into the file, mmap write, 164 the original page offset within the VMA write, 165 virtual address space (prior to any rmap write. 166 :c:func:`!mremap`), or PFN if a PFN map 167 and the architecture does not support 168 :c:macro:`!CONFIG_ARCH_HAS_PTE_SPECIAL`. 169 ===================== ======================================== =========== 170 171These fields describes the size, start and end of the VMA, and as such cannot be 172modified without first being hidden from the reverse mapping since these fields 173are used to locate VMAs within the reverse mapping interval trees. 174 175.. table:: Core fields 176 177 ============================ ======================================== ========================= 178 Field Description Write lock 179 ============================ ======================================== ========================= 180 :c:member:`!vm_mm` Containing mm_struct. None - written once on 181 initial map. 182 :c:member:`!vm_page_prot` Architecture-specific page table mmap write, VMA write. 183 protection bits determined from VMA 184 flags. 185 :c:member:`!vm_flags` Read-only access to VMA flags describing N/A 186 attributes of the VMA, in union with 187 private writable 188 :c:member:`!__vm_flags`. 189 :c:member:`!__vm_flags` Private, writable access to VMA flags mmap write, VMA write. 190 field, updated by 191 :c:func:`!vm_flags_*` functions. 192 :c:member:`!vm_file` If the VMA is file-backed, points to a None - written once on 193 struct file object describing the initial map. 194 underlying file, if anonymous then 195 :c:macro:`!NULL`. 196 :c:member:`!vm_ops` If the VMA is file-backed, then either None - Written once on 197 the driver or file-system provides a initial map by 198 :c:struct:`!struct vm_operations_struct` :c:func:`!f_ops->mmap()`. 199 object describing callbacks to be 200 invoked on VMA lifetime events. 201 :c:member:`!vm_private_data` A :c:member:`!void *` field for Handled by driver. 202 driver-specific metadata. 203 ============================ ======================================== ========================= 204 205These are the core fields which describe the MM the VMA belongs to and its attributes. 206 207.. table:: Config-specific fields 208 209 ================================= ===================== ======================================== =============== 210 Field Configuration option Description Write lock 211 ================================= ===================== ======================================== =============== 212 :c:member:`!anon_name` CONFIG_ANON_VMA_NAME A field for storing a mmap write, 213 :c:struct:`!struct anon_vma_name` VMA write. 214 object providing a name for anonymous 215 mappings, or :c:macro:`!NULL` if none 216 is set or the VMA is file-backed. The 217 underlying object is reference counted 218 and can be shared across multiple VMAs 219 for scalability. 220 :c:member:`!swap_readahead_info` CONFIG_SWAP Metadata used by the swap mechanism mmap read, 221 to perform readahead. This field is swap-specific 222 accessed atomically. lock. 223 :c:member:`!vm_policy` CONFIG_NUMA :c:type:`!mempolicy` object which mmap write, 224 describes the NUMA behaviour of the VMA write. 225 VMA. The underlying object is reference 226 counted. 227 :c:member:`!numab_state` CONFIG_NUMA_BALANCING :c:type:`!vma_numab_state` object which mmap read, 228 describes the current state of numab-specific 229 NUMA balancing in relation to this VMA. lock. 230 Updated under mmap read lock by 231 :c:func:`!task_numa_work`. 232 :c:member:`!vm_userfaultfd_ctx` CONFIG_USERFAULTFD Userfaultfd context wrapper object of mmap write, 233 type :c:type:`!vm_userfaultfd_ctx`, VMA write. 234 either of zero size if userfaultfd is 235 disabled, or containing a pointer 236 to an underlying 237 :c:type:`!userfaultfd_ctx` object which 238 describes userfaultfd metadata. 239 ================================= ===================== ======================================== =============== 240 241These fields are present or not depending on whether the relevant kernel 242configuration option is set. 243 244.. table:: Reverse mapping fields 245 246 =================================== ========================================= ============================ 247 Field Description Write lock 248 =================================== ========================================= ============================ 249 :c:member:`!shared.rb` A red/black tree node used, if the mmap write, VMA write, 250 mapping is file-backed, to place the VMA i_mmap write. 251 in the 252 :c:member:`!struct address_space->i_mmap` 253 red/black interval tree. 254 :c:member:`!shared.rb_subtree_last` Metadata used for management of the mmap write, VMA write, 255 interval tree if the VMA is file-backed. i_mmap write. 256 :c:member:`!anon_vma_chain` List of pointers to both forked/CoW’d mmap read, anon_vma write. 257 :c:type:`!anon_vma` objects and 258 :c:member:`!vma->anon_vma` if it is 259 non-:c:macro:`!NULL`. 260 :c:member:`!anon_vma` :c:type:`!anon_vma` object used by When :c:macro:`NULL` and 261 anonymous folios mapped exclusively to setting non-:c:macro:`NULL`: 262 this VMA. Initially set by mmap read, page_table_lock. 263 :c:func:`!anon_vma_prepare` serialised 264 by the :c:macro:`!page_table_lock`. This When non-:c:macro:`NULL` and 265 is set as soon as any page is faulted in. setting :c:macro:`NULL`: 266 mmap write, VMA write, 267 anon_vma write. 268 =================================== ========================================= ============================ 269 270These fields are used to both place the VMA within the reverse mapping, and for 271anonymous mappings, to be able to access both related :c:struct:`!struct anon_vma` objects 272and the :c:struct:`!struct anon_vma` in which folios mapped exclusively to this VMA should 273reside. 274 275.. note:: If a file-backed mapping is mapped with :c:macro:`!MAP_PRIVATE` set 276 then it can be in both the :c:type:`!anon_vma` and :c:type:`!i_mmap` 277 trees at the same time, so all of these fields might be utilised at 278 once. 279 280Page tables 281----------- 282 283We won't speak exhaustively on the subject but broadly speaking, page tables map 284virtual addresses to physical ones through a series of page tables, each of 285which contain entries with physical addresses for the next page table level 286(along with flags), and at the leaf level the physical addresses of the 287underlying physical data pages or a special entry such as a swap entry, 288migration entry or other special marker. Offsets into these pages are provided 289by the virtual address itself. 290 291In Linux these are divided into five levels - PGD, P4D, PUD, PMD and PTE. Huge 292pages might eliminate one or two of these levels, but when this is the case we 293typically refer to the leaf level as the PTE level regardless. 294 295.. note:: In instances where the architecture supports fewer page tables than 296 five the kernel cleverly 'folds' page table levels, that is stubbing 297 out functions related to the skipped levels. This allows us to 298 conceptually act as if there were always five levels, even if the 299 compiler might, in practice, eliminate any code relating to missing 300 ones. 301 302There are four key operations typically performed on page tables: 303 3041. **Traversing** page tables - Simply reading page tables in order to traverse 305 them. This only requires that the VMA is kept stable, so a lock which 306 establishes this suffices for traversal (there are also lockless variants 307 which eliminate even this requirement, such as :c:func:`!gup_fast`). There is 308 also a special case of page table traversal for non-VMA regions which we 309 consider separately below. 3102. **Installing** page table mappings - Whether creating a new mapping or 311 modifying an existing one in such a way as to change its identity. This 312 requires that the VMA is kept stable via an mmap or VMA lock (explicitly not 313 rmap locks). 3143. **Zapping/unmapping** page table entries - This is what the kernel calls 315 clearing page table mappings at the leaf level only, whilst leaving all page 316 tables in place. This is a very common operation in the kernel performed on 317 file truncation, the :c:macro:`!MADV_DONTNEED` operation via 318 :c:func:`!madvise`, and others. This is performed by a number of functions 319 including :c:func:`!unmap_mapping_range` and :c:func:`!unmap_mapping_pages`. 320 The VMA need only be kept stable for this operation. 3214. **Freeing** page tables - When finally the kernel removes page tables from a 322 userland process (typically via :c:func:`!free_pgtables`) extreme care must 323 be taken to ensure this is done safely, as this logic finally frees all page 324 tables in the specified range, ignoring existing leaf entries (it assumes the 325 caller has both zapped the range and prevented any further faults or 326 modifications within it). 327 328.. note:: Modifying mappings for reclaim or migration is performed under rmap 329 lock as it, like zapping, does not fundamentally modify the identity 330 of what is being mapped. 331 332**Traversing** and **zapping** ranges can be performed holding any one of the 333locks described in the terminology section above - that is the mmap lock, the 334VMA lock or either of the reverse mapping locks. 335 336That is - as long as you keep the relevant VMA **stable** - you are good to go 337ahead and perform these operations on page tables (though internally, kernel 338operations that perform writes also acquire internal page table locks to 339serialise - see the page table implementation detail section for more details). 340 341.. note:: We free empty PTE tables on zap under the RCU lock - this does not 342 change the aforementioned locking requirements around zapping. 343 344When **installing** page table entries, the mmap or VMA lock must be held to 345keep the VMA stable. We explore why this is in the page table locking details 346section below. 347 348**Freeing** page tables is an entirely internal memory management operation and 349has special requirements (see the page freeing section below for more details). 350 351.. warning:: When **freeing** page tables, it must not be possible for VMAs 352 containing the ranges those page tables map to be accessible via 353 the reverse mapping. 354 355 The :c:func:`!free_pgtables` function removes the relevant VMAs 356 from the reverse mappings, but no other VMAs can be permitted to be 357 accessible and span the specified range. 358 359Traversing non-VMA page tables 360------------------------------ 361 362We've focused above on traversal of page tables belonging to VMAs. It is also 363possible to traverse page tables which are not represented by VMAs. 364 365Kernel page table mappings themselves are generally managed but whatever part of 366the kernel established them and the aforementioned locking rules do not apply - 367for instance vmalloc has its own set of locks which are utilised for 368establishing and tearing down page its page tables. 369 370However, for convenience we provide the :c:func:`!walk_kernel_page_table_range` 371function which is synchronised via the mmap lock on the :c:macro:`!init_mm` 372kernel instantiation of the :c:struct:`!struct mm_struct` metadata object. 373 374If an operation requires exclusive access, a write lock is used, but if not, a 375read lock suffices - we assert only that at least a read lock has been acquired. 376 377Since, aside from vmalloc and memory hot plug, kernel page tables are not torn 378down all that often - this usually suffices, however any caller of this 379functionality must ensure that any additionally required locks are acquired in 380advance. 381 382We also permit a truly unusual case is the traversal of non-VMA ranges in 383**userland** ranges, as provided for by :c:func:`!walk_page_range_debug`. 384 385This has only one user - the general page table dumping logic (implemented in 386:c:macro:`!mm/ptdump.c`) - which seeks to expose all mappings for debug purposes 387even if they are highly unusual (possibly architecture-specific) and are not 388backed by a VMA. 389 390We must take great care in this case, as the :c:func:`!munmap` implementation 391detaches VMAs under an mmap write lock before tearing down page tables under a 392downgraded mmap read lock. 393 394This means such an operation could race with this, and thus an mmap **write** 395lock is required. 396 397Lock ordering 398------------- 399 400As we have multiple locks across the kernel which may or may not be taken at the 401same time as explicit mm or VMA locks, we have to be wary of lock inversion, and 402the **order** in which locks are acquired and released becomes very important. 403 404.. note:: Lock inversion occurs when two threads need to acquire multiple locks, 405 but in doing so inadvertently cause a mutual deadlock. 406 407 For example, consider thread 1 which holds lock A and tries to acquire lock B, 408 while thread 2 holds lock B and tries to acquire lock A. 409 410 Both threads are now deadlocked on each other. However, had they attempted to 411 acquire locks in the same order, one would have waited for the other to 412 complete its work and no deadlock would have occurred. 413 414The opening comment in :c:macro:`!mm/rmap.c` describes in detail the required 415ordering of locks within memory management code: 416 417.. code-block:: 418 419 inode->i_rwsem (while writing or truncating, not reading or faulting) 420 mm->mmap_lock 421 mapping->invalidate_lock (in filemap_fault) 422 folio_lock 423 hugetlbfs_i_mmap_rwsem_key (in huge_pmd_share, see hugetlbfs below) 424 vma_start_write 425 mapping->i_mmap_rwsem 426 anon_vma->rwsem 427 mm->page_table_lock or pte_lock 428 swap_lock (in swap_duplicate, swap_info_get) 429 mmlist_lock (in mmput, drain_mmlist and others) 430 mapping->private_lock (in block_dirty_folio) 431 i_pages lock (widely used) 432 lruvec->lru_lock (in folio_lruvec_lock_irq) 433 inode->i_lock (in set_page_dirty's __mark_inode_dirty) 434 bdi.wb->list_lock (in set_page_dirty's __mark_inode_dirty) 435 sb_lock (within inode_lock in fs/fs-writeback.c) 436 i_pages lock (widely used, in set_page_dirty, 437 in arch-dependent flush_dcache_mmap_lock, 438 within bdi.wb->list_lock in __sync_single_inode) 439 440There is also a file-system specific lock ordering comment located at the top of 441:c:macro:`!mm/filemap.c`: 442 443.. code-block:: 444 445 ->i_mmap_rwsem (truncate_pagecache) 446 ->private_lock (__free_pte->block_dirty_folio) 447 ->swap_lock (exclusive_swap_page, others) 448 ->i_pages lock 449 450 ->i_rwsem 451 ->invalidate_lock (acquired by fs in truncate path) 452 ->i_mmap_rwsem (truncate->unmap_mapping_range) 453 454 ->mmap_lock 455 ->i_mmap_rwsem 456 ->page_table_lock or pte_lock (various, mainly in memory.c) 457 ->i_pages lock (arch-dependent flush_dcache_mmap_lock) 458 459 ->mmap_lock 460 ->invalidate_lock (filemap_fault) 461 ->lock_page (filemap_fault, access_process_vm) 462 463 ->i_rwsem (generic_perform_write) 464 ->mmap_lock (fault_in_readable->do_page_fault) 465 466 bdi->wb.list_lock 467 sb_lock (fs/fs-writeback.c) 468 ->i_pages lock (__sync_single_inode) 469 470 ->i_mmap_rwsem 471 ->anon_vma.lock (vma_merge) 472 473 ->anon_vma.lock 474 ->page_table_lock or pte_lock (anon_vma_prepare and various) 475 476 ->page_table_lock or pte_lock 477 ->swap_lock (try_to_unmap_one) 478 ->private_lock (try_to_unmap_one) 479 ->i_pages lock (try_to_unmap_one) 480 ->lruvec->lru_lock (follow_page_mask->mark_page_accessed) 481 ->lruvec->lru_lock (check_pte_range->folio_isolate_lru) 482 ->private_lock (folio_remove_rmap_pte->set_page_dirty) 483 ->i_pages lock (folio_remove_rmap_pte->set_page_dirty) 484 bdi.wb->list_lock (folio_remove_rmap_pte->set_page_dirty) 485 ->inode->i_lock (folio_remove_rmap_pte->set_page_dirty) 486 bdi.wb->list_lock (zap_pte_range->set_page_dirty) 487 ->inode->i_lock (zap_pte_range->set_page_dirty) 488 ->private_lock (zap_pte_range->block_dirty_folio) 489 490Please check the current state of these comments which may have changed since 491the time of writing of this document. 492 493------------------------------ 494Locking Implementation Details 495------------------------------ 496 497.. warning:: Locking rules for PTE-level page tables are very different from 498 locking rules for page tables at other levels. 499 500Page table locking details 501-------------------------- 502 503.. note:: This section explores page table locking requirements for page tables 504 encompassed by a VMA. See the above section on non-VMA page table 505 traversal for details on how we handle that case. 506 507In addition to the locks described in the terminology section above, we have 508additional locks dedicated to page tables: 509 510* **Higher level page table locks** - Higher level page tables, that is PGD, P4D 511 and PUD each make use of the process address space granularity 512 :c:member:`!mm->page_table_lock` lock when modified. 513 514* **Fine-grained page table locks** - PMDs and PTEs each have fine-grained locks 515 either kept within the folios describing the page tables or allocated 516 separated and pointed at by the folios if :c:macro:`!ALLOC_SPLIT_PTLOCKS` is 517 set. The PMD spin lock is obtained via :c:func:`!pmd_lock`, however PTEs are 518 mapped into higher memory (if a 32-bit system) and carefully locked via 519 :c:func:`!pte_offset_map_lock`. 520 521These locks represent the minimum required to interact with each page table 522level, but there are further requirements. 523 524Importantly, note that on a **traversal** of page tables, sometimes no such 525locks are taken. However, at the PTE level, at least concurrent page table 526deletion must be prevented (using RCU) and the page table must be mapped into 527high memory, see below. 528 529Whether care is taken on reading the page table entries depends on the 530architecture, see the section on atomicity below. 531 532Locking rules 533^^^^^^^^^^^^^ 534 535We establish basic locking rules when interacting with page tables: 536 537* When changing a page table entry the page table lock for that page table 538 **must** be held, except if you can safely assume nobody can access the page 539 tables concurrently (such as on invocation of :c:func:`!free_pgtables`). 540* Reads from and writes to page table entries must be *appropriately* 541 atomic. See the section on atomicity below for details. 542* Populating previously empty entries requires that the mmap or VMA locks are 543 held (read or write), doing so with only rmap locks would be dangerous (see 544 the warning below). 545* As mentioned previously, zapping can be performed while simply keeping the VMA 546 stable, that is holding any one of the mmap, VMA or rmap locks. 547 548.. warning:: Populating previously empty entries is dangerous as, when unmapping 549 VMAs, :c:func:`!vms_clear_ptes` has a window of time between 550 zapping (via :c:func:`!unmap_vmas`) and freeing page tables (via 551 :c:func:`!free_pgtables`), where the VMA is still visible in the 552 rmap tree. :c:func:`!free_pgtables` assumes that the zap has 553 already been performed and removes PTEs unconditionally (along with 554 all other page tables in the freed range), so installing new PTE 555 entries could leak memory and also cause other unexpected and 556 dangerous behaviour. 557 558There are additional rules applicable when moving page tables, which we discuss 559in the section on this topic below. 560 561PTE-level page tables are different from page tables at other levels, and there 562are extra requirements for accessing them: 563 564* On 32-bit architectures, they may be in high memory (meaning they need to be 565 mapped into kernel memory to be accessible). 566* When empty, they can be unlinked and RCU-freed while holding an mmap lock or 567 rmap lock for reading in combination with the PTE and PMD page table locks. 568 In particular, this happens in :c:func:`!retract_page_tables` when handling 569 :c:macro:`!MADV_COLLAPSE`. 570 So accessing PTE-level page tables requires at least holding an RCU read lock; 571 but that only suffices for readers that can tolerate racing with concurrent 572 page table updates such that an empty PTE is observed (in a page table that 573 has actually already been detached and marked for RCU freeing) while another 574 new page table has been installed in the same location and filled with 575 entries. Writers normally need to take the PTE lock and revalidate that the 576 PMD entry still refers to the same PTE-level page table. 577 If the writer does not care whether it is the same PTE-level page table, it 578 can take the PMD lock and revalidate that the contents of pmd entry still meet 579 the requirements. In particular, this also happens in :c:func:`!retract_page_tables` 580 when handling :c:macro:`!MADV_COLLAPSE`. 581 582To access PTE-level page tables, a helper like :c:func:`!pte_offset_map_lock` or 583:c:func:`!pte_offset_map` can be used depending on stability requirements. 584These map the page table into kernel memory if required, take the RCU lock, and 585depending on variant, may also look up or acquire the PTE lock. 586See the comment on :c:func:`!__pte_offset_map_lock`. 587 588Atomicity 589^^^^^^^^^ 590 591Regardless of page table locks, the MMU hardware concurrently updates accessed 592and dirty bits (perhaps more, depending on architecture). Additionally, page 593table traversal operations in parallel (though holding the VMA stable) and 594functionality like GUP-fast locklessly traverses (that is reads) page tables, 595without even keeping the VMA stable at all. 596 597When performing a page table traversal and keeping the VMA stable, whether a 598read must be performed once and only once or not depends on the architecture 599(for instance x86-64 does not require any special precautions). 600 601If a write is being performed, or if a read informs whether a write takes place 602(on an installation of a page table entry say, for instance in 603:c:func:`!__pud_install`), special care must always be taken. In these cases we 604can never assume that page table locks give us entirely exclusive access, and 605must retrieve page table entries once and only once. 606 607If we are reading page table entries, then we need only ensure that the compiler 608does not rearrange our loads. This is achieved via :c:func:`!pXXp_get` 609functions - :c:func:`!pgdp_get`, :c:func:`!p4dp_get`, :c:func:`!pudp_get`, 610:c:func:`!pmdp_get`, and :c:func:`!ptep_get`. 611 612Each of these uses :c:func:`!READ_ONCE` to guarantee that the compiler reads 613the page table entry only once. 614 615However, if we wish to manipulate an existing page table entry and care about 616the previously stored data, we must go further and use an hardware atomic 617operation as, for example, in :c:func:`!ptep_get_and_clear`. 618 619Equally, operations that do not rely on the VMA being held stable, such as 620GUP-fast (see :c:func:`!gup_fast` and its various page table level handlers like 621:c:func:`!gup_fast_pte_range`), must very carefully interact with page table 622entries, using functions such as :c:func:`!ptep_get_lockless` and equivalent for 623higher level page table levels. 624 625Writes to page table entries must also be appropriately atomic, as established 626by :c:func:`!set_pXX` functions - :c:func:`!set_pgd`, :c:func:`!set_p4d`, 627:c:func:`!set_pud`, :c:func:`!set_pmd`, and :c:func:`!set_pte`. 628 629Equally functions which clear page table entries must be appropriately atomic, 630as in :c:func:`!pXX_clear` functions - :c:func:`!pgd_clear`, 631:c:func:`!p4d_clear`, :c:func:`!pud_clear`, :c:func:`!pmd_clear`, and 632:c:func:`!pte_clear`. 633 634Page table installation 635^^^^^^^^^^^^^^^^^^^^^^^ 636 637Page table installation is performed with the VMA held stable explicitly by an 638mmap or VMA lock in read or write mode (see the warning in the locking rules 639section for details as to why). 640 641When allocating a P4D, PUD or PMD and setting the relevant entry in the above 642PGD, P4D or PUD, the :c:member:`!mm->page_table_lock` must be held. This is 643acquired in :c:func:`!__p4d_alloc`, :c:func:`!__pud_alloc` and 644:c:func:`!__pmd_alloc` respectively. 645 646.. note:: :c:func:`!__pmd_alloc` actually invokes :c:func:`!pud_lock` and 647 :c:func:`!pud_lockptr` in turn, however at the time of writing it ultimately 648 references the :c:member:`!mm->page_table_lock`. 649 650Allocating a PTE will either use the :c:member:`!mm->page_table_lock` or, if 651:c:macro:`!USE_SPLIT_PMD_PTLOCKS` is defined, a lock embedded in the PMD 652physical page metadata in the form of a :c:struct:`!struct ptdesc`, acquired by 653:c:func:`!pmd_ptdesc` called from :c:func:`!pmd_lock` and ultimately 654:c:func:`!__pte_alloc`. 655 656Finally, modifying the contents of the PTE requires special treatment, as the 657PTE page table lock must be acquired whenever we want stable and exclusive 658access to entries contained within a PTE, especially when we wish to modify 659them. 660 661This is performed via :c:func:`!pte_offset_map_lock` which carefully checks to 662ensure that the PTE hasn't changed from under us, ultimately invoking 663:c:func:`!pte_lockptr` to obtain a spin lock at PTE granularity contained within 664the :c:struct:`!struct ptdesc` associated with the physical PTE page. The lock 665must be released via :c:func:`!pte_unmap_unlock`. 666 667.. note:: There are some variants on this, such as 668 :c:func:`!pte_offset_map_rw_nolock` when we know we hold the PTE stable but 669 for brevity we do not explore this. See the comment for 670 :c:func:`!__pte_offset_map_lock` for more details. 671 672When modifying data in ranges we typically only wish to allocate higher page 673tables as necessary, using these locks to avoid races or overwriting anything, 674and set/clear data at the PTE level as required (for instance when page faulting 675or zapping). 676 677A typical pattern taken when traversing page table entries to install a new 678mapping is to optimistically determine whether the page table entry in the table 679above is empty, if so, only then acquiring the page table lock and checking 680again to see if it was allocated underneath us. 681 682This allows for a traversal with page table locks only being taken when 683required. An example of this is :c:func:`!__pud_alloc`. 684 685At the leaf page table, that is the PTE, we can't entirely rely on this pattern 686as we have separate PMD and PTE locks and a THP collapse for instance might have 687eliminated the PMD entry as well as the PTE from under us. 688 689This is why :c:func:`!__pte_offset_map_lock` locklessly retrieves the PMD entry 690for the PTE, carefully checking it is as expected, before acquiring the 691PTE-specific lock, and then *again* checking that the PMD entry is as expected. 692 693If a THP collapse (or similar) were to occur then the lock on both pages would 694be acquired, so we can ensure this is prevented while the PTE lock is held. 695 696Installing entries this way ensures mutual exclusion on write. 697 698Page table freeing 699^^^^^^^^^^^^^^^^^^ 700 701Tearing down page tables themselves is something that requires significant 702care. There must be no way that page tables designated for removal can be 703traversed or referenced by concurrent tasks. 704 705It is insufficient to simply hold an mmap write lock and VMA lock (which will 706prevent racing faults, and rmap operations), as a file-backed mapping can be 707truncated under the :c:struct:`!struct address_space->i_mmap_rwsem` alone. 708 709As a result, no VMA which can be accessed via the reverse mapping (either 710through the :c:struct:`!struct anon_vma->rb_root` or the :c:member:`!struct 711address_space->i_mmap` interval trees) can have its page tables torn down. 712 713The operation is typically performed via :c:func:`!free_pgtables`, which assumes 714either the mmap write lock has been taken (as specified by its 715:c:member:`!mm_wr_locked` parameter), or that the VMA is already unreachable. 716 717It carefully removes the VMA from all reverse mappings, however it's important 718that no new ones overlap these or any route remain to permit access to addresses 719within the range whose page tables are being torn down. 720 721Additionally, it assumes that a zap has already been performed and steps have 722been taken to ensure that no further page table entries can be installed between 723the zap and the invocation of :c:func:`!free_pgtables`. 724 725Since it is assumed that all such steps have been taken, page table entries are 726cleared without page table locks (in the :c:func:`!pgd_clear`, :c:func:`!p4d_clear`, 727:c:func:`!pud_clear`, and :c:func:`!pmd_clear` functions. 728 729.. note:: It is possible for leaf page tables to be torn down independent of 730 the page tables above it as is done by 731 :c:func:`!retract_page_tables`, which is performed under the i_mmap 732 read lock, PMD, and PTE page table locks, without this level of care. 733 734Page table moving 735^^^^^^^^^^^^^^^^^ 736 737Some functions manipulate page table levels above PMD (that is PUD, P4D and PGD 738page tables). Most notable of these is :c:func:`!mremap`, which is capable of 739moving higher level page tables. 740 741In these instances, it is required that **all** locks are taken, that is 742the mmap lock, the VMA lock and the relevant rmap locks. 743 744You can observe this in the :c:func:`!mremap` implementation in the functions 745:c:func:`!take_rmap_locks` and :c:func:`!drop_rmap_locks` which perform the rmap 746side of lock acquisition, invoked ultimately by :c:func:`!move_page_tables`. 747 748VMA lock internals 749------------------ 750 751Overview 752^^^^^^^^ 753 754VMA read locking is entirely optimistic - if the lock is contended or a competing 755write has started, then we do not obtain a read lock. 756 757A VMA **read** lock is obtained by :c:func:`!lock_vma_under_rcu`, which first 758calls :c:func:`!rcu_read_lock` to ensure that the VMA is looked up in an RCU 759critical section, then attempts to VMA lock it via :c:func:`!vma_start_read`, 760before releasing the RCU lock via :c:func:`!rcu_read_unlock`. 761 762In cases when the user already holds mmap read lock, :c:func:`!vma_start_read_locked` 763and :c:func:`!vma_start_read_locked_nested` can be used. These functions do not 764fail due to lock contention but the caller should still check their return values 765in case they fail for other reasons. 766 767VMA read locks increment :c:member:`!vma.vm_refcnt` reference counter for their 768duration and the caller of :c:func:`!lock_vma_under_rcu` must drop it via 769:c:func:`!vma_end_read`. 770 771VMA **write** locks are acquired via :c:func:`!vma_start_write` in instances where a 772VMA is about to be modified, unlike :c:func:`!vma_start_read` the lock is always 773acquired. An mmap write lock **must** be held for the duration of the VMA write 774lock, releasing or downgrading the mmap write lock also releases the VMA write 775lock so there is no :c:func:`!vma_end_write` function. 776 777Note that when write-locking a VMA lock, the :c:member:`!vma.vm_refcnt` is temporarily 778modified so that readers can detect the presense of a writer. The reference counter is 779restored once the vma sequence number used for serialisation is updated. 780 781This ensures the semantics we require - VMA write locks provide exclusive write 782access to the VMA. 783 784Implementation details 785^^^^^^^^^^^^^^^^^^^^^^ 786 787The VMA lock mechanism is designed to be a lightweight means of avoiding the use 788of the heavily contended mmap lock. It is implemented using a combination of a 789reference counter and sequence numbers belonging to the containing 790:c:struct:`!struct mm_struct` and the VMA. 791 792Read locks are acquired via :c:func:`!vma_start_read`, which is an optimistic 793operation, i.e. it tries to acquire a read lock but returns false if it is 794unable to do so. At the end of the read operation, :c:func:`!vma_end_read` is 795called to release the VMA read lock. 796 797Invoking :c:func:`!vma_start_read` requires that :c:func:`!rcu_read_lock` has 798been called first, establishing that we are in an RCU critical section upon VMA 799read lock acquisition. Once acquired, the RCU lock can be released as it is only 800required for lookup. This is abstracted by :c:func:`!lock_vma_under_rcu` which 801is the interface a user should use. 802 803Writing requires the mmap to be write-locked and the VMA lock to be acquired via 804:c:func:`!vma_start_write`, however the write lock is released by the termination or 805downgrade of the mmap write lock so no :c:func:`!vma_end_write` is required. 806 807All this is achieved by the use of per-mm and per-VMA sequence counts, which are 808used in order to reduce complexity, especially for operations which write-lock 809multiple VMAs at once. 810 811If the mm sequence count, :c:member:`!mm->mm_lock_seq` is equal to the VMA 812sequence count :c:member:`!vma->vm_lock_seq` then the VMA is write-locked. If 813they differ, then it is not. 814 815Each time the mmap write lock is released in :c:func:`!mmap_write_unlock` or 816:c:func:`!mmap_write_downgrade`, :c:func:`!vma_end_write_all` is invoked which 817also increments :c:member:`!mm->mm_lock_seq` via 818:c:func:`!mm_lock_seqcount_end`. 819 820This way, we ensure that, regardless of the VMA's sequence number, a write lock 821is never incorrectly indicated and that when we release an mmap write lock we 822efficiently release **all** VMA write locks contained within the mmap at the 823same time. 824 825Since the mmap write lock is exclusive against others who hold it, the automatic 826release of any VMA locks on its release makes sense, as you would never want to 827keep VMAs locked across entirely separate write operations. It also maintains 828correct lock ordering. 829 830Each time a VMA read lock is acquired, we increment :c:member:`!vma.vm_refcnt` 831reference counter and check that the sequence count of the VMA does not match 832that of the mm. 833 834If it does, the read lock fails and :c:member:`!vma.vm_refcnt` is dropped. 835If it does not, we keep the reference counter raised, excluding writers, but 836permitting other readers, who can also obtain this lock under RCU. 837 838Importantly, maple tree operations performed in :c:func:`!lock_vma_under_rcu` 839are also RCU safe, so the whole read lock operation is guaranteed to function 840correctly. 841 842On the write side, we set a bit in :c:member:`!vma.vm_refcnt` which can't be 843modified by readers and wait for all readers to drop their reference count. 844Once there are no readers, the VMA's sequence number is set to match that of 845the mm. During this entire operation mmap write lock is held. 846 847This way, if any read locks are in effect, :c:func:`!vma_start_write` will sleep 848until these are finished and mutual exclusion is achieved. 849 850After setting the VMA's sequence number, the bit in :c:member:`!vma.vm_refcnt` 851indicating a writer is cleared. From this point on, VMA's sequence number will 852indicate VMA's write-locked state until mmap write lock is dropped or downgraded. 853 854This clever combination of a reference counter and sequence count allows for 855fast RCU-based per-VMA lock acquisition (especially on page fault, though 856utilised elsewhere) with minimal complexity around lock ordering. 857 858mmap write lock downgrading 859--------------------------- 860 861When an mmap write lock is held one has exclusive access to resources within the 862mmap (with the usual caveats about requiring VMA write locks to avoid races with 863tasks holding VMA read locks). 864 865It is then possible to **downgrade** from a write lock to a read lock via 866:c:func:`!mmap_write_downgrade` which, similar to :c:func:`!mmap_write_unlock`, 867implicitly terminates all VMA write locks via :c:func:`!vma_end_write_all`, but 868importantly does not relinquish the mmap lock while downgrading, therefore 869keeping the locked virtual address space stable. 870 871An interesting consequence of this is that downgraded locks are exclusive 872against any other task possessing a downgraded lock (since a racing task would 873have to acquire a write lock first to downgrade it, and the downgraded lock 874prevents a new write lock from being obtained until the original lock is 875released). 876 877For clarity, we map read (R)/downgraded write (D)/write (W) locks against one 878another showing which locks exclude the others: 879 880.. list-table:: Lock exclusivity 881 :widths: 5 5 5 5 882 :header-rows: 1 883 :stub-columns: 1 884 885 * - 886 - R 887 - D 888 - W 889 * - R 890 - N 891 - N 892 - Y 893 * - D 894 - N 895 - Y 896 - Y 897 * - W 898 - Y 899 - Y 900 - Y 901 902Here a Y indicates the locks in the matching row/column are mutually exclusive, 903and N indicates that they are not. 904 905Stack expansion 906--------------- 907 908Stack expansion throws up additional complexities in that we cannot permit there 909to be racing page faults, as a result we invoke :c:func:`!vma_start_write` to 910prevent this in :c:func:`!expand_downwards` or :c:func:`!expand_upwards`. 911 912------------------------ 913Functions and structures 914------------------------ 915 916.. kernel-doc:: include/linux/mmap_lock.h 917