xref: /linux/Documentation/mm/process_addrs.rst (revision 1623bc27a85a93e82194c8d077eccc464efa67db)
1.. SPDX-License-Identifier: GPL-2.0
2
3=================
4Process Addresses
5=================
6
7.. toctree::
8   :maxdepth: 3
9
10
11Userland memory ranges are tracked by the kernel via Virtual Memory Areas or
12'VMA's of type :c:struct:`!struct vm_area_struct`.
13
14Each VMA describes a virtually contiguous memory range with identical
15attributes, each described by a :c:struct:`!struct vm_area_struct`
16object. Userland access outside of VMAs is invalid except in the case where an
17adjacent stack VMA could be extended to contain the accessed address.
18
19All VMAs are contained within one and only one virtual address space, described
20by a :c:struct:`!struct mm_struct` object which is referenced by all tasks (that is,
21threads) which share the virtual address space. We refer to this as the
22:c:struct:`!mm`.
23
24Each mm object contains a maple tree data structure which describes all VMAs
25within the virtual address space.
26
27.. note:: An exception to this is the 'gate' VMA which is provided by
28          architectures which use :c:struct:`!vsyscall` and is a global static
29          object which does not belong to any specific mm.
30
31-------
32Locking
33-------
34
35The kernel is designed to be highly scalable against concurrent read operations
36on VMA **metadata** so a complicated set of locks are required to ensure memory
37corruption does not occur.
38
39.. note:: Locking VMAs for their metadata does not have any impact on the memory
40          they describe nor the page tables that map them.
41
42Terminology
43-----------
44
45* **mmap locks** - Each MM has a read/write semaphore :c:member:`!mmap_lock`
46  which locks at a process address space granularity which can be acquired via
47  :c:func:`!mmap_read_lock`, :c:func:`!mmap_write_lock` and variants.
48* **VMA locks** - The VMA lock is at VMA granularity (of course) which behaves
49  as a read/write semaphore in practice. A VMA read lock is obtained via
50  :c:func:`!lock_vma_under_rcu` (and unlocked via :c:func:`!vma_end_read`) and a
51  write lock via :c:func:`!vma_start_write` (all VMA write locks are unlocked
52  automatically when the mmap write lock is released). To take a VMA write lock
53  you **must** have already acquired an :c:func:`!mmap_write_lock`.
54* **rmap locks** - When trying to access VMAs through the reverse mapping via a
55  :c:struct:`!struct address_space` or :c:struct:`!struct anon_vma` object
56  (reachable from a folio via :c:member:`!folio->mapping`). VMAs must be stabilised via
57  :c:func:`!anon_vma_[try]lock_read` or :c:func:`!anon_vma_[try]lock_write` for
58  anonymous memory and :c:func:`!i_mmap_[try]lock_read` or
59  :c:func:`!i_mmap_[try]lock_write` for file-backed memory. We refer to these
60  locks as the reverse mapping locks, or 'rmap locks' for brevity.
61
62We discuss page table locks separately in the dedicated section below.
63
64The first thing **any** of these locks achieve is to **stabilise** the VMA
65within the MM tree. That is, guaranteeing that the VMA object will not be
66deleted from under you nor modified (except for some specific fields
67described below).
68
69Stabilising a VMA also keeps the address space described by it around.
70
71Lock usage
72----------
73
74If you want to **read** VMA metadata fields or just keep the VMA stable, you
75must do one of the following:
76
77* Obtain an mmap read lock at the MM granularity via :c:func:`!mmap_read_lock` (or a
78  suitable variant), unlocking it with a matching :c:func:`!mmap_read_unlock` when
79  you're done with the VMA, *or*
80* Try to obtain a VMA read lock via :c:func:`!lock_vma_under_rcu`. This tries to
81  acquire the lock atomically so might fail, in which case fall-back logic is
82  required to instead obtain an mmap read lock if this returns :c:macro:`!NULL`,
83  *or*
84* Acquire an rmap lock before traversing the locked interval tree (whether
85  anonymous or file-backed) to obtain the required VMA.
86
87If you want to **write** VMA metadata fields, then things vary depending on the
88field (we explore each VMA field in detail below). For the majority you must:
89
90* Obtain an mmap write lock at the MM granularity via :c:func:`!mmap_write_lock` (or a
91  suitable variant), unlocking it with a matching :c:func:`!mmap_write_unlock` when
92  you're done with the VMA, *and*
93* Obtain a VMA write lock via :c:func:`!vma_start_write` for each VMA you wish to
94  modify, which will be released automatically when :c:func:`!mmap_write_unlock` is
95  called.
96* If you want to be able to write to **any** field, you must also hide the VMA
97  from the reverse mapping by obtaining an **rmap write lock**.
98
99VMA locks are special in that you must obtain an mmap **write** lock **first**
100in order to obtain a VMA **write** lock. A VMA **read** lock however can be
101obtained without any other lock (:c:func:`!lock_vma_under_rcu` will acquire then
102release an RCU lock to lookup the VMA for you).
103
104This constrains the impact of writers on readers, as a writer can interact with
105one VMA while a reader interacts with another simultaneously.
106
107.. note:: The primary users of VMA read locks are page fault handlers, which
108          means that without a VMA write lock, page faults will run concurrent with
109          whatever you are doing.
110
111Examining all valid lock states:
112
113.. table::
114
115   ========= ======== ========= ======= ===== =========== ==========
116   mmap lock VMA lock rmap lock Stable? Read? Write most? Write all?
117   ========= ======== ========= ======= ===== =========== ==========
118   \-        \-       \-        N       N     N           N
119   \-        R        \-        Y       Y     N           N
120   \-        \-       R/W       Y       Y     N           N
121   R/W       \-/R     \-/R/W    Y       Y     N           N
122   W         W        \-/R      Y       Y     Y           N
123   W         W        W         Y       Y     Y           Y
124   ========= ======== ========= ======= ===== =========== ==========
125
126.. warning:: While it's possible to obtain a VMA lock while holding an mmap read lock,
127             attempting to do the reverse is invalid as it can result in deadlock - if
128             another task already holds an mmap write lock and attempts to acquire a VMA
129             write lock that will deadlock on the VMA read lock.
130
131All of these locks behave as read/write semaphores in practice, so you can
132obtain either a read or a write lock for each of these.
133
134.. note:: Generally speaking, a read/write semaphore is a class of lock which
135          permits concurrent readers. However a write lock can only be obtained
136          once all readers have left the critical region (and pending readers
137          made to wait).
138
139          This renders read locks on a read/write semaphore concurrent with other
140          readers and write locks exclusive against all others holding the semaphore.
141
142VMA fields
143^^^^^^^^^^
144
145We can subdivide :c:struct:`!struct vm_area_struct` fields by their purpose, which makes it
146easier to explore their locking characteristics:
147
148.. note:: We exclude VMA lock-specific fields here to avoid confusion, as these
149          are in effect an internal implementation detail.
150
151.. table:: Virtual layout fields
152
153   ===================== ======================================== ===========
154   Field                 Description                              Write lock
155   ===================== ======================================== ===========
156   :c:member:`!vm_start` Inclusive start virtual address of range mmap write,
157                         VMA describes.                           VMA write,
158                                                                  rmap write.
159   :c:member:`!vm_end`   Exclusive end virtual address of range   mmap write,
160                         VMA describes.                           VMA write,
161                                                                  rmap write.
162   :c:member:`!vm_pgoff` Describes the page offset into the file, mmap write,
163                         the original page offset within the      VMA write,
164                         virtual address space (prior to any      rmap write.
165                         :c:func:`!mremap`), or PFN if a PFN map
166                         and the architecture does not support
167                         :c:macro:`!CONFIG_ARCH_HAS_PTE_SPECIAL`.
168   ===================== ======================================== ===========
169
170These fields describes the size, start and end of the VMA, and as such cannot be
171modified without first being hidden from the reverse mapping since these fields
172are used to locate VMAs within the reverse mapping interval trees.
173
174.. table:: Core fields
175
176   ============================ ======================================== =========================
177   Field                        Description                              Write lock
178   ============================ ======================================== =========================
179   :c:member:`!vm_mm`           Containing mm_struct.                    None - written once on
180                                                                         initial map.
181   :c:member:`!vm_page_prot`    Architecture-specific page table         mmap write, VMA write.
182                                protection bits determined from VMA
183                                flags.
184   :c:member:`!vm_flags`        Read-only access to VMA flags describing N/A
185                                attributes of the VMA, in union with
186                                private writable
187                                :c:member:`!__vm_flags`.
188   :c:member:`!__vm_flags`      Private, writable access to VMA flags    mmap write, VMA write.
189                                field, updated by
190                                :c:func:`!vm_flags_*` functions.
191   :c:member:`!vm_file`         If the VMA is file-backed, points to a   None - written once on
192                                struct file object describing the        initial map.
193                                underlying file, if anonymous then
194                                :c:macro:`!NULL`.
195   :c:member:`!vm_ops`          If the VMA is file-backed, then either   None - Written once on
196                                the driver or file-system provides a     initial map by
197                                :c:struct:`!struct vm_operations_struct` :c:func:`!f_ops->mmap()`.
198                                object describing callbacks to be
199                                invoked on VMA lifetime events.
200   :c:member:`!vm_private_data` A :c:member:`!void *` field for          Handled by driver.
201                                driver-specific metadata.
202   ============================ ======================================== =========================
203
204These are the core fields which describe the MM the VMA belongs to and its attributes.
205
206.. table:: Config-specific fields
207
208   ================================= ===================== ======================================== ===============
209   Field                             Configuration option  Description                              Write lock
210   ================================= ===================== ======================================== ===============
211   :c:member:`!anon_name`            CONFIG_ANON_VMA_NAME  A field for storing a                    mmap write,
212                                                           :c:struct:`!struct anon_vma_name`        VMA write.
213                                                           object providing a name for anonymous
214                                                           mappings, or :c:macro:`!NULL` if none
215                                                           is set or the VMA is file-backed. The
216							   underlying object is reference counted
217							   and can be shared across multiple VMAs
218							   for scalability.
219   :c:member:`!swap_readahead_info`  CONFIG_SWAP           Metadata used by the swap mechanism      mmap read,
220                                                           to perform readahead. This field is      swap-specific
221                                                           accessed atomically.                     lock.
222   :c:member:`!vm_policy`            CONFIG_NUMA           :c:type:`!mempolicy` object which        mmap write,
223                                                           describes the NUMA behaviour of the      VMA write.
224                                                           VMA. The underlying object is reference
225							   counted.
226   :c:member:`!numab_state`          CONFIG_NUMA_BALANCING :c:type:`!vma_numab_state` object which  mmap read,
227                                                           describes the current state of           numab-specific
228                                                           NUMA balancing in relation to this VMA.  lock.
229                                                           Updated under mmap read lock by
230                                                           :c:func:`!task_numa_work`.
231   :c:member:`!vm_userfaultfd_ctx`   CONFIG_USERFAULTFD    Userfaultfd context wrapper object of    mmap write,
232                                                           type :c:type:`!vm_userfaultfd_ctx`,      VMA write.
233                                                           either of zero size if userfaultfd is
234                                                           disabled, or containing a pointer
235                                                           to an underlying
236                                                           :c:type:`!userfaultfd_ctx` object which
237                                                           describes userfaultfd metadata.
238   ================================= ===================== ======================================== ===============
239
240These fields are present or not depending on whether the relevant kernel
241configuration option is set.
242
243.. table:: Reverse mapping fields
244
245   =================================== ========================================= ============================
246   Field                               Description                               Write lock
247   =================================== ========================================= ============================
248   :c:member:`!shared.rb`              A red/black tree node used, if the        mmap write, VMA write,
249                                       mapping is file-backed, to place the VMA  i_mmap write.
250                                       in the
251                                       :c:member:`!struct address_space->i_mmap`
252                                       red/black interval tree.
253   :c:member:`!shared.rb_subtree_last` Metadata used for management of the       mmap write, VMA write,
254                                       interval tree if the VMA is file-backed.  i_mmap write.
255   :c:member:`!anon_vma_chain`         List of pointers to both forked/CoW’d     mmap read, anon_vma write.
256                                       :c:type:`!anon_vma` objects and
257                                       :c:member:`!vma->anon_vma` if it is
258                                       non-:c:macro:`!NULL`.
259   :c:member:`!anon_vma`               :c:type:`!anon_vma` object used by        When :c:macro:`NULL` and
260                                       anonymous folios mapped exclusively to    setting non-:c:macro:`NULL`:
261                                       this VMA. Initially set by                mmap read, page_table_lock.
262                                       :c:func:`!anon_vma_prepare` serialised
263                                       by the :c:macro:`!page_table_lock`. This  When non-:c:macro:`NULL` and
264                                       is set as soon as any page is faulted in. setting :c:macro:`NULL`:
265                                                                                 mmap write, VMA write,
266                                                                                 anon_vma write.
267   =================================== ========================================= ============================
268
269These fields are used to both place the VMA within the reverse mapping, and for
270anonymous mappings, to be able to access both related :c:struct:`!struct anon_vma` objects
271and the :c:struct:`!struct anon_vma` in which folios mapped exclusively to this VMA should
272reside.
273
274.. note:: If a file-backed mapping is mapped with :c:macro:`!MAP_PRIVATE` set
275          then it can be in both the :c:type:`!anon_vma` and :c:type:`!i_mmap`
276          trees at the same time, so all of these fields might be utilised at
277          once.
278
279Page tables
280-----------
281
282We won't speak exhaustively on the subject but broadly speaking, page tables map
283virtual addresses to physical ones through a series of page tables, each of
284which contain entries with physical addresses for the next page table level
285(along with flags), and at the leaf level the physical addresses of the
286underlying physical data pages or a special entry such as a swap entry,
287migration entry or other special marker. Offsets into these pages are provided
288by the virtual address itself.
289
290In Linux these are divided into five levels - PGD, P4D, PUD, PMD and PTE. Huge
291pages might eliminate one or two of these levels, but when this is the case we
292typically refer to the leaf level as the PTE level regardless.
293
294.. note:: In instances where the architecture supports fewer page tables than
295	  five the kernel cleverly 'folds' page table levels, that is stubbing
296	  out functions related to the skipped levels. This allows us to
297	  conceptually act as if there were always five levels, even if the
298	  compiler might, in practice, eliminate any code relating to missing
299	  ones.
300
301There are four key operations typically performed on page tables:
302
3031. **Traversing** page tables - Simply reading page tables in order to traverse
304   them. This only requires that the VMA is kept stable, so a lock which
305   establishes this suffices for traversal (there are also lockless variants
306   which eliminate even this requirement, such as :c:func:`!gup_fast`).
3072. **Installing** page table mappings - Whether creating a new mapping or
308   modifying an existing one in such a way as to change its identity. This
309   requires that the VMA is kept stable via an mmap or VMA lock (explicitly not
310   rmap locks).
3113. **Zapping/unmapping** page table entries - This is what the kernel calls
312   clearing page table mappings at the leaf level only, whilst leaving all page
313   tables in place. This is a very common operation in the kernel performed on
314   file truncation, the :c:macro:`!MADV_DONTNEED` operation via
315   :c:func:`!madvise`, and others. This is performed by a number of functions
316   including :c:func:`!unmap_mapping_range` and :c:func:`!unmap_mapping_pages`.
317   The VMA need only be kept stable for this operation.
3184. **Freeing** page tables - When finally the kernel removes page tables from a
319   userland process (typically via :c:func:`!free_pgtables`) extreme care must
320   be taken to ensure this is done safely, as this logic finally frees all page
321   tables in the specified range, ignoring existing leaf entries (it assumes the
322   caller has both zapped the range and prevented any further faults or
323   modifications within it).
324
325.. note:: Modifying mappings for reclaim or migration is performed under rmap
326          lock as it, like zapping, does not fundamentally modify the identity
327          of what is being mapped.
328
329**Traversing** and **zapping** ranges can be performed holding any one of the
330locks described in the terminology section above - that is the mmap lock, the
331VMA lock or either of the reverse mapping locks.
332
333That is - as long as you keep the relevant VMA **stable** - you are good to go
334ahead and perform these operations on page tables (though internally, kernel
335operations that perform writes also acquire internal page table locks to
336serialise - see the page table implementation detail section for more details).
337
338When **installing** page table entries, the mmap or VMA lock must be held to
339keep the VMA stable. We explore why this is in the page table locking details
340section below.
341
342.. warning:: Page tables are normally only traversed in regions covered by VMAs.
343             If you want to traverse page tables in areas that might not be
344             covered by VMAs, heavier locking is required.
345             See :c:func:`!walk_page_range_novma` for details.
346
347**Freeing** page tables is an entirely internal memory management operation and
348has special requirements (see the page freeing section below for more details).
349
350.. warning:: When **freeing** page tables, it must not be possible for VMAs
351             containing the ranges those page tables map to be accessible via
352             the reverse mapping.
353
354             The :c:func:`!free_pgtables` function removes the relevant VMAs
355             from the reverse mappings, but no other VMAs can be permitted to be
356             accessible and span the specified range.
357
358Lock ordering
359-------------
360
361As we have multiple locks across the kernel which may or may not be taken at the
362same time as explicit mm or VMA locks, we have to be wary of lock inversion, and
363the **order** in which locks are acquired and released becomes very important.
364
365.. note:: Lock inversion occurs when two threads need to acquire multiple locks,
366   but in doing so inadvertently cause a mutual deadlock.
367
368   For example, consider thread 1 which holds lock A and tries to acquire lock B,
369   while thread 2 holds lock B and tries to acquire lock A.
370
371   Both threads are now deadlocked on each other. However, had they attempted to
372   acquire locks in the same order, one would have waited for the other to
373   complete its work and no deadlock would have occurred.
374
375The opening comment in :c:macro:`!mm/rmap.c` describes in detail the required
376ordering of locks within memory management code:
377
378.. code-block::
379
380  inode->i_rwsem        (while writing or truncating, not reading or faulting)
381    mm->mmap_lock
382      mapping->invalidate_lock (in filemap_fault)
383        folio_lock
384          hugetlbfs_i_mmap_rwsem_key (in huge_pmd_share, see hugetlbfs below)
385            vma_start_write
386              mapping->i_mmap_rwsem
387                anon_vma->rwsem
388                  mm->page_table_lock or pte_lock
389                    swap_lock (in swap_duplicate, swap_info_get)
390                      mmlist_lock (in mmput, drain_mmlist and others)
391                      mapping->private_lock (in block_dirty_folio)
392                          i_pages lock (widely used)
393                            lruvec->lru_lock (in folio_lruvec_lock_irq)
394                      inode->i_lock (in set_page_dirty's __mark_inode_dirty)
395                      bdi.wb->list_lock (in set_page_dirty's __mark_inode_dirty)
396                        sb_lock (within inode_lock in fs/fs-writeback.c)
397                        i_pages lock (widely used, in set_page_dirty,
398                                  in arch-dependent flush_dcache_mmap_lock,
399                                  within bdi.wb->list_lock in __sync_single_inode)
400
401There is also a file-system specific lock ordering comment located at the top of
402:c:macro:`!mm/filemap.c`:
403
404.. code-block::
405
406  ->i_mmap_rwsem                        (truncate_pagecache)
407    ->private_lock                      (__free_pte->block_dirty_folio)
408      ->swap_lock                       (exclusive_swap_page, others)
409        ->i_pages lock
410
411  ->i_rwsem
412    ->invalidate_lock                   (acquired by fs in truncate path)
413      ->i_mmap_rwsem                    (truncate->unmap_mapping_range)
414
415  ->mmap_lock
416    ->i_mmap_rwsem
417      ->page_table_lock or pte_lock     (various, mainly in memory.c)
418        ->i_pages lock                  (arch-dependent flush_dcache_mmap_lock)
419
420  ->mmap_lock
421    ->invalidate_lock                   (filemap_fault)
422      ->lock_page                       (filemap_fault, access_process_vm)
423
424  ->i_rwsem                             (generic_perform_write)
425    ->mmap_lock                         (fault_in_readable->do_page_fault)
426
427  bdi->wb.list_lock
428    sb_lock                             (fs/fs-writeback.c)
429    ->i_pages lock                      (__sync_single_inode)
430
431  ->i_mmap_rwsem
432    ->anon_vma.lock                     (vma_merge)
433
434  ->anon_vma.lock
435    ->page_table_lock or pte_lock       (anon_vma_prepare and various)
436
437  ->page_table_lock or pte_lock
438    ->swap_lock                         (try_to_unmap_one)
439    ->private_lock                      (try_to_unmap_one)
440    ->i_pages lock                      (try_to_unmap_one)
441    ->lruvec->lru_lock                  (follow_page_mask->mark_page_accessed)
442    ->lruvec->lru_lock                  (check_pte_range->folio_isolate_lru)
443    ->private_lock                      (folio_remove_rmap_pte->set_page_dirty)
444    ->i_pages lock                      (folio_remove_rmap_pte->set_page_dirty)
445    bdi.wb->list_lock                   (folio_remove_rmap_pte->set_page_dirty)
446    ->inode->i_lock                     (folio_remove_rmap_pte->set_page_dirty)
447    bdi.wb->list_lock                   (zap_pte_range->set_page_dirty)
448    ->inode->i_lock                     (zap_pte_range->set_page_dirty)
449    ->private_lock                      (zap_pte_range->block_dirty_folio)
450
451Please check the current state of these comments which may have changed since
452the time of writing of this document.
453
454------------------------------
455Locking Implementation Details
456------------------------------
457
458.. warning:: Locking rules for PTE-level page tables are very different from
459             locking rules for page tables at other levels.
460
461Page table locking details
462--------------------------
463
464In addition to the locks described in the terminology section above, we have
465additional locks dedicated to page tables:
466
467* **Higher level page table locks** - Higher level page tables, that is PGD, P4D
468  and PUD each make use of the process address space granularity
469  :c:member:`!mm->page_table_lock` lock when modified.
470
471* **Fine-grained page table locks** - PMDs and PTEs each have fine-grained locks
472  either kept within the folios describing the page tables or allocated
473  separated and pointed at by the folios if :c:macro:`!ALLOC_SPLIT_PTLOCKS` is
474  set. The PMD spin lock is obtained via :c:func:`!pmd_lock`, however PTEs are
475  mapped into higher memory (if a 32-bit system) and carefully locked via
476  :c:func:`!pte_offset_map_lock`.
477
478These locks represent the minimum required to interact with each page table
479level, but there are further requirements.
480
481Importantly, note that on a **traversal** of page tables, sometimes no such
482locks are taken. However, at the PTE level, at least concurrent page table
483deletion must be prevented (using RCU) and the page table must be mapped into
484high memory, see below.
485
486Whether care is taken on reading the page table entries depends on the
487architecture, see the section on atomicity below.
488
489Locking rules
490^^^^^^^^^^^^^
491
492We establish basic locking rules when interacting with page tables:
493
494* When changing a page table entry the page table lock for that page table
495  **must** be held, except if you can safely assume nobody can access the page
496  tables concurrently (such as on invocation of :c:func:`!free_pgtables`).
497* Reads from and writes to page table entries must be *appropriately*
498  atomic. See the section on atomicity below for details.
499* Populating previously empty entries requires that the mmap or VMA locks are
500  held (read or write), doing so with only rmap locks would be dangerous (see
501  the warning below).
502* As mentioned previously, zapping can be performed while simply keeping the VMA
503  stable, that is holding any one of the mmap, VMA or rmap locks.
504
505.. warning:: Populating previously empty entries is dangerous as, when unmapping
506             VMAs, :c:func:`!vms_clear_ptes` has a window of time between
507             zapping (via :c:func:`!unmap_vmas`) and freeing page tables (via
508             :c:func:`!free_pgtables`), where the VMA is still visible in the
509             rmap tree. :c:func:`!free_pgtables` assumes that the zap has
510             already been performed and removes PTEs unconditionally (along with
511             all other page tables in the freed range), so installing new PTE
512             entries could leak memory and also cause other unexpected and
513             dangerous behaviour.
514
515There are additional rules applicable when moving page tables, which we discuss
516in the section on this topic below.
517
518PTE-level page tables are different from page tables at other levels, and there
519are extra requirements for accessing them:
520
521* On 32-bit architectures, they may be in high memory (meaning they need to be
522  mapped into kernel memory to be accessible).
523* When empty, they can be unlinked and RCU-freed while holding an mmap lock or
524  rmap lock for reading in combination with the PTE and PMD page table locks.
525  In particular, this happens in :c:func:`!retract_page_tables` when handling
526  :c:macro:`!MADV_COLLAPSE`.
527  So accessing PTE-level page tables requires at least holding an RCU read lock;
528  but that only suffices for readers that can tolerate racing with concurrent
529  page table updates such that an empty PTE is observed (in a page table that
530  has actually already been detached and marked for RCU freeing) while another
531  new page table has been installed in the same location and filled with
532  entries. Writers normally need to take the PTE lock and revalidate that the
533  PMD entry still refers to the same PTE-level page table.
534
535To access PTE-level page tables, a helper like :c:func:`!pte_offset_map_lock` or
536:c:func:`!pte_offset_map` can be used depending on stability requirements.
537These map the page table into kernel memory if required, take the RCU lock, and
538depending on variant, may also look up or acquire the PTE lock.
539See the comment on :c:func:`!__pte_offset_map_lock`.
540
541Atomicity
542^^^^^^^^^
543
544Regardless of page table locks, the MMU hardware concurrently updates accessed
545and dirty bits (perhaps more, depending on architecture). Additionally, page
546table traversal operations in parallel (though holding the VMA stable) and
547functionality like GUP-fast locklessly traverses (that is reads) page tables,
548without even keeping the VMA stable at all.
549
550When performing a page table traversal and keeping the VMA stable, whether a
551read must be performed once and only once or not depends on the architecture
552(for instance x86-64 does not require any special precautions).
553
554If a write is being performed, or if a read informs whether a write takes place
555(on an installation of a page table entry say, for instance in
556:c:func:`!__pud_install`), special care must always be taken. In these cases we
557can never assume that page table locks give us entirely exclusive access, and
558must retrieve page table entries once and only once.
559
560If we are reading page table entries, then we need only ensure that the compiler
561does not rearrange our loads. This is achieved via :c:func:`!pXXp_get`
562functions - :c:func:`!pgdp_get`, :c:func:`!p4dp_get`, :c:func:`!pudp_get`,
563:c:func:`!pmdp_get`, and :c:func:`!ptep_get`.
564
565Each of these uses :c:func:`!READ_ONCE` to guarantee that the compiler reads
566the page table entry only once.
567
568However, if we wish to manipulate an existing page table entry and care about
569the previously stored data, we must go further and use an hardware atomic
570operation as, for example, in :c:func:`!ptep_get_and_clear`.
571
572Equally, operations that do not rely on the VMA being held stable, such as
573GUP-fast (see :c:func:`!gup_fast` and its various page table level handlers like
574:c:func:`!gup_fast_pte_range`), must very carefully interact with page table
575entries, using functions such as :c:func:`!ptep_get_lockless` and equivalent for
576higher level page table levels.
577
578Writes to page table entries must also be appropriately atomic, as established
579by :c:func:`!set_pXX` functions - :c:func:`!set_pgd`, :c:func:`!set_p4d`,
580:c:func:`!set_pud`, :c:func:`!set_pmd`, and :c:func:`!set_pte`.
581
582Equally functions which clear page table entries must be appropriately atomic,
583as in :c:func:`!pXX_clear` functions - :c:func:`!pgd_clear`,
584:c:func:`!p4d_clear`, :c:func:`!pud_clear`, :c:func:`!pmd_clear`, and
585:c:func:`!pte_clear`.
586
587Page table installation
588^^^^^^^^^^^^^^^^^^^^^^^
589
590Page table installation is performed with the VMA held stable explicitly by an
591mmap or VMA lock in read or write mode (see the warning in the locking rules
592section for details as to why).
593
594When allocating a P4D, PUD or PMD and setting the relevant entry in the above
595PGD, P4D or PUD, the :c:member:`!mm->page_table_lock` must be held. This is
596acquired in :c:func:`!__p4d_alloc`, :c:func:`!__pud_alloc` and
597:c:func:`!__pmd_alloc` respectively.
598
599.. note:: :c:func:`!__pmd_alloc` actually invokes :c:func:`!pud_lock` and
600   :c:func:`!pud_lockptr` in turn, however at the time of writing it ultimately
601   references the :c:member:`!mm->page_table_lock`.
602
603Allocating a PTE will either use the :c:member:`!mm->page_table_lock` or, if
604:c:macro:`!USE_SPLIT_PMD_PTLOCKS` is defined, a lock embedded in the PMD
605physical page metadata in the form of a :c:struct:`!struct ptdesc`, acquired by
606:c:func:`!pmd_ptdesc` called from :c:func:`!pmd_lock` and ultimately
607:c:func:`!__pte_alloc`.
608
609Finally, modifying the contents of the PTE requires special treatment, as the
610PTE page table lock must be acquired whenever we want stable and exclusive
611access to entries contained within a PTE, especially when we wish to modify
612them.
613
614This is performed via :c:func:`!pte_offset_map_lock` which carefully checks to
615ensure that the PTE hasn't changed from under us, ultimately invoking
616:c:func:`!pte_lockptr` to obtain a spin lock at PTE granularity contained within
617the :c:struct:`!struct ptdesc` associated with the physical PTE page. The lock
618must be released via :c:func:`!pte_unmap_unlock`.
619
620.. note:: There are some variants on this, such as
621   :c:func:`!pte_offset_map_rw_nolock` when we know we hold the PTE stable but
622   for brevity we do not explore this.  See the comment for
623   :c:func:`!__pte_offset_map_lock` for more details.
624
625When modifying data in ranges we typically only wish to allocate higher page
626tables as necessary, using these locks to avoid races or overwriting anything,
627and set/clear data at the PTE level as required (for instance when page faulting
628or zapping).
629
630A typical pattern taken when traversing page table entries to install a new
631mapping is to optimistically determine whether the page table entry in the table
632above is empty, if so, only then acquiring the page table lock and checking
633again to see if it was allocated underneath us.
634
635This allows for a traversal with page table locks only being taken when
636required. An example of this is :c:func:`!__pud_alloc`.
637
638At the leaf page table, that is the PTE, we can't entirely rely on this pattern
639as we have separate PMD and PTE locks and a THP collapse for instance might have
640eliminated the PMD entry as well as the PTE from under us.
641
642This is why :c:func:`!__pte_offset_map_lock` locklessly retrieves the PMD entry
643for the PTE, carefully checking it is as expected, before acquiring the
644PTE-specific lock, and then *again* checking that the PMD entry is as expected.
645
646If a THP collapse (or similar) were to occur then the lock on both pages would
647be acquired, so we can ensure this is prevented while the PTE lock is held.
648
649Installing entries this way ensures mutual exclusion on write.
650
651Page table freeing
652^^^^^^^^^^^^^^^^^^
653
654Tearing down page tables themselves is something that requires significant
655care. There must be no way that page tables designated for removal can be
656traversed or referenced by concurrent tasks.
657
658It is insufficient to simply hold an mmap write lock and VMA lock (which will
659prevent racing faults, and rmap operations), as a file-backed mapping can be
660truncated under the :c:struct:`!struct address_space->i_mmap_rwsem` alone.
661
662As a result, no VMA which can be accessed via the reverse mapping (either
663through the :c:struct:`!struct anon_vma->rb_root` or the :c:member:`!struct
664address_space->i_mmap` interval trees) can have its page tables torn down.
665
666The operation is typically performed via :c:func:`!free_pgtables`, which assumes
667either the mmap write lock has been taken (as specified by its
668:c:member:`!mm_wr_locked` parameter), or that the VMA is already unreachable.
669
670It carefully removes the VMA from all reverse mappings, however it's important
671that no new ones overlap these or any route remain to permit access to addresses
672within the range whose page tables are being torn down.
673
674Additionally, it assumes that a zap has already been performed and steps have
675been taken to ensure that no further page table entries can be installed between
676the zap and the invocation of :c:func:`!free_pgtables`.
677
678Since it is assumed that all such steps have been taken, page table entries are
679cleared without page table locks (in the :c:func:`!pgd_clear`, :c:func:`!p4d_clear`,
680:c:func:`!pud_clear`, and :c:func:`!pmd_clear` functions.
681
682.. note:: It is possible for leaf page tables to be torn down independent of
683          the page tables above it as is done by
684          :c:func:`!retract_page_tables`, which is performed under the i_mmap
685          read lock, PMD, and PTE page table locks, without this level of care.
686
687Page table moving
688^^^^^^^^^^^^^^^^^
689
690Some functions manipulate page table levels above PMD (that is PUD, P4D and PGD
691page tables). Most notable of these is :c:func:`!mremap`, which is capable of
692moving higher level page tables.
693
694In these instances, it is required that **all** locks are taken, that is
695the mmap lock, the VMA lock and the relevant rmap locks.
696
697You can observe this in the :c:func:`!mremap` implementation in the functions
698:c:func:`!take_rmap_locks` and :c:func:`!drop_rmap_locks` which perform the rmap
699side of lock acquisition, invoked ultimately by :c:func:`!move_page_tables`.
700
701VMA lock internals
702------------------
703
704Overview
705^^^^^^^^
706
707VMA read locking is entirely optimistic - if the lock is contended or a competing
708write has started, then we do not obtain a read lock.
709
710A VMA **read** lock is obtained by :c:func:`!lock_vma_under_rcu`, which first
711calls :c:func:`!rcu_read_lock` to ensure that the VMA is looked up in an RCU
712critical section, then attempts to VMA lock it via :c:func:`!vma_start_read`,
713before releasing the RCU lock via :c:func:`!rcu_read_unlock`.
714
715VMA read locks hold the read lock on the :c:member:`!vma->vm_lock` semaphore for
716their duration and the caller of :c:func:`!lock_vma_under_rcu` must release it
717via :c:func:`!vma_end_read`.
718
719VMA **write** locks are acquired via :c:func:`!vma_start_write` in instances where a
720VMA is about to be modified, unlike :c:func:`!vma_start_read` the lock is always
721acquired. An mmap write lock **must** be held for the duration of the VMA write
722lock, releasing or downgrading the mmap write lock also releases the VMA write
723lock so there is no :c:func:`!vma_end_write` function.
724
725Note that a semaphore write lock is not held across a VMA lock. Rather, a
726sequence number is used for serialisation, and the write semaphore is only
727acquired at the point of write lock to update this.
728
729This ensures the semantics we require - VMA write locks provide exclusive write
730access to the VMA.
731
732Implementation details
733^^^^^^^^^^^^^^^^^^^^^^
734
735The VMA lock mechanism is designed to be a lightweight means of avoiding the use
736of the heavily contended mmap lock. It is implemented using a combination of a
737read/write semaphore and sequence numbers belonging to the containing
738:c:struct:`!struct mm_struct` and the VMA.
739
740Read locks are acquired via :c:func:`!vma_start_read`, which is an optimistic
741operation, i.e. it tries to acquire a read lock but returns false if it is
742unable to do so. At the end of the read operation, :c:func:`!vma_end_read` is
743called to release the VMA read lock.
744
745Invoking :c:func:`!vma_start_read` requires that :c:func:`!rcu_read_lock` has
746been called first, establishing that we are in an RCU critical section upon VMA
747read lock acquisition. Once acquired, the RCU lock can be released as it is only
748required for lookup. This is abstracted by :c:func:`!lock_vma_under_rcu` which
749is the interface a user should use.
750
751Writing requires the mmap to be write-locked and the VMA lock to be acquired via
752:c:func:`!vma_start_write`, however the write lock is released by the termination or
753downgrade of the mmap write lock so no :c:func:`!vma_end_write` is required.
754
755All this is achieved by the use of per-mm and per-VMA sequence counts, which are
756used in order to reduce complexity, especially for operations which write-lock
757multiple VMAs at once.
758
759If the mm sequence count, :c:member:`!mm->mm_lock_seq` is equal to the VMA
760sequence count :c:member:`!vma->vm_lock_seq` then the VMA is write-locked. If
761they differ, then it is not.
762
763Each time the mmap write lock is released in :c:func:`!mmap_write_unlock` or
764:c:func:`!mmap_write_downgrade`, :c:func:`!vma_end_write_all` is invoked which
765also increments :c:member:`!mm->mm_lock_seq` via
766:c:func:`!mm_lock_seqcount_end`.
767
768This way, we ensure that, regardless of the VMA's sequence number, a write lock
769is never incorrectly indicated and that when we release an mmap write lock we
770efficiently release **all** VMA write locks contained within the mmap at the
771same time.
772
773Since the mmap write lock is exclusive against others who hold it, the automatic
774release of any VMA locks on its release makes sense, as you would never want to
775keep VMAs locked across entirely separate write operations. It also maintains
776correct lock ordering.
777
778Each time a VMA read lock is acquired, we acquire a read lock on the
779:c:member:`!vma->vm_lock` read/write semaphore and hold it, while checking that
780the sequence count of the VMA does not match that of the mm.
781
782If it does, the read lock fails. If it does not, we hold the lock, excluding
783writers, but permitting other readers, who will also obtain this lock under RCU.
784
785Importantly, maple tree operations performed in :c:func:`!lock_vma_under_rcu`
786are also RCU safe, so the whole read lock operation is guaranteed to function
787correctly.
788
789On the write side, we acquire a write lock on the :c:member:`!vma->vm_lock`
790read/write semaphore, before setting the VMA's sequence number under this lock,
791also simultaneously holding the mmap write lock.
792
793This way, if any read locks are in effect, :c:func:`!vma_start_write` will sleep
794until these are finished and mutual exclusion is achieved.
795
796After setting the VMA's sequence number, the lock is released, avoiding
797complexity with a long-term held write lock.
798
799This clever combination of a read/write semaphore and sequence count allows for
800fast RCU-based per-VMA lock acquisition (especially on page fault, though
801utilised elsewhere) with minimal complexity around lock ordering.
802
803mmap write lock downgrading
804---------------------------
805
806When an mmap write lock is held one has exclusive access to resources within the
807mmap (with the usual caveats about requiring VMA write locks to avoid races with
808tasks holding VMA read locks).
809
810It is then possible to **downgrade** from a write lock to a read lock via
811:c:func:`!mmap_write_downgrade` which, similar to :c:func:`!mmap_write_unlock`,
812implicitly terminates all VMA write locks via :c:func:`!vma_end_write_all`, but
813importantly does not relinquish the mmap lock while downgrading, therefore
814keeping the locked virtual address space stable.
815
816An interesting consequence of this is that downgraded locks are exclusive
817against any other task possessing a downgraded lock (since a racing task would
818have to acquire a write lock first to downgrade it, and the downgraded lock
819prevents a new write lock from being obtained until the original lock is
820released).
821
822For clarity, we map read (R)/downgraded write (D)/write (W) locks against one
823another showing which locks exclude the others:
824
825.. list-table:: Lock exclusivity
826   :widths: 5 5 5 5
827   :header-rows: 1
828   :stub-columns: 1
829
830   * -
831     - R
832     - D
833     - W
834   * - R
835     - N
836     - N
837     - Y
838   * - D
839     - N
840     - Y
841     - Y
842   * - W
843     - Y
844     - Y
845     - Y
846
847Here a Y indicates the locks in the matching row/column are mutually exclusive,
848and N indicates that they are not.
849
850Stack expansion
851---------------
852
853Stack expansion throws up additional complexities in that we cannot permit there
854to be racing page faults, as a result we invoke :c:func:`!vma_start_write` to
855prevent this in :c:func:`!expand_downwards` or :c:func:`!expand_upwards`.
856