xref: /linux/Documentation/mm/process_addrs.rst (revision 9ea35a25d51b13013b724943a177a7aaf4bfed71)
1.. SPDX-License-Identifier: GPL-2.0
2
3=================
4Process Addresses
5=================
6
7.. toctree::
8   :maxdepth: 3
9
10
11Userland memory ranges are tracked by the kernel via Virtual Memory Areas or
12'VMA's of type :c:struct:`!struct vm_area_struct`.
13
14Each VMA describes a virtually contiguous memory range with identical
15attributes, each described by a :c:struct:`!struct vm_area_struct`
16object. Userland access outside of VMAs is invalid except in the case where an
17adjacent stack VMA could be extended to contain the accessed address.
18
19All VMAs are contained within one and only one virtual address space, described
20by a :c:struct:`!struct mm_struct` object which is referenced by all tasks (that is,
21threads) which share the virtual address space. We refer to this as the
22:c:struct:`!mm`.
23
24Each mm object contains a maple tree data structure which describes all VMAs
25within the virtual address space.
26
27.. note:: An exception to this is the 'gate' VMA which is provided by
28          architectures which use :c:struct:`!vsyscall` and is a global static
29          object which does not belong to any specific mm.
30
31-------
32Locking
33-------
34
35The kernel is designed to be highly scalable against concurrent read operations
36on VMA **metadata** so a complicated set of locks are required to ensure memory
37corruption does not occur.
38
39.. note:: Locking VMAs for their metadata does not have any impact on the memory
40          they describe nor the page tables that map them.
41
42Terminology
43-----------
44
45* **mmap locks** - Each MM has a read/write semaphore :c:member:`!mmap_lock`
46  which locks at a process address space granularity which can be acquired via
47  :c:func:`!mmap_read_lock`, :c:func:`!mmap_write_lock` and variants.
48* **VMA locks** - The VMA lock is at VMA granularity (of course) which behaves
49  as a read/write semaphore in practice. A VMA read lock is obtained via
50  :c:func:`!lock_vma_under_rcu` (and unlocked via :c:func:`!vma_end_read`) and a
51  write lock via vma_start_write() or vma_start_write_killable()
52  (all VMA write locks are unlocked
53  automatically when the mmap write lock is released). To take a VMA write lock
54  you **must** have already acquired an :c:func:`!mmap_write_lock`.
55* **rmap locks** - When trying to access VMAs through the reverse mapping via a
56  :c:struct:`!struct address_space` or :c:struct:`!struct anon_vma` object
57  (reachable from a folio via :c:member:`!folio->mapping`). VMAs must be stabilised via
58  :c:func:`!anon_vma_[try]lock_read` or :c:func:`!anon_vma_[try]lock_write` for
59  anonymous memory and :c:func:`!i_mmap_[try]lock_read` or
60  :c:func:`!i_mmap_[try]lock_write` for file-backed memory. We refer to these
61  locks as the reverse mapping locks, or 'rmap locks' for brevity.
62
63We discuss page table locks separately in the dedicated section below.
64
65The first thing **any** of these locks achieve is to **stabilise** the VMA
66within the MM tree. That is, guaranteeing that the VMA object will not be
67deleted from under you nor modified (except for some specific fields
68described below).
69
70Stabilising a VMA also keeps the address space described by it around.
71
72Lock usage
73----------
74
75If you want to **read** VMA metadata fields or just keep the VMA stable, you
76must do one of the following:
77
78* Obtain an mmap read lock at the MM granularity via :c:func:`!mmap_read_lock` (or a
79  suitable variant), unlocking it with a matching :c:func:`!mmap_read_unlock` when
80  you're done with the VMA, *or*
81* Try to obtain a VMA read lock via :c:func:`!lock_vma_under_rcu`. This tries to
82  acquire the lock atomically so might fail, in which case fall-back logic is
83  required to instead obtain an mmap read lock if this returns :c:macro:`!NULL`,
84  *or*
85* Acquire an rmap lock before traversing the locked interval tree (whether
86  anonymous or file-backed) to obtain the required VMA.
87
88If you want to **write** VMA metadata fields, then things vary depending on the
89field (we explore each VMA field in detail below). For the majority you must:
90
91* Obtain an mmap write lock at the MM granularity via :c:func:`!mmap_write_lock` (or a
92  suitable variant), unlocking it with a matching :c:func:`!mmap_write_unlock` when
93  you're done with the VMA, *and*
94* Obtain a VMA write lock via :c:func:`!vma_start_write` for each VMA you wish to
95  modify, which will be released automatically when :c:func:`!mmap_write_unlock` is
96  called.
97* If you want to be able to write to **any** field, you must also hide the VMA
98  from the reverse mapping by obtaining an **rmap write lock**.
99
100VMA locks are special in that you must obtain an mmap **write** lock **first**
101in order to obtain a VMA **write** lock. A VMA **read** lock however can be
102obtained without any other lock (:c:func:`!lock_vma_under_rcu` will acquire then
103release an RCU lock to lookup the VMA for you).
104
105This constrains the impact of writers on readers, as a writer can interact with
106one VMA while a reader interacts with another simultaneously.
107
108.. note:: The primary users of VMA read locks are page fault handlers, which
109          means that without a VMA write lock, page faults will run concurrent with
110          whatever you are doing.
111
112Examining all valid lock states:
113
114.. table::
115
116   ========= ======== ========= ======= ===== =========== ==========
117   mmap lock VMA lock rmap lock Stable? Read? Write most? Write all?
118   ========= ======== ========= ======= ===== =========== ==========
119   \-        \-       \-        N       N     N           N
120   \-        R        \-        Y       Y     N           N
121   \-        \-       R/W       Y       Y     N           N
122   R/W       \-/R     \-/R/W    Y       Y     N           N
123   W         W        \-/R      Y       Y     Y           N
124   W         W        W         Y       Y     Y           Y
125   ========= ======== ========= ======= ===== =========== ==========
126
127.. warning:: While it's possible to obtain a VMA lock while holding an mmap read lock,
128             attempting to do the reverse is invalid as it can result in deadlock - if
129             another task already holds an mmap write lock and attempts to acquire a VMA
130             write lock that will deadlock on the VMA read lock.
131
132All of these locks behave as read/write semaphores in practice, so you can
133obtain either a read or a write lock for each of these.
134
135.. note:: Generally speaking, a read/write semaphore is a class of lock which
136          permits concurrent readers. However a write lock can only be obtained
137          once all readers have left the critical region (and pending readers
138          made to wait).
139
140          This renders read locks on a read/write semaphore concurrent with other
141          readers and write locks exclusive against all others holding the semaphore.
142
143VMA fields
144^^^^^^^^^^
145
146We can subdivide :c:struct:`!struct vm_area_struct` fields by their purpose, which makes it
147easier to explore their locking characteristics:
148
149.. note:: We exclude VMA lock-specific fields here to avoid confusion, as these
150          are in effect an internal implementation detail.
151
152.. table:: Virtual layout fields
153
154   ===================== ======================================== ===========
155   Field                 Description                              Write lock
156   ===================== ======================================== ===========
157   :c:member:`!vm_start` Inclusive start virtual address of range mmap write,
158                         VMA describes.                           VMA write,
159                                                                  rmap write.
160   :c:member:`!vm_end`   Exclusive end virtual address of range   mmap write,
161                         VMA describes.                           VMA write,
162                                                                  rmap write.
163   :c:member:`!vm_pgoff` Describes the page offset into the file, mmap write,
164                         the original page offset within the      VMA write,
165                         virtual address space (prior to any      rmap write.
166                         :c:func:`!mremap`), or PFN if a PFN map
167                         and the architecture does not support
168                         :c:macro:`!CONFIG_ARCH_HAS_PTE_SPECIAL`.
169   ===================== ======================================== ===========
170
171These fields describes the size, start and end of the VMA, and as such cannot be
172modified without first being hidden from the reverse mapping since these fields
173are used to locate VMAs within the reverse mapping interval trees.
174
175.. table:: Core fields
176
177   ============================ ======================================== =========================
178   Field                        Description                              Write lock
179   ============================ ======================================== =========================
180   :c:member:`!vm_mm`           Containing mm_struct.                    None - written once on
181                                                                         initial map.
182   :c:member:`!vm_page_prot`    Architecture-specific page table         mmap write, VMA write.
183                                protection bits determined from VMA
184                                flags.
185   :c:member:`!vm_flags`        Read-only access to VMA flags describing N/A
186                                attributes of the VMA, in union with
187                                private writable
188                                :c:member:`!__vm_flags`.
189   :c:member:`!__vm_flags`      Private, writable access to VMA flags    mmap write, VMA write.
190                                field, updated by
191                                :c:func:`!vm_flags_*` functions.
192   :c:member:`!vm_file`         If the VMA is file-backed, points to a   None - written once on
193                                struct file object describing the        initial map.
194                                underlying file, if anonymous then
195                                :c:macro:`!NULL`.
196   :c:member:`!vm_ops`          If the VMA is file-backed, then either   None - Written once on
197                                the driver or file-system provides a     initial map by
198                                :c:struct:`!struct vm_operations_struct` :c:func:`!f_ops->mmap()`.
199                                object describing callbacks to be
200                                invoked on VMA lifetime events.
201   :c:member:`!vm_private_data` A :c:member:`!void *` field for          Handled by driver.
202                                driver-specific metadata.
203   ============================ ======================================== =========================
204
205These are the core fields which describe the MM the VMA belongs to and its attributes.
206
207.. table:: Config-specific fields
208
209   ================================= ===================== ======================================== ===============
210   Field                             Configuration option  Description                              Write lock
211   ================================= ===================== ======================================== ===============
212   :c:member:`!anon_name`            CONFIG_ANON_VMA_NAME  A field for storing a                    mmap write,
213                                                           :c:struct:`!struct anon_vma_name`        VMA write.
214                                                           object providing a name for anonymous
215                                                           mappings, or :c:macro:`!NULL` if none
216                                                           is set or the VMA is file-backed. The
217							   underlying object is reference counted
218							   and can be shared across multiple VMAs
219							   for scalability.
220   :c:member:`!swap_readahead_info`  CONFIG_SWAP           Metadata used by the swap mechanism      mmap read,
221                                                           to perform readahead. This field is      swap-specific
222                                                           accessed atomically.                     lock.
223   :c:member:`!vm_policy`            CONFIG_NUMA           :c:type:`!mempolicy` object which        mmap write,
224                                                           describes the NUMA behaviour of the      VMA write.
225                                                           VMA. The underlying object is reference
226							   counted.
227   :c:member:`!numab_state`          CONFIG_NUMA_BALANCING :c:type:`!vma_numab_state` object which  mmap read,
228                                                           describes the current state of           numab-specific
229                                                           NUMA balancing in relation to this VMA.  lock.
230                                                           Updated under mmap read lock by
231                                                           :c:func:`!task_numa_work`.
232   :c:member:`!vm_userfaultfd_ctx`   CONFIG_USERFAULTFD    Userfaultfd context wrapper object of    mmap write,
233                                                           type :c:type:`!vm_userfaultfd_ctx`,      VMA write.
234                                                           either of zero size if userfaultfd is
235                                                           disabled, or containing a pointer
236                                                           to an underlying
237                                                           :c:type:`!userfaultfd_ctx` object which
238                                                           describes userfaultfd metadata.
239   ================================= ===================== ======================================== ===============
240
241These fields are present or not depending on whether the relevant kernel
242configuration option is set.
243
244.. table:: Reverse mapping fields
245
246   =================================== ========================================= ============================
247   Field                               Description                               Write lock
248   =================================== ========================================= ============================
249   :c:member:`!shared.rb`              A red/black tree node used, if the        mmap write, VMA write,
250                                       mapping is file-backed, to place the VMA  i_mmap write.
251                                       in the
252                                       :c:member:`!struct address_space->i_mmap`
253                                       red/black interval tree.
254   :c:member:`!shared.rb_subtree_last` Metadata used for management of the       mmap write, VMA write,
255                                       interval tree if the VMA is file-backed.  i_mmap write.
256   :c:member:`!anon_vma_chain`         List of pointers to both forked/CoW’d     mmap read, anon_vma write.
257                                       :c:type:`!anon_vma` objects and
258                                       :c:member:`!vma->anon_vma` if it is
259                                       non-:c:macro:`!NULL`.
260   :c:member:`!anon_vma`               :c:type:`!anon_vma` object used by        When :c:macro:`NULL` and
261                                       anonymous folios mapped exclusively to    setting non-:c:macro:`NULL`:
262                                       this VMA. Initially set by                mmap read, page_table_lock.
263                                       :c:func:`!anon_vma_prepare` serialised
264                                       by the :c:macro:`!page_table_lock`. This  When non-:c:macro:`NULL` and
265                                       is set as soon as any page is faulted in. setting :c:macro:`NULL`:
266                                                                                 mmap write, VMA write,
267                                                                                 anon_vma write.
268   =================================== ========================================= ============================
269
270These fields are used to both place the VMA within the reverse mapping, and for
271anonymous mappings, to be able to access both related :c:struct:`!struct anon_vma` objects
272and the :c:struct:`!struct anon_vma` in which folios mapped exclusively to this VMA should
273reside.
274
275.. note:: If a file-backed mapping is mapped with :c:macro:`!MAP_PRIVATE` set
276          then it can be in both the :c:type:`!anon_vma` and :c:type:`!i_mmap`
277          trees at the same time, so all of these fields might be utilised at
278          once.
279
280Page tables
281-----------
282
283We won't speak exhaustively on the subject but broadly speaking, page tables map
284virtual addresses to physical ones through a series of page tables, each of
285which contain entries with physical addresses for the next page table level
286(along with flags), and at the leaf level the physical addresses of the
287underlying physical data pages or a special entry such as a swap entry,
288migration entry or other special marker. Offsets into these pages are provided
289by the virtual address itself.
290
291In Linux these are divided into five levels - PGD, P4D, PUD, PMD and PTE. Huge
292pages might eliminate one or two of these levels, but when this is the case we
293typically refer to the leaf level as the PTE level regardless.
294
295.. note:: In instances where the architecture supports fewer page tables than
296	  five the kernel cleverly 'folds' page table levels, that is stubbing
297	  out functions related to the skipped levels. This allows us to
298	  conceptually act as if there were always five levels, even if the
299	  compiler might, in practice, eliminate any code relating to missing
300	  ones.
301
302There are four key operations typically performed on page tables:
303
3041. **Traversing** page tables - Simply reading page tables in order to traverse
305   them. This only requires that the VMA is kept stable, so a lock which
306   establishes this suffices for traversal (there are also lockless variants
307   which eliminate even this requirement, such as :c:func:`!gup_fast`). There is
308   also a special case of page table traversal for non-VMA regions which we
309   consider separately below.
3102. **Installing** page table mappings - Whether creating a new mapping or
311   modifying an existing one in such a way as to change its identity. This
312   requires that the VMA is kept stable via an mmap or VMA lock (explicitly not
313   rmap locks).
3143. **Zapping/unmapping** page table entries - This is what the kernel calls
315   clearing page table mappings at the leaf level only, whilst leaving all page
316   tables in place. This is a very common operation in the kernel performed on
317   file truncation, the :c:macro:`!MADV_DONTNEED` operation via
318   :c:func:`!madvise`, and others. This is performed by a number of functions
319   including :c:func:`!unmap_mapping_range` and :c:func:`!unmap_mapping_pages`.
320   The VMA need only be kept stable for this operation.
3214. **Freeing** page tables - When finally the kernel removes page tables from a
322   userland process (typically via :c:func:`!free_pgtables`) extreme care must
323   be taken to ensure this is done safely, as this logic finally frees all page
324   tables in the specified range, ignoring existing leaf entries (it assumes the
325   caller has both zapped the range and prevented any further faults or
326   modifications within it).
327
328.. note:: Modifying mappings for reclaim or migration is performed under rmap
329          lock as it, like zapping, does not fundamentally modify the identity
330          of what is being mapped.
331
332**Traversing** and **zapping** ranges can be performed holding any one of the
333locks described in the terminology section above - that is the mmap lock, the
334VMA lock or either of the reverse mapping locks.
335
336That is - as long as you keep the relevant VMA **stable** - you are good to go
337ahead and perform these operations on page tables (though internally, kernel
338operations that perform writes also acquire internal page table locks to
339serialise - see the page table implementation detail section for more details).
340
341.. note:: We free empty PTE tables on zap under the RCU lock - this does not
342          change the aforementioned locking requirements around zapping.
343
344When **installing** page table entries, the mmap or VMA lock must be held to
345keep the VMA stable. We explore why this is in the page table locking details
346section below.
347
348**Freeing** page tables is an entirely internal memory management operation and
349has special requirements (see the page freeing section below for more details).
350
351.. warning:: When **freeing** page tables, it must not be possible for VMAs
352             containing the ranges those page tables map to be accessible via
353             the reverse mapping.
354
355             The :c:func:`!free_pgtables` function removes the relevant VMAs
356             from the reverse mappings, but no other VMAs can be permitted to be
357             accessible and span the specified range.
358
359Traversing non-VMA page tables
360------------------------------
361
362We've focused above on traversal of page tables belonging to VMAs. It is also
363possible to traverse page tables which are not represented by VMAs.
364
365Kernel page table mappings themselves are generally managed but whatever part of
366the kernel established them and the aforementioned locking rules do not apply -
367for instance vmalloc has its own set of locks which are utilised for
368establishing and tearing down page its page tables.
369
370However, for convenience we provide the :c:func:`!walk_kernel_page_table_range`
371function which is synchronised via the mmap lock on the :c:macro:`!init_mm`
372kernel instantiation of the :c:struct:`!struct mm_struct` metadata object.
373
374If an operation requires exclusive access, a write lock is used, but if not, a
375read lock suffices - we assert only that at least a read lock has been acquired.
376
377Since, aside from vmalloc and memory hot plug, kernel page tables are not torn
378down all that often - this usually suffices, however any caller of this
379functionality must ensure that any additionally required locks are acquired in
380advance.
381
382We also permit a truly unusual case is the traversal of non-VMA ranges in
383**userland** ranges, as provided for by :c:func:`!walk_page_range_debug`.
384
385This has only one user - the general page table dumping logic (implemented in
386:c:macro:`!mm/ptdump.c`) - which seeks to expose all mappings for debug purposes
387even if they are highly unusual (possibly architecture-specific) and are not
388backed by a VMA.
389
390We must take great care in this case, as the :c:func:`!munmap` implementation
391detaches VMAs under an mmap write lock before tearing down page tables under a
392downgraded mmap read lock.
393
394This means such an operation could race with this, and thus an mmap **write**
395lock is required.
396
397Lock ordering
398-------------
399
400As we have multiple locks across the kernel which may or may not be taken at the
401same time as explicit mm or VMA locks, we have to be wary of lock inversion, and
402the **order** in which locks are acquired and released becomes very important.
403
404.. note:: Lock inversion occurs when two threads need to acquire multiple locks,
405   but in doing so inadvertently cause a mutual deadlock.
406
407   For example, consider thread 1 which holds lock A and tries to acquire lock B,
408   while thread 2 holds lock B and tries to acquire lock A.
409
410   Both threads are now deadlocked on each other. However, had they attempted to
411   acquire locks in the same order, one would have waited for the other to
412   complete its work and no deadlock would have occurred.
413
414The opening comment in :c:macro:`!mm/rmap.c` describes in detail the required
415ordering of locks within memory management code:
416
417.. code-block::
418
419  inode->i_rwsem        (while writing or truncating, not reading or faulting)
420    mm->mmap_lock
421      mapping->invalidate_lock (in filemap_fault)
422        folio_lock
423          hugetlbfs_i_mmap_rwsem_key (in huge_pmd_share, see hugetlbfs below)
424            vma_start_write
425              mapping->i_mmap_rwsem
426                anon_vma->rwsem
427                  mm->page_table_lock or pte_lock
428                    swap_lock (in swap_duplicate, swap_info_get)
429                      mmlist_lock (in mmput, drain_mmlist and others)
430                      mapping->private_lock (in block_dirty_folio)
431                          i_pages lock (widely used)
432                            lruvec->lru_lock (in folio_lruvec_lock_irq)
433                      inode->i_lock (in set_page_dirty's __mark_inode_dirty)
434                      bdi.wb->list_lock (in set_page_dirty's __mark_inode_dirty)
435                        sb_lock (within inode_lock in fs/fs-writeback.c)
436                        i_pages lock (widely used, in set_page_dirty,
437                                  in arch-dependent flush_dcache_mmap_lock,
438                                  within bdi.wb->list_lock in __sync_single_inode)
439
440There is also a file-system specific lock ordering comment located at the top of
441:c:macro:`!mm/filemap.c`:
442
443.. code-block::
444
445  ->i_mmap_rwsem                        (truncate_pagecache)
446    ->private_lock                      (__free_pte->block_dirty_folio)
447      ->swap_lock                       (exclusive_swap_page, others)
448        ->i_pages lock
449
450  ->i_rwsem
451    ->invalidate_lock                   (acquired by fs in truncate path)
452      ->i_mmap_rwsem                    (truncate->unmap_mapping_range)
453
454  ->mmap_lock
455    ->i_mmap_rwsem
456      ->page_table_lock or pte_lock     (various, mainly in memory.c)
457        ->i_pages lock                  (arch-dependent flush_dcache_mmap_lock)
458
459  ->mmap_lock
460    ->invalidate_lock                   (filemap_fault)
461      ->lock_page                       (filemap_fault, access_process_vm)
462
463  ->i_rwsem                             (generic_perform_write)
464    ->mmap_lock                         (fault_in_readable->do_page_fault)
465
466  bdi->wb.list_lock
467    sb_lock                             (fs/fs-writeback.c)
468    ->i_pages lock                      (__sync_single_inode)
469
470  ->i_mmap_rwsem
471    ->anon_vma.lock                     (vma_merge)
472
473  ->anon_vma.lock
474    ->page_table_lock or pte_lock       (anon_vma_prepare and various)
475
476  ->page_table_lock or pte_lock
477    ->swap_lock                         (try_to_unmap_one)
478    ->private_lock                      (try_to_unmap_one)
479    ->i_pages lock                      (try_to_unmap_one)
480    ->lruvec->lru_lock                  (follow_page_mask->mark_page_accessed)
481    ->lruvec->lru_lock                  (check_pte_range->folio_isolate_lru)
482    ->private_lock                      (folio_remove_rmap_pte->set_page_dirty)
483    ->i_pages lock                      (folio_remove_rmap_pte->set_page_dirty)
484    bdi.wb->list_lock                   (folio_remove_rmap_pte->set_page_dirty)
485    ->inode->i_lock                     (folio_remove_rmap_pte->set_page_dirty)
486    bdi.wb->list_lock                   (zap_pte_range->set_page_dirty)
487    ->inode->i_lock                     (zap_pte_range->set_page_dirty)
488    ->private_lock                      (zap_pte_range->block_dirty_folio)
489
490Please check the current state of these comments which may have changed since
491the time of writing of this document.
492
493------------------------------
494Locking Implementation Details
495------------------------------
496
497.. warning:: Locking rules for PTE-level page tables are very different from
498             locking rules for page tables at other levels.
499
500Page table locking details
501--------------------------
502
503.. note:: This section explores page table locking requirements for page tables
504          encompassed by a VMA. See the above section on non-VMA page table
505          traversal for details on how we handle that case.
506
507In addition to the locks described in the terminology section above, we have
508additional locks dedicated to page tables:
509
510* **Higher level page table locks** - Higher level page tables, that is PGD, P4D
511  and PUD each make use of the process address space granularity
512  :c:member:`!mm->page_table_lock` lock when modified.
513
514* **Fine-grained page table locks** - PMDs and PTEs each have fine-grained locks
515  either kept within the folios describing the page tables or allocated
516  separated and pointed at by the folios if :c:macro:`!ALLOC_SPLIT_PTLOCKS` is
517  set. The PMD spin lock is obtained via :c:func:`!pmd_lock`, however PTEs are
518  mapped into higher memory (if a 32-bit system) and carefully locked via
519  :c:func:`!pte_offset_map_lock`.
520
521These locks represent the minimum required to interact with each page table
522level, but there are further requirements.
523
524Importantly, note that on a **traversal** of page tables, sometimes no such
525locks are taken. However, at the PTE level, at least concurrent page table
526deletion must be prevented (using RCU) and the page table must be mapped into
527high memory, see below.
528
529Whether care is taken on reading the page table entries depends on the
530architecture, see the section on atomicity below.
531
532Locking rules
533^^^^^^^^^^^^^
534
535We establish basic locking rules when interacting with page tables:
536
537* When changing a page table entry the page table lock for that page table
538  **must** be held, except if you can safely assume nobody can access the page
539  tables concurrently (such as on invocation of :c:func:`!free_pgtables`).
540* Reads from and writes to page table entries must be *appropriately*
541  atomic. See the section on atomicity below for details.
542* Populating previously empty entries requires that the mmap or VMA locks are
543  held (read or write), doing so with only rmap locks would be dangerous (see
544  the warning below).
545* As mentioned previously, zapping can be performed while simply keeping the VMA
546  stable, that is holding any one of the mmap, VMA or rmap locks.
547
548.. warning:: Populating previously empty entries is dangerous as, when unmapping
549             VMAs, :c:func:`!vms_clear_ptes` has a window of time between
550             zapping (via :c:func:`!unmap_vmas`) and freeing page tables (via
551             :c:func:`!free_pgtables`), where the VMA is still visible in the
552             rmap tree. :c:func:`!free_pgtables` assumes that the zap has
553             already been performed and removes PTEs unconditionally (along with
554             all other page tables in the freed range), so installing new PTE
555             entries could leak memory and also cause other unexpected and
556             dangerous behaviour.
557
558There are additional rules applicable when moving page tables, which we discuss
559in the section on this topic below.
560
561PTE-level page tables are different from page tables at other levels, and there
562are extra requirements for accessing them:
563
564* On 32-bit architectures, they may be in high memory (meaning they need to be
565  mapped into kernel memory to be accessible).
566* When empty, they can be unlinked and RCU-freed while holding an mmap lock or
567  rmap lock for reading in combination with the PTE and PMD page table locks.
568  In particular, this happens in :c:func:`!retract_page_tables` when handling
569  :c:macro:`!MADV_COLLAPSE`.
570  So accessing PTE-level page tables requires at least holding an RCU read lock;
571  but that only suffices for readers that can tolerate racing with concurrent
572  page table updates such that an empty PTE is observed (in a page table that
573  has actually already been detached and marked for RCU freeing) while another
574  new page table has been installed in the same location and filled with
575  entries. Writers normally need to take the PTE lock and revalidate that the
576  PMD entry still refers to the same PTE-level page table.
577  If the writer does not care whether it is the same PTE-level page table, it
578  can take the PMD lock and revalidate that the contents of pmd entry still meet
579  the requirements. In particular, this also happens in :c:func:`!retract_page_tables`
580  when handling :c:macro:`!MADV_COLLAPSE`.
581
582To access PTE-level page tables, a helper like :c:func:`!pte_offset_map_lock` or
583:c:func:`!pte_offset_map` can be used depending on stability requirements.
584These map the page table into kernel memory if required, take the RCU lock, and
585depending on variant, may also look up or acquire the PTE lock.
586See the comment on :c:func:`!__pte_offset_map_lock`.
587
588Atomicity
589^^^^^^^^^
590
591Regardless of page table locks, the MMU hardware concurrently updates accessed
592and dirty bits (perhaps more, depending on architecture). Additionally, page
593table traversal operations in parallel (though holding the VMA stable) and
594functionality like GUP-fast locklessly traverses (that is reads) page tables,
595without even keeping the VMA stable at all.
596
597When performing a page table traversal and keeping the VMA stable, whether a
598read must be performed once and only once or not depends on the architecture
599(for instance x86-64 does not require any special precautions).
600
601If a write is being performed, or if a read informs whether a write takes place
602(on an installation of a page table entry say, for instance in
603:c:func:`!__pud_install`), special care must always be taken. In these cases we
604can never assume that page table locks give us entirely exclusive access, and
605must retrieve page table entries once and only once.
606
607If we are reading page table entries, then we need only ensure that the compiler
608does not rearrange our loads. This is achieved via :c:func:`!pXXp_get`
609functions - :c:func:`!pgdp_get`, :c:func:`!p4dp_get`, :c:func:`!pudp_get`,
610:c:func:`!pmdp_get`, and :c:func:`!ptep_get`.
611
612Each of these uses :c:func:`!READ_ONCE` to guarantee that the compiler reads
613the page table entry only once.
614
615However, if we wish to manipulate an existing page table entry and care about
616the previously stored data, we must go further and use an hardware atomic
617operation as, for example, in :c:func:`!ptep_get_and_clear`.
618
619Equally, operations that do not rely on the VMA being held stable, such as
620GUP-fast (see :c:func:`!gup_fast` and its various page table level handlers like
621:c:func:`!gup_fast_pte_range`), must very carefully interact with page table
622entries, using functions such as :c:func:`!ptep_get_lockless` and equivalent for
623higher level page table levels.
624
625Writes to page table entries must also be appropriately atomic, as established
626by :c:func:`!set_pXX` functions - :c:func:`!set_pgd`, :c:func:`!set_p4d`,
627:c:func:`!set_pud`, :c:func:`!set_pmd`, and :c:func:`!set_pte`.
628
629Equally functions which clear page table entries must be appropriately atomic,
630as in :c:func:`!pXX_clear` functions - :c:func:`!pgd_clear`,
631:c:func:`!p4d_clear`, :c:func:`!pud_clear`, :c:func:`!pmd_clear`, and
632:c:func:`!pte_clear`.
633
634Page table installation
635^^^^^^^^^^^^^^^^^^^^^^^
636
637Page table installation is performed with the VMA held stable explicitly by an
638mmap or VMA lock in read or write mode (see the warning in the locking rules
639section for details as to why).
640
641When allocating a P4D, PUD or PMD and setting the relevant entry in the above
642PGD, P4D or PUD, the :c:member:`!mm->page_table_lock` must be held. This is
643acquired in :c:func:`!__p4d_alloc`, :c:func:`!__pud_alloc` and
644:c:func:`!__pmd_alloc` respectively.
645
646.. note:: :c:func:`!__pmd_alloc` actually invokes :c:func:`!pud_lock` and
647   :c:func:`!pud_lockptr` in turn, however at the time of writing it ultimately
648   references the :c:member:`!mm->page_table_lock`.
649
650Allocating a PTE will either use the :c:member:`!mm->page_table_lock` or, if
651:c:macro:`!USE_SPLIT_PMD_PTLOCKS` is defined, a lock embedded in the PMD
652physical page metadata in the form of a :c:struct:`!struct ptdesc`, acquired by
653:c:func:`!pmd_ptdesc` called from :c:func:`!pmd_lock` and ultimately
654:c:func:`!__pte_alloc`.
655
656Finally, modifying the contents of the PTE requires special treatment, as the
657PTE page table lock must be acquired whenever we want stable and exclusive
658access to entries contained within a PTE, especially when we wish to modify
659them.
660
661This is performed via :c:func:`!pte_offset_map_lock` which carefully checks to
662ensure that the PTE hasn't changed from under us, ultimately invoking
663:c:func:`!pte_lockptr` to obtain a spin lock at PTE granularity contained within
664the :c:struct:`!struct ptdesc` associated with the physical PTE page. The lock
665must be released via :c:func:`!pte_unmap_unlock`.
666
667.. note:: There are some variants on this, such as
668   :c:func:`!pte_offset_map_rw_nolock` when we know we hold the PTE stable but
669   for brevity we do not explore this.  See the comment for
670   :c:func:`!__pte_offset_map_lock` for more details.
671
672When modifying data in ranges we typically only wish to allocate higher page
673tables as necessary, using these locks to avoid races or overwriting anything,
674and set/clear data at the PTE level as required (for instance when page faulting
675or zapping).
676
677A typical pattern taken when traversing page table entries to install a new
678mapping is to optimistically determine whether the page table entry in the table
679above is empty, if so, only then acquiring the page table lock and checking
680again to see if it was allocated underneath us.
681
682This allows for a traversal with page table locks only being taken when
683required. An example of this is :c:func:`!__pud_alloc`.
684
685At the leaf page table, that is the PTE, we can't entirely rely on this pattern
686as we have separate PMD and PTE locks and a THP collapse for instance might have
687eliminated the PMD entry as well as the PTE from under us.
688
689This is why :c:func:`!__pte_offset_map_lock` locklessly retrieves the PMD entry
690for the PTE, carefully checking it is as expected, before acquiring the
691PTE-specific lock, and then *again* checking that the PMD entry is as expected.
692
693If a THP collapse (or similar) were to occur then the lock on both pages would
694be acquired, so we can ensure this is prevented while the PTE lock is held.
695
696Installing entries this way ensures mutual exclusion on write.
697
698Page table freeing
699^^^^^^^^^^^^^^^^^^
700
701Tearing down page tables themselves is something that requires significant
702care. There must be no way that page tables designated for removal can be
703traversed or referenced by concurrent tasks.
704
705It is insufficient to simply hold an mmap write lock and VMA lock (which will
706prevent racing faults, and rmap operations), as a file-backed mapping can be
707truncated under the :c:struct:`!struct address_space->i_mmap_rwsem` alone.
708
709As a result, no VMA which can be accessed via the reverse mapping (either
710through the :c:struct:`!struct anon_vma->rb_root` or the :c:member:`!struct
711address_space->i_mmap` interval trees) can have its page tables torn down.
712
713The operation is typically performed via :c:func:`!free_pgtables`, which assumes
714either the mmap write lock has been taken (as specified by its
715:c:member:`!mm_wr_locked` parameter), or that the VMA is already unreachable.
716
717It carefully removes the VMA from all reverse mappings, however it's important
718that no new ones overlap these or any route remain to permit access to addresses
719within the range whose page tables are being torn down.
720
721Additionally, it assumes that a zap has already been performed and steps have
722been taken to ensure that no further page table entries can be installed between
723the zap and the invocation of :c:func:`!free_pgtables`.
724
725Since it is assumed that all such steps have been taken, page table entries are
726cleared without page table locks (in the :c:func:`!pgd_clear`, :c:func:`!p4d_clear`,
727:c:func:`!pud_clear`, and :c:func:`!pmd_clear` functions.
728
729.. note:: It is possible for leaf page tables to be torn down independent of
730          the page tables above it as is done by
731          :c:func:`!retract_page_tables`, which is performed under the i_mmap
732          read lock, PMD, and PTE page table locks, without this level of care.
733
734Page table moving
735^^^^^^^^^^^^^^^^^
736
737Some functions manipulate page table levels above PMD (that is PUD, P4D and PGD
738page tables). Most notable of these is :c:func:`!mremap`, which is capable of
739moving higher level page tables.
740
741In these instances, it is required that **all** locks are taken, that is
742the mmap lock, the VMA lock and the relevant rmap locks.
743
744You can observe this in the :c:func:`!mremap` implementation in the functions
745:c:func:`!take_rmap_locks` and :c:func:`!drop_rmap_locks` which perform the rmap
746side of lock acquisition, invoked ultimately by :c:func:`!move_page_tables`.
747
748VMA lock internals
749------------------
750
751Overview
752^^^^^^^^
753
754VMA read locking is entirely optimistic - if the lock is contended or a competing
755write has started, then we do not obtain a read lock.
756
757A VMA **read** lock is obtained by :c:func:`!lock_vma_under_rcu`, which first
758calls :c:func:`!rcu_read_lock` to ensure that the VMA is looked up in an RCU
759critical section, then attempts to VMA lock it via :c:func:`!vma_start_read`,
760before releasing the RCU lock via :c:func:`!rcu_read_unlock`.
761
762In cases when the user already holds mmap read lock, :c:func:`!vma_start_read_locked`
763and :c:func:`!vma_start_read_locked_nested` can be used. These functions do not
764fail due to lock contention but the caller should still check their return values
765in case they fail for other reasons.
766
767VMA read locks increment :c:member:`!vma.vm_refcnt` reference counter for their
768duration and the caller of :c:func:`!lock_vma_under_rcu` must drop it via
769:c:func:`!vma_end_read`.
770
771VMA **write** locks are acquired via :c:func:`!vma_start_write` in instances where a
772VMA is about to be modified, unlike :c:func:`!vma_start_read` the lock is always
773acquired. An mmap write lock **must** be held for the duration of the VMA write
774lock, releasing or downgrading the mmap write lock also releases the VMA write
775lock so there is no :c:func:`!vma_end_write` function.
776
777Note that when write-locking a VMA lock, the :c:member:`!vma.vm_refcnt` is temporarily
778modified so that readers can detect the presense of a writer. The reference counter is
779restored once the vma sequence number used for serialisation is updated.
780
781This ensures the semantics we require - VMA write locks provide exclusive write
782access to the VMA.
783
784Implementation details
785^^^^^^^^^^^^^^^^^^^^^^
786
787The VMA lock mechanism is designed to be a lightweight means of avoiding the use
788of the heavily contended mmap lock. It is implemented using a combination of a
789reference counter and sequence numbers belonging to the containing
790:c:struct:`!struct mm_struct` and the VMA.
791
792Read locks are acquired via :c:func:`!vma_start_read`, which is an optimistic
793operation, i.e. it tries to acquire a read lock but returns false if it is
794unable to do so. At the end of the read operation, :c:func:`!vma_end_read` is
795called to release the VMA read lock.
796
797Invoking :c:func:`!vma_start_read` requires that :c:func:`!rcu_read_lock` has
798been called first, establishing that we are in an RCU critical section upon VMA
799read lock acquisition. Once acquired, the RCU lock can be released as it is only
800required for lookup. This is abstracted by :c:func:`!lock_vma_under_rcu` which
801is the interface a user should use.
802
803Writing requires the mmap to be write-locked and the VMA lock to be acquired via
804:c:func:`!vma_start_write`, however the write lock is released by the termination or
805downgrade of the mmap write lock so no :c:func:`!vma_end_write` is required.
806
807All this is achieved by the use of per-mm and per-VMA sequence counts, which are
808used in order to reduce complexity, especially for operations which write-lock
809multiple VMAs at once.
810
811If the mm sequence count, :c:member:`!mm->mm_lock_seq` is equal to the VMA
812sequence count :c:member:`!vma->vm_lock_seq` then the VMA is write-locked. If
813they differ, then it is not.
814
815Each time the mmap write lock is released in :c:func:`!mmap_write_unlock` or
816:c:func:`!mmap_write_downgrade`, :c:func:`!vma_end_write_all` is invoked which
817also increments :c:member:`!mm->mm_lock_seq` via
818:c:func:`!mm_lock_seqcount_end`.
819
820This way, we ensure that, regardless of the VMA's sequence number, a write lock
821is never incorrectly indicated and that when we release an mmap write lock we
822efficiently release **all** VMA write locks contained within the mmap at the
823same time.
824
825Since the mmap write lock is exclusive against others who hold it, the automatic
826release of any VMA locks on its release makes sense, as you would never want to
827keep VMAs locked across entirely separate write operations. It also maintains
828correct lock ordering.
829
830Each time a VMA read lock is acquired, we increment :c:member:`!vma.vm_refcnt`
831reference counter and check that the sequence count of the VMA does not match
832that of the mm.
833
834If it does, the read lock fails and :c:member:`!vma.vm_refcnt` is dropped.
835If it does not, we keep the reference counter raised, excluding writers, but
836permitting other readers, who can also obtain this lock under RCU.
837
838Importantly, maple tree operations performed in :c:func:`!lock_vma_under_rcu`
839are also RCU safe, so the whole read lock operation is guaranteed to function
840correctly.
841
842On the write side, we set a bit in :c:member:`!vma.vm_refcnt` which can't be
843modified by readers and wait for all readers to drop their reference count.
844Once there are no readers, the VMA's sequence number is set to match that of
845the mm. During this entire operation mmap write lock is held.
846
847This way, if any read locks are in effect, :c:func:`!vma_start_write` will sleep
848until these are finished and mutual exclusion is achieved.
849
850After setting the VMA's sequence number, the bit in :c:member:`!vma.vm_refcnt`
851indicating a writer is cleared. From this point on, VMA's sequence number will
852indicate VMA's write-locked state until mmap write lock is dropped or downgraded.
853
854This clever combination of a reference counter and sequence count allows for
855fast RCU-based per-VMA lock acquisition (especially on page fault, though
856utilised elsewhere) with minimal complexity around lock ordering.
857
858mmap write lock downgrading
859---------------------------
860
861When an mmap write lock is held one has exclusive access to resources within the
862mmap (with the usual caveats about requiring VMA write locks to avoid races with
863tasks holding VMA read locks).
864
865It is then possible to **downgrade** from a write lock to a read lock via
866:c:func:`!mmap_write_downgrade` which, similar to :c:func:`!mmap_write_unlock`,
867implicitly terminates all VMA write locks via :c:func:`!vma_end_write_all`, but
868importantly does not relinquish the mmap lock while downgrading, therefore
869keeping the locked virtual address space stable.
870
871An interesting consequence of this is that downgraded locks are exclusive
872against any other task possessing a downgraded lock (since a racing task would
873have to acquire a write lock first to downgrade it, and the downgraded lock
874prevents a new write lock from being obtained until the original lock is
875released).
876
877For clarity, we map read (R)/downgraded write (D)/write (W) locks against one
878another showing which locks exclude the others:
879
880.. list-table:: Lock exclusivity
881   :widths: 5 5 5 5
882   :header-rows: 1
883   :stub-columns: 1
884
885   * -
886     - R
887     - D
888     - W
889   * - R
890     - N
891     - N
892     - Y
893   * - D
894     - N
895     - Y
896     - Y
897   * - W
898     - Y
899     - Y
900     - Y
901
902Here a Y indicates the locks in the matching row/column are mutually exclusive,
903and N indicates that they are not.
904
905Stack expansion
906---------------
907
908Stack expansion throws up additional complexities in that we cannot permit there
909to be racing page faults, as a result we invoke :c:func:`!vma_start_write` to
910prevent this in :c:func:`!expand_downwards` or :c:func:`!expand_upwards`.
911
912------------------------
913Functions and structures
914------------------------
915
916.. kernel-doc:: include/linux/mmap_lock.h
917