1.. SPDX-License-Identifier: GPL-2.0 2 3=========== 4Page Tables 5=========== 6 7Paged virtual memory was invented along with virtual memory as a concept in 81962 on the Ferranti Atlas Computer which was the first computer with paged 9virtual memory. The feature migrated to newer computers and became a de facto 10feature of all Unix-like systems as time went by. In 1985 the feature was 11included in the Intel 80386, which was the CPU Linux 1.0 was developed on. 12 13Page tables map virtual addresses as seen by the CPU into physical addresses 14as seen on the external memory bus. 15 16Linux defines page tables as a hierarchy which is currently five levels in 17height. The architecture code for each supported architecture will then 18map this to the restrictions of the hardware. 19 20The physical address corresponding to the virtual address is often referenced 21by the underlying physical page frame. The **page frame number** or **pfn** 22is the physical address of the page (as seen on the external memory bus) 23divided by `PAGE_SIZE`. 24 25Physical memory address 0 will be *pfn 0* and the highest pfn will be 26the last page of physical memory the external address bus of the CPU can 27address. 28 29With a page granularity of 4KB and a address range of 32 bits, pfn 0 is at 30address 0x00000000, pfn 1 is at address 0x00001000, pfn 2 is at 0x00002000 31and so on until we reach pfn 0xfffff at 0xfffff000. With 16KB pages pfs are 32at 0x00004000, 0x00008000 ... 0xffffc000 and pfn goes from 0 to 0x3ffff. 33 34As you can see, with 4KB pages the page base address uses bits 12-31 of the 35address, and this is why `PAGE_SHIFT` in this case is defined as 12 and 36`PAGE_SIZE` is usually defined in terms of the page shift as `(1 << PAGE_SHIFT)` 37 38Over time a deeper hierarchy has been developed in response to increasing memory 39sizes. When Linux was created, 4KB pages and a single page table called 40`swapper_pg_dir` with 1024 entries was used, covering 4MB which coincided with 41the fact that Torvald's first computer had 4MB of physical memory. Entries in 42this single table were referred to as *PTE*:s - page table entries. 43 44The software page table hierarchy reflects the fact that page table hardware has 45become hierarchical and that in turn is done to save page table memory and 46speed up mapping. 47 48One could of course imagine a single, linear page table with enormous amounts 49of entries, breaking down the whole memory into single pages. Such a page table 50would be very sparse, because large portions of the virtual memory usually 51remains unused. By using hierarchical page tables large holes in the virtual 52address space does not waste valuable page table memory, because it will suffice 53to mark large areas as unmapped at a higher level in the page table hierarchy. 54 55Additionally, on modern CPUs, a higher level page table entry can point directly 56to a physical memory range, which allows mapping a contiguous range of several 57megabytes or even gigabytes in a single high-level page table entry, taking 58shortcuts in mapping virtual memory to physical memory: there is no need to 59traverse deeper in the hierarchy when you find a large mapped range like this. 60 61The page table hierarchy has now developed into this:: 62 63 +-----+ 64 | PGD | 65 +-----+ 66 | 67 | +-----+ 68 +-->| P4D | 69 +-----+ 70 | 71 | +-----+ 72 +-->| PUD | 73 +-----+ 74 | 75 | +-----+ 76 +-->| PMD | 77 +-----+ 78 | 79 | +-----+ 80 +-->| PTE | 81 +-----+ 82 83 84Symbols on the different levels of the page table hierarchy have the following 85meaning beginning from the bottom: 86 87- **pte**, `pte_t`, `pteval_t` = **Page Table Entry** - mentioned earlier. 88 The *pte* is an array of `PTRS_PER_PTE` elements of the `pteval_t` type, each 89 mapping a single page of virtual memory to a single page of physical memory. 90 The architecture defines the size and contents of `pteval_t`. 91 92 A typical example is that the `pteval_t` is a 32- or 64-bit value with the 93 upper bits being a **pfn** (page frame number), and the lower bits being some 94 architecture-specific bits such as memory protection. 95 96 The **entry** part of the name is a bit confusing because while in Linux 1.0 97 this did refer to a single page table entry in the single top level page 98 table, it was retrofitted to be an array of mapping elements when two-level 99 page tables were first introduced, so the *pte* is the lowermost page 100 *table*, not a page table *entry*. 101 102- **pmd**, `pmd_t`, `pmdval_t` = **Page Middle Directory**, the hierarchy right 103 above the *pte*, with `PTRS_PER_PMD` references to the *pte*:s. 104 105- **pud**, `pud_t`, `pudval_t` = **Page Upper Directory** was introduced after 106 the other levels to handle 4-level page tables. It is potentially unused, 107 or *folded* as we will discuss later. 108 109- **p4d**, `p4d_t`, `p4dval_t` = **Page Level 4 Directory** was introduced to 110 handle 5-level page tables after the *pud* was introduced. Now it was clear 111 that we needed to replace *pgd*, *pmd*, *pud* etc with a figure indicating the 112 directory level and that we cannot go on with ad hoc names any more. This 113 is only used on systems which actually have 5 levels of page tables, otherwise 114 it is folded. 115 116- **pgd**, `pgd_t`, `pgdval_t` = **Page Global Directory** - the Linux kernel 117 main page table handling the PGD for the kernel memory is still found in 118 `swapper_pg_dir`, but each userspace process in the system also has its own 119 memory context and thus its own *pgd*, found in `struct mm_struct` which 120 in turn is referenced to in each `struct task_struct`. So tasks have memory 121 context in the form of a `struct mm_struct` and this in turn has a 122 `struct pgt_t *pgd` pointer to the corresponding page global directory. 123 124To repeat: each level in the page table hierarchy is a *array of pointers*, so 125the **pgd** contains `PTRS_PER_PGD` pointers to the next level below, **p4d** 126contains `PTRS_PER_P4D` pointers to **pud** items and so on. The number of 127pointers on each level is architecture-defined.:: 128 129 PMD 130 --> +-----+ PTE 131 | ptr |-------> +-----+ 132 | ptr |- | ptr |-------> PAGE 133 | ptr | \ | ptr | 134 | ptr | \ ... 135 | ... | \ 136 | ptr | \ PTE 137 +-----+ +----> +-----+ 138 | ptr |-------> PAGE 139 | ptr | 140 ... 141 142 143Page Table Folding 144================== 145 146If the architecture does not use all the page table levels, they can be *folded* 147which means skipped, and all operations performed on page tables will be 148compile-time augmented to just skip a level when accessing the next lower 149level. 150 151Page table handling code that wishes to be architecture-neutral, such as the 152virtual memory manager, will need to be written so that it traverses all of the 153currently five levels. This style should also be preferred for 154architecture-specific code, so as to be robust to future changes. 155 156 157MMU, TLB, and Page Faults 158========================= 159 160The `Memory Management Unit (MMU)` is a hardware component that handles virtual 161to physical address translations. It may use relatively small caches in hardware 162called `Translation Lookaside Buffers (TLBs)` and `Page Walk Caches` to speed up 163these translations. 164 165When CPU accesses a memory location, it provides a virtual address to the MMU, 166which checks if there is the existing translation in the TLB or in the Page 167Walk Caches (on architectures that support them). If no translation is found, 168MMU uses the page walks to determine the physical address and create the map. 169 170The dirty bit for a page is set (i.e., turned on) when the page is written to. 171Each page of memory has associated permission and dirty bits. The latter 172indicate that the page has been modified since it was loaded into memory. 173 174If nothing prevents it, eventually the physical memory can be accessed and the 175requested operation on the physical frame is performed. 176 177There are several reasons why the MMU can't find certain translations. It could 178happen because the CPU is trying to access memory that the current task is not 179permitted to, or because the data is not present into physical memory. 180 181When these conditions happen, the MMU triggers page faults, which are types of 182exceptions that signal the CPU to pause the current execution and run a special 183function to handle the mentioned exceptions. 184 185There are common and expected causes of page faults. These are triggered by 186process management optimization techniques called "Lazy Allocation" and 187"Copy-on-Write". Page faults may also happen when frames have been swapped out 188to persistent storage (swap partition or file) and evicted from their physical 189locations. 190 191These techniques improve memory efficiency, reduce latency, and minimize space 192occupation. This document won't go deeper into the details of "Lazy Allocation" 193and "Copy-on-Write" because these subjects are out of scope as they belong to 194Process Address Management. 195 196Swapping differentiates itself from the other mentioned techniques because it's 197undesirable since it's performed as a means to reduce memory under heavy 198pressure. 199 200Swapping can't work for memory mapped by kernel logical addresses. These are a 201subset of the kernel virtual space that directly maps a contiguous range of 202physical memory. Given any logical address, its physical address is determined 203with simple arithmetic on an offset. Accesses to logical addresses are fast 204because they avoid the need for complex page table lookups at the expenses of 205frames not being evictable and pageable out. 206 207If the kernel fails to make room for the data that must be present in the 208physical frames, the kernel invokes the out-of-memory (OOM) killer to make room 209by terminating lower priority processes until pressure reduces under a safe 210threshold. 211 212Additionally, page faults may be also caused by code bugs or by maliciously 213crafted addresses that the CPU is instructed to access. A thread of a process 214could use instructions to address (non-shared) memory which does not belong to 215its own address space, or could try to execute an instruction that want to write 216to a read-only location. 217 218If the above-mentioned conditions happen in user-space, the kernel sends a 219`Segmentation Fault` (SIGSEGV) signal to the current thread. That signal usually 220causes the termination of the thread and of the process it belongs to. 221 222This document is going to simplify and show an high altitude view of how the 223Linux kernel handles these page faults, creates tables and tables' entries, 224check if memory is present and, if not, requests to load data from persistent 225storage or from other devices, and updates the MMU and its caches. 226 227The first steps are architecture dependent. Most architectures jump to 228`do_page_fault()`, whereas the x86 interrupt handler is defined by the 229`DEFINE_IDTENTRY_RAW_ERRORCODE()` macro which calls `handle_page_fault()`. 230 231Whatever the routes, all architectures end up to the invocation of 232`handle_mm_fault()` which, in turn, (likely) ends up calling 233`__handle_mm_fault()` to carry out the actual work of allocating the page 234tables. 235 236The unfortunate case of not being able to call `__handle_mm_fault()` means 237that the virtual address is pointing to areas of physical memory which are not 238permitted to be accessed (at least from the current context). This 239condition resolves to the kernel sending the above-mentioned SIGSEGV signal 240to the process and leads to the consequences already explained. 241 242`__handle_mm_fault()` carries out its work by calling several functions to 243find the entry's offsets of the upper layers of the page tables and allocate 244the tables that it may need. 245 246The functions that look for the offset have names like `*_offset()`, where the 247"*" is for pgd, p4d, pud, pmd, pte; instead the functions to allocate the 248corresponding tables, layer by layer, are called `*_alloc`, using the 249above-mentioned convention to name them after the corresponding types of tables 250in the hierarchy. 251 252The page table walk may end at one of the middle or upper layers (PMD, PUD). 253 254Linux supports larger page sizes than the usual 4KB (i.e., the so called 255`huge pages`). When using these kinds of larger pages, higher level pages can 256directly map them, with no need to use lower level page entries (PTE). Huge 257pages contain large contiguous physical regions that usually span from 2MB to 2581GB. They are respectively mapped by the PMD and PUD page entries. 259 260The huge pages bring with them several benefits like reduced TLB pressure, 261reduced page table overhead, memory allocation efficiency, and performance 262improvement for certain workloads. However, these benefits come with 263trade-offs, like wasted memory and allocation challenges. 264 265At the very end of the walk with allocations, if it didn't return errors, 266`__handle_mm_fault()` finally calls `handle_pte_fault()`, which via `do_fault()` 267performs one of `do_read_fault()`, `do_cow_fault()`, `do_shared_fault()`. 268"read", "cow", "shared" give hints about the reasons and the kind of fault it's 269handling. 270 271The actual implementation of the workflow is very complex. Its design allows 272Linux to handle page faults in a way that is tailored to the specific 273characteristics of each architecture, while still sharing a common overall 274structure. 275 276To conclude this high altitude view of how Linux handles page faults, let's 277add that the page faults handler can be disabled and enabled respectively with 278`pagefault_disable()` and `pagefault_enable()`. 279 280Several code path make use of the latter two functions because they need to 281disable traps into the page faults handler, mostly to prevent deadlocks. 282