1.. SPDX-License-Identifier: GPL-2.0 2 3========================== 4Page Table Isolation (PTI) 5========================== 6 7Overview 8======== 9 10Page Table Isolation (pti, previously known as KAISER [1]_) is a 11countermeasure against attacks on the shared user/kernel address 12space such as the "Meltdown" approach [2]_. 13 14To mitigate this class of attacks, we create an independent set of 15page tables for use only when running userspace applications. When 16the kernel is entered via syscalls, interrupts or exceptions, the 17page tables are switched to the full "kernel" copy. When the system 18switches back to user mode, the user copy is used again. 19 20The userspace page tables contain only a minimal amount of kernel 21data: only what is needed to enter/exit the kernel such as the 22entry/exit functions themselves and the interrupt descriptor table 23(IDT). There are a few strictly unnecessary things that get mapped 24such as the first C function when entering an interrupt (see 25comments in pti.c). 26 27This approach helps to ensure that side-channel attacks leveraging 28the paging structures do not function when PTI is enabled. It can be 29enabled by setting CONFIG_MITIGATION_PAGE_TABLE_ISOLATION=y at compile 30time. Once enabled at compile-time, it can be disabled at boot with 31the 'nopti' or 'pti=' kernel parameters (see kernel-parameters.txt). 32 33Page Table Management 34===================== 35 36When PTI is enabled, the kernel manages two sets of page tables. 37The first set is very similar to the single set which is present in 38kernels without PTI. This includes a complete mapping of userspace 39that the kernel can use for things like copy_to_user(). 40 41Although _complete_, the user portion of the kernel page tables is 42crippled by setting the NX bit in the top level. This ensures 43that any missed kernel->user CR3 switch will immediately crash 44userspace upon executing its first instruction. 45 46The userspace page tables map only the kernel data needed to enter 47and exit the kernel. This data is entirely contained in the 'struct 48cpu_entry_area' structure which is placed in the fixmap which gives 49each CPU's copy of the area a compile-time-fixed virtual address. 50 51For new userspace mappings, the kernel makes the entries in its 52page tables like normal. The only difference is when the kernel 53makes entries in the top (PGD) level. In addition to setting the 54entry in the main kernel PGD, a copy of the entry is made in the 55userspace page tables' PGD. 56 57This sharing at the PGD level also inherently shares all the lower 58layers of the page tables. This leaves a single, shared set of 59userspace page tables to manage. One PTE to lock, one set of 60accessed bits, dirty bits, etc... 61 62Overhead 63======== 64 65Protection against side-channel attacks is important. But, 66this protection comes at a cost: 67 681. Increased Memory Use 69 70 a. Each process now needs an order-1 PGD instead of order-0. 71 (Consumes an additional 4k per process). 72 b. The 'cpu_entry_area' structure must be 2MB in size and 2MB 73 aligned so that it can be mapped by setting a single PMD 74 entry. This consumes nearly 2MB of RAM once the kernel 75 is decompressed, but no space in the kernel image itself. 76 772. Runtime Cost 78 79 a. CR3 manipulation to switch between the page table copies 80 must be done at interrupt, syscall, and exception entry 81 and exit (it can be skipped when the kernel is interrupted, 82 though.) Moves to CR3 are on the order of a hundred 83 cycles, and are required at every entry and exit. 84 b. Percpu TSS is mapped into the user page tables to allow SYSCALL64 path 85 to work under PTI. This doesn't have a direct runtime cost but it can 86 be argued it opens certain timing attack scenarios. 87 c. Global pages are disabled for all kernel structures not 88 mapped into both kernel and userspace page tables. This 89 feature of the MMU allows different processes to share TLB 90 entries mapping the kernel. Losing the feature means more 91 TLB misses after a context switch. The actual loss of 92 performance is very small, however, never exceeding 1%. 93 d. Process Context IDentifiers (PCID) is a CPU feature that 94 allows us to skip flushing the entire TLB when switching page 95 tables by setting a special bit in CR3 when the page tables 96 are changed. This makes switching the page tables (at context 97 switch, or kernel entry/exit) cheaper. But, on systems with 98 PCID support, the context switch code must flush both the user 99 and kernel entries out of the TLB. The user PCID TLB flush is 100 deferred until the exit to userspace, minimizing the cost. 101 See intel.com/sdm for the gory PCID/INVPCID details. 102 e. The userspace page tables must be populated for each new 103 process. Even without PTI, the shared kernel mappings 104 are created by copying top-level (PGD) entries into each 105 new process. But, with PTI, there are now *two* kernel 106 mappings: one in the kernel page tables that maps everything 107 and one for the entry/exit structures. At fork(), we need to 108 copy both. 109 f. In addition to the fork()-time copying, there must also 110 be an update to the userspace PGD any time a set_pgd() is done 111 on a PGD used to map userspace. This ensures that the kernel 112 and userspace copies always map the same userspace 113 memory. 114 g. On systems without PCID support, each CR3 write flushes 115 the entire TLB. That means that each syscall, interrupt 116 or exception flushes the TLB. 117 h. INVPCID is a TLB-flushing instruction which allows flushing 118 of TLB entries for non-current PCIDs. Some systems support 119 PCIDs, but do not support INVPCID. On these systems, addresses 120 can only be flushed from the TLB for the current PCID. When 121 flushing a kernel address, we need to flush all PCIDs, so a 122 single kernel address flush will require a TLB-flushing CR3 123 write upon the next use of every PCID. 124 125Possible Future Work 126==================== 1271. We can be more careful about not actually writing to CR3 128 unless its value is actually changed. 1292. Allow PTI to be enabled/disabled at runtime in addition to the 130 boot-time switching. 131 132Testing 133======== 134 135To test stability of PTI, the following test procedure is recommended, 136ideally doing all of these in parallel: 137 1381. Set CONFIG_DEBUG_ENTRY=y 1392. Run several copies of all of the tools/testing/selftests/x86/ tests 140 (excluding MPX and protection_keys) in a loop on multiple CPUs for 141 several minutes. These tests frequently uncover corner cases in the 142 kernel entry code. In general, old kernels might cause these tests 143 themselves to crash, but they should never crash the kernel. 1443. Run the 'perf' tool in a mode (top or record) that generates many 145 frequent performance monitoring non-maskable interrupts (see "NMI" 146 in /proc/interrupts). This exercises the NMI entry/exit code which 147 is known to trigger bugs in code paths that did not expect to be 148 interrupted, including nested NMIs. Using "-c" boosts the rate of 149 NMIs, and using two -c with separate counters encourages nested NMIs 150 and less deterministic behavior. 151 :: 152 153 while true; do perf record -c 10000 -e instructions,cycles -a sleep 10; done 154 1554. Launch a KVM virtual machine. 1565. Run 32-bit binaries on systems supporting the SYSCALL instruction. 157 This has been a lightly-tested code path and needs extra scrutiny. 158 159Debugging 160========= 161 162Bugs in PTI cause a few different signatures of crashes 163that are worth noting here. 164 165 * Failures of the selftests/x86 code. Usually a bug in one of the 166 more obscure corners of entry_64.S 167 * Crashes in early boot, especially around CPU bringup. Bugs 168 in the mappings cause these. 169 * Crashes at the first interrupt. Caused by bugs in entry_64.S, 170 like screwing up a page table switch. Also caused by 171 incorrectly mapping the IRQ handler entry code. 172 * Crashes at the first NMI. The NMI code is separate from main 173 interrupt handlers and can have bugs that do not affect 174 normal interrupts. Also caused by incorrectly mapping NMI 175 code. NMIs that interrupt the entry code must be very 176 careful and can be the cause of crashes that show up when 177 running perf. 178 * Kernel crashes at the first exit to userspace. entry_64.S 179 bugs, or failing to map some of the exit code. 180 * Crashes at first interrupt that interrupts userspace. The paths 181 in entry_64.S that return to userspace are sometimes separate 182 from the ones that return to the kernel. 183 * Double faults: overflowing the kernel stack because of page 184 faults upon page faults. Caused by touching non-pti-mapped 185 data in the entry code, or forgetting to switch to kernel 186 CR3 before calling into C functions which are not pti-mapped. 187 * Userspace segfaults early in boot, sometimes manifesting 188 as mount(8) failing to mount the rootfs. These have 189 tended to be TLB invalidation issues. Usually invalidating 190 the wrong PCID, or otherwise missing an invalidation. 191 192.. [1] https://gruss.cc/files/kaiser.pdf 193.. [2] https://meltdownattack.com/meltdown.pdf 194