1ff61f079SJonathan Corbet.. SPDX-License-Identifier: GPL-2.0 2ff61f079SJonathan Corbet 3ff61f079SJonathan Corbet========================== 4ff61f079SJonathan CorbetPage Table Isolation (PTI) 5ff61f079SJonathan Corbet========================== 6ff61f079SJonathan Corbet 7ff61f079SJonathan CorbetOverview 8ff61f079SJonathan Corbet======== 9ff61f079SJonathan Corbet 10ff61f079SJonathan CorbetPage Table Isolation (pti, previously known as KAISER [1]_) is a 11ff61f079SJonathan Corbetcountermeasure against attacks on the shared user/kernel address 12ff61f079SJonathan Corbetspace such as the "Meltdown" approach [2]_. 13ff61f079SJonathan Corbet 14ff61f079SJonathan CorbetTo mitigate this class of attacks, we create an independent set of 15ff61f079SJonathan Corbetpage tables for use only when running userspace applications. When 16ff61f079SJonathan Corbetthe kernel is entered via syscalls, interrupts or exceptions, the 17ff61f079SJonathan Corbetpage tables are switched to the full "kernel" copy. When the system 18ff61f079SJonathan Corbetswitches back to user mode, the user copy is used again. 19ff61f079SJonathan Corbet 20ff61f079SJonathan CorbetThe userspace page tables contain only a minimal amount of kernel 21ff61f079SJonathan Corbetdata: only what is needed to enter/exit the kernel such as the 22ff61f079SJonathan Corbetentry/exit functions themselves and the interrupt descriptor table 23ff61f079SJonathan Corbet(IDT). There are a few strictly unnecessary things that get mapped 24ff61f079SJonathan Corbetsuch as the first C function when entering an interrupt (see 25ff61f079SJonathan Corbetcomments in pti.c). 26ff61f079SJonathan Corbet 27ff61f079SJonathan CorbetThis approach helps to ensure that side-channel attacks leveraging 28ff61f079SJonathan Corbetthe paging structures do not function when PTI is enabled. It can be 29*ea4654e0SBreno Leitaoenabled by setting CONFIG_MITIGATION_PAGE_TABLE_ISOLATION=y at compile 30*ea4654e0SBreno Leitaotime. Once enabled at compile-time, it can be disabled at boot with 31*ea4654e0SBreno Leitaothe 'nopti' or 'pti=' kernel parameters (see kernel-parameters.txt). 32ff61f079SJonathan Corbet 33ff61f079SJonathan CorbetPage Table Management 34ff61f079SJonathan Corbet===================== 35ff61f079SJonathan Corbet 36ff61f079SJonathan CorbetWhen PTI is enabled, the kernel manages two sets of page tables. 37ff61f079SJonathan CorbetThe first set is very similar to the single set which is present in 38ff61f079SJonathan Corbetkernels without PTI. This includes a complete mapping of userspace 39ff61f079SJonathan Corbetthat the kernel can use for things like copy_to_user(). 40ff61f079SJonathan Corbet 41ff61f079SJonathan CorbetAlthough _complete_, the user portion of the kernel page tables is 42ff61f079SJonathan Corbetcrippled by setting the NX bit in the top level. This ensures 43ff61f079SJonathan Corbetthat any missed kernel->user CR3 switch will immediately crash 44ff61f079SJonathan Corbetuserspace upon executing its first instruction. 45ff61f079SJonathan Corbet 46ff61f079SJonathan CorbetThe userspace page tables map only the kernel data needed to enter 47ff61f079SJonathan Corbetand exit the kernel. This data is entirely contained in the 'struct 48ff61f079SJonathan Corbetcpu_entry_area' structure which is placed in the fixmap which gives 49ff61f079SJonathan Corbeteach CPU's copy of the area a compile-time-fixed virtual address. 50ff61f079SJonathan Corbet 51ff61f079SJonathan CorbetFor new userspace mappings, the kernel makes the entries in its 52ff61f079SJonathan Corbetpage tables like normal. The only difference is when the kernel 53ff61f079SJonathan Corbetmakes entries in the top (PGD) level. In addition to setting the 54ff61f079SJonathan Corbetentry in the main kernel PGD, a copy of the entry is made in the 55ff61f079SJonathan Corbetuserspace page tables' PGD. 56ff61f079SJonathan Corbet 57ff61f079SJonathan CorbetThis sharing at the PGD level also inherently shares all the lower 58ff61f079SJonathan Corbetlayers of the page tables. This leaves a single, shared set of 59ff61f079SJonathan Corbetuserspace page tables to manage. One PTE to lock, one set of 60ff61f079SJonathan Corbetaccessed bits, dirty bits, etc... 61ff61f079SJonathan Corbet 62ff61f079SJonathan CorbetOverhead 63ff61f079SJonathan Corbet======== 64ff61f079SJonathan Corbet 65ff61f079SJonathan CorbetProtection against side-channel attacks is important. But, 66ff61f079SJonathan Corbetthis protection comes at a cost: 67ff61f079SJonathan Corbet 68ff61f079SJonathan Corbet1. Increased Memory Use 69ff61f079SJonathan Corbet 70ff61f079SJonathan Corbet a. Each process now needs an order-1 PGD instead of order-0. 71ff61f079SJonathan Corbet (Consumes an additional 4k per process). 72ff61f079SJonathan Corbet b. The 'cpu_entry_area' structure must be 2MB in size and 2MB 73ff61f079SJonathan Corbet aligned so that it can be mapped by setting a single PMD 74ff61f079SJonathan Corbet entry. This consumes nearly 2MB of RAM once the kernel 75ff61f079SJonathan Corbet is decompressed, but no space in the kernel image itself. 76ff61f079SJonathan Corbet 77ff61f079SJonathan Corbet2. Runtime Cost 78ff61f079SJonathan Corbet 79ff61f079SJonathan Corbet a. CR3 manipulation to switch between the page table copies 80ff61f079SJonathan Corbet must be done at interrupt, syscall, and exception entry 81ff61f079SJonathan Corbet and exit (it can be skipped when the kernel is interrupted, 82ff61f079SJonathan Corbet though.) Moves to CR3 are on the order of a hundred 83ff61f079SJonathan Corbet cycles, and are required at every entry and exit. 847a0a6d55SNikolay Borisov b. Percpu TSS is mapped into the user page tables to allow SYSCALL64 path 857a0a6d55SNikolay Borisov to work under PTI. This doesn't have a direct runtime cost but it can 867a0a6d55SNikolay Borisov be argued it opens certain timing attack scenarios. 87ff61f079SJonathan Corbet c. Global pages are disabled for all kernel structures not 88ff61f079SJonathan Corbet mapped into both kernel and userspace page tables. This 89ff61f079SJonathan Corbet feature of the MMU allows different processes to share TLB 90ff61f079SJonathan Corbet entries mapping the kernel. Losing the feature means more 91ff61f079SJonathan Corbet TLB misses after a context switch. The actual loss of 92ff61f079SJonathan Corbet performance is very small, however, never exceeding 1%. 93ff61f079SJonathan Corbet d. Process Context IDentifiers (PCID) is a CPU feature that 94ff61f079SJonathan Corbet allows us to skip flushing the entire TLB when switching page 95ff61f079SJonathan Corbet tables by setting a special bit in CR3 when the page tables 96ff61f079SJonathan Corbet are changed. This makes switching the page tables (at context 97ff61f079SJonathan Corbet switch, or kernel entry/exit) cheaper. But, on systems with 98ff61f079SJonathan Corbet PCID support, the context switch code must flush both the user 99ff61f079SJonathan Corbet and kernel entries out of the TLB. The user PCID TLB flush is 100ff61f079SJonathan Corbet deferred until the exit to userspace, minimizing the cost. 101ff61f079SJonathan Corbet See intel.com/sdm for the gory PCID/INVPCID details. 102ff61f079SJonathan Corbet e. The userspace page tables must be populated for each new 103ff61f079SJonathan Corbet process. Even without PTI, the shared kernel mappings 104ff61f079SJonathan Corbet are created by copying top-level (PGD) entries into each 105ff61f079SJonathan Corbet new process. But, with PTI, there are now *two* kernel 106ff61f079SJonathan Corbet mappings: one in the kernel page tables that maps everything 107ff61f079SJonathan Corbet and one for the entry/exit structures. At fork(), we need to 108ff61f079SJonathan Corbet copy both. 109ff61f079SJonathan Corbet f. In addition to the fork()-time copying, there must also 110ff61f079SJonathan Corbet be an update to the userspace PGD any time a set_pgd() is done 111ff61f079SJonathan Corbet on a PGD used to map userspace. This ensures that the kernel 112ff61f079SJonathan Corbet and userspace copies always map the same userspace 113ff61f079SJonathan Corbet memory. 114ff61f079SJonathan Corbet g. On systems without PCID support, each CR3 write flushes 115ff61f079SJonathan Corbet the entire TLB. That means that each syscall, interrupt 116ff61f079SJonathan Corbet or exception flushes the TLB. 117ff61f079SJonathan Corbet h. INVPCID is a TLB-flushing instruction which allows flushing 118ff61f079SJonathan Corbet of TLB entries for non-current PCIDs. Some systems support 119ff61f079SJonathan Corbet PCIDs, but do not support INVPCID. On these systems, addresses 120ff61f079SJonathan Corbet can only be flushed from the TLB for the current PCID. When 121ff61f079SJonathan Corbet flushing a kernel address, we need to flush all PCIDs, so a 122ff61f079SJonathan Corbet single kernel address flush will require a TLB-flushing CR3 123ff61f079SJonathan Corbet write upon the next use of every PCID. 124ff61f079SJonathan Corbet 125ff61f079SJonathan CorbetPossible Future Work 126ff61f079SJonathan Corbet==================== 127ff61f079SJonathan Corbet1. We can be more careful about not actually writing to CR3 128ff61f079SJonathan Corbet unless its value is actually changed. 129ff61f079SJonathan Corbet2. Allow PTI to be enabled/disabled at runtime in addition to the 130ff61f079SJonathan Corbet boot-time switching. 131ff61f079SJonathan Corbet 132ff61f079SJonathan CorbetTesting 133ff61f079SJonathan Corbet======== 134ff61f079SJonathan Corbet 135ff61f079SJonathan CorbetTo test stability of PTI, the following test procedure is recommended, 136ff61f079SJonathan Corbetideally doing all of these in parallel: 137ff61f079SJonathan Corbet 138ff61f079SJonathan Corbet1. Set CONFIG_DEBUG_ENTRY=y 139ff61f079SJonathan Corbet2. Run several copies of all of the tools/testing/selftests/x86/ tests 140ff61f079SJonathan Corbet (excluding MPX and protection_keys) in a loop on multiple CPUs for 141ff61f079SJonathan Corbet several minutes. These tests frequently uncover corner cases in the 142ff61f079SJonathan Corbet kernel entry code. In general, old kernels might cause these tests 143ff61f079SJonathan Corbet themselves to crash, but they should never crash the kernel. 144ff61f079SJonathan Corbet3. Run the 'perf' tool in a mode (top or record) that generates many 145ff61f079SJonathan Corbet frequent performance monitoring non-maskable interrupts (see "NMI" 146ff61f079SJonathan Corbet in /proc/interrupts). This exercises the NMI entry/exit code which 147ff61f079SJonathan Corbet is known to trigger bugs in code paths that did not expect to be 148ff61f079SJonathan Corbet interrupted, including nested NMIs. Using "-c" boosts the rate of 149ff61f079SJonathan Corbet NMIs, and using two -c with separate counters encourages nested NMIs 150ff61f079SJonathan Corbet and less deterministic behavior. 151ff61f079SJonathan Corbet :: 152ff61f079SJonathan Corbet 153ff61f079SJonathan Corbet while true; do perf record -c 10000 -e instructions,cycles -a sleep 10; done 154ff61f079SJonathan Corbet 155ff61f079SJonathan Corbet4. Launch a KVM virtual machine. 156ff61f079SJonathan Corbet5. Run 32-bit binaries on systems supporting the SYSCALL instruction. 157ff61f079SJonathan Corbet This has been a lightly-tested code path and needs extra scrutiny. 158ff61f079SJonathan Corbet 159ff61f079SJonathan CorbetDebugging 160ff61f079SJonathan Corbet========= 161ff61f079SJonathan Corbet 162ff61f079SJonathan CorbetBugs in PTI cause a few different signatures of crashes 163ff61f079SJonathan Corbetthat are worth noting here. 164ff61f079SJonathan Corbet 165ff61f079SJonathan Corbet * Failures of the selftests/x86 code. Usually a bug in one of the 166ff61f079SJonathan Corbet more obscure corners of entry_64.S 167ff61f079SJonathan Corbet * Crashes in early boot, especially around CPU bringup. Bugs 1687a0a6d55SNikolay Borisov in the mappings cause these. 169ff61f079SJonathan Corbet * Crashes at the first interrupt. Caused by bugs in entry_64.S, 170ff61f079SJonathan Corbet like screwing up a page table switch. Also caused by 171ff61f079SJonathan Corbet incorrectly mapping the IRQ handler entry code. 172ff61f079SJonathan Corbet * Crashes at the first NMI. The NMI code is separate from main 173ff61f079SJonathan Corbet interrupt handlers and can have bugs that do not affect 174ff61f079SJonathan Corbet normal interrupts. Also caused by incorrectly mapping NMI 175ff61f079SJonathan Corbet code. NMIs that interrupt the entry code must be very 176ff61f079SJonathan Corbet careful and can be the cause of crashes that show up when 177ff61f079SJonathan Corbet running perf. 178ff61f079SJonathan Corbet * Kernel crashes at the first exit to userspace. entry_64.S 179ff61f079SJonathan Corbet bugs, or failing to map some of the exit code. 180ff61f079SJonathan Corbet * Crashes at first interrupt that interrupts userspace. The paths 181ff61f079SJonathan Corbet in entry_64.S that return to userspace are sometimes separate 182ff61f079SJonathan Corbet from the ones that return to the kernel. 183ff61f079SJonathan Corbet * Double faults: overflowing the kernel stack because of page 184ff61f079SJonathan Corbet faults upon page faults. Caused by touching non-pti-mapped 185ff61f079SJonathan Corbet data in the entry code, or forgetting to switch to kernel 186ff61f079SJonathan Corbet CR3 before calling into C functions which are not pti-mapped. 187ff61f079SJonathan Corbet * Userspace segfaults early in boot, sometimes manifesting 188ff61f079SJonathan Corbet as mount(8) failing to mount the rootfs. These have 189ff61f079SJonathan Corbet tended to be TLB invalidation issues. Usually invalidating 190ff61f079SJonathan Corbet the wrong PCID, or otherwise missing an invalidation. 191ff61f079SJonathan Corbet 192ff61f079SJonathan Corbet.. [1] https://gruss.cc/files/kaiser.pdf 193ff61f079SJonathan Corbet.. [2] https://meltdownattack.com/meltdown.pdf 194