xref: /linux/Documentation/arch/x86/pti.rst (revision ea4654e0885348f0faa47f6d7b44a08d75ad16e9)
1ff61f079SJonathan Corbet.. SPDX-License-Identifier: GPL-2.0
2ff61f079SJonathan Corbet
3ff61f079SJonathan Corbet==========================
4ff61f079SJonathan CorbetPage Table Isolation (PTI)
5ff61f079SJonathan Corbet==========================
6ff61f079SJonathan Corbet
7ff61f079SJonathan CorbetOverview
8ff61f079SJonathan Corbet========
9ff61f079SJonathan Corbet
10ff61f079SJonathan CorbetPage Table Isolation (pti, previously known as KAISER [1]_) is a
11ff61f079SJonathan Corbetcountermeasure against attacks on the shared user/kernel address
12ff61f079SJonathan Corbetspace such as the "Meltdown" approach [2]_.
13ff61f079SJonathan Corbet
14ff61f079SJonathan CorbetTo mitigate this class of attacks, we create an independent set of
15ff61f079SJonathan Corbetpage tables for use only when running userspace applications.  When
16ff61f079SJonathan Corbetthe kernel is entered via syscalls, interrupts or exceptions, the
17ff61f079SJonathan Corbetpage tables are switched to the full "kernel" copy.  When the system
18ff61f079SJonathan Corbetswitches back to user mode, the user copy is used again.
19ff61f079SJonathan Corbet
20ff61f079SJonathan CorbetThe userspace page tables contain only a minimal amount of kernel
21ff61f079SJonathan Corbetdata: only what is needed to enter/exit the kernel such as the
22ff61f079SJonathan Corbetentry/exit functions themselves and the interrupt descriptor table
23ff61f079SJonathan Corbet(IDT).  There are a few strictly unnecessary things that get mapped
24ff61f079SJonathan Corbetsuch as the first C function when entering an interrupt (see
25ff61f079SJonathan Corbetcomments in pti.c).
26ff61f079SJonathan Corbet
27ff61f079SJonathan CorbetThis approach helps to ensure that side-channel attacks leveraging
28ff61f079SJonathan Corbetthe paging structures do not function when PTI is enabled.  It can be
29*ea4654e0SBreno Leitaoenabled by setting CONFIG_MITIGATION_PAGE_TABLE_ISOLATION=y at compile
30*ea4654e0SBreno Leitaotime.  Once enabled at compile-time, it can be disabled at boot with
31*ea4654e0SBreno Leitaothe 'nopti' or 'pti=' kernel parameters (see kernel-parameters.txt).
32ff61f079SJonathan Corbet
33ff61f079SJonathan CorbetPage Table Management
34ff61f079SJonathan Corbet=====================
35ff61f079SJonathan Corbet
36ff61f079SJonathan CorbetWhen PTI is enabled, the kernel manages two sets of page tables.
37ff61f079SJonathan CorbetThe first set is very similar to the single set which is present in
38ff61f079SJonathan Corbetkernels without PTI.  This includes a complete mapping of userspace
39ff61f079SJonathan Corbetthat the kernel can use for things like copy_to_user().
40ff61f079SJonathan Corbet
41ff61f079SJonathan CorbetAlthough _complete_, the user portion of the kernel page tables is
42ff61f079SJonathan Corbetcrippled by setting the NX bit in the top level.  This ensures
43ff61f079SJonathan Corbetthat any missed kernel->user CR3 switch will immediately crash
44ff61f079SJonathan Corbetuserspace upon executing its first instruction.
45ff61f079SJonathan Corbet
46ff61f079SJonathan CorbetThe userspace page tables map only the kernel data needed to enter
47ff61f079SJonathan Corbetand exit the kernel.  This data is entirely contained in the 'struct
48ff61f079SJonathan Corbetcpu_entry_area' structure which is placed in the fixmap which gives
49ff61f079SJonathan Corbeteach CPU's copy of the area a compile-time-fixed virtual address.
50ff61f079SJonathan Corbet
51ff61f079SJonathan CorbetFor new userspace mappings, the kernel makes the entries in its
52ff61f079SJonathan Corbetpage tables like normal.  The only difference is when the kernel
53ff61f079SJonathan Corbetmakes entries in the top (PGD) level.  In addition to setting the
54ff61f079SJonathan Corbetentry in the main kernel PGD, a copy of the entry is made in the
55ff61f079SJonathan Corbetuserspace page tables' PGD.
56ff61f079SJonathan Corbet
57ff61f079SJonathan CorbetThis sharing at the PGD level also inherently shares all the lower
58ff61f079SJonathan Corbetlayers of the page tables.  This leaves a single, shared set of
59ff61f079SJonathan Corbetuserspace page tables to manage.  One PTE to lock, one set of
60ff61f079SJonathan Corbetaccessed bits, dirty bits, etc...
61ff61f079SJonathan Corbet
62ff61f079SJonathan CorbetOverhead
63ff61f079SJonathan Corbet========
64ff61f079SJonathan Corbet
65ff61f079SJonathan CorbetProtection against side-channel attacks is important.  But,
66ff61f079SJonathan Corbetthis protection comes at a cost:
67ff61f079SJonathan Corbet
68ff61f079SJonathan Corbet1. Increased Memory Use
69ff61f079SJonathan Corbet
70ff61f079SJonathan Corbet  a. Each process now needs an order-1 PGD instead of order-0.
71ff61f079SJonathan Corbet     (Consumes an additional 4k per process).
72ff61f079SJonathan Corbet  b. The 'cpu_entry_area' structure must be 2MB in size and 2MB
73ff61f079SJonathan Corbet     aligned so that it can be mapped by setting a single PMD
74ff61f079SJonathan Corbet     entry.  This consumes nearly 2MB of RAM once the kernel
75ff61f079SJonathan Corbet     is decompressed, but no space in the kernel image itself.
76ff61f079SJonathan Corbet
77ff61f079SJonathan Corbet2. Runtime Cost
78ff61f079SJonathan Corbet
79ff61f079SJonathan Corbet  a. CR3 manipulation to switch between the page table copies
80ff61f079SJonathan Corbet     must be done at interrupt, syscall, and exception entry
81ff61f079SJonathan Corbet     and exit (it can be skipped when the kernel is interrupted,
82ff61f079SJonathan Corbet     though.)  Moves to CR3 are on the order of a hundred
83ff61f079SJonathan Corbet     cycles, and are required at every entry and exit.
847a0a6d55SNikolay Borisov  b. Percpu TSS is mapped into the user page tables to allow SYSCALL64 path
857a0a6d55SNikolay Borisov     to work under PTI. This doesn't have a direct runtime cost but it can
867a0a6d55SNikolay Borisov     be argued it opens certain timing attack scenarios.
87ff61f079SJonathan Corbet  c. Global pages are disabled for all kernel structures not
88ff61f079SJonathan Corbet     mapped into both kernel and userspace page tables.  This
89ff61f079SJonathan Corbet     feature of the MMU allows different processes to share TLB
90ff61f079SJonathan Corbet     entries mapping the kernel.  Losing the feature means more
91ff61f079SJonathan Corbet     TLB misses after a context switch.  The actual loss of
92ff61f079SJonathan Corbet     performance is very small, however, never exceeding 1%.
93ff61f079SJonathan Corbet  d. Process Context IDentifiers (PCID) is a CPU feature that
94ff61f079SJonathan Corbet     allows us to skip flushing the entire TLB when switching page
95ff61f079SJonathan Corbet     tables by setting a special bit in CR3 when the page tables
96ff61f079SJonathan Corbet     are changed.  This makes switching the page tables (at context
97ff61f079SJonathan Corbet     switch, or kernel entry/exit) cheaper.  But, on systems with
98ff61f079SJonathan Corbet     PCID support, the context switch code must flush both the user
99ff61f079SJonathan Corbet     and kernel entries out of the TLB.  The user PCID TLB flush is
100ff61f079SJonathan Corbet     deferred until the exit to userspace, minimizing the cost.
101ff61f079SJonathan Corbet     See intel.com/sdm for the gory PCID/INVPCID details.
102ff61f079SJonathan Corbet  e. The userspace page tables must be populated for each new
103ff61f079SJonathan Corbet     process.  Even without PTI, the shared kernel mappings
104ff61f079SJonathan Corbet     are created by copying top-level (PGD) entries into each
105ff61f079SJonathan Corbet     new process.  But, with PTI, there are now *two* kernel
106ff61f079SJonathan Corbet     mappings: one in the kernel page tables that maps everything
107ff61f079SJonathan Corbet     and one for the entry/exit structures.  At fork(), we need to
108ff61f079SJonathan Corbet     copy both.
109ff61f079SJonathan Corbet  f. In addition to the fork()-time copying, there must also
110ff61f079SJonathan Corbet     be an update to the userspace PGD any time a set_pgd() is done
111ff61f079SJonathan Corbet     on a PGD used to map userspace.  This ensures that the kernel
112ff61f079SJonathan Corbet     and userspace copies always map the same userspace
113ff61f079SJonathan Corbet     memory.
114ff61f079SJonathan Corbet  g. On systems without PCID support, each CR3 write flushes
115ff61f079SJonathan Corbet     the entire TLB.  That means that each syscall, interrupt
116ff61f079SJonathan Corbet     or exception flushes the TLB.
117ff61f079SJonathan Corbet  h. INVPCID is a TLB-flushing instruction which allows flushing
118ff61f079SJonathan Corbet     of TLB entries for non-current PCIDs.  Some systems support
119ff61f079SJonathan Corbet     PCIDs, but do not support INVPCID.  On these systems, addresses
120ff61f079SJonathan Corbet     can only be flushed from the TLB for the current PCID.  When
121ff61f079SJonathan Corbet     flushing a kernel address, we need to flush all PCIDs, so a
122ff61f079SJonathan Corbet     single kernel address flush will require a TLB-flushing CR3
123ff61f079SJonathan Corbet     write upon the next use of every PCID.
124ff61f079SJonathan Corbet
125ff61f079SJonathan CorbetPossible Future Work
126ff61f079SJonathan Corbet====================
127ff61f079SJonathan Corbet1. We can be more careful about not actually writing to CR3
128ff61f079SJonathan Corbet   unless its value is actually changed.
129ff61f079SJonathan Corbet2. Allow PTI to be enabled/disabled at runtime in addition to the
130ff61f079SJonathan Corbet   boot-time switching.
131ff61f079SJonathan Corbet
132ff61f079SJonathan CorbetTesting
133ff61f079SJonathan Corbet========
134ff61f079SJonathan Corbet
135ff61f079SJonathan CorbetTo test stability of PTI, the following test procedure is recommended,
136ff61f079SJonathan Corbetideally doing all of these in parallel:
137ff61f079SJonathan Corbet
138ff61f079SJonathan Corbet1. Set CONFIG_DEBUG_ENTRY=y
139ff61f079SJonathan Corbet2. Run several copies of all of the tools/testing/selftests/x86/ tests
140ff61f079SJonathan Corbet   (excluding MPX and protection_keys) in a loop on multiple CPUs for
141ff61f079SJonathan Corbet   several minutes.  These tests frequently uncover corner cases in the
142ff61f079SJonathan Corbet   kernel entry code.  In general, old kernels might cause these tests
143ff61f079SJonathan Corbet   themselves to crash, but they should never crash the kernel.
144ff61f079SJonathan Corbet3. Run the 'perf' tool in a mode (top or record) that generates many
145ff61f079SJonathan Corbet   frequent performance monitoring non-maskable interrupts (see "NMI"
146ff61f079SJonathan Corbet   in /proc/interrupts).  This exercises the NMI entry/exit code which
147ff61f079SJonathan Corbet   is known to trigger bugs in code paths that did not expect to be
148ff61f079SJonathan Corbet   interrupted, including nested NMIs.  Using "-c" boosts the rate of
149ff61f079SJonathan Corbet   NMIs, and using two -c with separate counters encourages nested NMIs
150ff61f079SJonathan Corbet   and less deterministic behavior.
151ff61f079SJonathan Corbet   ::
152ff61f079SJonathan Corbet
153ff61f079SJonathan Corbet	while true; do perf record -c 10000 -e instructions,cycles -a sleep 10; done
154ff61f079SJonathan Corbet
155ff61f079SJonathan Corbet4. Launch a KVM virtual machine.
156ff61f079SJonathan Corbet5. Run 32-bit binaries on systems supporting the SYSCALL instruction.
157ff61f079SJonathan Corbet   This has been a lightly-tested code path and needs extra scrutiny.
158ff61f079SJonathan Corbet
159ff61f079SJonathan CorbetDebugging
160ff61f079SJonathan Corbet=========
161ff61f079SJonathan Corbet
162ff61f079SJonathan CorbetBugs in PTI cause a few different signatures of crashes
163ff61f079SJonathan Corbetthat are worth noting here.
164ff61f079SJonathan Corbet
165ff61f079SJonathan Corbet * Failures of the selftests/x86 code.  Usually a bug in one of the
166ff61f079SJonathan Corbet   more obscure corners of entry_64.S
167ff61f079SJonathan Corbet * Crashes in early boot, especially around CPU bringup.  Bugs
1687a0a6d55SNikolay Borisov   in the mappings cause these.
169ff61f079SJonathan Corbet * Crashes at the first interrupt.  Caused by bugs in entry_64.S,
170ff61f079SJonathan Corbet   like screwing up a page table switch.  Also caused by
171ff61f079SJonathan Corbet   incorrectly mapping the IRQ handler entry code.
172ff61f079SJonathan Corbet * Crashes at the first NMI.  The NMI code is separate from main
173ff61f079SJonathan Corbet   interrupt handlers and can have bugs that do not affect
174ff61f079SJonathan Corbet   normal interrupts.  Also caused by incorrectly mapping NMI
175ff61f079SJonathan Corbet   code.  NMIs that interrupt the entry code must be very
176ff61f079SJonathan Corbet   careful and can be the cause of crashes that show up when
177ff61f079SJonathan Corbet   running perf.
178ff61f079SJonathan Corbet * Kernel crashes at the first exit to userspace.  entry_64.S
179ff61f079SJonathan Corbet   bugs, or failing to map some of the exit code.
180ff61f079SJonathan Corbet * Crashes at first interrupt that interrupts userspace. The paths
181ff61f079SJonathan Corbet   in entry_64.S that return to userspace are sometimes separate
182ff61f079SJonathan Corbet   from the ones that return to the kernel.
183ff61f079SJonathan Corbet * Double faults: overflowing the kernel stack because of page
184ff61f079SJonathan Corbet   faults upon page faults.  Caused by touching non-pti-mapped
185ff61f079SJonathan Corbet   data in the entry code, or forgetting to switch to kernel
186ff61f079SJonathan Corbet   CR3 before calling into C functions which are not pti-mapped.
187ff61f079SJonathan Corbet * Userspace segfaults early in boot, sometimes manifesting
188ff61f079SJonathan Corbet   as mount(8) failing to mount the rootfs.  These have
189ff61f079SJonathan Corbet   tended to be TLB invalidation issues.  Usually invalidating
190ff61f079SJonathan Corbet   the wrong PCID, or otherwise missing an invalidation.
191ff61f079SJonathan Corbet
192ff61f079SJonathan Corbet.. [1] https://gruss.cc/files/kaiser.pdf
193ff61f079SJonathan Corbet.. [2] https://meltdownattack.com/meltdown.pdf
194