xref: /linux/Documentation/arch/x86/tdx.rst (revision 6c7353836a91b1479e6b81791cdc163fb04b4834)
1.. SPDX-License-Identifier: GPL-2.0
2
3=====================================
4Intel Trust Domain Extensions (TDX)
5=====================================
6
7Intel's Trust Domain Extensions (TDX) protect confidential guest VMs from
8the host and physical attacks by isolating the guest register state and by
9encrypting the guest memory. In TDX, a special module running in a special
10mode sits between the host and the guest and manages the guest/host
11separation.
12
13TDX Host Kernel Support
14=======================
15
16TDX introduces a new CPU mode called Secure Arbitration Mode (SEAM) and
17a new isolated range pointed by the SEAM Ranger Register (SEAMRR).  A
18CPU-attested software module called 'the TDX module' runs inside the new
19isolated range to provide the functionalities to manage and run protected
20VMs.
21
22TDX also leverages Intel Multi-Key Total Memory Encryption (MKTME) to
23provide crypto-protection to the VMs.  TDX reserves part of MKTME KeyIDs
24as TDX private KeyIDs, which are only accessible within the SEAM mode.
25BIOS is responsible for partitioning legacy MKTME KeyIDs and TDX KeyIDs.
26
27Before the TDX module can be used to create and run protected VMs, it
28must be loaded into the isolated range and properly initialized.  The TDX
29architecture doesn't require the BIOS to load the TDX module, but the
30kernel assumes it is loaded by the BIOS.
31
32TDX boot-time detection
33-----------------------
34
35The kernel detects TDX by detecting TDX private KeyIDs during kernel
36boot.  Below dmesg shows when TDX is enabled by BIOS::
37
38  [..] virt/tdx: BIOS enabled: private KeyID range: [16, 64)
39
40TDX module initialization
41---------------------------------------
42
43The kernel talks to the TDX module via the new SEAMCALL instruction.  The
44TDX module implements SEAMCALL leaf functions to allow the kernel to
45initialize it.
46
47If the TDX module isn't loaded, the SEAMCALL instruction fails with a
48special error.  In this case the kernel fails the module initialization
49and reports the module isn't loaded::
50
51  [..] virt/tdx: module not loaded
52
53Initializing the TDX module consumes roughly ~1/256th system RAM size to
54use it as 'metadata' for the TDX memory.  It also takes additional CPU
55time to initialize those metadata along with the TDX module itself.  Both
56are not trivial.  The kernel initializes the TDX module at runtime on
57demand.
58
59Besides initializing the TDX module, a per-cpu initialization SEAMCALL
60must be done on one cpu before any other SEAMCALLs can be made on that
61cpu.
62
63The kernel provides two functions, tdx_enable() and tdx_cpu_enable() to
64allow the user of TDX to enable the TDX module and enable TDX on local
65cpu respectively.
66
67Making SEAMCALL requires VMXON has been done on that CPU.  Currently only
68KVM implements VMXON.  For now both tdx_enable() and tdx_cpu_enable()
69don't do VMXON internally (not trivial), but depends on the caller to
70guarantee that.
71
72To enable TDX, the caller of TDX should: 1) temporarily disable CPU
73hotplug; 2) do VMXON and tdx_enable_cpu() on all online cpus; 3) call
74tdx_enable().  For example::
75
76        cpus_read_lock();
77        on_each_cpu(vmxon_and_tdx_cpu_enable());
78        ret = tdx_enable();
79        cpus_read_unlock();
80        if (ret)
81                goto no_tdx;
82        // TDX is ready to use
83
84And the caller of TDX must guarantee the tdx_cpu_enable() has been
85successfully done on any cpu before it wants to run any other SEAMCALL.
86A typical usage is do both VMXON and tdx_cpu_enable() in CPU hotplug
87online callback, and refuse to online if tdx_cpu_enable() fails.
88
89User can consult dmesg to see whether the TDX module has been initialized.
90
91If the TDX module is initialized successfully, dmesg shows something
92like below::
93
94  [..] virt/tdx: 262668 KBs allocated for PAMT
95  [..] virt/tdx: module initialized
96
97If the TDX module failed to initialize, dmesg also shows it failed to
98initialize::
99
100  [..] virt/tdx: module initialization failed ...
101
102TDX Interaction to Other Kernel Components
103------------------------------------------
104
105TDX Memory Policy
106~~~~~~~~~~~~~~~~~
107
108TDX reports a list of "Convertible Memory Region" (CMR) to tell the
109kernel which memory is TDX compatible.  The kernel needs to build a list
110of memory regions (out of CMRs) as "TDX-usable" memory and pass those
111regions to the TDX module.  Once this is done, those "TDX-usable" memory
112regions are fixed during module's lifetime.
113
114To keep things simple, currently the kernel simply guarantees all pages
115in the page allocator are TDX memory.  Specifically, the kernel uses all
116system memory in the core-mm "at the time of TDX module initialization"
117as TDX memory, and in the meantime, refuses to online any non-TDX-memory
118in the memory hotplug.
119
120Physical Memory Hotplug
121~~~~~~~~~~~~~~~~~~~~~~~
122
123Note TDX assumes convertible memory is always physically present during
124machine's runtime.  A non-buggy BIOS should never support hot-removal of
125any convertible memory.  This implementation doesn't handle ACPI memory
126removal but depends on the BIOS to behave correctly.
127
128CPU Hotplug
129~~~~~~~~~~~
130
131TDX module requires the per-cpu initialization SEAMCALL must be done on
132one cpu before any other SEAMCALLs can be made on that cpu.  The kernel
133provides tdx_cpu_enable() to let the user of TDX to do it when the user
134wants to use a new cpu for TDX task.
135
136TDX doesn't support physical (ACPI) CPU hotplug.  During machine boot,
137TDX verifies all boot-time present logical CPUs are TDX compatible before
138enabling TDX.  A non-buggy BIOS should never support hot-add/removal of
139physical CPU.  Currently the kernel doesn't handle physical CPU hotplug,
140but depends on the BIOS to behave correctly.
141
142Note TDX works with CPU logical online/offline, thus the kernel still
143allows to offline logical CPU and online it again.
144
145Kexec()
146~~~~~~~
147
148TDX host support currently lacks the ability to handle kexec.  For
149simplicity only one of them can be enabled in the Kconfig.  This will be
150fixed in the future.
151
152Erratum
153~~~~~~~
154
155The first few generations of TDX hardware have an erratum.  A partial
156write to a TDX private memory cacheline will silently "poison" the
157line.  Subsequent reads will consume the poison and generate a machine
158check.
159
160A partial write is a memory write where a write transaction of less than
161cacheline lands at the memory controller.  The CPU does these via
162non-temporal write instructions (like MOVNTI), or through UC/WC memory
163mappings.  Devices can also do partial writes via DMA.
164
165Theoretically, a kernel bug could do partial write to TDX private memory
166and trigger unexpected machine check.  What's more, the machine check
167code will present these as "Hardware error" when they were, in fact, a
168software-triggered issue.  But in the end, this issue is hard to trigger.
169
170If the platform has such erratum, the kernel prints additional message in
171machine check handler to tell user the machine check may be caused by
172kernel bug on TDX private memory.
173
174Interaction vs S3 and deeper states
175~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
176
177TDX cannot survive from S3 and deeper states.  The hardware resets and
178disables TDX completely when platform goes to S3 and deeper.  Both TDX
179guests and the TDX module get destroyed permanently.
180
181The kernel uses S3 for suspend-to-ram, and use S4 and deeper states for
182hibernation.  Currently, for simplicity, the kernel chooses to make TDX
183mutually exclusive with S3 and hibernation.
184
185The kernel disables TDX during early boot when hibernation support is
186available::
187
188  [..] virt/tdx: initialization failed: Hibernation support is enabled
189
190Add 'nohibernate' kernel command line to disable hibernation in order to
191use TDX.
192
193ACPI S3 is disabled during kernel early boot if TDX is enabled.  The user
194needs to turn off TDX in the BIOS in order to use S3.
195
196TDX Guest Support
197=================
198Since the host cannot directly access guest registers or memory, much
199normal functionality of a hypervisor must be moved into the guest. This is
200implemented using a Virtualization Exception (#VE) that is handled by the
201guest kernel. A #VE is handled entirely inside the guest kernel, but some
202require the hypervisor to be consulted.
203
204TDX includes new hypercall-like mechanisms for communicating from the
205guest to the hypervisor or the TDX module.
206
207New TDX Exceptions
208------------------
209
210TDX guests behave differently from bare-metal and traditional VMX guests.
211In TDX guests, otherwise normal instructions or memory accesses can cause
212#VE or #GP exceptions.
213
214Instructions marked with an '*' conditionally cause exceptions.  The
215details for these instructions are discussed below.
216
217Instruction-based #VE
218~~~~~~~~~~~~~~~~~~~~~
219
220- Port I/O (INS, OUTS, IN, OUT)
221- HLT
222- MONITOR, MWAIT
223- WBINVD, INVD
224- VMCALL
225- RDMSR*,WRMSR*
226- CPUID*
227
228Instruction-based #GP
229~~~~~~~~~~~~~~~~~~~~~
230
231- All VMX instructions: INVEPT, INVVPID, VMCLEAR, VMFUNC, VMLAUNCH,
232  VMPTRLD, VMPTRST, VMREAD, VMRESUME, VMWRITE, VMXOFF, VMXON
233- ENCLS, ENCLU
234- GETSEC
235- RSM
236- ENQCMD
237- RDMSR*,WRMSR*
238
239RDMSR/WRMSR Behavior
240~~~~~~~~~~~~~~~~~~~~
241
242MSR access behavior falls into three categories:
243
244- #GP generated
245- #VE generated
246- "Just works"
247
248In general, the #GP MSRs should not be used in guests.  Their use likely
249indicates a bug in the guest.  The guest may try to handle the #GP with a
250hypercall but it is unlikely to succeed.
251
252The #VE MSRs are typically able to be handled by the hypervisor.  Guests
253can make a hypercall to the hypervisor to handle the #VE.
254
255The "just works" MSRs do not need any special guest handling.  They might
256be implemented by directly passing through the MSR to the hardware or by
257trapping and handling in the TDX module.  Other than possibly being slow,
258these MSRs appear to function just as they would on bare metal.
259
260CPUID Behavior
261~~~~~~~~~~~~~~
262
263For some CPUID leaves and sub-leaves, the virtualized bit fields of CPUID
264return values (in guest EAX/EBX/ECX/EDX) are configurable by the
265hypervisor. For such cases, the Intel TDX module architecture defines two
266virtualization types:
267
268- Bit fields for which the hypervisor controls the value seen by the guest
269  TD.
270
271- Bit fields for which the hypervisor configures the value such that the
272  guest TD either sees their native value or a value of 0.  For these bit
273  fields, the hypervisor can mask off the native values, but it can not
274  turn *on* values.
275
276A #VE is generated for CPUID leaves and sub-leaves that the TDX module does
277not know how to handle. The guest kernel may ask the hypervisor for the
278value with a hypercall.
279
280#VE on Memory Accesses
281----------------------
282
283There are essentially two classes of TDX memory: private and shared.
284Private memory receives full TDX protections.  Its content is protected
285against access from the hypervisor.  Shared memory is expected to be
286shared between guest and hypervisor and does not receive full TDX
287protections.
288
289A TD guest is in control of whether its memory accesses are treated as
290private or shared.  It selects the behavior with a bit in its page table
291entries.  This helps ensure that a guest does not place sensitive
292information in shared memory, exposing it to the untrusted hypervisor.
293
294#VE on Shared Memory
295~~~~~~~~~~~~~~~~~~~~
296
297Access to shared mappings can cause a #VE.  The hypervisor ultimately
298controls whether a shared memory access causes a #VE, so the guest must be
299careful to only reference shared pages it can safely handle a #VE.  For
300instance, the guest should be careful not to access shared memory in the
301#VE handler before it reads the #VE info structure (TDG.VP.VEINFO.GET).
302
303Shared mapping content is entirely controlled by the hypervisor. The guest
304should only use shared mappings for communicating with the hypervisor.
305Shared mappings must never be used for sensitive memory content like kernel
306stacks.  A good rule of thumb is that hypervisor-shared memory should be
307treated the same as memory mapped to userspace.  Both the hypervisor and
308userspace are completely untrusted.
309
310MMIO for virtual devices is implemented as shared memory.  The guest must
311be careful not to access device MMIO regions unless it is also prepared to
312handle a #VE.
313
314#VE on Private Pages
315~~~~~~~~~~~~~~~~~~~~
316
317An access to private mappings can also cause a #VE.  Since all kernel
318memory is also private memory, the kernel might theoretically need to
319handle a #VE on arbitrary kernel memory accesses.  This is not feasible, so
320TDX guests ensure that all guest memory has been "accepted" before memory
321is used by the kernel.
322
323A modest amount of memory (typically 512M) is pre-accepted by the firmware
324before the kernel runs to ensure that the kernel can start up without
325being subjected to a #VE.
326
327The hypervisor is permitted to unilaterally move accepted pages to a
328"blocked" state. However, if it does this, page access will not generate a
329#VE.  It will, instead, cause a "TD Exit" where the hypervisor is required
330to handle the exception.
331
332Linux #VE handler
333-----------------
334
335Just like page faults or #GP's, #VE exceptions can be either handled or be
336fatal.  Typically, an unhandled userspace #VE results in a SIGSEGV.
337An unhandled kernel #VE results in an oops.
338
339Handling nested exceptions on x86 is typically nasty business.  A #VE
340could be interrupted by an NMI which triggers another #VE and hilarity
341ensues.  The TDX #VE architecture anticipated this scenario and includes a
342feature to make it slightly less nasty.
343
344During #VE handling, the TDX module ensures that all interrupts (including
345NMIs) are blocked.  The block remains in place until the guest makes a
346TDG.VP.VEINFO.GET TDCALL.  This allows the guest to control when interrupts
347or a new #VE can be delivered.
348
349However, the guest kernel must still be careful to avoid potential
350#VE-triggering actions (discussed above) while this block is in place.
351While the block is in place, any #VE is elevated to a double fault (#DF)
352which is not recoverable.
353
354MMIO handling
355-------------
356
357In non-TDX VMs, MMIO is usually implemented by giving a guest access to a
358mapping which will cause a VMEXIT on access, and then the hypervisor
359emulates the access.  That is not possible in TDX guests because VMEXIT
360will expose the register state to the host. TDX guests don't trust the host
361and can't have their state exposed to the host.
362
363In TDX, MMIO regions typically trigger a #VE exception in the guest.  The
364guest #VE handler then emulates the MMIO instruction inside the guest and
365converts it into a controlled TDCALL to the host, rather than exposing
366guest state to the host.
367
368MMIO addresses on x86 are just special physical addresses. They can
369theoretically be accessed with any instruction that accesses memory.
370However, the kernel instruction decoding method is limited. It is only
371designed to decode instructions like those generated by io.h macros.
372
373MMIO access via other means (like structure overlays) may result in an
374oops.
375
376Shared Memory Conversions
377-------------------------
378
379All TDX guest memory starts out as private at boot.  This memory can not
380be accessed by the hypervisor.  However, some kernel users like device
381drivers might have a need to share data with the hypervisor.  To do this,
382memory must be converted between shared and private.  This can be
383accomplished using some existing memory encryption helpers:
384
385 * set_memory_decrypted() converts a range of pages to shared.
386 * set_memory_encrypted() converts memory back to private.
387
388Device drivers are the primary user of shared memory, but there's no need
389to touch every driver. DMA buffers and ioremap() do the conversions
390automatically.
391
392TDX uses SWIOTLB for most DMA allocations. The SWIOTLB buffer is
393converted to shared on boot.
394
395For coherent DMA allocation, the DMA buffer gets converted on the
396allocation. Check force_dma_unencrypted() for details.
397
398Attestation
399===========
400
401Attestation is used to verify the TDX guest trustworthiness to other
402entities before provisioning secrets to the guest. For example, a key
403server may want to use attestation to verify that the guest is the
404desired one before releasing the encryption keys to mount the encrypted
405rootfs or a secondary drive.
406
407The TDX module records the state of the TDX guest in various stages of
408the guest boot process using the build time measurement register (MRTD)
409and runtime measurement registers (RTMR). Measurements related to the
410guest initial configuration and firmware image are recorded in the MRTD
411register. Measurements related to initial state, kernel image, firmware
412image, command line options, initrd, ACPI tables, etc are recorded in
413RTMR registers. For more details, as an example, please refer to TDX
414Virtual Firmware design specification, section titled "TD Measurement".
415At TDX guest runtime, the attestation process is used to attest to these
416measurements.
417
418The attestation process consists of two steps: TDREPORT generation and
419Quote generation.
420
421TDX guest uses TDCALL[TDG.MR.REPORT] to get the TDREPORT (TDREPORT_STRUCT)
422from the TDX module. TDREPORT is a fixed-size data structure generated by
423the TDX module which contains guest-specific information (such as build
424and boot measurements), platform security version, and the MAC to protect
425the integrity of the TDREPORT. A user-provided 64-Byte REPORTDATA is used
426as input and included in the TDREPORT. Typically it can be some nonce
427provided by attestation service so the TDREPORT can be verified uniquely.
428More details about the TDREPORT can be found in Intel TDX Module
429specification, section titled "TDG.MR.REPORT Leaf".
430
431After getting the TDREPORT, the second step of the attestation process
432is to send it to the Quoting Enclave (QE) to generate the Quote. TDREPORT
433by design can only be verified on the local platform as the MAC key is
434bound to the platform. To support remote verification of the TDREPORT,
435TDX leverages Intel SGX Quoting Enclave to verify the TDREPORT locally
436and convert it to a remotely verifiable Quote. Method of sending TDREPORT
437to QE is implementation specific. Attestation software can choose
438whatever communication channel available (i.e. vsock or TCP/IP) to
439send the TDREPORT to QE and receive the Quote.
440
441References
442==========
443
444TDX reference material is collected here:
445
446https://www.intel.com/content/www/us/en/developer/articles/technical/intel-trust-domain-extensions.html
447