xref: /linux/Documentation/arch/x86/tdx.rst (revision bb0301f856bfc0ea8192b8d2bd5a79bdc6d3d3f1)
1.. SPDX-License-Identifier: GPL-2.0
2
3=====================================
4Intel Trust Domain Extensions (TDX)
5=====================================
6
7Intel's Trust Domain Extensions (TDX) protect confidential guest VMs from
8the host and physical attacks by isolating the guest register state and by
9encrypting the guest memory. In TDX, a special module running in a special
10mode sits between the host and the guest and manages the guest/host
11separation.
12
13TDX Host Kernel Support
14=======================
15
16TDX introduces a new CPU mode called Secure Arbitration Mode (SEAM) and
17a new isolated range pointed by the SEAM Ranger Register (SEAMRR).  A
18CPU-attested software module called 'the TDX module' runs inside the new
19isolated range to provide the functionalities to manage and run protected
20VMs.
21
22TDX also leverages Intel Multi-Key Total Memory Encryption (MKTME) to
23provide crypto-protection to the VMs.  TDX reserves part of MKTME KeyIDs
24as TDX private KeyIDs, which are only accessible within the SEAM mode.
25BIOS is responsible for partitioning legacy MKTME KeyIDs and TDX KeyIDs.
26
27Before the TDX module can be used to create and run protected VMs, it
28must be loaded into the isolated range and properly initialized.  The TDX
29architecture doesn't require the BIOS to load the TDX module, but the
30kernel assumes it is loaded by the BIOS.
31
32TDX boot-time detection
33-----------------------
34
35The kernel detects TDX by detecting TDX private KeyIDs during kernel
36boot.  Below dmesg shows when TDX is enabled by BIOS::
37
38  [..] virt/tdx: BIOS enabled: private KeyID range: [16, 64)
39
40TDX module initialization
41---------------------------------------
42
43The kernel talks to the TDX module via the new SEAMCALL instruction.  The
44TDX module implements SEAMCALL leaf functions to allow the kernel to
45initialize it.
46
47If the TDX module isn't loaded, the SEAMCALL instruction fails with a
48special error.  In this case the kernel fails the module initialization
49and reports the module isn't loaded::
50
51  [..] virt/tdx: module not loaded
52
53Initializing the TDX module consumes roughly ~1/256th system RAM size to
54use it as 'metadata' for the TDX memory.  It also takes additional CPU
55time to initialize those metadata along with the TDX module itself.  Both
56are not trivial.  The kernel initializes the TDX module at runtime on
57demand.
58
59Besides initializing the TDX module, a per-cpu initialization SEAMCALL
60must be done on one cpu before any other SEAMCALLs can be made on that
61cpu.
62
63User can consult dmesg to see whether the TDX module has been initialized.
64
65If the TDX module is initialized successfully, dmesg shows something
66like below::
67
68  [..] virt/tdx: 262668 KBs allocated for PAMT
69  [..] virt/tdx: TDX-Module initialized
70
71If the TDX module failed to initialize, dmesg also shows it failed to
72initialize::
73
74  [..] virt/tdx: TDX-Module initialization failed ...
75
76TDX module Runtime Update
77-------------------------
78
79Similar to microcode, the BIOS generally has a copy of the TDX module
80in flash. It loads this module image in to RAM at boot. However, just
81like microcode, the BIOS-loaded TDX module might be out of date either
82because the BIOS is old or the system has been up a long time. The
83kernel can replace the BIOS version in RAM and load a different TDX
84module. Kernel-loaded TDX modules do not affect the BIOS flash and do
85not survive reboots.
86
87The TDX module is normally the only piece of software running in SEAM
88mode with which the kernel interacts. But there is a second piece of
89software which is used to load or update the TDX module: a persistent
90SEAM loader (P-SEAMLDR). It runs in SEAM mode separately from the TDX
91module. The kernel communicates with the P-SEAMLDR to perform TDX
92module runtime updates.
93
94How to update the TDX module
95~~~~~~~~~~~~~~~~~~~~~~~~~~~~
96
97Updating the TDX module is a complex process. Much of the logic and
98policy is left to userspace. End users should use existing update
99infrastructure provided by their distro. The Intel TDX Module Binaries
100repository has a reference implementation of this logic:
101
102   https://github.com/intel/confidential-computing.tdx.tdx-module.binaries/blob/main/version_select_and_load.py
103
104This section will now lay out roughly what is needed to implement a
105userspace-driven TDX module update. Detailed documentation on the
106tdx_host ABIs is available here::
107
108     Documentation/ABI/testing/sysfs-devices-faux-tdx-host
109
110and is not duplicated in this document.
111
1121. Check whether runtime update is supported at all
113
114   Verify that the TDX module firmware upload interface is available::
115
116     /sys/class/firmware/tdx_module
117
118   Note that this is the generic kernel firmware update ABI. It is
119   separate from the "tdx_host" device ABI itself.
120
1212. Check whether additional updates are possible. Verify that::
122
123     /sys/devices/faux/tdx_host/num_remaining_updates
124
125   has a value greater than 0. If it is 0, the TDX update log might be
126   full. Reboot to reset this to a nonzero value.
127
1283. Choose a compatible TDX module image
129
130   Choosing a compatible TDX module image is not trivial. There are both
131   hard compatibility requirements and policy choices to make.
132
133   Hard compatibility requirements:
134
135   - The update must be compatible with the kernel.
136
137     The update must not change any TDX ABIs in any non-backward-compatible
138     way. It can introduce new features but must not require that the kernel
139     use new ABIs for existing features. It must ensure that the rest of the
140     system is not affected in any way. Software on the system must never
141     notice any behavioral changes. Attestation results should be identical
142     except for version changes.
143
144   - The update must be compatible with the CPU.
145
146     The set of supported CPU FMS values (family, model, stepping) is
147     encoded in the module image itself. In practice, module version series
148     are platform-specific. For example, the 1.5.x series runs on Sapphire
149     Rapids but not Granite Rapids, which needs 2.0.x.
150
151   - The update must be compatible with the P-SEAMLDR.
152
153     This information is provided in a metadata file, typically
154     mapping_file.json, released with the module image. Each module image
155     specifies the minimum required P-SEAMLDR version, and the update is
156     compatible only if the running P-SEAMLDR meets that requirement.
157
158     The current version of the P-SEAMLDR can be read here::
159
160       /sys/devices/faux/tdx_host/seamldr_version
161
162   - The update must be compatible with the running TDX module.
163
164     Like P-SEAMLDR, each module image also specifies a minimum required
165     TDX module version. The running module must satisfy that requirement.
166
167     The update software can read the current TDX module version here::
168
169       /sys/devices/faux/tdx_host/version
170
171   Policy choices:
172
173   - The update software chooses how to optimize its update. For instance,
174     it can optimize for fewer updates or for smaller version steps,
175     for example, 1.2.3 => 1.2.5 versus 1.2.3 => 1.2.4 => 1.2.5.
176
1774. Perform the update
178
179   Run::
180
181     echo 1 > /sys/class/firmware/tdx_module/loading
182     cat <path_to_module_image> > /sys/class/firmware/tdx_module/data
183     echo 0 > /sys/class/firmware/tdx_module/loading
184
185   The files /sys/class/firmware/tdx_module/status and
186   /sys/class/firmware/tdx_module/error report update progress and error
187   information.
188
189   After the update completes, the new module version is visible in
190   /sys/devices/faux/tdx_host/version.
191
192Impact on running TDs
193~~~~~~~~~~~~~~~~~~~~~
194
195TDX module runtime updates must have virtually no visible impact on running
196TDs. Any TD visible impact is a TDX module bug.
197
198The main exception is the TEE_TCB_SVN_2 field in TD quotes, which
199reflects the TCB of the currently running TDX module and therefore
200changes after an update. By contrast, TEE_TCB_SVN reflects the TCB at TD
201launch time and is not affected.
202
203TDX Interaction to Other Kernel Components
204------------------------------------------
205
206TDX Memory Policy
207~~~~~~~~~~~~~~~~~
208
209TDX reports a list of "Convertible Memory Region" (CMR) to tell the
210kernel which memory is TDX compatible.  The kernel needs to build a list
211of memory regions (out of CMRs) as "TDX-usable" memory and pass those
212regions to the TDX module.  Once this is done, those "TDX-usable" memory
213regions are fixed during module's lifetime.
214
215To keep things simple, currently the kernel simply guarantees all pages
216in the page allocator are TDX memory.  Specifically, the kernel uses all
217system memory in the core-mm "at the time of TDX module initialization"
218as TDX memory, and in the meantime, refuses to online any non-TDX-memory
219in the memory hotplug.
220
221Physical Memory Hotplug
222~~~~~~~~~~~~~~~~~~~~~~~
223
224Note TDX assumes convertible memory is always physically present during
225machine's runtime.  A non-buggy BIOS should never support hot-removal of
226any convertible memory.  This implementation doesn't handle ACPI memory
227removal but depends on the BIOS to behave correctly.
228
229CPU Hotplug
230~~~~~~~~~~~
231
232TDX module requires the per-cpu initialization SEAMCALL must be done on
233one cpu before any other SEAMCALLs can be made on that cpu.  The kernel,
234via the CPU hotplug framework, performs the necessary initialization when
235a CPU is first brought online.
236
237TDX doesn't support physical (ACPI) CPU hotplug.  During machine boot,
238TDX verifies all boot-time present logical CPUs are TDX compatible before
239enabling TDX.  A non-buggy BIOS should never support hot-add/removal of
240physical CPU.  Currently the kernel doesn't handle physical CPU hotplug,
241but depends on the BIOS to behave correctly.
242
243Note TDX works with CPU logical online/offline, thus the kernel still
244allows to offline logical CPU and online it again.
245
246Erratum
247~~~~~~~
248
249The first few generations of TDX hardware have an erratum.  A partial
250write to a TDX private memory cacheline will silently "poison" the
251line.  Subsequent reads will consume the poison and generate a machine
252check.
253
254A partial write is a memory write where a write transaction of less than
255cacheline lands at the memory controller.  The CPU does these via
256non-temporal write instructions (like MOVNTI), or through UC/WC memory
257mappings.  Devices can also do partial writes via DMA.
258
259Theoretically, a kernel bug could do partial write to TDX private memory
260and trigger unexpected machine check.  What's more, the machine check
261code will present these as "Hardware error" when they were, in fact, a
262software-triggered issue.  But in the end, this issue is hard to trigger.
263
264If the platform has such erratum, the kernel prints additional message in
265machine check handler to tell user the machine check may be caused by
266kernel bug on TDX private memory.
267
268Interaction vs S3 and deeper states
269~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
270
271TDX cannot survive from S3 and deeper states.  The hardware resets and
272disables TDX completely when platform goes to S3 and deeper.  Both TDX
273guests and the TDX module get destroyed permanently.
274
275The kernel uses S3 for suspend-to-ram, and use S4 and deeper states for
276hibernation.  Currently, for simplicity, the kernel chooses to make TDX
277mutually exclusive with S3 and hibernation.
278
279The kernel disables TDX during early boot when hibernation support is
280available::
281
282  [..] virt/tdx: initialization failed: Hibernation support is enabled
283
284Add 'nohibernate' kernel command line to disable hibernation in order to
285use TDX.
286
287ACPI S3 is disabled during kernel early boot if TDX is enabled.  The user
288needs to turn off TDX in the BIOS in order to use S3.
289
290TDX Guest Support
291=================
292Since the host cannot directly access guest registers or memory, much
293normal functionality of a hypervisor must be moved into the guest. This is
294implemented using a Virtualization Exception (#VE) that is handled by the
295guest kernel. A #VE is handled entirely inside the guest kernel, but some
296require the hypervisor to be consulted.
297
298TDX includes new hypercall-like mechanisms for communicating from the
299guest to the hypervisor or the TDX module.
300
301New TDX Exceptions
302------------------
303
304TDX guests behave differently from bare-metal and traditional VMX guests.
305In TDX guests, otherwise normal instructions or memory accesses can cause
306#VE or #GP exceptions.
307
308Instructions marked with an '*' conditionally cause exceptions.  The
309details for these instructions are discussed below.
310
311Instruction-based #VE
312~~~~~~~~~~~~~~~~~~~~~
313
314- Port I/O (INS, OUTS, IN, OUT)
315- HLT
316- MONITOR, MWAIT
317- WBINVD, INVD
318- VMCALL
319- RDMSR*,WRMSR*
320- CPUID*
321
322Instruction-based #GP
323~~~~~~~~~~~~~~~~~~~~~
324
325- All VMX instructions: INVEPT, INVVPID, VMCLEAR, VMFUNC, VMLAUNCH,
326  VMPTRLD, VMPTRST, VMREAD, VMRESUME, VMWRITE, VMXOFF, VMXON
327- ENCLS, ENCLU
328- GETSEC
329- RSM
330- ENQCMD
331- RDMSR*,WRMSR*
332
333RDMSR/WRMSR Behavior
334~~~~~~~~~~~~~~~~~~~~
335
336MSR access behavior falls into three categories:
337
338- #GP generated
339- #VE generated
340- "Just works"
341
342In general, the #GP MSRs should not be used in guests.  Their use likely
343indicates a bug in the guest.  The guest may try to handle the #GP with a
344hypercall but it is unlikely to succeed.
345
346The #VE MSRs are typically able to be handled by the hypervisor.  Guests
347can make a hypercall to the hypervisor to handle the #VE.
348
349The "just works" MSRs do not need any special guest handling.  They might
350be implemented by directly passing through the MSR to the hardware or by
351trapping and handling in the TDX module.  Other than possibly being slow,
352these MSRs appear to function just as they would on bare metal.
353
354CPUID Behavior
355~~~~~~~~~~~~~~
356
357For some CPUID leaves and sub-leaves, the virtualized bit fields of CPUID
358return values (in guest EAX/EBX/ECX/EDX) are configurable by the
359hypervisor. For such cases, the Intel TDX module architecture defines two
360virtualization types:
361
362- Bit fields for which the hypervisor controls the value seen by the guest
363  TD.
364
365- Bit fields for which the hypervisor configures the value such that the
366  guest TD either sees their native value or a value of 0.  For these bit
367  fields, the hypervisor can mask off the native values, but it can not
368  turn *on* values.
369
370A #VE is generated for CPUID leaves and sub-leaves that the TDX module does
371not know how to handle. The guest kernel may ask the hypervisor for the
372value with a hypercall.
373
374#VE on Memory Accesses
375----------------------
376
377There are essentially two classes of TDX memory: private and shared.
378Private memory receives full TDX protections.  Its content is protected
379against access from the hypervisor.  Shared memory is expected to be
380shared between guest and hypervisor and does not receive full TDX
381protections.
382
383A TD guest is in control of whether its memory accesses are treated as
384private or shared.  It selects the behavior with a bit in its page table
385entries.  This helps ensure that a guest does not place sensitive
386information in shared memory, exposing it to the untrusted hypervisor.
387
388#VE on Shared Memory
389~~~~~~~~~~~~~~~~~~~~
390
391Access to shared mappings can cause a #VE.  The hypervisor ultimately
392controls whether a shared memory access causes a #VE, so the guest must be
393careful to only reference shared pages it can safely handle a #VE.  For
394instance, the guest should be careful not to access shared memory in the
395#VE handler before it reads the #VE info structure (TDG.VP.VEINFO.GET).
396
397Shared mapping content is entirely controlled by the hypervisor. The guest
398should only use shared mappings for communicating with the hypervisor.
399Shared mappings must never be used for sensitive memory content like kernel
400stacks.  A good rule of thumb is that hypervisor-shared memory should be
401treated the same as memory mapped to userspace.  Both the hypervisor and
402userspace are completely untrusted.
403
404MMIO for virtual devices is implemented as shared memory.  The guest must
405be careful not to access device MMIO regions unless it is also prepared to
406handle a #VE.
407
408#VE on Private Pages
409~~~~~~~~~~~~~~~~~~~~
410
411An access to private mappings can also cause a #VE.  Since all kernel
412memory is also private memory, the kernel might theoretically need to
413handle a #VE on arbitrary kernel memory accesses.  This is not feasible, so
414TDX guests ensure that all guest memory has been "accepted" before memory
415is used by the kernel.
416
417A modest amount of memory (typically 512M) is pre-accepted by the firmware
418before the kernel runs to ensure that the kernel can start up without
419being subjected to a #VE.
420
421The hypervisor is permitted to unilaterally move accepted pages to a
422"blocked" state. However, if it does this, page access will not generate a
423#VE.  It will, instead, cause a "TD Exit" where the hypervisor is required
424to handle the exception.
425
426Linux #VE handler
427-----------------
428
429Just like page faults or #GP's, #VE exceptions can be either handled or be
430fatal.  Typically, an unhandled userspace #VE results in a SIGSEGV.
431An unhandled kernel #VE results in an oops.
432
433Handling nested exceptions on x86 is typically nasty business.  A #VE
434could be interrupted by an NMI which triggers another #VE and hilarity
435ensues.  The TDX #VE architecture anticipated this scenario and includes a
436feature to make it slightly less nasty.
437
438During #VE handling, the TDX module ensures that all interrupts (including
439NMIs) are blocked.  The block remains in place until the guest makes a
440TDG.VP.VEINFO.GET TDCALL.  This allows the guest to control when interrupts
441or a new #VE can be delivered.
442
443However, the guest kernel must still be careful to avoid potential
444#VE-triggering actions (discussed above) while this block is in place.
445While the block is in place, any #VE is elevated to a double fault (#DF)
446which is not recoverable.
447
448MMIO handling
449-------------
450
451In non-TDX VMs, MMIO is usually implemented by giving a guest access to a
452mapping which will cause a VMEXIT on access, and then the hypervisor
453emulates the access.  That is not possible in TDX guests because VMEXIT
454will expose the register state to the host. TDX guests don't trust the host
455and can't have their state exposed to the host.
456
457In TDX, MMIO regions typically trigger a #VE exception in the guest.  The
458guest #VE handler then emulates the MMIO instruction inside the guest and
459converts it into a controlled TDCALL to the host, rather than exposing
460guest state to the host.
461
462MMIO addresses on x86 are just special physical addresses. They can
463theoretically be accessed with any instruction that accesses memory.
464However, the kernel instruction decoding method is limited. It is only
465designed to decode instructions like those generated by io.h macros.
466
467MMIO access via other means (like structure overlays) may result in an
468oops.
469
470Shared Memory Conversions
471-------------------------
472
473All TDX guest memory starts out as private at boot.  This memory can not
474be accessed by the hypervisor.  However, some kernel users like device
475drivers might have a need to share data with the hypervisor.  To do this,
476memory must be converted between shared and private.  This can be
477accomplished using some existing memory encryption helpers:
478
479 * set_memory_decrypted() converts a range of pages to shared.
480 * set_memory_encrypted() converts memory back to private.
481
482Device drivers are the primary user of shared memory, but there's no need
483to touch every driver. DMA buffers and ioremap() do the conversions
484automatically.
485
486TDX uses SWIOTLB for most DMA allocations. The SWIOTLB buffer is
487converted to shared on boot.
488
489For coherent DMA allocation, the DMA buffer gets converted on the
490allocation. Check force_dma_unencrypted() for details.
491
492Attestation
493===========
494
495Attestation is used to verify the TDX guest trustworthiness to other
496entities before provisioning secrets to the guest. For example, a key
497server may want to use attestation to verify that the guest is the
498desired one before releasing the encryption keys to mount the encrypted
499rootfs or a secondary drive.
500
501The TDX module records the state of the TDX guest in various stages of
502the guest boot process using the build time measurement register (MRTD)
503and runtime measurement registers (RTMR). Measurements related to the
504guest initial configuration and firmware image are recorded in the MRTD
505register. Measurements related to initial state, kernel image, firmware
506image, command line options, initrd, ACPI tables, etc are recorded in
507RTMR registers. For more details, as an example, please refer to TDX
508Virtual Firmware design specification, section titled "TD Measurement".
509At TDX guest runtime, the attestation process is used to attest to these
510measurements.
511
512The attestation process consists of two steps: TDREPORT generation and
513Quote generation.
514
515TDX guest uses TDCALL[TDG.MR.REPORT] to get the TDREPORT (TDREPORT_STRUCT)
516from the TDX module. TDREPORT is a fixed-size data structure generated by
517the TDX module which contains guest-specific information (such as build
518and boot measurements), platform security version, and the MAC to protect
519the integrity of the TDREPORT. A user-provided 64-Byte REPORTDATA is used
520as input and included in the TDREPORT. Typically it can be some nonce
521provided by attestation service so the TDREPORT can be verified uniquely.
522More details about the TDREPORT can be found in Intel TDX Module
523specification, section titled "TDG.MR.REPORT Leaf".
524
525After getting the TDREPORT, the second step of the attestation process
526is to send it to the Quoting Enclave (QE) to generate the Quote. TDREPORT
527by design can only be verified on the local platform as the MAC key is
528bound to the platform. To support remote verification of the TDREPORT,
529TDX leverages Intel SGX Quoting Enclave to verify the TDREPORT locally
530and convert it to a remotely verifiable Quote. Method of sending TDREPORT
531to QE is implementation specific. Attestation software can choose
532whatever communication channel available (i.e. vsock or TCP/IP) to
533send the TDREPORT to QE and receive the Quote.
534
535References
536==========
537
538TDX reference material is collected here:
539
540https://www.intel.com/content/www/us/en/developer/articles/technical/intel-trust-domain-extensions.html
541