xref: /linux/Documentation/arch/x86/tdx.rst (revision fbf5df34a4dbcd09d433dd4f0916bf9b2ddb16de)
1.. SPDX-License-Identifier: GPL-2.0
2
3=====================================
4Intel Trust Domain Extensions (TDX)
5=====================================
6
7Intel's Trust Domain Extensions (TDX) protect confidential guest VMs from
8the host and physical attacks by isolating the guest register state and by
9encrypting the guest memory. In TDX, a special module running in a special
10mode sits between the host and the guest and manages the guest/host
11separation.
12
13TDX Host Kernel Support
14=======================
15
16TDX introduces a new CPU mode called Secure Arbitration Mode (SEAM) and
17a new isolated range pointed by the SEAM Ranger Register (SEAMRR).  A
18CPU-attested software module called 'the TDX module' runs inside the new
19isolated range to provide the functionalities to manage and run protected
20VMs.
21
22TDX also leverages Intel Multi-Key Total Memory Encryption (MKTME) to
23provide crypto-protection to the VMs.  TDX reserves part of MKTME KeyIDs
24as TDX private KeyIDs, which are only accessible within the SEAM mode.
25BIOS is responsible for partitioning legacy MKTME KeyIDs and TDX KeyIDs.
26
27Before the TDX module can be used to create and run protected VMs, it
28must be loaded into the isolated range and properly initialized.  The TDX
29architecture doesn't require the BIOS to load the TDX module, but the
30kernel assumes it is loaded by the BIOS.
31
32TDX boot-time detection
33-----------------------
34
35The kernel detects TDX by detecting TDX private KeyIDs during kernel
36boot.  Below dmesg shows when TDX is enabled by BIOS::
37
38  [..] virt/tdx: BIOS enabled: private KeyID range: [16, 64)
39
40TDX module initialization
41---------------------------------------
42
43The kernel talks to the TDX module via the new SEAMCALL instruction.  The
44TDX module implements SEAMCALL leaf functions to allow the kernel to
45initialize it.
46
47If the TDX module isn't loaded, the SEAMCALL instruction fails with a
48special error.  In this case the kernel fails the module initialization
49and reports the module isn't loaded::
50
51  [..] virt/tdx: module not loaded
52
53Initializing the TDX module consumes roughly ~1/256th system RAM size to
54use it as 'metadata' for the TDX memory.  It also takes additional CPU
55time to initialize those metadata along with the TDX module itself.  Both
56are not trivial.  The kernel initializes the TDX module at runtime on
57demand.
58
59Besides initializing the TDX module, a per-cpu initialization SEAMCALL
60must be done on one cpu before any other SEAMCALLs can be made on that
61cpu.
62
63User can consult dmesg to see whether the TDX module has been initialized.
64
65If the TDX module is initialized successfully, dmesg shows something
66like below::
67
68  [..] virt/tdx: 262668 KBs allocated for PAMT
69  [..] virt/tdx: TDX-Module initialized
70
71If the TDX module failed to initialize, dmesg also shows it failed to
72initialize::
73
74  [..] virt/tdx: TDX-Module initialization failed ...
75
76TDX Interaction to Other Kernel Components
77------------------------------------------
78
79TDX Memory Policy
80~~~~~~~~~~~~~~~~~
81
82TDX reports a list of "Convertible Memory Region" (CMR) to tell the
83kernel which memory is TDX compatible.  The kernel needs to build a list
84of memory regions (out of CMRs) as "TDX-usable" memory and pass those
85regions to the TDX module.  Once this is done, those "TDX-usable" memory
86regions are fixed during module's lifetime.
87
88To keep things simple, currently the kernel simply guarantees all pages
89in the page allocator are TDX memory.  Specifically, the kernel uses all
90system memory in the core-mm "at the time of TDX module initialization"
91as TDX memory, and in the meantime, refuses to online any non-TDX-memory
92in the memory hotplug.
93
94Physical Memory Hotplug
95~~~~~~~~~~~~~~~~~~~~~~~
96
97Note TDX assumes convertible memory is always physically present during
98machine's runtime.  A non-buggy BIOS should never support hot-removal of
99any convertible memory.  This implementation doesn't handle ACPI memory
100removal but depends on the BIOS to behave correctly.
101
102CPU Hotplug
103~~~~~~~~~~~
104
105TDX module requires the per-cpu initialization SEAMCALL must be done on
106one cpu before any other SEAMCALLs can be made on that cpu.  The kernel,
107via the CPU hotplug framework, performs the necessary initialization when
108a CPU is first brought online.
109
110TDX doesn't support physical (ACPI) CPU hotplug.  During machine boot,
111TDX verifies all boot-time present logical CPUs are TDX compatible before
112enabling TDX.  A non-buggy BIOS should never support hot-add/removal of
113physical CPU.  Currently the kernel doesn't handle physical CPU hotplug,
114but depends on the BIOS to behave correctly.
115
116Note TDX works with CPU logical online/offline, thus the kernel still
117allows to offline logical CPU and online it again.
118
119Erratum
120~~~~~~~
121
122The first few generations of TDX hardware have an erratum.  A partial
123write to a TDX private memory cacheline will silently "poison" the
124line.  Subsequent reads will consume the poison and generate a machine
125check.
126
127A partial write is a memory write where a write transaction of less than
128cacheline lands at the memory controller.  The CPU does these via
129non-temporal write instructions (like MOVNTI), or through UC/WC memory
130mappings.  Devices can also do partial writes via DMA.
131
132Theoretically, a kernel bug could do partial write to TDX private memory
133and trigger unexpected machine check.  What's more, the machine check
134code will present these as "Hardware error" when they were, in fact, a
135software-triggered issue.  But in the end, this issue is hard to trigger.
136
137If the platform has such erratum, the kernel prints additional message in
138machine check handler to tell user the machine check may be caused by
139kernel bug on TDX private memory.
140
141Kexec
142~~~~~~~
143
144Currently kexec doesn't work on the TDX platforms with the aforementioned
145erratum.  It fails when loading the kexec kernel image.  Otherwise it
146works normally.
147
148Interaction vs S3 and deeper states
149~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
150
151TDX cannot survive from S3 and deeper states.  The hardware resets and
152disables TDX completely when platform goes to S3 and deeper.  Both TDX
153guests and the TDX module get destroyed permanently.
154
155The kernel uses S3 for suspend-to-ram, and use S4 and deeper states for
156hibernation.  Currently, for simplicity, the kernel chooses to make TDX
157mutually exclusive with S3 and hibernation.
158
159The kernel disables TDX during early boot when hibernation support is
160available::
161
162  [..] virt/tdx: initialization failed: Hibernation support is enabled
163
164Add 'nohibernate' kernel command line to disable hibernation in order to
165use TDX.
166
167ACPI S3 is disabled during kernel early boot if TDX is enabled.  The user
168needs to turn off TDX in the BIOS in order to use S3.
169
170TDX Guest Support
171=================
172Since the host cannot directly access guest registers or memory, much
173normal functionality of a hypervisor must be moved into the guest. This is
174implemented using a Virtualization Exception (#VE) that is handled by the
175guest kernel. A #VE is handled entirely inside the guest kernel, but some
176require the hypervisor to be consulted.
177
178TDX includes new hypercall-like mechanisms for communicating from the
179guest to the hypervisor or the TDX module.
180
181New TDX Exceptions
182------------------
183
184TDX guests behave differently from bare-metal and traditional VMX guests.
185In TDX guests, otherwise normal instructions or memory accesses can cause
186#VE or #GP exceptions.
187
188Instructions marked with an '*' conditionally cause exceptions.  The
189details for these instructions are discussed below.
190
191Instruction-based #VE
192~~~~~~~~~~~~~~~~~~~~~
193
194- Port I/O (INS, OUTS, IN, OUT)
195- HLT
196- MONITOR, MWAIT
197- WBINVD, INVD
198- VMCALL
199- RDMSR*,WRMSR*
200- CPUID*
201
202Instruction-based #GP
203~~~~~~~~~~~~~~~~~~~~~
204
205- All VMX instructions: INVEPT, INVVPID, VMCLEAR, VMFUNC, VMLAUNCH,
206  VMPTRLD, VMPTRST, VMREAD, VMRESUME, VMWRITE, VMXOFF, VMXON
207- ENCLS, ENCLU
208- GETSEC
209- RSM
210- ENQCMD
211- RDMSR*,WRMSR*
212
213RDMSR/WRMSR Behavior
214~~~~~~~~~~~~~~~~~~~~
215
216MSR access behavior falls into three categories:
217
218- #GP generated
219- #VE generated
220- "Just works"
221
222In general, the #GP MSRs should not be used in guests.  Their use likely
223indicates a bug in the guest.  The guest may try to handle the #GP with a
224hypercall but it is unlikely to succeed.
225
226The #VE MSRs are typically able to be handled by the hypervisor.  Guests
227can make a hypercall to the hypervisor to handle the #VE.
228
229The "just works" MSRs do not need any special guest handling.  They might
230be implemented by directly passing through the MSR to the hardware or by
231trapping and handling in the TDX module.  Other than possibly being slow,
232these MSRs appear to function just as they would on bare metal.
233
234CPUID Behavior
235~~~~~~~~~~~~~~
236
237For some CPUID leaves and sub-leaves, the virtualized bit fields of CPUID
238return values (in guest EAX/EBX/ECX/EDX) are configurable by the
239hypervisor. For such cases, the Intel TDX module architecture defines two
240virtualization types:
241
242- Bit fields for which the hypervisor controls the value seen by the guest
243  TD.
244
245- Bit fields for which the hypervisor configures the value such that the
246  guest TD either sees their native value or a value of 0.  For these bit
247  fields, the hypervisor can mask off the native values, but it can not
248  turn *on* values.
249
250A #VE is generated for CPUID leaves and sub-leaves that the TDX module does
251not know how to handle. The guest kernel may ask the hypervisor for the
252value with a hypercall.
253
254#VE on Memory Accesses
255----------------------
256
257There are essentially two classes of TDX memory: private and shared.
258Private memory receives full TDX protections.  Its content is protected
259against access from the hypervisor.  Shared memory is expected to be
260shared between guest and hypervisor and does not receive full TDX
261protections.
262
263A TD guest is in control of whether its memory accesses are treated as
264private or shared.  It selects the behavior with a bit in its page table
265entries.  This helps ensure that a guest does not place sensitive
266information in shared memory, exposing it to the untrusted hypervisor.
267
268#VE on Shared Memory
269~~~~~~~~~~~~~~~~~~~~
270
271Access to shared mappings can cause a #VE.  The hypervisor ultimately
272controls whether a shared memory access causes a #VE, so the guest must be
273careful to only reference shared pages it can safely handle a #VE.  For
274instance, the guest should be careful not to access shared memory in the
275#VE handler before it reads the #VE info structure (TDG.VP.VEINFO.GET).
276
277Shared mapping content is entirely controlled by the hypervisor. The guest
278should only use shared mappings for communicating with the hypervisor.
279Shared mappings must never be used for sensitive memory content like kernel
280stacks.  A good rule of thumb is that hypervisor-shared memory should be
281treated the same as memory mapped to userspace.  Both the hypervisor and
282userspace are completely untrusted.
283
284MMIO for virtual devices is implemented as shared memory.  The guest must
285be careful not to access device MMIO regions unless it is also prepared to
286handle a #VE.
287
288#VE on Private Pages
289~~~~~~~~~~~~~~~~~~~~
290
291An access to private mappings can also cause a #VE.  Since all kernel
292memory is also private memory, the kernel might theoretically need to
293handle a #VE on arbitrary kernel memory accesses.  This is not feasible, so
294TDX guests ensure that all guest memory has been "accepted" before memory
295is used by the kernel.
296
297A modest amount of memory (typically 512M) is pre-accepted by the firmware
298before the kernel runs to ensure that the kernel can start up without
299being subjected to a #VE.
300
301The hypervisor is permitted to unilaterally move accepted pages to a
302"blocked" state. However, if it does this, page access will not generate a
303#VE.  It will, instead, cause a "TD Exit" where the hypervisor is required
304to handle the exception.
305
306Linux #VE handler
307-----------------
308
309Just like page faults or #GP's, #VE exceptions can be either handled or be
310fatal.  Typically, an unhandled userspace #VE results in a SIGSEGV.
311An unhandled kernel #VE results in an oops.
312
313Handling nested exceptions on x86 is typically nasty business.  A #VE
314could be interrupted by an NMI which triggers another #VE and hilarity
315ensues.  The TDX #VE architecture anticipated this scenario and includes a
316feature to make it slightly less nasty.
317
318During #VE handling, the TDX module ensures that all interrupts (including
319NMIs) are blocked.  The block remains in place until the guest makes a
320TDG.VP.VEINFO.GET TDCALL.  This allows the guest to control when interrupts
321or a new #VE can be delivered.
322
323However, the guest kernel must still be careful to avoid potential
324#VE-triggering actions (discussed above) while this block is in place.
325While the block is in place, any #VE is elevated to a double fault (#DF)
326which is not recoverable.
327
328MMIO handling
329-------------
330
331In non-TDX VMs, MMIO is usually implemented by giving a guest access to a
332mapping which will cause a VMEXIT on access, and then the hypervisor
333emulates the access.  That is not possible in TDX guests because VMEXIT
334will expose the register state to the host. TDX guests don't trust the host
335and can't have their state exposed to the host.
336
337In TDX, MMIO regions typically trigger a #VE exception in the guest.  The
338guest #VE handler then emulates the MMIO instruction inside the guest and
339converts it into a controlled TDCALL to the host, rather than exposing
340guest state to the host.
341
342MMIO addresses on x86 are just special physical addresses. They can
343theoretically be accessed with any instruction that accesses memory.
344However, the kernel instruction decoding method is limited. It is only
345designed to decode instructions like those generated by io.h macros.
346
347MMIO access via other means (like structure overlays) may result in an
348oops.
349
350Shared Memory Conversions
351-------------------------
352
353All TDX guest memory starts out as private at boot.  This memory can not
354be accessed by the hypervisor.  However, some kernel users like device
355drivers might have a need to share data with the hypervisor.  To do this,
356memory must be converted between shared and private.  This can be
357accomplished using some existing memory encryption helpers:
358
359 * set_memory_decrypted() converts a range of pages to shared.
360 * set_memory_encrypted() converts memory back to private.
361
362Device drivers are the primary user of shared memory, but there's no need
363to touch every driver. DMA buffers and ioremap() do the conversions
364automatically.
365
366TDX uses SWIOTLB for most DMA allocations. The SWIOTLB buffer is
367converted to shared on boot.
368
369For coherent DMA allocation, the DMA buffer gets converted on the
370allocation. Check force_dma_unencrypted() for details.
371
372Attestation
373===========
374
375Attestation is used to verify the TDX guest trustworthiness to other
376entities before provisioning secrets to the guest. For example, a key
377server may want to use attestation to verify that the guest is the
378desired one before releasing the encryption keys to mount the encrypted
379rootfs or a secondary drive.
380
381The TDX module records the state of the TDX guest in various stages of
382the guest boot process using the build time measurement register (MRTD)
383and runtime measurement registers (RTMR). Measurements related to the
384guest initial configuration and firmware image are recorded in the MRTD
385register. Measurements related to initial state, kernel image, firmware
386image, command line options, initrd, ACPI tables, etc are recorded in
387RTMR registers. For more details, as an example, please refer to TDX
388Virtual Firmware design specification, section titled "TD Measurement".
389At TDX guest runtime, the attestation process is used to attest to these
390measurements.
391
392The attestation process consists of two steps: TDREPORT generation and
393Quote generation.
394
395TDX guest uses TDCALL[TDG.MR.REPORT] to get the TDREPORT (TDREPORT_STRUCT)
396from the TDX module. TDREPORT is a fixed-size data structure generated by
397the TDX module which contains guest-specific information (such as build
398and boot measurements), platform security version, and the MAC to protect
399the integrity of the TDREPORT. A user-provided 64-Byte REPORTDATA is used
400as input and included in the TDREPORT. Typically it can be some nonce
401provided by attestation service so the TDREPORT can be verified uniquely.
402More details about the TDREPORT can be found in Intel TDX Module
403specification, section titled "TDG.MR.REPORT Leaf".
404
405After getting the TDREPORT, the second step of the attestation process
406is to send it to the Quoting Enclave (QE) to generate the Quote. TDREPORT
407by design can only be verified on the local platform as the MAC key is
408bound to the platform. To support remote verification of the TDREPORT,
409TDX leverages Intel SGX Quoting Enclave to verify the TDREPORT locally
410and convert it to a remotely verifiable Quote. Method of sending TDREPORT
411to QE is implementation specific. Attestation software can choose
412whatever communication channel available (i.e. vsock or TCP/IP) to
413send the TDREPORT to QE and receive the Quote.
414
415References
416==========
417
418TDX reference material is collected here:
419
420https://www.intel.com/content/www/us/en/developer/articles/technical/intel-trust-domain-extensions.html
421