1.. SPDX-License-Identifier: GPL-2.0 2 3===================================== 4Intel Trust Domain Extensions (TDX) 5===================================== 6 7Intel's Trust Domain Extensions (TDX) protect confidential guest VMs from 8the host and physical attacks by isolating the guest register state and by 9encrypting the guest memory. In TDX, a special module running in a special 10mode sits between the host and the guest and manages the guest/host 11separation. 12 13Since the host cannot directly access guest registers or memory, much 14normal functionality of a hypervisor must be moved into the guest. This is 15implemented using a Virtualization Exception (#VE) that is handled by the 16guest kernel. A #VE is handled entirely inside the guest kernel, but some 17require the hypervisor to be consulted. 18 19TDX includes new hypercall-like mechanisms for communicating from the 20guest to the hypervisor or the TDX module. 21 22New TDX Exceptions 23================== 24 25TDX guests behave differently from bare-metal and traditional VMX guests. 26In TDX guests, otherwise normal instructions or memory accesses can cause 27#VE or #GP exceptions. 28 29Instructions marked with an '*' conditionally cause exceptions. The 30details for these instructions are discussed below. 31 32Instruction-based #VE 33--------------------- 34 35- Port I/O (INS, OUTS, IN, OUT) 36- HLT 37- MONITOR, MWAIT 38- WBINVD, INVD 39- VMCALL 40- RDMSR*,WRMSR* 41- CPUID* 42 43Instruction-based #GP 44--------------------- 45 46- All VMX instructions: INVEPT, INVVPID, VMCLEAR, VMFUNC, VMLAUNCH, 47 VMPTRLD, VMPTRST, VMREAD, VMRESUME, VMWRITE, VMXOFF, VMXON 48- ENCLS, ENCLU 49- GETSEC 50- RSM 51- ENQCMD 52- RDMSR*,WRMSR* 53 54RDMSR/WRMSR Behavior 55-------------------- 56 57MSR access behavior falls into three categories: 58 59- #GP generated 60- #VE generated 61- "Just works" 62 63In general, the #GP MSRs should not be used in guests. Their use likely 64indicates a bug in the guest. The guest may try to handle the #GP with a 65hypercall but it is unlikely to succeed. 66 67The #VE MSRs are typically able to be handled by the hypervisor. Guests 68can make a hypercall to the hypervisor to handle the #VE. 69 70The "just works" MSRs do not need any special guest handling. They might 71be implemented by directly passing through the MSR to the hardware or by 72trapping and handling in the TDX module. Other than possibly being slow, 73these MSRs appear to function just as they would on bare metal. 74 75CPUID Behavior 76-------------- 77 78For some CPUID leaves and sub-leaves, the virtualized bit fields of CPUID 79return values (in guest EAX/EBX/ECX/EDX) are configurable by the 80hypervisor. For such cases, the Intel TDX module architecture defines two 81virtualization types: 82 83- Bit fields for which the hypervisor controls the value seen by the guest 84 TD. 85 86- Bit fields for which the hypervisor configures the value such that the 87 guest TD either sees their native value or a value of 0. For these bit 88 fields, the hypervisor can mask off the native values, but it can not 89 turn *on* values. 90 91A #VE is generated for CPUID leaves and sub-leaves that the TDX module does 92not know how to handle. The guest kernel may ask the hypervisor for the 93value with a hypercall. 94 95#VE on Memory Accesses 96====================== 97 98There are essentially two classes of TDX memory: private and shared. 99Private memory receives full TDX protections. Its content is protected 100against access from the hypervisor. Shared memory is expected to be 101shared between guest and hypervisor and does not receive full TDX 102protections. 103 104A TD guest is in control of whether its memory accesses are treated as 105private or shared. It selects the behavior with a bit in its page table 106entries. This helps ensure that a guest does not place sensitive 107information in shared memory, exposing it to the untrusted hypervisor. 108 109#VE on Shared Memory 110-------------------- 111 112Access to shared mappings can cause a #VE. The hypervisor ultimately 113controls whether a shared memory access causes a #VE, so the guest must be 114careful to only reference shared pages it can safely handle a #VE. For 115instance, the guest should be careful not to access shared memory in the 116#VE handler before it reads the #VE info structure (TDG.VP.VEINFO.GET). 117 118Shared mapping content is entirely controlled by the hypervisor. The guest 119should only use shared mappings for communicating with the hypervisor. 120Shared mappings must never be used for sensitive memory content like kernel 121stacks. A good rule of thumb is that hypervisor-shared memory should be 122treated the same as memory mapped to userspace. Both the hypervisor and 123userspace are completely untrusted. 124 125MMIO for virtual devices is implemented as shared memory. The guest must 126be careful not to access device MMIO regions unless it is also prepared to 127handle a #VE. 128 129#VE on Private Pages 130-------------------- 131 132An access to private mappings can also cause a #VE. Since all kernel 133memory is also private memory, the kernel might theoretically need to 134handle a #VE on arbitrary kernel memory accesses. This is not feasible, so 135TDX guests ensure that all guest memory has been "accepted" before memory 136is used by the kernel. 137 138A modest amount of memory (typically 512M) is pre-accepted by the firmware 139before the kernel runs to ensure that the kernel can start up without 140being subjected to a #VE. 141 142The hypervisor is permitted to unilaterally move accepted pages to a 143"blocked" state. However, if it does this, page access will not generate a 144#VE. It will, instead, cause a "TD Exit" where the hypervisor is required 145to handle the exception. 146 147Linux #VE handler 148================= 149 150Just like page faults or #GP's, #VE exceptions can be either handled or be 151fatal. Typically, an unhandled userspace #VE results in a SIGSEGV. 152An unhandled kernel #VE results in an oops. 153 154Handling nested exceptions on x86 is typically nasty business. A #VE 155could be interrupted by an NMI which triggers another #VE and hilarity 156ensues. The TDX #VE architecture anticipated this scenario and includes a 157feature to make it slightly less nasty. 158 159During #VE handling, the TDX module ensures that all interrupts (including 160NMIs) are blocked. The block remains in place until the guest makes a 161TDG.VP.VEINFO.GET TDCALL. This allows the guest to control when interrupts 162or a new #VE can be delivered. 163 164However, the guest kernel must still be careful to avoid potential 165#VE-triggering actions (discussed above) while this block is in place. 166While the block is in place, any #VE is elevated to a double fault (#DF) 167which is not recoverable. 168 169MMIO handling 170============= 171 172In non-TDX VMs, MMIO is usually implemented by giving a guest access to a 173mapping which will cause a VMEXIT on access, and then the hypervisor 174emulates the access. That is not possible in TDX guests because VMEXIT 175will expose the register state to the host. TDX guests don't trust the host 176and can't have their state exposed to the host. 177 178In TDX, MMIO regions typically trigger a #VE exception in the guest. The 179guest #VE handler then emulates the MMIO instruction inside the guest and 180converts it into a controlled TDCALL to the host, rather than exposing 181guest state to the host. 182 183MMIO addresses on x86 are just special physical addresses. They can 184theoretically be accessed with any instruction that accesses memory. 185However, the kernel instruction decoding method is limited. It is only 186designed to decode instructions like those generated by io.h macros. 187 188MMIO access via other means (like structure overlays) may result in an 189oops. 190 191Shared Memory Conversions 192========================= 193 194All TDX guest memory starts out as private at boot. This memory can not 195be accessed by the hypervisor. However, some kernel users like device 196drivers might have a need to share data with the hypervisor. To do this, 197memory must be converted between shared and private. This can be 198accomplished using some existing memory encryption helpers: 199 200 * set_memory_decrypted() converts a range of pages to shared. 201 * set_memory_encrypted() converts memory back to private. 202 203Device drivers are the primary user of shared memory, but there's no need 204to touch every driver. DMA buffers and ioremap() do the conversions 205automatically. 206 207TDX uses SWIOTLB for most DMA allocations. The SWIOTLB buffer is 208converted to shared on boot. 209 210For coherent DMA allocation, the DMA buffer gets converted on the 211allocation. Check force_dma_unencrypted() for details. 212 213Attestation 214=========== 215 216Attestation is used to verify the TDX guest trustworthiness to other 217entities before provisioning secrets to the guest. For example, a key 218server may want to use attestation to verify that the guest is the 219desired one before releasing the encryption keys to mount the encrypted 220rootfs or a secondary drive. 221 222The TDX module records the state of the TDX guest in various stages of 223the guest boot process using the build time measurement register (MRTD) 224and runtime measurement registers (RTMR). Measurements related to the 225guest initial configuration and firmware image are recorded in the MRTD 226register. Measurements related to initial state, kernel image, firmware 227image, command line options, initrd, ACPI tables, etc are recorded in 228RTMR registers. For more details, as an example, please refer to TDX 229Virtual Firmware design specification, section titled "TD Measurement". 230At TDX guest runtime, the attestation process is used to attest to these 231measurements. 232 233The attestation process consists of two steps: TDREPORT generation and 234Quote generation. 235 236TDX guest uses TDCALL[TDG.MR.REPORT] to get the TDREPORT (TDREPORT_STRUCT) 237from the TDX module. TDREPORT is a fixed-size data structure generated by 238the TDX module which contains guest-specific information (such as build 239and boot measurements), platform security version, and the MAC to protect 240the integrity of the TDREPORT. A user-provided 64-Byte REPORTDATA is used 241as input and included in the TDREPORT. Typically it can be some nonce 242provided by attestation service so the TDREPORT can be verified uniquely. 243More details about the TDREPORT can be found in Intel TDX Module 244specification, section titled "TDG.MR.REPORT Leaf". 245 246After getting the TDREPORT, the second step of the attestation process 247is to send it to the Quoting Enclave (QE) to generate the Quote. TDREPORT 248by design can only be verified on the local platform as the MAC key is 249bound to the platform. To support remote verification of the TDREPORT, 250TDX leverages Intel SGX Quoting Enclave to verify the TDREPORT locally 251and convert it to a remotely verifiable Quote. Method of sending TDREPORT 252to QE is implementation specific. Attestation software can choose 253whatever communication channel available (i.e. vsock or TCP/IP) to 254send the TDREPORT to QE and receive the Quote. 255 256References 257========== 258 259TDX reference material is collected here: 260 261https://www.intel.com/content/www/us/en/developer/articles/technical/intel-trust-domain-extensions.html 262