| e414b100 | 09-Sep-2025 |
Kai Huang <kai.huang@intel.com> |
x86/virt/tdx: Use precalculated TDVPR page physical address
All of the x86 KVM guest types (VMX, SEV and TDX) do some special context tracking when entering guests. This means that the actual guest
x86/virt/tdx: Use precalculated TDVPR page physical address
All of the x86 KVM guest types (VMX, SEV and TDX) do some special context tracking when entering guests. This means that the actual guest entry sequence must be noinstr.
Part of entering a TDX guest is passing a physical address to the TDX module. Right now, that physical address is stored as a 'struct page' and converted to a physical address at guest entry. That page=>phys conversion can be complicated, can vary greatly based on kernel config, and it is definitely _not_ a noinstr path today.
There have been a number of tinkering approaches to try and fix this up, but they all fall down due to some part of the page=>phys conversion infrastructure not being noinstr friendly.
Precalculate the page=>phys conversion and store it in the existing 'tdx_vp' structure. Use the new field at every site that needs a tdvpr physical address. Remove the now redundant tdx_tdvpr_pa(). Remove the __flatten remnant from the tinkering.
Note that only one user of the new field is actually noinstr. All others can use page_to_phys(). But, they might as well save the effort since there is a pre-calculated value sitting there for them.
[ dhansen: rewrite all the text ]
Signed-off-by: Kai Huang <kai.huang@intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Kiryl Shutsemau <kas@kernel.org> Tested-by: Farrah Chen <farrah.chen@intel.com>
show more ...
|
| 61221d07 | 01-Sep-2025 |
Kai Huang <kai.huang@intel.com> |
KVM/TDX: Explicitly do WBINVD when no more TDX SEAMCALLs
On TDX platforms, during kexec, the kernel needs to make sure there are no dirty cachelines of TDX private memory before booting to the new k
KVM/TDX: Explicitly do WBINVD when no more TDX SEAMCALLs
On TDX platforms, during kexec, the kernel needs to make sure there are no dirty cachelines of TDX private memory before booting to the new kernel to avoid silent memory corruption to the new kernel.
To do this, the kernel has a percpu boolean to indicate whether the cache of a CPU may be in incoherent state. During kexec, namely in stop_this_cpu(), the kernel does WBINVD if that percpu boolean is true. TDX turns on that percpu boolean on a CPU when the kernel does SEAMCALL, Thus making sure the cache will be flushed during kexec.
However, kexec has a race condition that, while remaining extremely rare, would be more likely in the presence of a relatively long operation such as WBINVD.
In particular, the kexec-ing CPU invokes native_stop_other_cpus() to stop all remote CPUs before booting to the new kernel. native_stop_other_cpus() then sends a REBOOT vector IPI to remote CPUs and waits for them to stop; if that times out, it also sends NMIs to the still-alive CPUs and waits again for them to stop. If the race happens, kexec proceeds before all CPUs have processed the NMI and stopped[1], and the system hangs.
But after tdx_disable_virtualization_cpu(), no more TDX activity can happen on this cpu. When kexec is enabled, flush the cache explicitly at that point; this moves the WBINVD to an earlier stage than stop_this_cpus(), avoiding a possibly lengthy operation at a time where it could cause this race.
[1] https://lore.kernel.org/kvm/b963fcd60abe26c7ec5dc20b42f1a2ebbcc72397.1750934177.git.kai.huang@intel.com/
[Make the new function a stub for !CONFIG_KEXEC_CORE. - Paolo] Signed-off-by: Kai Huang <kai.huang@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Acked-by: Paolo Bonzini <pbonzini@redhat.com> Tested-by: Farrah Chen <farrah.chen@intel.com> Link: https://lore.kernel.org/all/20250901160930.1785244-8-pbonzini%40redhat.com
show more ...
|
| 10df8607 | 01-Sep-2025 |
Kai Huang <kai.huang@intel.com> |
x86/virt/tdx: Mark memory cache state incoherent when making SEAMCALL
On TDX platforms, dirty cacheline aliases with and without encryption bits can coexist, and the cpu can flush them back to memor
x86/virt/tdx: Mark memory cache state incoherent when making SEAMCALL
On TDX platforms, dirty cacheline aliases with and without encryption bits can coexist, and the cpu can flush them back to memory in random order. During kexec, the caches must be flushed before jumping to the new kernel otherwise the dirty cachelines could silently corrupt the memory used by the new kernel due to different encryption property.
A percpu boolean is used to mark whether the cache of a given CPU may be in an incoherent state, and the kexec performs WBINVD on the CPUs with that boolean turned on.
For TDX, only the TDX module or the TDX guests can generate dirty cachelines of TDX private memory, i.e., they are only generated when the kernel does a SEAMCALL.
Set that boolean when the kernel does SEAMCALL so that kexec can flush the cache correctly.
The kernel provides both the __seamcall*() assembly functions and the seamcall*() wrapper ones which additionally handle running out of entropy error in a loop. Most of the SEAMCALLs are called using the seamcall*(), except TDH.VP.ENTER and TDH.PHYMEM.PAGE.RDMD which are called using __seamcall*() variant directly.
To cover the two special cases, add a new __seamcall_dirty_cache() helper which only sets the percpu boolean and calls the __seamcall*(), and change the special cases to use the new helper. To cover all other SEAMCALLs, change seamcall*() to call the new helper.
For the SEAMCALLs invoked via seamcall*(), they can be made from both task context and IRQ disabled context. Given SEAMCALL is just a lengthy instruction (e.g., thousands of cycles) from kernel's point of view and preempt_{disable|enable}() is cheap compared to it, just unconditionally disable preemption during setting the boolean and making SEAMCALL.
Signed-off-by: Kai Huang <kai.huang@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Chao Gao <chao.gao@intel.com> Reviewed-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Tested-by: Farrah Chen <farrah.chen@intel.com> Link: https://lore.kernel.org/all/20250901160930.1785244-4-pbonzini%40redhat.com
show more ...
|
| 01fb93a3 | 19-Aug-2025 |
Adrian Hunter <adrian.hunter@intel.com> |
x86/tdx: Skip clearing reclaimed pages unless X86_BUG_TDX_PW_MCE is present
Avoid clearing reclaimed TDX private pages unless the platform is affected by the X86_BUG_TDX_PW_MCE erratum. This signifi
x86/tdx: Skip clearing reclaimed pages unless X86_BUG_TDX_PW_MCE is present
Avoid clearing reclaimed TDX private pages unless the platform is affected by the X86_BUG_TDX_PW_MCE erratum. This significantly reduces VM shutdown time on unaffected systems.
Background
KVM currently clears reclaimed TDX private pages using MOVDIR64B, which:
- Clears the TD Owner bit (which identifies TDX private memory) and integrity metadata without triggering integrity violations. - Clears poison from cache lines without consuming it, avoiding MCEs on access (refer TDX Module Base spec. 1348549-006US section 6.5. Handling Machine Check Events during Guest TD Operation).
The TDX module also uses MOVDIR64B to initialize private pages before use. If cache flushing is needed, it sets TDX_FEATURES.CLFLUSH_BEFORE_ALLOC. However, KVM currently flushes unconditionally, refer commit 94c477a751c7b ("x86/virt/tdx: Add SEAMCALL wrappers to add TD private pages")
In contrast, when private pages are reclaimed, the TDX Module handles flushing via the TDH.PHYMEM.CACHE.WB SEAMCALL.
Problem
Clearing all private pages during VM shutdown is costly. For guests with a large amount of memory it can take minutes.
Solution
TDX Module Base Architecture spec. documents that private pages reclaimed from a TD should be initialized using MOVDIR64B, in order to avoid integrity violation or TD bit mismatch detection when later being read using a shared HKID, refer April 2025 spec. "Page Initialization" in section "8.6.2. Platforms not Using ACT: Required Cache Flush and Initialization by the Host VMM"
That is an overstatement and will be clarified in coming versions of the spec. In fact, as outlined in "Table 16.2: Non-ACT Platforms Checks on Memory" and "Table 16.3: Non-ACT Platforms Checks on Memory Reads in Li Mode" in the same spec, there is no issue accessing such reclaimed pages using a shared key that does not have integrity enabled. Linux always uses KeyID 0 which never has integrity enabled. KeyID 0 is also the TME KeyID which disallows integrity, refer "TME Policy/Encryption Algorithm" bit description in "Intel Architecture Memory Encryption Technologies" spec version 1.6 April 2025. So there is no need to clear pages to avoid integrity violations.
There remains a risk of poison consumption. However, in the context of TDX, it is expected that there would be a machine check associated with the original poisoning. On some platforms that results in a panic. However platforms may support "SEAM_NR" Machine Check capability, in which case Linux machine check handler marks the page as poisoned, which prevents it from being allocated anymore, refer commit 7911f145de5fe ("x86/mce: Implement recovery for errors in TDX/SEAM non-root mode")
Improvement
By skipping the clearing step on unaffected platforms, shutdown time can improve by up to 40%.
On platforms with the X86_BUG_TDX_PW_MCE erratum (SPR and EMR), continue clearing because these platforms may trigger poison on partial writes to previously-private pages, even with KeyID 0, refer commit 1e536e1068970 ("x86/cpu: Detect TDX partial write machine check erratum")
Signed-off-by: Adrian Hunter <adrian.hunter@intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Kirill A. Shutemov <kas@kernel.org> Reviewed-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com> Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com> Acked-by: Kai Huang <kai.huang@intel.com> Acked-by: Vishal Annapurve <vannapurve@google.com> Link: https://lore.kernel.org/all/20250819155811.136099-4-adrian.hunter%40intel.com
show more ...
|
| a27b008a | 19-Aug-2025 |
Adrian Hunter <adrian.hunter@intel.com> |
x86/tdx: Tidy reset_pamt functions
tdx_quirk_reset_paddr() was renamed to reflect that, in fact, the clearing is necessary only for hardware with a certain quirk. That is dealt with in a subsequent
x86/tdx: Tidy reset_pamt functions
tdx_quirk_reset_paddr() was renamed to reflect that, in fact, the clearing is necessary only for hardware with a certain quirk. That is dealt with in a subsequent patch.
Rename reset_pamt functions to contain "quirk" to reflect the new functionality, and remove the now misleading comment.
Signed-off-by: Adrian Hunter <adrian.hunter@intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com> Acked-by: Kai Huang <kai.huang@intel.com> Acked-by: Vishal Annapurve <vannapurve@google.com> Link: https://lore.kernel.org/all/20250819155811.136099-3-adrian.hunter%40intel.com
show more ...
|
| 69e23faf | 21-Nov-2024 |
Kai Huang <kai.huang@intel.com> |
x86/virt/tdx: Add SEAMCALL wrapper to enter/exit TDX guest
Intel TDX protects guest VM's from malicious host and certain physical attacks. TDX introduces a new operation mode, Secure Arbitration Mo
x86/virt/tdx: Add SEAMCALL wrapper to enter/exit TDX guest
Intel TDX protects guest VM's from malicious host and certain physical attacks. TDX introduces a new operation mode, Secure Arbitration Mode (SEAM) to isolate and protect guest VM's. A TDX guest VM runs in SEAM and, unlike VMX, direct control and interaction with the guest by the host VMM is not possible. Instead, Intel TDX Module, which also runs in SEAM, provides a SEAMCALL API.
The SEAMCALL that provides the ability to enter a guest is TDH.VP.ENTER. The TDX Module processes TDH.VP.ENTER, and enters the guest via VMX VMLAUNCH/VMRESUME instructions. When a guest VM-exit requires host VMM interaction, the TDH.VP.ENTER SEAMCALL returns to the host VMM (KVM).
Add tdh_vp_enter() to wrap the SEAMCALL invocation of TDH.VP.ENTER; tdh_vp_enter() needs to be noinstr because VM entry in KVM is noinstr as well, which is for two reasons: * marking the area as CT_STATE_GUEST via guest_state_enter_irqoff() and guest_state_exit_irqoff() * IRET must be avoided between VM-exit and NMI handling, in order to avoid prematurely releasing the NMI inhibit.
TDH.VP.ENTER is different from other SEAMCALLs in several ways: it uses more arguments, and after it returns some host state may need to be restored. Therefore tdh_vp_enter() uses __seamcall_saved_ret() instead of __seamcall_ret(); since it is the only caller of __seamcall_saved_ret(), it can be made noinstr also.
TDH.VP.ENTER arguments are passed through General Purpose Registers (GPRs). For the special case of the TD guest invoking TDG.VP.VMCALL, nearly any GPR can be used, as well as XMM0 to XMM15. Notably, RBP is not used, and Linux mandates the TDX Module feature NO_RBP_MOD, which is enforced elsewhere. Additionally, XMM registers are not required for the existing Guest Hypervisor Communication Interface and are handled by existing KVM code should they be modified by the guest.
There are 2 input formats and 5 output formats for TDH.VP.ENTER arguments. Input #1 : Initial entry or following a previous async. TD Exit Input #2 : Following a previous TDCALL(TDG.VP.VMCALL) Output #1 : On Error (No TD Entry) Output #2 : Async. Exits with a VMX Architectural Exit Reason Output #3 : Async. Exits with a non-VMX TD Exit Status Output #4 : Async. Exits with Cross-TD Exit Details Output #5 : On TDCALL(TDG.VP.VMCALL)
Currently, to keep things simple, the wrapper function does not attempt to support different formats, and just passes all the GPRs that could be used. The GPR values are held by KVM in the area set aside for guest GPRs. KVM code uses the guest GPR area (vcpu->arch.regs[]) to set up for or process results of tdh_vp_enter().
Therefore changing tdh_vp_enter() to use more complex argument formats would also alter the way KVM code interacts with tdh_vp_enter().
Signed-off-by: Kai Huang <kai.huang@intel.com> Signed-off-by: Adrian Hunter <adrian.hunter@intel.com> Message-ID: <20241121201448.36170-2-adrian.hunter@intel.com> Acked-by: Dave Hansen <dave.hansen@linux.intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
show more ...
|
| 099d7e9b | 12-Nov-2024 |
Isaku Yamahata <isaku.yamahata@intel.com> |
x86/virt/tdx: Add SEAMCALL wrappers for TD measurement of initial contents
The TDX module measures the TD during the build process and saves the measurement in TDCS.MRTD to facilitate TD attestation
x86/virt/tdx: Add SEAMCALL wrappers for TD measurement of initial contents
The TDX module measures the TD during the build process and saves the measurement in TDCS.MRTD to facilitate TD attestation of the initial contents of the TD. Wrap the SEAMCALL TDH.MR.EXTEND with tdh_mr_extend() and TDH.MR.FINALIZE with tdh_mr_finalize() to enable the host kernel to assist the TDX module in performing the measurement.
The measurement in TDCS.MRTD is a SHA-384 digest of the build process. SEAMCALLs TDH.MNG.INIT and TDH.MEM.PAGE.ADD initialize and contribute to the MRTD digest calculation.
The caller of tdh_mr_extend() should break the TD private page into chunks of size TDX_EXTENDMR_CHUNKSIZE and invoke tdh_mr_extend() to add the page content into the digest calculation. Failures are possible with TDH.MR.EXTEND (e.g., due to SEPT walking). The caller of tdh_mr_extend() can check the function return value and retrieve extended error information from the function output parameters.
Calling tdh_mr_finalize() completes the measurement. The TDX module then turns the TD into the runnable state. Further TDH.MEM.PAGE.ADD and TDH.MR.EXTEND calls will fail.
TDH.MR.FINALIZE may fail due to errors such as the TD having no vCPUs or contentions. Check function return value when calling tdh_mr_finalize() to determine the exact reason for failure. Take proper locks on the caller's side to avoid contention failures, or handle the BUSY error in specific ways (e.g., retry). Return the SEAMCALL error code directly to the caller. Do not attempt to handle it in the core kernel.
[Kai: Switched from generic seamcall export] [Yan: Re-wrote the changelog] Co-developed-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com> Signed-off-by: Kai Huang <kai.huang@intel.com> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Signed-off-by: Yan Zhao <yan.y.zhao@intel.com> Message-ID: <20241112073709.22171-1-yan.y.zhao@intel.com> Acked-by: Dave Hansen <dave.hansen@linux.intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
show more ...
|
| 206e7860 | 12-Nov-2024 |
Isaku Yamahata <isaku.yamahata@intel.com> |
x86/virt/tdx: Add SEAMCALL wrappers to remove a TD private page
TDX architecture introduces the concept of private GPA vs shared GPA, depending on the GPA.SHARED bit. The TDX module maintains a sing
x86/virt/tdx: Add SEAMCALL wrappers to remove a TD private page
TDX architecture introduces the concept of private GPA vs shared GPA, depending on the GPA.SHARED bit. The TDX module maintains a single Secure EPT (S-EPT or SEPT) tree per TD to translate TD's private memory accessed using a private GPA. Wrap the SEAMCALL TDH.MEM.PAGE.REMOVE with tdh_mem_page_remove() and TDH_PHYMEM_PAGE_WBINVD with tdh_phymem_page_wbinvd_hkid() to unmap a TD private page from the SEPT, remove the TD private page from the TDX module and flush cache lines to memory after removal of the private page.
Callers should specify "GPA" and "level" when calling tdh_mem_page_remove() to indicate to the TDX module which TD private page to unmap and remove.
TDH.MEM.PAGE.REMOVE may fail, and the caller of tdh_mem_page_remove() can check the function return value and retrieve extended error information from the function output parameters. Follow the TLB tracking protocol before calling tdh_mem_page_remove() to remove a TD private page to avoid SEAMCALL failure.
After removing a TD's private page, the TDX module does not write back and invalidate cache lines associated with the page and the page's keyID (i.e., the TD's guest keyID). Therefore, provide tdh_phymem_page_wbinvd_hkid() to allow the caller to pass in the TD's guest keyID and invoke TDH_PHYMEM_PAGE_WBINVD to perform this action.
Before reusing the page, the host kernel needs to map the page with keyID 0 and invoke movdir64b() to convert the TD private page to a normal shared page.
TDH.MEM.PAGE.REMOVE and TDH_PHYMEM_PAGE_WBINVD may meet contentions inside the TDX module for TDX's internal resources. To avoid staying in SEAM mode for too long, TDX module will return a BUSY error code to the kernel instead of spinning on the locks. The caller may need to handle this error in specific ways (e.g., retry). The wrappers return the SEAMCALL error code directly to the caller. Don't attempt to handle it in the core kernel.
[Kai: Switched from generic seamcall export] [Yan: Re-wrote the changelog] Co-developed-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com> Signed-off-by: Kai Huang <kai.huang@intel.com> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Signed-off-by: Yan Zhao <yan.y.zhao@intel.com> Message-ID: <20241112073658.22157-1-yan.y.zhao@intel.com> Acked-by: Dave Hansen <dave.hansen@linux.intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
show more ...
|
| ee4884eb | 14-Jan-2025 |
Isaku Yamahata <isaku.yamahata@intel.com> |
x86/virt/tdx: Add SEAMCALL wrappers to manage TDX TLB tracking
TDX module defines a TLB tracking protocol to make sure that no logical processor holds any stale Secure EPT (S-EPT or SEPT) TLB transl
x86/virt/tdx: Add SEAMCALL wrappers to manage TDX TLB tracking
TDX module defines a TLB tracking protocol to make sure that no logical processor holds any stale Secure EPT (S-EPT or SEPT) TLB translations for a given TD private GPA range. After a successful TDH.MEM.RANGE.BLOCK, TDH.MEM.TRACK, and kicking off all vCPUs, TDX module ensures that the subsequent TDH.VP.ENTER on each vCPU will flush all stale TLB entries for the specified GPA ranges in TDH.MEM.RANGE.BLOCK. Wrap the TDH.MEM.RANGE.BLOCK with tdh_mem_range_block() and TDH.MEM.TRACK with tdh_mem_track() to enable the kernel to assist the TDX module in TLB tracking management.
The caller of tdh_mem_range_block() needs to specify "GPA" and "level" to request the TDX module to block the subsequent creation of TLB translation for a GPA range. This GPA range can correspond to a SEPT page or a TD private page at any level.
Contentions and errors are possible with the SEAMCALL TDH.MEM.RANGE.BLOCK. Therefore, the caller of tdh_mem_range_block() needs to check the function return value and retrieve extended error info from the function output params.
Upon TDH.MEM.RANGE.BLOCK success, no new TLB entries will be created for the specified private GPA range, though the existing TLB translations may still persist. TDH.MEM.TRACK will then advance the TD's epoch counter to ensure TDX module will flush TLBs in all vCPUs once the vCPUs re-enter the TD. TDH.MEM.TRACK will fail to advance TD's epoch counter if there are vCPUs still running in non-root mode at the previous TD epoch counter. So to ensure private GPA translations are flushed, callers must first call tdh_mem_range_block(), then tdh_mem_track(), and lastly send IPIs to kick all the vCPUs and force them to re-enter, thus triggering the TLB flush.
Don't export a single operation and instead export functions that just expose the block and track operations; this is for a couple reasons:
1. The vCPU kick should use KVM's functionality for doing this, which can better target sending IPIs to only the minimum required pCPUs.
2. tdh_mem_track() doesn't need to be executed if a vCPU has not entered a TD, which is information only KVM knows.
3. Leaving the operations separate will allow for batching many tdh_mem_range_block() calls before a tdh_mem_track(). While this batching will not be done initially by KVM, it demonstrates that keeping mem block and track as separate operations is a generally good design.
Contentions are also possible in TDH.MEM.TRACK. For example, TDH.MEM.TRACK may contend with TDH.VP.ENTER when advancing the TD epoch counter. tdh_mem_track() does not provide the retries for the caller. Callers can choose to avoid contentions or retry on their own.
[Kai: Switched from generic seamcall export] [Yan: Re-wrote the changelog] Co-developed-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com> Signed-off-by: Kai Huang <kai.huang@intel.com> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Signed-off-by: Yan Zhao <yan.y.zhao@intel.com> Message-ID: <20241112073648.22143-1-yan.y.zhao@intel.com> Acked-by: Dave Hansen <dave.hansen@linux.intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
show more ...
|
| 94c477a7 | 12-Nov-2024 |
Isaku Yamahata <isaku.yamahata@intel.com> |
x86/virt/tdx: Add SEAMCALL wrappers to add TD private pages
TDX architecture introduces the concept of private GPA vs shared GPA, depending on the GPA.SHARED bit. The TDX module maintains a Secure E
x86/virt/tdx: Add SEAMCALL wrappers to add TD private pages
TDX architecture introduces the concept of private GPA vs shared GPA, depending on the GPA.SHARED bit. The TDX module maintains a Secure EPT (S-EPT or SEPT) tree per TD to translate TD's private memory accessed using a private GPA. Wrap the SEAMCALL TDH.MEM.PAGE.ADD with tdh_mem_page_add() and TDH.MEM.PAGE.AUG with tdh_mem_page_aug() to add TD private pages and map them to the TD's private GPAs in the SEPT.
Callers of tdh_mem_page_add() and tdh_mem_page_aug() allocate and provide normal pages to the wrappers, who further pass those pages to the TDX module. Before passing the pages to the TDX module, tdh_mem_page_add() and tdh_mem_page_aug() perform a CLFLUSH on the page mapped with keyID 0 to ensure that any dirty cache lines don't write back later and clobber TD memory or control structures. Don't worry about the other MK-TME keyIDs because the kernel doesn't use them. The TDX docs specify that this flush is not needed unless the TDX module exposes the CLFLUSH_BEFORE_ALLOC feature bit. Do the CLFLUSH unconditionally for two reasons: make the solution simpler by having a single path that can handle both !CLFLUSH_BEFORE_ALLOC and CLFLUSH_BEFORE_ALLOC cases. Avoid wading into any correctness uncertainty by going with a conservative solution to start.
Call tdh_mem_page_add() to add a private page to a TD during the TD's build time (i.e., before TDH.MR.FINALIZE). Specify which GPA the 4K private page will map to. No need to specify level info since TDH.MEM.PAGE.ADD only adds pages at 4K level. To provide initial contents to TD, provide an additional source page residing in memory managed by the host kernel itself (encrypted with a shared keyID). The TDX module will copy the initial contents from the source page in shared memory into the private page after mapping the page in the SEPT to the specified private GPA. The TDX module allows the source page to be the same page as the private page to be added. In that case, the TDX module converts and encrypts the source page as a TD private page.
Call tdh_mem_page_aug() to add a private page to a TD during the TD's runtime (i.e., after TDH.MR.FINALIZE). TDH.MEM.PAGE.AUG supports adding huge pages. Specify which GPA the private page will map to, along with level info embedded in the lower bits of the GPA. The TDX module will recognize the added page as the TD's private page after the TD's acceptance with TDCALL TDG.MEM.PAGE.ACCEPT.
tdh_mem_page_add() and tdh_mem_page_aug() may fail. Callers can check function return value and retrieve extended error info from the function output parameters.
The TDX module has many internal locks. To avoid staying in SEAM mode for too long, SEAMCALLs returns a BUSY error code to the kernel instead of spinning on the locks. Depending on the specific SEAMCALL, the caller may need to handle this error in specific ways (e.g., retry). Therefore, return the SEAMCALL error code directly to the caller. Don't attempt to handle it in the core kernel.
[Kai: Switched from generic seamcall export] [Yan: Re-wrote the changelog] Co-developed-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com> Signed-off-by: Kai Huang <kai.huang@intel.com> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Signed-off-by: Yan Zhao <yan.y.zhao@intel.com> Message-ID: <20241112073636.22129-1-yan.y.zhao@intel.com> Acked-by: Dave Hansen <dave.hansen@linux.intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
show more ...
|
| 385ba3fd | 12-Nov-2024 |
Isaku Yamahata <isaku.yamahata@intel.com> |
x86/virt/tdx: Add SEAMCALL wrapper tdh_mem_sept_add() to add SEPT pages
TDX architecture introduces the concept of private GPA vs shared GPA, depending on the GPA.SHARED bit. The TDX module maintain
x86/virt/tdx: Add SEAMCALL wrapper tdh_mem_sept_add() to add SEPT pages
TDX architecture introduces the concept of private GPA vs shared GPA, depending on the GPA.SHARED bit. The TDX module maintains a Secure EPT (S-EPT or SEPT) tree per TD for private GPA to HPA translation. Wrap the TDH.MEM.SEPT.ADD SEAMCALL with tdh_mem_sept_add() to provide pages to the TDX module for building a TD's SEPT tree. (Refer to these pages as SEPT pages).
Callers need to allocate and provide a normal page to tdh_mem_sept_add(), which then passes the page to the TDX module via the SEAMCALL TDH.MEM.SEPT.ADD. The TDX module then installs the page into SEPT tree and encrypts this SEPT page with the TD's guest keyID. The kernel cannot use the SEPT page until after reclaiming it via TDH.MEM.SEPT.REMOVE or TDH.PHYMEM.PAGE.RECLAIM.
Before passing the page to the TDX module, tdh_mem_sept_add() performs a CLFLUSH on the page mapped with keyID 0 to ensure that any dirty cache lines don't write back later and clobber TD memory or control structures. Don't worry about the other MK-TME keyIDs because the kernel doesn't use them. The TDX docs specify that this flush is not needed unless the TDX module exposes the CLFLUSH_BEFORE_ALLOC feature bit. Do the CLFLUSH unconditionally for two reasons: make the solution simpler by having a single path that can handle both !CLFLUSH_BEFORE_ALLOC and CLFLUSH_BEFORE_ALLOC cases. Avoid wading into any correctness uncertainty by going with a conservative solution to start.
Callers should specify "GPA" and "level" for the TDX module to install the SEPT page at the specified position in the SEPT. Do not include the root page level in "level" since TDH.MEM.SEPT.ADD can only add non-root pages to the SEPT. Ensure "level" is between 1 and 3 for a 4-level SEPT or between 1 and 4 for a 5-level SEPT.
Call tdh_mem_sept_add() during the TD's build time or during the TD's runtime. Check for errors from the function return value and retrieve extended error info from the function output parameters.
The TDX module has many internal locks. To avoid staying in SEAM mode for too long, SEAMCALLs returns a BUSY error code to the kernel instead of spinning on the locks. Depending on the specific SEAMCALL, the caller may need to handle this error in specific ways (e.g., retry). Therefore, return the SEAMCALL error code directly to the caller. Don't attempt to handle it in the core kernel.
TDH.MEM.SEPT.ADD effectively manages two internal resources of the TDX module: it installs page table pages in the SEPT tree and also updates the TDX module's page metadata (PAMT). Don't add a wrapper for the matching SEAMCALL for removing a SEPT page (TDH.MEM.SEPT.REMOVE) because KVM, as the only in-kernel user, will only tear down the SEPT tree when the TD is being torn down. When this happens it can just do other operations that reclaim the SEPT pages for the host kernels to use, update the PAMT and let the SEPT get trashed.
[Kai: Switched from generic seamcall export] [Yan: Re-wrote the changelog] Co-developed-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com> Signed-off-by: Kai Huang <kai.huang@intel.com> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Signed-off-by: Yan Zhao <yan.y.zhao@intel.com> Message-ID: <20241112073624.22114-1-yan.y.zhao@intel.com> Acked-by: Dave Hansen <dave.hansen@linux.intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
show more ...
|
| fcdbdf63 | 15-Nov-2024 |
Kai Huang <kai.huang@intel.com> |
KVM: VMX: Initialize TDX during KVM module load
Before KVM can use TDX to create and run TDX guests, TDX needs to be initialized from two perspectives: 1) TDX module must be initialized properly to
KVM: VMX: Initialize TDX during KVM module load
Before KVM can use TDX to create and run TDX guests, TDX needs to be initialized from two perspectives: 1) TDX module must be initialized properly to a working state; 2) A per-cpu TDX initialization, a.k.a the TDH.SYS.LP.INIT SEAMCALL must be done on any logical cpu before it can run any other TDX SEAMCALLs.
The TDX host core-kernel provides two functions to do the above two respectively: tdx_enable() and tdx_cpu_enable().
There are two options in terms of when to initialize TDX: initialize TDX at KVM module loading time, or when creating the first TDX guest.
Choose to initialize TDX during KVM module loading time:
Initializing TDX module is both memory and CPU time consuming: 1) the kernel needs to allocate a non-trivial size(~1/256) of system memory as metadata used by TDX module to track each TDX-usable memory page's status; 2) the TDX module needs to initialize this metadata, one entry for each TDX-usable memory page.
Also, the kernel uses alloc_contig_pages() to allocate those metadata chunks, because they are large and need to be physically contiguous. alloc_contig_pages() can fail. If initializing TDX when creating the first TDX guest, then there's chance that KVM won't be able to run any TDX guests albeit KVM _declares_ to be able to support TDX.
This isn't good for the user.
On the other hand, initializing TDX at KVM module loading time can make sure KVM is providing a consistent view of whether KVM can support TDX to the user.
Always only try to initialize TDX after VMX has been initialized. TDX is based on VMX, and if VMX fails to initialize then TDX is likely to be broken anyway. Also, in practice, supporting TDX will require part of VMX and common x86 infrastructure in working order, so TDX cannot be enabled alone w/o VMX support.
There are two cases that can result in failure to initialize TDX: 1) TDX cannot be supported (e.g., because of TDX is not supported or enabled by hardware, or module is not loaded, or missing some dependency in KVM's configuration); 2) Any unexpected error during TDX bring-up. For the first case only mark TDX is disabled but still allow KVM module to be loaded. For the second case just fail to load the KVM module so that the user can be aware.
Because TDX costs additional memory, don't enable TDX by default. Add a new module parameter 'enable_tdx' to allow the user to opt-in.
Note, the name tdx_init() has already been taken by the early boot code. Use tdx_bringup() for initializing TDX (and tdx_cleanup() since KVM doesn't actually teardown TDX). They don't match vt_init()/vt_exit(), vmx_init()/vmx_exit() etc but it's not end of the world.
Also, once initialized, the TDX module cannot be disabled and enabled again w/o the TDX module runtime update, which isn't supported by the kernel. After TDX is enabled, nothing needs to be done when KVM disables hardware virtualization, e.g., when offlining CPU, or during suspend/resume. TDX host core-kernel code internally tracks TDX status and can handle "multiple enabling" scenario.
Similar to KVM_AMD_SEV, add a new KVM_INTEL_TDX Kconfig to guide KVM TDX code. Make it depend on INTEL_TDX_HOST but not replace INTEL_TDX_HOST because in the longer term there's a use case that requires making SEAMCALLs w/o KVM as mentioned by Dan [1].
Link: https://lore.kernel.org/6723fc2070a96_60c3294dc@dwillia2-mobl3.amr.corp.intel.com.notmuch/ [1] Signed-off-by: Kai Huang <kai.huang@intel.com> Message-ID: <162f9dee05c729203b9ad6688db1ca2960b4b502.1731664295.git.kai.huang@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
show more ...
|
| aed4dde2 | 30-Oct-2024 |
Isaku Yamahata <isaku.yamahata@intel.com> |
x86/virt/tdx: Add tdx_guest_keyid_alloc/free() to alloc and free TDX guest KeyID
Intel TDX protects guest VMs from malicious host and certain physical attacks. Pre-TDX Intel hardware has support for
x86/virt/tdx: Add tdx_guest_keyid_alloc/free() to alloc and free TDX guest KeyID
Intel TDX protects guest VMs from malicious host and certain physical attacks. Pre-TDX Intel hardware has support for a memory encryption architecture called MK-TME, which repurposes several high bits of physical address as "KeyID". The BIOS reserves a sub-range of MK-TME KeyIDs as "TDX private KeyIDs".
Each TDX guest must be assigned with a unique TDX KeyID when it is created. The kernel reserves the first TDX private KeyID for crypto-protection of specific TDX module data which has a lifecycle that exceeds the KeyID reserved for the TD's use. The rest of the KeyIDs are left for TDX guests to use.
Create a small KeyID allocator. Export tdx_guest_keyid_alloc()/tdx_guest_keyid_free() to allocate and free TDX guest KeyID for KVM to use.
Don't provide the stub functions when CONFIG_INTEL_TDX_HOST=n since they are not supposed to be called in this case.
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Message-ID: <20241030190039.77971-5-rick.p.edgecombe@intel.com> Acked-by: Dave Hansen <dave.hansen@linux.intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
show more ...
|
| 4caf32da | 24-Dec-2024 |
Kai Huang <kai.huang@intel.com> |
x86/virt/tdx: Read essential global metadata for KVM
KVM needs two classes of global metadata to create and run TDX guests:
- "TD Control Structures" - "TD Configurability"
The first class conta
x86/virt/tdx: Read essential global metadata for KVM
KVM needs two classes of global metadata to create and run TDX guests:
- "TD Control Structures" - "TD Configurability"
The first class contains the sizes of TDX guest per-VM and per-vCPU control structures. KVM will need to use them to allocate enough space for those control structures.
The second class contains info which reports things like which features are configurable to TDX guest etc. KVM will need to use them to properly configure TDX guests.
Read them for KVM TDX to use.
The code change is auto-generated by re-running the script in [1] after uncommenting the "td_conf" and "td_ctrl" part to regenerate the tdx_global_metadata.{hc} and update them to the existing ones in the kernel.
#python tdx.py global_metadata.json tdx_global_metadata.h \ tdx_global_metadata.c
The 'global_metadata.json' can be fetched from [2].
Note that as of this writing, the JSON file only allows a maximum of 32 CPUID entries. While this is enough for current contents of the CPUID leaves, there were plans to change the JSON per TDX module release which would change the ABI and potentially prevent future versions of the TDX module from working with older kernels.
While discussions are ongoing with the TDX module team on what exactly constitutes an ABI breakage, in the meantime the TDX module team has agreed to not increase the number of CPUID entries beyond 128 without an opt in. Therefore the file was tweaked by hand to change the maximum number of CPUID_CONFIGs.
Link: https://lore.kernel.org/kvm/0853b155ec9aac09c594caa60914ed6ea4dc0a71.camel@intel.com/ [1] Link: https://cdrdv2.intel.com/v1/dl/getContent/795381 [2] Signed-off-by: Kai Huang <kai.huang@intel.com> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Message-ID: <20241030190039.77971-4-rick.p.edgecombe@intel.com> Acked-by: Dave Hansen <dave.hansen@linux.intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
show more ...
|
| 7ba2fd80 | 22-Jan-2025 |
Paolo Bonzini <pbonzini@redhat.com> |
x86/virt/tdx: allocate tdx_sys_info in static memory
Adding all the information that KVM needs increases the size of struct tdx_sys_info, to the point that you can get warnings about the stack size
x86/virt/tdx: allocate tdx_sys_info in static memory
Adding all the information that KVM needs increases the size of struct tdx_sys_info, to the point that you can get warnings about the stack size of init_tdx_module(). Since KVM also needs to read the TDX metadata after init_tdx_module() returns, make the variable a global.
Reviewed-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Reviewed-by: Kai Huang <kai.huang@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
show more ...
|
| e465cc63 | 03-Dec-2024 |
Rick Edgecombe <rick.p.edgecombe@intel.com> |
x86/virt/tdx: Add SEAMCALL wrappers for TDX flush operations
Intel TDX protects guest VMs from malicious host and certain physical attacks. The TDX module has the concept of flushing vCPUs. These fl
x86/virt/tdx: Add SEAMCALL wrappers for TDX flush operations
Intel TDX protects guest VMs from malicious host and certain physical attacks. The TDX module has the concept of flushing vCPUs. These flushes include both a flush of the translation caches and also any other state internal to the TDX module. Before freeing a KeyID, this flush operation needs to be done. KVM will need to perform the flush on each pCPU associated with the TD, and also perform a TD scoped operation that checks if the flush has been done on all vCPU's associated with the TD.
Add a tdh_vp_flush() function to be used to call TDH.VP.FLUSH on each pCPU associated with the TD during TD teardown. It will also be called when disabling TDX and during vCPU migration between pCPUs.
Add tdh_mng_vpflushdone() to be used by KVM to call TDH.MNG.VPFLUSHDONE. KVM will use this during TD teardown to verify that TDH.VP.FLUSH has been called sufficiently, and advance the state machine that will allow for reclaiming the TD's KeyID.
Co-developed-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com> Signed-off-by: Kai Huang <kai.huang@intel.com> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com> Reviewed-by: Yuan Yao <yuan.yao@intel.com> Message-ID: <20241203010317.827803-7-rick.p.edgecombe@intel.com> Acked-by: Dave Hansen <dave.hansen@linux.intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
show more ...
|
| 5e5151c5 | 03-Dec-2024 |
Rick Edgecombe <rick.p.edgecombe@intel.com> |
x86/virt/tdx: Add SEAMCALL wrappers for TDX VM/vCPU field access
Intel TDX protects guest VMs from malicious host and certain physical attacks. The TDX module has TD scoped and vCPU scoped "metadata
x86/virt/tdx: Add SEAMCALL wrappers for TDX VM/vCPU field access
Intel TDX protects guest VMs from malicious host and certain physical attacks. The TDX module has TD scoped and vCPU scoped "metadata fields". These fields are a bit like VMCS fields, and stored in data structures maintained by the TDX module. Export 3 SEAMCALLs for use in reading and writing these fields:
Make tdh_mng_rd() use MNG.VP.RD to read the TD scoped metadata.
Make tdh_vp_rd()/tdh_vp_wr() use TDH.VP.RD/WR to read/write the vCPU scoped metadata.
KVM will use these by creating inline helpers that target various metadata sizes. Export the raw SEAMCALL leaf, to avoid exporting the large number of various sized helpers.
Co-developed-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com> Signed-off-by: Kai Huang <kai.huang@intel.com> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com> Reviewed-by: Yuan Yao <yuan.yao@intel.com> Acked-by: Dave Hansen <dave.hansen@linux.intel.com> Message-ID: <20241203010317.827803-6-rick.p.edgecombe@intel.com> Acked-by: Dave Hansen <dave.hansen@linux.intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
show more ...
|
| 541b3e9e | 03-Dec-2024 |
Rick Edgecombe <rick.p.edgecombe@intel.com> |
x86/virt/tdx: Add SEAMCALL wrappers for TDX page cache management
Intel TDX protects guest VMs from malicious host and certain physical attacks. The TDX module uses pages provided by the host for bo
x86/virt/tdx: Add SEAMCALL wrappers for TDX page cache management
Intel TDX protects guest VMs from malicious host and certain physical attacks. The TDX module uses pages provided by the host for both control structures and for TD guest pages. These pages are encrypted using the MK-TME encryption engine, with its special requirements around cache invalidation. For its own security, the TDX module ensures pages are flushed properly and track which usage they are currently assigned. For creating and tearing down TD VMs and vCPUs KVM will need to use the TDH.PHYMEM.PAGE.RECLAIM, TDH.PHYMEM.CACHE.WB, and TDH.PHYMEM.PAGE.WBINVD SEAMCALLs.
Add tdh_phymem_page_reclaim() to enable KVM to call TDH.PHYMEM.PAGE.RECLAIM to reclaim the page for use by the host kernel. This effectively resets its state in the TDX module's page tracking (PAMT), if the page is available to be reclaimed. This will be used by KVM to reclaim the various types of pages owned by the TDX module. It will have a small wrapper in KVM that retries in the case of a relevant error code. Don't implement this wrapper in arch/x86 because KVM's solution around retrying SEAMCALLs will be better located in a single place.
Add tdh_phymem_cache_wb() to enable KVM to call TDH.PHYMEM.CACHE.WB to do a cache write back in a way that the TDX module can verify, before it allows a KeyID to be freed. The KVM code will use this to have a small wrapper that handles retries. Since the TDH.PHYMEM.CACHE.WB operation is interruptible, have tdh_phymem_cache_wb() take a resume argument to pass this info to the TDX module for restarts. It is worth noting that this SEAMCALL uses a SEAM specific MSR to do the write back in sections. In this way it does export some new functionality that affects CPU state.
Add tdh_phymem_page_wbinvd_tdr() to enable KVM to call TDH.PHYMEM.PAGE.WBINVD to do a cache write back and invalidate of a TDR, using the global KeyID. The underlying TDH.PHYMEM.PAGE.WBINVD SEAMCALL requires the related KeyID to be encoded into the SEAMCALL args. Since the global KeyID is not exposed to KVM, a dedicated wrapper is needed for TDR focused TDH.PHYMEM.PAGE.WBINVD operations.
Co-developed-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com> Signed-off-by: Kai Huang <kai.huang@intel.com> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com> Reviewed-by: Yuan Yao <yuan.yao@intel.com> Message-ID: <20241203010317.827803-5-rick.p.edgecombe@intel.com> Acked-by: Dave Hansen <dave.hansen@linux.intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
show more ...
|