#
7f9039c5 |
| 02-Jun-2025 |
Linus Torvalds <torvalds@linux-foundation.org> |
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull more kvm updates from Paolo Bonzini: Generic:
- Clean up locking of all vCPUs for a VM by using the *_nest_lock() f
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull more kvm updates from Paolo Bonzini: Generic:
- Clean up locking of all vCPUs for a VM by using the *_nest_lock() family of functions, and move duplicated code to virt/kvm/. kernel/ patches acked by Peter Zijlstra
- Add MGLRU support to the access tracking perf test
ARM fixes:
- Make the irqbypass hooks resilient to changes in the GSI<->MSI routing, avoiding behind stale vLPI mappings being left behind. The fix is to resolve the VGIC IRQ using the host IRQ (which is stable) and nuking the vLPI mapping upon a routing change
- Close another VGIC race where vCPU creation races with VGIC creation, leading to in-flight vCPUs entering the kernel w/o private IRQs allocated
- Fix a build issue triggered by the recently added workaround for Ampere's AC04_CPU_23 erratum
- Correctly sign-extend the VA when emulating a TLBI instruction potentially targeting a VNCR mapping
- Avoid dereferencing a NULL pointer in the VGIC debug code, which can happen if the device doesn't have any mapping yet
s390:
- Fix interaction between some filesystems and Secure Execution
- Some cleanups and refactorings, preparing for an upcoming big series
x86:
- Wait for target vCPU to ack KVM_REQ_UPDATE_PROTECTED_GUEST_STATE to fix a race between AP destroy and VMRUN
- Decrypt and dump the VMSA in dump_vmcb() if debugging enabled for the VM
- Refine and harden handling of spurious faults
- Add support for ALLOWED_SEV_FEATURES
- Add #VMGEXIT to the set of handlers special cased for CONFIG_RETPOLINE=y
- Treat DEBUGCTL[5:2] as reserved to pave the way for virtualizing features that utilize those bits
- Don't account temporary allocations in sev_send_update_data()
- Add support for KVM_CAP_X86_BUS_LOCK_EXIT on SVM, via Bus Lock Threshold
- Unify virtualization of IBRS on nested VM-Exit, and cross-vCPU IBPB, between SVM and VMX
- Advertise support to userspace for WRMSRNS and PREFETCHI
- Rescan I/O APIC routes after handling EOI that needed to be intercepted due to the old/previous routing, but not the new/current routing
- Add a module param to control and enumerate support for device posted interrupts
- Fix a potential overflow with nested virt on Intel systems running 32-bit kernels
- Flush shadow VMCSes on emergency reboot
- Add support for SNP to the various SEV selftests
- Add a selftest to verify fastops instructions via forced emulation
- Refine and optimize KVM's software processing of the posted interrupt bitmap, and share the harvesting code between KVM and the kernel's Posted MSI handler"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (93 commits) rtmutex_api: provide correct extern functions KVM: arm64: vgic-debug: Avoid dereferencing NULL ITE pointer KVM: arm64: vgic-init: Plug vCPU vs. VGIC creation race KVM: arm64: Unmap vLPIs affected by changes to GSI routing information KVM: arm64: Resolve vLPI by host IRQ in vgic_v4_unset_forwarding() KVM: arm64: Protect vLPI translation with vgic_irq::irq_lock KVM: arm64: Use lock guard in vgic_v4_set_forwarding() KVM: arm64: Mask out non-VA bits from TLBI VA* on VNCR invalidation arm64: sysreg: Drag linux/kconfig.h to work around vdso build issue KVM: s390: Simplify and move pv code KVM: s390: Refactor and split some gmap helpers KVM: s390: Remove unneeded srcu lock s390: Remove unneeded includes s390/uv: Improve splitting of large folios that cannot be split while dirty s390/uv: Always return 0 from s390_wiggle_split_folio() if successful s390/uv: Don't return 0 from make_hva_secure() if the operation was not successful rust: add helper for mutex_trylock RISC-V: KVM: use kvm_trylock_all_vcpus when locking all vCPUs KVM: arm64: use kvm_trylock_all_vcpus when locking all vCPUs x86: KVM: SVM: use kvm_lock_all_vcpus instead of a custom implementation ...
show more ...
|
#
cd1be30b |
| 27-May-2025 |
Edward Adam Davis <eadavis@qq.com> |
KVM: VMX: use __always_inline for is_td_vcpu and is_td
is_td() and is_td_vcpu() are used in no-instrumentation sections; use __always_inline instead of inline.
vmlinux.o: error: objtool: vmx_handle
KVM: VMX: use __always_inline for is_td_vcpu and is_td
is_td() and is_td_vcpu() are used in no-instrumentation sections; use __always_inline instead of inline.
vmlinux.o: error: objtool: vmx_handle_nmi+0x47: call to is_td_vcpu.isra.0() leaves .noinstr.text section
Fixes: 7172c753c26a ("KVM: VMX: Move common fields of struct vcpu_{vmx,tdx} to a struct") Signed-off-by: Edward Adam Davis <eadavis@qq.com> Message-ID: <tencent_1A767567C83C1137829622362E4A72756F09@qq.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
show more ...
|
#
43db1111 |
| 29-May-2025 |
Linus Torvalds <torvalds@linux-foundation.org> |
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull kvm updates from Paolo Bonzini: "As far as x86 goes this pull request "only" includes TDX host support.
Quotes are appropr
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull kvm updates from Paolo Bonzini: "As far as x86 goes this pull request "only" includes TDX host support.
Quotes are appropriate because (at 6k lines and 100+ commits) it is much bigger than the rest, which will come later this week and consists mostly of bugfixes and selftests. s390 changes will also come in the second batch.
ARM:
- Add large stage-2 mapping (THP) support for non-protected guests when pKVM is enabled, clawing back some performance.
- Enable nested virtualisation support on systems that support it, though it is disabled by default.
- Add UBSAN support to the standalone EL2 object used in nVHE/hVHE and protected modes.
- Large rework of the way KVM tracks architecture features and links them with the effects of control bits. While this has no functional impact, it ensures correctness of emulation (the data is automatically extracted from the published JSON files), and helps dealing with the evolution of the architecture.
- Significant changes to the way pKVM tracks ownership of pages, avoiding page table walks by storing the state in the hypervisor's vmemmap. This in turn enables the THP support described above.
- New selftest checking the pKVM ownership transition rules
- Fixes for FEAT_MTE_ASYNC being accidentally advertised to guests even if the host didn't have it.
- Fixes for the address translation emulation, which happened to be rather buggy in some specific contexts.
- Fixes for the PMU emulation in NV contexts, decoupling PMCR_EL0.N from the number of counters exposed to a guest and addressing a number of issues in the process.
- Add a new selftest for the SVE host state being corrupted by a guest.
- Keep HCR_EL2.xMO set at all times for systems running with the kernel at EL2, ensuring that the window for interrupts is slightly bigger, and avoiding a pretty bad erratum on the AmpereOne HW.
- Add workaround for AmpereOne's erratum AC04_CPU_23, which suffers from a pretty bad case of TLB corruption unless accesses to HCR_EL2 are heavily synchronised.
- Add a per-VM, per-ITS debugfs entry to dump the state of the ITS tables in a human-friendly fashion.
- and the usual random cleanups.
LoongArch:
- Don't flush tlb if the host supports hardware page table walks.
- Add KVM selftests support.
RISC-V:
- Add vector registers to get-reg-list selftest
- VCPU reset related improvements
- Remove scounteren initialization from VCPU reset
- Support VCPU reset from userspace using set_mpstate() ioctl
x86:
- Initial support for TDX in KVM.
This finally makes it possible to use the TDX module to run confidential guests on Intel processors. This is quite a large series, including support for private page tables (managed by the TDX module and mirrored in KVM for efficiency), forwarding some TDVMCALLs to userspace, and handling several special VM exits from the TDX module.
This has been in the works for literally years and it's not really possible to describe everything here, so I'll defer to the various merge commits up to and including commit 7bcf7246c42a ('Merge branch 'kvm-tdx-finish-initial' into HEAD')"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (248 commits) x86/tdx: mark tdh_vp_enter() as __flatten Documentation: virt/kvm: remove unreferenced footnote RISC-V: KVM: lock the correct mp_state during reset KVM: arm64: Fix documentation for vgic_its_iter_next() KVM: arm64: np-guest CMOs with PMD_SIZE fixmap KVM: arm64: Stage-2 huge mappings for np-guests KVM: arm64: Add a range to pkvm_mappings KVM: arm64: Convert pkvm_mappings to interval tree KVM: arm64: Add a range to __pkvm_host_test_clear_young_guest() KVM: arm64: Add a range to __pkvm_host_wrprotect_guest() KVM: arm64: Add a range to __pkvm_host_unshare_guest() KVM: arm64: Add a range to __pkvm_host_share_guest() KVM: arm64: Introduce for_each_hyp_page KVM: arm64: Handle huge mappings for np-guest CMOs KVM: arm64: nv: Release faulted-in VNCR page from mmu_lock critical section KVM: arm64: nv: Handle TLBI S1E2 for VNCR invalidation with mmu_lock held KVM: arm64: nv: Hold mmu_lock when invalidating VNCR SW-TLB before translating RISC-V: KVM: add KVM_CAP_RISCV_MP_STATE_RESET RISC-V: KVM: Remove scounteren initialization KVM: RISC-V: remove unnecessary SBI reset state ...
show more ...
|
Revision tags: v6.15, v6.15-rc7, v6.15-rc6, v6.15-rc5, v6.15-rc4, v6.15-rc3, v6.15-rc2, v6.15-rc1, v6.14 |
|
#
fd02aa45 |
| 19-Mar-2025 |
Paolo Bonzini <pbonzini@redhat.com> |
Merge branch 'kvm-tdx-initial' into HEAD
This large commit contains the initial support for TDX in KVM. All x86 parts enable the host-side hypercalls that KVM uses to talk to the TDX module, a soft
Merge branch 'kvm-tdx-initial' into HEAD
This large commit contains the initial support for TDX in KVM. All x86 parts enable the host-side hypercalls that KVM uses to talk to the TDX module, a software component that runs in a special CPU mode called SEAM (Secure Arbitration Mode).
The series is in turn split into multiple sub-series, each with a separate merge commit:
- Initialization: basic setup for using the TDX module from KVM, plus ioctls to create TDX VMs and vCPUs.
- MMU: in TDX, private and shared halves of the address space are mapped by different EPT roots, and the private half is managed by the TDX module. Using the support that was added to the generic MMU code in 6.14, add support for TDX's secure page tables to the Intel side of KVM. Generic KVM code takes care of maintaining a mirror of the secure page tables so that they can be queried efficiently, and ensuring that changes are applied to both the mirror and the secure EPT.
- vCPU enter/exit: implement the callbacks that handle the entry of a TDX vCPU (via the SEAMCALL TDH.VP.ENTER) and the corresponding save/restore of host state.
- Userspace exits: introduce support for guest TDVMCALLs that KVM forwards to userspace. These correspond to the usual KVM_EXIT_* "heavyweight vmexits" but are triggered through a different mechanism, similar to VMGEXIT for SEV-ES and SEV-SNP.
- Interrupt handling: support for virtual interrupt injection as well as handling VM-Exits that are caused by vectored events. Exclusive to TDX are machine-check SMIs, which the kernel already knows how to handle through the kernel machine check handler (commit 7911f145de5f, "x86/mce: Implement recovery for errors in TDX/SEAM non-root mode")
- Loose ends: handling of the remaining exits from the TDX module, including EPT violation/misconfig and several TDVMCALL leaves that are handled in the kernel (CPUID, HLT, RDMSR/WRMSR, GetTdVmCallInfo); plus returning an error or ignoring operations that are not supported by TDX guests
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
show more ...
|
Revision tags: v6.14-rc7 |
|
#
9913212b |
| 12-Mar-2025 |
Paolo Bonzini <pbonzini@redhat.com> |
Merge branch 'kvm-tdx-interrupts' into HEAD
Introduces support for interrupt handling for TDX guests, including virtual interrupt injection and VM-Exits caused by vectored events.
Injection =======
Merge branch 'kvm-tdx-interrupts' into HEAD
Introduces support for interrupt handling for TDX guests, including virtual interrupt injection and VM-Exits caused by vectored events.
Injection =========
TDX supports non-NMI interrupt injection only by posted interrupt. Posted interrupt descriptors (PIDs) are allocated in shared memory, KVM can update them directly. To post pending interrupts in the PID, KVM can generate a self-IPI with notification vector prior to TD entry. TDX guest status is protected, KVM can't get the interrupt status of TDX guest. For now, assume the interrupt is always allowed. A later patch set will let TDX guests to call TDVMCALL with HLT, which passes the interrupt block flag, so that whether interrupt is allowed in HLT will checked against the interrupt block flag.
For NMIs, KVM can request the TDX module to inject a NMI into a TDX vCPU by setting the PEND_NMI TDVPS field to 1. Following that, KVM can call TDH.VP.ENTER to run the vCPU and the TDX module will attempt to inject the NMI as soon as possible. PEND_NMI TDVPS field is a 1-bit filed, i.e. KVM can only pend one NMI in the TDX module. Also, TDX doesn't allow KVM to request NMI-window exit directly. When there is already one NMI pending in the TDX module, i.e. it has not been delivered to TDX guest yet, if there is NMI pending in KVM, collapse the pending NMI in KVM into the one pending in the TDX module. Such collapse is OK considering on X86 bare metal, multiple NMIs could collapse into one NMI, e.g. when NMI is blocked by SMI. It's OS's responsibility to poll all NMI sources in the NMI handler to avoid missing handling of some NMI events. More details can be found in the changelog of the patch "KVM: TDX: Implement methods to inject NMI".
TDX doesn't support system-management mode (SMM) and system-management interrupt (SMI) in guest TDs because TDX module doesn't provide a way for VMM to inject SMI into guest TD or switch guest vCPU mode into SMM. SMI requests return -ENOTTY similar to CONFIG_KVM_SMM=n. Likewise, INIT and SIPI events are not used and are blocked for TDX guests; TDX defines its own vCPU creation and initialization sequence, which is done on the host via SEAMCALLs at TD build time.
VM-exit for external events ===========================
Similar to the VMX case, external interrupts are with interrupts off: in the .handle_exit_irqoff() callback for external interrupts and in the noinstr region for NMIs. Just like VMX, NMI remains blocked after exiting from TDX guest for NMI-induced exits.
Machine check, which is handled in the .handle_exit_irqoff() callback, is the only exception type KVM handles for TDX guests. For other exceptions, because TDX guest state is protected, exceptions in TDX guests can't be intercepted. TDX VMM isn't supposed to handle these exceptions. Exit to userspace with KVM_EXIT_EXCEPTION If unexpected exception occurs.
Host SMIs also cause an exit to KVM. This is needed because in SEAM root mode (TDX module) all interrupts are blocked. An SMI can be "I/O SMI" or "other SMI". For TDX, there will be no I/O SMI because I/O instructions inside TDX guest trigger #VE and TDX guest needs to use TDVMCALL to request VMM to do I/O emulation. The only case of interest for "other SMI" is an #MC occurring in the guest when MCE-SMI morphing is enabled in the host firmware. Such "MSMI" is marked by having bit 0 set in the exit qualification; MSMI exits are fatal for the TD and are eventually handled by the kernel machine check handler (7911f14 x86/mce: Implement recovery for errors in TDX/SEAM non-root mode), which marks the page as poisoned. It is not possible right now to pass machine check exceptions to the guest.
SMIs other than machine check SMIs are handled just by leaving SEAM root mode and KVM doesn't need to do anything.
show more ...
|
#
77ab80c6 |
| 12-Mar-2025 |
Paolo Bonzini <pbonzini@redhat.com> |
Merge branch 'kvm-tdx-enter-exit' into HEAD
This series introduces callbacks to facilitate the entry of a TD VCPU and the corresponding save/restore of host state.
A TD VCPU is entered via the SEAM
Merge branch 'kvm-tdx-enter-exit' into HEAD
This series introduces callbacks to facilitate the entry of a TD VCPU and the corresponding save/restore of host state.
A TD VCPU is entered via the SEAMCALL TDH.VP.ENTER. The TDX Module manages the save/restore of guest state and, in conjunction with the SEAMCALL interface, handles certain aspects of host state. However, there are specific elements of the host state that require additional attention, as detailed in the Intel TDX ABI documentation for TDH.VP.ENTER.
TDX is quite different from VMX in this regard. For VMX, the host VMM is heavily involved in restoring, managing and saving guest CPU state, whereas for TDX this is handled by the TDX Module. In that way, the TDX Module can protect the confidentiality and integrity of TD CPU state.
The TDX Module does not save/restore all host CPU state because the host VMM can do it more efficiently and selectively. CPU state referred to below is host CPU state. Often values are already held in memory so no explicit save is needed, and restoration may not be needed if the kernel is not using a feature.
TDX does not support PAUSE-loop exiting. According to the TDX Module Base arch. spec., hypercalls are expected to be used instead. Note that the Linux TDX guest supports existing hypercalls via TDG.VP.VMCALL.
This series requires TDX module 1.5.06.00.0744, or later, due to removal of the workarounds for the lack of the NO_RBP_MOD feature required by the kernel. NO_RBP_MOD is now required.
show more ...
|
Revision tags: v6.14-rc6 |
|
#
fcbe3482 |
| 06-Mar-2025 |
Paolo Bonzini <pbonzini@redhat.com> |
Merge branch 'kvm-tdx-mmu' into HEAD
This series picks up from commit 86eb1aef7279 ("Merge branch 'kvm-mirror-page-tables' into HEAD", 2025-01-20), which focused on changes to the generic x86 parts
Merge branch 'kvm-tdx-mmu' into HEAD
This series picks up from commit 86eb1aef7279 ("Merge branch 'kvm-mirror-page-tables' into HEAD", 2025-01-20), which focused on changes to the generic x86 parts of the KVM MMU code, and adds support for TDX's secure page tables to the Intel side of KVM.
Confidential computing solutions have concepts of private and shared memory. Often the guest accesses either private or shared memory via a bit in the guest PTE. Solutions like SEV treat this bit more like a permission bit, where solutions like TDX and ARM CCA treat it more like a GPA bit. In the latter case, the host maps private memory in one half of the address space and shared in another. For TDX these two halves are mapped by different EPT roots. The private half (also called Secure EPT in Intel documentation) gets managed by the privileged TDX Module. The shared half is managed by the untrusted part of the VMM (KVM).
In addition to the separate roots for private and shared, there are limitations on what operations can be done on the private side. Like SNP, TDX wants to protect against protected memory being reset or otherwise scrambled by the host. In order to prevent this, the guest has to take specific action to “accept” memory after changes are made by the VMM to the private EPT. This prevents the VMM from performing many of the usual memory management operations that involve zapping and refaulting memory. The private memory also is always RWX and cannot have VMM specified cache attribute attributes applied.
TDX memory implementation =========================
Creating shared EPT ------------------- Shared EPT handling is relatively simple compared to private memory. It is managed from within KVM. The main differences between shared EPT and EPT in a normal VM are that the root is set with a TDVMCS field (via SEAMCALL), and that the GFN specified in the memslot perspective needs to be mapped at an offset in the EPT. For the former, this series plumbs in the load_mmu_pgd() operation to the correct field for the shared EPT. For the latter, previous patches have laid the groundwork for mapping so called “direct roots” roots at an offset specified in kvm->arch.gfn_direct_bits.
Creating private EPT -------------------- In previous patches, the concept of “mirrored roots” were introduced. Such roots maintain a KVM side “mirror” of the “external” EPT by keeping an unmapped EPT tree within the KVM MMU code. When changing these mirror EPTs, the KVM MMU code calls out via x86_ops to update the external EPT. This series adds implementations for these “external” ops for TDX to create and manage “private” memory via TDX module APIs.
Managing S-EPT with the TDX Module ---------------------------------- The TDX module allows the TD’s private memory to be managed via SEAMCALLs. This management consists of operating on two internal elements:
1. The private EPT, which the TDX module calls the S-EPT. It maps the actual mapped, private half of the GPA space using an EPT tree.
2. The HKID, which represents private encryption keys used for encrypting TD memory. The CPU doesn’t guarantee cache coherency between these encryption keys, so memory that is encrypted with one of these keys needs to be reclaimed for use on the host in special ways.
This series will primarily focus on the SEAMCALLs for managing the private EPT. Consideration of the HKID is needed for when the TD is torn down.
Populating TDX Private memory ----------------------------- TDX allows the EPT mapping the TD's private memory to be modified in limited ways. There are SEAMCALLs for building and tearing down the EPT tree, as well as mapping pages into the private EPT.
As for building and tearing down the EPT page tables, it is relatively simple. There are SEAMCALLs for installing and removing them. However, the current implementation only supports adding private EPT page tables, and leaves them installed for the lifetime of the TD. For teardown, the details are discussed in a later section.
As for populating and zapping private SPTE, there are SEAMCALLs for this as well. The zapping case will be described in detail later. As for the populating case, there are two categories: before TD is finalized and after TD is finalized. Both of these scenarios go through the TDP MMU map path. The changes done previously to introduce “mirror” and “external” page tables handle directing SPTE installation operations through the set_external_spte() op.
In the “after” case, the TDX set_external_spte() handler simply calls a SEAMCALL (TDX.MEM.PAGE.AUG).
For the before case, it is a bit more complicated as it requires both setting the private SPTE *and* copying in the initial contents of the page at the same time. For TDX this is done via the KVM_TDX_INIT_MEM_REGION ioctl, which is effectively the kvm_gmem_populate() operation.
For SNP, the private memory can be pre-populated first, and faulted in later like normal. But for TDX these need to both happen both at the same time and the setting of the private SPTE needs to happen in a different way than the “after” case described above. It needs to use the TDH.MEM.SEPT.ADD SEAMCALL which does both the copying in of the data and setting the SPTE.
Without extensive modification to the fault path, it’s not possible utilize this callback from the set_external_spte() handler because it the source page for the data to be copied in is not known deep down in this callchain. So instead the post-populate callback does a three step process.
1. Pre-fault the memory into the mirror EPT, but have the set_external_spte() not make any SEAMCALLs.
2. Check that the page is still faulted into the mirror EPT under read mmu_lock that is held over this and the following step.
3. Call TDH.MEM.SEPT.ADD with the HPA of the page to copy data from, and the private page installed in the mirror EPT to use for the private mapping.
The scheme involves some assumptions about the operations that might operate on the mirrored EPT before the VM is finalized. It assumes that no other memory will be faulted into the mirror EPT, that is not also added via TDH.MEM.SEPT.ADD). If this is violated the KVM MMU may not see private memory faulted in there later and so not make the proper external spte callbacks. To check this, KVM enforces that the number of pre-faulted pages is the same as the number of pages added via KVM_TDX_INIT_MEM_REGION.
TDX TLB flushing ---------------- For TDX, TLB flushing needs to happen in different ways depending on whether private and/or shared EPT needs to be flushed. Shared EPT can be flushed like normal EPT with INVEPT. To avoid reading TD's EPTP out from TDX module, this series flushes shared EPT with type 2 INVEPT. Private TLB entries can be flushed this way too (via type 2). However, since the TDX module needs to enforce some guarantees around which private memory is mapped in the TD, it requires these operations to be done in special ways for private memory.
For flushing private memory, two methods are possible. The simple one is the TDH.VP.FLUSH SEAMCALL; this flush is of the INVEPT type 1 variety (i.e. mappings associated with the TD).
The second method is part of a sequence of SEAMCALLs for removing a guest page. The sequence looks like:
1. TDH.MEM.RANGE.BLOCK - Remove RWX bits from entry (similar to KVM’s zap).
2. TDH.MEM.TRACK - Increment the TD TLB epoch, which is a per-TD counter
3. Kick off all vCPUs - In order to force them to have to re-enter.
4. TDH.MEM.PAGE.REMOVE - Actually remove the page and make it available for other use.
5. TDH.VP.ENTER - On re-entering TDX module will see the epoch is incremented and flush the TLB.
On top of this, during TDX module init TDH.SYS.LP.INIT (which is used to online a CPU for TDX usage) invokes INVEPT to flush all mappings in the TLB.
During runtime, for normal (TDP MMU, non-nested) guests, KVM will do a TLB flushes in 4 scenarios:
(1) kvm_mmu_load()
After EPT is loaded, call kvm_x86_flush_tlb_current() to invalidate TLBs for current vCPU loaded EPT on current pCPU.
(2) Loading vCPU to a new pCPU
Send request KVM_REQ_TLB_FLUSH to current vCPU, the request handler will call kvm_x86_flush_tlb_all() to flush all EPTs assocated with the new pCPU.
(3) When EPT mapping has changed (after removing or permission reduction) (e.g. in kvm_flush_remote_tlbs())
Send request KVM_REQ_TLB_FLUSH to all vCPUs by kicking all them off, the request handler on each vCPU will call kvm_x86_flush_tlb_all() to invalidate TLBs for all EPTs associated with the pCPU.
(4) When EPT changes only affects current vCPU, e.g. virtual apic mode changed.
Send request KVM_REQ_TLB_FLUSH_CURRENT, the request handler will call kvm_x86_flush_tlb_current() to invalidate TLBs for current vCPU loaded EPT on current pCPU.
Only the first 3 are relevant to TDX. They are implemented as follows.
(1) kvm_mmu_load()
Only the shared EPT root is loaded in this path. The TDX module does not require any assurances about the operation, so the flush_tlb_current()->ept_sync_global() can be called as normal.
(2) vCPU load
When a vCPU migrates to a new logical processor, it has to be flushed on the *old* pCPU, unlike normal VMs where the INVEPT is executed on the new pCPU to remove stale mappings from previous usage of the same EPTP on the new pCPU. The TDX behavior comes from a requirement that a vCPU can only be associated with one pCPU at at time. This flush happens via an IPI that invokes TDH.VP.FLUSH SEAMCALL, during the vcpu_load callback.
(3) Removing a private SPTE
This is the more complicated flow. It is done in a simple way for now and is especially inefficient during VM teardown. The plan is to get a basic functional version working and optimize some of these flows later.
When a private page mapping is removed, the core MMU code calls the newly remove_external_spte() op, and flushes the TLB on all vCPUs. But TDX can’t rely on doing that for private memory, so it has it’s own process for making sure the private page is removed. This flow (TDH.MEM.RANGE.BLOCK, TDH.MEM.TRACK, TDH.MEM.PAGE.REMOVE) is done withing the remove_external_spte() implementation as described in the “TDX TLB flushing” section above.
After that, back in the core MMU code, KVM will call kvm_flush_remote_tlbs*() resulting in an INVEPT. Despite that, when the vCPUs re-enter (TDH.VP.ENTER) the TD, the TDX module will do another INVEPT for its own reassurance.
Private memory teardown ----------------------- Tearing down private memory involves reclaiming three types of resources from the TDX module:
1. TD’s HKID
To reclaim the TD’s HKID, no mappings may be mapped with it.
2. Private guest pages (mapped with HKID) 3. Private page tables that map private pages (mapped with HKID)
From the TDX module’s perspective, to reclaim guest private pages they need to be prevented from be accessed via the HKID (unmapped and TLB flushed), their HKID associated cachelines need to be flushed, and they need to be marked as no longer use by the TD in the TDX modules internal tracking (PAMT)
During runtime private PTEs can be zapped as part of memslot deletion or when memory coverts from shared to private, but private page tables and HKIDs are not torn down until the TD is being destructed. The means the operation to zap private guest mapped pages needs to do the required cache writeback under the assumption that other vCPU’s may be active, but the PTs do not.
TD teardown resource reclamation -------------------------------- The code that does the TD teardown is organized such that when an HKID is reclaimed: 1. vCPUs will no longer enter the TD 2. The TLB is flushed on all CPUs 3. The HKID associated cachelines have been flushed.
So at that point most of the steps needed to reclaim TD private pages and page tables have already been done and the reclaim operation only needs to update the TDX module’s tracking of page ownership. For simplicity each operation only supports one scenario: before or after HKID reclaim. Since zapping and reclaiming private pages has to function during runtime for memslot deletion and converting from shared to private, the TD teardown is arranged so this happens before HKID reclaim. Since private page tables are never torn down during TD runtime, they can happen in a simpler and more efficient way after HKID reclaim. The private page reclaim is initiated from the kvm fd release. The callchain looks like this:
do_exit |->exit_mm --> tdx_mmu_release_hkid() was called here previously in v19 |->exit_files |->1.release vcpu fd |->2.kvm_gmem_release | |->kvm_gmem_invalidate_begin --> unmap all leaf entries, causing | zapping of private guest pages |->3.release kvmfd |->kvm_destroy_vm |->kvm_arch_pre_destroy_vm | | kvm_x86_call(vm_pre_destroy)(kvm) -->tdx_mmu_release_hkid() |->kvm_arch_destroy_vm |->kvm_unload_vcpu_mmus | kvm_destroy_vcpus(kvm) | |->kvm_arch_vcpu_destroy | |->kvm_x86_call(vcpu_free)(vcpu) | | kvm_mmu_destroy(vcpu) -->unref mirror root | kvm_mmu_uninit_vm(kvm) --> mirror root ref is 1 here, | zap private page tables | static_call_cond(kvm_x86_vm_destroy)(kvm);
show more ...
|
Revision tags: v6.14-rc5, v6.14-rc4 |
|
#
7e548b0d |
| 22-Feb-2025 |
Sean Christopherson <sean.j.christopherson@intel.com> |
KVM: VMX: Add a helper for NMI handling
Add a helper to handles NMI exit.
TDX handles the NMI exit the same as VMX case. Add a helper to share the code with TDX, expose the helper in common.h.
No
KVM: VMX: Add a helper for NMI handling
Add a helper to handles NMI exit.
TDX handles the NMI exit the same as VMX case. Add a helper to share the code with TDX, expose the helper in common.h.
No functional change intended.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com> Co-developed-by: Binbin Wu <binbin.wu@linux.intel.com> Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com> Message-ID: <20250222014757.897978-15-binbin.wu@linux.intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
show more ...
|
#
d5bc91e8 |
| 22-Feb-2025 |
Binbin Wu <binbin.wu@linux.intel.com> |
KVM: VMX: Move emulation_required to struct vcpu_vt
Move emulation_required from struct vcpu_vmx to struct vcpu_vt so that vmx_handle_exit_irqoff() can be reused by TDX code.
No functional change i
KVM: VMX: Move emulation_required to struct vcpu_vt
Move emulation_required from struct vcpu_vmx to struct vcpu_vt so that vmx_handle_exit_irqoff() can be reused by TDX code.
No functional change intended.
Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com> Message-ID: <20250222014757.897978-14-binbin.wu@linux.intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
show more ...
|
#
254e5dcd |
| 22-Feb-2025 |
Isaku Yamahata <isaku.yamahata@intel.com> |
KVM: VMX: Move posted interrupt delivery code to common header
Move posted interrupt delivery code to common header so that TDX can leverage it.
No functional change intended.
Signed-off-by: Isaku
KVM: VMX: Move posted interrupt delivery code to common header
Move posted interrupt delivery code to common header so that TDX can leverage it.
No functional change intended.
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com> [binbin: split into new patch] Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com> Reviewed-by: Chao Gao <chao.gao@intel.com> Message-ID: <20250222014757.897978-4-binbin.wu@linux.intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
show more ...
|
#
7172c753 |
| 14-Mar-2025 |
Binbin Wu <binbin.wu@linux.intel.com> |
KVM: VMX: Move common fields of struct vcpu_{vmx,tdx} to a struct
Move common fields of struct vcpu_vmx and struct vcpu_tdx to struct vcpu_vt, to share the code between VMX/TDX as much as possible a
KVM: VMX: Move common fields of struct vcpu_{vmx,tdx} to a struct
Move common fields of struct vcpu_vmx and struct vcpu_tdx to struct vcpu_vt, to share the code between VMX/TDX as much as possible and to make TDX exit handling more VMX like.
No functional change intended.
[Adrian: move code that depends on struct vcpu_vmx back to vmx.h]
Suggested-by: Sean Christopherson <seanjc@google.com> Link: https://lore.kernel.org/r/Z1suNzg2Or743a7e@google.com Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com> Signed-off-by: Adrian Hunter <adrian.hunter@intel.com> Message-ID: <20250129095902.16391-5-adrian.hunter@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
show more ...
|
Revision tags: v6.14-rc3, v6.14-rc2, v6.14-rc1, v6.13, v6.13-rc7, v6.13-rc6, v6.13-rc5, v6.13-rc4, v6.13-rc3, v6.13-rc2, v6.13-rc1, v6.12 |
|
#
3b725e97 |
| 12-Nov-2024 |
Rick Edgecombe <rick.p.edgecombe@intel.com> |
KVM: VMX: Teach EPT violation helper about private mem
Teach EPT violation helper to check shared mask of a GPA to find out whether the GPA is for private memory.
When EPT violation is triggered af
KVM: VMX: Teach EPT violation helper about private mem
Teach EPT violation helper to check shared mask of a GPA to find out whether the GPA is for private memory.
When EPT violation is triggered after TD accessing a private GPA, KVM will exit to user space if the corresponding GFN's attribute is not private. User space will then update GFN's attribute during its memory conversion process. After that, TD will re-access the private GPA and trigger EPT violation again. Only with GFN's attribute matches to private, KVM will fault in private page, map it in mirrored TDP root, and propagate changes to private EPT to resolve the EPT violation.
Relying on GFN's attribute tracking xarray to determine if a GFN is private, as for KVM_X86_SW_PROTECTED_VM, may lead to endless EPT violations.
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Co-developed-by: Yan Zhao <yan.y.zhao@intel.com> Signed-off-by: Yan Zhao <yan.y.zhao@intel.com> Message-ID: <20241112073539.22056-1-yan.y.zhao@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
show more ...
|
#
c8563d1b |
| 12-Nov-2024 |
Sean Christopherson <sean.j.christopherson@intel.com> |
KVM: VMX: Split out guts of EPT violation to common/exposed function
The difference of TDX EPT violation is how to retrieve information, GPA, and exit qualification. To share the code to handle EPT
KVM: VMX: Split out guts of EPT violation to common/exposed function
The difference of TDX EPT violation is how to retrieve information, GPA, and exit qualification. To share the code to handle EPT violation, split out the guts of EPT violation handler so that VMX/TDX exit handler can call it after retrieving GPA and exit qualification.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Co-developed-by: Isaku Yamahata <isaku.yamahata@intel.com> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com> Co-developed-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Signed-off-by: Yan Zhao <yan.y.zhao@intel.com> Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Reviewed-by: Kai Huang <kai.huang@intel.com> Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com> Message-ID: <20241112073528.22042-1-yan.y.zhao@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
show more ...
|