#
73d7cf07 |
| 10-Jul-2025 |
Linus Torvalds <torvalds@linux-foundation.org> |
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull KVM fixes from Paolo Bonzini: "Many patches, pretty much all of them small, that accumulated while I was on vacation.
AR
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull KVM fixes from Paolo Bonzini: "Many patches, pretty much all of them small, that accumulated while I was on vacation.
ARM:
- Remove the last leftovers of the ill-fated FPSIMD host state mapping at EL2 stage-1
- Fix unexpected advertisement to the guest of unimplemented S2 base granule sizes
- Gracefully fail initialising pKVM if the interrupt controller isn't GICv3
- Also gracefully fail initialising pKVM if the carveout allocation fails
- Fix the computing of the minimum MMIO range required for the host on stage-2 fault
- Fix the generation of the GICv3 Maintenance Interrupt in nested mode
x86:
- Reject SEV{-ES} intra-host migration if one or more vCPUs are actively being created, so as not to create a non-SEV{-ES} vCPU in an SEV{-ES} VM
- Use a pre-allocated, per-vCPU buffer for handling de-sparsification of vCPU masks in Hyper-V hypercalls; fixes a "stack frame too large" issue
- Allow out-of-range/invalid Xen event channel ports when configuring IRQ routing, to avoid dictating a specific ioctl() ordering to userspace
- Conditionally reschedule when setting memory attributes to avoid soft lockups when userspace converts huge swaths of memory to/from private
- Add back MWAIT as a required feature for the MONITOR/MWAIT selftest
- Add a missing field in struct sev_data_snp_launch_start that resulted in the guest-visible workarounds field being filled at the wrong offset
- Skip non-canonical address when processing Hyper-V PV TLB flushes to avoid VM-Fail on INVVPID
- Advertise supported TDX TDVMCALLs to userspace
- Pass SetupEventNotifyInterrupt arguments to userspace
- Fix TSC frequency underflow"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: KVM: x86: avoid underflow when scaling TSC frequency KVM: arm64: Remove kvm_arch_vcpu_run_map_fp() KVM: arm64: Fix handling of FEAT_GTG for unimplemented granule sizes KVM: arm64: Don't free hyp pages with pKVM on GICv2 KVM: arm64: Fix error path in init_hyp_mode() KVM: arm64: Adjust range correctly during host stage-2 faults KVM: arm64: nv: Fix MI line level calculation in vgic_v3_nested_update_mi() KVM: x86/hyper-v: Skip non-canonical addresses during PV TLB flush KVM: SVM: Add missing member in SNP_LAUNCH_START command structure Documentation: KVM: Fix unexpected unindent warnings KVM: selftests: Add back the missing check of MONITOR/MWAIT availability KVM: Allow CPU to reschedule while setting per-page memory attributes KVM: x86/xen: Allow 'out of range' event channel ports in IRQ routing table. KVM: x86/hyper-v: Use preallocated per-vCPU buffer for de-sparsified vCPU masks KVM: SVM: Initialize vmsa_pa in VMCB to INVALID_PAGE if VMSA page is NULL KVM: SVM: Reject SEV{-ES} intra host migration if vCPU creation is in-flight KVM: TDX: Report supported optional TDVMCALLs in TDX capabilities KVM: TDX: Exit to userspace for SetupEventNotifyInterrupt
show more ...
|
#
5383fc05 |
| 08-Jul-2025 |
Paolo Bonzini <pbonzini@redhat.com> |
Merge tag 'kvm-x86-fixes-6.16-rcN' of https://github.com/kvm-x86/linux into HEAD
KVM x86 fixes for 6.16-rcN
- Reject SEV{-ES} intra-host migration if one or more vCPUs are actively being created
Merge tag 'kvm-x86-fixes-6.16-rcN' of https://github.com/kvm-x86/linux into HEAD
KVM x86 fixes for 6.16-rcN
- Reject SEV{-ES} intra-host migration if one or more vCPUs are actively being created so as not to create a non-SEV{-ES} vCPU in an SEV{-ES} VM.
- Use a pre-allocated, per-vCPU buffer for handling de-sparsified vCPU masks when emulating Hyper-V hypercalls to fix a "stack frame too large" issue.
- Allow out-of-range/invalid Xen event channel ports when configuring IRQ routing to avoid dictating a specific ioctl() ordering to userspace.
- Conditionally reschedule when setting memory attributes to avoid soft lockups when userspace converts huge swaths of memory to/from private.
- Add back MWAIT as a required feature for the MONITOR/MWAIT selftest.
- Add a missing field in struct sev_data_snp_launch_start that resulted in the guest-visible workarounds field being filled at the wrong offset.
- Skip non-canonical address when processing Hyper-V PV TLB flushes to avoid VM-Fail on INVVPID.
- Advertise supported TDX TDVMCALLs to userspace.
show more ...
|
Revision tags: v6.16-rc5, v6.16-rc4, v6.16-rc3 |
|
#
28224ef0 |
| 20-Jun-2025 |
Paolo Bonzini <pbonzini@redhat.com> |
KVM: TDX: Report supported optional TDVMCALLs in TDX capabilities
Allow userspace to advertise TDG.VP.VMCALL subfunctions that the kernel also supports. For each output register of GetTdVmCallInfo'
KVM: TDX: Report supported optional TDVMCALLs in TDX capabilities
Allow userspace to advertise TDG.VP.VMCALL subfunctions that the kernel also supports. For each output register of GetTdVmCallInfo's leaf 1, add two fields to KVM_TDX_CAPABILITIES: one for kernel-supported TDVMCALLs (userspace can set those blindly) and one for user-supported TDVMCALLs (userspace can set those if it knows how to handle them).
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
show more ...
|
#
4580dbef |
| 20-Jun-2025 |
Paolo Bonzini <pbonzini@redhat.com> |
KVM: TDX: Exit to userspace for SetupEventNotifyInterrupt
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
#
346bd8a9 |
| 26-Jun-2025 |
Takashi Iwai <tiwai@suse.de> |
Merge tag 'asoc-fix-v6.16-rc3' of https://git.kernel.org/pub/scm/linux/kernel/git/broonie/sound into for-linus
ASoC: Fixes for v6.16
A small collection of fixes, the main one being a fix for resume
Merge tag 'asoc-fix-v6.16-rc3' of https://git.kernel.org/pub/scm/linux/kernel/git/broonie/sound into for-linus
ASoC: Fixes for v6.16
A small collection of fixes, the main one being a fix for resume from hibernation on AMD systems, plus a few new quirk entries for AMD systems.
show more ...
|
#
e669e322 |
| 22-Jun-2025 |
Linus Torvalds <torvalds@linux-foundation.org> |
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull kvm fixes from Paolo Bonzini: "ARM:
- Fix another set of FP/SIMD/SVE bugs affecting NV, and plugging some missing sy
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull kvm fixes from Paolo Bonzini: "ARM:
- Fix another set of FP/SIMD/SVE bugs affecting NV, and plugging some missing synchronisation
- A small fix for the irqbypass hook fixes, tightening the check and ensuring that we only deal with MSI for both the old and the new route entry
- Rework the way the shadow LRs are addressed in a nesting configuration, plugging an embarrassing bug as well as simplifying the whole process
- Add yet another fix for the dreaded arch_timer_edge_cases selftest
RISC-V:
- Fix the size parameter check in SBI SFENCE calls
- Don't treat SBI HFENCE calls as NOPs
x86 TDX:
- Complete API for handling complex TDVMCALLs in userspace.
This was delayed because the spec lacked a way for userspace to deny supporting these calls; the new exit code is now approved"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: KVM: TDX: Exit to userspace for GetTdVmCallInfo KVM: TDX: Handle TDG.VP.VMCALL<GetQuote> KVM: TDX: Add new TDVMCALL status code for unsupported subfuncs KVM: arm64: VHE: Centralize ISBs when returning to host KVM: arm64: Remove cpacr_clear_set() KVM: arm64: Remove ad-hoc CPTR manipulation from kvm_hyp_handle_fpsimd() KVM: arm64: Remove ad-hoc CPTR manipulation from fpsimd_sve_sync() KVM: arm64: Reorganise CPTR trap manipulation KVM: arm64: VHE: Synchronize CPTR trap deactivation KVM: arm64: VHE: Synchronize restore of host debug registers KVM: arm64: selftests: Close the GIC FD in arch_timer_edge_cases KVM: arm64: Explicitly treat routing entry type changes as changes KVM: arm64: nv: Fix tracking of shadow list registers RISC-V: KVM: Don't treat SBI HFENCE calls as NOPs RISC-V: KVM: Fix the size parameter check in SBI SFENCE calls
show more ...
|
Revision tags: v6.16-rc2 |
|
#
25e8b1dd |
| 10-Jun-2025 |
Binbin Wu <binbin.wu@linux.intel.com> |
KVM: TDX: Exit to userspace for GetTdVmCallInfo
Exit to userspace for TDG.VP.VMCALL<GetTdVmCallInfo> via KVM_EXIT_TDX, to allow userspace to provide information about the support of TDVMCALLs when r
KVM: TDX: Exit to userspace for GetTdVmCallInfo
Exit to userspace for TDG.VP.VMCALL<GetTdVmCallInfo> via KVM_EXIT_TDX, to allow userspace to provide information about the support of TDVMCALLs when r12 is 1 for the TDVMCALLs beyond the GHCI base API.
GHCI spec defines the GHCI base TDVMCALLs: <GetTdVmCallInfo>, <MapGPA>, <ReportFatalError>, <Instruction.CPUID>, <#VE.RequestMMIO>, <Instruction.HLT>, <Instruction.IO>, <Instruction.RDMSR> and <Instruction.WRMSR>. They must be supported by VMM to support TDX guests.
For GetTdVmCallInfo - When leaf (r12) to enumerate TDVMCALL functionality is set to 0, successful execution indicates all GHCI base TDVMCALLs listed above are supported.
Update the KVM TDX document with the set of the GHCI base APIs.
- When leaf (r12) to enumerate TDVMCALL functionality is set to 1, it indicates the TDX guest is querying the supported TDVMCALLs beyond the GHCI base TDVMCALLs. Exit to userspace to let userspace set the TDVMCALL sub-function bit(s) accordingly to the leaf outputs. KVM could set the TDVMCALL bit(s) supported by itself when the TDVMCALLs don't need support from userspace after returning from userspace and before entering guest. Currently, no such TDVMCALLs implemented, KVM just sets the values returned from userspace.
Suggested-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com> [Adjust userspace API. - Paolo] Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
show more ...
|
#
cf207eac |
| 10-Jun-2025 |
Binbin Wu <binbin.wu@linux.intel.com> |
KVM: TDX: Handle TDG.VP.VMCALL<GetQuote>
Handle TDVMCALL for GetQuote to generate a TD-Quote.
GetQuote is a doorbell-like interface used by TDX guests to request VMM to generate a TD-Quote signed b
KVM: TDX: Handle TDG.VP.VMCALL<GetQuote>
Handle TDVMCALL for GetQuote to generate a TD-Quote.
GetQuote is a doorbell-like interface used by TDX guests to request VMM to generate a TD-Quote signed by a service hosting TD-Quoting Enclave operating on the host. A TDX guest passes a TD Report (TDREPORT_STRUCT) in a shared-memory area as parameter. Host VMM can access it and queue the operation for a service hosting TD-Quoting enclave. When completed, the Quote is returned via the same shared-memory area.
KVM only checks the GPA from the TDX guest has the shared-bit set and drops the shared-bit before exiting to userspace to avoid bleeding the shared-bit into KVM's exit ABI. KVM forwards the request to userspace VMM (e.g. QEMU) and userspace VMM queues the operation asynchronously. KVM sets the return code according to the 'ret' field set by userspace to notify the TDX guest whether the request has been queued successfully or not. When the request has been queued successfully, the TDX guest can poll the status field in the shared-memory area to check whether the Quote generation is completed or not. When completed, the generated Quote is returned via the same buffer.
Add KVM_EXIT_TDX as a new exit reason to userspace. Userspace is required to handle the KVM exit reason as the initial support for TDX, by reentering KVM to ensure that the TDVMCALL is complete. While at it, add a note that KVM_EXIT_HYPERCALL also requires reentry with KVM_RUN.
Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com> Tested-by: Mikko Ylinen <mikko.ylinen@linux.intel.com> Acked-by: Kai Huang <kai.huang@intel.com> [Adjust userspace API. - Paolo] Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
show more ...
|
#
b5aafcb4 |
| 10-Jun-2025 |
Binbin Wu <binbin.wu@linux.intel.com> |
KVM: TDX: Add new TDVMCALL status code for unsupported subfuncs
Add the new TDVMCALL status code TDVMCALL_STATUS_SUBFUNC_UNSUPPORTED and return it for unimplemented TDVMCALL subfunctions.
Returning
KVM: TDX: Add new TDVMCALL status code for unsupported subfuncs
Add the new TDVMCALL status code TDVMCALL_STATUS_SUBFUNC_UNSUPPORTED and return it for unimplemented TDVMCALL subfunctions.
Returning TDVMCALL_STATUS_INVALID_OPERAND when a subfunction is not implemented is vague because TDX guests can't tell the error is due to the subfunction is not supported or an invalid input of the subfunction. New GHCI spec adds TDVMCALL_STATUS_SUBFUNC_UNSUPPORTED to avoid the ambiguity. Use it instead of TDVMCALL_STATUS_INVALID_OPERAND.
Before the change, for common guest implementations, when a TDX guest receives TDVMCALL_STATUS_INVALID_OPERAND, it has two cases: 1. Some operand is invalid. It could change the operand to another value retry. 2. The subfunction is not supported.
For case 1, an invalid operand usually means the guest implementation bug. Since the TDX guest can't tell which case is, the best practice for handling TDVMCALL_STATUS_INVALID_OPERAND is stopping calling such leaf, treating the failure as fatal if the TDVMCALL is essential or ignoring it if the TDVMCALL is optional.
With this change, TDVMCALL_STATUS_SUBFUNC_UNSUPPORTED could be sent to old TDX guest that do not know about it, but it is expected that the guest will make the same action as TDVMCALL_STATUS_INVALID_OPERAND. Currently, no known TDX guest checks TDVMCALL_STATUS_INVALID_OPERAND specifically; for example Linux just checks for success.
Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com> [Return it for untrapped KVM_HC_MAP_GPA_RANGE. - Paolo] Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
show more ...
|
Revision tags: v6.16-rc1 |
|
#
43db1111 |
| 29-May-2025 |
Linus Torvalds <torvalds@linux-foundation.org> |
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull kvm updates from Paolo Bonzini: "As far as x86 goes this pull request "only" includes TDX host support.
Quotes are appropr
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull kvm updates from Paolo Bonzini: "As far as x86 goes this pull request "only" includes TDX host support.
Quotes are appropriate because (at 6k lines and 100+ commits) it is much bigger than the rest, which will come later this week and consists mostly of bugfixes and selftests. s390 changes will also come in the second batch.
ARM:
- Add large stage-2 mapping (THP) support for non-protected guests when pKVM is enabled, clawing back some performance.
- Enable nested virtualisation support on systems that support it, though it is disabled by default.
- Add UBSAN support to the standalone EL2 object used in nVHE/hVHE and protected modes.
- Large rework of the way KVM tracks architecture features and links them with the effects of control bits. While this has no functional impact, it ensures correctness of emulation (the data is automatically extracted from the published JSON files), and helps dealing with the evolution of the architecture.
- Significant changes to the way pKVM tracks ownership of pages, avoiding page table walks by storing the state in the hypervisor's vmemmap. This in turn enables the THP support described above.
- New selftest checking the pKVM ownership transition rules
- Fixes for FEAT_MTE_ASYNC being accidentally advertised to guests even if the host didn't have it.
- Fixes for the address translation emulation, which happened to be rather buggy in some specific contexts.
- Fixes for the PMU emulation in NV contexts, decoupling PMCR_EL0.N from the number of counters exposed to a guest and addressing a number of issues in the process.
- Add a new selftest for the SVE host state being corrupted by a guest.
- Keep HCR_EL2.xMO set at all times for systems running with the kernel at EL2, ensuring that the window for interrupts is slightly bigger, and avoiding a pretty bad erratum on the AmpereOne HW.
- Add workaround for AmpereOne's erratum AC04_CPU_23, which suffers from a pretty bad case of TLB corruption unless accesses to HCR_EL2 are heavily synchronised.
- Add a per-VM, per-ITS debugfs entry to dump the state of the ITS tables in a human-friendly fashion.
- and the usual random cleanups.
LoongArch:
- Don't flush tlb if the host supports hardware page table walks.
- Add KVM selftests support.
RISC-V:
- Add vector registers to get-reg-list selftest
- VCPU reset related improvements
- Remove scounteren initialization from VCPU reset
- Support VCPU reset from userspace using set_mpstate() ioctl
x86:
- Initial support for TDX in KVM.
This finally makes it possible to use the TDX module to run confidential guests on Intel processors. This is quite a large series, including support for private page tables (managed by the TDX module and mirrored in KVM for efficiency), forwarding some TDVMCALLs to userspace, and handling several special VM exits from the TDX module.
This has been in the works for literally years and it's not really possible to describe everything here, so I'll defer to the various merge commits up to and including commit 7bcf7246c42a ('Merge branch 'kvm-tdx-finish-initial' into HEAD')"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (248 commits) x86/tdx: mark tdh_vp_enter() as __flatten Documentation: virt/kvm: remove unreferenced footnote RISC-V: KVM: lock the correct mp_state during reset KVM: arm64: Fix documentation for vgic_its_iter_next() KVM: arm64: np-guest CMOs with PMD_SIZE fixmap KVM: arm64: Stage-2 huge mappings for np-guests KVM: arm64: Add a range to pkvm_mappings KVM: arm64: Convert pkvm_mappings to interval tree KVM: arm64: Add a range to __pkvm_host_test_clear_young_guest() KVM: arm64: Add a range to __pkvm_host_wrprotect_guest() KVM: arm64: Add a range to __pkvm_host_unshare_guest() KVM: arm64: Add a range to __pkvm_host_share_guest() KVM: arm64: Introduce for_each_hyp_page KVM: arm64: Handle huge mappings for np-guest CMOs KVM: arm64: nv: Release faulted-in VNCR page from mmu_lock critical section KVM: arm64: nv: Handle TLBI S1E2 for VNCR invalidation with mmu_lock held KVM: arm64: nv: Hold mmu_lock when invalidating VNCR SW-TLB before translating RISC-V: KVM: add KVM_CAP_RISCV_MP_STATE_RESET RISC-V: KVM: Remove scounteren initialization KVM: RISC-V: remove unnecessary SBI reset state ...
show more ...
|
Revision tags: v6.15, v6.15-rc7, v6.15-rc6, v6.15-rc5, v6.15-rc4, v6.15-rc3, v6.15-rc2, v6.15-rc1, v6.14 |
|
#
fd02aa45 |
| 19-Mar-2025 |
Paolo Bonzini <pbonzini@redhat.com> |
Merge branch 'kvm-tdx-initial' into HEAD
This large commit contains the initial support for TDX in KVM. All x86 parts enable the host-side hypercalls that KVM uses to talk to the TDX module, a soft
Merge branch 'kvm-tdx-initial' into HEAD
This large commit contains the initial support for TDX in KVM. All x86 parts enable the host-side hypercalls that KVM uses to talk to the TDX module, a software component that runs in a special CPU mode called SEAM (Secure Arbitration Mode).
The series is in turn split into multiple sub-series, each with a separate merge commit:
- Initialization: basic setup for using the TDX module from KVM, plus ioctls to create TDX VMs and vCPUs.
- MMU: in TDX, private and shared halves of the address space are mapped by different EPT roots, and the private half is managed by the TDX module. Using the support that was added to the generic MMU code in 6.14, add support for TDX's secure page tables to the Intel side of KVM. Generic KVM code takes care of maintaining a mirror of the secure page tables so that they can be queried efficiently, and ensuring that changes are applied to both the mirror and the secure EPT.
- vCPU enter/exit: implement the callbacks that handle the entry of a TDX vCPU (via the SEAMCALL TDH.VP.ENTER) and the corresponding save/restore of host state.
- Userspace exits: introduce support for guest TDVMCALLs that KVM forwards to userspace. These correspond to the usual KVM_EXIT_* "heavyweight vmexits" but are triggered through a different mechanism, similar to VMGEXIT for SEV-ES and SEV-SNP.
- Interrupt handling: support for virtual interrupt injection as well as handling VM-Exits that are caused by vectored events. Exclusive to TDX are machine-check SMIs, which the kernel already knows how to handle through the kernel machine check handler (commit 7911f145de5f, "x86/mce: Implement recovery for errors in TDX/SEAM non-root mode")
- Loose ends: handling of the remaining exits from the TDX module, including EPT violation/misconfig and several TDVMCALL leaves that are handled in the kernel (CPUID, HLT, RDMSR/WRMSR, GetTdVmCallInfo); plus returning an error or ignoring operations that are not supported by TDX guests
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
show more ...
|
Revision tags: v6.14-rc7 |
|
#
7bcf7246 |
| 12-Mar-2025 |
Paolo Bonzini <pbonzini@redhat.com> |
Merge branch 'kvm-tdx-finish-initial' into HEAD
This patch ties the remaining loose ends and finally enables TDX guests to run inside KVM. It implements handling of EPT violation/misconfig and of s
Merge branch 'kvm-tdx-finish-initial' into HEAD
This patch ties the remaining loose ends and finally enables TDX guests to run inside KVM. It implements handling of EPT violation/misconfig and of several TDVMCALL leaves that are handled in the kernel (CPUID, HLT, RDMSR/WRMSR, GetTdVmCallInfo); it also adds a bunch of wrappers in vmx/main.c to ignore operations not supported by TDX guests(*)
Finally, it introduces documentation for the new APIs that have been added along the way.
(*) access to CPU state, VMX preemption timer, accesses to TSC offset or multiplier, LMCE enable/disable, hypercall patching.
show more ...
|
#
9913212b |
| 12-Mar-2025 |
Paolo Bonzini <pbonzini@redhat.com> |
Merge branch 'kvm-tdx-interrupts' into HEAD
Introduces support for interrupt handling for TDX guests, including virtual interrupt injection and VM-Exits caused by vectored events.
Injection =======
Merge branch 'kvm-tdx-interrupts' into HEAD
Introduces support for interrupt handling for TDX guests, including virtual interrupt injection and VM-Exits caused by vectored events.
Injection =========
TDX supports non-NMI interrupt injection only by posted interrupt. Posted interrupt descriptors (PIDs) are allocated in shared memory, KVM can update them directly. To post pending interrupts in the PID, KVM can generate a self-IPI with notification vector prior to TD entry. TDX guest status is protected, KVM can't get the interrupt status of TDX guest. For now, assume the interrupt is always allowed. A later patch set will let TDX guests to call TDVMCALL with HLT, which passes the interrupt block flag, so that whether interrupt is allowed in HLT will checked against the interrupt block flag.
For NMIs, KVM can request the TDX module to inject a NMI into a TDX vCPU by setting the PEND_NMI TDVPS field to 1. Following that, KVM can call TDH.VP.ENTER to run the vCPU and the TDX module will attempt to inject the NMI as soon as possible. PEND_NMI TDVPS field is a 1-bit filed, i.e. KVM can only pend one NMI in the TDX module. Also, TDX doesn't allow KVM to request NMI-window exit directly. When there is already one NMI pending in the TDX module, i.e. it has not been delivered to TDX guest yet, if there is NMI pending in KVM, collapse the pending NMI in KVM into the one pending in the TDX module. Such collapse is OK considering on X86 bare metal, multiple NMIs could collapse into one NMI, e.g. when NMI is blocked by SMI. It's OS's responsibility to poll all NMI sources in the NMI handler to avoid missing handling of some NMI events. More details can be found in the changelog of the patch "KVM: TDX: Implement methods to inject NMI".
TDX doesn't support system-management mode (SMM) and system-management interrupt (SMI) in guest TDs because TDX module doesn't provide a way for VMM to inject SMI into guest TD or switch guest vCPU mode into SMM. SMI requests return -ENOTTY similar to CONFIG_KVM_SMM=n. Likewise, INIT and SIPI events are not used and are blocked for TDX guests; TDX defines its own vCPU creation and initialization sequence, which is done on the host via SEAMCALLs at TD build time.
VM-exit for external events ===========================
Similar to the VMX case, external interrupts are with interrupts off: in the .handle_exit_irqoff() callback for external interrupts and in the noinstr region for NMIs. Just like VMX, NMI remains blocked after exiting from TDX guest for NMI-induced exits.
Machine check, which is handled in the .handle_exit_irqoff() callback, is the only exception type KVM handles for TDX guests. For other exceptions, because TDX guest state is protected, exceptions in TDX guests can't be intercepted. TDX VMM isn't supposed to handle these exceptions. Exit to userspace with KVM_EXIT_EXCEPTION If unexpected exception occurs.
Host SMIs also cause an exit to KVM. This is needed because in SEAM root mode (TDX module) all interrupts are blocked. An SMI can be "I/O SMI" or "other SMI". For TDX, there will be no I/O SMI because I/O instructions inside TDX guest trigger #VE and TDX guest needs to use TDVMCALL to request VMM to do I/O emulation. The only case of interest for "other SMI" is an #MC occurring in the guest when MCE-SMI morphing is enabled in the host firmware. Such "MSMI" is marked by having bit 0 set in the exit qualification; MSMI exits are fatal for the TD and are eventually handled by the kernel machine check handler (7911f14 x86/mce: Implement recovery for errors in TDX/SEAM non-root mode), which marks the page as poisoned. It is not possible right now to pass machine check exceptions to the guest.
SMIs other than machine check SMIs are handled just by leaving SEAM root mode and KVM doesn't need to do anything.
show more ...
|
#
4d2dc9a2 |
| 12-Mar-2025 |
Paolo Bonzini <pbonzini@redhat.com> |
Merge branch 'kvm-tdx-userspace-exit' into HEAD
Introduces support for VM exits that are forwarded the host VMM in userspace. These are initiated from the TDCALL exit code; although these userspace
Merge branch 'kvm-tdx-userspace-exit' into HEAD
Introduces support for VM exits that are forwarded the host VMM in userspace. These are initiated from the TDCALL exit code; although these userspace exits have the same TDX exit code, they result in several different types of exits to userspace.
When a guest TD issues a TDVMCALL, it exits to VMM with a new exit reason. The arguments from the guest TD and return values from the VMM are passed through the guest registers. The ABI details for the guest TD hypercalls are specified in the TDX GHCI specification.
There are two types of hypercalls defined in the GHCI specification:
- Standard TDVMCALLs: When input of R10 from guest TD is set to 0, it indicates that the TDVMCALL sub-function used in R11 is defined in GHCI specification.
- Vendor-Specific TDVMCALLs: When input of R10 from guest TD is non-zero, it indicates a vendor-specific TDVMCALL. KVM hypercalls from the guest follow this interface, using R10 as KVM hypercall number and R11-R14 as 4 arguments. The error code returned in R10.
This series includes basic standard TDVMCALLs that map to existing eixt reasons:
- TDG.VP.VMCALL<MapGPA> reuses exit reason KVM_EXIT_HYPERCALL with the hypercall number KVM_HC_MAP_GPA_RANGE.
- TDG.VP.VMCALL<ReportFatalError> reuses exit reason KVM_EXIT_SYSTEM_EVENT with a new event type KVM_SYSTEM_EVENT_TDX_FATAL.
- TDG.VP.VMCALL<Instruction.IO> reuses exit reason KVM_EXIT_IO.
- TDG.VP.VMCALL<#VE.RequestMMIO> reuses exit reason KVM_EXIT_MMIO.
Notably, handling for TDG.VP.VMCALL<SetupEventNotifyInterrupt> and TDG.VP.VMCALL<GetQuote> is not included yet.
show more ...
|
#
77ab80c6 |
| 12-Mar-2025 |
Paolo Bonzini <pbonzini@redhat.com> |
Merge branch 'kvm-tdx-enter-exit' into HEAD
This series introduces callbacks to facilitate the entry of a TD VCPU and the corresponding save/restore of host state.
A TD VCPU is entered via the SEAM
Merge branch 'kvm-tdx-enter-exit' into HEAD
This series introduces callbacks to facilitate the entry of a TD VCPU and the corresponding save/restore of host state.
A TD VCPU is entered via the SEAMCALL TDH.VP.ENTER. The TDX Module manages the save/restore of guest state and, in conjunction with the SEAMCALL interface, handles certain aspects of host state. However, there are specific elements of the host state that require additional attention, as detailed in the Intel TDX ABI documentation for TDH.VP.ENTER.
TDX is quite different from VMX in this regard. For VMX, the host VMM is heavily involved in restoring, managing and saving guest CPU state, whereas for TDX this is handled by the TDX Module. In that way, the TDX Module can protect the confidentiality and integrity of TD CPU state.
The TDX Module does not save/restore all host CPU state because the host VMM can do it more efficiently and selectively. CPU state referred to below is host CPU state. Often values are already held in memory so no explicit save is needed, and restoration may not be needed if the kernel is not using a feature.
TDX does not support PAUSE-loop exiting. According to the TDX Module Base arch. spec., hypercalls are expected to be used instead. Note that the Linux TDX guest supports existing hypercalls via TDG.VP.VMCALL.
This series requires TDX module 1.5.06.00.0744, or later, due to removal of the workarounds for the lack of the NO_RBP_MOD feature required by the kernel. NO_RBP_MOD is now required.
show more ...
|
Revision tags: v6.14-rc6 |
|
#
fcbe3482 |
| 06-Mar-2025 |
Paolo Bonzini <pbonzini@redhat.com> |
Merge branch 'kvm-tdx-mmu' into HEAD
This series picks up from commit 86eb1aef7279 ("Merge branch 'kvm-mirror-page-tables' into HEAD", 2025-01-20), which focused on changes to the generic x86 parts
Merge branch 'kvm-tdx-mmu' into HEAD
This series picks up from commit 86eb1aef7279 ("Merge branch 'kvm-mirror-page-tables' into HEAD", 2025-01-20), which focused on changes to the generic x86 parts of the KVM MMU code, and adds support for TDX's secure page tables to the Intel side of KVM.
Confidential computing solutions have concepts of private and shared memory. Often the guest accesses either private or shared memory via a bit in the guest PTE. Solutions like SEV treat this bit more like a permission bit, where solutions like TDX and ARM CCA treat it more like a GPA bit. In the latter case, the host maps private memory in one half of the address space and shared in another. For TDX these two halves are mapped by different EPT roots. The private half (also called Secure EPT in Intel documentation) gets managed by the privileged TDX Module. The shared half is managed by the untrusted part of the VMM (KVM).
In addition to the separate roots for private and shared, there are limitations on what operations can be done on the private side. Like SNP, TDX wants to protect against protected memory being reset or otherwise scrambled by the host. In order to prevent this, the guest has to take specific action to “accept” memory after changes are made by the VMM to the private EPT. This prevents the VMM from performing many of the usual memory management operations that involve zapping and refaulting memory. The private memory also is always RWX and cannot have VMM specified cache attribute attributes applied.
TDX memory implementation =========================
Creating shared EPT ------------------- Shared EPT handling is relatively simple compared to private memory. It is managed from within KVM. The main differences between shared EPT and EPT in a normal VM are that the root is set with a TDVMCS field (via SEAMCALL), and that the GFN specified in the memslot perspective needs to be mapped at an offset in the EPT. For the former, this series plumbs in the load_mmu_pgd() operation to the correct field for the shared EPT. For the latter, previous patches have laid the groundwork for mapping so called “direct roots” roots at an offset specified in kvm->arch.gfn_direct_bits.
Creating private EPT -------------------- In previous patches, the concept of “mirrored roots” were introduced. Such roots maintain a KVM side “mirror” of the “external” EPT by keeping an unmapped EPT tree within the KVM MMU code. When changing these mirror EPTs, the KVM MMU code calls out via x86_ops to update the external EPT. This series adds implementations for these “external” ops for TDX to create and manage “private” memory via TDX module APIs.
Managing S-EPT with the TDX Module ---------------------------------- The TDX module allows the TD’s private memory to be managed via SEAMCALLs. This management consists of operating on two internal elements:
1. The private EPT, which the TDX module calls the S-EPT. It maps the actual mapped, private half of the GPA space using an EPT tree.
2. The HKID, which represents private encryption keys used for encrypting TD memory. The CPU doesn’t guarantee cache coherency between these encryption keys, so memory that is encrypted with one of these keys needs to be reclaimed for use on the host in special ways.
This series will primarily focus on the SEAMCALLs for managing the private EPT. Consideration of the HKID is needed for when the TD is torn down.
Populating TDX Private memory ----------------------------- TDX allows the EPT mapping the TD's private memory to be modified in limited ways. There are SEAMCALLs for building and tearing down the EPT tree, as well as mapping pages into the private EPT.
As for building and tearing down the EPT page tables, it is relatively simple. There are SEAMCALLs for installing and removing them. However, the current implementation only supports adding private EPT page tables, and leaves them installed for the lifetime of the TD. For teardown, the details are discussed in a later section.
As for populating and zapping private SPTE, there are SEAMCALLs for this as well. The zapping case will be described in detail later. As for the populating case, there are two categories: before TD is finalized and after TD is finalized. Both of these scenarios go through the TDP MMU map path. The changes done previously to introduce “mirror” and “external” page tables handle directing SPTE installation operations through the set_external_spte() op.
In the “after” case, the TDX set_external_spte() handler simply calls a SEAMCALL (TDX.MEM.PAGE.AUG).
For the before case, it is a bit more complicated as it requires both setting the private SPTE *and* copying in the initial contents of the page at the same time. For TDX this is done via the KVM_TDX_INIT_MEM_REGION ioctl, which is effectively the kvm_gmem_populate() operation.
For SNP, the private memory can be pre-populated first, and faulted in later like normal. But for TDX these need to both happen both at the same time and the setting of the private SPTE needs to happen in a different way than the “after” case described above. It needs to use the TDH.MEM.SEPT.ADD SEAMCALL which does both the copying in of the data and setting the SPTE.
Without extensive modification to the fault path, it’s not possible utilize this callback from the set_external_spte() handler because it the source page for the data to be copied in is not known deep down in this callchain. So instead the post-populate callback does a three step process.
1. Pre-fault the memory into the mirror EPT, but have the set_external_spte() not make any SEAMCALLs.
2. Check that the page is still faulted into the mirror EPT under read mmu_lock that is held over this and the following step.
3. Call TDH.MEM.SEPT.ADD with the HPA of the page to copy data from, and the private page installed in the mirror EPT to use for the private mapping.
The scheme involves some assumptions about the operations that might operate on the mirrored EPT before the VM is finalized. It assumes that no other memory will be faulted into the mirror EPT, that is not also added via TDH.MEM.SEPT.ADD). If this is violated the KVM MMU may not see private memory faulted in there later and so not make the proper external spte callbacks. To check this, KVM enforces that the number of pre-faulted pages is the same as the number of pages added via KVM_TDX_INIT_MEM_REGION.
TDX TLB flushing ---------------- For TDX, TLB flushing needs to happen in different ways depending on whether private and/or shared EPT needs to be flushed. Shared EPT can be flushed like normal EPT with INVEPT. To avoid reading TD's EPTP out from TDX module, this series flushes shared EPT with type 2 INVEPT. Private TLB entries can be flushed this way too (via type 2). However, since the TDX module needs to enforce some guarantees around which private memory is mapped in the TD, it requires these operations to be done in special ways for private memory.
For flushing private memory, two methods are possible. The simple one is the TDH.VP.FLUSH SEAMCALL; this flush is of the INVEPT type 1 variety (i.e. mappings associated with the TD).
The second method is part of a sequence of SEAMCALLs for removing a guest page. The sequence looks like:
1. TDH.MEM.RANGE.BLOCK - Remove RWX bits from entry (similar to KVM’s zap).
2. TDH.MEM.TRACK - Increment the TD TLB epoch, which is a per-TD counter
3. Kick off all vCPUs - In order to force them to have to re-enter.
4. TDH.MEM.PAGE.REMOVE - Actually remove the page and make it available for other use.
5. TDH.VP.ENTER - On re-entering TDX module will see the epoch is incremented and flush the TLB.
On top of this, during TDX module init TDH.SYS.LP.INIT (which is used to online a CPU for TDX usage) invokes INVEPT to flush all mappings in the TLB.
During runtime, for normal (TDP MMU, non-nested) guests, KVM will do a TLB flushes in 4 scenarios:
(1) kvm_mmu_load()
After EPT is loaded, call kvm_x86_flush_tlb_current() to invalidate TLBs for current vCPU loaded EPT on current pCPU.
(2) Loading vCPU to a new pCPU
Send request KVM_REQ_TLB_FLUSH to current vCPU, the request handler will call kvm_x86_flush_tlb_all() to flush all EPTs assocated with the new pCPU.
(3) When EPT mapping has changed (after removing or permission reduction) (e.g. in kvm_flush_remote_tlbs())
Send request KVM_REQ_TLB_FLUSH to all vCPUs by kicking all them off, the request handler on each vCPU will call kvm_x86_flush_tlb_all() to invalidate TLBs for all EPTs associated with the pCPU.
(4) When EPT changes only affects current vCPU, e.g. virtual apic mode changed.
Send request KVM_REQ_TLB_FLUSH_CURRENT, the request handler will call kvm_x86_flush_tlb_current() to invalidate TLBs for current vCPU loaded EPT on current pCPU.
Only the first 3 are relevant to TDX. They are implemented as follows.
(1) kvm_mmu_load()
Only the shared EPT root is loaded in this path. The TDX module does not require any assurances about the operation, so the flush_tlb_current()->ept_sync_global() can be called as normal.
(2) vCPU load
When a vCPU migrates to a new logical processor, it has to be flushed on the *old* pCPU, unlike normal VMs where the INVEPT is executed on the new pCPU to remove stale mappings from previous usage of the same EPTP on the new pCPU. The TDX behavior comes from a requirement that a vCPU can only be associated with one pCPU at at time. This flush happens via an IPI that invokes TDH.VP.FLUSH SEAMCALL, during the vcpu_load callback.
(3) Removing a private SPTE
This is the more complicated flow. It is done in a simple way for now and is especially inefficient during VM teardown. The plan is to get a basic functional version working and optimize some of these flows later.
When a private page mapping is removed, the core MMU code calls the newly remove_external_spte() op, and flushes the TLB on all vCPUs. But TDX can’t rely on doing that for private memory, so it has it’s own process for making sure the private page is removed. This flow (TDH.MEM.RANGE.BLOCK, TDH.MEM.TRACK, TDH.MEM.PAGE.REMOVE) is done withing the remove_external_spte() implementation as described in the “TDX TLB flushing” section above.
After that, back in the core MMU code, KVM will call kvm_flush_remote_tlbs*() resulting in an INVEPT. Despite that, when the vCPUs re-enter (TDH.VP.ENTER) the TD, the TDX module will do another INVEPT for its own reassurance.
Private memory teardown ----------------------- Tearing down private memory involves reclaiming three types of resources from the TDX module:
1. TD’s HKID
To reclaim the TD’s HKID, no mappings may be mapped with it.
2. Private guest pages (mapped with HKID) 3. Private page tables that map private pages (mapped with HKID)
From the TDX module’s perspective, to reclaim guest private pages they need to be prevented from be accessed via the HKID (unmapped and TLB flushed), their HKID associated cachelines need to be flushed, and they need to be marked as no longer use by the TD in the TDX modules internal tracking (PAMT)
During runtime private PTEs can be zapped as part of memslot deletion or when memory coverts from shared to private, but private page tables and HKIDs are not torn down until the TD is being destructed. The means the operation to zap private guest mapped pages needs to do the required cache writeback under the assumption that other vCPU’s may be active, but the PTs do not.
TD teardown resource reclamation -------------------------------- The code that does the TD teardown is organized such that when an HKID is reclaimed: 1. vCPUs will no longer enter the TD 2. The TLB is flushed on all CPUs 3. The HKID associated cachelines have been flushed.
So at that point most of the steps needed to reclaim TD private pages and page tables have already been done and the reclaim operation only needs to update the TDX module’s tracking of page ownership. For simplicity each operation only supports one scenario: before or after HKID reclaim. Since zapping and reclaiming private pages has to function during runtime for memslot deletion and converting from shared to private, the TD teardown is arranged so this happens before HKID reclaim. Since private page tables are never torn down during TD runtime, they can happen in a simpler and more efficient way after HKID reclaim. The private page reclaim is initiated from the kvm fd release. The callchain looks like this:
do_exit |->exit_mm --> tdx_mmu_release_hkid() was called here previously in v19 |->exit_files |->1.release vcpu fd |->2.kvm_gmem_release | |->kvm_gmem_invalidate_begin --> unmap all leaf entries, causing | zapping of private guest pages |->3.release kvmfd |->kvm_destroy_vm |->kvm_arch_pre_destroy_vm | | kvm_x86_call(vm_pre_destroy)(kvm) -->tdx_mmu_release_hkid() |->kvm_arch_destroy_vm |->kvm_unload_vcpu_mmus | kvm_destroy_vcpus(kvm) | |->kvm_arch_vcpu_destroy | |->kvm_x86_call(vcpu_free)(vcpu) | | kvm_mmu_destroy(vcpu) -->unref mirror root | kvm_mmu_uninit_vm(kvm) --> mirror root ref is 1 here, | zap private page tables | static_call_cond(kvm_x86_vm_destroy)(kvm);
show more ...
|
#
0d20742b |
| 06-Mar-2025 |
Paolo Bonzini <pbonzini@redhat.com> |
Merge branch 'kvm-tdx-initialization' into HEAD
This series kicks off the actual interaction of KVM with the TDX module. This series encompasses the basic setup for using the TDX module from KVM, an
Merge branch 'kvm-tdx-initialization' into HEAD
This series kicks off the actual interaction of KVM with the TDX module. This series encompasses the basic setup for using the TDX module from KVM, and the creation of TD VMs and vCPUs.
The TDX Module is a software component that runs in a special CPU mode called SEAM (Secure Arbitration Mode). Loading it is mostly handled outside of KVM by the core kernel. Once it’s loaded KVM can interact with the TDX Module via a new instruction called SEAMCALL to virtualize a TD guests. This instruction can be used to make various types of seamcalls, with names organized into a hierarchy. The format is TDH.[AREA].[ACTION], where “TDH” stands for “Trust Domain Host”, and differentiates from another set of calls that can be done by the guest “TDG”. The KVM relevant areas of SEAMCALLs are: SYS – TDX module management, static metadata reading. MNG – TD management. VM scoped things that operate on a TDX module controlled structure called the TDCS. VP – vCPU management. vCPU scoped things that operate on TDX module controlled structures called the TDVPS. PHYMEM - Operations related to physical memory management (page reclaiming, cache operations, etc).
This series introduces some TDX specific KVM APIs and stops short of fully “finalizing” the creation of a TD VM. The part of initializing a guest where initial private memory is loaded is left to a separate MMU related series.
show more ...
|
Revision tags: v6.14-rc5 |
|
#
90fe64a9 |
| 24-Feb-2025 |
Yan Zhao <yan.y.zhao@intel.com> |
KVM: TDX: KVM: TDX: Always honor guest PAT on TDX enabled guests
Always honor guest PAT in KVM-managed EPTs on TDX enabled guests by making self-snoop feature a hard dependency for TDX and making qu
KVM: TDX: KVM: TDX: Always honor guest PAT on TDX enabled guests
Always honor guest PAT in KVM-managed EPTs on TDX enabled guests by making self-snoop feature a hard dependency for TDX and making quirk KVM_X86_QUIRK_IGNORE_GUEST_PAT not a valid quirk once TDX is enabled.
The quirk KVM_X86_QUIRK_IGNORE_GUEST_PAT only affects memory type of KVM-managed EPTs. For the TDX-module-managed private EPT, memory type is always forced to WB now.
Honoring guest PAT in KVM-managed EPTs ensures KVM does not invoke kvm_zap_gfn_range() when attaching/detaching non-coherent DMA devices, which would cause mirrored EPTs for TDs to be zapped, leading to the TDX-module-managed private EPT being incorrectly zapped.
As a new feature, TDX always comes with support for self-snoop, and does not have to worry about unmodifiable but buggy guests. So, simply ignore KVM_X86_QUIRK_IGNORE_GUEST_PAT on TDX guests just like kvm-amd.ko already does.
Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Yan Zhao <yan.y.zhao@intel.com> Message-ID: <20250224071039.31511-1-yan.y.zhao@intel.com> [Only apply to TDX guests. - Paolo] Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
show more ...
|
#
26eab9ae |
| 27-Feb-2025 |
Binbin Wu <binbin.wu@linux.intel.com> |
KVM: TDX: Enable guest access to MTRR MSRs
Allow TDX guests to access MTRR MSRs as what KVM does for normal VMs, i.e., KVM emulates accesses to MTRR MSRs, but doesn't virtualize guest MTRR memory ty
KVM: TDX: Enable guest access to MTRR MSRs
Allow TDX guests to access MTRR MSRs as what KVM does for normal VMs, i.e., KVM emulates accesses to MTRR MSRs, but doesn't virtualize guest MTRR memory types.
TDX module exposes MTRR feature to TDX guests unconditionally. KVM needs to support MTRR MSRs accesses for TDX guests to match the architectural behavior.
Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com> Message-ID: <20250227012021.1778144-19-binbin.wu@linux.intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
show more ...
|
#
04733836 |
| 27-Feb-2025 |
Isaku Yamahata <isaku.yamahata@intel.com> |
KVM: TDX: Handle TDG.VP.VMCALL<GetTdVmCallInfo> hypercall
Implement TDG.VP.VMCALL<GetTdVmCallInfo> hypercall. If the input value is zero, return success code and zero in output registers.
TDG.VP.V
KVM: TDX: Handle TDG.VP.VMCALL<GetTdVmCallInfo> hypercall
Implement TDG.VP.VMCALL<GetTdVmCallInfo> hypercall. If the input value is zero, return success code and zero in output registers.
TDG.VP.VMCALL<GetTdVmCallInfo> hypercall is a subleaf of TDG.VP.VMCALL to enumerate which TDG.VP.VMCALL sub leaves are supported. This hypercall is for future enhancement of the Guest-Host-Communication Interface (GHCI) specification. The GHCI version of 344426-001US defines it to require input R12 to be zero and to return zero in output registers, R11, R12, R13, and R14 so that guest TD enumerates no enhancement.
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com> Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com> Message-ID: <20250227012021.1778144-12-binbin.wu@linux.intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
show more ...
|
#
9fc3402a |
| 27-Feb-2025 |
Isaku Yamahata <isaku.yamahata@intel.com> |
KVM: TDX: Enable guest access to LMCE related MSRs
Allow TDX guest to configure LMCE (Local Machine Check Event) by handling MSR IA32_FEAT_CTL and IA32_MCG_EXT_CTL.
MCE and MCA are advertised via c
KVM: TDX: Enable guest access to LMCE related MSRs
Allow TDX guest to configure LMCE (Local Machine Check Event) by handling MSR IA32_FEAT_CTL and IA32_MCG_EXT_CTL.
MCE and MCA are advertised via cpuid based on the TDX module spec. Guest kernel can access IA32_FEAT_CTL to check whether LMCE is opted-in by the platform or not. If LMCE is opted-in by the platform, guest kernel can access IA32_MCG_EXT_CTL to enable/disable LMCE.
Handle MSR IA32_FEAT_CTL and IA32_MCG_EXT_CTL for TDX guests to avoid failure when a guest accesses them with TDG.VP.VMCALL<MSR> on #VE. E.g., Linux guest will treat the failure as a #GP(0).
Userspace VMM may not opt-in LMCE by default, e.g., QEMU disables it by default, "-cpu lmce=on" is needed in QEMU command line to opt-in it.
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com> [binbin: rework changelog] Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com> Message-ID: <20250227012021.1778144-11-binbin.wu@linux.intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
show more ...
|
#
081385db |
| 27-Feb-2025 |
Isaku Yamahata <isaku.yamahata@intel.com> |
KVM: TDX: Handle TDX PV rdmsr/wrmsr hypercall
Morph PV RDMSR/WRMSR hypercall to EXIT_REASON_MSR_{READ,WRITE} and wire up KVM backend functions.
For complete_emulated_msr() callback, instead of inje
KVM: TDX: Handle TDX PV rdmsr/wrmsr hypercall
Morph PV RDMSR/WRMSR hypercall to EXIT_REASON_MSR_{READ,WRITE} and wire up KVM backend functions.
For complete_emulated_msr() callback, instead of injecting #GP on error, implement tdx_complete_emulated_msr() to set return code on error. Also set return value on MSR read according to the values from kvm x86 registers.
Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com> Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com> Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Message-ID: <20250227012021.1778144-10-binbin.wu@linux.intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
show more ...
|
#
dd50294f |
| 27-Feb-2025 |
Isaku Yamahata <isaku.yamahata@intel.com> |
KVM: TDX: Implement callbacks for MSR operations
Add functions to implement MSR related callbacks, .set_msr(), .get_msr(), and .has_emulated_msr(), for preparation of handling hypercalls from TDX gu
KVM: TDX: Implement callbacks for MSR operations
Add functions to implement MSR related callbacks, .set_msr(), .get_msr(), and .has_emulated_msr(), for preparation of handling hypercalls from TDX guest for PV RDMSR and WRMSR. Ignore KVM_REQ_MSR_FILTER_CHANGED for TDX.
There are three classes of MSR virtualization for TDX. - Non-configurable: TDX module directly virtualizes it. VMM can't configure it, the value set by KVM_SET_MSRS is ignored. - Configurable: TDX module directly virtualizes it. VMM can configure it at VM creation time. The value set by KVM_SET_MSRS is used. - #VE case: TDX guest would issue TDG.VP.VMCALL<INSTRUCTION.{WRMSR,RDMSR}> and VMM handles the MSR hypercall. The value set by KVM_SET_MSRS is used.
For the MSRs belonging to the #VE case, the TDX module injects #VE to the TDX guest upon RDMSR or WRMSR. The exact list of such MSRs is defined in TDX Module ABI Spec.
Upon #VE, the TDX guest may call TDG.VP.VMCALL<INSTRUCTION.{WRMSR,RDMSR}>, which are defined in GHCI (Guest-Host Communication Interface) so that the host VMM (e.g. KVM) can virtualize the MSRs.
TDX doesn't allow VMM to configure interception of MSR accesses. Ignore KVM_REQ_MSR_FILTER_CHANGED for TDX guest. If the userspace has set any MSR filters, it will be applied when handling TDG.VP.VMCALL<INSTRUCTION.{WRMSR,RDMSR}> in a later patch.
Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com> Co-developed-by: Binbin Wu <binbin.wu@linux.intel.com> Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com> Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Message-ID: <20250227012021.1778144-9-binbin.wu@linux.intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
show more ...
|
#
5cf7239b |
| 27-Feb-2025 |
Isaku Yamahata <isaku.yamahata@intel.com> |
KVM: TDX: Handle TDX PV HLT hypercall
Handle TDX PV HLT hypercall and the interrupt status due to it.
TDX guest status is protected, KVM can't get the interrupt status of TDX guest and it assumes i
KVM: TDX: Handle TDX PV HLT hypercall
Handle TDX PV HLT hypercall and the interrupt status due to it.
TDX guest status is protected, KVM can't get the interrupt status of TDX guest and it assumes interrupt is always allowed unless TDX guest calls TDVMCALL with HLT, which passes the interrupt blocked flag.
If the guest halted with interrupt enabled, also query pending RVI by checking bit 0 of TD_VCPU_STATE_DETAILS_NON_ARCH field via a seamcall.
Update vt_interrupt_allowed() for TDX based on interrupt blocked flag passed by HLT TDVMCALL. Do not wakeup TD vCPU if interrupt is blocked for VT-d PI.
For NMIs, KVM cannot determine the NMI blocking status for TDX guests, so KVM always assumes NMIs are not blocked. In the unlikely scenario where a guest invokes the PV HLT hypercall within an NMI handler, this could result in a spurious wakeup. The guest should implement the PV HLT hypercall within a loop if it truly requires no interruptions, since NMI could be unblocked by an IRET due to an exception occurring before the PV HLT is executed in the NMI handler.
Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com> Co-developed-by: Binbin Wu <binbin.wu@linux.intel.com> Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com> Message-ID: <20250227012021.1778144-7-binbin.wu@linux.intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
show more ...
|
#
3bf31b57 |
| 27-Feb-2025 |
Isaku Yamahata <isaku.yamahata@intel.com> |
KVM: TDX: Handle TDX PV CPUID hypercall
Handle TDX PV CPUID hypercall for the CPUIDs virtualized by VMM according to TDX Guest Host Communication Interface (GHCI).
For TDX, most CPUID leaf/sub-leaf
KVM: TDX: Handle TDX PV CPUID hypercall
Handle TDX PV CPUID hypercall for the CPUIDs virtualized by VMM according to TDX Guest Host Communication Interface (GHCI).
For TDX, most CPUID leaf/sub-leaf combinations are virtualized by the TDX module while some trigger #VE. On #VE, TDX guest can issue TDG.VP.VMCALL<INSTRUCTION.CPUID> (same value as EXIT_REASON_CPUID) to request VMM to emulate CPUID operation.
Morph PV CPUID hypercall to EXIT_REASON_CPUID and wire up to the KVM backend function.
Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com> [binbin: rewrite changelog] Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com> Message-ID: <20250227012021.1778144-6-binbin.wu@linux.intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
show more ...
|