History log of /linux/arch/x86/kvm/vmx/tdx.h (Results 1 – 25 of 28)
Revision (<<< Hide revision tags) (Show revision tags >>>) Date Author Comments
# 43db1111 29-May-2025 Linus Torvalds <torvalds@linux-foundation.org>

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm

Pull kvm updates from Paolo Bonzini:
"As far as x86 goes this pull request "only" includes TDX host support.

Quotes are appropr

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm

Pull kvm updates from Paolo Bonzini:
"As far as x86 goes this pull request "only" includes TDX host support.

Quotes are appropriate because (at 6k lines and 100+ commits) it is
much bigger than the rest, which will come later this week and
consists mostly of bugfixes and selftests. s390 changes will also come
in the second batch.

ARM:

- Add large stage-2 mapping (THP) support for non-protected guests
when pKVM is enabled, clawing back some performance.

- Enable nested virtualisation support on systems that support it,
though it is disabled by default.

- Add UBSAN support to the standalone EL2 object used in nVHE/hVHE
and protected modes.

- Large rework of the way KVM tracks architecture features and links
them with the effects of control bits. While this has no functional
impact, it ensures correctness of emulation (the data is
automatically extracted from the published JSON files), and helps
dealing with the evolution of the architecture.

- Significant changes to the way pKVM tracks ownership of pages,
avoiding page table walks by storing the state in the hypervisor's
vmemmap. This in turn enables the THP support described above.

- New selftest checking the pKVM ownership transition rules

- Fixes for FEAT_MTE_ASYNC being accidentally advertised to guests
even if the host didn't have it.

- Fixes for the address translation emulation, which happened to be
rather buggy in some specific contexts.

- Fixes for the PMU emulation in NV contexts, decoupling PMCR_EL0.N
from the number of counters exposed to a guest and addressing a
number of issues in the process.

- Add a new selftest for the SVE host state being corrupted by a
guest.

- Keep HCR_EL2.xMO set at all times for systems running with the
kernel at EL2, ensuring that the window for interrupts is slightly
bigger, and avoiding a pretty bad erratum on the AmpereOne HW.

- Add workaround for AmpereOne's erratum AC04_CPU_23, which suffers
from a pretty bad case of TLB corruption unless accesses to HCR_EL2
are heavily synchronised.

- Add a per-VM, per-ITS debugfs entry to dump the state of the ITS
tables in a human-friendly fashion.

- and the usual random cleanups.

LoongArch:

- Don't flush tlb if the host supports hardware page table walks.

- Add KVM selftests support.

RISC-V:

- Add vector registers to get-reg-list selftest

- VCPU reset related improvements

- Remove scounteren initialization from VCPU reset

- Support VCPU reset from userspace using set_mpstate() ioctl

x86:

- Initial support for TDX in KVM.

This finally makes it possible to use the TDX module to run
confidential guests on Intel processors. This is quite a large
series, including support for private page tables (managed by the
TDX module and mirrored in KVM for efficiency), forwarding some
TDVMCALLs to userspace, and handling several special VM exits from
the TDX module.

This has been in the works for literally years and it's not really
possible to describe everything here, so I'll defer to the various
merge commits up to and including commit 7bcf7246c42a ('Merge
branch 'kvm-tdx-finish-initial' into HEAD')"

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (248 commits)
x86/tdx: mark tdh_vp_enter() as __flatten
Documentation: virt/kvm: remove unreferenced footnote
RISC-V: KVM: lock the correct mp_state during reset
KVM: arm64: Fix documentation for vgic_its_iter_next()
KVM: arm64: np-guest CMOs with PMD_SIZE fixmap
KVM: arm64: Stage-2 huge mappings for np-guests
KVM: arm64: Add a range to pkvm_mappings
KVM: arm64: Convert pkvm_mappings to interval tree
KVM: arm64: Add a range to __pkvm_host_test_clear_young_guest()
KVM: arm64: Add a range to __pkvm_host_wrprotect_guest()
KVM: arm64: Add a range to __pkvm_host_unshare_guest()
KVM: arm64: Add a range to __pkvm_host_share_guest()
KVM: arm64: Introduce for_each_hyp_page
KVM: arm64: Handle huge mappings for np-guest CMOs
KVM: arm64: nv: Release faulted-in VNCR page from mmu_lock critical section
KVM: arm64: nv: Handle TLBI S1E2 for VNCR invalidation with mmu_lock held
KVM: arm64: nv: Hold mmu_lock when invalidating VNCR SW-TLB before translating
RISC-V: KVM: add KVM_CAP_RISCV_MP_STATE_RESET
RISC-V: KVM: Remove scounteren initialization
KVM: RISC-V: remove unnecessary SBI reset state
...

show more ...


Revision tags: v6.15, v6.15-rc7, v6.15-rc6, v6.15-rc5, v6.15-rc4, v6.15-rc3, v6.15-rc2, v6.15-rc1, v6.14
# fd02aa45 19-Mar-2025 Paolo Bonzini <pbonzini@redhat.com>

Merge branch 'kvm-tdx-initial' into HEAD

This large commit contains the initial support for TDX in KVM. All x86
parts enable the host-side hypercalls that KVM uses to talk to the TDX
module, a soft

Merge branch 'kvm-tdx-initial' into HEAD

This large commit contains the initial support for TDX in KVM. All x86
parts enable the host-side hypercalls that KVM uses to talk to the TDX
module, a software component that runs in a special CPU mode called SEAM
(Secure Arbitration Mode).

The series is in turn split into multiple sub-series, each with a separate
merge commit:

- Initialization: basic setup for using the TDX module from KVM, plus
ioctls to create TDX VMs and vCPUs.

- MMU: in TDX, private and shared halves of the address space are mapped by
different EPT roots, and the private half is managed by the TDX module.
Using the support that was added to the generic MMU code in 6.14,
add support for TDX's secure page tables to the Intel side of KVM.
Generic KVM code takes care of maintaining a mirror of the secure page
tables so that they can be queried efficiently, and ensuring that changes
are applied to both the mirror and the secure EPT.

- vCPU enter/exit: implement the callbacks that handle the entry of a TDX
vCPU (via the SEAMCALL TDH.VP.ENTER) and the corresponding save/restore
of host state.

- Userspace exits: introduce support for guest TDVMCALLs that KVM forwards to
userspace. These correspond to the usual KVM_EXIT_* "heavyweight vmexits"
but are triggered through a different mechanism, similar to VMGEXIT for
SEV-ES and SEV-SNP.

- Interrupt handling: support for virtual interrupt injection as well as
handling VM-Exits that are caused by vectored events. Exclusive to
TDX are machine-check SMIs, which the kernel already knows how to
handle through the kernel machine check handler (commit 7911f145de5f,
"x86/mce: Implement recovery for errors in TDX/SEAM non-root mode")

- Loose ends: handling of the remaining exits from the TDX module, including
EPT violation/misconfig and several TDVMCALL leaves that are handled in
the kernel (CPUID, HLT, RDMSR/WRMSR, GetTdVmCallInfo); plus returning
an error or ignoring operations that are not supported by TDX guests

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

show more ...


Revision tags: v6.14-rc7
# 7bcf7246 12-Mar-2025 Paolo Bonzini <pbonzini@redhat.com>

Merge branch 'kvm-tdx-finish-initial' into HEAD

This patch ties the remaining loose ends and finally enables TDX guests to
run inside KVM. It implements handling of EPT violation/misconfig and of
s

Merge branch 'kvm-tdx-finish-initial' into HEAD

This patch ties the remaining loose ends and finally enables TDX guests to
run inside KVM. It implements handling of EPT violation/misconfig and of
several TDVMCALL leaves that are handled in the kernel (CPUID, HLT, RDMSR/WRMSR,
GetTdVmCallInfo); it also adds a bunch of wrappers in vmx/main.c to
ignore operations not supported by TDX guests(*)

Finally, it introduces documentation for the new APIs that have been
added along the way.

(*) access to CPU state, VMX preemption timer, accesses to TSC offset or
multiplier, LMCE enable/disable, hypercall patching.

show more ...


# 9913212b 12-Mar-2025 Paolo Bonzini <pbonzini@redhat.com>

Merge branch 'kvm-tdx-interrupts' into HEAD

Introduces support for interrupt handling for TDX guests, including
virtual interrupt injection and VM-Exits caused by vectored events.

Injection
=======

Merge branch 'kvm-tdx-interrupts' into HEAD

Introduces support for interrupt handling for TDX guests, including
virtual interrupt injection and VM-Exits caused by vectored events.

Injection
=========

TDX supports non-NMI interrupt injection only by posted interrupt. Posted
interrupt descriptors (PIDs) are allocated in shared memory, KVM
can update them directly. To post pending interrupts in the PID, KVM
can generate a self-IPI with notification vector prior to TD entry.
TDX guest status is protected, KVM can't get the interrupt status of
TDX guest. For now, assume the interrupt is always allowed. A later
patch set will let TDX guests to call TDVMCALL with HLT, which passes
the interrupt block flag, so that whether interrupt is allowed in HLT
will checked against the interrupt block flag.

For NMIs, KVM can request the TDX module to inject a NMI into a TDX vCPU
by setting the PEND_NMI TDVPS field to 1. Following that, KVM can call
TDH.VP.ENTER to run the vCPU and the TDX module will attempt to inject
the NMI as soon as possible. PEND_NMI TDVPS field is a 1-bit filed,
i.e. KVM can only pend one NMI in the TDX module. Also, TDX doesn't
allow KVM to request NMI-window exit directly. When there is already
one NMI pending in the TDX module, i.e. it has not been delivered to
TDX guest yet, if there is NMI pending in KVM, collapse the pending
NMI in KVM into the one pending in the TDX module. Such collapse is OK
considering on X86 bare metal, multiple NMIs could collapse into one NMI,
e.g. when NMI is blocked by SMI. It's OS's responsibility to poll all
NMI sources in the NMI handler to avoid missing handling of some NMI
events. More details can be found in the changelog of the patch "KVM:
TDX: Implement methods to inject NMI".

TDX doesn't support system-management mode (SMM) and system-management
interrupt (SMI) in guest TDs because TDX module doesn't provide a way for
VMM to inject SMI into guest TD or switch guest vCPU mode into SMM.
SMI requests return -ENOTTY similar to CONFIG_KVM_SMM=n. Likewise,
INIT and SIPI events are not used and are blocked for TDX guests;
TDX defines its own vCPU creation and initialization sequence, which
is done on the host via SEAMCALLs at TD build time.

VM-exit for external events
===========================

Similar to the VMX case, external interrupts are with interrupts off:
in the .handle_exit_irqoff() callback for external interrupts and in
the noinstr region for NMIs. Just like VMX, NMI remains blocked after
exiting from TDX guest for NMI-induced exits.

Machine check, which is handled in the .handle_exit_irqoff() callback, is
the only exception type KVM handles for TDX guests. For other exceptions,
because TDX guest state is protected, exceptions in TDX guests can't be
intercepted. TDX VMM isn't supposed to handle these exceptions. Exit to
userspace with KVM_EXIT_EXCEPTION If unexpected exception occurs.

Host SMIs also cause an exit to KVM. This is needed because in SEAM
root mode (TDX module) all interrupts are blocked. An SMI can be "I/O
SMI" or "other SMI". For TDX, there will be no I/O SMI because I/O
instructions inside TDX guest trigger #VE and TDX guest needs to use
TDVMCALL to request VMM to do I/O emulation. The only case of interest
for "other SMI" is an #MC occurring in the guest when MCE-SMI morphing
is enabled in the host firmware. Such "MSMI" is marked by having bit 0
set in the exit qualification; MSMI exits are fatal for the TD and
are eventually handled by the kernel machine check handler (7911f14
x86/mce: Implement recovery for errors in TDX/SEAM non-root mode),
which marks the page as poisoned. It is not possible right now to
pass machine check exceptions to the guest.

SMIs other than machine check SMIs are handled just by leaving SEAM
root mode and KVM doesn't need to do anything.

show more ...


# 4d2dc9a2 12-Mar-2025 Paolo Bonzini <pbonzini@redhat.com>

Merge branch 'kvm-tdx-userspace-exit' into HEAD

Introduces support for VM exits that are forwarded the host VMM in
userspace. These are initiated from the TDCALL exit code; although
these userspace

Merge branch 'kvm-tdx-userspace-exit' into HEAD

Introduces support for VM exits that are forwarded the host VMM in
userspace. These are initiated from the TDCALL exit code; although
these userspace exits have the same TDX exit code, they result in several
different types of exits to userspace.

When a guest TD issues a TDVMCALL, it
exits to VMM with a new exit reason. The arguments from the guest TD and
return values from the VMM are passed through the guest registers. The
ABI details for the guest TD hypercalls are specified in the TDX GHCI
specification.

There are two types of hypercalls defined in the GHCI specification:

- Standard TDVMCALLs: When input of R10 from guest TD is set to 0, it
indicates that the TDVMCALL sub-function used in R11 is defined in GHCI
specification.

- Vendor-Specific TDVMCALLs: When input of R10 from guest TD is non-zero,
it indicates a vendor-specific TDVMCALL. KVM hypercalls from the guest
follow this interface, using R10 as KVM hypercall number and R11-R14 as
4 arguments. The error code returned in R10.

This series includes basic standard TDVMCALLs that map to existing eixt
reasons:

- TDG.VP.VMCALL<MapGPA> reuses exit reason KVM_EXIT_HYPERCALL with the
hypercall number KVM_HC_MAP_GPA_RANGE.

- TDG.VP.VMCALL<ReportFatalError> reuses exit reason KVM_EXIT_SYSTEM_EVENT
with a new event type KVM_SYSTEM_EVENT_TDX_FATAL.

- TDG.VP.VMCALL<Instruction.IO> reuses exit reason KVM_EXIT_IO.

- TDG.VP.VMCALL<#VE.RequestMMIO> reuses exit reason KVM_EXIT_MMIO.

Notably, handling for TDG.VP.VMCALL<SetupEventNotifyInterrupt> and
TDG.VP.VMCALL<GetQuote> is not included yet.

show more ...


# 77ab80c6 12-Mar-2025 Paolo Bonzini <pbonzini@redhat.com>

Merge branch 'kvm-tdx-enter-exit' into HEAD

This series introduces callbacks to facilitate the entry of a TD VCPU
and the corresponding save/restore of host state.

A TD VCPU is entered via the SEAM

Merge branch 'kvm-tdx-enter-exit' into HEAD

This series introduces callbacks to facilitate the entry of a TD VCPU
and the corresponding save/restore of host state.

A TD VCPU is entered via the SEAMCALL TDH.VP.ENTER. The TDX Module manages
the save/restore of guest state and, in conjunction with the SEAMCALL
interface, handles certain aspects of host state. However, there are
specific elements of the host state that require additional attention, as
detailed in the Intel TDX ABI documentation for TDH.VP.ENTER.

TDX is quite different from VMX in this regard. For VMX, the host VMM is
heavily involved in restoring, managing and saving guest CPU state, whereas
for TDX this is handled by the TDX Module. In that way, the TDX Module can
protect the confidentiality and integrity of TD CPU state.

The TDX Module does not save/restore all host CPU state because the host
VMM can do it more efficiently and selectively. CPU state referred to
below is host CPU state. Often values are already held in memory so no
explicit save is needed, and restoration may not be needed if the kernel
is not using a feature.

TDX does not support PAUSE-loop exiting. According to the TDX Module
Base arch. spec., hypercalls are expected to be used instead. Note that
the Linux TDX guest supports existing hypercalls via TDG.VP.VMCALL.

This series requires TDX module 1.5.06.00.0744, or later, due to removal
of the workarounds for the lack of the NO_RBP_MOD feature required by the
kernel. NO_RBP_MOD is now required.

show more ...


Revision tags: v6.14-rc6
# fcbe3482 06-Mar-2025 Paolo Bonzini <pbonzini@redhat.com>

Merge branch 'kvm-tdx-mmu' into HEAD

This series picks up from commit 86eb1aef7279 ("Merge branch
'kvm-mirror-page-tables' into HEAD", 2025-01-20), which focused on
changes to the generic x86 parts

Merge branch 'kvm-tdx-mmu' into HEAD

This series picks up from commit 86eb1aef7279 ("Merge branch
'kvm-mirror-page-tables' into HEAD", 2025-01-20), which focused on
changes to the generic x86 parts of the KVM MMU code, and adds support
for TDX's secure page tables to the Intel side of KVM.

Confidential computing solutions have concepts of private and shared
memory. Often the guest accesses either private or shared memory via a bit
in the guest PTE. Solutions like SEV treat this bit more like a permission
bit, where solutions like TDX and ARM CCA treat it more like a GPA bit. In
the latter case, the host maps private memory in one half of the address
space and shared in another. For TDX these two halves are mapped by
different EPT roots. The private half (also called Secure EPT in Intel
documentation) gets managed by the privileged TDX Module. The shared half
is managed by the untrusted part of the VMM (KVM).

In addition to the separate roots for private and shared, there are
limitations on what operations can be done on the private side. Like SNP,
TDX wants to protect against protected memory being reset or otherwise
scrambled by the host. In order to prevent this, the guest has to take
specific action to “accept” memory after changes are made by the VMM to
the private EPT. This prevents the VMM from performing many of the usual
memory management operations that involve zapping and refaulting memory.
The private memory also is always RWX and cannot have VMM specified cache
attribute attributes applied.

TDX memory implementation
=========================

Creating shared EPT
-------------------
Shared EPT handling is relatively simple compared to private memory. It is
managed from within KVM. The main differences between shared EPT and EPT
in a normal VM are that the root is set with a TDVMCS field (via SEAMCALL),
and that the GFN specified in the memslot perspective needs to be mapped
at an offset in the EPT. For the former, this series plumbs in the
load_mmu_pgd() operation to the correct field for the shared EPT. For the
latter, previous patches have laid the groundwork for mapping so called
“direct roots” roots at an offset specified in kvm->arch.gfn_direct_bits.

Creating private EPT
--------------------
In previous patches, the concept of “mirrored roots” were introduced. Such
roots maintain a KVM side “mirror” of the “external” EPT by keeping an
unmapped EPT tree within the KVM MMU code. When changing these mirror
EPTs, the KVM MMU code calls out via x86_ops to update the external EPT.
This series adds implementations for these “external” ops for TDX to
create and manage “private” memory via TDX module APIs.

Managing S-EPT with the TDX Module
----------------------------------
The TDX module allows the TD’s private memory to be managed via SEAMCALLs.
This management consists of operating on two internal elements:

1. The private EPT, which the TDX module calls the S-EPT. It maps the
actual mapped, private half of the GPA space using an EPT tree.

2. The HKID, which represents private encryption keys used for encrypting
TD memory. The CPU doesn’t guarantee cache coherency between these
encryption keys, so memory that is encrypted with one of these keys
needs to be reclaimed for use on the host in special ways.

This series will primarily focus on the SEAMCALLs for managing the private
EPT. Consideration of the HKID is needed for when the TD is torn down.

Populating TDX Private memory
-----------------------------
TDX allows the EPT mapping the TD's private memory to be modified in
limited ways. There are SEAMCALLs for building and tearing down the EPT
tree, as well as mapping pages into the private EPT.

As for building and tearing down the EPT page tables, it is relatively
simple. There are SEAMCALLs for installing and removing them. However, the
current implementation only supports adding private EPT page tables, and
leaves them installed for the lifetime of the TD. For teardown, the
details are discussed in a later section.

As for populating and zapping private SPTE, there are SEAMCALLs for this
as well. The zapping case will be described in detail later. As for the
populating case, there are two categories: before TD is finalized and
after TD is finalized. Both of these scenarios go through the TDP MMU map
path. The changes done previously to introduce “mirror” and “external”
page tables handle directing SPTE installation operations through the
set_external_spte() op.

In the “after” case, the TDX set_external_spte() handler simply calls a
SEAMCALL (TDX.MEM.PAGE.AUG).

For the before case, it is a bit more complicated as it requires both
setting the private SPTE *and* copying in the initial contents of the page
at the same time. For TDX this is done via the KVM_TDX_INIT_MEM_REGION
ioctl, which is effectively the kvm_gmem_populate() operation.

For SNP, the private memory can be pre-populated first, and faulted in
later like normal. But for TDX these need to both happen both at the same
time and the setting of the private SPTE needs to happen in a different
way than the “after” case described above. It needs to use the
TDH.MEM.SEPT.ADD SEAMCALL which does both the copying in of the data and
setting the SPTE.

Without extensive modification to the fault path, it’s not possible
utilize this callback from the set_external_spte() handler because it the
source page for the data to be copied in is not known deep down in this
callchain. So instead the post-populate callback does a three step
process.

1. Pre-fault the memory into the mirror EPT, but have the
set_external_spte() not make any SEAMCALLs.

2. Check that the page is still faulted into the mirror EPT under read
mmu_lock that is held over this and the following step.

3. Call TDH.MEM.SEPT.ADD with the HPA of the page to copy data from, and
the private page installed in the mirror EPT to use for the private
mapping.

The scheme involves some assumptions about the operations that might
operate on the mirrored EPT before the VM is finalized. It assumes that no
other memory will be faulted into the mirror EPT, that is not also added
via TDH.MEM.SEPT.ADD). If this is violated the KVM MMU may not see private
memory faulted in there later and so not make the proper external spte
callbacks. To check this, KVM enforces that the number of
pre-faulted pages is the same as the number of pages added via
KVM_TDX_INIT_MEM_REGION.

TDX TLB flushing
----------------
For TDX, TLB flushing needs to happen in different ways depending on
whether private and/or shared EPT needs to be flushed. Shared EPT can be
flushed like normal EPT with INVEPT. To avoid reading TD's EPTP out from
TDX module, this series flushes shared EPT with type 2 INVEPT. Private TLB
entries can be flushed this way too (via type 2). However, since the TDX
module needs to enforce some guarantees around which private memory is
mapped in the TD, it requires these operations to be done in special ways
for private memory.

For flushing private memory, two methods are possible. The simple one
is the TDH.VP.FLUSH SEAMCALL; this flush is of the INVEPT type 1 variety
(i.e. mappings associated with the TD).

The second method is part of a sequence of SEAMCALLs for removing a guest
page. The sequence looks like:

1. TDH.MEM.RANGE.BLOCK - Remove RWX bits from entry (similar to KVM’s zap).

2. TDH.MEM.TRACK - Increment the TD TLB epoch, which is a per-TD counter

3. Kick off all vCPUs - In order to force them to have to re-enter.

4. TDH.MEM.PAGE.REMOVE - Actually remove the page and make it available for
other use.

5. TDH.VP.ENTER - On re-entering TDX module will see the epoch is
incremented and flush the TLB.

On top of this, during TDX module init TDH.SYS.LP.INIT (which is used
to online a CPU for TDX usage) invokes INVEPT to flush all mappings in
the TLB.

During runtime, for normal (TDP MMU, non-nested) guests, KVM will do a TLB
flushes in 4 scenarios:

(1) kvm_mmu_load()

After EPT is loaded, call kvm_x86_flush_tlb_current() to invalidate
TLBs for current vCPU loaded EPT on current pCPU.

(2) Loading vCPU to a new pCPU

Send request KVM_REQ_TLB_FLUSH to current vCPU, the request handler
will call kvm_x86_flush_tlb_all() to flush all EPTs assocated with the
new pCPU.

(3) When EPT mapping has changed (after removing or permission reduction)
(e.g. in kvm_flush_remote_tlbs())

Send request KVM_REQ_TLB_FLUSH to all vCPUs by kicking all them off,
the request handler on each vCPU will call kvm_x86_flush_tlb_all() to
invalidate TLBs for all EPTs associated with the pCPU.

(4) When EPT changes only affects current vCPU, e.g. virtual apic mode
changed.

Send request KVM_REQ_TLB_FLUSH_CURRENT, the request handler will call
kvm_x86_flush_tlb_current() to invalidate TLBs for current vCPU loaded
EPT on current pCPU.

Only the first 3 are relevant to TDX. They are implemented as follows.

(1) kvm_mmu_load()

Only the shared EPT root is loaded in this path. The TDX module does
not require any assurances about the operation, so the
flush_tlb_current()->ept_sync_global() can be called as normal.

(2) vCPU load

When a vCPU migrates to a new logical processor, it has to be flushed
on the *old* pCPU, unlike normal VMs where the INVEPT is executed on
the new pCPU to remove stale mappings from previous usage of the same
EPTP on the new pCPU. The TDX behavior comes from a requirement
that a vCPU can only be associated with one pCPU at at time. This
flush happens via an IPI that invokes TDH.VP.FLUSH SEAMCALL, during
the vcpu_load callback.

(3) Removing a private SPTE

This is the more complicated flow. It is done in a simple way for now
and is especially inefficient during VM teardown. The plan is to get a
basic functional version working and optimize some of these flows
later.

When a private page mapping is removed, the core MMU code calls the
newly remove_external_spte() op, and flushes the TLB on all vCPUs. But
TDX can’t rely on doing that for private memory, so it has it’s own
process for making sure the private page is removed. This flow
(TDH.MEM.RANGE.BLOCK, TDH.MEM.TRACK, TDH.MEM.PAGE.REMOVE) is done
withing the remove_external_spte() implementation as described in the
“TDX TLB flushing” section above.

After that, back in the core MMU code, KVM will call
kvm_flush_remote_tlbs*() resulting in an INVEPT. Despite that, when
the vCPUs re-enter (TDH.VP.ENTER) the TD, the TDX module will do
another INVEPT for its own reassurance.

Private memory teardown
-----------------------
Tearing down private memory involves reclaiming three types of resources
from the TDX module:

1. TD’s HKID

To reclaim the TD’s HKID, no mappings may be mapped with it.

2. Private guest pages (mapped with HKID)
3. Private page tables that map private pages (mapped with HKID)

From the TDX module’s perspective, to reclaim guest private pages they
need to be prevented from be accessed via the HKID (unmapped and TLB
flushed), their HKID associated cachelines need to be flushed, and
they need to be marked as no longer use by the TD in the TDX modules
internal tracking (PAMT)

During runtime private PTEs can be zapped as part of memslot deletion or
when memory coverts from shared to private, but private page tables and
HKIDs are not torn down until the TD is being destructed. The means the
operation to zap private guest mapped pages needs to do the required cache
writeback under the assumption that other vCPU’s may be active, but the
PTs do not.

TD teardown resource reclamation
--------------------------------
The code that does the TD teardown is organized such that when an HKID is
reclaimed:
1. vCPUs will no longer enter the TD
2. The TLB is flushed on all CPUs
3. The HKID associated cachelines have been flushed.

So at that point most of the steps needed to reclaim TD private pages and
page tables have already been done and the reclaim operation only needs to
update the TDX module’s tracking of page ownership. For simplicity each
operation only supports one scenario: before or after HKID reclaim. Since
zapping and reclaiming private pages has to function during runtime for
memslot deletion and converting from shared to private, the TD teardown is
arranged so this happens before HKID reclaim. Since private page tables
are never torn down during TD runtime, they can happen in a simpler and
more efficient way after HKID reclaim. The private page reclaim is
initiated from the kvm fd release. The callchain looks like this:

do_exit
|->exit_mm --> tdx_mmu_release_hkid() was called here previously in v19
|->exit_files
|->1.release vcpu fd
|->2.kvm_gmem_release
| |->kvm_gmem_invalidate_begin --> unmap all leaf entries, causing
| zapping of private guest pages
|->3.release kvmfd
|->kvm_destroy_vm
|->kvm_arch_pre_destroy_vm
| | kvm_x86_call(vm_pre_destroy)(kvm) -->tdx_mmu_release_hkid()
|->kvm_arch_destroy_vm
|->kvm_unload_vcpu_mmus
| kvm_destroy_vcpus(kvm)
| |->kvm_arch_vcpu_destroy
| |->kvm_x86_call(vcpu_free)(vcpu)
| | kvm_mmu_destroy(vcpu) -->unref mirror root
| kvm_mmu_uninit_vm(kvm) --> mirror root ref is 1 here,
| zap private page tables
| static_call_cond(kvm_x86_vm_destroy)(kvm);

show more ...


# 0d20742b 06-Mar-2025 Paolo Bonzini <pbonzini@redhat.com>

Merge branch 'kvm-tdx-initialization' into HEAD

This series kicks off the actual interaction of KVM with the TDX module.
This series encompasses the basic setup for using the TDX module from KVM,
an

Merge branch 'kvm-tdx-initialization' into HEAD

This series kicks off the actual interaction of KVM with the TDX module.
This series encompasses the basic setup for using the TDX module from KVM,
and the creation of TD VMs and vCPUs.

The TDX Module is a software component that runs in a special CPU mode
called SEAM (Secure Arbitration Mode). Loading it is mostly handled
outside of KVM by the core kernel. Once it’s loaded KVM can interact with
the TDX Module via a new instruction called SEAMCALL to virtualize a TD
guests. This instruction can be used to make various types of seamcalls,
with names organized into a hierarchy. The format is TDH.[AREA].[ACTION],
where “TDH” stands for “Trust Domain Host”, and differentiates from
another set of calls that can be done by the guest “TDG”. The KVM relevant
areas of SEAMCALLs are:
SYS – TDX module management, static metadata reading.
MNG – TD management. VM scoped things that operate on a TDX module
controlled structure called the TDCS.
VP – vCPU management. vCPU scoped things that operate on TDX module
controlled structures called the TDVPS.
PHYMEM - Operations related to physical memory management (page
reclaiming, cache operations, etc).

This series introduces some TDX specific KVM APIs and stops short of
fully “finalizing” the creation of a TD VM. The part of initializing
a guest where initial private memory is loaded is left to a separate
MMU related series.

show more ...


Revision tags: v6.14-rc5
# 081385db 27-Feb-2025 Isaku Yamahata <isaku.yamahata@intel.com>

KVM: TDX: Handle TDX PV rdmsr/wrmsr hypercall

Morph PV RDMSR/WRMSR hypercall to EXIT_REASON_MSR_{READ,WRITE} and
wire up KVM backend functions.

For complete_emulated_msr() callback, instead of inje

KVM: TDX: Handle TDX PV rdmsr/wrmsr hypercall

Morph PV RDMSR/WRMSR hypercall to EXIT_REASON_MSR_{READ,WRITE} and
wire up KVM backend functions.

For complete_emulated_msr() callback, instead of injecting #GP on error,
implement tdx_complete_emulated_msr() to set return code on error. Also
set return value on MSR read according to the values from kvm x86
registers.

Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Message-ID: <20250227012021.1778144-10-binbin.wu@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

show more ...


# 5cf7239b 27-Feb-2025 Isaku Yamahata <isaku.yamahata@intel.com>

KVM: TDX: Handle TDX PV HLT hypercall

Handle TDX PV HLT hypercall and the interrupt status due to it.

TDX guest status is protected, KVM can't get the interrupt status
of TDX guest and it assumes i

KVM: TDX: Handle TDX PV HLT hypercall

Handle TDX PV HLT hypercall and the interrupt status due to it.

TDX guest status is protected, KVM can't get the interrupt status
of TDX guest and it assumes interrupt is always allowed unless TDX
guest calls TDVMCALL with HLT, which passes the interrupt blocked flag.

If the guest halted with interrupt enabled, also query pending RVI by
checking bit 0 of TD_VCPU_STATE_DETAILS_NON_ARCH field via a seamcall.

Update vt_interrupt_allowed() for TDX based on interrupt blocked flag
passed by HLT TDVMCALL. Do not wakeup TD vCPU if interrupt is blocked
for VT-d PI.

For NMIs, KVM cannot determine the NMI blocking status for TDX guests,
so KVM always assumes NMIs are not blocked. In the unlikely scenario
where a guest invokes the PV HLT hypercall within an NMI handler, this
could result in a spurious wakeup. The guest should implement the PV
HLT hypercall within a loop if it truly requires no interruptions, since
NMI could be unblocked by an IRET due to an exception occurring before
the PV HLT is executed in the NMI handler.

Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Co-developed-by: Binbin Wu <binbin.wu@linux.intel.com>
Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com>
Message-ID: <20250227012021.1778144-7-binbin.wu@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

show more ...


# 4b2abc49 27-Feb-2025 Yan Zhao <yan.y.zhao@intel.com>

KVM: TDX: Kick off vCPUs when SEAMCALL is busy during TD page removal

Kick off all vCPUs and prevent tdh_vp_enter() from executing whenever
tdh_mem_range_block()/tdh_mem_track()/tdh_mem_page_remove(

KVM: TDX: Kick off vCPUs when SEAMCALL is busy during TD page removal

Kick off all vCPUs and prevent tdh_vp_enter() from executing whenever
tdh_mem_range_block()/tdh_mem_track()/tdh_mem_page_remove() encounters
contention, since the page removal path does not expect error and is less
sensitive to the performance penalty caused by kicking off vCPUs.

Although KVM has protected SEPT zap-related SEAMCALLs with kvm->mmu_lock,
KVM may still encounter TDX_OPERAND_BUSY due to the contention in the TDX
module.
- tdh_mem_track() may contend with tdh_vp_enter().
- tdh_mem_range_block()/tdh_mem_page_remove() may contend with
tdh_vp_enter() and TDCALLs.

Resources SHARED users EXCLUSIVE users
------------------------------------------------------------
TDCS epoch tdh_vp_enter tdh_mem_track
------------------------------------------------------------
SEPT tree tdh_mem_page_remove tdh_vp_enter (0-step mitigation)
tdh_mem_range_block
------------------------------------------------------------
SEPT entry tdh_mem_range_block (Host lock)
tdh_mem_page_remove (Host lock)
tdg_mem_page_accept (Guest lock)
tdg_mem_page_attr_rd (Guest lock)
tdg_mem_page_attr_wr (Guest lock)

Use a TDX specific per-VM flag wait_for_sept_zap along with
KVM_REQ_OUTSIDE_GUEST_MODE to kick off vCPUs and prevent them from entering
TD, thereby avoiding the potential contention. Apply the kick-off and no
vCPU entering only after each SEAMCALL busy error to minimize the window of
no TD entry, as the contention due to 0-step mitigation or TDCALLs is
expected to be rare.

Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
Message-ID: <20250227012021.1778144-5-binbin.wu@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

show more ...


Revision tags: v6.14-rc4
# acc64eb4 22-Feb-2025 Isaku Yamahata <isaku.yamahata@intel.com>

KVM: TDX: Implement methods to inject NMI

Inject NMI to TDX guest by setting the PEND_NMI TDVPS field to 1, i.e. make
the NMI pending in the TDX module. If there is a further pending NMI in
KVM, co

KVM: TDX: Implement methods to inject NMI

Inject NMI to TDX guest by setting the PEND_NMI TDVPS field to 1, i.e. make
the NMI pending in the TDX module. If there is a further pending NMI in
KVM, collapse it to the one pending in the TDX module.

VMM can request the TDX module to inject a NMI into a TDX vCPU by setting
the PEND_NMI TDVPS field to 1. Following that, VMM can call TDH.VP.ENTER
to run the vCPU and the TDX module will attempt to inject the NMI as soon
as possible.

KVM has the following 3 cases to inject two NMIs when handling simultaneous
NMIs and they need to be injected in a back-to-back way. Otherwise, OS
kernel may fire a warning about the unknown NMI [1]:
K1. One NMI is being handled in the guest and one NMI pending in KVM.
KVM requests NMI window exit to inject the pending NMI.
K2. Two NMIs are pending in KVM.
KVM injects the first NMI and requests NMI window exit to inject the
second NMI.
K3. A previous NMI needs to be rejected and one NMI pending in KVM.
KVM first requests force immediate exit followed by a VM entry to
complete the NMI rejection. Then, during the force immediate exit, KVM
requests NMI window exit to inject the pending NMI.

For TDX, PEND_NMI TDVPS field is a 1-bit field, i.e. KVM can only pend one
NMI in the TDX module. Also, the vCPU state is protected, KVM doesn't know
the NMI blocking states of TDX vCPU, KVM has to assume NMI is always
unmasked and allowed. When KVM sees PEND_NMI is 1 after a TD exit, it
means the previous NMI needs to be re-injected.

Based on KVM's NMI handling flow, there are following 6 cases:
In NMI handler TDX module KVM
T1. No PEND_NMI=0 1 pending NMI
T2. No PEND_NMI=0 2 pending NMIs
T3. No PEND_NMI=1 1 pending NMI
T4. Yes PEND_NMI=0 1 pending NMI
T5. Yes PEND_NMI=0 2 pending NMIs
T6. Yes PEND_NMI=1 1 pending NMI
K1 is mapped to T4.
K2 is mapped to T2 or T5.
K3 is mapped to T3 or T6.
Note: KVM doesn't know whether NMI is blocked by a NMI or not, case T5 and
T6 can happen.

When handling pending NMI in KVM for TDX guest, what KVM can do is to add a
pending NMI in TDX module when PEND_NMI is 0. T1 and T4 can be handled by
this way. However, TDX doesn't allow KVM to request NMI window exit
directly, if PEND_NMI is already set and there is still pending NMI in KVM,
the only way KVM could try is to request a force immediate exit. But for
case T5 and T6, force immediate exit will result in infinite loop because
force immediate exit makes it no progress in the NMI handler, so that the
pending NMI in the TDX module can never be injected.

Considering on X86 bare metal, multiple NMIs could collapse into one NMI,
e.g. when NMI is blocked by SMI. It's OS's responsibility to poll all NMI
sources in the NMI handler to avoid missing handling of some NMI events.

Based on that, for the above 3 cases (K1-K3), only case K1 must inject the
second NMI because the guest NMI handler may have already polled some of
the NMI sources, which could include the source of the pending NMI, the
pending NMI must be injected to avoid the lost of NMI. For case K2 and K3,
the guest OS will poll all NMI sources (including the sources caused by the
second NMI and further NMI collapsed) when the delivery of the first NMI,
KVM doesn't have the necessity to inject the second NMI.

To handle the NMI injection properly for TDX, there are two options:
- Option 1: Modify the KVM's NMI handling common code, to collapse the
second pending NMI for K2 and K3.
- Option 2: Do it in TDX specific way. When the previous NMI is still
pending in the TDX module, i.e. it has not been delivered to TDX guest
yet, collapse the pending NMI in KVM into the previous one.

This patch goes with option 2 because it is simple and doesn't impact other
VM types. Option 1 may need more discussions.

This is the first need to access vCPU scope metadata in the "management"
class. Make needed accessors available.

[1] https://lore.kernel.org/all/1317409584-23662-5-git-send-email-dzickus@redhat.com/

Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Co-developed-by: Binbin Wu <binbin.wu@linux.intel.com>
Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Message-ID: <20250222014757.897978-8-binbin.wu@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

show more ...


# 2c304880 22-Feb-2025 Binbin Wu <binbin.wu@linux.intel.com>

KVM: TDX: Handle TDG.VP.VMCALL<MapGPA>

Convert TDG.VP.VMCALL<MapGPA> to KVM_EXIT_HYPERCALL with
KVM_HC_MAP_GPA_RANGE and forward it to userspace for handling.

MapGPA is used by TDX guest to request

KVM: TDX: Handle TDG.VP.VMCALL<MapGPA>

Convert TDG.VP.VMCALL<MapGPA> to KVM_EXIT_HYPERCALL with
KVM_HC_MAP_GPA_RANGE and forward it to userspace for handling.

MapGPA is used by TDX guest to request to map a GPA range as private
or shared memory. It needs to exit to userspace for handling. KVM has
already implemented a similar hypercall KVM_HC_MAP_GPA_RANGE, which will
exit to userspace with exit reason KVM_EXIT_HYPERCALL. Do sanity checks,
convert TDVMCALL_MAP_GPA to KVM_HC_MAP_GPA_RANGE and forward the request
to userspace.

To prevent a TDG.VP.VMCALL<MapGPA> call from taking too long, the MapGPA
range is split into 2MB chunks and check interrupt pending between chunks.
This allows for timely injection of interrupts and prevents issues with
guest lockup detection. TDX guest should retry the operation for the
GPA starting at the address specified in R11 when the TDVMCALL return
TDVMCALL_RETRY as status code.

Note userspace needs to enable KVM_CAP_EXIT_HYPERCALL with
KVM_HC_MAP_GPA_RANGE bit set for TD VM.

Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com>
Message-ID: <20250222014225.897298-7-binbin.wu@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

show more ...


# 095b71a0 06-Mar-2025 Isaku Yamahata <isaku.yamahata@intel.com>

KVM: TDX: Add a place holder to handle TDX VM exit

Introduce the wiring for handling TDX VM exits by implementing the
callbacks .get_exit_info(), .get_entry_info(), and .handle_exit().
Additionally,

KVM: TDX: Add a place holder to handle TDX VM exit

Introduce the wiring for handling TDX VM exits by implementing the
callbacks .get_exit_info(), .get_entry_info(), and .handle_exit().
Additionally, add error handling during the TDX VM exit flow, and add a
place holder to handle various exit reasons.

Store VMX exit reason and exit qualification in struct vcpu_vt for TDX,
so that TDX/VMX can use the same helpers to get exit reason and exit
qualification. Store extended exit qualification and exit GPA info in
struct vcpu_tdx because they are used by TDX code only.

Contention Handling: The TDH.VP.ENTER operation may contend with TDH.MEM.*
operations due to secure EPT or TD EPOCH. If the contention occurs,
the return value will have TDX_OPERAND_BUSY set, prompting the vCPU to
attempt re-entry into the guest with EXIT_FASTPATH_EXIT_HANDLED,
not EXIT_FASTPATH_REENTER_GUEST, so that the interrupts pending during
IN_GUEST_MODE can be delivered for sure. Otherwise, the requester of
KVM_REQ_OUTSIDE_GUEST_MODE may be blocked endlessly.

Error Handling:
- TDX_SW_ERROR: This includes #UD caused by SEAMCALL instruction if the
CPU isn't in VMX operation, #GP caused by SEAMCALL instruction when TDX
isn't enabled by the BIOS, and TDX_SEAMCALL_VMFAILINVALID when SEAM
firmware is not loaded or disabled.
- TDX_ERROR: This indicates some check failed in the TDX module, preventing
the vCPU from running.
- Failed VM Entry: Exit to userspace with KVM_EXIT_FAIL_ENTRY. Handle it
separately before handling TDX_NON_RECOVERABLE because when off-TD debug
is not enabled, TDX_NON_RECOVERABLE is set.
- TDX_NON_RECOVERABLE: Set by the TDX module when the error is
non-recoverable, indicating that the TDX guest is dead or the vCPU is
disabled.
A special case is triple fault, which also sets TDX_NON_RECOVERABLE but
exits to userspace with KVM_EXIT_SHUTDOWN, aligning with the VMX case.
- Any unhandled VM exit reason will also return to userspace with
KVM_EXIT_INTERNAL_ERROR.

Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Co-developed-by: Binbin Wu <binbin.wu@linux.intel.com>
Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com>
Reviewed-by: Chao Gao <chao.gao@intel.com>
Message-ID: <20250222014225.897298-4-binbin.wu@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

show more ...


Revision tags: v6.14-rc3, v6.14-rc2, v6.14-rc1
# e0b4f31a 29-Jan-2025 Isaku Yamahata <isaku.yamahata@intel.com>

KVM: TDX: restore user ret MSRs

Several MSRs are clobbered on TD exit that are not used by Linux while
in ring 0. Ensure the cached value of the MSR is updated on vcpu_put,
and the MSRs themselves

KVM: TDX: restore user ret MSRs

Several MSRs are clobbered on TD exit that are not used by Linux while
in ring 0. Ensure the cached value of the MSR is updated on vcpu_put,
and the MSRs themselves before returning to ring 3.

Co-developed-by: Tony Lindgren <tony.lindgren@linux.intel.com>
Signed-off-by: Tony Lindgren <tony.lindgren@linux.intel.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Message-ID: <20250129095902.16391-10-adrian.hunter@intel.com>
Reviewed-by: Xiayao Li <xiaoyao.li@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

show more ...


# 81bf912b 29-Jan-2025 Isaku Yamahata <isaku.yamahata@intel.com>

KVM: TDX: Implement TDX vcpu enter/exit path

Implement callbacks to enter/exit a TDX VCPU by calling tdh_vp_enter().
Ensure the TDX VCPU is in a correct state to run.

Do not pass arguments from/to

KVM: TDX: Implement TDX vcpu enter/exit path

Implement callbacks to enter/exit a TDX VCPU by calling tdh_vp_enter().
Ensure the TDX VCPU is in a correct state to run.

Do not pass arguments from/to vcpu->arch.regs[] unconditionally. Instead,
marshall state to/from the appropriate x86 registers only when needed,
i.e., to handle some TDVMCALL sub-leaves following KVM's ABI to leverage
the existing code.

Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
Message-ID: <20250129095902.16391-6-adrian.hunter@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

show more ...


# 7172c753 14-Mar-2025 Binbin Wu <binbin.wu@linux.intel.com>

KVM: VMX: Move common fields of struct vcpu_{vmx,tdx} to a struct

Move common fields of struct vcpu_vmx and struct vcpu_tdx to struct
vcpu_vt, to share the code between VMX/TDX as much as possible a

KVM: VMX: Move common fields of struct vcpu_{vmx,tdx} to a struct

Move common fields of struct vcpu_vmx and struct vcpu_tdx to struct
vcpu_vt, to share the code between VMX/TDX as much as possible and to make
TDX exit handling more VMX like.

No functional change intended.

[Adrian: move code that depends on struct vcpu_vmx back to vmx.h]

Suggested-by: Sean Christopherson <seanjc@google.com>
Link: https://lore.kernel.org/r/Z1suNzg2Or743a7e@google.com
Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com>
Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
Message-ID: <20250129095902.16391-5-adrian.hunter@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

show more ...


Revision tags: v6.13, v6.13-rc7, v6.13-rc6, v6.13-rc5, v6.13-rc4, v6.13-rc3, v6.13-rc2, v6.13-rc1, v6.12
# d789fa6e 12-Nov-2024 Isaku Yamahata <isaku.yamahata@intel.com>

KVM: TDX: Handle vCPU dissociation

Handle vCPUs dissociations by invoking SEAMCALL TDH.VP.FLUSH which flushes
the address translation caches and cached TD VMCS of a TD vCPU in its
associated pCPU.

KVM: TDX: Handle vCPU dissociation

Handle vCPUs dissociations by invoking SEAMCALL TDH.VP.FLUSH which flushes
the address translation caches and cached TD VMCS of a TD vCPU in its
associated pCPU.

In TDX, a vCPUs can only be associated with one pCPU at a time, which is
done by invoking SEAMCALL TDH.VP.ENTER. For a successful association, the
vCPU must be dissociated from its previous associated pCPU.

To facilitate vCPU dissociation, introduce a per-pCPU list
associated_tdvcpus. Add a vCPU into this list when it's loaded into a new
pCPU (i.e. when a vCPU is loaded for the first time or migrated to a new
pCPU).

vCPU dissociations can happen under below conditions:
- On the op hardware_disable is called.
This op is called when virtualization is disabled on a given pCPU, e.g.
when hot-unplug a pCPU or machine shutdown/suspend.
In this case, dissociate all vCPUs from the pCPU by iterating its
per-pCPU list associated_tdvcpus.

- On vCPU migration to a new pCPU.
Before adding a vCPU into associated_tdvcpus list of the new pCPU,
dissociation from its old pCPU is required, which is performed by issuing
an IPI and executing SEAMCALL TDH.VP.FLUSH on the old pCPU.
On a successful dissociation, the vCPU will be removed from the
associated_tdvcpus list of its previously associated pCPU.

- On tdx_mmu_release_hkid() is called.
TDX mandates that all vCPUs must be disassociated prior to the release of
an hkid. Therefore, dissociation of all vCPUs is a must before executing
the SEAMCALL TDH.MNG.VPFLUSHDONE and subsequently freeing the hkid.

Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Co-developed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Co-developed-by: Yan Zhao <yan.y.zhao@intel.com>
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
Message-ID: <20241112073858.22312-1-yan.y.zhao@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

show more ...


Revision tags: v6.12-rc7, v6.12-rc6, v6.12-rc5, v6.12-rc4, v6.12-rc3, v6.12-rc2, v6.12-rc1, v6.11, v6.11-rc7
# 012426d6 04-Sep-2024 Isaku Yamahata <isaku.yamahata@intel.com>

KVM: TDX: Finalize VM initialization

Add a new VM-scoped KVM_MEMORY_ENCRYPT_OP IOCTL subcommand,
KVM_TDX_FINALIZE_VM, to perform TD Measurement Finalization.

Documentation for the API is added in a

KVM: TDX: Finalize VM initialization

Add a new VM-scoped KVM_MEMORY_ENCRYPT_OP IOCTL subcommand,
KVM_TDX_FINALIZE_VM, to perform TD Measurement Finalization.

Documentation for the API is added in another patch:
"Documentation/virt/kvm: Document on Trust Domain Extensions(TDX)"

For the purpose of attestation, a measurement must be made of the TDX VM
initial state. This is referred to as TD Measurement Finalization, and
uses SEAMCALL TDH.MR.FINALIZE, after which:
1. The VMM adding TD private pages with arbitrary content is no longer
allowed
2. The TDX VM is runnable

Co-developed-by: Adrian Hunter <adrian.hunter@intel.com>
Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Message-ID: <20240904030751.117579-21-rick.p.edgecombe@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

show more ...


# fe1e6d48 12-Nov-2024 Isaku Yamahata <isaku.yamahata@intel.com>

KVM: TDX: Add accessors VMX VMCS helpers

TDX defines SEAMCALL APIs to access TDX control structures corresponding to
the VMX VMCS. Introduce helper accessors to hide its SEAMCALL ABI details.

Sign

KVM: TDX: Add accessors VMX VMCS helpers

TDX defines SEAMCALL APIs to access TDX control structures corresponding to
the VMX VMCS. Introduce helper accessors to hide its SEAMCALL ABI details.

Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Co-developed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Co-developed-by: Yan Zhao <yan.y.zhao@intel.com>
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
Message-ID: <20241112073551.22070-1-yan.y.zhao@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

show more ...


# 7c035bea 19-Feb-2025 Zhiming Hu <zhiming.hu@intel.com>

KVM: TDX: Register TDX host key IDs to cgroup misc controller

TDX host key IDs (HKID) are limit resources in a machine, and the misc
cgroup lets the machine owner track their usage and limits the po

KVM: TDX: Register TDX host key IDs to cgroup misc controller

TDX host key IDs (HKID) are limit resources in a machine, and the misc
cgroup lets the machine owner track their usage and limits the possibility
of abusing them outside the owner's control.

The cgroup v2 miscellaneous subsystem was introduced to control the
resource of AMD SEV & SEV-ES ASIDs. Likewise introduce HKIDs as a misc
resource.

Signed-off-by: Zhiming Hu <zhiming.hu@intel.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

show more ...


# a50f673f 30-Oct-2024 Isaku Yamahata <isaku.yamahata@intel.com>

KVM: TDX: Do TDX specific vcpu initialization

TD guest vcpu needs TDX specific initialization before running. Repurpose
KVM_MEMORY_ENCRYPT_OP to vcpu-scope, add a new sub-command
KVM_TDX_INIT_VCPU,

KVM: TDX: Do TDX specific vcpu initialization

TD guest vcpu needs TDX specific initialization before running. Repurpose
KVM_MEMORY_ENCRYPT_OP to vcpu-scope, add a new sub-command
KVM_TDX_INIT_VCPU, and implement the callback for it.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Co-developed-by: Tony Lindgren <tony.lindgren@linux.intel.com>
Signed-off-by: Tony Lindgren <tony.lindgren@linux.intel.com>
Co-developed-by: Adrian Hunter <adrian.hunter@intel.com>
Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
---
- Fix comment: https://lore.kernel.org/kvm/Z36OYfRW9oPjW8be@google.com/
(Sean)
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

show more ...


# 0186dd29 14-Jan-2025 Isaku Yamahata <isaku.yamahata@intel.com>

KVM: TDX: add ioctl to initialize VM with TDX specific parameters

After the crypto-protection key has been configured, TDX requires a
VM-scope initialization as a step of creating the TDX guest. Th

KVM: TDX: add ioctl to initialize VM with TDX specific parameters

After the crypto-protection key has been configured, TDX requires a
VM-scope initialization as a step of creating the TDX guest. This
"per-VM" TDX initialization does the global configurations/features that
the TDX guest can support, such as guest's CPUIDs (emulated by the TDX
module), the maximum number of vcpus etc.

Because there is no room in KVM_CREATE_VM to pass all the required
parameters, introduce a new ioctl KVM_TDX_INIT_VM and mark the VM as
TD_STATE_UNINITIALIZED until it is invoked.

This "per-VM" TDX initialization must be done before any "vcpu-scope" TDX
initialization; KVM_TDX_INIT_VM IOCTL must be invoked before the creation
of vCPUs.

Co-developed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

show more ...


# 8d032b68 25-Feb-2025 Isaku Yamahata <isaku.yamahata@intel.com>

KVM: TDX: create/destroy VM structure

Implement managing the TDX private KeyID to implement, create, destroy
and free for a TDX guest.

When creating at TDX guest, assign a TDX private KeyID for the

KVM: TDX: create/destroy VM structure

Implement managing the TDX private KeyID to implement, create, destroy
and free for a TDX guest.

When creating at TDX guest, assign a TDX private KeyID for the TDX guest
for memory encryption, and allocate pages for the guest. These are used
for the Trust Domain Root (TDR) and Trust Domain Control Structure (TDCS).

On destruction, free the allocated pages, and the KeyID.

Before tearing down the private page tables, TDX requires the guest TD to
be destroyed by reclaiming the KeyID. Do it in the vm_pre_destroy() kvm_x86_ops
hook. The TDR control structures can be freed in the vm_destroy() hook,
which runs last.

Co-developed-by: Tony Lindgren <tony.lindgren@linux.intel.com>
Signed-off-by: Tony Lindgren <tony.lindgren@linux.intel.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Co-developed-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Co-developed-by: Kai Huang <kai.huang@intel.com>
Signed-off-by: Kai Huang <kai.huang@intel.com>
Co-developed-by: Yan Zhao <yan.y.zhao@intel.com>
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
Co-developed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
---
- Fix build issue in kvm-coco-queue
- Init ret earlier to fix __tdx_td_init() error handling. (Chao)
- Standardize -EAGAIN for __tdx_td_init() retry errors (Rick)
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

show more ...


# 1001d988 30-Oct-2024 Sean Christopherson <sean.j.christopherson@intel.com>

KVM: TDX: Add TDX "architectural" error codes

Add error codes for the TDX SEAMCALLs both for TDX VMM side for TDH
SEAMCALL and TDX guest side for TDG.VP.VMCALL. KVM issues the TDX
SEAMCALLs and che

KVM: TDX: Add TDX "architectural" error codes

Add error codes for the TDX SEAMCALLs both for TDX VMM side for TDH
SEAMCALL and TDX guest side for TDG.VP.VMCALL. KVM issues the TDX
SEAMCALLs and checks its error code. KVM handles hypercall from the TDX
guest and may return an error. So error code for the TDX guest is also
needed.

TDX SEAMCALL uses bits 31:0 to return more information, so these error
codes will only exactly match RAX[63:32]. Error codes for TDG.VP.VMCALL is
defined by TDX Guest-Host-Communication interface spec.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Yuan Yao <yuan.yao@intel.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Message-ID: <20241030190039.77971-14-rick.p.edgecombe@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

show more ...


12