xref: /linux/Documentation/virt/hyperv/coco.rst (revision b61104e7a6349bd2c2b3e2fb3260d87f15eda8f4)
1.. SPDX-License-Identifier: GPL-2.0
2
3Confidential Computing VMs
4==========================
5Hyper-V can create and run Linux guests that are Confidential Computing
6(CoCo) VMs. Such VMs cooperate with the physical processor to better protect
7the confidentiality and integrity of data in the VM's memory, even in the
8face of a hypervisor/VMM that has been compromised and may behave maliciously.
9CoCo VMs on Hyper-V share the generic CoCo VM threat model and security
10objectives described in Documentation/security/snp-tdx-threat-model.rst. Note
11that Hyper-V specific code in Linux refers to CoCo VMs as "isolated VMs" or
12"isolation VMs".
13
14A Linux CoCo VM on Hyper-V requires the cooperation and interaction of the
15following:
16
17* Physical hardware with a processor that supports CoCo VMs
18
19* The hardware runs a version of Windows/Hyper-V with support for CoCo VMs
20
21* The VM runs a version of Linux that supports being a CoCo VM
22
23The physical hardware requirements are as follows:
24
25* AMD processor with SEV-SNP. Hyper-V does not run guest VMs with AMD SME,
26  SEV, or SEV-ES encryption, and such encryption is not sufficient for a CoCo
27  VM on Hyper-V.
28
29* Intel processor with TDX
30
31To create a CoCo VM, the "Isolated VM" attribute must be specified to Hyper-V
32when the VM is created. A VM cannot be changed from a CoCo VM to a normal VM,
33or vice versa, after it is created.
34
35Operational Modes
36-----------------
37Hyper-V CoCo VMs can run in two modes. The mode is selected when the VM is
38created and cannot be changed during the life of the VM.
39
40* Fully-enlightened mode. In this mode, the guest operating system is
41  enlightened to understand and manage all aspects of running as a CoCo VM.
42
43* Paravisor mode. In this mode, a paravisor layer between the guest and the
44  host provides some operations needed to run as a CoCo VM. The guest operating
45  system can have fewer CoCo enlightenments than is required in the
46  fully-enlightened case.
47
48Conceptually, fully-enlightened mode and paravisor mode may be treated as
49points on a spectrum spanning the degree of guest enlightenment needed to run
50as a CoCo VM. Fully-enlightened mode is one end of the spectrum. A full
51implementation of paravisor mode is the other end of the spectrum, where all
52aspects of running as a CoCo VM are handled by the paravisor, and a normal
53guest OS with no knowledge of memory encryption or other aspects of CoCo VMs
54can run successfully. However, the Hyper-V implementation of paravisor mode
55does not go this far, and is somewhere in the middle of the spectrum. Some
56aspects of CoCo VMs are handled by the Hyper-V paravisor while the guest OS
57must be enlightened for other aspects. Unfortunately, there is no
58standardized enumeration of feature/functions that might be provided in the
59paravisor, and there is no standardized mechanism for a guest OS to query the
60paravisor for the feature/functions it provides. The understanding of what
61the paravisor provides is hard-coded in the guest OS.
62
63Paravisor mode has similarities to the `Coconut project`_, which aims to provide
64a limited paravisor to provide services to the guest such as a virtual TPM.
65However, the Hyper-V paravisor generally handles more aspects of CoCo VMs
66than is currently envisioned for Coconut, and so is further toward the "no
67guest enlightenments required" end of the spectrum.
68
69.. _Coconut project: https://github.com/coconut-svsm/svsm
70
71In the CoCo VM threat model, the paravisor is in the guest security domain
72and must be trusted by the guest OS. By implication, the hypervisor/VMM must
73protect itself against a potentially malicious paravisor just like it
74protects against a potentially malicious guest.
75
76The hardware architectural approach to fully-enlightened vs. paravisor mode
77varies depending on the underlying processor.
78
79* With AMD SEV-SNP processors, in fully-enlightened mode the guest OS runs in
80  VMPL 0 and has full control of the guest context. In paravisor mode, the
81  guest OS runs in VMPL 2 and the paravisor runs in VMPL 0. The paravisor
82  running in VMPL 0 has privileges that the guest OS in VMPL 2 does not have.
83  Certain operations require the guest to invoke the paravisor. Furthermore, in
84  paravisor mode the guest OS operates in "virtual Top Of Memory" (vTOM) mode
85  as defined by the SEV-SNP architecture. This mode simplifies guest management
86  of memory encryption when a paravisor is used.
87
88* With Intel TDX processor, in fully-enlightened mode the guest OS runs in an
89  L1 VM. In paravisor mode, TD partitioning is used. The paravisor runs in the
90  L1 VM, and the guest OS runs in a nested L2 VM.
91
92Hyper-V exposes a synthetic MSR to guests that describes the CoCo mode. This
93MSR indicates if the underlying processor uses AMD SEV-SNP or Intel TDX, and
94whether a paravisor is being used. It is straightforward to build a single
95kernel image that can boot and run properly on either architecture, and in
96either mode.
97
98Paravisor Effects
99-----------------
100Running in paravisor mode affects the following areas of generic Linux kernel
101CoCo VM functionality:
102
103* Initial guest memory setup. When a new VM is created in paravisor mode, the
104  paravisor runs first and sets up the guest physical memory as encrypted. The
105  guest Linux does normal memory initialization, except for explicitly marking
106  appropriate ranges as decrypted (shared). In paravisor mode, Linux does not
107  perform the early boot memory setup steps that are particularly tricky with
108  AMD SEV-SNP in fully-enlightened mode.
109
110* #VC/#VE exception handling. In paravisor mode, Hyper-V configures the guest
111  CoCo VM to route #VC and #VE exceptions to VMPL 0 and the L1 VM,
112  respectively, and not the guest Linux. Consequently, these exception handlers
113  do not run in the guest Linux and are not a required enlightenment for a
114  Linux guest in paravisor mode.
115
116* CPUID flags. Both AMD SEV-SNP and Intel TDX provide a CPUID flag in the
117  guest indicating that the VM is operating with the respective hardware
118  support. While these CPUID flags are visible in fully-enlightened CoCo VMs,
119  the paravisor filters out these flags and the guest Linux does not see them.
120  Throughout the Linux kernel, explicitly testing these flags has mostly been
121  eliminated in favor of the cc_platform_has() function, with the goal of
122  abstracting the differences between SEV-SNP and TDX. But the
123  cc_platform_has() abstraction also allows the Hyper-V paravisor configuration
124  to selectively enable aspects of CoCo VM functionality even when the CPUID
125  flags are not set. The exception is early boot memory setup on SEV-SNP, which
126  tests the CPUID SEV-SNP flag. But not having the flag in Hyper-V paravisor
127  mode VM achieves the desired effect or not running SEV-SNP specific early
128  boot memory setup.
129
130* Device emulation. In paravisor mode, the Hyper-V paravisor provides
131  emulation of devices such as the IO-APIC and TPM. Because the emulation
132  happens in the paravisor in the guest context (instead of the hypervisor/VMM
133  context), MMIO accesses to these devices must be encrypted references instead
134  of the decrypted references that would be used in a fully-enlightened CoCo
135  VM. The __ioremap_caller() function has been enhanced to make a callback to
136  check whether a particular address range should be treated as encrypted
137  (private). See the "is_private_mmio" callback.
138
139* Encrypt/decrypt memory transitions. In a CoCo VM, transitioning guest
140  memory between encrypted and decrypted requires coordinating with the
141  hypervisor/VMM. This is done via callbacks invoked from
142  __set_memory_enc_pgtable(). In fully-enlightened mode, the normal SEV-SNP and
143  TDX implementations of these callbacks are used. In paravisor mode, a Hyper-V
144  specific set of callbacks is used. These callbacks invoke the paravisor so
145  that the paravisor can coordinate the transitions and inform the hypervisor
146  as necessary. See hv_vtom_init() where these callback are set up.
147
148* Interrupt injection. In fully enlightened mode, a malicious hypervisor
149  could inject interrupts into the guest OS at times that violate x86/x64
150  architectural rules. For full protection, the guest OS should include
151  enlightenments that use the interrupt injection management features provided
152  by CoCo-capable processors. In paravisor mode, the paravisor mediates
153  interrupt injection into the guest OS, and ensures that the guest OS only
154  sees interrupts that are "legal". The paravisor uses the interrupt injection
155  management features provided by the CoCo-capable physical processor, thereby
156  masking these complexities from the guest OS.
157
158Hyper-V Hypercalls
159------------------
160When in fully-enlightened mode, hypercalls made by the Linux guest are routed
161directly to the hypervisor, just as in a non-CoCo VM. But in paravisor mode,
162normal hypercalls trap to the paravisor first, which may in turn invoke the
163hypervisor. But the paravisor is idiosyncratic in this regard, and a few
164hypercalls made by the Linux guest must always be routed directly to the
165hypervisor. These hypercall sites test for a paravisor being present, and use
166a special invocation sequence. See hv_post_message(), for example.
167
168Guest communication with Hyper-V
169--------------------------------
170Separate from the generic Linux kernel handling of memory encryption in Linux
171CoCo VMs, Hyper-V has VMBus and VMBus devices that communicate using memory
172shared between the Linux guest and the host. This shared memory must be
173marked decrypted to enable communication. Furthermore, since the threat model
174includes a compromised and potentially malicious host, the guest must guard
175against leaking any unintended data to the host through this shared memory.
176
177These Hyper-V and VMBus memory pages are marked as decrypted:
178
179* VMBus monitor pages
180
181* Synthetic interrupt controller (SynIC) related pages (unless supplied by
182  the paravisor)
183
184* Per-cpu hypercall input and output pages (unless running with a paravisor)
185
186* VMBus ring buffers. The direct mapping is marked decrypted in
187  __vmbus_establish_gpadl(). The secondary mapping created in
188  hv_ringbuffer_init() must also include the "decrypted" attribute.
189
190When the guest writes data to memory that is shared with the host, it must
191ensure that only the intended data is written. Padding or unused fields must
192be initialized to zeros before copying into the shared memory so that random
193kernel data is not inadvertently given to the host.
194
195Similarly, when the guest reads memory that is shared with the host, it must
196validate the data before acting on it so that a malicious host cannot induce
197the guest to expose unintended data. Doing such validation can be tricky
198because the host can modify the shared memory areas even while or after
199validation is performed. For messages passed from the host to the guest in a
200VMBus ring buffer, the length of the message is validated, and the message is
201copied into a temporary (encrypted) buffer for further validation and
202processing. The copying adds a small amount of overhead, but is the only way
203to protect against a malicious host. See hv_pkt_iter_first().
204
205Many drivers for VMBus devices have been "hardened" by adding code to fully
206validate messages received over VMBus, instead of assuming that Hyper-V is
207acting cooperatively. Such drivers are marked as "allowed_in_isolated" in the
208vmbus_devs[] table. Other drivers for VMBus devices that are not needed in a
209CoCo VM have not been hardened, and they are not allowed to load in a CoCo
210VM. See vmbus_is_valid_offer() where such devices are excluded.
211
212Two VMBus devices depend on the Hyper-V host to do DMA data transfers:
213storvsc for disk I/O and netvsc for network I/O. storvsc uses the normal
214Linux kernel DMA APIs, and so bounce buffering through decrypted swiotlb
215memory is done implicitly. netvsc has two modes for data transfers. The first
216mode goes through send and receive buffer space that is explicitly allocated
217by the netvsc driver, and is used for most smaller packets. These send and
218receive buffers are marked decrypted by __vmbus_establish_gpadl(). Because
219the netvsc driver explicitly copies packets to/from these buffers, the
220equivalent of bounce buffering between encrypted and decrypted memory is
221already part of the data path. The second mode uses the normal Linux kernel
222DMA APIs, and is bounce buffered through swiotlb memory implicitly like in
223storvsc.
224
225Finally, the VMBus virtual PCI driver needs special handling in a CoCo VM.
226Linux PCI device drivers access PCI config space using standard APIs provided
227by the Linux PCI subsystem. On Hyper-V, these functions directly access MMIO
228space, and the access traps to Hyper-V for emulation. But in CoCo VMs, memory
229encryption prevents Hyper-V from reading the guest instruction stream to
230emulate the access. So in a CoCo VM, these functions must make a hypercall
231with arguments explicitly describing the access. See
232_hv_pcifront_read_config() and _hv_pcifront_write_config() and the
233"use_calls" flag indicating to use hypercalls.
234
235Confidential VMBus
236------------------
237The confidential VMBus enables the confidential guest not to interact with
238the untrusted host partition and the untrusted hypervisor. Instead, the guest
239relies on the trusted paravisor to communicate with the devices processing
240sensitive data. The hardware (SNP or TDX) encrypts the guest memory and the
241register state while measuring the paravisor image using the platform security
242processor to ensure trusted and confidential computing.
243
244Confidential VMBus provides a secure communication channel between the guest
245and the paravisor, ensuring that sensitive data is protected from hypervisor-
246level access through memory encryption and register state isolation.
247
248Confidential VMBus is an extension of Confidential Computing (CoCo) VMs
249(a.k.a. "Isolated" VMs in Hyper-V terminology). Without Confidential VMBus,
250guest VMBus device drivers (the "VSC"s in VMBus terminology) communicate
251with VMBus servers (the VSPs) running on the Hyper-V host. The
252communication must be through memory that has been decrypted so the
253host can access it. With Confidential VMBus, one or more of the VSPs reside
254in the trusted paravisor layer in the guest VM. Since the paravisor layer also
255operates in encrypted memory, the memory used for communication with
256such VSPs does not need to be decrypted and thereby exposed to the
257Hyper-V host. The paravisor is responsible for communicating securely
258with the Hyper-V host as necessary.
259
260The data is transferred directly between the VM and a vPCI device (a.k.a.
261a PCI pass-thru device, see :doc:`vpci`) that is directly assigned to VTL2
262and that supports encrypted memory. In such a case, neither the host partition
263nor the hypervisor has any access to the data. The guest needs to establish
264a VMBus connection only with the paravisor for the channels that process
265sensitive data, and the paravisor abstracts the details of communicating
266with the specific devices away providing the guest with the well-established
267VSP (Virtual Service Provider) interface that has had support in the Hyper-V
268drivers for a decade.
269
270In the case the device does not support encrypted memory, the paravisor
271provides bounce-buffering, and although the data is not encrypted, the backing
272pages aren't mapped into the host partition through SLAT. While not impossible,
273it becomes much more difficult for the host partition to exfiltrate the data
274than it would be with a conventional VMBus connection where the host partition
275has direct access to the memory used for communication.
276
277Here is the data flow for a conventional VMBus connection (`C` stands for the
278client or VSC, `S` for the server or VSP, the `DEVICE` is a physical one, might
279be with multiple virtual functions)::
280
281  +---- GUEST ----+       +----- DEVICE ----+        +----- HOST -----+
282  |               |       |                 |        |                |
283  |               |       |                 |        |                |
284  |               |       |                 ==========                |
285  |               |       |                 |        |                |
286  |               |       |                 |        |                |
287  |               |       |                 |        |                |
288  +----- C -------+       +-----------------+        +------- S ------+
289         ||                                                   ||
290         ||                                                   ||
291  +------||------------------ VMBus --------------------------||------+
292  |                     Interrupts, MMIO                              |
293  +-------------------------------------------------------------------+
294
295and the Confidential VMBus connection::
296
297  +---- GUEST --------------- VTL0 ------+               +-- DEVICE --+
298  |                                      |               |            |
299  | +- PARAVISOR --------- VTL2 -----+   |               |            |
300  | |     +-- VMBus Relay ------+    ====+================            |
301  | |     |   Interrupts, MMIO  |    |   |               |            |
302  | |     +-------- S ----------+    |   |               +------------+
303  | |               ||               |   |
304  | +---------+     ||               |   |
305  | |  Linux  |     ||    OpenHCL    |   |
306  | |  kernel |     ||               |   |
307  | +---- C --+-----||---------------+   |
308  |       ||        ||                   |
309  +-------++------- C -------------------+               +------------+
310          ||                                             |    HOST    |
311          ||                                             +---- S -----+
312  +-------||----------------- VMBus ---------------------------||-----+
313  |                     Interrupts, MMIO                              |
314  +-------------------------------------------------------------------+
315
316An implementation of the VMBus relay that offers the Confidential VMBus
317channels is available in the OpenVMM project as a part of the OpenHCL
318paravisor. Please refer to
319
320  * https://openvmm.dev/, and
321  * https://github.com/microsoft/openvmm
322
323for more information about the OpenHCL paravisor.
324
325A guest that is running with a paravisor must determine at runtime if
326Confidential VMBus is supported by the current paravisor. The x86_64-specific
327approach relies on the CPUID Virtualization Stack leaf; the ARM64 implementation
328is expected to support the Confidential VMBus unconditionally when running
329ARM CCA guests.
330
331Confidential VMBus is a characteristic of the VMBus connection as a whole,
332and of each VMBus channel that is created. When a Confidential VMBus
333connection is established, the paravisor provides the guest the message-passing
334path that is used for VMBus device creation and deletion, and it provides a
335per-CPU synthetic interrupt controller (SynIC) just like the SynIC that is
336offered by the Hyper-V host. Each VMBus device that is offered to the guest
337indicates the degree to which it participates in Confidential VMBus. The offer
338indicates if the device uses encrypted ring buffers, and if the device uses
339encrypted memory for DMA that is done outside the ring buffer. These settings
340may be different for different devices using the same Confidential VMBus
341connection.
342
343Although these settings are separate, in practice it'll always be encrypted
344ring buffer only, or both encrypted ring buffer and external data. If a channel
345is offered by the paravisor with confidential VMBus, the ring buffer can always
346be encrypted since it's strictly for communication between the VTL2 paravisor
347and the VTL0 guest. However, other memory regions are often used for e.g. DMA,
348so they need to be accessible by the underlying hardware, and must be
349unencrypted (unless the device supports encrypted memory). Currently, there are
350not any VSPs in OpenHCL that support encrypted external memory, but future
351versions are expected to enable this capability.
352
353Because some devices on a Confidential VMBus may require decrypted ring buffers
354and DMA transfers, the guest must interact with two SynICs -- the one provided
355by the paravisor and the one provided by the Hyper-V host when Confidential
356VMBus is not offered. Interrupts are always signaled by the paravisor SynIC,
357but the guest must check for messages and for channel interrupts on both SynICs.
358
359In the case of a confidential VMBus, regular SynIC access by the guest is
360intercepted by the paravisor (this includes various MSRs such as the SIMP and
361SIEFP, as well as hypercalls like HvPostMessage and HvSignalEvent). If the
362guest actually wants to communicate with the hypervisor, it has to use special
363mechanisms (GHCB page on SNP, or tdcall on TDX). Messages can be of either
364kind: with confidential VMBus, messages use the paravisor SynIC, and if the
365guest chose to communicate directly to the hypervisor, they use the hypervisor
366SynIC. For interrupt signaling, some channels may be running on the host
367(non-confidential, using the VMBus relay) and use the hypervisor SynIC, and
368some on the paravisor and use its SynIC. The RelIDs are coordinated by the
369OpenHCL VMBus server and are guaranteed to be unique regardless of whether
370the channel originated on the host or the paravisor.
371
372load_unaligned_zeropad()
373------------------------
374When transitioning memory between encrypted and decrypted, the caller of
375set_memory_encrypted() or set_memory_decrypted() is responsible for ensuring
376the memory isn't in use and isn't referenced while the transition is in
377progress. The transition has multiple steps, and includes interaction with
378the Hyper-V host. The memory is in an inconsistent state until all steps are
379complete. A reference while the state is inconsistent could result in an
380exception that can't be cleanly fixed up.
381
382However, the kernel load_unaligned_zeropad() mechanism may make stray
383references that can't be prevented by the caller of set_memory_encrypted() or
384set_memory_decrypted(), so there's specific code in the #VC or #VE exception
385handler to fixup this case. But a CoCo VM running on Hyper-V may be
386configured to run with a paravisor, with the #VC or #VE exception routed to
387the paravisor. There's no architectural way to forward the exceptions back to
388the guest kernel, and in such a case, the load_unaligned_zeropad() fixup code
389in the #VC/#VE handlers doesn't run.
390
391To avoid this problem, the Hyper-V specific functions for notifying the
392hypervisor of the transition mark pages as "not present" while a transition
393is in progress. If load_unaligned_zeropad() causes a stray reference, a
394normal page fault is generated instead of #VC or #VE, and the page-fault-
395based handlers for load_unaligned_zeropad() fixup the reference. When the
396encrypted/decrypted transition is complete, the pages are marked as "present"
397again. See hv_vtom_clear_present() and hv_vtom_set_host_visibility().
398