xref: /linux/Documentation/virt/hyperv/vpci.rst (revision 24168c5e6dfbdd5b414f048f47f75d64533296ca)
1.. SPDX-License-Identifier: GPL-2.0
2
3PCI pass-thru devices
4=========================
5In a Hyper-V guest VM, PCI pass-thru devices (also called
6virtual PCI devices, or vPCI devices) are physical PCI devices
7that are mapped directly into the VM's physical address space.
8Guest device drivers can interact directly with the hardware
9without intermediation by the host hypervisor.  This approach
10provides higher bandwidth access to the device with lower
11latency, compared with devices that are virtualized by the
12hypervisor.  The device should appear to the guest just as it
13would when running on bare metal, so no changes are required
14to the Linux device drivers for the device.
15
16Hyper-V terminology for vPCI devices is "Discrete Device
17Assignment" (DDA).  Public documentation for Hyper-V DDA is
18available here: `DDA`_
19
20.. _DDA: https://learn.microsoft.com/en-us/windows-server/virtualization/hyper-v/plan/plan-for-deploying-devices-using-discrete-device-assignment
21
22DDA is typically used for storage controllers, such as NVMe,
23and for GPUs.  A similar mechanism for NICs is called SR-IOV
24and produces the same benefits by allowing a guest device
25driver to interact directly with the hardware.  See Hyper-V
26public documentation here: `SR-IOV`_
27
28.. _SR-IOV: https://learn.microsoft.com/en-us/windows-hardware/drivers/network/overview-of-single-root-i-o-virtualization--sr-iov-
29
30This discussion of vPCI devices includes DDA and SR-IOV
31devices.
32
33Device Presentation
34-------------------
35Hyper-V provides full PCI functionality for a vPCI device when
36it is operating, so the Linux device driver for the device can
37be used unchanged, provided it uses the correct Linux kernel
38APIs for accessing PCI config space and for other integration
39with Linux.  But the initial detection of the PCI device and
40its integration with the Linux PCI subsystem must use Hyper-V
41specific mechanisms.  Consequently, vPCI devices on Hyper-V
42have a dual identity.  They are initially presented to Linux
43guests as VMBus devices via the standard VMBus "offer"
44mechanism, so they have a VMBus identity and appear under
45/sys/bus/vmbus/devices.  The VMBus vPCI driver in Linux at
46drivers/pci/controller/pci-hyperv.c handles a newly introduced
47vPCI device by fabricating a PCI bus topology and creating all
48the normal PCI device data structures in Linux that would
49exist if the PCI device were discovered via ACPI on a bare-
50metal system.  Once those data structures are set up, the
51device also has a normal PCI identity in Linux, and the normal
52Linux device driver for the vPCI device can function as if it
53were running in Linux on bare-metal.  Because vPCI devices are
54presented dynamically through the VMBus offer mechanism, they
55do not appear in the Linux guest's ACPI tables.  vPCI devices
56may be added to a VM or removed from a VM at any time during
57the life of the VM, and not just during initial boot.
58
59With this approach, the vPCI device is a VMBus device and a
60PCI device at the same time.  In response to the VMBus offer
61message, the hv_pci_probe() function runs and establishes a
62VMBus connection to the vPCI VSP on the Hyper-V host.  That
63connection has a single VMBus channel.  The channel is used to
64exchange messages with the vPCI VSP for the purpose of setting
65up and configuring the vPCI device in Linux.  Once the device
66is fully configured in Linux as a PCI device, the VMBus
67channel is used only if Linux changes the vCPU to be interrupted
68in the guest, or if the vPCI device is removed from
69the VM while the VM is running.  The ongoing operation of the
70device happens directly between the Linux device driver for
71the device and the hardware, with VMBus and the VMBus channel
72playing no role.
73
74PCI Device Setup
75----------------
76PCI device setup follows a sequence that Hyper-V originally
77created for Windows guests, and that can be ill-suited for
78Linux guests due to differences in the overall structure of
79the Linux PCI subsystem compared with Windows.  Nonetheless,
80with a bit of hackery in the Hyper-V virtual PCI driver for
81Linux, the virtual PCI device is setup in Linux so that
82generic Linux PCI subsystem code and the Linux driver for the
83device "just work".
84
85Each vPCI device is set up in Linux to be in its own PCI
86domain with a host bridge.  The PCI domainID is derived from
87bytes 4 and 5 of the instance GUID assigned to the VMBus vPCI
88device.  The Hyper-V host does not guarantee that these bytes
89are unique, so hv_pci_probe() has an algorithm to resolve
90collisions.  The collision resolution is intended to be stable
91across reboots of the same VM so that the PCI domainIDs don't
92change, as the domainID appears in the user space
93configuration of some devices.
94
95hv_pci_probe() allocates a guest MMIO range to be used as PCI
96config space for the device.  This MMIO range is communicated
97to the Hyper-V host over the VMBus channel as part of telling
98the host that the device is ready to enter d0.  See
99hv_pci_enter_d0().  When the guest subsequently accesses this
100MMIO range, the Hyper-V host intercepts the accesses and maps
101them to the physical device PCI config space.
102
103hv_pci_probe() also gets BAR information for the device from
104the Hyper-V host, and uses this information to allocate MMIO
105space for the BARs.  That MMIO space is then setup to be
106associated with the host bridge so that it works when generic
107PCI subsystem code in Linux processes the BARs.
108
109Finally, hv_pci_probe() creates the root PCI bus.  At this
110point the Hyper-V virtual PCI driver hackery is done, and the
111normal Linux PCI machinery for scanning the root bus works to
112detect the device, to perform driver matching, and to
113initialize the driver and device.
114
115PCI Device Removal
116------------------
117A Hyper-V host may initiate removal of a vPCI device from a
118guest VM at any time during the life of the VM.  The removal
119is instigated by an admin action taken on the Hyper-V host and
120is not under the control of the guest OS.
121
122A guest VM is notified of the removal by an unsolicited
123"Eject" message sent from the host to the guest over the VMBus
124channel associated with the vPCI device.  Upon receipt of such
125a message, the Hyper-V virtual PCI driver in Linux
126asynchronously invokes Linux kernel PCI subsystem calls to
127shutdown and remove the device.  When those calls are
128complete, an "Ejection Complete" message is sent back to
129Hyper-V over the VMBus channel indicating that the device has
130been removed.  At this point, Hyper-V sends a VMBus rescind
131message to the Linux guest, which the VMBus driver in Linux
132processes by removing the VMBus identity for the device.  Once
133that processing is complete, all vestiges of the device having
134been present are gone from the Linux kernel.  The rescind
135message also indicates to the guest that Hyper-V has stopped
136providing support for the vPCI device in the guest.  If the
137guest were to attempt to access that device's MMIO space, it
138would be an invalid reference. Hypercalls affecting the device
139return errors, and any further messages sent in the VMBus
140channel are ignored.
141
142After sending the Eject message, Hyper-V allows the guest VM
14360 seconds to cleanly shutdown the device and respond with
144Ejection Complete before sending the VMBus rescind
145message.  If for any reason the Eject steps don't complete
146within the allowed 60 seconds, the Hyper-V host forcibly
147performs the rescind steps, which will likely result in
148cascading errors in the guest because the device is now no
149longer present from the guest standpoint and accessing the
150device MMIO space will fail.
151
152Because ejection is asynchronous and can happen at any point
153during the guest VM lifecycle, proper synchronization in the
154Hyper-V virtual PCI driver is very tricky.  Ejection has been
155observed even before a newly offered vPCI device has been
156fully setup.  The Hyper-V virtual PCI driver has been updated
157several times over the years to fix race conditions when
158ejections happen at inopportune times. Care must be taken when
159modifying this code to prevent re-introducing such problems.
160See comments in the code.
161
162Interrupt Assignment
163--------------------
164The Hyper-V virtual PCI driver supports vPCI devices using
165MSI, multi-MSI, or MSI-X.  Assigning the guest vCPU that will
166receive the interrupt for a particular MSI or MSI-X message is
167complex because of the way the Linux setup of IRQs maps onto
168the Hyper-V interfaces.  For the single-MSI and MSI-X cases,
169Linux calls hv_compse_msi_msg() twice, with the first call
170containing a dummy vCPU and the second call containing the
171real vCPU.  Furthermore, hv_irq_unmask() is finally called
172(on x86) or the GICD registers are set (on arm64) to specify
173the real vCPU again.  Each of these three calls interact
174with Hyper-V, which must decide which physical CPU should
175receive the interrupt before it is forwarded to the guest VM.
176Unfortunately, the Hyper-V decision-making process is a bit
177limited, and can result in concentrating the physical
178interrupts on a single CPU, causing a performance bottleneck.
179See details about how this is resolved in the extensive
180comment above the function hv_compose_msi_req_get_cpu().
181
182The Hyper-V virtual PCI driver implements the
183irq_chip.irq_compose_msi_msg function as hv_compose_msi_msg().
184Unfortunately, on Hyper-V the implementation requires sending
185a VMBus message to the Hyper-V host and awaiting an interrupt
186indicating receipt of a reply message.  Since
187irq_chip.irq_compose_msi_msg can be called with IRQ locks
188held, it doesn't work to do the normal sleep until awakened by
189the interrupt. Instead hv_compose_msi_msg() must send the
190VMBus message, and then poll for the completion message. As
191further complexity, the vPCI device could be ejected/rescinded
192while the polling is in progress, so this scenario must be
193detected as well.  See comments in the code regarding this
194very tricky area.
195
196Most of the code in the Hyper-V virtual PCI driver (pci-
197hyperv.c) applies to Hyper-V and Linux guests running on x86
198and on arm64 architectures.  But there are differences in how
199interrupt assignments are managed.  On x86, the Hyper-V
200virtual PCI driver in the guest must make a hypercall to tell
201Hyper-V which guest vCPU should be interrupted by each
202MSI/MSI-X interrupt, and the x86 interrupt vector number that
203the x86_vector IRQ domain has picked for the interrupt.  This
204hypercall is made by hv_arch_irq_unmask().  On arm64, the
205Hyper-V virtual PCI driver manages the allocation of an SPI
206for each MSI/MSI-X interrupt.  The Hyper-V virtual PCI driver
207stores the allocated SPI in the architectural GICD registers,
208which Hyper-V emulates, so no hypercall is necessary as with
209x86.  Hyper-V does not support using LPIs for vPCI devices in
210arm64 guest VMs because it does not emulate a GICv3 ITS.
211
212The Hyper-V virtual PCI driver in Linux supports vPCI devices
213whose drivers create managed or unmanaged Linux IRQs.  If the
214smp_affinity for an unmanaged IRQ is updated via the /proc/irq
215interface, the Hyper-V virtual PCI driver is called to tell
216the Hyper-V host to change the interrupt targeting and
217everything works properly.  However, on x86 if the x86_vector
218IRQ domain needs to reassign an interrupt vector due to
219running out of vectors on a CPU, there's no path to inform the
220Hyper-V host of the change, and things break.  Fortunately,
221guest VMs operate in a constrained device environment where
222using all the vectors on a CPU doesn't happen. Since such a
223problem is only a theoretical concern rather than a practical
224concern, it has been left unaddressed.
225
226DMA
227---
228By default, Hyper-V pins all guest VM memory in the host
229when the VM is created, and programs the physical IOMMU to
230allow the VM to have DMA access to all its memory.  Hence
231it is safe to assign PCI devices to the VM, and allow the
232guest operating system to program the DMA transfers.  The
233physical IOMMU prevents a malicious guest from initiating
234DMA to memory belonging to the host or to other VMs on the
235host. From the Linux guest standpoint, such DMA transfers
236are in "direct" mode since Hyper-V does not provide a virtual
237IOMMU in the guest.
238
239Hyper-V assumes that physical PCI devices always perform
240cache-coherent DMA.  When running on x86, this behavior is
241required by the architecture.  When running on arm64, the
242architecture allows for both cache-coherent and
243non-cache-coherent devices, with the behavior of each device
244specified in the ACPI DSDT.  But when a PCI device is assigned
245to a guest VM, that device does not appear in the DSDT, so the
246Hyper-V VMBus driver propagates cache-coherency information
247from the VMBus node in the ACPI DSDT to all VMBus devices,
248including vPCI devices (since they have a dual identity as a VMBus
249device and as a PCI device).  See vmbus_dma_configure().
250Current Hyper-V versions always indicate that the VMBus is
251cache coherent, so vPCI devices on arm64 always get marked as
252cache coherent and the CPU does not perform any sync
253operations as part of dma_map/unmap_*() calls.
254
255vPCI protocol versions
256----------------------
257As previously described, during vPCI device setup and teardown
258messages are passed over a VMBus channel between the Hyper-V
259host and the Hyper-v vPCI driver in the Linux guest.  Some
260messages have been revised in newer versions of Hyper-V, so
261the guest and host must agree on the vPCI protocol version to
262be used.  The version is negotiated when communication over
263the VMBus channel is first established.  See
264hv_pci_protocol_negotiation(). Newer versions of the protocol
265extend support to VMs with more than 64 vCPUs, and provide
266additional information about the vPCI device, such as the
267guest virtual NUMA node to which it is most closely affined in
268the underlying hardware.
269
270Guest NUMA node affinity
271------------------------
272When the vPCI protocol version provides it, the guest NUMA
273node affinity of the vPCI device is stored as part of the Linux
274device information for subsequent use by the Linux driver. See
275hv_pci_assign_numa_node().  If the negotiated protocol version
276does not support the host providing NUMA affinity information,
277the Linux guest defaults the device NUMA node to 0.  But even
278when the negotiated protocol version includes NUMA affinity
279information, the ability of the host to provide such
280information depends on certain host configuration options.  If
281the guest receives NUMA node value "0", it could mean NUMA
282node 0, or it could mean "no information is available".
283Unfortunately it is not possible to distinguish the two cases
284from the guest side.
285
286PCI config space access in a CoCo VM
287------------------------------------
288Linux PCI device drivers access PCI config space using a
289standard set of functions provided by the Linux PCI subsystem.
290In Hyper-V guests these standard functions map to functions
291hv_pcifront_read_config() and hv_pcifront_write_config()
292in the Hyper-V virtual PCI driver.  In normal VMs,
293these hv_pcifront_*() functions directly access the PCI config
294space, and the accesses trap to Hyper-V to be handled.
295But in CoCo VMs, memory encryption prevents Hyper-V
296from reading the guest instruction stream to emulate the
297access, so the hv_pcifront_*() functions must invoke
298hypercalls with explicit arguments describing the access to be
299made.
300
301Config Block back-channel
302-------------------------
303The Hyper-V host and Hyper-V virtual PCI driver in Linux
304together implement a non-standard back-channel communication
305path between the host and guest.  The back-channel path uses
306messages sent over the VMBus channel associated with the vPCI
307device.  The functions hyperv_read_cfg_blk() and
308hyperv_write_cfg_blk() are the primary interfaces provided to
309other parts of the Linux kernel.  As of this writing, these
310interfaces are used only by the Mellanox mlx5 driver to pass
311diagnostic data to a Hyper-V host running in the Azure public
312cloud.  The functions hyperv_read_cfg_blk() and
313hyperv_write_cfg_blk() are implemented in a separate module
314(pci-hyperv-intf.c, under CONFIG_PCI_HYPERV_INTERFACE) that
315effectively stubs them out when running in non-Hyper-V
316environments.
317