1.. SPDX-License-Identifier: GPL-2.0 2 3PCI pass-thru devices 4========================= 5In a Hyper-V guest VM, PCI pass-thru devices (also called 6virtual PCI devices, or vPCI devices) are physical PCI devices 7that are mapped directly into the VM's physical address space. 8Guest device drivers can interact directly with the hardware 9without intermediation by the host hypervisor. This approach 10provides higher bandwidth access to the device with lower 11latency, compared with devices that are virtualized by the 12hypervisor. The device should appear to the guest just as it 13would when running on bare metal, so no changes are required 14to the Linux device drivers for the device. 15 16Hyper-V terminology for vPCI devices is "Discrete Device 17Assignment" (DDA). Public documentation for Hyper-V DDA is 18available here: `DDA`_ 19 20.. _DDA: https://learn.microsoft.com/en-us/windows-server/virtualization/hyper-v/plan/plan-for-deploying-devices-using-discrete-device-assignment 21 22DDA is typically used for storage controllers, such as NVMe, 23and for GPUs. A similar mechanism for NICs is called SR-IOV 24and produces the same benefits by allowing a guest device 25driver to interact directly with the hardware. See Hyper-V 26public documentation here: `SR-IOV`_ 27 28.. _SR-IOV: https://learn.microsoft.com/en-us/windows-hardware/drivers/network/overview-of-single-root-i-o-virtualization--sr-iov- 29 30This discussion of vPCI devices includes DDA and SR-IOV 31devices. 32 33Device Presentation 34------------------- 35Hyper-V provides full PCI functionality for a vPCI device when 36it is operating, so the Linux device driver for the device can 37be used unchanged, provided it uses the correct Linux kernel 38APIs for accessing PCI config space and for other integration 39with Linux. But the initial detection of the PCI device and 40its integration with the Linux PCI subsystem must use Hyper-V 41specific mechanisms. Consequently, vPCI devices on Hyper-V 42have a dual identity. They are initially presented to Linux 43guests as VMBus devices via the standard VMBus "offer" 44mechanism, so they have a VMBus identity and appear under 45/sys/bus/vmbus/devices. The VMBus vPCI driver in Linux at 46drivers/pci/controller/pci-hyperv.c handles a newly introduced 47vPCI device by fabricating a PCI bus topology and creating all 48the normal PCI device data structures in Linux that would 49exist if the PCI device were discovered via ACPI on a bare- 50metal system. Once those data structures are set up, the 51device also has a normal PCI identity in Linux, and the normal 52Linux device driver for the vPCI device can function as if it 53were running in Linux on bare-metal. Because vPCI devices are 54presented dynamically through the VMBus offer mechanism, they 55do not appear in the Linux guest's ACPI tables. vPCI devices 56may be added to a VM or removed from a VM at any time during 57the life of the VM, and not just during initial boot. 58 59With this approach, the vPCI device is a VMBus device and a 60PCI device at the same time. In response to the VMBus offer 61message, the hv_pci_probe() function runs and establishes a 62VMBus connection to the vPCI VSP on the Hyper-V host. That 63connection has a single VMBus channel. The channel is used to 64exchange messages with the vPCI VSP for the purpose of setting 65up and configuring the vPCI device in Linux. Once the device 66is fully configured in Linux as a PCI device, the VMBus 67channel is used only if Linux changes the vCPU to be interrupted 68in the guest, or if the vPCI device is removed from 69the VM while the VM is running. The ongoing operation of the 70device happens directly between the Linux device driver for 71the device and the hardware, with VMBus and the VMBus channel 72playing no role. 73 74PCI Device Setup 75---------------- 76PCI device setup follows a sequence that Hyper-V originally 77created for Windows guests, and that can be ill-suited for 78Linux guests due to differences in the overall structure of 79the Linux PCI subsystem compared with Windows. Nonetheless, 80with a bit of hackery in the Hyper-V virtual PCI driver for 81Linux, the virtual PCI device is setup in Linux so that 82generic Linux PCI subsystem code and the Linux driver for the 83device "just work". 84 85Each vPCI device is set up in Linux to be in its own PCI 86domain with a host bridge. The PCI domainID is derived from 87bytes 4 and 5 of the instance GUID assigned to the VMBus vPCI 88device. The Hyper-V host does not guarantee that these bytes 89are unique, so hv_pci_probe() has an algorithm to resolve 90collisions. The collision resolution is intended to be stable 91across reboots of the same VM so that the PCI domainIDs don't 92change, as the domainID appears in the user space 93configuration of some devices. 94 95hv_pci_probe() allocates a guest MMIO range to be used as PCI 96config space for the device. This MMIO range is communicated 97to the Hyper-V host over the VMBus channel as part of telling 98the host that the device is ready to enter d0. See 99hv_pci_enter_d0(). When the guest subsequently accesses this 100MMIO range, the Hyper-V host intercepts the accesses and maps 101them to the physical device PCI config space. 102 103hv_pci_probe() also gets BAR information for the device from 104the Hyper-V host, and uses this information to allocate MMIO 105space for the BARs. That MMIO space is then setup to be 106associated with the host bridge so that it works when generic 107PCI subsystem code in Linux processes the BARs. 108 109Finally, hv_pci_probe() creates the root PCI bus. At this 110point the Hyper-V virtual PCI driver hackery is done, and the 111normal Linux PCI machinery for scanning the root bus works to 112detect the device, to perform driver matching, and to 113initialize the driver and device. 114 115PCI Device Removal 116------------------ 117A Hyper-V host may initiate removal of a vPCI device from a 118guest VM at any time during the life of the VM. The removal 119is instigated by an admin action taken on the Hyper-V host and 120is not under the control of the guest OS. 121 122A guest VM is notified of the removal by an unsolicited 123"Eject" message sent from the host to the guest over the VMBus 124channel associated with the vPCI device. Upon receipt of such 125a message, the Hyper-V virtual PCI driver in Linux 126asynchronously invokes Linux kernel PCI subsystem calls to 127shutdown and remove the device. When those calls are 128complete, an "Ejection Complete" message is sent back to 129Hyper-V over the VMBus channel indicating that the device has 130been removed. At this point, Hyper-V sends a VMBus rescind 131message to the Linux guest, which the VMBus driver in Linux 132processes by removing the VMBus identity for the device. Once 133that processing is complete, all vestiges of the device having 134been present are gone from the Linux kernel. The rescind 135message also indicates to the guest that Hyper-V has stopped 136providing support for the vPCI device in the guest. If the 137guest were to attempt to access that device's MMIO space, it 138would be an invalid reference. Hypercalls affecting the device 139return errors, and any further messages sent in the VMBus 140channel are ignored. 141 142After sending the Eject message, Hyper-V allows the guest VM 14360 seconds to cleanly shutdown the device and respond with 144Ejection Complete before sending the VMBus rescind 145message. If for any reason the Eject steps don't complete 146within the allowed 60 seconds, the Hyper-V host forcibly 147performs the rescind steps, which will likely result in 148cascading errors in the guest because the device is now no 149longer present from the guest standpoint and accessing the 150device MMIO space will fail. 151 152Because ejection is asynchronous and can happen at any point 153during the guest VM lifecycle, proper synchronization in the 154Hyper-V virtual PCI driver is very tricky. Ejection has been 155observed even before a newly offered vPCI device has been 156fully setup. The Hyper-V virtual PCI driver has been updated 157several times over the years to fix race conditions when 158ejections happen at inopportune times. Care must be taken when 159modifying this code to prevent re-introducing such problems. 160See comments in the code. 161 162Interrupt Assignment 163-------------------- 164The Hyper-V virtual PCI driver supports vPCI devices using 165MSI, multi-MSI, or MSI-X. Assigning the guest vCPU that will 166receive the interrupt for a particular MSI or MSI-X message is 167complex because of the way the Linux setup of IRQs maps onto 168the Hyper-V interfaces. For the single-MSI and MSI-X cases, 169Linux calls hv_compse_msi_msg() twice, with the first call 170containing a dummy vCPU and the second call containing the 171real vCPU. Furthermore, hv_irq_unmask() is finally called 172(on x86) or the GICD registers are set (on arm64) to specify 173the real vCPU again. Each of these three calls interact 174with Hyper-V, which must decide which physical CPU should 175receive the interrupt before it is forwarded to the guest VM. 176Unfortunately, the Hyper-V decision-making process is a bit 177limited, and can result in concentrating the physical 178interrupts on a single CPU, causing a performance bottleneck. 179See details about how this is resolved in the extensive 180comment above the function hv_compose_msi_req_get_cpu(). 181 182The Hyper-V virtual PCI driver implements the 183irq_chip.irq_compose_msi_msg function as hv_compose_msi_msg(). 184Unfortunately, on Hyper-V the implementation requires sending 185a VMBus message to the Hyper-V host and awaiting an interrupt 186indicating receipt of a reply message. Since 187irq_chip.irq_compose_msi_msg can be called with IRQ locks 188held, it doesn't work to do the normal sleep until awakened by 189the interrupt. Instead hv_compose_msi_msg() must send the 190VMBus message, and then poll for the completion message. As 191further complexity, the vPCI device could be ejected/rescinded 192while the polling is in progress, so this scenario must be 193detected as well. See comments in the code regarding this 194very tricky area. 195 196Most of the code in the Hyper-V virtual PCI driver (pci- 197hyperv.c) applies to Hyper-V and Linux guests running on x86 198and on arm64 architectures. But there are differences in how 199interrupt assignments are managed. On x86, the Hyper-V 200virtual PCI driver in the guest must make a hypercall to tell 201Hyper-V which guest vCPU should be interrupted by each 202MSI/MSI-X interrupt, and the x86 interrupt vector number that 203the x86_vector IRQ domain has picked for the interrupt. This 204hypercall is made by hv_arch_irq_unmask(). On arm64, the 205Hyper-V virtual PCI driver manages the allocation of an SPI 206for each MSI/MSI-X interrupt. The Hyper-V virtual PCI driver 207stores the allocated SPI in the architectural GICD registers, 208which Hyper-V emulates, so no hypercall is necessary as with 209x86. Hyper-V does not support using LPIs for vPCI devices in 210arm64 guest VMs because it does not emulate a GICv3 ITS. 211 212The Hyper-V virtual PCI driver in Linux supports vPCI devices 213whose drivers create managed or unmanaged Linux IRQs. If the 214smp_affinity for an unmanaged IRQ is updated via the /proc/irq 215interface, the Hyper-V virtual PCI driver is called to tell 216the Hyper-V host to change the interrupt targeting and 217everything works properly. However, on x86 if the x86_vector 218IRQ domain needs to reassign an interrupt vector due to 219running out of vectors on a CPU, there's no path to inform the 220Hyper-V host of the change, and things break. Fortunately, 221guest VMs operate in a constrained device environment where 222using all the vectors on a CPU doesn't happen. Since such a 223problem is only a theoretical concern rather than a practical 224concern, it has been left unaddressed. 225 226DMA 227--- 228By default, Hyper-V pins all guest VM memory in the host 229when the VM is created, and programs the physical IOMMU to 230allow the VM to have DMA access to all its memory. Hence 231it is safe to assign PCI devices to the VM, and allow the 232guest operating system to program the DMA transfers. The 233physical IOMMU prevents a malicious guest from initiating 234DMA to memory belonging to the host or to other VMs on the 235host. From the Linux guest standpoint, such DMA transfers 236are in "direct" mode since Hyper-V does not provide a virtual 237IOMMU in the guest. 238 239Hyper-V assumes that physical PCI devices always perform 240cache-coherent DMA. When running on x86, this behavior is 241required by the architecture. When running on arm64, the 242architecture allows for both cache-coherent and 243non-cache-coherent devices, with the behavior of each device 244specified in the ACPI DSDT. But when a PCI device is assigned 245to a guest VM, that device does not appear in the DSDT, so the 246Hyper-V VMBus driver propagates cache-coherency information 247from the VMBus node in the ACPI DSDT to all VMBus devices, 248including vPCI devices (since they have a dual identity as a VMBus 249device and as a PCI device). See vmbus_dma_configure(). 250Current Hyper-V versions always indicate that the VMBus is 251cache coherent, so vPCI devices on arm64 always get marked as 252cache coherent and the CPU does not perform any sync 253operations as part of dma_map/unmap_*() calls. 254 255vPCI protocol versions 256---------------------- 257As previously described, during vPCI device setup and teardown 258messages are passed over a VMBus channel between the Hyper-V 259host and the Hyper-v vPCI driver in the Linux guest. Some 260messages have been revised in newer versions of Hyper-V, so 261the guest and host must agree on the vPCI protocol version to 262be used. The version is negotiated when communication over 263the VMBus channel is first established. See 264hv_pci_protocol_negotiation(). Newer versions of the protocol 265extend support to VMs with more than 64 vCPUs, and provide 266additional information about the vPCI device, such as the 267guest virtual NUMA node to which it is most closely affined in 268the underlying hardware. 269 270Guest NUMA node affinity 271------------------------ 272When the vPCI protocol version provides it, the guest NUMA 273node affinity of the vPCI device is stored as part of the Linux 274device information for subsequent use by the Linux driver. See 275hv_pci_assign_numa_node(). If the negotiated protocol version 276does not support the host providing NUMA affinity information, 277the Linux guest defaults the device NUMA node to 0. But even 278when the negotiated protocol version includes NUMA affinity 279information, the ability of the host to provide such 280information depends on certain host configuration options. If 281the guest receives NUMA node value "0", it could mean NUMA 282node 0, or it could mean "no information is available". 283Unfortunately it is not possible to distinguish the two cases 284from the guest side. 285 286PCI config space access in a CoCo VM 287------------------------------------ 288Linux PCI device drivers access PCI config space using a 289standard set of functions provided by the Linux PCI subsystem. 290In Hyper-V guests these standard functions map to functions 291hv_pcifront_read_config() and hv_pcifront_write_config() 292in the Hyper-V virtual PCI driver. In normal VMs, 293these hv_pcifront_*() functions directly access the PCI config 294space, and the accesses trap to Hyper-V to be handled. 295But in CoCo VMs, memory encryption prevents Hyper-V 296from reading the guest instruction stream to emulate the 297access, so the hv_pcifront_*() functions must invoke 298hypercalls with explicit arguments describing the access to be 299made. 300 301Config Block back-channel 302------------------------- 303The Hyper-V host and Hyper-V virtual PCI driver in Linux 304together implement a non-standard back-channel communication 305path between the host and guest. The back-channel path uses 306messages sent over the VMBus channel associated with the vPCI 307device. The functions hyperv_read_cfg_blk() and 308hyperv_write_cfg_blk() are the primary interfaces provided to 309other parts of the Linux kernel. As of this writing, these 310interfaces are used only by the Mellanox mlx5 driver to pass 311diagnostic data to a Hyper-V host running in the Azure public 312cloud. The functions hyperv_read_cfg_blk() and 313hyperv_write_cfg_blk() are implemented in a separate module 314(pci-hyperv-intf.c, under CONFIG_PCI_HYPERV_INTERFACE) that 315effectively stubs them out when running in non-Hyper-V 316environments. 317