1*bb6a85b4SSebastian Andrzej Siewior.. SPDX-License-Identifier: GPL-2.0 2*bb6a85b4SSebastian Andrzej Siewior 3*bb6a85b4SSebastian Andrzej Siewior=========================== 4*bb6a85b4SSebastian Andrzej SiewiorAffinity managed interrupts 5*bb6a85b4SSebastian Andrzej Siewior=========================== 6*bb6a85b4SSebastian Andrzej Siewior 7*bb6a85b4SSebastian Andrzej SiewiorThe IRQ core provides support for managing interrupts according to a specified 8*bb6a85b4SSebastian Andrzej SiewiorCPU affinity. Under normal operation, an interrupt is associated with a 9*bb6a85b4SSebastian Andrzej Siewiorparticular CPU. If that CPU is taken offline, the interrupt is migrated to 10*bb6a85b4SSebastian Andrzej Siewioranother online CPU. 11*bb6a85b4SSebastian Andrzej Siewior 12*bb6a85b4SSebastian Andrzej SiewiorDevices with large numbers of interrupt vectors can stress the available vector 13*bb6a85b4SSebastian Andrzej Siewiorspace. For example, an NVMe device with 128 I/O queues typically requests one 14*bb6a85b4SSebastian Andrzej Siewiorinterrupt per queue on systems with at least 128 CPUs. Two such devices 15*bb6a85b4SSebastian Andrzej Siewiortherefore request 256 interrupts. On x86, the interrupt vector space is 16*bb6a85b4SSebastian Andrzej Siewiornotoriously low, providing only 256 vectors per CPU, and the kernel reserves a 17*bb6a85b4SSebastian Andrzej Siewiorsubset of these, further reducing the number available for device interrupts. 18*bb6a85b4SSebastian Andrzej SiewiorIn practice this is not an issue because the interrupts are distributed across 19*bb6a85b4SSebastian Andrzej Siewiormany CPUs, so each CPU only receives a small number of vectors. 20*bb6a85b4SSebastian Andrzej Siewior 21*bb6a85b4SSebastian Andrzej SiewiorDuring system suspend, however, all secondary CPUs are taken offline and all 22*bb6a85b4SSebastian Andrzej Siewiorinterrupts are migrated to the single CPU that remains online. This can exhaust 23*bb6a85b4SSebastian Andrzej Siewiorthe available interrupt vectors on that CPU and cause the suspend operation to 24*bb6a85b4SSebastian Andrzej Siewiorfail. 25*bb6a85b4SSebastian Andrzej Siewior 26*bb6a85b4SSebastian Andrzej SiewiorAffinity‑managed interrupts address this limitation. Each interrupt is assigned 27*bb6a85b4SSebastian Andrzej Siewiora CPU affinity mask that specifies the set of CPUs on which the interrupt may 28*bb6a85b4SSebastian Andrzej Siewiorbe targeted. When a CPU in the mask goes offline, the interrupt is moved to the 29*bb6a85b4SSebastian Andrzej Siewiornext CPU in the mask. If the last CPU in the mask goes offline, the interrupt 30*bb6a85b4SSebastian Andrzej Siewioris shut down. Drivers using affinity‑managed interrupts must ensure that the 31*bb6a85b4SSebastian Andrzej Siewiorassociated queue is quiesced before the interrupt is disabled so that no 32*bb6a85b4SSebastian Andrzej Siewiorfurther interrupts are generated. When a CPU in the affinity mask comes back 33*bb6a85b4SSebastian Andrzej Siewioronline, the interrupt is re‑enabled. 34*bb6a85b4SSebastian Andrzej Siewior 35*bb6a85b4SSebastian Andrzej SiewiorImplementation 36*bb6a85b4SSebastian Andrzej Siewior-------------- 37*bb6a85b4SSebastian Andrzej Siewior 38*bb6a85b4SSebastian Andrzej SiewiorDevices must provide per‑instance interrupts, such as per‑I/O‑queue interrupts 39*bb6a85b4SSebastian Andrzej Siewiorfor storage devices like NVMe. The driver allocates interrupt vectors with the 40*bb6a85b4SSebastian Andrzej Siewiorrequired affinity settings using struct irq_affinity. For MSI‑X devices, this 41*bb6a85b4SSebastian Andrzej Siewioris done via pci_alloc_irq_vectors_affinity() with the PCI_IRQ_AFFINITY flag 42*bb6a85b4SSebastian Andrzej Siewiorset. 43*bb6a85b4SSebastian Andrzej Siewior 44*bb6a85b4SSebastian Andrzej SiewiorBased on the provided affinity information, the IRQ core attempts to spread the 45*bb6a85b4SSebastian Andrzej Siewiorinterrupts evenly across the system. The affinity masks are computed during 46*bb6a85b4SSebastian Andrzej Siewiorthis allocation step, but the final IRQ assignment is performed when 47*bb6a85b4SSebastian Andrzej Siewiorrequest_irq() is invoked. 48*bb6a85b4SSebastian Andrzej Siewior 49*bb6a85b4SSebastian Andrzej SiewiorIsolated CPUs 50*bb6a85b4SSebastian Andrzej Siewior------------- 51*bb6a85b4SSebastian Andrzej Siewior 52*bb6a85b4SSebastian Andrzej SiewiorThe affinity of managed interrupts is handled entirely in the kernel and cannot 53*bb6a85b4SSebastian Andrzej Siewiorbe modified from user space through the /proc interfaces. The managed_irq 54*bb6a85b4SSebastian Andrzej Siewiorsub‑parameter of the isolcpus boot option specifies a CPU mask that managed 55*bb6a85b4SSebastian Andrzej Siewiorinterrupts should attempt to avoid. This isolation is best‑effort and only 56*bb6a85b4SSebastian Andrzej Siewiorapplies if the automatically assigned interrupt mask also contains online CPUs 57*bb6a85b4SSebastian Andrzej Siewioroutside the avoided mask. If the requested mask contains only isolated CPUs, 58*bb6a85b4SSebastian Andrzej Siewiorthe setting has no effect. 59*bb6a85b4SSebastian Andrzej Siewior 60*bb6a85b4SSebastian Andrzej SiewiorCPUs listed in the avoided mask remain part of the interrupt’s affinity mask. 61*bb6a85b4SSebastian Andrzej SiewiorThis means that if all non‑isolated CPUs go offline while isolated CPUs remain 62*bb6a85b4SSebastian Andrzej Siewioronline, the interrupt will be assigned to one of the isolated CPUs. 63*bb6a85b4SSebastian Andrzej Siewior 64*bb6a85b4SSebastian Andrzej SiewiorThe following examples assume a system with 8 CPUs. 65*bb6a85b4SSebastian Andrzej Siewior 66*bb6a85b4SSebastian Andrzej Siewior- A QEMU instance is booted with "-device virtio-scsi-pci". 67*bb6a85b4SSebastian Andrzej Siewior The MSI‑X device exposes 11 interrupts: 3 "management" interrupts and 8 68*bb6a85b4SSebastian Andrzej Siewior "queue" interrupts. The driver requests the 8 queue interrupts, each of which 69*bb6a85b4SSebastian Andrzej Siewior is affine to exactly one CPU. If that CPU goes offline, the interrupt is shut 70*bb6a85b4SSebastian Andrzej Siewior down. 71*bb6a85b4SSebastian Andrzej Siewior 72*bb6a85b4SSebastian Andrzej Siewior Assuming interrupt 48 is one of the queue interrupts, the following appears:: 73*bb6a85b4SSebastian Andrzej Siewior 74*bb6a85b4SSebastian Andrzej Siewior /proc/irq/48/effective_affinity_list:7 75*bb6a85b4SSebastian Andrzej Siewior /proc/irq/48/smp_affinity_list:7 76*bb6a85b4SSebastian Andrzej Siewior 77*bb6a85b4SSebastian Andrzej Siewior This indicates that the interrupt is served only by CPU7. Shutting down CPU7 78*bb6a85b4SSebastian Andrzej Siewior does not migrate the interrupt to another CPU:: 79*bb6a85b4SSebastian Andrzej Siewior 80*bb6a85b4SSebastian Andrzej Siewior /proc/irq/48/effective_affinity_list:0 81*bb6a85b4SSebastian Andrzej Siewior /proc/irq/48/smp_affinity_list:7 82*bb6a85b4SSebastian Andrzej Siewior 83*bb6a85b4SSebastian Andrzej Siewior This can be verified via the debugfs interface 84*bb6a85b4SSebastian Andrzej Siewior (/sys/kernel/debug/irq/irqs/48). The dstate field will include 85*bb6a85b4SSebastian Andrzej Siewior IRQD_IRQ_DISABLED, IRQD_IRQ_MASKED and IRQD_MANAGED_SHUTDOWN. 86*bb6a85b4SSebastian Andrzej Siewior 87*bb6a85b4SSebastian Andrzej Siewior- A QEMU instance is booted with "-device virtio-scsi-pci,num_queues=2" 88*bb6a85b4SSebastian Andrzej Siewior and the kernel command line includes: 89*bb6a85b4SSebastian Andrzej Siewior "irqaffinity=0,1 isolcpus=domain,2-7 isolcpus=managed_irq,1-3,5-7". 90*bb6a85b4SSebastian Andrzej Siewior The MSI‑X device exposes 5 interrupts: 3 management interrupts and 2 queue 91*bb6a85b4SSebastian Andrzej Siewior interrupts. The management interrupts follow the irqaffinity= setting. The 92*bb6a85b4SSebastian Andrzej Siewior queue interrupts are spread across available CPUs:: 93*bb6a85b4SSebastian Andrzej Siewior 94*bb6a85b4SSebastian Andrzej Siewior /proc/irq/47/effective_affinity_list:0 95*bb6a85b4SSebastian Andrzej Siewior /proc/irq/47/smp_affinity_list:0-3 96*bb6a85b4SSebastian Andrzej Siewior /proc/irq/48/effective_affinity_list:4 97*bb6a85b4SSebastian Andrzej Siewior /proc/irq/48/smp_affinity_list:4-7 98*bb6a85b4SSebastian Andrzej Siewior 99*bb6a85b4SSebastian Andrzej Siewior The two queue interrupts are evenly distributed. Interrupt 48 is placed on CPU4 100*bb6a85b4SSebastian Andrzej Siewior because the managed_irq mask avoids CPUs 5–7 when possible. 101*bb6a85b4SSebastian Andrzej Siewior 102*bb6a85b4SSebastian Andrzej Siewior Replacing the managed_irq argument with "isolcpus=managed_irq,1-3,4-5,7" 103*bb6a85b4SSebastian Andrzej Siewior results in:: 104*bb6a85b4SSebastian Andrzej Siewior 105*bb6a85b4SSebastian Andrzej Siewior /proc/irq/48/effective_affinity_list:6 106*bb6a85b4SSebastian Andrzej Siewior /proc/irq/48/smp_affinity_list:4-7 107*bb6a85b4SSebastian Andrzej Siewior 108*bb6a85b4SSebastian Andrzej Siewior Interrupt 48 is now served on CPU6 because the system avoids CPUs 4, 5 and 109*bb6a85b4SSebastian Andrzej Siewior 7. If CPU6 is taken offline, the interrupt migrates to one of the "isolated" 110*bb6a85b4SSebastian Andrzej Siewior CPUs:: 111*bb6a85b4SSebastian Andrzej Siewior 112*bb6a85b4SSebastian Andrzej Siewior /proc/irq/48/effective_affinity_list:7 113*bb6a85b4SSebastian Andrzej Siewior /proc/irq/48/smp_affinity_list:4-7 114*bb6a85b4SSebastian Andrzej Siewior 115*bb6a85b4SSebastian Andrzej Siewior The interrupt is shut down once all CPUs listed in its smp_affinity mask are 116*bb6a85b4SSebastian Andrzej Siewior offline. 117