xref: /linux/Documentation/core-api/irq/managed_irq.rst (revision 5181afcdf99527dd92a88f80fc4d0d8013e1b510)
1*bb6a85b4SSebastian Andrzej Siewior.. SPDX-License-Identifier: GPL-2.0
2*bb6a85b4SSebastian Andrzej Siewior
3*bb6a85b4SSebastian Andrzej Siewior===========================
4*bb6a85b4SSebastian Andrzej SiewiorAffinity managed interrupts
5*bb6a85b4SSebastian Andrzej Siewior===========================
6*bb6a85b4SSebastian Andrzej Siewior
7*bb6a85b4SSebastian Andrzej SiewiorThe IRQ core provides support for managing interrupts according to a specified
8*bb6a85b4SSebastian Andrzej SiewiorCPU affinity. Under normal operation, an interrupt is associated with a
9*bb6a85b4SSebastian Andrzej Siewiorparticular CPU. If that CPU is taken offline, the interrupt is migrated to
10*bb6a85b4SSebastian Andrzej Siewioranother online CPU.
11*bb6a85b4SSebastian Andrzej Siewior
12*bb6a85b4SSebastian Andrzej SiewiorDevices with large numbers of interrupt vectors can stress the available vector
13*bb6a85b4SSebastian Andrzej Siewiorspace. For example, an NVMe device with 128 I/O queues typically requests one
14*bb6a85b4SSebastian Andrzej Siewiorinterrupt per queue on systems with at least 128 CPUs. Two such devices
15*bb6a85b4SSebastian Andrzej Siewiortherefore request 256 interrupts. On x86, the interrupt vector space is
16*bb6a85b4SSebastian Andrzej Siewiornotoriously low, providing only 256 vectors per CPU, and the kernel reserves a
17*bb6a85b4SSebastian Andrzej Siewiorsubset of these, further reducing the number available for device interrupts.
18*bb6a85b4SSebastian Andrzej SiewiorIn practice this is not an issue because the interrupts are distributed across
19*bb6a85b4SSebastian Andrzej Siewiormany CPUs, so each CPU only receives a small number of vectors.
20*bb6a85b4SSebastian Andrzej Siewior
21*bb6a85b4SSebastian Andrzej SiewiorDuring system suspend, however, all secondary CPUs are taken offline and all
22*bb6a85b4SSebastian Andrzej Siewiorinterrupts are migrated to the single CPU that remains online. This can exhaust
23*bb6a85b4SSebastian Andrzej Siewiorthe available interrupt vectors on that CPU and cause the suspend operation to
24*bb6a85b4SSebastian Andrzej Siewiorfail.
25*bb6a85b4SSebastian Andrzej Siewior
26*bb6a85b4SSebastian Andrzej SiewiorAffinity‑managed interrupts address this limitation. Each interrupt is assigned
27*bb6a85b4SSebastian Andrzej Siewiora CPU affinity mask that specifies the set of CPUs on which the interrupt may
28*bb6a85b4SSebastian Andrzej Siewiorbe targeted. When a CPU in the mask goes offline, the interrupt is moved to the
29*bb6a85b4SSebastian Andrzej Siewiornext CPU in the mask. If the last CPU in the mask goes offline, the interrupt
30*bb6a85b4SSebastian Andrzej Siewioris shut down. Drivers using affinity‑managed interrupts must ensure that the
31*bb6a85b4SSebastian Andrzej Siewiorassociated queue is quiesced before the interrupt is disabled so that no
32*bb6a85b4SSebastian Andrzej Siewiorfurther interrupts are generated. When a CPU in the affinity mask comes back
33*bb6a85b4SSebastian Andrzej Siewioronline, the interrupt is re‑enabled.
34*bb6a85b4SSebastian Andrzej Siewior
35*bb6a85b4SSebastian Andrzej SiewiorImplementation
36*bb6a85b4SSebastian Andrzej Siewior--------------
37*bb6a85b4SSebastian Andrzej Siewior
38*bb6a85b4SSebastian Andrzej SiewiorDevices must provide per‑instance interrupts, such as per‑I/O‑queue interrupts
39*bb6a85b4SSebastian Andrzej Siewiorfor storage devices like NVMe. The driver allocates interrupt vectors with the
40*bb6a85b4SSebastian Andrzej Siewiorrequired affinity settings using struct irq_affinity. For MSI‑X devices, this
41*bb6a85b4SSebastian Andrzej Siewioris done via pci_alloc_irq_vectors_affinity() with the PCI_IRQ_AFFINITY flag
42*bb6a85b4SSebastian Andrzej Siewiorset.
43*bb6a85b4SSebastian Andrzej Siewior
44*bb6a85b4SSebastian Andrzej SiewiorBased on the provided affinity information, the IRQ core attempts to spread the
45*bb6a85b4SSebastian Andrzej Siewiorinterrupts evenly across the system. The affinity masks are computed during
46*bb6a85b4SSebastian Andrzej Siewiorthis allocation step, but the final IRQ assignment is performed when
47*bb6a85b4SSebastian Andrzej Siewiorrequest_irq() is invoked.
48*bb6a85b4SSebastian Andrzej Siewior
49*bb6a85b4SSebastian Andrzej SiewiorIsolated CPUs
50*bb6a85b4SSebastian Andrzej Siewior-------------
51*bb6a85b4SSebastian Andrzej Siewior
52*bb6a85b4SSebastian Andrzej SiewiorThe affinity of managed interrupts is handled entirely in the kernel and cannot
53*bb6a85b4SSebastian Andrzej Siewiorbe modified from user space through the /proc interfaces. The managed_irq
54*bb6a85b4SSebastian Andrzej Siewiorsub‑parameter of the isolcpus boot option specifies a CPU mask that managed
55*bb6a85b4SSebastian Andrzej Siewiorinterrupts should attempt to avoid. This isolation is best‑effort and only
56*bb6a85b4SSebastian Andrzej Siewiorapplies if the automatically assigned interrupt mask also contains online CPUs
57*bb6a85b4SSebastian Andrzej Siewioroutside the avoided mask. If the requested mask contains only isolated CPUs,
58*bb6a85b4SSebastian Andrzej Siewiorthe setting has no effect.
59*bb6a85b4SSebastian Andrzej Siewior
60*bb6a85b4SSebastian Andrzej SiewiorCPUs listed in the avoided mask remain part of the interrupt’s affinity mask.
61*bb6a85b4SSebastian Andrzej SiewiorThis means that if all non‑isolated CPUs go offline while isolated CPUs remain
62*bb6a85b4SSebastian Andrzej Siewioronline, the interrupt will be assigned to one of the isolated CPUs.
63*bb6a85b4SSebastian Andrzej Siewior
64*bb6a85b4SSebastian Andrzej SiewiorThe following examples assume a system with 8 CPUs.
65*bb6a85b4SSebastian Andrzej Siewior
66*bb6a85b4SSebastian Andrzej Siewior- A QEMU instance is booted with "-device virtio-scsi-pci".
67*bb6a85b4SSebastian Andrzej Siewior  The MSI‑X device exposes 11 interrupts: 3 "management" interrupts and 8
68*bb6a85b4SSebastian Andrzej Siewior  "queue" interrupts. The driver requests the 8 queue interrupts, each of which
69*bb6a85b4SSebastian Andrzej Siewior  is affine to exactly one CPU. If that CPU goes offline, the interrupt is shut
70*bb6a85b4SSebastian Andrzej Siewior  down.
71*bb6a85b4SSebastian Andrzej Siewior
72*bb6a85b4SSebastian Andrzej Siewior  Assuming interrupt 48 is one of the queue interrupts, the following appears::
73*bb6a85b4SSebastian Andrzej Siewior
74*bb6a85b4SSebastian Andrzej Siewior    /proc/irq/48/effective_affinity_list:7
75*bb6a85b4SSebastian Andrzej Siewior    /proc/irq/48/smp_affinity_list:7
76*bb6a85b4SSebastian Andrzej Siewior
77*bb6a85b4SSebastian Andrzej Siewior  This indicates that the interrupt is served only by CPU7. Shutting down CPU7
78*bb6a85b4SSebastian Andrzej Siewior  does not migrate the interrupt to another CPU::
79*bb6a85b4SSebastian Andrzej Siewior
80*bb6a85b4SSebastian Andrzej Siewior    /proc/irq/48/effective_affinity_list:0
81*bb6a85b4SSebastian Andrzej Siewior    /proc/irq/48/smp_affinity_list:7
82*bb6a85b4SSebastian Andrzej Siewior
83*bb6a85b4SSebastian Andrzej Siewior  This can be verified via the debugfs interface
84*bb6a85b4SSebastian Andrzej Siewior  (/sys/kernel/debug/irq/irqs/48). The dstate field will include
85*bb6a85b4SSebastian Andrzej Siewior  IRQD_IRQ_DISABLED, IRQD_IRQ_MASKED and IRQD_MANAGED_SHUTDOWN.
86*bb6a85b4SSebastian Andrzej Siewior
87*bb6a85b4SSebastian Andrzej Siewior- A QEMU instance is booted with "-device virtio-scsi-pci,num_queues=2"
88*bb6a85b4SSebastian Andrzej Siewior  and the kernel command line includes:
89*bb6a85b4SSebastian Andrzej Siewior  "irqaffinity=0,1 isolcpus=domain,2-7 isolcpus=managed_irq,1-3,5-7".
90*bb6a85b4SSebastian Andrzej Siewior  The MSI‑X device exposes 5 interrupts: 3 management interrupts and 2 queue
91*bb6a85b4SSebastian Andrzej Siewior  interrupts. The management interrupts follow the irqaffinity= setting. The
92*bb6a85b4SSebastian Andrzej Siewior  queue interrupts are spread across available CPUs::
93*bb6a85b4SSebastian Andrzej Siewior
94*bb6a85b4SSebastian Andrzej Siewior    /proc/irq/47/effective_affinity_list:0
95*bb6a85b4SSebastian Andrzej Siewior    /proc/irq/47/smp_affinity_list:0-3
96*bb6a85b4SSebastian Andrzej Siewior    /proc/irq/48/effective_affinity_list:4
97*bb6a85b4SSebastian Andrzej Siewior    /proc/irq/48/smp_affinity_list:4-7
98*bb6a85b4SSebastian Andrzej Siewior
99*bb6a85b4SSebastian Andrzej Siewior  The two queue interrupts are evenly distributed. Interrupt 48 is placed on CPU4
100*bb6a85b4SSebastian Andrzej Siewior  because the managed_irq mask avoids CPUs 5–7 when possible.
101*bb6a85b4SSebastian Andrzej Siewior
102*bb6a85b4SSebastian Andrzej Siewior  Replacing the managed_irq argument with "isolcpus=managed_irq,1-3,4-5,7"
103*bb6a85b4SSebastian Andrzej Siewior  results in::
104*bb6a85b4SSebastian Andrzej Siewior
105*bb6a85b4SSebastian Andrzej Siewior    /proc/irq/48/effective_affinity_list:6
106*bb6a85b4SSebastian Andrzej Siewior    /proc/irq/48/smp_affinity_list:4-7
107*bb6a85b4SSebastian Andrzej Siewior
108*bb6a85b4SSebastian Andrzej Siewior  Interrupt 48 is now served on CPU6 because the system avoids CPUs 4, 5 and
109*bb6a85b4SSebastian Andrzej Siewior  7. If CPU6 is taken offline, the interrupt migrates to one of the "isolated"
110*bb6a85b4SSebastian Andrzej Siewior  CPUs::
111*bb6a85b4SSebastian Andrzej Siewior
112*bb6a85b4SSebastian Andrzej Siewior    /proc/irq/48/effective_affinity_list:7
113*bb6a85b4SSebastian Andrzej Siewior    /proc/irq/48/smp_affinity_list:4-7
114*bb6a85b4SSebastian Andrzej Siewior
115*bb6a85b4SSebastian Andrzej Siewior  The interrupt is shut down once all CPUs listed in its smp_affinity mask are
116*bb6a85b4SSebastian Andrzej Siewior  offline.
117