xref: /linux/Documentation/accel/qaic/qaic.rst (revision ab779466166348eecf17d20f620aa9a47965c934)
1.. SPDX-License-Identifier: GPL-2.0-only
2
3=============
4 QAIC driver
5=============
6
7The QAIC driver is the Kernel Mode Driver (KMD) for the AIC100 family of AI
8accelerator products.
9
10Interrupts
11==========
12
13IRQ Storm Mitigation
14--------------------
15
16While the AIC100 DMA Bridge hardware implements an IRQ storm mitigation
17mechanism, it is still possible for an IRQ storm to occur. A storm can happen
18if the workload is particularly quick, and the host is responsive. If the host
19can drain the response FIFO as quickly as the device can insert elements into
20it, then the device will frequently transition the response FIFO from empty to
21non-empty and generate MSIs at a rate equivalent to the speed of the
22workload's ability to process inputs. The lprnet (license plate reader network)
23workload is known to trigger this condition, and can generate in excess of 100k
24MSIs per second. It has been observed that most systems cannot tolerate this
25for long, and will crash due to some form of watchdog due to the overhead of
26the interrupt controller interrupting the host CPU.
27
28To mitigate this issue, the QAIC driver implements specific IRQ handling. When
29QAIC receives an IRQ, it disables that line. This prevents the interrupt
30controller from interrupting the CPU. Then AIC drains the FIFO. Once the FIFO
31is drained, QAIC implements a "last chance" polling algorithm where QAIC will
32sleep for a time to see if the workload will generate more activity. The IRQ
33line remains disabled during this time. If no activity is detected, QAIC exits
34polling mode and reenables the IRQ line.
35
36This mitigation in QAIC is very effective. The same lprnet usecase that
37generates 100k IRQs per second (per /proc/interrupts) is reduced to roughly 64
38IRQs over 5 minutes while keeping the host system stable, and having the same
39workload throughput performance (within run to run noise variation).
40
41Single MSI Mode
42---------------
43
44MultiMSI is not well supported on all systems; virtualized ones even less so
45(circa 2023). Between hypervisors masking the PCIe MSI capability structure to
46large memory requirements for vIOMMUs (required for supporting MultiMSI), it is
47useful to be able to fall back to a single MSI when needed.
48
49To support this fallback, we allow the case where only one MSI is able to be
50allocated, and share that one MSI between MHI and the DBCs. The device detects
51when only one MSI has been configured and directs the interrupts for the DBCs
52to the interrupt normally used for MHI. Unfortunately this means that the
53interrupt handlers for every DBC and MHI wake up for every interrupt that
54arrives; however, the DBC threaded irq handlers only are started when work to be
55done is detected (MHI will always start its threaded handler).
56
57If the DBC is configured to force MSI interrupts, this can circumvent the
58software IRQ storm mitigation mentioned above. Since the MSI is shared it is
59never disabled, allowing each new entry to the FIFO to trigger a new interrupt.
60
61
62Neural Network Control (NNC) Protocol
63=====================================
64
65The implementation of NNC is split between the KMD (QAIC) and UMD. In general
66QAIC understands how to encode/decode NNC wire protocol, and elements of the
67protocol which require kernel space knowledge to process (for example, mapping
68host memory to device IOVAs). QAIC understands the structure of a message, and
69all of the transactions. QAIC does not understand commands (the payload of a
70passthrough transaction).
71
72QAIC handles and enforces the required little endianness and 64-bit alignment,
73to the degree that it can. Since QAIC does not know the contents of a
74passthrough transaction, it relies on the UMD to satisfy the requirements.
75
76The terminate transaction is of particular use to QAIC. QAIC is not aware of
77the resources that are loaded onto a device since the majority of that activity
78occurs within NNC commands. As a result, QAIC does not have the means to
79roll back userspace activity. To ensure that a userspace client's resources
80are fully released in the case of a process crash, or a bug, QAIC uses the
81terminate command to let QSM know when a user has gone away, and the resources
82can be released.
83
84QSM can report a version number of the NNC protocol it supports. This is in the
85form of a Major number and a Minor number.
86
87Major number updates indicate changes to the NNC protocol which impact the
88message format, or transactions (impacts QAIC).
89
90Minor number updates indicate changes to the NNC protocol which impact the
91commands (does not impact QAIC).
92
93uAPI
94====
95
96QAIC defines a number of driver specific IOCTLs as part of the userspace API.
97This section describes those APIs.
98
99DRM_IOCTL_QAIC_MANAGE
100  This IOCTL allows userspace to send a NNC request to the QSM. The call will
101  block until a response is received, or the request has timed out.
102
103DRM_IOCTL_QAIC_CREATE_BO
104  This IOCTL allows userspace to allocate a buffer object (BO) which can send
105  or receive data from a workload. The call will return a GEM handle that
106  represents the allocated buffer. The BO is not usable until it has been
107  sliced (see DRM_IOCTL_QAIC_ATTACH_SLICE_BO).
108
109DRM_IOCTL_QAIC_MMAP_BO
110  This IOCTL allows userspace to prepare an allocated BO to be mmap'd into the
111  userspace process.
112
113DRM_IOCTL_QAIC_ATTACH_SLICE_BO
114  This IOCTL allows userspace to slice a BO in preparation for sending the BO
115  to the device. Slicing is the operation of describing what portions of a BO
116  get sent where to a workload. This requires a set of DMA transfers for the
117  DMA Bridge, and as such, locks the BO to a specific DBC.
118
119DRM_IOCTL_QAIC_EXECUTE_BO
120  This IOCTL allows userspace to submit a set of sliced BOs to the device. The
121  call is non-blocking. Success only indicates that the BOs have been queued
122  to the device, but does not guarantee they have been executed.
123
124DRM_IOCTL_QAIC_PARTIAL_EXECUTE_BO
125  This IOCTL operates like DRM_IOCTL_QAIC_EXECUTE_BO, but it allows userspace
126  to shrink the BOs sent to the device for this specific call. If a BO
127  typically has N inputs, but only a subset of those is available, this IOCTL
128  allows userspace to indicate that only the first M bytes of the BO should be
129  sent to the device to minimize data transfer overhead. This IOCTL dynamically
130  recomputes the slicing, and therefore has some processing overhead before the
131  BOs can be queued to the device.
132
133DRM_IOCTL_QAIC_WAIT_BO
134  This IOCTL allows userspace to determine when a particular BO has been
135  processed by the device. The call will block until either the BO has been
136  processed and can be re-queued to the device, or a timeout occurs.
137
138DRM_IOCTL_QAIC_PERF_STATS_BO
139  This IOCTL allows userspace to collect performance statistics on the most
140  recent execution of a BO. This allows userspace to construct an end to end
141  timeline of the BO processing for a performance analysis.
142
143DRM_IOCTL_QAIC_PART_DEV
144  This IOCTL allows userspace to request a duplicate "shadow device". This extra
145  accelN device is associated with a specific partition of resources on the
146  AIC100 device and can be used for limiting a process to some subset of
147  resources.
148
149DRM_IOCTL_QAIC_DETACH_SLICE_BO
150  This IOCTL allows userspace to remove the slicing information from a BO that
151  was originally provided by a call to DRM_IOCTL_QAIC_ATTACH_SLICE_BO. This
152  is the inverse of DRM_IOCTL_QAIC_ATTACH_SLICE_BO. The BO must be idle for
153  DRM_IOCTL_QAIC_DETACH_SLICE_BO to be called. After a successful detach slice
154  operation the BO may have new slicing information attached with a new call
155  to DRM_IOCTL_QAIC_ATTACH_SLICE_BO. After detach slice, the BO cannot be
156  executed until after a new attach slice operation. Combining attach slice
157  and detach slice calls allows userspace to use a BO with multiple workloads.
158
159Userspace Client Isolation
160==========================
161
162AIC100 supports multiple clients. Multiple DBCs can be consumed by a single
163client, and multiple clients can each consume one or more DBCs. Workloads
164may contain sensitive information therefore only the client that owns the
165workload should be allowed to interface with the DBC.
166
167Clients are identified by the instance associated with their open(). A client
168may only use memory they allocate, and DBCs that are assigned to their
169workloads. Attempts to access resources assigned to other clients will be
170rejected.
171
172Module parameters
173=================
174
175QAIC supports the following module parameters:
176
177**datapath_polling (bool)**
178
179Configures QAIC to use a polling thread for datapath events instead of relying
180on the device interrupts. Useful for platforms with broken multiMSI. Must be
181set at QAIC driver initialization. Default is 0 (off).
182
183**mhi_timeout_ms (unsigned int)**
184
185Sets the timeout value for MHI operations in milliseconds (ms). Must be set
186at the time the driver detects a device. Default is 2000 (2 seconds).
187
188**control_resp_timeout_s (unsigned int)**
189
190Sets the timeout value for QSM responses to NNC messages in seconds (s). Must
191be set at the time the driver is sending a request to QSM. Default is 60 (one
192minute).
193
194**wait_exec_default_timeout_ms (unsigned int)**
195
196Sets the default timeout for the wait_exec ioctl in milliseconds (ms). Must be
197set prior to the waic_exec ioctl call. A value specified in the ioctl call
198overrides this for that call. Default is 5000 (5 seconds).
199
200**datapath_poll_interval_us (unsigned int)**
201
202Sets the polling interval in microseconds (us) when datapath polling is active.
203Takes effect at the next polling interval. Default is 100 (100 us).
204
205**timesync_delay_ms (unsigned int)**
206
207Sets the time interval in milliseconds (ms) between two consecutive timesync
208operations. Default is 1000 (1000 ms).
209