xref: /linux/Documentation/accel/qaic/aic100.rst (revision 7f71507851fc7764b36a3221839607d3a45c2025)
1.. SPDX-License-Identifier: GPL-2.0-only
2
3===============================
4 Qualcomm Cloud AI 100 (AIC100)
5===============================
6
7Overview
8========
9
10The Qualcomm Cloud AI 100/AIC100 family of products (including SA9000P - part of
11Snapdragon Ride) are PCIe adapter cards which contain a dedicated SoC ASIC for
12the purpose of efficiently running Artificial Intelligence (AI) Deep Learning
13inference workloads. They are AI accelerators.
14
15The PCIe interface of AIC100 is capable of PCIe Gen4 speeds over eight lanes
16(x8). An individual SoC on a card can have up to 16 NSPs for running workloads.
17Each SoC has an A53 management CPU. On card, there can be up to 32 GB of DDR.
18
19Multiple AIC100 cards can be hosted in a single system to scale overall
20performance. AIC100 cards are multi-user capable and able to execute workloads
21from multiple users in a concurrent manner.
22
23Hardware Description
24====================
25
26An AIC100 card consists of an AIC100 SoC, on-card DDR, and a set of misc
27peripherals (PMICs, etc).
28
29An AIC100 card can either be a PCIe HHHL form factor (a traditional PCIe card),
30or a Dual M.2 card. Both use PCIe to connect to the host system.
31
32As a PCIe endpoint/adapter, AIC100 uses the standard VendorID(VID)/
33DeviceID(DID) combination to uniquely identify itself to the host. AIC100
34uses the standard Qualcomm VID (0x17cb). All AIC100 SKUs use the same
35AIC100 DID (0xa100).
36
37AIC100 does not implement FLR (function level reset).
38
39AIC100 implements MSI but does not implement MSI-X. AIC100 prefers 17 MSIs to
40operate (1 for MHI, 16 for the DMA Bridge). Falling back to 1 MSI is possible in
41scenarios where reserving 32 MSIs isn't feasible.
42
43As a PCIe device, AIC100 utilizes BARs to provide host interfaces to the device
44hardware. AIC100 provides 3, 64-bit BARs.
45
46* The first BAR is 4K in size, and exposes the MHI interface to the host.
47
48* The second BAR is 2M in size, and exposes the DMA Bridge interface to the
49  host.
50
51* The third BAR is variable in size based on an individual AIC100's
52  configuration, but defaults to 64K. This BAR currently has no purpose.
53
54From the host perspective, AIC100 has several key hardware components -
55
56* MHI (Modem Host Interface)
57* QSM (QAIC Service Manager)
58* NSPs (Neural Signal Processor)
59* DMA Bridge
60* DDR
61
62MHI
63---
64
65AIC100 has one MHI interface over PCIe. MHI itself is documented at
66Documentation/mhi/index.rst MHI is the mechanism the host uses to communicate
67with the QSM. Except for workload data via the DMA Bridge, all interaction with
68the device occurs via MHI.
69
70QSM
71---
72
73QAIC Service Manager. This is an ARM A53 CPU that runs the primary
74firmware of the card and performs on-card management tasks. It also
75communicates with the host via MHI. Each AIC100 has one of
76these.
77
78NSP
79---
80
81Neural Signal Processor. Each AIC100 has up to 16 of these. These are
82the processors that run the workloads on AIC100. Each NSP is a Qualcomm Hexagon
83(Q6) DSP with HVX and HMX. Each NSP can only run one workload at a time, but
84multiple NSPs may be assigned to a single workload. Since each NSP can only run
85one workload, AIC100 is limited to 16 concurrent workloads. Workload
86"scheduling" is under the purview of the host. AIC100 does not automatically
87timeslice.
88
89DMA Bridge
90----------
91
92The DMA Bridge is custom DMA engine that manages the flow of data
93in and out of workloads. AIC100 has one of these. The DMA Bridge has 16
94channels, each consisting of a set of request/response FIFOs. Each active
95workload is assigned a single DMA Bridge channel. The DMA Bridge exposes
96hardware registers to manage the FIFOs (head/tail pointers), but requires host
97memory to store the FIFOs.
98
99DDR
100---
101
102AIC100 has on-card DDR. In total, an AIC100 can have up to 32 GB of DDR.
103This DDR is used to store workloads, data for the workloads, and is used by the
104QSM for managing the device. NSPs are granted access to sections of the DDR by
105the QSM. The host does not have direct access to the DDR, and must make
106requests to the QSM to transfer data to the DDR.
107
108High-level Use Flow
109===================
110
111AIC100 is a multi-user, programmable accelerator typically used for running
112neural networks in inferencing mode to efficiently perform AI operations.
113AIC100 is not intended for training neural networks. AIC100 can be utilized
114for generic compute workloads.
115
116Assuming a user wants to utilize AIC100, they would follow these steps:
117
1181. Compile the workload into an ELF targeting the NSP(s)
1192. Make requests to the QSM to load the workload and related artifacts into the
120   device DDR
1213. Make a request to the QSM to activate the workload onto a set of idle NSPs
1224. Make requests to the DMA Bridge to send input data to the workload to be
123   processed, and other requests to receive processed output data from the
124   workload.
1255. Once the workload is no longer required, make a request to the QSM to
126   deactivate the workload, thus putting the NSPs back into an idle state.
1276. Once the workload and related artifacts are no longer needed for future
128   sessions, make requests to the QSM to unload the data from DDR. This frees
129   the DDR to be used by other users.
130
131
132Boot Flow
133=========
134
135AIC100 uses a flashless boot flow, derived from Qualcomm MSMs.
136
137When AIC100 is first powered on, it begins executing PBL (Primary Bootloader)
138from ROM. PBL enumerates the PCIe link, and initializes the BHI (Boot Host
139Interface) component of MHI.
140
141Using BHI, the host points PBL to the location of the SBL (Secondary Bootloader)
142image. The PBL pulls the image from the host, validates it, and begins
143execution of SBL.
144
145SBL initializes MHI, and uses MHI to notify the host that the device has entered
146the SBL stage. SBL performs a number of operations:
147
148* SBL initializes the majority of hardware (anything PBL left uninitialized),
149  including DDR.
150* SBL offloads the bootlog to the host.
151* SBL synchronizes timestamps with the host for future logging.
152* SBL uses the Sahara protocol to obtain the runtime firmware images from the
153  host.
154
155Once SBL has obtained and validated the runtime firmware, it brings the NSPs out
156of reset, and jumps into the QSM.
157
158The QSM uses MHI to notify the host that the device has entered the QSM stage
159(AMSS in MHI terms). At this point, the AIC100 device is fully functional, and
160ready to process workloads.
161
162Userspace components
163====================
164
165Compiler
166--------
167
168An open compiler for AIC100 based on upstream LLVM can be found at:
169https://github.com/quic/software-kit-for-qualcomm-cloud-ai-100-cc
170
171Usermode Driver (UMD)
172---------------------
173
174An open UMD that interfaces with the qaic kernel driver can be found at:
175https://github.com/quic/software-kit-for-qualcomm-cloud-ai-100
176
177Sahara loader
178-------------
179
180An open implementation of the Sahara protocol called kickstart can be found at:
181https://github.com/andersson/qdl
182
183MHI Channels
184============
185
186AIC100 defines a number of MHI channels for different purposes. This is a list
187of the defined channels, and their uses.
188
189+----------------+---------+----------+----------------------------------------+
190| Channel name   | IDs     | EEs      | Purpose                                |
191+================+=========+==========+========================================+
192| QAIC_LOOPBACK  | 0 & 1   | AMSS     | Any data sent to the device on this    |
193|                |         |          | channel is sent back to the host.      |
194+----------------+---------+----------+----------------------------------------+
195| QAIC_SAHARA    | 2 & 3   | SBL      | Used by SBL to obtain the runtime      |
196|                |         |          | firmware from the host.                |
197+----------------+---------+----------+----------------------------------------+
198| QAIC_DIAG      | 4 & 5   | AMSS     | Used to communicate with QSM via the   |
199|                |         |          | DIAG protocol.                         |
200+----------------+---------+----------+----------------------------------------+
201| QAIC_SSR       | 6 & 7   | AMSS     | Used to notify the host of subsystem   |
202|                |         |          | restart events, and to offload SSR     |
203|                |         |          | crashdumps.                            |
204+----------------+---------+----------+----------------------------------------+
205| QAIC_QDSS      | 8 & 9   | AMSS     | Used for the Qualcomm Debug Subsystem. |
206+----------------+---------+----------+----------------------------------------+
207| QAIC_CONTROL   | 10 & 11 | AMSS     | Used for the Neural Network Control    |
208|                |         |          | (NNC) protocol. This is the primary    |
209|                |         |          | channel between host and QSM for       |
210|                |         |          | managing workloads.                    |
211+----------------+---------+----------+----------------------------------------+
212| QAIC_LOGGING   | 12 & 13 | SBL      | Used by the SBL to send the bootlog to |
213|                |         |          | the host.                              |
214+----------------+---------+----------+----------------------------------------+
215| QAIC_STATUS    | 14 & 15 | AMSS     | Used to notify the host of Reliability,|
216|                |         |          | Accessibility, Serviceability (RAS)    |
217|                |         |          | events.                                |
218+----------------+---------+----------+----------------------------------------+
219| QAIC_TELEMETRY | 16 & 17 | AMSS     | Used to get/set power/thermal/etc      |
220|                |         |          | attributes.                            |
221+----------------+---------+----------+----------------------------------------+
222| QAIC_DEBUG     | 18 & 19 | AMSS     | Not used.                              |
223+----------------+---------+----------+----------------------------------------+
224| QAIC_TIMESYNC  | 20 & 21 | SBL      | Used to synchronize timestamps in the  |
225|                |         |          | device side logs with the host time    |
226|                |         |          | source.                                |
227+----------------+---------+----------+----------------------------------------+
228| QAIC_TIMESYNC  | 22 & 23 | AMSS     | Used to periodically synchronize       |
229| _PERIODIC      |         |          | timestamps in the device side logs with|
230|                |         |          | the host time source.                  |
231+----------------+---------+----------+----------------------------------------+
232| IPCR           | 24 & 25 | AMSS     | AF_QIPCRTR clients and servers.        |
233+----------------+---------+----------+----------------------------------------+
234
235DMA Bridge
236==========
237
238Overview
239--------
240
241The DMA Bridge is one of the main interfaces to the host from the device
242(the other being MHI). As part of activating a workload to run on NSPs, the QSM
243assigns that network a DMA Bridge channel. A workload's DMA Bridge channel
244(DBC for short) is solely for the use of that workload and is not shared with
245other workloads.
246
247Each DBC is a pair of FIFOs that manage data in and out of the workload. One
248FIFO is the request FIFO. The other FIFO is the response FIFO.
249
250Each DBC contains 4 registers in hardware:
251
252* Request FIFO head pointer (offset 0x0). Read only by the host. Indicates the
253  latest item in the FIFO the device has consumed.
254* Request FIFO tail pointer (offset 0x4). Read/write by the host. Host
255  increments this register to add new items to the FIFO.
256* Response FIFO head pointer (offset 0x8). Read/write by the host. Indicates
257  the latest item in the FIFO the host has consumed.
258* Response FIFO tail pointer (offset 0xc). Read only by the host. Device
259  increments this register to add new items to the FIFO.
260
261The values in each register are indexes in the FIFO. To get the location of the
262FIFO element pointed to by the register: FIFO base address + register * element
263size.
264
265DBC registers are exposed to the host via the second BAR. Each DBC consumes
2664KB of space in the BAR.
267
268The actual FIFOs are backed by host memory. When sending a request to the QSM
269to activate a network, the host must donate memory to be used for the FIFOs.
270Due to internal mapping limitations of the device, a single contiguous chunk of
271memory must be provided per DBC, which hosts both FIFOs. The request FIFO will
272consume the beginning of the memory chunk, and the response FIFO will consume
273the end of the memory chunk.
274
275Request FIFO
276------------
277
278A request FIFO element has the following structure:
279
280.. code-block:: c
281
282  struct request_elem {
283	u16 req_id;
284	u8  seq_id;
285	u8  pcie_dma_cmd;
286	u32 reserved;
287	u64 pcie_dma_source_addr;
288	u64 pcie_dma_dest_addr;
289	u32 pcie_dma_len;
290	u32 reserved;
291	u64 doorbell_addr;
292	u8  doorbell_attr;
293	u8  reserved;
294	u16 reserved;
295	u32 doorbell_data;
296	u32 sem_cmd0;
297	u32 sem_cmd1;
298	u32 sem_cmd2;
299	u32 sem_cmd3;
300  };
301
302Request field descriptions:
303
304req_id
305	request ID. A request FIFO element and a response FIFO element with
306	the same request ID refer to the same command.
307
308seq_id
309	sequence ID within a request. Ignored by the DMA Bridge.
310
311pcie_dma_cmd
312	describes the DMA element of this request.
313
314	* Bit(7) is the force msi flag, which overrides the DMA Bridge MSI logic
315	  and generates a MSI when this request is complete, and QSM
316	  configures the DMA Bridge to look at this bit.
317	* Bits(6:5) are reserved.
318	* Bit(4) is the completion code flag, and indicates that the DMA Bridge
319	  shall generate a response FIFO element when this request is
320	  complete.
321	* Bit(3) indicates if this request is a linked list transfer(0) or a bulk
322	  transfer(1).
323	* Bit(2) is reserved.
324	* Bits(1:0) indicate the type of transfer. No transfer(0), to device(1),
325	  from device(2). Value 3 is illegal.
326
327pcie_dma_source_addr
328	source address for a bulk transfer, or the address of the linked list.
329
330pcie_dma_dest_addr
331	destination address for a bulk transfer.
332
333pcie_dma_len
334	length of the bulk transfer. Note that the size of this field
335	limits transfers to 4G in size.
336
337doorbell_addr
338	address of the doorbell to ring when this request is complete.
339
340doorbell_attr
341	doorbell attributes.
342
343	* Bit(7) indicates if a write to a doorbell is to occur.
344	* Bits(6:2) are reserved.
345	* Bits(1:0) contain the encoding of the doorbell length. 0 is 32-bit,
346	  1 is 16-bit, 2 is 8-bit, 3 is reserved. The doorbell address
347	  must be naturally aligned to the specified length.
348
349doorbell_data
350	data to write to the doorbell. Only the bits corresponding to
351	the doorbell length are valid.
352
353sem_cmdN
354	semaphore command.
355
356	* Bit(31) indicates this semaphore command is enabled.
357	* Bit(30) is the to-device DMA fence. Block this request until all
358	  to-device DMA transfers are complete.
359	* Bit(29) is the from-device DMA fence. Block this request until all
360	  from-device DMA transfers are complete.
361	* Bits(28:27) are reserved.
362	* Bits(26:24) are the semaphore command. 0 is NOP. 1 is init with the
363	  specified value. 2 is increment. 3 is decrement. 4 is wait
364	  until the semaphore is equal to the specified value. 5 is wait
365	  until the semaphore is greater or equal to the specified value.
366	  6 is "P", wait until semaphore is greater than 0, then
367	  decrement by 1. 7 is reserved.
368	* Bit(23) is reserved.
369	* Bit(22) is the semaphore sync. 0 is post sync, which means that the
370	  semaphore operation is done after the DMA transfer. 1 is
371	  presync, which gates the DMA transfer. Only one presync is
372	  allowed per request.
373	* Bit(21) is reserved.
374	* Bits(20:16) is the index of the semaphore to operate on.
375	* Bits(15:12) are reserved.
376	* Bits(11:0) are the semaphore value to use in operations.
377
378Overall, a request is processed in 4 steps:
379
3801. If specified, the presync semaphore condition must be true
3812. If enabled, the DMA transfer occurs
3823. If specified, the postsync semaphore conditions must be true
3834. If enabled, the doorbell is written
384
385By using the semaphores in conjunction with the workload running on the NSPs,
386the data pipeline can be synchronized such that the host can queue multiple
387requests of data for the workload to process, but the DMA Bridge will only copy
388the data into the memory of the workload when the workload is ready to process
389the next input.
390
391Response FIFO
392-------------
393
394Once a request is fully processed, a response FIFO element is generated if
395specified in pcie_dma_cmd. The structure of a response FIFO element:
396
397.. code-block:: c
398
399  struct response_elem {
400	u16 req_id;
401	u16 completion_code;
402  };
403
404req_id
405	matches the req_id of the request that generated this element.
406
407completion_code
408	status of this request. 0 is success. Non-zero is an error.
409
410The DMA Bridge will generate a MSI to the host as a reaction to activity in the
411response FIFO of a DBC. The DMA Bridge hardware has an IRQ storm mitigation
412algorithm, where it will only generate a MSI when the response FIFO transitions
413from empty to non-empty (unless force MSI is enabled and triggered). In
414response to this MSI, the host is expected to drain the response FIFO, and must
415take care to handle any race conditions between draining the FIFO, and the
416device inserting elements into the FIFO.
417
418Neural Network Control (NNC) Protocol
419=====================================
420
421The NNC protocol is how the host makes requests to the QSM to manage workloads.
422It uses the QAIC_CONTROL MHI channel.
423
424Each NNC request is packaged into a message. Each message is a series of
425transactions. A passthrough type transaction can contain elements known as
426commands.
427
428QSM requires NNC messages be little endian encoded and the fields be naturally
429aligned. Since there are 64-bit elements in some NNC messages, 64-bit alignment
430must be maintained.
431
432A message contains a header and then a series of transactions. A message may be
433at most 4K in size from QSM to the host. From the host to the QSM, a message
434can be at most 64K (maximum size of a single MHI packet), but there is a
435continuation feature where message N+1 can be marked as a continuation of
436message N. This is used for exceedingly large DMA xfer transactions.
437
438Transaction descriptions
439------------------------
440
441passthrough
442	Allows userspace to send an opaque payload directly to the QSM.
443	This is used for NNC commands. Userspace is responsible for managing
444	the QSM message requirements in the payload.
445
446dma_xfer
447	DMA transfer. Describes an object that the QSM should DMA into the
448	device via address and size tuples.
449
450activate
451	Activate a workload onto NSPs. The host must provide memory to be
452	used by the DBC.
453
454deactivate
455	Deactivate an active workload and return the NSPs to idle.
456
457status
458	Query the QSM about it's NNC implementation. Returns the NNC version,
459	and if CRC is used.
460
461terminate
462	Release a user's resources.
463
464dma_xfer_cont
465	Continuation of a previous DMA transfer. If a DMA transfer
466	cannot be specified in a single message (highly fragmented), this
467	transaction can be used to specify more ranges.
468
469validate_partition
470	Query to QSM to determine if a partition identifier is valid.
471
472Each message is tagged with a user id, and a partition id. The user id allows
473QSM to track resources, and release them when the user goes away (eg the process
474crashes). A partition id identifies the resource partition that QSM manages,
475which this message applies to.
476
477Messages may have CRCs. Messages should have CRCs applied until the QSM
478reports via the status transaction that CRCs are not needed. The QSM on the
479SA9000P requires CRCs for black channel safing.
480
481Subsystem Restart (SSR)
482=======================
483
484SSR is the concept of limiting the impact of an error. An AIC100 device may
485have multiple users, each with their own workload running. If the workload of
486one user crashes, the fallout of that should be limited to that workload and not
487impact other workloads. SSR accomplishes this.
488
489If a particular workload crashes, QSM notifies the host via the QAIC_SSR MHI
490channel. This notification identifies the workload by it's assigned DBC. A
491multi-stage recovery process is then used to cleanup both sides, and get the
492DBC/NSPs into a working state.
493
494When SSR occurs, any state in the workload is lost. Any inputs that were in
495process, or queued by not yet serviced, are lost. The loaded artifacts will
496remain in on-card DDR, but the host will need to re-activate the workload if
497it desires to recover the workload.
498
499Reliability, Accessibility, Serviceability (RAS)
500================================================
501
502AIC100 is expected to be deployed in server systems where RAS ideology is
503applied. Simply put, RAS is the concept of detecting, classifying, and
504reporting errors. While PCIe has AER (Advanced Error Reporting) which factors
505into RAS, AER does not allow for a device to report details about internal
506errors. Therefore, AIC100 implements a custom RAS mechanism. When a RAS event
507occurs, QSM will report the event with appropriate details via the QAIC_STATUS
508MHI channel. A sysadmin may determine that a particular device needs
509additional service based on RAS reports.
510
511Telemetry
512=========
513
514QSM has the ability to report various physical attributes of the device, and in
515some cases, to allow the host to control them. Examples include thermal limits,
516thermal readings, and power readings. These items are communicated via the
517QAIC_TELEMETRY MHI channel.
518