1.. SPDX-License-Identifier: GPL-2.0-only 2 3=============================== 4 Qualcomm Cloud AI 100 (AIC100) 5=============================== 6 7Overview 8======== 9 10The Qualcomm Cloud AI 100/AIC100 family of products (including SA9000P - part of 11Snapdragon Ride) are PCIe adapter cards which contain a dedicated SoC ASIC for 12the purpose of efficiently running Artificial Intelligence (AI) Deep Learning 13inference workloads. They are AI accelerators. 14 15The PCIe interface of AIC100 is capable of PCIe Gen4 speeds over eight lanes 16(x8). An individual SoC on a card can have up to 16 NSPs for running workloads. 17Each SoC has an A53 management CPU. On card, there can be up to 32 GB of DDR. 18 19Multiple AIC100 cards can be hosted in a single system to scale overall 20performance. AIC100 cards are multi-user capable and able to execute workloads 21from multiple users in a concurrent manner. 22 23Hardware Description 24==================== 25 26An AIC100 card consists of an AIC100 SoC, on-card DDR, and a set of misc 27peripherals (PMICs, etc). 28 29An AIC100 card can either be a PCIe HHHL form factor (a traditional PCIe card), 30or a Dual M.2 card. Both use PCIe to connect to the host system. 31 32As a PCIe endpoint/adapter, AIC100 uses the standard VendorID(VID)/ 33DeviceID(DID) combination to uniquely identify itself to the host. AIC100 34uses the standard Qualcomm VID (0x17cb). All AIC100 SKUs use the same 35AIC100 DID (0xa100). 36 37AIC100 does not implement FLR (function level reset). 38 39AIC100 implements MSI but does not implement MSI-X. AIC100 requires 17 MSIs to 40operate (1 for MHI, 16 for the DMA Bridge). 41 42As a PCIe device, AIC100 utilizes BARs to provide host interfaces to the device 43hardware. AIC100 provides 3, 64-bit BARs. 44 45* The first BAR is 4K in size, and exposes the MHI interface to the host. 46 47* The second BAR is 2M in size, and exposes the DMA Bridge interface to the 48 host. 49 50* The third BAR is variable in size based on an individual AIC100's 51 configuration, but defaults to 64K. This BAR currently has no purpose. 52 53From the host perspective, AIC100 has several key hardware components - 54 55* MHI (Modem Host Interface) 56* QSM (QAIC Service Manager) 57* NSPs (Neural Signal Processor) 58* DMA Bridge 59* DDR 60 61MHI 62--- 63 64AIC100 has one MHI interface over PCIe. MHI itself is documented at 65Documentation/mhi/index.rst MHI is the mechanism the host uses to communicate 66with the QSM. Except for workload data via the DMA Bridge, all interaction with 67the device occurs via MHI. 68 69QSM 70--- 71 72QAIC Service Manager. This is an ARM A53 CPU that runs the primary 73firmware of the card and performs on-card management tasks. It also 74communicates with the host via MHI. Each AIC100 has one of 75these. 76 77NSP 78--- 79 80Neural Signal Processor. Each AIC100 has up to 16 of these. These are 81the processors that run the workloads on AIC100. Each NSP is a Qualcomm Hexagon 82(Q6) DSP with HVX and HMX. Each NSP can only run one workload at a time, but 83multiple NSPs may be assigned to a single workload. Since each NSP can only run 84one workload, AIC100 is limited to 16 concurrent workloads. Workload 85"scheduling" is under the purview of the host. AIC100 does not automatically 86timeslice. 87 88DMA Bridge 89---------- 90 91The DMA Bridge is custom DMA engine that manages the flow of data 92in and out of workloads. AIC100 has one of these. The DMA Bridge has 16 93channels, each consisting of a set of request/response FIFOs. Each active 94workload is assigned a single DMA Bridge channel. The DMA Bridge exposes 95hardware registers to manage the FIFOs (head/tail pointers), but requires host 96memory to store the FIFOs. 97 98DDR 99--- 100 101AIC100 has on-card DDR. In total, an AIC100 can have up to 32 GB of DDR. 102This DDR is used to store workloads, data for the workloads, and is used by the 103QSM for managing the device. NSPs are granted access to sections of the DDR by 104the QSM. The host does not have direct access to the DDR, and must make 105requests to the QSM to transfer data to the DDR. 106 107High-level Use Flow 108=================== 109 110AIC100 is a multi-user, programmable accelerator typically used for running 111neural networks in inferencing mode to efficiently perform AI operations. 112AIC100 is not intended for training neural networks. AIC100 can be utilized 113for generic compute workloads. 114 115Assuming a user wants to utilize AIC100, they would follow these steps: 116 1171. Compile the workload into an ELF targeting the NSP(s) 1182. Make requests to the QSM to load the workload and related artifacts into the 119 device DDR 1203. Make a request to the QSM to activate the workload onto a set of idle NSPs 1214. Make requests to the DMA Bridge to send input data to the workload to be 122 processed, and other requests to receive processed output data from the 123 workload. 1245. Once the workload is no longer required, make a request to the QSM to 125 deactivate the workload, thus putting the NSPs back into an idle state. 1266. Once the workload and related artifacts are no longer needed for future 127 sessions, make requests to the QSM to unload the data from DDR. This frees 128 the DDR to be used by other users. 129 130 131Boot Flow 132========= 133 134AIC100 uses a flashless boot flow, derived from Qualcomm MSMs. 135 136When AIC100 is first powered on, it begins executing PBL (Primary Bootloader) 137from ROM. PBL enumerates the PCIe link, and initializes the BHI (Boot Host 138Interface) component of MHI. 139 140Using BHI, the host points PBL to the location of the SBL (Secondary Bootloader) 141image. The PBL pulls the image from the host, validates it, and begins 142execution of SBL. 143 144SBL initializes MHI, and uses MHI to notify the host that the device has entered 145the SBL stage. SBL performs a number of operations: 146 147* SBL initializes the majority of hardware (anything PBL left uninitialized), 148 including DDR. 149* SBL offloads the bootlog to the host. 150* SBL synchronizes timestamps with the host for future logging. 151* SBL uses the Sahara protocol to obtain the runtime firmware images from the 152 host. 153 154Once SBL has obtained and validated the runtime firmware, it brings the NSPs out 155of reset, and jumps into the QSM. 156 157The QSM uses MHI to notify the host that the device has entered the QSM stage 158(AMSS in MHI terms). At this point, the AIC100 device is fully functional, and 159ready to process workloads. 160 161Userspace components 162==================== 163 164Compiler 165-------- 166 167An open compiler for AIC100 based on upstream LLVM can be found at: 168https://github.com/quic/software-kit-for-qualcomm-cloud-ai-100-cc 169 170Usermode Driver (UMD) 171--------------------- 172 173An open UMD that interfaces with the qaic kernel driver can be found at: 174https://github.com/quic/software-kit-for-qualcomm-cloud-ai-100 175 176Sahara loader 177------------- 178 179An open implementation of the Sahara protocol called kickstart can be found at: 180https://github.com/andersson/qdl 181 182MHI Channels 183============ 184 185AIC100 defines a number of MHI channels for different purposes. This is a list 186of the defined channels, and their uses. 187 188+----------------+---------+----------+----------------------------------------+ 189| Channel name | IDs | EEs | Purpose | 190+================+=========+==========+========================================+ 191| QAIC_LOOPBACK | 0 & 1 | AMSS | Any data sent to the device on this | 192| | | | channel is sent back to the host. | 193+----------------+---------+----------+----------------------------------------+ 194| QAIC_SAHARA | 2 & 3 | SBL | Used by SBL to obtain the runtime | 195| | | | firmware from the host. | 196+----------------+---------+----------+----------------------------------------+ 197| QAIC_DIAG | 4 & 5 | AMSS | Used to communicate with QSM via the | 198| | | | DIAG protocol. | 199+----------------+---------+----------+----------------------------------------+ 200| QAIC_SSR | 6 & 7 | AMSS | Used to notify the host of subsystem | 201| | | | restart events, and to offload SSR | 202| | | | crashdumps. | 203+----------------+---------+----------+----------------------------------------+ 204| QAIC_QDSS | 8 & 9 | AMSS | Used for the Qualcomm Debug Subsystem. | 205+----------------+---------+----------+----------------------------------------+ 206| QAIC_CONTROL | 10 & 11 | AMSS | Used for the Neural Network Control | 207| | | | (NNC) protocol. This is the primary | 208| | | | channel between host and QSM for | 209| | | | managing workloads. | 210+----------------+---------+----------+----------------------------------------+ 211| QAIC_LOGGING | 12 & 13 | SBL | Used by the SBL to send the bootlog to | 212| | | | the host. | 213+----------------+---------+----------+----------------------------------------+ 214| QAIC_STATUS | 14 & 15 | AMSS | Used to notify the host of Reliability,| 215| | | | Accessibility, Serviceability (RAS) | 216| | | | events. | 217+----------------+---------+----------+----------------------------------------+ 218| QAIC_TELEMETRY | 16 & 17 | AMSS | Used to get/set power/thermal/etc | 219| | | | attributes. | 220+----------------+---------+----------+----------------------------------------+ 221| QAIC_DEBUG | 18 & 19 | AMSS | Not used. | 222+----------------+---------+----------+----------------------------------------+ 223| QAIC_TIMESYNC | 20 & 21 | SBL/AMSS | Used to synchronize timestamps in the | 224| | | | device side logs with the host time | 225| | | | source. | 226+----------------+---------+----------+----------------------------------------+ 227 228DMA Bridge 229========== 230 231Overview 232-------- 233 234The DMA Bridge is one of the main interfaces to the host from the device 235(the other being MHI). As part of activating a workload to run on NSPs, the QSM 236assigns that network a DMA Bridge channel. A workload's DMA Bridge channel 237(DBC for short) is solely for the use of that workload and is not shared with 238other workloads. 239 240Each DBC is a pair of FIFOs that manage data in and out of the workload. One 241FIFO is the request FIFO. The other FIFO is the response FIFO. 242 243Each DBC contains 4 registers in hardware: 244 245* Request FIFO head pointer (offset 0x0). Read only by the host. Indicates the 246 latest item in the FIFO the device has consumed. 247* Request FIFO tail pointer (offset 0x4). Read/write by the host. Host 248 increments this register to add new items to the FIFO. 249* Response FIFO head pointer (offset 0x8). Read/write by the host. Indicates 250 the latest item in the FIFO the host has consumed. 251* Response FIFO tail pointer (offset 0xc). Read only by the host. Device 252 increments this register to add new items to the FIFO. 253 254The values in each register are indexes in the FIFO. To get the location of the 255FIFO element pointed to by the register: FIFO base address + register * element 256size. 257 258DBC registers are exposed to the host via the second BAR. Each DBC consumes 2594KB of space in the BAR. 260 261The actual FIFOs are backed by host memory. When sending a request to the QSM 262to activate a network, the host must donate memory to be used for the FIFOs. 263Due to internal mapping limitations of the device, a single contiguous chunk of 264memory must be provided per DBC, which hosts both FIFOs. The request FIFO will 265consume the beginning of the memory chunk, and the response FIFO will consume 266the end of the memory chunk. 267 268Request FIFO 269------------ 270 271A request FIFO element has the following structure: 272 273.. code-block:: c 274 275 struct request_elem { 276 u16 req_id; 277 u8 seq_id; 278 u8 pcie_dma_cmd; 279 u32 reserved; 280 u64 pcie_dma_source_addr; 281 u64 pcie_dma_dest_addr; 282 u32 pcie_dma_len; 283 u32 reserved; 284 u64 doorbell_addr; 285 u8 doorbell_attr; 286 u8 reserved; 287 u16 reserved; 288 u32 doorbell_data; 289 u32 sem_cmd0; 290 u32 sem_cmd1; 291 u32 sem_cmd2; 292 u32 sem_cmd3; 293 }; 294 295Request field descriptions: 296 297req_id 298 request ID. A request FIFO element and a response FIFO element with 299 the same request ID refer to the same command. 300 301seq_id 302 sequence ID within a request. Ignored by the DMA Bridge. 303 304pcie_dma_cmd 305 describes the DMA element of this request. 306 307 * Bit(7) is the force msi flag, which overrides the DMA Bridge MSI logic 308 and generates a MSI when this request is complete, and QSM 309 configures the DMA Bridge to look at this bit. 310 * Bits(6:5) are reserved. 311 * Bit(4) is the completion code flag, and indicates that the DMA Bridge 312 shall generate a response FIFO element when this request is 313 complete. 314 * Bit(3) indicates if this request is a linked list transfer(0) or a bulk 315 transfer(1). 316 * Bit(2) is reserved. 317 * Bits(1:0) indicate the type of transfer. No transfer(0), to device(1), 318 from device(2). Value 3 is illegal. 319 320pcie_dma_source_addr 321 source address for a bulk transfer, or the address of the linked list. 322 323pcie_dma_dest_addr 324 destination address for a bulk transfer. 325 326pcie_dma_len 327 length of the bulk transfer. Note that the size of this field 328 limits transfers to 4G in size. 329 330doorbell_addr 331 address of the doorbell to ring when this request is complete. 332 333doorbell_attr 334 doorbell attributes. 335 336 * Bit(7) indicates if a write to a doorbell is to occur. 337 * Bits(6:2) are reserved. 338 * Bits(1:0) contain the encoding of the doorbell length. 0 is 32-bit, 339 1 is 16-bit, 2 is 8-bit, 3 is reserved. The doorbell address 340 must be naturally aligned to the specified length. 341 342doorbell_data 343 data to write to the doorbell. Only the bits corresponding to 344 the doorbell length are valid. 345 346sem_cmdN 347 semaphore command. 348 349 * Bit(31) indicates this semaphore command is enabled. 350 * Bit(30) is the to-device DMA fence. Block this request until all 351 to-device DMA transfers are complete. 352 * Bit(29) is the from-device DMA fence. Block this request until all 353 from-device DMA transfers are complete. 354 * Bits(28:27) are reserved. 355 * Bits(26:24) are the semaphore command. 0 is NOP. 1 is init with the 356 specified value. 2 is increment. 3 is decrement. 4 is wait 357 until the semaphore is equal to the specified value. 5 is wait 358 until the semaphore is greater or equal to the specified value. 359 6 is "P", wait until semaphore is greater than 0, then 360 decrement by 1. 7 is reserved. 361 * Bit(23) is reserved. 362 * Bit(22) is the semaphore sync. 0 is post sync, which means that the 363 semaphore operation is done after the DMA transfer. 1 is 364 presync, which gates the DMA transfer. Only one presync is 365 allowed per request. 366 * Bit(21) is reserved. 367 * Bits(20:16) is the index of the semaphore to operate on. 368 * Bits(15:12) are reserved. 369 * Bits(11:0) are the semaphore value to use in operations. 370 371Overall, a request is processed in 4 steps: 372 3731. If specified, the presync semaphore condition must be true 3742. If enabled, the DMA transfer occurs 3753. If specified, the postsync semaphore conditions must be true 3764. If enabled, the doorbell is written 377 378By using the semaphores in conjunction with the workload running on the NSPs, 379the data pipeline can be synchronized such that the host can queue multiple 380requests of data for the workload to process, but the DMA Bridge will only copy 381the data into the memory of the workload when the workload is ready to process 382the next input. 383 384Response FIFO 385------------- 386 387Once a request is fully processed, a response FIFO element is generated if 388specified in pcie_dma_cmd. The structure of a response FIFO element: 389 390.. code-block:: c 391 392 struct response_elem { 393 u16 req_id; 394 u16 completion_code; 395 }; 396 397req_id 398 matches the req_id of the request that generated this element. 399 400completion_code 401 status of this request. 0 is success. Non-zero is an error. 402 403The DMA Bridge will generate a MSI to the host as a reaction to activity in the 404response FIFO of a DBC. The DMA Bridge hardware has an IRQ storm mitigation 405algorithm, where it will only generate a MSI when the response FIFO transitions 406from empty to non-empty (unless force MSI is enabled and triggered). In 407response to this MSI, the host is expected to drain the response FIFO, and must 408take care to handle any race conditions between draining the FIFO, and the 409device inserting elements into the FIFO. 410 411Neural Network Control (NNC) Protocol 412===================================== 413 414The NNC protocol is how the host makes requests to the QSM to manage workloads. 415It uses the QAIC_CONTROL MHI channel. 416 417Each NNC request is packaged into a message. Each message is a series of 418transactions. A passthrough type transaction can contain elements known as 419commands. 420 421QSM requires NNC messages be little endian encoded and the fields be naturally 422aligned. Since there are 64-bit elements in some NNC messages, 64-bit alignment 423must be maintained. 424 425A message contains a header and then a series of transactions. A message may be 426at most 4K in size from QSM to the host. From the host to the QSM, a message 427can be at most 64K (maximum size of a single MHI packet), but there is a 428continuation feature where message N+1 can be marked as a continuation of 429message N. This is used for exceedingly large DMA xfer transactions. 430 431Transaction descriptions 432------------------------ 433 434passthrough 435 Allows userspace to send an opaque payload directly to the QSM. 436 This is used for NNC commands. Userspace is responsible for managing 437 the QSM message requirements in the payload. 438 439dma_xfer 440 DMA transfer. Describes an object that the QSM should DMA into the 441 device via address and size tuples. 442 443activate 444 Activate a workload onto NSPs. The host must provide memory to be 445 used by the DBC. 446 447deactivate 448 Deactivate an active workload and return the NSPs to idle. 449 450status 451 Query the QSM about it's NNC implementation. Returns the NNC version, 452 and if CRC is used. 453 454terminate 455 Release a user's resources. 456 457dma_xfer_cont 458 Continuation of a previous DMA transfer. If a DMA transfer 459 cannot be specified in a single message (highly fragmented), this 460 transaction can be used to specify more ranges. 461 462validate_partition 463 Query to QSM to determine if a partition identifier is valid. 464 465Each message is tagged with a user id, and a partition id. The user id allows 466QSM to track resources, and release them when the user goes away (eg the process 467crashes). A partition id identifies the resource partition that QSM manages, 468which this message applies to. 469 470Messages may have CRCs. Messages should have CRCs applied until the QSM 471reports via the status transaction that CRCs are not needed. The QSM on the 472SA9000P requires CRCs for black channel safing. 473 474Subsystem Restart (SSR) 475======================= 476 477SSR is the concept of limiting the impact of an error. An AIC100 device may 478have multiple users, each with their own workload running. If the workload of 479one user crashes, the fallout of that should be limited to that workload and not 480impact other workloads. SSR accomplishes this. 481 482If a particular workload crashes, QSM notifies the host via the QAIC_SSR MHI 483channel. This notification identifies the workload by it's assigned DBC. A 484multi-stage recovery process is then used to cleanup both sides, and get the 485DBC/NSPs into a working state. 486 487When SSR occurs, any state in the workload is lost. Any inputs that were in 488process, or queued by not yet serviced, are lost. The loaded artifacts will 489remain in on-card DDR, but the host will need to re-activate the workload if 490it desires to recover the workload. 491 492Reliability, Accessibility, Serviceability (RAS) 493================================================ 494 495AIC100 is expected to be deployed in server systems where RAS ideology is 496applied. Simply put, RAS is the concept of detecting, classifying, and 497reporting errors. While PCIe has AER (Advanced Error Reporting) which factors 498into RAS, AER does not allow for a device to report details about internal 499errors. Therefore, AIC100 implements a custom RAS mechanism. When a RAS event 500occurs, QSM will report the event with appropriate details via the QAIC_STATUS 501MHI channel. A sysadmin may determine that a particular device needs 502additional service based on RAS reports. 503 504Telemetry 505========= 506 507QSM has the ability to report various physical attributes of the device, and in 508some cases, to allow the host to control them. Examples include thermal limits, 509thermal readings, and power readings. These items are communicated via the 510QAIC_TELEMETRY MHI channel. 511