1.. SPDX-License-Identifier: GPL-2.0-only 2 3.. include:: <isonum.txt> 4 5========= 6 AMD NPU 7========= 8 9:Copyright: |copy| 2024 Advanced Micro Devices, Inc. 10:Author: Sonal Santan <sonal.santan@amd.com> 11 12Overview 13======== 14 15AMD NPU (Neural Processing Unit) is a multi-user AI inference accelerator 16integrated into AMD client APU. NPU enables efficient execution of Machine 17Learning applications like CNN, LLM, etc. NPU is based on 18`AMD XDNA Architecture`_. NPU is managed by **amdxdna** driver. 19 20 21Hardware Description 22==================== 23 24AMD NPU consists of the following hardware components: 25 26AMD XDNA Array 27-------------- 28 29AMD XDNA Array comprises of 2D array of compute and memory tiles built with 30`AMD AI Engine Technology`_. Each column has 4 rows of compute tiles and 1 31row of memory tile. Each compute tile contains a VLIW processor with its own 32dedicated program and data memory. The memory tile acts as L2 memory. The 2D 33array can be partitioned at a column boundary creating a spatially isolated 34partition which can be bound to a workload context. 35 36Each column also has dedicated DMA engines to move data between host DDR and 37memory tile. 38 39AMD Phoenix and AMD Hawk Point client NPU have a 4x5 topology, i.e., 4 rows of 40compute tiles arranged into 5 columns. AMD Strix Point client APU have 4x8 41topology, i.e., 4 rows of compute tiles arranged into 8 columns. 42 43Shared L2 Memory 44---------------- 45 46The single row of memory tiles create a pool of software managed on chip L2 47memory. DMA engines are used to move data between host DDR and memory tiles. 48AMD Phoenix and AMD Hawk Point NPUs have a total of 2560 KB of L2 memory. 49AMD Strix Point NPU has a total of 4096 KB of L2 memory. 50 51Microcontroller 52--------------- 53 54A microcontroller runs NPU Firmware which is responsible for command processing, 55XDNA Array partition setup, XDNA Array configuration, workload context 56management and workload orchestration. 57 58NPU Firmware uses a dedicated instance of an isolated non-privileged context 59called ERT to service each workload context. ERT is also used to execute user 60provided ``ctrlcode`` associated with the workload context. 61 62NPU Firmware uses a single isolated privileged context called MERT to service 63management commands from the amdxdna driver. 64 65Mailboxes 66--------- 67 68The microcontroller and amdxdna driver use a privileged channel for management 69tasks like setting up of contexts, telemetry, query, error handling, setting up 70user channel, etc. As mentioned before, privileged channel requests are 71serviced by MERT. The privileged channel is bound to a single mailbox. 72 73The microcontroller and amdxdna driver use a dedicated user channel per 74workload context. The user channel is primarily used for submitting work to 75the NPU. As mentioned before, a user channel requests are serviced by an 76instance of ERT. Each user channel is bound to its own dedicated mailbox. 77 78PCIe EP 79------- 80 81NPU is visible to the x86 host CPU as a PCIe device with multiple BARs and some 82MSI-X interrupt vectors. NPU uses a dedicated high bandwidth SoC level fabric 83for reading or writing into host memory. Each instance of ERT gets its own 84dedicated MSI-X interrupt. MERT gets a single instance of MSI-X interrupt. 85 86The number of PCIe BARs varies depending on the specific device. Based on their 87functions, PCIe BARs can generally be categorized into the following types. 88 89* PSP BAR: Expose the AMD PSP (Platform Security Processor) function 90* SMU BAR: Expose the AMD SMU (System Management Unit) function 91* SRAM BAR: Expose ring buffers for the mailbox 92* Mailbox BAR: Expose the mailbox control registers (head, tail and ISR 93 registers etc.) 94* Public Register BAR: Expose public registers 95 96On specific devices, the above-mentioned BAR type might be combined into a 97single physical PCIe BAR. Or a module might require two physical PCIe BARs to 98be fully functional. For example, 99 100* On AMD Phoenix device, PSP, SMU, Public Register BARs are on PCIe BAR index 0. 101* On AMD Strix Point device, Mailbox and Public Register BARs are on PCIe BAR 102 index 0. The PSP has some registers in PCIe BAR index 0 (Public Register BAR) 103 and PCIe BAR index 4 (PSP BAR). 104 105Process Isolation Hardware 106-------------------------- 107 108As explained before, XDNA Array can be dynamically divided into isolated 109spatial partitions, each of which may have one or more columns. The spatial 110partition is setup by programming the column isolation registers by the 111microcontroller. Each spatial partition is associated with a PASID which is 112also programmed by the microcontroller. Hence multiple spatial partitions in 113the NPU can make concurrent host access protected by PASID. 114 115The NPU FW itself uses microcontroller MMU enforced isolated contexts for 116servicing user and privileged channel requests. 117 118 119Mixed Spatial and Temporal Scheduling 120===================================== 121 122AMD XDNA architecture supports mixed spatial and temporal (time sharing) 123scheduling of 2D array. This means that spatial partitions may be setup and 124torn down dynamically to accommodate various workloads. A *spatial* partition 125may be *exclusively* bound to one workload context while another partition may 126be *temporarily* bound to more than one workload contexts. The microcontroller 127updates the PASID for a temporarily shared partition to match the context that 128has been bound to the partition at any moment. 129 130Resource Solver 131--------------- 132 133The Resource Solver component of the amdxdna driver manages the allocation 134of 2D array among various workloads. Every workload describes the number 135of columns required to run the NPU binary in its metadata. The Resource Solver 136component uses hints passed by the workload and its own heuristics to 137decide 2D array (re)partition strategy and mapping of workloads for spatial and 138temporal sharing of columns. The FW enforces the context-to-column(s) resource 139binding decisions made by the Resource Solver. 140 141AMD Phoenix and AMD Hawk Point client NPU can support 6 concurrent workload 142contexts. AMD Strix Point can support 16 concurrent workload contexts. 143 144 145Application Binaries 146==================== 147 148A NPU application workload is comprised of two separate binaries which are 149generated by the NPU compiler. 150 1511. AMD XDNA Array overlay, which is used to configure a NPU spatial partition. 152 The overlay contains instructions for setting up the stream switch 153 configuration and ELF for the compute tiles. The overlay is loaded on the 154 spatial partition bound to the workload by the associated ERT instance. 155 Refer to the 156 `Versal Adaptive SoC AIE-ML Architecture Manual (AM020)`_ for more details. 157 1582. ``ctrlcode``, used for orchestrating the overlay loaded on the spatial 159 partition. ``ctrlcode`` is executed by the ERT running in protected mode on 160 the microcontroller in the context of the workload. ``ctrlcode`` is made up 161 of a sequence of opcodes named ``XAie_TxnOpcode``. Refer to the 162 `AI Engine Run Time`_ for more details. 163 164 165Special Host Buffers 166==================== 167 168Per-context Instruction Buffer 169------------------------------ 170 171Every workload context uses a host resident 64 MB buffer which is memory 172mapped into the ERT instance created to service the workload. The ``ctrlcode`` 173used by the workload is copied into this special memory. This buffer is 174protected by PASID like all other input/output buffers used by that workload. 175Instruction buffer is also mapped into the user space of the workload. 176 177Global Privileged Buffer 178------------------------ 179 180In addition, the driver also allocates a single buffer for maintenance tasks 181like recording errors from MERT. This global buffer uses the global IOMMU 182domain and is only accessible by MERT. 183 184 185High-level Use Flow 186=================== 187 188Here are the steps to run a workload on AMD NPU: 189 1901. Compile the workload into an overlay and a ``ctrlcode`` binary. 1912. Userspace opens a context in the driver and provides the overlay. 1923. The driver checks with the Resource Solver for provisioning a set of columns 193 for the workload. 1944. The driver then asks MERT to create a context on the device with the desired 195 columns. 1965. MERT then creates an instance of ERT. MERT also maps the Instruction Buffer 197 into ERT memory. 1986. The userspace then copies the ``ctrlcode`` to the Instruction Buffer. 1997. Userspace then creates a command buffer with pointers to input, output, and 200 instruction buffer; it then submits command buffer with the driver and goes 201 to sleep waiting for completion. 2028. The driver sends the command over the Mailbox to ERT. 2039. ERT *executes* the ``ctrlcode`` in the instruction buffer. 20410. Execution of the ``ctrlcode`` kicks off DMAs to and from the host DDR while 205 AMD XDNA Array is running. 20611. When ERT reaches end of ``ctrlcode``, it raises an MSI-X to send completion 207 signal to the driver which then wakes up the waiting workload. 208 209 210Boot Flow 211========= 212 213amdxdna driver uses PSP to securely load signed NPU FW and kick off the boot 214of the NPU microcontroller. amdxdna driver then waits for the alive signal in 215a special location on BAR 0. The NPU is switched off during SoC suspend and 216turned on after resume where the NPU FW is reloaded, and the handshake is 217performed again. 218 219 220Userspace components 221==================== 222 223Compiler 224-------- 225 226Peano is an LLVM based open-source compiler for AMD XDNA Array compute tile 227available at: 228https://github.com/Xilinx/llvm-aie 229 230The open-source IREE compiler supports graph compilation of ML models for AMD 231NPU and uses Peano underneath. It is available at: 232https://github.com/nod-ai/iree-amd-aie 233 234Usermode Driver (UMD) 235--------------------- 236 237The open-source XRT runtime stack interfaces with amdxdna kernel driver. XRT 238can be found at: 239https://github.com/Xilinx/XRT 240 241The open-source XRT shim for NPU is can be found at: 242https://github.com/amd/xdna-driver 243 244 245DMA Operation 246============= 247 248DMA operation instructions are encoded in the ``ctrlcode`` as 249``XAIE_IO_BLOCKWRITE`` opcode. When ERT executes ``XAIE_IO_BLOCKWRITE``, DMA 250operations between host DDR and L2 memory are effected. 251 252 253Error Handling 254============== 255 256When MERT detects an error in AMD XDNA Array, it pauses execution for that 257workload context and sends an asynchronous message to the driver over the 258privileged channel. The driver then sends a buffer pointer to MERT to capture 259the register states for the partition bound to faulting workload context. The 260driver then decodes the error by reading the contents of the buffer pointer. 261 262 263Telemetry 264========= 265 266MERT can report various kinds of telemetry information like the following: 267 268* L1 interrupt counter 269* DMA counter 270* Deep Sleep counter 271* etc. 272 273 274References 275========== 276 277- `AMD XDNA Architecture <https://www.amd.com/en/technologies/xdna.html>`_ 278- `AMD AI Engine Technology <https://www.xilinx.com/products/technology/ai-engine.html>`_ 279- `Peano <https://github.com/Xilinx/llvm-aie>`_ 280- `Versal Adaptive SoC AIE-ML Architecture Manual (AM020) <https://docs.amd.com/r/en-US/am020-versal-aie-ml>`_ 281- `AI Engine Run Time <https://github.com/Xilinx/aie-rt/tree/release/main_aig>`_ 282