xref: /linux/Documentation/accel/amdxdna/amdnpu.rst (revision 2c1ed907520c50326b8f604907a8478b27881a2e)
1.. SPDX-License-Identifier: GPL-2.0-only
2
3.. include:: <isonum.txt>
4
5=========
6 AMD NPU
7=========
8
9:Copyright: |copy| 2024 Advanced Micro Devices, Inc.
10:Author: Sonal Santan <sonal.santan@amd.com>
11
12Overview
13========
14
15AMD NPU (Neural Processing Unit) is a multi-user AI inference accelerator
16integrated into AMD client APU. NPU enables efficient execution of Machine
17Learning applications like CNN, LLM, etc. NPU is based on
18`AMD XDNA Architecture`_. NPU is managed by **amdxdna** driver.
19
20
21Hardware Description
22====================
23
24AMD NPU consists of the following hardware components:
25
26AMD XDNA Array
27--------------
28
29AMD XDNA Array comprises of 2D array of compute and memory tiles built with
30`AMD AI Engine Technology`_. Each column has 4 rows of compute tiles and 1
31row of memory tile. Each compute tile contains a VLIW processor with its own
32dedicated program and data memory. The memory tile acts as L2 memory. The 2D
33array can be partitioned at a column boundary creating a spatially isolated
34partition which can be bound to a workload context.
35
36Each column also has dedicated DMA engines to move data between host DDR and
37memory tile.
38
39AMD Phoenix and AMD Hawk Point client NPU have a 4x5 topology, i.e., 4 rows of
40compute tiles arranged into 5 columns. AMD Strix Point client APU have 4x8
41topology, i.e., 4 rows of compute tiles arranged into 8 columns.
42
43Shared L2 Memory
44----------------
45
46The single row of memory tiles create a pool of software managed on chip L2
47memory. DMA engines are used to move data between host DDR and memory tiles.
48AMD Phoenix and AMD Hawk Point NPUs have a total of 2560 KB of L2 memory.
49AMD Strix Point NPU has a total of 4096 KB of L2 memory.
50
51Microcontroller
52---------------
53
54A microcontroller runs NPU Firmware which is responsible for command processing,
55XDNA Array partition setup, XDNA Array configuration, workload context
56management and workload orchestration.
57
58NPU Firmware uses a dedicated instance of an isolated non-privileged context
59called ERT to service each workload context. ERT is also used to execute user
60provided ``ctrlcode`` associated with the workload context.
61
62NPU Firmware uses a single isolated privileged context called MERT to service
63management commands from the amdxdna driver.
64
65Mailboxes
66---------
67
68The microcontroller and amdxdna driver use a privileged channel for management
69tasks like setting up of contexts, telemetry, query, error handling, setting up
70user channel, etc. As mentioned before, privileged channel requests are
71serviced by MERT. The privileged channel is bound to a single mailbox.
72
73The microcontroller and amdxdna driver use a dedicated user channel per
74workload context. The user channel is primarily used for submitting work to
75the NPU. As mentioned before, a user channel requests are serviced by an
76instance of ERT. Each user channel is bound to its own dedicated mailbox.
77
78PCIe EP
79-------
80
81NPU is visible to the x86 host CPU as a PCIe device with multiple BARs and some
82MSI-X interrupt vectors. NPU uses a dedicated high bandwidth SoC level fabric
83for reading or writing into host memory. Each instance of ERT gets its own
84dedicated MSI-X interrupt. MERT gets a single instance of MSI-X interrupt.
85
86The number of PCIe BARs varies depending on the specific device. Based on their
87functions, PCIe BARs can generally be categorized into the following types.
88
89* PSP BAR: Expose the AMD PSP (Platform Security Processor) function
90* SMU BAR: Expose the AMD SMU (System Management Unit) function
91* SRAM BAR: Expose ring buffers for the mailbox
92* Mailbox BAR: Expose the mailbox control registers (head, tail and ISR
93  registers etc.)
94* Public Register BAR: Expose public registers
95
96On specific devices, the above-mentioned BAR type might be combined into a
97single physical PCIe BAR. Or a module might require two physical PCIe BARs to
98be fully functional. For example,
99
100* On AMD Phoenix device, PSP, SMU, Public Register BARs are on PCIe BAR index 0.
101* On AMD Strix Point device, Mailbox and Public Register BARs are on PCIe BAR
102  index 0. The PSP has some registers in PCIe BAR index 0 (Public Register BAR)
103  and PCIe BAR index 4 (PSP BAR).
104
105Process Isolation Hardware
106--------------------------
107
108As explained before, XDNA Array can be dynamically divided into isolated
109spatial partitions, each of which may have one or more columns. The spatial
110partition is setup by programming the column isolation registers by the
111microcontroller. Each spatial partition is associated with a PASID which is
112also programmed by the microcontroller. Hence multiple spatial partitions in
113the NPU can make concurrent host access protected by PASID.
114
115The NPU FW itself uses microcontroller MMU enforced isolated contexts for
116servicing user and privileged channel requests.
117
118
119Mixed Spatial and Temporal Scheduling
120=====================================
121
122AMD XDNA architecture supports mixed spatial and temporal (time sharing)
123scheduling of 2D array. This means that spatial partitions may be setup and
124torn down dynamically to accommodate various workloads. A *spatial* partition
125may be *exclusively* bound to one workload context while another partition may
126be *temporarily* bound to more than one workload contexts. The microcontroller
127updates the PASID for a temporarily shared partition to match the context that
128has been bound to the partition at any moment.
129
130Resource Solver
131---------------
132
133The Resource Solver component of the amdxdna driver manages the allocation
134of 2D array among various workloads. Every workload describes the number
135of columns required to run the NPU binary in its metadata. The Resource Solver
136component uses hints passed by the workload and its own heuristics to
137decide 2D array (re)partition strategy and mapping of workloads for spatial and
138temporal sharing of columns. The FW enforces the context-to-column(s) resource
139binding decisions made by the Resource Solver.
140
141AMD Phoenix and AMD Hawk Point client NPU can support 6 concurrent workload
142contexts. AMD Strix Point can support 16 concurrent workload contexts.
143
144
145Application Binaries
146====================
147
148A NPU application workload is comprised of two separate binaries which are
149generated by the NPU compiler.
150
1511. AMD XDNA Array overlay, which is used to configure a NPU spatial partition.
152   The overlay contains instructions for setting up the stream switch
153   configuration and ELF for the compute tiles. The overlay is loaded on the
154   spatial partition bound to the workload by the associated ERT instance.
155   Refer to the
156   `Versal Adaptive SoC AIE-ML Architecture Manual (AM020)`_ for more details.
157
1582. ``ctrlcode``, used for orchestrating the overlay loaded on the spatial
159   partition. ``ctrlcode`` is executed by the ERT running in protected mode on
160   the microcontroller in the context of the workload. ``ctrlcode`` is made up
161   of a sequence of opcodes named ``XAie_TxnOpcode``. Refer to the
162   `AI Engine Run Time`_ for more details.
163
164
165Special Host Buffers
166====================
167
168Per-context Instruction Buffer
169------------------------------
170
171Every workload context uses a host resident 64 MB buffer which is memory
172mapped into the ERT instance created to service the workload. The ``ctrlcode``
173used by the workload is copied into this special memory. This buffer is
174protected by PASID like all other input/output buffers used by that workload.
175Instruction buffer is also mapped into the user space of the workload.
176
177Global Privileged Buffer
178------------------------
179
180In addition, the driver also allocates a single buffer for maintenance tasks
181like recording errors from MERT. This global buffer uses the global IOMMU
182domain and is only accessible by MERT.
183
184
185High-level Use Flow
186===================
187
188Here are the steps to run a workload on AMD NPU:
189
1901.  Compile the workload into an overlay and a ``ctrlcode`` binary.
1912.  Userspace opens a context in the driver and provides the overlay.
1923.  The driver checks with the Resource Solver for provisioning a set of columns
193    for the workload.
1944.  The driver then asks MERT to create a context on the device with the desired
195    columns.
1965.  MERT then creates an instance of ERT. MERT also maps the Instruction Buffer
197    into ERT memory.
1986.  The userspace then copies the ``ctrlcode`` to the Instruction Buffer.
1997.  Userspace then creates a command buffer with pointers to input, output, and
200    instruction buffer; it then submits command buffer with the driver and goes
201    to sleep waiting for completion.
2028.  The driver sends the command over the Mailbox to ERT.
2039.  ERT *executes* the ``ctrlcode`` in the instruction buffer.
20410. Execution of the ``ctrlcode`` kicks off DMAs to and from the host DDR while
205    AMD XDNA Array is running.
20611. When ERT reaches end of ``ctrlcode``, it raises an MSI-X to send completion
207    signal to the driver which then wakes up the waiting workload.
208
209
210Boot Flow
211=========
212
213amdxdna driver uses PSP to securely load signed NPU FW and kick off the boot
214of the NPU microcontroller. amdxdna driver then waits for the alive signal in
215a special location on BAR 0. The NPU is switched off during SoC suspend and
216turned on after resume where the NPU FW is reloaded, and the handshake is
217performed again.
218
219
220Userspace components
221====================
222
223Compiler
224--------
225
226Peano is an LLVM based open-source compiler for AMD XDNA Array compute tile
227available at:
228https://github.com/Xilinx/llvm-aie
229
230The open-source IREE compiler supports graph compilation of ML models for AMD
231NPU and uses Peano underneath. It is available at:
232https://github.com/nod-ai/iree-amd-aie
233
234Usermode Driver (UMD)
235---------------------
236
237The open-source XRT runtime stack interfaces with amdxdna kernel driver. XRT
238can be found at:
239https://github.com/Xilinx/XRT
240
241The open-source XRT shim for NPU is can be found at:
242https://github.com/amd/xdna-driver
243
244
245DMA Operation
246=============
247
248DMA operation instructions are encoded in the ``ctrlcode`` as
249``XAIE_IO_BLOCKWRITE`` opcode. When ERT executes ``XAIE_IO_BLOCKWRITE``, DMA
250operations between host DDR and L2 memory are effected.
251
252
253Error Handling
254==============
255
256When MERT detects an error in AMD XDNA Array, it pauses execution for that
257workload context and sends an asynchronous message to the driver over the
258privileged channel. The driver then sends a buffer pointer to MERT to capture
259the register states for the partition bound to faulting workload context. The
260driver then decodes the error by reading the contents of the buffer pointer.
261
262
263Telemetry
264=========
265
266MERT can report various kinds of telemetry information like the following:
267
268* L1 interrupt counter
269* DMA counter
270* Deep Sleep counter
271* etc.
272
273
274References
275==========
276
277- `AMD XDNA Architecture <https://www.amd.com/en/technologies/xdna.html>`_
278- `AMD AI Engine Technology <https://www.xilinx.com/products/technology/ai-engine.html>`_
279- `Peano <https://github.com/Xilinx/llvm-aie>`_
280- `Versal Adaptive SoC AIE-ML Architecture Manual (AM020) <https://docs.amd.com/r/en-US/am020-versal-aie-ml>`_
281- `AI Engine Run Time <https://github.com/Xilinx/aie-rt/tree/release/main_aig>`_
282