xref: /linux/Documentation/gpu/amdgpu/userq.rst (revision c7062be3380cb20c8b1c4a935a13f1848ead0719)
1.. _amdgpu-userq:
2
3==================
4 User Mode Queues
5==================
6
7Introduction
8============
9
10Similar to the KFD, GPU engine queues move into userspace.  The idea is to let
11user processes manage their submissions to the GPU engines directly, bypassing
12IOCTL calls to the driver to submit work.  This reduces overhead and also allows
13the GPU to submit work to itself.  Applications can set up work graphs of jobs
14across multiple GPU engines without needing trips through the CPU.
15
16UMDs directly interface with firmware via per application shared memory areas.
17The main vehicle for this is queue.  A queue is a ring buffer with a read
18pointer (rptr) and a write pointer (wptr).  The UMD writes IP specific packets
19into the queue and the firmware processes those packets, kicking off work on the
20GPU engines.  The CPU in the application (or another queue or device) updates
21the wptr to tell the firmware how far into the ring buffer to process packets
22and the rtpr provides feedback to the UMD on how far the firmware has progressed
23in executing those packets.  When the wptr and the rptr are equal, the queue is
24idle.
25
26Theory of Operation
27===================
28
29The various engines on modern AMD GPUs support multiple queues per engine with a
30scheduling firmware which handles dynamically scheduling user queues on the
31available hardware queue slots.  When the number of user queues outnumbers the
32available hardware queue slots, the scheduling firmware dynamically maps and
33unmaps queues based on priority and time quanta.  The state of each user queue
34is managed in the kernel driver in an MQD (Memory Queue Descriptor).  This is a
35buffer in GPU accessible memory that stores the state of a user queue.  The
36scheduling firmware uses the MQD to load the queue state into an HQD (Hardware
37Queue Descriptor) when a user queue is mapped.  Each user queue requires a
38number of additional buffers which represent the ring buffer and any metadata
39needed by the engine for runtime operation.  On most engines this consists of
40the ring buffer itself, a rptr buffer (where the firmware will shadow the rptr
41to userspace), a wptr buffer (where the application will write the wptr for the
42firmware to fetch it), and a doorbell.  A doorbell is a piece of one of the
43device's MMIO BARs which can be mapped to specific user queues.  When the
44application writes to the doorbell, it will signal the firmware to take some
45action. Writing to the doorbell wakes the firmware and causes it to fetch the
46wptr and start processing the packets in the queue. Each 4K page of the doorbell
47BAR supports specific offset ranges for specific engines.  The doorbell of a
48queue must be mapped into the aperture aligned to the IP used by the queue
49(e.g., GFX, VCN, SDMA, etc.).  These doorbell apertures are set up via NBIO
50registers.  Doorbells are 32 bit or 64 bit (depending on the engine) chunks of
51the doorbell BAR.  A 4K doorbell page provides 512 64-bit doorbells for up to
52512 user queues.  A subset of each page is reserved for each IP type supported
53on the device.  The user can query the doorbell ranges for each IP via the INFO
54IOCTL.  See the IOCTL Interfaces section for more information.
55
56When an application wants to create a user queue, it allocates the necessary
57buffers for the queue (ring buffer, wptr and rptr, context save areas, etc.).
58These can be separate buffers or all part of one larger buffer.  The application
59would map the buffer(s) into its GPUVM and use the GPU virtual addresses of for
60the areas of memory they want to use for the user queue.  They would also
61allocate a doorbell page for the doorbells used by the user queues.  The
62application would then populate the MQD in the USERQ IOCTL structure with the
63GPU virtual addresses and doorbell index they want to use.  The user can also
64specify the attributes for the user queue (priority, whether the queue is secure
65for protected content, etc.).  The application would then call the USERQ
66CREATE IOCTL to create the queue using the specified MQD details in the IOCTL.
67The kernel driver then validates the MQD provided by the application and
68translates the MQD into the engine specific MQD format for the IP.  The IP
69specific MQD would be allocated and the queue would be added to the run list
70maintained by the scheduling firmware.  Once the queue has been created, the
71application can write packets directly into the queue, update the wptr, and
72write to the doorbell offset to kick off work in the user queue.
73
74When the application is done with the user queue, it would call the USERQ
75FREE IOCTL to destroy it.  The kernel driver would preempt the queue and
76remove it from the scheduling firmware's run list.  Then the IP specific MQD
77would be freed and the user queue state would be cleaned up.
78
79Some engines may require the aggregated doorbell too if the engine does not
80support doorbells from unmapped queues.  The aggregated doorbell is a special
81page of doorbell space which wakes the scheduler.  In cases where the engine may
82be oversubscribed, some queues may not be mapped.  If the doorbell is rung when
83the queue is not mapped, the engine firmware may miss the request.  Some
84scheduling firmware may work around this by polling wptr shadows when the
85hardware is oversubscribed, other engines may support doorbell updates from
86unmapped queues.  In the event that one of these options is not available, the
87kernel driver will map a page of aggregated doorbell space into each GPUVM
88space.  The UMD will then update the doorbell and wptr as normal and then write
89to the aggregated doorbell as well.
90
91Special Packets
92---------------
93
94In order to support legacy implicit synchronization, as well as mixed user and
95kernel queues, we need a synchronization mechanism that is secure.  Because
96kernel queues or memory management tasks depend on kernel fences, we need a way
97for user queues to update memory that the kernel can use for a fence, that can't
98be messed with by a bad actor.  To support this, we've added a protected fence
99packet.  This packet works by writing a monotonically increasing value to
100a memory location that only privileged clients have write access to. User
101queues only have read access.  When this packet is executed, the memory location
102is updated and other queues (kernel or user) can see the results.  The
103user application would submit this packet in their command stream.  The actual
104packet format varies from IP to IP (GFX/Compute, SDMA, VCN, etc.), but the
105behavior is the same.  The packet submission is handled in userspace.  The
106kernel driver sets up the privileged memory used for each user queue when it
107sets the queues up when the application creates them.
108
109
110Memory Management
111=================
112
113It is assumed that all buffers mapped into the GPUVM space for the process are
114valid when engines on the GPU are running.  The kernel driver will only allow
115user queues to run when all buffers are mapped.  If there is a memory event that
116requires buffer migration, the kernel driver will preempt the user queues,
117migrate buffers to where they need to be, update the GPUVM page tables and
118invaldidate the TLB, and then resume the user queues.
119
120Interaction with Kernel Queues
121==============================
122
123Depending on the IP and the scheduling firmware, you can enable kernel queues
124and user queues at the same time, however, you are limited by the HQD slots.
125Kernel queues are always mapped so any work that goes into kernel queues will
126take priority.  This limits the available HQD slots for user queues.
127
128Not all IPs will support user queues on all GPUs.  As such, UMDs will need to
129support both user queues and kernel queues depending on the IP.  For example, a
130GPU may support user queues for GFX, compute, and SDMA, but not for VCN, JPEG,
131and VPE.  UMDs need to support both.  The kernel driver provides a way to
132determine if user queues and kernel queues are supported on a per IP basis.
133UMDs can query this information via the INFO IOCTL and determine whether to use
134kernel queues or user queues for each IP.
135
136Queue Resets
137============
138
139For most engines, queues can be reset individually.  GFX, compute, and SDMA
140queues can be reset individually.  When a hung queue is detected, it can be
141reset either via the scheduling firmware or MMIO.  Since there are no kernel
142fences for most user queues, they will usually only be detected when some other
143event happens; e.g., a memory event which requires migration of buffers.  When
144the queues are preempted, if the queue is hung, the preemption will fail.
145Driver will then look up the queues that failed to preempt and reset them and
146record which queues are hung.
147
148On the UMD side, we will add a USERQ QUERY_STATUS IOCTL to query the queue
149status.  UMD will provide the queue id in the IOCTL and the kernel driver
150will check if it has already recorded the queue as hung (e.g., due to failed
151peemption) and report back the status.
152
153IOCTL Interfaces
154================
155
156GPU virtual addresses used for queues and related data (rptrs, wptrs, context
157save areas, etc.) should be validated by the kernel mode driver to prevent the
158user from specifying invalid GPU virtual addresses.  If the user provides
159invalid GPU virtual addresses or doorbell indicies, the IOCTL should return an
160error message.  These buffers should also be tracked in the kernel driver so
161that if the user attempts to unmap the buffer(s) from the GPUVM, the umap call
162would return an error.
163
164INFO
165----
166There are several new INFO queries related to user queues in order to query the
167size of user queue meta data needed for a user queue (e.g., context save areas
168or shadow buffers), whether kernel or user queues or both are supported
169for each IP type, and the offsets for each IP type in each doorbell page.
170
171USERQ
172-----
173The USERQ IOCTL is used for creating, freeing, and querying the status of user
174queues.  It supports 3 opcodes:
175
1761. CREATE - Create a user queue.  The application provides an MQD-like structure
177   that defines the type of queue and associated metadata and flags for that
178   queue type.  Returns the queue id.
1792. FREE - Free a user queue.
1803. QUERY_STATUS - Query that status of a queue.  Used to check if the queue is
181   healthy or not.  E.g., if the queue has been reset. (WIP)
182
183USERQ_SIGNAL
184------------
185The USERQ_SIGNAL IOCTL is used to provide a list of sync objects to be signaled.
186
187USERQ_WAIT
188----------
189The USERQ_WAIT IOCTL is used to provide a list of sync object to be waited on.
190
191Kernel and User Queues
192======================
193
194In order to properly validate and test performance, we have a driver option to
195select what type of queues are enabled (kernel queues, user queues or both).
196The user_queue driver parameter allows you to enable kernel queues only (0),
197user queues and kernel queues (1), and user queues only (2).  Enabling user
198queues only will free up static queue assignments that would otherwise be used
199by kernel queues for use by the scheduling firmware.  Some kernel queues are
200required for kernel driver operation and they will always be created.  When the
201kernel queues are not enabled, they are not registered with the drm scheduler
202and the CS IOCTL will reject any incoming command submissions which target those
203queue types.  Kernel queues only mirrors the behavior on all existing GPUs.
204Enabling both queues allows for backwards compatibility with old userspace while
205still supporting user queues.
206