xref: /linux/Documentation/gpu/amdgpu/userq.rst (revision 58809f614e0e3f4e12b489bddf680bfeb31c0a20)
1*0c1f3fe9SAlex Deucher==================
2*0c1f3fe9SAlex Deucher User Mode Queues
3*0c1f3fe9SAlex Deucher==================
4*0c1f3fe9SAlex Deucher
5*0c1f3fe9SAlex DeucherIntroduction
6*0c1f3fe9SAlex Deucher============
7*0c1f3fe9SAlex Deucher
8*0c1f3fe9SAlex DeucherSimilar to the KFD, GPU engine queues move into userspace.  The idea is to let
9*0c1f3fe9SAlex Deucheruser processes manage their submissions to the GPU engines directly, bypassing
10*0c1f3fe9SAlex DeucherIOCTL calls to the driver to submit work.  This reduces overhead and also allows
11*0c1f3fe9SAlex Deucherthe GPU to submit work to itself.  Applications can set up work graphs of jobs
12*0c1f3fe9SAlex Deucheracross multiple GPU engines without needing trips through the CPU.
13*0c1f3fe9SAlex Deucher
14*0c1f3fe9SAlex DeucherUMDs directly interface with firmware via per application shared memory areas.
15*0c1f3fe9SAlex DeucherThe main vehicle for this is queue.  A queue is a ring buffer with a read
16*0c1f3fe9SAlex Deucherpointer (rptr) and a write pointer (wptr).  The UMD writes IP specific packets
17*0c1f3fe9SAlex Deucherinto the queue and the firmware processes those packets, kicking off work on the
18*0c1f3fe9SAlex DeucherGPU engines.  The CPU in the application (or another queue or device) updates
19*0c1f3fe9SAlex Deucherthe wptr to tell the firmware how far into the ring buffer to process packets
20*0c1f3fe9SAlex Deucherand the rtpr provides feedback to the UMD on how far the firmware has progressed
21*0c1f3fe9SAlex Deucherin executing those packets.  When the wptr and the rptr are equal, the queue is
22*0c1f3fe9SAlex Deucheridle.
23*0c1f3fe9SAlex Deucher
24*0c1f3fe9SAlex DeucherTheory of Operation
25*0c1f3fe9SAlex Deucher===================
26*0c1f3fe9SAlex Deucher
27*0c1f3fe9SAlex DeucherThe various engines on modern AMD GPUs support multiple queues per engine with a
28*0c1f3fe9SAlex Deucherscheduling firmware which handles dynamically scheduling user queues on the
29*0c1f3fe9SAlex Deucheravailable hardware queue slots.  When the number of user queues outnumbers the
30*0c1f3fe9SAlex Deucheravailable hardware queue slots, the scheduling firmware dynamically maps and
31*0c1f3fe9SAlex Deucherunmaps queues based on priority and time quanta.  The state of each user queue
32*0c1f3fe9SAlex Deucheris managed in the kernel driver in an MQD (Memory Queue Descriptor).  This is a
33*0c1f3fe9SAlex Deucherbuffer in GPU accessible memory that stores the state of a user queue.  The
34*0c1f3fe9SAlex Deucherscheduling firmware uses the MQD to load the queue state into an HQD (Hardware
35*0c1f3fe9SAlex DeucherQueue Descriptor) when a user queue is mapped.  Each user queue requires a
36*0c1f3fe9SAlex Deuchernumber of additional buffers which represent the ring buffer and any metadata
37*0c1f3fe9SAlex Deucherneeded by the engine for runtime operation.  On most engines this consists of
38*0c1f3fe9SAlex Deucherthe ring buffer itself, a rptr buffer (where the firmware will shadow the rptr
39*0c1f3fe9SAlex Deucherto userspace), a wptr buffer (where the application will write the wptr for the
40*0c1f3fe9SAlex Deucherfirmware to fetch it), and a doorbell.  A doorbell is a piece of one of the
41*0c1f3fe9SAlex Deucherdevice's MMIO BARs which can be mapped to specific user queues.  When the
42*0c1f3fe9SAlex Deucherapplication writes to the doorbell, it will signal the firmware to take some
43*0c1f3fe9SAlex Deucheraction. Writing to the doorbell wakes the firmware and causes it to fetch the
44*0c1f3fe9SAlex Deucherwptr and start processing the packets in the queue. Each 4K page of the doorbell
45*0c1f3fe9SAlex DeucherBAR supports specific offset ranges for specific engines.  The doorbell of a
46*0c1f3fe9SAlex Deucherqueue must be mapped into the aperture aligned to the IP used by the queue
47*0c1f3fe9SAlex Deucher(e.g., GFX, VCN, SDMA, etc.).  These doorbell apertures are set up via NBIO
48*0c1f3fe9SAlex Deucherregisters.  Doorbells are 32 bit or 64 bit (depending on the engine) chunks of
49*0c1f3fe9SAlex Deucherthe doorbell BAR.  A 4K doorbell page provides 512 64-bit doorbells for up to
50*0c1f3fe9SAlex Deucher512 user queues.  A subset of each page is reserved for each IP type supported
51*0c1f3fe9SAlex Deucheron the device.  The user can query the doorbell ranges for each IP via the INFO
52*0c1f3fe9SAlex DeucherIOCTL.  See the IOCTL Interfaces section for more information.
53*0c1f3fe9SAlex Deucher
54*0c1f3fe9SAlex DeucherWhen an application wants to create a user queue, it allocates the necessary
55*0c1f3fe9SAlex Deucherbuffers for the queue (ring buffer, wptr and rptr, context save areas, etc.).
56*0c1f3fe9SAlex DeucherThese can be separate buffers or all part of one larger buffer.  The application
57*0c1f3fe9SAlex Deucherwould map the buffer(s) into its GPUVM and use the GPU virtual addresses of for
58*0c1f3fe9SAlex Deucherthe areas of memory they want to use for the user queue.  They would also
59*0c1f3fe9SAlex Deucherallocate a doorbell page for the doorbells used by the user queues.  The
60*0c1f3fe9SAlex Deucherapplication would then populate the MQD in the USERQ IOCTL structure with the
61*0c1f3fe9SAlex DeucherGPU virtual addresses and doorbell index they want to use.  The user can also
62*0c1f3fe9SAlex Deucherspecify the attributes for the user queue (priority, whether the queue is secure
63*0c1f3fe9SAlex Deucherfor protected content, etc.).  The application would then call the USERQ
64*0c1f3fe9SAlex DeucherCREATE IOCTL to create the queue using the specified MQD details in the IOCTL.
65*0c1f3fe9SAlex DeucherThe kernel driver then validates the MQD provided by the application and
66*0c1f3fe9SAlex Deuchertranslates the MQD into the engine specific MQD format for the IP.  The IP
67*0c1f3fe9SAlex Deucherspecific MQD would be allocated and the queue would be added to the run list
68*0c1f3fe9SAlex Deuchermaintained by the scheduling firmware.  Once the queue has been created, the
69*0c1f3fe9SAlex Deucherapplication can write packets directly into the queue, update the wptr, and
70*0c1f3fe9SAlex Deucherwrite to the doorbell offset to kick off work in the user queue.
71*0c1f3fe9SAlex Deucher
72*0c1f3fe9SAlex DeucherWhen the application is done with the user queue, it would call the USERQ
73*0c1f3fe9SAlex DeucherFREE IOCTL to destroy it.  The kernel driver would preempt the queue and
74*0c1f3fe9SAlex Deucherremove it from the scheduling firmware's run list.  Then the IP specific MQD
75*0c1f3fe9SAlex Deucherwould be freed and the user queue state would be cleaned up.
76*0c1f3fe9SAlex Deucher
77*0c1f3fe9SAlex DeucherSome engines may require the aggregated doorbell too if the engine does not
78*0c1f3fe9SAlex Deuchersupport doorbells from unmapped queues.  The aggregated doorbell is a special
79*0c1f3fe9SAlex Deucherpage of doorbell space which wakes the scheduler.  In cases where the engine may
80*0c1f3fe9SAlex Deucherbe oversubscribed, some queues may not be mapped.  If the doorbell is rung when
81*0c1f3fe9SAlex Deucherthe queue is not mapped, the engine firmware may miss the request.  Some
82*0c1f3fe9SAlex Deucherscheduling firmware may work around this by polling wptr shadows when the
83*0c1f3fe9SAlex Deucherhardware is oversubscribed, other engines may support doorbell updates from
84*0c1f3fe9SAlex Deucherunmapped queues.  In the event that one of these options is not available, the
85*0c1f3fe9SAlex Deucherkernel driver will map a page of aggregated doorbell space into each GPUVM
86*0c1f3fe9SAlex Deucherspace.  The UMD will then update the doorbell and wptr as normal and then write
87*0c1f3fe9SAlex Deucherto the aggregated doorbell as well.
88*0c1f3fe9SAlex Deucher
89*0c1f3fe9SAlex DeucherSpecial Packets
90*0c1f3fe9SAlex Deucher---------------
91*0c1f3fe9SAlex Deucher
92*0c1f3fe9SAlex DeucherIn order to support legacy implicit synchronization, as well as mixed user and
93*0c1f3fe9SAlex Deucherkernel queues, we need a synchronization mechanism that is secure.  Because
94*0c1f3fe9SAlex Deucherkernel queues or memory management tasks depend on kernel fences, we need a way
95*0c1f3fe9SAlex Deucherfor user queues to update memory that the kernel can use for a fence, that can't
96*0c1f3fe9SAlex Deucherbe messed with by a bad actor.  To support this, we've added a protected fence
97*0c1f3fe9SAlex Deucherpacket.  This packet works by writing a monotonically increasing value to
98*0c1f3fe9SAlex Deuchera memory location that only privileged clients have write access to. User
99*0c1f3fe9SAlex Deucherqueues only have read access.  When this packet is executed, the memory location
100*0c1f3fe9SAlex Deucheris updated and other queues (kernel or user) can see the results.  The
101*0c1f3fe9SAlex Deucheruser application would submit this packet in their command stream.  The actual
102*0c1f3fe9SAlex Deucherpacket format varies from IP to IP (GFX/Compute, SDMA, VCN, etc.), but the
103*0c1f3fe9SAlex Deucherbehavior is the same.  The packet submission is handled in userspace.  The
104*0c1f3fe9SAlex Deucherkernel driver sets up the privileged memory used for each user queue when it
105*0c1f3fe9SAlex Deuchersets the queues up when the application creates them.
106*0c1f3fe9SAlex Deucher
107*0c1f3fe9SAlex Deucher
108*0c1f3fe9SAlex DeucherMemory Management
109*0c1f3fe9SAlex Deucher=================
110*0c1f3fe9SAlex Deucher
111*0c1f3fe9SAlex DeucherIt is assumed that all buffers mapped into the GPUVM space for the process are
112*0c1f3fe9SAlex Deuchervalid when engines on the GPU are running.  The kernel driver will only allow
113*0c1f3fe9SAlex Deucheruser queues to run when all buffers are mapped.  If there is a memory event that
114*0c1f3fe9SAlex Deucherrequires buffer migration, the kernel driver will preempt the user queues,
115*0c1f3fe9SAlex Deuchermigrate buffers to where they need to be, update the GPUVM page tables and
116*0c1f3fe9SAlex Deucherinvaldidate the TLB, and then resume the user queues.
117*0c1f3fe9SAlex Deucher
118*0c1f3fe9SAlex DeucherInteraction with Kernel Queues
119*0c1f3fe9SAlex Deucher==============================
120*0c1f3fe9SAlex Deucher
121*0c1f3fe9SAlex DeucherDepending on the IP and the scheduling firmware, you can enable kernel queues
122*0c1f3fe9SAlex Deucherand user queues at the same time, however, you are limited by the HQD slots.
123*0c1f3fe9SAlex DeucherKernel queues are always mapped so any work that goes into kernel queues will
124*0c1f3fe9SAlex Deuchertake priority.  This limits the available HQD slots for user queues.
125*0c1f3fe9SAlex Deucher
126*0c1f3fe9SAlex DeucherNot all IPs will support user queues on all GPUs.  As such, UMDs will need to
127*0c1f3fe9SAlex Deuchersupport both user queues and kernel queues depending on the IP.  For example, a
128*0c1f3fe9SAlex DeucherGPU may support user queues for GFX, compute, and SDMA, but not for VCN, JPEG,
129*0c1f3fe9SAlex Deucherand VPE.  UMDs need to support both.  The kernel driver provides a way to
130*0c1f3fe9SAlex Deucherdetermine if user queues and kernel queues are supported on a per IP basis.
131*0c1f3fe9SAlex DeucherUMDs can query this information via the INFO IOCTL and determine whether to use
132*0c1f3fe9SAlex Deucherkernel queues or user queues for each IP.
133*0c1f3fe9SAlex Deucher
134*0c1f3fe9SAlex DeucherQueue Resets
135*0c1f3fe9SAlex Deucher============
136*0c1f3fe9SAlex Deucher
137*0c1f3fe9SAlex DeucherFor most engines, queues can be reset individually.  GFX, compute, and SDMA
138*0c1f3fe9SAlex Deucherqueues can be reset individually.  When a hung queue is detected, it can be
139*0c1f3fe9SAlex Deucherreset either via the scheduling firmware or MMIO.  Since there are no kernel
140*0c1f3fe9SAlex Deucherfences for most user queues, they will usually only be detected when some other
141*0c1f3fe9SAlex Deucherevent happens; e.g., a memory event which requires migration of buffers.  When
142*0c1f3fe9SAlex Deucherthe queues are preempted, if the queue is hung, the preemption will fail.
143*0c1f3fe9SAlex DeucherDriver will then look up the queues that failed to preempt and reset them and
144*0c1f3fe9SAlex Deucherrecord which queues are hung.
145*0c1f3fe9SAlex Deucher
146*0c1f3fe9SAlex DeucherOn the UMD side, we will add a USERQ QUERY_STATUS IOCTL to query the queue
147*0c1f3fe9SAlex Deucherstatus.  UMD will provide the queue id in the IOCTL and the kernel driver
148*0c1f3fe9SAlex Deucherwill check if it has already recorded the queue as hung (e.g., due to failed
149*0c1f3fe9SAlex Deucherpeemption) and report back the status.
150*0c1f3fe9SAlex Deucher
151*0c1f3fe9SAlex DeucherIOCTL Interfaces
152*0c1f3fe9SAlex Deucher================
153*0c1f3fe9SAlex Deucher
154*0c1f3fe9SAlex DeucherGPU virtual addresses used for queues and related data (rptrs, wptrs, context
155*0c1f3fe9SAlex Deuchersave areas, etc.) should be validated by the kernel mode driver to prevent the
156*0c1f3fe9SAlex Deucheruser from specifying invalid GPU virtual addresses.  If the user provides
157*0c1f3fe9SAlex Deucherinvalid GPU virtual addresses or doorbell indicies, the IOCTL should return an
158*0c1f3fe9SAlex Deuchererror message.  These buffers should also be tracked in the kernel driver so
159*0c1f3fe9SAlex Deucherthat if the user attempts to unmap the buffer(s) from the GPUVM, the umap call
160*0c1f3fe9SAlex Deucherwould return an error.
161*0c1f3fe9SAlex Deucher
162*0c1f3fe9SAlex DeucherINFO
163*0c1f3fe9SAlex Deucher----
164*0c1f3fe9SAlex DeucherThere are several new INFO queries related to user queues in order to query the
165*0c1f3fe9SAlex Deuchersize of user queue meta data needed for a user queue (e.g., context save areas
166*0c1f3fe9SAlex Deucheror shadow buffers), whether kernel or user queues or both are supported
167*0c1f3fe9SAlex Deucherfor each IP type, and the offsets for each IP type in each doorbell page.
168*0c1f3fe9SAlex Deucher
169*0c1f3fe9SAlex DeucherUSERQ
170*0c1f3fe9SAlex Deucher-----
171*0c1f3fe9SAlex DeucherThe USERQ IOCTL is used for creating, freeing, and querying the status of user
172*0c1f3fe9SAlex Deucherqueues.  It supports 3 opcodes:
173*0c1f3fe9SAlex Deucher
174*0c1f3fe9SAlex Deucher1. CREATE - Create a user queue.  The application provides an MQD-like structure
175*0c1f3fe9SAlex Deucher   that defines the type of queue and associated metadata and flags for that
176*0c1f3fe9SAlex Deucher   queue type.  Returns the queue id.
177*0c1f3fe9SAlex Deucher2. FREE - Free a user queue.
178*0c1f3fe9SAlex Deucher3. QUERY_STATUS - Query that status of a queue.  Used to check if the queue is
179*0c1f3fe9SAlex Deucher   healthy or not.  E.g., if the queue has been reset. (WIP)
180*0c1f3fe9SAlex Deucher
181*0c1f3fe9SAlex DeucherUSERQ_SIGNAL
182*0c1f3fe9SAlex Deucher------------
183*0c1f3fe9SAlex DeucherThe USERQ_SIGNAL IOCTL is used to provide a list of sync objects to be signaled.
184*0c1f3fe9SAlex Deucher
185*0c1f3fe9SAlex DeucherUSERQ_WAIT
186*0c1f3fe9SAlex Deucher----------
187*0c1f3fe9SAlex DeucherThe USERQ_WAIT IOCTL is used to provide a list of sync object to be waited on.
188*0c1f3fe9SAlex Deucher
189*0c1f3fe9SAlex DeucherKernel and User Queues
190*0c1f3fe9SAlex Deucher======================
191*0c1f3fe9SAlex Deucher
192*0c1f3fe9SAlex DeucherIn order to properly validate and test performance, we have a driver option to
193*0c1f3fe9SAlex Deucherselect what type of queues are enabled (kernel queues, user queues or both).
194*0c1f3fe9SAlex DeucherThe user_queue driver parameter allows you to enable kernel queues only (0),
195*0c1f3fe9SAlex Deucheruser queues and kernel queues (1), and user queues only (2).  Enabling user
196*0c1f3fe9SAlex Deucherqueues only will free up static queue assignments that would otherwise be used
197*0c1f3fe9SAlex Deucherby kernel queues for use by the scheduling firmware.  Some kernel queues are
198*0c1f3fe9SAlex Deucherrequired for kernel driver operation and they will always be created.  When the
199*0c1f3fe9SAlex Deucherkernel queues are not enabled, they are not registered with the drm scheduler
200*0c1f3fe9SAlex Deucherand the CS IOCTL will reject any incoming command submissions which target those
201*0c1f3fe9SAlex Deucherqueue types.  Kernel queues only mirrors the behavior on all existing GPUs.
202*0c1f3fe9SAlex DeucherEnabling both queues allows for backwards compatibility with old userspace while
203*0c1f3fe9SAlex Deucherstill supporting user queues.
204