1*0c1f3fe9SAlex Deucher================== 2*0c1f3fe9SAlex Deucher User Mode Queues 3*0c1f3fe9SAlex Deucher================== 4*0c1f3fe9SAlex Deucher 5*0c1f3fe9SAlex DeucherIntroduction 6*0c1f3fe9SAlex Deucher============ 7*0c1f3fe9SAlex Deucher 8*0c1f3fe9SAlex DeucherSimilar to the KFD, GPU engine queues move into userspace. The idea is to let 9*0c1f3fe9SAlex Deucheruser processes manage their submissions to the GPU engines directly, bypassing 10*0c1f3fe9SAlex DeucherIOCTL calls to the driver to submit work. This reduces overhead and also allows 11*0c1f3fe9SAlex Deucherthe GPU to submit work to itself. Applications can set up work graphs of jobs 12*0c1f3fe9SAlex Deucheracross multiple GPU engines without needing trips through the CPU. 13*0c1f3fe9SAlex Deucher 14*0c1f3fe9SAlex DeucherUMDs directly interface with firmware via per application shared memory areas. 15*0c1f3fe9SAlex DeucherThe main vehicle for this is queue. A queue is a ring buffer with a read 16*0c1f3fe9SAlex Deucherpointer (rptr) and a write pointer (wptr). The UMD writes IP specific packets 17*0c1f3fe9SAlex Deucherinto the queue and the firmware processes those packets, kicking off work on the 18*0c1f3fe9SAlex DeucherGPU engines. The CPU in the application (or another queue or device) updates 19*0c1f3fe9SAlex Deucherthe wptr to tell the firmware how far into the ring buffer to process packets 20*0c1f3fe9SAlex Deucherand the rtpr provides feedback to the UMD on how far the firmware has progressed 21*0c1f3fe9SAlex Deucherin executing those packets. When the wptr and the rptr are equal, the queue is 22*0c1f3fe9SAlex Deucheridle. 23*0c1f3fe9SAlex Deucher 24*0c1f3fe9SAlex DeucherTheory of Operation 25*0c1f3fe9SAlex Deucher=================== 26*0c1f3fe9SAlex Deucher 27*0c1f3fe9SAlex DeucherThe various engines on modern AMD GPUs support multiple queues per engine with a 28*0c1f3fe9SAlex Deucherscheduling firmware which handles dynamically scheduling user queues on the 29*0c1f3fe9SAlex Deucheravailable hardware queue slots. When the number of user queues outnumbers the 30*0c1f3fe9SAlex Deucheravailable hardware queue slots, the scheduling firmware dynamically maps and 31*0c1f3fe9SAlex Deucherunmaps queues based on priority and time quanta. The state of each user queue 32*0c1f3fe9SAlex Deucheris managed in the kernel driver in an MQD (Memory Queue Descriptor). This is a 33*0c1f3fe9SAlex Deucherbuffer in GPU accessible memory that stores the state of a user queue. The 34*0c1f3fe9SAlex Deucherscheduling firmware uses the MQD to load the queue state into an HQD (Hardware 35*0c1f3fe9SAlex DeucherQueue Descriptor) when a user queue is mapped. Each user queue requires a 36*0c1f3fe9SAlex Deuchernumber of additional buffers which represent the ring buffer and any metadata 37*0c1f3fe9SAlex Deucherneeded by the engine for runtime operation. On most engines this consists of 38*0c1f3fe9SAlex Deucherthe ring buffer itself, a rptr buffer (where the firmware will shadow the rptr 39*0c1f3fe9SAlex Deucherto userspace), a wptr buffer (where the application will write the wptr for the 40*0c1f3fe9SAlex Deucherfirmware to fetch it), and a doorbell. A doorbell is a piece of one of the 41*0c1f3fe9SAlex Deucherdevice's MMIO BARs which can be mapped to specific user queues. When the 42*0c1f3fe9SAlex Deucherapplication writes to the doorbell, it will signal the firmware to take some 43*0c1f3fe9SAlex Deucheraction. Writing to the doorbell wakes the firmware and causes it to fetch the 44*0c1f3fe9SAlex Deucherwptr and start processing the packets in the queue. Each 4K page of the doorbell 45*0c1f3fe9SAlex DeucherBAR supports specific offset ranges for specific engines. The doorbell of a 46*0c1f3fe9SAlex Deucherqueue must be mapped into the aperture aligned to the IP used by the queue 47*0c1f3fe9SAlex Deucher(e.g., GFX, VCN, SDMA, etc.). These doorbell apertures are set up via NBIO 48*0c1f3fe9SAlex Deucherregisters. Doorbells are 32 bit or 64 bit (depending on the engine) chunks of 49*0c1f3fe9SAlex Deucherthe doorbell BAR. A 4K doorbell page provides 512 64-bit doorbells for up to 50*0c1f3fe9SAlex Deucher512 user queues. A subset of each page is reserved for each IP type supported 51*0c1f3fe9SAlex Deucheron the device. The user can query the doorbell ranges for each IP via the INFO 52*0c1f3fe9SAlex DeucherIOCTL. See the IOCTL Interfaces section for more information. 53*0c1f3fe9SAlex Deucher 54*0c1f3fe9SAlex DeucherWhen an application wants to create a user queue, it allocates the necessary 55*0c1f3fe9SAlex Deucherbuffers for the queue (ring buffer, wptr and rptr, context save areas, etc.). 56*0c1f3fe9SAlex DeucherThese can be separate buffers or all part of one larger buffer. The application 57*0c1f3fe9SAlex Deucherwould map the buffer(s) into its GPUVM and use the GPU virtual addresses of for 58*0c1f3fe9SAlex Deucherthe areas of memory they want to use for the user queue. They would also 59*0c1f3fe9SAlex Deucherallocate a doorbell page for the doorbells used by the user queues. The 60*0c1f3fe9SAlex Deucherapplication would then populate the MQD in the USERQ IOCTL structure with the 61*0c1f3fe9SAlex DeucherGPU virtual addresses and doorbell index they want to use. The user can also 62*0c1f3fe9SAlex Deucherspecify the attributes for the user queue (priority, whether the queue is secure 63*0c1f3fe9SAlex Deucherfor protected content, etc.). The application would then call the USERQ 64*0c1f3fe9SAlex DeucherCREATE IOCTL to create the queue using the specified MQD details in the IOCTL. 65*0c1f3fe9SAlex DeucherThe kernel driver then validates the MQD provided by the application and 66*0c1f3fe9SAlex Deuchertranslates the MQD into the engine specific MQD format for the IP. The IP 67*0c1f3fe9SAlex Deucherspecific MQD would be allocated and the queue would be added to the run list 68*0c1f3fe9SAlex Deuchermaintained by the scheduling firmware. Once the queue has been created, the 69*0c1f3fe9SAlex Deucherapplication can write packets directly into the queue, update the wptr, and 70*0c1f3fe9SAlex Deucherwrite to the doorbell offset to kick off work in the user queue. 71*0c1f3fe9SAlex Deucher 72*0c1f3fe9SAlex DeucherWhen the application is done with the user queue, it would call the USERQ 73*0c1f3fe9SAlex DeucherFREE IOCTL to destroy it. The kernel driver would preempt the queue and 74*0c1f3fe9SAlex Deucherremove it from the scheduling firmware's run list. Then the IP specific MQD 75*0c1f3fe9SAlex Deucherwould be freed and the user queue state would be cleaned up. 76*0c1f3fe9SAlex Deucher 77*0c1f3fe9SAlex DeucherSome engines may require the aggregated doorbell too if the engine does not 78*0c1f3fe9SAlex Deuchersupport doorbells from unmapped queues. The aggregated doorbell is a special 79*0c1f3fe9SAlex Deucherpage of doorbell space which wakes the scheduler. In cases where the engine may 80*0c1f3fe9SAlex Deucherbe oversubscribed, some queues may not be mapped. If the doorbell is rung when 81*0c1f3fe9SAlex Deucherthe queue is not mapped, the engine firmware may miss the request. Some 82*0c1f3fe9SAlex Deucherscheduling firmware may work around this by polling wptr shadows when the 83*0c1f3fe9SAlex Deucherhardware is oversubscribed, other engines may support doorbell updates from 84*0c1f3fe9SAlex Deucherunmapped queues. In the event that one of these options is not available, the 85*0c1f3fe9SAlex Deucherkernel driver will map a page of aggregated doorbell space into each GPUVM 86*0c1f3fe9SAlex Deucherspace. The UMD will then update the doorbell and wptr as normal and then write 87*0c1f3fe9SAlex Deucherto the aggregated doorbell as well. 88*0c1f3fe9SAlex Deucher 89*0c1f3fe9SAlex DeucherSpecial Packets 90*0c1f3fe9SAlex Deucher--------------- 91*0c1f3fe9SAlex Deucher 92*0c1f3fe9SAlex DeucherIn order to support legacy implicit synchronization, as well as mixed user and 93*0c1f3fe9SAlex Deucherkernel queues, we need a synchronization mechanism that is secure. Because 94*0c1f3fe9SAlex Deucherkernel queues or memory management tasks depend on kernel fences, we need a way 95*0c1f3fe9SAlex Deucherfor user queues to update memory that the kernel can use for a fence, that can't 96*0c1f3fe9SAlex Deucherbe messed with by a bad actor. To support this, we've added a protected fence 97*0c1f3fe9SAlex Deucherpacket. This packet works by writing a monotonically increasing value to 98*0c1f3fe9SAlex Deuchera memory location that only privileged clients have write access to. User 99*0c1f3fe9SAlex Deucherqueues only have read access. When this packet is executed, the memory location 100*0c1f3fe9SAlex Deucheris updated and other queues (kernel or user) can see the results. The 101*0c1f3fe9SAlex Deucheruser application would submit this packet in their command stream. The actual 102*0c1f3fe9SAlex Deucherpacket format varies from IP to IP (GFX/Compute, SDMA, VCN, etc.), but the 103*0c1f3fe9SAlex Deucherbehavior is the same. The packet submission is handled in userspace. The 104*0c1f3fe9SAlex Deucherkernel driver sets up the privileged memory used for each user queue when it 105*0c1f3fe9SAlex Deuchersets the queues up when the application creates them. 106*0c1f3fe9SAlex Deucher 107*0c1f3fe9SAlex Deucher 108*0c1f3fe9SAlex DeucherMemory Management 109*0c1f3fe9SAlex Deucher================= 110*0c1f3fe9SAlex Deucher 111*0c1f3fe9SAlex DeucherIt is assumed that all buffers mapped into the GPUVM space for the process are 112*0c1f3fe9SAlex Deuchervalid when engines on the GPU are running. The kernel driver will only allow 113*0c1f3fe9SAlex Deucheruser queues to run when all buffers are mapped. If there is a memory event that 114*0c1f3fe9SAlex Deucherrequires buffer migration, the kernel driver will preempt the user queues, 115*0c1f3fe9SAlex Deuchermigrate buffers to where they need to be, update the GPUVM page tables and 116*0c1f3fe9SAlex Deucherinvaldidate the TLB, and then resume the user queues. 117*0c1f3fe9SAlex Deucher 118*0c1f3fe9SAlex DeucherInteraction with Kernel Queues 119*0c1f3fe9SAlex Deucher============================== 120*0c1f3fe9SAlex Deucher 121*0c1f3fe9SAlex DeucherDepending on the IP and the scheduling firmware, you can enable kernel queues 122*0c1f3fe9SAlex Deucherand user queues at the same time, however, you are limited by the HQD slots. 123*0c1f3fe9SAlex DeucherKernel queues are always mapped so any work that goes into kernel queues will 124*0c1f3fe9SAlex Deuchertake priority. This limits the available HQD slots for user queues. 125*0c1f3fe9SAlex Deucher 126*0c1f3fe9SAlex DeucherNot all IPs will support user queues on all GPUs. As such, UMDs will need to 127*0c1f3fe9SAlex Deuchersupport both user queues and kernel queues depending on the IP. For example, a 128*0c1f3fe9SAlex DeucherGPU may support user queues for GFX, compute, and SDMA, but not for VCN, JPEG, 129*0c1f3fe9SAlex Deucherand VPE. UMDs need to support both. The kernel driver provides a way to 130*0c1f3fe9SAlex Deucherdetermine if user queues and kernel queues are supported on a per IP basis. 131*0c1f3fe9SAlex DeucherUMDs can query this information via the INFO IOCTL and determine whether to use 132*0c1f3fe9SAlex Deucherkernel queues or user queues for each IP. 133*0c1f3fe9SAlex Deucher 134*0c1f3fe9SAlex DeucherQueue Resets 135*0c1f3fe9SAlex Deucher============ 136*0c1f3fe9SAlex Deucher 137*0c1f3fe9SAlex DeucherFor most engines, queues can be reset individually. GFX, compute, and SDMA 138*0c1f3fe9SAlex Deucherqueues can be reset individually. When a hung queue is detected, it can be 139*0c1f3fe9SAlex Deucherreset either via the scheduling firmware or MMIO. Since there are no kernel 140*0c1f3fe9SAlex Deucherfences for most user queues, they will usually only be detected when some other 141*0c1f3fe9SAlex Deucherevent happens; e.g., a memory event which requires migration of buffers. When 142*0c1f3fe9SAlex Deucherthe queues are preempted, if the queue is hung, the preemption will fail. 143*0c1f3fe9SAlex DeucherDriver will then look up the queues that failed to preempt and reset them and 144*0c1f3fe9SAlex Deucherrecord which queues are hung. 145*0c1f3fe9SAlex Deucher 146*0c1f3fe9SAlex DeucherOn the UMD side, we will add a USERQ QUERY_STATUS IOCTL to query the queue 147*0c1f3fe9SAlex Deucherstatus. UMD will provide the queue id in the IOCTL and the kernel driver 148*0c1f3fe9SAlex Deucherwill check if it has already recorded the queue as hung (e.g., due to failed 149*0c1f3fe9SAlex Deucherpeemption) and report back the status. 150*0c1f3fe9SAlex Deucher 151*0c1f3fe9SAlex DeucherIOCTL Interfaces 152*0c1f3fe9SAlex Deucher================ 153*0c1f3fe9SAlex Deucher 154*0c1f3fe9SAlex DeucherGPU virtual addresses used for queues and related data (rptrs, wptrs, context 155*0c1f3fe9SAlex Deuchersave areas, etc.) should be validated by the kernel mode driver to prevent the 156*0c1f3fe9SAlex Deucheruser from specifying invalid GPU virtual addresses. If the user provides 157*0c1f3fe9SAlex Deucherinvalid GPU virtual addresses or doorbell indicies, the IOCTL should return an 158*0c1f3fe9SAlex Deuchererror message. These buffers should also be tracked in the kernel driver so 159*0c1f3fe9SAlex Deucherthat if the user attempts to unmap the buffer(s) from the GPUVM, the umap call 160*0c1f3fe9SAlex Deucherwould return an error. 161*0c1f3fe9SAlex Deucher 162*0c1f3fe9SAlex DeucherINFO 163*0c1f3fe9SAlex Deucher---- 164*0c1f3fe9SAlex DeucherThere are several new INFO queries related to user queues in order to query the 165*0c1f3fe9SAlex Deuchersize of user queue meta data needed for a user queue (e.g., context save areas 166*0c1f3fe9SAlex Deucheror shadow buffers), whether kernel or user queues or both are supported 167*0c1f3fe9SAlex Deucherfor each IP type, and the offsets for each IP type in each doorbell page. 168*0c1f3fe9SAlex Deucher 169*0c1f3fe9SAlex DeucherUSERQ 170*0c1f3fe9SAlex Deucher----- 171*0c1f3fe9SAlex DeucherThe USERQ IOCTL is used for creating, freeing, and querying the status of user 172*0c1f3fe9SAlex Deucherqueues. It supports 3 opcodes: 173*0c1f3fe9SAlex Deucher 174*0c1f3fe9SAlex Deucher1. CREATE - Create a user queue. The application provides an MQD-like structure 175*0c1f3fe9SAlex Deucher that defines the type of queue and associated metadata and flags for that 176*0c1f3fe9SAlex Deucher queue type. Returns the queue id. 177*0c1f3fe9SAlex Deucher2. FREE - Free a user queue. 178*0c1f3fe9SAlex Deucher3. QUERY_STATUS - Query that status of a queue. Used to check if the queue is 179*0c1f3fe9SAlex Deucher healthy or not. E.g., if the queue has been reset. (WIP) 180*0c1f3fe9SAlex Deucher 181*0c1f3fe9SAlex DeucherUSERQ_SIGNAL 182*0c1f3fe9SAlex Deucher------------ 183*0c1f3fe9SAlex DeucherThe USERQ_SIGNAL IOCTL is used to provide a list of sync objects to be signaled. 184*0c1f3fe9SAlex Deucher 185*0c1f3fe9SAlex DeucherUSERQ_WAIT 186*0c1f3fe9SAlex Deucher---------- 187*0c1f3fe9SAlex DeucherThe USERQ_WAIT IOCTL is used to provide a list of sync object to be waited on. 188*0c1f3fe9SAlex Deucher 189*0c1f3fe9SAlex DeucherKernel and User Queues 190*0c1f3fe9SAlex Deucher====================== 191*0c1f3fe9SAlex Deucher 192*0c1f3fe9SAlex DeucherIn order to properly validate and test performance, we have a driver option to 193*0c1f3fe9SAlex Deucherselect what type of queues are enabled (kernel queues, user queues or both). 194*0c1f3fe9SAlex DeucherThe user_queue driver parameter allows you to enable kernel queues only (0), 195*0c1f3fe9SAlex Deucheruser queues and kernel queues (1), and user queues only (2). Enabling user 196*0c1f3fe9SAlex Deucherqueues only will free up static queue assignments that would otherwise be used 197*0c1f3fe9SAlex Deucherby kernel queues for use by the scheduling firmware. Some kernel queues are 198*0c1f3fe9SAlex Deucherrequired for kernel driver operation and they will always be created. When the 199*0c1f3fe9SAlex Deucherkernel queues are not enabled, they are not registered with the drm scheduler 200*0c1f3fe9SAlex Deucherand the CS IOCTL will reject any incoming command submissions which target those 201*0c1f3fe9SAlex Deucherqueue types. Kernel queues only mirrors the behavior on all existing GPUs. 202*0c1f3fe9SAlex DeucherEnabling both queues allows for backwards compatibility with old userspace while 203*0c1f3fe9SAlex Deucherstill supporting user queues. 204