gpu/amdgpu/userq.rst

*0c1f3fe9SAlex Deucher==================
*0c1f3fe9SAlex Deucher User Mode Queues
*0c1f3fe9SAlex Deucher==================
*0c1f3fe9SAlex Deucher
*0c1f3fe9SAlex DeucherIntroduction
*0c1f3fe9SAlex Deucher============
*0c1f3fe9SAlex Deucher
*0c1f3fe9SAlex DeucherSimilar to the KFD, GPU engine queues move into userspace.  The idea is to let
*0c1f3fe9SAlex Deucheruser processes manage their submissions to the GPU engines directly, bypassing
*0c1f3fe9SAlex DeucherIOCTL calls to the driver to submit work.  This reduces overhead and also allows
*0c1f3fe9SAlex Deucherthe GPU to submit work to itself.  Applications can set up work graphs of jobs
*0c1f3fe9SAlex Deucheracross multiple GPU engines without needing trips through the CPU.
*0c1f3fe9SAlex Deucher
*0c1f3fe9SAlex DeucherUMDs directly interface with firmware via per application shared memory areas.
*0c1f3fe9SAlex DeucherThe main vehicle for this is queue.  A queue is a ring buffer with a read
*0c1f3fe9SAlex Deucherpointer (rptr) and a write pointer (wptr).  The UMD writes IP specific packets
*0c1f3fe9SAlex Deucherinto the queue and the firmware processes those packets, kicking off work on the
*0c1f3fe9SAlex DeucherGPU engines.  The CPU in the application (or another queue or device) updates
*0c1f3fe9SAlex Deucherthe wptr to tell the firmware how far into the ring buffer to process packets
*0c1f3fe9SAlex Deucherand the rtpr provides feedback to the UMD on how far the firmware has progressed
*0c1f3fe9SAlex Deucherin executing those packets.  When the wptr and the rptr are equal, the queue is
*0c1f3fe9SAlex Deucheridle.
*0c1f3fe9SAlex Deucher
*0c1f3fe9SAlex DeucherTheory of Operation
*0c1f3fe9SAlex Deucher===================
*0c1f3fe9SAlex Deucher
*0c1f3fe9SAlex DeucherThe various engines on modern AMD GPUs support multiple queues per engine with a
*0c1f3fe9SAlex Deucherscheduling firmware which handles dynamically scheduling user queues on the
*0c1f3fe9SAlex Deucheravailable hardware queue slots.  When the number of user queues outnumbers the
*0c1f3fe9SAlex Deucheravailable hardware queue slots, the scheduling firmware dynamically maps and
*0c1f3fe9SAlex Deucherunmaps queues based on priority and time quanta.  The state of each user queue
*0c1f3fe9SAlex Deucheris managed in the kernel driver in an MQD (Memory Queue Descriptor).  This is a
*0c1f3fe9SAlex Deucherbuffer in GPU accessible memory that stores the state of a user queue.  The
*0c1f3fe9SAlex Deucherscheduling firmware uses the MQD to load the queue state into an HQD (Hardware
*0c1f3fe9SAlex DeucherQueue Descriptor) when a user queue is mapped.  Each user queue requires a
*0c1f3fe9SAlex Deuchernumber of additional buffers which represent the ring buffer and any metadata
*0c1f3fe9SAlex Deucherneeded by the engine for runtime operation.  On most engines this consists of
*0c1f3fe9SAlex Deucherthe ring buffer itself, a rptr buffer (where the firmware will shadow the rptr
*0c1f3fe9SAlex Deucherto userspace), a wptr buffer (where the application will write the wptr for the
*0c1f3fe9SAlex Deucherfirmware to fetch it), and a doorbell.  A doorbell is a piece of one of the
*0c1f3fe9SAlex Deucherdevice's MMIO BARs which can be mapped to specific user queues.  When the
*0c1f3fe9SAlex Deucherapplication writes to the doorbell, it will signal the firmware to take some
*0c1f3fe9SAlex Deucheraction. Writing to the doorbell wakes the firmware and causes it to fetch the
*0c1f3fe9SAlex Deucherwptr and start processing the packets in the queue. Each 4K page of the doorbell
*0c1f3fe9SAlex DeucherBAR supports specific offset ranges for specific engines.  The doorbell of a
*0c1f3fe9SAlex Deucherqueue must be mapped into the aperture aligned to the IP used by the queue
*0c1f3fe9SAlex Deucher(e.g., GFX, VCN, SDMA, etc.).  These doorbell apertures are set up via NBIO
*0c1f3fe9SAlex Deucherregisters.  Doorbells are 32 bit or 64 bit (depending on the engine) chunks of
*0c1f3fe9SAlex Deucherthe doorbell BAR.  A 4K doorbell page provides 512 64-bit doorbells for up to
*0c1f3fe9SAlex Deucher512 user queues.  A subset of each page is reserved for each IP type supported
*0c1f3fe9SAlex Deucheron the device.  The user can query the doorbell ranges for each IP via the INFO
*0c1f3fe9SAlex DeucherIOCTL.  See the IOCTL Interfaces section for more information.
*0c1f3fe9SAlex Deucher
*0c1f3fe9SAlex DeucherWhen an application wants to create a user queue, it allocates the necessary
*0c1f3fe9SAlex Deucherbuffers for the queue (ring buffer, wptr and rptr, context save areas, etc.).
*0c1f3fe9SAlex DeucherThese can be separate buffers or all part of one larger buffer.  The application
*0c1f3fe9SAlex Deucherwould map the buffer(s) into its GPUVM and use the GPU virtual addresses of for
*0c1f3fe9SAlex Deucherthe areas of memory they want to use for the user queue.  They would also
*0c1f3fe9SAlex Deucherallocate a doorbell page for the doorbells used by the user queues.  The
*0c1f3fe9SAlex Deucherapplication would then populate the MQD in the USERQ IOCTL structure with the
*0c1f3fe9SAlex DeucherGPU virtual addresses and doorbell index they want to use.  The user can also
*0c1f3fe9SAlex Deucherspecify the attributes for the user queue (priority, whether the queue is secure
*0c1f3fe9SAlex Deucherfor protected content, etc.).  The application would then call the USERQ
*0c1f3fe9SAlex DeucherCREATE IOCTL to create the queue using the specified MQD details in the IOCTL.
*0c1f3fe9SAlex DeucherThe kernel driver then validates the MQD provided by the application and
*0c1f3fe9SAlex Deuchertranslates the MQD into the engine specific MQD format for the IP.  The IP
*0c1f3fe9SAlex Deucherspecific MQD would be allocated and the queue would be added to the run list
*0c1f3fe9SAlex Deuchermaintained by the scheduling firmware.  Once the queue has been created, the
*0c1f3fe9SAlex Deucherapplication can write packets directly into the queue, update the wptr, and
*0c1f3fe9SAlex Deucherwrite to the doorbell offset to kick off work in the user queue.
*0c1f3fe9SAlex Deucher
*0c1f3fe9SAlex DeucherWhen the application is done with the user queue, it would call the USERQ
*0c1f3fe9SAlex DeucherFREE IOCTL to destroy it.  The kernel driver would preempt the queue and
*0c1f3fe9SAlex Deucherremove it from the scheduling firmware's run list.  Then the IP specific MQD
*0c1f3fe9SAlex Deucherwould be freed and the user queue state would be cleaned up.
*0c1f3fe9SAlex Deucher
*0c1f3fe9SAlex DeucherSome engines may require the aggregated doorbell too if the engine does not
*0c1f3fe9SAlex Deuchersupport doorbells from unmapped queues.  The aggregated doorbell is a special
*0c1f3fe9SAlex Deucherpage of doorbell space which wakes the scheduler.  In cases where the engine may
*0c1f3fe9SAlex Deucherbe oversubscribed, some queues may not be mapped.  If the doorbell is rung when
*0c1f3fe9SAlex Deucherthe queue is not mapped, the engine firmware may miss the request.  Some
*0c1f3fe9SAlex Deucherscheduling firmware may work around this by polling wptr shadows when the
*0c1f3fe9SAlex Deucherhardware is oversubscribed, other engines may support doorbell updates from
*0c1f3fe9SAlex Deucherunmapped queues.  In the event that one of these options is not available, the
*0c1f3fe9SAlex Deucherkernel driver will map a page of aggregated doorbell space into each GPUVM
*0c1f3fe9SAlex Deucherspace.  The UMD will then update the doorbell and wptr as normal and then write
*0c1f3fe9SAlex Deucherto the aggregated doorbell as well.
*0c1f3fe9SAlex Deucher
*0c1f3fe9SAlex DeucherSpecial Packets
*0c1f3fe9SAlex Deucher---------------
*0c1f3fe9SAlex Deucher
*0c1f3fe9SAlex DeucherIn order to support legacy implicit synchronization, as well as mixed user and
*0c1f3fe9SAlex Deucherkernel queues, we need a synchronization mechanism that is secure.  Because
*0c1f3fe9SAlex Deucherkernel queues or memory management tasks depend on kernel fences, we need a way
*0c1f3fe9SAlex Deucherfor user queues to update memory that the kernel can use for a fence, that can't
*0c1f3fe9SAlex Deucherbe messed with by a bad actor.  To support this, we've added a protected fence
*0c1f3fe9SAlex Deucherpacket.  This packet works by writing a monotonically increasing value to
*0c1f3fe9SAlex Deuchera memory location that only privileged clients have write access to. User
*0c1f3fe9SAlex Deucherqueues only have read access.  When this packet is executed, the memory location
*0c1f3fe9SAlex Deucheris updated and other queues (kernel or user) can see the results.  The
*0c1f3fe9SAlex Deucheruser application would submit this packet in their command stream.  The actual
*0c1f3fe9SAlex Deucherpacket format varies from IP to IP (GFX/Compute, SDMA, VCN, etc.), but the
*0c1f3fe9SAlex Deucherbehavior is the same.  The packet submission is handled in userspace.  The
*0c1f3fe9SAlex Deucherkernel driver sets up the privileged memory used for each user queue when it
*0c1f3fe9SAlex Deuchersets the queues up when the application creates them.
*0c1f3fe9SAlex Deucher
*0c1f3fe9SAlex Deucher
*0c1f3fe9SAlex DeucherMemory Management
*0c1f3fe9SAlex Deucher=================
*0c1f3fe9SAlex Deucher
*0c1f3fe9SAlex DeucherIt is assumed that all buffers mapped into the GPUVM space for the process are
*0c1f3fe9SAlex Deuchervalid when engines on the GPU are running.  The kernel driver will only allow
*0c1f3fe9SAlex Deucheruser queues to run when all buffers are mapped.  If there is a memory event that
*0c1f3fe9SAlex Deucherrequires buffer migration, the kernel driver will preempt the user queues,
*0c1f3fe9SAlex Deuchermigrate buffers to where they need to be, update the GPUVM page tables and
*0c1f3fe9SAlex Deucherinvaldidate the TLB, and then resume the user queues.
*0c1f3fe9SAlex Deucher
*0c1f3fe9SAlex DeucherInteraction with Kernel Queues
*0c1f3fe9SAlex Deucher==============================
*0c1f3fe9SAlex Deucher
*0c1f3fe9SAlex DeucherDepending on the IP and the scheduling firmware, you can enable kernel queues
*0c1f3fe9SAlex Deucherand user queues at the same time, however, you are limited by the HQD slots.
*0c1f3fe9SAlex DeucherKernel queues are always mapped so any work that goes into kernel queues will
*0c1f3fe9SAlex Deuchertake priority.  This limits the available HQD slots for user queues.
*0c1f3fe9SAlex Deucher
*0c1f3fe9SAlex DeucherNot all IPs will support user queues on all GPUs.  As such, UMDs will need to
*0c1f3fe9SAlex Deuchersupport both user queues and kernel queues depending on the IP.  For example, a
*0c1f3fe9SAlex DeucherGPU may support user queues for GFX, compute, and SDMA, but not for VCN, JPEG,
*0c1f3fe9SAlex Deucherand VPE.  UMDs need to support both.  The kernel driver provides a way to
*0c1f3fe9SAlex Deucherdetermine if user queues and kernel queues are supported on a per IP basis.
*0c1f3fe9SAlex DeucherUMDs can query this information via the INFO IOCTL and determine whether to use
*0c1f3fe9SAlex Deucherkernel queues or user queues for each IP.
*0c1f3fe9SAlex Deucher
*0c1f3fe9SAlex DeucherQueue Resets
*0c1f3fe9SAlex Deucher============
*0c1f3fe9SAlex Deucher
*0c1f3fe9SAlex DeucherFor most engines, queues can be reset individually.  GFX, compute, and SDMA
*0c1f3fe9SAlex Deucherqueues can be reset individually.  When a hung queue is detected, it can be
*0c1f3fe9SAlex Deucherreset either via the scheduling firmware or MMIO.  Since there are no kernel
*0c1f3fe9SAlex Deucherfences for most user queues, they will usually only be detected when some other
*0c1f3fe9SAlex Deucherevent happens; e.g., a memory event which requires migration of buffers.  When
*0c1f3fe9SAlex Deucherthe queues are preempted, if the queue is hung, the preemption will fail.
*0c1f3fe9SAlex DeucherDriver will then look up the queues that failed to preempt and reset them and
*0c1f3fe9SAlex Deucherrecord which queues are hung.
*0c1f3fe9SAlex Deucher
*0c1f3fe9SAlex DeucherOn the UMD side, we will add a USERQ QUERY_STATUS IOCTL to query the queue
*0c1f3fe9SAlex Deucherstatus.  UMD will provide the queue id in the IOCTL and the kernel driver
*0c1f3fe9SAlex Deucherwill check if it has already recorded the queue as hung (e.g., due to failed
*0c1f3fe9SAlex Deucherpeemption) and report back the status.
*0c1f3fe9SAlex Deucher
*0c1f3fe9SAlex DeucherIOCTL Interfaces
*0c1f3fe9SAlex Deucher================
*0c1f3fe9SAlex Deucher
*0c1f3fe9SAlex DeucherGPU virtual addresses used for queues and related data (rptrs, wptrs, context
*0c1f3fe9SAlex Deuchersave areas, etc.) should be validated by the kernel mode driver to prevent the
*0c1f3fe9SAlex Deucheruser from specifying invalid GPU virtual addresses.  If the user provides
*0c1f3fe9SAlex Deucherinvalid GPU virtual addresses or doorbell indicies, the IOCTL should return an
*0c1f3fe9SAlex Deuchererror message.  These buffers should also be tracked in the kernel driver so
*0c1f3fe9SAlex Deucherthat if the user attempts to unmap the buffer(s) from the GPUVM, the umap call
*0c1f3fe9SAlex Deucherwould return an error.
*0c1f3fe9SAlex Deucher
*0c1f3fe9SAlex DeucherINFO
*0c1f3fe9SAlex Deucher----
*0c1f3fe9SAlex DeucherThere are several new INFO queries related to user queues in order to query the
*0c1f3fe9SAlex Deuchersize of user queue meta data needed for a user queue (e.g., context save areas
*0c1f3fe9SAlex Deucheror shadow buffers), whether kernel or user queues or both are supported
*0c1f3fe9SAlex Deucherfor each IP type, and the offsets for each IP type in each doorbell page.
*0c1f3fe9SAlex Deucher
*0c1f3fe9SAlex DeucherUSERQ
*0c1f3fe9SAlex Deucher-----
*0c1f3fe9SAlex DeucherThe USERQ IOCTL is used for creating, freeing, and querying the status of user
*0c1f3fe9SAlex Deucherqueues.  It supports 3 opcodes:
*0c1f3fe9SAlex Deucher
*0c1f3fe9SAlex Deucher1. CREATE - Create a user queue.  The application provides an MQD-like structure
*0c1f3fe9SAlex Deucher   that defines the type of queue and associated metadata and flags for that
*0c1f3fe9SAlex Deucher   queue type.  Returns the queue id.
*0c1f3fe9SAlex Deucher2. FREE - Free a user queue.
*0c1f3fe9SAlex Deucher3. QUERY_STATUS - Query that status of a queue.  Used to check if the queue is
*0c1f3fe9SAlex Deucher   healthy or not.  E.g., if the queue has been reset. (WIP)
*0c1f3fe9SAlex Deucher
*0c1f3fe9SAlex DeucherUSERQ_SIGNAL
*0c1f3fe9SAlex Deucher------------
*0c1f3fe9SAlex DeucherThe USERQ_SIGNAL IOCTL is used to provide a list of sync objects to be signaled.
*0c1f3fe9SAlex Deucher
*0c1f3fe9SAlex DeucherUSERQ_WAIT
*0c1f3fe9SAlex Deucher----------
*0c1f3fe9SAlex DeucherThe USERQ_WAIT IOCTL is used to provide a list of sync object to be waited on.
*0c1f3fe9SAlex Deucher
*0c1f3fe9SAlex DeucherKernel and User Queues
*0c1f3fe9SAlex Deucher======================
*0c1f3fe9SAlex Deucher
*0c1f3fe9SAlex DeucherIn order to properly validate and test performance, we have a driver option to
*0c1f3fe9SAlex Deucherselect what type of queues are enabled (kernel queues, user queues or both).
*0c1f3fe9SAlex DeucherThe user_queue driver parameter allows you to enable kernel queues only (0),
*0c1f3fe9SAlex Deucheruser queues and kernel queues (1), and user queues only (2).  Enabling user
*0c1f3fe9SAlex Deucherqueues only will free up static queue assignments that would otherwise be used
*0c1f3fe9SAlex Deucherby kernel queues for use by the scheduling firmware.  Some kernel queues are
*0c1f3fe9SAlex Deucherrequired for kernel driver operation and they will always be created.  When the
*0c1f3fe9SAlex Deucherkernel queues are not enabled, they are not registered with the drm scheduler
*0c1f3fe9SAlex Deucherand the CS IOCTL will reject any incoming command submissions which target those
*0c1f3fe9SAlex Deucherqueue types.  Kernel queues only mirrors the behavior on all existing GPUs.
*0c1f3fe9SAlex DeucherEnabling both queues allows for backwards compatibility with old userspace while
*0c1f3fe9SAlex Deucherstill supporting user queues.