Documentation/gpu/amdgpu/userq.rst
Source file repositories/reference/linux-study-clean/Documentation/gpu/amdgpu/userq.rst
File Facts
- System
- Linux kernel
- Corpus path
Documentation/gpu/amdgpu/userq.rst- Extension
.rst- Size
- 11358 bytes
- Lines
- 206
- Domain
- Support Tooling And Documentation
- Bucket
- Documentation
- Inferred role
- Support Tooling And Documentation: documentation
- Status
- atlas-only
Why This File Exists
Repository support layer: documentation, build tooling, samples, user-space helper tools, generated initramfs support, licenses, and validation utilities.
- Repository support layer: documentation, build tooling, samples, user-space helper tools, generated initramfs support, licenses, and validation utilities.
Dependency Surface
- No C-style include directives detected by the generator.
Detected Declarations
- No top-level syscall, struct, function, initcall, or export declaration detected by the generator.
Annotated Snippet
.. _amdgpu-userq:
==================
User Mode Queues
==================
Introduction
============
Similar to the KFD, GPU engine queues move into userspace. The idea is to let
user processes manage their submissions to the GPU engines directly, bypassing
IOCTL calls to the driver to submit work. This reduces overhead and also allows
the GPU to submit work to itself. Applications can set up work graphs of jobs
across multiple GPU engines without needing trips through the CPU.
UMDs directly interface with firmware via per application shared memory areas.
The main vehicle for this is queue. A queue is a ring buffer with a read
pointer (rptr) and a write pointer (wptr). The UMD writes IP specific packets
into the queue and the firmware processes those packets, kicking off work on the
GPU engines. The CPU in the application (or another queue or device) updates
the wptr to tell the firmware how far into the ring buffer to process packets
and the rtpr provides feedback to the UMD on how far the firmware has progressed
in executing those packets. When the wptr and the rptr are equal, the queue is
idle.
Theory of Operation
===================
The various engines on modern AMD GPUs support multiple queues per engine with a
scheduling firmware which handles dynamically scheduling user queues on the
available hardware queue slots. When the number of user queues outnumbers the
available hardware queue slots, the scheduling firmware dynamically maps and
unmaps queues based on priority and time quanta. The state of each user queue
is managed in the kernel driver in an MQD (Memory Queue Descriptor). This is a
buffer in GPU accessible memory that stores the state of a user queue. The
scheduling firmware uses the MQD to load the queue state into an HQD (Hardware
Queue Descriptor) when a user queue is mapped. Each user queue requires a
number of additional buffers which represent the ring buffer and any metadata
needed by the engine for runtime operation. On most engines this consists of
the ring buffer itself, a rptr buffer (where the firmware will shadow the rptr
to userspace), a wptr buffer (where the application will write the wptr for the
firmware to fetch it), and a doorbell. A doorbell is a piece of one of the
device's MMIO BARs which can be mapped to specific user queues. When the
application writes to the doorbell, it will signal the firmware to take some
action. Writing to the doorbell wakes the firmware and causes it to fetch the
wptr and start processing the packets in the queue. Each 4K page of the doorbell
BAR supports specific offset ranges for specific engines. The doorbell of a
queue must be mapped into the aperture aligned to the IP used by the queue
(e.g., GFX, VCN, SDMA, etc.). These doorbell apertures are set up via NBIO
registers. Doorbells are 32 bit or 64 bit (depending on the engine) chunks of
the doorbell BAR. A 4K doorbell page provides 512 64-bit doorbells for up to
512 user queues. A subset of each page is reserved for each IP type supported
on the device. The user can query the doorbell ranges for each IP via the INFO
IOCTL. See the IOCTL Interfaces section for more information.
When an application wants to create a user queue, it allocates the necessary
buffers for the queue (ring buffer, wptr and rptr, context save areas, etc.).
These can be separate buffers or all part of one larger buffer. The application
would map the buffer(s) into its GPUVM and use the GPU virtual addresses of for
the areas of memory they want to use for the user queue. They would also
allocate a doorbell page for the doorbells used by the user queues. The
application would then populate the MQD in the USERQ IOCTL structure with the
GPU virtual addresses and doorbell index they want to use. The user can also
specify the attributes for the user queue (priority, whether the queue is secure
for protected content, etc.). The application would then call the USERQ
CREATE IOCTL to create the queue using the specified MQD details in the IOCTL.
The kernel driver then validates the MQD provided by the application and
translates the MQD into the engine specific MQD format for the IP. The IP
specific MQD would be allocated and the queue would be added to the run list
maintained by the scheduling firmware. Once the queue has been created, the
Annotation
- Atlas domain: Support Tooling And Documentation / Documentation.
- Implementation status: atlas-only.
Implementation Notes
- This generated page is the file-by-file coverage layer; curated subsystem chapters should link here when they synthesize a multi-file control flow.
- Core OS pages should be promoted from atlas-only to deep-reviewed when they explain data structures, invariants, locking, lifecycle, and C implementation snippets.
- Driver-family pages are intentionally pattern-oriented unless they are part of the selected PCIe/NVMe representative device path.