xref: /linux/Documentation/gpu/amdgpu/debugging.rst (revision 001821b0e79716c4e17c71d8e053a23599a7a508)
1===============
2 GPU Debugging
3===============
4
5GPUVM Debugging
6===============
7
8To aid in debugging GPU virtual memory related problems, the driver supports a
9number of options module parameters:
10
11`vm_fault_stop` - If non-0, halt the GPU memory controller on a GPU page fault.
12
13`vm_update_mode` - If non-0, use the CPU to update GPU page tables rather than
14the GPU.
15
16
17Decoding a GPUVM Page Fault
18===========================
19
20If you see a GPU page fault in the kernel log, you can decode it to figure
21out what is going wrong in your application.  A page fault in your kernel
22log may look something like this:
23
24::
25
26 [gfxhub0] no-retry page fault (src_id:0 ring:24 vmid:3 pasid:32777, for process glxinfo pid 2424 thread glxinfo:cs0 pid 2425)
27   in page starting at address 0x0000800102800000 from IH client 0x1b (UTCL2)
28 VM_L2_PROTECTION_FAULT_STATUS:0x00301030
29 	Faulty UTCL2 client ID: TCP (0x8)
30 	MORE_FAULTS: 0x0
31 	WALKER_ERROR: 0x0
32 	PERMISSION_FAULTS: 0x3
33 	MAPPING_ERROR: 0x0
34 	RW: 0x0
35
36First you have the memory hub, gfxhub and mmhub.  gfxhub is the memory
37hub used for graphics, compute, and sdma on some chips.  mmhub is the
38memory hub used for multi-media and sdma on some chips.
39
40Next you have the vmid and pasid.  If the vmid is 0, this fault was likely
41caused by the kernel driver or firmware.  If the vmid is non-0, it is generally
42a fault in a user application.  The pasid is used to link a vmid to a system
43process id.  If the process is active when the fault happens, the process
44information will be printed.
45
46The GPU virtual address that caused the fault comes next.
47
48The client ID indicates the GPU block that caused the fault.
49Some common client IDs:
50
51- CB/DB: The color/depth backend of the graphics pipe
52- CPF: Command Processor Frontend
53- CPC: Command Processor Compute
54- CPG: Command Processor Graphics
55- TCP/SQC/SQG: Shaders
56- SDMA: SDMA engines
57- VCN: Video encode/decode engines
58- JPEG: JPEG engines
59
60PERMISSION_FAULTS describe what faults were encountered:
61
62- bit 0: the PTE was not valid
63- bit 1: the PTE read bit was not set
64- bit 2: the PTE write bit was not set
65- bit 3: the PTE execute bit was not set
66
67Finally, RW, indicates whether the access was a read (0) or a write (1).
68
69In the example above, a shader (cliend id = TCP) generated a read (RW = 0x0) to
70an invalid page (PERMISSION_FAULTS = 0x3) at GPU virtual address
710x0000800102800000.  The user can then inspect their shader code and resource
72descriptor state to determine what caused the GPU page fault.
73
74UMR
75===
76
77`umr <https://gitlab.freedesktop.org/tomstdenis/umr>`_ is a general purpose
78GPU debugging and diagnostics tool.  Please see the umr
79`documentation <https://umr.readthedocs.io/en/main/>`_ for more information
80about its capabilities.
81