xref: /linux/Documentation/gpu/amdgpu/display/dc-debug.rst (revision 7f4f3b14e8079ecde096bd734af10e30d40c27b7)
1========================
2Display Core Debug tools
3========================
4
5In this section, you will find helpful information on debugging the amdgpu
6driver from the display perspective. This page introduces debug mechanisms and
7procedures to help you identify if some issues are related to display code.
8
9Narrow down display issues
10==========================
11
12Since the display is the driver's visual component, it is common to see users
13reporting issues as a display when another component causes the problem. This
14section equips users to determine if a specific issue was caused by the display
15component or another part of the driver.
16
17DC dmesg important messages
18---------------------------
19
20The dmesg log is the first source of information to be checked, and amdgpu
21takes advantage of this feature by logging some valuable information. When
22looking for the issues associated with amdgpu, remember that each component of
23the driver (e.g., smu, PSP, dm, etc.) is loaded one by one, and this
24information can be found in the dmesg log. In this sense, look for the part of
25the log that looks like the below log snippet::
26
27  [    4.254295] [drm] initializing kernel modesetting (IP DISCOVERY 0x1002:0x744C 0x1002:0x0E3B 0xC8).
28  [    4.254718] [drm] register mmio base: 0xFCB00000
29  [    4.254918] [drm] register mmio size: 1048576
30  [    4.260095] [drm] add ip block number 0 <soc21_common>
31  [    4.260318] [drm] add ip block number 1 <gmc_v11_0>
32  [    4.260510] [drm] add ip block number 2 <ih_v6_0>
33  [    4.260696] [drm] add ip block number 3 <psp>
34  [    4.260878] [drm] add ip block number 4 <smu>
35  [    4.261057] [drm] add ip block number 5 <dm>
36  [    4.261231] [drm] add ip block number 6 <gfx_v11_0>
37  [    4.261402] [drm] add ip block number 7 <sdma_v6_0>
38  [    4.261568] [drm] add ip block number 8 <vcn_v4_0>
39  [    4.261729] [drm] add ip block number 9 <jpeg_v4_0>
40  [    4.261887] [drm] add ip block number 10 <mes_v11_0>
41
42From the above example, you can see the line that reports that `<dm>`,
43(**Display Manager**), was loaded, which means that display can be part of the
44issue. If you do not see that line, something else might have failed before
45amdgpu loads the display component, indicating that we don't have a
46display issue.
47
48After you identified that the DM was loaded correctly, you can check for the
49display version of the hardware in use, which can be retrieved from the dmesg
50log with the command::
51
52  dmesg | grep -i 'display core'
53
54This command shows a message that looks like this::
55
56  [    4.655828] [drm] Display Core v3.2.285 initialized on DCN 3.2
57
58This message has two key pieces of information:
59
60* **The DC version (e.g., v3.2.285)**: Display developers release a new DC version
61  every week, and this information can be advantageous in a situation where a
62  user/developer must find a good point versus a bad point based on a tested
63  version of the display code. Remember from page :ref:`Display Core <amdgpu-display-core>`,
64  that every week the new patches for display are heavily tested with IGT and
65  manual tests.
66* **The DCN version (e.g., DCN 3.2)**: The DCN block is associated with the
67  hardware generation, and the DCN version conveys the hardware generation that
68  the driver is currently running. This information helps to narrow down the
69  code debug area since each DCN version has its files in the DC folder per DCN
70  component (from the example, the developer might want to focus on
71  files/folders/functions/structs with the dcn32 label might be executed).
72  However, keep in mind that DC reuses code across different DCN versions; for
73  example, it is expected to have some callbacks set in one DCN that are the same
74  as those from another DCN. In summary, use the DCN version just as a guide.
75
76From the dmesg file, it is also possible to get the ATOM bios code by using::
77
78  dmesg  | grep -i 'ATOM BIOS'
79
80Which generates an output that looks like this::
81
82  [    4.274534] amdgpu: ATOM BIOS: 113-D7020100-102
83
84This type of information is useful to be reported.
85
86Avoid loading display core
87--------------------------
88
89Sometimes, it might be hard to figure out which part of the driver is causing
90the issue; if you suspect that the display is not part of the problem and your
91bug scenario is simple (e.g., some desktop configuration) you can try to remove
92the display component from the equation. First, you need to identify `dm` ID
93from the dmesg log; for example, search for the following log::
94
95  [    4.254295] [drm] initializing kernel modesetting (IP DISCOVERY 0x1002:0x744C 0x1002:0x0E3B 0xC8).
96  [..]
97  [    4.260095] [drm] add ip block number 0 <soc21_common>
98  [    4.260318] [drm] add ip block number 1 <gmc_v11_0>
99  [..]
100  [    4.261057] [drm] add ip block number 5 <dm>
101
102Notice from the above example that the `dm` id is 5 for this specific hardware.
103Next, you need to run the following binary operation to identify the IP block
104mask::
105
106  0xffffffff & ~(1 << [DM ID])
107
108From our example the IP mask is::
109
110 0xffffffff & ~(1 << 5) = 0xffffffdf
111
112Finally, to disable DC, you just need to set the below parameter in your
113bootloader::
114
115 amdgpu.ip_block_mask = 0xffffffdf
116
117If you can boot your system with the DC disabled and still see the issue, it
118means you can rule DC out of the equation. However, if the bug disappears, you
119still need to consider the DC part of the problem and keep narrowing down the
120issue. In some scenarios, disabling DC is impossible since it might be
121necessary to use the display component to reproduce the issue (e.g., play a
122game).
123
124**Note: This will probably lead to the absence of a display output.**
125
126Display flickering
127------------------
128
129Display flickering might have multiple causes; one is the lack of proper power
130to the GPU or problems in the DPM switches. A good first generic verification
131is to set the GPU to use high voltage::
132
133   bash -c "echo high > /sys/class/drm/card0/device/power_dpm_force_performance_level"
134
135The above command sets the GPU/APU to use the maximum power allowed which
136disables DPM switches. If forcing DPM levels high does not fix the issue, it
137is less likely that the issue is related to power management. If the issue
138disappears, there is a good chance that other components might be involved, and
139the display should not be ignored since this could be a DPM issues. From the
140display side, if the power increase fixes the issue, it is worth debugging the
141clock configuration and the pipe split police used in the specific
142configuration.
143
144Display artifacts
145-----------------
146
147Users may see some screen artifacts that can be categorized into two different
148types: localized artifacts and general artifacts. The localized artifacts
149happen in some specific areas, such as around the UI window corners; if you see
150this type of issue, there is a considerable chance that you have a userspace
151problem, likely Mesa or similar. The general artifacts usually happen on the
152entire screen. They might be caused by a misconfiguration at the driver level
153of the display parameters, but the userspace might also cause this issue. One
154way to identify the source of the problem is to take a screenshot or make a
155desktop video capture when the problem happens; after checking the
156screenshot/video recording, if you don't see any of the artifacts, it means
157that the issue is likely on the the driver side. If you can still see the
158problem in the data collected, it is an issue that probably happened during
159rendering, and the display code just got the framebuffer already corrupted.
160
161Disabling/Enabling specific features
162====================================
163
164DC has a struct named `dc_debug_options`, which is statically initialized by
165all DCE/DCN components based on the specific hardware characteristic. This
166structure usually facilitates the bring-up phase since developers can start
167with many disabled features and enable them individually. This is also an
168important debug feature since users can change it when debugging specific
169issues.
170
171For example, dGPU users sometimes see a problem where a horizontal fillet of
172flickering happens in some specific part of the screen. This could be an
173indication of Sub-Viewport issues; after the users identified the target DCN,
174they can set the `force_disable_subvp` field to true in the statically
175initialized version of `dc_debug_options` to see if the issue gets fixed. Along
176the same lines, users/developers can also try to turn off `fams2_config` and
177`enable_single_display_2to1_odm_policy`. In summary, the `dc_debug_options` is
178an interesting form for identifying the problem.
179
180DC Visual Confirmation
181======================
182
183Display core provides a feature named visual confirmation, which is a set of
184bars added at the scanout time by the driver to convey some specific
185information. In general, you can enable this debug option by using::
186
187  echo <N> > /sys/kernel/debug/dri/0/amdgpu_dm_visual_confirm
188
189Where `N` is an integer number for some specific scenarios that the developer
190wants to enable, you will see some of these debug cases in the following
191subsection.
192
193Multiple Planes Debug
194---------------------
195
196If you want to enable or debug multiple planes in a specific user-space
197application, you can leverage a debug feature named visual confirm. For
198enabling it, you will need::
199
200  echo 1 > /sys/kernel/debug/dri/0/amdgpu_dm_visual_confirm
201
202You need to reload your GUI to see the visual confirmation. When the plane
203configuration changes or a full update occurs there will be a colored bar at
204the bottom of each hardware plane being drawn on the screen.
205
206* The color indicates the format - For example, red is AR24 and green is NV12
207* The height of the bar indicates the index of the plane
208* Pipe split can be observed if there are two bars with a difference in height
209  covering the same plane
210
211Consider the video playback case in which a video is played in a specific
212plane, and the desktop is drawn in another plane. The video plane should
213feature one or two green bars at the bottom of the video depending on pipe
214split configuration.
215
216* There should **not** be any visual corruption
217* There should **not** be any underflow or screen flashes
218* There should **not** be any black screens
219* There should **not** be any cursor corruption
220* Multiple plane **may** be briefly disabled during window transitions or
221  resizing but should come back after the action has finished
222
223Pipe Split Debug
224----------------
225
226Sometimes we need to debug if DCN is splitting pipes correctly, and visual
227confirmation is also handy for this case. Similar to the MPO case, you can use
228the below command to enable visual confirmation::
229
230  echo 1 > /sys/kernel/debug/dri/0/amdgpu_dm_visual_confirm
231
232In this case, if you have a pipe split, you will see one small red bar at the
233bottom of the display covering the entire display width and another bar
234covering the second pipe. In other words, you will see a bit high bar in the
235second pipe.
236
237DTN Debug
238=========
239
240DC (DCN) provides an extensive log that dumps multiple details from our
241hardware configuration. Via debugfs, you can capture those status values by
242using Display Test Next (DTN) log, which can be captured via debugfs by using::
243
244  cat /sys/kernel/debug/dri/0/amdgpu_dm_dtn_log
245
246Since this log is updated accordingly with DCN status, you can also follow the
247change in real-time by using something like::
248
249  sudo watch -d cat /sys/kernel/debug/dri/0/amdgpu_dm_dtn_log
250
251When reporting a bug related to DC, consider attaching this log before and
252after you reproduce the bug.
253
254Collect Firmware information
255============================
256
257When reporting issues, it is important to have the firmware information since
258it can be helpful for debugging purposes. To get all the firmware information,
259use the command::
260
261  cat /sys/kernel/debug/dri/0/amdgpu_firmware_info
262
263From the display perspective, pay attention to the firmware of the DMCU and
264DMCUB.
265
266DMUB Firmware Debug
267===================
268
269Sometimes, dmesg logs aren't enough. This is especially true if a feature is
270implemented primarily in DMUB firmware. In such cases, all we see in dmesg when
271an issue arises is some generic timeout error. So, to get more relevant
272information, we can trace DMUB commands by enabling the relevant bits in
273`amdgpu_dm_dmub_trace_mask`.
274
275Currently, we support the tracing of the following groups:
276
277Trace Groups
278------------
279
280.. csv-table::
281   :header-rows: 1
282   :widths: 1, 1
283   :file: ./trace-groups-table.csv
284
285**Note: Not all ASICs support all of the listed trace groups**
286
287So, to enable just PSR tracing you can use the following command::
288
289  # echo 0x8020 > /sys/kernel/debug/dri/0/amdgpu_dm_dmub_trace_mask
290
291Then, you need to enable logging trace events to the buffer, which you can do
292using the following::
293
294  # echo 1 > /sys/kernel/debug/dri/0/amdgpu_dm_dmcub_trace_event_en
295
296Lastly, after you are able to reproduce the issue you are trying to debug,
297you can disable tracing and read the trace log by using the following::
298
299  # echo 0 > /sys/kernel/debug/dri/0/amdgpu_dm_dmcub_trace_event_en
300  # cat /sys/kernel/debug/dri/0/amdgpu_dm_dmub_tracebuffer
301
302So, when reporting bugs related to features such as PSR and ABM, consider
303enabling the relevant bits in the mask before reproducing the issue and
304attach the log that you obtain from the trace buffer in any bug reports that you
305create.
306