xref: /linux/Documentation/gpu/drm-ras.rst (revision c3fb1fb9e65fa6a108b4d19c61bdcb47fd4fe180)
1.. SPDX-License-Identifier: GPL-2.0+
2
3============================
4DRM RAS over Generic Netlink
5============================
6
7The DRM RAS (Reliability, Availability, Serviceability) interface provides a
8standardized way for GPU/accelerator drivers to expose error counters and
9other reliability nodes to user space via Generic Netlink. This allows
10diagnostic tools, monitoring daemons, or test infrastructure to query hardware
11health in a uniform way across different DRM drivers.
12
13Key Goals:
14
15* Provide a standardized RAS solution for GPU and accelerator drivers, enabling
16  data center monitoring and reliability operations.
17* Implement a single drm-ras Generic Netlink family to meet modern Netlink YAML
18  specifications and centralize all RAS-related communication in one namespace.
19* Support a basic error counter interface, addressing the immediate, essential
20  monitoring needs.
21* Offer a flexible, future-proof interface that can be extended to support
22  additional types of RAS data in the future.
23* Allow multiple nodes per driver, enabling drivers to register separate
24  nodes for different IP blocks, sub-blocks, or other logical subdivisions
25  as applicable.
26
27Nodes
28=====
29
30Nodes are logical abstractions representing an error type or error source within
31the device. Currently, only error counter nodes is supported.
32
33Drivers are responsible for registering and unregistering nodes via the
34`drm_ras_node_register()` and `drm_ras_node_unregister()` APIs.
35
36Node Management
37-------------------
38
39.. kernel-doc:: drivers/gpu/drm/drm_ras.c
40   :doc: DRM RAS Node Management
41.. kernel-doc:: drivers/gpu/drm/drm_ras.c
42   :internal:
43
44Generic Netlink Usage
45=====================
46
47The interface is implemented as a Generic Netlink family named ``drm-ras``.
48User space tools can:
49
50* List registered nodes with the ``list-nodes`` command.
51* List all error counters in an node with the ``get-error-counter`` command with ``node-id``
52  as a parameter.
53* Query specific error counter values with the ``get-error-counter`` command, using both
54  ``node-id`` and ``error-id`` as parameters.
55* Clear specific error counters with the ``clear-error-counter`` command, using both
56  ``node-id`` and ``error-id`` as parameters.
57
58YAML-based Interface
59--------------------
60
61The interface is described in a YAML specification ``Documentation/netlink/specs/drm_ras.yaml``
62
63This YAML is used to auto-generate user space bindings via
64``tools/net/ynl/pyynl/ynl_gen_c.py``, and drives the structure of netlink
65attributes and operations.
66
67Usage Notes
68-----------
69
70* User space must first enumerate nodes to obtain their IDs.
71* Node IDs or Node names can be used for all further queries, such as error counters.
72* Error counters can be queried by either the Error ID or Error name.
73* Query Parameters should be defined as part of the uAPI to ensure user interface stability.
74* The interface supports future extension by adding new node types and
75  additional attributes.
76
77Example: List nodes using ynl
78
79.. code-block:: bash
80
81    sudo ynl --family drm_ras --dump list-nodes
82    [{'device-name': '0000:03:00.0',
83    'node-id': 0,
84    'node-name': 'correctable-errors',
85    'node-type': 'error-counter'},
86    {'device-name': '0000:03:00.0',
87     'node-id': 1,
88     'node-name': 'uncorrectable-errors',
89     'node-type': 'error-counter'}]
90
91Example: List all error counters using ynl
92
93.. code-block:: bash
94
95    sudo ynl --family drm_ras --dump get-error-counter --json '{"node-id":0}'
96    [{'error-id': 1, 'error-name': 'error_name1', 'error-value': 0},
97    {'error-id': 2, 'error-name': 'error_name2', 'error-value': 0}]
98
99Example: Query an error counter for a given node
100
101.. code-block:: bash
102
103    sudo ynl --family drm_ras --do get-error-counter --json '{"node-id":0, "error-id":1}'
104    {'error-id': 1, 'error-name': 'error_name1', 'error-value': 0}
105
106Example: Clear an error counter for a given node
107
108.. code-block:: bash
109
110    sudo ynl --family drm_ras --do clear-error-counter --json '{"node-id":0, "error-id":1}'
111    None
112