xref: /linux/Documentation/gpu/drm-ras.rst (revision 4a57e0913e8c7fff407e97909f4ae48caa84d612)
1*c36218dcSRodrigo Vivi.. SPDX-License-Identifier: GPL-2.0+
2*c36218dcSRodrigo Vivi
3*c36218dcSRodrigo Vivi============================
4*c36218dcSRodrigo ViviDRM RAS over Generic Netlink
5*c36218dcSRodrigo Vivi============================
6*c36218dcSRodrigo Vivi
7*c36218dcSRodrigo ViviThe DRM RAS (Reliability, Availability, Serviceability) interface provides a
8*c36218dcSRodrigo Vivistandardized way for GPU/accelerator drivers to expose error counters and
9*c36218dcSRodrigo Viviother reliability nodes to user space via Generic Netlink. This allows
10*c36218dcSRodrigo Vividiagnostic tools, monitoring daemons, or test infrastructure to query hardware
11*c36218dcSRodrigo Vivihealth in a uniform way across different DRM drivers.
12*c36218dcSRodrigo Vivi
13*c36218dcSRodrigo ViviKey Goals:
14*c36218dcSRodrigo Vivi
15*c36218dcSRodrigo Vivi* Provide a standardized RAS solution for GPU and accelerator drivers, enabling
16*c36218dcSRodrigo Vivi  data center monitoring and reliability operations.
17*c36218dcSRodrigo Vivi* Implement a single drm-ras Generic Netlink family to meet modern Netlink YAML
18*c36218dcSRodrigo Vivi  specifications and centralize all RAS-related communication in one namespace.
19*c36218dcSRodrigo Vivi* Support a basic error counter interface, addressing the immediate, essential
20*c36218dcSRodrigo Vivi  monitoring needs.
21*c36218dcSRodrigo Vivi* Offer a flexible, future-proof interface that can be extended to support
22*c36218dcSRodrigo Vivi  additional types of RAS data in the future.
23*c36218dcSRodrigo Vivi* Allow multiple nodes per driver, enabling drivers to register separate
24*c36218dcSRodrigo Vivi  nodes for different IP blocks, sub-blocks, or other logical subdivisions
25*c36218dcSRodrigo Vivi  as applicable.
26*c36218dcSRodrigo Vivi
27*c36218dcSRodrigo ViviNodes
28*c36218dcSRodrigo Vivi=====
29*c36218dcSRodrigo Vivi
30*c36218dcSRodrigo ViviNodes are logical abstractions representing an error type or error source within
31*c36218dcSRodrigo Vivithe device. Currently, only error counter nodes is supported.
32*c36218dcSRodrigo Vivi
33*c36218dcSRodrigo ViviDrivers are responsible for registering and unregistering nodes via the
34*c36218dcSRodrigo Vivi`drm_ras_node_register()` and `drm_ras_node_unregister()` APIs.
35*c36218dcSRodrigo Vivi
36*c36218dcSRodrigo ViviNode Management
37*c36218dcSRodrigo Vivi-------------------
38*c36218dcSRodrigo Vivi
39*c36218dcSRodrigo Vivi.. kernel-doc:: drivers/gpu/drm/drm_ras.c
40*c36218dcSRodrigo Vivi   :doc: DRM RAS Node Management
41*c36218dcSRodrigo Vivi.. kernel-doc:: drivers/gpu/drm/drm_ras.c
42*c36218dcSRodrigo Vivi   :internal:
43*c36218dcSRodrigo Vivi
44*c36218dcSRodrigo ViviGeneric Netlink Usage
45*c36218dcSRodrigo Vivi=====================
46*c36218dcSRodrigo Vivi
47*c36218dcSRodrigo ViviThe interface is implemented as a Generic Netlink family named ``drm-ras``.
48*c36218dcSRodrigo ViviUser space tools can:
49*c36218dcSRodrigo Vivi
50*c36218dcSRodrigo Vivi* List registered nodes with the ``list-nodes`` command.
51*c36218dcSRodrigo Vivi* List all error counters in an node with the ``get-error-counter`` command with ``node-id``
52*c36218dcSRodrigo Vivi  as a parameter.
53*c36218dcSRodrigo Vivi* Query specific error counter values with the ``get-error-counter`` command, using both
54*c36218dcSRodrigo Vivi  ``node-id`` and ``error-id`` as parameters.
55*c36218dcSRodrigo Vivi
56*c36218dcSRodrigo ViviYAML-based Interface
57*c36218dcSRodrigo Vivi--------------------
58*c36218dcSRodrigo Vivi
59*c36218dcSRodrigo ViviThe interface is described in a YAML specification ``Documentation/netlink/specs/drm_ras.yaml``
60*c36218dcSRodrigo Vivi
61*c36218dcSRodrigo ViviThis YAML is used to auto-generate user space bindings via
62*c36218dcSRodrigo Vivi``tools/net/ynl/pyynl/ynl_gen_c.py``, and drives the structure of netlink
63*c36218dcSRodrigo Viviattributes and operations.
64*c36218dcSRodrigo Vivi
65*c36218dcSRodrigo ViviUsage Notes
66*c36218dcSRodrigo Vivi-----------
67*c36218dcSRodrigo Vivi
68*c36218dcSRodrigo Vivi* User space must first enumerate nodes to obtain their IDs.
69*c36218dcSRodrigo Vivi* Node IDs or Node names can be used for all further queries, such as error counters.
70*c36218dcSRodrigo Vivi* Error counters can be queried by either the Error ID or Error name.
71*c36218dcSRodrigo Vivi* Query Parameters should be defined as part of the uAPI to ensure user interface stability.
72*c36218dcSRodrigo Vivi* The interface supports future extension by adding new node types and
73*c36218dcSRodrigo Vivi  additional attributes.
74*c36218dcSRodrigo Vivi
75*c36218dcSRodrigo ViviExample: List nodes using ynl
76*c36218dcSRodrigo Vivi
77*c36218dcSRodrigo Vivi.. code-block:: bash
78*c36218dcSRodrigo Vivi
79*c36218dcSRodrigo Vivi    sudo ynl --family drm_ras --dump list-nodes
80*c36218dcSRodrigo Vivi    [{'device-name': '0000:03:00.0',
81*c36218dcSRodrigo Vivi    'node-id': 0,
82*c36218dcSRodrigo Vivi    'node-name': 'correctable-errors',
83*c36218dcSRodrigo Vivi    'node-type': 'error-counter'},
84*c36218dcSRodrigo Vivi    {'device-name': '0000:03:00.0',
85*c36218dcSRodrigo Vivi     'node-id': 1,
86*c36218dcSRodrigo Vivi     'node-name': 'uncorrectable-errors',
87*c36218dcSRodrigo Vivi     'node-type': 'error-counter'}]
88*c36218dcSRodrigo Vivi
89*c36218dcSRodrigo ViviExample: List all error counters using ynl
90*c36218dcSRodrigo Vivi
91*c36218dcSRodrigo Vivi.. code-block:: bash
92*c36218dcSRodrigo Vivi
93*c36218dcSRodrigo Vivi    sudo ynl --family drm_ras --dump get-error-counter --json '{"node-id":0}'
94*c36218dcSRodrigo Vivi    [{'error-id': 1, 'error-name': 'error_name1', 'error-value': 0},
95*c36218dcSRodrigo Vivi    {'error-id': 2, 'error-name': 'error_name2', 'error-value': 0}]
96*c36218dcSRodrigo Vivi
97*c36218dcSRodrigo ViviExample: Query an error counter for a given node
98*c36218dcSRodrigo Vivi
99*c36218dcSRodrigo Vivi.. code-block:: bash
100*c36218dcSRodrigo Vivi
101*c36218dcSRodrigo Vivi    sudo ynl --family drm_ras --do get-error-counter --json '{"node-id":0, "error-id":1}'
102*c36218dcSRodrigo Vivi    {'error-id': 1, 'error-name': 'error_name1', 'error-value': 0}
103*c36218dcSRodrigo Vivi
104