1.. SPDX-License-Identifier: GPL-2.0+ 2 3============================ 4DRM RAS over Generic Netlink 5============================ 6 7The DRM RAS (Reliability, Availability, Serviceability) interface provides a 8standardized way for GPU/accelerator drivers to expose error counters and 9other reliability nodes to user space via Generic Netlink. This allows 10diagnostic tools, monitoring daemons, or test infrastructure to query hardware 11health in a uniform way across different DRM drivers. 12 13Key Goals: 14 15* Provide a standardized RAS solution for GPU and accelerator drivers, enabling 16 data center monitoring and reliability operations. 17* Implement a single drm-ras Generic Netlink family to meet modern Netlink YAML 18 specifications and centralize all RAS-related communication in one namespace. 19* Support a basic error counter interface, addressing the immediate, essential 20 monitoring needs. 21* Offer a flexible, future-proof interface that can be extended to support 22 additional types of RAS data in the future. 23* Allow multiple nodes per driver, enabling drivers to register separate 24 nodes for different IP blocks, sub-blocks, or other logical subdivisions 25 as applicable. 26 27.. contents:: 28 29Nodes 30===== 31 32Nodes are logical abstractions representing an error type or error source within 33the device. Currently, only error counter nodes is supported. 34 35Drivers are responsible for registering and unregistering nodes via the 36`drm_ras_node_register()` and `drm_ras_node_unregister()` APIs. 37 38Node Management 39------------------- 40 41.. kernel-doc:: drivers/gpu/drm/drm_ras.c 42 :doc: DRM RAS Node Management 43.. kernel-doc:: drivers/gpu/drm/drm_ras.c 44 :internal: 45 46Generic Netlink Usage 47===================== 48 49The interface is implemented as a Generic Netlink family named ``drm-ras``. 50User space tools can: 51 52* List registered nodes with the ``list-nodes`` command. 53* List all error counters in an node with the ``get-error-counter`` command with ``node-id`` 54 as a parameter. 55* Query specific error counter values with the ``get-error-counter`` command, using both 56 ``node-id`` and ``error-id`` as parameters. 57* Clear specific error counters with the ``clear-error-counter`` command, using both 58 ``node-id`` and ``error-id`` as parameters. 59 60YAML-based Interface 61-------------------- 62 63The interface is described in a YAML specification ``Documentation/netlink/specs/drm_ras.yaml`` 64 65This YAML is used to auto-generate user space bindings via 66``tools/net/ynl/pyynl/ynl_gen_c.py``, and drives the structure of netlink 67attributes and operations. 68 69Usage Notes 70----------- 71 72* User space must first enumerate nodes to obtain their IDs. 73* Node IDs or Node names can be used for all further queries, such as error counters. 74* Error counters can be queried by either the Error ID or Error name. 75* Query Parameters should be defined as part of the uAPI to ensure user interface stability. 76* The interface supports future extension by adding new node types and 77 additional attributes. 78 79Example: List nodes using ynl 80 81.. code-block:: bash 82 83 sudo ynl --family drm_ras --dump list-nodes 84 [{'device-name': '0000:03:00.0', 85 'node-id': 0, 86 'node-name': 'correctable-errors', 87 'node-type': 'error-counter'}, 88 {'device-name': '0000:03:00.0', 89 'node-id': 1, 90 'node-name': 'uncorrectable-errors', 91 'node-type': 'error-counter'}] 92 93Example: List all error counters using ynl 94 95.. code-block:: bash 96 97 sudo ynl --family drm_ras --dump get-error-counter --json '{"node-id":0}' 98 [{'error-id': 1, 'error-name': 'error_name1', 'error-value': 0}, 99 {'error-id': 2, 'error-name': 'error_name2', 'error-value': 0}] 100 101Example: Query an error counter for a given node 102 103.. code-block:: bash 104 105 sudo ynl --family drm_ras --do get-error-counter --json '{"node-id":0, "error-id":1}' 106 {'error-id': 1, 'error-name': 'error_name1', 'error-value': 0} 107 108Example: Clear an error counter for a given node 109 110.. code-block:: bash 111 112 sudo ynl --family drm_ras --do clear-error-counter --json '{"node-id":0, "error-id":1}' 113 None 114