1*c36218dcSRodrigo Vivi.. SPDX-License-Identifier: GPL-2.0+ 2*c36218dcSRodrigo Vivi 3*c36218dcSRodrigo Vivi============================ 4*c36218dcSRodrigo ViviDRM RAS over Generic Netlink 5*c36218dcSRodrigo Vivi============================ 6*c36218dcSRodrigo Vivi 7*c36218dcSRodrigo ViviThe DRM RAS (Reliability, Availability, Serviceability) interface provides a 8*c36218dcSRodrigo Vivistandardized way for GPU/accelerator drivers to expose error counters and 9*c36218dcSRodrigo Viviother reliability nodes to user space via Generic Netlink. This allows 10*c36218dcSRodrigo Vividiagnostic tools, monitoring daemons, or test infrastructure to query hardware 11*c36218dcSRodrigo Vivihealth in a uniform way across different DRM drivers. 12*c36218dcSRodrigo Vivi 13*c36218dcSRodrigo ViviKey Goals: 14*c36218dcSRodrigo Vivi 15*c36218dcSRodrigo Vivi* Provide a standardized RAS solution for GPU and accelerator drivers, enabling 16*c36218dcSRodrigo Vivi data center monitoring and reliability operations. 17*c36218dcSRodrigo Vivi* Implement a single drm-ras Generic Netlink family to meet modern Netlink YAML 18*c36218dcSRodrigo Vivi specifications and centralize all RAS-related communication in one namespace. 19*c36218dcSRodrigo Vivi* Support a basic error counter interface, addressing the immediate, essential 20*c36218dcSRodrigo Vivi monitoring needs. 21*c36218dcSRodrigo Vivi* Offer a flexible, future-proof interface that can be extended to support 22*c36218dcSRodrigo Vivi additional types of RAS data in the future. 23*c36218dcSRodrigo Vivi* Allow multiple nodes per driver, enabling drivers to register separate 24*c36218dcSRodrigo Vivi nodes for different IP blocks, sub-blocks, or other logical subdivisions 25*c36218dcSRodrigo Vivi as applicable. 26*c36218dcSRodrigo Vivi 27*c36218dcSRodrigo ViviNodes 28*c36218dcSRodrigo Vivi===== 29*c36218dcSRodrigo Vivi 30*c36218dcSRodrigo ViviNodes are logical abstractions representing an error type or error source within 31*c36218dcSRodrigo Vivithe device. Currently, only error counter nodes is supported. 32*c36218dcSRodrigo Vivi 33*c36218dcSRodrigo ViviDrivers are responsible for registering and unregistering nodes via the 34*c36218dcSRodrigo Vivi`drm_ras_node_register()` and `drm_ras_node_unregister()` APIs. 35*c36218dcSRodrigo Vivi 36*c36218dcSRodrigo ViviNode Management 37*c36218dcSRodrigo Vivi------------------- 38*c36218dcSRodrigo Vivi 39*c36218dcSRodrigo Vivi.. kernel-doc:: drivers/gpu/drm/drm_ras.c 40*c36218dcSRodrigo Vivi :doc: DRM RAS Node Management 41*c36218dcSRodrigo Vivi.. kernel-doc:: drivers/gpu/drm/drm_ras.c 42*c36218dcSRodrigo Vivi :internal: 43*c36218dcSRodrigo Vivi 44*c36218dcSRodrigo ViviGeneric Netlink Usage 45*c36218dcSRodrigo Vivi===================== 46*c36218dcSRodrigo Vivi 47*c36218dcSRodrigo ViviThe interface is implemented as a Generic Netlink family named ``drm-ras``. 48*c36218dcSRodrigo ViviUser space tools can: 49*c36218dcSRodrigo Vivi 50*c36218dcSRodrigo Vivi* List registered nodes with the ``list-nodes`` command. 51*c36218dcSRodrigo Vivi* List all error counters in an node with the ``get-error-counter`` command with ``node-id`` 52*c36218dcSRodrigo Vivi as a parameter. 53*c36218dcSRodrigo Vivi* Query specific error counter values with the ``get-error-counter`` command, using both 54*c36218dcSRodrigo Vivi ``node-id`` and ``error-id`` as parameters. 55*c36218dcSRodrigo Vivi 56*c36218dcSRodrigo ViviYAML-based Interface 57*c36218dcSRodrigo Vivi-------------------- 58*c36218dcSRodrigo Vivi 59*c36218dcSRodrigo ViviThe interface is described in a YAML specification ``Documentation/netlink/specs/drm_ras.yaml`` 60*c36218dcSRodrigo Vivi 61*c36218dcSRodrigo ViviThis YAML is used to auto-generate user space bindings via 62*c36218dcSRodrigo Vivi``tools/net/ynl/pyynl/ynl_gen_c.py``, and drives the structure of netlink 63*c36218dcSRodrigo Viviattributes and operations. 64*c36218dcSRodrigo Vivi 65*c36218dcSRodrigo ViviUsage Notes 66*c36218dcSRodrigo Vivi----------- 67*c36218dcSRodrigo Vivi 68*c36218dcSRodrigo Vivi* User space must first enumerate nodes to obtain their IDs. 69*c36218dcSRodrigo Vivi* Node IDs or Node names can be used for all further queries, such as error counters. 70*c36218dcSRodrigo Vivi* Error counters can be queried by either the Error ID or Error name. 71*c36218dcSRodrigo Vivi* Query Parameters should be defined as part of the uAPI to ensure user interface stability. 72*c36218dcSRodrigo Vivi* The interface supports future extension by adding new node types and 73*c36218dcSRodrigo Vivi additional attributes. 74*c36218dcSRodrigo Vivi 75*c36218dcSRodrigo ViviExample: List nodes using ynl 76*c36218dcSRodrigo Vivi 77*c36218dcSRodrigo Vivi.. code-block:: bash 78*c36218dcSRodrigo Vivi 79*c36218dcSRodrigo Vivi sudo ynl --family drm_ras --dump list-nodes 80*c36218dcSRodrigo Vivi [{'device-name': '0000:03:00.0', 81*c36218dcSRodrigo Vivi 'node-id': 0, 82*c36218dcSRodrigo Vivi 'node-name': 'correctable-errors', 83*c36218dcSRodrigo Vivi 'node-type': 'error-counter'}, 84*c36218dcSRodrigo Vivi {'device-name': '0000:03:00.0', 85*c36218dcSRodrigo Vivi 'node-id': 1, 86*c36218dcSRodrigo Vivi 'node-name': 'uncorrectable-errors', 87*c36218dcSRodrigo Vivi 'node-type': 'error-counter'}] 88*c36218dcSRodrigo Vivi 89*c36218dcSRodrigo ViviExample: List all error counters using ynl 90*c36218dcSRodrigo Vivi 91*c36218dcSRodrigo Vivi.. code-block:: bash 92*c36218dcSRodrigo Vivi 93*c36218dcSRodrigo Vivi sudo ynl --family drm_ras --dump get-error-counter --json '{"node-id":0}' 94*c36218dcSRodrigo Vivi [{'error-id': 1, 'error-name': 'error_name1', 'error-value': 0}, 95*c36218dcSRodrigo Vivi {'error-id': 2, 'error-name': 'error_name2', 'error-value': 0}] 96*c36218dcSRodrigo Vivi 97*c36218dcSRodrigo ViviExample: Query an error counter for a given node 98*c36218dcSRodrigo Vivi 99*c36218dcSRodrigo Vivi.. code-block:: bash 100*c36218dcSRodrigo Vivi 101*c36218dcSRodrigo Vivi sudo ynl --family drm_ras --do get-error-counter --json '{"node-id":0, "error-id":1}' 102*c36218dcSRodrigo Vivi {'error-id': 1, 'error-name': 'error_name1', 'error-value': 0} 103*c36218dcSRodrigo Vivi 104