xref: /linux/Documentation/networking/devlink/devlink-health.rst (revision d639d9fa162aadec1ae9980c4dcf6e50bd2f8290)
1.. SPDX-License-Identifier: GPL-2.0
2
3==============
4Devlink Health
5==============
6
7Background
8==========
9
10The ``devlink`` health mechanism is targeted for Real Time Alerting, in
11order to know when something bad happened to a PCI device.
12
13  * Provide alert debug information.
14  * Self healing.
15  * If problem needs vendor support, provide a way to gather all needed
16    debugging information.
17
18Overview
19========
20
21The main idea is to unify and centralize driver health reports in the
22generic ``devlink`` instance and allow the user to set different
23attributes of the health reporting and recovery procedures.
24
25The ``devlink`` health reporter:
26Device driver creates a "health reporter" per each error/health type.
27Error/Health type can be a known/generic (e.g. PCI error, fw error, rx/tx error)
28or unknown (driver specific).
29For each registered health reporter a driver can issue error/health reports
30asynchronously. All health reports handling is done by ``devlink``.
31Device driver can provide specific callbacks for each "health reporter", e.g.:
32
33  * Recovery procedures
34  * Diagnostics procedures
35  * Object dump procedures
36
37Drivers also provide default values for generic reporter parameters when
38creating a health reporter.
39
40Different parts of the driver can register different types of health reporters
41with different handlers.
42
43Actions
44=======
45
46Once an error is reported, devlink health will perform the following actions:
47
48  * A log is being send to the kernel trace events buffer
49  * Health status and statistics are being updated for the reporter instance
50  * Object dump is being taken and saved at the reporter instance. This is
51    best effort and skipped when recovery is aborted, auto-dump is disabled,
52    no dump callback is registered, or a dump is already stored.
53  * Auto recovery attempt is being done. Depends on:
54
55    - Auto-recovery configuration
56    - Grace period (and burst period)  vs. time passed since last recover
57
58Devlink formatted message
59=========================
60
61To handle devlink health diagnose and health dump requests, devlink creates a
62formatted message structure ``devlink_fmsg`` and send it to the driver's callback
63to fill the data in using the devlink fmsg API.
64
65Devlink fmsg is a mechanism to pass descriptors between drivers and devlink, in
66json-like format. The API allows the driver to add nested attributes such as
67object, object pair and value array, in addition to attributes such as name and
68value.
69
70Driver should use this API to fill the fmsg context in a format which will be
71translated by the devlink to the netlink message later. When it needs to send
72the data using SKBs to the netlink layer, it fragments the data between
73different SKBs. In order to do this fragmentation, it uses virtual nests
74attributes, to avoid actual nesting use which cannot be divided between
75different SKBs.
76
77User Interface
78==============
79
80User can access/change each reporter's parameters and driver specific callbacks
81via ``devlink``, e.g. per error type (per health reporter). Reporters may be
82registered for the whole devlink instance or for a specific devlink port.
83
84  * Configure reporter's generic parameters (like: disable/enable auto recovery)
85  * Invoke recovery procedure
86  * Run diagnostics
87  * Object dump
88
89.. list-table:: List of devlink health interfaces
90   :widths: 10 90
91
92   * - Name
93     - Description
94   * - ``DEVLINK_CMD_HEALTH_REPORTER_GET``
95     - Retrieves status and configuration info per DEV and reporter.
96   * - ``DEVLINK_CMD_HEALTH_REPORTER_SET``
97     - Allows reporter-related configuration setting.
98   * - ``DEVLINK_CMD_HEALTH_REPORTER_RECOVER``
99     - Triggers reporter's recovery procedure.
100   * - ``DEVLINK_CMD_HEALTH_REPORTER_TEST``
101     - Triggers a fake health event on the reporter. The effects of the test
102       event in terms of recovery flow should follow closely that of a real
103       event.
104   * - ``DEVLINK_CMD_HEALTH_REPORTER_DIAGNOSE``
105     - Retrieves current device state related to the reporter.
106   * - ``DEVLINK_CMD_HEALTH_REPORTER_DUMP_GET``
107     - Retrieves the last stored dump. Devlink health
108       saves a single dump. If an dump is not already stored by devlink
109       for this reporter, devlink generates a new dump.
110       Dump output is defined by the reporter.
111   * - ``DEVLINK_CMD_HEALTH_REPORTER_DUMP_CLEAR``
112     - Clears the last saved dump file for the specified reporter.
113
114The following diagram provides a general overview of ``devlink-health``::
115
116                                                   netlink
117                                          +--------------------------+
118                                          |                          |
119                                          |            +             |
120                                          |            |             |
121                                          +--------------------------+
122                                                       |request for ops
123                                                       |(diagnose,
124      driver                               devlink     |recover,
125                                                       |dump)
126    +--------+                            +--------------------------+
127    |        |                            |    reporter|             |
128    |        |                            |  +---------v----------+  |
129    |        |   ops execution            |  |                    |  |
130    |     <----------------------------------+                    |  |
131    |        |                            |  |                    |  |
132    |        |                            |  + ^------------------+  |
133    |        |                            |    | request for ops     |
134    |        |                            |    | (recover, dump)     |
135    |        |                            |    |                     |
136    |        |                            |  +-+------------------+  |
137    |        |     health report          |  | health handler     |  |
138    |        +------------------------------->                    |  |
139    |        |                            |  +--------------------+  |
140    |        |     health reporter create |                          |
141    |        +---------------------------->                          |
142    +--------+                            +--------------------------+
143