xref: /linux/Documentation/PCI/pcieaer-howto.rst (revision 2f2c7254931f41b5736e3ba12aaa9ac1bbeeeb92)
1.. SPDX-License-Identifier: GPL-2.0
2.. include:: <isonum.txt>
3
4===========================================================
5The PCI Express Advanced Error Reporting Driver Guide HOWTO
6===========================================================
7
8:Authors: - T. Long Nguyen <tom.l.nguyen@intel.com>
9          - Yanmin Zhang <yanmin.zhang@intel.com>
10
11:Copyright: |copy| 2006 Intel Corporation
12
13Overview
14===========
15
16About this guide
17----------------
18
19This guide describes the basics of the PCI Express (PCIe) Advanced Error
20Reporting (AER) driver and provides information on how to use it, as
21well as how to enable the drivers of Endpoint devices to conform with
22the PCIe AER driver.
23
24
25What is the PCIe AER Driver?
26----------------------------
27
28PCIe error signaling can occur on the PCIe link itself
29or on behalf of transactions initiated on the link. PCIe
30defines two error reporting paradigms: the baseline capability and
31the Advanced Error Reporting capability. The baseline capability is
32required of all PCIe components providing a minimum defined
33set of error reporting requirements. Advanced Error Reporting
34capability is implemented with a PCIe Advanced Error Reporting
35extended capability structure providing more robust error reporting.
36
37The PCIe AER driver provides the infrastructure to support PCIe Advanced
38Error Reporting capability. The PCIe AER driver provides three basic
39functions:
40
41  - Gathers the comprehensive error information if errors occurred.
42  - Reports error to the users.
43  - Performs error recovery actions.
44
45The AER driver only attaches to Root Ports and RCECs that support the PCIe
46AER capability.
47
48
49User Guide
50==========
51
52Include the PCIe AER Root Driver into the Linux Kernel
53------------------------------------------------------
54
55The PCIe AER driver is a Root Port service driver attached
56via the PCIe Port Bus driver. If a user wants to use it, the driver
57must be compiled. It is enabled with CONFIG_PCIEAER, which
58depends on CONFIG_PCIEPORTBUS.
59
60Load PCIe AER Root Driver
61-------------------------
62
63Some systems have AER support in firmware. Enabling Linux AER support at
64the same time the firmware handles AER would result in unpredictable
65behavior. Therefore, Linux does not handle AER events unless the firmware
66grants AER control to the OS via the ACPI _OSC method. See the PCI Firmware
67Specification for details regarding _OSC usage.
68
69AER error output
70----------------
71
72When a PCIe AER error is captured, an error message will be output to
73console. If it's a correctable error, it is output as a warning message.
74Otherwise, it is printed as an error. So users could choose different
75log level to filter out correctable error messages.
76
77Below shows an example::
78
79  0000:50:00.0: PCIe Bus Error: severity=Uncorrectable (Fatal), type=Transaction Layer, (Requester ID)
80  0000:50:00.0:   device [8086:0329] error status/mask=00100000/00000000
81  0000:50:00.0:    [20] UnsupReq               (First)
82  0000:50:00.0:   TLP Header: 0x04000001 0x00200a03 0x05010000 0x00050100
83
84In the example, 'Requester ID' means the ID of the device that sent
85the error message to the Root Port. Please refer to PCIe specs for other
86fields.
87
88AER Ratelimits
89--------------
90
91Since error messages can be generated for each transaction, we may see
92large volumes of errors reported. To prevent spammy devices from flooding
93the console/stalling execution, messages are throttled by device and error
94type (correctable vs. non-fatal uncorrectable).  Fatal errors, including
95DPC errors, are not ratelimited.
96
97AER uses the default ratelimit of DEFAULT_RATELIMIT_BURST (10 events) over
98DEFAULT_RATELIMIT_INTERVAL (5 seconds).
99
100Ratelimits are exposed in the form of sysfs attributes and configurable.
101See Documentation/ABI/testing/sysfs-bus-pci-devices-aer.
102
103AER Statistics / Counters
104-------------------------
105
106When PCIe AER errors are captured, the counters / statistics are also exposed
107in the form of sysfs attributes which are documented at
108Documentation/ABI/testing/sysfs-bus-pci-devices-aer.
109
110Developer Guide
111===============
112
113To enable error recovery, a software driver must provide callbacks.
114
115To support AER better, developers need to understand how AER works.
116
117PCIe errors are classified into two types: correctable errors
118and uncorrectable errors. This classification is based on the impact
119of those errors, which may result in degraded performance or function
120failure.
121
122Correctable errors pose no impacts on the functionality of the
123interface. The PCIe protocol can recover without any software
124intervention or any loss of data. These errors are detected and
125corrected by hardware.
126
127Unlike correctable errors, uncorrectable
128errors impact functionality of the interface. Uncorrectable errors
129can cause a particular transaction or a particular PCIe link
130to be unreliable. Depending on those error conditions, uncorrectable
131errors are further classified into non-fatal errors and fatal errors.
132Non-fatal errors cause the particular transaction to be unreliable,
133but the PCIe link itself is fully functional. Fatal errors, on
134the other hand, cause the link to be unreliable.
135
136When PCIe error reporting is enabled, a device will automatically send an
137error message to the Root Port above it when it captures
138an error. The Root Port, upon receiving an error reporting message,
139internally processes and logs the error message in its AER
140Capability structure. Error information being logged includes storing
141the error reporting agent's Requester ID into the Error Source
142Identification Registers and setting the error bits of the Root Error
143Status Register accordingly. If AER error reporting is enabled in the Root
144Error Command Register, the Root Port generates an interrupt when an
145error is detected.
146
147Note that the errors as described above are related to the PCIe
148hierarchy and links. These errors do not include any device specific
149errors because device specific errors will still get sent directly to
150the device driver.
151
152Provide callbacks
153-----------------
154
155PCI error-recovery callbacks
156~~~~~~~~~~~~~~~~~~~~~~~~~~~~
157
158The PCIe AER Root driver uses error callbacks to coordinate
159with downstream device drivers associated with a hierarchy in question
160when performing error recovery actions.
161
162Data struct pci_driver has a pointer, err_handler, to point to
163pci_error_handlers who consists of a couple of callback function
164pointers. The AER driver follows the rules defined in
165pci-error-recovery.rst except PCIe-specific parts (see
166below). Please refer to pci-error-recovery.rst for detailed
167definitions of the callbacks.
168
169The sections below specify when to call the error callback functions.
170
171Correctable errors
172~~~~~~~~~~~~~~~~~~
173
174Correctable errors pose no impacts on the functionality of
175the interface. The PCIe protocol can recover without any
176software intervention or any loss of data. These errors do not
177require any recovery actions. The AER driver clears the device's
178correctable error status register accordingly and logs these errors.
179
180Uncorrectable (non-fatal and fatal) errors
181~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
182
183The AER driver performs a Secondary Bus Reset to recover from
184uncorrectable errors. The reset is applied at the port above
185the originating device: If the originating device is an Endpoint,
186only the Endpoint is reset. If on the other hand the originating
187device has subordinate devices, those are all affected by the
188reset as well.
189
190If the originating device is a Root Complex Integrated Endpoint,
191there's no port above where a Secondary Bus Reset could be applied.
192In this case, the AER driver instead applies a Function Level Reset.
193
194If an error message indicates a non-fatal error, performing a reset
195at upstream is not required. The AER driver calls error_detected(dev,
196pci_channel_io_normal) to all drivers associated within a hierarchy in
197question. For example::
198
199  Endpoint <==> Downstream Port B <==> Upstream Port A <==> Root Port
200
201If Upstream Port A captures an AER error, the hierarchy consists of
202Downstream Port B and Endpoint.
203
204A driver may return PCI_ERS_RESULT_CAN_RECOVER,
205PCI_ERS_RESULT_DISCONNECT, or PCI_ERS_RESULT_NEED_RESET, depending on
206whether it can recover without a reset, considers the device unrecoverable
207or needs a reset for recovery. If all affected drivers agree that they can
208recover without a reset, it is skipped. Should one driver request a reset,
209it overrides all other drivers.
210
211If an error message indicates a fatal error, kernel will broadcast
212error_detected(dev, pci_channel_io_frozen) to all drivers within
213a hierarchy in question. Then, performing a reset at upstream is
214necessary. If error_detected returns PCI_ERS_RESULT_CAN_RECOVER
215to indicate that recovery without a reset is possible, the error
216handling goes to mmio_enabled, but afterwards a reset is still
217performed.
218
219In other words, for non-fatal errors, drivers may opt in to a reset.
220But for fatal errors, they cannot opt out of a reset, based on the
221assumption that the link is unreliable.
222
223Frequently Asked Questions
224--------------------------
225
226Q:
227  What happens if a PCIe device driver does not provide an
228  error recovery handler (pci_driver->err_handler is equal to NULL)?
229
230A:
231  The devices attached with the driver won't be recovered.
232  The kernel will print out informational messages to identify
233  unrecoverable devices.
234
235
236Software error injection
237========================
238
239Debugging PCIe AER error recovery code is quite difficult because it
240is hard to trigger real hardware errors. Software based error
241injection can be used to fake various kinds of PCIe errors.
242
243First you should enable PCIe AER software error injection in kernel
244configuration, that is, following item should be in your .config.
245
246CONFIG_PCIEAER_INJECT=y or CONFIG_PCIEAER_INJECT=m
247
248After reboot with new kernel or insert the module, a device file named
249/dev/aer_inject should be created.
250
251Then, you need a user space tool named aer-inject, which can be gotten
252from:
253
254    https://github.com/intel/aer-inject.git
255
256More information about aer-inject can be found in the document in
257its source code.
258