xref: /linux/Documentation/PCI/pcieaer-howto.rst (revision bb0301f856bfc0ea8192b8d2bd5a79bdc6d3d3f1)
1.. SPDX-License-Identifier: GPL-2.0
2.. include:: <isonum.txt>
3
4===========================================================
5The PCI Express Advanced Error Reporting Driver Guide HOWTO
6===========================================================
7
8:Authors: - T. Long Nguyen <tom.l.nguyen@intel.com>
9          - Yanmin Zhang <yanmin.zhang@intel.com>
10
11:Copyright: |copy| 2006 Intel Corporation
12
13Overview
14===========
15
16About this guide
17----------------
18
19This guide describes the basics of the PCI Express (PCIe) Advanced Error
20Reporting (AER) driver and provides information on how to use it, as
21well as how to enable the drivers of Endpoint devices to conform with
22the PCIe AER driver.
23
24
25What is the PCIe AER Driver?
26----------------------------
27
28PCIe error signaling can occur on the PCIe link itself
29or on behalf of transactions initiated on the link. PCIe
30defines two error reporting paradigms: the baseline capability and
31the Advanced Error Reporting capability. The baseline capability is
32required of all PCIe components providing a minimum defined
33set of error reporting requirements. Advanced Error Reporting
34capability is implemented with a PCIe Advanced Error Reporting
35extended capability structure providing more robust error reporting.
36
37The PCIe AER driver provides the infrastructure to support PCIe Advanced
38Error Reporting capability. The PCIe AER driver provides three basic
39functions:
40
41  - Gathers the comprehensive error information if errors occurred.
42  - Reports error to the users.
43  - Performs error recovery actions.
44
45The AER driver only attaches to Root Ports and RCECs that support the PCIe
46AER capability.
47
48
49User Guide
50==========
51
52Include the PCIe AER Root Driver into the Linux Kernel
53------------------------------------------------------
54
55The PCIe AER driver is a Root Port service driver attached
56via the PCIe Port Bus driver. If a user wants to use it, the driver
57must be compiled. It is enabled with CONFIG_PCIEAER, which
58depends on CONFIG_PCIEPORTBUS.
59
60Load PCIe AER Root Driver
61-------------------------
62
63Some systems have AER support in firmware. Enabling Linux AER support at
64the same time the firmware handles AER would result in unpredictable
65behavior. Therefore, Linux does not handle AER events unless the firmware
66grants AER control to the OS via the ACPI _OSC method. See the PCI Firmware
67Specification for details regarding _OSC usage.
68
69AER error output
70----------------
71
72When a PCIe AER error is captured, an error message will be output to
73console. If it's a correctable error, it is output as a warning message.
74Otherwise, it is printed as an error. So users could choose different
75log level to filter out correctable error messages.
76
77Below shows an example::
78
79  0000:50:00.0: PCIe Bus Error: severity=Uncorrectable (Fatal), type=Transaction Layer, (Requester ID)
80  0000:50:00.0:   device [8086:0329] error status/mask=00100000/00000000
81  0000:50:00.0:    [20] UnsupReq               (First)
82  0000:50:00.0:   TLP Header: 0x04000001 0x00200a03 0x05010000 0x00050100
83
84In the example, 'Requester ID' means the ID of the device that sent
85the error message to the Root Port. Please refer to PCIe specs for other
86fields.
87
88The 'TLP Header' is the prefix/header of the TLP that caused the error
89in raw hex format. To decode the TLP Header into human-readable form
90one may use tlp-tool:
91
92https://github.com/mmpg-x86/tlp-tool
93
94Example usage::
95
96  curl -L https://git.kernel.org/linus/2ca1c94ce0b6 | rtlp-tool --aer
97
98AER Ratelimits
99--------------
100
101Since error messages can be generated for each transaction, we may see
102large volumes of errors reported. To prevent spammy devices from flooding
103the console/stalling execution, messages are throttled by device and error
104type (correctable vs. non-fatal uncorrectable).  Fatal errors, including
105DPC errors, are not ratelimited.
106
107AER uses the default ratelimit of DEFAULT_RATELIMIT_BURST (10 events) over
108DEFAULT_RATELIMIT_INTERVAL (5 seconds).
109
110Ratelimits are exposed in the form of sysfs attributes and configurable.
111See Documentation/ABI/testing/sysfs-bus-pci-devices-aer.
112
113AER Statistics / Counters
114-------------------------
115
116When PCIe AER errors are captured, the counters / statistics are also exposed
117in the form of sysfs attributes which are documented at
118Documentation/ABI/testing/sysfs-bus-pci-devices-aer.
119
120Developer Guide
121===============
122
123To enable error recovery, a software driver must provide callbacks.
124
125To support AER better, developers need to understand how AER works.
126
127PCIe errors are classified into two types: correctable errors
128and uncorrectable errors. This classification is based on the impact
129of those errors, which may result in degraded performance or function
130failure.
131
132Correctable errors pose no impacts on the functionality of the
133interface. The PCIe protocol can recover without any software
134intervention or any loss of data. These errors are detected and
135corrected by hardware.
136
137Unlike correctable errors, uncorrectable
138errors impact functionality of the interface. Uncorrectable errors
139can cause a particular transaction or a particular PCIe link
140to be unreliable. Depending on those error conditions, uncorrectable
141errors are further classified into non-fatal errors and fatal errors.
142Non-fatal errors cause the particular transaction to be unreliable,
143but the PCIe link itself is fully functional. Fatal errors, on
144the other hand, cause the link to be unreliable.
145
146When PCIe error reporting is enabled, a device will automatically send an
147error message to the Root Port above it when it captures
148an error. The Root Port, upon receiving an error reporting message,
149internally processes and logs the error message in its AER
150Capability structure. Error information being logged includes storing
151the error reporting agent's Requester ID into the Error Source
152Identification Registers and setting the error bits of the Root Error
153Status Register accordingly. If AER error reporting is enabled in the Root
154Error Command Register, the Root Port generates an interrupt when an
155error is detected.
156
157Note that the errors as described above are related to the PCIe
158hierarchy and links. These errors do not include any device specific
159errors because device specific errors will still get sent directly to
160the device driver.
161
162Provide callbacks
163-----------------
164
165PCI error-recovery callbacks
166~~~~~~~~~~~~~~~~~~~~~~~~~~~~
167
168The PCIe AER Root driver uses error callbacks to coordinate
169with downstream device drivers associated with a hierarchy in question
170when performing error recovery actions.
171
172Data struct pci_driver has a pointer, err_handler, to point to
173pci_error_handlers who consists of a couple of callback function
174pointers. The AER driver follows the rules defined in
175pci-error-recovery.rst except PCIe-specific parts (see
176below). Please refer to pci-error-recovery.rst for detailed
177definitions of the callbacks.
178
179The sections below specify when to call the error callback functions.
180
181Correctable errors
182~~~~~~~~~~~~~~~~~~
183
184Correctable errors pose no impacts on the functionality of
185the interface. The PCIe protocol can recover without any
186software intervention or any loss of data. These errors do not
187require any recovery actions. The AER driver clears the device's
188correctable error status register accordingly and logs these errors.
189
190Uncorrectable (non-fatal and fatal) errors
191~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
192
193The AER driver performs a Secondary Bus Reset to recover from
194uncorrectable errors. The reset is applied at the port above
195the originating device: If the originating device is an Endpoint,
196only the Endpoint is reset. If on the other hand the originating
197device has subordinate devices, those are all affected by the
198reset as well.
199
200If the originating device is a Root Complex Integrated Endpoint,
201there's no port above where a Secondary Bus Reset could be applied.
202In this case, the AER driver instead applies a Function Level Reset.
203
204If an error message indicates a non-fatal error, performing a reset
205at upstream is not required. The AER driver calls error_detected(dev,
206pci_channel_io_normal) to all drivers associated within a hierarchy in
207question. For example::
208
209  Endpoint <==> Downstream Port B <==> Upstream Port A <==> Root Port
210
211If Upstream Port A captures an AER error, the hierarchy consists of
212Downstream Port B and Endpoint.
213
214A driver may return PCI_ERS_RESULT_CAN_RECOVER,
215PCI_ERS_RESULT_DISCONNECT, or PCI_ERS_RESULT_NEED_RESET, depending on
216whether it can recover without a reset, considers the device unrecoverable
217or needs a reset for recovery. If all affected drivers agree that they can
218recover without a reset, it is skipped. Should one driver request a reset,
219it overrides all other drivers.
220
221If an error message indicates a fatal error, kernel will broadcast
222error_detected(dev, pci_channel_io_frozen) to all drivers within
223a hierarchy in question. Then, performing a reset at upstream is
224necessary. If error_detected returns PCI_ERS_RESULT_CAN_RECOVER
225to indicate that recovery without a reset is possible, the error
226handling goes to mmio_enabled, but afterwards a reset is still
227performed.
228
229In other words, for non-fatal errors, drivers may opt in to a reset.
230But for fatal errors, they cannot opt out of a reset, based on the
231assumption that the link is unreliable.
232
233Frequently Asked Questions
234--------------------------
235
236Q:
237  What happens if a PCIe device driver does not provide an
238  error recovery handler (pci_driver->err_handler is equal to NULL)?
239
240A:
241  The devices attached with the driver won't be recovered.
242  The kernel will print out informational messages to identify
243  unrecoverable devices.
244
245
246Software error injection
247========================
248
249Debugging PCIe AER error recovery code is quite difficult because it
250is hard to trigger real hardware errors. Software based error
251injection can be used to fake various kinds of PCIe errors.
252
253First you should enable PCIe AER software error injection in kernel
254configuration, that is, following item should be in your .config.
255
256CONFIG_PCIEAER_INJECT=y or CONFIG_PCIEAER_INJECT=m
257
258After reboot with new kernel or insert the module, a device file named
259/dev/aer_inject should be created.
260
261Then, you need a user space tool named aer-inject, which can be gotten
262from:
263
264    https://github.com/intel/aer-inject.git
265
266More information about aer-inject can be found in the document in
267its source code.
268