xref: /linux/Documentation/edac/memory_repair.rst (revision 29e9359005dd1ac5f9683608891718e6a32a20a3)
1.. SPDX-License-Identifier: GPL-2.0 OR GFDL-1.2-no-invariants-or-later
2
3==========================
4EDAC Memory Repair Control
5==========================
6
7Copyright (c) 2024-2025 HiSilicon Limited.
8
9:Author:   Shiju Jose <shiju.jose@huawei.com>
10:License:  The GNU Free Documentation License, Version 1.2 without
11           Invariant Sections, Front-Cover Texts nor Back-Cover Texts.
12           (dual licensed under the GPL v2)
13:Original Reviewers:
14
15- Written for: 6.15
16
17Introduction
18------------
19
20Some memory devices support repair operations to address issues in their
21memory media. Post Package Repair (PPR) and memory sparing are examples of
22such features.
23
24Post Package Repair (PPR)
25~~~~~~~~~~~~~~~~~~~~~~~~~
26
27Post Package Repair is a maintenance operation which requests the memory
28device to perform repair operation on its media. It is a memory self-healing
29feature that fixes a failing memory location by replacing it with a spare row
30in a DRAM device.
31
32For example, a CXL memory device with DRAM components that support PPR
33features implements maintenance operations. DRAM components support those
34types of PPR functions:
35
36 - hard PPR, for a permanent row repair, and
37 - soft PPR, for a temporary row repair.
38
39Soft PPR is much faster than hard PPR, but the repair is lost after a power
40cycle.
41
42The data may not be retained and memory requests may not be correctly
43processed during a repair operation. In such case, the repair operation should
44not be executed at runtime.
45
46For example, for CXL memory devices, see CXL spec rev 3.1 [1]_ sections
478.2.9.7.1.1 PPR Maintenance Operations, 8.2.9.7.1.2 sPPR Maintenance Operation
48and 8.2.9.7.1.3 hPPR Maintenance Operation for more details.
49
50Memory Sparing
51~~~~~~~~~~~~~~
52
53Memory sparing is a repair function that replaces a portion of memory with
54a portion of functional memory at a particular granularity. Memory
55sparing has cacheline/row/bank/rank sparing granularities. For example, in
56rank memory-sparing mode, one memory rank serves as a spare for other ranks on
57the same channel in case they fail.
58
59The spare rank is held in reserve and not used as active memory until
60a failure is indicated, with reserved capacity subtracted from the total
61available memory in the system.
62
63After an error threshold is surpassed in a system protected by memory sparing,
64the content of a failing rank of DIMMs is copied to the spare rank. The
65failing rank is then taken offline and the spare rank placed online for use as
66active memory in place of the failed rank.
67
68For example, CXL memory devices can support various subclasses for sparing
69operation vary in terms of the scope of the sparing being performed.
70
71Cacheline sparing subclass refers to a sparing action that can replace a full
72cacheline. Row sparing is provided as an alternative to PPR sparing functions
73and its scope is that of a single DDR row. Bank sparing allows an entire bank
74to be replaced. Rank sparing is defined as an operation in which an entire DDR
75rank is replaced.
76
77See CXL spec 3.1 [1]_ section 8.2.9.7.1.4 Memory Sparing Maintenance
78Operations for more details.
79
80.. [1] https://computeexpresslink.org/cxl-specification/
81
82Use cases of generic memory repair features control
83~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
84
851. The soft PPR, hard PPR and memory-sparing features share similar control
86   attributes. Therefore, there is a need for a standardized, generic sysfs
87   repair control that is exposed to userspace and used by administrators,
88   scripts and tools.
89
902. When a CXL device detects an error in a memory component, it informs the
91   host of the need for a repair maintenance operation by using an event
92   record where the "maintenance needed" flag is set. The event record
93   specifies the device physical address (DPA) and attributes of the memory
94   that requires repair. The kernel reports the corresponding CXL general
95   media or DRAM trace event to userspace, and userspace tools (e.g.
96   rasdaemon) initiate a repair maintenance operation in response to the
97   device request using the sysfs repair control.
98
993. Userspace tools, such as rasdaemon, request a repair operation on a memory
100   region when maintenance need flag set or an uncorrected memory error or
101   excess of corrected memory errors above a threshold value is reported or an
102   exceed corrected errors threshold flag set for that memory.
103
1044. Multiple PPR/sparing instances may be present per memory device.
105
1065. Drivers should enforce that live repair is safe. In systems where memory
107   mapping functions can change between boots, one approach to this is to log
108   memory errors seen on this boot against which to check live memory repair
109   requests.
110
111The File System
112---------------
113
114The control attributes of a registered memory repair instance could be
115accessed in the /sys/bus/edac/devices/<dev-name>/mem_repairX/
116
117sysfs
118-----
119
120Sysfs files are documented in
121`Documentation/ABI/testing/sysfs-edac-memory-repair`.
122
123Examples
124--------
125
126The memory repair usage takes the form shown in this example:
127
1281. CXL memory sparing
129
130Memory sparing is defined as a repair function that replaces a portion of
131memory with a portion of functional memory at that same DPA. The subclass
132for this operation, cacheline/row/bank/rank sparing, vary in terms of the
133scope of the sparing being performed.
134
135Memory sparing maintenance operations may be supported by CXL devices that
136implement CXL.mem protocol. A sparing maintenance operation requests the
137CXL device to perform a repair operation on its media. For example, a CXL
138device with DRAM components that support memory sparing features may
139implement sparing maintenance operations.
140
1412. CXL memory Soft Post Package Repair (sPPR)
142
143Post Package Repair (PPR) maintenance operations may be supported by CXL
144devices that implement CXL.mem protocol. A PPR maintenance operation
145requests the CXL device to perform a repair operation on its media.
146For example, a CXL device with DRAM components that support PPR features
147may implement PPR Maintenance operations. Soft PPR (sPPR) is a temporary
148row repair. Soft PPR may be faster, but the repair is lost with a power
149cycle.
150
151Sysfs files for memory repair are documented in
152`Documentation/ABI/testing/sysfs-edac-memory-repair`
153