1.. SPDX-License-Identifier: GPL-2.0 OR GFDL-1.2-no-invariants-or-later 2 3========================== 4EDAC Memory Repair Control 5========================== 6 7Copyright (c) 2024-2025 HiSilicon Limited. 8 9:Author: Shiju Jose <shiju.jose@huawei.com> 10:License: The GNU Free Documentation License, Version 1.2 without 11 Invariant Sections, Front-Cover Texts nor Back-Cover Texts. 12 (dual licensed under the GPL v2) 13:Original Reviewers: 14 15- Written for: 6.15 16 17Introduction 18------------ 19 20Some memory devices support repair operations to address issues in their 21memory media. Post Package Repair (PPR) and memory sparing are examples of 22such features. 23 24Post Package Repair (PPR) 25~~~~~~~~~~~~~~~~~~~~~~~~~ 26 27Post Package Repair is a maintenance operation which requests the memory 28device to perform repair operation on its media. It is a memory self-healing 29feature that fixes a failing memory location by replacing it with a spare row 30in a DRAM device. 31 32For example, a CXL memory device with DRAM components that support PPR 33features implements maintenance operations. DRAM components support those 34types of PPR functions: 35 36 - hard PPR, for a permanent row repair, and 37 - soft PPR, for a temporary row repair. 38 39Soft PPR is much faster than hard PPR, but the repair is lost after a power 40cycle. 41 42The data may not be retained and memory requests may not be correctly 43processed during a repair operation. In such case, the repair operation should 44not be executed at runtime. 45 46For example, for CXL memory devices, see CXL spec rev 3.1 [1]_ sections 478.2.9.7.1.1 PPR Maintenance Operations, 8.2.9.7.1.2 sPPR Maintenance Operation 48and 8.2.9.7.1.3 hPPR Maintenance Operation for more details. 49 50Memory Sparing 51~~~~~~~~~~~~~~ 52 53Memory sparing is a repair function that replaces a portion of memory with 54a portion of functional memory at a particular granularity. Memory 55sparing has cacheline/row/bank/rank sparing granularities. For example, in 56rank memory-sparing mode, one memory rank serves as a spare for other ranks on 57the same channel in case they fail. 58 59The spare rank is held in reserve and not used as active memory until 60a failure is indicated, with reserved capacity subtracted from the total 61available memory in the system. 62 63After an error threshold is surpassed in a system protected by memory sparing, 64the content of a failing rank of DIMMs is copied to the spare rank. The 65failing rank is then taken offline and the spare rank placed online for use as 66active memory in place of the failed rank. 67 68For example, CXL memory devices can support various subclasses for sparing 69operation vary in terms of the scope of the sparing being performed. 70 71Cacheline sparing subclass refers to a sparing action that can replace a full 72cacheline. Row sparing is provided as an alternative to PPR sparing functions 73and its scope is that of a single DDR row. Bank sparing allows an entire bank 74to be replaced. Rank sparing is defined as an operation in which an entire DDR 75rank is replaced. 76 77See CXL spec 3.1 [1]_ section 8.2.9.7.1.4 Memory Sparing Maintenance 78Operations for more details. 79 80.. [1] https://computeexpresslink.org/cxl-specification/ 81 82Use cases of generic memory repair features control 83~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 84 851. The soft PPR, hard PPR and memory-sparing features share similar control 86 attributes. Therefore, there is a need for a standardized, generic sysfs 87 repair control that is exposed to userspace and used by administrators, 88 scripts and tools. 89 902. When a CXL device detects an error in a memory component, it informs the 91 host of the need for a repair maintenance operation by using an event 92 record where the "maintenance needed" flag is set. The event record 93 specifies the device physical address (DPA) and attributes of the memory 94 that requires repair. The kernel reports the corresponding CXL general 95 media or DRAM trace event to userspace, and userspace tools (e.g. 96 rasdaemon) initiate a repair maintenance operation in response to the 97 device request using the sysfs repair control. 98 993. Userspace tools, such as rasdaemon, request a repair operation on a memory 100 region when maintenance need flag set or an uncorrected memory error or 101 excess of corrected memory errors above a threshold value is reported or an 102 exceed corrected errors threshold flag set for that memory. 103 1044. Multiple PPR/sparing instances may be present per memory device. 105 1065. Drivers should enforce that live repair is safe. In systems where memory 107 mapping functions can change between boots, one approach to this is to log 108 memory errors seen on this boot against which to check live memory repair 109 requests. 110 111The File System 112--------------- 113 114The control attributes of a registered memory repair instance could be 115accessed in the /sys/bus/edac/devices/<dev-name>/mem_repairX/ 116 117sysfs 118----- 119 120Sysfs files are documented in 121`Documentation/ABI/testing/sysfs-edac-memory-repair`. 122 123Examples 124-------- 125 126The memory repair usage takes the form shown in this example: 127 1281. CXL memory sparing 129 130Memory sparing is defined as a repair function that replaces a portion of 131memory with a portion of functional memory at that same DPA. The subclass 132for this operation, cacheline/row/bank/rank sparing, vary in terms of the 133scope of the sparing being performed. 134 135Memory sparing maintenance operations may be supported by CXL devices that 136implement CXL.mem protocol. A sparing maintenance operation requests the 137CXL device to perform a repair operation on its media. For example, a CXL 138device with DRAM components that support memory sparing features may 139implement sparing maintenance operations. 140 1412. CXL memory Soft Post Package Repair (sPPR) 142 143Post Package Repair (PPR) maintenance operations may be supported by CXL 144devices that implement CXL.mem protocol. A PPR maintenance operation 145requests the CXL device to perform a repair operation on its media. 146For example, a CXL device with DRAM components that support PPR features 147may implement PPR Maintenance operations. Soft PPR (sPPR) is a temporary 148row repair. Soft PPR may be faster, but the repair is lost with a power 149cycle. 150 151Sysfs files for memory repair are documented in 152`Documentation/ABI/testing/sysfs-edac-memory-repair` 153