xref: /linux/Documentation/driver-api/edac.rst (revision 8e07e0e3964ca4e23ce7b68e2096fe660a888942)
1Error Detection And Correction (EDAC) Devices
2=============================================
3
4Main Concepts used at the EDAC subsystem
5----------------------------------------
6
7There are several things to be aware of that aren't at all obvious, like
8*sockets, *socket sets*, *banks*, *rows*, *chip-select rows*, *channels*,
9etc...
10
11These are some of the many terms that are thrown about that don't always
12mean what people think they mean (Inconceivable!).  In the interest of
13creating a common ground for discussion, terms and their definitions
14will be established.
15
16* Memory devices
17
18The individual DRAM chips on a memory stick.  These devices commonly
19output 4 and 8 bits each (x4, x8). Grouping several of these in parallel
20provides the number of bits that the memory controller expects:
21typically 72 bits, in order to provide 64 bits + 8 bits of ECC data.
22
23* Memory Stick
24
25A printed circuit board that aggregates multiple memory devices in
26parallel.  In general, this is the Field Replaceable Unit (FRU) which
27gets replaced, in the case of excessive errors. Most often it is also
28called DIMM (Dual Inline Memory Module).
29
30* Memory Socket
31
32A physical connector on the motherboard that accepts a single memory
33stick. Also called as "slot" on several datasheets.
34
35* Channel
36
37A memory controller channel, responsible to communicate with a group of
38DIMMs. Each channel has its own independent control (command) and data
39bus, and can be used independently or grouped with other channels.
40
41* Branch
42
43It is typically the highest hierarchy on a Fully-Buffered DIMM memory
44controller. Typically, it contains two channels. Two channels at the
45same branch can be used in single mode or in lockstep mode. When
46lockstep is enabled, the cacheline is doubled, but it generally brings
47some performance penalty. Also, it is generally not possible to point to
48just one memory stick when an error occurs, as the error correction code
49is calculated using two DIMMs instead of one. Due to that, it is capable
50of correcting more errors than on single mode.
51
52* Single-channel
53
54The data accessed by the memory controller is contained into one dimm
55only. E. g. if the data is 64 bits-wide, the data flows to the CPU using
56one 64 bits parallel access. Typically used with SDR, DDR, DDR2 and DDR3
57memories. FB-DIMM and RAMBUS use a different concept for channel, so
58this concept doesn't apply there.
59
60* Double-channel
61
62The data size accessed by the memory controller is interlaced into two
63dimms, accessed at the same time. E. g. if the DIMM is 64 bits-wide (72
64bits with ECC), the data flows to the CPU using a 128 bits parallel
65access.
66
67* Chip-select row
68
69This is the name of the DRAM signal used to select the DRAM ranks to be
70accessed. Common chip-select rows for single channel are 64 bits, for
71dual channel 128 bits. It may not be visible by the memory controller,
72as some DIMM types have a memory buffer that can hide direct access to
73it from the Memory Controller.
74
75* Single-Ranked stick
76
77A Single-ranked stick has 1 chip-select row of memory. Motherboards
78commonly drive two chip-select pins to a memory stick. A single-ranked
79stick, will occupy only one of those rows. The other will be unused.
80
81.. _doubleranked:
82
83* Double-Ranked stick
84
85A double-ranked stick has two chip-select rows which access different
86sets of memory devices.  The two rows cannot be accessed concurrently.
87
88* Double-sided stick
89
90**DEPRECATED TERM**, see :ref:`Double-Ranked stick <doubleranked>`.
91
92A double-sided stick has two chip-select rows which access different sets
93of memory devices. The two rows cannot be accessed concurrently.
94"Double-sided" is irrespective of the memory devices being mounted on
95both sides of the memory stick.
96
97* Socket set
98
99All of the memory sticks that are required for a single memory access or
100all of the memory sticks spanned by a chip-select row.  A single socket
101set has two chip-select rows and if double-sided sticks are used these
102will occupy those chip-select rows.
103
104* Bank
105
106This term is avoided because it is unclear when needing to distinguish
107between chip-select rows and socket sets.
108
109* High Bandwidth Memory (HBM)
110
111HBM is a new memory type with low power consumption and ultra-wide
112communication lanes. It uses vertically stacked memory chips (DRAM dies)
113interconnected by microscopic wires called "through-silicon vias," or
114TSVs.
115
116Several stacks of HBM chips connect to the CPU or GPU through an ultra-fast
117interconnect called the "interposer". Therefore, HBM's characteristics
118are nearly indistinguishable from on-chip integrated RAM.
119
120Memory Controllers
121------------------
122
123Most of the EDAC core is focused on doing Memory Controller error detection.
124The :c:func:`edac_mc_alloc`. It uses internally the struct ``mem_ctl_info``
125to describe the memory controllers, with is an opaque struct for the EDAC
126drivers. Only the EDAC core is allowed to touch it.
127
128.. kernel-doc:: include/linux/edac.h
129
130.. kernel-doc:: drivers/edac/edac_mc.h
131
132PCI Controllers
133---------------
134
135The EDAC subsystem provides a mechanism to handle PCI controllers by calling
136the :c:func:`edac_pci_alloc_ctl_info`. It will use the struct
137:c:type:`edac_pci_ctl_info` to describe the PCI controllers.
138
139.. kernel-doc:: drivers/edac/edac_pci.h
140
141EDAC Blocks
142-----------
143
144The EDAC subsystem also provides a generic mechanism to report errors on
145other parts of the hardware via :c:func:`edac_device_alloc_ctl_info` function.
146
147The structures :c:type:`edac_dev_sysfs_block_attribute`,
148:c:type:`edac_device_block`, :c:type:`edac_device_instance` and
149:c:type:`edac_device_ctl_info` provide a generic or abstract 'edac_device'
150representation at sysfs.
151
152This set of structures and the code that implements the APIs for the same, provide for registering EDAC type devices which are NOT standard memory or
153PCI, like:
154
155- CPU caches (L1 and L2)
156- DMA engines
157- Core CPU switches
158- Fabric switch units
159- PCIe interface controllers
160- other EDAC/ECC type devices that can be monitored for
161  errors, etc.
162
163It allows for a 2 level set of hierarchy.
164
165For example, a cache could be composed of L1, L2 and L3 levels of cache.
166Each CPU core would have its own L1 cache, while sharing L2 and maybe L3
167caches. On such case, those can be represented via the following sysfs
168nodes::
169
170	/sys/devices/system/edac/..
171
172	pci/		<existing pci directory (if available)>
173	mc/		<existing memory device directory>
174	cpu/cpu0/..	<L1 and L2 block directory>
175		/L1-cache/ce_count
176			 /ue_count
177		/L2-cache/ce_count
178			 /ue_count
179	cpu/cpu1/..	<L1 and L2 block directory>
180		/L1-cache/ce_count
181			 /ue_count
182		/L2-cache/ce_count
183			 /ue_count
184	...
185
186	the L1 and L2 directories would be "edac_device_block's"
187
188.. kernel-doc:: drivers/edac/edac_device.h
189
190
191Heterogeneous system support
192----------------------------
193
194An AMD heterogeneous system is built by connecting the data fabrics of
195both CPUs and GPUs via custom xGMI links. Thus, the data fabric on the
196GPU nodes can be accessed the same way as the data fabric on CPU nodes.
197
198The MI200 accelerators are data center GPUs. They have 2 data fabrics,
199and each GPU data fabric contains four Unified Memory Controllers (UMC).
200Each UMC contains eight channels. Each UMC channel controls one 128-bit
201HBM2e (2GB) channel (equivalent to 8 X 2GB ranks).  This creates a total
202of 4096-bits of DRAM data bus.
203
204While the UMC is interfacing a 16GB (8high X 2GB DRAM) HBM stack, each UMC
205channel is interfacing 2GB of DRAM (represented as rank).
206
207Memory controllers on AMD GPU nodes can be represented in EDAC thusly:
208
209	GPU DF / GPU Node -> EDAC MC
210	GPU UMC           -> EDAC CSROW
211	GPU UMC channel   -> EDAC CHANNEL
212
213For example: a heterogeneous system with 1 AMD CPU is connected to
2144 MI200 (Aldebaran) GPUs using xGMI.
215
216Some more heterogeneous hardware details:
217
218- The CPU UMC (Unified Memory Controller) is mostly the same as the GPU UMC.
219  They have chip selects (csrows) and channels. However, the layouts are different
220  for performance, physical layout, or other reasons.
221- CPU UMCs use 1 channel, In this case UMC = EDAC channel. This follows the
222  marketing speak. CPU has X memory channels, etc.
223- CPU UMCs use up to 4 chip selects, So UMC chip select = EDAC CSROW.
224- GPU UMCs use 1 chip select, So UMC = EDAC CSROW.
225- GPU UMCs use 8 channels, So UMC channel = EDAC channel.
226
227The EDAC subsystem provides a mechanism to handle AMD heterogeneous
228systems by calling system specific ops for both CPUs and GPUs.
229
230AMD GPU nodes are enumerated in sequential order based on the PCI
231hierarchy, and the first GPU node is assumed to have a Node ID value
232following those of the CPU nodes after latter are fully populated::
233
234	$ ls /sys/devices/system/edac/mc/
235		mc0   - CPU MC node 0
236		mc1  |
237		mc2  |- GPU card[0] => node 0(mc1), node 1(mc2)
238		mc3  |
239		mc4  |- GPU card[1] => node 0(mc3), node 1(mc4)
240		mc5  |
241		mc6  |- GPU card[2] => node 0(mc5), node 1(mc6)
242		mc7  |
243		mc8  |- GPU card[3] => node 0(mc7), node 1(mc8)
244
245For example, a heterogeneous system with one AMD CPU is connected to
246four MI200 (Aldebaran) GPUs using xGMI. This topology can be represented
247via the following sysfs entries::
248
249	/sys/devices/system/edac/mc/..
250
251	CPU			# CPU node
252	├── mc 0
253
254	GPU Nodes are enumerated sequentially after CPU nodes have been populated
255	GPU card 1		# Each MI200 GPU has 2 nodes/mcs
256	├── mc 1		# GPU node 0 == mc1, Each MC node has 4 UMCs/CSROWs
257	│   ├── csrow 0		# UMC 0
258	│   │   ├── channel 0	# Each UMC has 8 channels
259	│   │   ├── channel 1   # size of each channel is 2 GB, so each UMC has 16 GB
260	│   │   ├── channel 2
261	│   │   ├── channel 3
262	│   │   ├── channel 4
263	│   │   ├── channel 5
264	│   │   ├── channel 6
265	│   │   ├── channel 7
266	│   ├── csrow 1		# UMC 1
267	│   │   ├── channel 0
268	│   │   ├── ..
269	│   │   ├── channel 7
270	│   ├── ..		..
271	│   ├── csrow 3		# UMC 3
272	│   │   ├── channel 0
273	│   │   ├── ..
274	│   │   ├── channel 7
275	│   ├── rank 0
276	│   ├── ..		..
277	│   ├── rank 31		# total 32 ranks/dimms from 4 UMCs
278279	├── mc 2		# GPU node 1 == mc2
280	│   ├── ..		# each GPU has total 64 GB
281
282	GPU card 2
283	├── mc 3
284	│   ├── ..
285	├── mc 4
286	│   ├── ..
287
288	GPU card 3
289	├── mc 5
290	│   ├── ..
291	├── mc 6
292	│   ├── ..
293
294	GPU card 4
295	├── mc 7
296	│   ├── ..
297	├── mc 8
298	│   ├── ..
299