58029c39 | 27-Feb-2025 |
Yazen Ghannam <yazen.ghannam@amd.com> |
RAS/AMD/FMPM: Get masked address
Some operations require checking, or ignoring, specific bits in an address value. For example, this can be comparing address values to identify unique structures.
C
RAS/AMD/FMPM: Get masked address
Some operations require checking, or ignoring, specific bits in an address value. For example, this can be comparing address values to identify unique structures.
Currently, the full address value is compared when filtering for duplicates. This results in over counting and creation of extra records. This gives the impression that more unique events occurred than did in reality.
Mask the address for physical rows on MI300.
[ bp: Simplify. ]
Fixes: 6f15e617cc99 ("RAS: Introduce a FRU memory poison manager") Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Cc: stable@vger.kernel.org
show more ...
|
03a9b670 | 15-Jul-2024 |
Borislav Petkov (AMD) <bp@alien8.de> |
Merge remote-tracking branches 'ras/edac-amd-atl' and 'ras/edac-misc' into edac-updates
* ras/edac-amd-atl: RAS/AMD/FMPM: Use atl internal.h for INVALID_SPA RAS/AMD/ATL: Implement DF 4.5 NP2 den
Merge remote-tracking branches 'ras/edac-amd-atl' and 'ras/edac-misc' into edac-updates
* ras/edac-amd-atl: RAS/AMD/FMPM: Use atl internal.h for INVALID_SPA RAS/AMD/ATL: Implement DF 4.5 NP2 denormalization RAS/AMD/ATL: Validate address map when information is gathered RAS/AMD/ATL: Expand helpers for adding and removing base and hole RAS/AMD/ATL: Read DRAM hole base early RAS/AMD/ATL: Add amd_atl pr_fmt() prefix RAS/AMD/ATL: Add a missing module description
* ras/edac-misc: EDAC: Add missing MODULE_DESCRIPTION() macros EDAC/dmc520: Use devm_platform_ioremap_resource() EDAC/igen6: Add Intel Arrow Lake-U/H SoCs support EDAC, i10nm: make skx_common.o a separate module EDAC/skx: Switch to new Intel CPU model defines EDAC/sb_edac: Switch to new Intel CPU model defines EDAC, pnd2: Switch to new Intel CPU model defines EDAC/i10nm: Switch to new Intel CPU model defines EDAC/ghes: Add missing newline to pr_info() statement RAS/AMD/ATL: Add missing newline to pr_info() statement EDAC/thunderx: Remove unused struct error_syndrome
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
show more ...
|
ba437905 | 07-Jun-2024 |
Yazen Ghannam <yazen.ghannam@amd.com> |
RAS/AMD/ATL: Use system settings for MI300 DRAM to normalized address translation
The currently used normalized address format is not applicable to all MI300 systems. This leads to incorrect results
RAS/AMD/ATL: Use system settings for MI300 DRAM to normalized address translation
The currently used normalized address format is not applicable to all MI300 systems. This leads to incorrect results during address translation.
Drop the fixed layout and construct the normalized address from system settings.
Fixes: 87a612375307 ("RAS/AMD/ATL: Add MI300 DRAM to normalized address translation support") Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Cc: <stable@kernel.org> Link: https://lore.kernel.org/r/20240607-mi300-dram-xl-fix-v1-2-2f11547a178c@amd.com
show more ...
|
f4c0cd18 | 06-Jun-2024 |
John Allen <john.allen@amd.com> |
RAS/AMD/FMPM: Use atl internal.h for INVALID_SPA
Both the AMD ATL and the FMPM driver define INVALID_SPA. Include the definition from the ATL internal.h header in the FMPM driver.
Signed-off-by: Jo
RAS/AMD/FMPM: Use atl internal.h for INVALID_SPA
Both the AMD ATL and the FMPM driver define INVALID_SPA. Include the definition from the ATL internal.h header in the FMPM driver.
Signed-off-by: John Allen <john.allen@amd.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/20240606203313.51197-7-john.allen@amd.com
show more ...
|
e0372d69 | 06-Jun-2024 |
John Allen <john.allen@amd.com> |
RAS/AMD/ATL: Implement DF 4.5 NP2 denormalization
Unlike with previous Data Fabric versions, with Data Fabric 4.5 non-power-of-2 denormalization, there are bits of the system physical address that c
RAS/AMD/ATL: Implement DF 4.5 NP2 denormalization
Unlike with previous Data Fabric versions, with Data Fabric 4.5 non-power-of-2 denormalization, there are bits of the system physical address that can't be fully reconstructed from the normalized address.
To determine the proper combination of missing system physical address bits, iterate through each possible combination of these bits, normalize the resulting system physical address, and compare to the original address that is being translated. If the addresses match, then the correct permutation of bits has been found.
Signed-off-by: John Allen <john.allen@amd.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Yazen Ghannam <yazen.ghannam@amd.com> Link: https://lore.kernel.org/r/20240606203313.51197-6-john.allen@amd.com
show more ...
|
d5811a16 | 06-Jun-2024 |
John Allen <john.allen@amd.com> |
RAS/AMD/ATL: Validate address map when information is gathered
Validate address maps at the time the information is gathered as the address map will not change during translation.
Signed-off-by: Jo
RAS/AMD/ATL: Validate address map when information is gathered
Validate address maps at the time the information is gathered as the address map will not change during translation.
Signed-off-by: John Allen <john.allen@amd.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Yazen Ghannam <yazen.ghannam@amd.com> Link: https://lore.kernel.org/r/20240606203313.51197-5-john.allen@amd.com
show more ...
|
6cce048c | 06-Jun-2024 |
John Allen <john.allen@amd.com> |
RAS/AMD/ATL: Expand helpers for adding and removing base and hole
The ret_addr field in struct addr_ctx contains the intermediate value of the returned address as it passes through multiple steps in
RAS/AMD/ATL: Expand helpers for adding and removing base and hole
The ret_addr field in struct addr_ctx contains the intermediate value of the returned address as it passes through multiple steps in the translation process. Currently, adding the DRAM base and legacy hole is only done once, so it operates directly on the intermediate value.
However, for DF 4.5 non-power-of-2 denormalization, adding and removing the DRAM base and legacy hole needs to be done for multiple temporary address values. During this process, the intermediate value should not be lost so the ret_addr value can't be reused.
Update the existing 'add' helper to operate on an arbitrary address and introduce a new 'remove' helper to do the inverse operations.
Signed-off-by: John Allen <john.allen@amd.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Yazen Ghannam <yazen.ghannam@amd.com> Link: https://lore.kernel.org/r/20240606203313.51197-4-john.allen@amd.com
show more ...
|
1233aa3f | 06-Jun-2024 |
John Allen <john.allen@amd.com> |
RAS/AMD/ATL: Read DRAM hole base early
Read DRAM hole base when constructing the address map as the value will not change during run time.
Signed-off-by: John Allen <john.allen@amd.com> Signed-off-
RAS/AMD/ATL: Read DRAM hole base early
Read DRAM hole base when constructing the address map as the value will not change during run time.
Signed-off-by: John Allen <john.allen@amd.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Yazen Ghannam <yazen.ghannam@amd.com> Link: https://lore.kernel.org/r/20240606203313.51197-3-john.allen@amd.com
show more ...
|
efdbe82a | 06-Jun-2024 |
John Allen <john.allen@amd.com> |
RAS/AMD/ATL: Add amd_atl pr_fmt() prefix
Prefix all AMD ATL pr_* statements with "amd_atl:".
Signed-off-by: John Allen <john.allen@amd.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link:
RAS/AMD/ATL: Add amd_atl pr_fmt() prefix
Prefix all AMD ATL pr_* statements with "amd_atl:".
Signed-off-by: John Allen <john.allen@amd.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/20240606203313.51197-2-john.allen@amd.com
show more ...
|
9b195439 | 19-Mar-2024 |
Yazen Ghannam <yazen.ghannam@amd.com> |
RAS/AMD/FMPM: Safely handle saved records of various sizes
Currently, the size of the locally cached FRU record structures is based on the module parameter "max_nr_entries".
This creates issues whe
RAS/AMD/FMPM: Safely handle saved records of various sizes
Currently, the size of the locally cached FRU record structures is based on the module parameter "max_nr_entries".
This creates issues when restoring records if a user changes the parameter.
If the number of entries is reduced, then old, larger records will not be restored. The opportunity to take action on the saved data is missed. Also, new records will be created and written to storage, even as the old records remain in storage, resulting in wasted space.
If the number of entries is increased, then the length of the old, smaller records will not be adjusted. This causes a checksum failure which leads to the old record being cleared from storage. Again this results in another missed opportunity for action on the saved data.
Allocate the temporary record with the maximum possible size based on the current maximum number of supported entries (255). This allows the ERST read operation to succeed if max_nr_entries has been increased.
Warn the user if a saved record exceeds the expected size and fail to load the module. This allows the user to adjust the module parameter without losing data or the opportunity to restore larger records.
Increase the size of a saved record up to the current max_rec_len. The checksum will be recalculated, and the updated record will be written to storage.
Fixes: 6f15e617cc99 ("RAS: Introduce a FRU memory poison manager") Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Tested-by: Muralidhara M K <muralidhara.mk@amd.com> Link: https://lore.kernel.org/r/20240319113322.280096-3-yazen.ghannam@amd.com
show more ...
|
bd17b7c3 | 06-Mar-2024 |
Dan Carpenter <dan.carpenter@linaro.org> |
RAS/AMD/FMPM: Fix off by one when unwinding on error
Decrement the index variable i before the first iteration when freeing the remaining elements on error. Depending on where this fails it could fr
RAS/AMD/FMPM: Fix off by one when unwinding on error
Decrement the index variable i before the first iteration when freeing the remaining elements on error. Depending on where this fails it could free something from one element beyond the end of the fru_records[] array.
[ bp: Massage commit message. ]
Fixes: 6f15e617cc99 ("RAS: Introduce a FRU memory poison manager") Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/6fdec71a-846b-4cd0-af69-e5f6cd12f4f6@moroto.mountain
show more ...
|
7d19eea5 | 01-Mar-2024 |
Yazen Ghannam <yazen.ghannam@amd.com> |
RAS/AMD/FMPM: Add debugfs interface to print record entries
It is helpful to see the saved record entries during run time in human-readable format. This is useful for testing during module developme
RAS/AMD/FMPM: Add debugfs interface to print record entries
It is helpful to see the saved record entries during run time in human-readable format. This is useful for testing during module development. It can also be used by system admins to quickly and easily see the state of the system.
Provide a sequential file in debugfs to print fields of interest from the FRU records and their entries.
Don't fail to load the module if the debugfs interface is not available. This is a convenience feature which does not affect other module functionality.
The new interface reads the record entries and should hold the mutex. Expand the mutex code comment to clarify when it should be held.
Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/20240301143748.854090-4-yazen.ghannam@amd.com
show more ...
|