| da0e2197 | 24-Aug-2025 |
Shahar Shitrit <shshitrit@nvidia.com> |
devlink: Make health reporter burst period configurable
Enable configuration of the burst period — a time window starting from the first error recovery, during which the reporter allows recovery att
devlink: Make health reporter burst period configurable
Enable configuration of the burst period — a time window starting from the first error recovery, during which the reporter allows recovery attempts for each reported error.
This feature is helpful when a single underlying issue causes multiple errors, as it delays the start of the grace period to allow sufficient time for recovering all related errors. For example, if multiple TX queues time out simultaneously, a sufficient burst period could allow all affected TX queues to be recovered within that window. Without this period, only the first TX queue that reports a timeout will undergo recovery, while the remaining TX queues will be blocked once the grace period begins.
Configuration example: $ devlink health set pci/0000:00:09.0 reporter tx burst_period 500
Configuration example with ynl: ./tools/net/ynl/pyynl/cli.py \ --spec Documentation/netlink/specs/devlink.yaml \ --do health-reporter-set --json '{ "bus-name": "auxiliary", "dev-name": "mlx5_core.eth.0", "port-index": 65535, "health-reporter-name": "tx", "health-reporter-burst-period": 500 }'
Signed-off-by: Shahar Shitrit <shshitrit@nvidia.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Reviewed-by: Dragos Tatulea <dtatulea@nvidia.com> Reviewed-by: Carolina Jubran <cjubran@nvidia.com> Signed-off-by: Mark Bloch <mbloch@nvidia.com> Link: https://patch.msgid.link/20250824084354.533182-5-mbloch@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
show more ...
|
| 6a06d8c4 | 24-Aug-2025 |
Shahar Shitrit <shshitrit@nvidia.com> |
devlink: Introduce burst period for health reporter
Currently, the devlink health reporter starts the grace period immediately after handling an error, blocking any further recoveries until it finis
devlink: Introduce burst period for health reporter
Currently, the devlink health reporter starts the grace period immediately after handling an error, blocking any further recoveries until it finished.
However, when a single root cause triggers multiple errors in a short time frame, it is desirable to treat them as a bulk of errors and to allow their recoveries, avoiding premature blocking of subsequent related errors, and reducing the risk of inconsistent or incomplete error handling.
To address this, introduce a configurable burst period for devlink health reporter. Start this period when the first error is handled, and allow recovery attempts for reported errors during this window. Once burst period expires, begin the grace period to block further recoveries until it concludes.
Timeline summary:
----|--------|------------------------------/----------------------/-- error is error is burst period grace period reported recovered (recoveries allowed) (recoveries blocked)
For calculating the burst period duration, use the same last_recovery_ts as the grace period. Update it on recovery only when the burst period is inactive (either disabled or at the first error).
This patch implements the framework for the burst period and effectively sets its value to 0 at reporter creation, so the current behavior remains unchanged, which ensures backward compatibility.
A downstream patch will make the burst period configurable.
Signed-off-by: Shahar Shitrit <shshitrit@nvidia.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Signed-off-by: Mark Bloch <mbloch@nvidia.com> Link: https://patch.msgid.link/20250824084354.533182-4-mbloch@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
show more ...
|
| 20597fb9 | 24-Aug-2025 |
Shahar Shitrit <shshitrit@nvidia.com> |
devlink: Move health reporter recovery abort logic to a separate function
Extract the health reporter recovery abort logic into a separate function devlink_health_recover_abort(). The function encap
devlink: Move health reporter recovery abort logic to a separate function
Extract the health reporter recovery abort logic into a separate function devlink_health_recover_abort(). The function encapsulates the conditions for aborting recovery: - When auto-recovery is disabled - When previous error wasn't recovered - When within the grace period after last recovery
Signed-off-by: Shahar Shitrit <shshitrit@nvidia.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Reviewed-by: Dragos Tatulea <dtatulea@nvidia.com> Reviewed-by: Carolina Jubran <cjubran@nvidia.com> Signed-off-by: Mark Bloch <mbloch@nvidia.com> Link: https://patch.msgid.link/20250824084354.533182-3-mbloch@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
show more ...
|
| 41a6e8ab | 13-Aug-2025 |
Parav Pandit <parav@nvidia.com> |
devlink/port: Check attributes early and constify
Constify the devlink port attributes to indicate they are read only and does not depend on anything else. Therefore, validate it early before settin
devlink/port: Check attributes early and constify
Constify the devlink port attributes to indicate they are read only and does not depend on anything else. Therefore, validate it early before setting in the devlink port.
Reviewed-by: Jiri Pirko <jiri@nvidia.com> Signed-off-by: Parav Pandit <parav@nvidia.com> Link: https://patch.msgid.link/20250813094417.7269-3-parav@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
show more ...
|
| de9ccf22 | 04-Jul-2025 |
Ivan Vecera <ivecera@redhat.com> |
devlink: Add new "clock_id" generic device param
Add a new device generic parameter to specify clock ID that should be used by the device for registering DPLL devices and pins.
Signed-off-by: Ivan
devlink: Add new "clock_id" generic device param
Add a new device generic parameter to specify clock ID that should be used by the device for registering DPLL devices and pins.
Signed-off-by: Ivan Vecera <ivecera@redhat.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Link: https://patch.msgid.link/20250704182202.1641943-5-ivecera@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
show more ...
|
| 88debb52 | 05-May-2025 |
Jiri Pirko <jiri@nvidia.com> |
devlink: use DEVLINK_VAR_ATTR_TYPE_* instead of NLA_* in fmsg
Use newly introduced DEVLINK_VAR_ATTR_TYPE_* enum values instead of internal NLA_* in fmsg health reporter code.
Signed-off-by: Jiri Pi
devlink: use DEVLINK_VAR_ATTR_TYPE_* instead of NLA_* in fmsg
Use newly introduced DEVLINK_VAR_ATTR_TYPE_* enum values instead of internal NLA_* in fmsg health reporter code.
Signed-off-by: Jiri Pirko <jiri@nvidia.com> Link: https://patch.msgid.link/20250505114513.53370-5-jiri@resnulli.us Signed-off-by: Jakub Kicinski <kuba@kernel.org>
show more ...
|
| f9e78932 | 05-May-2025 |
Jiri Pirko <jiri@nvidia.com> |
devlink: avoid param type value translations
Assign DEVLINK_PARAM_TYPE_* enum values to DEVLINK_VAR_ATTR_TYPE_* to ensure the same values are used internally and in UAPI. Benefit from that by removi
devlink: avoid param type value translations
Assign DEVLINK_PARAM_TYPE_* enum values to DEVLINK_VAR_ATTR_TYPE_* to ensure the same values are used internally and in UAPI. Benefit from that by removing the value translations.
Signed-off-by: Jiri Pirko <jiri@nvidia.com> Link: https://patch.msgid.link/20250505114513.53370-4-jiri@resnulli.us Signed-off-by: Jakub Kicinski <kuba@kernel.org>
show more ...
|