1.. SPDX-License-Identifier: GPL-2.0 2 3==================== 4mlx5 devlink support 5==================== 6 7This document describes the devlink features implemented by the ``mlx5`` 8device driver. 9 10Parameters 11========== 12 13.. list-table:: Generic parameters implemented 14 15 * - Name 16 - Mode 17 - Validation 18 - Notes 19 * - ``enable_roce`` 20 - driverinit 21 - Boolean 22 - If the device supports RoCE disablement, RoCE enablement state controls 23 device support for RoCE capability. Otherwise, the control occurs in the 24 driver stack. When RoCE is disabled at the driver level, only raw 25 ethernet QPs are supported. 26 * - ``io_eq_size`` 27 - driverinit 28 - The range is between 64 and 4096. 29 - 30 * - ``event_eq_size`` 31 - driverinit 32 - The range is between 64 and 4096. 33 - 34 * - ``max_macs`` 35 - driverinit 36 - The range is between 1 and 2^31. Only power of 2 values are supported. 37 - 38 * - ``enable_sriov`` 39 - permanent 40 - Boolean 41 - Applies to each physical function (PF) independently, if the device 42 supports it. Otherwise, it applies symmetrically to all PFs. 43 * - ``total_vfs`` 44 - permanent 45 - The range is between 1 and a device-specific max. 46 - Applies to each physical function (PF) independently, if the device 47 supports it. Otherwise, it applies symmetrically to all PFs. 48 49Note: permanent parameters such as ``enable_sriov`` and ``total_vfs`` require FW reset to take effect 50 51.. code-block:: bash 52 53 # setup parameters 54 devlink dev param set pci/0000:01:00.0 name enable_sriov value true cmode permanent 55 devlink dev param set pci/0000:01:00.0 name total_vfs value 8 cmode permanent 56 57 # Fw reset 58 devlink dev reload pci/0000:01:00.0 action fw_activate 59 60 # for PCI related config such as sriov PCI reset/rescan is required: 61 echo 1 >/sys/bus/pci/devices/0000:01:00.0/remove 62 echo 1 >/sys/bus/pci/rescan 63 grep ^ /sys/bus/pci/devices/0000:01:00.0/sriov_* 64 65 66The ``mlx5`` driver also implements the following driver-specific 67parameters. 68 69.. list-table:: Driver-specific parameters implemented 70 :widths: 5 5 5 85 71 72 * - Name 73 - Type 74 - Mode 75 - Description 76 * - ``flow_steering_mode`` 77 - string 78 - runtime 79 - Controls the flow steering mode of the driver 80 81 * ``dmfs`` Device managed flow steering. In DMFS mode, the HW 82 steering entities are created and managed through firmware. 83 * ``smfs`` Software managed flow steering. In SMFS mode, the HW 84 steering entities are created and manage through the driver without 85 firmware intervention. 86 * ``hmfs`` Hardware managed flow steering. In HMFS mode, the driver 87 is configuring steering rules directly to the HW using Work Queues with 88 a special new type of WQE (Work Queue Element). 89 90 SMFS mode is faster and provides better rule insertion rate compared to 91 default DMFS mode. 92 * - ``fdb_large_groups`` 93 - u32 94 - driverinit 95 - Control the number of large groups (size > 1) in the FDB table. 96 97 * The default value is 15, and the range is between 1 and 1024. 98 * - ``esw_multiport`` 99 - Boolean 100 - runtime 101 - Control MultiPort E-Switch shared fdb mode. 102 103 An experimental mode where a single E-Switch is used and all the vports 104 and physical ports on the NIC are connected to it. 105 106 An example is to send traffic from a VF that is created on PF0 to an 107 uplink that is natively associated with the uplink of PF1 108 109 Note: Future devices, ConnectX-8 and onward, will eventually have this 110 as the default to allow forwarding between all NIC ports in a single 111 E-switch environment and the dual E-switch mode will likely get 112 deprecated. 113 114 Default: disabled 115 * - ``esw_port_metadata`` 116 - Boolean 117 - runtime 118 - When applicable, disabling eswitch metadata can increase packet rate up 119 to 20% depending on the use case and packet sizes. 120 121 Eswitch port metadata state controls whether to internally tag packets 122 with metadata. Metadata tagging must be enabled for multi-port RoCE, 123 failover between representors and stacked devices. By default metadata is 124 enabled on the supported devices in E-switch. Metadata is applicable only 125 for E-switch in switchdev mode and users may disable it when NONE of the 126 below use cases will be in use: 127 1. HCA is in Dual/multi-port RoCE mode. 128 2. VF/SF representor bonding (Usually used for Live migration) 129 3. Stacked devices 130 131 When metadata is disabled, the above use cases will fail to initialize if 132 users try to enable them. 133 134 Note: Setting this parameter does not take effect immediately. Setting 135 must happen in legacy mode and eswitch port metadata takes effect after 136 enabling switchdev mode. 137 * - ``hairpin_num_queues`` 138 - u32 139 - driverinit 140 - We refer to a TC NIC rule that involves forwarding as "hairpin". 141 Hairpin queues are mlx5 hardware specific implementation for hardware 142 forwarding of such packets. 143 144 Control the number of hairpin queues. 145 * - ``hairpin_queue_size`` 146 - u32 147 - driverinit 148 - Control the size (in packets) of the hairpin queues. 149 * - ``pcie_cong_inbound_high`` 150 - u16 151 - driverinit 152 - High threshold configuration for PCIe congestion events. The firmware 153 will send an event once device side inbound PCIe traffic went 154 above the configured high threshold for a long enough period (at least 155 200ms). 156 157 See pci_bw_inbound_high ethtool stat. 158 159 Units are 0.01 %. Accepted values are in range [0, 10000]. 160 pcie_cong_inbound_low < pcie_cong_inbound_high. 161 Default value: 9000 (Corresponds to 90%). 162 * - ``pcie_cong_inbound_low`` 163 - u16 164 - driverinit 165 - Low threshold configuration for PCIe congestion events. The firmware 166 will send an event once device side inbound PCIe traffic went 167 below the configured low threshold, only after having been previously in 168 a congested state. 169 170 See pci_bw_inbound_low ethtool stat. 171 172 Units are 0.01 %. Accepted values are in range [0, 10000]. 173 pcie_cong_inbound_low < pcie_cong_inbound_high. 174 Default value: 7500. 175 * - ``pcie_cong_outbound_high`` 176 - u16 177 - driverinit 178 - High threshold configuration for PCIe congestion events. The firmware 179 will send an event once device side outbound PCIe traffic went 180 above the configured high threshold for a long enough period (at least 181 200ms). 182 183 See pci_bw_outbound_high ethtool stat. 184 185 Units are 0.01 %. Accepted values are in range [0, 10000]. 186 pcie_cong_outbound_low < pcie_cong_outbound_high. 187 Default value: 9000 (Corresponds to 90%). 188 * - ``pcie_cong_outbound_low`` 189 - u16 190 - driverinit 191 - Low threshold configuration for PCIe congestion events. The firmware 192 will send an event once device side outbound PCIe traffic went 193 below the configured low threshold, only after having been previously in 194 a congested state. 195 196 See pci_bw_outbound_low ethtool stat. 197 198 Units are 0.01 %. Accepted values are in range [0, 10000]. 199 pcie_cong_outbound_low < pcie_cong_outbound_high. 200 Default value: 7500. 201 202 * - ``cqe_compress_type`` 203 - string 204 - permanent 205 - Configure which mechanism/algorithm should be used by the NIC that will 206 affect the rate (aggressiveness) of compressed CQEs depending on PCIe bus 207 conditions and other internal NIC factors. This mode affects all queues 208 that enable compression. 209 * ``balanced`` : Merges fewer CQEs, resulting in a moderate compression ratio but maintaining a balance between bandwidth savings and performance 210 * ``aggressive`` : Merges more CQEs into a single entry, achieving a higher compression rate and maximizing performance, particularly under high traffic loads 211 212The ``mlx5`` driver supports reloading via ``DEVLINK_CMD_RELOAD`` 213 214Info versions 215============= 216 217The ``mlx5`` driver reports the following versions 218 219.. list-table:: devlink info versions implemented 220 :widths: 5 5 90 221 222 * - Name 223 - Type 224 - Description 225 * - ``fw.psid`` 226 - fixed 227 - Used to represent the board id of the device. 228 * - ``fw.version`` 229 - stored, running 230 - Three digit major.minor.subminor firmware version number. 231 232Health reporters 233================ 234 235tx reporter 236----------- 237The tx reporter is responsible for reporting and recovering of the following three error scenarios: 238 239- tx timeout 240 Report on kernel tx timeout detection. 241 Recover by searching lost interrupts. 242- tx error completion 243 Report on error tx completion. 244 Recover by flushing the tx queue and reset it. 245- tx PTP port timestamping CQ unhealthy 246 Report too many CQEs never delivered on port ts CQ. 247 Recover by flushing and re-creating all PTP channels. 248 249tx reporter also support on demand diagnose callback, on which it provides 250real time information of its send queues status. 251 252User commands examples: 253 254- Diagnose send queues status:: 255 256 $ devlink health diagnose pci/0000:82:00.0 reporter tx 257 258.. note:: 259 This command has valid output only when interface is up, otherwise the command has empty output. 260 261- Show number of tx errors indicated, number of recover flows ended successfully, 262 is autorecover enabled and graceful period from last recover:: 263 264 $ devlink health show pci/0000:82:00.0 reporter tx 265 266rx reporter 267----------- 268The rx reporter is responsible for reporting and recovering of the following two error scenarios: 269 270- rx queues' initialization (population) timeout 271 Population of rx queues' descriptors on ring initialization is done 272 in napi context via triggering an irq. In case of a failure to get 273 the minimum amount of descriptors, a timeout would occur, and 274 descriptors could be recovered by polling the EQ (Event Queue). 275- rx completions with errors (reported by HW on interrupt context) 276 Report on rx completion error. 277 Recover (if needed) by flushing the related queue and reset it. 278 279rx reporter also supports on demand diagnose callback, on which it 280provides real time information of its receive queues' status. 281 282- Diagnose rx queues' status and corresponding completion queue:: 283 284 $ devlink health diagnose pci/0000:82:00.0 reporter rx 285 286.. note:: 287 This command has valid output only when interface is up. Otherwise, the command has empty output. 288 289- Show number of rx errors indicated, number of recover flows ended successfully, 290 is autorecover enabled, and graceful period from last recover:: 291 292 $ devlink health show pci/0000:82:00.0 reporter rx 293 294fw reporter 295----------- 296The fw reporter implements `diagnose` and `dump` callbacks. 297It follows symptoms of fw error such as fw syndrome by triggering 298fw core dump and storing it into the dump buffer. 299The fw reporter diagnose command can be triggered any time by the user to check 300current fw status. 301 302User commands examples: 303 304- Check fw heath status:: 305 306 $ devlink health diagnose pci/0000:82:00.0 reporter fw 307 308- Read FW core dump if already stored or trigger new one:: 309 310 $ devlink health dump show pci/0000:82:00.0 reporter fw 311 312.. note:: 313 This command can run only on the PF which has fw tracer ownership, 314 running it on other PF or any VF will return "Operation not permitted". 315 316fw fatal reporter 317----------------- 318The fw fatal reporter implements `dump` and `recover` callbacks. 319It follows fatal errors indications by CR-space dump and recover flow. 320The CR-space dump uses vsc interface which is valid even if the FW command 321interface is not functional, which is the case in most FW fatal errors. 322The recover function runs recover flow which reloads the driver and triggers fw 323reset if needed. 324On firmware error, the health buffer is dumped into the dmesg. The log 325level is derived from the error's severity (given in health buffer). 326 327User commands examples: 328 329- Run fw recover flow manually:: 330 331 $ devlink health recover pci/0000:82:00.0 reporter fw_fatal 332 333- Read FW CR-space dump if already stored or trigger new one:: 334 335 $ devlink health dump show pci/0000:82:00.1 reporter fw_fatal 336 337.. note:: 338 This command can run only on PF. 339 340vnic reporter 341------------- 342The vnic reporter implements only the `diagnose` callback. 343It is responsible for querying the vnic diagnostic counters from fw and displaying 344them in realtime. 345 346Description of the vnic counters: 347 348- total_error_queues 349 number of queues in an error state due to 350 an async error or errored command. 351- send_queue_priority_update_flow 352 number of QP/SQ priority/SL update events. 353- cq_overrun 354 number of times CQ entered an error state due to an overflow. 355- async_eq_overrun 356 number of times an EQ mapped to async events was overrun. 357- comp_eq_overrun 358 number of times an EQ mapped to completion events was 359 overrun. 360- quota_exceeded_command 361 number of commands issued and failed due to quota exceeded. 362- invalid_command 363 number of commands issued and failed dues to any reason other than quota 364 exceeded. 365- nic_receive_steering_discard 366 number of packets that completed RX flow 367 steering but were discarded due to a mismatch in flow table. 368- generated_pkt_steering_fail 369 number of packets generated by the VNIC experiencing unexpected steering 370 failure (at any point in steering flow). 371- handled_pkt_steering_fail 372 number of packets handled by the VNIC experiencing unexpected steering 373 failure (at any point in steering flow owned by the VNIC, including the FDB 374 for the eswitch owner). 375- icm_consumption 376 amount of Interconnect Host Memory (ICM) consumed by the vnic in 377 granularity of 4KB. ICM is host memory allocated by SW upon HCA request 378 and is used for storing data structures that control HCA operation. 379 380User commands examples: 381 382- Diagnose PF/VF vnic counters:: 383 384 $ devlink health diagnose pci/0000:82:00.1 reporter vnic 385 386- Diagnose representor vnic counters (performed by supplying devlink port of the 387 representor, which can be obtained via devlink port command):: 388 389 $ devlink health diagnose pci/0000:82:00.1/65537 reporter vnic 390 391.. note:: 392 This command can run over all interfaces such as PF/VF and representor ports. 393