xref: /linux/Documentation/networking/devlink/mlx5.rst (revision fc3a2810412c163b5df1b377d332e048860f45db)
1.. SPDX-License-Identifier: GPL-2.0
2
3====================
4mlx5 devlink support
5====================
6
7This document describes the devlink features implemented by the ``mlx5``
8device driver.
9
10Parameters
11==========
12
13.. list-table:: Generic parameters implemented
14
15   * - Name
16     - Mode
17     - Validation
18     - Notes
19   * - ``enable_roce``
20     - driverinit
21     - Boolean
22     - If the device supports RoCE disablement, RoCE enablement state controls
23       device support for RoCE capability. Otherwise, the control occurs in the
24       driver stack. When RoCE is disabled at the driver level, only raw
25       ethernet QPs are supported.
26   * - ``io_eq_size``
27     - driverinit
28     - The range is between 64 and 4096.
29     -
30   * - ``event_eq_size``
31     - driverinit
32     - The range is between 64 and 4096.
33     -
34   * - ``max_macs``
35     - driverinit
36     - The range is between 1 and 2^31. Only power of 2 values are supported.
37     -
38   * - ``enable_sriov``
39     - permanent
40     - Boolean
41     - Applies to each physical function (PF) independently, if the device
42       supports it. Otherwise, it applies symmetrically to all PFs.
43   * - ``total_vfs``
44     - permanent
45     - The range is between 1 and a device-specific max.
46     - Applies to each physical function (PF) independently, if the device
47       supports it. Otherwise, it applies symmetrically to all PFs.
48
49Note: permanent parameters such as ``enable_sriov`` and ``total_vfs`` require FW reset to take effect
50
51.. code-block:: bash
52
53   # setup parameters
54   devlink dev param set pci/0000:01:00.0 name enable_sriov value true cmode permanent
55   devlink dev param set pci/0000:01:00.0 name total_vfs value 8 cmode permanent
56
57   # Fw reset
58   devlink dev reload pci/0000:01:00.0 action fw_activate
59
60   # for PCI related config such as sriov PCI reset/rescan is required:
61   echo 1 >/sys/bus/pci/devices/0000:01:00.0/remove
62   echo 1 >/sys/bus/pci/rescan
63   grep ^ /sys/bus/pci/devices/0000:01:00.0/sriov_*
64
65
66The ``mlx5`` driver also implements the following driver-specific
67parameters.
68
69.. list-table:: Driver-specific parameters implemented
70   :widths: 5 5 5 85
71
72   * - Name
73     - Type
74     - Mode
75     - Description
76   * - ``flow_steering_mode``
77     - string
78     - runtime
79     - Controls the flow steering mode of the driver
80
81       * ``dmfs`` Device managed flow steering. In DMFS mode, the HW
82         steering entities are created and managed through firmware.
83       * ``smfs`` Software managed flow steering. In SMFS mode, the HW
84         steering entities are created and manage through the driver without
85         firmware intervention.
86       * ``hmfs`` Hardware managed flow steering. In HMFS mode, the driver
87         is configuring steering rules directly to the HW using Work Queues with
88         a special new type of WQE (Work Queue Element).
89
90       SMFS mode is faster and provides better rule insertion rate compared to
91       default DMFS mode.
92   * - ``fdb_large_groups``
93     - u32
94     - driverinit
95     - Control the number of large groups (size > 1) in the FDB table.
96
97       * The default value is 15, and the range is between 1 and 1024.
98   * - ``esw_multiport``
99     - Boolean
100     - runtime
101     - Control MultiPort E-Switch shared fdb mode.
102
103       An experimental mode where a single E-Switch is used and all the vports
104       and physical ports on the NIC are connected to it.
105
106       An example is to send traffic from a VF that is created on PF0 to an
107       uplink that is natively associated with the uplink of PF1
108
109       Note: Future devices, ConnectX-8 and onward, will eventually have this
110       as the default to allow forwarding between all NIC ports in a single
111       E-switch environment and the dual E-switch mode will likely get
112       deprecated.
113
114       Default: disabled
115   * - ``esw_port_metadata``
116     - Boolean
117     - runtime
118     - When applicable, disabling eswitch metadata can increase packet rate up
119       to 20% depending on the use case and packet sizes.
120
121       Eswitch port metadata state controls whether to internally tag packets
122       with metadata. Metadata tagging must be enabled for multi-port RoCE,
123       failover between representors and stacked devices. By default metadata is
124       enabled on the supported devices in E-switch. Metadata is applicable only
125       for E-switch in switchdev mode and users may disable it when NONE of the
126       below use cases will be in use:
127       1. HCA is in Dual/multi-port RoCE mode.
128       2. VF/SF representor bonding (Usually used for Live migration)
129       3. Stacked devices
130
131       When metadata is disabled, the above use cases will fail to initialize if
132       users try to enable them.
133
134       Note: Setting this parameter does not take effect immediately. Setting
135       must happen in legacy mode and eswitch port metadata takes effect after
136       enabling switchdev mode.
137   * - ``hairpin_num_queues``
138     - u32
139     - driverinit
140     - We refer to a TC NIC rule that involves forwarding as "hairpin".
141       Hairpin queues are mlx5 hardware specific implementation for hardware
142       forwarding of such packets.
143
144       Control the number of hairpin queues.
145   * - ``hairpin_queue_size``
146     - u32
147     - driverinit
148     - Control the size (in packets) of the hairpin queues.
149   * - ``pcie_cong_inbound_high``
150     - u16
151     - driverinit
152     - High threshold configuration for PCIe congestion events. The firmware
153       will send an event once device side inbound PCIe traffic went
154       above the configured high threshold for a long enough period (at least
155       200ms).
156
157       See pci_bw_inbound_high ethtool stat.
158
159       Units are 0.01 %. Accepted values are in range [0, 10000].
160       pcie_cong_inbound_low < pcie_cong_inbound_high.
161       Default value: 9000 (Corresponds to 90%).
162   * - ``pcie_cong_inbound_low``
163     - u16
164     - driverinit
165     - Low threshold configuration for PCIe congestion events. The firmware
166       will send an event once device side inbound PCIe traffic went
167       below the configured low threshold, only after having been previously in
168       a congested state.
169
170       See pci_bw_inbound_low ethtool stat.
171
172       Units are 0.01 %. Accepted values are in range [0, 10000].
173       pcie_cong_inbound_low < pcie_cong_inbound_high.
174       Default value: 7500.
175   * - ``pcie_cong_outbound_high``
176     - u16
177     - driverinit
178     - High threshold configuration for PCIe congestion events. The firmware
179       will send an event once device side outbound PCIe traffic went
180       above the configured high threshold for a long enough period (at least
181       200ms).
182
183       See pci_bw_outbound_high ethtool stat.
184
185       Units are 0.01 %. Accepted values are in range [0, 10000].
186       pcie_cong_outbound_low < pcie_cong_outbound_high.
187       Default value: 9000 (Corresponds to 90%).
188   * - ``pcie_cong_outbound_low``
189     - u16
190     - driverinit
191     - Low threshold configuration for PCIe congestion events. The firmware
192       will send an event once device side outbound PCIe traffic went
193       below the configured low threshold, only after having been previously in
194       a congested state.
195
196       See pci_bw_outbound_low ethtool stat.
197
198       Units are 0.01 %. Accepted values are in range [0, 10000].
199       pcie_cong_outbound_low < pcie_cong_outbound_high.
200       Default value: 7500.
201
202   * - ``cqe_compress_type``
203     - string
204     - permanent
205     - Configure which mechanism/algorithm should be used by the NIC that will
206       affect the rate (aggressiveness) of compressed CQEs depending on PCIe bus
207       conditions and other internal NIC factors. This mode affects all queues
208       that enable compression.
209       * ``balanced`` : Merges fewer CQEs, resulting in a moderate compression ratio but maintaining a balance between bandwidth savings and performance
210       * ``aggressive`` : Merges more CQEs into a single entry, achieving a higher compression rate and maximizing performance, particularly under high traffic loads
211
212The ``mlx5`` driver supports reloading via ``DEVLINK_CMD_RELOAD``
213
214Info versions
215=============
216
217The ``mlx5`` driver reports the following versions
218
219.. list-table:: devlink info versions implemented
220   :widths: 5 5 90
221
222   * - Name
223     - Type
224     - Description
225   * - ``fw.psid``
226     - fixed
227     - Used to represent the board id of the device.
228   * - ``fw.version``
229     - stored, running
230     - Three digit major.minor.subminor firmware version number.
231
232Health reporters
233================
234
235tx reporter
236-----------
237The tx reporter is responsible for reporting and recovering of the following three error scenarios:
238
239- tx timeout
240    Report on kernel tx timeout detection.
241    Recover by searching lost interrupts.
242- tx error completion
243    Report on error tx completion.
244    Recover by flushing the tx queue and reset it.
245- tx PTP port timestamping CQ unhealthy
246    Report too many CQEs never delivered on port ts CQ.
247    Recover by flushing and re-creating all PTP channels.
248
249tx reporter also support on demand diagnose callback, on which it provides
250real time information of its send queues status.
251
252User commands examples:
253
254- Diagnose send queues status::
255
256    $ devlink health diagnose pci/0000:82:00.0 reporter tx
257
258.. note::
259   This command has valid output only when interface is up, otherwise the command has empty output.
260
261- Show number of tx errors indicated, number of recover flows ended successfully,
262  is autorecover enabled and graceful period from last recover::
263
264    $ devlink health show pci/0000:82:00.0 reporter tx
265
266rx reporter
267-----------
268The rx reporter is responsible for reporting and recovering of the following two error scenarios:
269
270- rx queues' initialization (population) timeout
271    Population of rx queues' descriptors on ring initialization is done
272    in napi context via triggering an irq. In case of a failure to get
273    the minimum amount of descriptors, a timeout would occur, and
274    descriptors could be recovered by polling the EQ (Event Queue).
275- rx completions with errors (reported by HW on interrupt context)
276    Report on rx completion error.
277    Recover (if needed) by flushing the related queue and reset it.
278
279rx reporter also supports on demand diagnose callback, on which it
280provides real time information of its receive queues' status.
281
282- Diagnose rx queues' status and corresponding completion queue::
283
284    $ devlink health diagnose pci/0000:82:00.0 reporter rx
285
286.. note::
287   This command has valid output only when interface is up. Otherwise, the command has empty output.
288
289- Show number of rx errors indicated, number of recover flows ended successfully,
290  is autorecover enabled, and graceful period from last recover::
291
292    $ devlink health show pci/0000:82:00.0 reporter rx
293
294fw reporter
295-----------
296The fw reporter implements `diagnose` and `dump` callbacks.
297It follows symptoms of fw error such as fw syndrome by triggering
298fw core dump and storing it into the dump buffer.
299The fw reporter diagnose command can be triggered any time by the user to check
300current fw status.
301
302User commands examples:
303
304- Check fw heath status::
305
306    $ devlink health diagnose pci/0000:82:00.0 reporter fw
307
308- Read FW core dump if already stored or trigger new one::
309
310    $ devlink health dump show pci/0000:82:00.0 reporter fw
311
312.. note::
313   This command can run only on the PF which has fw tracer ownership,
314   running it on other PF or any VF will return "Operation not permitted".
315
316fw fatal reporter
317-----------------
318The fw fatal reporter implements `dump` and `recover` callbacks.
319It follows fatal errors indications by CR-space dump and recover flow.
320The CR-space dump uses vsc interface which is valid even if the FW command
321interface is not functional, which is the case in most FW fatal errors.
322The recover function runs recover flow which reloads the driver and triggers fw
323reset if needed.
324On firmware error, the health buffer is dumped into the dmesg. The log
325level is derived from the error's severity (given in health buffer).
326
327User commands examples:
328
329- Run fw recover flow manually::
330
331    $ devlink health recover pci/0000:82:00.0 reporter fw_fatal
332
333- Read FW CR-space dump if already stored or trigger new one::
334
335    $ devlink health dump show pci/0000:82:00.1 reporter fw_fatal
336
337.. note::
338   This command can run only on PF.
339
340vnic reporter
341-------------
342The vnic reporter implements only the `diagnose` callback.
343It is responsible for querying the vnic diagnostic counters from fw and displaying
344them in realtime.
345
346Description of the vnic counters:
347
348- total_error_queues
349        number of queues in an error state due to
350        an async error or errored command.
351- send_queue_priority_update_flow
352        number of QP/SQ priority/SL update events.
353- cq_overrun
354        number of times CQ entered an error state due to an overflow.
355- async_eq_overrun
356        number of times an EQ mapped to async events was overrun.
357- comp_eq_overrun
358        number of times an EQ mapped to completion events was
359        overrun.
360- quota_exceeded_command
361        number of commands issued and failed due to quota exceeded.
362- invalid_command
363        number of commands issued and failed dues to any reason other than quota
364        exceeded.
365- nic_receive_steering_discard
366        number of packets that completed RX flow
367        steering but were discarded due to a mismatch in flow table.
368- generated_pkt_steering_fail
369	number of packets generated by the VNIC experiencing unexpected steering
370	failure (at any point in steering flow).
371- handled_pkt_steering_fail
372	number of packets handled by the VNIC experiencing unexpected steering
373	failure (at any point in steering flow owned by the VNIC, including the FDB
374	for the eswitch owner).
375- icm_consumption
376        amount of Interconnect Host Memory (ICM) consumed by the vnic in
377        granularity of 4KB. ICM is host memory allocated by SW upon HCA request
378        and is used for storing data structures that control HCA operation.
379
380User commands examples:
381
382- Diagnose PF/VF vnic counters::
383
384        $ devlink health diagnose pci/0000:82:00.1 reporter vnic
385
386- Diagnose representor vnic counters (performed by supplying devlink port of the
387  representor, which can be obtained via devlink port command)::
388
389        $ devlink health diagnose pci/0000:82:00.1/65537 reporter vnic
390
391.. note::
392   This command can run over all interfaces such as PF/VF and representor ports.
393