xref: /linux/Documentation/networking/devlink/ice.rst (revision 1a9239bb4253f9076b5b4b2a1a4e8d7defd77a95)
1.. SPDX-License-Identifier: GPL-2.0
2
3===================
4ice devlink support
5===================
6
7This document describes the devlink features implemented by the ``ice``
8device driver.
9
10Parameters
11==========
12
13.. list-table:: Generic parameters implemented
14   :widths: 5 5 90
15
16   * - Name
17     - Mode
18     - Notes
19   * - ``enable_roce``
20     - runtime
21     - mutually exclusive with ``enable_iwarp``
22   * - ``enable_iwarp``
23     - runtime
24     - mutually exclusive with ``enable_roce``
25   * - ``tx_scheduling_layers``
26     - permanent
27     - The ice hardware uses hierarchical scheduling for Tx with a fixed
28       number of layers in the scheduling tree. Each of them are decision
29       points. Root node represents a port, while all the leaves represent
30       the queues. This way of configuring the Tx scheduler allows features
31       like DCB or devlink-rate (documented below) to configure how much
32       bandwidth is given to any given queue or group of queues, enabling
33       fine-grained control because scheduling parameters can be configured
34       at any given layer of the tree.
35
36       The default 9-layer tree topology was deemed best for most workloads,
37       as it gives an optimal ratio of performance to configurability. However,
38       for some specific cases, this 9-layer topology might not be desired.
39       One example would be sending traffic to queues that are not a multiple
40       of 8. Because the maximum radix is limited to 8 in 9-layer topology,
41       the 9th queue has a different parent than the rest, and it's given
42       more bandwidth credits. This causes a problem when the system is
43       sending traffic to 9 queues:
44
45       | tx_queue_0_packets: 24163396
46       | tx_queue_1_packets: 24164623
47       | tx_queue_2_packets: 24163188
48       | tx_queue_3_packets: 24163701
49       | tx_queue_4_packets: 24163683
50       | tx_queue_5_packets: 24164668
51       | tx_queue_6_packets: 23327200
52       | tx_queue_7_packets: 24163853
53       | tx_queue_8_packets: 91101417 < Too much traffic is sent from 9th
54
55       To address this need, you can switch to a 5-layer topology, which
56       changes the maximum topology radix to 512. With this enhancement,
57       the performance characteristic is equal as all queues can be assigned
58       to the same parent in the tree. The obvious drawback of this solution
59       is a lower configuration depth of the tree.
60
61       Use the ``tx_scheduling_layer`` parameter with the devlink command
62       to change the transmit scheduler topology. To use 5-layer topology,
63       use a value of 5. For example:
64       $ devlink dev param set pci/0000:16:00.0 name tx_scheduling_layers
65       value 5 cmode permanent
66       Use a value of 9 to set it back to the default value.
67
68       You must do PCI slot powercycle for the selected topology to take effect.
69
70       To verify that value has been set:
71       $ devlink dev param show pci/0000:16:00.0 name tx_scheduling_layers
72   * - ``msix_vec_per_pf_max``
73     - driverinit
74     - Set the max MSI-X that can be used by the PF, rest can be utilized for
75       SRIOV. The range is from min value set in msix_vec_per_pf_min to
76       2k/number of ports.
77   * - ``msix_vec_per_pf_min``
78     - driverinit
79     - Set the min MSI-X that will be used by the PF. This value inform how many
80       MSI-X will be allocated statically. The range is from 2 to value set
81       in msix_vec_per_pf_max.
82
83.. list-table:: Driver specific parameters implemented
84    :widths: 5 5 90
85
86    * - Name
87      - Mode
88      - Description
89    * - ``local_forwarding``
90      - runtime
91      - Controls loopback behavior by tuning scheduler bandwidth.
92        It impacts all kinds of functions: physical, virtual and
93        subfunctions.
94        Supported values are:
95
96        ``enabled`` - loopback traffic is allowed on port
97
98        ``disabled`` - loopback traffic is not allowed on this port
99
100        ``prioritized`` - loopback traffic is prioritized on this port
101
102        Default value of ``local_forwarding`` parameter is ``enabled``.
103        ``prioritized`` provides ability to adjust loopback traffic rate to increase
104        one port capacity at cost of the another. User needs to disable
105        local forwarding on one of the ports in order have increased capacity
106        on the ``prioritized`` port.
107
108Info versions
109=============
110
111The ``ice`` driver reports the following versions
112
113.. list-table:: devlink info versions implemented
114    :widths: 5 5 5 90
115
116    * - Name
117      - Type
118      - Example
119      - Description
120    * - ``board.id``
121      - fixed
122      - K65390-000
123      - The Product Board Assembly (PBA) identifier of the board.
124    * - ``cgu.id``
125      - fixed
126      - 36
127      - The Clock Generation Unit (CGU) hardware revision identifier.
128    * - ``fw.mgmt``
129      - running
130      - 2.1.7
131      - 3-digit version number of the management firmware running on the
132        Embedded Management Processor of the device. It controls the PHY,
133        link, access to device resources, etc. Intel documentation refers to
134        this as the EMP firmware.
135    * - ``fw.mgmt.api``
136      - running
137      - 1.5.1
138      - 3-digit version number (major.minor.patch) of the API exported over
139        the AdminQ by the management firmware. Used by the driver to
140        identify what commands are supported. Historical versions of the
141        kernel only displayed a 2-digit version number (major.minor).
142    * - ``fw.mgmt.build``
143      - running
144      - 0x305d955f
145      - Unique identifier of the source for the management firmware.
146    * - ``fw.undi``
147      - running
148      - 1.2581.0
149      - Version of the Option ROM containing the UEFI driver. The version is
150        reported in ``major.minor.patch`` format. The major version is
151        incremented whenever a major breaking change occurs, or when the
152        minor version would overflow. The minor version is incremented for
153        non-breaking changes and reset to 1 when the major version is
154        incremented. The patch version is normally 0 but is incremented when
155        a fix is delivered as a patch against an older base Option ROM.
156    * - ``fw.psid.api``
157      - running
158      - 0.80
159      - Version defining the format of the flash contents.
160    * - ``fw.bundle_id``
161      - running
162      - 0x80002ec0
163      - Unique identifier of the firmware image file that was loaded onto
164        the device. Also referred to as the EETRACK identifier of the NVM.
165    * - ``fw.app.name``
166      - running
167      - ICE OS Default Package
168      - The name of the DDP package that is active in the device. The DDP
169        package is loaded by the driver during initialization. Each
170        variation of the DDP package has a unique name.
171    * - ``fw.app``
172      - running
173      - 1.3.1.0
174      - The version of the DDP package that is active in the device. Note
175        that both the name (as reported by ``fw.app.name``) and version are
176        required to uniquely identify the package.
177    * - ``fw.app.bundle_id``
178      - running
179      - 0xc0000001
180      - Unique identifier for the DDP package loaded in the device. Also
181        referred to as the DDP Track ID. Can be used to uniquely identify
182        the specific DDP package.
183    * - ``fw.netlist``
184      - running
185      - 1.1.2000-6.7.0
186      - The version of the netlist module. This module defines the device's
187        Ethernet capabilities and default settings, and is used by the
188        management firmware as part of managing link and device
189        connectivity.
190    * - ``fw.netlist.build``
191      - running
192      - 0xee16ced7
193      - The first 4 bytes of the hash of the netlist module contents.
194    * - ``fw.cgu``
195      - running
196      - 8032.16973825.6021
197      - The version of Clock Generation Unit (CGU). Format:
198        <CGU type>.<configuration version>.<firmware version>.
199
200Flash Update
201============
202
203The ``ice`` driver implements support for flash update using the
204``devlink-flash`` interface. It supports updating the device flash using a
205combined flash image that contains the ``fw.mgmt``, ``fw.undi``, and
206``fw.netlist`` components.
207
208.. list-table:: List of supported overwrite modes
209   :widths: 5 95
210
211   * - Bits
212     - Behavior
213   * - ``DEVLINK_FLASH_OVERWRITE_SETTINGS``
214     - Do not preserve settings stored in the flash components being
215       updated. This includes overwriting the port configuration that
216       determines the number of physical functions the device will
217       initialize with.
218   * - ``DEVLINK_FLASH_OVERWRITE_SETTINGS`` and ``DEVLINK_FLASH_OVERWRITE_IDENTIFIERS``
219     - Do not preserve either settings or identifiers. Overwrite everything
220       in the flash with the contents from the provided image, without
221       performing any preservation. This includes overwriting device
222       identifying fields such as the MAC address, VPD area, and device
223       serial number. It is expected that this combination be used with an
224       image customized for the specific device.
225
226The ice hardware does not support overwriting only identifiers while
227preserving settings, and thus ``DEVLINK_FLASH_OVERWRITE_IDENTIFIERS`` on its
228own will be rejected. If no overwrite mask is provided, the firmware will be
229instructed to preserve all settings and identifying fields when updating.
230
231Reload
232======
233
234The ``ice`` driver supports activating new firmware after a flash update
235using ``DEVLINK_CMD_RELOAD`` with the ``DEVLINK_RELOAD_ACTION_FW_ACTIVATE``
236action.
237
238.. code:: shell
239
240    $ devlink dev reload pci/0000:01:00.0 reload action fw_activate
241
242The new firmware is activated by issuing a device specific Embedded
243Management Processor reset which requests the device to reset and reload the
244EMP firmware image.
245
246The driver does not currently support reloading the driver via
247``DEVLINK_RELOAD_ACTION_DRIVER_REINIT``.
248
249Port split
250==========
251
252The ``ice`` driver supports port splitting only for port 0, as the FW has
253a predefined set of available port split options for the whole device.
254
255A system reboot is required for port split to be applied.
256
257The following command will select the port split option with 4 ports:
258
259.. code:: shell
260
261    $ devlink port split pci/0000:16:00.0/0 count 4
262
263The list of all available port options will be printed to dynamic debug after
264each ``split`` and ``unsplit`` command. The first option is the default.
265
266.. code:: shell
267
268    ice 0000:16:00.0: Available port split options and max port speeds (Gbps):
269    ice 0000:16:00.0: Status  Split      Quad 0          Quad 1
270    ice 0000:16:00.0:         count  L0  L1  L2  L3  L4  L5  L6  L7
271    ice 0000:16:00.0: Active  2     100   -   -   - 100   -   -   -
272    ice 0000:16:00.0:         2      50   -  50   -   -   -   -   -
273    ice 0000:16:00.0: Pending 4      25  25  25  25   -   -   -   -
274    ice 0000:16:00.0:         4      25  25   -   -  25  25   -   -
275    ice 0000:16:00.0:         8      10  10  10  10  10  10  10  10
276    ice 0000:16:00.0:         1     100   -   -   -   -   -   -   -
277
278There could be multiple FW port options with the same port split count. When
279the same port split count request is issued again, the next FW port option with
280the same port split count will be selected.
281
282``devlink port unsplit`` will select the option with a split count of 1. If
283there is no FW option available with split count 1, you will receive an error.
284
285Regions
286=======
287
288The ``ice`` driver implements the following regions for accessing internal
289device data.
290
291.. list-table:: regions implemented
292    :widths: 15 85
293
294    * - Name
295      - Description
296    * - ``nvm-flash``
297      - The contents of the entire flash chip, sometimes referred to as
298        the device's Non Volatile Memory.
299    * - ``shadow-ram``
300      - The contents of the Shadow RAM, which is loaded from the beginning
301        of the flash. Although the contents are primarily from the flash,
302        this area also contains data generated during device boot which is
303        not stored in flash.
304    * - ``device-caps``
305      - The contents of the device firmware's capabilities buffer. Useful to
306        determine the current state and configuration of the device.
307
308Both the ``nvm-flash`` and ``shadow-ram`` regions can be accessed without a
309snapshot. The ``device-caps`` region requires a snapshot as the contents are
310sent by firmware and can't be split into separate reads.
311
312Users can request an immediate capture of a snapshot for all three regions
313via the ``DEVLINK_CMD_REGION_NEW`` command.
314
315.. code:: shell
316
317    $ devlink region show
318    pci/0000:01:00.0/nvm-flash: size 10485760 snapshot [] max 1
319    pci/0000:01:00.0/device-caps: size 4096 snapshot [] max 10
320
321    $ devlink region new pci/0000:01:00.0/nvm-flash snapshot 1
322    $ devlink region dump pci/0000:01:00.0/nvm-flash snapshot 1
323
324    $ devlink region dump pci/0000:01:00.0/nvm-flash snapshot 1
325    0000000000000000 0014 95dc 0014 9514 0035 1670 0034 db30
326    0000000000000010 0000 0000 ffff ff04 0029 8c00 0028 8cc8
327    0000000000000020 0016 0bb8 0016 1720 0000 0000 c00f 3ffc
328    0000000000000030 bada cce5 bada cce5 bada cce5 bada cce5
329
330    $ devlink region read pci/0000:01:00.0/nvm-flash snapshot 1 address 0 length 16
331    0000000000000000 0014 95dc 0014 9514 0035 1670 0034 db30
332
333    $ devlink region delete pci/0000:01:00.0/nvm-flash snapshot 1
334
335    $ devlink region new pci/0000:01:00.0/device-caps snapshot 1
336    $ devlink region dump pci/0000:01:00.0/device-caps snapshot 1
337    0000000000000000 01 00 01 00 00 00 00 00 01 00 00 00 00 00 00 00
338    0000000000000010 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
339    0000000000000020 02 00 02 01 32 03 00 00 0a 00 00 00 25 00 00 00
340    0000000000000030 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
341    0000000000000040 04 00 01 00 01 00 00 00 00 00 00 00 00 00 00 00
342    0000000000000050 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
343    0000000000000060 05 00 01 00 03 00 00 00 00 00 00 00 00 00 00 00
344    0000000000000070 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
345    0000000000000080 06 00 01 00 01 00 00 00 00 00 00 00 00 00 00 00
346    0000000000000090 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
347    00000000000000a0 08 00 01 00 00 00 00 00 00 00 00 00 00 00 00 00
348    00000000000000b0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
349    00000000000000c0 12 00 01 00 01 00 00 00 01 00 01 00 00 00 00 00
350    00000000000000d0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
351    00000000000000e0 13 00 01 00 00 01 00 00 00 00 00 00 00 00 00 00
352    00000000000000f0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
353    0000000000000100 14 00 01 00 01 00 00 00 00 00 00 00 00 00 00 00
354    0000000000000110 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
355    0000000000000120 15 00 01 00 01 00 00 00 00 00 00 00 00 00 00 00
356    0000000000000130 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
357    0000000000000140 16 00 01 00 01 00 00 00 00 00 00 00 00 00 00 00
358    0000000000000150 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
359    0000000000000160 17 00 01 00 06 00 00 00 00 00 00 00 00 00 00 00
360    0000000000000170 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
361    0000000000000180 18 00 01 00 01 00 00 00 01 00 00 00 08 00 00 00
362    0000000000000190 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
363    00000000000001a0 22 00 01 00 01 00 00 00 00 00 00 00 00 00 00 00
364    00000000000001b0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
365    00000000000001c0 40 00 01 00 00 08 00 00 08 00 00 00 00 00 00 00
366    00000000000001d0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
367    00000000000001e0 41 00 01 00 00 08 00 00 00 00 00 00 00 00 00 00
368    00000000000001f0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
369    0000000000000200 42 00 01 00 00 08 00 00 00 00 00 00 00 00 00 00
370    0000000000000210 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
371
372    $ devlink region delete pci/0000:01:00.0/device-caps snapshot 1
373
374Devlink Rate
375============
376
377The ``ice`` driver implements devlink-rate API. It allows for offload of
378the Hierarchical QoS to the hardware. It enables user to group Virtual
379Functions in a tree structure and assign supported parameters: tx_share,
380tx_max, tx_priority and tx_weight to each node in a tree. So effectively
381user gains an ability to control how much bandwidth is allocated for each
382VF group. This is later enforced by the HW.
383
384It is assumed that this feature is mutually exclusive with DCB performed
385in FW and ADQ, or any driver feature that would trigger changes in QoS,
386for example creation of the new traffic class. The driver will prevent DCB
387or ADQ configuration if user started making any changes to the nodes using
388devlink-rate API. To configure those features a driver reload is necessary.
389Correspondingly if ADQ or DCB will get configured the driver won't export
390hierarchy at all, or will remove the untouched hierarchy if those
391features are enabled after the hierarchy is exported, but before any
392changes are made.
393
394This feature is also dependent on switchdev being enabled in the system.
395It's required because devlink-rate requires devlink-port objects to be
396present, and those objects are only created in switchdev mode.
397
398If the driver is set to the switchdev mode, it will export internal
399hierarchy the moment VF's are created. Root of the tree is always
400represented by the node_0. This node can't be deleted by the user. Leaf
401nodes and nodes with children also can't be deleted.
402
403.. list-table:: Attributes supported
404    :widths: 15 85
405
406    * - Name
407      - Description
408    * - ``tx_max``
409      - maximum bandwidth to be consumed by the tree Node. Rate Limit is
410        an absolute number specifying a maximum amount of bytes a Node may
411        consume during the course of one second. Rate limit guarantees
412        that a link will not oversaturate the receiver on the remote end
413        and also enforces an SLA between the subscriber and network
414        provider.
415    * - ``tx_share``
416      - minimum bandwidth allocated to a tree node when it is not blocked.
417        It specifies an absolute BW. While tx_max defines the maximum
418        bandwidth the node may consume, the tx_share marks committed BW
419        for the Node.
420    * - ``tx_priority``
421      - allows for usage of strict priority arbiter among siblings. This
422        arbitration scheme attempts to schedule nodes based on their
423        priority as long as the nodes remain within their bandwidth limit.
424        Range 0-7. Nodes with priority 7 have the highest priority and are
425        selected first, while nodes with priority 0 have the lowest
426        priority. Nodes that have the same priority are treated equally.
427    * - ``tx_weight``
428      - allows for usage of Weighted Fair Queuing arbitration scheme among
429        siblings. This arbitration scheme can be used simultaneously with
430        the strict priority. Range 1-200. Only relative values matter for
431        arbitration.
432
433``tx_priority`` and ``tx_weight`` can be used simultaneously. In that case
434nodes with the same priority form a WFQ subgroup in the sibling group
435and arbitration among them is based on assigned weights.
436
437.. code:: shell
438
439    # enable switchdev
440    $ devlink dev eswitch set pci/0000:4b:00.0 mode switchdev
441
442    # at this point driver should export internal hierarchy
443    $ echo 2 > /sys/class/net/ens785np0/device/sriov_numvfs
444
445    $ devlink port function rate show
446    pci/0000:4b:00.0/node_25: type node parent node_24
447    pci/0000:4b:00.0/node_24: type node parent node_0
448    pci/0000:4b:00.0/node_32: type node parent node_31
449    pci/0000:4b:00.0/node_31: type node parent node_30
450    pci/0000:4b:00.0/node_30: type node parent node_16
451    pci/0000:4b:00.0/node_19: type node parent node_18
452    pci/0000:4b:00.0/node_18: type node parent node_17
453    pci/0000:4b:00.0/node_17: type node parent node_16
454    pci/0000:4b:00.0/node_14: type node parent node_5
455    pci/0000:4b:00.0/node_5: type node parent node_3
456    pci/0000:4b:00.0/node_13: type node parent node_4
457    pci/0000:4b:00.0/node_12: type node parent node_4
458    pci/0000:4b:00.0/node_11: type node parent node_4
459    pci/0000:4b:00.0/node_10: type node parent node_4
460    pci/0000:4b:00.0/node_9: type node parent node_4
461    pci/0000:4b:00.0/node_8: type node parent node_4
462    pci/0000:4b:00.0/node_7: type node parent node_4
463    pci/0000:4b:00.0/node_6: type node parent node_4
464    pci/0000:4b:00.0/node_4: type node parent node_3
465    pci/0000:4b:00.0/node_3: type node parent node_16
466    pci/0000:4b:00.0/node_16: type node parent node_15
467    pci/0000:4b:00.0/node_15: type node parent node_0
468    pci/0000:4b:00.0/node_2: type node parent node_1
469    pci/0000:4b:00.0/node_1: type node parent node_0
470    pci/0000:4b:00.0/node_0: type node
471    pci/0000:4b:00.0/1: type leaf parent node_25
472    pci/0000:4b:00.0/2: type leaf parent node_25
473
474    # let's create some custom node
475    $ devlink port function rate add pci/0000:4b:00.0/node_custom parent node_0
476
477    # second custom node
478    $ devlink port function rate add pci/0000:4b:00.0/node_custom_1 parent node_custom
479
480    # reassign second VF to newly created branch
481    $ devlink port function rate set pci/0000:4b:00.0/2 parent node_custom_1
482
483    # assign tx_weight to the VF
484    $ devlink port function rate set pci/0000:4b:00.0/2 tx_weight 5
485
486    # assign tx_share to the VF
487    $ devlink port function rate set pci/0000:4b:00.0/2 tx_share 500Mbps
488