xref: /linux/Documentation/networking/devlink/ice.rst (revision daa121128a2d2ac6006159e2c47676e4fcd21eab)
1.. SPDX-License-Identifier: GPL-2.0
2
3===================
4ice devlink support
5===================
6
7This document describes the devlink features implemented by the ``ice``
8device driver.
9
10Parameters
11==========
12
13.. list-table:: Generic parameters implemented
14
15   * - Name
16     - Mode
17     - Notes
18   * - ``enable_roce``
19     - runtime
20     - mutually exclusive with ``enable_iwarp``
21   * - ``enable_iwarp``
22     - runtime
23     - mutually exclusive with ``enable_roce``
24   * - ``tx_scheduling_layers``
25     - permanent
26     - The ice hardware uses hierarchical scheduling for Tx with a fixed
27       number of layers in the scheduling tree. Each of them are decision
28       points. Root node represents a port, while all the leaves represent
29       the queues. This way of configuring the Tx scheduler allows features
30       like DCB or devlink-rate (documented below) to configure how much
31       bandwidth is given to any given queue or group of queues, enabling
32       fine-grained control because scheduling parameters can be configured
33       at any given layer of the tree.
34
35       The default 9-layer tree topology was deemed best for most workloads,
36       as it gives an optimal ratio of performance to configurability. However,
37       for some specific cases, this 9-layer topology might not be desired.
38       One example would be sending traffic to queues that are not a multiple
39       of 8. Because the maximum radix is limited to 8 in 9-layer topology,
40       the 9th queue has a different parent than the rest, and it's given
41       more bandwidth credits. This causes a problem when the system is
42       sending traffic to 9 queues:
43
44       | tx_queue_0_packets: 24163396
45       | tx_queue_1_packets: 24164623
46       | tx_queue_2_packets: 24163188
47       | tx_queue_3_packets: 24163701
48       | tx_queue_4_packets: 24163683
49       | tx_queue_5_packets: 24164668
50       | tx_queue_6_packets: 23327200
51       | tx_queue_7_packets: 24163853
52       | tx_queue_8_packets: 91101417 < Too much traffic is sent from 9th
53
54       To address this need, you can switch to a 5-layer topology, which
55       changes the maximum topology radix to 512. With this enhancement,
56       the performance characteristic is equal as all queues can be assigned
57       to the same parent in the tree. The obvious drawback of this solution
58       is a lower configuration depth of the tree.
59
60       Use the ``tx_scheduling_layer`` parameter with the devlink command
61       to change the transmit scheduler topology. To use 5-layer topology,
62       use a value of 5. For example:
63       $ devlink dev param set pci/0000:16:00.0 name tx_scheduling_layers
64       value 5 cmode permanent
65       Use a value of 9 to set it back to the default value.
66
67       You must do PCI slot powercycle for the selected topology to take effect.
68
69       To verify that value has been set:
70       $ devlink dev param show pci/0000:16:00.0 name tx_scheduling_layers
71
72Info versions
73=============
74
75The ``ice`` driver reports the following versions
76
77.. list-table:: devlink info versions implemented
78    :widths: 5 5 5 90
79
80    * - Name
81      - Type
82      - Example
83      - Description
84    * - ``board.id``
85      - fixed
86      - K65390-000
87      - The Product Board Assembly (PBA) identifier of the board.
88    * - ``cgu.id``
89      - fixed
90      - 36
91      - The Clock Generation Unit (CGU) hardware revision identifier.
92    * - ``fw.mgmt``
93      - running
94      - 2.1.7
95      - 3-digit version number of the management firmware running on the
96        Embedded Management Processor of the device. It controls the PHY,
97        link, access to device resources, etc. Intel documentation refers to
98        this as the EMP firmware.
99    * - ``fw.mgmt.api``
100      - running
101      - 1.5.1
102      - 3-digit version number (major.minor.patch) of the API exported over
103        the AdminQ by the management firmware. Used by the driver to
104        identify what commands are supported. Historical versions of the
105        kernel only displayed a 2-digit version number (major.minor).
106    * - ``fw.mgmt.build``
107      - running
108      - 0x305d955f
109      - Unique identifier of the source for the management firmware.
110    * - ``fw.undi``
111      - running
112      - 1.2581.0
113      - Version of the Option ROM containing the UEFI driver. The version is
114        reported in ``major.minor.patch`` format. The major version is
115        incremented whenever a major breaking change occurs, or when the
116        minor version would overflow. The minor version is incremented for
117        non-breaking changes and reset to 1 when the major version is
118        incremented. The patch version is normally 0 but is incremented when
119        a fix is delivered as a patch against an older base Option ROM.
120    * - ``fw.psid.api``
121      - running
122      - 0.80
123      - Version defining the format of the flash contents.
124    * - ``fw.bundle_id``
125      - running
126      - 0x80002ec0
127      - Unique identifier of the firmware image file that was loaded onto
128        the device. Also referred to as the EETRACK identifier of the NVM.
129    * - ``fw.app.name``
130      - running
131      - ICE OS Default Package
132      - The name of the DDP package that is active in the device. The DDP
133        package is loaded by the driver during initialization. Each
134        variation of the DDP package has a unique name.
135    * - ``fw.app``
136      - running
137      - 1.3.1.0
138      - The version of the DDP package that is active in the device. Note
139        that both the name (as reported by ``fw.app.name``) and version are
140        required to uniquely identify the package.
141    * - ``fw.app.bundle_id``
142      - running
143      - 0xc0000001
144      - Unique identifier for the DDP package loaded in the device. Also
145        referred to as the DDP Track ID. Can be used to uniquely identify
146        the specific DDP package.
147    * - ``fw.netlist``
148      - running
149      - 1.1.2000-6.7.0
150      - The version of the netlist module. This module defines the device's
151        Ethernet capabilities and default settings, and is used by the
152        management firmware as part of managing link and device
153        connectivity.
154    * - ``fw.netlist.build``
155      - running
156      - 0xee16ced7
157      - The first 4 bytes of the hash of the netlist module contents.
158    * - ``fw.cgu``
159      - running
160      - 8032.16973825.6021
161      - The version of Clock Generation Unit (CGU). Format:
162        <CGU type>.<configuration version>.<firmware version>.
163
164Flash Update
165============
166
167The ``ice`` driver implements support for flash update using the
168``devlink-flash`` interface. It supports updating the device flash using a
169combined flash image that contains the ``fw.mgmt``, ``fw.undi``, and
170``fw.netlist`` components.
171
172.. list-table:: List of supported overwrite modes
173   :widths: 5 95
174
175   * - Bits
176     - Behavior
177   * - ``DEVLINK_FLASH_OVERWRITE_SETTINGS``
178     - Do not preserve settings stored in the flash components being
179       updated. This includes overwriting the port configuration that
180       determines the number of physical functions the device will
181       initialize with.
182   * - ``DEVLINK_FLASH_OVERWRITE_SETTINGS`` and ``DEVLINK_FLASH_OVERWRITE_IDENTIFIERS``
183     - Do not preserve either settings or identifiers. Overwrite everything
184       in the flash with the contents from the provided image, without
185       performing any preservation. This includes overwriting device
186       identifying fields such as the MAC address, VPD area, and device
187       serial number. It is expected that this combination be used with an
188       image customized for the specific device.
189
190The ice hardware does not support overwriting only identifiers while
191preserving settings, and thus ``DEVLINK_FLASH_OVERWRITE_IDENTIFIERS`` on its
192own will be rejected. If no overwrite mask is provided, the firmware will be
193instructed to preserve all settings and identifying fields when updating.
194
195Reload
196======
197
198The ``ice`` driver supports activating new firmware after a flash update
199using ``DEVLINK_CMD_RELOAD`` with the ``DEVLINK_RELOAD_ACTION_FW_ACTIVATE``
200action.
201
202.. code:: shell
203
204    $ devlink dev reload pci/0000:01:00.0 reload action fw_activate
205
206The new firmware is activated by issuing a device specific Embedded
207Management Processor reset which requests the device to reset and reload the
208EMP firmware image.
209
210The driver does not currently support reloading the driver via
211``DEVLINK_RELOAD_ACTION_DRIVER_REINIT``.
212
213Port split
214==========
215
216The ``ice`` driver supports port splitting only for port 0, as the FW has
217a predefined set of available port split options for the whole device.
218
219A system reboot is required for port split to be applied.
220
221The following command will select the port split option with 4 ports:
222
223.. code:: shell
224
225    $ devlink port split pci/0000:16:00.0/0 count 4
226
227The list of all available port options will be printed to dynamic debug after
228each ``split`` and ``unsplit`` command. The first option is the default.
229
230.. code:: shell
231
232    ice 0000:16:00.0: Available port split options and max port speeds (Gbps):
233    ice 0000:16:00.0: Status  Split      Quad 0          Quad 1
234    ice 0000:16:00.0:         count  L0  L1  L2  L3  L4  L5  L6  L7
235    ice 0000:16:00.0: Active  2     100   -   -   - 100   -   -   -
236    ice 0000:16:00.0:         2      50   -  50   -   -   -   -   -
237    ice 0000:16:00.0: Pending 4      25  25  25  25   -   -   -   -
238    ice 0000:16:00.0:         4      25  25   -   -  25  25   -   -
239    ice 0000:16:00.0:         8      10  10  10  10  10  10  10  10
240    ice 0000:16:00.0:         1     100   -   -   -   -   -   -   -
241
242There could be multiple FW port options with the same port split count. When
243the same port split count request is issued again, the next FW port option with
244the same port split count will be selected.
245
246``devlink port unsplit`` will select the option with a split count of 1. If
247there is no FW option available with split count 1, you will receive an error.
248
249Regions
250=======
251
252The ``ice`` driver implements the following regions for accessing internal
253device data.
254
255.. list-table:: regions implemented
256    :widths: 15 85
257
258    * - Name
259      - Description
260    * - ``nvm-flash``
261      - The contents of the entire flash chip, sometimes referred to as
262        the device's Non Volatile Memory.
263    * - ``shadow-ram``
264      - The contents of the Shadow RAM, which is loaded from the beginning
265        of the flash. Although the contents are primarily from the flash,
266        this area also contains data generated during device boot which is
267        not stored in flash.
268    * - ``device-caps``
269      - The contents of the device firmware's capabilities buffer. Useful to
270        determine the current state and configuration of the device.
271
272Both the ``nvm-flash`` and ``shadow-ram`` regions can be accessed without a
273snapshot. The ``device-caps`` region requires a snapshot as the contents are
274sent by firmware and can't be split into separate reads.
275
276Users can request an immediate capture of a snapshot for all three regions
277via the ``DEVLINK_CMD_REGION_NEW`` command.
278
279.. code:: shell
280
281    $ devlink region show
282    pci/0000:01:00.0/nvm-flash: size 10485760 snapshot [] max 1
283    pci/0000:01:00.0/device-caps: size 4096 snapshot [] max 10
284
285    $ devlink region new pci/0000:01:00.0/nvm-flash snapshot 1
286    $ devlink region dump pci/0000:01:00.0/nvm-flash snapshot 1
287
288    $ devlink region dump pci/0000:01:00.0/nvm-flash snapshot 1
289    0000000000000000 0014 95dc 0014 9514 0035 1670 0034 db30
290    0000000000000010 0000 0000 ffff ff04 0029 8c00 0028 8cc8
291    0000000000000020 0016 0bb8 0016 1720 0000 0000 c00f 3ffc
292    0000000000000030 bada cce5 bada cce5 bada cce5 bada cce5
293
294    $ devlink region read pci/0000:01:00.0/nvm-flash snapshot 1 address 0 length 16
295    0000000000000000 0014 95dc 0014 9514 0035 1670 0034 db30
296
297    $ devlink region delete pci/0000:01:00.0/nvm-flash snapshot 1
298
299    $ devlink region new pci/0000:01:00.0/device-caps snapshot 1
300    $ devlink region dump pci/0000:01:00.0/device-caps snapshot 1
301    0000000000000000 01 00 01 00 00 00 00 00 01 00 00 00 00 00 00 00
302    0000000000000010 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
303    0000000000000020 02 00 02 01 32 03 00 00 0a 00 00 00 25 00 00 00
304    0000000000000030 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
305    0000000000000040 04 00 01 00 01 00 00 00 00 00 00 00 00 00 00 00
306    0000000000000050 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
307    0000000000000060 05 00 01 00 03 00 00 00 00 00 00 00 00 00 00 00
308    0000000000000070 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
309    0000000000000080 06 00 01 00 01 00 00 00 00 00 00 00 00 00 00 00
310    0000000000000090 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
311    00000000000000a0 08 00 01 00 00 00 00 00 00 00 00 00 00 00 00 00
312    00000000000000b0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
313    00000000000000c0 12 00 01 00 01 00 00 00 01 00 01 00 00 00 00 00
314    00000000000000d0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
315    00000000000000e0 13 00 01 00 00 01 00 00 00 00 00 00 00 00 00 00
316    00000000000000f0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
317    0000000000000100 14 00 01 00 01 00 00 00 00 00 00 00 00 00 00 00
318    0000000000000110 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
319    0000000000000120 15 00 01 00 01 00 00 00 00 00 00 00 00 00 00 00
320    0000000000000130 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
321    0000000000000140 16 00 01 00 01 00 00 00 00 00 00 00 00 00 00 00
322    0000000000000150 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
323    0000000000000160 17 00 01 00 06 00 00 00 00 00 00 00 00 00 00 00
324    0000000000000170 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
325    0000000000000180 18 00 01 00 01 00 00 00 01 00 00 00 08 00 00 00
326    0000000000000190 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
327    00000000000001a0 22 00 01 00 01 00 00 00 00 00 00 00 00 00 00 00
328    00000000000001b0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
329    00000000000001c0 40 00 01 00 00 08 00 00 08 00 00 00 00 00 00 00
330    00000000000001d0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
331    00000000000001e0 41 00 01 00 00 08 00 00 00 00 00 00 00 00 00 00
332    00000000000001f0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
333    0000000000000200 42 00 01 00 00 08 00 00 00 00 00 00 00 00 00 00
334    0000000000000210 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
335
336    $ devlink region delete pci/0000:01:00.0/device-caps snapshot 1
337
338Devlink Rate
339============
340
341The ``ice`` driver implements devlink-rate API. It allows for offload of
342the Hierarchical QoS to the hardware. It enables user to group Virtual
343Functions in a tree structure and assign supported parameters: tx_share,
344tx_max, tx_priority and tx_weight to each node in a tree. So effectively
345user gains an ability to control how much bandwidth is allocated for each
346VF group. This is later enforced by the HW.
347
348It is assumed that this feature is mutually exclusive with DCB performed
349in FW and ADQ, or any driver feature that would trigger changes in QoS,
350for example creation of the new traffic class. The driver will prevent DCB
351or ADQ configuration if user started making any changes to the nodes using
352devlink-rate API. To configure those features a driver reload is necessary.
353Correspondingly if ADQ or DCB will get configured the driver won't export
354hierarchy at all, or will remove the untouched hierarchy if those
355features are enabled after the hierarchy is exported, but before any
356changes are made.
357
358This feature is also dependent on switchdev being enabled in the system.
359It's required because devlink-rate requires devlink-port objects to be
360present, and those objects are only created in switchdev mode.
361
362If the driver is set to the switchdev mode, it will export internal
363hierarchy the moment VF's are created. Root of the tree is always
364represented by the node_0. This node can't be deleted by the user. Leaf
365nodes and nodes with children also can't be deleted.
366
367.. list-table:: Attributes supported
368    :widths: 15 85
369
370    * - Name
371      - Description
372    * - ``tx_max``
373      - maximum bandwidth to be consumed by the tree Node. Rate Limit is
374        an absolute number specifying a maximum amount of bytes a Node may
375        consume during the course of one second. Rate limit guarantees
376        that a link will not oversaturate the receiver on the remote end
377        and also enforces an SLA between the subscriber and network
378        provider.
379    * - ``tx_share``
380      - minimum bandwidth allocated to a tree node when it is not blocked.
381        It specifies an absolute BW. While tx_max defines the maximum
382        bandwidth the node may consume, the tx_share marks committed BW
383        for the Node.
384    * - ``tx_priority``
385      - allows for usage of strict priority arbiter among siblings. This
386        arbitration scheme attempts to schedule nodes based on their
387        priority as long as the nodes remain within their bandwidth limit.
388        Range 0-7. Nodes with priority 7 have the highest priority and are
389        selected first, while nodes with priority 0 have the lowest
390        priority. Nodes that have the same priority are treated equally.
391    * - ``tx_weight``
392      - allows for usage of Weighted Fair Queuing arbitration scheme among
393        siblings. This arbitration scheme can be used simultaneously with
394        the strict priority. Range 1-200. Only relative values matter for
395        arbitration.
396
397``tx_priority`` and ``tx_weight`` can be used simultaneously. In that case
398nodes with the same priority form a WFQ subgroup in the sibling group
399and arbitration among them is based on assigned weights.
400
401.. code:: shell
402
403    # enable switchdev
404    $ devlink dev eswitch set pci/0000:4b:00.0 mode switchdev
405
406    # at this point driver should export internal hierarchy
407    $ echo 2 > /sys/class/net/ens785np0/device/sriov_numvfs
408
409    $ devlink port function rate show
410    pci/0000:4b:00.0/node_25: type node parent node_24
411    pci/0000:4b:00.0/node_24: type node parent node_0
412    pci/0000:4b:00.0/node_32: type node parent node_31
413    pci/0000:4b:00.0/node_31: type node parent node_30
414    pci/0000:4b:00.0/node_30: type node parent node_16
415    pci/0000:4b:00.0/node_19: type node parent node_18
416    pci/0000:4b:00.0/node_18: type node parent node_17
417    pci/0000:4b:00.0/node_17: type node parent node_16
418    pci/0000:4b:00.0/node_14: type node parent node_5
419    pci/0000:4b:00.0/node_5: type node parent node_3
420    pci/0000:4b:00.0/node_13: type node parent node_4
421    pci/0000:4b:00.0/node_12: type node parent node_4
422    pci/0000:4b:00.0/node_11: type node parent node_4
423    pci/0000:4b:00.0/node_10: type node parent node_4
424    pci/0000:4b:00.0/node_9: type node parent node_4
425    pci/0000:4b:00.0/node_8: type node parent node_4
426    pci/0000:4b:00.0/node_7: type node parent node_4
427    pci/0000:4b:00.0/node_6: type node parent node_4
428    pci/0000:4b:00.0/node_4: type node parent node_3
429    pci/0000:4b:00.0/node_3: type node parent node_16
430    pci/0000:4b:00.0/node_16: type node parent node_15
431    pci/0000:4b:00.0/node_15: type node parent node_0
432    pci/0000:4b:00.0/node_2: type node parent node_1
433    pci/0000:4b:00.0/node_1: type node parent node_0
434    pci/0000:4b:00.0/node_0: type node
435    pci/0000:4b:00.0/1: type leaf parent node_25
436    pci/0000:4b:00.0/2: type leaf parent node_25
437
438    # let's create some custom node
439    $ devlink port function rate add pci/0000:4b:00.0/node_custom parent node_0
440
441    # second custom node
442    $ devlink port function rate add pci/0000:4b:00.0/node_custom_1 parent node_custom
443
444    # reassign second VF to newly created branch
445    $ devlink port function rate set pci/0000:4b:00.0/2 parent node_custom_1
446
447    # assign tx_weight to the VF
448    $ devlink port function rate set pci/0000:4b:00.0/2 tx_weight 5
449
450    # assign tx_share to the VF
451    $ devlink port function rate set pci/0000:4b:00.0/2 tx_share 500Mbps
452