1.. SPDX-License-Identifier: GPL-2.0 2 3=================== 4ice devlink support 5=================== 6 7This document describes the devlink features implemented by the ``ice`` 8device driver. 9 10Parameters 11========== 12 13.. list-table:: Generic parameters implemented 14 15 * - Name 16 - Mode 17 - Notes 18 * - ``enable_roce`` 19 - runtime 20 - mutually exclusive with ``enable_iwarp`` 21 * - ``enable_iwarp`` 22 - runtime 23 - mutually exclusive with ``enable_roce`` 24 * - ``tx_scheduling_layers`` 25 - permanent 26 - The ice hardware uses hierarchical scheduling for Tx with a fixed 27 number of layers in the scheduling tree. Each of them are decision 28 points. Root node represents a port, while all the leaves represent 29 the queues. This way of configuring the Tx scheduler allows features 30 like DCB or devlink-rate (documented below) to configure how much 31 bandwidth is given to any given queue or group of queues, enabling 32 fine-grained control because scheduling parameters can be configured 33 at any given layer of the tree. 34 35 The default 9-layer tree topology was deemed best for most workloads, 36 as it gives an optimal ratio of performance to configurability. However, 37 for some specific cases, this 9-layer topology might not be desired. 38 One example would be sending traffic to queues that are not a multiple 39 of 8. Because the maximum radix is limited to 8 in 9-layer topology, 40 the 9th queue has a different parent than the rest, and it's given 41 more bandwidth credits. This causes a problem when the system is 42 sending traffic to 9 queues: 43 44 | tx_queue_0_packets: 24163396 45 | tx_queue_1_packets: 24164623 46 | tx_queue_2_packets: 24163188 47 | tx_queue_3_packets: 24163701 48 | tx_queue_4_packets: 24163683 49 | tx_queue_5_packets: 24164668 50 | tx_queue_6_packets: 23327200 51 | tx_queue_7_packets: 24163853 52 | tx_queue_8_packets: 91101417 < Too much traffic is sent from 9th 53 54 To address this need, you can switch to a 5-layer topology, which 55 changes the maximum topology radix to 512. With this enhancement, 56 the performance characteristic is equal as all queues can be assigned 57 to the same parent in the tree. The obvious drawback of this solution 58 is a lower configuration depth of the tree. 59 60 Use the ``tx_scheduling_layer`` parameter with the devlink command 61 to change the transmit scheduler topology. To use 5-layer topology, 62 use a value of 5. For example: 63 $ devlink dev param set pci/0000:16:00.0 name tx_scheduling_layers 64 value 5 cmode permanent 65 Use a value of 9 to set it back to the default value. 66 67 You must do PCI slot powercycle for the selected topology to take effect. 68 69 To verify that value has been set: 70 $ devlink dev param show pci/0000:16:00.0 name tx_scheduling_layers 71 72Info versions 73============= 74 75The ``ice`` driver reports the following versions 76 77.. list-table:: devlink info versions implemented 78 :widths: 5 5 5 90 79 80 * - Name 81 - Type 82 - Example 83 - Description 84 * - ``board.id`` 85 - fixed 86 - K65390-000 87 - The Product Board Assembly (PBA) identifier of the board. 88 * - ``cgu.id`` 89 - fixed 90 - 36 91 - The Clock Generation Unit (CGU) hardware revision identifier. 92 * - ``fw.mgmt`` 93 - running 94 - 2.1.7 95 - 3-digit version number of the management firmware running on the 96 Embedded Management Processor of the device. It controls the PHY, 97 link, access to device resources, etc. Intel documentation refers to 98 this as the EMP firmware. 99 * - ``fw.mgmt.api`` 100 - running 101 - 1.5.1 102 - 3-digit version number (major.minor.patch) of the API exported over 103 the AdminQ by the management firmware. Used by the driver to 104 identify what commands are supported. Historical versions of the 105 kernel only displayed a 2-digit version number (major.minor). 106 * - ``fw.mgmt.build`` 107 - running 108 - 0x305d955f 109 - Unique identifier of the source for the management firmware. 110 * - ``fw.undi`` 111 - running 112 - 1.2581.0 113 - Version of the Option ROM containing the UEFI driver. The version is 114 reported in ``major.minor.patch`` format. The major version is 115 incremented whenever a major breaking change occurs, or when the 116 minor version would overflow. The minor version is incremented for 117 non-breaking changes and reset to 1 when the major version is 118 incremented. The patch version is normally 0 but is incremented when 119 a fix is delivered as a patch against an older base Option ROM. 120 * - ``fw.psid.api`` 121 - running 122 - 0.80 123 - Version defining the format of the flash contents. 124 * - ``fw.bundle_id`` 125 - running 126 - 0x80002ec0 127 - Unique identifier of the firmware image file that was loaded onto 128 the device. Also referred to as the EETRACK identifier of the NVM. 129 * - ``fw.app.name`` 130 - running 131 - ICE OS Default Package 132 - The name of the DDP package that is active in the device. The DDP 133 package is loaded by the driver during initialization. Each 134 variation of the DDP package has a unique name. 135 * - ``fw.app`` 136 - running 137 - 1.3.1.0 138 - The version of the DDP package that is active in the device. Note 139 that both the name (as reported by ``fw.app.name``) and version are 140 required to uniquely identify the package. 141 * - ``fw.app.bundle_id`` 142 - running 143 - 0xc0000001 144 - Unique identifier for the DDP package loaded in the device. Also 145 referred to as the DDP Track ID. Can be used to uniquely identify 146 the specific DDP package. 147 * - ``fw.netlist`` 148 - running 149 - 1.1.2000-6.7.0 150 - The version of the netlist module. This module defines the device's 151 Ethernet capabilities and default settings, and is used by the 152 management firmware as part of managing link and device 153 connectivity. 154 * - ``fw.netlist.build`` 155 - running 156 - 0xee16ced7 157 - The first 4 bytes of the hash of the netlist module contents. 158 * - ``fw.cgu`` 159 - running 160 - 8032.16973825.6021 161 - The version of Clock Generation Unit (CGU). Format: 162 <CGU type>.<configuration version>.<firmware version>. 163 164Flash Update 165============ 166 167The ``ice`` driver implements support for flash update using the 168``devlink-flash`` interface. It supports updating the device flash using a 169combined flash image that contains the ``fw.mgmt``, ``fw.undi``, and 170``fw.netlist`` components. 171 172.. list-table:: List of supported overwrite modes 173 :widths: 5 95 174 175 * - Bits 176 - Behavior 177 * - ``DEVLINK_FLASH_OVERWRITE_SETTINGS`` 178 - Do not preserve settings stored in the flash components being 179 updated. This includes overwriting the port configuration that 180 determines the number of physical functions the device will 181 initialize with. 182 * - ``DEVLINK_FLASH_OVERWRITE_SETTINGS`` and ``DEVLINK_FLASH_OVERWRITE_IDENTIFIERS`` 183 - Do not preserve either settings or identifiers. Overwrite everything 184 in the flash with the contents from the provided image, without 185 performing any preservation. This includes overwriting device 186 identifying fields such as the MAC address, VPD area, and device 187 serial number. It is expected that this combination be used with an 188 image customized for the specific device. 189 190The ice hardware does not support overwriting only identifiers while 191preserving settings, and thus ``DEVLINK_FLASH_OVERWRITE_IDENTIFIERS`` on its 192own will be rejected. If no overwrite mask is provided, the firmware will be 193instructed to preserve all settings and identifying fields when updating. 194 195Reload 196====== 197 198The ``ice`` driver supports activating new firmware after a flash update 199using ``DEVLINK_CMD_RELOAD`` with the ``DEVLINK_RELOAD_ACTION_FW_ACTIVATE`` 200action. 201 202.. code:: shell 203 204 $ devlink dev reload pci/0000:01:00.0 reload action fw_activate 205 206The new firmware is activated by issuing a device specific Embedded 207Management Processor reset which requests the device to reset and reload the 208EMP firmware image. 209 210The driver does not currently support reloading the driver via 211``DEVLINK_RELOAD_ACTION_DRIVER_REINIT``. 212 213Port split 214========== 215 216The ``ice`` driver supports port splitting only for port 0, as the FW has 217a predefined set of available port split options for the whole device. 218 219A system reboot is required for port split to be applied. 220 221The following command will select the port split option with 4 ports: 222 223.. code:: shell 224 225 $ devlink port split pci/0000:16:00.0/0 count 4 226 227The list of all available port options will be printed to dynamic debug after 228each ``split`` and ``unsplit`` command. The first option is the default. 229 230.. code:: shell 231 232 ice 0000:16:00.0: Available port split options and max port speeds (Gbps): 233 ice 0000:16:00.0: Status Split Quad 0 Quad 1 234 ice 0000:16:00.0: count L0 L1 L2 L3 L4 L5 L6 L7 235 ice 0000:16:00.0: Active 2 100 - - - 100 - - - 236 ice 0000:16:00.0: 2 50 - 50 - - - - - 237 ice 0000:16:00.0: Pending 4 25 25 25 25 - - - - 238 ice 0000:16:00.0: 4 25 25 - - 25 25 - - 239 ice 0000:16:00.0: 8 10 10 10 10 10 10 10 10 240 ice 0000:16:00.0: 1 100 - - - - - - - 241 242There could be multiple FW port options with the same port split count. When 243the same port split count request is issued again, the next FW port option with 244the same port split count will be selected. 245 246``devlink port unsplit`` will select the option with a split count of 1. If 247there is no FW option available with split count 1, you will receive an error. 248 249Regions 250======= 251 252The ``ice`` driver implements the following regions for accessing internal 253device data. 254 255.. list-table:: regions implemented 256 :widths: 15 85 257 258 * - Name 259 - Description 260 * - ``nvm-flash`` 261 - The contents of the entire flash chip, sometimes referred to as 262 the device's Non Volatile Memory. 263 * - ``shadow-ram`` 264 - The contents of the Shadow RAM, which is loaded from the beginning 265 of the flash. Although the contents are primarily from the flash, 266 this area also contains data generated during device boot which is 267 not stored in flash. 268 * - ``device-caps`` 269 - The contents of the device firmware's capabilities buffer. Useful to 270 determine the current state and configuration of the device. 271 272Both the ``nvm-flash`` and ``shadow-ram`` regions can be accessed without a 273snapshot. The ``device-caps`` region requires a snapshot as the contents are 274sent by firmware and can't be split into separate reads. 275 276Users can request an immediate capture of a snapshot for all three regions 277via the ``DEVLINK_CMD_REGION_NEW`` command. 278 279.. code:: shell 280 281 $ devlink region show 282 pci/0000:01:00.0/nvm-flash: size 10485760 snapshot [] max 1 283 pci/0000:01:00.0/device-caps: size 4096 snapshot [] max 10 284 285 $ devlink region new pci/0000:01:00.0/nvm-flash snapshot 1 286 $ devlink region dump pci/0000:01:00.0/nvm-flash snapshot 1 287 288 $ devlink region dump pci/0000:01:00.0/nvm-flash snapshot 1 289 0000000000000000 0014 95dc 0014 9514 0035 1670 0034 db30 290 0000000000000010 0000 0000 ffff ff04 0029 8c00 0028 8cc8 291 0000000000000020 0016 0bb8 0016 1720 0000 0000 c00f 3ffc 292 0000000000000030 bada cce5 bada cce5 bada cce5 bada cce5 293 294 $ devlink region read pci/0000:01:00.0/nvm-flash snapshot 1 address 0 length 16 295 0000000000000000 0014 95dc 0014 9514 0035 1670 0034 db30 296 297 $ devlink region delete pci/0000:01:00.0/nvm-flash snapshot 1 298 299 $ devlink region new pci/0000:01:00.0/device-caps snapshot 1 300 $ devlink region dump pci/0000:01:00.0/device-caps snapshot 1 301 0000000000000000 01 00 01 00 00 00 00 00 01 00 00 00 00 00 00 00 302 0000000000000010 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 303 0000000000000020 02 00 02 01 32 03 00 00 0a 00 00 00 25 00 00 00 304 0000000000000030 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 305 0000000000000040 04 00 01 00 01 00 00 00 00 00 00 00 00 00 00 00 306 0000000000000050 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 307 0000000000000060 05 00 01 00 03 00 00 00 00 00 00 00 00 00 00 00 308 0000000000000070 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 309 0000000000000080 06 00 01 00 01 00 00 00 00 00 00 00 00 00 00 00 310 0000000000000090 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 311 00000000000000a0 08 00 01 00 00 00 00 00 00 00 00 00 00 00 00 00 312 00000000000000b0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 313 00000000000000c0 12 00 01 00 01 00 00 00 01 00 01 00 00 00 00 00 314 00000000000000d0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 315 00000000000000e0 13 00 01 00 00 01 00 00 00 00 00 00 00 00 00 00 316 00000000000000f0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 317 0000000000000100 14 00 01 00 01 00 00 00 00 00 00 00 00 00 00 00 318 0000000000000110 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 319 0000000000000120 15 00 01 00 01 00 00 00 00 00 00 00 00 00 00 00 320 0000000000000130 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 321 0000000000000140 16 00 01 00 01 00 00 00 00 00 00 00 00 00 00 00 322 0000000000000150 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 323 0000000000000160 17 00 01 00 06 00 00 00 00 00 00 00 00 00 00 00 324 0000000000000170 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 325 0000000000000180 18 00 01 00 01 00 00 00 01 00 00 00 08 00 00 00 326 0000000000000190 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 327 00000000000001a0 22 00 01 00 01 00 00 00 00 00 00 00 00 00 00 00 328 00000000000001b0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 329 00000000000001c0 40 00 01 00 00 08 00 00 08 00 00 00 00 00 00 00 330 00000000000001d0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 331 00000000000001e0 41 00 01 00 00 08 00 00 00 00 00 00 00 00 00 00 332 00000000000001f0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 333 0000000000000200 42 00 01 00 00 08 00 00 00 00 00 00 00 00 00 00 334 0000000000000210 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 335 336 $ devlink region delete pci/0000:01:00.0/device-caps snapshot 1 337 338Devlink Rate 339============ 340 341The ``ice`` driver implements devlink-rate API. It allows for offload of 342the Hierarchical QoS to the hardware. It enables user to group Virtual 343Functions in a tree structure and assign supported parameters: tx_share, 344tx_max, tx_priority and tx_weight to each node in a tree. So effectively 345user gains an ability to control how much bandwidth is allocated for each 346VF group. This is later enforced by the HW. 347 348It is assumed that this feature is mutually exclusive with DCB performed 349in FW and ADQ, or any driver feature that would trigger changes in QoS, 350for example creation of the new traffic class. The driver will prevent DCB 351or ADQ configuration if user started making any changes to the nodes using 352devlink-rate API. To configure those features a driver reload is necessary. 353Correspondingly if ADQ or DCB will get configured the driver won't export 354hierarchy at all, or will remove the untouched hierarchy if those 355features are enabled after the hierarchy is exported, but before any 356changes are made. 357 358This feature is also dependent on switchdev being enabled in the system. 359It's required because devlink-rate requires devlink-port objects to be 360present, and those objects are only created in switchdev mode. 361 362If the driver is set to the switchdev mode, it will export internal 363hierarchy the moment VF's are created. Root of the tree is always 364represented by the node_0. This node can't be deleted by the user. Leaf 365nodes and nodes with children also can't be deleted. 366 367.. list-table:: Attributes supported 368 :widths: 15 85 369 370 * - Name 371 - Description 372 * - ``tx_max`` 373 - maximum bandwidth to be consumed by the tree Node. Rate Limit is 374 an absolute number specifying a maximum amount of bytes a Node may 375 consume during the course of one second. Rate limit guarantees 376 that a link will not oversaturate the receiver on the remote end 377 and also enforces an SLA between the subscriber and network 378 provider. 379 * - ``tx_share`` 380 - minimum bandwidth allocated to a tree node when it is not blocked. 381 It specifies an absolute BW. While tx_max defines the maximum 382 bandwidth the node may consume, the tx_share marks committed BW 383 for the Node. 384 * - ``tx_priority`` 385 - allows for usage of strict priority arbiter among siblings. This 386 arbitration scheme attempts to schedule nodes based on their 387 priority as long as the nodes remain within their bandwidth limit. 388 Range 0-7. Nodes with priority 7 have the highest priority and are 389 selected first, while nodes with priority 0 have the lowest 390 priority. Nodes that have the same priority are treated equally. 391 * - ``tx_weight`` 392 - allows for usage of Weighted Fair Queuing arbitration scheme among 393 siblings. This arbitration scheme can be used simultaneously with 394 the strict priority. Range 1-200. Only relative values matter for 395 arbitration. 396 397``tx_priority`` and ``tx_weight`` can be used simultaneously. In that case 398nodes with the same priority form a WFQ subgroup in the sibling group 399and arbitration among them is based on assigned weights. 400 401.. code:: shell 402 403 # enable switchdev 404 $ devlink dev eswitch set pci/0000:4b:00.0 mode switchdev 405 406 # at this point driver should export internal hierarchy 407 $ echo 2 > /sys/class/net/ens785np0/device/sriov_numvfs 408 409 $ devlink port function rate show 410 pci/0000:4b:00.0/node_25: type node parent node_24 411 pci/0000:4b:00.0/node_24: type node parent node_0 412 pci/0000:4b:00.0/node_32: type node parent node_31 413 pci/0000:4b:00.0/node_31: type node parent node_30 414 pci/0000:4b:00.0/node_30: type node parent node_16 415 pci/0000:4b:00.0/node_19: type node parent node_18 416 pci/0000:4b:00.0/node_18: type node parent node_17 417 pci/0000:4b:00.0/node_17: type node parent node_16 418 pci/0000:4b:00.0/node_14: type node parent node_5 419 pci/0000:4b:00.0/node_5: type node parent node_3 420 pci/0000:4b:00.0/node_13: type node parent node_4 421 pci/0000:4b:00.0/node_12: type node parent node_4 422 pci/0000:4b:00.0/node_11: type node parent node_4 423 pci/0000:4b:00.0/node_10: type node parent node_4 424 pci/0000:4b:00.0/node_9: type node parent node_4 425 pci/0000:4b:00.0/node_8: type node parent node_4 426 pci/0000:4b:00.0/node_7: type node parent node_4 427 pci/0000:4b:00.0/node_6: type node parent node_4 428 pci/0000:4b:00.0/node_4: type node parent node_3 429 pci/0000:4b:00.0/node_3: type node parent node_16 430 pci/0000:4b:00.0/node_16: type node parent node_15 431 pci/0000:4b:00.0/node_15: type node parent node_0 432 pci/0000:4b:00.0/node_2: type node parent node_1 433 pci/0000:4b:00.0/node_1: type node parent node_0 434 pci/0000:4b:00.0/node_0: type node 435 pci/0000:4b:00.0/1: type leaf parent node_25 436 pci/0000:4b:00.0/2: type leaf parent node_25 437 438 # let's create some custom node 439 $ devlink port function rate add pci/0000:4b:00.0/node_custom parent node_0 440 441 # second custom node 442 $ devlink port function rate add pci/0000:4b:00.0/node_custom_1 parent node_custom 443 444 # reassign second VF to newly created branch 445 $ devlink port function rate set pci/0000:4b:00.0/2 parent node_custom_1 446 447 # assign tx_weight to the VF 448 $ devlink port function rate set pci/0000:4b:00.0/2 tx_weight 5 449 450 # assign tx_share to the VF 451 $ devlink port function rate set pci/0000:4b:00.0/2 tx_share 500Mbps 452