1.. SPDX-License-Identifier: GPL-2.0 2 3=================== 4ice devlink support 5=================== 6 7This document describes the devlink features implemented by the ``ice`` 8device driver. 9 10Parameters 11========== 12 13.. list-table:: Generic parameters implemented 14 :widths: 5 5 90 15 16 * - Name 17 - Mode 18 - Notes 19 * - ``enable_roce`` 20 - runtime 21 - mutually exclusive with ``enable_iwarp`` 22 * - ``enable_iwarp`` 23 - runtime 24 - mutually exclusive with ``enable_roce`` 25 * - ``tx_scheduling_layers`` 26 - permanent 27 - The ice hardware uses hierarchical scheduling for Tx with a fixed 28 number of layers in the scheduling tree. Each of them are decision 29 points. Root node represents a port, while all the leaves represent 30 the queues. This way of configuring the Tx scheduler allows features 31 like DCB or devlink-rate (documented below) to configure how much 32 bandwidth is given to any given queue or group of queues, enabling 33 fine-grained control because scheduling parameters can be configured 34 at any given layer of the tree. 35 36 The default 9-layer tree topology was deemed best for most workloads, 37 as it gives an optimal ratio of performance to configurability. However, 38 for some specific cases, this 9-layer topology might not be desired. 39 One example would be sending traffic to queues that are not a multiple 40 of 8. Because the maximum radix is limited to 8 in 9-layer topology, 41 the 9th queue has a different parent than the rest, and it's given 42 more bandwidth credits. This causes a problem when the system is 43 sending traffic to 9 queues: 44 45 | tx_queue_0_packets: 24163396 46 | tx_queue_1_packets: 24164623 47 | tx_queue_2_packets: 24163188 48 | tx_queue_3_packets: 24163701 49 | tx_queue_4_packets: 24163683 50 | tx_queue_5_packets: 24164668 51 | tx_queue_6_packets: 23327200 52 | tx_queue_7_packets: 24163853 53 | tx_queue_8_packets: 91101417 < Too much traffic is sent from 9th 54 55 To address this need, you can switch to a 5-layer topology, which 56 changes the maximum topology radix to 512. With this enhancement, 57 the performance characteristic is equal as all queues can be assigned 58 to the same parent in the tree. The obvious drawback of this solution 59 is a lower configuration depth of the tree. 60 61 Use the ``tx_scheduling_layer`` parameter with the devlink command 62 to change the transmit scheduler topology. To use 5-layer topology, 63 use a value of 5. For example: 64 $ devlink dev param set pci/0000:16:00.0 name tx_scheduling_layers 65 value 5 cmode permanent 66 Use a value of 9 to set it back to the default value. 67 68 You must do PCI slot powercycle for the selected topology to take effect. 69 70 To verify that value has been set: 71 $ devlink dev param show pci/0000:16:00.0 name tx_scheduling_layers 72 * - ``msix_vec_per_pf_max`` 73 - driverinit 74 - Set the max MSI-X that can be used by the PF, rest can be utilized for 75 SRIOV. The range is from min value set in msix_vec_per_pf_min to 76 2k/number of ports. 77 * - ``msix_vec_per_pf_min`` 78 - driverinit 79 - Set the min MSI-X that will be used by the PF. This value inform how many 80 MSI-X will be allocated statically. The range is from 2 to value set 81 in msix_vec_per_pf_max. 82 83.. list-table:: Driver specific parameters implemented 84 :widths: 5 5 90 85 86 * - Name 87 - Mode 88 - Description 89 * - ``local_forwarding`` 90 - runtime 91 - Controls loopback behavior by tuning scheduler bandwidth. 92 It impacts all kinds of functions: physical, virtual and 93 subfunctions. 94 Supported values are: 95 96 ``enabled`` - loopback traffic is allowed on port 97 98 ``disabled`` - loopback traffic is not allowed on this port 99 100 ``prioritized`` - loopback traffic is prioritized on this port 101 102 Default value of ``local_forwarding`` parameter is ``enabled``. 103 ``prioritized`` provides ability to adjust loopback traffic rate to increase 104 one port capacity at cost of the another. User needs to disable 105 local forwarding on one of the ports in order have increased capacity 106 on the ``prioritized`` port. 107 108Info versions 109============= 110 111The ``ice`` driver reports the following versions 112 113.. list-table:: devlink info versions implemented 114 :widths: 5 5 5 90 115 116 * - Name 117 - Type 118 - Example 119 - Description 120 * - ``board.id`` 121 - fixed 122 - K65390-000 123 - The Product Board Assembly (PBA) identifier of the board. 124 * - ``cgu.id`` 125 - fixed 126 - 36 127 - The Clock Generation Unit (CGU) hardware revision identifier. 128 * - ``fw.mgmt`` 129 - running 130 - 2.1.7 131 - 3-digit version number of the management firmware running on the 132 Embedded Management Processor of the device. It controls the PHY, 133 link, access to device resources, etc. Intel documentation refers to 134 this as the EMP firmware. 135 * - ``fw.mgmt.api`` 136 - running 137 - 1.5.1 138 - 3-digit version number (major.minor.patch) of the API exported over 139 the AdminQ by the management firmware. Used by the driver to 140 identify what commands are supported. Historical versions of the 141 kernel only displayed a 2-digit version number (major.minor). 142 * - ``fw.mgmt.build`` 143 - running 144 - 0x305d955f 145 - Unique identifier of the source for the management firmware. 146 * - ``fw.undi`` 147 - running 148 - 1.2581.0 149 - Version of the Option ROM containing the UEFI driver. The version is 150 reported in ``major.minor.patch`` format. The major version is 151 incremented whenever a major breaking change occurs, or when the 152 minor version would overflow. The minor version is incremented for 153 non-breaking changes and reset to 1 when the major version is 154 incremented. The patch version is normally 0 but is incremented when 155 a fix is delivered as a patch against an older base Option ROM. 156 * - ``fw.psid.api`` 157 - running 158 - 0.80 159 - Version defining the format of the flash contents. 160 * - ``fw.bundle_id`` 161 - running 162 - 0x80002ec0 163 - Unique identifier of the firmware image file that was loaded onto 164 the device. Also referred to as the EETRACK identifier of the NVM. 165 * - ``fw.app.name`` 166 - running 167 - ICE OS Default Package 168 - The name of the DDP package that is active in the device. The DDP 169 package is loaded by the driver during initialization. Each 170 variation of the DDP package has a unique name. 171 * - ``fw.app`` 172 - running 173 - 1.3.1.0 174 - The version of the DDP package that is active in the device. Note 175 that both the name (as reported by ``fw.app.name``) and version are 176 required to uniquely identify the package. 177 * - ``fw.app.bundle_id`` 178 - running 179 - 0xc0000001 180 - Unique identifier for the DDP package loaded in the device. Also 181 referred to as the DDP Track ID. Can be used to uniquely identify 182 the specific DDP package. 183 * - ``fw.netlist`` 184 - running 185 - 1.1.2000-6.7.0 186 - The version of the netlist module. This module defines the device's 187 Ethernet capabilities and default settings, and is used by the 188 management firmware as part of managing link and device 189 connectivity. 190 * - ``fw.netlist.build`` 191 - running 192 - 0xee16ced7 193 - The first 4 bytes of the hash of the netlist module contents. 194 * - ``fw.cgu`` 195 - running 196 - 8032.16973825.6021 197 - The version of Clock Generation Unit (CGU). Format: 198 <CGU type>.<configuration version>.<firmware version>. 199 200Flash Update 201============ 202 203The ``ice`` driver implements support for flash update using the 204``devlink-flash`` interface. It supports updating the device flash using a 205combined flash image that contains the ``fw.mgmt``, ``fw.undi``, and 206``fw.netlist`` components. 207 208.. list-table:: List of supported overwrite modes 209 :widths: 5 95 210 211 * - Bits 212 - Behavior 213 * - ``DEVLINK_FLASH_OVERWRITE_SETTINGS`` 214 - Do not preserve settings stored in the flash components being 215 updated. This includes overwriting the port configuration that 216 determines the number of physical functions the device will 217 initialize with. 218 * - ``DEVLINK_FLASH_OVERWRITE_SETTINGS`` and ``DEVLINK_FLASH_OVERWRITE_IDENTIFIERS`` 219 - Do not preserve either settings or identifiers. Overwrite everything 220 in the flash with the contents from the provided image, without 221 performing any preservation. This includes overwriting device 222 identifying fields such as the MAC address, VPD area, and device 223 serial number. It is expected that this combination be used with an 224 image customized for the specific device. 225 226The ice hardware does not support overwriting only identifiers while 227preserving settings, and thus ``DEVLINK_FLASH_OVERWRITE_IDENTIFIERS`` on its 228own will be rejected. If no overwrite mask is provided, the firmware will be 229instructed to preserve all settings and identifying fields when updating. 230 231Reload 232====== 233 234The ``ice`` driver supports activating new firmware after a flash update 235using ``DEVLINK_CMD_RELOAD`` with the ``DEVLINK_RELOAD_ACTION_FW_ACTIVATE`` 236action. 237 238.. code:: shell 239 240 $ devlink dev reload pci/0000:01:00.0 reload action fw_activate 241 242The new firmware is activated by issuing a device specific Embedded 243Management Processor reset which requests the device to reset and reload the 244EMP firmware image. 245 246The driver does not currently support reloading the driver via 247``DEVLINK_RELOAD_ACTION_DRIVER_REINIT``. 248 249Port split 250========== 251 252The ``ice`` driver supports port splitting only for port 0, as the FW has 253a predefined set of available port split options for the whole device. 254 255A system reboot is required for port split to be applied. 256 257The following command will select the port split option with 4 ports: 258 259.. code:: shell 260 261 $ devlink port split pci/0000:16:00.0/0 count 4 262 263The list of all available port options will be printed to dynamic debug after 264each ``split`` and ``unsplit`` command. The first option is the default. 265 266.. code:: shell 267 268 ice 0000:16:00.0: Available port split options and max port speeds (Gbps): 269 ice 0000:16:00.0: Status Split Quad 0 Quad 1 270 ice 0000:16:00.0: count L0 L1 L2 L3 L4 L5 L6 L7 271 ice 0000:16:00.0: Active 2 100 - - - 100 - - - 272 ice 0000:16:00.0: 2 50 - 50 - - - - - 273 ice 0000:16:00.0: Pending 4 25 25 25 25 - - - - 274 ice 0000:16:00.0: 4 25 25 - - 25 25 - - 275 ice 0000:16:00.0: 8 10 10 10 10 10 10 10 10 276 ice 0000:16:00.0: 1 100 - - - - - - - 277 278There could be multiple FW port options with the same port split count. When 279the same port split count request is issued again, the next FW port option with 280the same port split count will be selected. 281 282``devlink port unsplit`` will select the option with a split count of 1. If 283there is no FW option available with split count 1, you will receive an error. 284 285Regions 286======= 287 288The ``ice`` driver implements the following regions for accessing internal 289device data. 290 291.. list-table:: regions implemented 292 :widths: 15 85 293 294 * - Name 295 - Description 296 * - ``nvm-flash`` 297 - The contents of the entire flash chip, sometimes referred to as 298 the device's Non Volatile Memory. 299 * - ``shadow-ram`` 300 - The contents of the Shadow RAM, which is loaded from the beginning 301 of the flash. Although the contents are primarily from the flash, 302 this area also contains data generated during device boot which is 303 not stored in flash. 304 * - ``device-caps`` 305 - The contents of the device firmware's capabilities buffer. Useful to 306 determine the current state and configuration of the device. 307 308Both the ``nvm-flash`` and ``shadow-ram`` regions can be accessed without a 309snapshot. The ``device-caps`` region requires a snapshot as the contents are 310sent by firmware and can't be split into separate reads. 311 312Users can request an immediate capture of a snapshot for all three regions 313via the ``DEVLINK_CMD_REGION_NEW`` command. 314 315.. code:: shell 316 317 $ devlink region show 318 pci/0000:01:00.0/nvm-flash: size 10485760 snapshot [] max 1 319 pci/0000:01:00.0/device-caps: size 4096 snapshot [] max 10 320 321 $ devlink region new pci/0000:01:00.0/nvm-flash snapshot 1 322 $ devlink region dump pci/0000:01:00.0/nvm-flash snapshot 1 323 324 $ devlink region dump pci/0000:01:00.0/nvm-flash snapshot 1 325 0000000000000000 0014 95dc 0014 9514 0035 1670 0034 db30 326 0000000000000010 0000 0000 ffff ff04 0029 8c00 0028 8cc8 327 0000000000000020 0016 0bb8 0016 1720 0000 0000 c00f 3ffc 328 0000000000000030 bada cce5 bada cce5 bada cce5 bada cce5 329 330 $ devlink region read pci/0000:01:00.0/nvm-flash snapshot 1 address 0 length 16 331 0000000000000000 0014 95dc 0014 9514 0035 1670 0034 db30 332 333 $ devlink region delete pci/0000:01:00.0/nvm-flash snapshot 1 334 335 $ devlink region new pci/0000:01:00.0/device-caps snapshot 1 336 $ devlink region dump pci/0000:01:00.0/device-caps snapshot 1 337 0000000000000000 01 00 01 00 00 00 00 00 01 00 00 00 00 00 00 00 338 0000000000000010 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 339 0000000000000020 02 00 02 01 32 03 00 00 0a 00 00 00 25 00 00 00 340 0000000000000030 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 341 0000000000000040 04 00 01 00 01 00 00 00 00 00 00 00 00 00 00 00 342 0000000000000050 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 343 0000000000000060 05 00 01 00 03 00 00 00 00 00 00 00 00 00 00 00 344 0000000000000070 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 345 0000000000000080 06 00 01 00 01 00 00 00 00 00 00 00 00 00 00 00 346 0000000000000090 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 347 00000000000000a0 08 00 01 00 00 00 00 00 00 00 00 00 00 00 00 00 348 00000000000000b0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 349 00000000000000c0 12 00 01 00 01 00 00 00 01 00 01 00 00 00 00 00 350 00000000000000d0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 351 00000000000000e0 13 00 01 00 00 01 00 00 00 00 00 00 00 00 00 00 352 00000000000000f0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 353 0000000000000100 14 00 01 00 01 00 00 00 00 00 00 00 00 00 00 00 354 0000000000000110 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 355 0000000000000120 15 00 01 00 01 00 00 00 00 00 00 00 00 00 00 00 356 0000000000000130 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 357 0000000000000140 16 00 01 00 01 00 00 00 00 00 00 00 00 00 00 00 358 0000000000000150 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 359 0000000000000160 17 00 01 00 06 00 00 00 00 00 00 00 00 00 00 00 360 0000000000000170 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 361 0000000000000180 18 00 01 00 01 00 00 00 01 00 00 00 08 00 00 00 362 0000000000000190 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 363 00000000000001a0 22 00 01 00 01 00 00 00 00 00 00 00 00 00 00 00 364 00000000000001b0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 365 00000000000001c0 40 00 01 00 00 08 00 00 08 00 00 00 00 00 00 00 366 00000000000001d0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 367 00000000000001e0 41 00 01 00 00 08 00 00 00 00 00 00 00 00 00 00 368 00000000000001f0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 369 0000000000000200 42 00 01 00 00 08 00 00 00 00 00 00 00 00 00 00 370 0000000000000210 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 371 372 $ devlink region delete pci/0000:01:00.0/device-caps snapshot 1 373 374Devlink Rate 375============ 376 377The ``ice`` driver implements devlink-rate API. It allows for offload of 378the Hierarchical QoS to the hardware. It enables user to group Virtual 379Functions in a tree structure and assign supported parameters: tx_share, 380tx_max, tx_priority and tx_weight to each node in a tree. So effectively 381user gains an ability to control how much bandwidth is allocated for each 382VF group. This is later enforced by the HW. 383 384It is assumed that this feature is mutually exclusive with DCB performed 385in FW and ADQ, or any driver feature that would trigger changes in QoS, 386for example creation of the new traffic class. The driver will prevent DCB 387or ADQ configuration if user started making any changes to the nodes using 388devlink-rate API. To configure those features a driver reload is necessary. 389Correspondingly if ADQ or DCB will get configured the driver won't export 390hierarchy at all, or will remove the untouched hierarchy if those 391features are enabled after the hierarchy is exported, but before any 392changes are made. 393 394This feature is also dependent on switchdev being enabled in the system. 395It's required because devlink-rate requires devlink-port objects to be 396present, and those objects are only created in switchdev mode. 397 398If the driver is set to the switchdev mode, it will export internal 399hierarchy the moment VF's are created. Root of the tree is always 400represented by the node_0. This node can't be deleted by the user. Leaf 401nodes and nodes with children also can't be deleted. 402 403.. list-table:: Attributes supported 404 :widths: 15 85 405 406 * - Name 407 - Description 408 * - ``tx_max`` 409 - maximum bandwidth to be consumed by the tree Node. Rate Limit is 410 an absolute number specifying a maximum amount of bytes a Node may 411 consume during the course of one second. Rate limit guarantees 412 that a link will not oversaturate the receiver on the remote end 413 and also enforces an SLA between the subscriber and network 414 provider. 415 * - ``tx_share`` 416 - minimum bandwidth allocated to a tree node when it is not blocked. 417 It specifies an absolute BW. While tx_max defines the maximum 418 bandwidth the node may consume, the tx_share marks committed BW 419 for the Node. 420 * - ``tx_priority`` 421 - allows for usage of strict priority arbiter among siblings. This 422 arbitration scheme attempts to schedule nodes based on their 423 priority as long as the nodes remain within their bandwidth limit. 424 Range 0-7. Nodes with priority 7 have the highest priority and are 425 selected first, while nodes with priority 0 have the lowest 426 priority. Nodes that have the same priority are treated equally. 427 * - ``tx_weight`` 428 - allows for usage of Weighted Fair Queuing arbitration scheme among 429 siblings. This arbitration scheme can be used simultaneously with 430 the strict priority. Range 1-200. Only relative values matter for 431 arbitration. 432 433``tx_priority`` and ``tx_weight`` can be used simultaneously. In that case 434nodes with the same priority form a WFQ subgroup in the sibling group 435and arbitration among them is based on assigned weights. 436 437.. code:: shell 438 439 # enable switchdev 440 $ devlink dev eswitch set pci/0000:4b:00.0 mode switchdev 441 442 # at this point driver should export internal hierarchy 443 $ echo 2 > /sys/class/net/ens785np0/device/sriov_numvfs 444 445 $ devlink port function rate show 446 pci/0000:4b:00.0/node_25: type node parent node_24 447 pci/0000:4b:00.0/node_24: type node parent node_0 448 pci/0000:4b:00.0/node_32: type node parent node_31 449 pci/0000:4b:00.0/node_31: type node parent node_30 450 pci/0000:4b:00.0/node_30: type node parent node_16 451 pci/0000:4b:00.0/node_19: type node parent node_18 452 pci/0000:4b:00.0/node_18: type node parent node_17 453 pci/0000:4b:00.0/node_17: type node parent node_16 454 pci/0000:4b:00.0/node_14: type node parent node_5 455 pci/0000:4b:00.0/node_5: type node parent node_3 456 pci/0000:4b:00.0/node_13: type node parent node_4 457 pci/0000:4b:00.0/node_12: type node parent node_4 458 pci/0000:4b:00.0/node_11: type node parent node_4 459 pci/0000:4b:00.0/node_10: type node parent node_4 460 pci/0000:4b:00.0/node_9: type node parent node_4 461 pci/0000:4b:00.0/node_8: type node parent node_4 462 pci/0000:4b:00.0/node_7: type node parent node_4 463 pci/0000:4b:00.0/node_6: type node parent node_4 464 pci/0000:4b:00.0/node_4: type node parent node_3 465 pci/0000:4b:00.0/node_3: type node parent node_16 466 pci/0000:4b:00.0/node_16: type node parent node_15 467 pci/0000:4b:00.0/node_15: type node parent node_0 468 pci/0000:4b:00.0/node_2: type node parent node_1 469 pci/0000:4b:00.0/node_1: type node parent node_0 470 pci/0000:4b:00.0/node_0: type node 471 pci/0000:4b:00.0/1: type leaf parent node_25 472 pci/0000:4b:00.0/2: type leaf parent node_25 473 474 # let's create some custom node 475 $ devlink port function rate add pci/0000:4b:00.0/node_custom parent node_0 476 477 # second custom node 478 $ devlink port function rate add pci/0000:4b:00.0/node_custom_1 parent node_custom 479 480 # reassign second VF to newly created branch 481 $ devlink port function rate set pci/0000:4b:00.0/2 parent node_custom_1 482 483 # assign tx_weight to the VF 484 $ devlink port function rate set pci/0000:4b:00.0/2 tx_weight 5 485 486 # assign tx_share to the VF 487 $ devlink port function rate set pci/0000:4b:00.0/2 tx_share 500Mbps 488