1.. SPDX-License-Identifier: GPL-2.0 2 3.. _devlink_port: 4 5============ 6Devlink Port 7============ 8 9``devlink-port`` is a port that exists on the device. It has a logically 10separate ingress/egress point of the device. A devlink port can be any one 11of many flavours. A devlink port flavour along with port attributes 12describe what a port represents. 13 14A device driver that intends to publish a devlink port sets the 15devlink port attributes and registers the devlink port. 16 17Devlink port flavours are described below. 18 19.. list-table:: List of devlink port flavours 20 :widths: 33 90 21 22 * - Flavour 23 - Description 24 * - ``DEVLINK_PORT_FLAVOUR_PHYSICAL`` 25 - Any kind of physical port. This can be an eswitch physical port or any 26 other physical port on the device. 27 * - ``DEVLINK_PORT_FLAVOUR_DSA`` 28 - This indicates a DSA interconnect port. 29 * - ``DEVLINK_PORT_FLAVOUR_CPU`` 30 - This indicates a CPU port applicable only to DSA. 31 * - ``DEVLINK_PORT_FLAVOUR_PCI_PF`` 32 - This indicates an eswitch port representing a port of PCI 33 physical function (PF). 34 * - ``DEVLINK_PORT_FLAVOUR_PCI_VF`` 35 - This indicates an eswitch port representing a port of PCI 36 virtual function (VF). 37 * - ``DEVLINK_PORT_FLAVOUR_PCI_SF`` 38 - This indicates an eswitch port representing a port of PCI 39 subfunction (SF). 40 * - ``DEVLINK_PORT_FLAVOUR_VIRTUAL`` 41 - Any virtual port facing the user. 42 43Devlink port can have a different type based on the link layer described below. 44 45.. list-table:: List of devlink port types 46 :widths: 23 90 47 48 * - Type 49 - Description 50 * - ``DEVLINK_PORT_TYPE_ETH`` 51 - Driver should set this port type when a link layer of the port is 52 Ethernet. 53 * - ``DEVLINK_PORT_TYPE_IB`` 54 - Driver should set this port type when a link layer of the port is 55 InfiniBand. 56 * - ``DEVLINK_PORT_TYPE_AUTO`` 57 - This type is indicated by the user when driver should detect the port 58 type automatically. 59 60PCI controllers 61--------------- 62In most cases a PCI device has only one controller. A controller consists of 63potentially multiple physical, virtual functions and subfunctions. A function 64consists of one or more ports. This port is represented by the devlink eswitch 65port. 66 67A PCI device connected to multiple CPUs or multiple PCI root complexes or a 68SmartNIC, however, may have multiple controllers. For a device with multiple 69controllers, each controller is distinguished by a unique controller number. 70An eswitch is on the PCI device which supports ports of multiple controllers. 71 72An example view of a system with two controllers:: 73 74 --------------------------------------------------------- 75 | | 76 | --------- --------- ------- ------- | 77 ----------- | | vf(s) | | sf(s) | |vf(s)| |sf(s)| | 78 | server | | ------- ----/---- ---/----- ------- ---/--- ---/--- | 79 | pci rc |=== | pf0 |______/________/ | pf1 |___/_______/ | 80 | connect | | ------- ------- | 81 ----------- | | controller_num=1 (no eswitch) | 82 ------|-------------------------------------------------- 83 (internal wire) 84 | 85 --------------------------------------------------------- 86 | devlink eswitch ports and reps | 87 | ----------------------------------------------------- | 88 | |ctrl-0 | ctrl-0 | ctrl-0 | ctrl-0 | ctrl-0 |ctrl-0 | | 89 | |pf0 | pf0vfN | pf0sfN | pf1 | pf1vfN |pf1sfN | | 90 | ----------------------------------------------------- | 91 | |ctrl-1 | ctrl-1 | ctrl-1 | ctrl-1 | ctrl-1 |ctrl-1 | | 92 | |pf0 | pf0vfN | pf0sfN | pf1 | pf1vfN |pf1sfN | | 93 | ----------------------------------------------------- | 94 | | 95 | | 96 ----------- | --------- --------- ------- ------- | 97 | smartNIC| | | vf(s) | | sf(s) | |vf(s)| |sf(s)| | 98 | pci rc |==| ------- ----/---- ---/----- ------- ---/--- ---/--- | 99 | connect | | | pf0 |______/________/ | pf1 |___/_______/ | 100 ----------- | ------- ------- | 101 | | 102 | local controller_num=0 (eswitch) | 103 --------------------------------------------------------- 104 105In the above example, the external controller (identified by controller number = 1) 106doesn't have the eswitch. Local controller (identified by controller number = 0) 107has the eswitch. The Devlink instance on the local controller has eswitch 108devlink ports for both the controllers. 109 110Function configuration 111====================== 112 113Users can configure one or more function attributes before enumerating the PCI 114function. Usually it means, user should configure function attribute 115before a bus specific device for the function is created. However, when 116SRIOV is enabled, virtual function devices are created on the PCI bus. 117Hence, function attribute should be configured before binding virtual 118function device to the driver. For subfunctions, this means user should 119configure port function attribute before activating the port function. 120 121A user may set the hardware address of the function using 122`devlink port function set hw_addr` command. For Ethernet port function 123this means a MAC address. 124 125Users may also set the RoCE capability of the function using 126`devlink port function set roce` command. 127 128Users may also set the function as migratable using 129`devlink port function set migratable` command. 130 131Users may also set the IPsec crypto capability of the function using 132`devlink port function set ipsec_crypto` command. 133 134Users may also set the IPsec packet capability of the function using 135`devlink port function set ipsec_packet` command. 136 137The ``migratable`` attribute may be set only on ports with 138``DEVLINK_PORT_FLAVOUR_PCI_VF``. 139 140Users may also set the maximum IO event queues of the function 141using `devlink port function set max_io_eqs` command. 142 143Function attributes 144=================== 145 146MAC address setup 147----------------- 148The configured MAC address of the PCI VF/SF will be used by netdevice and rdma 149device created for the PCI VF/SF. 150 151- Get the MAC address of the VF identified by its unique devlink port index:: 152 153 $ devlink port show pci/0000:06:00.0/2 154 pci/0000:06:00.0/2: type eth netdev enp6s0pf0vf1 flavour pcivf pfnum 0 vfnum 1 155 function: 156 hw_addr 00:00:00:00:00:00 157 158- Set the MAC address of the VF identified by its unique devlink port index:: 159 160 $ devlink port function set pci/0000:06:00.0/2 hw_addr 00:11:22:33:44:55 161 162 $ devlink port show pci/0000:06:00.0/2 163 pci/0000:06:00.0/2: type eth netdev enp6s0pf0vf1 flavour pcivf pfnum 0 vfnum 1 164 function: 165 hw_addr 00:11:22:33:44:55 166 167- Get the MAC address of the SF identified by its unique devlink port index:: 168 169 $ devlink port show pci/0000:06:00.0/32768 170 pci/0000:06:00.0/32768: type eth netdev enp6s0pf0sf88 flavour pcisf pfnum 0 sfnum 88 171 function: 172 hw_addr 00:00:00:00:00:00 173 174- Set the MAC address of the SF identified by its unique devlink port index:: 175 176 $ devlink port function set pci/0000:06:00.0/32768 hw_addr 00:00:00:00:88:88 177 178 $ devlink port show pci/0000:06:00.0/32768 179 pci/0000:06:00.0/32768: type eth netdev enp6s0pf0sf88 flavour pcisf pfnum 0 sfnum 88 180 function: 181 hw_addr 00:00:00:00:88:88 182 183RoCE capability setup 184--------------------- 185Not all PCI VFs/SFs require RoCE capability. 186 187When RoCE capability is disabled, it saves system memory per PCI VF/SF. 188 189When user disables RoCE capability for a VF/SF, user application cannot send or 190receive any RoCE packets through this VF/SF and RoCE GID table for this PCI 191will be empty. 192 193When RoCE capability is disabled in the device using port function attribute, 194VF/SF driver cannot override it. 195 196- Get RoCE capability of the VF device:: 197 198 $ devlink port show pci/0000:06:00.0/2 199 pci/0000:06:00.0/2: type eth netdev enp6s0pf0vf1 flavour pcivf pfnum 0 vfnum 1 200 function: 201 hw_addr 00:00:00:00:00:00 roce enable 202 203- Set RoCE capability of the VF device:: 204 205 $ devlink port function set pci/0000:06:00.0/2 roce disable 206 207 $ devlink port show pci/0000:06:00.0/2 208 pci/0000:06:00.0/2: type eth netdev enp6s0pf0vf1 flavour pcivf pfnum 0 vfnum 1 209 function: 210 hw_addr 00:00:00:00:00:00 roce disable 211 212migratable capability setup 213--------------------------- 214Live migration is the process of transferring a live virtual machine 215from one physical host to another without disrupting its normal 216operation. 217 218User who want PCI VFs to be able to perform live migration need to 219explicitly enable the VF migratable capability. 220 221When user enables migratable capability for a VF, and the HV binds the VF to VFIO driver 222with migration support, the user can migrate the VM with this VF from one HV to a 223different one. 224 225However, when migratable capability is enable, device will disable features which cannot 226be migrated. Thus migratable cap can impose limitations on a VF so let the user decide. 227 228Example of LM with migratable function configuration: 229- Get migratable capability of the VF device:: 230 231 $ devlink port show pci/0000:06:00.0/2 232 pci/0000:06:00.0/2: type eth netdev enp6s0pf0vf1 flavour pcivf pfnum 0 vfnum 1 233 function: 234 hw_addr 00:00:00:00:00:00 migratable disable 235 236- Set migratable capability of the VF device:: 237 238 $ devlink port function set pci/0000:06:00.0/2 migratable enable 239 240 $ devlink port show pci/0000:06:00.0/2 241 pci/0000:06:00.0/2: type eth netdev enp6s0pf0vf1 flavour pcivf pfnum 0 vfnum 1 242 function: 243 hw_addr 00:00:00:00:00:00 migratable enable 244 245- Bind VF to VFIO driver with migration support:: 246 247 $ echo <pci_id> > /sys/bus/pci/devices/0000:08:00.0/driver/unbind 248 $ echo mlx5_vfio_pci > /sys/bus/pci/devices/0000:08:00.0/driver_override 249 $ echo <pci_id> > /sys/bus/pci/devices/0000:08:00.0/driver/bind 250 251Attach VF to the VM. 252Start the VM. 253Perform live migration. 254 255IPsec crypto capability setup 256----------------------------- 257When user enables IPsec crypto capability for a VF, user application can offload 258XFRM state crypto operation (Encrypt/Decrypt) to this VF. 259 260When IPsec crypto capability is disabled (default) for a VF, the XFRM state is 261processed in software by the kernel. 262 263- Get IPsec crypto capability of the VF device:: 264 265 $ devlink port show pci/0000:06:00.0/2 266 pci/0000:06:00.0/2: type eth netdev enp6s0pf0vf1 flavour pcivf pfnum 0 vfnum 1 267 function: 268 hw_addr 00:00:00:00:00:00 ipsec_crypto disabled 269 270- Set IPsec crypto capability of the VF device:: 271 272 $ devlink port function set pci/0000:06:00.0/2 ipsec_crypto enable 273 274 $ devlink port show pci/0000:06:00.0/2 275 pci/0000:06:00.0/2: type eth netdev enp6s0pf0vf1 flavour pcivf pfnum 0 vfnum 1 276 function: 277 hw_addr 00:00:00:00:00:00 ipsec_crypto enabled 278 279IPsec packet capability setup 280----------------------------- 281When user enables IPsec packet capability for a VF, user application can offload 282XFRM state and policy crypto operation (Encrypt/Decrypt) to this VF, as well as 283IPsec encapsulation. 284 285When IPsec packet capability is disabled (default) for a VF, the XFRM state and 286policy is processed in software by the kernel. 287 288- Get IPsec packet capability of the VF device:: 289 290 $ devlink port show pci/0000:06:00.0/2 291 pci/0000:06:00.0/2: type eth netdev enp6s0pf0vf1 flavour pcivf pfnum 0 vfnum 1 292 function: 293 hw_addr 00:00:00:00:00:00 ipsec_packet disabled 294 295- Set IPsec packet capability of the VF device:: 296 297 $ devlink port function set pci/0000:06:00.0/2 ipsec_packet enable 298 299 $ devlink port show pci/0000:06:00.0/2 300 pci/0000:06:00.0/2: type eth netdev enp6s0pf0vf1 flavour pcivf pfnum 0 vfnum 1 301 function: 302 hw_addr 00:00:00:00:00:00 ipsec_packet enabled 303 304Maximum IO events queues setup 305------------------------------ 306When user sets maximum number of IO event queues for a SF or 307a VF, such function driver is limited to consume only enforced 308number of IO event queues. 309 310IO event queues deliver events related to IO queues, including network 311device transmit and receive queues (txq and rxq) and RDMA Queue Pairs (QPs). 312For example, the number of netdevice channels and RDMA device completion 313vectors are derived from the function's IO event queues. Usually, the number 314of interrupt vectors consumed by the driver is limited by the number of IO 315event queues per device, as each of the IO event queues is connected to an 316interrupt vector. 317 318- Get maximum IO event queues of the VF device:: 319 320 $ devlink port show pci/0000:06:00.0/2 321 pci/0000:06:00.0/2: type eth netdev enp6s0pf0vf1 flavour pcivf pfnum 0 vfnum 1 322 function: 323 hw_addr 00:00:00:00:00:00 ipsec_packet disabled max_io_eqs 10 324 325- Set maximum IO event queues of the VF device:: 326 327 $ devlink port function set pci/0000:06:00.0/2 max_io_eqs 32 328 329 $ devlink port show pci/0000:06:00.0/2 330 pci/0000:06:00.0/2: type eth netdev enp6s0pf0vf1 flavour pcivf pfnum 0 vfnum 1 331 function: 332 hw_addr 00:00:00:00:00:00 ipsec_packet disabled max_io_eqs 32 333 334Subfunction 335============ 336 337Subfunction is a lightweight function that has a parent PCI function on which 338it is deployed. Subfunction is created and deployed in unit of 1. Unlike 339SRIOV VFs, a subfunction doesn't require its own PCI virtual function. 340A subfunction communicates with the hardware through the parent PCI function. 341 342To use a subfunction, 3 steps setup sequence is followed: 343 3441) create - create a subfunction; 3452) configure - configure subfunction attributes; 3463) deploy - deploy the subfunction; 347 348Subfunction management is done using devlink port user interface. 349User performs setup on the subfunction management device. 350 351(1) Create 352---------- 353A subfunction is created using a devlink port interface. A user adds the 354subfunction by adding a devlink port of subfunction flavour. The devlink 355kernel code calls down to subfunction management driver (devlink ops) and asks 356it to create a subfunction devlink port. Driver then instantiates the 357subfunction port and any associated objects such as health reporters and 358representor netdevice. 359 360(2) Configure 361------------- 362A subfunction devlink port is created but it is not active yet. That means the 363entities are created on devlink side, the e-switch port representor is created, 364but the subfunction device itself is not created. A user might use e-switch port 365representor to do settings, putting it into bridge, adding TC rules, etc. A user 366might as well configure the hardware address (such as MAC address) of the 367subfunction while subfunction is inactive. 368 369(3) Deploy 370---------- 371Once a subfunction is configured, user must activate it to use it. Upon 372activation, subfunction management driver asks the subfunction management 373device to instantiate the subfunction device on particular PCI function. 374A subfunction device is created on the :ref:`Documentation/driver-api/auxiliary_bus.rst <auxiliary_bus>`. 375At this point a matching subfunction driver binds to the subfunction's auxiliary device. 376 377Rate object management 378====================== 379 380Devlink provides API to manage tx rates of single devlink port or a group. 381This is done through rate objects, which can be one of the two types: 382 383``leaf`` 384 Represents a single devlink port; created/destroyed by the driver. Since leaf 385 have 1to1 mapping to its devlink port, in user space it is referred as 386 ``pci/<bus_addr>/<port_index>``; 387 388``node`` 389 Represents a group of rate objects (leafs and/or nodes); created/deleted by 390 request from the userspace; initially empty (no rate objects added). In 391 userspace it is referred as ``pci/<bus_addr>/<node_name>``, where 392 ``node_name`` can be any identifier, except decimal number, to avoid 393 collisions with leafs. 394 395API allows to configure following rate object's parameters: 396 397``tx_share`` 398 Minimum TX rate value shared among all other rate objects, or rate objects 399 that parts of the parent group, if it is a part of the same group. 400 401``tx_max`` 402 Maximum TX rate value. 403 404``tx_priority`` 405 Allows for usage of strict priority arbiter among siblings. This 406 arbitration scheme attempts to schedule nodes based on their priority 407 as long as the nodes remain within their bandwidth limit. The higher the 408 priority the higher the probability that the node will get selected for 409 scheduling. 410 411``tx_weight`` 412 Allows for usage of Weighted Fair Queuing arbitration scheme among 413 siblings. This arbitration scheme can be used simultaneously with the 414 strict priority. As a node is configured with a higher rate it gets more 415 BW relative to its siblings. Values are relative like a percentage 416 points, they basically tell how much BW should node take relative to 417 its siblings. 418 419``parent`` 420 Parent node name. Parent node rate limits are considered as additional limits 421 to all node children limits. ``tx_max`` is an upper limit for children. 422 ``tx_share`` is a total bandwidth distributed among children. 423 424``tc_bw`` 425 Allow users to set the bandwidth allocation per traffic class on rate 426 objects. This enables fine-grained QoS configurations by assigning a relative 427 share value to each traffic class. The bandwidth is distributed in proportion 428 to the share value for each class, relative to the sum of all shares. 429 When applied to a non-leaf node, tc_bw determines how bandwidth is shared 430 among its child elements. 431 432``tx_priority`` and ``tx_weight`` can be used simultaneously. In that case 433nodes with the same priority form a WFQ subgroup in the sibling group 434and arbitration among them is based on assigned weights. 435 436Arbitration flow from the high level: 437 438#. Choose a node, or group of nodes with the highest priority that stays 439 within the BW limit and are not blocked. Use ``tx_priority`` as a 440 parameter for this arbitration. 441 442#. If group of nodes have the same priority perform WFQ arbitration on 443 that subgroup. Use ``tx_weight`` as a parameter for this arbitration. 444 445#. Select the winner node, and continue arbitration flow among its children, 446 until leaf node is reached, and the winner is established. 447 448#. If all the nodes from the highest priority sub-group are satisfied, or 449 overused their assigned BW, move to the lower priority nodes. 450 451Driver implementations are allowed to support both or either rate object types 452and setting methods of their parameters. Additionally driver implementation 453may export nodes/leafs and their child-parent relationships. 454 455Terms and Definitions 456===================== 457 458.. list-table:: Terms and Definitions 459 :widths: 22 90 460 461 * - Term 462 - Definitions 463 * - ``PCI device`` 464 - A physical PCI device having one or more PCI buses consists of one or 465 more PCI controllers. 466 * - ``PCI controller`` 467 - A controller consists of potentially multiple physical functions, 468 virtual functions and subfunctions. 469 * - ``Port function`` 470 - An object to manage the function of a port. 471 * - ``Subfunction`` 472 - A lightweight function that has parent PCI function on which it is 473 deployed. 474 * - ``Subfunction device`` 475 - A bus device of the subfunction, usually on a auxiliary bus. 476 * - ``Subfunction driver`` 477 - A device driver for the subfunction auxiliary device. 478 * - ``Subfunction management device`` 479 - A PCI physical function that supports subfunction management. 480 * - ``Subfunction management driver`` 481 - A device driver for PCI physical function that supports 482 subfunction management using devlink port interface. 483 * - ``Subfunction host driver`` 484 - A device driver for PCI physical function that hosts subfunction 485 devices. In most cases it is same as subfunction management driver. When 486 subfunction is used on external controller, subfunction management and 487 host drivers are different. 488