1.. SPDX-License-Identifier: GPL-2.0 2 3.. _devlink_port: 4 5============ 6Devlink Port 7============ 8 9``devlink-port`` is a port that exists on the device. It has a logically 10separate ingress/egress point of the device. A devlink port can be any one 11of many flavours. A devlink port flavour along with port attributes 12describe what a port represents. 13 14A device driver that intends to publish a devlink port sets the 15devlink port attributes and registers the devlink port. 16 17Devlink port flavours are described below. 18 19.. list-table:: List of devlink port flavours 20 :widths: 33 90 21 22 * - Flavour 23 - Description 24 * - ``DEVLINK_PORT_FLAVOUR_PHYSICAL`` 25 - Any kind of physical port. This can be an eswitch physical port or any 26 other physical port on the device. 27 * - ``DEVLINK_PORT_FLAVOUR_DSA`` 28 - This indicates a DSA interconnect port. 29 * - ``DEVLINK_PORT_FLAVOUR_CPU`` 30 - This indicates a CPU port applicable only to DSA. 31 * - ``DEVLINK_PORT_FLAVOUR_PCI_PF`` 32 - This indicates an eswitch port representing a port of PCI 33 physical function (PF). 34 * - ``DEVLINK_PORT_FLAVOUR_PCI_VF`` 35 - This indicates an eswitch port representing a port of PCI 36 virtual function (VF). 37 * - ``DEVLINK_PORT_FLAVOUR_PCI_SF`` 38 - This indicates an eswitch port representing a port of PCI 39 subfunction (SF). 40 * - ``DEVLINK_PORT_FLAVOUR_VIRTUAL`` 41 - This indicates a virtual port for the PCI virtual function. 42 43Devlink port can have a different type based on the link layer described below. 44 45.. list-table:: List of devlink port types 46 :widths: 23 90 47 48 * - Type 49 - Description 50 * - ``DEVLINK_PORT_TYPE_ETH`` 51 - Driver should set this port type when a link layer of the port is 52 Ethernet. 53 * - ``DEVLINK_PORT_TYPE_IB`` 54 - Driver should set this port type when a link layer of the port is 55 InfiniBand. 56 * - ``DEVLINK_PORT_TYPE_AUTO`` 57 - This type is indicated by the user when driver should detect the port 58 type automatically. 59 60PCI controllers 61--------------- 62In most cases a PCI device has only one controller. A controller consists of 63potentially multiple physical, virtual functions and subfunctions. A function 64consists of one or more ports. This port is represented by the devlink eswitch 65port. 66 67A PCI device connected to multiple CPUs or multiple PCI root complexes or a 68SmartNIC, however, may have multiple controllers. For a device with multiple 69controllers, each controller is distinguished by a unique controller number. 70An eswitch is on the PCI device which supports ports of multiple controllers. 71 72An example view of a system with two controllers:: 73 74 --------------------------------------------------------- 75 | | 76 | --------- --------- ------- ------- | 77 ----------- | | vf(s) | | sf(s) | |vf(s)| |sf(s)| | 78 | server | | ------- ----/---- ---/----- ------- ---/--- ---/--- | 79 | pci rc |=== | pf0 |______/________/ | pf1 |___/_______/ | 80 | connect | | ------- ------- | 81 ----------- | | controller_num=1 (no eswitch) | 82 ------|-------------------------------------------------- 83 (internal wire) 84 | 85 --------------------------------------------------------- 86 | devlink eswitch ports and reps | 87 | ----------------------------------------------------- | 88 | |ctrl-0 | ctrl-0 | ctrl-0 | ctrl-0 | ctrl-0 |ctrl-0 | | 89 | |pf0 | pf0vfN | pf0sfN | pf1 | pf1vfN |pf1sfN | | 90 | ----------------------------------------------------- | 91 | |ctrl-1 | ctrl-1 | ctrl-1 | ctrl-1 | ctrl-1 |ctrl-1 | | 92 | |pf0 | pf0vfN | pf0sfN | pf1 | pf1vfN |pf1sfN | | 93 | ----------------------------------------------------- | 94 | | 95 | | 96 ----------- | --------- --------- ------- ------- | 97 | smartNIC| | | vf(s) | | sf(s) | |vf(s)| |sf(s)| | 98 | pci rc |==| ------- ----/---- ---/----- ------- ---/--- ---/--- | 99 | connect | | | pf0 |______/________/ | pf1 |___/_______/ | 100 ----------- | ------- ------- | 101 | | 102 | local controller_num=0 (eswitch) | 103 --------------------------------------------------------- 104 105In the above example, the external controller (identified by controller number = 1) 106doesn't have the eswitch. Local controller (identified by controller number = 0) 107has the eswitch. The Devlink instance on the local controller has eswitch 108devlink ports for both the controllers. 109 110Function configuration 111====================== 112 113Users can configure one or more function attributes before enumerating the PCI 114function. Usually it means, user should configure function attribute 115before a bus specific device for the function is created. However, when 116SRIOV is enabled, virtual function devices are created on the PCI bus. 117Hence, function attribute should be configured before binding virtual 118function device to the driver. For subfunctions, this means user should 119configure port function attribute before activating the port function. 120 121A user may set the hardware address of the function using 122`devlink port function set hw_addr` command. For Ethernet port function 123this means a MAC address. 124 125Users may also set the RoCE capability of the function using 126`devlink port function set roce` command. 127 128Users may also set the function as migratable using 129'devlink port function set migratable' command. 130 131Function attributes 132=================== 133 134MAC address setup 135----------------- 136The configured MAC address of the PCI VF/SF will be used by netdevice and rdma 137device created for the PCI VF/SF. 138 139- Get the MAC address of the VF identified by its unique devlink port index:: 140 141 $ devlink port show pci/0000:06:00.0/2 142 pci/0000:06:00.0/2: type eth netdev enp6s0pf0vf1 flavour pcivf pfnum 0 vfnum 1 143 function: 144 hw_addr 00:00:00:00:00:00 145 146- Set the MAC address of the VF identified by its unique devlink port index:: 147 148 $ devlink port function set pci/0000:06:00.0/2 hw_addr 00:11:22:33:44:55 149 150 $ devlink port show pci/0000:06:00.0/2 151 pci/0000:06:00.0/2: type eth netdev enp6s0pf0vf1 flavour pcivf pfnum 0 vfnum 1 152 function: 153 hw_addr 00:11:22:33:44:55 154 155- Get the MAC address of the SF identified by its unique devlink port index:: 156 157 $ devlink port show pci/0000:06:00.0/32768 158 pci/0000:06:00.0/32768: type eth netdev enp6s0pf0sf88 flavour pcisf pfnum 0 sfnum 88 159 function: 160 hw_addr 00:00:00:00:00:00 161 162- Set the MAC address of the SF identified by its unique devlink port index:: 163 164 $ devlink port function set pci/0000:06:00.0/32768 hw_addr 00:00:00:00:88:88 165 166 $ devlink port show pci/0000:06:00.0/32768 167 pci/0000:06:00.0/32768: type eth netdev enp6s0pf0sf88 flavour pcisf pfnum 0 sfnum 88 168 function: 169 hw_addr 00:00:00:00:88:88 170 171RoCE capability setup 172--------------------- 173Not all PCI VFs/SFs require RoCE capability. 174 175When RoCE capability is disabled, it saves system memory per PCI VF/SF. 176 177When user disables RoCE capability for a VF/SF, user application cannot send or 178receive any RoCE packets through this VF/SF and RoCE GID table for this PCI 179will be empty. 180 181When RoCE capability is disabled in the device using port function attribute, 182VF/SF driver cannot override it. 183 184- Get RoCE capability of the VF device:: 185 186 $ devlink port show pci/0000:06:00.0/2 187 pci/0000:06:00.0/2: type eth netdev enp6s0pf0vf1 flavour pcivf pfnum 0 vfnum 1 188 function: 189 hw_addr 00:00:00:00:00:00 roce enable 190 191- Set RoCE capability of the VF device:: 192 193 $ devlink port function set pci/0000:06:00.0/2 roce disable 194 195 $ devlink port show pci/0000:06:00.0/2 196 pci/0000:06:00.0/2: type eth netdev enp6s0pf0vf1 flavour pcivf pfnum 0 vfnum 1 197 function: 198 hw_addr 00:00:00:00:00:00 roce disable 199 200migratable capability setup 201--------------------------- 202Live migration is the process of transferring a live virtual machine 203from one physical host to another without disrupting its normal 204operation. 205 206User who want PCI VFs to be able to perform live migration need to 207explicitly enable the VF migratable capability. 208 209When user enables migratable capability for a VF, and the HV binds the VF to VFIO driver 210with migration support, the user can migrate the VM with this VF from one HV to a 211different one. 212 213However, when migratable capability is enable, device will disable features which cannot 214be migrated. Thus migratable cap can impose limitations on a VF so let the user decide. 215 216Example of LM with migratable function configuration: 217- Get migratable capability of the VF device:: 218 219 $ devlink port show pci/0000:06:00.0/2 220 pci/0000:06:00.0/2: type eth netdev enp6s0pf0vf1 flavour pcivf pfnum 0 vfnum 1 221 function: 222 hw_addr 00:00:00:00:00:00 migratable disable 223 224- Set migratable capability of the VF device:: 225 226 $ devlink port function set pci/0000:06:00.0/2 migratable enable 227 228 $ devlink port show pci/0000:06:00.0/2 229 pci/0000:06:00.0/2: type eth netdev enp6s0pf0vf1 flavour pcivf pfnum 0 vfnum 1 230 function: 231 hw_addr 00:00:00:00:00:00 migratable enable 232 233- Bind VF to VFIO driver with migration support:: 234 235 $ echo <pci_id> > /sys/bus/pci/devices/0000:08:00.0/driver/unbind 236 $ echo mlx5_vfio_pci > /sys/bus/pci/devices/0000:08:00.0/driver_override 237 $ echo <pci_id> > /sys/bus/pci/devices/0000:08:00.0/driver/bind 238 239Attach VF to the VM. 240Start the VM. 241Perform live migration. 242 243Subfunction 244============ 245 246Subfunction is a lightweight function that has a parent PCI function on which 247it is deployed. Subfunction is created and deployed in unit of 1. Unlike 248SRIOV VFs, a subfunction doesn't require its own PCI virtual function. 249A subfunction communicates with the hardware through the parent PCI function. 250 251To use a subfunction, 3 steps setup sequence is followed: 252 2531) create - create a subfunction; 2542) configure - configure subfunction attributes; 2553) deploy - deploy the subfunction; 256 257Subfunction management is done using devlink port user interface. 258User performs setup on the subfunction management device. 259 260(1) Create 261---------- 262A subfunction is created using a devlink port interface. A user adds the 263subfunction by adding a devlink port of subfunction flavour. The devlink 264kernel code calls down to subfunction management driver (devlink ops) and asks 265it to create a subfunction devlink port. Driver then instantiates the 266subfunction port and any associated objects such as health reporters and 267representor netdevice. 268 269(2) Configure 270------------- 271A subfunction devlink port is created but it is not active yet. That means the 272entities are created on devlink side, the e-switch port representor is created, 273but the subfunction device itself is not created. A user might use e-switch port 274representor to do settings, putting it into bridge, adding TC rules, etc. A user 275might as well configure the hardware address (such as MAC address) of the 276subfunction while subfunction is inactive. 277 278(3) Deploy 279---------- 280Once a subfunction is configured, user must activate it to use it. Upon 281activation, subfunction management driver asks the subfunction management 282device to instantiate the subfunction device on particular PCI function. 283A subfunction device is created on the :ref:`Documentation/driver-api/auxiliary_bus.rst <auxiliary_bus>`. 284At this point a matching subfunction driver binds to the subfunction's auxiliary device. 285 286Rate object management 287====================== 288 289Devlink provides API to manage tx rates of single devlink port or a group. 290This is done through rate objects, which can be one of the two types: 291 292``leaf`` 293 Represents a single devlink port; created/destroyed by the driver. Since leaf 294 have 1to1 mapping to its devlink port, in user space it is referred as 295 ``pci/<bus_addr>/<port_index>``; 296 297``node`` 298 Represents a group of rate objects (leafs and/or nodes); created/deleted by 299 request from the userspace; initially empty (no rate objects added). In 300 userspace it is referred as ``pci/<bus_addr>/<node_name>``, where 301 ``node_name`` can be any identifier, except decimal number, to avoid 302 collisions with leafs. 303 304API allows to configure following rate object's parameters: 305 306``tx_share`` 307 Minimum TX rate value shared among all other rate objects, or rate objects 308 that parts of the parent group, if it is a part of the same group. 309 310``tx_max`` 311 Maximum TX rate value. 312 313``tx_priority`` 314 Allows for usage of strict priority arbiter among siblings. This 315 arbitration scheme attempts to schedule nodes based on their priority 316 as long as the nodes remain within their bandwidth limit. The higher the 317 priority the higher the probability that the node will get selected for 318 scheduling. 319 320``tx_weight`` 321 Allows for usage of Weighted Fair Queuing arbitration scheme among 322 siblings. This arbitration scheme can be used simultaneously with the 323 strict priority. As a node is configured with a higher rate it gets more 324 BW relative to it's siblings. Values are relative like a percentage 325 points, they basically tell how much BW should node take relative to 326 it's siblings. 327 328``parent`` 329 Parent node name. Parent node rate limits are considered as additional limits 330 to all node children limits. ``tx_max`` is an upper limit for children. 331 ``tx_share`` is a total bandwidth distributed among children. 332 333``tx_priority`` and ``tx_weight`` can be used simultaneously. In that case 334nodes with the same priority form a WFQ subgroup in the sibling group 335and arbitration among them is based on assigned weights. 336 337Arbitration flow from the high level: 338 339#. Choose a node, or group of nodes with the highest priority that stays 340 within the BW limit and are not blocked. Use ``tx_priority`` as a 341 parameter for this arbitration. 342 343#. If group of nodes have the same priority perform WFQ arbitration on 344 that subgroup. Use ``tx_weight`` as a parameter for this arbitration. 345 346#. Select the winner node, and continue arbitration flow among it's children, 347 until leaf node is reached, and the winner is established. 348 349#. If all the nodes from the highest priority sub-group are satisfied, or 350 overused their assigned BW, move to the lower priority nodes. 351 352Driver implementations are allowed to support both or either rate object types 353and setting methods of their parameters. Additionally driver implementation 354may export nodes/leafs and their child-parent relationships. 355 356Terms and Definitions 357===================== 358 359.. list-table:: Terms and Definitions 360 :widths: 22 90 361 362 * - Term 363 - Definitions 364 * - ``PCI device`` 365 - A physical PCI device having one or more PCI buses consists of one or 366 more PCI controllers. 367 * - ``PCI controller`` 368 - A controller consists of potentially multiple physical functions, 369 virtual functions and subfunctions. 370 * - ``Port function`` 371 - An object to manage the function of a port. 372 * - ``Subfunction`` 373 - A lightweight function that has parent PCI function on which it is 374 deployed. 375 * - ``Subfunction device`` 376 - A bus device of the subfunction, usually on a auxiliary bus. 377 * - ``Subfunction driver`` 378 - A device driver for the subfunction auxiliary device. 379 * - ``Subfunction management device`` 380 - A PCI physical function that supports subfunction management. 381 * - ``Subfunction management driver`` 382 - A device driver for PCI physical function that supports 383 subfunction management using devlink port interface. 384 * - ``Subfunction host driver`` 385 - A device driver for PCI physical function that hosts subfunction 386 devices. In most cases it is same as subfunction management driver. When 387 subfunction is used on external controller, subfunction management and 388 host drivers are different. 389