1.. SPDX-License-Identifier: GPL-2.0 2 3.. _devlink_port: 4 5============ 6Devlink Port 7============ 8 9``devlink-port`` is a port that exists on the device. It has a logically 10separate ingress/egress point of the device. A devlink port can be any one 11of many flavours. A devlink port flavour along with port attributes 12describe what a port represents. 13 14A device driver that intends to publish a devlink port sets the 15devlink port attributes and registers the devlink port. 16 17Devlink port flavours are described below. 18 19.. list-table:: List of devlink port flavours 20 :widths: 33 90 21 22 * - Flavour 23 - Description 24 * - ``DEVLINK_PORT_FLAVOUR_PHYSICAL`` 25 - Any kind of physical port. This can be an eswitch physical port or any 26 other physical port on the device. 27 * - ``DEVLINK_PORT_FLAVOUR_DSA`` 28 - This indicates a DSA interconnect port. 29 * - ``DEVLINK_PORT_FLAVOUR_CPU`` 30 - This indicates a CPU port applicable only to DSA. 31 * - ``DEVLINK_PORT_FLAVOUR_PCI_PF`` 32 - This indicates an eswitch port representing a port of PCI 33 physical function (PF). 34 * - ``DEVLINK_PORT_FLAVOUR_PCI_VF`` 35 - This indicates an eswitch port representing a port of PCI 36 virtual function (VF). 37 * - ``DEVLINK_PORT_FLAVOUR_PCI_SF`` 38 - This indicates an eswitch port representing a port of PCI 39 subfunction (SF). 40 * - ``DEVLINK_PORT_FLAVOUR_VIRTUAL`` 41 - This indicates a virtual port for the PCI virtual function. 42 43Devlink port can have a different type based on the link layer described below. 44 45.. list-table:: List of devlink port types 46 :widths: 23 90 47 48 * - Type 49 - Description 50 * - ``DEVLINK_PORT_TYPE_ETH`` 51 - Driver should set this port type when a link layer of the port is 52 Ethernet. 53 * - ``DEVLINK_PORT_TYPE_IB`` 54 - Driver should set this port type when a link layer of the port is 55 InfiniBand. 56 * - ``DEVLINK_PORT_TYPE_AUTO`` 57 - This type is indicated by the user when driver should detect the port 58 type automatically. 59 60PCI controllers 61--------------- 62In most cases a PCI device has only one controller. A controller consists of 63potentially multiple physical, virtual functions and subfunctions. A function 64consists of one or more ports. This port is represented by the devlink eswitch 65port. 66 67A PCI device connected to multiple CPUs or multiple PCI root complexes or a 68SmartNIC, however, may have multiple controllers. For a device with multiple 69controllers, each controller is distinguished by a unique controller number. 70An eswitch is on the PCI device which supports ports of multiple controllers. 71 72An example view of a system with two controllers:: 73 74 --------------------------------------------------------- 75 | | 76 | --------- --------- ------- ------- | 77 ----------- | | vf(s) | | sf(s) | |vf(s)| |sf(s)| | 78 | server | | ------- ----/---- ---/----- ------- ---/--- ---/--- | 79 | pci rc |=== | pf0 |______/________/ | pf1 |___/_______/ | 80 | connect | | ------- ------- | 81 ----------- | | controller_num=1 (no eswitch) | 82 ------|-------------------------------------------------- 83 (internal wire) 84 | 85 --------------------------------------------------------- 86 | devlink eswitch ports and reps | 87 | ----------------------------------------------------- | 88 | |ctrl-0 | ctrl-0 | ctrl-0 | ctrl-0 | ctrl-0 |ctrl-0 | | 89 | |pf0 | pf0vfN | pf0sfN | pf1 | pf1vfN |pf1sfN | | 90 | ----------------------------------------------------- | 91 | |ctrl-1 | ctrl-1 | ctrl-1 | ctrl-1 | ctrl-1 |ctrl-1 | | 92 | |pf0 | pf0vfN | pf0sfN | pf1 | pf1vfN |pf1sfN | | 93 | ----------------------------------------------------- | 94 | | 95 | | 96 ----------- | --------- --------- ------- ------- | 97 | smartNIC| | | vf(s) | | sf(s) | |vf(s)| |sf(s)| | 98 | pci rc |==| ------- ----/---- ---/----- ------- ---/--- ---/--- | 99 | connect | | | pf0 |______/________/ | pf1 |___/_______/ | 100 ----------- | ------- ------- | 101 | | 102 | local controller_num=0 (eswitch) | 103 --------------------------------------------------------- 104 105In the above example, the external controller (identified by controller number = 1) 106doesn't have the eswitch. Local controller (identified by controller number = 0) 107has the eswitch. The Devlink instance on the local controller has eswitch 108devlink ports for both the controllers. 109 110Function configuration 111====================== 112 113A user can configure the function attribute before enumerating the PCI 114function. Usually it means, user should configure function attribute 115before a bus specific device for the function is created. However, when 116SRIOV is enabled, virtual function devices are created on the PCI bus. 117Hence, function attribute should be configured before binding virtual 118function device to the driver. For subfunctions, this means user should 119configure port function attribute before activating the port function. 120 121A user may set the hardware address of the function using 122'devlink port function set hw_addr' command. For Ethernet port function 123this means a MAC address. 124 125Subfunction 126============ 127 128Subfunction is a lightweight function that has a parent PCI function on which 129it is deployed. Subfunction is created and deployed in unit of 1. Unlike 130SRIOV VFs, a subfunction doesn't require its own PCI virtual function. 131A subfunction communicates with the hardware through the parent PCI function. 132 133To use a subfunction, 3 steps setup sequence is followed. 134(1) create - create a subfunction; 135(2) configure - configure subfunction attributes; 136(3) deploy - deploy the subfunction; 137 138Subfunction management is done using devlink port user interface. 139User performs setup on the subfunction management device. 140 141(1) Create 142---------- 143A subfunction is created using a devlink port interface. A user adds the 144subfunction by adding a devlink port of subfunction flavour. The devlink 145kernel code calls down to subfunction management driver (devlink ops) and asks 146it to create a subfunction devlink port. Driver then instantiates the 147subfunction port and any associated objects such as health reporters and 148representor netdevice. 149 150(2) Configure 151------------- 152A subfunction devlink port is created but it is not active yet. That means the 153entities are created on devlink side, the e-switch port representor is created, 154but the subfunction device itself is not created. A user might use e-switch port 155representor to do settings, putting it into bridge, adding TC rules, etc. A user 156might as well configure the hardware address (such as MAC address) of the 157subfunction while subfunction is inactive. 158 159(3) Deploy 160---------- 161Once a subfunction is configured, user must activate it to use it. Upon 162activation, subfunction management driver asks the subfunction management 163device to instantiate the subfunction device on particular PCI function. 164A subfunction device is created on the :ref:`Documentation/driver-api/auxiliary_bus.rst <auxiliary_bus>`. 165At this point a matching subfunction driver binds to the subfunction's auxiliary device. 166 167Rate object management 168====================== 169 170Devlink provides API to manage tx rates of single devlink port or a group. 171This is done through rate objects, which can be one of the two types: 172 173``leaf`` 174 Represents a single devlink port; created/destroyed by the driver. Since leaf 175 have 1to1 mapping to its devlink port, in user space it is referred as 176 ``pci/<bus_addr>/<port_index>``; 177 178``node`` 179 Represents a group of rate objects (leafs and/or nodes); created/deleted by 180 request from the userspace; initially empty (no rate objects added). In 181 userspace it is referred as ``pci/<bus_addr>/<node_name>``, where 182 ``node_name`` can be any identifier, except decimal number, to avoid 183 collisions with leafs. 184 185API allows to configure following rate object's parameters: 186 187``tx_share`` 188 Minimum TX rate value shared among all other rate objects, or rate objects 189 that parts of the parent group, if it is a part of the same group. 190 191``tx_max`` 192 Maximum TX rate value. 193 194``tx_priority`` 195 Allows for usage of strict priority arbiter among siblings. This 196 arbitration scheme attempts to schedule nodes based on their priority 197 as long as the nodes remain within their bandwidth limit. The higher the 198 priority the higher the probability that the node will get selected for 199 scheduling. 200 201``tx_weight`` 202 Allows for usage of Weighted Fair Queuing arbitration scheme among 203 siblings. This arbitration scheme can be used simultaneously with the 204 strict priority. As a node is configured with a higher rate it gets more 205 BW relative to it's siblings. Values are relative like a percentage 206 points, they basically tell how much BW should node take relative to 207 it's siblings. 208 209``parent`` 210 Parent node name. Parent node rate limits are considered as additional limits 211 to all node children limits. ``tx_max`` is an upper limit for children. 212 ``tx_share`` is a total bandwidth distributed among children. 213 214``tx_priority`` and ``tx_weight`` can be used simultaneously. In that case 215nodes with the same priority form a WFQ subgroup in the sibling group 216and arbitration among them is based on assigned weights. 217 218Arbitration flow from the high level: 219#. Choose a node, or group of nodes with the highest priority that stays 220 within the BW limit and are not blocked. Use ``tx_priority`` as a 221 parameter for this arbitration. 222#. If group of nodes have the same priority perform WFQ arbitration on 223 that subgroup. Use ``tx_weight`` as a parameter for this arbitration. 224#. Select the winner node, and continue arbitration flow among it's children, 225 until leaf node is reached, and the winner is established. 226#. If all the nodes from the highest priority sub-group are satisfied, or 227 overused their assigned BW, move to the lower priority nodes. 228 229Driver implementations are allowed to support both or either rate object types 230and setting methods of their parameters. Additionally driver implementation 231may export nodes/leafs and their child-parent relationships. 232 233Terms and Definitions 234===================== 235 236.. list-table:: Terms and Definitions 237 :widths: 22 90 238 239 * - Term 240 - Definitions 241 * - ``PCI device`` 242 - A physical PCI device having one or more PCI buses consists of one or 243 more PCI controllers. 244 * - ``PCI controller`` 245 - A controller consists of potentially multiple physical functions, 246 virtual functions and subfunctions. 247 * - ``Port function`` 248 - An object to manage the function of a port. 249 * - ``Subfunction`` 250 - A lightweight function that has parent PCI function on which it is 251 deployed. 252 * - ``Subfunction device`` 253 - A bus device of the subfunction, usually on a auxiliary bus. 254 * - ``Subfunction driver`` 255 - A device driver for the subfunction auxiliary device. 256 * - ``Subfunction management device`` 257 - A PCI physical function that supports subfunction management. 258 * - ``Subfunction management driver`` 259 - A device driver for PCI physical function that supports 260 subfunction management using devlink port interface. 261 * - ``Subfunction host driver`` 262 - A device driver for PCI physical function that hosts subfunction 263 devices. In most cases it is same as subfunction management driver. When 264 subfunction is used on external controller, subfunction management and 265 host drivers are different. 266