xref: /linux/Documentation/networking/devlink/devlink-port.rst (revision d639d9fa162aadec1ae9980c4dcf6e50bd2f8290)
1.. SPDX-License-Identifier: GPL-2.0
2
3.. _devlink_port:
4
5============
6Devlink Port
7============
8
9``devlink-port`` is a port that exists on the device. It has a logically
10separate ingress/egress point of the device. A devlink port can be any one
11of many flavours. A devlink port flavour along with port attributes
12describe what a port represents.
13
14A device driver that intends to publish a devlink port sets the
15devlink port attributes and registers the devlink port.
16
17Devlink port flavours are described below.
18
19.. list-table:: List of devlink port flavours
20   :widths: 33 90
21
22   * - Flavour
23     - Description
24   * - ``DEVLINK_PORT_FLAVOUR_PHYSICAL``
25     - Any kind of physical port. This can be an eswitch physical port or any
26       other physical port on the device.
27   * - ``DEVLINK_PORT_FLAVOUR_DSA``
28     - This indicates a DSA interconnect port.
29   * - ``DEVLINK_PORT_FLAVOUR_CPU``
30     - This indicates a CPU port applicable only to DSA.
31   * - ``DEVLINK_PORT_FLAVOUR_PCI_PF``
32     - This indicates an eswitch port representing a port of PCI
33       physical function (PF).
34   * - ``DEVLINK_PORT_FLAVOUR_PCI_VF``
35     - This indicates an eswitch port representing a port of PCI
36       virtual function (VF).
37   * - ``DEVLINK_PORT_FLAVOUR_PCI_SF``
38     - This indicates an eswitch port representing a port of PCI
39       subfunction (SF).
40   * - ``DEVLINK_PORT_FLAVOUR_VIRTUAL``
41     - Any virtual port facing the user.
42
43Devlink port can have a different type based on the link layer described below.
44
45.. list-table:: List of devlink port types
46   :widths: 23 90
47
48   * - Type
49     - Description
50   * - ``DEVLINK_PORT_TYPE_ETH``
51     - Driver should set this port type when a link layer of the port is
52       Ethernet.
53   * - ``DEVLINK_PORT_TYPE_IB``
54     - Driver should set this port type when a link layer of the port is
55       InfiniBand.
56   * - ``DEVLINK_PORT_TYPE_AUTO``
57     - This type is indicated by the user when driver should detect the port
58       type automatically.
59
60PCI controllers
61---------------
62In most cases a PCI device has only one controller. A controller consists of
63potentially multiple physical, virtual functions and subfunctions. A function
64consists of one or more ports. This port is represented by the devlink eswitch
65port.
66
67A PCI device connected to multiple CPUs or multiple PCI root complexes or a
68SmartNIC, however, may have multiple controllers. For a device with multiple
69controllers, each controller is distinguished by a unique controller number.
70An eswitch is on the PCI device which supports ports of multiple controllers.
71
72An example view of a system with two controllers::
73
74                 ---------------------------------------------------------
75                 |                                                       |
76                 |           --------- ---------         ------- ------- |
77    -----------  |           | vf(s) | | sf(s) |         |vf(s)| |sf(s)| |
78    | server  |  | -------   ----/---- ---/----- ------- ---/--- ---/--- |
79    | pci rc  |=== | pf0 |______/________/       | pf1 |___/_______/     |
80    | connect |  | -------                       -------                 |
81    -----------  |     | controller_num=1 (no eswitch)                   |
82                 ------|--------------------------------------------------
83                 (internal wire)
84                       |
85                 ---------------------------------------------------------
86                 | devlink eswitch ports and reps                        |
87                 | ----------------------------------------------------- |
88                 | |ctrl-0 | ctrl-0 | ctrl-0 | ctrl-0 | ctrl-0 |ctrl-0 | |
89                 | |pf0    | pf0vfN | pf0sfN | pf1    | pf1vfN |pf1sfN | |
90                 | ----------------------------------------------------- |
91                 | |ctrl-1 | ctrl-1 | ctrl-1 | ctrl-1 | ctrl-1 |ctrl-1 | |
92                 | |pf0    | pf0vfN | pf0sfN | pf1    | pf1vfN |pf1sfN | |
93                 | ----------------------------------------------------- |
94                 |                                                       |
95                 |                                                       |
96    -----------  |           --------- ---------         ------- ------- |
97    | smartNIC|  |           | vf(s) | | sf(s) |         |vf(s)| |sf(s)| |
98    | pci rc  |==| -------   ----/---- ---/----- ------- ---/--- ---/--- |
99    | connect |  | | pf0 |______/________/       | pf1 |___/_______/     |
100    -----------  | -------                       -------                 |
101                 |                                                       |
102                 |  local controller_num=0 (eswitch)                     |
103                 ---------------------------------------------------------
104
105In the above example, the external controller (identified by controller number = 1)
106doesn't have the eswitch. Local controller (identified by controller number = 0)
107has the eswitch. The Devlink instance on the local controller has eswitch
108devlink ports for both the controllers.
109
110Function configuration
111======================
112
113Users can configure one or more function attributes before enumerating the PCI
114function. Usually it means, user should configure function attribute
115before a bus specific device for the function is created. However, when
116SRIOV is enabled, virtual function devices are created on the PCI bus.
117Hence, function attribute should be configured before binding virtual
118function device to the driver. For subfunctions, this means user should
119configure port function attribute before activating the port function.
120
121A user may set the hardware address of the function using
122`devlink port function set hw_addr` command. For Ethernet port function
123this means a MAC address.
124
125Users may also set the RoCE capability of the function using
126`devlink port function set roce` command.
127
128Users may also set the function as migratable using
129`devlink port function set migratable` command.
130
131Users may also set the IPsec crypto capability of the function using
132`devlink port function set ipsec_crypto` command.
133
134Users may also set the IPsec packet capability of the function using
135`devlink port function set ipsec_packet` command.
136
137The ``migratable`` attribute may be set only on ports with
138``DEVLINK_PORT_FLAVOUR_PCI_VF``.
139
140Users may also set the maximum IO event queues of the function
141using `devlink port function set max_io_eqs` command.
142
143Function attributes
144===================
145
146MAC address setup
147-----------------
148The configured MAC address of the PCI VF/SF will be used by netdevice and rdma
149device created for the PCI VF/SF.
150
151- Get the MAC address of the VF identified by its unique devlink port index::
152
153    $ devlink port show pci/0000:06:00.0/2
154    pci/0000:06:00.0/2: type eth netdev enp6s0pf0vf1 flavour pcivf pfnum 0 vfnum 1
155      function:
156        hw_addr 00:00:00:00:00:00
157
158- Set the MAC address of the VF identified by its unique devlink port index::
159
160    $ devlink port function set pci/0000:06:00.0/2 hw_addr 00:11:22:33:44:55
161
162    $ devlink port show pci/0000:06:00.0/2
163    pci/0000:06:00.0/2: type eth netdev enp6s0pf0vf1 flavour pcivf pfnum 0 vfnum 1
164      function:
165        hw_addr 00:11:22:33:44:55
166
167- Get the MAC address of the SF identified by its unique devlink port index::
168
169    $ devlink port show pci/0000:06:00.0/32768
170    pci/0000:06:00.0/32768: type eth netdev enp6s0pf0sf88 flavour pcisf pfnum 0 sfnum 88
171      function:
172        hw_addr 00:00:00:00:00:00
173
174- Set the MAC address of the SF identified by its unique devlink port index::
175
176    $ devlink port function set pci/0000:06:00.0/32768 hw_addr 00:00:00:00:88:88
177
178    $ devlink port show pci/0000:06:00.0/32768
179    pci/0000:06:00.0/32768: type eth netdev enp6s0pf0sf88 flavour pcisf pfnum 0 sfnum 88
180      function:
181        hw_addr 00:00:00:00:88:88
182
183RoCE capability setup
184---------------------
185Not all PCI VFs/SFs require RoCE capability.
186
187When RoCE capability is disabled, it saves system memory per PCI VF/SF.
188
189When user disables RoCE capability for a VF/SF, user application cannot send or
190receive any RoCE packets through this VF/SF and RoCE GID table for this PCI
191will be empty.
192
193When RoCE capability is disabled in the device using port function attribute,
194VF/SF driver cannot override it.
195
196- Get RoCE capability of the VF device::
197
198    $ devlink port show pci/0000:06:00.0/2
199    pci/0000:06:00.0/2: type eth netdev enp6s0pf0vf1 flavour pcivf pfnum 0 vfnum 1
200        function:
201            hw_addr 00:00:00:00:00:00 roce enable
202
203- Set RoCE capability of the VF device::
204
205    $ devlink port function set pci/0000:06:00.0/2 roce disable
206
207    $ devlink port show pci/0000:06:00.0/2
208    pci/0000:06:00.0/2: type eth netdev enp6s0pf0vf1 flavour pcivf pfnum 0 vfnum 1
209        function:
210            hw_addr 00:00:00:00:00:00 roce disable
211
212migratable capability setup
213---------------------------
214Live migration is the process of transferring a live virtual machine
215from one physical host to another without disrupting its normal
216operation.
217
218User who want PCI VFs to be able to perform live migration need to
219explicitly enable the VF migratable capability.
220
221When user enables migratable capability for a VF, and the HV binds the VF to VFIO driver
222with migration support, the user can migrate the VM with this VF from one HV to a
223different one.
224
225However, when migratable capability is enable, device will disable features which cannot
226be migrated. Thus migratable cap can impose limitations on a VF so let the user decide.
227
228Example of LM with migratable function configuration:
229- Get migratable capability of the VF device::
230
231    $ devlink port show pci/0000:06:00.0/2
232    pci/0000:06:00.0/2: type eth netdev enp6s0pf0vf1 flavour pcivf pfnum 0 vfnum 1
233        function:
234            hw_addr 00:00:00:00:00:00 migratable disable
235
236- Set migratable capability of the VF device::
237
238    $ devlink port function set pci/0000:06:00.0/2 migratable enable
239
240    $ devlink port show pci/0000:06:00.0/2
241    pci/0000:06:00.0/2: type eth netdev enp6s0pf0vf1 flavour pcivf pfnum 0 vfnum 1
242        function:
243            hw_addr 00:00:00:00:00:00 migratable enable
244
245- Bind VF to VFIO driver with migration support::
246
247    $ echo <pci_id> > /sys/bus/pci/devices/0000:08:00.0/driver/unbind
248    $ echo mlx5_vfio_pci > /sys/bus/pci/devices/0000:08:00.0/driver_override
249    $ echo <pci_id> > /sys/bus/pci/devices/0000:08:00.0/driver/bind
250
251Attach VF to the VM.
252Start the VM.
253Perform live migration.
254
255IPsec crypto capability setup
256-----------------------------
257When user enables IPsec crypto capability for a VF, user application can offload
258XFRM state crypto operation (Encrypt/Decrypt) to this VF.
259
260When IPsec crypto capability is disabled (default) for a VF, the XFRM state is
261processed in software by the kernel.
262
263- Get IPsec crypto capability of the VF device::
264
265    $ devlink port show pci/0000:06:00.0/2
266    pci/0000:06:00.0/2: type eth netdev enp6s0pf0vf1 flavour pcivf pfnum 0 vfnum 1
267        function:
268            hw_addr 00:00:00:00:00:00 ipsec_crypto disabled
269
270- Set IPsec crypto capability of the VF device::
271
272    $ devlink port function set pci/0000:06:00.0/2 ipsec_crypto enable
273
274    $ devlink port show pci/0000:06:00.0/2
275    pci/0000:06:00.0/2: type eth netdev enp6s0pf0vf1 flavour pcivf pfnum 0 vfnum 1
276        function:
277            hw_addr 00:00:00:00:00:00 ipsec_crypto enabled
278
279IPsec packet capability setup
280-----------------------------
281When user enables IPsec packet capability for a VF, user application can offload
282XFRM state and policy crypto operation (Encrypt/Decrypt) to this VF, as well as
283IPsec encapsulation.
284
285When IPsec packet capability is disabled (default) for a VF, the XFRM state and
286policy is processed in software by the kernel.
287
288- Get IPsec packet capability of the VF device::
289
290    $ devlink port show pci/0000:06:00.0/2
291    pci/0000:06:00.0/2: type eth netdev enp6s0pf0vf1 flavour pcivf pfnum 0 vfnum 1
292        function:
293            hw_addr 00:00:00:00:00:00 ipsec_packet disabled
294
295- Set IPsec packet capability of the VF device::
296
297    $ devlink port function set pci/0000:06:00.0/2 ipsec_packet enable
298
299    $ devlink port show pci/0000:06:00.0/2
300    pci/0000:06:00.0/2: type eth netdev enp6s0pf0vf1 flavour pcivf pfnum 0 vfnum 1
301        function:
302            hw_addr 00:00:00:00:00:00 ipsec_packet enabled
303
304Maximum IO events queues setup
305------------------------------
306When user sets maximum number of IO event queues for a SF or
307a VF, such function driver is limited to consume only enforced
308number of IO event queues.
309
310IO event queues deliver events related to IO queues, including network
311device transmit and receive queues (txq and rxq) and RDMA Queue Pairs (QPs).
312For example, the number of netdevice channels and RDMA device completion
313vectors are derived from the function's IO event queues. Usually, the number
314of interrupt vectors consumed by the driver is limited by the number of IO
315event queues per device, as each of the IO event queues is connected to an
316interrupt vector.
317
318- Get maximum IO event queues of the VF device::
319
320    $ devlink port show pci/0000:06:00.0/2
321    pci/0000:06:00.0/2: type eth netdev enp6s0pf0vf1 flavour pcivf pfnum 0 vfnum 1
322        function:
323            hw_addr 00:00:00:00:00:00 ipsec_packet disabled max_io_eqs 10
324
325- Set maximum IO event queues of the VF device::
326
327    $ devlink port function set pci/0000:06:00.0/2 max_io_eqs 32
328
329    $ devlink port show pci/0000:06:00.0/2
330    pci/0000:06:00.0/2: type eth netdev enp6s0pf0vf1 flavour pcivf pfnum 0 vfnum 1
331        function:
332            hw_addr 00:00:00:00:00:00 ipsec_packet disabled max_io_eqs 32
333
334Subfunction
335============
336
337Subfunction is a lightweight function that has a parent PCI function on which
338it is deployed. Subfunction is created and deployed in unit of 1. Unlike
339SRIOV VFs, a subfunction doesn't require its own PCI virtual function.
340A subfunction communicates with the hardware through the parent PCI function.
341
342To use a subfunction, 3 steps setup sequence is followed:
343
3441) create - create a subfunction;
3452) configure - configure subfunction attributes;
3463) deploy - deploy the subfunction;
347
348Subfunction management is done using devlink port user interface.
349User performs setup on the subfunction management device.
350
351(1) Create
352----------
353A subfunction is created using a devlink port interface. A user adds the
354subfunction by adding a devlink port of subfunction flavour. The devlink
355kernel code calls down to subfunction management driver (devlink ops) and asks
356it to create a subfunction devlink port. Driver then instantiates the
357subfunction port and any associated objects such as health reporters and
358representor netdevice.
359
360(2) Configure
361-------------
362A subfunction devlink port is created but it is not active yet. That means the
363entities are created on devlink side, the e-switch port representor is created,
364but the subfunction device itself is not created. A user might use e-switch port
365representor to do settings, putting it into bridge, adding TC rules, etc. A user
366might as well configure the hardware address (such as MAC address) of the
367subfunction while subfunction is inactive.
368
369(3) Deploy
370----------
371Once a subfunction is configured, user must activate it to use it. Upon
372activation, subfunction management driver asks the subfunction management
373device to instantiate the subfunction device on particular PCI function.
374A subfunction device is created on the :ref:`Documentation/driver-api/auxiliary_bus.rst <auxiliary_bus>`.
375At this point a matching subfunction driver binds to the subfunction's auxiliary device.
376
377Rate object management
378======================
379
380Devlink provides API to manage tx rates of single devlink port or a group.
381This is done through rate objects, which can be one of the two types:
382
383``leaf``
384  Represents a single devlink port; created/destroyed by the driver. Since leaf
385  have 1to1 mapping to its devlink port, in user space it is referred as
386  ``pci/<bus_addr>/<port_index>``;
387
388``node``
389  Represents a group of rate objects (leafs and/or nodes); created/deleted by
390  request from the userspace; initially empty (no rate objects added). In
391  userspace it is referred as ``pci/<bus_addr>/<node_name>``, where
392  ``node_name`` can be any identifier, except decimal number, to avoid
393  collisions with leafs.
394
395API allows to configure following rate object's parameters:
396
397``tx_share``
398  Minimum TX rate value shared among all other rate objects, or rate objects
399  that parts of the parent group, if it is a part of the same group.
400
401``tx_max``
402  Maximum TX rate value.
403
404``tx_priority``
405  Allows for usage of strict priority arbiter among siblings. This
406  arbitration scheme attempts to schedule nodes based on their priority
407  as long as the nodes remain within their bandwidth limit. The higher the
408  priority the higher the probability that the node will get selected for
409  scheduling.
410
411``tx_weight``
412  Allows for usage of Weighted Fair Queuing arbitration scheme among
413  siblings. This arbitration scheme can be used simultaneously with the
414  strict priority. As a node is configured with a higher rate it gets more
415  BW relative to its siblings. Values are relative like a percentage
416  points, they basically tell how much BW should node take relative to
417  its siblings.
418
419``parent``
420  Parent node name. Parent node rate limits are considered as additional limits
421  to all node children limits. ``tx_max`` is an upper limit for children.
422  ``tx_share`` is a total bandwidth distributed among children.
423
424``tc_bw``
425  Allow users to set the bandwidth allocation per traffic class on rate
426  objects. This enables fine-grained QoS configurations by assigning a relative
427  share value to each traffic class. The bandwidth is distributed in proportion
428  to the share value for each class, relative to the sum of all shares.
429  When applied to a non-leaf node, tc_bw determines how bandwidth is shared
430  among its child elements.
431
432``tx_priority`` and ``tx_weight`` can be used simultaneously. In that case
433nodes with the same priority form a WFQ subgroup in the sibling group
434and arbitration among them is based on assigned weights.
435
436Arbitration flow from the high level:
437
438#. Choose a node, or group of nodes with the highest priority that stays
439   within the BW limit and are not blocked. Use ``tx_priority`` as a
440   parameter for this arbitration.
441
442#. If group of nodes have the same priority perform WFQ arbitration on
443   that subgroup. Use ``tx_weight`` as a parameter for this arbitration.
444
445#. Select the winner node, and continue arbitration flow among its children,
446   until leaf node is reached, and the winner is established.
447
448#. If all the nodes from the highest priority sub-group are satisfied, or
449   overused their assigned BW, move to the lower priority nodes.
450
451Driver implementations are allowed to support both or either rate object types
452and setting methods of their parameters. Additionally driver implementation
453may export nodes/leafs and their child-parent relationships.
454
455Terms and Definitions
456=====================
457
458.. list-table:: Terms and Definitions
459   :widths: 22 90
460
461   * - Term
462     - Definitions
463   * - ``PCI device``
464     - A physical PCI device having one or more PCI buses consists of one or
465       more PCI controllers.
466   * - ``PCI controller``
467     -  A controller consists of potentially multiple physical functions,
468        virtual functions and subfunctions.
469   * - ``Port function``
470     -  An object to manage the function of a port.
471   * - ``Subfunction``
472     -  A lightweight function that has parent PCI function on which it is
473        deployed.
474   * - ``Subfunction device``
475     -  A bus device of the subfunction, usually on a auxiliary bus.
476   * - ``Subfunction driver``
477     -  A device driver for the subfunction auxiliary device.
478   * - ``Subfunction management device``
479     -  A PCI physical function that supports subfunction management.
480   * - ``Subfunction management driver``
481     -  A device driver for PCI physical function that supports
482        subfunction management using devlink port interface.
483   * - ``Subfunction host driver``
484     -  A device driver for PCI physical function that hosts subfunction
485        devices. In most cases it is same as subfunction management driver. When
486        subfunction is used on external controller, subfunction management and
487        host drivers are different.
488