xref: /linux/Documentation/networking/devlink/devlink-port.rst (revision 001821b0e79716c4e17c71d8e053a23599a7a508)
1.. SPDX-License-Identifier: GPL-2.0
2
3.. _devlink_port:
4
5============
6Devlink Port
7============
8
9``devlink-port`` is a port that exists on the device. It has a logically
10separate ingress/egress point of the device. A devlink port can be any one
11of many flavours. A devlink port flavour along with port attributes
12describe what a port represents.
13
14A device driver that intends to publish a devlink port sets the
15devlink port attributes and registers the devlink port.
16
17Devlink port flavours are described below.
18
19.. list-table:: List of devlink port flavours
20   :widths: 33 90
21
22   * - Flavour
23     - Description
24   * - ``DEVLINK_PORT_FLAVOUR_PHYSICAL``
25     - Any kind of physical port. This can be an eswitch physical port or any
26       other physical port on the device.
27   * - ``DEVLINK_PORT_FLAVOUR_DSA``
28     - This indicates a DSA interconnect port.
29   * - ``DEVLINK_PORT_FLAVOUR_CPU``
30     - This indicates a CPU port applicable only to DSA.
31   * - ``DEVLINK_PORT_FLAVOUR_PCI_PF``
32     - This indicates an eswitch port representing a port of PCI
33       physical function (PF).
34   * - ``DEVLINK_PORT_FLAVOUR_PCI_VF``
35     - This indicates an eswitch port representing a port of PCI
36       virtual function (VF).
37   * - ``DEVLINK_PORT_FLAVOUR_PCI_SF``
38     - This indicates an eswitch port representing a port of PCI
39       subfunction (SF).
40   * - ``DEVLINK_PORT_FLAVOUR_VIRTUAL``
41     - This indicates a virtual port for the PCI virtual function.
42
43Devlink port can have a different type based on the link layer described below.
44
45.. list-table:: List of devlink port types
46   :widths: 23 90
47
48   * - Type
49     - Description
50   * - ``DEVLINK_PORT_TYPE_ETH``
51     - Driver should set this port type when a link layer of the port is
52       Ethernet.
53   * - ``DEVLINK_PORT_TYPE_IB``
54     - Driver should set this port type when a link layer of the port is
55       InfiniBand.
56   * - ``DEVLINK_PORT_TYPE_AUTO``
57     - This type is indicated by the user when driver should detect the port
58       type automatically.
59
60PCI controllers
61---------------
62In most cases a PCI device has only one controller. A controller consists of
63potentially multiple physical, virtual functions and subfunctions. A function
64consists of one or more ports. This port is represented by the devlink eswitch
65port.
66
67A PCI device connected to multiple CPUs or multiple PCI root complexes or a
68SmartNIC, however, may have multiple controllers. For a device with multiple
69controllers, each controller is distinguished by a unique controller number.
70An eswitch is on the PCI device which supports ports of multiple controllers.
71
72An example view of a system with two controllers::
73
74                 ---------------------------------------------------------
75                 |                                                       |
76                 |           --------- ---------         ------- ------- |
77    -----------  |           | vf(s) | | sf(s) |         |vf(s)| |sf(s)| |
78    | server  |  | -------   ----/---- ---/----- ------- ---/--- ---/--- |
79    | pci rc  |=== | pf0 |______/________/       | pf1 |___/_______/     |
80    | connect |  | -------                       -------                 |
81    -----------  |     | controller_num=1 (no eswitch)                   |
82                 ------|--------------------------------------------------
83                 (internal wire)
84                       |
85                 ---------------------------------------------------------
86                 | devlink eswitch ports and reps                        |
87                 | ----------------------------------------------------- |
88                 | |ctrl-0 | ctrl-0 | ctrl-0 | ctrl-0 | ctrl-0 |ctrl-0 | |
89                 | |pf0    | pf0vfN | pf0sfN | pf1    | pf1vfN |pf1sfN | |
90                 | ----------------------------------------------------- |
91                 | |ctrl-1 | ctrl-1 | ctrl-1 | ctrl-1 | ctrl-1 |ctrl-1 | |
92                 | |pf0    | pf0vfN | pf0sfN | pf1    | pf1vfN |pf1sfN | |
93                 | ----------------------------------------------------- |
94                 |                                                       |
95                 |                                                       |
96    -----------  |           --------- ---------         ------- ------- |
97    | smartNIC|  |           | vf(s) | | sf(s) |         |vf(s)| |sf(s)| |
98    | pci rc  |==| -------   ----/---- ---/----- ------- ---/--- ---/--- |
99    | connect |  | | pf0 |______/________/       | pf1 |___/_______/     |
100    -----------  | -------                       -------                 |
101                 |                                                       |
102                 |  local controller_num=0 (eswitch)                     |
103                 ---------------------------------------------------------
104
105In the above example, the external controller (identified by controller number = 1)
106doesn't have the eswitch. Local controller (identified by controller number = 0)
107has the eswitch. The Devlink instance on the local controller has eswitch
108devlink ports for both the controllers.
109
110Function configuration
111======================
112
113Users can configure one or more function attributes before enumerating the PCI
114function. Usually it means, user should configure function attribute
115before a bus specific device for the function is created. However, when
116SRIOV is enabled, virtual function devices are created on the PCI bus.
117Hence, function attribute should be configured before binding virtual
118function device to the driver. For subfunctions, this means user should
119configure port function attribute before activating the port function.
120
121A user may set the hardware address of the function using
122`devlink port function set hw_addr` command. For Ethernet port function
123this means a MAC address.
124
125Users may also set the RoCE capability of the function using
126`devlink port function set roce` command.
127
128Users may also set the function as migratable using
129`devlink port function set migratable` command.
130
131Users may also set the IPsec crypto capability of the function using
132`devlink port function set ipsec_crypto` command.
133
134Users may also set the IPsec packet capability of the function using
135`devlink port function set ipsec_packet` command.
136
137Users may also set the maximum IO event queues of the function
138using `devlink port function set max_io_eqs` command.
139
140Function attributes
141===================
142
143MAC address setup
144-----------------
145The configured MAC address of the PCI VF/SF will be used by netdevice and rdma
146device created for the PCI VF/SF.
147
148- Get the MAC address of the VF identified by its unique devlink port index::
149
150    $ devlink port show pci/0000:06:00.0/2
151    pci/0000:06:00.0/2: type eth netdev enp6s0pf0vf1 flavour pcivf pfnum 0 vfnum 1
152      function:
153        hw_addr 00:00:00:00:00:00
154
155- Set the MAC address of the VF identified by its unique devlink port index::
156
157    $ devlink port function set pci/0000:06:00.0/2 hw_addr 00:11:22:33:44:55
158
159    $ devlink port show pci/0000:06:00.0/2
160    pci/0000:06:00.0/2: type eth netdev enp6s0pf0vf1 flavour pcivf pfnum 0 vfnum 1
161      function:
162        hw_addr 00:11:22:33:44:55
163
164- Get the MAC address of the SF identified by its unique devlink port index::
165
166    $ devlink port show pci/0000:06:00.0/32768
167    pci/0000:06:00.0/32768: type eth netdev enp6s0pf0sf88 flavour pcisf pfnum 0 sfnum 88
168      function:
169        hw_addr 00:00:00:00:00:00
170
171- Set the MAC address of the SF identified by its unique devlink port index::
172
173    $ devlink port function set pci/0000:06:00.0/32768 hw_addr 00:00:00:00:88:88
174
175    $ devlink port show pci/0000:06:00.0/32768
176    pci/0000:06:00.0/32768: type eth netdev enp6s0pf0sf88 flavour pcisf pfnum 0 sfnum 88
177      function:
178        hw_addr 00:00:00:00:88:88
179
180RoCE capability setup
181---------------------
182Not all PCI VFs/SFs require RoCE capability.
183
184When RoCE capability is disabled, it saves system memory per PCI VF/SF.
185
186When user disables RoCE capability for a VF/SF, user application cannot send or
187receive any RoCE packets through this VF/SF and RoCE GID table for this PCI
188will be empty.
189
190When RoCE capability is disabled in the device using port function attribute,
191VF/SF driver cannot override it.
192
193- Get RoCE capability of the VF device::
194
195    $ devlink port show pci/0000:06:00.0/2
196    pci/0000:06:00.0/2: type eth netdev enp6s0pf0vf1 flavour pcivf pfnum 0 vfnum 1
197        function:
198            hw_addr 00:00:00:00:00:00 roce enable
199
200- Set RoCE capability of the VF device::
201
202    $ devlink port function set pci/0000:06:00.0/2 roce disable
203
204    $ devlink port show pci/0000:06:00.0/2
205    pci/0000:06:00.0/2: type eth netdev enp6s0pf0vf1 flavour pcivf pfnum 0 vfnum 1
206        function:
207            hw_addr 00:00:00:00:00:00 roce disable
208
209migratable capability setup
210---------------------------
211Live migration is the process of transferring a live virtual machine
212from one physical host to another without disrupting its normal
213operation.
214
215User who want PCI VFs to be able to perform live migration need to
216explicitly enable the VF migratable capability.
217
218When user enables migratable capability for a VF, and the HV binds the VF to VFIO driver
219with migration support, the user can migrate the VM with this VF from one HV to a
220different one.
221
222However, when migratable capability is enable, device will disable features which cannot
223be migrated. Thus migratable cap can impose limitations on a VF so let the user decide.
224
225Example of LM with migratable function configuration:
226- Get migratable capability of the VF device::
227
228    $ devlink port show pci/0000:06:00.0/2
229    pci/0000:06:00.0/2: type eth netdev enp6s0pf0vf1 flavour pcivf pfnum 0 vfnum 1
230        function:
231            hw_addr 00:00:00:00:00:00 migratable disable
232
233- Set migratable capability of the VF device::
234
235    $ devlink port function set pci/0000:06:00.0/2 migratable enable
236
237    $ devlink port show pci/0000:06:00.0/2
238    pci/0000:06:00.0/2: type eth netdev enp6s0pf0vf1 flavour pcivf pfnum 0 vfnum 1
239        function:
240            hw_addr 00:00:00:00:00:00 migratable enable
241
242- Bind VF to VFIO driver with migration support::
243
244    $ echo <pci_id> > /sys/bus/pci/devices/0000:08:00.0/driver/unbind
245    $ echo mlx5_vfio_pci > /sys/bus/pci/devices/0000:08:00.0/driver_override
246    $ echo <pci_id> > /sys/bus/pci/devices/0000:08:00.0/driver/bind
247
248Attach VF to the VM.
249Start the VM.
250Perform live migration.
251
252IPsec crypto capability setup
253-----------------------------
254When user enables IPsec crypto capability for a VF, user application can offload
255XFRM state crypto operation (Encrypt/Decrypt) to this VF.
256
257When IPsec crypto capability is disabled (default) for a VF, the XFRM state is
258processed in software by the kernel.
259
260- Get IPsec crypto capability of the VF device::
261
262    $ devlink port show pci/0000:06:00.0/2
263    pci/0000:06:00.0/2: type eth netdev enp6s0pf0vf1 flavour pcivf pfnum 0 vfnum 1
264        function:
265            hw_addr 00:00:00:00:00:00 ipsec_crypto disabled
266
267- Set IPsec crypto capability of the VF device::
268
269    $ devlink port function set pci/0000:06:00.0/2 ipsec_crypto enable
270
271    $ devlink port show pci/0000:06:00.0/2
272    pci/0000:06:00.0/2: type eth netdev enp6s0pf0vf1 flavour pcivf pfnum 0 vfnum 1
273        function:
274            hw_addr 00:00:00:00:00:00 ipsec_crypto enabled
275
276IPsec packet capability setup
277-----------------------------
278When user enables IPsec packet capability for a VF, user application can offload
279XFRM state and policy crypto operation (Encrypt/Decrypt) to this VF, as well as
280IPsec encapsulation.
281
282When IPsec packet capability is disabled (default) for a VF, the XFRM state and
283policy is processed in software by the kernel.
284
285- Get IPsec packet capability of the VF device::
286
287    $ devlink port show pci/0000:06:00.0/2
288    pci/0000:06:00.0/2: type eth netdev enp6s0pf0vf1 flavour pcivf pfnum 0 vfnum 1
289        function:
290            hw_addr 00:00:00:00:00:00 ipsec_packet disabled
291
292- Set IPsec packet capability of the VF device::
293
294    $ devlink port function set pci/0000:06:00.0/2 ipsec_packet enable
295
296    $ devlink port show pci/0000:06:00.0/2
297    pci/0000:06:00.0/2: type eth netdev enp6s0pf0vf1 flavour pcivf pfnum 0 vfnum 1
298        function:
299            hw_addr 00:00:00:00:00:00 ipsec_packet enabled
300
301Maximum IO events queues setup
302------------------------------
303When user sets maximum number of IO event queues for a SF or
304a VF, such function driver is limited to consume only enforced
305number of IO event queues.
306
307IO event queues deliver events related to IO queues, including network
308device transmit and receive queues (txq and rxq) and RDMA Queue Pairs (QPs).
309For example, the number of netdevice channels and RDMA device completion
310vectors are derived from the function's IO event queues. Usually, the number
311of interrupt vectors consumed by the driver is limited by the number of IO
312event queues per device, as each of the IO event queues is connected to an
313interrupt vector.
314
315- Get maximum IO event queues of the VF device::
316
317    $ devlink port show pci/0000:06:00.0/2
318    pci/0000:06:00.0/2: type eth netdev enp6s0pf0vf1 flavour pcivf pfnum 0 vfnum 1
319        function:
320            hw_addr 00:00:00:00:00:00 ipsec_packet disabled max_io_eqs 10
321
322- Set maximum IO event queues of the VF device::
323
324    $ devlink port function set pci/0000:06:00.0/2 max_io_eqs 32
325
326    $ devlink port show pci/0000:06:00.0/2
327    pci/0000:06:00.0/2: type eth netdev enp6s0pf0vf1 flavour pcivf pfnum 0 vfnum 1
328        function:
329            hw_addr 00:00:00:00:00:00 ipsec_packet disabled max_io_eqs 32
330
331Subfunction
332============
333
334Subfunction is a lightweight function that has a parent PCI function on which
335it is deployed. Subfunction is created and deployed in unit of 1. Unlike
336SRIOV VFs, a subfunction doesn't require its own PCI virtual function.
337A subfunction communicates with the hardware through the parent PCI function.
338
339To use a subfunction, 3 steps setup sequence is followed:
340
3411) create - create a subfunction;
3422) configure - configure subfunction attributes;
3433) deploy - deploy the subfunction;
344
345Subfunction management is done using devlink port user interface.
346User performs setup on the subfunction management device.
347
348(1) Create
349----------
350A subfunction is created using a devlink port interface. A user adds the
351subfunction by adding a devlink port of subfunction flavour. The devlink
352kernel code calls down to subfunction management driver (devlink ops) and asks
353it to create a subfunction devlink port. Driver then instantiates the
354subfunction port and any associated objects such as health reporters and
355representor netdevice.
356
357(2) Configure
358-------------
359A subfunction devlink port is created but it is not active yet. That means the
360entities are created on devlink side, the e-switch port representor is created,
361but the subfunction device itself is not created. A user might use e-switch port
362representor to do settings, putting it into bridge, adding TC rules, etc. A user
363might as well configure the hardware address (such as MAC address) of the
364subfunction while subfunction is inactive.
365
366(3) Deploy
367----------
368Once a subfunction is configured, user must activate it to use it. Upon
369activation, subfunction management driver asks the subfunction management
370device to instantiate the subfunction device on particular PCI function.
371A subfunction device is created on the :ref:`Documentation/driver-api/auxiliary_bus.rst <auxiliary_bus>`.
372At this point a matching subfunction driver binds to the subfunction's auxiliary device.
373
374Rate object management
375======================
376
377Devlink provides API to manage tx rates of single devlink port or a group.
378This is done through rate objects, which can be one of the two types:
379
380``leaf``
381  Represents a single devlink port; created/destroyed by the driver. Since leaf
382  have 1to1 mapping to its devlink port, in user space it is referred as
383  ``pci/<bus_addr>/<port_index>``;
384
385``node``
386  Represents a group of rate objects (leafs and/or nodes); created/deleted by
387  request from the userspace; initially empty (no rate objects added). In
388  userspace it is referred as ``pci/<bus_addr>/<node_name>``, where
389  ``node_name`` can be any identifier, except decimal number, to avoid
390  collisions with leafs.
391
392API allows to configure following rate object's parameters:
393
394``tx_share``
395  Minimum TX rate value shared among all other rate objects, or rate objects
396  that parts of the parent group, if it is a part of the same group.
397
398``tx_max``
399  Maximum TX rate value.
400
401``tx_priority``
402  Allows for usage of strict priority arbiter among siblings. This
403  arbitration scheme attempts to schedule nodes based on their priority
404  as long as the nodes remain within their bandwidth limit. The higher the
405  priority the higher the probability that the node will get selected for
406  scheduling.
407
408``tx_weight``
409  Allows for usage of Weighted Fair Queuing arbitration scheme among
410  siblings. This arbitration scheme can be used simultaneously with the
411  strict priority. As a node is configured with a higher rate it gets more
412  BW relative to its siblings. Values are relative like a percentage
413  points, they basically tell how much BW should node take relative to
414  its siblings.
415
416``parent``
417  Parent node name. Parent node rate limits are considered as additional limits
418  to all node children limits. ``tx_max`` is an upper limit for children.
419  ``tx_share`` is a total bandwidth distributed among children.
420
421``tx_priority`` and ``tx_weight`` can be used simultaneously. In that case
422nodes with the same priority form a WFQ subgroup in the sibling group
423and arbitration among them is based on assigned weights.
424
425Arbitration flow from the high level:
426
427#. Choose a node, or group of nodes with the highest priority that stays
428   within the BW limit and are not blocked. Use ``tx_priority`` as a
429   parameter for this arbitration.
430
431#. If group of nodes have the same priority perform WFQ arbitration on
432   that subgroup. Use ``tx_weight`` as a parameter for this arbitration.
433
434#. Select the winner node, and continue arbitration flow among its children,
435   until leaf node is reached, and the winner is established.
436
437#. If all the nodes from the highest priority sub-group are satisfied, or
438   overused their assigned BW, move to the lower priority nodes.
439
440Driver implementations are allowed to support both or either rate object types
441and setting methods of their parameters. Additionally driver implementation
442may export nodes/leafs and their child-parent relationships.
443
444Terms and Definitions
445=====================
446
447.. list-table:: Terms and Definitions
448   :widths: 22 90
449
450   * - Term
451     - Definitions
452   * - ``PCI device``
453     - A physical PCI device having one or more PCI buses consists of one or
454       more PCI controllers.
455   * - ``PCI controller``
456     -  A controller consists of potentially multiple physical functions,
457        virtual functions and subfunctions.
458   * - ``Port function``
459     -  An object to manage the function of a port.
460   * - ``Subfunction``
461     -  A lightweight function that has parent PCI function on which it is
462        deployed.
463   * - ``Subfunction device``
464     -  A bus device of the subfunction, usually on a auxiliary bus.
465   * - ``Subfunction driver``
466     -  A device driver for the subfunction auxiliary device.
467   * - ``Subfunction management device``
468     -  A PCI physical function that supports subfunction management.
469   * - ``Subfunction management driver``
470     -  A device driver for PCI physical function that supports
471        subfunction management using devlink port interface.
472   * - ``Subfunction host driver``
473     -  A device driver for PCI physical function that hosts subfunction
474        devices. In most cases it is same as subfunction management driver. When
475        subfunction is used on external controller, subfunction management and
476        host drivers are different.
477