xref: /linux/Documentation/networking/devlink/devlink-port.rst (revision 7255fcc80d4b525cc10cfaaf7f485830d4ed2000)
1.. SPDX-License-Identifier: GPL-2.0
2
3.. _devlink_port:
4
5============
6Devlink Port
7============
8
9``devlink-port`` is a port that exists on the device. It has a logically
10separate ingress/egress point of the device. A devlink port can be any one
11of many flavours. A devlink port flavour along with port attributes
12describe what a port represents.
13
14A device driver that intends to publish a devlink port sets the
15devlink port attributes and registers the devlink port.
16
17Devlink port flavours are described below.
18
19.. list-table:: List of devlink port flavours
20   :widths: 33 90
21
22   * - Flavour
23     - Description
24   * - ``DEVLINK_PORT_FLAVOUR_PHYSICAL``
25     - Any kind of physical port. This can be an eswitch physical port or any
26       other physical port on the device.
27   * - ``DEVLINK_PORT_FLAVOUR_DSA``
28     - This indicates a DSA interconnect port.
29   * - ``DEVLINK_PORT_FLAVOUR_CPU``
30     - This indicates a CPU port applicable only to DSA.
31   * - ``DEVLINK_PORT_FLAVOUR_PCI_PF``
32     - This indicates an eswitch port representing a port of PCI
33       physical function (PF).
34   * - ``DEVLINK_PORT_FLAVOUR_PCI_VF``
35     - This indicates an eswitch port representing a port of PCI
36       virtual function (VF).
37   * - ``DEVLINK_PORT_FLAVOUR_PCI_SF``
38     - This indicates an eswitch port representing a port of PCI
39       subfunction (SF).
40   * - ``DEVLINK_PORT_FLAVOUR_VIRTUAL``
41     - This indicates a virtual port for the PCI virtual function.
42
43Devlink port can have a different type based on the link layer described below.
44
45.. list-table:: List of devlink port types
46   :widths: 23 90
47
48   * - Type
49     - Description
50   * - ``DEVLINK_PORT_TYPE_ETH``
51     - Driver should set this port type when a link layer of the port is
52       Ethernet.
53   * - ``DEVLINK_PORT_TYPE_IB``
54     - Driver should set this port type when a link layer of the port is
55       InfiniBand.
56   * - ``DEVLINK_PORT_TYPE_AUTO``
57     - This type is indicated by the user when driver should detect the port
58       type automatically.
59
60PCI controllers
61---------------
62In most cases a PCI device has only one controller. A controller consists of
63potentially multiple physical, virtual functions and subfunctions. A function
64consists of one or more ports. This port is represented by the devlink eswitch
65port.
66
67A PCI device connected to multiple CPUs or multiple PCI root complexes or a
68SmartNIC, however, may have multiple controllers. For a device with multiple
69controllers, each controller is distinguished by a unique controller number.
70An eswitch is on the PCI device which supports ports of multiple controllers.
71
72An example view of a system with two controllers::
73
74                 ---------------------------------------------------------
75                 |                                                       |
76                 |           --------- ---------         ------- ------- |
77    -----------  |           | vf(s) | | sf(s) |         |vf(s)| |sf(s)| |
78    | server  |  | -------   ----/---- ---/----- ------- ---/--- ---/--- |
79    | pci rc  |=== | pf0 |______/________/       | pf1 |___/_______/     |
80    | connect |  | -------                       -------                 |
81    -----------  |     | controller_num=1 (no eswitch)                   |
82                 ------|--------------------------------------------------
83                 (internal wire)
84                       |
85                 ---------------------------------------------------------
86                 | devlink eswitch ports and reps                        |
87                 | ----------------------------------------------------- |
88                 | |ctrl-0 | ctrl-0 | ctrl-0 | ctrl-0 | ctrl-0 |ctrl-0 | |
89                 | |pf0    | pf0vfN | pf0sfN | pf1    | pf1vfN |pf1sfN | |
90                 | ----------------------------------------------------- |
91                 | |ctrl-1 | ctrl-1 | ctrl-1 | ctrl-1 | ctrl-1 |ctrl-1 | |
92                 | |pf0    | pf0vfN | pf0sfN | pf1    | pf1vfN |pf1sfN | |
93                 | ----------------------------------------------------- |
94                 |                                                       |
95                 |                                                       |
96    -----------  |           --------- ---------         ------- ------- |
97    | smartNIC|  |           | vf(s) | | sf(s) |         |vf(s)| |sf(s)| |
98    | pci rc  |==| -------   ----/---- ---/----- ------- ---/--- ---/--- |
99    | connect |  | | pf0 |______/________/       | pf1 |___/_______/     |
100    -----------  | -------                       -------                 |
101                 |                                                       |
102                 |  local controller_num=0 (eswitch)                     |
103                 ---------------------------------------------------------
104
105In the above example, the external controller (identified by controller number = 1)
106doesn't have the eswitch. Local controller (identified by controller number = 0)
107has the eswitch. The Devlink instance on the local controller has eswitch
108devlink ports for both the controllers.
109
110Function configuration
111======================
112
113Users can configure one or more function attributes before enumerating the PCI
114function. Usually it means, user should configure function attribute
115before a bus specific device for the function is created. However, when
116SRIOV is enabled, virtual function devices are created on the PCI bus.
117Hence, function attribute should be configured before binding virtual
118function device to the driver. For subfunctions, this means user should
119configure port function attribute before activating the port function.
120
121A user may set the hardware address of the function using
122`devlink port function set hw_addr` command. For Ethernet port function
123this means a MAC address.
124
125Users may also set the RoCE capability of the function using
126`devlink port function set roce` command.
127
128Users may also set the function as migratable using
129`devlink port function set migratable` command.
130
131Users may also set the IPsec crypto capability of the function using
132`devlink port function set ipsec_crypto` command.
133
134Users may also set the IPsec packet capability of the function using
135`devlink port function set ipsec_packet` command.
136
137Function attributes
138===================
139
140MAC address setup
141-----------------
142The configured MAC address of the PCI VF/SF will be used by netdevice and rdma
143device created for the PCI VF/SF.
144
145- Get the MAC address of the VF identified by its unique devlink port index::
146
147    $ devlink port show pci/0000:06:00.0/2
148    pci/0000:06:00.0/2: type eth netdev enp6s0pf0vf1 flavour pcivf pfnum 0 vfnum 1
149      function:
150        hw_addr 00:00:00:00:00:00
151
152- Set the MAC address of the VF identified by its unique devlink port index::
153
154    $ devlink port function set pci/0000:06:00.0/2 hw_addr 00:11:22:33:44:55
155
156    $ devlink port show pci/0000:06:00.0/2
157    pci/0000:06:00.0/2: type eth netdev enp6s0pf0vf1 flavour pcivf pfnum 0 vfnum 1
158      function:
159        hw_addr 00:11:22:33:44:55
160
161- Get the MAC address of the SF identified by its unique devlink port index::
162
163    $ devlink port show pci/0000:06:00.0/32768
164    pci/0000:06:00.0/32768: type eth netdev enp6s0pf0sf88 flavour pcisf pfnum 0 sfnum 88
165      function:
166        hw_addr 00:00:00:00:00:00
167
168- Set the MAC address of the SF identified by its unique devlink port index::
169
170    $ devlink port function set pci/0000:06:00.0/32768 hw_addr 00:00:00:00:88:88
171
172    $ devlink port show pci/0000:06:00.0/32768
173    pci/0000:06:00.0/32768: type eth netdev enp6s0pf0sf88 flavour pcisf pfnum 0 sfnum 88
174      function:
175        hw_addr 00:00:00:00:88:88
176
177RoCE capability setup
178---------------------
179Not all PCI VFs/SFs require RoCE capability.
180
181When RoCE capability is disabled, it saves system memory per PCI VF/SF.
182
183When user disables RoCE capability for a VF/SF, user application cannot send or
184receive any RoCE packets through this VF/SF and RoCE GID table for this PCI
185will be empty.
186
187When RoCE capability is disabled in the device using port function attribute,
188VF/SF driver cannot override it.
189
190- Get RoCE capability of the VF device::
191
192    $ devlink port show pci/0000:06:00.0/2
193    pci/0000:06:00.0/2: type eth netdev enp6s0pf0vf1 flavour pcivf pfnum 0 vfnum 1
194        function:
195            hw_addr 00:00:00:00:00:00 roce enable
196
197- Set RoCE capability of the VF device::
198
199    $ devlink port function set pci/0000:06:00.0/2 roce disable
200
201    $ devlink port show pci/0000:06:00.0/2
202    pci/0000:06:00.0/2: type eth netdev enp6s0pf0vf1 flavour pcivf pfnum 0 vfnum 1
203        function:
204            hw_addr 00:00:00:00:00:00 roce disable
205
206migratable capability setup
207---------------------------
208Live migration is the process of transferring a live virtual machine
209from one physical host to another without disrupting its normal
210operation.
211
212User who want PCI VFs to be able to perform live migration need to
213explicitly enable the VF migratable capability.
214
215When user enables migratable capability for a VF, and the HV binds the VF to VFIO driver
216with migration support, the user can migrate the VM with this VF from one HV to a
217different one.
218
219However, when migratable capability is enable, device will disable features which cannot
220be migrated. Thus migratable cap can impose limitations on a VF so let the user decide.
221
222Example of LM with migratable function configuration:
223- Get migratable capability of the VF device::
224
225    $ devlink port show pci/0000:06:00.0/2
226    pci/0000:06:00.0/2: type eth netdev enp6s0pf0vf1 flavour pcivf pfnum 0 vfnum 1
227        function:
228            hw_addr 00:00:00:00:00:00 migratable disable
229
230- Set migratable capability of the VF device::
231
232    $ devlink port function set pci/0000:06:00.0/2 migratable enable
233
234    $ devlink port show pci/0000:06:00.0/2
235    pci/0000:06:00.0/2: type eth netdev enp6s0pf0vf1 flavour pcivf pfnum 0 vfnum 1
236        function:
237            hw_addr 00:00:00:00:00:00 migratable enable
238
239- Bind VF to VFIO driver with migration support::
240
241    $ echo <pci_id> > /sys/bus/pci/devices/0000:08:00.0/driver/unbind
242    $ echo mlx5_vfio_pci > /sys/bus/pci/devices/0000:08:00.0/driver_override
243    $ echo <pci_id> > /sys/bus/pci/devices/0000:08:00.0/driver/bind
244
245Attach VF to the VM.
246Start the VM.
247Perform live migration.
248
249IPsec crypto capability setup
250-----------------------------
251When user enables IPsec crypto capability for a VF, user application can offload
252XFRM state crypto operation (Encrypt/Decrypt) to this VF.
253
254When IPsec crypto capability is disabled (default) for a VF, the XFRM state is
255processed in software by the kernel.
256
257- Get IPsec crypto capability of the VF device::
258
259    $ devlink port show pci/0000:06:00.0/2
260    pci/0000:06:00.0/2: type eth netdev enp6s0pf0vf1 flavour pcivf pfnum 0 vfnum 1
261        function:
262            hw_addr 00:00:00:00:00:00 ipsec_crypto disabled
263
264- Set IPsec crypto capability of the VF device::
265
266    $ devlink port function set pci/0000:06:00.0/2 ipsec_crypto enable
267
268    $ devlink port show pci/0000:06:00.0/2
269    pci/0000:06:00.0/2: type eth netdev enp6s0pf0vf1 flavour pcivf pfnum 0 vfnum 1
270        function:
271            hw_addr 00:00:00:00:00:00 ipsec_crypto enabled
272
273IPsec packet capability setup
274-----------------------------
275When user enables IPsec packet capability for a VF, user application can offload
276XFRM state and policy crypto operation (Encrypt/Decrypt) to this VF, as well as
277IPsec encapsulation.
278
279When IPsec packet capability is disabled (default) for a VF, the XFRM state and
280policy is processed in software by the kernel.
281
282- Get IPsec packet capability of the VF device::
283
284    $ devlink port show pci/0000:06:00.0/2
285    pci/0000:06:00.0/2: type eth netdev enp6s0pf0vf1 flavour pcivf pfnum 0 vfnum 1
286        function:
287            hw_addr 00:00:00:00:00:00 ipsec_packet disabled
288
289- Set IPsec packet capability of the VF device::
290
291    $ devlink port function set pci/0000:06:00.0/2 ipsec_packet enable
292
293    $ devlink port show pci/0000:06:00.0/2
294    pci/0000:06:00.0/2: type eth netdev enp6s0pf0vf1 flavour pcivf pfnum 0 vfnum 1
295        function:
296            hw_addr 00:00:00:00:00:00 ipsec_packet enabled
297
298Subfunction
299============
300
301Subfunction is a lightweight function that has a parent PCI function on which
302it is deployed. Subfunction is created and deployed in unit of 1. Unlike
303SRIOV VFs, a subfunction doesn't require its own PCI virtual function.
304A subfunction communicates with the hardware through the parent PCI function.
305
306To use a subfunction, 3 steps setup sequence is followed:
307
3081) create - create a subfunction;
3092) configure - configure subfunction attributes;
3103) deploy - deploy the subfunction;
311
312Subfunction management is done using devlink port user interface.
313User performs setup on the subfunction management device.
314
315(1) Create
316----------
317A subfunction is created using a devlink port interface. A user adds the
318subfunction by adding a devlink port of subfunction flavour. The devlink
319kernel code calls down to subfunction management driver (devlink ops) and asks
320it to create a subfunction devlink port. Driver then instantiates the
321subfunction port and any associated objects such as health reporters and
322representor netdevice.
323
324(2) Configure
325-------------
326A subfunction devlink port is created but it is not active yet. That means the
327entities are created on devlink side, the e-switch port representor is created,
328but the subfunction device itself is not created. A user might use e-switch port
329representor to do settings, putting it into bridge, adding TC rules, etc. A user
330might as well configure the hardware address (such as MAC address) of the
331subfunction while subfunction is inactive.
332
333(3) Deploy
334----------
335Once a subfunction is configured, user must activate it to use it. Upon
336activation, subfunction management driver asks the subfunction management
337device to instantiate the subfunction device on particular PCI function.
338A subfunction device is created on the :ref:`Documentation/driver-api/auxiliary_bus.rst <auxiliary_bus>`.
339At this point a matching subfunction driver binds to the subfunction's auxiliary device.
340
341Rate object management
342======================
343
344Devlink provides API to manage tx rates of single devlink port or a group.
345This is done through rate objects, which can be one of the two types:
346
347``leaf``
348  Represents a single devlink port; created/destroyed by the driver. Since leaf
349  have 1to1 mapping to its devlink port, in user space it is referred as
350  ``pci/<bus_addr>/<port_index>``;
351
352``node``
353  Represents a group of rate objects (leafs and/or nodes); created/deleted by
354  request from the userspace; initially empty (no rate objects added). In
355  userspace it is referred as ``pci/<bus_addr>/<node_name>``, where
356  ``node_name`` can be any identifier, except decimal number, to avoid
357  collisions with leafs.
358
359API allows to configure following rate object's parameters:
360
361``tx_share``
362  Minimum TX rate value shared among all other rate objects, or rate objects
363  that parts of the parent group, if it is a part of the same group.
364
365``tx_max``
366  Maximum TX rate value.
367
368``tx_priority``
369  Allows for usage of strict priority arbiter among siblings. This
370  arbitration scheme attempts to schedule nodes based on their priority
371  as long as the nodes remain within their bandwidth limit. The higher the
372  priority the higher the probability that the node will get selected for
373  scheduling.
374
375``tx_weight``
376  Allows for usage of Weighted Fair Queuing arbitration scheme among
377  siblings. This arbitration scheme can be used simultaneously with the
378  strict priority. As a node is configured with a higher rate it gets more
379  BW relative to its siblings. Values are relative like a percentage
380  points, they basically tell how much BW should node take relative to
381  its siblings.
382
383``parent``
384  Parent node name. Parent node rate limits are considered as additional limits
385  to all node children limits. ``tx_max`` is an upper limit for children.
386  ``tx_share`` is a total bandwidth distributed among children.
387
388``tx_priority`` and ``tx_weight`` can be used simultaneously. In that case
389nodes with the same priority form a WFQ subgroup in the sibling group
390and arbitration among them is based on assigned weights.
391
392Arbitration flow from the high level:
393
394#. Choose a node, or group of nodes with the highest priority that stays
395   within the BW limit and are not blocked. Use ``tx_priority`` as a
396   parameter for this arbitration.
397
398#. If group of nodes have the same priority perform WFQ arbitration on
399   that subgroup. Use ``tx_weight`` as a parameter for this arbitration.
400
401#. Select the winner node, and continue arbitration flow among its children,
402   until leaf node is reached, and the winner is established.
403
404#. If all the nodes from the highest priority sub-group are satisfied, or
405   overused their assigned BW, move to the lower priority nodes.
406
407Driver implementations are allowed to support both or either rate object types
408and setting methods of their parameters. Additionally driver implementation
409may export nodes/leafs and their child-parent relationships.
410
411Terms and Definitions
412=====================
413
414.. list-table:: Terms and Definitions
415   :widths: 22 90
416
417   * - Term
418     - Definitions
419   * - ``PCI device``
420     - A physical PCI device having one or more PCI buses consists of one or
421       more PCI controllers.
422   * - ``PCI controller``
423     -  A controller consists of potentially multiple physical functions,
424        virtual functions and subfunctions.
425   * - ``Port function``
426     -  An object to manage the function of a port.
427   * - ``Subfunction``
428     -  A lightweight function that has parent PCI function on which it is
429        deployed.
430   * - ``Subfunction device``
431     -  A bus device of the subfunction, usually on a auxiliary bus.
432   * - ``Subfunction driver``
433     -  A device driver for the subfunction auxiliary device.
434   * - ``Subfunction management device``
435     -  A PCI physical function that supports subfunction management.
436   * - ``Subfunction management driver``
437     -  A device driver for PCI physical function that supports
438        subfunction management using devlink port interface.
439   * - ``Subfunction host driver``
440     -  A device driver for PCI physical function that hosts subfunction
441        devices. In most cases it is same as subfunction management driver. When
442        subfunction is used on external controller, subfunction management and
443        host drivers are different.
444