xref: /linux/Documentation/networking/multi-pf-netdev.rst (revision 9410645520e9b820069761f3450ef6661418e279)
177d9ec3fSTariq Toukan.. SPDX-License-Identifier: GPL-2.0
277d9ec3fSTariq Toukan.. include:: <isonum.txt>
377d9ec3fSTariq Toukan
477d9ec3fSTariq Toukan===============
577d9ec3fSTariq ToukanMulti-PF Netdev
677d9ec3fSTariq Toukan===============
777d9ec3fSTariq Toukan
877d9ec3fSTariq ToukanContents
977d9ec3fSTariq Toukan========
1077d9ec3fSTariq Toukan
1177d9ec3fSTariq Toukan- `Background`_
1277d9ec3fSTariq Toukan- `Overview`_
1377d9ec3fSTariq Toukan- `mlx5 implementation`_
1477d9ec3fSTariq Toukan- `Channels distribution`_
1577d9ec3fSTariq Toukan- `Observability`_
1677d9ec3fSTariq Toukan- `Steering`_
1777d9ec3fSTariq Toukan- `Mutually exclusive features`_
1877d9ec3fSTariq Toukan
1977d9ec3fSTariq ToukanBackground
2077d9ec3fSTariq Toukan==========
2177d9ec3fSTariq Toukan
2277d9ec3fSTariq ToukanThe Multi-PF NIC technology enables several CPUs within a multi-socket server to connect directly to
2377d9ec3fSTariq Toukanthe network, each through its own dedicated PCIe interface. Through either a connection harness that
2477d9ec3fSTariq Toukansplits the PCIe lanes between two cards or by bifurcating a PCIe slot for a single card. This
2577d9ec3fSTariq Toukanresults in eliminating the network traffic traversing over the internal bus between the sockets,
2677d9ec3fSTariq Toukansignificantly reducing overhead and latency, in addition to reducing CPU utilization and increasing
2777d9ec3fSTariq Toukannetwork throughput.
2877d9ec3fSTariq Toukan
2977d9ec3fSTariq ToukanOverview
3077d9ec3fSTariq Toukan========
3177d9ec3fSTariq Toukan
3277d9ec3fSTariq ToukanThe feature adds support for combining multiple PFs of the same port in a Multi-PF environment under
3377d9ec3fSTariq Toukanone netdev instance. It is implemented in the netdev layer. Lower-layer instances like pci func,
3477d9ec3fSTariq Toukansysfs entry, and devlink are kept separate.
3577d9ec3fSTariq ToukanPassing traffic through different devices belonging to different NUMA sockets saves cross-NUMA
3677d9ec3fSTariq Toukantraffic and allows apps running on the same netdev from different NUMAs to still feel a sense of
3777d9ec3fSTariq Toukanproximity to the device and achieve improved performance.
3877d9ec3fSTariq Toukan
3977d9ec3fSTariq Toukanmlx5 implementation
4077d9ec3fSTariq Toukan===================
4177d9ec3fSTariq Toukan
4277d9ec3fSTariq ToukanMulti-PF or Socket-direct in mlx5 is achieved by grouping PFs together which belong to the same
4377d9ec3fSTariq ToukanNIC and has the socket-direct property enabled, once all PFs are probed, we create a single netdev
4477d9ec3fSTariq Toukanto represent all of them, symmetrically, we destroy the netdev whenever any of the PFs is removed.
4577d9ec3fSTariq Toukan
4677d9ec3fSTariq ToukanThe netdev network channels are distributed between all devices, a proper configuration would utilize
4777d9ec3fSTariq Toukanthe correct close NUMA node when working on a certain app/CPU.
4877d9ec3fSTariq Toukan
4977d9ec3fSTariq ToukanWe pick one PF to be a primary (leader), and it fills a special role. The other devices
5077d9ec3fSTariq Toukan(secondaries) are disconnected from the network at the chip level (set to silent mode). In silent
5177d9ec3fSTariq Toukanmode, no south <-> north traffic flowing directly through a secondary PF. It needs the assistance of
5277d9ec3fSTariq Toukanthe leader PF (east <-> west traffic) to function. All Rx/Tx traffic is steered through the primary
5377d9ec3fSTariq Toukanto/from the secondaries.
5477d9ec3fSTariq Toukan
5577d9ec3fSTariq ToukanCurrently, we limit the support to PFs only, and up to two PFs (sockets).
5677d9ec3fSTariq Toukan
5777d9ec3fSTariq ToukanChannels distribution
5877d9ec3fSTariq Toukan=====================
5977d9ec3fSTariq Toukan
6077d9ec3fSTariq ToukanWe distribute the channels between the different PFs to achieve local NUMA node performance
6177d9ec3fSTariq Toukanon multiple NUMA nodes.
6277d9ec3fSTariq Toukan
6377d9ec3fSTariq ToukanEach combined channel works against one specific PF, creating all its datapath queues against it. We
6477d9ec3fSTariq Toukandistribute channels to PFs in a round-robin policy.
6577d9ec3fSTariq Toukan
6677d9ec3fSTariq Toukan::
6777d9ec3fSTariq Toukan
6877d9ec3fSTariq Toukan        Example for 2 PFs and 5 channels:
6977d9ec3fSTariq Toukan        +--------+--------+
7077d9ec3fSTariq Toukan        | ch idx | PF idx |
7177d9ec3fSTariq Toukan        +--------+--------+
7277d9ec3fSTariq Toukan        |    0   |    0   |
7377d9ec3fSTariq Toukan        |    1   |    1   |
7477d9ec3fSTariq Toukan        |    2   |    0   |
7577d9ec3fSTariq Toukan        |    3   |    1   |
7677d9ec3fSTariq Toukan        |    4   |    0   |
7777d9ec3fSTariq Toukan        +--------+--------+
7877d9ec3fSTariq Toukan
7977d9ec3fSTariq Toukan
8077d9ec3fSTariq ToukanThe reason we prefer round-robin is, it is less influenced by changes in the number of channels. The
8177d9ec3fSTariq Toukanmapping between a channel index and a PF is fixed, no matter how many channels the user configures.
8277d9ec3fSTariq ToukanAs the channel stats are persistent across channel's closure, changing the mapping every single time
8377d9ec3fSTariq Toukanwould turn the accumulative stats less representing of the channel's history.
8477d9ec3fSTariq Toukan
8577d9ec3fSTariq ToukanThis is achieved by using the correct core device instance (mdev) in each channel, instead of them
8677d9ec3fSTariq Toukanall using the same instance under "priv->mdev".
8777d9ec3fSTariq Toukan
8877d9ec3fSTariq ToukanObservability
8977d9ec3fSTariq Toukan=============
901c636867SJakub KicinskiThe relation between PF, irq, napi, and queue can be observed via netlink spec::
9177d9ec3fSTariq Toukan
9277d9ec3fSTariq Toukan  $ ./tools/net/ynl/cli.py --spec Documentation/netlink/specs/netdev.yaml --dump queue-get --json='{"ifindex": 13}'
9377d9ec3fSTariq Toukan  [{'id': 0, 'ifindex': 13, 'napi-id': 539, 'type': 'rx'},
9477d9ec3fSTariq Toukan   {'id': 1, 'ifindex': 13, 'napi-id': 540, 'type': 'rx'},
9577d9ec3fSTariq Toukan   {'id': 2, 'ifindex': 13, 'napi-id': 541, 'type': 'rx'},
9677d9ec3fSTariq Toukan   {'id': 3, 'ifindex': 13, 'napi-id': 542, 'type': 'rx'},
9777d9ec3fSTariq Toukan   {'id': 4, 'ifindex': 13, 'napi-id': 543, 'type': 'rx'},
9877d9ec3fSTariq Toukan   {'id': 0, 'ifindex': 13, 'napi-id': 539, 'type': 'tx'},
9977d9ec3fSTariq Toukan   {'id': 1, 'ifindex': 13, 'napi-id': 540, 'type': 'tx'},
10077d9ec3fSTariq Toukan   {'id': 2, 'ifindex': 13, 'napi-id': 541, 'type': 'tx'},
10177d9ec3fSTariq Toukan   {'id': 3, 'ifindex': 13, 'napi-id': 542, 'type': 'tx'},
10277d9ec3fSTariq Toukan   {'id': 4, 'ifindex': 13, 'napi-id': 543, 'type': 'tx'}]
10377d9ec3fSTariq Toukan
10477d9ec3fSTariq Toukan  $ ./tools/net/ynl/cli.py --spec Documentation/netlink/specs/netdev.yaml --dump napi-get --json='{"ifindex": 13}'
10577d9ec3fSTariq Toukan  [{'id': 543, 'ifindex': 13, 'irq': 42},
10677d9ec3fSTariq Toukan   {'id': 542, 'ifindex': 13, 'irq': 41},
10777d9ec3fSTariq Toukan   {'id': 541, 'ifindex': 13, 'irq': 40},
10877d9ec3fSTariq Toukan   {'id': 540, 'ifindex': 13, 'irq': 39},
10977d9ec3fSTariq Toukan   {'id': 539, 'ifindex': 13, 'irq': 36}]
11077d9ec3fSTariq Toukan
1111c636867SJakub KicinskiHere you can clearly observe our channels distribution policy::
11277d9ec3fSTariq Toukan
11377d9ec3fSTariq Toukan  $ ls /proc/irq/{36,39,40,41,42}/mlx5* -d -1
114*9480fd0cSTariq Toukan  /proc/irq/36/mlx5_comp0@pci:0000:08:00.0
115*9480fd0cSTariq Toukan  /proc/irq/39/mlx5_comp0@pci:0000:09:00.0
116*9480fd0cSTariq Toukan  /proc/irq/40/mlx5_comp1@pci:0000:08:00.0
117*9480fd0cSTariq Toukan  /proc/irq/41/mlx5_comp1@pci:0000:09:00.0
118*9480fd0cSTariq Toukan  /proc/irq/42/mlx5_comp2@pci:0000:08:00.0
11977d9ec3fSTariq Toukan
12077d9ec3fSTariq ToukanSteering
12177d9ec3fSTariq Toukan========
12277d9ec3fSTariq ToukanSecondary PFs are set to "silent" mode, meaning they are disconnected from the network.
12377d9ec3fSTariq Toukan
12477d9ec3fSTariq ToukanIn Rx, the steering tables belong to the primary PF only, and it is its role to distribute incoming
12577d9ec3fSTariq Toukantraffic to other PFs, via cross-vhca steering capabilities. Still maintain a single default RSS table,
12677d9ec3fSTariq Toukanthat is capable of pointing to the receive queues of a different PF.
12777d9ec3fSTariq Toukan
12877d9ec3fSTariq ToukanIn Tx, the primary PF creates a new Tx flow table, which is aliased by the secondaries, so they can
12977d9ec3fSTariq Toukango out to the network through it.
13077d9ec3fSTariq Toukan
13177d9ec3fSTariq ToukanIn addition, we set default XPS configuration that, based on the CPU, selects an SQ belonging to the
13277d9ec3fSTariq ToukanPF on the same node as the CPU.
13377d9ec3fSTariq Toukan
13477d9ec3fSTariq ToukanXPS default config example:
13577d9ec3fSTariq Toukan
13677d9ec3fSTariq ToukanNUMA node(s):          2
13777d9ec3fSTariq ToukanNUMA node0 CPU(s):     0-11
13877d9ec3fSTariq ToukanNUMA node1 CPU(s):     12-23
13977d9ec3fSTariq Toukan
14077d9ec3fSTariq ToukanPF0 on node0, PF1 on node1.
14177d9ec3fSTariq Toukan
14277d9ec3fSTariq Toukan- /sys/class/net/eth2/queues/tx-0/xps_cpus:000001
14377d9ec3fSTariq Toukan- /sys/class/net/eth2/queues/tx-1/xps_cpus:001000
14477d9ec3fSTariq Toukan- /sys/class/net/eth2/queues/tx-2/xps_cpus:000002
14577d9ec3fSTariq Toukan- /sys/class/net/eth2/queues/tx-3/xps_cpus:002000
14677d9ec3fSTariq Toukan- /sys/class/net/eth2/queues/tx-4/xps_cpus:000004
14777d9ec3fSTariq Toukan- /sys/class/net/eth2/queues/tx-5/xps_cpus:004000
14877d9ec3fSTariq Toukan- /sys/class/net/eth2/queues/tx-6/xps_cpus:000008
14977d9ec3fSTariq Toukan- /sys/class/net/eth2/queues/tx-7/xps_cpus:008000
15077d9ec3fSTariq Toukan- /sys/class/net/eth2/queues/tx-8/xps_cpus:000010
15177d9ec3fSTariq Toukan- /sys/class/net/eth2/queues/tx-9/xps_cpus:010000
15277d9ec3fSTariq Toukan- /sys/class/net/eth2/queues/tx-10/xps_cpus:000020
15377d9ec3fSTariq Toukan- /sys/class/net/eth2/queues/tx-11/xps_cpus:020000
15477d9ec3fSTariq Toukan- /sys/class/net/eth2/queues/tx-12/xps_cpus:000040
15577d9ec3fSTariq Toukan- /sys/class/net/eth2/queues/tx-13/xps_cpus:040000
15677d9ec3fSTariq Toukan- /sys/class/net/eth2/queues/tx-14/xps_cpus:000080
15777d9ec3fSTariq Toukan- /sys/class/net/eth2/queues/tx-15/xps_cpus:080000
15877d9ec3fSTariq Toukan- /sys/class/net/eth2/queues/tx-16/xps_cpus:000100
15977d9ec3fSTariq Toukan- /sys/class/net/eth2/queues/tx-17/xps_cpus:100000
16077d9ec3fSTariq Toukan- /sys/class/net/eth2/queues/tx-18/xps_cpus:000200
16177d9ec3fSTariq Toukan- /sys/class/net/eth2/queues/tx-19/xps_cpus:200000
16277d9ec3fSTariq Toukan- /sys/class/net/eth2/queues/tx-20/xps_cpus:000400
16377d9ec3fSTariq Toukan- /sys/class/net/eth2/queues/tx-21/xps_cpus:400000
16477d9ec3fSTariq Toukan- /sys/class/net/eth2/queues/tx-22/xps_cpus:000800
16577d9ec3fSTariq Toukan- /sys/class/net/eth2/queues/tx-23/xps_cpus:800000
16677d9ec3fSTariq Toukan
16777d9ec3fSTariq ToukanMutually exclusive features
16877d9ec3fSTariq Toukan===========================
16977d9ec3fSTariq Toukan
17077d9ec3fSTariq ToukanThe nature of Multi-PF, where different channels work with different PFs, conflicts with
17177d9ec3fSTariq Toukanstateful features where the state is maintained in one of the PFs.
17277d9ec3fSTariq ToukanFor example, in the TLS device-offload feature, special context objects are created per connection
17377d9ec3fSTariq Toukanand maintained in the PF.  Transitioning between different RQs/SQs would break the feature. Hence,
17477d9ec3fSTariq Toukanwe disable this combination for now.
175