177d9ec3fSTariq Toukan.. SPDX-License-Identifier: GPL-2.0 277d9ec3fSTariq Toukan.. include:: <isonum.txt> 377d9ec3fSTariq Toukan 477d9ec3fSTariq Toukan=============== 577d9ec3fSTariq ToukanMulti-PF Netdev 677d9ec3fSTariq Toukan=============== 777d9ec3fSTariq Toukan 877d9ec3fSTariq ToukanContents 977d9ec3fSTariq Toukan======== 1077d9ec3fSTariq Toukan 1177d9ec3fSTariq Toukan- `Background`_ 1277d9ec3fSTariq Toukan- `Overview`_ 1377d9ec3fSTariq Toukan- `mlx5 implementation`_ 1477d9ec3fSTariq Toukan- `Channels distribution`_ 1577d9ec3fSTariq Toukan- `Observability`_ 1677d9ec3fSTariq Toukan- `Steering`_ 1777d9ec3fSTariq Toukan- `Mutually exclusive features`_ 1877d9ec3fSTariq Toukan 1977d9ec3fSTariq ToukanBackground 2077d9ec3fSTariq Toukan========== 2177d9ec3fSTariq Toukan 2277d9ec3fSTariq ToukanThe Multi-PF NIC technology enables several CPUs within a multi-socket server to connect directly to 2377d9ec3fSTariq Toukanthe network, each through its own dedicated PCIe interface. Through either a connection harness that 2477d9ec3fSTariq Toukansplits the PCIe lanes between two cards or by bifurcating a PCIe slot for a single card. This 2577d9ec3fSTariq Toukanresults in eliminating the network traffic traversing over the internal bus between the sockets, 2677d9ec3fSTariq Toukansignificantly reducing overhead and latency, in addition to reducing CPU utilization and increasing 2777d9ec3fSTariq Toukannetwork throughput. 2877d9ec3fSTariq Toukan 2977d9ec3fSTariq ToukanOverview 3077d9ec3fSTariq Toukan======== 3177d9ec3fSTariq Toukan 3277d9ec3fSTariq ToukanThe feature adds support for combining multiple PFs of the same port in a Multi-PF environment under 3377d9ec3fSTariq Toukanone netdev instance. It is implemented in the netdev layer. Lower-layer instances like pci func, 3477d9ec3fSTariq Toukansysfs entry, and devlink are kept separate. 3577d9ec3fSTariq ToukanPassing traffic through different devices belonging to different NUMA sockets saves cross-NUMA 3677d9ec3fSTariq Toukantraffic and allows apps running on the same netdev from different NUMAs to still feel a sense of 3777d9ec3fSTariq Toukanproximity to the device and achieve improved performance. 3877d9ec3fSTariq Toukan 3977d9ec3fSTariq Toukanmlx5 implementation 4077d9ec3fSTariq Toukan=================== 4177d9ec3fSTariq Toukan 4277d9ec3fSTariq ToukanMulti-PF or Socket-direct in mlx5 is achieved by grouping PFs together which belong to the same 4377d9ec3fSTariq ToukanNIC and has the socket-direct property enabled, once all PFs are probed, we create a single netdev 4477d9ec3fSTariq Toukanto represent all of them, symmetrically, we destroy the netdev whenever any of the PFs is removed. 4577d9ec3fSTariq Toukan 4677d9ec3fSTariq ToukanThe netdev network channels are distributed between all devices, a proper configuration would utilize 4777d9ec3fSTariq Toukanthe correct close NUMA node when working on a certain app/CPU. 4877d9ec3fSTariq Toukan 4977d9ec3fSTariq ToukanWe pick one PF to be a primary (leader), and it fills a special role. The other devices 5077d9ec3fSTariq Toukan(secondaries) are disconnected from the network at the chip level (set to silent mode). In silent 5177d9ec3fSTariq Toukanmode, no south <-> north traffic flowing directly through a secondary PF. It needs the assistance of 5277d9ec3fSTariq Toukanthe leader PF (east <-> west traffic) to function. All Rx/Tx traffic is steered through the primary 5377d9ec3fSTariq Toukanto/from the secondaries. 5477d9ec3fSTariq Toukan 5577d9ec3fSTariq ToukanCurrently, we limit the support to PFs only, and up to two PFs (sockets). 5677d9ec3fSTariq Toukan 5777d9ec3fSTariq ToukanChannels distribution 5877d9ec3fSTariq Toukan===================== 5977d9ec3fSTariq Toukan 6077d9ec3fSTariq ToukanWe distribute the channels between the different PFs to achieve local NUMA node performance 6177d9ec3fSTariq Toukanon multiple NUMA nodes. 6277d9ec3fSTariq Toukan 6377d9ec3fSTariq ToukanEach combined channel works against one specific PF, creating all its datapath queues against it. We 6477d9ec3fSTariq Toukandistribute channels to PFs in a round-robin policy. 6577d9ec3fSTariq Toukan 6677d9ec3fSTariq Toukan:: 6777d9ec3fSTariq Toukan 6877d9ec3fSTariq Toukan Example for 2 PFs and 5 channels: 6977d9ec3fSTariq Toukan +--------+--------+ 7077d9ec3fSTariq Toukan | ch idx | PF idx | 7177d9ec3fSTariq Toukan +--------+--------+ 7277d9ec3fSTariq Toukan | 0 | 0 | 7377d9ec3fSTariq Toukan | 1 | 1 | 7477d9ec3fSTariq Toukan | 2 | 0 | 7577d9ec3fSTariq Toukan | 3 | 1 | 7677d9ec3fSTariq Toukan | 4 | 0 | 7777d9ec3fSTariq Toukan +--------+--------+ 7877d9ec3fSTariq Toukan 7977d9ec3fSTariq Toukan 8077d9ec3fSTariq ToukanThe reason we prefer round-robin is, it is less influenced by changes in the number of channels. The 8177d9ec3fSTariq Toukanmapping between a channel index and a PF is fixed, no matter how many channels the user configures. 8277d9ec3fSTariq ToukanAs the channel stats are persistent across channel's closure, changing the mapping every single time 8377d9ec3fSTariq Toukanwould turn the accumulative stats less representing of the channel's history. 8477d9ec3fSTariq Toukan 8577d9ec3fSTariq ToukanThis is achieved by using the correct core device instance (mdev) in each channel, instead of them 8677d9ec3fSTariq Toukanall using the same instance under "priv->mdev". 8777d9ec3fSTariq Toukan 8877d9ec3fSTariq ToukanObservability 8977d9ec3fSTariq Toukan============= 901c636867SJakub KicinskiThe relation between PF, irq, napi, and queue can be observed via netlink spec:: 9177d9ec3fSTariq Toukan 9277d9ec3fSTariq Toukan $ ./tools/net/ynl/cli.py --spec Documentation/netlink/specs/netdev.yaml --dump queue-get --json='{"ifindex": 13}' 9377d9ec3fSTariq Toukan [{'id': 0, 'ifindex': 13, 'napi-id': 539, 'type': 'rx'}, 9477d9ec3fSTariq Toukan {'id': 1, 'ifindex': 13, 'napi-id': 540, 'type': 'rx'}, 9577d9ec3fSTariq Toukan {'id': 2, 'ifindex': 13, 'napi-id': 541, 'type': 'rx'}, 9677d9ec3fSTariq Toukan {'id': 3, 'ifindex': 13, 'napi-id': 542, 'type': 'rx'}, 9777d9ec3fSTariq Toukan {'id': 4, 'ifindex': 13, 'napi-id': 543, 'type': 'rx'}, 9877d9ec3fSTariq Toukan {'id': 0, 'ifindex': 13, 'napi-id': 539, 'type': 'tx'}, 9977d9ec3fSTariq Toukan {'id': 1, 'ifindex': 13, 'napi-id': 540, 'type': 'tx'}, 10077d9ec3fSTariq Toukan {'id': 2, 'ifindex': 13, 'napi-id': 541, 'type': 'tx'}, 10177d9ec3fSTariq Toukan {'id': 3, 'ifindex': 13, 'napi-id': 542, 'type': 'tx'}, 10277d9ec3fSTariq Toukan {'id': 4, 'ifindex': 13, 'napi-id': 543, 'type': 'tx'}] 10377d9ec3fSTariq Toukan 10477d9ec3fSTariq Toukan $ ./tools/net/ynl/cli.py --spec Documentation/netlink/specs/netdev.yaml --dump napi-get --json='{"ifindex": 13}' 10577d9ec3fSTariq Toukan [{'id': 543, 'ifindex': 13, 'irq': 42}, 10677d9ec3fSTariq Toukan {'id': 542, 'ifindex': 13, 'irq': 41}, 10777d9ec3fSTariq Toukan {'id': 541, 'ifindex': 13, 'irq': 40}, 10877d9ec3fSTariq Toukan {'id': 540, 'ifindex': 13, 'irq': 39}, 10977d9ec3fSTariq Toukan {'id': 539, 'ifindex': 13, 'irq': 36}] 11077d9ec3fSTariq Toukan 1111c636867SJakub KicinskiHere you can clearly observe our channels distribution policy:: 11277d9ec3fSTariq Toukan 11377d9ec3fSTariq Toukan $ ls /proc/irq/{36,39,40,41,42}/mlx5* -d -1 114*9480fd0cSTariq Toukan /proc/irq/36/mlx5_comp0@pci:0000:08:00.0 115*9480fd0cSTariq Toukan /proc/irq/39/mlx5_comp0@pci:0000:09:00.0 116*9480fd0cSTariq Toukan /proc/irq/40/mlx5_comp1@pci:0000:08:00.0 117*9480fd0cSTariq Toukan /proc/irq/41/mlx5_comp1@pci:0000:09:00.0 118*9480fd0cSTariq Toukan /proc/irq/42/mlx5_comp2@pci:0000:08:00.0 11977d9ec3fSTariq Toukan 12077d9ec3fSTariq ToukanSteering 12177d9ec3fSTariq Toukan======== 12277d9ec3fSTariq ToukanSecondary PFs are set to "silent" mode, meaning they are disconnected from the network. 12377d9ec3fSTariq Toukan 12477d9ec3fSTariq ToukanIn Rx, the steering tables belong to the primary PF only, and it is its role to distribute incoming 12577d9ec3fSTariq Toukantraffic to other PFs, via cross-vhca steering capabilities. Still maintain a single default RSS table, 12677d9ec3fSTariq Toukanthat is capable of pointing to the receive queues of a different PF. 12777d9ec3fSTariq Toukan 12877d9ec3fSTariq ToukanIn Tx, the primary PF creates a new Tx flow table, which is aliased by the secondaries, so they can 12977d9ec3fSTariq Toukango out to the network through it. 13077d9ec3fSTariq Toukan 13177d9ec3fSTariq ToukanIn addition, we set default XPS configuration that, based on the CPU, selects an SQ belonging to the 13277d9ec3fSTariq ToukanPF on the same node as the CPU. 13377d9ec3fSTariq Toukan 13477d9ec3fSTariq ToukanXPS default config example: 13577d9ec3fSTariq Toukan 13677d9ec3fSTariq ToukanNUMA node(s): 2 13777d9ec3fSTariq ToukanNUMA node0 CPU(s): 0-11 13877d9ec3fSTariq ToukanNUMA node1 CPU(s): 12-23 13977d9ec3fSTariq Toukan 14077d9ec3fSTariq ToukanPF0 on node0, PF1 on node1. 14177d9ec3fSTariq Toukan 14277d9ec3fSTariq Toukan- /sys/class/net/eth2/queues/tx-0/xps_cpus:000001 14377d9ec3fSTariq Toukan- /sys/class/net/eth2/queues/tx-1/xps_cpus:001000 14477d9ec3fSTariq Toukan- /sys/class/net/eth2/queues/tx-2/xps_cpus:000002 14577d9ec3fSTariq Toukan- /sys/class/net/eth2/queues/tx-3/xps_cpus:002000 14677d9ec3fSTariq Toukan- /sys/class/net/eth2/queues/tx-4/xps_cpus:000004 14777d9ec3fSTariq Toukan- /sys/class/net/eth2/queues/tx-5/xps_cpus:004000 14877d9ec3fSTariq Toukan- /sys/class/net/eth2/queues/tx-6/xps_cpus:000008 14977d9ec3fSTariq Toukan- /sys/class/net/eth2/queues/tx-7/xps_cpus:008000 15077d9ec3fSTariq Toukan- /sys/class/net/eth2/queues/tx-8/xps_cpus:000010 15177d9ec3fSTariq Toukan- /sys/class/net/eth2/queues/tx-9/xps_cpus:010000 15277d9ec3fSTariq Toukan- /sys/class/net/eth2/queues/tx-10/xps_cpus:000020 15377d9ec3fSTariq Toukan- /sys/class/net/eth2/queues/tx-11/xps_cpus:020000 15477d9ec3fSTariq Toukan- /sys/class/net/eth2/queues/tx-12/xps_cpus:000040 15577d9ec3fSTariq Toukan- /sys/class/net/eth2/queues/tx-13/xps_cpus:040000 15677d9ec3fSTariq Toukan- /sys/class/net/eth2/queues/tx-14/xps_cpus:000080 15777d9ec3fSTariq Toukan- /sys/class/net/eth2/queues/tx-15/xps_cpus:080000 15877d9ec3fSTariq Toukan- /sys/class/net/eth2/queues/tx-16/xps_cpus:000100 15977d9ec3fSTariq Toukan- /sys/class/net/eth2/queues/tx-17/xps_cpus:100000 16077d9ec3fSTariq Toukan- /sys/class/net/eth2/queues/tx-18/xps_cpus:000200 16177d9ec3fSTariq Toukan- /sys/class/net/eth2/queues/tx-19/xps_cpus:200000 16277d9ec3fSTariq Toukan- /sys/class/net/eth2/queues/tx-20/xps_cpus:000400 16377d9ec3fSTariq Toukan- /sys/class/net/eth2/queues/tx-21/xps_cpus:400000 16477d9ec3fSTariq Toukan- /sys/class/net/eth2/queues/tx-22/xps_cpus:000800 16577d9ec3fSTariq Toukan- /sys/class/net/eth2/queues/tx-23/xps_cpus:800000 16677d9ec3fSTariq Toukan 16777d9ec3fSTariq ToukanMutually exclusive features 16877d9ec3fSTariq Toukan=========================== 16977d9ec3fSTariq Toukan 17077d9ec3fSTariq ToukanThe nature of Multi-PF, where different channels work with different PFs, conflicts with 17177d9ec3fSTariq Toukanstateful features where the state is maintained in one of the PFs. 17277d9ec3fSTariq ToukanFor example, in the TLS device-offload feature, special context objects are created per connection 17377d9ec3fSTariq Toukanand maintained in the PF. Transitioning between different RQs/SQs would break the feature. Hence, 17477d9ec3fSTariq Toukanwe disable this combination for now. 175