1.. SPDX-License-Identifier: GPL-2.0 2.. include:: <isonum.txt> 3 4=============== 5Multi-PF Netdev 6=============== 7 8Contents 9======== 10 11- `Background`_ 12- `Overview`_ 13- `mlx5 implementation`_ 14- `Channels distribution`_ 15- `Observability`_ 16- `Steering`_ 17- `Mutually exclusive features`_ 18 19Background 20========== 21 22The Multi-PF NIC technology enables several CPUs within a multi-socket server to connect directly to 23the network, each through its own dedicated PCIe interface. Through either a connection harness that 24splits the PCIe lanes between two cards or by bifurcating a PCIe slot for a single card. This 25results in eliminating the network traffic traversing over the internal bus between the sockets, 26significantly reducing overhead and latency, in addition to reducing CPU utilization and increasing 27network throughput. 28 29Overview 30======== 31 32The feature adds support for combining multiple PFs of the same port in a Multi-PF environment under 33one netdev instance. It is implemented in the netdev layer. Lower-layer instances like pci func, 34sysfs entry, and devlink are kept separate. 35Passing traffic through different devices belonging to different NUMA sockets saves cross-NUMA 36traffic and allows apps running on the same netdev from different NUMAs to still feel a sense of 37proximity to the device and achieve improved performance. 38 39mlx5 implementation 40=================== 41 42Multi-PF or Socket-direct in mlx5 is achieved by grouping PFs together which belong to the same 43NIC and has the socket-direct property enabled, once all PFs are probed, we create a single netdev 44to represent all of them, symmetrically, we destroy the netdev whenever any of the PFs is removed. 45 46The netdev network channels are distributed between all devices, a proper configuration would utilize 47the correct close NUMA node when working on a certain app/CPU. 48 49We pick one PF to be a primary (leader), and it fills a special role. The other devices 50(secondaries) are disconnected from the network at the chip level (set to silent mode). In silent 51mode, no south <-> north traffic flowing directly through a secondary PF. It needs the assistance of 52the leader PF (east <-> west traffic) to function. All Rx/Tx traffic is steered through the primary 53to/from the secondaries. 54 55Currently, we limit the support to PFs only, and up to two PFs (sockets). 56 57Channels distribution 58===================== 59 60We distribute the channels between the different PFs to achieve local NUMA node performance 61on multiple NUMA nodes. 62 63Each combined channel works against one specific PF, creating all its datapath queues against it. We 64distribute channels to PFs in a round-robin policy. 65 66:: 67 68 Example for 2 PFs and 5 channels: 69 +--------+--------+ 70 | ch idx | PF idx | 71 +--------+--------+ 72 | 0 | 0 | 73 | 1 | 1 | 74 | 2 | 0 | 75 | 3 | 1 | 76 | 4 | 0 | 77 +--------+--------+ 78 79 80The reason we prefer round-robin is, it is less influenced by changes in the number of channels. The 81mapping between a channel index and a PF is fixed, no matter how many channels the user configures. 82As the channel stats are persistent across channel's closure, changing the mapping every single time 83would turn the accumulative stats less representing of the channel's history. 84 85This is achieved by using the correct core device instance (mdev) in each channel, instead of them 86all using the same instance under "priv->mdev". 87 88Observability 89============= 90The relation between PF, irq, napi, and queue can be observed via netlink spec: 91 92$ ./tools/net/ynl/cli.py --spec Documentation/netlink/specs/netdev.yaml --dump queue-get --json='{"ifindex": 13}' 93[{'id': 0, 'ifindex': 13, 'napi-id': 539, 'type': 'rx'}, 94 {'id': 1, 'ifindex': 13, 'napi-id': 540, 'type': 'rx'}, 95 {'id': 2, 'ifindex': 13, 'napi-id': 541, 'type': 'rx'}, 96 {'id': 3, 'ifindex': 13, 'napi-id': 542, 'type': 'rx'}, 97 {'id': 4, 'ifindex': 13, 'napi-id': 543, 'type': 'rx'}, 98 {'id': 0, 'ifindex': 13, 'napi-id': 539, 'type': 'tx'}, 99 {'id': 1, 'ifindex': 13, 'napi-id': 540, 'type': 'tx'}, 100 {'id': 2, 'ifindex': 13, 'napi-id': 541, 'type': 'tx'}, 101 {'id': 3, 'ifindex': 13, 'napi-id': 542, 'type': 'tx'}, 102 {'id': 4, 'ifindex': 13, 'napi-id': 543, 'type': 'tx'}] 103 104$ ./tools/net/ynl/cli.py --spec Documentation/netlink/specs/netdev.yaml --dump napi-get --json='{"ifindex": 13}' 105[{'id': 543, 'ifindex': 13, 'irq': 42}, 106 {'id': 542, 'ifindex': 13, 'irq': 41}, 107 {'id': 541, 'ifindex': 13, 'irq': 40}, 108 {'id': 540, 'ifindex': 13, 'irq': 39}, 109 {'id': 539, 'ifindex': 13, 'irq': 36}] 110 111Here you can clearly observe our channels distribution policy: 112 113$ ls /proc/irq/{36,39,40,41,42}/mlx5* -d -1 114/proc/irq/36/mlx5_comp1@pci:0000:08:00.0 115/proc/irq/39/mlx5_comp1@pci:0000:09:00.0 116/proc/irq/40/mlx5_comp2@pci:0000:08:00.0 117/proc/irq/41/mlx5_comp2@pci:0000:09:00.0 118/proc/irq/42/mlx5_comp3@pci:0000:08:00.0 119 120Steering 121======== 122Secondary PFs are set to "silent" mode, meaning they are disconnected from the network. 123 124In Rx, the steering tables belong to the primary PF only, and it is its role to distribute incoming 125traffic to other PFs, via cross-vhca steering capabilities. Still maintain a single default RSS table, 126that is capable of pointing to the receive queues of a different PF. 127 128In Tx, the primary PF creates a new Tx flow table, which is aliased by the secondaries, so they can 129go out to the network through it. 130 131In addition, we set default XPS configuration that, based on the CPU, selects an SQ belonging to the 132PF on the same node as the CPU. 133 134XPS default config example: 135 136NUMA node(s): 2 137NUMA node0 CPU(s): 0-11 138NUMA node1 CPU(s): 12-23 139 140PF0 on node0, PF1 on node1. 141 142- /sys/class/net/eth2/queues/tx-0/xps_cpus:000001 143- /sys/class/net/eth2/queues/tx-1/xps_cpus:001000 144- /sys/class/net/eth2/queues/tx-2/xps_cpus:000002 145- /sys/class/net/eth2/queues/tx-3/xps_cpus:002000 146- /sys/class/net/eth2/queues/tx-4/xps_cpus:000004 147- /sys/class/net/eth2/queues/tx-5/xps_cpus:004000 148- /sys/class/net/eth2/queues/tx-6/xps_cpus:000008 149- /sys/class/net/eth2/queues/tx-7/xps_cpus:008000 150- /sys/class/net/eth2/queues/tx-8/xps_cpus:000010 151- /sys/class/net/eth2/queues/tx-9/xps_cpus:010000 152- /sys/class/net/eth2/queues/tx-10/xps_cpus:000020 153- /sys/class/net/eth2/queues/tx-11/xps_cpus:020000 154- /sys/class/net/eth2/queues/tx-12/xps_cpus:000040 155- /sys/class/net/eth2/queues/tx-13/xps_cpus:040000 156- /sys/class/net/eth2/queues/tx-14/xps_cpus:000080 157- /sys/class/net/eth2/queues/tx-15/xps_cpus:080000 158- /sys/class/net/eth2/queues/tx-16/xps_cpus:000100 159- /sys/class/net/eth2/queues/tx-17/xps_cpus:100000 160- /sys/class/net/eth2/queues/tx-18/xps_cpus:000200 161- /sys/class/net/eth2/queues/tx-19/xps_cpus:200000 162- /sys/class/net/eth2/queues/tx-20/xps_cpus:000400 163- /sys/class/net/eth2/queues/tx-21/xps_cpus:400000 164- /sys/class/net/eth2/queues/tx-22/xps_cpus:000800 165- /sys/class/net/eth2/queues/tx-23/xps_cpus:800000 166 167Mutually exclusive features 168=========================== 169 170The nature of Multi-PF, where different channels work with different PFs, conflicts with 171stateful features where the state is maintained in one of the PFs. 172For example, in the TLS device-offload feature, special context objects are created per connection 173and maintained in the PF. Transitioning between different RQs/SQs would break the feature. Hence, 174we disable this combination for now. 175