1.. SPDX-License-Identifier: GPL-2.0 2.. _representors: 3 4============================= 5Network Function Representors 6============================= 7 8This document describes the semantics and usage of representor netdevices, as 9used to control internal switching on SmartNICs. For the closely-related port 10representors on physical (multi-port) switches, see 11:ref:`Documentation/networking/switchdev.rst <switchdev>`. 12 13Motivation 14---------- 15 16Since the mid-2010s, network cards have started offering more complex 17virtualisation capabilities than the legacy SR-IOV approach (with its simple 18MAC/VLAN-based switching model) can support. This led to a desire to offload 19software-defined networks (such as OpenVSwitch) to these NICs to specify the 20network connectivity of each function. The resulting designs are variously 21called SmartNICs or DPUs. 22 23Network function representors bring the standard Linux networking stack to 24virtual switches and IOV devices. Just as each physical port of a Linux- 25controlled switch has a separate netdev, so does each virtual port of a virtual 26switch. 27When the system boots, and before any offload is configured, all packets from 28the virtual functions appear in the networking stack of the PF via the 29representors. The PF can thus always communicate freely with the virtual 30functions. 31The PF can configure standard Linux forwarding between representors, the uplink 32or any other netdev (routing, bridging, TC classifiers). 33 34Thus, a representor is both a control plane object (representing the function in 35administrative commands) and a data plane object (one end of a virtual pipe). 36As a virtual link endpoint, the representor can be configured like any other 37netdevice; in some cases (e.g. link state) the representee will follow the 38representor's configuration, while in others there are separate APIs to 39configure the representee. 40 41Definitions 42----------- 43 44This document uses the term "switchdev function" to refer to the PCIe function 45which has administrative control over the virtual switch on the device. 46Typically, this will be a PF, but conceivably a NIC could be configured to grant 47these administrative privileges instead to a VF or SF (subfunction). 48Depending on NIC design, a multi-port NIC might have a single switchdev function 49for the whole device or might have a separate virtual switch, and hence 50switchdev function, for each physical network port. 51If the NIC supports nested switching, there might be separate switchdev 52functions for each nested switch, in which case each switchdev function should 53only create representors for the ports on the (sub-)switch it directly 54administers. 55 56A "representee" is the object that a representor represents. So for example in 57the case of a VF representor, the representee is the corresponding VF. 58 59What does a representor do? 60--------------------------- 61 62A representor has three main roles. 63 641. It is used to configure the network connection the representee sees, e.g. 65 link up/down, MTU, etc. For instance, bringing the representor 66 administratively UP should cause the representee to see a link up / carrier 67 on event. 682. It provides the slow path for traffic which does not hit any offloaded 69 fast-path rules in the virtual switch. Packets transmitted on the 70 representor netdevice should be delivered to the representee; packets 71 transmitted by the representee which fail to match any switching rule should 72 be received on the representor netdevice. (That is, there is a virtual pipe 73 connecting the representor to the representee, similar in concept to a veth 74 pair.) 75 This allows software switch implementations (such as OpenVSwitch or a Linux 76 bridge) to forward packets between representees and the rest of the network. 773. It acts as a handle by which switching rules (such as TC filters) can refer 78 to the representee, allowing these rules to be offloaded. 79 80The combination of 2) and 3) means that the behaviour (apart from performance) 81should be the same whether a TC filter is offloaded or not. E.g. a TC rule 82on a VF representor applies in software to packets received on that representor 83netdevice, while in hardware offload it would apply to packets transmitted by 84the representee VF. Conversely, a mirred egress redirect to a VF representor 85corresponds in hardware to delivery directly to the representee VF. 86 87What functions should have a representor? 88----------------------------------------- 89 90Essentially, for each virtual port on the device's internal switch, there 91should be a representor. 92Some vendors have chosen to omit representors for the uplink and the physical 93network port, which can simplify usage (the uplink netdev becomes in effect the 94physical port's representor) but does not generalise to devices with multiple 95ports or uplinks. 96 97Thus, the following should all have representors: 98 99 - VFs belonging to the switchdev function. 100 - Other PFs on the local PCIe controller, and any VFs belonging to them. 101 - PFs and VFs on external PCIe controllers on the device (e.g. for any embedded 102 System-on-Chip within the SmartNIC). 103 - PFs and VFs with other personalities, including network block devices (such 104 as a vDPA virtio-blk PF backed by remote/distributed storage), if (and only 105 if) their network access is implemented through a virtual switch port. [#]_ 106 Note that such functions can require a representor despite the representee 107 not having a netdev. 108 - Subfunctions (SFs) belonging to any of the above PFs or VFs, if they have 109 their own port on the switch (as opposed to using their parent PF's port). 110 - Any accelerators or plugins on the device whose interface to the network is 111 through a virtual switch port, even if they do not have a corresponding PCIe 112 PF or VF. 113 114This allows the entire switching behaviour of the NIC to be controlled through 115representor TC rules. 116 117It is a common misunderstanding to conflate virtual ports with PCIe virtual 118functions or their netdevs. While in simple cases there will be a 1:1 119correspondence between VF netdevices and VF representors, more advanced device 120configurations may not follow this. 121A PCIe function which does not have network access through the internal switch 122(not even indirectly through the hardware implementation of whatever services 123the function provides) should *not* have a representor (even if it has a 124netdev). 125Such a function has no switch virtual port for the representor to configure or 126to be the other end of the virtual pipe. 127The representor represents the virtual port, not the PCIe function nor the 'end 128user' netdevice. 129 130.. [#] The concept here is that a hardware IP stack in the device performs the 131 translation between block DMA requests and network packets, so that only 132 network packets pass through the virtual port onto the switch. The network 133 access that the IP stack "sees" would then be configurable through tc rules; 134 e.g. its traffic might all be wrapped in a specific VLAN or VxLAN. However, 135 any needed configuration of the block device *qua* block device, not being a 136 networking entity, would not be appropriate for the representor and would 137 thus use some other channel such as devlink. 138 Contrast this with the case of a virtio-blk implementation which forwards the 139 DMA requests unchanged to another PF whose driver then initiates and 140 terminates IP traffic in software; in that case the DMA traffic would *not* 141 run over the virtual switch and the virtio-blk PF should thus *not* have a 142 representor. 143 144How are representors created? 145----------------------------- 146 147The driver instance attached to the switchdev function should, for each virtual 148port on the switch, create a pure-software netdevice which has some form of 149in-kernel reference to the switchdev function's own netdevice or driver private 150data (``netdev_priv()``). 151This may be by enumerating ports at probe time, reacting dynamically to the 152creation and destruction of ports at run time, or a combination of the two. 153 154The operations of the representor netdevice will generally involve acting 155through the switchdev function. For example, ``ndo_start_xmit()`` might send 156the packet through a hardware TX queue attached to the switchdev function, with 157either packet metadata or queue configuration marking it for delivery to the 158representee. 159 160How are representors identified? 161-------------------------------- 162 163The representor netdevice should *not* directly refer to a PCIe device (e.g. 164through ``net_dev->dev.parent`` / ``SET_NETDEV_DEV()``), either of the 165representee or of the switchdev function. 166Instead, the driver should use the ``SET_NETDEV_DEVLINK_PORT`` macro to 167assign a devlink port instance to the netdevice before registering the 168netdevice; the kernel uses the devlink port to provide the ``phys_switch_id`` 169and ``phys_port_name`` sysfs nodes. 170(Some legacy drivers implement ``ndo_get_port_parent_id()`` and 171``ndo_get_phys_port_name()`` directly, but this is deprecated.) See 172:ref:`Documentation/networking/devlink/devlink-port.rst <devlink_port>` for the 173details of this API. 174 175It is expected that userland will use this information (e.g. through udev rules) 176to construct an appropriately informative name or alias for the netdevice. For 177instance if the switchdev function is ``eth4`` then a representor with a 178``phys_port_name`` of ``p0pf1vf2`` might be renamed ``eth4pf1vf2rep``. 179 180There are as yet no established conventions for naming representors which do not 181correspond to PCIe functions (e.g. accelerators and plugins). 182 183How do representors interact with TC rules? 184------------------------------------------- 185 186Any TC rule on a representor applies (in software TC) to packets received by 187that representor netdevice. Thus, if the delivery part of the rule corresponds 188to another port on the virtual switch, the driver may choose to offload it to 189hardware, applying it to packets transmitted by the representee. 190 191Similarly, since a TC mirred egress action targeting the representor would (in 192software) send the packet through the representor (and thus indirectly deliver 193it to the representee), hardware offload should interpret this as delivery to 194the representee. 195 196As a simple example, if ``PORT_DEV`` is the physical port representor and 197``REP_DEV`` is a VF representor, the following rules:: 198 199 tc filter add dev $REP_DEV parent ffff: protocol ipv4 flower \ 200 action mirred egress redirect dev $PORT_DEV 201 tc filter add dev $PORT_DEV parent ffff: protocol ipv4 flower skip_sw \ 202 action mirred egress mirror dev $REP_DEV 203 204would mean that all IPv4 packets from the VF are sent out the physical port, and 205all IPv4 packets received on the physical port are delivered to the VF in 206addition to ``PORT_DEV``. (Note that without ``skip_sw`` on the second rule, 207the VF would get two copies, as the packet reception on ``PORT_DEV`` would 208trigger the TC rule again and mirror the packet to ``REP_DEV``.) 209 210On devices without separate port and uplink representors, ``PORT_DEV`` would 211instead be the switchdev function's own uplink netdevice. 212 213Of course the rules can (if supported by the NIC) include packet-modifying 214actions (e.g. VLAN push/pop), which should be performed by the virtual switch. 215 216Tunnel encapsulation and decapsulation are rather more complicated, as they 217involve a third netdevice (a tunnel netdev operating in metadata mode, such as 218a VxLAN device created with ``ip link add vxlan0 type vxlan external``) and 219require an IP address to be bound to the underlay device (e.g. switchdev 220function uplink netdev or port representor). TC rules such as:: 221 222 tc filter add dev $REP_DEV parent ffff: flower \ 223 action tunnel_key set id $VNI src_ip $LOCAL_IP dst_ip $REMOTE_IP \ 224 dst_port 4789 \ 225 action mirred egress redirect dev vxlan0 226 tc filter add dev vxlan0 parent ffff: flower enc_src_ip $REMOTE_IP \ 227 enc_dst_ip $LOCAL_IP enc_key_id $VNI enc_dst_port 4789 \ 228 action tunnel_key unset action mirred egress redirect dev $REP_DEV 229 230where ``LOCAL_IP`` is an IP address bound to ``PORT_DEV``, and ``REMOTE_IP`` is 231another IP address on the same subnet, mean that packets sent by the VF should 232be VxLAN encapsulated and sent out the physical port (the driver has to deduce 233this by a route lookup of ``LOCAL_IP`` leading to ``PORT_DEV``, and also 234perform an ARP/neighbour table lookup to find the MAC addresses to use in the 235outer Ethernet frame), while UDP packets received on the physical port with UDP 236port 4789 should be parsed as VxLAN and, if their VSID matches ``$VNI``, 237decapsulated and forwarded to the VF. 238 239If this all seems complicated, just remember the 'golden rule' of TC offload: 240the hardware should ensure the same final results as if the packets were 241processed through the slow path, traversed software TC (except ignoring any 242``skip_hw`` rules and applying any ``skip_sw`` rules) and were transmitted or 243received through the representor netdevices. 244 245Configuring the representee's MAC 246--------------------------------- 247 248The representee's link state is controlled through the representor. Setting the 249representor administratively UP or DOWN should cause carrier ON or OFF at the 250representee. 251 252Setting an MTU on the representor should cause that same MTU to be reported to 253the representee. 254(On hardware that allows configuring separate and distinct MTU and MRU values, 255the representor MTU should correspond to the representee's MRU and vice-versa.) 256 257Currently there is no way to use the representor to set the station permanent 258MAC address of the representee; other methods available to do this include: 259 260 - legacy SR-IOV (``ip link set DEVICE vf NUM mac LLADDR``) 261 - devlink port function (see **devlink-port(8)** and 262 :ref:`Documentation/networking/devlink/devlink-port.rst <devlink_port>`) 263