1.. SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause) 2 3==================================== 4Marvell OcteonTx2 RVU Kernel Drivers 5==================================== 6 7Copyright (c) 2020 Marvell International Ltd. 8 9Contents 10======== 11 12- `Overview`_ 13- `Drivers`_ 14- `Basic packet flow`_ 15- `Devlink health reporters`_ 16- `Quality of service`_ 17- `RVU representors`_ 18 19Overview 20======== 21 22Resource virtualization unit (RVU) on Marvell's OcteonTX2 SOC maps HW 23resources from the network, crypto and other functional blocks into 24PCI-compatible physical and virtual functions. Each functional block 25again has multiple local functions (LFs) for provisioning to PCI devices. 26RVU supports multiple PCIe SRIOV physical functions (PFs) and virtual 27functions (VFs). PF0 is called the administrative / admin function (AF) 28and has privileges to provision RVU functional block's LFs to each of the 29PF/VF. 30 31RVU managed networking functional blocks 32 - Network pool or buffer allocator (NPA) 33 - Network interface controller (NIX) 34 - Network parser CAM (NPC) 35 - Schedule/Synchronize/Order unit (SSO) 36 - Loopback interface (LBK) 37 38RVU managed non-networking functional blocks 39 - Crypto accelerator (CPT) 40 - Scheduled timers unit (TIM) 41 - Schedule/Synchronize/Order unit (SSO) 42 Used for both networking and non networking usecases 43 44Resource provisioning examples 45 - A PF/VF with NIX-LF & NPA-LF resources works as a pure network device 46 - A PF/VF with CPT-LF resource works as a pure crypto offload device. 47 48RVU functional blocks are highly configurable as per software requirements. 49 50Firmware setups following stuff before kernel boots 51 - Enables required number of RVU PFs based on number of physical links. 52 - Number of VFs per PF are either static or configurable at compile time. 53 Based on config, firmware assigns VFs to each of the PFs. 54 - Also assigns MSIX vectors to each of PF and VFs. 55 - These are not changed after kernel boot. 56 57Drivers 58======= 59 60Linux kernel will have multiple drivers registering to different PF and VFs 61of RVU. Wrt networking there will be 3 flavours of drivers. 62 63Admin Function driver 64--------------------- 65 66As mentioned above RVU PF0 is called the admin function (AF), this driver 67supports resource provisioning and configuration of functional blocks. 68Doesn't handle any I/O. It sets up few basic stuff but most of the 69funcionality is achieved via configuration requests from PFs and VFs. 70 71PF/VFs communicates with AF via a shared memory region (mailbox). Upon 72receiving requests AF does resource provisioning and other HW configuration. 73AF is always attached to host kernel, but PFs and their VFs may be used by host 74kernel itself, or attached to VMs or to userspace applications like 75DPDK etc. So AF has to handle provisioning/configuration requests sent 76by any device from any domain. 77 78AF driver also interacts with underlying firmware to 79 - Manage physical ethernet links ie CGX LMACs. 80 - Retrieve information like speed, duplex, autoneg etc 81 - Retrieve PHY EEPROM and stats. 82 - Configure FEC, PAM modes 83 - etc 84 85From pure networking side AF driver supports following functionality. 86 - Map a physical link to a RVU PF to which a netdev is registered. 87 - Attach NIX and NPA block LFs to RVU PF/VF which provide buffer pools, RQs, SQs 88 for regular networking functionality. 89 - Flow control (pause frames) enable/disable/config. 90 - HW PTP timestamping related config. 91 - NPC parser profile config, basically how to parse pkt and what info to extract. 92 - NPC extract profile config, what to extract from the pkt to match data in MCAM entries. 93 - Manage NPC MCAM entries, upon request can frame and install requested packet forwarding rules. 94 - Defines receive side scaling (RSS) algorithms. 95 - Defines segmentation offload algorithms (eg TSO) 96 - VLAN stripping, capture and insertion config. 97 - SSO and TIM blocks config which provide packet scheduling support. 98 - Debugfs support, to check current resource provising, current status of 99 NPA pools, NIX RQ, SQ and CQs, various stats etc which helps in debugging issues. 100 - And many more. 101 102Physical Function driver 103------------------------ 104 105This RVU PF handles IO, is mapped to a physical ethernet link and this 106driver registers a netdev. This supports SR-IOV. As said above this driver 107communicates with AF with a mailbox. To retrieve information from physical 108links this driver talks to AF and AF gets that info from firmware and responds 109back ie cannot talk to firmware directly. 110 111Supports ethtool for configuring links, RSS, queue count, queue size, 112flow control, ntuple filters, dump PHY EEPROM, config FEC etc. 113 114Virtual Function driver 115----------------------- 116 117There are two types VFs, VFs that share the physical link with their parent 118SR-IOV PF and the VFs which work in pairs using internal HW loopback channels (LBK). 119 120Type1: 121 - These VFs and their parent PF share a physical link and used for outside communication. 122 - VFs cannot communicate with AF directly, they send mbox message to PF and PF 123 forwards that to AF. AF after processing, responds back to PF and PF forwards 124 the reply to VF. 125 - From functionality point of view there is no difference between PF and VF as same type 126 HW resources are attached to both. But user would be able to configure few stuff only 127 from PF as PF is treated as owner/admin of the link. 128 129Type2: 130 - RVU PF0 ie admin function creates these VFs and maps them to loopback block's channels. 131 - A set of two VFs (VF0 & VF1, VF2 & VF3 .. so on) works as a pair ie pkts sent out of 132 VF0 will be received by VF1 and vice versa. 133 - These VFs can be used by applications or virtual machines to communicate between them 134 without sending traffic outside. There is no switch present in HW, hence the support 135 for loopback VFs. 136 - These communicate directly with AF (PF0) via mbox. 137 138Except for the IO channels or links used for packet reception and transmission there is 139no other difference between these VF types. AF driver takes care of IO channel mapping, 140hence same VF driver works for both types of devices. 141 142Basic packet flow 143================= 144 145Ingress 146------- 147 1481. CGX LMAC receives packet. 1492. Forwards the packet to the NIX block. 1503. Then submitted to NPC block for parsing and then MCAM lookup to get the destination RVU device. 1514. NIX LF attached to the destination RVU device allocates a buffer from RQ mapped buffer pool of NPA block LF. 1525. RQ may be selected by RSS or by configuring MCAM rule with a RQ number. 1536. Packet is DMA'ed and driver is notified. 154 155Egress 156------ 157 1581. Driver prepares a send descriptor and submits to SQ for transmission. 1592. The SQ is already configured (by AF) to transmit on a specific link/channel. 1603. The SQ descriptor ring is maintained in buffers allocated from SQ mapped pool of NPA block LF. 1614. NIX block transmits the pkt on the designated channel. 1625. NPC MCAM entries can be installed to divert pkt onto a different channel. 163 164Devlink health reporters 165======================== 166 167NPA Reporters 168------------- 169The NPA reporters are responsible for reporting and recovering the following group of errors: 170 1711. GENERAL events 172 173 - Error due to operation of unmapped PF. 174 - Error due to disabled alloc/free for other HW blocks (NIX, SSO, TIM, DPI and AURA). 175 1762. ERROR events 177 178 - Fault due to NPA_AQ_INST_S read or NPA_AQ_RES_S write. 179 - AQ Doorbell Error. 180 1813. RAS events 182 183 - RAS Error Reporting for NPA_AQ_INST_S/NPA_AQ_RES_S. 184 1854. RVU events 186 187 - Error due to unmapped slot. 188 189Sample Output:: 190 191 ~# devlink health 192 pci/0002:01:00.0: 193 reporter hw_npa_intr 194 state healthy error 2872 recover 2872 last_dump_date 2020-12-10 last_dump_time 09:39:09 grace_period 0 auto_recover true auto_dump true 195 reporter hw_npa_gen 196 state healthy error 2872 recover 2872 last_dump_date 2020-12-11 last_dump_time 04:43:04 grace_period 0 auto_recover true auto_dump true 197 reporter hw_npa_err 198 state healthy error 2871 recover 2871 last_dump_date 2020-12-10 last_dump_time 09:39:17 grace_period 0 auto_recover true auto_dump true 199 reporter hw_npa_ras 200 state healthy error 0 recover 0 last_dump_date 2020-12-10 last_dump_time 09:32:40 grace_period 0 auto_recover true auto_dump true 201 202Each reporter dumps the 203 204 - Error Type 205 - Error Register value 206 - Reason in words 207 208For example:: 209 210 ~# devlink health dump show pci/0002:01:00.0 reporter hw_npa_gen 211 NPA_AF_GENERAL: 212 NPA General Interrupt Reg : 1 213 NIX0: free disabled RX 214 ~# devlink health dump show pci/0002:01:00.0 reporter hw_npa_intr 215 NPA_AF_RVU: 216 NPA RVU Interrupt Reg : 1 217 Unmap Slot Error 218 ~# devlink health dump show pci/0002:01:00.0 reporter hw_npa_err 219 NPA_AF_ERR: 220 NPA Error Interrupt Reg : 4096 221 AQ Doorbell Error 222 223 224NIX Reporters 225------------- 226The NIX reporters are responsible for reporting and recovering the following group of errors: 227 2281. GENERAL events 229 230 - Receive mirror/multicast packet drop due to insufficient buffer. 231 - SMQ Flush operation. 232 2332. ERROR events 234 235 - Memory Fault due to WQE read/write from multicast/mirror buffer. 236 - Receive multicast/mirror replication list error. 237 - Receive packet on an unmapped PF. 238 - Fault due to NIX_AQ_INST_S read or NIX_AQ_RES_S write. 239 - AQ Doorbell Error. 240 2413. RAS events 242 243 - RAS Error Reporting for NIX Receive Multicast/Mirror Entry Structure. 244 - RAS Error Reporting for WQE/Packet Data read from Multicast/Mirror Buffer.. 245 - RAS Error Reporting for NIX_AQ_INST_S/NIX_AQ_RES_S. 246 2474. RVU events 248 249 - Error due to unmapped slot. 250 251Sample Output:: 252 253 ~# ./devlink health 254 pci/0002:01:00.0: 255 reporter hw_npa_intr 256 state healthy error 0 recover 0 grace_period 0 auto_recover true auto_dump true 257 reporter hw_npa_gen 258 state healthy error 0 recover 0 grace_period 0 auto_recover true auto_dump true 259 reporter hw_npa_err 260 state healthy error 0 recover 0 grace_period 0 auto_recover true auto_dump true 261 reporter hw_npa_ras 262 state healthy error 0 recover 0 grace_period 0 auto_recover true auto_dump true 263 reporter hw_nix_intr 264 state healthy error 1121 recover 1121 last_dump_date 2021-01-19 last_dump_time 05:42:26 grace_period 0 auto_recover true auto_dump true 265 reporter hw_nix_gen 266 state healthy error 949 recover 949 last_dump_date 2021-01-19 last_dump_time 05:42:43 grace_period 0 auto_recover true auto_dump true 267 reporter hw_nix_err 268 state healthy error 1147 recover 1147 last_dump_date 2021-01-19 last_dump_time 05:42:59 grace_period 0 auto_recover true auto_dump true 269 reporter hw_nix_ras 270 state healthy error 409 recover 409 last_dump_date 2021-01-19 last_dump_time 05:43:16 grace_period 0 auto_recover true auto_dump true 271 272Each reporter dumps the 273 274 - Error Type 275 - Error Register value 276 - Reason in words 277 278For example:: 279 280 ~# devlink health dump show pci/0002:01:00.0 reporter hw_nix_intr 281 NIX_AF_RVU: 282 NIX RVU Interrupt Reg : 1 283 Unmap Slot Error 284 ~# devlink health dump show pci/0002:01:00.0 reporter hw_nix_gen 285 NIX_AF_GENERAL: 286 NIX General Interrupt Reg : 1 287 Rx multicast pkt drop 288 ~# devlink health dump show pci/0002:01:00.0 reporter hw_nix_err 289 NIX_AF_ERR: 290 NIX Error Interrupt Reg : 64 291 Rx on unmapped PF_FUNC 292 293 294Quality of service 295================== 296 297 298Hardware algorithms used in scheduling 299-------------------------------------- 300 301octeontx2 silicon and CN10K transmit interface consists of five transmit levels 302starting from SMQ/MDQ, TL4 to TL1. Each packet will traverse MDQ, TL4 to TL1 303levels. Each level contains an array of queues to support scheduling and shaping. 304The hardware uses the below algorithms depending on the priority of scheduler queues. 305once the usercreates tc classes with different priorities, the driver configures 306schedulers allocated to the class with specified priority along with rate-limiting 307configuration. 308 3091. Strict Priority 310 311 - Once packets are submitted to MDQ, hardware picks all active MDQs having different priority 312 using strict priority. 313 3142. Round Robin 315 316 - Active MDQs having the same priority level are chosen using round robin. 317 318 319Setup HTB offload 320----------------- 321 3221. Enable HW TC offload on the interface:: 323 324 # ethtool -K <interface> hw-tc-offload on 325 3262. Crate htb root:: 327 328 # tc qdisc add dev <interface> clsact 329 # tc qdisc replace dev <interface> root handle 1: htb offload 330 3313. Create tc classes with different priorities:: 332 333 # tc class add dev <interface> parent 1: classid 1:1 htb rate 10Gbit prio 1 334 335 # tc class add dev <interface> parent 1: classid 1:2 htb rate 10Gbit prio 7 336 3374. Create tc classes with same priorities and different quantum:: 338 339 # tc class add dev <interface> parent 1: classid 1:1 htb rate 10Gbit prio 2 quantum 409600 340 341 # tc class add dev <interface> parent 1: classid 1:2 htb rate 10Gbit prio 2 quantum 188416 342 343 # tc class add dev <interface> parent 1: classid 1:3 htb rate 10Gbit prio 2 quantum 32768 344 345 346RVU Representors 347================ 348 349RVU representor driver adds support for creation of representor devices for 350RVU PFs' VFs in the system. Representor devices are created when user enables 351the switchdev mode. 352Switchdev mode can be enabled either before or after setting up SRIOV numVFs. 353All representor devices share a single NIXLF but each has a dedicated Rx/Tx 354queues. RVU PF representor driver registers a separate netdev for each 355Rx/Tx queue pair. 356 357Current HW does not support built-in switch which can do L2 learning and 358forwarding packets between representee and representor. Hence, packet path 359between representee and it's representor is achieved by setting up appropriate 360NPC MCAM filters. 361Transmit packets matching these filters will be loopbacked through hardware 362loopback channel/interface (i.e, instead of sending them out of MAC interface). 363Which will again match the installed filters and will be forwarded. 364This way representee => representor and representor => representee packet 365path is achieved. These rules get installed when representors are created 366and gets active/deactivate based on the representor/representee interface state. 367 368Usage example: 369 370 - Change device to switchdev mode:: 371 372 # devlink dev eswitch set pci/0002:1c:00.0 mode switchdev 373 374 - List of representor devices on the system:: 375 376 # ip link show 377 Rpf1vf0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state DOWN mode DEFAULT group default qlen 1000 link/ether f6:43:83:ee:26:21 brd ff:ff:ff:ff:ff:ff 378 Rpf1vf1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state DOWN mode DEFAULT group default qlen 1000 link/ether 12:b2:54:0e:24:54 brd ff:ff:ff:ff:ff:ff 379 Rpf1vf2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state DOWN mode DEFAULT group default qlen 1000 link/ether 4a:12:c4:4c:32:62 brd ff:ff:ff:ff:ff:ff 380 Rpf1vf3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state DOWN mode DEFAULT group default qlen 1000 link/ether ca:cb:68:0e:e2:6e brd ff:ff:ff:ff:ff:ff 381 Rpf2vf0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state DOWN mode DEFAULT group default qlen 1000 link/ether 06:cc:ad:b4:f0:93 brd ff:ff:ff:ff:ff:ff 382 383 384To delete the representors devices from the system. Change the device to legacy mode. 385 386 - Change device to legacy mode:: 387 388 # devlink dev eswitch set pci/0002:1c:00.0 mode legacy 389 390RVU representors can be managed using devlink ports 391(see :ref:`Documentation/networking/devlink/devlink-port.rst <devlink_port>`) interface. 392 393 - Show devlink ports of representors:: 394 395 # devlink port 396 pci/0002:1c:00.0/0: type eth netdev Rpf1vf0 flavour physical port 0 splittable false 397 pci/0002:1c:00.0/1: type eth netdev Rpf1vf1 flavour pcivf controller 0 pfnum 1 vfnum 1 external false splittable false 398 pci/0002:1c:00.0/2: type eth netdev Rpf1vf2 flavour pcivf controller 0 pfnum 1 vfnum 2 external false splittable false 399 pci/0002:1c:00.0/3: type eth netdev Rpf1vf3 flavour pcivf controller 0 pfnum 1 vfnum 3 external false splittable false 400 401Function attributes 402=================== 403 404The RVU representor support function attributes for representors. 405Port function configuration of the representors are supported through devlink eswitch port. 406 407MAC address setup 408----------------- 409 410RVU representor driver support devlink port function attr mechanism to setup MAC 411address. (refer to Documentation/networking/devlink/devlink-port.rst) 412 413 - To setup MAC address for port 2:: 414 415 # devlink port function set pci/0002:1c:00.0/2 hw_addr 5c:a1:1b:5e:43:11 416 # devlink port show pci/0002:1c:00.0/2 417 pci/0002:1c:00.0/2: type eth netdev Rpf1vf2 flavour pcivf controller 0 pfnum 1 vfnum 2 external false splittable false 418 function: 419 hw_addr 5c:a1:1b:5e:43:11 420 421 422TC offload 423========== 424 425The rvu representor driver implements support for offloading tc rules using port representors. 426 427 - Drop packets with vlan id 3:: 428 429 # tc filter add dev Rpf1vf0 protocol 802.1Q parent ffff: flower vlan_id 3 vlan_ethtype ipv4 skip_sw action drop 430 431 - Redirect packets with vlan id 5 and IPv4 packets to eth1, after stripping vlan header.:: 432 433 # tc filter add dev Rpf1vf0 ingress protocol 802.1Q flower vlan_id 5 vlan_ethtype ipv4 skip_sw action vlan pop action mirred ingress redirect dev eth1 434