1.. SPDX-License-Identifier: GPL-2.0 2 3===================================== 4Scaling in the Linux Networking Stack 5===================================== 6 7 8Introduction 9============ 10 11This document describes a set of complementary techniques in the Linux 12networking stack to increase parallelism and improve performance for 13multi-processor systems. 14 15The following technologies are described: 16 17- RSS: Receive Side Scaling 18- RPS: Receive Packet Steering 19- RFS: Receive Flow Steering 20- Accelerated Receive Flow Steering 21- XPS: Transmit Packet Steering 22 23 24RSS: Receive Side Scaling 25========================= 26 27Contemporary NICs support multiple receive and transmit descriptor queues 28(multi-queue). On reception, a NIC can send different packets to different 29queues to distribute processing among CPUs. The NIC distributes packets by 30applying a filter to each packet that assigns it to one of a small number 31of logical flows. Packets for each flow are steered to a separate receive 32queue, which in turn can be processed by separate CPUs. This mechanism is 33generally known as “Receive-side Scaling” (RSS). The goal of RSS and 34the other scaling techniques is to increase performance uniformly. 35Multi-queue distribution can also be used for traffic prioritization, but 36that is not the focus of these techniques. 37 38The filter used in RSS is typically a hash function over the network 39and/or transport layer headers-- for example, a 4-tuple hash over 40IP addresses and TCP ports of a packet. The most common hardware 41implementation of RSS uses a 128-entry indirection table where each entry 42stores a queue number. The receive queue for a packet is determined 43by masking out the low order seven bits of the computed hash for the 44packet (usually a Toeplitz hash), taking this number as a key into the 45indirection table and reading the corresponding value. 46 47Some advanced NICs allow steering packets to queues based on 48programmable filters. For example, webserver bound TCP port 80 packets 49can be directed to their own receive queue. Such “n-tuple” filters can 50be configured from ethtool (--config-ntuple). 51 52 53RSS Configuration 54----------------- 55 56The driver for a multi-queue capable NIC typically provides a kernel 57module parameter for specifying the number of hardware queues to 58configure. In the bnx2x driver, for instance, this parameter is called 59num_queues. A typical RSS configuration would be to have one receive queue 60for each CPU if the device supports enough queues, or otherwise at least 61one for each memory domain, where a memory domain is a set of CPUs that 62share a particular memory level (L1, L2, NUMA node, etc.). 63 64The indirection table of an RSS device, which resolves a queue by masked 65hash, is usually programmed by the driver at initialization. The 66default mapping is to distribute the queues evenly in the table, but the 67indirection table can be retrieved and modified at runtime using ethtool 68commands (--show-rxfh-indir and --set-rxfh-indir). Modifying the 69indirection table could be done to give different queues different 70relative weights. 71 72 73RSS IRQ Configuration 74~~~~~~~~~~~~~~~~~~~~~ 75 76Each receive queue has a separate IRQ associated with it. The NIC triggers 77this to notify a CPU when new packets arrive on the given queue. The 78signaling path for PCIe devices uses message signaled interrupts (MSI-X), 79that can route each interrupt to a particular CPU. The active mapping 80of queues to IRQs can be determined from /proc/interrupts. By default, 81an IRQ may be handled on any CPU. Because a non-negligible part of packet 82processing takes place in receive interrupt handling, it is advantageous 83to spread receive interrupts between CPUs. To manually adjust the IRQ 84affinity of each interrupt see Documentation/core-api/irq/irq-affinity.rst. Some systems 85will be running irqbalance, a daemon that dynamically optimizes IRQ 86assignments and as a result may override any manual settings. 87 88 89Suggested Configuration 90~~~~~~~~~~~~~~~~~~~~~~~ 91 92RSS should be enabled when latency is a concern or whenever receive 93interrupt processing forms a bottleneck. Spreading load between CPUs 94decreases queue length. For low latency networking, the optimal setting 95is to allocate as many queues as there are CPUs in the system (or the 96NIC maximum, if lower). The most efficient high-rate configuration 97is likely the one with the smallest number of receive queues where no 98receive queue overflows due to a saturated CPU, because in default 99mode with interrupt coalescing enabled, the aggregate number of 100interrupts (and thus work) grows with each additional queue. 101 102Per-cpu load can be observed using the mpstat utility, but note that on 103processors with hyperthreading (HT), each hyperthread is represented as 104a separate CPU. For interrupt handling, HT has shown no benefit in 105initial tests, so limit the number of queues to the number of CPU cores 106in the system. 107 108Dedicated RSS contexts 109~~~~~~~~~~~~~~~~~~~~~~ 110 111Modern NICs support creating multiple co-existing RSS configurations 112which are selected based on explicit matching rules. This can be very 113useful when application wants to constrain the set of queues receiving 114traffic for e.g. a particular destination port or IP address. 115The example below shows how to direct all traffic to TCP port 22 116to queues 0 and 1. 117 118To create an additional RSS context use:: 119 120 # ethtool -X eth0 hfunc toeplitz context new 121 New RSS context is 1 122 123Kernel reports back the ID of the allocated context (the default, always 124present RSS context has ID of 0). The new context can be queried and 125modified using the same APIs as the default context:: 126 127 # ethtool -x eth0 context 1 128 RX flow hash indirection table for eth0 with 13 RX ring(s): 129 0: 0 1 2 3 4 5 6 7 130 8: 8 9 10 11 12 0 1 2 131 [...] 132 # ethtool -X eth0 equal 2 context 1 133 # ethtool -x eth0 context 1 134 RX flow hash indirection table for eth0 with 13 RX ring(s): 135 0: 0 1 0 1 0 1 0 1 136 8: 0 1 0 1 0 1 0 1 137 [...] 138 139To make use of the new context direct traffic to it using an n-tuple 140filter:: 141 142 # ethtool -N eth0 flow-type tcp6 dst-port 22 context 1 143 Added rule with ID 1023 144 145When done, remove the context and the rule:: 146 147 # ethtool -N eth0 delete 1023 148 # ethtool -X eth0 context 1 delete 149 150 151RPS: Receive Packet Steering 152============================ 153 154Receive Packet Steering (RPS) is logically a software implementation of 155RSS. Being in software, it is necessarily called later in the datapath. 156Whereas RSS selects the queue and hence CPU that will run the hardware 157interrupt handler, RPS selects the CPU to perform protocol processing 158above the interrupt handler. This is accomplished by placing the packet 159on the desired CPU’s backlog queue and waking up the CPU for processing. 160RPS has some advantages over RSS: 161 1621) it can be used with any NIC 1632) software filters can easily be added to hash over new protocols 1643) it does not increase hardware device interrupt rate (although it does 165 introduce inter-processor interrupts (IPIs)) 166 167RPS is called during bottom half of the receive interrupt handler, when 168a driver sends a packet up the network stack with netif_rx() or 169netif_receive_skb(). These call the get_rps_cpu() function, which 170selects the queue that should process a packet. 171 172The first step in determining the target CPU for RPS is to calculate a 173flow hash over the packet’s addresses or ports (2-tuple or 4-tuple hash 174depending on the protocol). This serves as a consistent hash of the 175associated flow of the packet. The hash is either provided by hardware 176or will be computed in the stack. Capable hardware can pass the hash in 177the receive descriptor for the packet; this would usually be the same 178hash used for RSS (e.g. computed Toeplitz hash). The hash is saved in 179skb->hash and can be used elsewhere in the stack as a hash of the 180packet’s flow. 181 182Each receive hardware queue has an associated list of CPUs to which 183RPS may enqueue packets for processing. For each received packet, 184an index into the list is computed from the flow hash modulo the size 185of the list. The indexed CPU is the target for processing the packet, 186and the packet is queued to the tail of that CPU’s backlog queue. At 187the end of the bottom half routine, IPIs are sent to any CPUs for which 188packets have been queued to their backlog queue. The IPI wakes backlog 189processing on the remote CPU, and any queued packets are then processed 190up the networking stack. 191 192 193RPS Configuration 194----------------- 195 196RPS requires a kernel compiled with the CONFIG_RPS kconfig symbol (on 197by default for SMP). Even when compiled in, RPS remains disabled until 198explicitly configured. The list of CPUs to which RPS may forward traffic 199can be configured for each receive queue using a sysfs file entry:: 200 201 /sys/class/net/<dev>/queues/rx-<n>/rps_cpus 202 203This file implements a bitmap of CPUs. RPS is disabled when it is zero 204(the default), in which case packets are processed on the interrupting 205CPU. Documentation/core-api/irq/irq-affinity.rst explains how CPUs are assigned to 206the bitmap. 207 208 209Suggested Configuration 210~~~~~~~~~~~~~~~~~~~~~~~ 211 212For a single queue device, a typical RPS configuration would be to set 213the rps_cpus to the CPUs in the same memory domain of the interrupting 214CPU. If NUMA locality is not an issue, this could also be all CPUs in 215the system. At high interrupt rate, it might be wise to exclude the 216interrupting CPU from the map since that already performs much work. 217 218For a multi-queue system, if RSS is configured so that a hardware 219receive queue is mapped to each CPU, then RPS is probably redundant 220and unnecessary. If there are fewer hardware queues than CPUs, then 221RPS might be beneficial if the rps_cpus for each queue are the ones that 222share the same memory domain as the interrupting CPU for that queue. 223 224 225RPS Flow Limit 226-------------- 227 228RPS scales kernel receive processing across CPUs without introducing 229reordering. The trade-off to sending all packets from the same flow 230to the same CPU is CPU load imbalance if flows vary in packet rate. 231In the extreme case a single flow dominates traffic. Especially on 232common server workloads with many concurrent connections, such 233behavior indicates a problem such as a misconfiguration or spoofed 234source Denial of Service attack. 235 236Flow Limit is an optional RPS feature that prioritizes small flows 237during CPU contention by dropping packets from large flows slightly 238ahead of those from small flows. It is active only when an RPS or RFS 239destination CPU approaches saturation. Once a CPU's input packet 240queue exceeds half the maximum queue length (as set by sysctl 241net.core.netdev_max_backlog), the kernel starts a per-flow packet 242count over the last 256 packets. If a flow exceeds a set ratio (by 243default, half) of these packets when a new packet arrives, then the 244new packet is dropped. Packets from other flows are still only 245dropped once the input packet queue reaches netdev_max_backlog. 246No packets are dropped when the input packet queue length is below 247the threshold, so flow limit does not sever connections outright: 248even large flows maintain connectivity. 249 250 251Interface 252~~~~~~~~~ 253 254Flow limit is compiled in by default (CONFIG_NET_FLOW_LIMIT), but not 255turned on. It is implemented for each CPU independently (to avoid lock 256and cache contention) and toggled per CPU by setting the relevant bit 257in sysctl net.core.flow_limit_cpu_bitmap. It exposes the same CPU 258bitmap interface as rps_cpus (see above) when called from procfs:: 259 260 /proc/sys/net/core/flow_limit_cpu_bitmap 261 262Per-flow rate is calculated by hashing each packet into a hashtable 263bucket and incrementing a per-bucket counter. The hash function is 264the same that selects a CPU in RPS, but as the number of buckets can 265be much larger than the number of CPUs, flow limit has finer-grained 266identification of large flows and fewer false positives. The default 267table has 4096 buckets. This value can be modified through sysctl:: 268 269 net.core.flow_limit_table_len 270 271The value is only consulted when a new table is allocated. Modifying 272it does not update active tables. 273 274 275Suggested Configuration 276~~~~~~~~~~~~~~~~~~~~~~~ 277 278Flow limit is useful on systems with many concurrent connections, 279where a single connection taking up 50% of a CPU indicates a problem. 280In such environments, enable the feature on all CPUs that handle 281network rx interrupts (as set in /proc/irq/N/smp_affinity). 282 283The feature depends on the input packet queue length to exceed 284the flow limit threshold (50%) + the flow history length (256). 285Setting net.core.netdev_max_backlog to either 1000 or 10000 286performed well in experiments. 287 288 289RFS: Receive Flow Steering 290========================== 291 292While RPS steers packets solely based on hash, and thus generally 293provides good load distribution, it does not take into account 294application locality. This is accomplished by Receive Flow Steering 295(RFS). The goal of RFS is to increase datacache hitrate by steering 296kernel processing of packets to the CPU where the application thread 297consuming the packet is running. RFS relies on the same RPS mechanisms 298to enqueue packets onto the backlog of another CPU and to wake up that 299CPU. 300 301In RFS, packets are not forwarded directly by the value of their hash, 302but the hash is used as index into a flow lookup table. This table maps 303flows to the CPUs where those flows are being processed. The flow hash 304(see RPS section above) is used to calculate the index into this table. 305The CPU recorded in each entry is the one which last processed the flow. 306If an entry does not hold a valid CPU, then packets mapped to that entry 307are steered using plain RPS. Multiple table entries may point to the 308same CPU. Indeed, with many flows and few CPUs, it is very likely that 309a single application thread handles flows with many different flow hashes. 310 311rps_sock_flow_table is a global flow table that contains the *desired* CPU 312for flows: the CPU that is currently processing the flow in userspace. 313Each table value is a CPU index that is updated during calls to recvmsg 314and sendmsg (specifically, inet_recvmsg(), inet_sendmsg() and 315tcp_splice_read()). 316 317When the scheduler moves a thread to a new CPU while it has outstanding 318receive packets on the old CPU, packets may arrive out of order. To 319avoid this, RFS uses a second flow table to track outstanding packets 320for each flow: rps_dev_flow_table is a table specific to each hardware 321receive queue of each device. Each table value stores a CPU index and a 322counter. The CPU index represents the *current* CPU onto which packets 323for this flow are enqueued for further kernel processing. Ideally, kernel 324and userspace processing occur on the same CPU, and hence the CPU index 325in both tables is identical. This is likely false if the scheduler has 326recently migrated a userspace thread while the kernel still has packets 327enqueued for kernel processing on the old CPU. 328 329The counter in rps_dev_flow_table values records the length of the current 330CPU's backlog when a packet in this flow was last enqueued. Each backlog 331queue has a head counter that is incremented on dequeue. A tail counter 332is computed as head counter + queue length. In other words, the counter 333in rps_dev_flow[i] records the last element in flow i that has 334been enqueued onto the currently designated CPU for flow i (of course, 335entry i is actually selected by hash and multiple flows may hash to the 336same entry i). 337 338And now the trick for avoiding out of order packets: when selecting the 339CPU for packet processing (from get_rps_cpu()) the rps_sock_flow table 340and the rps_dev_flow table of the queue that the packet was received on 341are compared. If the desired CPU for the flow (found in the 342rps_sock_flow table) matches the current CPU (found in the rps_dev_flow 343table), the packet is enqueued onto that CPU’s backlog. If they differ, 344the current CPU is updated to match the desired CPU if one of the 345following is true: 346 347 - The current CPU's queue head counter >= the recorded tail counter 348 value in rps_dev_flow[i] 349 - The current CPU is unset (>= nr_cpu_ids) 350 - The current CPU is offline 351 352After this check, the packet is sent to the (possibly updated) current 353CPU. These rules aim to ensure that a flow only moves to a new CPU when 354there are no packets outstanding on the old CPU, as the outstanding 355packets could arrive later than those about to be processed on the new 356CPU. 357 358 359RFS Configuration 360----------------- 361 362RFS is only available if the kconfig symbol CONFIG_RPS is enabled (on 363by default for SMP). The functionality remains disabled until explicitly 364configured. The number of entries in the global flow table is set through:: 365 366 /proc/sys/net/core/rps_sock_flow_entries 367 368The number of entries in the per-queue flow table are set through:: 369 370 /sys/class/net/<dev>/queues/rx-<n>/rps_flow_cnt 371 372 373Suggested Configuration 374~~~~~~~~~~~~~~~~~~~~~~~ 375 376Both of these need to be set before RFS is enabled for a receive queue. 377Values for both are rounded up to the nearest power of two. The 378suggested flow count depends on the expected number of active connections 379at any given time, which may be significantly less than the number of open 380connections. We have found that a value of 32768 for rps_sock_flow_entries 381works fairly well on a moderately loaded server. 382 383For a single queue device, the rps_flow_cnt value for the single queue 384would normally be configured to the same value as rps_sock_flow_entries. 385For a multi-queue device, the rps_flow_cnt for each queue might be 386configured as rps_sock_flow_entries / N, where N is the number of 387queues. So for instance, if rps_sock_flow_entries is set to 32768 and there 388are 16 configured receive queues, rps_flow_cnt for each queue might be 389configured as 2048. 390 391 392Accelerated RFS 393=============== 394 395Accelerated RFS is to RFS what RSS is to RPS: a hardware-accelerated load 396balancing mechanism that uses soft state to steer flows based on where 397the application thread consuming the packets of each flow is running. 398Accelerated RFS should perform better than RFS since packets are sent 399directly to a CPU local to the thread consuming the data. The target CPU 400will either be the same CPU where the application runs, or at least a CPU 401which is local to the application thread’s CPU in the cache hierarchy. 402 403To enable accelerated RFS, the networking stack calls the 404ndo_rx_flow_steer driver function to communicate the desired hardware 405queue for packets matching a particular flow. The network stack 406automatically calls this function every time a flow entry in 407rps_dev_flow_table is updated. The driver in turn uses a device specific 408method to program the NIC to steer the packets. 409 410The hardware queue for a flow is derived from the CPU recorded in 411rps_dev_flow_table. The stack consults a CPU to hardware queue map which 412is maintained by the NIC driver. This is an auto-generated reverse map of 413the IRQ affinity table shown by /proc/interrupts. Drivers can use 414functions in the cpu_rmap (“CPU affinity reverse map”) kernel library 415to populate the map. For each CPU, the corresponding queue in the map is 416set to be one whose processing CPU is closest in cache locality. 417 418 419Accelerated RFS Configuration 420----------------------------- 421 422Accelerated RFS is only available if the kernel is compiled with 423CONFIG_RFS_ACCEL and support is provided by the NIC device and driver. 424It also requires that ntuple filtering is enabled via ethtool. The map 425of CPU to queues is automatically deduced from the IRQ affinities 426configured for each receive queue by the driver, so no additional 427configuration should be necessary. 428 429 430Suggested Configuration 431~~~~~~~~~~~~~~~~~~~~~~~ 432 433This technique should be enabled whenever one wants to use RFS and the 434NIC supports hardware acceleration. 435 436 437XPS: Transmit Packet Steering 438============================= 439 440Transmit Packet Steering is a mechanism for intelligently selecting 441which transmit queue to use when transmitting a packet on a multi-queue 442device. This can be accomplished by recording two kinds of maps, either 443a mapping of CPU to hardware queue(s) or a mapping of receive queue(s) 444to hardware transmit queue(s). 445 4461. XPS using CPUs map 447 448The goal of this mapping is usually to assign queues 449exclusively to a subset of CPUs, where the transmit completions for 450these queues are processed on a CPU within this set. This choice 451provides two benefits. First, contention on the device queue lock is 452significantly reduced since fewer CPUs contend for the same queue 453(contention can be eliminated completely if each CPU has its own 454transmit queue). Secondly, cache miss rate on transmit completion is 455reduced, in particular for data cache lines that hold the sk_buff 456structures. 457 4582. XPS using receive queues map 459 460This mapping is used to pick transmit queue based on the receive 461queue(s) map configuration set by the administrator. A set of receive 462queues can be mapped to a set of transmit queues (many:many), although 463the common use case is a 1:1 mapping. This will enable sending packets 464on the same queue associations for transmit and receive. This is useful for 465busy polling multi-threaded workloads where there are challenges in 466associating a given CPU to a given application thread. The application 467threads are not pinned to CPUs and each thread handles packets 468received on a single queue. The receive queue number is cached in the 469socket for the connection. In this model, sending the packets on the same 470transmit queue corresponding to the associated receive queue has benefits 471in keeping the CPU overhead low. Transmit completion work is locked into 472the same queue-association that a given application is polling on. This 473avoids the overhead of triggering an interrupt on another CPU. When the 474application cleans up the packets during the busy poll, transmit completion 475may be processed along with it in the same thread context and so result in 476reduced latency. 477 478XPS is configured per transmit queue by setting a bitmap of 479CPUs/receive-queues that may use that queue to transmit. The reverse 480mapping, from CPUs to transmit queues or from receive-queues to transmit 481queues, is computed and maintained for each network device. When 482transmitting the first packet in a flow, the function get_xps_queue() is 483called to select a queue. This function uses the ID of the receive queue 484for the socket connection for a match in the receive queue-to-transmit queue 485lookup table. Alternatively, this function can also use the ID of the 486running CPU as a key into the CPU-to-queue lookup table. If the 487ID matches a single queue, that is used for transmission. If multiple 488queues match, one is selected by using the flow hash to compute an index 489into the set. When selecting the transmit queue based on receive queue(s) 490map, the transmit device is not validated against the receive device as it 491requires expensive lookup operation in the datapath. 492 493The queue chosen for transmitting a particular flow is saved in the 494corresponding socket structure for the flow (e.g. a TCP connection). 495This transmit queue is used for subsequent packets sent on the flow to 496prevent out of order (ooo) packets. The choice also amortizes the cost 497of calling get_xps_queues() over all packets in the flow. To avoid 498ooo packets, the queue for a flow can subsequently only be changed if 499skb->ooo_okay is set for a packet in the flow. This flag indicates that 500there are no outstanding packets in the flow, so the transmit queue can 501change without the risk of generating out of order packets. The 502transport layer is responsible for setting ooo_okay appropriately. TCP, 503for instance, sets the flag when all data for a connection has been 504acknowledged. 505 506XPS Configuration 507----------------- 508 509XPS is only available if the kconfig symbol CONFIG_XPS is enabled (on by 510default for SMP). If compiled in, it is driver dependent whether, and 511how, XPS is configured at device init. The mapping of CPUs/receive-queues 512to transmit queue can be inspected and configured using sysfs: 513 514For selection based on CPUs map:: 515 516 /sys/class/net/<dev>/queues/tx-<n>/xps_cpus 517 518For selection based on receive-queues map:: 519 520 /sys/class/net/<dev>/queues/tx-<n>/xps_rxqs 521 522 523Suggested Configuration 524~~~~~~~~~~~~~~~~~~~~~~~ 525 526For a network device with a single transmission queue, XPS configuration 527has no effect, since there is no choice in this case. In a multi-queue 528system, XPS is preferably configured so that each CPU maps onto one queue. 529If there are as many queues as there are CPUs in the system, then each 530queue can also map onto one CPU, resulting in exclusive pairings that 531experience no contention. If there are fewer queues than CPUs, then the 532best CPUs to share a given queue are probably those that share the cache 533with the CPU that processes transmit completions for that queue 534(transmit interrupts). 535 536For transmit queue selection based on receive queue(s), XPS has to be 537explicitly configured mapping receive-queue(s) to transmit queue(s). If the 538user configuration for receive-queue map does not apply, then the transmit 539queue is selected based on the CPUs map. 540 541 542Per TX Queue rate limitation 543============================ 544 545These are rate-limitation mechanisms implemented by HW, where currently 546a max-rate attribute is supported, by setting a Mbps value to:: 547 548 /sys/class/net/<dev>/queues/tx-<n>/tx_maxrate 549 550A value of zero means disabled, and this is the default. 551 552 553Further Information 554=================== 555RPS and RFS were introduced in kernel 2.6.35. XPS was incorporated into 5562.6.38. Original patches were submitted by Tom Herbert 557(therbert@google.com) 558 559Accelerated RFS was introduced in 2.6.35. Original patches were 560submitted by Ben Hutchings (bwh@kernel.org) 561 562Authors: 563 564- Tom Herbert (therbert@google.com) 565- Willem de Bruijn (willemb@google.com) 566