scaling.rst - OpenGrok cross reference for /linux/Documentation/networking/scaling.rst

Lines Matching +full:single +full:- +full:cpu +full:- +full:affinity
1 .. SPDX-License-Identifier: GPL-2.0
13 multi-processor systems.
17 - RSS: Receive Side Scaling
18 - RPS: Receive Packet Steering
19 - RFS: Receive Flow Steering
20 - Accelerated Receive Flow Steering
21 - XPS: Transmit Packet Steering
28 (multi-queue). On reception, a NIC can send different packets to different
33 generally known as “Receive-side Scaling” (RSS). The goal of RSS and
35 Multi-queue distribution can also be used for traffic prioritization, but
39 and/or transport layer headers-- for example, a 4-tuple hash over
41 implementation of RSS uses a 128-entry indirection table where each entry
51 both directions of the flow to land on the same Rx queue (and CPU). The
52 "Symmetric-XOR" and "Symmetric-OR-XOR" are types of RSS algorithms that
57 Specifically, the "Symmetric-XOR" algorithm XORs the input
62 The "Symmetric-OR-XOR" algorithm, on the other hand, transforms the input as
71 can be directed to their own receive queue. Such “n-tuple” filters can
72 be configured from ethtool (--config-ntuple).
76 -----------------
78 The driver for a multi-queue capable NIC typically provides a kernel
82 for each CPU if the device supports enough queues, or otherwise at least
90 commands (--show-rxfh-indir and --set-rxfh-indir). Modifying the
99 this to notify a CPU when new packets arrive on the given queue. The
100 signaling path for PCIe devices uses message signaled interrupts (MSI-X),
101 that can route each interrupt to a particular CPU. The active mapping
103 an IRQ may be handled on any CPU. Because a non-negligible part of packet
106 affinity of each interrupt see Documentation/core-api/irq/irq-affinity.rst. Some systems
118 NIC maximum, if lower). The most efficient high-rate configuration
120 receive queue overflows due to a saturated CPU, because in default
124 Per-cpu load can be observed using the mpstat utility, but note that on
126 a separate CPU. For interrupt handling, HT has shown no benefit in
127 initial tests, so limit the number of queues to the number of CPU cores
133 Modern NICs support creating multiple co-existing RSS configurations
142   # ethtool -X eth0 hfunc toeplitz context new
149   # ethtool -x eth0 context 1
154   # ethtool -X eth0 equal 2 context 1
155   # ethtool -x eth0 context 1
161 To make use of the new context direct traffic to it using an n-tuple
164   # ethtool -N eth0 flow-type tcp6 dst-port 22 context 1
169   # ethtool -N eth0 delete 1023
170   # ethtool -X eth0 context 1 delete
178 Whereas RSS selects the queue and hence CPU that will run the hardware
179 interrupt handler, RPS selects the CPU to perform protocol processing
181 on the desired CPU’s backlog queue and waking up the CPU for processing.
187    introduce inter-processor interrupts (IPIs))
194 The first step in determining the target CPU for RPS is to calculate a
195 flow hash over the packet’s addresses or ports (2-tuple or 4-tuple hash
201 skb->hash and can be used elsewhere in the stack as a hash of the
207 of the list. The indexed CPU is the target for processing the packet,
208 and the packet is queued to the tail of that CPU’s backlog queue. At
211 processing on the remote CPU, and any queued packets are then processed
216 -----------------
223   /sys/class/net/<dev>/queues/rx-<n>/rps_cpus
227 CPU. Documentation/core-api/irq/irq-affinity.rst explains how CPUs are assigned to
234 For a single queue device, a typical RPS configuration would be to set
236 CPU. If NUMA locality is not an issue, this could also be all CPUs in
238 interrupting CPU from the map since that already performs much work.
240 For a multi-queue system, if RSS is configured so that a hardware
241 receive queue is mapped to each CPU, then RPS is probably redundant
244 share the same memory domain as the interrupting CPU for that queue.
248 --------------
251 reordering. The trade-off to sending all packets from the same flow
252 to the same CPU is CPU load imbalance if flows vary in packet rate.
253 In the extreme case a single flow dominates traffic. Especially on
259 during CPU contention by dropping packets from large flows slightly
261 destination CPU approaches saturation.  Once a CPU's input packet
263 net.core.netdev_max_backlog), the kernel starts a per-flow packet
277 turned on. It is implemented for each CPU independently (to avoid lock
278 and cache contention) and toggled per CPU by setting the relevant bit
279 in sysctl net.core.flow_limit_cpu_bitmap. It exposes the same CPU
284 Per-flow rate is calculated by hashing each packet into a hashtable
285 bucket and incrementing a per-bucket counter. The hash function is
286 the same that selects a CPU in RPS, but as the number of buckets can
287 be much larger than the number of CPUs, flow limit has finer-grained
301 where a single connection taking up 50% of a CPU indicates a problem.
318 kernel processing of packets to the CPU where the application thread
320 to enqueue packets onto the backlog of another CPU and to wake up that
321 CPU.
327 The CPU recorded in each entry is the one which last processed the flow.
328 If an entry does not hold a valid CPU, then packets mapped to that entry
330 same CPU. Indeed, with many flows and few CPUs, it is very likely that
331 a single application thread handles flows with many different flow hashes.
333 rps_sock_flow_table is a global flow table that contains the *desired* CPU
334 for flows: the CPU that is currently processing the flow in userspace.
335 Each table value is a CPU index that is updated during calls to recvmsg
339 When the scheduler moves a thread to a new CPU while it has outstanding
340 receive packets on the old CPU, packets may arrive out of order. To
343 receive queue of each device. Each table value stores a CPU index and a
344 counter. The CPU index represents the *current* CPU onto which packets
346 and userspace processing occur on the same CPU, and hence the CPU index
349 enqueued for kernel processing on the old CPU.
352 CPU's backlog when a packet in this flow was last enqueued. Each backlog
356 been enqueued onto the currently designated CPU for flow i (of course,
361 CPU for packet processing (from get_rps_cpu()) the rps_sock_flow table
363 are compared. If the desired CPU for the flow (found in the
364 rps_sock_flow table) matches the current CPU (found in the rps_dev_flow
365 table), the packet is enqueued onto that CPU’s backlog. If they differ,
366 the current CPU is updated to match the desired CPU if one of the
369   - The current CPU's queue head counter >= the recorded tail counter
371   - The current CPU is unset (>= nr_cpu_ids)
372   - The current CPU is offline
375 CPU. These rules aim to ensure that a flow only moves to a new CPU when
376 there are no packets outstanding on the old CPU, as the outstanding
378 CPU.
382 -----------------
390 The number of entries in the per-queue flow table are set through::
392   /sys/class/net/<dev>/queues/rx-<n>/rps_flow_cnt
405 For a single queue device, the rps_flow_cnt value for the single queue
407 For a multi-queue device, the rps_flow_cnt for each queue might be
417 Accelerated RFS is to RFS what RSS is to RPS: a hardware-accelerated load
421 directly to a CPU local to the thread consuming the data. The target CPU
422 will either be the same CPU where the application runs, or at least a CPU
423 which is local to the application thread’s CPU in the cache hierarchy.
432 The hardware queue for a flow is derived from the CPU recorded in
433 rps_dev_flow_table. The stack consults a CPU to hardware queue map which
434 is maintained by the NIC driver. This is an auto-generated reverse map of
435 the IRQ affinity table shown by /proc/interrupts. Drivers can use
436 functions in the cpu_rmap (“CPU affinity reverse map”) kernel library
438 management to the Kernel by calling netif_enable_cpu_rmap(). For each CPU,
439 the corresponding queue in the map is set to be one whose processing CPU is
444 -----------------------------
449 of CPU to queues is automatically deduced from the IRQ affinities
465 which transmit queue to use when transmitting a packet on a multi-queue
467 a mapping of CPU to hardware queue(s) or a mapping of receive queue(s)
474 these queues are processed on a CPU within this set. This choice
477 (contention can be eliminated completely if each CPU has its own
489 busy polling multi-threaded workloads where there are challenges in
490 associating a given CPU to a given application thread. The application
492 received on a single queue. The receive queue number is cached in the
495 in keeping the CPU overhead low. Transmit completion work is locked into
496 the same queue-association that a given application is polling on. This
497 avoids the overhead of triggering an interrupt on another CPU. When the
503 CPUs/receive-queues that may use that queue to transmit. The reverse
504 mapping, from CPUs to transmit queues or from receive-queues to transmit
508 for the socket connection for a match in the receive queue-to-transmit queue
510 running CPU as a key into the CPU-to-queue lookup table. If the
511 ID matches a single queue, that is used for transmission. If multiple
523 skb->ooo_okay is set for a packet in the flow. This flag indicates that
531 -----------------
535 how, XPS is configured at device init. The mapping of CPUs/receive-queues
540   /sys/class/net/<dev>/queues/tx-<n>/xps_cpus
542 For selection based on receive-queues map::
544   /sys/class/net/<dev>/queues/tx-<n>/xps_rxqs
550 For a network device with a single transmission queue, XPS configuration
551 has no effect, since there is no choice in this case. In a multi-queue
552 system, XPS is preferably configured so that each CPU maps onto one queue.
554 queue can also map onto one CPU, resulting in exclusive pairings that
557 with the CPU that processes transmit completions for that queue
561 explicitly configured mapping receive-queue(s) to transmit queue(s). If the
562 user configuration for receive-queue map does not apply, then the transmit
569 These are rate-limitation mechanisms implemented by HW, where currently
570 a max-rate attribute is supported, by setting a Mbps value to::
572   /sys/class/net/<dev>/queues/tx-<n>/tx_maxrate
588 - Tom Herbert (therbert@google.com)
589 - Willem de Bruijn (willemb@google.com)