1*7548c69fSSebastian Andrzej Siewior.. SPDX-License-Identifier: GPL-2.0 2*7548c69fSSebastian Andrzej Siewior 3*7548c69fSSebastian Andrzej Siewior==================== 4*7548c69fSSebastian Andrzej SiewiorConsidering hardware 5*7548c69fSSebastian Andrzej Siewior==================== 6*7548c69fSSebastian Andrzej Siewior 7*7548c69fSSebastian Andrzej Siewior:Author: Sebastian Andrzej Siewior <bigeasy@linutronix.de> 8*7548c69fSSebastian Andrzej Siewior 9*7548c69fSSebastian Andrzej SiewiorThe way a workload is handled can be influenced by the hardware it runs on. 10*7548c69fSSebastian Andrzej SiewiorKey components include the CPU, memory, and the buses that connect them. 11*7548c69fSSebastian Andrzej SiewiorThese resources are shared among all applications on the system. 12*7548c69fSSebastian Andrzej SiewiorAs a result, heavy utilization of one resource by a single application 13*7548c69fSSebastian Andrzej Siewiorcan affect the deterministic handling of workloads in other applications. 14*7548c69fSSebastian Andrzej Siewior 15*7548c69fSSebastian Andrzej SiewiorBelow is a brief overview. 16*7548c69fSSebastian Andrzej Siewior 17*7548c69fSSebastian Andrzej SiewiorSystem memory and cache 18*7548c69fSSebastian Andrzej Siewior----------------------- 19*7548c69fSSebastian Andrzej Siewior 20*7548c69fSSebastian Andrzej SiewiorMain memory and the associated caches are the most common shared resources among 21*7548c69fSSebastian Andrzej Siewiortasks in a system. One task can dominate the available caches, forcing another 22*7548c69fSSebastian Andrzej Siewiortask to wait until a cache line is written back to main memory before it can 23*7548c69fSSebastian Andrzej Siewiorproceed. The impact of this contention varies based on write patterns and the 24*7548c69fSSebastian Andrzej Siewiorsize of the caches available. Larger caches may reduce stalls because more lines 25*7548c69fSSebastian Andrzej Siewiorcan be buffered before being written back. Conversely, certain write patterns 26*7548c69fSSebastian Andrzej Siewiormay trigger the cache controller to flush many lines at once, causing 27*7548c69fSSebastian Andrzej Siewiorapplications to stall until the operation completes. 28*7548c69fSSebastian Andrzej Siewior 29*7548c69fSSebastian Andrzej SiewiorThis issue can be partly mitigated if applications do not share the same CPU 30*7548c69fSSebastian Andrzej Siewiorcache. The kernel is aware of the cache topology and exports this information to 31*7548c69fSSebastian Andrzej Siewioruser space. Tools such as **lstopo** from the Portable Hardware Locality (hwloc) 32*7548c69fSSebastian Andrzej Siewiorproject (https://www.open-mpi.org/projects/hwloc/) can visualize the hierarchy. 33*7548c69fSSebastian Andrzej Siewior 34*7548c69fSSebastian Andrzej SiewiorAvoiding shared L2 or L3 caches is not always possible. Even when cache sharing 35*7548c69fSSebastian Andrzej Siewioris minimized, bottlenecks can still occur when accessing system memory. Memory 36*7548c69fSSebastian Andrzej Siewioris used not only by the CPU but also by peripheral devices via DMA, such as 37*7548c69fSSebastian Andrzej Siewiorgraphics cards or network adapters. 38*7548c69fSSebastian Andrzej Siewior 39*7548c69fSSebastian Andrzej SiewiorIn some cases, cache and memory bottlenecks can be controlled if the hardware 40*7548c69fSSebastian Andrzej Siewiorprovides the necessary support. On x86 systems, Intel offers Cache Allocation 41*7548c69fSSebastian Andrzej SiewiorTechnology (CAT), which enables cache partitioning among applications and 42*7548c69fSSebastian Andrzej Siewiorprovides control over the interconnect. AMD provides similar functionality under 43*7548c69fSSebastian Andrzej SiewiorPlatform Quality of Service (PQoS). On Arm64, the equivalent is Memory 44*7548c69fSSebastian Andrzej SiewiorSystem Resource Partitioning and Monitoring (MPAM). 45*7548c69fSSebastian Andrzej Siewior 46*7548c69fSSebastian Andrzej SiewiorThese features can be configured through the Linux Resource Control interface. 47*7548c69fSSebastian Andrzej SiewiorFor details, see Documentation/filesystems/resctrl.rst. 48*7548c69fSSebastian Andrzej Siewior 49*7548c69fSSebastian Andrzej SiewiorThe perf tool can be used to monitor cache behavior. It can analyze 50*7548c69fSSebastian Andrzej Siewiorcache misses of an application and compare how they change under 51*7548c69fSSebastian Andrzej Siewiordifferent workloads on a neighboring CPU. Even more powerful, the perf 52*7548c69fSSebastian Andrzej Siewiorc2c tool can help identify cache-to-cache issues, where multiple CPU 53*7548c69fSSebastian Andrzej Siewiorcores repeatedly access and modify data on the same cache line. 54*7548c69fSSebastian Andrzej Siewior 55*7548c69fSSebastian Andrzej SiewiorHardware buses 56*7548c69fSSebastian Andrzej Siewior-------------- 57*7548c69fSSebastian Andrzej Siewior 58*7548c69fSSebastian Andrzej SiewiorReal-time systems often need to access hardware directly to perform their work. 59*7548c69fSSebastian Andrzej SiewiorAny latency in this process is undesirable, as it can affect the outcome of the 60*7548c69fSSebastian Andrzej Siewiortask. For example, on an I/O bus, a changed output may not become immediately 61*7548c69fSSebastian Andrzej Siewiorvisible but instead appear with variable delay depending on the latency of the 62*7548c69fSSebastian Andrzej Siewiorbus used for communication. 63*7548c69fSSebastian Andrzej Siewior 64*7548c69fSSebastian Andrzej SiewiorA bus such as PCI is relatively simple because register accesses are routed 65*7548c69fSSebastian Andrzej Siewiordirectly to the connected device. In the worst case, a read operation stalls the 66*7548c69fSSebastian Andrzej SiewiorCPU until the device responds. 67*7548c69fSSebastian Andrzej Siewior 68*7548c69fSSebastian Andrzej SiewiorA bus such as USB is more complex, involving multiple layers. A register read 69*7548c69fSSebastian Andrzej Siewioror write is wrapped in a USB Request Block (URB), which is then sent by the 70*7548c69fSSebastian Andrzej SiewiorUSB host controller to the device. Timing and latency are influenced by the 71*7548c69fSSebastian Andrzej Siewiorunderlying USB bus. Requests cannot be sent immediately; they must align with 72*7548c69fSSebastian Andrzej Siewiorthe next frame boundary according to the endpoint type and the host controller's 73*7548c69fSSebastian Andrzej Siewiorscheduling rules. This can introduce delays and additional latency. For example, 74*7548c69fSSebastian Andrzej Siewiora network device connected via USB may still deliver sufficient throughput, but 75*7548c69fSSebastian Andrzej Siewiorthe added latency when sending or receiving packets may fail to meet the 76*7548c69fSSebastian Andrzej Siewiorrequirements of certain real-time use cases. 77*7548c69fSSebastian Andrzej Siewior 78*7548c69fSSebastian Andrzej SiewiorAdditional restrictions on bus latency can arise from power management. For 79*7548c69fSSebastian Andrzej Siewiorinstance, PCIe with Active State Power Management (ASPM) enabled can suspend 80*7548c69fSSebastian Andrzej Siewiorthe link between the device and the host. While this behavior is beneficial for 81*7548c69fSSebastian Andrzej Siewiorpower savings, it delays device access and adds latency to responses. This issue 82*7548c69fSSebastian Andrzej Siewioris not limited to PCIe; internal buses within a System-on-Chip (SoC) can also be 83*7548c69fSSebastian Andrzej Siewioraffected by power management mechanisms. 84*7548c69fSSebastian Andrzej Siewior 85*7548c69fSSebastian Andrzej SiewiorVirtualization 86*7548c69fSSebastian Andrzej Siewior-------------- 87*7548c69fSSebastian Andrzej Siewior 88*7548c69fSSebastian Andrzej SiewiorIn a virtualized environment such as KVM, each guest CPU is represented as a 89*7548c69fSSebastian Andrzej Siewiorthread on the host. If such a thread runs with real-time priority, the system 90*7548c69fSSebastian Andrzej Siewiorshould be tested to confirm it can sustain this behavior over extended periods. 91*7548c69fSSebastian Andrzej SiewiorBecause of its priority, the thread will not be preempted by lower-priority 92*7548c69fSSebastian Andrzej Siewiorthreads (such as SCHED_OTHER), which may then receive no CPU time. This can 93*7548c69fSSebastian Andrzej Siewiorcause problems if a lower-priority thread is pinned to a CPU already occupied by 94*7548c69fSSebastian Andrzej Siewiora real-time task and unable to make progress. Even if a CPU has been isolated, 95*7548c69fSSebastian Andrzej Siewiorthe system may still (accidentally) start a per‑CPU thread on that CPU. 96*7548c69fSSebastian Andrzej SiewiorEnsuring that a guest CPU goes idle is difficult, as it requires avoiding both 97*7548c69fSSebastian Andrzej Siewiortask scheduling and interrupt handling. Furthermore, if the guest CPU does go 98*7548c69fSSebastian Andrzej Siewioridle but the guest system is booted with the option **idle=poll**, the guest 99*7548c69fSSebastian Andrzej SiewiorCPU will never enter an idle state and will instead spin until an event 100*7548c69fSSebastian Andrzej Siewiorarrives. 101*7548c69fSSebastian Andrzej Siewior 102*7548c69fSSebastian Andrzej SiewiorDevice handling introduces additional considerations. Emulated PCI devices or 103*7548c69fSSebastian Andrzej SiewiorVirtIO devices require a counterpart on the host to complete requests. This 104*7548c69fSSebastian Andrzej Siewioradds latency because the host must intercept and either process the request 105*7548c69fSSebastian Andrzej Siewiordirectly or schedule a thread for its completion. These delays can be avoided if 106*7548c69fSSebastian Andrzej Siewiorthe required PCI device is passed directly through to the guest. Some devices, 107*7548c69fSSebastian Andrzej Siewiorsuch as networking or storage controllers, support the PCIe SR-IOV feature. 108*7548c69fSSebastian Andrzej SiewiorSR-IOV allows a single PCIe device to be divided into multiple virtual functions, 109*7548c69fSSebastian Andrzej Siewiorwhich can then be assigned to different guests. 110*7548c69fSSebastian Andrzej Siewior 111*7548c69fSSebastian Andrzej SiewiorNetworking 112*7548c69fSSebastian Andrzej Siewior---------- 113*7548c69fSSebastian Andrzej Siewior 114*7548c69fSSebastian Andrzej SiewiorFor low-latency networking, the full networking stack may be undesirable, as it 115*7548c69fSSebastian Andrzej Siewiorcan introduce additional sources of delay. In this context, XDP can be used 116*7548c69fSSebastian Andrzej Siewioras a shortcut to bypass much of the stack while still relying on the kernel's 117*7548c69fSSebastian Andrzej Siewiornetwork driver. 118*7548c69fSSebastian Andrzej Siewior 119*7548c69fSSebastian Andrzej SiewiorThe requirements are that the network driver must support XDP- preferably using 120*7548c69fSSebastian Andrzej Siewioran "skb pool" and that the application must use an XDP socket. Additional 121*7548c69fSSebastian Andrzej Siewiorconfiguration may involve BPF filters, tuning networking queues, or configuring 122*7548c69fSSebastian Andrzej Siewiorqdiscs for time-based transmission. These techniques are often 123*7548c69fSSebastian Andrzej Siewiorapplied in Time-Sensitive Networking (TSN) environments. 124*7548c69fSSebastian Andrzej Siewior 125*7548c69fSSebastian Andrzej SiewiorDocumenting all required steps exceeds the scope of this text. For detailed 126*7548c69fSSebastian Andrzej Siewiorguidance, see the TSN documentation at https://tsn.readthedocs.io. 127*7548c69fSSebastian Andrzej Siewior 128*7548c69fSSebastian Andrzej SiewiorAnother useful resource is the Linux Real-Time Communication Testbench 129*7548c69fSSebastian Andrzej Siewiorhttps://github.com/Linutronix/RTC-Testbench. 130*7548c69fSSebastian Andrzej SiewiorThe goal of this project is to validate real-time network communication. It can 131*7548c69fSSebastian Andrzej Siewiorbe thought of as a "cyclictest" for networking and also serves as a starting 132*7548c69fSSebastian Andrzej Siewiorpoint for application development. 133