1========================= 2Capacity Aware Scheduling 3========================= 4 51. CPU Capacity 6=============== 7 81.1 Introduction 9---------------- 10 11Conventional, homogeneous SMP platforms are composed of purely identical 12CPUs. Heterogeneous platforms on the other hand are composed of CPUs with 13different performance characteristics - on such platforms, not all CPUs can be 14considered equal. 15 16CPU capacity is a measure of the performance a CPU can reach, normalized against 17the most performant CPU in the system. Heterogeneous systems are also called 18asymmetric CPU capacity systems, as they contain CPUs of different capacities. 19 20Disparity in maximum attainable performance (IOW in maximum CPU capacity) stems 21from two factors: 22 23- not all CPUs may have the same microarchitecture (µarch). 24- with Dynamic Voltage and Frequency Scaling (DVFS), not all CPUs may be 25 physically able to attain the higher Operating Performance Points (OPP). 26 27Arm big.LITTLE systems are an example of both. The big CPUs are more 28performance-oriented than the LITTLE ones (more pipeline stages, bigger caches, 29smarter predictors, etc), and can usually reach higher OPPs than the LITTLE ones 30can. 31 32CPU performance is usually expressed in Millions of Instructions Per Second 33(MIPS), which can also be expressed as a given amount of instructions attainable 34per Hz, leading to:: 35 36 capacity(cpu) = work_per_hz(cpu) * max_freq(cpu) 37 381.2 Scheduler terms 39------------------- 40 41Two different capacity values are used within the scheduler. A CPU's 42``original capacity`` is its maximum attainable capacity, i.e. its maximum 43attainable performance level. This original capacity is returned by 44the function arch_scale_cpu_capacity(). A CPU's ``capacity`` is its ``original 45capacity`` to which some loss of available performance (e.g. time spent 46handling IRQs) is subtracted. 47 48Note that a CPU's ``capacity`` is solely intended to be used by the CFS class, 49while ``original capacity`` is class-agnostic. The rest of this document will use 50the term ``capacity`` interchangeably with ``original capacity`` for the sake of 51brevity. 52 531.3 Platform examples 54--------------------- 55 561.3.1 Identical OPPs 57~~~~~~~~~~~~~~~~~~~~ 58 59Consider an hypothetical dual-core asymmetric CPU capacity system where 60 61- work_per_hz(CPU0) = W 62- work_per_hz(CPU1) = W/2 63- all CPUs are running at the same fixed frequency 64 65By the above definition of capacity: 66 67- capacity(CPU0) = C 68- capacity(CPU1) = C/2 69 70To draw the parallel with Arm big.LITTLE, CPU0 would be a big while CPU1 would 71be a LITTLE. 72 73With a workload that periodically does a fixed amount of work, you will get an 74execution trace like so:: 75 76 CPU0 work ^ 77 | ____ ____ ____ 78 | | | | | | | 79 +----+----+----+----+----+----+----+----+----+----+-> time 80 81 CPU1 work ^ 82 | _________ _________ ____ 83 | | | | | | 84 +----+----+----+----+----+----+----+----+----+----+-> time 85 86CPU0 has the highest capacity in the system (C), and completes a fixed amount of 87work W in T units of time. On the other hand, CPU1 has half the capacity of 88CPU0, and thus only completes W/2 in T. 89 901.3.2 Different max OPPs 91~~~~~~~~~~~~~~~~~~~~~~~~ 92 93Usually, CPUs of different capacity values also have different maximum 94OPPs. Consider the same CPUs as above (i.e. same work_per_hz()) with: 95 96- max_freq(CPU0) = F 97- max_freq(CPU1) = 2/3 * F 98 99This yields: 100 101- capacity(CPU0) = C 102- capacity(CPU1) = C/3 103 104Executing the same workload as described in 1.3.1, which each CPU running at its 105maximum frequency results in:: 106 107 CPU0 work ^ 108 | ____ ____ ____ 109 | | | | | | | 110 +----+----+----+----+----+----+----+----+----+----+-> time 111 112 workload on CPU1 113 CPU1 work ^ 114 | ______________ ______________ ____ 115 | | | | | | 116 +----+----+----+----+----+----+----+----+----+----+-> time 117 1181.4 Representation caveat 119------------------------- 120 121It should be noted that having a *single* value to represent differences in CPU 122performance is somewhat of a contentious point. The relative performance 123difference between two different µarchs could be X% on integer operations, Y% on 124floating point operations, Z% on branches, and so on. Still, results using this 125simple approach have been satisfactory for now. 126 1272. Task utilization 128=================== 129 1302.1 Introduction 131---------------- 132 133Capacity aware scheduling requires an expression of a task's requirements with 134regards to CPU capacity. Each scheduler class can express this differently, and 135while task utilization is specific to CFS, it is convenient to describe it here 136in order to introduce more generic concepts. 137 138Task utilization is a percentage meant to represent the throughput requirements 139of a task. A simple approximation of it is the task's duty cycle, i.e.:: 140 141 task_util(p) = duty_cycle(p) 142 143On an SMP system with fixed frequencies, 100% utilization suggests the task is a 144busy loop. Conversely, 10% utilization hints it is a small periodic task that 145spends more time sleeping than executing. Variable CPU frequencies and 146asymmetric CPU capacities complexify this somewhat; the following sections will 147expand on these. 148 1492.2 Frequency invariance 150------------------------ 151 152One issue that needs to be taken into account is that a workload's duty cycle is 153directly impacted by the current OPP the CPU is running at. Consider running a 154periodic workload at a given frequency F:: 155 156 CPU work ^ 157 | ____ ____ ____ 158 | | | | | | | 159 +----+----+----+----+----+----+----+----+----+----+-> time 160 161This yields duty_cycle(p) == 25%. 162 163Now, consider running the *same* workload at frequency F/2:: 164 165 CPU work ^ 166 | _________ _________ ____ 167 | | | | | | 168 +----+----+----+----+----+----+----+----+----+----+-> time 169 170This yields duty_cycle(p) == 50%, despite the task having the exact same 171behaviour (i.e. executing the same amount of work) in both executions. 172 173The task utilization signal can be made frequency invariant using the following 174formula:: 175 176 task_util_freq_inv(p) = duty_cycle(p) * (curr_frequency(cpu) / max_frequency(cpu)) 177 178Applying this formula to the two examples above yields a frequency invariant 179task utilization of 25%. 180 1812.3 CPU invariance 182------------------ 183 184CPU capacity has a similar effect on task utilization in that running an 185identical workload on CPUs of different capacity values will yield different 186duty cycles. 187 188Consider the system described in 1.3.2., i.e.:: 189 190- capacity(CPU0) = C 191- capacity(CPU1) = C/3 192 193Executing a given periodic workload on each CPU at their maximum frequency would 194result in:: 195 196 CPU0 work ^ 197 | ____ ____ ____ 198 | | | | | | | 199 +----+----+----+----+----+----+----+----+----+----+-> time 200 201 CPU1 work ^ 202 | ______________ ______________ ____ 203 | | | | | | 204 +----+----+----+----+----+----+----+----+----+----+-> time 205 206IOW, 207 208- duty_cycle(p) == 25% if p runs on CPU0 at its maximum frequency 209- duty_cycle(p) == 75% if p runs on CPU1 at its maximum frequency 210 211The task utilization signal can be made CPU invariant using the following 212formula:: 213 214 task_util_cpu_inv(p) = duty_cycle(p) * (capacity(cpu) / max_capacity) 215 216with ``max_capacity`` being the highest CPU capacity value in the 217system. Applying this formula to the above example above yields a CPU 218invariant task utilization of 25%. 219 2202.4 Invariant task utilization 221------------------------------ 222 223Both frequency and CPU invariance need to be applied to task utilization in 224order to obtain a truly invariant signal. The pseudo-formula for a task 225utilization that is both CPU and frequency invariant is thus, for a given 226task p:: 227 228 curr_frequency(cpu) capacity(cpu) 229 task_util_inv(p) = duty_cycle(p) * ------------------- * ------------- 230 max_frequency(cpu) max_capacity 231 232In other words, invariant task utilization describes the behaviour of a task as 233if it were running on the highest-capacity CPU in the system, running at its 234maximum frequency. 235 236Any mention of task utilization in the following sections will imply its 237invariant form. 238 2392.5 Utilization estimation 240-------------------------- 241 242Without a crystal ball, task behaviour (and thus task utilization) cannot 243accurately be predicted the moment a task first becomes runnable. The CFS class 244maintains a handful of CPU and task signals based on the Per-Entity Load 245Tracking (PELT) mechanism, one of those yielding an *average* utilization (as 246opposed to instantaneous). 247 248This means that while the capacity aware scheduling criteria will be written 249considering a "true" task utilization (using a crystal ball), the implementation 250will only ever be able to use an estimator thereof. 251 2523. Capacity aware scheduling requirements 253========================================= 254 2553.1 CPU capacity 256---------------- 257 258Linux cannot currently figure out CPU capacity on its own, this information thus 259needs to be handed to it. Architectures must define arch_scale_cpu_capacity() 260for that purpose. 261 262The arm, arm64, and RISC-V architectures directly map this to the arch_topology driver 263CPU scaling data, which is derived from the capacity-dmips-mhz CPU binding; see 264Documentation/devicetree/bindings/cpu/cpu-capacity.txt. 265 2663.2 Frequency invariance 267------------------------ 268 269As stated in 2.2, capacity-aware scheduling requires a frequency-invariant task 270utilization. Architectures must define arch_scale_freq_capacity(cpu) for that 271purpose. 272 273Implementing this function requires figuring out at which frequency each CPU 274have been running at. One way to implement this is to leverage hardware counters 275whose increment rate scale with a CPU's current frequency (APERF/MPERF on x86, 276AMU on arm64). Another is to directly hook into cpufreq frequency transitions, 277when the kernel is aware of the switched-to frequency (also employed by 278arm/arm64). 279 2804. Scheduler topology 281===================== 282 283During the construction of the sched domains, the scheduler will figure out 284whether the system exhibits asymmetric CPU capacities. Should that be the 285case: 286 287- The sched_asym_cpucapacity static key will be enabled. 288- The SD_ASYM_CPUCAPACITY_FULL flag will be set at the lowest sched_domain 289 level that spans all unique CPU capacity values. 290- The SD_ASYM_CPUCAPACITY flag will be set for any sched_domain that spans 291 CPUs with any range of asymmetry. 292 293The sched_asym_cpucapacity static key is intended to guard sections of code that 294cater to asymmetric CPU capacity systems. Do note however that said key is 295*system-wide*. Imagine the following setup using cpusets:: 296 297 capacity C/2 C 298 ________ ________ 299 / \ / \ 300 CPUs 0 1 2 3 4 5 6 7 301 \__/ \______________/ 302 cpusets cs0 cs1 303 304Which could be created via: 305 306.. code-block:: sh 307 308 mkdir /sys/fs/cgroup/cpuset/cs0 309 echo 0-1 > /sys/fs/cgroup/cpuset/cs0/cpuset.cpus 310 echo 0 > /sys/fs/cgroup/cpuset/cs0/cpuset.mems 311 312 mkdir /sys/fs/cgroup/cpuset/cs1 313 echo 2-7 > /sys/fs/cgroup/cpuset/cs1/cpuset.cpus 314 echo 0 > /sys/fs/cgroup/cpuset/cs1/cpuset.mems 315 316 echo 0 > /sys/fs/cgroup/cpuset/cpuset.sched_load_balance 317 318Since there *is* CPU capacity asymmetry in the system, the 319sched_asym_cpucapacity static key will be enabled. However, the sched_domain 320hierarchy of CPUs 0-1 spans a single capacity value: SD_ASYM_CPUCAPACITY isn't 321set in that hierarchy, it describes an SMP island and should be treated as such. 322 323Therefore, the 'canonical' pattern for protecting codepaths that cater to 324asymmetric CPU capacities is to: 325 326- Check the sched_asym_cpucapacity static key 327- If it is enabled, then also check for the presence of SD_ASYM_CPUCAPACITY in 328 the sched_domain hierarchy (if relevant, i.e. the codepath targets a specific 329 CPU or group thereof) 330 3315. Capacity aware scheduling implementation 332=========================================== 333 3345.1 CFS 335------- 336 3375.1.1 Capacity fitness 338~~~~~~~~~~~~~~~~~~~~~~ 339 340The main capacity scheduling criterion of CFS is:: 341 342 task_util(p) < capacity(task_cpu(p)) 343 344This is commonly called the capacity fitness criterion, i.e. CFS must ensure a 345task "fits" on its CPU. If it is violated, the task will need to achieve more 346work than what its CPU can provide: it will be CPU-bound. 347 348Furthermore, uclamp lets userspace specify a minimum and a maximum utilization 349value for a task, either via sched_setattr() or via the cgroup interface (see 350Documentation/admin-guide/cgroup-v2.rst). As its name imply, this can be used to 351clamp task_util() in the previous criterion. 352 3535.1.2 Wakeup CPU selection 354~~~~~~~~~~~~~~~~~~~~~~~~~~ 355 356CFS task wakeup CPU selection follows the capacity fitness criterion described 357above. On top of that, uclamp is used to clamp the task utilization values, 358which lets userspace have more leverage over the CPU selection of CFS 359tasks. IOW, CFS wakeup CPU selection searches for a CPU that satisfies:: 360 361 clamp(task_util(p), task_uclamp_min(p), task_uclamp_max(p)) < capacity(cpu) 362 363By using uclamp, userspace can e.g. allow a busy loop (100% utilization) to run 364on any CPU by giving it a low uclamp.max value. Conversely, it can force a small 365periodic task (e.g. 10% utilization) to run on the highest-performance CPUs by 366giving it a high uclamp.min value. 367 368.. note:: 369 370 Wakeup CPU selection in CFS can be eclipsed by Energy Aware Scheduling 371 (EAS), which is described in Documentation/scheduler/sched-energy.rst. 372 3735.1.3 Load balancing 374~~~~~~~~~~~~~~~~~~~~ 375 376A pathological case in the wakeup CPU selection occurs when a task rarely 377sleeps, if at all - it thus rarely wakes up, if at all. Consider:: 378 379 w == wakeup event 380 381 capacity(CPU0) = C 382 capacity(CPU1) = C / 3 383 384 workload on CPU0 385 CPU work ^ 386 | _________ _________ ____ 387 | | | | | | 388 +----+----+----+----+----+----+----+----+----+----+-> time 389 w w w 390 391 workload on CPU1 392 CPU work ^ 393 | ____________________________________________ 394 | | 395 +----+----+----+----+----+----+----+----+----+----+-> 396 w 397 398This workload should run on CPU0, but if the task either: 399 400- was improperly scheduled from the start (inaccurate initial 401 utilization estimation) 402- was properly scheduled from the start, but suddenly needs more 403 processing power 404 405then it might become CPU-bound, IOW ``task_util(p) > capacity(task_cpu(p))``; 406the CPU capacity scheduling criterion is violated, and there may not be any more 407wakeup event to fix this up via wakeup CPU selection. 408 409Tasks that are in this situation are dubbed "misfit" tasks, and the mechanism 410put in place to handle this shares the same name. Misfit task migration 411leverages the CFS load balancer, more specifically the active load balance part 412(which caters to migrating currently running tasks). When load balance happens, 413a misfit active load balance will be triggered if a misfit task can be migrated 414to a CPU with more capacity than its current one. 415 4165.2 RT 417------ 418 4195.2.1 Wakeup CPU selection 420~~~~~~~~~~~~~~~~~~~~~~~~~~ 421 422RT task wakeup CPU selection searches for a CPU that satisfies:: 423 424 task_uclamp_min(p) <= capacity(task_cpu(cpu)) 425 426while still following the usual priority constraints. If none of the candidate 427CPUs can satisfy this capacity criterion, then strict priority based scheduling 428is followed and CPU capacities are ignored. 429 4305.3 DL 431------ 432 4335.3.1 Wakeup CPU selection 434~~~~~~~~~~~~~~~~~~~~~~~~~~ 435 436DL task wakeup CPU selection searches for a CPU that satisfies:: 437 438 task_bandwidth(p) < capacity(task_cpu(p)) 439 440while still respecting the usual bandwidth and deadline constraints. If 441none of the candidate CPUs can satisfy this capacity criterion, then the 442task will remain on its current CPU. 443