1========= 2Schedutil 3========= 4 5.. note:: 6 7 All this assumes a linear relation between frequency and work capacity, 8 we know this is flawed, but it is the best workable approximation. 9 10 11PELT (Per Entity Load Tracking) 12=============================== 13 14With PELT we track some metrics across the various scheduler entities, from 15individual tasks to task-group slices to CPU runqueues. As the basis for this 16we use an Exponentially Weighted Moving Average (EWMA), each period (1024us) 17is decayed such that y^32 = 0.5. That is, the most recent 32ms contribute 18half, while the rest of history contribute the other half. 19 20Specifically: 21 22 ewma_sum(u) := u_0 + u_1*y + u_2*y^2 + ... 23 24 ewma(u) = ewma_sum(u) / ewma_sum(1) 25 26Since this is essentially a progression of an infinite geometric series, the 27results are composable, that is ewma(A) + ewma(B) = ewma(A+B). This property 28is key, since it gives the ability to recompose the averages when tasks move 29around. 30 31Note that blocked tasks still contribute to the aggregates (task-group slices 32and CPU runqueues), which reflects their expected contribution when they 33resume running. 34 35Using this we track 2 key metrics: 'running' and 'runnable'. 'Running' 36reflects the time an entity spends on the CPU, while 'runnable' reflects the 37time an entity spends on the runqueue. When there is only a single task these 38two metrics are the same, but once there is contention for the CPU 'running' 39will decrease to reflect the fraction of time each task spends on the CPU 40while 'runnable' will increase to reflect the amount of contention. 41 42For more detail see: kernel/sched/pelt.c 43 44 45Frequency / CPU Invariance 46========================== 47 48Because consuming the CPU for 50% at 1GHz is not the same as consuming the CPU 49for 50% at 2GHz, nor is running 50% on a LITTLE CPU the same as running 50% on 50a big CPU, we allow architectures to scale the time delta with two ratios, one 51Dynamic Voltage and Frequency Scaling (DVFS) ratio and one microarch ratio. 52 53For simple DVFS architectures (where software is in full control) we trivially 54compute the ratio as:: 55 56 f_cur 57 r_dvfs := ----- 58 f_max 59 60For more dynamic systems where the hardware is in control of DVFS we use 61hardware counters (Intel APERF/MPERF, ARMv8.4-AMU) to provide us this ratio. 62For Intel specifically, we use:: 63 64 APERF 65 f_cur := ----- * P0 66 MPERF 67 68 4C-turbo; if available and turbo enabled 69 f_max := { 1C-turbo; if turbo enabled 70 P0; otherwise 71 72 f_cur 73 r_dvfs := min( 1, ----- ) 74 f_max 75 76We pick 4C turbo over 1C turbo to make it slightly more sustainable. 77 78r_cpu is determined as the ratio of highest performance level of the current 79CPU vs the highest performance level of any other CPU in the system. 80 81 r_tot = r_dvfs * r_cpu 82 83The result is that the above 'running' and 'runnable' metrics become invariant 84of DVFS and CPU type. IOW. we can transfer and compare them between CPUs. 85 86For more detail see: 87 88 - kernel/sched/pelt.h:update_rq_clock_pelt() 89 - arch/x86/kernel/smpboot.c:"APERF/MPERF frequency ratio computation." 90 - Documentation/scheduler/sched-capacity.rst:"1. CPU Capacity + 2. Task utilization" 91 92 93UTIL_EST 94======== 95 96Because periodic tasks have their averages decayed while they sleep, even 97though when running their expected utilization will be the same, they suffer a 98(DVFS) ramp-up after they are running again. 99 100To alleviate this (a default enabled option) UTIL_EST drives an Infinite 101Impulse Response (IIR) EWMA with the 'running' value on dequeue -- when it is 102highest. UTIL_EST filters to instantly increase and only decay on decrease. 103 104A further runqueue wide sum (of runnable tasks) is maintained of: 105 106 util_est := \Sum_t max( t_running, t_util_est_ewma ) 107 108For more detail see: kernel/sched/fair.c:util_est_dequeue() 109 110 111UCLAMP 112====== 113 114It is possible to set effective u_min and u_max clamps on each CFS or RT task; 115the runqueue keeps an max aggregate of these clamps for all running tasks. 116 117For more detail see: include/uapi/linux/sched/types.h 118 119 120Schedutil / DVFS 121================ 122 123Every time the scheduler load tracking is updated (task wakeup, task 124migration, time progression) we call out to schedutil to update the hardware 125DVFS state. 126 127The basis is the CPU runqueue's 'running' metric, which per the above it is 128the frequency invariant utilization estimate of the CPU. From this we compute 129a desired frequency like:: 130 131 max( running, util_est ); if UTIL_EST 132 u_cfs := { running; otherwise 133 134 clamp( u_cfs + u_rt , u_min, u_max ); if UCLAMP_TASK 135 u_clamp := { u_cfs + u_rt; otherwise 136 137 u := u_clamp + u_irq + u_dl; [approx. see source for more detail] 138 139 f_des := min( f_max, 1.25 u * f_max ) 140 141XXX IO-wait: when the update is due to a task wakeup from IO-completion we 142boost 'u' above. 143 144This frequency is then used to select a P-state/OPP or directly munged into a 145CPPC style request to the hardware. 146 147XXX: deadline tasks (Sporadic Task Model) allows us to calculate a hard f_min 148required to satisfy the workload. 149 150Because these callbacks are directly from the scheduler, the DVFS hardware 151interaction should be 'fast' and non-blocking. Schedutil supports 152rate-limiting DVFS requests for when hardware interaction is slow and 153expensive, this reduces effectiveness. 154 155For more information see: kernel/sched/cpufreq_schedutil.c 156 157 158NOTES 159===== 160 161 - On low-load scenarios, where DVFS is most relevant, the 'running' numbers 162 will closely reflect utilization. 163 164 - In saturated scenarios task movement will cause some transient dips, 165 suppose we have a CPU saturated with 4 tasks, then when we migrate a task 166 to an idle CPU, the old CPU will have a 'running' value of 0.75 while the 167 new CPU will gain 0.25. This is inevitable and time progression will 168 correct this. XXX do we still guarantee f_max due to no idle-time? 169 170 - Much of the above is about avoiding DVFS dips, and independent DVFS domains 171 having to re-learn / ramp-up when load shifts. 172 173