xref: /linux/Documentation/scheduler/sched-util-clamp.rst (revision 9a87ffc99ec8eb8d35eed7c4f816d75f5cc9662e)
1*acbee592SQais Yousef.. SPDX-License-Identifier: GPL-2.0
2*acbee592SQais Yousef
3*acbee592SQais Yousef====================
4*acbee592SQais YousefUtilization Clamping
5*acbee592SQais Yousef====================
6*acbee592SQais Yousef
7*acbee592SQais Yousef1. Introduction
8*acbee592SQais Yousef===============
9*acbee592SQais Yousef
10*acbee592SQais YousefUtilization clamping, also known as util clamp or uclamp, is a scheduler
11*acbee592SQais Youseffeature that allows user space to help in managing the performance requirement
12*acbee592SQais Yousefof tasks. It was introduced in v5.3 release. The CGroup support was merged in
13*acbee592SQais Yousefv5.4.
14*acbee592SQais Yousef
15*acbee592SQais YousefUclamp is a hinting mechanism that allows the scheduler to understand the
16*acbee592SQais Yousefperformance requirements and restrictions of the tasks, thus it helps the
17*acbee592SQais Yousefscheduler to make a better decision. And when schedutil cpufreq governor is
18*acbee592SQais Yousefused, util clamp will influence the CPU frequency selection as well.
19*acbee592SQais Yousef
20*acbee592SQais YousefSince the scheduler and schedutil are both driven by PELT (util_avg) signals,
21*acbee592SQais Yousefutil clamp acts on that to achieve its goal by clamping the signal to a certain
22*acbee592SQais Yousefpoint; hence the name. That is, by clamping utilization we are making the
23*acbee592SQais Yousefsystem run at a certain performance point.
24*acbee592SQais Yousef
25*acbee592SQais YousefThe right way to view util clamp is as a mechanism to make request or hint on
26*acbee592SQais Yousefperformance constraints. It consists of two tunables:
27*acbee592SQais Yousef
28*acbee592SQais Yousef        * UCLAMP_MIN, which sets the lower bound.
29*acbee592SQais Yousef        * UCLAMP_MAX, which sets the upper bound.
30*acbee592SQais Yousef
31*acbee592SQais YousefThese two bounds will ensure a task will operate within this performance range
32*acbee592SQais Yousefof the system. UCLAMP_MIN implies boosting a task, while UCLAMP_MAX implies
33*acbee592SQais Yousefcapping a task.
34*acbee592SQais Yousef
35*acbee592SQais YousefOne can tell the system (scheduler) that some tasks require a minimum
36*acbee592SQais Yousefperformance point to operate at to deliver the desired user experience. Or one
37*acbee592SQais Yousefcan tell the system that some tasks should be restricted from consuming too
38*acbee592SQais Yousefmuch resources and should not go above a specific performance point. Viewing
39*acbee592SQais Yousefthe uclamp values as performance points rather than utilization is a better
40*acbee592SQais Yousefabstraction from user space point of view.
41*acbee592SQais Yousef
42*acbee592SQais YousefAs an example, a game can use util clamp to form a feedback loop with its
43*acbee592SQais Yousefperceived Frames Per Second (FPS). It can dynamically increase the minimum
44*acbee592SQais Yousefperformance point required by its display pipeline to ensure no frame is
45*acbee592SQais Yousefdropped. It can also dynamically 'prime' up these tasks if it knows in the
46*acbee592SQais Yousefcoming few hundred milliseconds a computationally intensive scene is about to
47*acbee592SQais Yousefhappen.
48*acbee592SQais Yousef
49*acbee592SQais YousefOn mobile hardware where the capability of the devices varies a lot, this
50*acbee592SQais Yousefdynamic feedback loop offers a great flexibility to ensure best user experience
51*acbee592SQais Yousefgiven the capabilities of any system.
52*acbee592SQais Yousef
53*acbee592SQais YousefOf course a static configuration is possible too. The exact usage will depend
54*acbee592SQais Yousefon the system, application and the desired outcome.
55*acbee592SQais Yousef
56*acbee592SQais YousefAnother example is in Android where tasks are classified as background,
57*acbee592SQais Yousefforeground, top-app, etc. Util clamp can be used to constrain how much
58*acbee592SQais Yousefresources background tasks are consuming by capping the performance point they
59*acbee592SQais Yousefcan run at. This constraint helps reserve resources for important tasks, like
60*acbee592SQais Yousefthe ones belonging to the currently active app (top-app group). Beside this
61*acbee592SQais Yousefhelps in limiting how much power they consume. This can be more obvious in
62*acbee592SQais Yousefheterogeneous systems (e.g. Arm big.LITTLE); the constraint will help bias the
63*acbee592SQais Yousefbackground tasks to stay on the little cores which will ensure that:
64*acbee592SQais Yousef
65*acbee592SQais Yousef        1. The big cores are free to run top-app tasks immediately. top-app
66*acbee592SQais Yousef           tasks are the tasks the user is currently interacting with, hence
67*acbee592SQais Yousef           the most important tasks in the system.
68*acbee592SQais Yousef        2. They don't run on a power hungry core and drain battery even if they
69*acbee592SQais Yousef           are CPU intensive tasks.
70*acbee592SQais Yousef
71*acbee592SQais Yousef.. note::
72*acbee592SQais Yousef  **little cores**:
73*acbee592SQais Yousef    CPUs with capacity < 1024
74*acbee592SQais Yousef
75*acbee592SQais Yousef  **big cores**:
76*acbee592SQais Yousef    CPUs with capacity = 1024
77*acbee592SQais Yousef
78*acbee592SQais YousefBy making these uclamp performance requests, or rather hints, user space can
79*acbee592SQais Yousefensure system resources are used optimally to deliver the best possible user
80*acbee592SQais Yousefexperience.
81*acbee592SQais Yousef
82*acbee592SQais YousefAnother use case is to help with **overcoming the ramp up latency inherit in
83*acbee592SQais Yousefhow scheduler utilization signal is calculated**.
84*acbee592SQais Yousef
85*acbee592SQais YousefOn the other hand, a busy task for instance that requires to run at maximum
86*acbee592SQais Yousefperformance point will suffer a delay of ~200ms (PELT HALFIFE = 32ms) for the
87*acbee592SQais Yousefscheduler to realize that. This is known to affect workloads like gaming on
88*acbee592SQais Yousefmobile devices where frames will drop due to slow response time to select the
89*acbee592SQais Yousefhigher frequency required for the tasks to finish their work in time. Setting
90*acbee592SQais YousefUCLAMP_MIN=1024 will ensure such tasks will always see the highest performance
91*acbee592SQais Youseflevel when they start running.
92*acbee592SQais Yousef
93*acbee592SQais YousefThe overall visible effect goes beyond better perceived user
94*acbee592SQais Yousefexperience/performance and stretches to help achieve a better overall
95*acbee592SQais Yousefperformance/watt if used effectively.
96*acbee592SQais Yousef
97*acbee592SQais YousefUser space can form a feedback loop with the thermal subsystem too to ensure
98*acbee592SQais Yousefthe device doesn't heat up to the point where it will throttle.
99*acbee592SQais Yousef
100*acbee592SQais YousefBoth SCHED_NORMAL/OTHER and SCHED_FIFO/RR honour uclamp requests/hints.
101*acbee592SQais Yousef
102*acbee592SQais YousefIn the SCHED_FIFO/RR case, uclamp gives the option to run RT tasks at any
103*acbee592SQais Yousefperformance point rather than being tied to MAX frequency all the time. Which
104*acbee592SQais Yousefcan be useful on general purpose systems that run on battery powered devices.
105*acbee592SQais Yousef
106*acbee592SQais YousefNote that by design RT tasks don't have per-task PELT signal and must always
107*acbee592SQais Yousefrun at a constant frequency to combat undeterministic DVFS rampup delays.
108*acbee592SQais Yousef
109*acbee592SQais YousefNote that using schedutil always implies a single delay to modify the frequency
110*acbee592SQais Yousefwhen an RT task wakes up. This cost is unchanged by using uclamp. Uclamp only
111*acbee592SQais Yousefhelps picking what frequency to request instead of schedutil always requesting
112*acbee592SQais YousefMAX for all RT tasks.
113*acbee592SQais Yousef
114*acbee592SQais YousefSee :ref:`section 3.4 <uclamp-default-values>` for default values and
115*acbee592SQais Yousef:ref:`3.4.1 <sched-util-clamp-min-rt-default>` on how to change RT tasks
116*acbee592SQais Yousefdefault value.
117*acbee592SQais Yousef
118*acbee592SQais Yousef2. Design
119*acbee592SQais Yousef=========
120*acbee592SQais Yousef
121*acbee592SQais YousefUtil clamp is a property of every task in the system. It sets the boundaries of
122*acbee592SQais Yousefits utilization signal; acting as a bias mechanism that influences certain
123*acbee592SQais Yousefdecisions within the scheduler.
124*acbee592SQais Yousef
125*acbee592SQais YousefThe actual utilization signal of a task is never clamped in reality. If you
126*acbee592SQais Yousefinspect PELT signals at any point of time you should continue to see them as
127*acbee592SQais Yousefthey are intact. Clamping happens only when needed, e.g: when a task wakes up
128*acbee592SQais Yousefand the scheduler needs to select a suitable CPU for it to run on.
129*acbee592SQais Yousef
130*acbee592SQais YousefSince the goal of util clamp is to allow requesting a minimum and maximum
131*acbee592SQais Yousefperformance point for a task to run on, it must be able to influence the
132*acbee592SQais Youseffrequency selection as well as task placement to be most effective. Both of
133*acbee592SQais Yousefwhich have implications on the utilization value at CPU runqueue (rq for short)
134*acbee592SQais Youseflevel, which brings us to the main design challenge.
135*acbee592SQais Yousef
136*acbee592SQais YousefWhen a task wakes up on an rq, the utilization signal of the rq will be
137*acbee592SQais Yousefaffected by the uclamp settings of all the tasks enqueued on it. For example if
138*acbee592SQais Yousefa task requests to run at UTIL_MIN = 512, then the util signal of the rq needs
139*acbee592SQais Yousefto respect to this request as well as all other requests from all of the
140*acbee592SQais Yousefenqueued tasks.
141*acbee592SQais Yousef
142*acbee592SQais YousefTo be able to aggregate the util clamp value of all the tasks attached to the
143*acbee592SQais Yousefrq, uclamp must do some housekeeping at every enqueue/dequeue, which is the
144*acbee592SQais Yousefscheduler hot path. Hence care must be taken since any slow down will have
145*acbee592SQais Yousefsignificant impact on a lot of use cases and could hinder its usability in
146*acbee592SQais Yousefpractice.
147*acbee592SQais Yousef
148*acbee592SQais YousefThe way this is handled is by dividing the utilization range into buckets
149*acbee592SQais Yousef(struct uclamp_bucket) which allows us to reduce the search space from every
150*acbee592SQais Youseftask on the rq to only a subset of tasks on the top-most bucket.
151*acbee592SQais Yousef
152*acbee592SQais YousefWhen a task is enqueued, the counter in the matching bucket is incremented,
153*acbee592SQais Yousefand on dequeue it is decremented. This makes keeping track of the effective
154*acbee592SQais Yousefuclamp value at rq level a lot easier.
155*acbee592SQais Yousef
156*acbee592SQais YousefAs tasks are enqueued and dequeued, we keep track of the current effective
157*acbee592SQais Yousefuclamp value of the rq. See :ref:`section 2.1 <uclamp-buckets>` for details on
158*acbee592SQais Yousefhow this works.
159*acbee592SQais Yousef
160*acbee592SQais YousefLater at any path that wants to identify the effective uclamp value of the rq,
161*acbee592SQais Yousefit will simply need to read this effective uclamp value of the rq at that exact
162*acbee592SQais Yousefmoment of time it needs to take a decision.
163*acbee592SQais Yousef
164*acbee592SQais YousefFor task placement case, only Energy Aware and Capacity Aware Scheduling
165*acbee592SQais Yousef(EAS/CAS) make use of uclamp for now, which implies that it is applied on
166*acbee592SQais Yousefheterogeneous systems only.
167*acbee592SQais YousefWhen a task wakes up, the scheduler will look at the current effective uclamp
168*acbee592SQais Yousefvalue of every rq and compare it with the potential new value if the task were
169*acbee592SQais Yousefto be enqueued there. Favoring the rq that will end up with the most energy
170*acbee592SQais Yousefefficient combination.
171*acbee592SQais Yousef
172*acbee592SQais YousefSimilarly in schedutil, when it needs to make a frequency update it will look
173*acbee592SQais Yousefat the current effective uclamp value of the rq which is influenced by the set
174*acbee592SQais Yousefof tasks currently enqueued there and select the appropriate frequency that
175*acbee592SQais Yousefwill satisfy constraints from requests.
176*acbee592SQais Yousef
177*acbee592SQais YousefOther paths like setting overutilization state (which effectively disables EAS)
178*acbee592SQais Yousefmake use of uclamp as well. Such cases are considered necessary housekeeping to
179*acbee592SQais Yousefallow the 2 main use cases above and will not be covered in detail here as they
180*acbee592SQais Yousefcould change with implementation details.
181*acbee592SQais Yousef
182*acbee592SQais Yousef.. _uclamp-buckets:
183*acbee592SQais Yousef
184*acbee592SQais Yousef2.1. Buckets
185*acbee592SQais Yousef------------
186*acbee592SQais Yousef
187*acbee592SQais Yousef::
188*acbee592SQais Yousef
189*acbee592SQais Yousef                           [struct rq]
190*acbee592SQais Yousef
191*acbee592SQais Yousef  (bottom)                                                    (top)
192*acbee592SQais Yousef
193*acbee592SQais Yousef    0                                                          1024
194*acbee592SQais Yousef    |                                                           |
195*acbee592SQais Yousef    +-----------+-----------+-----------+----   ----+-----------+
196*acbee592SQais Yousef    |  Bucket 0 |  Bucket 1 |  Bucket 2 |    ...    |  Bucket N |
197*acbee592SQais Yousef    +-----------+-----------+-----------+----   ----+-----------+
198*acbee592SQais Yousef       :           :                                   :
199*acbee592SQais Yousef       +- p0       +- p3                               +- p4
200*acbee592SQais Yousef       :                                               :
201*acbee592SQais Yousef       +- p1                                           +- p5
202*acbee592SQais Yousef       :
203*acbee592SQais Yousef       +- p2
204*acbee592SQais Yousef
205*acbee592SQais Yousef
206*acbee592SQais Yousef.. note::
207*acbee592SQais Yousef  The diagram above is an illustration rather than a true depiction of the
208*acbee592SQais Yousef  internal data structure.
209*acbee592SQais Yousef
210*acbee592SQais YousefTo reduce the search space when trying to decide the effective uclamp value of
211*acbee592SQais Yousefan rq as tasks are enqueued/dequeued, the whole utilization range is divided
212*acbee592SQais Yousefinto N buckets where N is configured at compile time by setting
213*acbee592SQais YousefCONFIG_UCLAMP_BUCKETS_COUNT. By default it is set to 5.
214*acbee592SQais Yousef
215*acbee592SQais YousefThe rq has a bucket for each uclamp_id tunables: [UCLAMP_MIN, UCLAMP_MAX].
216*acbee592SQais Yousef
217*acbee592SQais YousefThe range of each bucket is 1024/N. For example, for the default value of
218*acbee592SQais Yousef5 there will be 5 buckets, each of which will cover the following range:
219*acbee592SQais Yousef
220*acbee592SQais Yousef::
221*acbee592SQais Yousef
222*acbee592SQais Yousef        DELTA = round_closest(1024/5) = 204.8 = 205
223*acbee592SQais Yousef
224*acbee592SQais Yousef        Bucket 0: [0:204]
225*acbee592SQais Yousef        Bucket 1: [205:409]
226*acbee592SQais Yousef        Bucket 2: [410:614]
227*acbee592SQais Yousef        Bucket 3: [615:819]
228*acbee592SQais Yousef        Bucket 4: [820:1024]
229*acbee592SQais Yousef
230*acbee592SQais YousefWhen a task p with following tunable parameters
231*acbee592SQais Yousef
232*acbee592SQais Yousef::
233*acbee592SQais Yousef
234*acbee592SQais Yousef        p->uclamp[UCLAMP_MIN] = 300
235*acbee592SQais Yousef        p->uclamp[UCLAMP_MAX] = 1024
236*acbee592SQais Yousef
237*acbee592SQais Yousefis enqueued into the rq, bucket 1 will be incremented for UCLAMP_MIN and bucket
238*acbee592SQais Yousef4 will be incremented for UCLAMP_MAX to reflect the fact the rq has a task in
239*acbee592SQais Yousefthis range.
240*acbee592SQais Yousef
241*acbee592SQais YousefThe rq then keeps track of its current effective uclamp value for each
242*acbee592SQais Yousefuclamp_id.
243*acbee592SQais Yousef
244*acbee592SQais YousefWhen a task p is enqueued, the rq value changes to:
245*acbee592SQais Yousef
246*acbee592SQais Yousef::
247*acbee592SQais Yousef
248*acbee592SQais Yousef        // update bucket logic goes here
249*acbee592SQais Yousef        rq->uclamp[UCLAMP_MIN] = max(rq->uclamp[UCLAMP_MIN], p->uclamp[UCLAMP_MIN])
250*acbee592SQais Yousef        // repeat for UCLAMP_MAX
251*acbee592SQais Yousef
252*acbee592SQais YousefSimilarly, when p is dequeued the rq value changes to:
253*acbee592SQais Yousef
254*acbee592SQais Yousef::
255*acbee592SQais Yousef
256*acbee592SQais Yousef        // update bucket logic goes here
257*acbee592SQais Yousef        rq->uclamp[UCLAMP_MIN] = search_top_bucket_for_highest_value()
258*acbee592SQais Yousef        // repeat for UCLAMP_MAX
259*acbee592SQais Yousef
260*acbee592SQais YousefWhen all buckets are empty, the rq uclamp values are reset to system defaults.
261*acbee592SQais YousefSee :ref:`section 3.4 <uclamp-default-values>` for details on default values.
262*acbee592SQais Yousef
263*acbee592SQais Yousef
264*acbee592SQais Yousef2.2. Max aggregation
265*acbee592SQais Yousef--------------------
266*acbee592SQais Yousef
267*acbee592SQais YousefUtil clamp is tuned to honour the request for the task that requires the
268*acbee592SQais Yousefhighest performance point.
269*acbee592SQais Yousef
270*acbee592SQais YousefWhen multiple tasks are attached to the same rq, then util clamp must make sure
271*acbee592SQais Yousefthe task that needs the highest performance point gets it even if there's
272*acbee592SQais Yousefanother task that doesn't need it or is disallowed from reaching this point.
273*acbee592SQais Yousef
274*acbee592SQais YousefFor example, if there are multiple tasks attached to an rq with the following
275*acbee592SQais Yousefvalues:
276*acbee592SQais Yousef
277*acbee592SQais Yousef::
278*acbee592SQais Yousef
279*acbee592SQais Yousef        p0->uclamp[UCLAMP_MIN] = 300
280*acbee592SQais Yousef        p0->uclamp[UCLAMP_MAX] = 900
281*acbee592SQais Yousef
282*acbee592SQais Yousef        p1->uclamp[UCLAMP_MIN] = 500
283*acbee592SQais Yousef        p1->uclamp[UCLAMP_MAX] = 500
284*acbee592SQais Yousef
285*acbee592SQais Yousefthen assuming both p0 and p1 are enqueued to the same rq, both UCLAMP_MIN
286*acbee592SQais Yousefand UCLAMP_MAX become:
287*acbee592SQais Yousef
288*acbee592SQais Yousef::
289*acbee592SQais Yousef
290*acbee592SQais Yousef        rq->uclamp[UCLAMP_MIN] = max(300, 500) = 500
291*acbee592SQais Yousef        rq->uclamp[UCLAMP_MAX] = max(900, 500) = 900
292*acbee592SQais Yousef
293*acbee592SQais YousefAs we shall see in :ref:`section 5.1 <uclamp-capping-fail>`, this max
294*acbee592SQais Yousefaggregation is the cause of one of limitations when using util clamp, in
295*acbee592SQais Yousefparticular for UCLAMP_MAX hint when user space would like to save power.
296*acbee592SQais Yousef
297*acbee592SQais Yousef2.3. Hierarchical aggregation
298*acbee592SQais Yousef-----------------------------
299*acbee592SQais Yousef
300*acbee592SQais YousefAs stated earlier, util clamp is a property of every task in the system. But
301*acbee592SQais Yousefthe actual applied (effective) value can be influenced by more than just the
302*acbee592SQais Yousefrequest made by the task or another actor on its behalf (middleware library).
303*acbee592SQais Yousef
304*acbee592SQais YousefThe effective util clamp value of any task is restricted as follows:
305*acbee592SQais Yousef
306*acbee592SQais Yousef  1. By the uclamp settings defined by the cgroup CPU controller it is attached
307*acbee592SQais Yousef     to, if any.
308*acbee592SQais Yousef  2. The restricted value in (1) is then further restricted by the system wide
309*acbee592SQais Yousef     uclamp settings.
310*acbee592SQais Yousef
311*acbee592SQais Yousef:ref:`Section 3 <uclamp-interfaces>` discusses the interfaces and will expand
312*acbee592SQais Youseffurther on that.
313*acbee592SQais Yousef
314*acbee592SQais YousefFor now suffice to say that if a task makes a request, its actual effective
315*acbee592SQais Yousefvalue will have to adhere to some restrictions imposed by cgroup and system
316*acbee592SQais Yousefwide settings.
317*acbee592SQais Yousef
318*acbee592SQais YousefThe system will still accept the request even if effectively will be beyond the
319*acbee592SQais Yousefconstraints, but as soon as the task moves to a different cgroup or a sysadmin
320*acbee592SQais Yousefmodifies the system settings, the request will be satisfied only if it is
321*acbee592SQais Yousefwithin new constraints.
322*acbee592SQais Yousef
323*acbee592SQais YousefIn other words, this aggregation will not cause an error when a task changes
324*acbee592SQais Yousefits uclamp values, but rather the system may not be able to satisfy requests
325*acbee592SQais Yousefbased on those factors.
326*acbee592SQais Yousef
327*acbee592SQais Yousef2.4. Range
328*acbee592SQais Yousef----------
329*acbee592SQais Yousef
330*acbee592SQais YousefUclamp performance request has the range of 0 to 1024 inclusive.
331*acbee592SQais Yousef
332*acbee592SQais YousefFor cgroup interface percentage is used (that is 0 to 100 inclusive).
333*acbee592SQais YousefJust like other cgroup interfaces, you can use 'max' instead of 100.
334*acbee592SQais Yousef
335*acbee592SQais Yousef.. _uclamp-interfaces:
336*acbee592SQais Yousef
337*acbee592SQais Yousef3. Interfaces
338*acbee592SQais Yousef=============
339*acbee592SQais Yousef
340*acbee592SQais Yousef3.1. Per task interface
341*acbee592SQais Yousef-----------------------
342*acbee592SQais Yousef
343*acbee592SQais Yousefsched_setattr() syscall was extended to accept two new fields:
344*acbee592SQais Yousef
345*acbee592SQais Yousef* sched_util_min: requests the minimum performance point the system should run
346*acbee592SQais Yousef  at when this task is running. Or lower performance bound.
347*acbee592SQais Yousef* sched_util_max: requests the maximum performance point the system should run
348*acbee592SQais Yousef  at when this task is running. Or upper performance bound.
349*acbee592SQais Yousef
350*acbee592SQais YousefFor example, the following scenario have 40% to 80% utilization constraints:
351*acbee592SQais Yousef
352*acbee592SQais Yousef::
353*acbee592SQais Yousef
354*acbee592SQais Yousef        attr->sched_util_min = 40% * 1024;
355*acbee592SQais Yousef        attr->sched_util_max = 80% * 1024;
356*acbee592SQais Yousef
357*acbee592SQais YousefWhen task @p is running, **the scheduler should try its best to ensure it
358*acbee592SQais Yousefstarts at 40% performance level**. If the task runs for a long enough time so
359*acbee592SQais Yousefthat its actual utilization goes above 80%, the utilization, or performance
360*acbee592SQais Youseflevel, will be capped.
361*acbee592SQais Yousef
362*acbee592SQais YousefThe special value -1 is used to reset the uclamp settings to the system
363*acbee592SQais Yousefdefault.
364*acbee592SQais Yousef
365*acbee592SQais YousefNote that resetting the uclamp value to system default using -1 is not the same
366*acbee592SQais Yousefas manually setting uclamp value to system default. This distinction is
367*acbee592SQais Yousefimportant because as we shall see in system interfaces, the default value for
368*acbee592SQais YousefRT could be changed. SCHED_NORMAL/OTHER might gain similar knobs too in the
369*acbee592SQais Youseffuture.
370*acbee592SQais Yousef
371*acbee592SQais Yousef3.2. cgroup interface
372*acbee592SQais Yousef---------------------
373*acbee592SQais Yousef
374*acbee592SQais YousefThere are two uclamp related values in the CPU cgroup controller:
375*acbee592SQais Yousef
376*acbee592SQais Yousef* cpu.uclamp.min
377*acbee592SQais Yousef* cpu.uclamp.max
378*acbee592SQais Yousef
379*acbee592SQais YousefWhen a task is attached to a CPU controller, its uclamp values will be impacted
380*acbee592SQais Yousefas follows:
381*acbee592SQais Yousef
382*acbee592SQais Yousef* cpu.uclamp.min is a protection as described in :ref:`section 3-3 of cgroup
383*acbee592SQais Yousef  v2 documentation <cgroupv2-protections-distributor>`.
384*acbee592SQais Yousef
385*acbee592SQais Yousef  If a task uclamp_min value is lower than cpu.uclamp.min, then the task will
386*acbee592SQais Yousef  inherit the cgroup cpu.uclamp.min value.
387*acbee592SQais Yousef
388*acbee592SQais Yousef  In a cgroup hierarchy, effective cpu.uclamp.min is the max of (child,
389*acbee592SQais Yousef  parent).
390*acbee592SQais Yousef
391*acbee592SQais Yousef* cpu.uclamp.max is a limit as described in :ref:`section 3-2 of cgroup v2
392*acbee592SQais Yousef  documentation <cgroupv2-limits-distributor>`.
393*acbee592SQais Yousef
394*acbee592SQais Yousef  If a task uclamp_max value is higher than cpu.uclamp.max, then the task will
395*acbee592SQais Yousef  inherit the cgroup cpu.uclamp.max value.
396*acbee592SQais Yousef
397*acbee592SQais Yousef  In a cgroup hierarchy, effective cpu.uclamp.max is the min of (child,
398*acbee592SQais Yousef  parent).
399*acbee592SQais Yousef
400*acbee592SQais YousefFor example, given following parameters:
401*acbee592SQais Yousef
402*acbee592SQais Yousef::
403*acbee592SQais Yousef
404*acbee592SQais Yousef        p0->uclamp[UCLAMP_MIN] = // system default;
405*acbee592SQais Yousef        p0->uclamp[UCLAMP_MAX] = // system default;
406*acbee592SQais Yousef
407*acbee592SQais Yousef        p1->uclamp[UCLAMP_MIN] = 40% * 1024;
408*acbee592SQais Yousef        p1->uclamp[UCLAMP_MAX] = 50% * 1024;
409*acbee592SQais Yousef
410*acbee592SQais Yousef        cgroup0->cpu.uclamp.min = 20% * 1024;
411*acbee592SQais Yousef        cgroup0->cpu.uclamp.max = 60% * 1024;
412*acbee592SQais Yousef
413*acbee592SQais Yousef        cgroup1->cpu.uclamp.min = 60% * 1024;
414*acbee592SQais Yousef        cgroup1->cpu.uclamp.max = 100% * 1024;
415*acbee592SQais Yousef
416*acbee592SQais Yousefwhen p0 and p1 are attached to cgroup0, the values become:
417*acbee592SQais Yousef
418*acbee592SQais Yousef::
419*acbee592SQais Yousef
420*acbee592SQais Yousef        p0->uclamp[UCLAMP_MIN] = cgroup0->cpu.uclamp.min = 20% * 1024;
421*acbee592SQais Yousef        p0->uclamp[UCLAMP_MAX] = cgroup0->cpu.uclamp.max = 60% * 1024;
422*acbee592SQais Yousef
423*acbee592SQais Yousef        p1->uclamp[UCLAMP_MIN] = 40% * 1024; // intact
424*acbee592SQais Yousef        p1->uclamp[UCLAMP_MAX] = 50% * 1024; // intact
425*acbee592SQais Yousef
426*acbee592SQais Yousefwhen p0 and p1 are attached to cgroup1, these instead become:
427*acbee592SQais Yousef
428*acbee592SQais Yousef::
429*acbee592SQais Yousef
430*acbee592SQais Yousef        p0->uclamp[UCLAMP_MIN] = cgroup1->cpu.uclamp.min = 60% * 1024;
431*acbee592SQais Yousef        p0->uclamp[UCLAMP_MAX] = cgroup1->cpu.uclamp.max = 100% * 1024;
432*acbee592SQais Yousef
433*acbee592SQais Yousef        p1->uclamp[UCLAMP_MIN] = cgroup1->cpu.uclamp.min = 60% * 1024;
434*acbee592SQais Yousef        p1->uclamp[UCLAMP_MAX] = 50% * 1024; // intact
435*acbee592SQais Yousef
436*acbee592SQais YousefNote that cgroup interfaces allows cpu.uclamp.max value to be lower than
437*acbee592SQais Yousefcpu.uclamp.min. Other interfaces don't allow that.
438*acbee592SQais Yousef
439*acbee592SQais Yousef3.3. System interface
440*acbee592SQais Yousef---------------------
441*acbee592SQais Yousef
442*acbee592SQais Yousef3.3.1 sched_util_clamp_min
443*acbee592SQais Yousef--------------------------
444*acbee592SQais Yousef
445*acbee592SQais YousefSystem wide limit of allowed UCLAMP_MIN range. By default it is set to 1024,
446*acbee592SQais Yousefwhich means that permitted effective UCLAMP_MIN range for tasks is [0:1024].
447*acbee592SQais YousefBy changing it to 512 for example the range reduces to [0:512]. This is useful
448*acbee592SQais Yousefto restrict how much boosting tasks are allowed to acquire.
449*acbee592SQais Yousef
450*acbee592SQais YousefRequests from tasks to go above this knob value will still succeed, but
451*acbee592SQais Yousefthey won't be satisfied until it is more than p->uclamp[UCLAMP_MIN].
452*acbee592SQais Yousef
453*acbee592SQais YousefThe value must be smaller than or equal to sched_util_clamp_max.
454*acbee592SQais Yousef
455*acbee592SQais Yousef3.3.2 sched_util_clamp_max
456*acbee592SQais Yousef--------------------------
457*acbee592SQais Yousef
458*acbee592SQais YousefSystem wide limit of allowed UCLAMP_MAX range. By default it is set to 1024,
459*acbee592SQais Yousefwhich means that permitted effective UCLAMP_MAX range for tasks is [0:1024].
460*acbee592SQais Yousef
461*acbee592SQais YousefBy changing it to 512 for example the effective allowed range reduces to
462*acbee592SQais Yousef[0:512]. This means is that no task can run above 512, which implies that all
463*acbee592SQais Yousefrqs are restricted too. IOW, the whole system is capped to half its performance
464*acbee592SQais Yousefcapacity.
465*acbee592SQais Yousef
466*acbee592SQais YousefThis is useful to restrict the overall maximum performance point of the system.
467*acbee592SQais YousefFor example, it can be handy to limit performance when running low on battery
468*acbee592SQais Yousefor when the system wants to limit access to more energy hungry performance
469*acbee592SQais Youseflevels when it's in idle state or screen is off.
470*acbee592SQais Yousef
471*acbee592SQais YousefRequests from tasks to go above this knob value will still succeed, but they
472*acbee592SQais Yousefwon't be satisfied until it is more than p->uclamp[UCLAMP_MAX].
473*acbee592SQais Yousef
474*acbee592SQais YousefThe value must be greater than or equal to sched_util_clamp_min.
475*acbee592SQais Yousef
476*acbee592SQais Yousef.. _uclamp-default-values:
477*acbee592SQais Yousef
478*acbee592SQais Yousef3.4. Default values
479*acbee592SQais Yousef-------------------
480*acbee592SQais Yousef
481*acbee592SQais YousefBy default all SCHED_NORMAL/SCHED_OTHER tasks are initialized to:
482*acbee592SQais Yousef
483*acbee592SQais Yousef::
484*acbee592SQais Yousef
485*acbee592SQais Yousef        p_fair->uclamp[UCLAMP_MIN] = 0
486*acbee592SQais Yousef        p_fair->uclamp[UCLAMP_MAX] = 1024
487*acbee592SQais Yousef
488*acbee592SQais YousefThat is, by default they're boosted to run at the maximum performance point of
489*acbee592SQais Yousefchanged at boot or runtime. No argument was made yet as to why we should
490*acbee592SQais Yousefprovide this, but can be added in the future.
491*acbee592SQais Yousef
492*acbee592SQais YousefFor SCHED_FIFO/SCHED_RR tasks:
493*acbee592SQais Yousef
494*acbee592SQais Yousef::
495*acbee592SQais Yousef
496*acbee592SQais Yousef        p_rt->uclamp[UCLAMP_MIN] = 1024
497*acbee592SQais Yousef        p_rt->uclamp[UCLAMP_MAX] = 1024
498*acbee592SQais Yousef
499*acbee592SQais YousefThat is by default they're boosted to run at the maximum performance point of
500*acbee592SQais Yousefthe system which retains the historical behavior of the RT tasks.
501*acbee592SQais Yousef
502*acbee592SQais YousefRT tasks default uclamp_min value can be modified at boot or runtime via
503*acbee592SQais Yousefsysctl. See below section.
504*acbee592SQais Yousef
505*acbee592SQais Yousef.. _sched-util-clamp-min-rt-default:
506*acbee592SQais Yousef
507*acbee592SQais Yousef3.4.1 sched_util_clamp_min_rt_default
508*acbee592SQais Yousef-------------------------------------
509*acbee592SQais Yousef
510*acbee592SQais YousefRunning RT tasks at maximum performance point is expensive on battery powered
511*acbee592SQais Yousefdevices and not necessary. To allow system developer to offer good performance
512*acbee592SQais Yousefguarantees for these tasks without pushing it all the way to maximum
513*acbee592SQais Yousefperformance point, this sysctl knob allows tuning the best boost value to
514*acbee592SQais Yousefaddress the system requirement without burning power running at maximum
515*acbee592SQais Yousefperformance point all the time.
516*acbee592SQais Yousef
517*acbee592SQais YousefApplication developer are encouraged to use the per task util clamp interface
518*acbee592SQais Yousefto ensure they are performance and power aware. Ideally this knob should be set
519*acbee592SQais Yousefto 0 by system designers and leave the task of managing performance
520*acbee592SQais Yousefrequirements to the apps.
521*acbee592SQais Yousef
522*acbee592SQais Yousef4. How to use util clamp
523*acbee592SQais Yousef========================
524*acbee592SQais Yousef
525*acbee592SQais YousefUtil clamp promotes the concept of user space assisted power and performance
526*acbee592SQais Yousefmanagement. At the scheduler level there is no info required to make the best
527*acbee592SQais Yousefdecision. However, with util clamp user space can hint to the scheduler to make
528*acbee592SQais Yousefbetter decision about task placement and frequency selection.
529*acbee592SQais Yousef
530*acbee592SQais YousefBest results are achieved by not making any assumptions about the system the
531*acbee592SQais Yousefapplication is running on and to use it in conjunction with a feedback loop to
532*acbee592SQais Yousefdynamically monitor and adjust. Ultimately this will allow for a better user
533*acbee592SQais Yousefexperience at a better perf/watt.
534*acbee592SQais Yousef
535*acbee592SQais YousefFor some systems and use cases, static setup will help to achieve good results.
536*acbee592SQais YousefPortability will be a problem in this case. How much work one can do at 100,
537*acbee592SQais Yousef200 or 1024 is different for each system. Unless there's a specific target
538*acbee592SQais Yousefsystem, static setup should be avoided.
539*acbee592SQais Yousef
540*acbee592SQais YousefThere are enough possibilities to create a whole framework based on util clamp
541*acbee592SQais Yousefor self contained app that makes use of it directly.
542*acbee592SQais Yousef
543*acbee592SQais Yousef4.1. Boost important and DVFS-latency-sensitive tasks
544*acbee592SQais Yousef-----------------------------------------------------
545*acbee592SQais Yousef
546*acbee592SQais YousefA GUI task might not be busy to warrant driving the frequency high when it
547*acbee592SQais Yousefwakes up. However, it requires to finish its work within a specific time window
548*acbee592SQais Yousefto deliver the desired user experience. The right frequency it requires at
549*acbee592SQais Yousefwakeup will be system dependent. On some underpowered systems it will be high,
550*acbee592SQais Yousefon other overpowered ones it will be low or 0.
551*acbee592SQais Yousef
552*acbee592SQais YousefThis task can increase its UCLAMP_MIN value every time it misses the deadline
553*acbee592SQais Yousefto ensure on next wake up it runs at a higher performance point. It should try
554*acbee592SQais Yousefto approach the lowest UCLAMP_MIN value that allows to meet its deadline on any
555*acbee592SQais Yousefparticular system to achieve the best possible perf/watt for that system.
556*acbee592SQais Yousef
557*acbee592SQais YousefOn heterogeneous systems, it might be important for this task to run on
558*acbee592SQais Yousefa faster CPU.
559*acbee592SQais Yousef
560*acbee592SQais Yousef**Generally it is advised to perceive the input as performance level or point
561*acbee592SQais Yousefwhich will imply both task placement and frequency selection**.
562*acbee592SQais Yousef
563*acbee592SQais Yousef4.2. Cap background tasks
564*acbee592SQais Yousef-------------------------
565*acbee592SQais Yousef
566*acbee592SQais YousefLike explained for Android case in the introduction. Any app can lower
567*acbee592SQais YousefUCLAMP_MAX for some background tasks that don't care about performance but
568*acbee592SQais Yousefcould end up being busy and consume unnecessary system resources on the system.
569*acbee592SQais Yousef
570*acbee592SQais Yousef4.3. Powersave mode
571*acbee592SQais Yousef-------------------
572*acbee592SQais Yousef
573*acbee592SQais Yousefsched_util_clamp_max system wide interface can be used to limit all tasks from
574*acbee592SQais Yousefoperating at the higher performance points which are usually energy
575*acbee592SQais Yousefinefficient.
576*acbee592SQais Yousef
577*acbee592SQais YousefThis is not unique to uclamp as one can achieve the same by reducing max
578*acbee592SQais Youseffrequency of the cpufreq governor. It can be considered a more convenient
579*acbee592SQais Yousefalternative interface.
580*acbee592SQais Yousef
581*acbee592SQais Yousef4.4. Per-app performance restriction
582*acbee592SQais Yousef------------------------------------
583*acbee592SQais Yousef
584*acbee592SQais YousefMiddleware/Utility can provide the user an option to set UCLAMP_MIN/MAX for an
585*acbee592SQais Yousefapp every time it is executed to guarantee a minimum performance point and/or
586*acbee592SQais Youseflimit it from draining system power at the cost of reduced performance for
587*acbee592SQais Yousefthese apps.
588*acbee592SQais Yousef
589*acbee592SQais YousefIf you want to prevent your laptop from heating up while on the go from
590*acbee592SQais Yousefcompiling the kernel and happy to sacrifice performance to save power, but
591*acbee592SQais Yousefstill would like to keep your browser performance intact, uclamp makes it
592*acbee592SQais Yousefpossible.
593*acbee592SQais Yousef
594*acbee592SQais Yousef5. Limitations
595*acbee592SQais Yousef==============
596*acbee592SQais Yousef
597*acbee592SQais Yousef.. _uclamp-capping-fail:
598*acbee592SQais Yousef
599*acbee592SQais Yousef5.1. Capping frequency with uclamp_max fails under certain conditions
600*acbee592SQais Yousef---------------------------------------------------------------------
601*acbee592SQais Yousef
602*acbee592SQais YousefIf task p0 is capped to run at 512:
603*acbee592SQais Yousef
604*acbee592SQais Yousef::
605*acbee592SQais Yousef
606*acbee592SQais Yousef        p0->uclamp[UCLAMP_MAX] = 512
607*acbee592SQais Yousef
608*acbee592SQais Yousefand it shares the rq with p1 which is free to run at any performance point:
609*acbee592SQais Yousef
610*acbee592SQais Yousef::
611*acbee592SQais Yousef
612*acbee592SQais Yousef        p1->uclamp[UCLAMP_MAX] = 1024
613*acbee592SQais Yousef
614*acbee592SQais Yousefthen due to max aggregation the rq will be allowed to reach max performance
615*acbee592SQais Yousefpoint:
616*acbee592SQais Yousef
617*acbee592SQais Yousef::
618*acbee592SQais Yousef
619*acbee592SQais Yousef        rq->uclamp[UCLAMP_MAX] = max(512, 1024) = 1024
620*acbee592SQais Yousef
621*acbee592SQais YousefAssuming both p0 and p1 have UCLAMP_MIN = 0, then the frequency selection for
622*acbee592SQais Yousefthe rq will depend on the actual utilization value of the tasks.
623*acbee592SQais Yousef
624*acbee592SQais YousefIf p1 is a small task but p0 is a CPU intensive task, then due to the fact that
625*acbee592SQais Yousefboth are running at the same rq, p1 will cause the frequency capping to be left
626*acbee592SQais Youseffrom the rq although p1, which is allowed to run at any performance point,
627*acbee592SQais Yousefdoesn't actually need to run at that frequency.
628*acbee592SQais Yousef
629*acbee592SQais Yousef5.2. UCLAMP_MAX can break PELT (util_avg) signal
630*acbee592SQais Yousef------------------------------------------------
631*acbee592SQais Yousef
632*acbee592SQais YousefPELT assumes that frequency will always increase as the signals grow to ensure
633*acbee592SQais Yousefthere's always some idle time on the CPU. But with UCLAMP_MAX, this frequency
634*acbee592SQais Yousefincrease will be prevented which can lead to no idle time in some
635*acbee592SQais Yousefcircumstances. When there's no idle time, a task will stuck in a busy loop,
636*acbee592SQais Yousefwhich would result in util_avg being 1024.
637*acbee592SQais Yousef
638*acbee592SQais YousefCombing with issue described below, this can lead to unwanted frequency spikes
639*acbee592SQais Yousefwhen severely capped tasks share the rq with a small non capped task.
640*acbee592SQais Yousef
641*acbee592SQais YousefAs an example if task p, which have:
642*acbee592SQais Yousef
643*acbee592SQais Yousef::
644*acbee592SQais Yousef
645*acbee592SQais Yousef        p0->util_avg = 300
646*acbee592SQais Yousef        p0->uclamp[UCLAMP_MAX] = 0
647*acbee592SQais Yousef
648*acbee592SQais Yousefwakes up on an idle CPU, then it will run at min frequency (Fmin) this
649*acbee592SQais YousefCPU is capable of. The max CPU frequency (Fmax) matters here as well,
650*acbee592SQais Yousefsince it designates the shortest computational time to finish the task's
651*acbee592SQais Yousefwork on this CPU.
652*acbee592SQais Yousef
653*acbee592SQais Yousef::
654*acbee592SQais Yousef
655*acbee592SQais Yousef        rq->uclamp[UCLAMP_MAX] = 0
656*acbee592SQais Yousef
657*acbee592SQais YousefIf the ratio of Fmax/Fmin is 3, then maximum value will be:
658*acbee592SQais Yousef
659*acbee592SQais Yousef::
660*acbee592SQais Yousef
661*acbee592SQais Yousef        300 * (Fmax/Fmin) = 900
662*acbee592SQais Yousef
663*acbee592SQais Yousefwhich indicates the CPU will still see idle time since 900 is < 1024. The
664*acbee592SQais Yousef_actual_ util_avg will not be 900 though, but somewhere between 300 and 900. As
665*acbee592SQais Youseflong as there's idle time, p->util_avg updates will be off by a some margin,
666*acbee592SQais Yousefbut not proportional to Fmax/Fmin.
667*acbee592SQais Yousef
668*acbee592SQais Yousef::
669*acbee592SQais Yousef
670*acbee592SQais Yousef        p0->util_avg = 300 + small_error
671*acbee592SQais Yousef
672*acbee592SQais YousefNow if the ratio of Fmax/Fmin is 4, the maximum value becomes:
673*acbee592SQais Yousef
674*acbee592SQais Yousef::
675*acbee592SQais Yousef
676*acbee592SQais Yousef        300 * (Fmax/Fmin) = 1200
677*acbee592SQais Yousef
678*acbee592SQais Yousefwhich is higher than 1024 and indicates that the CPU has no idle time. When
679*acbee592SQais Yousefthis happens, then the _actual_ util_avg will become:
680*acbee592SQais Yousef
681*acbee592SQais Yousef::
682*acbee592SQais Yousef
683*acbee592SQais Yousef        p0->util_avg = 1024
684*acbee592SQais Yousef
685*acbee592SQais YousefIf task p1 wakes up on this CPU, which have:
686*acbee592SQais Yousef
687*acbee592SQais Yousef::
688*acbee592SQais Yousef
689*acbee592SQais Yousef        p1->util_avg = 200
690*acbee592SQais Yousef        p1->uclamp[UCLAMP_MAX] = 1024
691*acbee592SQais Yousef
692*acbee592SQais Yousefthen the effective UCLAMP_MAX for the CPU will be 1024 according to max
693*acbee592SQais Yousefaggregation rule. But since the capped p0 task was running and throttled
694*acbee592SQais Yousefseverely, then the rq->util_avg will be:
695*acbee592SQais Yousef
696*acbee592SQais Yousef::
697*acbee592SQais Yousef
698*acbee592SQais Yousef        p0->util_avg = 1024
699*acbee592SQais Yousef        p1->util_avg = 200
700*acbee592SQais Yousef
701*acbee592SQais Yousef        rq->util_avg = 1024
702*acbee592SQais Yousef        rq->uclamp[UCLAMP_MAX] = 1024
703*acbee592SQais Yousef
704*acbee592SQais YousefHence lead to a frequency spike since if p0 wasn't throttled we should get:
705*acbee592SQais Yousef
706*acbee592SQais Yousef::
707*acbee592SQais Yousef
708*acbee592SQais Yousef        p0->util_avg = 300
709*acbee592SQais Yousef        p1->util_avg = 200
710*acbee592SQais Yousef
711*acbee592SQais Yousef        rq->util_avg = 500
712*acbee592SQais Yousef
713*acbee592SQais Yousefand run somewhere near mid performance point of that CPU, not the Fmax we get.
714*acbee592SQais Yousef
715*acbee592SQais Yousef5.3. Schedutil response time issues
716*acbee592SQais Yousef-----------------------------------
717*acbee592SQais Yousef
718*acbee592SQais Yousefschedutil has three limitations:
719*acbee592SQais Yousef
720*acbee592SQais Yousef        1. Hardware takes non-zero time to respond to any frequency change
721*acbee592SQais Yousef           request. On some platforms can be in the order of few ms.
722*acbee592SQais Yousef        2. Non fast-switch systems require a worker deadline thread to wake up
723*acbee592SQais Yousef           and perform the frequency change, which adds measurable overhead.
724*acbee592SQais Yousef        3. schedutil rate_limit_us drops any requests during this rate_limit_us
725*acbee592SQais Yousef           window.
726*acbee592SQais Yousef
727*acbee592SQais YousefIf a relatively small task is doing critical job and requires a certain
728*acbee592SQais Yousefperformance point when it wakes up and starts running, then all these
729*acbee592SQais Youseflimitations will prevent it from getting what it wants in the time scale it
730*acbee592SQais Yousefexpects.
731*acbee592SQais Yousef
732*acbee592SQais YousefThis limitation is not only impactful when using uclamp, but will be more
733*acbee592SQais Yousefprevalent as we no longer gradually ramp up or down. We could easily be
734*acbee592SQais Yousefjumping between frequencies depending on the order tasks wake up, and their
735*acbee592SQais Yousefrespective uclamp values.
736*acbee592SQais Yousef
737*acbee592SQais YousefWe regard that as a limitation of the capabilities of the underlying system
738*acbee592SQais Yousefitself.
739*acbee592SQais Yousef
740*acbee592SQais YousefThere is room to improve the behavior of schedutil rate_limit_us, but not much
741*acbee592SQais Yousefto be done for 1 or 2. They are considered hard limitations of the system.
742