cpuidle.rst - OpenGrok cross reference for /linux/Documentation/admin-guide/pm/cpuidle.rst

Lines Matching +full:as +full:- +full:is
1 .. SPDX-License-Identifier: GPL-2.0
20 a program is suspended and instructions belonging to it are not fetched from
23 Since part of the processor hardware is not used in idle states, entering them
25 it is an opportunity to save energy.
27 CPU idle time management is an energy-efficiency feature concerned about using
31 ------------
33 CPU idle time management operates on CPUs as seen by the *CPU scheduler* (that
34 is the part of the kernel responsible for the distribution of computational
35 work in the system).  In its view, CPUs are *logical* units.  That is, they need
37 software as individual single-core processors.  In other words, a CPU is an
43 program) at a time, it is a CPU.  In that case, if the hardware is asked to
44 enter an idle state, that applies to the processor as a whole.
46 Second, if the processor is multi-core, each core in it is able to follow at
51 time.  The entire cores are CPUs in that case and if the hardware is asked to
61 Finally, each core in a multi-core processor may be able to follow more than one
62 program in the same time frame (that is, each core may be able to fetch
65 the cores present themselves to software as "bundles" each consisting of
66 multiple individual single-core "processors", referred to as *hardware threads*
67 (or hyper-threads specifically on Intel hardware), that each can follow one
69 time management perspective and if the processor is asked to enter an idle state
70 by one of them, the hardware thread (or CPU) that asked for it is stopped, but
74 it may be put into an idle state as a whole (if the other cores within the
78 ---------
80 Logical CPUs, simply referred to as "CPUs" in what follows, are regarded as
87 processor every time the task's code is run by a CPU.  The CPU scheduler
91 no specific conditions preventing their code from being run by a CPU as long as
92 there is a CPU available for that (for example, they are not waiting for any
102 assigned to the given CPU and the CPU is then regarded as idle.  In other words,
106 idle states, or there is not enough time to spend in an idle state before the
109 useless instructions in a loop until it is assigned a new task to run.
112 .. _idle-loop:
118 calls into a code module referred to as the *governor* that belongs to the CPU
124 The role of the governor is to find an idle state most suitable for the
127 the platform or the processor architecture and organized in a one-dimensional
129 driver matching the platform the kernel is running on at the initialization
133 Each idle state present in that array is characterized by two parameters to be
134 taken into account by the governor, the *target residency* and the (worst-case)
135 *exit latency*.  The target residency is the minimum time the hardware must
140 latency, in turn, is the maximum time it will take a CPU asking the processor
144 hardware is entering it and it must be entered completely to be exited in an
149 time is known exactly, because the kernel programs timers and it knows exactly
150 when they will trigger, and it is the maximum time the hardware that the given
152 and exit it.  However, the CPU may be woken up by a non-timer event at any time
153 (in particular, before the closest timer triggers) and it generally is not known
155 was idle after it has been woken up (that time will be referred to as the *idle
158 governor uses that information depends on what algorithm is implemented by it
159 and that is the primary reason for having more than one governor in the
162 There are four ``CPUIdle`` governors available, ``menu``, `TEO <teo-gov_>`_,
163 ``ladder`` and ``haltpoll``.  Which of them is used by default depends on the
165 tick can be `stopped by the idle loop <idle-cpus-and-tick_>`_.  Available
172 Which ``CPUIdle`` driver is used, on the other hand, usually depends on the
173 platform the kernel is running on, but there are platforms with more than one
180 the ``acpi_idle`` driver will be used if ``intel_idle`` is disabled for some
186 .. _idle-cpus-and-tick:
191 The scheduler tick is a timer that triggers periodically in order to implement
194 allow them to make reasonable progress in a given time frame is to make them
195 share the available CPU time.  Namely, in rough approximation, each task is
197 prioritization and so on and when that time slice is used up, the CPU should be
200 is there to make the switch happen regardless.  That is not the only role of the
201 tick, but it is the primary reason for using it.
203 The scheduler tick is problematic from the CPU idle time management perspective,
205 configuration, the length of the tick period is between 1 ms and 10 ms).
206 Thus, if the tick is allowed to trigger on idle CPUs, it will not make sense
212 Fortunately, it is not really necessary to allow the tick to trigger on idle
215 of the CPU time on them is the idle loop.  Since the time of an idle CPU need
217 tick goes away if the given CPU is idle.  Consequently, it is possible to stop
222 depends on what is expected by the governor.  First, if there is another
223 (non-tick) timer due to trigger within the tick range, stopping the tick clearly
225 reprogrammed in that case.  Second, if the governor is expecting a non-timer
226 wakeup within the tick range, stopping the tick is not necessary and it may even
228 the target residency within the time until the expected wakeup, so that state is
230 state then, as that would contradict its own expectation of a wakeup in short
233 which is expensive.  On the other hand, if the tick is stopped and the wakeup
236 energy.  Hence, if the governor is expecting a wakeup of any kind within the
237 tick range, it is better to allow the tick trigger.  Otherwise, however, the
241 In any case, the governor knows what it is expecting and the decision on whether
243 stopped already (in one of the previous iterations of the loop), it is better
244 to leave it as is and the governor needs to take that into account.
247 loop altogether.  That can be done through the build-time configuration of it
249 ``nohz=off`` to it in the command line.  In both cases, as the stopping of the
250 scheduler tick is disabled, the governor's decisions regarding it are simply
251 ignored by the idle loop code and the tick is never stopped.
254 stopped on idle CPUs are referred to as *tickless* systems and they are
255 generally regarded as more energy-efficient than the systems running kernels in
256 which the tick cannot be stopped.  If the given system is tickless, it will use
257 the ``menu`` governor by default and if it is not tickless, the default
261 .. _menu-gov:
266 The ``menu`` governor is the default ``CPUIdle`` governor for tickless systems.
267 It is quite complex, but the basic principle of its design is straightforward.
275 and variance of them.  If the variance is small (smaller than 400 square
276 milliseconds) or it is small relative to the average (the average is greater
277 that 6 times the standard deviation), the average is regarded as the "typical
279 which one is farther from the average) of the saved observed idle duration
280 values is discarded and the computation is repeated for the remaining ones.
282 Again, if the variance of them is small (in the above sense), the average is
283 taken as the "typical interval" value and so on, until either the "typical
284 interval" is determined or too many data points are disregarded.  In the latter
285 case, if the size of the set of data points still under consideration is
286 sufficiently large, the next idle duration is not likely to be above the largest
287 idle duration value still in that set, so that value is taken as the predicted
289 consideration is too small, no prediction is made.
291 If the preliminary prediction of the next idle duration computed this way is
294 as the *sleep length* in what follows, is the upper bound on the time before the
295 next CPU wakeup.  It is used to determine the sleep length range, which in turn
296 is needed to get the sleep length correction factor.
300 range represented in the array is approximately 10 times wider than the previous
304 selecting the idle state for the CPU) is updated after the CPU has been woken
305 up and the closer the sleep length is to the observed idle duration, the closer
307 The sleep length is multiplied by the correction factor for the range that it
308 falls into to obtain an approximation of the predicted idle duration that is
310 the two is taken as the final idle duration prediction.
312 If the "typical interval" value is small, which means that the CPU is likely
313 to be woken up soon enough, the sleep length computation is skipped as it may
314 be costly and the idle duration is simply predicted to equal the "typical
317 Now, the governor is ready to walk the list of idle states and choose one of
320 limit coming from the power management quality of service, or `PM QoS <cpu-pm-qos_>`_,
326 if it has not decided to `stop the scheduler tick <idle-cpus-and-tick_>`_.  That
327 happens if the idle duration predicted by it is less than the tick period and
330 the real time until the closest timer event and if it really is greater than
335 .. _teo-gov:
340 The timer events oriented (TEO) governor is an alternative ``CPUIdle`` governor
341 for tickless systems.  It follows the same basic strategy as the ``menu`` `one
342 <menu-gov_>`_: it always tries to find the deepest idle state suitable for the
345 .. kernel-doc:: drivers/cpuidle/governors/teo.c
346    :doc: teo-description
348 .. _idle-states-representation:
354 supported by the processor have to be represented as a one-dimensional array of
357 is a hierarchy of units in the processor, one |struct cpuidle_state| object can
360 of it <idle-loop_>`_, must reflect the properties of the idle state at the
364 For example, take a processor with two cores in a larger unit referred to as
367 enter a specific idle state of its own (say "MX") if the other core is in idle
369 level gives the hardware a license to go as deep as to idle state "MX" at the
370 "module" level, but there is no guarantee that this is going to happen (the core
374 the module (including the time needed to enter it), because that is the minimum
378 because that is the maximum delay between a wakeup signal and the time the CPU
380 module will always be ready to execute instructions as soon as the module
381 becomes operational as a whole).
386 example, in any way and the ``CPUIdle`` driver is responsible for the entire
387 handling of the hierarchy.  Then, the definition of the idle state objects is
398 |struct cpuidle_state| object, there is a corresponding
400 statistics of the given idle state.  That information is exposed by the kernel
403 For each CPU in the system, there is a :file:`/sys/devices/system/cpu/cpu<N>/cpuidle/`
404 directory in ``sysfs``, where the number ``<N>`` is assigned to the given
411 object corresponding to it, as follows:
427 	Whether or not this idle state is disabled.
446 	Total time spent in this idle state by the given CPU (as measured by the
458 between them is that the name is expected to be more concise, while the
462 The :file:`disable` attribute is the only writeable one.  If it contains 1, the
463 given idle state is disabled for this particular CPU, which means that the
465 driver will never ask the hardware to enter it for that CPU as a result.
469 governor is implemented, disabling an idle state prevents that governor from
472 If the :file:`disable` attribute contains 0, the given idle state is enabled for
480 The :file:`power` attribute is not defined very well, especially for idle state
482 hierarchy of units in the processor, and it generally is hard to obtain idle
488 really spent by the given CPU in the given idle state, because it is measured by
499 it is to use idle state residency counters in the hardware, if available.
507 .. _cpu-pm-qos:
514 energy-efficiency features of the kernel to prevent performance from dropping
522 device file under :file:`/dev/` and writing a binary value (interpreted as a
523 signed 32-bit integer) to it.  In turn, the resume latency constraint for a CPU
525 32-bit integer) to the :file:`power/pm_qos_resume_latency_us` file under
527 ``<N>`` is allocated at the system initialization time.  Negative values
529 number will be interpreted as a requested PM QoS constraint in microseconds.
531 The requested value is not automatically applied as a new constraint, however,
532 as it may be less restrictive (greater in this particular case) than another
536 applies the effective (minimum in this particular case) value as the new
542 represents that request.  If that file descriptor is then used for writing, the
544 it as a new requested limit value.  Next, the priority list mechanism will be
546 that effective value will be set as a new CPU latency limit.  Thus requesting a
547 new limit value will only change the real limit if the effective "list" value is
548 affected by it, which is the case if it is the minimum of the requested values
563 In turn, for each CPU there is one resume latency PM QoS request associated with
567 process does that.  In other words, this PM QoS request is shared by the entire
570 practice is to pin a process to the CPU in question and let it use the
571 ``sysfs`` interface to control the resume latency constraint for it.]  It is
572 still only a request, however.  It is an entry in a priority list used to
573 determine the effective value to be set as the resume latency constraint for the
574 CPU in question every time the list of requests is updated this way or another
579 the given CPU as the upper limit for the exit latency of the idle states that
588 `disabled for individual CPUs <idle-states-representation_>`_, there are kernel
594 from being invoked.  If it is added to the kernel command line, the idle loop
596 support code that is expected to provide a default mechanism for this purpose.
597 That default mechanism usually is the least common denominator for all of the
599 however, so it is rather crude and not very energy-efficient.  For this reason,
600 it is not recommended for production use.
605 governor will be used instead of the default one.  It is possible to force
619 which of the two parameters is added to the kernel command line.  In the
621 instruction of the CPUs (which, as a rule, suspends the execution of the program
623 for this purpose, and if ``idle=poll`` is used, idle CPUs will execute a
625 that using ``idle=poll`` is somewhat drastic in many cases, as preventing idle
628 P-states (see |cpufreq|) that require any number of CPUs in a package to be
629 idle, so it very well may hurt single-thread computations performance as well as
630 energy-efficiency.  Thus using it for performance reasons may not be a good idea
634 the CPU to enter idle states. When this option is used, the ``acpi_idle``
639 by it is in the system's ACPI tables.
641 In addition to the architecture-level kernel command line options affecting CPU
645 where ``<n>`` is an idle state index also used in the name of the given
647 `Representation of Idle States <idle-states-representation_>`_), causes the
651 the two drivers is different for ``<n>`` equal to ``0``.  Adding
654 ``processor.max_cstate=0`` is equivalent to ``processor.max_cstate=1``.
655 Also, the ``acpi_idle`` driver is part of the ``processor`` kernel module that
656 can be loaded separately and ``max_cstate=<n>`` can be passed to it as a module
657 parameter when it is loaded.]