1.. SPDX-License-Identifier: GPL-2.0 2.. include:: <isonum.txt> 3 4============================================== 5``intel_idle`` CPU Idle Time Management Driver 6============================================== 7 8:Copyright: |copy| 2020 Intel Corporation 9 10:Author: Rafael J. Wysocki <rafael.j.wysocki@intel.com> 11 12 13General Information 14=================== 15 16``intel_idle`` is a part of the 17:doc:`CPU idle time management subsystem <cpuidle>` in the Linux kernel 18(``CPUIdle``). It is the default CPU idle time management driver for the 19Nehalem and later generations of Intel processors, but the level of support for 20a particular processor model in it depends on whether or not it recognizes that 21processor model and may also depend on information coming from the platform 22firmware. [To understand ``intel_idle`` it is necessary to know how ``CPUIdle`` 23works in general, so this is the time to get familiar with :doc:`cpuidle` if you 24have not done that yet.] 25 26``intel_idle`` uses the ``MWAIT`` instruction to inform the processor that the 27logical CPU executing it is idle and so it may be possible to put some of the 28processor's functional blocks into low-power states. That instruction takes two 29arguments (passed in the ``EAX`` and ``ECX`` registers of the target CPU), the 30first of which, referred to as a *hint*, can be used by the processor to 31determine what can be done (for details refer to Intel Software Developer’s 32Manual [1]_). Accordingly, ``intel_idle`` refuses to work with processors in 33which the support for the ``MWAIT`` instruction has been disabled (for example, 34via the platform firmware configuration menu) or which do not support that 35instruction at all. 36 37``intel_idle`` is not modular, so it cannot be unloaded, which means that the 38only way to pass early-configuration-time parameters to it is via the kernel 39command line. 40 41 42.. _intel-idle-enumeration-of-states: 43 44Enumeration of Idle States 45========================== 46 47Each ``MWAIT`` hint value is interpreted by the processor as a license to 48reconfigure itself in a certain way in order to save energy. The processor 49configurations (with reduced power draw) resulting from that are referred to 50as C-states (in the ACPI terminology) or idle states. The list of meaningful 51``MWAIT`` hint values and idle states (i.e. low-power configurations of the 52processor) corresponding to them depends on the processor model and it may also 53depend on the configuration of the platform. 54 55In order to create a list of available idle states required by the ``CPUIdle`` 56subsystem (see :ref:`idle-states-representation` in :doc:`cpuidle`), 57``intel_idle`` can use two sources of information: static tables of idle states 58for different processor models included in the driver itself and the ACPI tables 59of the system. The former are always used if the processor model at hand is 60recognized by ``intel_idle`` and the latter are used if that is required for 61the given processor model (which is the case for all server processor models 62recognized by ``intel_idle``) or if the processor model is not recognized. 63 64If the ACPI tables are going to be used for building the list of available idle 65states, ``intel_idle`` first looks for a ``_CST`` object under one of the ACPI 66objects corresponding to the CPUs in the system (refer to the ACPI specification 67[2]_ for the description of ``_CST`` and its output package). Because the 68``CPUIdle`` subsystem expects that the list of idle states supplied by the 69driver will be suitable for all of the CPUs handled by it and ``intel_idle`` is 70registered as the ``CPUIdle`` driver for all of the CPUs in the system, the 71driver looks for the first ``_CST`` object returning at least one valid idle 72state description and such that all of the idle states included in its return 73package are of the FFH (Functional Fixed Hardware) type, which means that the 74``MWAIT`` instruction is expected to be used to tell the processor that it can 75enter one of them. The return package of that ``_CST`` is then assumed to be 76applicable to all of the other CPUs in the system and the idle state 77descriptions extracted from it are stored in a preliminary list of idle states 78coming from the ACPI tables. [This step is skipped if ``intel_idle`` is 79configured to ignore the ACPI tables; see `below <intel-idle-parameters_>`_.] 80 81Next, the first (index 0) entry in the list of available idle states is 82initialized to represent a "polling idle state" (a pseudo-idle state in which 83the target CPU continuously fetches and executes instructions), and the 84subsequent (real) idle state entries are populated as follows. 85 86If the processor model at hand is recognized by ``intel_idle``, there is a 87(static) table of idle state descriptions for it in the driver. In that case, 88the "internal" table is the primary source of information on idle states and the 89information from it is copied to the final list of available idle states. If 90using the ACPI tables for the enumeration of idle states is not required 91(depending on the processor model), all of the listed idle state are enabled by 92default (so all of them will be taken into consideration by ``CPUIdle`` 93governors during CPU idle state selection). Otherwise, some of the listed idle 94states may not be enabled by default if there are no matching entries in the 95preliminary list of idle states coming from the ACPI tables. In that case user 96space still can enable them later (on a per-CPU basis) with the help of 97the ``disable`` idle state attribute in ``sysfs`` (see 98:ref:`idle-states-representation` in :doc:`cpuidle`). This basically means that 99the idle states "known" to the driver may not be enabled by default if they have 100not been exposed by the platform firmware (through the ACPI tables). 101 102If the given processor model is not recognized by ``intel_idle``, but it 103supports ``MWAIT``, the preliminary list of idle states coming from the ACPI 104tables is used for building the final list that will be supplied to the 105``CPUIdle`` core during driver registration. For each idle state in that list, 106the description, ``MWAIT`` hint and exit latency are copied to the corresponding 107entry in the final list of idle states. The name of the idle state represented 108by it (to be returned by the ``name`` idle state attribute in ``sysfs``) is 109"CX_ACPI", where X is the index of that idle state in the final list (note that 110the minimum value of X is 1, because 0 is reserved for the "polling" state), and 111its target residency is based on the exit latency value. Specifically, for 112C1-type idle states the exit latency value is also used as the target residency 113(for compatibility with the majority of the "internal" tables of idle states for 114various processor models recognized by ``intel_idle``) and for the other idle 115state types (C2 and C3) the target residency value is 3 times the exit latency 116(again, that is because it reflects the target residency to exit latency ratio 117in the majority of cases for the processor models recognized by ``intel_idle``). 118All of the idle states in the final list are enabled by default in this case. 119 120 121.. _intel-idle-initialization: 122 123Initialization 124============== 125 126The initialization of ``intel_idle`` starts with checking if the kernel command 127line options forbid the use of the ``MWAIT`` instruction. If that is the case, 128an error code is returned right away. 129 130The next step is to check whether or not the processor model is known to the 131driver, which determines the idle states enumeration method (see 132`above <intel-idle-enumeration-of-states_>`_), and whether or not the processor 133supports ``MWAIT`` (the initialization fails if that is not the case). Then, 134the ``MWAIT`` support in the processor is enumerated through ``CPUID`` and the 135driver initialization fails if the level of support is not as expected (for 136example, if the total number of ``MWAIT`` substates returned is 0). 137 138Next, if the driver is not configured to ignore the ACPI tables (see 139`below <intel-idle-parameters_>`_), the idle states information provided by the 140platform firmware is extracted from them. 141 142Then, ``CPUIdle`` device objects are allocated for all CPUs and the list of 143available idle states is created as explained 144`above <intel-idle-enumeration-of-states_>`_. 145 146Finally, ``intel_idle`` is registered with the help of cpuidle_register_driver() 147as the ``CPUIdle`` driver for all CPUs in the system and a CPU online callback 148for configuring individual CPUs is registered via cpuhp_setup_state(), which 149(among other things) causes the callback routine to be invoked for all of the 150CPUs present in the system at that time (each CPU executes its own instance of 151the callback routine). That routine registers a ``CPUIdle`` device for the CPU 152running it (which enables the ``CPUIdle`` subsystem to operate that CPU) and 153optionally performs some CPU-specific initialization actions that may be 154required for the given processor model. 155 156 157.. _intel-idle-parameters: 158 159Kernel Command Line Options and Module Parameters 160================================================= 161 162The *x86* architecture support code recognizes three kernel command line 163options related to CPU idle time management: ``idle=poll``, ``idle=halt``, 164and ``idle=nomwait``. If any of them is present in the kernel command line, the 165``MWAIT`` instruction is not allowed to be used, so the initialization of 166``intel_idle`` will fail. 167 168Apart from that there are two module parameters recognized by ``intel_idle`` 169itself that can be set via the kernel command line (they cannot be updated via 170sysfs, so that is the only way to change their values). 171 172The ``max_cstate`` parameter value is the maximum idle state index in the list 173of idle states supplied to the ``CPUIdle`` core during the registration of the 174driver. It is also the maximum number of regular (non-polling) idle states that 175can be used by ``intel_idle``, so the enumeration of idle states is terminated 176after finding that number of usable idle states (the other idle states that 177potentially might have been used if ``max_cstate`` had been greater are not 178taken into consideration at all). Setting ``max_cstate`` can prevent 179``intel_idle`` from exposing idle states that are regarded as "too deep" for 180some reason to the ``CPUIdle`` core, but it does so by making them effectively 181invisible until the system is shut down and started again which may not always 182be desirable. In practice, it is only really necessary to do that if the idle 183states in question cannot be enabled during system startup, because in the 184working state of the system the CPU power management quality of service (PM 185QoS) feature can be used to prevent ``CPUIdle`` from touching those idle states 186even if they have been enumerated (see :ref:`cpu-pm-qos` in :doc:`cpuidle`). 187Setting ``max_cstate`` to 0 causes the ``intel_idle`` initialization to fail. 188 189The ``noacpi`` module parameter (which is recognized by ``intel_idle`` if the 190kernel has been configured with ACPI support), can be set to make the driver 191ignore the system's ACPI tables entirely (it is unset by default). 192 193 194.. _intel-idle-core-and-package-idle-states: 195 196Core and Package Levels of Idle States 197====================================== 198 199Typically, in a processor supporting the ``MWAIT`` instruction there are (at 200least) two levels of idle states (or C-states). One level, referred to as 201"core C-states", covers individual cores in the processor, whereas the other 202level, referred to as "package C-states", covers the entire processor package 203and it may also involve other components of the system (GPUs, memory 204controllers, I/O hubs etc.). 205 206Some of the ``MWAIT`` hint values allow the processor to use core C-states only 207(most importantly, that is the case for the ``MWAIT`` hint value corresponding 208to the ``C1`` idle state), but the majority of them give it a license to put 209the target core (i.e. the core containing the logical CPU executing ``MWAIT`` 210with the given hint value) into a specific core C-state and then (if possible) 211to enter a specific package C-state at the deeper level. For example, the 212``MWAIT`` hint value representing the ``C3`` idle state allows the processor to 213put the target core into the low-power state referred to as "core ``C3``" (or 214``CC3``), which happens if all of the logical CPUs (SMT siblings) in that core 215have executed ``MWAIT`` with the ``C3`` hint value (or with a hint value 216representing a deeper idle state), and in addition to that (in the majority of 217cases) it gives the processor a license to put the entire package (possibly 218including some non-CPU components such as a GPU or a memory controller) into the 219low-power state referred to as "package ``C3``" (or ``PC3``), which happens if 220all of the cores have gone into the ``CC3`` state and (possibly) some additional 221conditions are satisfied (for instance, if the GPU is covered by ``PC3``, it may 222be required to be in a certain GPU-specific low-power state for ``PC3`` to be 223reachable). 224 225As a rule, there is no simple way to make the processor use core C-states only 226if the conditions for entering the corresponding package C-states are met, so 227the logical CPU executing ``MWAIT`` with a hint value that is not core-level 228only (like for ``C1``) must always assume that this may cause the processor to 229enter a package C-state. [That is why the exit latency and target residency 230values corresponding to the majority of ``MWAIT`` hint values in the "internal" 231tables of idle states in ``intel_idle`` reflect the properties of package 232C-states.] If using package C-states is not desirable at all, either 233:ref:`PM QoS <cpu-pm-qos>` or the ``max_cstate`` module parameter of 234``intel_idle`` described `above <intel-idle-parameters_>`_ must be used to 235restrict the range of permissible idle states to the ones with core-level only 236``MWAIT`` hint values (like ``C1``). 237 238 239References 240========== 241 242.. [1] *Intel® 64 and IA-32 Architectures Software Developer’s Manual Volume 2B*, 243 https://www.intel.com/content/www/us/en/architecture-and-technology/64-ia-32-architectures-software-developer-vol-2b-manual.html 244 245.. [2] *Advanced Configuration and Power Interface (ACPI) Specification*, 246 https://uefi.org/specifications 247