1.. SPDX-License-Identifier: GPL-2.0 2.. include:: <isonum.txt> 3 4============================== 5Intel Uncore Frequency Scaling 6============================== 7 8:Copyright: |copy| 2022-2023 Intel Corporation 9 10:Author: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> 11 12Introduction 13------------ 14 15The uncore can consume significant amount of power in Intel's Xeon servers based 16on the workload characteristics. To optimize the total power and improve overall 17performance, SoCs have internal algorithms for scaling uncore frequency. These 18algorithms monitor workload usage of uncore and set a desirable frequency. 19 20It is possible that users have different expectations of uncore performance and 21want to have control over it. The objective is similar to allowing users to set 22the scaling min/max frequencies via cpufreq sysfs to improve CPU performance. 23Users may have some latency sensitive workloads where they do not want any 24change to uncore frequency. Also, users may have workloads which require 25different core and uncore performance at distinct phases and they may want to 26use both cpufreq and the uncore scaling interface to distribute power and 27improve overall performance. 28 29Sysfs Interface 30--------------- 31 32To control uncore frequency, a sysfs interface is provided in the directory: 33`/sys/devices/system/cpu/intel_uncore_frequency/`. 34 35There is one directory for each package and die combination as the scope of 36uncore scaling control is per die in multiple die/package SoCs or per 37package for single die per package SoCs. The name represents the 38scope of control. For example: 'package_00_die_00' is for package id 0 and 39die 0. 40 41Each package_*_die_* contains the following attributes: 42 43``initial_max_freq_khz`` 44 Out of reset, this attribute represent the maximum possible frequency. 45 This is a read-only attribute. If users adjust max_freq_khz, 46 they can always go back to maximum using the value from this attribute. 47 48``initial_min_freq_khz`` 49 Out of reset, this attribute represent the minimum possible frequency. 50 This is a read-only attribute. If users adjust min_freq_khz, 51 they can always go back to minimum using the value from this attribute. 52 53``max_freq_khz`` 54 This attribute is used to set the maximum uncore frequency. 55 56``min_freq_khz`` 57 This attribute is used to set the minimum uncore frequency. 58 59``current_freq_khz`` 60 This attribute is used to get the current uncore frequency. 61 62SoCs with TPMI (Topology Aware Register and PM Capsule Interface) 63----------------------------------------------------------------- 64 65An SoC can contain multiple power domains with individual or collection 66of mesh partitions. This partition is called fabric cluster. 67 68Certain type of meshes will need to run at the same frequency, they will 69be placed in the same fabric cluster. Benefit of fabric cluster is that it 70offers a scalable mechanism to deal with partitioned fabrics in a SoC. 71 72The current sysfs interface supports controls at package and die level. 73This interface is not enough to support more granular control at 74fabric cluster level. 75 76SoCs with the support of TPMI (Topology Aware Register and PM Capsule 77Interface), can have multiple power domains. Each power domain can 78contain one or more fabric clusters. 79 80To represent controls at fabric cluster level in addition to the 81controls at package and die level (like systems without TPMI 82support), sysfs is enhanced. This granular interface is presented in the 83sysfs with directories names prefixed with "uncore". For example: 84uncore00, uncore01 etc. 85 86The scope of control is specified by attributes "package_id", "domain_id" 87and "fabric_cluster_id" in the directory. 88 89Attributes in each directory: 90 91``domain_id`` 92 This attribute is used to get the power domain id of this instance. 93 94``die_id`` 95 This attribute is used to get the Linux die id of this instance. 96 This attribute is only present for domains with core agents and 97 when the CPUID leaf 0x1f presents die ID. 98 99``fabric_cluster_id`` 100 This attribute is used to get the fabric cluster id of this instance. 101 102``package_id`` 103 This attribute is used to get the package id of this instance. 104 105``agent_types`` 106 This attribute displays all the hardware agents present within the 107 domain. Each agent has the capability to control one or more hardware 108 subsystems, which include: core, cache, memory, and I/O. 109 110The other attributes are same as presented at package_*_die_* level. 111 112In most of current use cases, the "max_freq_khz" and "min_freq_khz" 113is updated at "package_*_die_*" level. This model will be still supported 114with the following approach: 115 116When user uses controls at "package_*_die_*" level, then every fabric 117cluster is affected in that package and die. For example: user changes 118"max_freq_khz" in the package_00_die_00, then "max_freq_khz" for uncore* 119directory with the same package id will be updated. In this case user can 120still update "max_freq_khz" at each uncore* level, which is more restrictive. 121Similarly, user can update "min_freq_khz" at "package_*_die_*" level 122to apply at each uncore* level. 123 124Support for "current_freq_khz" is available only at each fabric cluster 125level (i.e., in uncore* directory). 126 127Efficiency vs. Latency Tradeoff 128------------------------------- 129 130The Efficiency Latency Control (ELC) feature improves performance 131per watt. With this feature hardware power management algorithms 132optimize trade-off between latency and power consumption. For some 133latency sensitive workloads further tuning can be done by SW to 134get desired performance. 135 136The hardware monitors the average CPU utilization across all cores 137in a power domain at regular intervals and decides an uncore frequency. 138While this may result in the best performance per watt, workload may be 139expecting higher performance at the expense of power. Consider an 140application that intermittently wakes up to perform memory reads on an 141otherwise idle system. In such cases, if hardware lowers uncore 142frequency, then there may be delay in ramp up of frequency to meet 143target performance. 144 145The ELC control defines some parameters which can be changed from SW. 146If the average CPU utilization is below a user-defined threshold 147(elc_low_threshold_percent attribute below), the user-defined uncore 148floor frequency will be used (elc_floor_freq_khz attribute below) 149instead of hardware calculated minimum. 150 151Similarly in high load scenario where the CPU utilization goes above 152the high threshold value (elc_high_threshold_percent attribute below) 153instead of jumping to maximum uncore frequency, frequency is increased 154in 100MHz steps. This avoids consuming unnecessarily high power 155immediately with CPU utilization spikes. 156 157Attributes for efficiency latency control: 158 159``elc_floor_freq_khz`` 160 This attribute is used to get/set the efficiency latency floor frequency. 161 If this variable is lower than the 'min_freq_khz', it is ignored by 162 the firmware. 163 164``elc_low_threshold_percent`` 165 This attribute is used to get/set the efficiency latency control low 166 threshold. This attribute is in percentages of CPU utilization. 167 168``elc_high_threshold_percent`` 169 This attribute is used to get/set the efficiency latency control high 170 threshold. This attribute is in percentages of CPU utilization. 171 172``elc_high_threshold_enable`` 173 This attribute is used to enable/disable the efficiency latency control 174 high threshold. Write '1' to enable, '0' to disable. 175 176Example system configuration below, which does following: 177 * when CPU utilization is less than 10%: sets uncore frequency to 800MHz 178 * when CPU utilization is higher than 95%: increases uncore frequency in 179 100MHz steps, until power limit is reached 180 181 elc_floor_freq_khz:800000 182 elc_high_threshold_percent:95 183 elc_high_threshold_enable:1 184 elc_low_threshold_percent:10 185