xref: /illumos-gate/usr/src/uts/i86pc/os/cpupm/cpupm_amd.c (revision f7d9ddd6b7c4de2c2b40c6aeff906bb4f2578ef5)
1 /*
2  * CDDL HEADER START
3  *
4  * The contents of this file are subject to the terms of the
5  * Common Development and Distribution License (the "License").
6  * You may not use this file except in compliance with the License.
7  *
8  * You can obtain a copy of the license at usr/src/OPENSOLARIS.LICENSE
9  * or http://www.opensolaris.org/os/licensing.
10  * See the License for the specific language governing permissions
11  * and limitations under the License.
12  *
13  * When distributing Covered Code, include this CDDL HEADER in each
14  * file and include the License file at usr/src/OPENSOLARIS.LICENSE.
15  * If applicable, add the following below this CDDL HEADER, with the
16  * fields enclosed by brackets "[]" replaced with your own identifying
17  * information: Portions Copyright [yyyy] [name of copyright owner]
18  *
19  * CDDL HEADER END
20  */
21 /*
22  * Copyright 2009 Sun Microsystems, Inc.  All rights reserved.
23  * Use is subject to license terms.
24  *
25  * Copyright 2024 Oxide Computer Company
26  */
27 
28 /*
29  * AMD-specific CPU power management support.
30  *
31  * And, a brief history of AMD CPU power management. Or, "Why you care about CPU
32  * power management even when you are not worried about a few watts from the
33  * wall." This history is intended to provide lodestones for this domain, but is
34  * not a fully comprehensive AMD power management feature chronology.
35  *
36  * In the early 2000s, AMD shipped a feature called PowerNow! in the K6 era -
37  * K6-2E+ and K6-III+ cores, according to "AMD PowerNow! Technology Dynamically
38  * Manages Power and Performance", publication number 24404A. This feature
39  * allowed operating systems to control power and performance settings in a way
40  * that is very similar to ACPI P-states. That is, selectable core voltage and
41  * frequency levels, with default "power-saver" and "high-performance" modes
42  * that are reflective of Pmin and Pmax on a 2024-era AMD processor.
43  *
44  * With Thuban and Zosma parts later in the K10 era, AMD extended power and
45  * frequency management with the "Turbo Core" feature. They talk about this in
46  * more detail in blogs about the Bulldozer architecture, though many materials
47  * are now dead links. Exactly how Turbo Core is informed and managed is less
48  * discussed, or at least I have been unable to find good technical material on
49  * the topic, but we can draw some inferences from what *is* discussed with
50  * those Bulldozer cores:
51  * * introduces the notion of boosting all cores beyond a "base frequency"
52  * * introduces the notion of boosting further with only half or fewer cores
53  * active
54  * * introduces the notion of power-governed turbo boost
55  *
56  * Somewhere in the K10 era, AMD also introduced C-state support, allowing cores
57  * to be put into low-power idle states when not used. Some articles from
58  * reviewers and system integrators around this time indicate that setting the
59  * "C-state mode to C6" is "required to get the highest Turbo Core frequencies."
60  *
61  * As the AMD 15h BIOS and Kernel Developers Guide (BKDG) is clear to note, AMD
62  * C-states do not directly correspond to ACPI C-states. But when an ACPI
63  * low-power C-state is entered, the CPU's low-power implementation is one of
64  * these AMD C-states, and C6 is the lowest-power of them.
65  *
66  * Further, note that in the Bulldozer era, CPUs were in the range of 4-8 cores,
67  * so "half or fewer cores" means "2-4 active cores."
68  *
69  * At this point and onwards, for some families of AMD parts, best single-core
70  * performance can only be achieved if an operating system parks idle CPU cores
71  * in the lowest-power states - AMD's C6, aka ACPI C3.
72  *
73  * Boosting beyond a base clock, in a AMD-defined and approved manner,
74  * potentially on all cores, has since also been branded as "AMD Core
75  * Performance Boost." This is the name you can find this behavior known as in
76  * Zen and later parts.
77  *
78  * Zen included a more expansive power management approach, "Precision Boost."
79  * "Precision Boost" is where we see start to see power management more
80  * explicitly *managed* - core clocks and voltages are decided by some software
81  * running on the new System Management Unit (SMU). Correspondingly, exactly
82  * what voltage/termperature/power inputs will produce what operational outcomes
83  * from the processor become less and less clearly documented.
84  *
85  * For example, a (Zen 1) Ryzen 7 1700 part is labeled as 3.0GHz base clock,
86  * with up to 3.8GHz boost clock. This 800MHz gamut is the purview of the SMU
87  * implementing "Precision Boost."
88  *
89  * In practice, later AMD marketing material implies that Precision Boost
90  * retained "Turbo Core" behavior that peak boost frequences are only attainable
91  * when one or two cores are actually active. Additionally, even if all cores
92  * are loaded, Precision Boost provides some amount of boost if thermal and
93  * power headroom allows.
94  *
95  * Taking the above Ryzen 7 1700 part as an example, the "base clock" of 3.0 GHz
96  * is relatively unlikely to be an actual operational frequency of the part.
97  * Either a core will be off (as in AMD-defined C1 or C6), on in a low-power
98  * P-state (the processor's minimum operational frequency, probably P1 or
99  * whatever Pmin the part supports), or on in a high-power P-state (P0). In the
100  * high-power P-state, if "boost above base clock" feature is enabled, a core
101  * will probably be some hundreds of MHz above its requested clock speed!
102  *
103  * Further, somewhere around the Zen architecture AMD introduced the "Extended
104  * Frequency Range" (XFR) feature, which allows the processor to clock
105  * 100-150MHz (depending on SKU) higher than "max turbo." This is still
106  * constrained by the silicon, power, and thermal limits indicated by a
107  * combination of fused values set at fabrication time, platform firmware, and
108  * potentially user customization (if firmware allows). Specifics here are
109  * still slim pickings.
110  *
111  * Forum-goers in 2018 would discuss their Ryzen 7 1700s having a clock speed of
112  * 3.1-3.2GHz under all-core load, going up to 3.7GHz under one- or two-core
113  * loads. All frequency selection in this range is up to the SMU, potentially
114  * capped by BIOS or OS-provided parameters.
115  *
116  * As of Zen 5, the latest development here is "Precision Boost 2", which began
117  * shipping with Zen 2. This seems to be an upgrade of the power/frequency
118  * selection regime used by the SMU - instead of "all-core" and "low-core"
119  * turbos, the processor measures its utilization of system-specific paramters
120  * such as package temperature and power draw. Exactly how frequency choices are
121  * made at this point appears to be a black box, other than blanket statements
122  * like "the processor will pick the highest permissible frequency given its
123  * operating environment."
124  *
125  * An interesting detail in the marketing material and slide decks surrounding
126  * the introduction of Precision Boost 2.0 is an implicit confirmation that
127  * Precision Boost did maintain a strict "all-core" and "low-core" pair of
128  * frequencies. This comes from the marketing statement that Precision Boost 2.0
129  * has done away with those concepts from previous generations, instead
130  * providing a "linear scaling" of frequencies under increasing load levels.
131  *
132  * This brings us to 2024; empirically the above blanket statements are only
133  * correct given the operating system managing CPU cores in a way roughly
134  * commensurate with how AMD would expect an operating system to manage them.
135  *
136  * This is especially dramatic on AMD's server parts - Naples, Rome, Milan, and
137  * onward - where with all cores in high-power C-states, but possibly low-power
138  * P-states, still prevent individual cores from boosting closer to a part's
139  * Fmax. The difference between a peak clock without C-state management, and
140  * peak clock with C-state management, can be as much as 20% of a part's Fmax.
141  * This has also been seen on Threadripper systems. But the impact of C-state
142  * management seems much less dramatic on "desktop" parts; a 7950x without
143  * C-state management can see individual cores clocking to 5.4 GHz or above,
144  * much closer to its rated Fmax of 5.75 GHz.
145  *
146  * From empirical measurement, the difference here appears to be an undocumented
147  * "all-core" turbo that the part limits itself to if all cores are in C0, even
148  * if they are in C0 but in Pmax and stopped in hlt/mwait idle - the actual
149  * power draw differences between these states may be small, but simply being
150  * powered seems to trip some threshold.
151  *
152  * One conclusion from all this is that across the board, C-state management can
153  * have a surprising relationship to performance. Unfortunately, the direct
154  * relationships are undocumented. We are entirely dependent on ACPI-provided
155  * latency information to decide if C-state transitions are profitable given
156  * instantaneous workloads and performance needs.
157  *
158  * Finally, CPPC (Collaborative Processor Performance Control) is a feature
159  * that currently seems to be more oriented towards desktop enthusiast parts,
160  * but stretches the above even further. CPPC includes an abstract "performance
161  * scale" a processor supports, where the operating system requests some factor
162  * along this scale based on workloads it must run. CPPC also introduces the
163  * idea of "Preferred Cores", where at manufacturing time individual cores in a
164  * die are fused with information indicating how highly they can be driven.
165  * This is reportedly reflected as higher peak clocks under load, lower voltage
166  * (and less power) at intermediate clocks.
167  *
168  * It would be nice, in the limit of time, to find if a given processor supports
169  * CPPC, collect its preferred cores, and prefer scheduling tasks on those cores
170  * if they are not already busy. This extends somewhat beyond simply managing
171  * power states of loaded cores.
172  */
173 
174 #include <sys/x86_archext.h>
175 #include <sys/cpu_acpi.h>
176 #include <sys/cpu_idle.h>
177 #include <sys/pwrnow.h>
178 
179 boolean_t
cpupm_amd_init(cpu_t * cp)180 cpupm_amd_init(cpu_t *cp)
181 {
182 	cpupm_mach_state_t *mach_state =
183 	    (cpupm_mach_state_t *)(cp->cpu_m.mcpu_pm_mach_state);
184 
185 	/* AMD or Hygon? */
186 	if (x86_vendor != X86_VENDOR_AMD &&
187 	    x86_vendor != X86_VENDOR_HYGON)
188 		return (B_FALSE);
189 
190 	/*
191 	 * If we support PowerNow! on this processor, then set the
192 	 * correct cma_ops for the processor.
193 	 */
194 	mach_state->ms_pstate.cma_ops = pwrnow_supported() ?
195 	    &pwrnow_ops : NULL;
196 
197 	/*
198 	 * AMD systems may support C-states, so optimistically set cma_ops to
199 	 * drive C-states. If the system does not *actually* support C-states,
200 	 * ACPI tables will not include _CST objects and `cpus_init` will fail.
201 	 * This, in turn, will cause `cpupm_init` to reset idle handling to not
202 	 * use C-states including clearing `ms_cstate.cma_ops`.
203 	 */
204 	mach_state->ms_cstate.cma_ops = &cpu_idle_ops;
205 
206 	return (B_TRUE);
207 }
208