1 /* 2 * CDDL HEADER START 3 * 4 * The contents of this file are subject to the terms of the 5 * Common Development and Distribution License (the "License"). 6 * You may not use this file except in compliance with the License. 7 * 8 * You can obtain a copy of the license at usr/src/OPENSOLARIS.LICENSE 9 * or http://www.opensolaris.org/os/licensing. 10 * See the License for the specific language governing permissions 11 * and limitations under the License. 12 * 13 * When distributing Covered Code, include this CDDL HEADER in each 14 * file and include the License file at usr/src/OPENSOLARIS.LICENSE. 15 * If applicable, add the following below this CDDL HEADER, with the 16 * fields enclosed by brackets "[]" replaced with your own identifying 17 * information: Portions Copyright [yyyy] [name of copyright owner] 18 * 19 * CDDL HEADER END 20 */ 21 /* 22 * Copyright (c) 2004, 2010, Oracle and/or its affiliates. All rights reserved. 23 * Copyright (c) 2011, 2016 by Delphix. All rights reserved. 24 * Copyright 2013 Nexenta Systems, Inc. All rights reserved. 25 * Copyright 2014 Josef "Jeff" Sipek <jeffpc@josefsipek.net> 26 * Copyright 2020 Joyent, Inc. 27 * Copyright 2025 Oxide Computer Company 28 * Copyright 2024 MNX Cloud, Inc. 29 * Copyright 2025 Edgecast Cloud LLC. 30 */ 31 /* 32 * Copyright (c) 2010, Intel Corporation. 33 * All rights reserved. 34 */ 35 /* 36 * Portions Copyright 2009 Advanced Micro Devices, Inc. 37 */ 38 39 /* 40 * CPU Identification logic 41 * 42 * The purpose of this file and its companion, cpuid_subr.c, is to help deal 43 * with the identification of CPUs, their features, and their topologies. More 44 * specifically, this file helps drive the following: 45 * 46 * 1. Enumeration of features of the processor which are used by the kernel to 47 * determine what features to enable or disable. These may be instruction set 48 * enhancements or features that we use. 49 * 50 * 2. Enumeration of instruction set architecture (ISA) additions that userland 51 * will be told about through the auxiliary vector. 52 * 53 * 3. Understanding the physical topology of the CPU such as the number of 54 * caches, how many cores it has, whether or not it supports symmetric 55 * multi-processing (SMT), etc. 56 * 57 * ------------------------ 58 * CPUID History and Basics 59 * ------------------------ 60 * 61 * The cpuid instruction was added by Intel roughly around the time that the 62 * original Pentium was introduced. The purpose of cpuid was to tell in a 63 * programmatic fashion information about the CPU that previously was guessed 64 * at. For example, an important part of cpuid is that we can know what 65 * extensions to the ISA exist. If you use an invalid opcode you would get a 66 * #UD, so this method allows a program (whether a user program or the kernel) 67 * to determine what exists without crashing or getting a SIGILL. Of course, 68 * this was also during the era of the clones and the AMD Am5x86. The vendor 69 * name shows up first in cpuid for a reason. 70 * 71 * cpuid information is broken down into ranges called a 'leaf'. Each leaf puts 72 * unique values into the registers %eax, %ebx, %ecx, and %edx and each leaf has 73 * its own meaning. The different leaves are broken down into different regions: 74 * 75 * [ 0, 7fffffff ] This region is called the 'basic' 76 * region. This region is generally defined 77 * by Intel, though some of the original 78 * portions have different meanings based 79 * on the manufacturer. These days, Intel 80 * adds most new features to this region. 81 * AMD adds non-Intel compatible 82 * information in the third, extended 83 * region. Intel uses this for everything 84 * including ISA extensions, CPU 85 * features, cache information, topology, 86 * and more. 87 * 88 * There is a hole carved out of this 89 * region which is reserved for 90 * hypervisors. 91 * 92 * [ 40000000, 4fffffff ] This region, which is found in the 93 * middle of the previous region, is 94 * explicitly promised to never be used by 95 * CPUs. Instead, it is used by hypervisors 96 * to communicate information about 97 * themselves to the operating system. The 98 * values and details are unique for each 99 * hypervisor. 100 * 101 * [ 80000000, ffffffff ] This region is called the 'extended' 102 * region. Some of the low leaves mirror 103 * parts of the basic leaves. This region 104 * has generally been used by AMD for 105 * various extensions. For example, AMD- 106 * specific information about caches, 107 * features, and topology are found in this 108 * region. 109 * 110 * To specify a range, you place the desired leaf into %eax, zero %ebx, %ecx, 111 * and %edx, and then issue the cpuid instruction. At the first leaf in each of 112 * the ranges, one of the primary things returned is the maximum valid leaf in 113 * that range. This allows for discovery of what range of CPUID is valid. 114 * 115 * The CPUs have potentially surprising behavior when using an invalid leaf or 116 * unimplemented leaf. If the requested leaf is within the valid basic or 117 * extended range, but is unimplemented, then %eax, %ebx, %ecx, and %edx will be 118 * set to zero. However, if you specify a leaf that is outside of a valid range, 119 * then instead it will be filled with the last valid _basic_ leaf. For example, 120 * if the maximum basic value is on leaf 0x3, then issuing a cpuid for leaf 4 or 121 * an invalid extended leaf will return the information for leaf 3. 122 * 123 * Some leaves are broken down into sub-leaves. This means that the value 124 * depends on both the leaf asked for in %eax and a secondary register. For 125 * example, Intel uses the value in %ecx on leaf 7 to indicate a sub-leaf to get 126 * additional information. Or when getting topology information in leaf 0xb, the 127 * initial value in %ecx changes which level of the topology that you are 128 * getting information about. 129 * 130 * cpuid values are always kept to 32 bits regardless of whether or not the 131 * program is in 64-bit mode. When executing in 64-bit mode, the upper 132 * 32 bits of the register are always set to zero so that way the values are the 133 * same regardless of execution mode. 134 * 135 * ---------------------- 136 * Identifying Processors 137 * ---------------------- 138 * 139 * We can identify a processor in two steps. The first step looks at cpuid leaf 140 * 0. Leaf 0 contains the processor's vendor information. This is done by 141 * putting a 12 character string in %ebx, %ecx, and %edx. On AMD, it is 142 * 'AuthenticAMD' and on Intel it is 'GenuineIntel'. 143 * 144 * From there, a processor is identified by a combination of three different 145 * values: 146 * 147 * 1. Family 148 * 2. Model 149 * 3. Stepping 150 * 151 * Each vendor uses the family and model to uniquely identify a processor. The 152 * way that family and model are changed depends on the vendor. For example, 153 * Intel has been using family 0x6 for almost all of their processor since the 154 * Pentium Pro/Pentium II era, often called the P6. The model is used to 155 * identify the exact processor. Different models are often used for the client 156 * (consumer) and server parts. Even though each processor often has major 157 * architectural differences, they still are considered the same family by 158 * Intel. 159 * 160 * On the other hand, each major AMD architecture generally has its own family. 161 * For example, the K8 is family 0x10, Bulldozer 0x15, and Zen 0x17. Within it 162 * the model number is used to help identify specific processors. As AMD's 163 * product lines have expanded, they have started putting a mixed bag of 164 * processors into the same family, with each processor under a single 165 * identifying banner (e.g., Milan, Cezanne) using a range of model numbers. We 166 * refer to each such collection as a processor family, distinct from cpuid 167 * family. Importantly, each processor family has a BIOS and Kernel Developer's 168 * Guide (BKDG, older parts) or Processor Programming Reference (PPR) that 169 * defines the processor family's non-architectural features. In general, we'll 170 * use "family" here to mean the family number reported by the cpuid instruction 171 * and distinguish the processor family from it where appropriate. 172 * 173 * The stepping is used to refer to a revision of a specific microprocessor. The 174 * term comes from equipment used to produce masks that are used to create 175 * integrated circuits. 176 * 177 * The information is present in leaf 1, %eax. In technical documentation you 178 * will see the terms extended model and extended family. The original family, 179 * model, and stepping fields were each 4 bits wide. If the values in either 180 * are 0xf, then one is to consult the extended model and extended family, which 181 * take previously reserved bits and allow for a larger number of models and add 182 * 0xf to them. 183 * 184 * When we process this information, we store the full family, model, and 185 * stepping in the struct cpuid_info members cpi_family, cpi_model, and 186 * cpi_step, respectively. Whenever you are performing comparisons with the 187 * family, model, and stepping, you should use these members and not the raw 188 * values from cpuid. If you must use the raw values from cpuid directly, you 189 * must make sure that you add the extended model and family to the base model 190 * and family. 191 * 192 * In general, we do not use information about the family, model, and stepping 193 * to determine whether or not a feature is present; that is generally driven by 194 * specific leaves. However, when something we care about on the processor is 195 * not considered 'architectural' meaning that it is specific to a set of 196 * processors and not promised in the architecture model to be consistent from 197 * generation to generation, then we will fall back on this information. The 198 * most common cases where this comes up is when we have to workaround errata in 199 * the processor, are dealing with processor-specific features such as CPU 200 * performance counters, or we want to provide additional information for things 201 * such as fault management. 202 * 203 * While processors also do have a brand string, which is the name that people 204 * are familiar with when buying the processor, they are not meant for 205 * programmatic consumption. That is what the family, model, and stepping are 206 * for. 207 * 208 * We use the x86_chiprev_t to encode a combination of vendor, processor family, 209 * and stepping(s) that refer to a single or very closely related set of silicon 210 * implementations; while there are sometimes more specific ways to learn of the 211 * presence or absence of a particular erratum or workaround, one may generally 212 * assume that all processors of the same chiprev have the same errata and we 213 * have chosen to represent them this way precisely because that is how AMD 214 * groups them in their revision guides (errata documentation). The processor 215 * family (x86_processor_family_t) may be extracted from the chiprev if that 216 * level of detail is not needed. Processor families are considered unordered 217 * but revisions within a family may be compared for either an exact match or at 218 * least as recent as a reference revision. See the chiprev_xxx() functions 219 * below. 220 * 221 * Similarly, each processor family implements a particular microarchitecture, 222 * which itself may have multiple revisions. In general, non-architectural 223 * features are specific to a processor family, but some may exist across 224 * families containing cores that implement the same microarchitectural revision 225 * (and, such cores share common bugs, too). We provide utility routines 226 * analogous to those for extracting and comparing chiprevs for 227 * microarchitectures as well; see the uarch_xxx() functions. 228 * 229 * Both chiprevs and uarchrevs are defined in x86_archext.h and both are at 230 * present used and available only for AMD and AMD-like processors. 231 * 232 * ------------ 233 * CPUID Passes 234 * ------------ 235 * 236 * As part of performing feature detection, we break this into several different 237 * passes. There used to be a pass 0 that was done from assembly in locore.s to 238 * support processors that have a missing or broken cpuid instruction (notably 239 * certain Cyrix processors) but those were all 32-bit processors which are no 240 * longer supported. Passes are no longer numbered explicitly to make it easier 241 * to break them up or move them around as needed; however, they still have a 242 * well-defined execution ordering enforced by the definition of cpuid_pass_t in 243 * x86_archext.h. The external interface to execute a cpuid pass or determine 244 * whether a pass has been completed consists of cpuid_execpass() and 245 * cpuid_checkpass() respectively. The passes now, in that execution order, 246 * are as follows: 247 * 248 * PRELUDE This pass does not have any dependencies on system 249 * setup; in particular, unlike all subsequent passes it is 250 * guaranteed not to require PCI config space access. It 251 * sets the flag indicating that the processor we are 252 * running on supports the cpuid instruction, which all 253 * 64-bit processors do. This would also be the place to 254 * add any other basic state that is required later on and 255 * can be learned without dependencies. 256 * 257 * IDENT Determine which vendor manufactured the CPU, the family, 258 * model, and stepping information, and compute basic 259 * identifying tags from those values. This is done first 260 * so that machine-dependent code can control the features 261 * the cpuid instruction will report during subsequent 262 * passes if needed, and so that any intervening 263 * machine-dependent code that needs basic identity will 264 * have it available. This includes synthesised 265 * identifiers such as chiprev and uarchrev as well as the 266 * values obtained directly from cpuid. Prior to executing 267 * this pass, machine-depedent boot code is responsible for 268 * ensuring that the PCI configuration space access 269 * functions have been set up and, if necessary, that 270 * determine_platform() has been called. 271 * 272 * BASIC This is the primary pass and is responsible for doing a 273 * large number of different things: 274 * 275 * 1. Gathering a large number of feature flags to 276 * determine which features the CPU support and which 277 * indicate things that we need to do other work in the OS 278 * to enable. Features detected this way are added to the 279 * x86_featureset which can be queried to 280 * determine what we should do. This includes processing 281 * all of the basic and extended CPU features that we care 282 * about. 283 * 284 * 2. Determining the CPU's topology. This includes 285 * information about how many cores and threads are present 286 * in the package. It also is responsible for figuring out 287 * which logical CPUs are potentially part of the same core 288 * and what other resources they might share. For more 289 * information see the 'Topology' section. 290 * 291 * 3. Determining the set of CPU security-specific features 292 * that we need to worry about and determine the 293 * appropriate set of workarounds. 294 * 295 * Pass 1 on the boot CPU occurs before KMDB is started. 296 * 297 * EXTENDED The second pass is done after startup(). Here, we check 298 * other miscellaneous features. Most of this is gathering 299 * additional basic and extended features that we'll use in 300 * later passes or for debugging support. 301 * 302 * DYNAMIC The third pass occurs after the kernel memory allocator 303 * has been fully initialized. This gathers information 304 * where we might need dynamic memory available for our 305 * uses. This includes several varying width leaves that 306 * have cache information and the processor's brand string. 307 * 308 * RESOLVE The fourth and final normal pass is performed after the 309 * kernel has brought most everything online. This is 310 * invoked from post_startup(). In this pass, we go through 311 * the set of features that we have enabled and turn that 312 * into the hardware auxiliary vector features that 313 * userland receives. This is used by userland, primarily 314 * by the run-time link-editor (RTLD), though userland 315 * software could also refer to it directly. 316 * 317 * The function that performs a pass is currently assumed to be infallible, and 318 * all existing implementation are. This simplifies callers by allowing 319 * cpuid_execpass() to return void. Similarly, implementers do not need to check 320 * for a NULL CPU argument; the current CPU's cpu_t is substituted if necessary. 321 * Both of these assumptions can be relaxed if needed by future developments. 322 * Tracking of completed states is handled by cpuid_execpass(). It is programmer 323 * error to attempt to execute a pass before all previous passes have been 324 * completed on the specified CPU, or to request cpuid information before the 325 * pass that captures it has been executed. These conditions can be tested 326 * using cpuid_checkpass(). 327 * 328 * --------- 329 * Microcode 330 * --------- 331 * 332 * Microcode updates may be applied by the firmware (BIOS/UEFI) and/or by the 333 * operating system and may result in architecturally visible changes (e.g., 334 * changed MSR or CPUID bits). As such, we want to apply any updates as early 335 * as possible during the boot process -- right after the IDENT pass. 336 * 337 * Microcode may also be updated at runtime via ucodeadm(8), after which we do 338 * a selective rescan of the cpuid leaves to determine what features have 339 * changed. Microcode updates can provide more details about security related 340 * features to deal with issues like Spectre and L1TF. On occasion, vendors have 341 * violated their contract and removed bits. However, we don't try to detect 342 * that because that puts us in a situation that we really can't deal with. As 343 * such, the only thing we rescan are security related features today. See 344 * cpuid_pass_ucode(). This is not a pass in the same sense as the others and 345 * is run on demand, via cpuid_post_ucodeadm(). 346 * 347 * 348 * All of the passes are run on all CPUs. However, for the most part we only 349 * care about what the boot CPU says about this information and use the other 350 * CPUs as a rough guide to sanity check that we have the same feature set. 351 * 352 * We do not support running multiple logical CPUs with disjoint, let alone 353 * different, feature sets. 354 * 355 * ------------------ 356 * Processor Topology 357 * ------------------ 358 * 359 * One of the important things that we need to do is to understand the topology 360 * of the underlying processor. When we say topology in this case, we're trying 361 * to understand the relationship between the logical CPUs that the operating 362 * system sees and the underlying physical layout. Different logical CPUs may 363 * share different resources which can have important consequences for the 364 * performance of the system. For example, they may share caches, execution 365 * units, and more. 366 * 367 * The topology of the processor changes from generation to generation and 368 * vendor to vendor. Along with that, different vendors use different 369 * terminology, and the operating system itself uses occasionally overlapping 370 * terminology. It's important to understand what this topology looks like so 371 * one can understand the different things that we try to calculate and 372 * determine. 373 * 374 * To get started, let's talk about a little bit of terminology that we've used 375 * so far, is used throughout this file, and is fairly generic across multiple 376 * vendors: 377 * 378 * CPU 379 * A central processing unit (CPU) refers to a logical and/or virtual 380 * entity that the operating system can execute instructions on. The 381 * underlying resources for this CPU may be shared between multiple 382 * entities; however, to the operating system it is a discrete unit. 383 * 384 * PROCESSOR and PACKAGE 385 * 386 * Generally, when we use the term 'processor' on its own, we are referring 387 * to the physical entity that one buys and plugs into a board. However, 388 * because processor has been overloaded and one might see it used to mean 389 * multiple different levels, we will instead use the term 'package' for 390 * the rest of this file. The term package comes from the electrical 391 * engineering side and refers to the physical entity that encloses the 392 * electronics inside. Strictly speaking the package can contain more than 393 * just the CPU, for example, on many processors it may also have what's 394 * called an 'integrated graphical processing unit (GPU)'. Because the 395 * package can encapsulate multiple units, it is the largest physical unit 396 * that we refer to. 397 * 398 * SOCKET 399 * 400 * A socket refers to unit on a system board (generally the motherboard) 401 * that can receive a package. A single package, or processor, is plugged 402 * into a single socket. A system may have multiple sockets. Often times, 403 * the term socket is used interchangeably with package and refers to the 404 * electrical component that has plugged in, and not the receptacle itself. 405 * 406 * CORE 407 * 408 * A core refers to the physical instantiation of a CPU, generally, with a 409 * full set of hardware resources available to it. A package may contain 410 * multiple cores inside of it or it may just have a single one. A 411 * processor with more than one core is often referred to as 'multi-core'. 412 * In illumos, we will use the feature X86FSET_CMP to refer to a system 413 * that has 'multi-core' processors. 414 * 415 * A core may expose a single logical CPU to the operating system, or it 416 * may expose multiple CPUs, which we call threads, defined below. 417 * 418 * Some resources may still be shared by cores in the same package. For 419 * example, many processors will share the level 3 cache between cores. 420 * Some AMD generations share hardware resources between cores. For more 421 * information on that see the section 'AMD Topology'. 422 * 423 * THREAD and STRAND 424 * 425 * In this file, generally a thread refers to a hardware resources and not 426 * the operating system's logical abstraction. A thread is always exposed 427 * as an independent logical CPU to the operating system. A thread belongs 428 * to a specific core. A core may have more than one thread. When that is 429 * the case, the threads that are part of the same core are often referred 430 * to as 'siblings'. 431 * 432 * When multiple threads exist, this is generally referred to as 433 * simultaneous multi-threading (SMT). When Intel introduced this in their 434 * processors they called it hyper-threading (HT). When multiple threads 435 * are active in a core, they split the resources of the core. For example, 436 * two threads may share the same set of hardware execution units. 437 * 438 * The operating system often uses the term 'strand' to refer to a thread. 439 * This helps disambiguate it from the software concept. 440 * 441 * CHIP 442 * 443 * Unfortunately, the term 'chip' is dramatically overloaded. At its most 444 * base meaning, it is used to refer to a single integrated circuit, which 445 * may or may not be the only thing in the package. In illumos, when you 446 * see the term 'chip' it is almost always referring to the same thing as 447 * the 'package'. However, many vendors may use chip to refer to one of 448 * many integrated circuits that have been placed in the package. As an 449 * example, see the subsequent definition. 450 * 451 * To try and keep things consistent, we will only use chip when referring 452 * to the entire integrated circuit package, with the exception of the 453 * definition of multi-chip module (because it is in the name) and use the 454 * term 'die' when we want the more general, potential sub-component 455 * definition. 456 * 457 * DIE 458 * 459 * A die refers to an integrated circuit. Inside of the package there may 460 * be a single die or multiple dies. This is sometimes called a 'chip' in 461 * vendor's parlance, but in this file, we use the term die to refer to a 462 * subcomponent. 463 * 464 * MULTI-CHIP MODULE 465 * 466 * A multi-chip module (MCM) refers to putting multiple distinct chips that 467 * are connected together in the same package. When a multi-chip design is 468 * used, generally each chip is manufactured independently and then joined 469 * together in the package. For example, on AMD's Zen microarchitecture 470 * (family 0x17), the package contains several dies (the second meaning of 471 * chip from above) that are connected together. 472 * 473 * CACHE 474 * 475 * A cache is a part of the processor that maintains copies of recently 476 * accessed memory. Caches are split into levels and then into types. 477 * Commonly there are one to three levels, called level one, two, and 478 * three. The lower the level, the smaller it is, the closer it is to the 479 * execution units of the CPU, and the faster it is to access. The layout 480 * and design of the cache come in many different flavors, consult other 481 * resources for a discussion of those. 482 * 483 * Caches are generally split into two types, the instruction and data 484 * cache. The caches contain what their names suggest, the instruction 485 * cache has executable program text, while the data cache has all other 486 * memory that the processor accesses. As of this writing, data is kept 487 * coherent between all of the caches on x86, so if one modifies program 488 * text before it is executed, that will be in the data cache, and the 489 * instruction cache will be synchronized with that change when the 490 * processor actually executes those instructions. This coherency also 491 * covers the fact that data could show up in multiple caches. 492 * 493 * Generally, the lowest level caches are specific to a core. However, the 494 * last layer cache is shared between some number of cores. The number of 495 * CPUs sharing this last level cache is important. This has implications 496 * for the choices that the scheduler makes, as accessing memory that might 497 * be in a remote cache after thread migration can be quite expensive. 498 * 499 * Sometimes, the word cache is abbreviated with a '$', because in US 500 * English the word cache is pronounced the same as cash. So L1D$ refers to 501 * the L1 data cache, and L2$ would be the L2 cache. This will not be used 502 * in the rest of this theory statement for clarity. 503 * 504 * MEMORY CONTROLLER 505 * 506 * The memory controller is a component that provides access to DRAM. Each 507 * memory controller can access a set number of DRAM channels. Each channel 508 * can have a number of DIMMs (sticks of memory) associated with it. A 509 * given package may have more than one memory controller. The association 510 * of the memory controller to a group of cores is important as it is 511 * cheaper to access memory on the controller that you are associated with. 512 * 513 * NUMA 514 * 515 * NUMA or non-uniform memory access, describes a way that systems are 516 * built. On x86, any processor core can address all of the memory in the 517 * system. However, When using multiple sockets or possibly within a 518 * multi-chip module, some of that memory is physically closer and some of 519 * it is further. Memory that is further away is more expensive to access. 520 * Consider the following image of multiple sockets with memory: 521 * 522 * +--------+ +--------+ 523 * | DIMM A | +----------+ +----------+ | DIMM D | 524 * +--------+-+ | | | | +-+------+-+ 525 * | DIMM B |=======| Socket 0 |======| Socket 1 |=======| DIMM E | 526 * +--------+-+ | | | | +-+------+-+ 527 * | DIMM C | +----------+ +----------+ | DIMM F | 528 * +--------+ +--------+ 529 * 530 * In this example, Socket 0 is closer to DIMMs A-C while Socket 1 is 531 * closer to DIMMs D-F. This means that it is cheaper for socket 0 to 532 * access DIMMs A-C and more expensive to access D-F as it has to go 533 * through Socket 1 to get there. The inverse is true for Socket 1. DIMMs 534 * D-F are cheaper than A-C. While the socket form is the most common, when 535 * using multi-chip modules, this can also sometimes occur. For another 536 * example of this that's more involved, see the AMD topology section. 537 * 538 * 539 * Intel Topology 540 * -------------- 541 * 542 * Most Intel processors since Nehalem, (as of this writing the current gen 543 * is Skylake / Cannon Lake) follow a fairly similar pattern. The CPU portion of 544 * the package is a single monolithic die. MCMs currently aren't used. Most 545 * parts have three levels of caches, with the L3 cache being shared between 546 * all of the cores on the package. The L1/L2 cache is generally specific to 547 * an individual core. The following image shows at a simplified level what 548 * this looks like. The memory controller is commonly part of something called 549 * the 'Uncore', that used to be separate physical chips that were not a part of 550 * the package, but are now part of the same chip. 551 * 552 * +-----------------------------------------------------------------------+ 553 * | Package | 554 * | +-------------------+ +-------------------+ +-------------------+ | 555 * | | Core | | Core | | Core | | 556 * | | +--------+ +---+ | | +--------+ +---+ | | +--------+ +---+ | | 557 * | | | Thread | | L | | | | Thread | | L | | | | Thread | | L | | | 558 * | | +--------+ | 1 | | | +--------+ | 1 | | | +--------+ | 1 | | | 559 * | | +--------+ | | | | +--------+ | | | | +--------+ | | | | 560 * | | | Thread | | | | | | Thread | | | | | | Thread | | | | | 561 * | | +--------+ +---+ | | +--------+ +---+ | | +--------+ +---+ | | 562 * | | +--------------+ | | +--------------+ | | +--------------+ | | 563 * | | | L2 Cache | | | | L2 Cache | | | | L2 Cache | | | 564 * | | +--------------+ | | +--------------+ | | +--------------+ | | 565 * | +-------------------+ +-------------------+ +-------------------+ | 566 * | +-------------------------------------------------------------------+ | 567 * | | Shared L3 Cache | | 568 * | +-------------------------------------------------------------------+ | 569 * | +-------------------------------------------------------------------+ | 570 * | | Memory Controller | | 571 * | +-------------------------------------------------------------------+ | 572 * +-----------------------------------------------------------------------+ 573 * 574 * A side effect of this current architecture is that what we care about from a 575 * scheduling and topology perspective, is simplified. In general we care about 576 * understanding which logical CPUs are part of the same core and socket. 577 * 578 * To determine the relationship between threads and cores, Intel initially used 579 * the identifier in the advanced programmable interrupt controller (APIC). They 580 * also added cpuid leaf 4 to give additional information about the number of 581 * threads and CPUs in the processor. With the addition of x2apic (which 582 * increased the number of addressable logical CPUs from 8-bits to 32-bits), an 583 * additional cpuid topology leaf 0xB was added. 584 * 585 * AMD Topology 586 * ------------ 587 * 588 * When discussing AMD topology, we want to break this into three distinct 589 * generations of topology. There's the basic topology that has been used in 590 * family 0xf+ (Opteron, Athlon64), there's the topology that was introduced 591 * with family 0x15 (Bulldozer), and there's the topology that was introduced 592 * with family 0x17 (Zen), evolved more dramatically in Zen 2 (still family 593 * 0x17), and tweaked slightly in Zen 3 (family 19h). AMD also has some 594 * additional terminology that's worth talking about. 595 * 596 * Until the introduction of family 0x17 (Zen), AMD did not implement something 597 * that they considered SMT. Whether or not the AMD processors have SMT 598 * influences many things including scheduling and reliability, availability, 599 * and serviceability (RAS) features. 600 * 601 * NODE 602 * 603 * AMD uses the term node to refer to a die that contains a number of cores 604 * and I/O resources. Depending on the processor family and model, more 605 * than one node can be present in the package. When there is more than one 606 * node this indicates a multi-chip module. Usually each node has its own 607 * access to memory and I/O devices. This is important and generally 608 * different from the corresponding Intel Nehalem-Skylake+ processors. As a 609 * result, we track this relationship in the operating system. 610 * 611 * In processors with an L3 cache, the L3 cache is generally shared across 612 * the entire node, though the way this is carved up varies from generation 613 * to generation. 614 * 615 * BULLDOZER 616 * 617 * Starting with the Bulldozer family (0x15) and continuing until the 618 * introduction of the Zen microarchitecture, AMD introduced the idea of a 619 * compute unit. In a compute unit, two traditional cores share a number of 620 * hardware resources. Critically, they share the FPU, L1 instruction 621 * cache, and the L2 cache. Several compute units were then combined inside 622 * of a single node. Because the integer execution units, L1 data cache, 623 * and some other resources were not shared between the cores, AMD never 624 * considered this to be SMT. 625 * 626 * ZEN 627 * 628 * The Zen family (0x17) uses a multi-chip module (MCM) design, the module 629 * is called Zeppelin. These modules are similar to the idea of nodes used 630 * previously. Each of these nodes has two DRAM channels which all of the 631 * cores in the node can access uniformly. These nodes are linked together 632 * in the package, creating a NUMA environment. 633 * 634 * The Zeppelin die itself contains two different 'core complexes'. Each 635 * core complex consists of four cores which each have two threads, for a 636 * total of 8 logical CPUs per complex. Unlike other generations, 637 * where all the logical CPUs in a given node share the L3 cache, here each 638 * core complex has its own shared L3 cache. 639 * 640 * A further thing that we need to consider is that in some configurations, 641 * particularly with the Threadripper line of processors, not every die 642 * actually has its memory controllers wired up to actual memory channels. 643 * This means that some cores have memory attached to them and others 644 * don't. 645 * 646 * To put Zen in perspective, consider the following images: 647 * 648 * +--------------------------------------------------------+ 649 * | Core Complex | 650 * | +-------------------+ +-------------------+ +---+ | 651 * | | Core +----+ | | Core +----+ | | | | 652 * | | +--------+ | L2 | | | +--------+ | L2 | | | | | 653 * | | | Thread | +----+ | | | Thread | +----+ | | | | 654 * | | +--------+-+ +--+ | | +--------+-+ +--+ | | L | | 655 * | | | Thread | |L1| | | | Thread | |L1| | | 3 | | 656 * | | +--------+ +--+ | | +--------+ +--+ | | | | 657 * | +-------------------+ +-------------------+ | C | | 658 * | +-------------------+ +-------------------+ | a | | 659 * | | Core +----+ | | Core +----+ | | c | | 660 * | | +--------+ | L2 | | | +--------+ | L2 | | | h | | 661 * | | | Thread | +----+ | | | Thread | +----+ | | e | | 662 * | | +--------+-+ +--+ | | +--------+-+ +--+ | | | | 663 * | | | Thread | |L1| | | | Thread | |L1| | | | | 664 * | | +--------+ +--+ | | +--------+ +--+ | | | | 665 * | +-------------------+ +-------------------+ +---+ | 666 * | | 667 * +--------------------------------------------------------+ 668 * 669 * This first image represents a single Zen core complex that consists of four 670 * cores. 671 * 672 * 673 * +--------------------------------------------------------+ 674 * | Zeppelin Die | 675 * | +--------------------------------------------------+ | 676 * | | I/O Units (PCIe, SATA, USB, etc.) | | 677 * | +--------------------------------------------------+ | 678 * | HH | 679 * | +-----------+ HH +-----------+ | 680 * | | | HH | | | 681 * | | Core |==========| Core | | 682 * | | Complex |==========| Complex | | 683 * | | | HH | | | 684 * | +-----------+ HH +-----------+ | 685 * | HH | 686 * | +--------------------------------------------------+ | 687 * | | Memory Controller | | 688 * | +--------------------------------------------------+ | 689 * | | 690 * +--------------------------------------------------------+ 691 * 692 * This image represents a single Zeppelin Die. Note how both cores are 693 * connected to the same memory controller and I/O units. While each core 694 * complex has its own L3 cache as seen in the first image, they both have 695 * uniform access to memory. 696 * 697 * 698 * PP PP 699 * PP PP 700 * +----------PP---------------------PP---------+ 701 * | PP PP | 702 * | +-----------+ +-----------+ | 703 * | | | | | | 704 * MMMMMMMMM| Zeppelin |==========| Zeppelin |MMMMMMMMM 705 * MMMMMMMMM| Die |==========| Die |MMMMMMMMM 706 * | | | | | | 707 * | +-----------+ooo ...+-----------+ | 708 * | HH ooo ... HH | 709 * | HH oo.. HH | 710 * | HH ..oo HH | 711 * | HH ... ooo HH | 712 * | +-----------+... ooo+-----------+ | 713 * | | | | | | 714 * MMMMMMMMM| Zeppelin |==========| Zeppelin |MMMMMMMMM 715 * MMMMMMMMM| Die |==========| Die |MMMMMMMMM 716 * | | | | | | 717 * | +-----------+ +-----------+ | 718 * | PP PP | 719 * +----------PP---------------------PP---------+ 720 * PP PP 721 * PP PP 722 * 723 * This image represents a single Zen package. In this example, it has four 724 * Zeppelin dies, though some configurations only have a single one. In this 725 * example, each die is directly connected to the next. Also, each die is 726 * represented as being connected to memory by the 'M' character and connected 727 * to PCIe devices and other I/O, by the 'P' character. Because each Zeppelin 728 * die is made up of two core complexes, we have multiple different NUMA 729 * domains that we care about for these systems. 730 * 731 * ZEN 2 732 * 733 * Zen 2 changes things in a dramatic way from Zen 1. Whereas in Zen 1 734 * each Zeppelin Die had its own I/O die, that has been moved out of the 735 * core complex in Zen 2. The actual core complex looks pretty similar, but 736 * now the die actually looks much simpler: 737 * 738 * +--------------------------------------------------------+ 739 * | Zen 2 Core Complex Die HH | 740 * | HH | 741 * | +-----------+ HH +-----------+ | 742 * | | | HH | | | 743 * | | Core |==========| Core | | 744 * | | Complex |==========| Complex | | 745 * | | | HH | | | 746 * | +-----------+ HH +-----------+ | 747 * | HH | 748 * | HH | 749 * +--------------------------------------------------------+ 750 * 751 * From here, when we add the central I/O die, this changes things a bit. 752 * Each die is connected to the I/O die, rather than trying to interconnect 753 * them directly. The following image takes the same Zen 1 image that we 754 * had earlier and shows what it looks like with the I/O die instead: 755 * 756 * PP PP 757 * PP PP 758 * +---------------------PP----PP---------------------+ 759 * | PP PP | 760 * | +-----------+ PP PP +-----------+ | 761 * | | | PP PP | | | 762 * | | Zen 2 | +-PP----PP-+ | Zen 2 | | 763 * | | Die _| | PP PP | |_ Die | | 764 * | | |o|oooo| |oooo|o| | | 765 * | +-----------+ | | +-----------+ | 766 * | | I/O | | 767 * MMMMMMMMMMMMMMMMMMMMMMMMMM Die MMMMMMMMMMMMMMMMMMMMMMMMMM 768 * MMMMMMMMMMMMMMMMMMMMMMMMMM MMMMMMMMMMMMMMMMMMMMMMMMMM 769 * | | | | 770 * MMMMMMMMMMMMMMMMMMMMMMMMMM MMMMMMMMMMMMMMMMMMMMMMMMMM 771 * MMMMMMMMMMMMMMMMMMMMMMMMMM MMMMMMMMMMMMMMMMMMMMMMMMMM 772 * | | | | 773 * | +-----------+ | | +-----------+ | 774 * | | |o|oooo| PP PP |oooo|o| | | 775 * | | Zen 2 -| +-PP----PP-+ |- Zen 2 | | 776 * | | Die | PP PP | Die | | 777 * | | | PP PP | | | 778 * | +-----------+ PP PP +-----------+ | 779 * | PP PP | 780 * +---------------------PP----PP---------------------+ 781 * PP PP 782 * PP PP 783 * 784 * The above has four core complex dies installed, though the Zen 2 EPYC 785 * and ThreadRipper parts allow for up to eight, while the Ryzen parts 786 * generally only have one to two. The more notable difference here is how 787 * everything communicates. Note that memory and PCIe come out of the 788 * central die. This changes the way that one die accesses a resource. It 789 * basically always has to go to the I/O die, where as in Zen 1 it may have 790 * satisfied it locally. In general, this ends up being a better strategy 791 * for most things, though it is possible to still treat everything in four 792 * distinct NUMA domains with each Zen 2 die slightly closer to some memory 793 * and PCIe than otherwise. This also impacts the 'amdzen' nexus driver as 794 * now there is only one 'node' present. 795 * 796 * ZEN 3 797 * 798 * From an architectural perspective, Zen 3 is a much smaller change from 799 * Zen 2 than Zen 2 was from Zen 1, though it makes up for most of that in 800 * its microarchitectural changes. The biggest thing for us is how the die 801 * changes. In Zen 1 and Zen 2, each core complex still had its own L3 802 * cache. However, in Zen 3, the L3 is now shared between the entire core 803 * complex die and is no longer partitioned between each core complex. This 804 * means that all cores on the die can share the same L3 cache. Otherwise, 805 * the general layout of the overall package with various core complexes 806 * and an I/O die stays the same. Here's what the Core Complex Die looks 807 * like in a bit more detail: 808 * 809 * +-------------------------------------------------+ 810 * | Zen 3 Core Complex Die | 811 * | +-------------------+ +-------------------+ | 812 * | | Core +----+ | | Core +----+ | | 813 * | | +--------+ | L2 | | | +--------+ | L2 | | | 814 * | | | Thread | +----+ | | | Thread | +----+ | | 815 * | | +--------+-+ +--+ | | +--------+-+ +--+ | | 816 * | | | Thread | |L1| | | | Thread | |L1| | | 817 * | | +--------+ +--+ | | +--------+ +--+ | | 818 * | +-------------------+ +-------------------+ | 819 * | +-------------------+ +-------------------+ | 820 * | | Core +----+ | | Core +----+ | | 821 * | | +--------+ | L2 | | | +--------+ | L2 | | | 822 * | | | Thread | +----+ | | | Thread | +----+ | | 823 * | | +--------+-+ +--+ | | +--------+-+ +--+ | | 824 * | | | Thread | |L1| | | | Thread | |L1| | | 825 * | | +--------+ +--+ | | +--------+ +--+ | | 826 * | +-------------------+ +-------------------+ | 827 * | | 828 * | +--------------------------------------------+ | 829 * | | L3 Cache | | 830 * | +--------------------------------------------+ | 831 * | | 832 * | +-------------------+ +-------------------+ | 833 * | | Core +----+ | | Core +----+ | | 834 * | | +--------+ | L2 | | | +--------+ | L2 | | | 835 * | | | Thread | +----+ | | | Thread | +----+ | | 836 * | | +--------+-+ +--+ | | +--------+-+ +--+ | | 837 * | | | Thread | |L1| | | | Thread | |L1| | | 838 * | | +--------+ +--+ | | +--------+ +--+ | | 839 * | +-------------------+ +-------------------+ | 840 * | +-------------------+ +-------------------+ | 841 * | | Core +----+ | | Core +----+ | | 842 * | | +--------+ | L2 | | | +--------+ | L2 | | | 843 * | | | Thread | +----+ | | | Thread | +----+ | | 844 * | | +--------+-+ +--+ | | +--------+-+ +--+ | | 845 * | | | Thread | |L1| | | | Thread | |L1| | | 846 * | | +--------+ +--+ | | +--------+ +--+ | | 847 * | +-------------------+ +-------------------+ | 848 * +-------------------------------------------------+ 849 * 850 * While it is not pictured, there are connections from the die to the 851 * broader data fabric and additional functional blocks to support that 852 * communication and coherency. 853 * 854 * CPUID LEAVES 855 * 856 * There are a few different CPUID leaves that we can use to try and understand 857 * the actual state of the world. As part of the introduction of family 0xf, AMD 858 * added CPUID leaf 0x80000008. This leaf tells us the number of logical 859 * processors that are in the system. Because families before Zen didn't have 860 * SMT, this was always the number of cores that were in the system. However, it 861 * should always be thought of as the number of logical threads to be consistent 862 * between generations. In addition we also get the size of the APIC ID that is 863 * used to represent the number of logical processors. This is important for 864 * deriving topology information. 865 * 866 * In the Bulldozer family, AMD added leaf 0x8000001E. The information varies a 867 * bit between Bulldozer and later families, but it is quite useful in 868 * determining the topology information. Because this information has changed 869 * across family generations, it's worth calling out what these mean 870 * explicitly. The registers have the following meanings: 871 * 872 * %eax The APIC ID. The entire register is defined to have a 32-bit 873 * APIC ID, even though on systems without x2apic support, it will 874 * be limited to 8 bits. 875 * 876 * %ebx On Bulldozer-era systems this contains information about the 877 * number of cores that are in a compute unit (cores that share 878 * resources). It also contains a per-package compute unit ID that 879 * identifies which compute unit the logical CPU is a part of. 880 * 881 * On Zen-era systems this instead contains the number of threads 882 * per core and the ID of the core that the logical CPU is a part 883 * of. Note, this ID is unique only to the package, it is not 884 * globally unique across the entire system. 885 * 886 * %ecx This contains the number of nodes that exist in the package. It 887 * also contains an ID that identifies which node the logical CPU 888 * is a part of. 889 * 890 * Finally, we also use cpuid leaf 0x8000001D to determine information about the 891 * cache layout to determine which logical CPUs are sharing which caches. 892 * 893 * illumos Topology 894 * ---------------- 895 * 896 * Based on the above we synthesize the information into several different 897 * variables that we store in the 'struct cpuid_info'. We'll go into the details 898 * of what each member is supposed to represent and their uniqueness. In 899 * general, there are two levels of uniqueness that we care about. We care about 900 * an ID that is globally unique. That means that it will be unique across all 901 * entities in the system. For example, the default logical CPU ID is globally 902 * unique. On the other hand, there is some information that we only care about 903 * being unique within the context of a single package / socket. Here are the 904 * variables that we keep track of and their meaning. 905 * 906 * Several of the values that are asking for an identifier, with the exception 907 * of cpi_apicid, are allowed to be synthetic. 908 * 909 * 910 * cpi_apicid 911 * 912 * This is the value of the CPU's APIC id. This should be the full 32-bit 913 * ID if the CPU is using the x2apic. Otherwise, it should be the 8-bit 914 * APIC ID. This value is globally unique between all logical CPUs across 915 * all packages. This is usually required by the APIC. 916 * 917 * cpi_chipid 918 * 919 * This value indicates the ID of the package that the logical CPU is a 920 * part of. This value is allowed to be synthetic. It is usually derived by 921 * taking the CPU's APIC ID and determining how many bits are used to 922 * represent CPU cores in the package. All logical CPUs that are part of 923 * the same package must have the same value. 924 * 925 * cpi_coreid 926 * 927 * This represents the ID of a CPU core. Two logical CPUs should only have 928 * the same cpi_coreid value if they are part of the same core. These 929 * values may be synthetic. On systems that support SMT, this value is 930 * usually derived from the APIC ID, otherwise it is often synthetic and 931 * just set to the value of the cpu_id in the cpu_t. 932 * 933 * cpi_pkgcoreid 934 * 935 * This is similar to the cpi_coreid in that logical CPUs that are part of 936 * the same core should have the same ID. The main difference is that these 937 * values are only required to be unique to a given socket. 938 * 939 * cpi_clogid 940 * 941 * This represents the logical ID of a logical CPU. This value should be 942 * unique within a given socket for each logical CPU. This is allowed to be 943 * synthetic, though it is usually based off of the CPU's apic ID. The 944 * broader system expects that logical CPUs that have are part of the same 945 * core have contiguous numbers. For example, if there were two threads per 946 * core, then the core IDs divided by two should be the same and the first 947 * modulus two should be zero and the second one. For example, IDs 4 and 5 948 * indicate two logical CPUs that are part of the same core. But IDs 5 and 949 * 6 represent two logical CPUs that are part of different cores. 950 * 951 * While it is common for the cpi_coreid and the cpi_clogid to be derived 952 * from the same source, strictly speaking, they don't have to be and the 953 * two values should be considered logically independent. One should not 954 * try to compare a logical CPU's cpi_coreid and cpi_clogid to determine 955 * some kind of relationship. While this is tempting, we've seen cases on 956 * AMD family 0xf where the system's cpu id is not related to its APIC ID. 957 * 958 * cpi_ncpu_per_chip 959 * 960 * This value indicates the total number of logical CPUs that exist in the 961 * physical package. Critically, this is not the number of logical CPUs 962 * that exist for just the single core. 963 * 964 * This value should be the same for all logical CPUs in the same package. 965 * 966 * cpi_ncore_per_chip 967 * 968 * This value indicates the total number of physical CPU cores that exist 969 * in the package. The system compares this value with cpi_ncpu_per_chip to 970 * determine if simultaneous multi-threading (SMT) is enabled. When 971 * cpi_ncpu_per_chip equals cpi_ncore_per_chip, then there is no SMT and 972 * the X86FSET_HTT feature is not set. If this value is greater than one, 973 * than we consider the processor to have the feature X86FSET_CMP, to 974 * indicate that there is support for more than one core. 975 * 976 * This value should be the same for all logical CPUs in the same package. 977 * 978 * cpi_procnodes_per_pkg 979 * 980 * This value indicates the number of 'nodes' that exist in the package. 981 * When processors are actually a multi-chip module, this represents the 982 * number of such modules that exist in the package. Currently, on Intel 983 * based systems this member is always set to 1. 984 * 985 * This value should be the same for all logical CPUs in the same package. 986 * 987 * cpi_procnodeid 988 * 989 * This value indicates the ID of the node that the logical CPU is a part 990 * of. All logical CPUs that are in the same node must have the same value 991 * here. This value must be unique across all of the packages in the 992 * system. On Intel based systems, this is currently set to the value in 993 * cpi_chipid because there is only one node. 994 * 995 * cpi_cores_per_compunit 996 * 997 * This value indicates the number of cores that are part of a compute 998 * unit. See the AMD topology section for this. This member only has real 999 * meaning currently for AMD Bulldozer family processors. For all other 1000 * processors, this should currently be set to 1. 1001 * 1002 * cpi_compunitid 1003 * 1004 * This indicates the compute unit that the logical CPU belongs to. For 1005 * processors without AMD Bulldozer-style compute units this should be set 1006 * to the value of cpi_coreid. 1007 * 1008 * cpi_ncpu_shr_last_cache 1009 * 1010 * This indicates the number of logical CPUs that are sharing the same last 1011 * level cache. This value should be the same for all CPUs that are sharing 1012 * that cache. The last cache refers to the cache that is closest to memory 1013 * and furthest away from the CPU. 1014 * 1015 * cpi_last_lvl_cacheid 1016 * 1017 * This indicates the ID of the last cache that the logical CPU uses. This 1018 * cache is often shared between multiple logical CPUs and is the cache 1019 * that is closest to memory and furthest away from the CPU. This value 1020 * should be the same for a group of logical CPUs only if they actually 1021 * share the same last level cache. IDs should not overlap between 1022 * packages. 1023 * 1024 * cpi_ncore_bits 1025 * 1026 * This indicates the number of bits that are required to represent all of 1027 * the cores in the system. As cores are derived based on their APIC IDs, 1028 * we aren't guaranteed a run of APIC IDs starting from zero. It's OK for 1029 * this value to be larger than the actual number of IDs that are present 1030 * in the system. This is used to size tables by the CMI framework. It is 1031 * only filled in for Intel and AMD CPUs. 1032 * 1033 * cpi_nthread_bits 1034 * 1035 * This indicates the number of bits required to represent all of the IDs 1036 * that cover the logical CPUs that exist on a given core. It's OK for this 1037 * value to be larger than the actual number of IDs that are present in the 1038 * system. This is used to size tables by the CMI framework. It is 1039 * only filled in for Intel and AMD CPUs. 1040 * 1041 * ----------- 1042 * Hypervisors 1043 * ----------- 1044 * 1045 * If trying to manage the differences between vendors wasn't bad enough, it can 1046 * get worse thanks to our friend hardware virtualization. Hypervisors are given 1047 * the ability to interpose on all cpuid instructions and change them to suit 1048 * their purposes. In general, this is necessary as the hypervisor wants to be 1049 * able to present a more uniform set of features or not necessarily give the 1050 * guest operating system kernel knowledge of all features so it can be 1051 * more easily migrated between systems. 1052 * 1053 * When it comes to trying to determine topology information, this can be a 1054 * double edged sword. When a hypervisor doesn't actually implement a cpuid 1055 * leaf, it'll often return all zeros. Because of that, you'll often see various 1056 * checks scattered about fields being non-zero before we assume we can use 1057 * them. 1058 * 1059 * When it comes to topology information, the hypervisor is often incentivized 1060 * to lie to you about topology. This is because it doesn't always actually 1061 * guarantee that topology at all. The topology path we take in the system 1062 * depends on how the CPU advertises itself. If it advertises itself as an Intel 1063 * or AMD CPU, then we basically do our normal path. However, when they don't 1064 * use an actual vendor, then that usually turns into multiple one-core CPUs 1065 * that we enumerate that are often on different sockets. The actual behavior 1066 * depends greatly on what the hypervisor actually exposes to us. 1067 * 1068 * -------------------- 1069 * Exposing Information 1070 * -------------------- 1071 * 1072 * We expose CPUID information in three different forms in the system. 1073 * 1074 * The first is through the x86_featureset variable. This is used in conjunction 1075 * with the is_x86_feature() function. This is queried by x86-specific functions 1076 * to determine which features are or aren't present in the system and to make 1077 * decisions based upon them. For example, users of this include everything from 1078 * parts of the system dedicated to reliability, availability, and 1079 * serviceability (RAS), to making decisions about how to handle security 1080 * mitigations, to various x86-specific drivers. General purpose or 1081 * architecture independent drivers should never be calling this function. 1082 * 1083 * The second means is through the auxiliary vector. The auxiliary vector is a 1084 * series of tagged data that the kernel passes down to a user program when it 1085 * begins executing. This information is used to indicate to programs what 1086 * instruction set extensions are present. For example, information about the 1087 * CPU supporting the machine check architecture (MCA) wouldn't be passed down 1088 * since user programs cannot make use of it. However, things like the AVX 1089 * instruction sets are. Programs use this information to make run-time 1090 * decisions about what features they should use. As an example, the run-time 1091 * link-editor (rtld) can relocate different functions depending on the hardware 1092 * support available. 1093 * 1094 * The final form is through a series of accessor functions that all have the 1095 * form cpuid_get*. This is used by a number of different subsystems in the 1096 * kernel to determine more detailed information about what we're running on, 1097 * topology information, etc. Some of these subsystems include processor groups 1098 * (uts/common/os/pg.c.), CPU Module Interface (uts/i86pc/os/cmi.c), ACPI, 1099 * microcode, and performance monitoring. These functions all ASSERT that the 1100 * CPU they're being called on has reached a certain cpuid pass. If the passes 1101 * are rearranged, then this needs to be adjusted. 1102 * 1103 * ----------------------------------------------- 1104 * Speculative Execution CPU Side Channel Security 1105 * ----------------------------------------------- 1106 * 1107 * With the advent of the Spectre and Meltdown attacks which exploit speculative 1108 * execution in the CPU to create side channels there have been a number of 1109 * different attacks and corresponding issues that the operating system needs to 1110 * mitigate against. The following list is some of the common, but not 1111 * exhaustive, set of issues that we know about and have done some or need to do 1112 * more work in the system to mitigate against: 1113 * 1114 * - Spectre v1 1115 * - swapgs (Spectre v1 variant) 1116 * - Spectre v2 1117 * - Branch History Injection (BHI). 1118 * - Meltdown (Spectre v3) 1119 * - Rogue Register Read (Spectre v3a) 1120 * - Speculative Store Bypass (Spectre v4) 1121 * - ret2spec, SpectreRSB 1122 * - L1 Terminal Fault (L1TF) 1123 * - Microarchitectural Data Sampling (MDS) 1124 * - Register File Data Sampling (RFDS) 1125 * 1126 * Each of these requires different sets of mitigations and has different attack 1127 * surfaces. For the most part, this discussion is about protecting the kernel 1128 * from non-kernel executing environments such as user processes and hardware 1129 * virtual machines. Unfortunately, there are a number of user vs. user 1130 * scenarios that exist with these. The rest of this section will describe the 1131 * overall approach that the system has taken to address these as well as their 1132 * shortcomings. Unfortunately, not all of the above have been handled today. 1133 * 1134 * SPECTRE v2, ret2spec, SpectreRSB 1135 * 1136 * The second variant of the spectre attack focuses on performing branch target 1137 * injection. This generally impacts indirect call instructions in the system. 1138 * There are four different ways to mitigate this issue that are commonly 1139 * described today: 1140 * 1141 * 1. Using Indirect Branch Restricted Speculation (IBRS). 1142 * 2. Using Retpolines and RSB Stuffing 1143 * 3. Using Enhanced Indirect Branch Restricted Speculation (eIBRS) 1144 * 4. Using Automated Indirect Branch Restricted Speculation (AIBRS) 1145 * 1146 * IBRS uses a feature added to microcode to restrict speculation, among other 1147 * things. This form of mitigation has not been used as it has been generally 1148 * seen as too expensive and requires reactivation upon various transitions in 1149 * the system. 1150 * 1151 * As a less impactful alternative to IBRS, retpolines were developed by 1152 * Google. These basically require one to replace indirect calls with a specific 1153 * trampoline that will cause speculation to fail and break the attack. 1154 * Retpolines require compiler support. We always build with retpolines in the 1155 * external thunk mode. This means that a traditional indirect call is replaced 1156 * with a call to one of the __x86_indirect_thunk_<reg> functions. A side effect 1157 * of this is that all indirect function calls are performed through a register. 1158 * 1159 * We have to use a common external location of the thunk and not inline it into 1160 * the callsite so that way we can have a single place to patch these functions. 1161 * As it turns out, we currently have two different forms of retpolines that 1162 * exist in the system: 1163 * 1164 * 1. A full retpoline 1165 * 2. A no-op version 1166 * 1167 * The first one is used in the general case. Historically, there was an 1168 * AMD-specific optimized retopoline variant that was based around using a 1169 * serializing lfence instruction; however, in March 2022 it was announced that 1170 * this was actually still vulnerable to Spectre v2 and therefore we no longer 1171 * use it and it is no longer available in the system. 1172 * 1173 * The third form described above is the most curious. It turns out that the way 1174 * that retpolines are implemented is that they rely on how speculation is 1175 * performed on a 'ret' instruction. Intel has continued to optimize this 1176 * process (which is partly why we need to have return stack buffer stuffing, 1177 * but more on that in a bit) and in processors starting with Cascade Lake 1178 * on the server side, it's dangerous to rely on retpolines. Instead, a new 1179 * mechanism has been introduced called Enhanced IBRS (eIBRS). 1180 * 1181 * Unlike IBRS, eIBRS is designed to be enabled once at boot and left on each 1182 * physical core. However, if this is the case, we don't want to use retpolines 1183 * any more. Therefore if eIBRS is present, we end up turning each retpoline 1184 * function (called a thunk) into a jmp instruction. This means that we're still 1185 * paying the cost of an extra jump to the external thunk, but it gives us 1186 * flexibility and the ability to have a single kernel image that works across a 1187 * wide variety of systems and hardware features. 1188 * 1189 * Unfortunately, this alone is insufficient. First, Skylake systems have 1190 * additional speculation for the Return Stack Buffer (RSB) which is used to 1191 * return from call instructions which retpolines take advantage of. However, 1192 * this problem is not just limited to Skylake and is actually more pernicious. 1193 * The SpectreRSB paper introduces several more problems that can arise with 1194 * dealing with this. The RSB can be poisoned just like the indirect branch 1195 * predictor. This means that one needs to clear the RSB when transitioning 1196 * between two different privilege domains. Some examples include: 1197 * 1198 * - Switching between two different user processes 1199 * - Going between user land and the kernel 1200 * - Returning to the kernel from a hardware virtual machine 1201 * 1202 * Mitigating this involves combining a couple of different things. The first is 1203 * SMEP (supervisor mode execution protection) which was introduced in Ivy 1204 * Bridge. When an RSB entry refers to a user address and we're executing in the 1205 * kernel, speculation through it will be stopped when SMEP is enabled. This 1206 * protects against a number of the different cases that we would normally be 1207 * worried about such as when we enter the kernel from user land. 1208 * 1209 * To prevent against additional manipulation of the RSB from other contexts 1210 * such as a non-root VMX context attacking the kernel we first look to 1211 * enhanced IBRS. When eIBRS is present and enabled, then there should be 1212 * nothing else that we need to do to protect the kernel at this time. 1213 * 1214 * Unfortunately, not all eIBRS implementations are sufficient to guard 1215 * against RSB manipulations, so we still need to manually overwrite the 1216 * contents of the return stack buffer unless the hardware specifies we are 1217 * covered. We do this through the x86_rsb_stuff() function. Currently this 1218 * is employed on context switch and vmx_exit. The x86_rsb_stuff() function is 1219 * disabled only when mitigations in general are, or if we have hardware 1220 * indicating no need for post-barrier RSB protections, either in one place 1221 * (old hardware), or on both (newer hardware). 1222 * 1223 * If SMEP is not present, then we would have to stuff the RSB every time we 1224 * transitioned from user mode to the kernel, which isn't very practical right 1225 * now. 1226 * 1227 * To fully protect user to user and vmx to vmx attacks from these classes of 1228 * issues, we would also need to allow them to opt into performing an Indirect 1229 * Branch Prediction Barrier (IBPB) on switch. This is not currently wired up. 1230 * 1231 * The fourth form of mitigation here is specific to AMD and is called Automated 1232 * IBRS (AIBRS). This is similar in spirit to eIBRS; however rather than set the 1233 * IBRS bit in MSR_IA32_SPEC_CTRL (0x48) we instead set a bit in the EFER 1234 * (extended feature enable register) MSR. This bit basically says that IBRS 1235 * acts as though it is always active when executing at CPL0 and when executing 1236 * in the 'host' context when SEV-SNP is enabled. 1237 * 1238 * When this is active, AMD states that the RSB is cleared on VMEXIT and 1239 * therefore it is unnecessary. While this handles RSB stuffing attacks from SVM 1240 * to the kernel, we must still consider the remaining cases that exist, just 1241 * like above. While traditionally AMD employed a 32 entry RSB allowing the 1242 * traditional technique to work, this is not true on all CPUs. While a write to 1243 * IBRS would clear the RSB if the processor supports more than 32 entries (but 1244 * not otherwise), AMD states that as long as at leat a single 4 KiB unmapped 1245 * guard page is present between user and kernel address spaces and SMEP is 1246 * enabled, then there is no need to clear the RSB at all. 1247 * 1248 * By default, the system will enable RSB stuffing and the required variant of 1249 * retpolines and store that information in the x86_spectrev2_mitigation value. 1250 * This will be evaluated after a microcode update as well, though it is 1251 * expected that microcode updates will not take away features. This may mean 1252 * that a late loaded microcode may not end up in the optimal configuration 1253 * (though this should be rare). 1254 * 1255 * Currently we do not build kmdb with retpolines or perform any additional side 1256 * channel security mitigations for it. One complication with kmdb is that it 1257 * requires its own retpoline thunks and it would need to adjust itself based on 1258 * what the kernel does. The threat model of kmdb is more limited and therefore 1259 * it may make more sense to investigate using prediction barriers as the whole 1260 * system is only executing a single instruction at a time while in kmdb. 1261 * 1262 * Branch History Injection (BHI) 1263 * 1264 * BHI is a specific form of SPECTREv2 where an attacker may manipulate branch 1265 * history before transitioning from user to supervisor mode (or from VMX 1266 * non-root/guest to root mode). The attacker can then exploit certain 1267 * compiler-generated code-sequences ("gadgets") to disclose information from 1268 * other contexts or domains. Recent (late-2023/early-2024) research in 1269 * object code analysis discovered many more potential gadgets than what was 1270 * initially reported (which previously was confined to Linux use of 1271 * unprivileged eBPF). 1272 * 1273 * The BHI threat doesn't exist in processsors that predate eIBRS, or in AMD 1274 * ones. Some eIBRS processors have the ability to disable branch history in 1275 * certain (but not all) cases using an MSR write. eIBRS processors that don't 1276 * have the ability to disable must use a software sequence to scrub the 1277 * branch history buffer. 1278 * 1279 * BHI_DIS_S (the aforementioned MSR) prevents ring 0 from ring 3 (VMX guest 1280 * or VMX root). It does not protect different user processes from each other, 1281 * or ring 3 VMX guest from ring 3 VMX root or vice versa. 1282 * 1283 * The BHI clearing sequence prevents user exploiting kernel gadgets, and user 1284 * A's use of user B's gadgets. 1285 * 1286 * SMEP and eIBRS are a continuing defense-in-depth measure protecting the 1287 * kernel. 1288 * 1289 * SPECTRE v1, v4 1290 * 1291 * The v1 and v4 variants of spectre are not currently mitigated in the 1292 * system and require other classes of changes to occur in the code. 1293 * 1294 * SPECTRE v1 (SWAPGS VARIANT) 1295 * 1296 * The class of Spectre v1 vulnerabilities aren't all about bounds checks, but 1297 * can generally affect any branch-dependent code. The swapgs issue is one 1298 * variant of this. If we are coming in from userspace, we can have code like 1299 * this: 1300 * 1301 * cmpw $KCS_SEL, REGOFF_CS(%rsp) 1302 * je 1f 1303 * movq $0, REGOFF_SAVFP(%rsp) 1304 * swapgs 1305 * 1: 1306 * movq %gs:CPU_THREAD, %rax 1307 * 1308 * If an attacker can cause a mis-speculation of the branch here, we could skip 1309 * the needed swapgs, and use the /user/ %gsbase as the base of the %gs-based 1310 * load. If subsequent code can act as the usual Spectre cache gadget, this 1311 * would potentially allow KPTI bypass. To fix this, we need an lfence prior to 1312 * any use of the %gs override. 1313 * 1314 * The other case is also an issue: if we're coming into a trap from kernel 1315 * space, we could mis-speculate and swapgs the user %gsbase back in prior to 1316 * using it. AMD systems are not vulnerable to this version, as a swapgs is 1317 * serializing with respect to subsequent uses. But as AMD /does/ need the other 1318 * case, and the fix is the same in both cases (an lfence at the branch target 1319 * 1: in this example), we'll just do it unconditionally. 1320 * 1321 * Note that we don't enable user-space "wrgsbase" via CR4_FSGSBASE, making it 1322 * harder for user-space to actually set a useful %gsbase value: although it's 1323 * not clear, it might still be feasible via lwp_setprivate(), though, so we 1324 * mitigate anyway. 1325 * 1326 * MELTDOWN 1327 * 1328 * Meltdown, or spectre v3, allowed a user process to read any data in their 1329 * address space regardless of whether or not the page tables in question 1330 * allowed the user to have the ability to read them. The solution to meltdown 1331 * is kernel page table isolation. In this world, there are two page tables that 1332 * are used for a process, one in user land and one in the kernel. To implement 1333 * this we use per-CPU page tables and switch between the user and kernel 1334 * variants when entering and exiting the kernel. For more information about 1335 * this process and how the trampolines work, please see the big theory 1336 * statements and additional comments in: 1337 * 1338 * - uts/i86pc/ml/kpti_trampolines.s 1339 * - uts/i86pc/vm/hat_i86.c 1340 * 1341 * While Meltdown only impacted Intel systems and there are also Intel systems 1342 * that have Meltdown fixed (called Rogue Data Cache Load), we always have 1343 * kernel page table isolation enabled. While this may at first seem weird, an 1344 * important thing to remember is that you can't speculatively read an address 1345 * if it's never in your page table at all. Having user processes without kernel 1346 * pages present provides us with an important layer of defense in the kernel 1347 * against any other side channel attacks that exist and have yet to be 1348 * discovered. As such, kernel page table isolation (KPTI) is always enabled by 1349 * default, no matter the x86 system. 1350 * 1351 * L1 TERMINAL FAULT 1352 * 1353 * L1 Terminal Fault (L1TF) takes advantage of an issue in how speculative 1354 * execution uses page table entries. Effectively, it is two different problems. 1355 * The first is that it ignores the not present bit in the page table entries 1356 * when performing speculative execution. This means that something can 1357 * speculatively read the listed physical address if it's present in the L1 1358 * cache under certain conditions (see Intel's documentation for the full set of 1359 * conditions). Secondly, this can be used to bypass hardware virtualization 1360 * extended page tables (EPT) that are part of Intel's hardware virtual machine 1361 * instructions. 1362 * 1363 * For the non-hardware virtualized case, this is relatively easy to deal with. 1364 * We must make sure that all unmapped pages have an address of zero. This means 1365 * that they could read the first 4k of physical memory; however, we never use 1366 * that first page in the operating system and always skip putting it in our 1367 * memory map, even if firmware tells us we can use it in our memory map. While 1368 * other systems try to put extra metadata in the address and reserved bits, 1369 * which led to this being problematic in those cases, we do not. 1370 * 1371 * For hardware virtual machines things are more complicated. Because they can 1372 * construct their own page tables, it isn't hard for them to perform this 1373 * attack against any physical address. The one wrinkle is that this physical 1374 * address must be in the L1 data cache. Thus Intel added an MSR that we can use 1375 * to flush the L1 data cache. We wrap this up in the function 1376 * spec_uarch_flush(). This function is also used in the mitigation of 1377 * microarchitectural data sampling (MDS) discussed later on. Kernel based 1378 * hypervisors such as KVM or bhyve are responsible for performing this before 1379 * entering the guest. 1380 * 1381 * Because this attack takes place in the L1 cache, there's another wrinkle 1382 * here. The L1 cache is shared between all logical CPUs in a core in most Intel 1383 * designs. This means that when a thread enters a hardware virtualized context 1384 * and flushes the L1 data cache, the other thread on the processor may then go 1385 * ahead and put new data in it that can be potentially attacked. While one 1386 * solution is to disable SMT on the system, another option that is available is 1387 * to use a feature for hardware virtualization called 'SMT exclusion'. This 1388 * goes through and makes sure that if a HVM is being scheduled on one thread, 1389 * then the thing on the other thread is from the same hardware virtual machine. 1390 * If an interrupt comes in or the guest exits to the broader system, then the 1391 * other SMT thread will be kicked out. 1392 * 1393 * L1TF can be fully mitigated by hardware. If the RDCL_NO feature is set in the 1394 * architecture capabilities MSR (MSR_IA32_ARCH_CAPABILITIES), then we will not 1395 * perform L1TF related mitigations. 1396 * 1397 * MICROARCHITECTURAL DATA SAMPLING 1398 * 1399 * Microarchitectural data sampling (MDS) is a combination of four discrete 1400 * vulnerabilities that are similar issues affecting various parts of the CPU's 1401 * microarchitectural implementation around load, store, and fill buffers. 1402 * Specifically it is made up of the following subcomponents: 1403 * 1404 * 1. Microarchitectural Store Buffer Data Sampling (MSBDS) 1405 * 2. Microarchitectural Fill Buffer Data Sampling (MFBDS) 1406 * 3. Microarchitectural Load Port Data Sampling (MLPDS) 1407 * 4. Microarchitectural Data Sampling Uncacheable Memory (MDSUM) 1408 * 1409 * To begin addressing these, Intel has introduced another feature in microcode 1410 * called MD_CLEAR. This changes the verw instruction to operate in a different 1411 * way. This allows us to execute the verw instruction in a particular way to 1412 * flush the state of the affected parts. The L1TF L1D flush mechanism is also 1413 * updated when this microcode is present to flush this state. 1414 * 1415 * Primarily we need to flush this state whenever we transition from the kernel 1416 * to a less privileged context such as user mode or an HVM guest. MSBDS is a 1417 * little bit different. Here the structures are statically sized when a logical 1418 * CPU is in use and resized when it goes to sleep. Therefore, we also need to 1419 * flush the microarchitectural state before the CPU goes idles by calling hlt, 1420 * mwait, or another ACPI method. To perform these flushes, we call 1421 * x86_md_clear() at all of these transition points. 1422 * 1423 * If hardware enumerates RDCL_NO, indicating that it is not vulnerable to L1TF, 1424 * then we change the spec_uarch_flush() function to point to x86_md_clear(). If 1425 * MDS_NO has been set, then this is fully mitigated and x86_md_clear() becomes 1426 * a no-op. 1427 * 1428 * Unfortunately, with this issue hyperthreading rears its ugly head. In 1429 * particular, everything we've discussed above is only valid for a single 1430 * thread executing on a core. In the case where you have hyper-threading 1431 * present, this attack can be performed between threads. The theoretical fix 1432 * for this is to ensure that both threads are always in the same security 1433 * domain. This means that they are executing in the same ring and mutually 1434 * trust each other. Practically speaking, this would mean that a system call 1435 * would have to issue an inter-processor interrupt (IPI) to the other thread. 1436 * Rather than implement this, we recommend that one disables hyper-threading 1437 * through the use of psradm -aS. 1438 * 1439 * TSX ASYNCHRONOUS ABORT 1440 * 1441 * TSX Asynchronous Abort (TAA) is another side-channel vulnerability that 1442 * behaves like MDS, but leverages Intel's transactional instructions as another 1443 * vector. Effectively, when a transaction hits one of these cases (unmapped 1444 * page, various cache snoop activity, etc.) then the same data can be exposed 1445 * as in the case of MDS. This means that you can attack your twin. 1446 * 1447 * Intel has described that there are two different ways that we can mitigate 1448 * this problem on affected processors: 1449 * 1450 * 1) We can use the same techniques used to deal with MDS. Flushing the 1451 * microarchitectural buffers and disabling hyperthreading will mitigate 1452 * this in the same way. 1453 * 1454 * 2) Using microcode to disable TSX. 1455 * 1456 * Now, most processors that are subject to MDS (as in they don't have MDS_NO in 1457 * the IA32_ARCH_CAPABILITIES MSR) will not receive microcode to disable TSX. 1458 * That's OK as we're already doing all such mitigations. On the other hand, 1459 * processors with MDS_NO are all supposed to receive microcode updates that 1460 * enumerate support for disabling TSX. In general, we'd rather use this method 1461 * when available as it doesn't require disabling hyperthreading to be 1462 * effective. Currently we basically are relying on microcode for processors 1463 * that enumerate MDS_NO. 1464 * 1465 * Another MDS-variant in a few select Intel Atom CPUs is Register File Data 1466 * Sampling: RFDS. This allows an attacker to sample values that were in any 1467 * of integer, floating point, or vector registers. This was discovered by 1468 * Intel during internal validation work. The existence of the RFDS_NO 1469 * capability, or the LACK of a RFDS_CLEAR capability, means we do not have to 1470 * act. Intel has said some CPU models immune to RFDS MAY NOT enumerate 1471 * RFDS_NO. If RFDS_NO is not set, but RFDS_CLEAR is, we must set x86_md_clear, 1472 * and make sure it's using VERW. Unlike MDS, RFDS can't be helped by the 1473 * MSR that L1D uses. 1474 * 1475 * The microcode features are enumerated as part of the IA32_ARCH_CAPABILITIES. 1476 * When bit 7 (IA32_ARCH_CAP_TSX_CTRL) is present, then we are given two 1477 * different powers. The first allows us to cause all transactions to 1478 * immediately abort. The second gives us a means of disabling TSX completely, 1479 * which includes removing it from cpuid. If we have support for this in 1480 * microcode during the first cpuid pass, then we'll disable TSX completely such 1481 * that user land never has a chance to observe the bit. However, if we are late 1482 * loading the microcode, then we must use the functionality to cause 1483 * transactions to automatically abort. This is necessary for user land's sake. 1484 * Once a program sees a cpuid bit, it must not be taken away. 1485 * 1486 * We track whether or not we should do this based on what cpuid pass we're in. 1487 * Whenever we hit cpuid_scan_security() on the boot CPU and we're still on pass 1488 * 1 of the cpuid logic, then we can completely turn off TSX. Notably this 1489 * should happen twice. Once in the normal cpuid_pass_basic() code and then a 1490 * second time after we do the initial microcode update. As a result we need to 1491 * be careful in cpuid_apply_tsx() to only use the MSR if we've loaded a 1492 * suitable microcode on the current CPU (which happens prior to 1493 * cpuid_pass_ucode()). 1494 * 1495 * If TAA has been fixed, then it will be enumerated in IA32_ARCH_CAPABILITIES 1496 * as TAA_NO. In such a case, we will still disable TSX: it's proven to be an 1497 * unfortunate feature in a number of ways, and taking the opportunity to 1498 * finally be able to turn it off is likely to be of benefit in the future. 1499 * 1500 * SUMMARY 1501 * 1502 * The following table attempts to summarize the mitigations for various issues 1503 * and what's done in various places: 1504 * 1505 * - Spectre v1: Not currently mitigated 1506 * - swapgs: lfences after swapgs paths 1507 * - Spectre v2: Retpolines/RSB Stuffing or eIBRS/AIBRS if HW support 1508 * - Meltdown: Kernel Page Table Isolation 1509 * - Spectre v3a: Updated CPU microcode 1510 * - Spectre v4: Not currently mitigated 1511 * - SpectreRSB: SMEP and RSB Stuffing 1512 * - L1TF: spec_uarch_flush, SMT exclusion, requires microcode 1513 * - MDS: x86_md_clear, requires microcode, disabling SMT 1514 * - TAA: x86_md_clear and disabling SMT OR microcode and disabling TSX 1515 * - RFDS: microcode with x86_md_clear if RFDS_CLEAR set and RFDS_NO not. 1516 * - BHI: software sequence, and use of BHI_DIS_S if microcode has it. 1517 * 1518 * The following table indicates the x86 feature set bits that indicate that a 1519 * given problem has been solved or a notable feature is present: 1520 * 1521 * - RDCL_NO: Meltdown, L1TF, MSBDS subset of MDS 1522 * - MDS_NO: All forms of MDS 1523 * - TAA_NO: TAA 1524 * - RFDS_NO: RFDS 1525 * - BHI_NO: BHI 1526 */ 1527 1528 #include <sys/types.h> 1529 #include <sys/archsystm.h> 1530 #include <sys/x86_archext.h> 1531 #include <sys/kmem.h> 1532 #include <sys/systm.h> 1533 #include <sys/cmn_err.h> 1534 #include <sys/sunddi.h> 1535 #include <sys/sunndi.h> 1536 #include <sys/cpuvar.h> 1537 #include <sys/processor.h> 1538 #include <sys/stdbool.h> 1539 #include <sys/sysmacros.h> 1540 #include <sys/pg.h> 1541 #include <sys/fp.h> 1542 #include <sys/controlregs.h> 1543 #include <sys/bitmap.h> 1544 #include <sys/auxv_386.h> 1545 #include <sys/memnode.h> 1546 #include <sys/pci_cfgspace.h> 1547 #include <sys/comm_page.h> 1548 #include <sys/mach_mmu.h> 1549 #include <sys/ucode.h> 1550 #include <sys/tsc.h> 1551 #include <sys/kobj.h> 1552 #include <sys/asm_misc.h> 1553 #include <sys/bitmap.h> 1554 1555 #ifdef __xpv 1556 #include <sys/hypervisor.h> 1557 #else 1558 #include <sys/ontrap.h> 1559 #endif 1560 1561 uint_t x86_vendor = X86_VENDOR_IntelClone; 1562 uint_t x86_type = X86_TYPE_OTHER; 1563 uint_t x86_clflush_size = 0; 1564 1565 #if defined(__xpv) 1566 int x86_use_pcid = 0; 1567 int x86_use_invpcid = 0; 1568 #else 1569 int x86_use_pcid = -1; 1570 int x86_use_invpcid = -1; 1571 #endif 1572 1573 typedef enum { 1574 X86_SPECTREV2_RETPOLINE, 1575 X86_SPECTREV2_ENHANCED_IBRS, 1576 X86_SPECTREV2_AUTO_IBRS, 1577 X86_SPECTREV2_DISABLED 1578 } x86_spectrev2_mitigation_t; 1579 1580 uint_t x86_disable_spectrev2 = 0; 1581 static x86_spectrev2_mitigation_t x86_spectrev2_mitigation = 1582 X86_SPECTREV2_RETPOLINE; 1583 1584 /* 1585 * The mitigation status for TAA: 1586 * X86_TAA_NOTHING -- no mitigation available for TAA side-channels 1587 * X86_TAA_DISABLED -- mitigation disabled via x86_disable_taa 1588 * X86_TAA_MD_CLEAR -- MDS mitigation also suffices for TAA 1589 * X86_TAA_TSX_FORCE_ABORT -- transactions are forced to abort 1590 * X86_TAA_TSX_DISABLE -- force abort transactions and hide from CPUID 1591 * X86_TAA_HW_MITIGATED -- TSX potentially active but H/W not TAA-vulnerable 1592 */ 1593 typedef enum { 1594 X86_TAA_NOTHING, 1595 X86_TAA_DISABLED, 1596 X86_TAA_MD_CLEAR, 1597 X86_TAA_TSX_FORCE_ABORT, 1598 X86_TAA_TSX_DISABLE, 1599 X86_TAA_HW_MITIGATED 1600 } x86_taa_mitigation_t; 1601 1602 uint_t x86_disable_taa = 0; 1603 static x86_taa_mitigation_t x86_taa_mitigation = X86_TAA_NOTHING; 1604 1605 uint_t pentiumpro_bug4046376; 1606 1607 uchar_t x86_featureset[BT_SIZEOFMAP(NUM_X86_FEATURES)]; 1608 1609 static char *x86_feature_names[NUM_X86_FEATURES] = { 1610 "lgpg", 1611 "tsc", 1612 "msr", 1613 "mtrr", 1614 "pge", 1615 "de", 1616 "cmov", 1617 "mmx", 1618 "mca", 1619 "pae", 1620 "cv8", 1621 "pat", 1622 "sep", 1623 "sse", 1624 "sse2", 1625 "htt", 1626 "asysc", 1627 "nx", 1628 "sse3", 1629 "cx16", 1630 "cmp", 1631 "tscp", 1632 "mwait", 1633 "sse4a", 1634 "cpuid", 1635 "ssse3", 1636 "sse4_1", 1637 "sse4_2", 1638 "1gpg", 1639 "clfsh", 1640 "64", 1641 "aes", 1642 "pclmulqdq", 1643 "xsave", 1644 "avx", 1645 "vmx", 1646 "svm", 1647 "topoext", 1648 "f16c", 1649 "rdrand", 1650 "x2apic", 1651 "avx2", 1652 "bmi1", 1653 "bmi2", 1654 "fma", 1655 "smep", 1656 "smap", 1657 "adx", 1658 "rdseed", 1659 "mpx", 1660 "avx512f", 1661 "avx512dq", 1662 "avx512pf", 1663 "avx512er", 1664 "avx512cd", 1665 "avx512bw", 1666 "avx512vl", 1667 "avx512fma", 1668 "avx512vbmi", 1669 "avx512_vpopcntdq", 1670 "avx512_4vnniw", 1671 "avx512_4fmaps", 1672 "xsaveopt", 1673 "xsavec", 1674 "xsaves", 1675 "sha", 1676 "umip", 1677 "pku", 1678 "ospke", 1679 "pcid", 1680 "invpcid", 1681 "ibrs", 1682 "ibpb", 1683 "stibp", 1684 "ssbd", 1685 "ssbd_virt", 1686 "rdcl_no", 1687 "ibrs_all", 1688 "rsba", 1689 "ssb_no", 1690 "stibp_all", 1691 "flush_cmd", 1692 "l1d_vmentry_no", 1693 "fsgsbase", 1694 "clflushopt", 1695 "clwb", 1696 "monitorx", 1697 "clzero", 1698 "xop", 1699 "fma4", 1700 "tbm", 1701 "avx512_vnni", 1702 "amd_pcec", 1703 "md_clear", 1704 "mds_no", 1705 "core_thermal", 1706 "pkg_thermal", 1707 "tsx_ctrl", 1708 "taa_no", 1709 "ppin", 1710 "vaes", 1711 "vpclmulqdq", 1712 "lfence_serializing", 1713 "gfni", 1714 "avx512_vp2intersect", 1715 "avx512_bitalg", 1716 "avx512_vbmi2", 1717 "avx512_bf16", 1718 "auto_ibrs", 1719 "rfds_no", 1720 "rfds_clear", 1721 "pbrsb_no", 1722 "bhi_no", 1723 "bhi_clear" 1724 }; 1725 1726 boolean_t 1727 is_x86_feature(void *featureset, uint_t feature) 1728 { 1729 ASSERT(feature < NUM_X86_FEATURES); 1730 return (BT_TEST((ulong_t *)featureset, feature)); 1731 } 1732 1733 void 1734 add_x86_feature(void *featureset, uint_t feature) 1735 { 1736 ASSERT(feature < NUM_X86_FEATURES); 1737 BT_SET((ulong_t *)featureset, feature); 1738 } 1739 1740 void 1741 remove_x86_feature(void *featureset, uint_t feature) 1742 { 1743 ASSERT(feature < NUM_X86_FEATURES); 1744 BT_CLEAR((ulong_t *)featureset, feature); 1745 } 1746 1747 boolean_t 1748 compare_x86_featureset(void *setA, void *setB) 1749 { 1750 /* 1751 * We assume that the unused bits of the bitmap are always zero. 1752 */ 1753 if (memcmp(setA, setB, BT_SIZEOFMAP(NUM_X86_FEATURES)) == 0) { 1754 return (B_TRUE); 1755 } else { 1756 return (B_FALSE); 1757 } 1758 } 1759 1760 void 1761 print_x86_featureset(void *featureset) 1762 { 1763 uint_t i; 1764 1765 for (i = 0; i < NUM_X86_FEATURES; i++) { 1766 if (is_x86_feature(featureset, i)) { 1767 cmn_err(CE_CONT, "?x86_feature: %s\n", 1768 x86_feature_names[i]); 1769 } 1770 } 1771 } 1772 1773 /* Note: This is the maximum size for the CPU, not the size of the structure. */ 1774 static size_t xsave_state_size = 0; 1775 uint64_t xsave_bv_all = (XFEATURE_LEGACY_FP | XFEATURE_SSE); 1776 boolean_t xsave_force_disable = B_FALSE; 1777 extern int disable_smap; 1778 1779 /* 1780 * This is set to platform type we are running on. 1781 */ 1782 static int platform_type = -1; 1783 1784 #if !defined(__xpv) 1785 /* 1786 * Variable to patch if hypervisor platform detection needs to be 1787 * disabled (e.g. platform_type will always be HW_NATIVE if this is 0). 1788 */ 1789 int enable_platform_detection = 1; 1790 #endif 1791 1792 /* 1793 * monitor/mwait info. 1794 * 1795 * size_actual and buf_actual are the real address and size allocated to get 1796 * proper mwait_buf alignement. buf_actual and size_actual should be passed 1797 * to kmem_free(). Currently kmem_alloc() and mwait happen to both use 1798 * processor cache-line alignment, but this is not guarantied in the furture. 1799 */ 1800 struct mwait_info { 1801 size_t mon_min; /* min size to avoid missed wakeups */ 1802 size_t mon_max; /* size to avoid false wakeups */ 1803 size_t size_actual; /* size actually allocated */ 1804 void *buf_actual; /* memory actually allocated */ 1805 uint32_t support; /* processor support of monitor/mwait */ 1806 }; 1807 1808 /* 1809 * xsave/xrestor info. 1810 * 1811 * This structure contains HW feature bits and the size of the xsave save area. 1812 * Note: the kernel declares a fixed size (AVX_XSAVE_SIZE) structure 1813 * (xsave_state) to describe the xsave layout. However, at runtime the 1814 * per-lwp xsave area is dynamically allocated based on xsav_max_size. The 1815 * xsave_state structure simply represents the legacy layout of the beginning 1816 * of the xsave area. 1817 */ 1818 struct xsave_info { 1819 uint32_t xsav_hw_features_low; /* Supported HW features */ 1820 uint32_t xsav_hw_features_high; /* Supported HW features */ 1821 size_t xsav_max_size; /* max size save area for HW features */ 1822 size_t ymm_size; /* AVX: size of ymm save area */ 1823 size_t ymm_offset; /* AVX: offset for ymm save area */ 1824 size_t bndregs_size; /* MPX: size of bndregs save area */ 1825 size_t bndregs_offset; /* MPX: offset for bndregs save area */ 1826 size_t bndcsr_size; /* MPX: size of bndcsr save area */ 1827 size_t bndcsr_offset; /* MPX: offset for bndcsr save area */ 1828 size_t opmask_size; /* AVX512: size of opmask save */ 1829 size_t opmask_offset; /* AVX512: offset for opmask save */ 1830 size_t zmmlo_size; /* AVX512: size of zmm 256 save */ 1831 size_t zmmlo_offset; /* AVX512: offset for zmm 256 save */ 1832 size_t zmmhi_size; /* AVX512: size of zmm hi reg save */ 1833 size_t zmmhi_offset; /* AVX512: offset for zmm hi reg save */ 1834 size_t pkru_size; /* PKRU size */ 1835 size_t pkru_offset; /* PKRU offset */ 1836 }; 1837 1838 1839 /* 1840 * These constants determine how many of the elements of the 1841 * cpuid we cache in the cpuid_info data structure; the 1842 * remaining elements are accessible via the cpuid instruction. 1843 */ 1844 1845 #define NMAX_CPI_STD 8 /* eax = 0 .. 7 */ 1846 #define NMAX_CPI_EXTD 0x22 /* eax = 0x80000000 .. 0x80000021 */ 1847 #define NMAX_CPI_TOPO 0x10 /* Sanity check on leaf 8X26, 1F */ 1848 1849 /* 1850 * See the big theory statement for a more detailed explanation of what some of 1851 * these members mean. 1852 */ 1853 struct cpuid_info { 1854 uint_t cpi_pass; /* last pass completed */ 1855 /* 1856 * standard function information 1857 */ 1858 uint_t cpi_maxeax; /* fn 0: %eax */ 1859 char cpi_vendorstr[13]; /* fn 0: %ebx:%ecx:%edx */ 1860 uint_t cpi_vendor; /* enum of cpi_vendorstr */ 1861 1862 uint_t cpi_family; /* fn 1: extended family */ 1863 uint_t cpi_model; /* fn 1: extended model */ 1864 uint_t cpi_step; /* fn 1: stepping */ 1865 chipid_t cpi_chipid; /* fn 1: %ebx: Intel: chip # */ 1866 /* AMD: package/socket # */ 1867 uint_t cpi_brandid; /* fn 1: %ebx: brand ID */ 1868 int cpi_clogid; /* fn 1: %ebx: thread # */ 1869 uint_t cpi_ncpu_per_chip; /* fn 1: %ebx: logical cpu count */ 1870 uint8_t cpi_cacheinfo[16]; /* fn 2: intel-style cache desc */ 1871 uint_t cpi_ncache; /* fn 2: number of elements */ 1872 uint_t cpi_ncpu_shr_last_cache; /* fn 4: %eax: ncpus sharing cache */ 1873 id_t cpi_last_lvl_cacheid; /* fn 4: %eax: derived cache id */ 1874 uint_t cpi_cache_leaf_size; /* Number of cache elements */ 1875 /* Intel fn: 4, AMD fn: 8000001d */ 1876 struct cpuid_regs **cpi_cache_leaves; /* Actual leaves from above */ 1877 struct cpuid_regs cpi_std[NMAX_CPI_STD]; /* 0 .. 7 */ 1878 struct cpuid_regs cpi_sub7[2]; /* Leaf 7, sub-leaves 1-2 */ 1879 /* 1880 * extended function information 1881 */ 1882 uint_t cpi_xmaxeax; /* fn 0x80000000: %eax */ 1883 char cpi_brandstr[49]; /* fn 0x8000000[234] */ 1884 uint8_t cpi_pabits; /* fn 0x80000006: %eax */ 1885 uint8_t cpi_vabits; /* fn 0x80000006: %eax */ 1886 uint8_t cpi_fp_amd_save; /* AMD: FP error pointer save rqd. */ 1887 struct cpuid_regs cpi_extd[NMAX_CPI_EXTD]; /* 0x800000XX */ 1888 1889 id_t cpi_coreid; /* same coreid => strands share core */ 1890 int cpi_pkgcoreid; /* core number within single package */ 1891 uint_t cpi_ncore_per_chip; /* AMD: fn 0x80000008: %ecx[7-0] */ 1892 /* Intel: fn 4: %eax[31-26] */ 1893 1894 /* 1895 * These values represent the number of bits that are required to store 1896 * information about the number of cores and threads. 1897 */ 1898 uint_t cpi_ncore_bits; 1899 uint_t cpi_nthread_bits; 1900 /* 1901 * supported feature information 1902 */ 1903 uint32_t cpi_support[6]; 1904 #define STD_EDX_FEATURES 0 1905 #define AMD_EDX_FEATURES 1 1906 #define TM_EDX_FEATURES 2 1907 #define STD_ECX_FEATURES 3 1908 #define AMD_ECX_FEATURES 4 1909 #define STD_EBX_FEATURES 5 1910 /* 1911 * Synthesized information, where known. 1912 */ 1913 x86_chiprev_t cpi_chiprev; /* See X86_CHIPREV_* in x86_archext.h */ 1914 const char *cpi_chiprevstr; /* May be NULL if chiprev unknown */ 1915 uint32_t cpi_socket; /* Chip package/socket type */ 1916 x86_uarchrev_t cpi_uarchrev; /* Microarchitecture and revision */ 1917 1918 struct mwait_info cpi_mwait; /* fn 5: monitor/mwait info */ 1919 uint32_t cpi_apicid; 1920 uint_t cpi_procnodeid; /* AMD: nodeID on HT, Intel: chipid */ 1921 uint_t cpi_procnodes_per_pkg; /* AMD: # of nodes in the package */ 1922 /* Intel: 1 */ 1923 uint_t cpi_compunitid; /* AMD: ComputeUnit ID, Intel: coreid */ 1924 uint_t cpi_cores_per_compunit; /* AMD: # of cores in the ComputeUnit */ 1925 1926 struct xsave_info cpi_xsave; /* fn D: xsave/xrestor info */ 1927 1928 /* 1929 * AMD and Intel extended topology information. Leaf 8X26 (AMD) and 1930 * eventually leaf 0x1F (Intel). 1931 */ 1932 uint_t cpi_topo_nleaves; 1933 struct cpuid_regs cpi_topo[NMAX_CPI_TOPO]; 1934 }; 1935 1936 1937 static struct cpuid_info cpuid_info0; 1938 1939 /* 1940 * These bit fields are defined by the Intel Application Note AP-485 1941 * "Intel Processor Identification and the CPUID Instruction" 1942 */ 1943 #define CPI_FAMILY_XTD(cpi) BITX((cpi)->cpi_std[1].cp_eax, 27, 20) 1944 #define CPI_MODEL_XTD(cpi) BITX((cpi)->cpi_std[1].cp_eax, 19, 16) 1945 #define CPI_TYPE(cpi) BITX((cpi)->cpi_std[1].cp_eax, 13, 12) 1946 #define CPI_FAMILY(cpi) BITX((cpi)->cpi_std[1].cp_eax, 11, 8) 1947 #define CPI_STEP(cpi) BITX((cpi)->cpi_std[1].cp_eax, 3, 0) 1948 #define CPI_MODEL(cpi) BITX((cpi)->cpi_std[1].cp_eax, 7, 4) 1949 1950 #define CPI_FEATURES_EDX(cpi) ((cpi)->cpi_std[1].cp_edx) 1951 #define CPI_FEATURES_ECX(cpi) ((cpi)->cpi_std[1].cp_ecx) 1952 #define CPI_FEATURES_XTD_EDX(cpi) ((cpi)->cpi_extd[1].cp_edx) 1953 #define CPI_FEATURES_XTD_ECX(cpi) ((cpi)->cpi_extd[1].cp_ecx) 1954 #define CPI_FEATURES_7_0_EBX(cpi) ((cpi)->cpi_std[7].cp_ebx) 1955 #define CPI_FEATURES_7_0_ECX(cpi) ((cpi)->cpi_std[7].cp_ecx) 1956 #define CPI_FEATURES_7_0_EDX(cpi) ((cpi)->cpi_std[7].cp_edx) 1957 #define CPI_FEATURES_7_1_EAX(cpi) ((cpi)->cpi_sub7[0].cp_eax) 1958 #define CPI_FEATURES_7_2_EDX(cpi) ((cpi)->cpi_sub7[1].cp_edx) 1959 1960 #define CPI_BRANDID(cpi) BITX((cpi)->cpi_std[1].cp_ebx, 7, 0) 1961 #define CPI_CHUNKS(cpi) BITX((cpi)->cpi_std[1].cp_ebx, 15, 7) 1962 #define CPI_CPU_COUNT(cpi) BITX((cpi)->cpi_std[1].cp_ebx, 23, 16) 1963 #define CPI_APIC_ID(cpi) BITX((cpi)->cpi_std[1].cp_ebx, 31, 24) 1964 1965 #define CPI_MAXEAX_MAX 0x100 /* sanity control */ 1966 #define CPI_XMAXEAX_MAX 0x80000100 1967 #define CPI_FN4_ECX_MAX 0x20 /* sanity: max fn 4 levels */ 1968 #define CPI_FNB_ECX_MAX 0x20 /* sanity: max fn B levels */ 1969 1970 /* 1971 * Function 4 (Deterministic Cache Parameters) macros 1972 * Defined by Intel Application Note AP-485 1973 */ 1974 #define CPI_NUM_CORES(regs) BITX((regs)->cp_eax, 31, 26) 1975 #define CPI_NTHR_SHR_CACHE(regs) BITX((regs)->cp_eax, 25, 14) 1976 #define CPI_FULL_ASSOC_CACHE(regs) BITX((regs)->cp_eax, 9, 9) 1977 #define CPI_SELF_INIT_CACHE(regs) BITX((regs)->cp_eax, 8, 8) 1978 #define CPI_CACHE_LVL(regs) BITX((regs)->cp_eax, 7, 5) 1979 #define CPI_CACHE_TYPE(regs) BITX((regs)->cp_eax, 4, 0) 1980 #define CPI_CACHE_TYPE_DONE 0 1981 #define CPI_CACHE_TYPE_DATA 1 1982 #define CPI_CACHE_TYPE_INSTR 2 1983 #define CPI_CACHE_TYPE_UNIFIED 3 1984 #define CPI_CPU_LEVEL_TYPE(regs) BITX((regs)->cp_ecx, 15, 8) 1985 1986 #define CPI_CACHE_WAYS(regs) BITX((regs)->cp_ebx, 31, 22) 1987 #define CPI_CACHE_PARTS(regs) BITX((regs)->cp_ebx, 21, 12) 1988 #define CPI_CACHE_COH_LN_SZ(regs) BITX((regs)->cp_ebx, 11, 0) 1989 1990 #define CPI_CACHE_SETS(regs) BITX((regs)->cp_ecx, 31, 0) 1991 1992 #define CPI_PREFCH_STRIDE(regs) BITX((regs)->cp_edx, 9, 0) 1993 1994 1995 /* 1996 * A couple of shorthand macros to identify "later" P6-family chips 1997 * like the Pentium M and Core. First, the "older" P6-based stuff 1998 * (loosely defined as "pre-Pentium-4"): 1999 * P6, PII, Mobile PII, PII Xeon, PIII, Mobile PIII, PIII Xeon 2000 */ 2001 #define IS_LEGACY_P6(cpi) ( \ 2002 cpi->cpi_family == 6 && \ 2003 (cpi->cpi_model == 1 || \ 2004 cpi->cpi_model == 3 || \ 2005 cpi->cpi_model == 5 || \ 2006 cpi->cpi_model == 6 || \ 2007 cpi->cpi_model == 7 || \ 2008 cpi->cpi_model == 8 || \ 2009 cpi->cpi_model == 0xA || \ 2010 cpi->cpi_model == 0xB) \ 2011 ) 2012 2013 /* A "new F6" is everything with family 6 that's not the above */ 2014 #define IS_NEW_F6(cpi) ((cpi->cpi_family == 6) && !IS_LEGACY_P6(cpi)) 2015 2016 /* Extended family/model support */ 2017 #define IS_EXTENDED_MODEL_INTEL(cpi) (cpi->cpi_family == 0x6 || \ 2018 cpi->cpi_family >= 0xf) 2019 2020 /* 2021 * Info for monitor/mwait idle loop. 2022 * 2023 * See cpuid section of "Intel 64 and IA-32 Architectures Software Developer's 2024 * Manual Volume 2A: Instruction Set Reference, A-M" #25366-022US, November 2025 * 2006. 2026 * See MONITOR/MWAIT section of "AMD64 Architecture Programmer's Manual 2027 * Documentation Updates" #33633, Rev 2.05, December 2006. 2028 */ 2029 #define MWAIT_SUPPORT (0x00000001) /* mwait supported */ 2030 #define MWAIT_EXTENSIONS (0x00000002) /* extenstion supported */ 2031 #define MWAIT_ECX_INT_ENABLE (0x00000004) /* ecx 1 extension supported */ 2032 #define MWAIT_SUPPORTED(cpi) ((cpi)->cpi_std[1].cp_ecx & CPUID_INTC_ECX_MON) 2033 #define MWAIT_INT_ENABLE(cpi) ((cpi)->cpi_std[5].cp_ecx & 0x2) 2034 #define MWAIT_EXTENSION(cpi) ((cpi)->cpi_std[5].cp_ecx & 0x1) 2035 #define MWAIT_SIZE_MIN(cpi) BITX((cpi)->cpi_std[5].cp_eax, 15, 0) 2036 #define MWAIT_SIZE_MAX(cpi) BITX((cpi)->cpi_std[5].cp_ebx, 15, 0) 2037 /* 2038 * Number of sub-cstates for a given c-state. 2039 */ 2040 #define MWAIT_NUM_SUBC_STATES(cpi, c_state) \ 2041 BITX((cpi)->cpi_std[5].cp_edx, c_state + 3, c_state) 2042 2043 /* 2044 * XSAVE leaf 0xD enumeration 2045 */ 2046 #define CPUID_LEAFD_2_YMM_OFFSET 576 2047 #define CPUID_LEAFD_2_YMM_SIZE 256 2048 2049 /* 2050 * Common extended leaf names to cut down on typos. 2051 */ 2052 #define CPUID_LEAF_EXT_0 0x80000000 2053 #define CPUID_LEAF_EXT_8 0x80000008 2054 #define CPUID_LEAF_EXT_1d 0x8000001d 2055 #define CPUID_LEAF_EXT_1e 0x8000001e 2056 #define CPUID_LEAF_EXT_21 0x80000021 2057 #define CPUID_LEAF_EXT_26 0x80000026 2058 2059 /* 2060 * Functions we consume from cpuid_subr.c; don't publish these in a header 2061 * file to try and keep people using the expected cpuid_* interfaces. 2062 */ 2063 extern uint32_t _cpuid_skt(uint_t, uint_t, uint_t, uint_t); 2064 extern const char *_cpuid_sktstr(uint_t, uint_t, uint_t, uint_t); 2065 extern x86_chiprev_t _cpuid_chiprev(uint_t, uint_t, uint_t, uint_t); 2066 extern const char *_cpuid_chiprevstr(uint_t, uint_t, uint_t, uint_t); 2067 extern x86_uarchrev_t _cpuid_uarchrev(uint_t, uint_t, uint_t, uint_t); 2068 extern uint_t _cpuid_vendorstr_to_vendorcode(char *); 2069 2070 /* 2071 * Apply up various platform-dependent restrictions where the 2072 * underlying platform restrictions mean the CPU can be marked 2073 * as less capable than its cpuid instruction would imply. 2074 */ 2075 #if defined(__xpv) 2076 static void 2077 platform_cpuid_mangle(uint_t vendor, uint32_t eax, struct cpuid_regs *cp) 2078 { 2079 switch (eax) { 2080 case 1: { 2081 uint32_t mcamask = DOMAIN_IS_INITDOMAIN(xen_info) ? 2082 0 : CPUID_INTC_EDX_MCA; 2083 cp->cp_edx &= 2084 ~(mcamask | 2085 CPUID_INTC_EDX_PSE | 2086 CPUID_INTC_EDX_VME | CPUID_INTC_EDX_DE | 2087 CPUID_INTC_EDX_SEP | CPUID_INTC_EDX_MTRR | 2088 CPUID_INTC_EDX_PGE | CPUID_INTC_EDX_PAT | 2089 CPUID_AMD_EDX_SYSC | CPUID_INTC_EDX_SEP | 2090 CPUID_INTC_EDX_PSE36 | CPUID_INTC_EDX_HTT); 2091 break; 2092 } 2093 2094 case 0x80000001: 2095 cp->cp_edx &= 2096 ~(CPUID_AMD_EDX_PSE | 2097 CPUID_INTC_EDX_VME | CPUID_INTC_EDX_DE | 2098 CPUID_AMD_EDX_MTRR | CPUID_AMD_EDX_PGE | 2099 CPUID_AMD_EDX_PAT | CPUID_AMD_EDX_PSE36 | 2100 CPUID_AMD_EDX_SYSC | CPUID_INTC_EDX_SEP | 2101 CPUID_AMD_EDX_TSCP); 2102 cp->cp_ecx &= ~CPUID_AMD_ECX_CMP_LGCY; 2103 break; 2104 default: 2105 break; 2106 } 2107 2108 switch (vendor) { 2109 case X86_VENDOR_Intel: 2110 switch (eax) { 2111 case 4: 2112 /* 2113 * Zero out the (ncores-per-chip - 1) field 2114 */ 2115 cp->cp_eax &= 0x03fffffff; 2116 break; 2117 default: 2118 break; 2119 } 2120 break; 2121 case X86_VENDOR_AMD: 2122 case X86_VENDOR_HYGON: 2123 switch (eax) { 2124 2125 case 0x80000001: 2126 cp->cp_ecx &= ~CPUID_AMD_ECX_CR8D; 2127 break; 2128 2129 case CPUID_LEAF_EXT_8: 2130 /* 2131 * Zero out the (ncores-per-chip - 1) field 2132 */ 2133 cp->cp_ecx &= 0xffffff00; 2134 break; 2135 default: 2136 break; 2137 } 2138 break; 2139 default: 2140 break; 2141 } 2142 } 2143 #else 2144 #define platform_cpuid_mangle(vendor, eax, cp) /* nothing */ 2145 #endif 2146 2147 /* 2148 * Some undocumented ways of patching the results of the cpuid 2149 * instruction to permit running Solaris 10 on future cpus that 2150 * we don't currently support. Could be set to non-zero values 2151 * via settings in eeprom. 2152 */ 2153 2154 uint32_t cpuid_feature_ecx_include; 2155 uint32_t cpuid_feature_ecx_exclude; 2156 uint32_t cpuid_feature_edx_include; 2157 uint32_t cpuid_feature_edx_exclude; 2158 2159 /* 2160 * Allocate space for mcpu_cpi in the machcpu structure for all non-boot CPUs. 2161 */ 2162 void 2163 cpuid_alloc_space(cpu_t *cpu) 2164 { 2165 /* 2166 * By convention, cpu0 is the boot cpu, which is set up 2167 * before memory allocation is available. All other cpus get 2168 * their cpuid_info struct allocated here. 2169 */ 2170 ASSERT(cpu->cpu_id != 0); 2171 ASSERT(cpu->cpu_m.mcpu_cpi == NULL); 2172 cpu->cpu_m.mcpu_cpi = 2173 kmem_zalloc(sizeof (*cpu->cpu_m.mcpu_cpi), KM_SLEEP); 2174 } 2175 2176 void 2177 cpuid_free_space(cpu_t *cpu) 2178 { 2179 struct cpuid_info *cpi = cpu->cpu_m.mcpu_cpi; 2180 int i; 2181 2182 ASSERT(cpi != NULL); 2183 ASSERT(cpi != &cpuid_info0); 2184 2185 /* 2186 * Free up any cache leaf related dynamic storage. The first entry was 2187 * cached from the standard cpuid storage, so we should not free it. 2188 */ 2189 for (i = 1; i < cpi->cpi_cache_leaf_size; i++) 2190 kmem_free(cpi->cpi_cache_leaves[i], sizeof (struct cpuid_regs)); 2191 if (cpi->cpi_cache_leaf_size > 0) 2192 kmem_free(cpi->cpi_cache_leaves, 2193 cpi->cpi_cache_leaf_size * sizeof (struct cpuid_regs *)); 2194 2195 kmem_free(cpi, sizeof (*cpi)); 2196 cpu->cpu_m.mcpu_cpi = NULL; 2197 } 2198 2199 #if !defined(__xpv) 2200 /* 2201 * Determine the type of the underlying platform. This is used to customize 2202 * initialization of various subsystems (e.g. TSC). determine_platform() must 2203 * only ever be called once to prevent two processors from seeing different 2204 * values of platform_type. Must be called before cpuid_pass_ident(), the 2205 * earliest consumer to execute; the identification pass will call 2206 * synth_amd_info() to compute the chiprev, which in turn calls get_hwenv(). 2207 */ 2208 void 2209 determine_platform(void) 2210 { 2211 struct cpuid_regs cp; 2212 uint32_t base; 2213 uint32_t regs[4]; 2214 char *hvstr = (char *)regs; 2215 2216 ASSERT(platform_type == -1); 2217 2218 platform_type = HW_NATIVE; 2219 2220 if (!enable_platform_detection) 2221 return; 2222 2223 /* 2224 * If Hypervisor CPUID bit is set, try to determine hypervisor 2225 * vendor signature, and set platform type accordingly. 2226 * 2227 * References: 2228 * http://lkml.org/lkml/2008/10/1/246 2229 * http://kb.vmware.com/kb/1009458 2230 */ 2231 cp.cp_eax = 0x1; 2232 (void) __cpuid_insn(&cp); 2233 if ((cp.cp_ecx & CPUID_INTC_ECX_HV) != 0) { 2234 cp.cp_eax = 0x40000000; 2235 (void) __cpuid_insn(&cp); 2236 regs[0] = cp.cp_ebx; 2237 regs[1] = cp.cp_ecx; 2238 regs[2] = cp.cp_edx; 2239 regs[3] = 0; 2240 if (strcmp(hvstr, HVSIG_XEN_HVM) == 0) { 2241 platform_type = HW_XEN_HVM; 2242 return; 2243 } 2244 if (strcmp(hvstr, HVSIG_VMWARE) == 0) { 2245 platform_type = HW_VMWARE; 2246 return; 2247 } 2248 if (strcmp(hvstr, HVSIG_KVM) == 0) { 2249 platform_type = HW_KVM; 2250 return; 2251 } 2252 if (strcmp(hvstr, HVSIG_BHYVE) == 0) { 2253 platform_type = HW_BHYVE; 2254 return; 2255 } 2256 if (strcmp(hvstr, HVSIG_MICROSOFT) == 0) { 2257 platform_type = HW_MICROSOFT; 2258 return; 2259 } 2260 if (strcmp(hvstr, HVSIG_QEMU_TCG) == 0) { 2261 platform_type = HW_QEMU_TCG; 2262 return; 2263 } 2264 if (strcmp(hvstr, HVSIG_VIRTUALBOX) == 0) { 2265 platform_type = HW_VIRTUALBOX; 2266 return; 2267 } 2268 if (strcmp(hvstr, HVSIG_ACRN) == 0) { 2269 platform_type = HW_ACRN; 2270 return; 2271 } 2272 } else { 2273 /* 2274 * Check older VMware hardware versions. VMware hypervisor is 2275 * detected by performing an IN operation to VMware hypervisor 2276 * port and checking that value returned in %ebx is VMware 2277 * hypervisor magic value. 2278 * 2279 * References: http://kb.vmware.com/kb/1009458 2280 */ 2281 vmware_port(VMWARE_HVCMD_GETVERSION, regs); 2282 if (regs[1] == VMWARE_HVMAGIC) { 2283 platform_type = HW_VMWARE; 2284 return; 2285 } 2286 } 2287 2288 /* 2289 * Check Xen hypervisor. In a fully virtualized domain, 2290 * Xen's pseudo-cpuid function returns a string representing the 2291 * Xen signature in %ebx, %ecx, and %edx. %eax contains the maximum 2292 * supported cpuid function. We need at least a (base + 2) leaf value 2293 * to do what we want to do. Try different base values, since the 2294 * hypervisor might use a different one depending on whether Hyper-V 2295 * emulation is switched on by default or not. 2296 */ 2297 for (base = 0x40000000; base < 0x40010000; base += 0x100) { 2298 cp.cp_eax = base; 2299 (void) __cpuid_insn(&cp); 2300 regs[0] = cp.cp_ebx; 2301 regs[1] = cp.cp_ecx; 2302 regs[2] = cp.cp_edx; 2303 regs[3] = 0; 2304 if (strcmp(hvstr, HVSIG_XEN_HVM) == 0 && 2305 cp.cp_eax >= (base + 2)) { 2306 platform_type &= ~HW_NATIVE; 2307 platform_type |= HW_XEN_HVM; 2308 return; 2309 } 2310 } 2311 } 2312 2313 int 2314 get_hwenv(void) 2315 { 2316 ASSERT(platform_type != -1); 2317 return (platform_type); 2318 } 2319 2320 int 2321 is_controldom(void) 2322 { 2323 return (0); 2324 } 2325 2326 #else 2327 2328 int 2329 get_hwenv(void) 2330 { 2331 return (HW_XEN_PV); 2332 } 2333 2334 int 2335 is_controldom(void) 2336 { 2337 return (DOMAIN_IS_INITDOMAIN(xen_info)); 2338 } 2339 2340 #endif /* __xpv */ 2341 2342 /* 2343 * Gather the extended topology information. This should be the same for both 2344 * AMD leaf 8X26 and Intel leaf 0x1F (though the data interpretation varies). 2345 */ 2346 static void 2347 cpuid_gather_ext_topo_leaf(struct cpuid_info *cpi, uint32_t leaf) 2348 { 2349 uint_t i; 2350 2351 for (i = 0; i < ARRAY_SIZE(cpi->cpi_topo); i++) { 2352 struct cpuid_regs *regs = &cpi->cpi_topo[i]; 2353 2354 bzero(regs, sizeof (struct cpuid_regs)); 2355 regs->cp_eax = leaf; 2356 regs->cp_ecx = i; 2357 2358 (void) __cpuid_insn(regs); 2359 if (CPUID_AMD_8X26_ECX_TYPE(regs->cp_ecx) == 2360 CPUID_AMD_8X26_TYPE_DONE) { 2361 break; 2362 } 2363 } 2364 2365 cpi->cpi_topo_nleaves = i; 2366 } 2367 2368 /* 2369 * Make sure that we have gathered all of the CPUID leaves that we might need to 2370 * determine topology. We assume that the standard leaf 1 has already been done 2371 * and that xmaxeax has already been calculated. 2372 */ 2373 static void 2374 cpuid_gather_amd_topology_leaves(cpu_t *cpu) 2375 { 2376 struct cpuid_info *cpi = cpu->cpu_m.mcpu_cpi; 2377 2378 if (cpi->cpi_xmaxeax >= CPUID_LEAF_EXT_8) { 2379 struct cpuid_regs *cp; 2380 2381 cp = &cpi->cpi_extd[8]; 2382 cp->cp_eax = CPUID_LEAF_EXT_8; 2383 (void) __cpuid_insn(cp); 2384 platform_cpuid_mangle(cpi->cpi_vendor, CPUID_LEAF_EXT_8, cp); 2385 } 2386 2387 if (is_x86_feature(x86_featureset, X86FSET_TOPOEXT) && 2388 cpi->cpi_xmaxeax >= CPUID_LEAF_EXT_1e) { 2389 struct cpuid_regs *cp; 2390 2391 cp = &cpi->cpi_extd[0x1e]; 2392 cp->cp_eax = CPUID_LEAF_EXT_1e; 2393 (void) __cpuid_insn(cp); 2394 } 2395 2396 if (cpi->cpi_xmaxeax >= CPUID_LEAF_EXT_26) { 2397 cpuid_gather_ext_topo_leaf(cpi, CPUID_LEAF_EXT_26); 2398 } 2399 } 2400 2401 /* 2402 * Get the APIC ID for this processor. If Leaf B is present and valid, we prefer 2403 * it to everything else. If not, and we're on an AMD system where 8000001e is 2404 * valid, then we use that. Othewrise, we fall back to the default value for the 2405 * APIC ID in leaf 1. 2406 */ 2407 static uint32_t 2408 cpuid_gather_apicid(struct cpuid_info *cpi) 2409 { 2410 /* 2411 * Leaf B changes based on the arguments to it. Because we don't cache 2412 * it, we need to gather it again. 2413 */ 2414 if (cpi->cpi_maxeax >= 0xB) { 2415 struct cpuid_regs regs; 2416 struct cpuid_regs *cp; 2417 2418 cp = ®s; 2419 cp->cp_eax = 0xB; 2420 cp->cp_edx = cp->cp_ebx = cp->cp_ecx = 0; 2421 (void) __cpuid_insn(cp); 2422 2423 if (cp->cp_ebx != 0) { 2424 return (cp->cp_edx); 2425 } 2426 } 2427 2428 if ((cpi->cpi_vendor == X86_VENDOR_AMD || 2429 cpi->cpi_vendor == X86_VENDOR_HYGON) && 2430 is_x86_feature(x86_featureset, X86FSET_TOPOEXT) && 2431 cpi->cpi_xmaxeax >= CPUID_LEAF_EXT_1e) { 2432 return (cpi->cpi_extd[0x1e].cp_eax); 2433 } 2434 2435 return (CPI_APIC_ID(cpi)); 2436 } 2437 2438 /* 2439 * For AMD processors, attempt to calculate the number of chips and cores that 2440 * exist. The way that we do this varies based on the generation, because the 2441 * generations themselves have changed dramatically. 2442 * 2443 * If cpuid leaf 0x80000008 exists, that generally tells us the number of cores. 2444 * However, with the advent of family 17h (Zen) it actually tells us the number 2445 * of threads, so we need to look at leaf 0x8000001e if available to determine 2446 * its value. Otherwise, for all prior families, the number of enabled cores is 2447 * the same as threads. 2448 * 2449 * If we do not have leaf 0x80000008, then we assume that this processor does 2450 * not have anything. AMD's older CPUID specification says there's no reason to 2451 * fall back to leaf 1. 2452 * 2453 * In some virtualization cases we will not have leaf 8000001e or it will be 2454 * zero. When that happens we assume the number of threads is one. 2455 */ 2456 static void 2457 cpuid_amd_ncores(struct cpuid_info *cpi, uint_t *ncpus, uint_t *ncores) 2458 { 2459 uint_t nthreads, nthread_per_core; 2460 2461 nthreads = nthread_per_core = 1; 2462 2463 if (cpi->cpi_xmaxeax >= CPUID_LEAF_EXT_8) { 2464 nthreads = BITX(cpi->cpi_extd[8].cp_ecx, 7, 0) + 1; 2465 } else if ((cpi->cpi_std[1].cp_edx & CPUID_INTC_EDX_HTT) != 0) { 2466 nthreads = CPI_CPU_COUNT(cpi); 2467 } 2468 2469 /* 2470 * For us to have threads, and know about it, we have to be at least at 2471 * family 17h and have the cpuid bit that says we have extended 2472 * topology. 2473 */ 2474 if (cpi->cpi_family >= 0x17 && 2475 is_x86_feature(x86_featureset, X86FSET_TOPOEXT) && 2476 cpi->cpi_xmaxeax >= CPUID_LEAF_EXT_1e) { 2477 nthread_per_core = BITX(cpi->cpi_extd[0x1e].cp_ebx, 15, 8) + 1; 2478 } 2479 2480 *ncpus = nthreads; 2481 *ncores = nthreads / nthread_per_core; 2482 } 2483 2484 /* 2485 * Seed the initial values for the cores and threads for an Intel based 2486 * processor. These values will be overwritten if we detect that the processor 2487 * supports CPUID leaf 0xb. 2488 */ 2489 static void 2490 cpuid_intel_ncores(struct cpuid_info *cpi, uint_t *ncpus, uint_t *ncores) 2491 { 2492 /* 2493 * Only seed the number of physical cores from the first level leaf 4 2494 * information. The number of threads there indicate how many share the 2495 * L1 cache, which may or may not have anything to do with the number of 2496 * logical CPUs per core. 2497 */ 2498 if (cpi->cpi_maxeax >= 4) { 2499 *ncores = BITX(cpi->cpi_std[4].cp_eax, 31, 26) + 1; 2500 } else { 2501 *ncores = 1; 2502 } 2503 2504 if ((cpi->cpi_std[1].cp_edx & CPUID_INTC_EDX_HTT) != 0) { 2505 *ncpus = CPI_CPU_COUNT(cpi); 2506 } else { 2507 *ncpus = *ncores; 2508 } 2509 } 2510 2511 static boolean_t 2512 cpuid_leafB_getids(cpu_t *cpu) 2513 { 2514 struct cpuid_info *cpi = cpu->cpu_m.mcpu_cpi; 2515 struct cpuid_regs regs; 2516 struct cpuid_regs *cp; 2517 2518 if (cpi->cpi_maxeax < 0xB) 2519 return (B_FALSE); 2520 2521 cp = ®s; 2522 cp->cp_eax = 0xB; 2523 cp->cp_edx = cp->cp_ebx = cp->cp_ecx = 0; 2524 2525 (void) __cpuid_insn(cp); 2526 2527 /* 2528 * Check CPUID.EAX=0BH, ECX=0H:EBX is non-zero, which 2529 * indicates that the extended topology enumeration leaf is 2530 * available. 2531 */ 2532 if (cp->cp_ebx != 0) { 2533 uint32_t x2apic_id = 0; 2534 uint_t coreid_shift = 0; 2535 uint_t ncpu_per_core = 1; 2536 uint_t chipid_shift = 0; 2537 uint_t ncpu_per_chip = 1; 2538 uint_t i; 2539 uint_t level; 2540 2541 for (i = 0; i < CPI_FNB_ECX_MAX; i++) { 2542 cp->cp_eax = 0xB; 2543 cp->cp_ecx = i; 2544 2545 (void) __cpuid_insn(cp); 2546 level = CPI_CPU_LEVEL_TYPE(cp); 2547 2548 if (level == 1) { 2549 x2apic_id = cp->cp_edx; 2550 coreid_shift = BITX(cp->cp_eax, 4, 0); 2551 ncpu_per_core = BITX(cp->cp_ebx, 15, 0); 2552 } else if (level == 2) { 2553 x2apic_id = cp->cp_edx; 2554 chipid_shift = BITX(cp->cp_eax, 4, 0); 2555 ncpu_per_chip = BITX(cp->cp_ebx, 15, 0); 2556 } 2557 } 2558 2559 /* 2560 * cpi_apicid is taken care of in cpuid_gather_apicid. 2561 */ 2562 cpi->cpi_ncpu_per_chip = ncpu_per_chip; 2563 cpi->cpi_ncore_per_chip = ncpu_per_chip / 2564 ncpu_per_core; 2565 cpi->cpi_chipid = x2apic_id >> chipid_shift; 2566 cpi->cpi_clogid = x2apic_id & ((1 << chipid_shift) - 1); 2567 cpi->cpi_coreid = x2apic_id >> coreid_shift; 2568 cpi->cpi_pkgcoreid = cpi->cpi_clogid >> coreid_shift; 2569 cpi->cpi_procnodeid = cpi->cpi_chipid; 2570 cpi->cpi_compunitid = cpi->cpi_coreid; 2571 2572 if (coreid_shift > 0 && chipid_shift > coreid_shift) { 2573 cpi->cpi_nthread_bits = coreid_shift; 2574 cpi->cpi_ncore_bits = chipid_shift - coreid_shift; 2575 } 2576 2577 return (B_TRUE); 2578 } else { 2579 return (B_FALSE); 2580 } 2581 } 2582 2583 static void 2584 cpuid_intel_getids(cpu_t *cpu, void *feature) 2585 { 2586 uint_t i; 2587 uint_t chipid_shift = 0; 2588 uint_t coreid_shift = 0; 2589 struct cpuid_info *cpi = cpu->cpu_m.mcpu_cpi; 2590 2591 /* 2592 * There are no compute units or processor nodes currently on Intel. 2593 * Always set these to one. 2594 */ 2595 cpi->cpi_procnodes_per_pkg = 1; 2596 cpi->cpi_cores_per_compunit = 1; 2597 2598 /* 2599 * If cpuid Leaf B is present, use that to try and get this information. 2600 * It will be the most accurate for Intel CPUs. 2601 */ 2602 if (cpuid_leafB_getids(cpu)) 2603 return; 2604 2605 /* 2606 * In this case, we have the leaf 1 and leaf 4 values for ncpu_per_chip 2607 * and ncore_per_chip. These represent the largest power of two values 2608 * that we need to cover all of the IDs in the system. Therefore, we use 2609 * those values to seed the number of bits needed to cover information 2610 * in the case when leaf B is not available. These values will probably 2611 * be larger than required, but that's OK. 2612 */ 2613 cpi->cpi_nthread_bits = ddi_fls(cpi->cpi_ncpu_per_chip); 2614 cpi->cpi_ncore_bits = ddi_fls(cpi->cpi_ncore_per_chip); 2615 2616 for (i = 1; i < cpi->cpi_ncpu_per_chip; i <<= 1) 2617 chipid_shift++; 2618 2619 cpi->cpi_chipid = cpi->cpi_apicid >> chipid_shift; 2620 cpi->cpi_clogid = cpi->cpi_apicid & ((1 << chipid_shift) - 1); 2621 2622 if (is_x86_feature(feature, X86FSET_CMP)) { 2623 /* 2624 * Multi-core (and possibly multi-threaded) 2625 * processors. 2626 */ 2627 uint_t ncpu_per_core = 0; 2628 2629 if (cpi->cpi_ncore_per_chip == 1) 2630 ncpu_per_core = cpi->cpi_ncpu_per_chip; 2631 else if (cpi->cpi_ncore_per_chip > 1) 2632 ncpu_per_core = cpi->cpi_ncpu_per_chip / 2633 cpi->cpi_ncore_per_chip; 2634 /* 2635 * 8bit APIC IDs on dual core Pentiums 2636 * look like this: 2637 * 2638 * +-----------------------+------+------+ 2639 * | Physical Package ID | MC | HT | 2640 * +-----------------------+------+------+ 2641 * <------- chipid --------> 2642 * <------- coreid ---------------> 2643 * <--- clogid --> 2644 * <------> 2645 * pkgcoreid 2646 * 2647 * Where the number of bits necessary to 2648 * represent MC and HT fields together equals 2649 * to the minimum number of bits necessary to 2650 * store the value of cpi->cpi_ncpu_per_chip. 2651 * Of those bits, the MC part uses the number 2652 * of bits necessary to store the value of 2653 * cpi->cpi_ncore_per_chip. 2654 */ 2655 for (i = 1; i < ncpu_per_core; i <<= 1) 2656 coreid_shift++; 2657 cpi->cpi_coreid = cpi->cpi_apicid >> coreid_shift; 2658 cpi->cpi_pkgcoreid = cpi->cpi_clogid >> coreid_shift; 2659 } else if (is_x86_feature(feature, X86FSET_HTT)) { 2660 /* 2661 * Single-core multi-threaded processors. 2662 */ 2663 cpi->cpi_coreid = cpi->cpi_chipid; 2664 cpi->cpi_pkgcoreid = 0; 2665 } else { 2666 /* 2667 * Single-core single-thread processors. 2668 */ 2669 cpi->cpi_coreid = cpu->cpu_id; 2670 cpi->cpi_pkgcoreid = 0; 2671 } 2672 cpi->cpi_procnodeid = cpi->cpi_chipid; 2673 cpi->cpi_compunitid = cpi->cpi_coreid; 2674 } 2675 2676 /* 2677 * Historically, AMD has had CMP chips with only a single thread per core. 2678 * However, starting in family 17h (Zen), this has changed and they now have 2679 * multiple threads. Our internal core id needs to be a unique value. 2680 * 2681 * To determine the core id of an AMD system, if we're from a family before 17h, 2682 * then we just use the cpu id, as that gives us a good value that will be 2683 * unique for each core. If instead, we're on family 17h or later, then we need 2684 * to do something more complicated. CPUID leaf 0x8000001e can tell us 2685 * how many threads are in the system. Based on that, we'll shift the APIC ID. 2686 * We can't use the normal core id in that leaf as it's only unique within the 2687 * socket, which is perfect for cpi_pkgcoreid, but not us. 2688 */ 2689 static id_t 2690 cpuid_amd_get_coreid(cpu_t *cpu) 2691 { 2692 struct cpuid_info *cpi = cpu->cpu_m.mcpu_cpi; 2693 2694 if (cpi->cpi_family >= 0x17 && 2695 is_x86_feature(x86_featureset, X86FSET_TOPOEXT) && 2696 cpi->cpi_xmaxeax >= CPUID_LEAF_EXT_1e) { 2697 uint_t nthreads = BITX(cpi->cpi_extd[0x1e].cp_ebx, 15, 8) + 1; 2698 if (nthreads > 1) { 2699 VERIFY3U(nthreads, ==, 2); 2700 return (cpi->cpi_apicid >> 1); 2701 } 2702 } 2703 2704 return (cpu->cpu_id); 2705 } 2706 2707 /* 2708 * IDs on AMD is a more challenging task. This is notable because of the 2709 * following two facts: 2710 * 2711 * 1. Before family 0x17 (Zen), there was no support for SMT and there was 2712 * also no way to get an actual unique core id from the system. As such, we 2713 * synthesize this case by using cpu->cpu_id. This scheme does not, 2714 * however, guarantee that sibling cores of a chip will have sequential 2715 * coreids starting at a multiple of the number of cores per chip - that is 2716 * usually the case, but if the APIC IDs have been set up in a different 2717 * order then we need to perform a few more gymnastics for the pkgcoreid. 2718 * 2719 * 2. In families 0x15 and 16x (Bulldozer and co.) the cores came in groups 2720 * called compute units. These compute units share the L1I cache, L2 cache, 2721 * and the FPU. To deal with this, a new topology leaf was added in 2722 * 0x8000001e. However, parts of this leaf have different meanings 2723 * once we get to family 0x17. 2724 */ 2725 2726 static void 2727 cpuid_amd_getids(cpu_t *cpu, uchar_t *features) 2728 { 2729 int i, first_half, coreidsz; 2730 uint32_t nb_caps_reg; 2731 uint_t node2_1; 2732 struct cpuid_info *cpi = cpu->cpu_m.mcpu_cpi; 2733 struct cpuid_regs *cp; 2734 2735 /* 2736 * Calculate the core id (this comes from hardware in family 0x17 if it 2737 * hasn't been stripped by virtualization). We always set the compute 2738 * unit id to the same value. Also, initialize the default number of 2739 * cores per compute unit and nodes per package. This will be 2740 * overwritten when we know information about a particular family. 2741 */ 2742 cpi->cpi_coreid = cpuid_amd_get_coreid(cpu); 2743 cpi->cpi_compunitid = cpi->cpi_coreid; 2744 cpi->cpi_cores_per_compunit = 1; 2745 cpi->cpi_procnodes_per_pkg = 1; 2746 2747 /* 2748 * To construct the logical ID, we need to determine how many APIC IDs 2749 * are dedicated to the cores and threads. This is provided for us in 2750 * 0x80000008. However, if it's not present (say due to virtualization), 2751 * then we assume it's one. This should be present on all 64-bit AMD 2752 * processors. It was added in family 0xf (Hammer). 2753 */ 2754 if (cpi->cpi_xmaxeax >= CPUID_LEAF_EXT_8) { 2755 coreidsz = BITX((cpi)->cpi_extd[8].cp_ecx, 15, 12); 2756 2757 /* 2758 * In AMD parlance chip is really a node while illumos 2759 * uses chip as equivalent to socket/package. 2760 */ 2761 if (coreidsz == 0) { 2762 /* Use legacy method */ 2763 for (i = 1; i < cpi->cpi_ncore_per_chip; i <<= 1) 2764 coreidsz++; 2765 if (coreidsz == 0) 2766 coreidsz = 1; 2767 } 2768 } else { 2769 /* Assume single-core part */ 2770 coreidsz = 1; 2771 } 2772 cpi->cpi_clogid = cpi->cpi_apicid & ((1 << coreidsz) - 1); 2773 2774 /* 2775 * The package core ID varies depending on the family. While it may be 2776 * tempting to use the CPUID_LEAF_EXT_1e %ebx core id, unfortunately, 2777 * this value is the core id in the given node. For non-virtualized 2778 * family 17h, we need to take the logical core id and shift off the 2779 * threads like we do when getting the core id. Otherwise, we can use 2780 * the clogid as is. When family 17h is virtualized, the clogid should 2781 * be sufficient as if we don't have valid data in the leaf, then we 2782 * won't think we have SMT, in which case the cpi_clogid should be 2783 * sufficient. 2784 */ 2785 if (cpi->cpi_family >= 0x17 && 2786 is_x86_feature(x86_featureset, X86FSET_TOPOEXT) && 2787 cpi->cpi_xmaxeax >= CPUID_LEAF_EXT_1e && 2788 cpi->cpi_extd[0x1e].cp_ebx != 0) { 2789 uint_t nthreads = BITX(cpi->cpi_extd[0x1e].cp_ebx, 15, 8) + 1; 2790 if (nthreads > 1) { 2791 VERIFY3U(nthreads, ==, 2); 2792 cpi->cpi_pkgcoreid = cpi->cpi_clogid >> 1; 2793 } else { 2794 cpi->cpi_pkgcoreid = cpi->cpi_clogid; 2795 } 2796 } else { 2797 cpi->cpi_pkgcoreid = cpi->cpi_clogid; 2798 } 2799 2800 /* 2801 * Obtain the node ID and compute unit IDs. If we're on family 0x15 2802 * (bulldozer) or newer, then we can derive all of this from leaf 2803 * CPUID_LEAF_EXT_1e. Otherwise, the method varies by family. 2804 */ 2805 if (is_x86_feature(x86_featureset, X86FSET_TOPOEXT) && 2806 cpi->cpi_xmaxeax >= CPUID_LEAF_EXT_1e) { 2807 cp = &cpi->cpi_extd[0x1e]; 2808 2809 cpi->cpi_procnodes_per_pkg = BITX(cp->cp_ecx, 10, 8) + 1; 2810 cpi->cpi_procnodeid = BITX(cp->cp_ecx, 7, 0); 2811 2812 /* 2813 * For Bulldozer-era CPUs, recalculate the compute unit 2814 * information. 2815 */ 2816 if (cpi->cpi_family >= 0x15 && cpi->cpi_family < 0x17) { 2817 cpi->cpi_cores_per_compunit = 2818 BITX(cp->cp_ebx, 15, 8) + 1; 2819 cpi->cpi_compunitid = BITX(cp->cp_ebx, 7, 0) + 2820 (cpi->cpi_ncore_per_chip / 2821 cpi->cpi_cores_per_compunit) * 2822 (cpi->cpi_procnodeid / 2823 cpi->cpi_procnodes_per_pkg); 2824 } 2825 } else if (cpi->cpi_family == 0xf || cpi->cpi_family >= 0x11) { 2826 cpi->cpi_procnodeid = (cpi->cpi_apicid >> coreidsz) & 7; 2827 } else if (cpi->cpi_family == 0x10) { 2828 /* 2829 * See if we are a multi-node processor. 2830 * All processors in the system have the same number of nodes 2831 */ 2832 nb_caps_reg = pci_getl_func(0, 24, 3, 0xe8); 2833 if ((cpi->cpi_model < 8) || BITX(nb_caps_reg, 29, 29) == 0) { 2834 /* Single-node */ 2835 cpi->cpi_procnodeid = BITX(cpi->cpi_apicid, 5, 2836 coreidsz); 2837 } else { 2838 2839 /* 2840 * Multi-node revision D (2 nodes per package 2841 * are supported) 2842 */ 2843 cpi->cpi_procnodes_per_pkg = 2; 2844 2845 first_half = (cpi->cpi_pkgcoreid <= 2846 (cpi->cpi_ncore_per_chip/2 - 1)); 2847 2848 if (cpi->cpi_apicid == cpi->cpi_pkgcoreid) { 2849 /* We are BSP */ 2850 cpi->cpi_procnodeid = (first_half ? 0 : 1); 2851 } else { 2852 2853 /* We are AP */ 2854 /* NodeId[2:1] bits to use for reading F3xe8 */ 2855 node2_1 = BITX(cpi->cpi_apicid, 5, 4) << 1; 2856 2857 nb_caps_reg = 2858 pci_getl_func(0, 24 + node2_1, 3, 0xe8); 2859 2860 /* 2861 * Check IntNodeNum bit (31:30, but bit 31 is 2862 * always 0 on dual-node processors) 2863 */ 2864 if (BITX(nb_caps_reg, 30, 30) == 0) 2865 cpi->cpi_procnodeid = node2_1 + 2866 !first_half; 2867 else 2868 cpi->cpi_procnodeid = node2_1 + 2869 first_half; 2870 } 2871 } 2872 } else { 2873 cpi->cpi_procnodeid = 0; 2874 } 2875 2876 cpi->cpi_chipid = 2877 cpi->cpi_procnodeid / cpi->cpi_procnodes_per_pkg; 2878 2879 cpi->cpi_ncore_bits = coreidsz; 2880 cpi->cpi_nthread_bits = ddi_fls(cpi->cpi_ncpu_per_chip / 2881 cpi->cpi_ncore_per_chip); 2882 } 2883 2884 static void 2885 spec_uarch_flush_noop(void) 2886 { 2887 } 2888 2889 /* 2890 * When microcode is present that mitigates MDS, this wrmsr will also flush the 2891 * MDS-related micro-architectural state that would normally happen by calling 2892 * x86_md_clear(). 2893 */ 2894 static void 2895 spec_uarch_flush_msr(void) 2896 { 2897 wrmsr(MSR_IA32_FLUSH_CMD, IA32_FLUSH_CMD_L1D); 2898 } 2899 2900 /* 2901 * This function points to a function that will flush certain 2902 * micro-architectural state on the processor. This flush is used to mitigate 2903 * three different classes of Intel CPU vulnerabilities: L1TF, MDS, and RFDS. 2904 * This function can point to one of three functions: 2905 * 2906 * - A noop which is done because we either are vulnerable, but do not have 2907 * microcode available to help deal with a fix, or because we aren't 2908 * vulnerable. 2909 * 2910 * - spec_uarch_flush_msr which will issue an L1D flush and if microcode to 2911 * mitigate MDS is present, also perform the equivalent of the MDS flush; 2912 * however, it only flushes the MDS related micro-architectural state on the 2913 * current hyperthread, it does not do anything for the twin. 2914 * 2915 * - x86_md_clear which will flush the MDS related state. This is done when we 2916 * have a processor that is vulnerable to MDS, but is not vulnerable to L1TF 2917 * (RDCL_NO is set); or if the CPU is vulnerable to RFDS and indicates VERW 2918 * can clear it (RFDS_CLEAR is set). 2919 */ 2920 void (*spec_uarch_flush)(void) = spec_uarch_flush_noop; 2921 2922 static void 2923 cpuid_update_md_clear(cpu_t *cpu, uchar_t *featureset) 2924 { 2925 struct cpuid_info *cpi = cpu->cpu_m.mcpu_cpi; 2926 2927 /* Non-Intel doesn't concern us here. */ 2928 if (cpi->cpi_vendor != X86_VENDOR_Intel) 2929 return; 2930 2931 /* 2932 * While RDCL_NO indicates that one of the MDS vulnerabilities (MSBDS) 2933 * has been fixed in hardware, it doesn't cover everything related to 2934 * MDS. Therefore we can only rely on MDS_NO to determine that we don't 2935 * need to mitigate this. 2936 * 2937 * We must ALSO check the case of RFDS_NO and if RFDS_CLEAR is set, 2938 * because of the small cases of RFDS. 2939 */ 2940 2941 if ((!is_x86_feature(featureset, X86FSET_MDS_NO) && 2942 is_x86_feature(featureset, X86FSET_MD_CLEAR)) || 2943 (!is_x86_feature(featureset, X86FSET_RFDS_NO) && 2944 is_x86_feature(featureset, X86FSET_RFDS_CLEAR))) { 2945 const uint8_t nop = NOP_INSTR; 2946 uint8_t *md = (uint8_t *)x86_md_clear; 2947 2948 *md = nop; 2949 } 2950 2951 membar_producer(); 2952 } 2953 2954 static void 2955 cpuid_update_l1d_flush(cpu_t *cpu, uchar_t *featureset) 2956 { 2957 boolean_t need_l1d, need_mds, need_rfds; 2958 struct cpuid_info *cpi = cpu->cpu_m.mcpu_cpi; 2959 2960 /* 2961 * If we're not on Intel or we've mitigated all of RDCL, MDS, and RFDS 2962 * in hardware, then there's nothing left for us to do for enabling 2963 * the flush. We can also go ahead and say that SMT exclusion is 2964 * unnecessary. 2965 */ 2966 if (cpi->cpi_vendor != X86_VENDOR_Intel || 2967 (is_x86_feature(featureset, X86FSET_RDCL_NO) && 2968 is_x86_feature(featureset, X86FSET_MDS_NO) && 2969 is_x86_feature(featureset, X86FSET_RFDS_NO))) { 2970 extern int smt_exclusion; 2971 smt_exclusion = 0; 2972 spec_uarch_flush = spec_uarch_flush_noop; 2973 membar_producer(); 2974 return; 2975 } 2976 2977 /* 2978 * The locations where we need to perform an L1D flush are required both 2979 * for mitigating L1TF and MDS. When verw support is present in 2980 * microcode, then the L1D flush will take care of doing that as well. 2981 * However, if we have a system where RDCL_NO is present, but we don't 2982 * have MDS_NO, then we need to do a verw (x86_md_clear) and not a full 2983 * L1D flush. 2984 */ 2985 if (!is_x86_feature(featureset, X86FSET_RDCL_NO) && 2986 is_x86_feature(featureset, X86FSET_FLUSH_CMD) && 2987 !is_x86_feature(featureset, X86FSET_L1D_VM_NO)) { 2988 need_l1d = B_TRUE; 2989 } else { 2990 need_l1d = B_FALSE; 2991 } 2992 2993 if (!is_x86_feature(featureset, X86FSET_MDS_NO) && 2994 is_x86_feature(featureset, X86FSET_MD_CLEAR)) { 2995 need_mds = B_TRUE; 2996 } else { 2997 need_mds = B_FALSE; 2998 } 2999 3000 if (!is_x86_feature(featureset, X86FSET_RFDS_NO) && 3001 is_x86_feature(featureset, X86FSET_RFDS_CLEAR)) { 3002 need_rfds = B_TRUE; 3003 } else { 3004 need_rfds = B_FALSE; 3005 } 3006 3007 if (need_l1d) { 3008 /* 3009 * As of Feb, 2024, no CPU needs L1D *and* RFDS mitigation 3010 * together. If the following VERIFY trips, we need to add 3011 * further fixes here. 3012 */ 3013 VERIFY(!need_rfds); 3014 spec_uarch_flush = spec_uarch_flush_msr; 3015 } else if (need_mds || need_rfds) { 3016 spec_uarch_flush = x86_md_clear; 3017 } else { 3018 /* 3019 * We have no hardware mitigations available to us. 3020 */ 3021 spec_uarch_flush = spec_uarch_flush_noop; 3022 } 3023 membar_producer(); 3024 } 3025 3026 /* 3027 * Branch History Injection (BHI) mitigations. 3028 * 3029 * Intel has provided a software sequence that will scrub the BHB. Like RSB 3030 * (below) we can scribble a return at the beginning to avoid if if the CPU 3031 * is modern enough. We can also scribble a return if the CPU is old enough 3032 * to not have an RSB (pre-eIBRS). 3033 */ 3034 typedef enum { 3035 X86_BHI_TOO_OLD_OR_DISABLED, /* Pre-eIBRS or disabled */ 3036 X86_BHI_NEW_ENOUGH, /* AMD, or Intel with BHI_NO set */ 3037 X86_BHI_DIS_S, /* BHI_NO == 0, but BHI_DIS_S avail. */ 3038 /* NOTE: BHI_DIS_S above will still need the software sequence. */ 3039 X86_BHI_SOFTWARE_SEQUENCE, /* Use software sequence */ 3040 } x86_native_bhi_mitigation_t; 3041 3042 x86_native_bhi_mitigation_t x86_bhi_mitigation = X86_BHI_SOFTWARE_SEQUENCE; 3043 3044 static void 3045 cpuid_enable_bhi_dis_s(void) 3046 { 3047 uint64_t val; 3048 3049 val = rdmsr(MSR_IA32_SPEC_CTRL); 3050 val |= IA32_SPEC_CTRL_BHI_DIS_S; 3051 wrmsr(MSR_IA32_SPEC_CTRL, val); 3052 } 3053 3054 /* 3055 * This function scribbles RET into the first instruction of x86_bhb_clear() 3056 * if SPECTREV2 mitigations are disabled, the CPU is too old, the CPU is new 3057 * enough to fix (which includes non-Intel CPUs), or the CPU has an explicit 3058 * disable-Branch-History control. 3059 */ 3060 static x86_native_bhi_mitigation_t 3061 cpuid_learn_and_patch_bhi(x86_spectrev2_mitigation_t v2mit, cpu_t *cpu, 3062 uchar_t *featureset) 3063 { 3064 struct cpuid_info *cpi = cpu->cpu_m.mcpu_cpi; 3065 const uint8_t ret = RET_INSTR; 3066 uint8_t *bhb_clear = (uint8_t *)x86_bhb_clear; 3067 3068 ASSERT0(cpu->cpu_id); 3069 3070 /* First check for explicitly disabled... */ 3071 if (v2mit == X86_SPECTREV2_DISABLED) { 3072 *bhb_clear = ret; 3073 return (X86_BHI_TOO_OLD_OR_DISABLED); 3074 } 3075 3076 /* 3077 * Then check for BHI_NO, which means the CPU doesn't have this bug, 3078 * or if it's non-Intel, in which case this mitigation mechanism 3079 * doesn't apply. 3080 */ 3081 if (cpi->cpi_vendor != X86_VENDOR_Intel || 3082 is_x86_feature(featureset, X86FSET_BHI_NO)) { 3083 *bhb_clear = ret; 3084 return (X86_BHI_NEW_ENOUGH); 3085 } 3086 3087 /* 3088 * Now check for the BHI_CTRL MSR, and then set it if available. 3089 * We will still need to use the software sequence, however. 3090 */ 3091 if (is_x86_feature(featureset, X86FSET_BHI_CTRL)) { 3092 cpuid_enable_bhi_dis_s(); 3093 return (X86_BHI_DIS_S); 3094 } 3095 3096 /* 3097 * Finally, check if we are too old to bother with RSB: 3098 */ 3099 if (v2mit == X86_SPECTREV2_RETPOLINE) { 3100 *bhb_clear = ret; 3101 return (X86_BHI_TOO_OLD_OR_DISABLED); 3102 } 3103 3104 ASSERT(*bhb_clear != ret); 3105 return (X86_BHI_SOFTWARE_SEQUENCE); 3106 } 3107 3108 /* 3109 * We default to enabling Return Stack Buffer (RSB) mitigations. 3110 * 3111 * We used to skip RSB mitigations with Intel eIBRS, but developments around 3112 * post-barrier RSB (PBRSB) guessing suggests we should enable Intel RSB 3113 * mitigations always unless explicitly bypassed, or unless hardware indicates 3114 * the bug has been fixed. 3115 * 3116 * The current decisions for using, or ignoring, a RSB software stuffing 3117 * sequence are expressed by the following table: 3118 * 3119 * +-------+------------+-----------------+--------+ 3120 * | eIBRS | PBRSB_NO | context switch | vmexit | 3121 * +-------+------------+-----------------+--------+ 3122 * | Yes | No | stuff | stuff | 3123 * | Yes | Yes | ignore | ignore | 3124 * | No | No | stuff | ignore | 3125 * +-------+------------+-----------------+--------+ 3126 * 3127 * Note that if an Intel CPU has no eIBRS, it will never enumerate PBRSB_NO, 3128 * because machines with no eIBRS do not have a problem with PBRSB overflow. 3129 * See the Intel document cited below for details. 3130 * 3131 * Also note that AMD AUTO_IBRS has no PBRSB problem, so it is not included in 3132 * the table above, and that there is no situation where vmexit stuffing is 3133 * needed, but context-switch stuffing isn't. 3134 */ 3135 3136 /* BEGIN CSTYLED */ 3137 /* 3138 * https://www.intel.com/content/www/us/en/developer/articles/technical/software-security-guidance/advisory-guidance/post-barrier-return-stack-buffer-predictions.html 3139 */ 3140 /* END CSTYLED */ 3141 3142 /* 3143 * AMD indicates that when Automatic IBRS is enabled we do not need to implement 3144 * return stack buffer clearing for VMEXIT as it takes care of it. The manual 3145 * also states that as long as SMEP and we maintain at least one page between 3146 * the kernel and user space (we have much more of a red zone), then we do not 3147 * need to clear the RSB. We constrain this to only when Automatic IRBS is 3148 * present. 3149 */ 3150 static void 3151 cpuid_patch_rsb(x86_spectrev2_mitigation_t mit, bool intel_pbrsb_no) 3152 { 3153 const uint8_t ret = RET_INSTR; 3154 uint8_t *stuff = (uint8_t *)x86_rsb_stuff; 3155 uint8_t *vmx_stuff = (uint8_t *)x86_rsb_stuff_vmexit; 3156 3157 switch (mit) { 3158 case X86_SPECTREV2_AUTO_IBRS: 3159 case X86_SPECTREV2_DISABLED: 3160 /* Don't bother with any RSB stuffing! */ 3161 *stuff = ret; 3162 *vmx_stuff = ret; 3163 break; 3164 case X86_SPECTREV2_RETPOLINE: 3165 /* 3166 * The Intel document on Post-Barrier RSB says that processors 3167 * without eIBRS do not have PBRSB problems upon VMEXIT. 3168 */ 3169 VERIFY(!intel_pbrsb_no); 3170 VERIFY3U(*stuff, !=, ret); 3171 *vmx_stuff = ret; 3172 break; 3173 default: 3174 /* 3175 * eIBRS is all that's left. If CPU claims PBRSB is fixed, 3176 * don't use the RSB mitigation in either case. Otherwise 3177 * both vmexit and context-switching require the software 3178 * mitigation. 3179 */ 3180 if (intel_pbrsb_no) { 3181 /* CPU claims PBRSB problems are fixed. */ 3182 *stuff = ret; 3183 *vmx_stuff = ret; 3184 } 3185 VERIFY3U(*stuff, ==, *vmx_stuff); 3186 break; 3187 } 3188 } 3189 3190 static void 3191 cpuid_patch_retpolines(x86_spectrev2_mitigation_t mit) 3192 { 3193 const char *thunks[] = { "_rax", "_rbx", "_rcx", "_rdx", "_rdi", 3194 "_rsi", "_rbp", "_r8", "_r9", "_r10", "_r11", "_r12", "_r13", 3195 "_r14", "_r15" }; 3196 const uint_t nthunks = ARRAY_SIZE(thunks); 3197 const char *type; 3198 uint_t i; 3199 3200 if (mit == x86_spectrev2_mitigation) 3201 return; 3202 3203 switch (mit) { 3204 case X86_SPECTREV2_RETPOLINE: 3205 type = "gen"; 3206 break; 3207 case X86_SPECTREV2_AUTO_IBRS: 3208 case X86_SPECTREV2_ENHANCED_IBRS: 3209 case X86_SPECTREV2_DISABLED: 3210 type = "jmp"; 3211 break; 3212 default: 3213 panic("asked to update retpoline state with unknown state!"); 3214 } 3215 3216 for (i = 0; i < nthunks; i++) { 3217 uintptr_t source, dest; 3218 int ssize, dsize; 3219 char sourcebuf[64], destbuf[64]; 3220 3221 (void) snprintf(destbuf, sizeof (destbuf), 3222 "__x86_indirect_thunk%s", thunks[i]); 3223 (void) snprintf(sourcebuf, sizeof (sourcebuf), 3224 "__x86_indirect_thunk_%s%s", type, thunks[i]); 3225 3226 source = kobj_getelfsym(sourcebuf, NULL, &ssize); 3227 dest = kobj_getelfsym(destbuf, NULL, &dsize); 3228 VERIFY3U(source, !=, 0); 3229 VERIFY3U(dest, !=, 0); 3230 VERIFY3S(dsize, >=, ssize); 3231 bcopy((void *)source, (void *)dest, ssize); 3232 } 3233 } 3234 3235 static void 3236 cpuid_enable_enhanced_ibrs(void) 3237 { 3238 uint64_t val; 3239 3240 val = rdmsr(MSR_IA32_SPEC_CTRL); 3241 val |= IA32_SPEC_CTRL_IBRS; 3242 wrmsr(MSR_IA32_SPEC_CTRL, val); 3243 } 3244 3245 static void 3246 cpuid_enable_auto_ibrs(void) 3247 { 3248 uint64_t val; 3249 3250 val = rdmsr(MSR_AMD_EFER); 3251 val |= AMD_EFER_AIBRSE; 3252 wrmsr(MSR_AMD_EFER, val); 3253 } 3254 3255 /* 3256 * AMD Zen 5 processors have a bug where the 16- and 32-bit forms of the 3257 * RDSEED instruction can frequently return 0 despite indicating success 3258 * (CF=1) - See AMD-SB-7055 / CVE-2025-62626. 3259 */ 3260 static void 3261 cpuid_evaluate_amd_rdseed(cpu_t *cpu, uchar_t *featureset) 3262 { 3263 struct cpuid_info *cpi = cpu->cpu_m.mcpu_cpi; 3264 struct cpuid_regs *ecp = &cpi->cpi_std[7]; 3265 uint32_t rev = cpu->cpu_m.mcpu_ucode_info->cui_rev; 3266 uint64_t val; 3267 3268 ASSERT3U(cpi->cpi_vendor, ==, X86_VENDOR_AMD); 3269 ASSERT(ecp->cp_ebx & CPUID_INTC_EBX_7_0_RDSEED); 3270 3271 /* This erratum only applies to the Zen5 uarch */ 3272 if (uarchrev_uarch(cpi->cpi_uarchrev) != X86_UARCH_AMD_ZEN5) 3273 return; 3274 3275 /* 3276 * AMD-SB-7055 specifies microcode versions that mitigate this issue on 3277 * BRH-C1 and BRHD-B0. If we're on one of those chips and the microcode 3278 * version is new enough we can leave RDSEED enabled. 3279 */ 3280 if (chiprev_matches(cpi->cpi_chiprev, X86_CHIPREV_AMD_TURIN_C1) && 3281 rev >= 0x0b00215a) { 3282 return; 3283 } 3284 if (chiprev_matches(cpi->cpi_chiprev, X86_CHIPREV_AMD_DENSE_TURIN_B0) && 3285 rev >= 0x0b101054) { 3286 return; 3287 } 3288 3289 /* 3290 * Go ahead and disable RDSEED on this boot. 3291 * In addition to removing it from the feature set and cached value, we 3292 * also need to remove it from the features returned by CPUID7 so that 3293 * userland programs performing their own feature detection will 3294 * determine it is not available. 3295 */ 3296 if (cpu->cpu_id == 0) 3297 cmn_err(CE_WARN, "Masking unreliable RDSEED on this hardware"); 3298 3299 remove_x86_feature(featureset, X86FSET_RDSEED); 3300 ecp->cp_ebx &= ~CPUID_INTC_EBX_7_0_RDSEED; 3301 3302 val = rdmsr(MSR_AMD_CPUID7_FEATURES); 3303 val &= ~MSR_AMD_CPUID7_FEATURES_RDSEED; 3304 wrmsr(MSR_AMD_CPUID7_FEATURES, val); 3305 } 3306 3307 /* 3308 * Determine how we should mitigate TAA or if we need to. Regardless of TAA, if 3309 * we can disable TSX, we do so. 3310 * 3311 * This determination is done only on the boot CPU, potentially after loading 3312 * updated microcode. 3313 */ 3314 static void 3315 cpuid_update_tsx(cpu_t *cpu, uchar_t *featureset) 3316 { 3317 struct cpuid_info *cpi = cpu->cpu_m.mcpu_cpi; 3318 3319 VERIFY(cpu->cpu_id == 0); 3320 3321 if (cpi->cpi_vendor != X86_VENDOR_Intel) { 3322 x86_taa_mitigation = X86_TAA_HW_MITIGATED; 3323 return; 3324 } 3325 3326 if (x86_disable_taa) { 3327 x86_taa_mitigation = X86_TAA_DISABLED; 3328 return; 3329 } 3330 3331 /* 3332 * If we do not have the ability to disable TSX, then our only 3333 * mitigation options are in hardware (TAA_NO), or by using our existing 3334 * MDS mitigation as described above. The latter relies upon us having 3335 * configured MDS mitigations correctly! This includes disabling SMT if 3336 * we want to cross-CPU-thread protection. 3337 */ 3338 if (!is_x86_feature(featureset, X86FSET_TSX_CTRL)) { 3339 /* 3340 * It's not clear whether any parts will enumerate TAA_NO 3341 * *without* TSX_CTRL, but let's mark it as such if we see this. 3342 */ 3343 if (is_x86_feature(featureset, X86FSET_TAA_NO)) { 3344 x86_taa_mitigation = X86_TAA_HW_MITIGATED; 3345 return; 3346 } 3347 3348 if (is_x86_feature(featureset, X86FSET_MD_CLEAR) && 3349 !is_x86_feature(featureset, X86FSET_MDS_NO)) { 3350 x86_taa_mitigation = X86_TAA_MD_CLEAR; 3351 } else { 3352 x86_taa_mitigation = X86_TAA_NOTHING; 3353 } 3354 return; 3355 } 3356 3357 /* 3358 * We have TSX_CTRL, but we can only fully disable TSX if we're early 3359 * enough in boot. 3360 * 3361 * Otherwise, we'll fall back to causing transactions to abort as our 3362 * mitigation. TSX-using code will always take the fallback path. 3363 */ 3364 if (cpi->cpi_pass < 4) { 3365 x86_taa_mitigation = X86_TAA_TSX_DISABLE; 3366 } else { 3367 x86_taa_mitigation = X86_TAA_TSX_FORCE_ABORT; 3368 } 3369 } 3370 3371 /* 3372 * As mentioned, we should only touch the MSR when we've got a suitable 3373 * microcode loaded on this CPU. 3374 */ 3375 static void 3376 cpuid_apply_tsx(x86_taa_mitigation_t taa, uchar_t *featureset) 3377 { 3378 uint64_t val; 3379 3380 switch (taa) { 3381 case X86_TAA_TSX_DISABLE: 3382 if (!is_x86_feature(featureset, X86FSET_TSX_CTRL)) 3383 return; 3384 val = rdmsr(MSR_IA32_TSX_CTRL); 3385 val |= IA32_TSX_CTRL_CPUID_CLEAR | IA32_TSX_CTRL_RTM_DISABLE; 3386 wrmsr(MSR_IA32_TSX_CTRL, val); 3387 break; 3388 case X86_TAA_TSX_FORCE_ABORT: 3389 if (!is_x86_feature(featureset, X86FSET_TSX_CTRL)) 3390 return; 3391 val = rdmsr(MSR_IA32_TSX_CTRL); 3392 val |= IA32_TSX_CTRL_RTM_DISABLE; 3393 wrmsr(MSR_IA32_TSX_CTRL, val); 3394 break; 3395 case X86_TAA_HW_MITIGATED: 3396 case X86_TAA_MD_CLEAR: 3397 case X86_TAA_DISABLED: 3398 case X86_TAA_NOTHING: 3399 break; 3400 } 3401 } 3402 3403 static void 3404 cpuid_scan_security(cpu_t *cpu, uchar_t *featureset) 3405 { 3406 struct cpuid_info *cpi = cpu->cpu_m.mcpu_cpi; 3407 x86_spectrev2_mitigation_t v2mit; 3408 3409 if ((cpi->cpi_vendor == X86_VENDOR_AMD || 3410 cpi->cpi_vendor == X86_VENDOR_HYGON) && 3411 cpi->cpi_xmaxeax >= CPUID_LEAF_EXT_8) { 3412 if (cpi->cpi_extd[8].cp_ebx & CPUID_AMD_EBX_IBPB) 3413 add_x86_feature(featureset, X86FSET_IBPB); 3414 if (cpi->cpi_extd[8].cp_ebx & CPUID_AMD_EBX_IBRS) 3415 add_x86_feature(featureset, X86FSET_IBRS); 3416 if (cpi->cpi_extd[8].cp_ebx & CPUID_AMD_EBX_STIBP) 3417 add_x86_feature(featureset, X86FSET_STIBP); 3418 if (cpi->cpi_extd[8].cp_ebx & CPUID_AMD_EBX_STIBP_ALL) 3419 add_x86_feature(featureset, X86FSET_STIBP_ALL); 3420 if (cpi->cpi_extd[8].cp_ebx & CPUID_AMD_EBX_SSBD) 3421 add_x86_feature(featureset, X86FSET_SSBD); 3422 if (cpi->cpi_extd[8].cp_ebx & CPUID_AMD_EBX_VIRT_SSBD) 3423 add_x86_feature(featureset, X86FSET_SSBD_VIRT); 3424 if (cpi->cpi_extd[8].cp_ebx & CPUID_AMD_EBX_SSB_NO) 3425 add_x86_feature(featureset, X86FSET_SSB_NO); 3426 3427 /* 3428 * Rather than Enhanced IBRS, AMD has a different feature that 3429 * is a bit in EFER that can be enabled and will basically do 3430 * the right thing while executing in the kernel. 3431 */ 3432 if (cpi->cpi_vendor == X86_VENDOR_AMD && 3433 (cpi->cpi_extd[8].cp_ebx & CPUID_AMD_EBX_PREFER_IBRS) && 3434 cpi->cpi_xmaxeax >= CPUID_LEAF_EXT_21 && 3435 (cpi->cpi_extd[0x21].cp_eax & CPUID_AMD_8X21_EAX_AIBRS)) { 3436 add_x86_feature(featureset, X86FSET_AUTO_IBRS); 3437 } 3438 3439 } else if (cpi->cpi_vendor == X86_VENDOR_Intel && 3440 cpi->cpi_maxeax >= 7) { 3441 struct cpuid_regs *ecp; 3442 ecp = &cpi->cpi_std[7]; 3443 3444 if (ecp->cp_edx & CPUID_INTC_EDX_7_0_MD_CLEAR) { 3445 add_x86_feature(featureset, X86FSET_MD_CLEAR); 3446 } 3447 3448 if (ecp->cp_edx & CPUID_INTC_EDX_7_0_SPEC_CTRL) { 3449 add_x86_feature(featureset, X86FSET_IBRS); 3450 add_x86_feature(featureset, X86FSET_IBPB); 3451 } 3452 3453 if (ecp->cp_edx & CPUID_INTC_EDX_7_0_STIBP) { 3454 add_x86_feature(featureset, X86FSET_STIBP); 3455 } 3456 3457 /* 3458 * Some prediction controls are enumerated by subleaf 2 of 3459 * leaf 7. 3460 */ 3461 if (CPI_FEATURES_7_2_EDX(cpi) & CPUID_INTC_EDX_7_2_BHI_CTRL) { 3462 add_x86_feature(featureset, X86FSET_BHI_CTRL); 3463 } 3464 3465 /* 3466 * Don't read the arch caps MSR on xpv where we lack the 3467 * on_trap(). 3468 */ 3469 #ifndef __xpv 3470 if (ecp->cp_edx & CPUID_INTC_EDX_7_0_ARCH_CAPS) { 3471 on_trap_data_t otd; 3472 3473 /* 3474 * Be paranoid and assume we'll get a #GP. 3475 */ 3476 if (!on_trap(&otd, OT_DATA_ACCESS)) { 3477 uint64_t reg; 3478 3479 reg = rdmsr(MSR_IA32_ARCH_CAPABILITIES); 3480 if (reg & IA32_ARCH_CAP_RDCL_NO) { 3481 add_x86_feature(featureset, 3482 X86FSET_RDCL_NO); 3483 } 3484 if (reg & IA32_ARCH_CAP_IBRS_ALL) { 3485 add_x86_feature(featureset, 3486 X86FSET_IBRS_ALL); 3487 } 3488 if (reg & IA32_ARCH_CAP_RSBA) { 3489 add_x86_feature(featureset, 3490 X86FSET_RSBA); 3491 } 3492 if (reg & IA32_ARCH_CAP_SKIP_L1DFL_VMENTRY) { 3493 add_x86_feature(featureset, 3494 X86FSET_L1D_VM_NO); 3495 } 3496 if (reg & IA32_ARCH_CAP_SSB_NO) { 3497 add_x86_feature(featureset, 3498 X86FSET_SSB_NO); 3499 } 3500 if (reg & IA32_ARCH_CAP_MDS_NO) { 3501 add_x86_feature(featureset, 3502 X86FSET_MDS_NO); 3503 } 3504 if (reg & IA32_ARCH_CAP_TSX_CTRL) { 3505 add_x86_feature(featureset, 3506 X86FSET_TSX_CTRL); 3507 } 3508 if (reg & IA32_ARCH_CAP_TAA_NO) { 3509 add_x86_feature(featureset, 3510 X86FSET_TAA_NO); 3511 } 3512 if (reg & IA32_ARCH_CAP_RFDS_NO) { 3513 add_x86_feature(featureset, 3514 X86FSET_RFDS_NO); 3515 } 3516 if (reg & IA32_ARCH_CAP_RFDS_CLEAR) { 3517 add_x86_feature(featureset, 3518 X86FSET_RFDS_CLEAR); 3519 } 3520 if (reg & IA32_ARCH_CAP_PBRSB_NO) { 3521 add_x86_feature(featureset, 3522 X86FSET_PBRSB_NO); 3523 } 3524 if (reg & IA32_ARCH_CAP_BHI_NO) { 3525 add_x86_feature(featureset, 3526 X86FSET_BHI_NO); 3527 } 3528 } 3529 no_trap(); 3530 } 3531 #endif /* !__xpv */ 3532 3533 if (ecp->cp_edx & CPUID_INTC_EDX_7_0_SSBD) 3534 add_x86_feature(featureset, X86FSET_SSBD); 3535 3536 if (ecp->cp_edx & CPUID_INTC_EDX_7_0_FLUSH_CMD) 3537 add_x86_feature(featureset, X86FSET_FLUSH_CMD); 3538 } 3539 3540 /* 3541 * Take care of certain mitigations on the non-boot CPU. The boot CPU 3542 * will have already run this function and determined what we need to 3543 * do. This gives us a hook for per-HW thread mitigations such as 3544 * enhanced IBRS, or disabling TSX. 3545 */ 3546 if (cpu->cpu_id != 0) { 3547 switch (x86_spectrev2_mitigation) { 3548 case X86_SPECTREV2_ENHANCED_IBRS: 3549 cpuid_enable_enhanced_ibrs(); 3550 break; 3551 case X86_SPECTREV2_AUTO_IBRS: 3552 cpuid_enable_auto_ibrs(); 3553 break; 3554 default: 3555 break; 3556 } 3557 3558 /* If we're committed to BHI_DIS_S, set it for this core. */ 3559 if (x86_bhi_mitigation == X86_BHI_DIS_S) 3560 cpuid_enable_bhi_dis_s(); 3561 3562 cpuid_apply_tsx(x86_taa_mitigation, featureset); 3563 return; 3564 } 3565 3566 /* 3567 * Go through and initialize various security mechanisms that we should 3568 * only do on a single CPU. This includes Spectre V2, L1TF, MDS, and 3569 * TAA. 3570 */ 3571 3572 /* 3573 * By default we've come in with retpolines enabled. Check whether we 3574 * should disable them or enable enhanced or automatic IBRS. 3575 * 3576 * Note, we do not allow the use of AMD optimized retpolines as it was 3577 * disclosed by AMD in March 2022 that they were still 3578 * vulnerable. Prior to that point, we used them. 3579 */ 3580 if (x86_disable_spectrev2 != 0) { 3581 v2mit = X86_SPECTREV2_DISABLED; 3582 } else if (is_x86_feature(featureset, X86FSET_AUTO_IBRS)) { 3583 cpuid_enable_auto_ibrs(); 3584 v2mit = X86_SPECTREV2_AUTO_IBRS; 3585 } else if (is_x86_feature(featureset, X86FSET_IBRS_ALL)) { 3586 cpuid_enable_enhanced_ibrs(); 3587 v2mit = X86_SPECTREV2_ENHANCED_IBRS; 3588 } else { 3589 v2mit = X86_SPECTREV2_RETPOLINE; 3590 } 3591 3592 cpuid_patch_retpolines(v2mit); 3593 cpuid_patch_rsb(v2mit, is_x86_feature(featureset, X86FSET_PBRSB_NO)); 3594 x86_bhi_mitigation = cpuid_learn_and_patch_bhi(v2mit, cpu, featureset); 3595 x86_spectrev2_mitigation = v2mit; 3596 membar_producer(); 3597 3598 /* 3599 * We need to determine what changes are required for mitigating L1TF 3600 * and MDS. If the CPU suffers from either of them, then SMT exclusion 3601 * is required. 3602 * 3603 * If any of these are present, then we need to flush u-arch state at 3604 * various points. For MDS, we need to do so whenever we change to a 3605 * lesser privilege level or we are halting the CPU. For L1TF we need to 3606 * flush the L1D cache at VM entry. When we have microcode that handles 3607 * MDS, the L1D flush also clears the other u-arch state that the 3608 * md_clear does. 3609 */ 3610 3611 /* 3612 * Update whether or not we need to be taking explicit action against 3613 * MDS or RFDS. 3614 */ 3615 cpuid_update_md_clear(cpu, featureset); 3616 3617 /* 3618 * Determine whether SMT exclusion is required and whether or not we 3619 * need to perform an l1d flush. 3620 */ 3621 cpuid_update_l1d_flush(cpu, featureset); 3622 3623 /* 3624 * Determine what our mitigation strategy should be for TAA and then 3625 * also apply TAA mitigations. 3626 */ 3627 cpuid_update_tsx(cpu, featureset); 3628 cpuid_apply_tsx(x86_taa_mitigation, featureset); 3629 } 3630 3631 /* 3632 * Setup XFeature_Enabled_Mask register. Required by xsave feature. 3633 */ 3634 void 3635 setup_xfem(void) 3636 { 3637 uint64_t flags = XFEATURE_LEGACY_FP; 3638 3639 ASSERT(is_x86_feature(x86_featureset, X86FSET_XSAVE)); 3640 3641 if (is_x86_feature(x86_featureset, X86FSET_SSE)) 3642 flags |= XFEATURE_SSE; 3643 3644 if (is_x86_feature(x86_featureset, X86FSET_AVX)) 3645 flags |= XFEATURE_AVX; 3646 3647 if (is_x86_feature(x86_featureset, X86FSET_AVX512F)) 3648 flags |= XFEATURE_AVX512; 3649 3650 set_xcr(XFEATURE_ENABLED_MASK, flags); 3651 3652 xsave_bv_all = flags; 3653 } 3654 3655 static void 3656 cpuid_basic_topology(cpu_t *cpu, uchar_t *featureset) 3657 { 3658 struct cpuid_info *cpi; 3659 3660 cpi = cpu->cpu_m.mcpu_cpi; 3661 3662 if (cpi->cpi_vendor == X86_VENDOR_AMD || 3663 cpi->cpi_vendor == X86_VENDOR_HYGON) { 3664 cpuid_gather_amd_topology_leaves(cpu); 3665 } 3666 3667 cpi->cpi_apicid = cpuid_gather_apicid(cpi); 3668 3669 /* 3670 * Before we can calculate the IDs that we should assign to this 3671 * processor, we need to understand how many cores and threads it has. 3672 */ 3673 switch (cpi->cpi_vendor) { 3674 case X86_VENDOR_Intel: 3675 cpuid_intel_ncores(cpi, &cpi->cpi_ncpu_per_chip, 3676 &cpi->cpi_ncore_per_chip); 3677 break; 3678 case X86_VENDOR_AMD: 3679 case X86_VENDOR_HYGON: 3680 cpuid_amd_ncores(cpi, &cpi->cpi_ncpu_per_chip, 3681 &cpi->cpi_ncore_per_chip); 3682 break; 3683 default: 3684 /* 3685 * If we have some other x86 compatible chip, it's not clear how 3686 * they would behave. The most common case is virtualization 3687 * today, though there are also 64-bit VIA chips. Assume that 3688 * all we can get is the basic Leaf 1 HTT information. 3689 */ 3690 if ((cpi->cpi_std[1].cp_edx & CPUID_INTC_EDX_HTT) != 0) { 3691 cpi->cpi_ncore_per_chip = 1; 3692 cpi->cpi_ncpu_per_chip = CPI_CPU_COUNT(cpi); 3693 } 3694 break; 3695 } 3696 3697 /* 3698 * Based on the calculated number of threads and cores, potentially 3699 * assign the HTT and CMT features. 3700 */ 3701 if (cpi->cpi_ncore_per_chip > 1) { 3702 add_x86_feature(featureset, X86FSET_CMP); 3703 } 3704 3705 if (cpi->cpi_ncpu_per_chip > 1 && 3706 cpi->cpi_ncpu_per_chip != cpi->cpi_ncore_per_chip) { 3707 add_x86_feature(featureset, X86FSET_HTT); 3708 } 3709 3710 /* 3711 * Now that has been set up, we need to go through and calculate all of 3712 * the rest of the parameters that exist. If we think the CPU doesn't 3713 * have either SMT (HTT) or CMP, then we basically go through and fake 3714 * up information in some way. The most likely case for this is 3715 * virtualization where we have a lot of partial topology information. 3716 */ 3717 if (!is_x86_feature(featureset, X86FSET_HTT) && 3718 !is_x86_feature(featureset, X86FSET_CMP)) { 3719 /* 3720 * This is a single core, single-threaded processor. 3721 */ 3722 cpi->cpi_procnodes_per_pkg = 1; 3723 cpi->cpi_cores_per_compunit = 1; 3724 cpi->cpi_compunitid = 0; 3725 cpi->cpi_chipid = -1; 3726 cpi->cpi_clogid = 0; 3727 cpi->cpi_coreid = cpu->cpu_id; 3728 cpi->cpi_pkgcoreid = 0; 3729 if (cpi->cpi_vendor == X86_VENDOR_AMD || 3730 cpi->cpi_vendor == X86_VENDOR_HYGON) { 3731 cpi->cpi_procnodeid = BITX(cpi->cpi_apicid, 3, 0); 3732 } else { 3733 cpi->cpi_procnodeid = cpi->cpi_chipid; 3734 } 3735 } else { 3736 switch (cpi->cpi_vendor) { 3737 case X86_VENDOR_Intel: 3738 cpuid_intel_getids(cpu, featureset); 3739 break; 3740 case X86_VENDOR_AMD: 3741 case X86_VENDOR_HYGON: 3742 cpuid_amd_getids(cpu, featureset); 3743 break; 3744 default: 3745 /* 3746 * In this case, it's hard to say what we should do. 3747 * We're going to model them to the OS as single core 3748 * threads. We don't have a good identifier for them, so 3749 * we're just going to use the cpu id all on a single 3750 * chip. 3751 * 3752 * This case has historically been different from the 3753 * case above where we don't have HTT or CMP. While they 3754 * could be combined, we've opted to keep it separate to 3755 * minimize the risk of topology changes in weird cases. 3756 */ 3757 cpi->cpi_procnodes_per_pkg = 1; 3758 cpi->cpi_cores_per_compunit = 1; 3759 cpi->cpi_chipid = 0; 3760 cpi->cpi_coreid = cpu->cpu_id; 3761 cpi->cpi_clogid = cpu->cpu_id; 3762 cpi->cpi_pkgcoreid = cpu->cpu_id; 3763 cpi->cpi_procnodeid = cpi->cpi_chipid; 3764 cpi->cpi_compunitid = cpi->cpi_coreid; 3765 break; 3766 } 3767 } 3768 } 3769 3770 /* 3771 * Gather relevant CPU features from leaf 6 which covers thermal information. We 3772 * always gather leaf 6 if it's supported; however, we only look for features on 3773 * Intel systems as AMD does not currently define any of the features we look 3774 * for below. 3775 */ 3776 static void 3777 cpuid_basic_thermal(cpu_t *cpu, uchar_t *featureset) 3778 { 3779 struct cpuid_regs *cp; 3780 struct cpuid_info *cpi = cpu->cpu_m.mcpu_cpi; 3781 3782 if (cpi->cpi_maxeax < 6) { 3783 return; 3784 } 3785 3786 cp = &cpi->cpi_std[6]; 3787 cp->cp_eax = 6; 3788 cp->cp_ebx = cp->cp_ecx = cp->cp_edx = 0; 3789 (void) __cpuid_insn(cp); 3790 platform_cpuid_mangle(cpi->cpi_vendor, 6, cp); 3791 3792 if (cpi->cpi_vendor != X86_VENDOR_Intel) { 3793 return; 3794 } 3795 3796 if ((cp->cp_eax & CPUID_INTC_EAX_DTS) != 0) { 3797 add_x86_feature(featureset, X86FSET_CORE_THERMAL); 3798 } 3799 3800 if ((cp->cp_eax & CPUID_INTC_EAX_PTM) != 0) { 3801 add_x86_feature(featureset, X86FSET_PKG_THERMAL); 3802 } 3803 } 3804 3805 /* 3806 * This is used when we discover that we have AVX support in cpuid. This 3807 * proceeds to scan for the rest of the AVX derived features. 3808 */ 3809 static void 3810 cpuid_basic_avx(cpu_t *cpu, uchar_t *featureset) 3811 { 3812 struct cpuid_info *cpi = cpu->cpu_m.mcpu_cpi; 3813 3814 /* 3815 * If we don't have AVX, don't bother with most of this. 3816 */ 3817 if ((cpi->cpi_std[1].cp_ecx & CPUID_INTC_ECX_AVX) == 0) 3818 return; 3819 3820 add_x86_feature(featureset, X86FSET_AVX); 3821 3822 /* 3823 * Intel says we can't check these without also 3824 * checking AVX. 3825 */ 3826 if (cpi->cpi_std[1].cp_ecx & CPUID_INTC_ECX_F16C) 3827 add_x86_feature(featureset, X86FSET_F16C); 3828 3829 if (cpi->cpi_std[1].cp_ecx & CPUID_INTC_ECX_FMA) 3830 add_x86_feature(featureset, X86FSET_FMA); 3831 3832 if (cpi->cpi_std[7].cp_ebx & CPUID_INTC_EBX_7_0_BMI1) 3833 add_x86_feature(featureset, X86FSET_BMI1); 3834 3835 if (cpi->cpi_std[7].cp_ebx & CPUID_INTC_EBX_7_0_BMI2) 3836 add_x86_feature(featureset, X86FSET_BMI2); 3837 3838 if (cpi->cpi_std[7].cp_ebx & CPUID_INTC_EBX_7_0_AVX2) 3839 add_x86_feature(featureset, X86FSET_AVX2); 3840 3841 if (cpi->cpi_std[7].cp_ecx & CPUID_INTC_ECX_7_0_VAES) 3842 add_x86_feature(featureset, X86FSET_VAES); 3843 3844 if (cpi->cpi_std[7].cp_ecx & CPUID_INTC_ECX_7_0_VPCLMULQDQ) 3845 add_x86_feature(featureset, X86FSET_VPCLMULQDQ); 3846 3847 /* 3848 * The rest of the AVX features require AVX512. Do not check them unless 3849 * it is present. 3850 */ 3851 if ((cpi->cpi_std[7].cp_ebx & CPUID_INTC_EBX_7_0_AVX512F) == 0) 3852 return; 3853 add_x86_feature(featureset, X86FSET_AVX512F); 3854 3855 if (cpi->cpi_std[7].cp_ebx & CPUID_INTC_EBX_7_0_AVX512DQ) 3856 add_x86_feature(featureset, X86FSET_AVX512DQ); 3857 3858 if (cpi->cpi_std[7].cp_ebx & CPUID_INTC_EBX_7_0_AVX512IFMA) 3859 add_x86_feature(featureset, X86FSET_AVX512FMA); 3860 3861 if (cpi->cpi_std[7].cp_ebx & CPUID_INTC_EBX_7_0_AVX512PF) 3862 add_x86_feature(featureset, X86FSET_AVX512PF); 3863 3864 if (cpi->cpi_std[7].cp_ebx & CPUID_INTC_EBX_7_0_AVX512ER) 3865 add_x86_feature(featureset, X86FSET_AVX512ER); 3866 3867 if (cpi->cpi_std[7].cp_ebx & CPUID_INTC_EBX_7_0_AVX512CD) 3868 add_x86_feature(featureset, X86FSET_AVX512CD); 3869 3870 if (cpi->cpi_std[7].cp_ebx & CPUID_INTC_EBX_7_0_AVX512BW) 3871 add_x86_feature(featureset, X86FSET_AVX512BW); 3872 3873 if (cpi->cpi_std[7].cp_ebx & CPUID_INTC_EBX_7_0_AVX512VL) 3874 add_x86_feature(featureset, X86FSET_AVX512VL); 3875 3876 if (cpi->cpi_std[7].cp_ecx & CPUID_INTC_ECX_7_0_AVX512VBMI) 3877 add_x86_feature(featureset, X86FSET_AVX512VBMI); 3878 3879 if (cpi->cpi_std[7].cp_ecx & CPUID_INTC_ECX_7_0_AVX512VBMI2) 3880 add_x86_feature(featureset, X86FSET_AVX512_VBMI2); 3881 3882 if (cpi->cpi_std[7].cp_ecx & CPUID_INTC_ECX_7_0_AVX512VNNI) 3883 add_x86_feature(featureset, X86FSET_AVX512VNNI); 3884 3885 if (cpi->cpi_std[7].cp_ecx & CPUID_INTC_ECX_7_0_AVX512BITALG) 3886 add_x86_feature(featureset, X86FSET_AVX512_BITALG); 3887 3888 if (cpi->cpi_std[7].cp_ecx & CPUID_INTC_ECX_7_0_AVX512VPOPCDQ) 3889 add_x86_feature(featureset, X86FSET_AVX512VPOPCDQ); 3890 3891 if (cpi->cpi_std[7].cp_edx & CPUID_INTC_EDX_7_0_AVX5124NNIW) 3892 add_x86_feature(featureset, X86FSET_AVX512NNIW); 3893 3894 if (cpi->cpi_std[7].cp_edx & CPUID_INTC_EDX_7_0_AVX5124FMAPS) 3895 add_x86_feature(featureset, X86FSET_AVX512FMAPS); 3896 3897 /* 3898 * More features here are in Leaf 7, subleaf 1. Don't bother checking if 3899 * we don't need to. 3900 */ 3901 if (cpi->cpi_std[7].cp_eax < 1) 3902 return; 3903 3904 if (cpi->cpi_sub7[0].cp_eax & CPUID_INTC_EAX_7_1_AVX512_BF16) 3905 add_x86_feature(featureset, X86FSET_AVX512_BF16); 3906 } 3907 3908 /* 3909 * PPIN is the protected processor inventory number. On AMD this is an actual 3910 * feature bit. However, on Intel systems we need to read the platform 3911 * information MSR if we're on a specific model. 3912 */ 3913 #if !defined(__xpv) 3914 static void 3915 cpuid_basic_ppin(cpu_t *cpu, uchar_t *featureset) 3916 { 3917 on_trap_data_t otd; 3918 struct cpuid_info *cpi = cpu->cpu_m.mcpu_cpi; 3919 3920 switch (cpi->cpi_vendor) { 3921 case X86_VENDOR_AMD: 3922 /* 3923 * This leaf will have already been gathered in the topology 3924 * functions. 3925 */ 3926 if (cpi->cpi_xmaxeax >= CPUID_LEAF_EXT_8) { 3927 if (cpi->cpi_extd[8].cp_ebx & CPUID_AMD_EBX_PPIN) { 3928 add_x86_feature(featureset, X86FSET_PPIN); 3929 } 3930 } 3931 break; 3932 case X86_VENDOR_Intel: 3933 if (cpi->cpi_family != 6) 3934 break; 3935 switch (cpi->cpi_model) { 3936 case INTC_MODEL_IVYBRIDGE_XEON: 3937 case INTC_MODEL_HASWELL_XEON: 3938 case INTC_MODEL_BROADWELL_XEON: 3939 case INTC_MODEL_BROADWELL_XEON_D: 3940 case INTC_MODEL_SKYLAKE_XEON: 3941 case INTC_MODEL_ICELAKE_XEON: 3942 if (!on_trap(&otd, OT_DATA_ACCESS)) { 3943 uint64_t value; 3944 3945 value = rdmsr(MSR_PLATFORM_INFO); 3946 if ((value & MSR_PLATFORM_INFO_PPIN) != 0) { 3947 add_x86_feature(featureset, 3948 X86FSET_PPIN); 3949 } 3950 } 3951 no_trap(); 3952 break; 3953 default: 3954 break; 3955 } 3956 break; 3957 default: 3958 break; 3959 } 3960 } 3961 #endif /* ! __xpv */ 3962 3963 static void 3964 cpuid_pass_prelude(cpu_t *cpu, void *arg) 3965 { 3966 uchar_t *featureset = (uchar_t *)arg; 3967 3968 /* 3969 * We don't run on any processor that doesn't have cpuid, and could not 3970 * possibly have arrived here. 3971 */ 3972 add_x86_feature(featureset, X86FSET_CPUID); 3973 } 3974 3975 static void 3976 cpuid_pass_ident(cpu_t *cpu, void *arg __unused) 3977 { 3978 struct cpuid_info *cpi; 3979 struct cpuid_regs *cp; 3980 3981 /* 3982 * We require that virtual/native detection be complete and that PCI 3983 * config space access has been set up; at present there is no reliable 3984 * way to determine the latter. 3985 */ 3986 #if !defined(__xpv) 3987 ASSERT3S(platform_type, !=, -1); 3988 #endif /* !__xpv */ 3989 3990 cpi = cpu->cpu_m.mcpu_cpi; 3991 ASSERT(cpi != NULL); 3992 3993 cp = &cpi->cpi_std[0]; 3994 cp->cp_eax = 0; 3995 cpi->cpi_maxeax = __cpuid_insn(cp); 3996 { 3997 uint32_t *iptr = (uint32_t *)cpi->cpi_vendorstr; 3998 *iptr++ = cp->cp_ebx; 3999 *iptr++ = cp->cp_edx; 4000 *iptr++ = cp->cp_ecx; 4001 *(char *)&cpi->cpi_vendorstr[12] = '\0'; 4002 } 4003 4004 cpi->cpi_vendor = _cpuid_vendorstr_to_vendorcode(cpi->cpi_vendorstr); 4005 x86_vendor = cpi->cpi_vendor; /* for compatibility */ 4006 4007 /* 4008 * Limit the range in case of weird hardware 4009 */ 4010 if (cpi->cpi_maxeax > CPI_MAXEAX_MAX) 4011 cpi->cpi_maxeax = CPI_MAXEAX_MAX; 4012 if (cpi->cpi_maxeax < 1) 4013 return; 4014 4015 cp = &cpi->cpi_std[1]; 4016 cp->cp_eax = 1; 4017 (void) __cpuid_insn(cp); 4018 4019 /* 4020 * Extract identifying constants for easy access. 4021 */ 4022 cpi->cpi_model = CPI_MODEL(cpi); 4023 cpi->cpi_family = CPI_FAMILY(cpi); 4024 4025 if (cpi->cpi_family == 0xf) 4026 cpi->cpi_family += CPI_FAMILY_XTD(cpi); 4027 4028 /* 4029 * Beware: AMD uses "extended model" iff base *FAMILY* == 0xf. 4030 * Intel, and presumably everyone else, uses model == 0xf, as 4031 * one would expect (max value means possible overflow). Sigh. 4032 */ 4033 4034 switch (cpi->cpi_vendor) { 4035 case X86_VENDOR_Intel: 4036 if (IS_EXTENDED_MODEL_INTEL(cpi)) 4037 cpi->cpi_model += CPI_MODEL_XTD(cpi) << 4; 4038 break; 4039 case X86_VENDOR_AMD: 4040 if (CPI_FAMILY(cpi) == 0xf) 4041 cpi->cpi_model += CPI_MODEL_XTD(cpi) << 4; 4042 break; 4043 case X86_VENDOR_HYGON: 4044 cpi->cpi_model += CPI_MODEL_XTD(cpi) << 4; 4045 break; 4046 default: 4047 if (cpi->cpi_model == 0xf) 4048 cpi->cpi_model += CPI_MODEL_XTD(cpi) << 4; 4049 break; 4050 } 4051 4052 cpi->cpi_step = CPI_STEP(cpi); 4053 cpi->cpi_brandid = CPI_BRANDID(cpi); 4054 4055 /* 4056 * Synthesize chip "revision" and socket type 4057 */ 4058 cpi->cpi_chiprev = _cpuid_chiprev(cpi->cpi_vendor, cpi->cpi_family, 4059 cpi->cpi_model, cpi->cpi_step); 4060 cpi->cpi_chiprevstr = _cpuid_chiprevstr(cpi->cpi_vendor, 4061 cpi->cpi_family, cpi->cpi_model, cpi->cpi_step); 4062 cpi->cpi_socket = _cpuid_skt(cpi->cpi_vendor, cpi->cpi_family, 4063 cpi->cpi_model, cpi->cpi_step); 4064 cpi->cpi_uarchrev = _cpuid_uarchrev(cpi->cpi_vendor, cpi->cpi_family, 4065 cpi->cpi_model, cpi->cpi_step); 4066 } 4067 4068 static void 4069 cpuid_pass_basic(cpu_t *cpu, void *arg) 4070 { 4071 uchar_t *featureset = (uchar_t *)arg; 4072 uint32_t mask_ecx, mask_edx; 4073 struct cpuid_info *cpi; 4074 struct cpuid_regs *cp; 4075 int xcpuid; 4076 #if !defined(__xpv) 4077 extern int idle_cpu_prefer_mwait; 4078 #endif 4079 4080 cpi = cpu->cpu_m.mcpu_cpi; 4081 ASSERT(cpi != NULL); 4082 4083 if (cpi->cpi_maxeax < 1) 4084 return; 4085 4086 /* 4087 * This was filled during the identification pass. 4088 */ 4089 cp = &cpi->cpi_std[1]; 4090 4091 /* 4092 * *default* assumptions: 4093 * - believe %edx feature word 4094 * - ignore %ecx feature word 4095 * - 32-bit virtual and physical addressing 4096 */ 4097 mask_edx = 0xffffffff; 4098 mask_ecx = 0; 4099 4100 cpi->cpi_pabits = cpi->cpi_vabits = 32; 4101 4102 switch (cpi->cpi_vendor) { 4103 case X86_VENDOR_Intel: 4104 if (cpi->cpi_family == 5) 4105 x86_type = X86_TYPE_P5; 4106 else if (IS_LEGACY_P6(cpi)) { 4107 x86_type = X86_TYPE_P6; 4108 pentiumpro_bug4046376 = 1; 4109 /* 4110 * Clear the SEP bit when it was set erroneously 4111 */ 4112 if (cpi->cpi_model < 3 && cpi->cpi_step < 3) 4113 cp->cp_edx &= ~CPUID_INTC_EDX_SEP; 4114 } else if (IS_NEW_F6(cpi) || cpi->cpi_family == 0xf) { 4115 x86_type = X86_TYPE_P4; 4116 /* 4117 * We don't currently depend on any of the %ecx 4118 * features until Prescott, so we'll only check 4119 * this from P4 onwards. We might want to revisit 4120 * that idea later. 4121 */ 4122 mask_ecx = 0xffffffff; 4123 } else if (cpi->cpi_family > 0xf) 4124 mask_ecx = 0xffffffff; 4125 /* 4126 * We don't support MONITOR/MWAIT if leaf 5 is not available 4127 * to obtain the monitor linesize. 4128 */ 4129 if (cpi->cpi_maxeax < 5) 4130 mask_ecx &= ~CPUID_INTC_ECX_MON; 4131 break; 4132 case X86_VENDOR_IntelClone: 4133 default: 4134 break; 4135 case X86_VENDOR_AMD: 4136 #if defined(OPTERON_ERRATUM_108) 4137 if (cpi->cpi_family == 0xf && cpi->cpi_model == 0xe) { 4138 cp->cp_eax = (0xf0f & cp->cp_eax) | 0xc0; 4139 cpi->cpi_model = 0xc; 4140 } else 4141 #endif 4142 if (cpi->cpi_family == 5) { 4143 /* 4144 * AMD K5 and K6 4145 * 4146 * These CPUs have an incomplete implementation 4147 * of MCA/MCE which we mask away. 4148 */ 4149 mask_edx &= ~(CPUID_INTC_EDX_MCE | CPUID_INTC_EDX_MCA); 4150 4151 /* 4152 * Model 0 uses the wrong (APIC) bit 4153 * to indicate PGE. Fix it here. 4154 */ 4155 if (cpi->cpi_model == 0) { 4156 if (cp->cp_edx & 0x200) { 4157 cp->cp_edx &= ~0x200; 4158 cp->cp_edx |= CPUID_INTC_EDX_PGE; 4159 } 4160 } 4161 4162 /* 4163 * Early models had problems w/ MMX; disable. 4164 */ 4165 if (cpi->cpi_model < 6) 4166 mask_edx &= ~CPUID_INTC_EDX_MMX; 4167 } 4168 4169 /* 4170 * For newer families, SSE3 and CX16, at least, are valid; 4171 * enable all 4172 */ 4173 if (cpi->cpi_family >= 0xf) 4174 mask_ecx = 0xffffffff; 4175 /* 4176 * We don't support MONITOR/MWAIT if leaf 5 is not available 4177 * to obtain the monitor linesize. 4178 */ 4179 if (cpi->cpi_maxeax < 5) 4180 mask_ecx &= ~CPUID_INTC_ECX_MON; 4181 4182 #if !defined(__xpv) 4183 /* 4184 * AMD has not historically used MWAIT in the CPU's idle loop. 4185 * Pre-family-10h Opterons do not have the MWAIT instruction. We 4186 * know for certain that in at least family 17h, per AMD, mwait 4187 * is preferred. Families in-between are less certain. 4188 */ 4189 if (cpi->cpi_family < 0x17) { 4190 idle_cpu_prefer_mwait = 0; 4191 } 4192 #endif 4193 4194 break; 4195 case X86_VENDOR_HYGON: 4196 /* Enable all for Hygon Dhyana CPU */ 4197 mask_ecx = 0xffffffff; 4198 break; 4199 case X86_VENDOR_TM: 4200 /* 4201 * workaround the NT workaround in CMS 4.1 4202 */ 4203 if (cpi->cpi_family == 5 && cpi->cpi_model == 4 && 4204 (cpi->cpi_step == 2 || cpi->cpi_step == 3)) 4205 cp->cp_edx |= CPUID_INTC_EDX_CX8; 4206 break; 4207 case X86_VENDOR_Centaur: 4208 /* 4209 * workaround the NT workarounds again 4210 */ 4211 if (cpi->cpi_family == 6) 4212 cp->cp_edx |= CPUID_INTC_EDX_CX8; 4213 break; 4214 case X86_VENDOR_Cyrix: 4215 /* 4216 * We rely heavily on the probing in locore 4217 * to actually figure out what parts, if any, 4218 * of the Cyrix cpuid instruction to believe. 4219 */ 4220 switch (x86_type) { 4221 case X86_TYPE_CYRIX_486: 4222 mask_edx = 0; 4223 break; 4224 case X86_TYPE_CYRIX_6x86: 4225 mask_edx = 0; 4226 break; 4227 case X86_TYPE_CYRIX_6x86L: 4228 mask_edx = 4229 CPUID_INTC_EDX_DE | 4230 CPUID_INTC_EDX_CX8; 4231 break; 4232 case X86_TYPE_CYRIX_6x86MX: 4233 mask_edx = 4234 CPUID_INTC_EDX_DE | 4235 CPUID_INTC_EDX_MSR | 4236 CPUID_INTC_EDX_CX8 | 4237 CPUID_INTC_EDX_PGE | 4238 CPUID_INTC_EDX_CMOV | 4239 CPUID_INTC_EDX_MMX; 4240 break; 4241 case X86_TYPE_CYRIX_GXm: 4242 mask_edx = 4243 CPUID_INTC_EDX_MSR | 4244 CPUID_INTC_EDX_CX8 | 4245 CPUID_INTC_EDX_CMOV | 4246 CPUID_INTC_EDX_MMX; 4247 break; 4248 case X86_TYPE_CYRIX_MediaGX: 4249 break; 4250 case X86_TYPE_CYRIX_MII: 4251 case X86_TYPE_VIA_CYRIX_III: 4252 mask_edx = 4253 CPUID_INTC_EDX_DE | 4254 CPUID_INTC_EDX_TSC | 4255 CPUID_INTC_EDX_MSR | 4256 CPUID_INTC_EDX_CX8 | 4257 CPUID_INTC_EDX_PGE | 4258 CPUID_INTC_EDX_CMOV | 4259 CPUID_INTC_EDX_MMX; 4260 break; 4261 default: 4262 break; 4263 } 4264 break; 4265 } 4266 4267 #if defined(__xpv) 4268 /* 4269 * Do not support MONITOR/MWAIT under a hypervisor 4270 */ 4271 mask_ecx &= ~CPUID_INTC_ECX_MON; 4272 /* 4273 * Do not support XSAVE under a hypervisor for now 4274 */ 4275 xsave_force_disable = B_TRUE; 4276 4277 #endif /* __xpv */ 4278 4279 if (xsave_force_disable) { 4280 mask_ecx &= ~CPUID_INTC_ECX_XSAVE; 4281 mask_ecx &= ~CPUID_INTC_ECX_AVX; 4282 mask_ecx &= ~CPUID_INTC_ECX_F16C; 4283 mask_ecx &= ~CPUID_INTC_ECX_FMA; 4284 } 4285 4286 /* 4287 * Now we've figured out the masks that determine 4288 * which bits we choose to believe, apply the masks 4289 * to the feature words, then map the kernel's view 4290 * of these feature words into its feature word. 4291 */ 4292 cp->cp_edx &= mask_edx; 4293 cp->cp_ecx &= mask_ecx; 4294 4295 /* 4296 * apply any platform restrictions (we don't call this 4297 * immediately after __cpuid_insn here, because we need the 4298 * workarounds applied above first) 4299 */ 4300 platform_cpuid_mangle(cpi->cpi_vendor, 1, cp); 4301 4302 /* 4303 * In addition to ecx and edx, Intel and AMD are storing a bunch of 4304 * instruction set extensions in leaf 7's ebx, ecx, and edx. Note, leaf 4305 * 7 has sub-leaves determined by ecx. 4306 */ 4307 if (cpi->cpi_maxeax >= 7) { 4308 struct cpuid_regs *ecp; 4309 ecp = &cpi->cpi_std[7]; 4310 ecp->cp_eax = 7; 4311 ecp->cp_ecx = 0; 4312 (void) __cpuid_insn(ecp); 4313 4314 /* 4315 * If XSAVE has been disabled, just ignore all of the 4316 * extended-save-area dependent flags here. By removing most of 4317 * the leaf 7, sub-leaf 0 flags, that will ensure that we don't 4318 * end up looking at additional xsave dependent leaves right 4319 * now. 4320 */ 4321 if (xsave_force_disable) { 4322 ecp->cp_ebx &= ~CPUID_INTC_EBX_7_0_BMI1; 4323 ecp->cp_ebx &= ~CPUID_INTC_EBX_7_0_BMI2; 4324 ecp->cp_ebx &= ~CPUID_INTC_EBX_7_0_AVX2; 4325 ecp->cp_ebx &= ~CPUID_INTC_EBX_7_0_MPX; 4326 ecp->cp_ebx &= ~CPUID_INTC_EBX_7_0_ALL_AVX512; 4327 ecp->cp_ecx &= ~CPUID_INTC_ECX_7_0_ALL_AVX512; 4328 ecp->cp_edx &= ~CPUID_INTC_EDX_7_0_ALL_AVX512; 4329 ecp->cp_ecx &= ~CPUID_INTC_ECX_7_0_VAES; 4330 ecp->cp_ecx &= ~CPUID_INTC_ECX_7_0_VPCLMULQDQ; 4331 ecp->cp_ecx &= ~CPUID_INTC_ECX_7_0_GFNI; 4332 } 4333 4334 if (ecp->cp_ebx & CPUID_INTC_EBX_7_0_SMEP) 4335 add_x86_feature(featureset, X86FSET_SMEP); 4336 4337 /* 4338 * We check disable_smap here in addition to in startup_smap() 4339 * to ensure CPUs that aren't the boot CPU don't accidentally 4340 * include it in the feature set and thus generate a mismatched 4341 * x86 feature set across CPUs. 4342 */ 4343 if (ecp->cp_ebx & CPUID_INTC_EBX_7_0_SMAP && 4344 disable_smap == 0) 4345 add_x86_feature(featureset, X86FSET_SMAP); 4346 4347 if (ecp->cp_ebx & CPUID_INTC_EBX_7_0_RDSEED) { 4348 add_x86_feature(featureset, X86FSET_RDSEED); 4349 if (cpi->cpi_vendor == X86_VENDOR_AMD) 4350 cpuid_evaluate_amd_rdseed(cpu, featureset); 4351 } 4352 4353 if (ecp->cp_ebx & CPUID_INTC_EBX_7_0_ADX) 4354 add_x86_feature(featureset, X86FSET_ADX); 4355 4356 if (ecp->cp_ebx & CPUID_INTC_EBX_7_0_FSGSBASE) 4357 add_x86_feature(featureset, X86FSET_FSGSBASE); 4358 4359 if (ecp->cp_ebx & CPUID_INTC_EBX_7_0_CLFLUSHOPT) 4360 add_x86_feature(featureset, X86FSET_CLFLUSHOPT); 4361 4362 if (ecp->cp_ebx & CPUID_INTC_EBX_7_0_INVPCID) 4363 add_x86_feature(featureset, X86FSET_INVPCID); 4364 4365 if (ecp->cp_ecx & CPUID_INTC_ECX_7_0_UMIP) 4366 add_x86_feature(featureset, X86FSET_UMIP); 4367 if (ecp->cp_ecx & CPUID_INTC_ECX_7_0_PKU) 4368 add_x86_feature(featureset, X86FSET_PKU); 4369 if (ecp->cp_ecx & CPUID_INTC_ECX_7_0_OSPKE) 4370 add_x86_feature(featureset, X86FSET_OSPKE); 4371 if (ecp->cp_ecx & CPUID_INTC_ECX_7_0_GFNI) 4372 add_x86_feature(featureset, X86FSET_GFNI); 4373 4374 if (ecp->cp_ebx & CPUID_INTC_EBX_7_0_CLWB) 4375 add_x86_feature(featureset, X86FSET_CLWB); 4376 4377 if (cpi->cpi_vendor == X86_VENDOR_Intel) { 4378 if (ecp->cp_ebx & CPUID_INTC_EBX_7_0_MPX) 4379 add_x86_feature(featureset, X86FSET_MPX); 4380 } 4381 4382 /* 4383 * If we have subleaf 1 or 2 available, grab and store 4384 * that. This is used for more AVX and related features. 4385 */ 4386 if (ecp->cp_eax >= 1) { 4387 struct cpuid_regs *c71; 4388 c71 = &cpi->cpi_sub7[0]; 4389 c71->cp_eax = 7; 4390 c71->cp_ecx = 1; 4391 (void) __cpuid_insn(c71); 4392 } 4393 4394 /* Subleaf 2 has certain security indicators in it. */ 4395 if (ecp->cp_eax >= 2) { 4396 struct cpuid_regs *c72; 4397 c72 = &cpi->cpi_sub7[1]; 4398 c72->cp_eax = 7; 4399 c72->cp_ecx = 2; 4400 (void) __cpuid_insn(c72); 4401 } 4402 } 4403 4404 /* 4405 * fold in overrides from the "eeprom" mechanism 4406 */ 4407 cp->cp_edx |= cpuid_feature_edx_include; 4408 cp->cp_edx &= ~cpuid_feature_edx_exclude; 4409 4410 cp->cp_ecx |= cpuid_feature_ecx_include; 4411 cp->cp_ecx &= ~cpuid_feature_ecx_exclude; 4412 4413 if (cp->cp_edx & CPUID_INTC_EDX_PSE) { 4414 add_x86_feature(featureset, X86FSET_LARGEPAGE); 4415 } 4416 if (cp->cp_edx & CPUID_INTC_EDX_TSC) { 4417 add_x86_feature(featureset, X86FSET_TSC); 4418 } 4419 if (cp->cp_edx & CPUID_INTC_EDX_MSR) { 4420 add_x86_feature(featureset, X86FSET_MSR); 4421 } 4422 if (cp->cp_edx & CPUID_INTC_EDX_MTRR) { 4423 add_x86_feature(featureset, X86FSET_MTRR); 4424 } 4425 if (cp->cp_edx & CPUID_INTC_EDX_PGE) { 4426 add_x86_feature(featureset, X86FSET_PGE); 4427 } 4428 if (cp->cp_edx & CPUID_INTC_EDX_CMOV) { 4429 add_x86_feature(featureset, X86FSET_CMOV); 4430 } 4431 if (cp->cp_edx & CPUID_INTC_EDX_MMX) { 4432 add_x86_feature(featureset, X86FSET_MMX); 4433 } 4434 if ((cp->cp_edx & CPUID_INTC_EDX_MCE) != 0 && 4435 (cp->cp_edx & CPUID_INTC_EDX_MCA) != 0) { 4436 add_x86_feature(featureset, X86FSET_MCA); 4437 } 4438 if (cp->cp_edx & CPUID_INTC_EDX_PAE) { 4439 add_x86_feature(featureset, X86FSET_PAE); 4440 } 4441 if (cp->cp_edx & CPUID_INTC_EDX_CX8) { 4442 add_x86_feature(featureset, X86FSET_CX8); 4443 } 4444 if (cp->cp_ecx & CPUID_INTC_ECX_CX16) { 4445 add_x86_feature(featureset, X86FSET_CX16); 4446 } 4447 if (cp->cp_edx & CPUID_INTC_EDX_PAT) { 4448 add_x86_feature(featureset, X86FSET_PAT); 4449 } 4450 if (cp->cp_edx & CPUID_INTC_EDX_SEP) { 4451 add_x86_feature(featureset, X86FSET_SEP); 4452 } 4453 if (cp->cp_edx & CPUID_INTC_EDX_FXSR) { 4454 /* 4455 * In our implementation, fxsave/fxrstor 4456 * are prerequisites before we'll even 4457 * try and do SSE things. 4458 */ 4459 if (cp->cp_edx & CPUID_INTC_EDX_SSE) { 4460 add_x86_feature(featureset, X86FSET_SSE); 4461 } 4462 if (cp->cp_edx & CPUID_INTC_EDX_SSE2) { 4463 add_x86_feature(featureset, X86FSET_SSE2); 4464 } 4465 if (cp->cp_ecx & CPUID_INTC_ECX_SSE3) { 4466 add_x86_feature(featureset, X86FSET_SSE3); 4467 } 4468 if (cp->cp_ecx & CPUID_INTC_ECX_SSSE3) { 4469 add_x86_feature(featureset, X86FSET_SSSE3); 4470 } 4471 if (cp->cp_ecx & CPUID_INTC_ECX_SSE4_1) { 4472 add_x86_feature(featureset, X86FSET_SSE4_1); 4473 } 4474 if (cp->cp_ecx & CPUID_INTC_ECX_SSE4_2) { 4475 add_x86_feature(featureset, X86FSET_SSE4_2); 4476 } 4477 if (cp->cp_ecx & CPUID_INTC_ECX_AES) { 4478 add_x86_feature(featureset, X86FSET_AES); 4479 } 4480 if (cp->cp_ecx & CPUID_INTC_ECX_PCLMULQDQ) { 4481 add_x86_feature(featureset, X86FSET_PCLMULQDQ); 4482 } 4483 4484 if (cpi->cpi_std[7].cp_ebx & CPUID_INTC_EBX_7_0_SHA) 4485 add_x86_feature(featureset, X86FSET_SHA); 4486 4487 if (cp->cp_ecx & CPUID_INTC_ECX_XSAVE) { 4488 add_x86_feature(featureset, X86FSET_XSAVE); 4489 4490 /* We only test AVX & AVX512 when there is XSAVE */ 4491 cpuid_basic_avx(cpu, featureset); 4492 } 4493 } 4494 4495 if (cp->cp_ecx & CPUID_INTC_ECX_PCID) { 4496 add_x86_feature(featureset, X86FSET_PCID); 4497 } 4498 4499 if (cp->cp_ecx & CPUID_INTC_ECX_X2APIC) { 4500 add_x86_feature(featureset, X86FSET_X2APIC); 4501 } 4502 if (cp->cp_edx & CPUID_INTC_EDX_DE) { 4503 add_x86_feature(featureset, X86FSET_DE); 4504 } 4505 #if !defined(__xpv) 4506 if (cp->cp_ecx & CPUID_INTC_ECX_MON) { 4507 4508 /* 4509 * We require the CLFLUSH instruction for erratum workaround 4510 * to use MONITOR/MWAIT. 4511 */ 4512 if (cp->cp_edx & CPUID_INTC_EDX_CLFSH) { 4513 cpi->cpi_mwait.support |= MWAIT_SUPPORT; 4514 add_x86_feature(featureset, X86FSET_MWAIT); 4515 } else { 4516 extern int idle_cpu_assert_cflush_monitor; 4517 4518 /* 4519 * All processors we are aware of which have 4520 * MONITOR/MWAIT also have CLFLUSH. 4521 */ 4522 if (idle_cpu_assert_cflush_monitor) { 4523 ASSERT((cp->cp_ecx & CPUID_INTC_ECX_MON) && 4524 (cp->cp_edx & CPUID_INTC_EDX_CLFSH)); 4525 } 4526 } 4527 } 4528 #endif /* __xpv */ 4529 4530 if (cp->cp_ecx & CPUID_INTC_ECX_VMX) { 4531 add_x86_feature(featureset, X86FSET_VMX); 4532 } 4533 4534 if (cp->cp_ecx & CPUID_INTC_ECX_RDRAND) 4535 add_x86_feature(featureset, X86FSET_RDRAND); 4536 4537 /* 4538 * Only need it first time, rest of the cpus would follow suit. 4539 * we only capture this for the bootcpu. 4540 */ 4541 if (cp->cp_edx & CPUID_INTC_EDX_CLFSH) { 4542 add_x86_feature(featureset, X86FSET_CLFSH); 4543 x86_clflush_size = (BITX(cp->cp_ebx, 15, 8) * 8); 4544 } 4545 if (is_x86_feature(featureset, X86FSET_PAE)) 4546 cpi->cpi_pabits = 36; 4547 4548 if (cpi->cpi_maxeax >= 0xD && !xsave_force_disable) { 4549 struct cpuid_regs r, *ecp; 4550 4551 ecp = &r; 4552 ecp->cp_eax = 0xD; 4553 ecp->cp_ecx = 1; 4554 ecp->cp_edx = ecp->cp_ebx = 0; 4555 (void) __cpuid_insn(ecp); 4556 4557 if (ecp->cp_eax & CPUID_INTC_EAX_D_1_XSAVEOPT) 4558 add_x86_feature(featureset, X86FSET_XSAVEOPT); 4559 if (ecp->cp_eax & CPUID_INTC_EAX_D_1_XSAVEC) 4560 add_x86_feature(featureset, X86FSET_XSAVEC); 4561 if (ecp->cp_eax & CPUID_INTC_EAX_D_1_XSAVES) 4562 add_x86_feature(featureset, X86FSET_XSAVES); 4563 4564 /* 4565 * Zen 2 family processors suffer from erratum 1386 that causes 4566 * xsaves to not function correctly in some circumstances. There 4567 * are no supervisor states in Zen 2 and earlier. Practically 4568 * speaking this has no impact for us as we currently do not 4569 * leverage compressed xsave formats. To safeguard against 4570 * issues in the future where we may opt to using it, we remove 4571 * it from the feature set now. While Matisse has a microcode 4572 * update available with a fix, not all Zen 2 CPUs do so it's 4573 * simpler for the moment to unconditionally remove it. 4574 */ 4575 if (cpi->cpi_vendor == X86_VENDOR_AMD && 4576 uarchrev_uarch(cpi->cpi_uarchrev) <= X86_UARCH_AMD_ZEN2) { 4577 remove_x86_feature(featureset, X86FSET_XSAVES); 4578 } 4579 } 4580 4581 /* 4582 * Work on the "extended" feature information, doing 4583 * some basic initialization to be used in the extended pass. 4584 */ 4585 xcpuid = 0; 4586 switch (cpi->cpi_vendor) { 4587 case X86_VENDOR_Intel: 4588 /* 4589 * On KVM we know we will have proper support for extended 4590 * cpuid. 4591 */ 4592 if (IS_NEW_F6(cpi) || cpi->cpi_family >= 0xf || 4593 (get_hwenv() == HW_KVM && cpi->cpi_family == 6 && 4594 (cpi->cpi_model == 6 || cpi->cpi_model == 2))) 4595 xcpuid++; 4596 break; 4597 case X86_VENDOR_AMD: 4598 if (cpi->cpi_family > 5 || 4599 (cpi->cpi_family == 5 && cpi->cpi_model >= 1)) 4600 xcpuid++; 4601 break; 4602 case X86_VENDOR_Cyrix: 4603 /* 4604 * Only these Cyrix CPUs are -known- to support 4605 * extended cpuid operations. 4606 */ 4607 if (x86_type == X86_TYPE_VIA_CYRIX_III || 4608 x86_type == X86_TYPE_CYRIX_GXm) 4609 xcpuid++; 4610 break; 4611 case X86_VENDOR_HYGON: 4612 case X86_VENDOR_Centaur: 4613 case X86_VENDOR_TM: 4614 default: 4615 xcpuid++; 4616 break; 4617 } 4618 4619 if (xcpuid) { 4620 cp = &cpi->cpi_extd[0]; 4621 cp->cp_eax = CPUID_LEAF_EXT_0; 4622 cpi->cpi_xmaxeax = __cpuid_insn(cp); 4623 } 4624 4625 if (cpi->cpi_xmaxeax & CPUID_LEAF_EXT_0) { 4626 4627 if (cpi->cpi_xmaxeax > CPI_XMAXEAX_MAX) 4628 cpi->cpi_xmaxeax = CPI_XMAXEAX_MAX; 4629 4630 switch (cpi->cpi_vendor) { 4631 case X86_VENDOR_Intel: 4632 case X86_VENDOR_AMD: 4633 case X86_VENDOR_HYGON: 4634 if (cpi->cpi_xmaxeax < 0x80000001) 4635 break; 4636 cp = &cpi->cpi_extd[1]; 4637 cp->cp_eax = 0x80000001; 4638 (void) __cpuid_insn(cp); 4639 4640 if (cpi->cpi_vendor == X86_VENDOR_AMD && 4641 cpi->cpi_family == 5 && 4642 cpi->cpi_model == 6 && 4643 cpi->cpi_step == 6) { 4644 /* 4645 * K6 model 6 uses bit 10 to indicate SYSC 4646 * Later models use bit 11. Fix it here. 4647 */ 4648 if (cp->cp_edx & 0x400) { 4649 cp->cp_edx &= ~0x400; 4650 cp->cp_edx |= CPUID_AMD_EDX_SYSC; 4651 } 4652 } 4653 4654 platform_cpuid_mangle(cpi->cpi_vendor, 0x80000001, cp); 4655 4656 /* 4657 * Compute the additions to the kernel's feature word. 4658 */ 4659 if (cp->cp_edx & CPUID_AMD_EDX_NX) { 4660 add_x86_feature(featureset, X86FSET_NX); 4661 } 4662 4663 /* 4664 * Regardless whether or not we boot 64-bit, 4665 * we should have a way to identify whether 4666 * the CPU is capable of running 64-bit. 4667 */ 4668 if (cp->cp_edx & CPUID_AMD_EDX_LM) { 4669 add_x86_feature(featureset, X86FSET_64); 4670 } 4671 4672 /* 1 GB large page - enable only for 64 bit kernel */ 4673 if (cp->cp_edx & CPUID_AMD_EDX_1GPG) { 4674 add_x86_feature(featureset, X86FSET_1GPG); 4675 } 4676 4677 if ((cpi->cpi_vendor == X86_VENDOR_AMD || 4678 cpi->cpi_vendor == X86_VENDOR_HYGON) && 4679 (cpi->cpi_std[1].cp_edx & CPUID_INTC_EDX_FXSR) && 4680 (cp->cp_ecx & CPUID_AMD_ECX_SSE4A)) { 4681 add_x86_feature(featureset, X86FSET_SSE4A); 4682 } 4683 4684 /* 4685 * It's really tricky to support syscall/sysret in 4686 * the i386 kernel; we rely on sysenter/sysexit 4687 * instead. In the amd64 kernel, things are -way- 4688 * better. 4689 */ 4690 if (cp->cp_edx & CPUID_AMD_EDX_SYSC) { 4691 add_x86_feature(featureset, X86FSET_ASYSC); 4692 } 4693 4694 /* 4695 * While we're thinking about system calls, note 4696 * that AMD processors don't support sysenter 4697 * in long mode at all, so don't try to program them. 4698 */ 4699 if (x86_vendor == X86_VENDOR_AMD || 4700 x86_vendor == X86_VENDOR_HYGON) { 4701 remove_x86_feature(featureset, X86FSET_SEP); 4702 } 4703 4704 if (cp->cp_edx & CPUID_AMD_EDX_TSCP) { 4705 add_x86_feature(featureset, X86FSET_TSCP); 4706 } 4707 4708 if (cp->cp_ecx & CPUID_AMD_ECX_SVM) { 4709 add_x86_feature(featureset, X86FSET_SVM); 4710 } 4711 4712 if (cp->cp_ecx & CPUID_AMD_ECX_TOPOEXT) { 4713 add_x86_feature(featureset, X86FSET_TOPOEXT); 4714 } 4715 4716 if (cp->cp_ecx & CPUID_AMD_ECX_PCEC) { 4717 add_x86_feature(featureset, X86FSET_AMD_PCEC); 4718 } 4719 4720 if (cp->cp_ecx & CPUID_AMD_ECX_XOP) { 4721 add_x86_feature(featureset, X86FSET_XOP); 4722 } 4723 4724 if (cp->cp_ecx & CPUID_AMD_ECX_FMA4) { 4725 add_x86_feature(featureset, X86FSET_FMA4); 4726 } 4727 4728 if (cp->cp_ecx & CPUID_AMD_ECX_TBM) { 4729 add_x86_feature(featureset, X86FSET_TBM); 4730 } 4731 4732 if (cp->cp_ecx & CPUID_AMD_ECX_MONITORX) { 4733 add_x86_feature(featureset, X86FSET_MONITORX); 4734 } 4735 break; 4736 default: 4737 break; 4738 } 4739 4740 /* 4741 * Get CPUID data about processor cores and hyperthreads. 4742 */ 4743 switch (cpi->cpi_vendor) { 4744 case X86_VENDOR_Intel: 4745 if (cpi->cpi_maxeax >= 4) { 4746 cp = &cpi->cpi_std[4]; 4747 cp->cp_eax = 4; 4748 cp->cp_ecx = 0; 4749 (void) __cpuid_insn(cp); 4750 platform_cpuid_mangle(cpi->cpi_vendor, 4, cp); 4751 } 4752 /*FALLTHROUGH*/ 4753 case X86_VENDOR_AMD: 4754 case X86_VENDOR_HYGON: 4755 if (cpi->cpi_xmaxeax < CPUID_LEAF_EXT_8) 4756 break; 4757 cp = &cpi->cpi_extd[8]; 4758 cp->cp_eax = CPUID_LEAF_EXT_8; 4759 (void) __cpuid_insn(cp); 4760 platform_cpuid_mangle(cpi->cpi_vendor, CPUID_LEAF_EXT_8, 4761 cp); 4762 4763 /* 4764 * AMD uses ebx for some extended functions. 4765 */ 4766 if (cpi->cpi_vendor == X86_VENDOR_AMD || 4767 cpi->cpi_vendor == X86_VENDOR_HYGON) { 4768 /* 4769 * While we're here, check for the AMD "Error 4770 * Pointer Zero/Restore" feature. This can be 4771 * used to setup the FP save handlers 4772 * appropriately. 4773 */ 4774 if (cp->cp_ebx & CPUID_AMD_EBX_ERR_PTR_ZERO) { 4775 cpi->cpi_fp_amd_save = 0; 4776 } else { 4777 cpi->cpi_fp_amd_save = 1; 4778 } 4779 4780 if (cp->cp_ebx & CPUID_AMD_EBX_CLZERO) { 4781 add_x86_feature(featureset, 4782 X86FSET_CLZERO); 4783 } 4784 } 4785 4786 /* 4787 * Virtual and physical address limits from 4788 * cpuid override previously guessed values. 4789 */ 4790 cpi->cpi_pabits = BITX(cp->cp_eax, 7, 0); 4791 cpi->cpi_vabits = BITX(cp->cp_eax, 15, 8); 4792 break; 4793 default: 4794 break; 4795 } 4796 4797 /* 4798 * Get CPUID data about TSC Invariance in Deep C-State. 4799 */ 4800 switch (cpi->cpi_vendor) { 4801 case X86_VENDOR_Intel: 4802 case X86_VENDOR_AMD: 4803 case X86_VENDOR_HYGON: 4804 if (cpi->cpi_maxeax >= 7) { 4805 cp = &cpi->cpi_extd[7]; 4806 cp->cp_eax = 0x80000007; 4807 cp->cp_ecx = 0; 4808 (void) __cpuid_insn(cp); 4809 } 4810 break; 4811 default: 4812 break; 4813 } 4814 } 4815 4816 /* 4817 * cpuid_basic_ppin assumes that cpuid_basic_topology has already been 4818 * run and thus gathered some of its dependent leaves. 4819 */ 4820 cpuid_basic_topology(cpu, featureset); 4821 cpuid_basic_thermal(cpu, featureset); 4822 #if !defined(__xpv) 4823 cpuid_basic_ppin(cpu, featureset); 4824 #endif 4825 4826 if (cpi->cpi_vendor == X86_VENDOR_AMD || 4827 cpi->cpi_vendor == X86_VENDOR_HYGON) { 4828 if (cpi->cpi_xmaxeax >= CPUID_LEAF_EXT_8 && 4829 cpi->cpi_extd[8].cp_ebx & CPUID_AMD_EBX_ERR_PTR_ZERO) { 4830 /* Special handling for AMD FP not necessary. */ 4831 cpi->cpi_fp_amd_save = 0; 4832 } else { 4833 cpi->cpi_fp_amd_save = 1; 4834 } 4835 } 4836 4837 /* 4838 * Check (and potentially set) if lfence is serializing. 4839 * This is useful for accurate rdtsc measurements and AMD retpolines. 4840 */ 4841 if ((cpi->cpi_vendor == X86_VENDOR_AMD || 4842 cpi->cpi_vendor == X86_VENDOR_HYGON) && 4843 is_x86_feature(featureset, X86FSET_SSE2)) { 4844 /* 4845 * The AMD white paper Software Techniques For Managing 4846 * Speculation on AMD Processors details circumstances for when 4847 * lfence instructions are serializing. 4848 * 4849 * On family 0xf and 0x11, it is inherently so. On family 0x10 4850 * and later (excluding 0x11), a bit in the DE_CFG MSR 4851 * determines the lfence behavior. Per that whitepaper, AMD has 4852 * committed to supporting that MSR on all later CPUs. 4853 */ 4854 if (cpi->cpi_family == 0xf || cpi->cpi_family == 0x11) { 4855 add_x86_feature(featureset, X86FSET_LFENCE_SER); 4856 } else if (cpi->cpi_family >= 0x10) { 4857 #if !defined(__xpv) 4858 uint64_t val; 4859 4860 /* 4861 * Be careful when attempting to enable the bit, and 4862 * verify that it was actually set in case we are 4863 * running in a hypervisor which is less than faithful 4864 * about its emulation of this feature. 4865 */ 4866 on_trap_data_t otd; 4867 if (!on_trap(&otd, OT_DATA_ACCESS)) { 4868 val = rdmsr(MSR_AMD_DE_CFG); 4869 val |= AMD_DE_CFG_LFENCE_DISPATCH; 4870 wrmsr(MSR_AMD_DE_CFG, val); 4871 val = rdmsr(MSR_AMD_DE_CFG); 4872 } else { 4873 val = 0; 4874 } 4875 no_trap(); 4876 4877 if ((val & AMD_DE_CFG_LFENCE_DISPATCH) != 0) { 4878 add_x86_feature(featureset, X86FSET_LFENCE_SER); 4879 } 4880 #endif 4881 } 4882 } else if (cpi->cpi_vendor == X86_VENDOR_Intel && 4883 is_x86_feature(featureset, X86FSET_SSE2)) { 4884 /* 4885 * Documentation and other OSes indicate that lfence is always 4886 * serializing on Intel CPUs. 4887 */ 4888 add_x86_feature(featureset, X86FSET_LFENCE_SER); 4889 } 4890 4891 4892 /* 4893 * Check the processor leaves that are used for security features. Grab 4894 * any additional processor-specific leaves that we may not have yet. 4895 */ 4896 switch (cpi->cpi_vendor) { 4897 case X86_VENDOR_AMD: 4898 case X86_VENDOR_HYGON: 4899 if (cpi->cpi_xmaxeax >= CPUID_LEAF_EXT_21) { 4900 cp = &cpi->cpi_extd[0x21]; 4901 cp->cp_eax = CPUID_LEAF_EXT_21; 4902 cp->cp_ecx = 0; 4903 (void) __cpuid_insn(cp); 4904 } 4905 break; 4906 default: 4907 break; 4908 } 4909 4910 cpuid_scan_security(cpu, featureset); 4911 } 4912 4913 /* 4914 * Make copies of the cpuid table entries we depend on, in 4915 * part for ease of parsing now, in part so that we have only 4916 * one place to correct any of it, in part for ease of 4917 * later export to userland, and in part so we can look at 4918 * this stuff in a crash dump. 4919 */ 4920 4921 static void 4922 cpuid_pass_extended(cpu_t *cpu, void *_arg __unused) 4923 { 4924 uint_t n, nmax; 4925 int i; 4926 struct cpuid_regs *cp; 4927 uint8_t *dp; 4928 uint32_t *iptr; 4929 struct cpuid_info *cpi = cpu->cpu_m.mcpu_cpi; 4930 4931 if (cpi->cpi_maxeax < 1) 4932 return; 4933 4934 if ((nmax = cpi->cpi_maxeax + 1) > NMAX_CPI_STD) 4935 nmax = NMAX_CPI_STD; 4936 /* 4937 * (We already handled n == 0 and n == 1 in the basic pass) 4938 */ 4939 for (n = 2, cp = &cpi->cpi_std[2]; n < nmax; n++, cp++) { 4940 /* 4941 * leaves 6 and 7 were handled in the basic pass 4942 */ 4943 if (n == 6 || n == 7) 4944 continue; 4945 4946 cp->cp_eax = n; 4947 4948 /* 4949 * CPUID function 4 expects %ecx to be initialized 4950 * with an index which indicates which cache to return 4951 * information about. The OS is expected to call function 4 4952 * with %ecx set to 0, 1, 2, ... until it returns with 4953 * EAX[4:0] set to 0, which indicates there are no more 4954 * caches. 4955 * 4956 * Here, populate cpi_std[4] with the information returned by 4957 * function 4 when %ecx == 0, and do the rest in a later pass 4958 * when dynamic memory allocation becomes available. 4959 * 4960 * Note: we need to explicitly initialize %ecx here, since 4961 * function 4 may have been previously invoked. 4962 */ 4963 if (n == 4) 4964 cp->cp_ecx = 0; 4965 4966 (void) __cpuid_insn(cp); 4967 platform_cpuid_mangle(cpi->cpi_vendor, n, cp); 4968 switch (n) { 4969 case 2: 4970 /* 4971 * "the lower 8 bits of the %eax register 4972 * contain a value that identifies the number 4973 * of times the cpuid [instruction] has to be 4974 * executed to obtain a complete image of the 4975 * processor's caching systems." 4976 * 4977 * How *do* they make this stuff up? 4978 */ 4979 cpi->cpi_ncache = sizeof (*cp) * 4980 BITX(cp->cp_eax, 7, 0); 4981 if (cpi->cpi_ncache == 0) 4982 break; 4983 cpi->cpi_ncache--; /* skip count byte */ 4984 4985 /* 4986 * Well, for now, rather than attempt to implement 4987 * this slightly dubious algorithm, we just look 4988 * at the first 15 .. 4989 */ 4990 if (cpi->cpi_ncache > (sizeof (*cp) - 1)) 4991 cpi->cpi_ncache = sizeof (*cp) - 1; 4992 4993 dp = cpi->cpi_cacheinfo; 4994 if (BITX(cp->cp_eax, 31, 31) == 0) { 4995 uint8_t *p = (void *)&cp->cp_eax; 4996 for (i = 1; i < 4; i++) 4997 if (p[i] != 0) 4998 *dp++ = p[i]; 4999 } 5000 if (BITX(cp->cp_ebx, 31, 31) == 0) { 5001 uint8_t *p = (void *)&cp->cp_ebx; 5002 for (i = 0; i < 4; i++) 5003 if (p[i] != 0) 5004 *dp++ = p[i]; 5005 } 5006 if (BITX(cp->cp_ecx, 31, 31) == 0) { 5007 uint8_t *p = (void *)&cp->cp_ecx; 5008 for (i = 0; i < 4; i++) 5009 if (p[i] != 0) 5010 *dp++ = p[i]; 5011 } 5012 if (BITX(cp->cp_edx, 31, 31) == 0) { 5013 uint8_t *p = (void *)&cp->cp_edx; 5014 for (i = 0; i < 4; i++) 5015 if (p[i] != 0) 5016 *dp++ = p[i]; 5017 } 5018 break; 5019 5020 case 3: /* Processor serial number, if PSN supported */ 5021 break; 5022 5023 case 4: /* Deterministic cache parameters */ 5024 break; 5025 5026 case 5: /* Monitor/Mwait parameters */ 5027 { 5028 size_t mwait_size; 5029 5030 /* 5031 * check cpi_mwait.support which was set in 5032 * cpuid_pass_basic() 5033 */ 5034 if (!(cpi->cpi_mwait.support & MWAIT_SUPPORT)) 5035 break; 5036 5037 /* 5038 * Protect ourself from insane mwait line size. 5039 * Workaround for incomplete hardware emulator(s). 5040 */ 5041 mwait_size = (size_t)MWAIT_SIZE_MAX(cpi); 5042 if (mwait_size < sizeof (uint32_t) || 5043 !ISP2(mwait_size)) { 5044 #if DEBUG 5045 cmn_err(CE_NOTE, "Cannot handle cpu %d mwait " 5046 "size %ld", cpu->cpu_id, (long)mwait_size); 5047 #endif 5048 break; 5049 } 5050 5051 cpi->cpi_mwait.mon_min = (size_t)MWAIT_SIZE_MIN(cpi); 5052 cpi->cpi_mwait.mon_max = mwait_size; 5053 if (MWAIT_EXTENSION(cpi)) { 5054 cpi->cpi_mwait.support |= MWAIT_EXTENSIONS; 5055 if (MWAIT_INT_ENABLE(cpi)) 5056 cpi->cpi_mwait.support |= 5057 MWAIT_ECX_INT_ENABLE; 5058 } 5059 break; 5060 } 5061 default: 5062 break; 5063 } 5064 } 5065 5066 /* 5067 * XSAVE enumeration 5068 */ 5069 if (cpi->cpi_maxeax >= 0xD) { 5070 struct cpuid_regs regs; 5071 boolean_t cpuid_d_valid = B_TRUE; 5072 5073 cp = ®s; 5074 cp->cp_eax = 0xD; 5075 cp->cp_edx = cp->cp_ebx = cp->cp_ecx = 0; 5076 5077 (void) __cpuid_insn(cp); 5078 5079 /* 5080 * Sanity checks for debug 5081 */ 5082 if ((cp->cp_eax & XFEATURE_LEGACY_FP) == 0 || 5083 (cp->cp_eax & XFEATURE_SSE) == 0) { 5084 cpuid_d_valid = B_FALSE; 5085 } 5086 5087 cpi->cpi_xsave.xsav_hw_features_low = cp->cp_eax; 5088 cpi->cpi_xsave.xsav_hw_features_high = cp->cp_edx; 5089 cpi->cpi_xsave.xsav_max_size = cp->cp_ecx; 5090 5091 /* 5092 * If the hw supports AVX, get the size and offset in the save 5093 * area for the ymm state. 5094 */ 5095 if (cpi->cpi_xsave.xsav_hw_features_low & XFEATURE_AVX) { 5096 cp->cp_eax = 0xD; 5097 cp->cp_ecx = 2; 5098 cp->cp_edx = cp->cp_ebx = 0; 5099 5100 (void) __cpuid_insn(cp); 5101 5102 if (cp->cp_ebx != CPUID_LEAFD_2_YMM_OFFSET || 5103 cp->cp_eax != CPUID_LEAFD_2_YMM_SIZE) { 5104 cpuid_d_valid = B_FALSE; 5105 } 5106 5107 cpi->cpi_xsave.ymm_size = cp->cp_eax; 5108 cpi->cpi_xsave.ymm_offset = cp->cp_ebx; 5109 } 5110 5111 /* 5112 * If the hw supports MPX, get the size and offset in the 5113 * save area for BNDREGS and BNDCSR. 5114 */ 5115 if (cpi->cpi_xsave.xsav_hw_features_low & XFEATURE_MPX) { 5116 cp->cp_eax = 0xD; 5117 cp->cp_ecx = 3; 5118 cp->cp_edx = cp->cp_ebx = 0; 5119 5120 (void) __cpuid_insn(cp); 5121 5122 cpi->cpi_xsave.bndregs_size = cp->cp_eax; 5123 cpi->cpi_xsave.bndregs_offset = cp->cp_ebx; 5124 5125 cp->cp_eax = 0xD; 5126 cp->cp_ecx = 4; 5127 cp->cp_edx = cp->cp_ebx = 0; 5128 5129 (void) __cpuid_insn(cp); 5130 5131 cpi->cpi_xsave.bndcsr_size = cp->cp_eax; 5132 cpi->cpi_xsave.bndcsr_offset = cp->cp_ebx; 5133 } 5134 5135 /* 5136 * If the hw supports AVX512, get the size and offset in the 5137 * save area for the opmask registers and zmm state. 5138 */ 5139 if (cpi->cpi_xsave.xsav_hw_features_low & XFEATURE_AVX512) { 5140 cp->cp_eax = 0xD; 5141 cp->cp_ecx = 5; 5142 cp->cp_edx = cp->cp_ebx = 0; 5143 5144 (void) __cpuid_insn(cp); 5145 5146 cpi->cpi_xsave.opmask_size = cp->cp_eax; 5147 cpi->cpi_xsave.opmask_offset = cp->cp_ebx; 5148 5149 cp->cp_eax = 0xD; 5150 cp->cp_ecx = 6; 5151 cp->cp_edx = cp->cp_ebx = 0; 5152 5153 (void) __cpuid_insn(cp); 5154 5155 cpi->cpi_xsave.zmmlo_size = cp->cp_eax; 5156 cpi->cpi_xsave.zmmlo_offset = cp->cp_ebx; 5157 5158 cp->cp_eax = 0xD; 5159 cp->cp_ecx = 7; 5160 cp->cp_edx = cp->cp_ebx = 0; 5161 5162 (void) __cpuid_insn(cp); 5163 5164 cpi->cpi_xsave.zmmhi_size = cp->cp_eax; 5165 cpi->cpi_xsave.zmmhi_offset = cp->cp_ebx; 5166 } 5167 5168 if (cpi->cpi_xsave.xsav_hw_features_low & XFEATURE_PKRU) { 5169 cp->cp_eax = 0xD; 5170 cp->cp_ecx = 9; 5171 cp->cp_edx = cp->cp_ebx = 0; 5172 5173 (void) __cpuid_insn(cp); 5174 5175 cpi->cpi_xsave.pkru_size = cp->cp_eax; 5176 cpi->cpi_xsave.pkru_offset = cp->cp_ebx; 5177 } 5178 5179 if (is_x86_feature(x86_featureset, X86FSET_XSAVE)) { 5180 xsave_state_size = 0; 5181 } else if (cpuid_d_valid) { 5182 xsave_state_size = cpi->cpi_xsave.xsav_max_size; 5183 } else { 5184 /* Broken CPUID 0xD, probably in HVM */ 5185 cmn_err(CE_WARN, "cpu%d: CPUID.0xD returns invalid " 5186 "value: hw_low = %d, hw_high = %d, xsave_size = %d" 5187 ", ymm_size = %d, ymm_offset = %d\n", 5188 cpu->cpu_id, cpi->cpi_xsave.xsav_hw_features_low, 5189 cpi->cpi_xsave.xsav_hw_features_high, 5190 (int)cpi->cpi_xsave.xsav_max_size, 5191 (int)cpi->cpi_xsave.ymm_size, 5192 (int)cpi->cpi_xsave.ymm_offset); 5193 5194 if (xsave_state_size != 0) { 5195 /* 5196 * This must be a non-boot CPU. We cannot 5197 * continue, because boot cpu has already 5198 * enabled XSAVE. 5199 */ 5200 ASSERT(cpu->cpu_id != 0); 5201 cmn_err(CE_PANIC, "cpu%d: we have already " 5202 "enabled XSAVE on boot cpu, cannot " 5203 "continue.", cpu->cpu_id); 5204 } else { 5205 /* 5206 * If we reached here on the boot CPU, it's also 5207 * almost certain that we'll reach here on the 5208 * non-boot CPUs. When we're here on a boot CPU 5209 * we should disable the feature, on a non-boot 5210 * CPU we need to confirm that we have. 5211 */ 5212 if (cpu->cpu_id == 0) { 5213 remove_x86_feature(x86_featureset, 5214 X86FSET_XSAVE); 5215 remove_x86_feature(x86_featureset, 5216 X86FSET_AVX); 5217 remove_x86_feature(x86_featureset, 5218 X86FSET_F16C); 5219 remove_x86_feature(x86_featureset, 5220 X86FSET_BMI1); 5221 remove_x86_feature(x86_featureset, 5222 X86FSET_BMI2); 5223 remove_x86_feature(x86_featureset, 5224 X86FSET_FMA); 5225 remove_x86_feature(x86_featureset, 5226 X86FSET_AVX2); 5227 remove_x86_feature(x86_featureset, 5228 X86FSET_MPX); 5229 remove_x86_feature(x86_featureset, 5230 X86FSET_AVX512F); 5231 remove_x86_feature(x86_featureset, 5232 X86FSET_AVX512DQ); 5233 remove_x86_feature(x86_featureset, 5234 X86FSET_AVX512PF); 5235 remove_x86_feature(x86_featureset, 5236 X86FSET_AVX512ER); 5237 remove_x86_feature(x86_featureset, 5238 X86FSET_AVX512CD); 5239 remove_x86_feature(x86_featureset, 5240 X86FSET_AVX512BW); 5241 remove_x86_feature(x86_featureset, 5242 X86FSET_AVX512VL); 5243 remove_x86_feature(x86_featureset, 5244 X86FSET_AVX512FMA); 5245 remove_x86_feature(x86_featureset, 5246 X86FSET_AVX512VBMI); 5247 remove_x86_feature(x86_featureset, 5248 X86FSET_AVX512VNNI); 5249 remove_x86_feature(x86_featureset, 5250 X86FSET_AVX512VPOPCDQ); 5251 remove_x86_feature(x86_featureset, 5252 X86FSET_AVX512NNIW); 5253 remove_x86_feature(x86_featureset, 5254 X86FSET_AVX512FMAPS); 5255 remove_x86_feature(x86_featureset, 5256 X86FSET_VAES); 5257 remove_x86_feature(x86_featureset, 5258 X86FSET_VPCLMULQDQ); 5259 remove_x86_feature(x86_featureset, 5260 X86FSET_GFNI); 5261 remove_x86_feature(x86_featureset, 5262 X86FSET_AVX512_VP2INT); 5263 remove_x86_feature(x86_featureset, 5264 X86FSET_AVX512_BITALG); 5265 remove_x86_feature(x86_featureset, 5266 X86FSET_AVX512_VBMI2); 5267 remove_x86_feature(x86_featureset, 5268 X86FSET_AVX512_BF16); 5269 5270 xsave_force_disable = B_TRUE; 5271 } else { 5272 VERIFY(is_x86_feature(x86_featureset, 5273 X86FSET_XSAVE) == B_FALSE); 5274 } 5275 } 5276 } 5277 } 5278 5279 5280 if ((cpi->cpi_xmaxeax & CPUID_LEAF_EXT_0) == 0) 5281 return; 5282 5283 if ((nmax = cpi->cpi_xmaxeax - CPUID_LEAF_EXT_0 + 1) > NMAX_CPI_EXTD) 5284 nmax = NMAX_CPI_EXTD; 5285 /* 5286 * Copy the extended properties, fixing them as we go. While we start at 5287 * 2 because we've already handled a few cases in the basic pass, the 5288 * rest we let ourselves just grab again (e.g. 0x8, 0x21). 5289 */ 5290 iptr = (void *)cpi->cpi_brandstr; 5291 for (n = 2, cp = &cpi->cpi_extd[2]; n < nmax; cp++, n++) { 5292 cp->cp_eax = CPUID_LEAF_EXT_0 + n; 5293 (void) __cpuid_insn(cp); 5294 platform_cpuid_mangle(cpi->cpi_vendor, CPUID_LEAF_EXT_0 + n, 5295 cp); 5296 switch (n) { 5297 case 2: 5298 case 3: 5299 case 4: 5300 /* 5301 * Extract the brand string 5302 */ 5303 *iptr++ = cp->cp_eax; 5304 *iptr++ = cp->cp_ebx; 5305 *iptr++ = cp->cp_ecx; 5306 *iptr++ = cp->cp_edx; 5307 break; 5308 case 5: 5309 switch (cpi->cpi_vendor) { 5310 case X86_VENDOR_AMD: 5311 /* 5312 * The Athlon and Duron were the first 5313 * parts to report the sizes of the 5314 * TLB for large pages. Before then, 5315 * we don't trust the data. 5316 */ 5317 if (cpi->cpi_family < 6 || 5318 (cpi->cpi_family == 6 && 5319 cpi->cpi_model < 1)) 5320 cp->cp_eax = 0; 5321 break; 5322 default: 5323 break; 5324 } 5325 break; 5326 case 6: 5327 switch (cpi->cpi_vendor) { 5328 case X86_VENDOR_AMD: 5329 /* 5330 * The Athlon and Duron were the first 5331 * AMD parts with L2 TLB's. 5332 * Before then, don't trust the data. 5333 */ 5334 if (cpi->cpi_family < 6 || 5335 (cpi->cpi_family == 6 && 5336 cpi->cpi_model < 1)) 5337 cp->cp_eax = cp->cp_ebx = 0; 5338 /* 5339 * AMD Duron rev A0 reports L2 5340 * cache size incorrectly as 1K 5341 * when it is really 64K 5342 */ 5343 if (cpi->cpi_family == 6 && 5344 cpi->cpi_model == 3 && 5345 cpi->cpi_step == 0) { 5346 cp->cp_ecx &= 0xffff; 5347 cp->cp_ecx |= 0x400000; 5348 } 5349 break; 5350 case X86_VENDOR_Cyrix: /* VIA C3 */ 5351 /* 5352 * VIA C3 processors are a bit messed 5353 * up w.r.t. encoding cache sizes in %ecx 5354 */ 5355 if (cpi->cpi_family != 6) 5356 break; 5357 /* 5358 * model 7 and 8 were incorrectly encoded 5359 * 5360 * xxx is model 8 really broken? 5361 */ 5362 if (cpi->cpi_model == 7 || 5363 cpi->cpi_model == 8) 5364 cp->cp_ecx = 5365 BITX(cp->cp_ecx, 31, 24) << 16 | 5366 BITX(cp->cp_ecx, 23, 16) << 12 | 5367 BITX(cp->cp_ecx, 15, 8) << 8 | 5368 BITX(cp->cp_ecx, 7, 0); 5369 /* 5370 * model 9 stepping 1 has wrong associativity 5371 */ 5372 if (cpi->cpi_model == 9 && cpi->cpi_step == 1) 5373 cp->cp_ecx |= 8 << 12; 5374 break; 5375 case X86_VENDOR_Intel: 5376 /* 5377 * Extended L2 Cache features function. 5378 * First appeared on Prescott. 5379 */ 5380 default: 5381 break; 5382 } 5383 break; 5384 default: 5385 break; 5386 } 5387 } 5388 } 5389 5390 static const char * 5391 intel_cpubrand(const struct cpuid_info *cpi) 5392 { 5393 int i; 5394 5395 ASSERT(is_x86_feature(x86_featureset, X86FSET_CPUID)); 5396 5397 switch (cpi->cpi_family) { 5398 case 5: 5399 return ("Intel Pentium(r)"); 5400 case 6: 5401 switch (cpi->cpi_model) { 5402 uint_t celeron, xeon; 5403 const struct cpuid_regs *cp; 5404 case 0: 5405 case 1: 5406 case 2: 5407 return ("Intel Pentium(r) Pro"); 5408 case 3: 5409 case 4: 5410 return ("Intel Pentium(r) II"); 5411 case 6: 5412 return ("Intel Celeron(r)"); 5413 case 5: 5414 case 7: 5415 celeron = xeon = 0; 5416 cp = &cpi->cpi_std[2]; /* cache info */ 5417 5418 for (i = 1; i < 4; i++) { 5419 uint_t tmp; 5420 5421 tmp = (cp->cp_eax >> (8 * i)) & 0xff; 5422 if (tmp == 0x40) 5423 celeron++; 5424 if (tmp >= 0x44 && tmp <= 0x45) 5425 xeon++; 5426 } 5427 5428 for (i = 0; i < 2; i++) { 5429 uint_t tmp; 5430 5431 tmp = (cp->cp_ebx >> (8 * i)) & 0xff; 5432 if (tmp == 0x40) 5433 celeron++; 5434 else if (tmp >= 0x44 && tmp <= 0x45) 5435 xeon++; 5436 } 5437 5438 for (i = 0; i < 4; i++) { 5439 uint_t tmp; 5440 5441 tmp = (cp->cp_ecx >> (8 * i)) & 0xff; 5442 if (tmp == 0x40) 5443 celeron++; 5444 else if (tmp >= 0x44 && tmp <= 0x45) 5445 xeon++; 5446 } 5447 5448 for (i = 0; i < 4; i++) { 5449 uint_t tmp; 5450 5451 tmp = (cp->cp_edx >> (8 * i)) & 0xff; 5452 if (tmp == 0x40) 5453 celeron++; 5454 else if (tmp >= 0x44 && tmp <= 0x45) 5455 xeon++; 5456 } 5457 5458 if (celeron) 5459 return ("Intel Celeron(r)"); 5460 if (xeon) 5461 return (cpi->cpi_model == 5 ? 5462 "Intel Pentium(r) II Xeon(tm)" : 5463 "Intel Pentium(r) III Xeon(tm)"); 5464 return (cpi->cpi_model == 5 ? 5465 "Intel Pentium(r) II or Pentium(r) II Xeon(tm)" : 5466 "Intel Pentium(r) III or Pentium(r) III Xeon(tm)"); 5467 default: 5468 break; 5469 } 5470 default: 5471 break; 5472 } 5473 5474 /* BrandID is present if the field is nonzero */ 5475 if (cpi->cpi_brandid != 0) { 5476 static const struct { 5477 uint_t bt_bid; 5478 const char *bt_str; 5479 } brand_tbl[] = { 5480 { 0x1, "Intel(r) Celeron(r)" }, 5481 { 0x2, "Intel(r) Pentium(r) III" }, 5482 { 0x3, "Intel(r) Pentium(r) III Xeon(tm)" }, 5483 { 0x4, "Intel(r) Pentium(r) III" }, 5484 { 0x6, "Mobile Intel(r) Pentium(r) III" }, 5485 { 0x7, "Mobile Intel(r) Celeron(r)" }, 5486 { 0x8, "Intel(r) Pentium(r) 4" }, 5487 { 0x9, "Intel(r) Pentium(r) 4" }, 5488 { 0xa, "Intel(r) Celeron(r)" }, 5489 { 0xb, "Intel(r) Xeon(tm)" }, 5490 { 0xc, "Intel(r) Xeon(tm) MP" }, 5491 { 0xe, "Mobile Intel(r) Pentium(r) 4" }, 5492 { 0xf, "Mobile Intel(r) Celeron(r)" }, 5493 { 0x11, "Mobile Genuine Intel(r)" }, 5494 { 0x12, "Intel(r) Celeron(r) M" }, 5495 { 0x13, "Mobile Intel(r) Celeron(r)" }, 5496 { 0x14, "Intel(r) Celeron(r)" }, 5497 { 0x15, "Mobile Genuine Intel(r)" }, 5498 { 0x16, "Intel(r) Pentium(r) M" }, 5499 { 0x17, "Mobile Intel(r) Celeron(r)" } 5500 }; 5501 uint_t btblmax = sizeof (brand_tbl) / sizeof (brand_tbl[0]); 5502 uint_t sgn; 5503 5504 sgn = (cpi->cpi_family << 8) | 5505 (cpi->cpi_model << 4) | cpi->cpi_step; 5506 5507 for (i = 0; i < btblmax; i++) 5508 if (brand_tbl[i].bt_bid == cpi->cpi_brandid) 5509 break; 5510 if (i < btblmax) { 5511 if (sgn == 0x6b1 && cpi->cpi_brandid == 3) 5512 return ("Intel(r) Celeron(r)"); 5513 if (sgn < 0xf13 && cpi->cpi_brandid == 0xb) 5514 return ("Intel(r) Xeon(tm) MP"); 5515 if (sgn < 0xf13 && cpi->cpi_brandid == 0xe) 5516 return ("Intel(r) Xeon(tm)"); 5517 return (brand_tbl[i].bt_str); 5518 } 5519 } 5520 5521 return (NULL); 5522 } 5523 5524 static const char * 5525 amd_cpubrand(const struct cpuid_info *cpi) 5526 { 5527 ASSERT(is_x86_feature(x86_featureset, X86FSET_CPUID)); 5528 5529 switch (cpi->cpi_family) { 5530 case 5: 5531 switch (cpi->cpi_model) { 5532 case 0: 5533 case 1: 5534 case 2: 5535 case 3: 5536 case 4: 5537 case 5: 5538 return ("AMD-K5(r)"); 5539 case 6: 5540 case 7: 5541 return ("AMD-K6(r)"); 5542 case 8: 5543 return ("AMD-K6(r)-2"); 5544 case 9: 5545 return ("AMD-K6(r)-III"); 5546 default: 5547 return ("AMD (family 5)"); 5548 } 5549 case 6: 5550 switch (cpi->cpi_model) { 5551 case 1: 5552 return ("AMD-K7(tm)"); 5553 case 0: 5554 case 2: 5555 case 4: 5556 return ("AMD Athlon(tm)"); 5557 case 3: 5558 case 7: 5559 return ("AMD Duron(tm)"); 5560 case 6: 5561 case 8: 5562 case 10: 5563 /* 5564 * Use the L2 cache size to distinguish 5565 */ 5566 return ((cpi->cpi_extd[6].cp_ecx >> 16) >= 256 ? 5567 "AMD Athlon(tm)" : "AMD Duron(tm)"); 5568 default: 5569 return ("AMD (family 6)"); 5570 } 5571 default: 5572 break; 5573 } 5574 5575 if (cpi->cpi_family == 0xf && cpi->cpi_model == 5 && 5576 cpi->cpi_brandid != 0) { 5577 switch (BITX(cpi->cpi_brandid, 7, 5)) { 5578 case 3: 5579 return ("AMD Opteron(tm) UP 1xx"); 5580 case 4: 5581 return ("AMD Opteron(tm) DP 2xx"); 5582 case 5: 5583 return ("AMD Opteron(tm) MP 8xx"); 5584 default: 5585 return ("AMD Opteron(tm)"); 5586 } 5587 } 5588 5589 return (NULL); 5590 } 5591 5592 static const char * 5593 cyrix_cpubrand(struct cpuid_info *cpi, uint_t type) 5594 { 5595 ASSERT(is_x86_feature(x86_featureset, X86FSET_CPUID)); 5596 5597 switch (type) { 5598 case X86_TYPE_CYRIX_6x86: 5599 return ("Cyrix 6x86"); 5600 case X86_TYPE_CYRIX_6x86L: 5601 return ("Cyrix 6x86L"); 5602 case X86_TYPE_CYRIX_6x86MX: 5603 return ("Cyrix 6x86MX"); 5604 case X86_TYPE_CYRIX_GXm: 5605 return ("Cyrix GXm"); 5606 case X86_TYPE_CYRIX_MediaGX: 5607 return ("Cyrix MediaGX"); 5608 case X86_TYPE_CYRIX_MII: 5609 return ("Cyrix M2"); 5610 case X86_TYPE_VIA_CYRIX_III: 5611 return ("VIA Cyrix M3"); 5612 default: 5613 /* 5614 * Have another wild guess .. 5615 */ 5616 if (cpi->cpi_family == 4 && cpi->cpi_model == 9) 5617 return ("Cyrix 5x86"); 5618 else if (cpi->cpi_family == 5) { 5619 switch (cpi->cpi_model) { 5620 case 2: 5621 return ("Cyrix 6x86"); /* Cyrix M1 */ 5622 case 4: 5623 return ("Cyrix MediaGX"); 5624 default: 5625 break; 5626 } 5627 } else if (cpi->cpi_family == 6) { 5628 switch (cpi->cpi_model) { 5629 case 0: 5630 return ("Cyrix 6x86MX"); /* Cyrix M2? */ 5631 case 5: 5632 case 6: 5633 case 7: 5634 case 8: 5635 case 9: 5636 return ("VIA C3"); 5637 default: 5638 break; 5639 } 5640 } 5641 break; 5642 } 5643 return (NULL); 5644 } 5645 5646 /* 5647 * This only gets called in the case that the CPU extended 5648 * feature brand string (0x80000002, 0x80000003, 0x80000004) 5649 * aren't available, or contain null bytes for some reason. 5650 */ 5651 static void 5652 fabricate_brandstr(struct cpuid_info *cpi) 5653 { 5654 const char *brand = NULL; 5655 5656 switch (cpi->cpi_vendor) { 5657 case X86_VENDOR_Intel: 5658 brand = intel_cpubrand(cpi); 5659 break; 5660 case X86_VENDOR_AMD: 5661 brand = amd_cpubrand(cpi); 5662 break; 5663 case X86_VENDOR_Cyrix: 5664 brand = cyrix_cpubrand(cpi, x86_type); 5665 break; 5666 case X86_VENDOR_NexGen: 5667 if (cpi->cpi_family == 5 && cpi->cpi_model == 0) 5668 brand = "NexGen Nx586"; 5669 break; 5670 case X86_VENDOR_Centaur: 5671 if (cpi->cpi_family == 5) 5672 switch (cpi->cpi_model) { 5673 case 4: 5674 brand = "Centaur C6"; 5675 break; 5676 case 8: 5677 brand = "Centaur C2"; 5678 break; 5679 case 9: 5680 brand = "Centaur C3"; 5681 break; 5682 default: 5683 break; 5684 } 5685 break; 5686 case X86_VENDOR_Rise: 5687 if (cpi->cpi_family == 5 && 5688 (cpi->cpi_model == 0 || cpi->cpi_model == 2)) 5689 brand = "Rise mP6"; 5690 break; 5691 case X86_VENDOR_SiS: 5692 if (cpi->cpi_family == 5 && cpi->cpi_model == 0) 5693 brand = "SiS 55x"; 5694 break; 5695 case X86_VENDOR_TM: 5696 if (cpi->cpi_family == 5 && cpi->cpi_model == 4) 5697 brand = "Transmeta Crusoe TM3x00 or TM5x00"; 5698 break; 5699 case X86_VENDOR_NSC: 5700 case X86_VENDOR_UMC: 5701 default: 5702 break; 5703 } 5704 if (brand) { 5705 (void) strcpy((char *)cpi->cpi_brandstr, brand); 5706 return; 5707 } 5708 5709 /* 5710 * If all else fails ... 5711 */ 5712 (void) snprintf(cpi->cpi_brandstr, sizeof (cpi->cpi_brandstr), 5713 "%s %d.%d.%d", cpi->cpi_vendorstr, cpi->cpi_family, 5714 cpi->cpi_model, cpi->cpi_step); 5715 } 5716 5717 /* 5718 * This routine is called just after kernel memory allocation 5719 * becomes available on cpu0, and as part of mp_startup() on 5720 * the other cpus. 5721 * 5722 * Fixup the brand string, and collect any information from cpuid 5723 * that requires dynamically allocated storage to represent. 5724 */ 5725 5726 static void 5727 cpuid_pass_dynamic(cpu_t *cpu, void *_arg __unused) 5728 { 5729 int i, max, shft, level, size; 5730 struct cpuid_regs regs; 5731 struct cpuid_regs *cp; 5732 struct cpuid_info *cpi = cpu->cpu_m.mcpu_cpi; 5733 5734 /* 5735 * Deterministic cache parameters 5736 * 5737 * Intel uses leaf 0x4 for this, while AMD uses leaf 0x8000001d. The 5738 * values that are present are currently defined to be the same. This 5739 * means we can use the same logic to parse it as long as we use the 5740 * appropriate leaf to get the data. If you're updating this, make sure 5741 * you're careful about which vendor supports which aspect. 5742 * 5743 * Take this opportunity to detect the number of threads sharing the 5744 * last level cache, and construct a corresponding cache id. The 5745 * respective cpuid_info members are initialized to the default case of 5746 * "no last level cache sharing". 5747 */ 5748 cpi->cpi_ncpu_shr_last_cache = 1; 5749 cpi->cpi_last_lvl_cacheid = cpu->cpu_id; 5750 5751 if ((cpi->cpi_maxeax >= 4 && cpi->cpi_vendor == X86_VENDOR_Intel) || 5752 ((cpi->cpi_vendor == X86_VENDOR_AMD || 5753 cpi->cpi_vendor == X86_VENDOR_HYGON) && 5754 cpi->cpi_xmaxeax >= CPUID_LEAF_EXT_1d && 5755 is_x86_feature(x86_featureset, X86FSET_TOPOEXT))) { 5756 uint32_t leaf; 5757 5758 if (cpi->cpi_vendor == X86_VENDOR_Intel) { 5759 leaf = 4; 5760 } else { 5761 leaf = CPUID_LEAF_EXT_1d; 5762 } 5763 5764 /* 5765 * Find the # of elements (size) returned by the leaf and along 5766 * the way detect last level cache sharing details. 5767 */ 5768 bzero(®s, sizeof (regs)); 5769 cp = ®s; 5770 for (i = 0, max = 0; i < CPI_FN4_ECX_MAX; i++) { 5771 cp->cp_eax = leaf; 5772 cp->cp_ecx = i; 5773 5774 (void) __cpuid_insn(cp); 5775 5776 if (CPI_CACHE_TYPE(cp) == 0) 5777 break; 5778 level = CPI_CACHE_LVL(cp); 5779 if (level > max) { 5780 max = level; 5781 cpi->cpi_ncpu_shr_last_cache = 5782 CPI_NTHR_SHR_CACHE(cp) + 1; 5783 } 5784 } 5785 cpi->cpi_cache_leaf_size = size = i; 5786 5787 /* 5788 * Allocate the cpi_cache_leaves array. The first element 5789 * references the regs for the corresponding leaf with %ecx set 5790 * to 0. This was gathered in cpuid_pass_extended(). 5791 */ 5792 if (size > 0) { 5793 cpi->cpi_cache_leaves = 5794 kmem_alloc(size * sizeof (cp), KM_SLEEP); 5795 if (cpi->cpi_vendor == X86_VENDOR_Intel) { 5796 cpi->cpi_cache_leaves[0] = &cpi->cpi_std[4]; 5797 } else { 5798 cpi->cpi_cache_leaves[0] = &cpi->cpi_extd[0x1d]; 5799 } 5800 5801 /* 5802 * Allocate storage to hold the additional regs 5803 * for the leaf, %ecx == 1 .. cpi_cache_leaf_size. 5804 * 5805 * The regs for the leaf, %ecx == 0 has already 5806 * been allocated as indicated above. 5807 */ 5808 for (i = 1; i < size; i++) { 5809 cp = cpi->cpi_cache_leaves[i] = 5810 kmem_zalloc(sizeof (regs), KM_SLEEP); 5811 cp->cp_eax = leaf; 5812 cp->cp_ecx = i; 5813 5814 (void) __cpuid_insn(cp); 5815 } 5816 } 5817 /* 5818 * Determine the number of bits needed to represent 5819 * the number of CPUs sharing the last level cache. 5820 * 5821 * Shift off that number of bits from the APIC id to 5822 * derive the cache id. 5823 */ 5824 shft = 0; 5825 for (i = 1; i < cpi->cpi_ncpu_shr_last_cache; i <<= 1) 5826 shft++; 5827 cpi->cpi_last_lvl_cacheid = cpi->cpi_apicid >> shft; 5828 } 5829 5830 /* 5831 * Now fixup the brand string 5832 */ 5833 if ((cpi->cpi_xmaxeax & CPUID_LEAF_EXT_0) == 0) { 5834 fabricate_brandstr(cpi); 5835 } else { 5836 5837 /* 5838 * If we successfully extracted a brand string from the cpuid 5839 * instruction, clean it up by removing leading spaces and 5840 * similar junk. 5841 */ 5842 if (cpi->cpi_brandstr[0]) { 5843 size_t maxlen = sizeof (cpi->cpi_brandstr); 5844 char *src, *dst; 5845 5846 dst = src = (char *)cpi->cpi_brandstr; 5847 src[maxlen - 1] = '\0'; 5848 /* 5849 * strip leading spaces 5850 */ 5851 while (*src == ' ') 5852 src++; 5853 /* 5854 * Remove any 'Genuine' or "Authentic" prefixes 5855 */ 5856 if (strncmp(src, "Genuine ", 8) == 0) 5857 src += 8; 5858 if (strncmp(src, "Authentic ", 10) == 0) 5859 src += 10; 5860 5861 /* 5862 * Now do an in-place copy. 5863 * Map (R) to (r) and (TM) to (tm). 5864 * The era of teletypes is long gone, and there's 5865 * -really- no need to shout. 5866 */ 5867 while (*src != '\0') { 5868 if (src[0] == '(') { 5869 if (strncmp(src + 1, "R)", 2) == 0) { 5870 (void) strncpy(dst, "(r)", 3); 5871 src += 3; 5872 dst += 3; 5873 continue; 5874 } 5875 if (strncmp(src + 1, "TM)", 3) == 0) { 5876 (void) strncpy(dst, "(tm)", 4); 5877 src += 4; 5878 dst += 4; 5879 continue; 5880 } 5881 } 5882 *dst++ = *src++; 5883 } 5884 *dst = '\0'; 5885 5886 /* 5887 * Finally, remove any trailing spaces 5888 */ 5889 while (--dst > cpi->cpi_brandstr) 5890 if (*dst == ' ') 5891 *dst = '\0'; 5892 else 5893 break; 5894 } else 5895 fabricate_brandstr(cpi); 5896 } 5897 } 5898 5899 typedef struct { 5900 uint32_t avm_av; 5901 uint32_t avm_feat; 5902 } av_feat_map_t; 5903 5904 /* 5905 * These arrays are used to map features that we should add based on x86 5906 * features that are present. As a large number depend on kernel features, 5907 * rather than rechecking and clearing CPUID everywhere, we simply map these. 5908 * There is an array of these for each hwcap word. Some features aren't tracked 5909 * in the kernel x86 featureset and that's ok. They will not show up in here. 5910 */ 5911 static const av_feat_map_t x86fset_to_av1[] = { 5912 { AV_386_CX8, X86FSET_CX8 }, 5913 { AV_386_SEP, X86FSET_SEP }, 5914 { AV_386_AMD_SYSC, X86FSET_ASYSC }, 5915 { AV_386_CMOV, X86FSET_CMOV }, 5916 { AV_386_FXSR, X86FSET_SSE }, 5917 { AV_386_SSE, X86FSET_SSE }, 5918 { AV_386_SSE2, X86FSET_SSE2 }, 5919 { AV_386_SSE3, X86FSET_SSE3 }, 5920 { AV_386_CX16, X86FSET_CX16 }, 5921 { AV_386_TSCP, X86FSET_TSCP }, 5922 { AV_386_AMD_SSE4A, X86FSET_SSE4A }, 5923 { AV_386_SSSE3, X86FSET_SSSE3 }, 5924 { AV_386_SSE4_1, X86FSET_SSE4_1 }, 5925 { AV_386_SSE4_2, X86FSET_SSE4_2 }, 5926 { AV_386_AES, X86FSET_AES }, 5927 { AV_386_PCLMULQDQ, X86FSET_PCLMULQDQ }, 5928 { AV_386_XSAVE, X86FSET_XSAVE }, 5929 { AV_386_AVX, X86FSET_AVX }, 5930 { AV_386_VMX, X86FSET_VMX }, 5931 { AV_386_AMD_SVM, X86FSET_SVM } 5932 }; 5933 5934 static const av_feat_map_t x86fset_to_av2[] = { 5935 { AV_386_2_F16C, X86FSET_F16C }, 5936 { AV_386_2_RDRAND, X86FSET_RDRAND }, 5937 { AV_386_2_BMI1, X86FSET_BMI1 }, 5938 { AV_386_2_BMI2, X86FSET_BMI2 }, 5939 { AV_386_2_FMA, X86FSET_FMA }, 5940 { AV_386_2_AVX2, X86FSET_AVX2 }, 5941 { AV_386_2_ADX, X86FSET_ADX }, 5942 { AV_386_2_RDSEED, X86FSET_RDSEED }, 5943 { AV_386_2_AVX512F, X86FSET_AVX512F }, 5944 { AV_386_2_AVX512DQ, X86FSET_AVX512DQ }, 5945 { AV_386_2_AVX512IFMA, X86FSET_AVX512FMA }, 5946 { AV_386_2_AVX512PF, X86FSET_AVX512PF }, 5947 { AV_386_2_AVX512ER, X86FSET_AVX512ER }, 5948 { AV_386_2_AVX512CD, X86FSET_AVX512CD }, 5949 { AV_386_2_AVX512BW, X86FSET_AVX512BW }, 5950 { AV_386_2_AVX512VL, X86FSET_AVX512VL }, 5951 { AV_386_2_AVX512VBMI, X86FSET_AVX512VBMI }, 5952 { AV_386_2_AVX512VPOPCDQ, X86FSET_AVX512VPOPCDQ }, 5953 { AV_386_2_SHA, X86FSET_SHA }, 5954 { AV_386_2_FSGSBASE, X86FSET_FSGSBASE }, 5955 { AV_386_2_CLFLUSHOPT, X86FSET_CLFLUSHOPT }, 5956 { AV_386_2_CLWB, X86FSET_CLWB }, 5957 { AV_386_2_MONITORX, X86FSET_MONITORX }, 5958 { AV_386_2_CLZERO, X86FSET_CLZERO }, 5959 { AV_386_2_AVX512_VNNI, X86FSET_AVX512VNNI }, 5960 { AV_386_2_VPCLMULQDQ, X86FSET_VPCLMULQDQ }, 5961 { AV_386_2_VAES, X86FSET_VAES }, 5962 { AV_386_2_GFNI, X86FSET_GFNI }, 5963 { AV_386_2_AVX512_VP2INT, X86FSET_AVX512_VP2INT }, 5964 { AV_386_2_AVX512_BITALG, X86FSET_AVX512_BITALG } 5965 }; 5966 5967 static const av_feat_map_t x86fset_to_av3[] = { 5968 { AV_386_3_AVX512_VBMI2, X86FSET_AVX512_VBMI2 }, 5969 { AV_386_3_AVX512_BF16, X86FSET_AVX512_BF16 } 5970 }; 5971 5972 /* 5973 * This routine is called out of bind_hwcap() much later in the life 5974 * of the kernel (post_startup()). The job of this routine is to resolve 5975 * the hardware feature support and kernel support for those features into 5976 * what we're actually going to tell applications via the aux vector. 5977 * 5978 * Most of the aux vector is derived from the x86_featureset array vector where 5979 * a given feature indicates that an aux vector should be plumbed through. This 5980 * allows the kernel to use one tracking mechanism for these based on whether or 5981 * not it has the required hardware support (most often xsave). Most newer 5982 * features are added there in case we need them in the kernel. Otherwise, 5983 * features are evaluated based on looking at the cpuid features that remain. If 5984 * you find yourself wanting to clear out cpuid features for some reason, they 5985 * should instead be driven by the feature set so we have a consistent view. 5986 */ 5987 5988 static void 5989 cpuid_pass_resolve(cpu_t *cpu, void *arg) 5990 { 5991 uint_t *hwcap_out = (uint_t *)arg; 5992 struct cpuid_info *cpi; 5993 uint_t hwcap_flags = 0, hwcap_flags_2 = 0, hwcap_flags_3 = 0; 5994 5995 cpi = cpu->cpu_m.mcpu_cpi; 5996 5997 for (uint_t i = 0; i < ARRAY_SIZE(x86fset_to_av1); i++) { 5998 if (is_x86_feature(x86_featureset, 5999 x86fset_to_av1[i].avm_feat)) { 6000 hwcap_flags |= x86fset_to_av1[i].avm_av; 6001 } 6002 } 6003 6004 for (uint_t i = 0; i < ARRAY_SIZE(x86fset_to_av2); i++) { 6005 if (is_x86_feature(x86_featureset, 6006 x86fset_to_av2[i].avm_feat)) { 6007 hwcap_flags_2 |= x86fset_to_av2[i].avm_av; 6008 } 6009 } 6010 6011 for (uint_t i = 0; i < ARRAY_SIZE(x86fset_to_av3); i++) { 6012 if (is_x86_feature(x86_featureset, 6013 x86fset_to_av3[i].avm_feat)) { 6014 hwcap_flags_3 |= x86fset_to_av3[i].avm_av; 6015 } 6016 } 6017 6018 /* 6019 * From here on out we're working through features that don't have 6020 * corresponding kernel feature flags for various reasons that are 6021 * mostly just due to the historical implementation. 6022 */ 6023 if (cpi->cpi_maxeax >= 1) { 6024 uint32_t *edx = &cpi->cpi_support[STD_EDX_FEATURES]; 6025 uint32_t *ecx = &cpi->cpi_support[STD_ECX_FEATURES]; 6026 6027 *edx = CPI_FEATURES_EDX(cpi); 6028 *ecx = CPI_FEATURES_ECX(cpi); 6029 6030 /* 6031 * [no explicit support required beyond x87 fp context] 6032 */ 6033 if (!fpu_exists) 6034 *edx &= ~(CPUID_INTC_EDX_FPU | CPUID_INTC_EDX_MMX); 6035 6036 /* 6037 * Now map the supported feature vector to things that we 6038 * think userland will care about. 6039 */ 6040 if (*ecx & CPUID_INTC_ECX_MOVBE) 6041 hwcap_flags |= AV_386_MOVBE; 6042 6043 if (*ecx & CPUID_INTC_ECX_POPCNT) 6044 hwcap_flags |= AV_386_POPCNT; 6045 if (*edx & CPUID_INTC_EDX_FPU) 6046 hwcap_flags |= AV_386_FPU; 6047 if (*edx & CPUID_INTC_EDX_MMX) 6048 hwcap_flags |= AV_386_MMX; 6049 if (*edx & CPUID_INTC_EDX_TSC) 6050 hwcap_flags |= AV_386_TSC; 6051 } 6052 6053 /* 6054 * Check a few miscellaneous features. 6055 */ 6056 if (cpi->cpi_xmaxeax < 0x80000001) 6057 goto resolve_done; 6058 6059 switch (cpi->cpi_vendor) { 6060 uint32_t *edx, *ecx; 6061 6062 case X86_VENDOR_Intel: 6063 /* 6064 * Seems like Intel duplicated what we necessary 6065 * here to make the initial crop of 64-bit OS's work. 6066 * Hopefully, those are the only "extended" bits 6067 * they'll add. 6068 */ 6069 /*FALLTHROUGH*/ 6070 6071 case X86_VENDOR_AMD: 6072 case X86_VENDOR_HYGON: 6073 edx = &cpi->cpi_support[AMD_EDX_FEATURES]; 6074 ecx = &cpi->cpi_support[AMD_ECX_FEATURES]; 6075 6076 *edx = CPI_FEATURES_XTD_EDX(cpi); 6077 *ecx = CPI_FEATURES_XTD_ECX(cpi); 6078 6079 /* 6080 * [no explicit support required beyond 6081 * x87 fp context and exception handlers] 6082 */ 6083 if (!fpu_exists) 6084 *edx &= ~(CPUID_AMD_EDX_MMXamd | 6085 CPUID_AMD_EDX_3DNow | CPUID_AMD_EDX_3DNowx); 6086 6087 /* 6088 * Now map the supported feature vector to 6089 * things that we think userland will care about. 6090 */ 6091 if (*edx & CPUID_AMD_EDX_MMXamd) 6092 hwcap_flags |= AV_386_AMD_MMX; 6093 if (*edx & CPUID_AMD_EDX_3DNow) 6094 hwcap_flags |= AV_386_AMD_3DNow; 6095 if (*edx & CPUID_AMD_EDX_3DNowx) 6096 hwcap_flags |= AV_386_AMD_3DNowx; 6097 6098 switch (cpi->cpi_vendor) { 6099 case X86_VENDOR_AMD: 6100 case X86_VENDOR_HYGON: 6101 if (*ecx & CPUID_AMD_ECX_AHF64) 6102 hwcap_flags |= AV_386_AHF; 6103 if (*ecx & CPUID_AMD_ECX_LZCNT) 6104 hwcap_flags |= AV_386_AMD_LZCNT; 6105 break; 6106 6107 case X86_VENDOR_Intel: 6108 if (*ecx & CPUID_AMD_ECX_LZCNT) 6109 hwcap_flags |= AV_386_AMD_LZCNT; 6110 /* 6111 * Aarrgh. 6112 * Intel uses a different bit in the same word. 6113 */ 6114 if (*ecx & CPUID_INTC_ECX_AHF64) 6115 hwcap_flags |= AV_386_AHF; 6116 break; 6117 default: 6118 break; 6119 } 6120 break; 6121 6122 default: 6123 break; 6124 } 6125 6126 resolve_done: 6127 if (hwcap_out != NULL) { 6128 hwcap_out[0] = hwcap_flags; 6129 hwcap_out[1] = hwcap_flags_2; 6130 hwcap_out[2] = hwcap_flags_3; 6131 } 6132 } 6133 6134 6135 /* 6136 * Simulate the cpuid instruction using the data we previously 6137 * captured about this CPU. We try our best to return the truth 6138 * about the hardware, independently of kernel support. 6139 */ 6140 uint32_t 6141 cpuid_insn(cpu_t *cpu, struct cpuid_regs *cp) 6142 { 6143 struct cpuid_info *cpi; 6144 struct cpuid_regs *xcp; 6145 6146 if (cpu == NULL) 6147 cpu = CPU; 6148 cpi = cpu->cpu_m.mcpu_cpi; 6149 6150 ASSERT(cpuid_checkpass(cpu, CPUID_PASS_DYNAMIC)); 6151 6152 /* 6153 * CPUID data is cached in two separate places: cpi_std for standard 6154 * CPUID leaves , and cpi_extd for extended CPUID leaves. 6155 */ 6156 if (cp->cp_eax <= cpi->cpi_maxeax && cp->cp_eax < NMAX_CPI_STD) { 6157 xcp = &cpi->cpi_std[cp->cp_eax]; 6158 } else if (cp->cp_eax >= CPUID_LEAF_EXT_0 && 6159 cp->cp_eax <= cpi->cpi_xmaxeax && 6160 cp->cp_eax < CPUID_LEAF_EXT_0 + NMAX_CPI_EXTD) { 6161 xcp = &cpi->cpi_extd[cp->cp_eax - CPUID_LEAF_EXT_0]; 6162 } else { 6163 /* 6164 * The caller is asking for data from an input parameter which 6165 * the kernel has not cached. In this case we go fetch from 6166 * the hardware and return the data directly to the user. 6167 */ 6168 return (__cpuid_insn(cp)); 6169 } 6170 6171 cp->cp_eax = xcp->cp_eax; 6172 cp->cp_ebx = xcp->cp_ebx; 6173 cp->cp_ecx = xcp->cp_ecx; 6174 cp->cp_edx = xcp->cp_edx; 6175 return (cp->cp_eax); 6176 } 6177 6178 boolean_t 6179 cpuid_checkpass(const cpu_t *const cpu, const cpuid_pass_t pass) 6180 { 6181 return (cpu != NULL && cpu->cpu_m.mcpu_cpi != NULL && 6182 cpu->cpu_m.mcpu_cpi->cpi_pass >= pass); 6183 } 6184 6185 int 6186 cpuid_getbrandstr(cpu_t *cpu, char *s, size_t n) 6187 { 6188 ASSERT(cpuid_checkpass(cpu, CPUID_PASS_DYNAMIC)); 6189 6190 return (snprintf(s, n, "%s", cpu->cpu_m.mcpu_cpi->cpi_brandstr)); 6191 } 6192 6193 int 6194 cpuid_is_cmt(cpu_t *cpu) 6195 { 6196 if (cpu == NULL) 6197 cpu = CPU; 6198 6199 ASSERT(cpuid_checkpass(cpu, CPUID_PASS_BASIC)); 6200 6201 return (cpu->cpu_m.mcpu_cpi->cpi_chipid >= 0); 6202 } 6203 6204 /* 6205 * AMD and Intel both implement the 64-bit variant of the syscall 6206 * instruction (syscallq), so if there's -any- support for syscall, 6207 * cpuid currently says "yes, we support this". 6208 * 6209 * However, Intel decided to -not- implement the 32-bit variant of the 6210 * syscall instruction, so we provide a predicate to allow our caller 6211 * to test that subtlety here. 6212 * 6213 * XXPV Currently, 32-bit syscall instructions don't work via the hypervisor, 6214 * even in the case where the hardware would in fact support it. 6215 */ 6216 /*ARGSUSED*/ 6217 int 6218 cpuid_syscall32_insn(cpu_t *cpu) 6219 { 6220 ASSERT(cpuid_checkpass((cpu == NULL ? CPU : cpu), CPUID_PASS_BASIC)); 6221 6222 #if !defined(__xpv) 6223 if (cpu == NULL) 6224 cpu = CPU; 6225 6226 /*CSTYLED*/ 6227 { 6228 struct cpuid_info *cpi = cpu->cpu_m.mcpu_cpi; 6229 6230 if ((cpi->cpi_vendor == X86_VENDOR_AMD || 6231 cpi->cpi_vendor == X86_VENDOR_HYGON) && 6232 cpi->cpi_xmaxeax >= 0x80000001 && 6233 (CPI_FEATURES_XTD_EDX(cpi) & CPUID_AMD_EDX_SYSC)) 6234 return (1); 6235 } 6236 #endif 6237 return (0); 6238 } 6239 6240 int 6241 cpuid_getidstr(cpu_t *cpu, char *s, size_t n) 6242 { 6243 struct cpuid_info *cpi = cpu->cpu_m.mcpu_cpi; 6244 6245 static const char fmt[] = 6246 "x86 (%s %X family %d model %d step %d clock %d MHz)"; 6247 static const char fmt_ht[] = 6248 "x86 (chipid 0x%x %s %X family %d model %d step %d clock %d MHz)"; 6249 6250 ASSERT(cpuid_checkpass(cpu, CPUID_PASS_BASIC)); 6251 6252 if (cpuid_is_cmt(cpu)) 6253 return (snprintf(s, n, fmt_ht, cpi->cpi_chipid, 6254 cpi->cpi_vendorstr, cpi->cpi_std[1].cp_eax, 6255 cpi->cpi_family, cpi->cpi_model, 6256 cpi->cpi_step, cpu->cpu_type_info.pi_clock)); 6257 return (snprintf(s, n, fmt, 6258 cpi->cpi_vendorstr, cpi->cpi_std[1].cp_eax, 6259 cpi->cpi_family, cpi->cpi_model, 6260 cpi->cpi_step, cpu->cpu_type_info.pi_clock)); 6261 } 6262 6263 const char * 6264 cpuid_getvendorstr(cpu_t *cpu) 6265 { 6266 ASSERT(cpuid_checkpass(cpu, CPUID_PASS_IDENT)); 6267 return ((const char *)cpu->cpu_m.mcpu_cpi->cpi_vendorstr); 6268 } 6269 6270 uint_t 6271 cpuid_getvendor(cpu_t *cpu) 6272 { 6273 ASSERT(cpuid_checkpass(cpu, CPUID_PASS_IDENT)); 6274 return (cpu->cpu_m.mcpu_cpi->cpi_vendor); 6275 } 6276 6277 uint_t 6278 cpuid_getfamily(cpu_t *cpu) 6279 { 6280 ASSERT(cpuid_checkpass(cpu, CPUID_PASS_IDENT)); 6281 return (cpu->cpu_m.mcpu_cpi->cpi_family); 6282 } 6283 6284 uint_t 6285 cpuid_getmodel(cpu_t *cpu) 6286 { 6287 ASSERT(cpuid_checkpass(cpu, CPUID_PASS_IDENT)); 6288 return (cpu->cpu_m.mcpu_cpi->cpi_model); 6289 } 6290 6291 uint_t 6292 cpuid_get_ncpu_per_chip(cpu_t *cpu) 6293 { 6294 ASSERT(cpuid_checkpass(cpu, CPUID_PASS_BASIC)); 6295 return (cpu->cpu_m.mcpu_cpi->cpi_ncpu_per_chip); 6296 } 6297 6298 uint_t 6299 cpuid_get_ncore_per_chip(cpu_t *cpu) 6300 { 6301 ASSERT(cpuid_checkpass(cpu, CPUID_PASS_BASIC)); 6302 return (cpu->cpu_m.mcpu_cpi->cpi_ncore_per_chip); 6303 } 6304 6305 uint_t 6306 cpuid_get_ncpu_sharing_last_cache(cpu_t *cpu) 6307 { 6308 ASSERT(cpuid_checkpass(cpu, CPUID_PASS_EXTENDED)); 6309 return (cpu->cpu_m.mcpu_cpi->cpi_ncpu_shr_last_cache); 6310 } 6311 6312 id_t 6313 cpuid_get_last_lvl_cacheid(cpu_t *cpu) 6314 { 6315 ASSERT(cpuid_checkpass(cpu, CPUID_PASS_EXTENDED)); 6316 return (cpu->cpu_m.mcpu_cpi->cpi_last_lvl_cacheid); 6317 } 6318 6319 uint_t 6320 cpuid_getstep(cpu_t *cpu) 6321 { 6322 ASSERT(cpuid_checkpass(cpu, CPUID_PASS_IDENT)); 6323 return (cpu->cpu_m.mcpu_cpi->cpi_step); 6324 } 6325 6326 uint_t 6327 cpuid_getsig(struct cpu *cpu) 6328 { 6329 ASSERT(cpuid_checkpass(cpu, CPUID_PASS_IDENT)); 6330 return (cpu->cpu_m.mcpu_cpi->cpi_std[1].cp_eax); 6331 } 6332 6333 x86_chiprev_t 6334 cpuid_getchiprev(struct cpu *cpu) 6335 { 6336 ASSERT(cpuid_checkpass(cpu, CPUID_PASS_IDENT)); 6337 return (cpu->cpu_m.mcpu_cpi->cpi_chiprev); 6338 } 6339 6340 const char * 6341 cpuid_getchiprevstr(struct cpu *cpu) 6342 { 6343 ASSERT(cpuid_checkpass(cpu, CPUID_PASS_IDENT)); 6344 return (cpu->cpu_m.mcpu_cpi->cpi_chiprevstr); 6345 } 6346 6347 uint32_t 6348 cpuid_getsockettype(struct cpu *cpu) 6349 { 6350 ASSERT(cpuid_checkpass(cpu, CPUID_PASS_IDENT)); 6351 return (cpu->cpu_m.mcpu_cpi->cpi_socket); 6352 } 6353 6354 const char * 6355 cpuid_getsocketstr(cpu_t *cpu) 6356 { 6357 static const char *socketstr = NULL; 6358 struct cpuid_info *cpi; 6359 6360 ASSERT(cpuid_checkpass(cpu, CPUID_PASS_IDENT)); 6361 cpi = cpu->cpu_m.mcpu_cpi; 6362 6363 /* Assume that socket types are the same across the system */ 6364 if (socketstr == NULL) 6365 socketstr = _cpuid_sktstr(cpi->cpi_vendor, cpi->cpi_family, 6366 cpi->cpi_model, cpi->cpi_step); 6367 6368 6369 return (socketstr); 6370 } 6371 6372 x86_uarchrev_t 6373 cpuid_getuarchrev(cpu_t *cpu) 6374 { 6375 return (cpu->cpu_m.mcpu_cpi->cpi_uarchrev); 6376 } 6377 6378 int 6379 cpuid_get_chipid(cpu_t *cpu) 6380 { 6381 ASSERT(cpuid_checkpass(cpu, CPUID_PASS_BASIC)); 6382 6383 if (cpuid_is_cmt(cpu)) 6384 return (cpu->cpu_m.mcpu_cpi->cpi_chipid); 6385 return (cpu->cpu_id); 6386 } 6387 6388 id_t 6389 cpuid_get_coreid(cpu_t *cpu) 6390 { 6391 ASSERT(cpuid_checkpass(cpu, CPUID_PASS_BASIC)); 6392 return (cpu->cpu_m.mcpu_cpi->cpi_coreid); 6393 } 6394 6395 int 6396 cpuid_get_pkgcoreid(cpu_t *cpu) 6397 { 6398 ASSERT(cpuid_checkpass(cpu, CPUID_PASS_BASIC)); 6399 return (cpu->cpu_m.mcpu_cpi->cpi_pkgcoreid); 6400 } 6401 6402 int 6403 cpuid_get_clogid(cpu_t *cpu) 6404 { 6405 ASSERT(cpuid_checkpass(cpu, CPUID_PASS_BASIC)); 6406 return (cpu->cpu_m.mcpu_cpi->cpi_clogid); 6407 } 6408 6409 int 6410 cpuid_get_cacheid(cpu_t *cpu) 6411 { 6412 ASSERT(cpuid_checkpass(cpu, CPUID_PASS_BASIC)); 6413 return (cpu->cpu_m.mcpu_cpi->cpi_last_lvl_cacheid); 6414 } 6415 6416 uint_t 6417 cpuid_get_procnodeid(cpu_t *cpu) 6418 { 6419 ASSERT(cpuid_checkpass(cpu, CPUID_PASS_BASIC)); 6420 return (cpu->cpu_m.mcpu_cpi->cpi_procnodeid); 6421 } 6422 6423 uint_t 6424 cpuid_get_procnodes_per_pkg(cpu_t *cpu) 6425 { 6426 ASSERT(cpuid_checkpass(cpu, CPUID_PASS_BASIC)); 6427 return (cpu->cpu_m.mcpu_cpi->cpi_procnodes_per_pkg); 6428 } 6429 6430 uint_t 6431 cpuid_get_compunitid(cpu_t *cpu) 6432 { 6433 ASSERT(cpuid_checkpass(cpu, CPUID_PASS_BASIC)); 6434 return (cpu->cpu_m.mcpu_cpi->cpi_compunitid); 6435 } 6436 6437 uint_t 6438 cpuid_get_cores_per_compunit(cpu_t *cpu) 6439 { 6440 ASSERT(cpuid_checkpass(cpu, CPUID_PASS_BASIC)); 6441 return (cpu->cpu_m.mcpu_cpi->cpi_cores_per_compunit); 6442 } 6443 6444 uint32_t 6445 cpuid_get_apicid(cpu_t *cpu) 6446 { 6447 ASSERT(cpuid_checkpass(cpu, CPUID_PASS_BASIC)); 6448 if (cpu->cpu_m.mcpu_cpi->cpi_maxeax < 1) { 6449 return (UINT32_MAX); 6450 } else { 6451 return (cpu->cpu_m.mcpu_cpi->cpi_apicid); 6452 } 6453 } 6454 6455 void 6456 cpuid_get_addrsize(cpu_t *cpu, uint_t *pabits, uint_t *vabits) 6457 { 6458 struct cpuid_info *cpi; 6459 6460 if (cpu == NULL) 6461 cpu = CPU; 6462 cpi = cpu->cpu_m.mcpu_cpi; 6463 6464 ASSERT(cpuid_checkpass(cpu, CPUID_PASS_BASIC)); 6465 6466 if (pabits) 6467 *pabits = cpi->cpi_pabits; 6468 if (vabits) 6469 *vabits = cpi->cpi_vabits; 6470 } 6471 6472 /* 6473 * Export information about known offsets to the kernel. We only care about 6474 * things we have actually enabled support for in %xcr0. 6475 */ 6476 void 6477 cpuid_get_xsave_info(uint64_t bit, size_t *sizep, size_t *offp) 6478 { 6479 size_t size, off; 6480 6481 VERIFY3U(bit & xsave_bv_all, !=, 0); 6482 6483 if (sizep == NULL) 6484 sizep = &size; 6485 if (offp == NULL) 6486 offp = &off; 6487 6488 switch (bit) { 6489 case XFEATURE_LEGACY_FP: 6490 case XFEATURE_SSE: 6491 *sizep = sizeof (struct fxsave_state); 6492 *offp = 0; 6493 break; 6494 case XFEATURE_AVX: 6495 *sizep = cpuid_info0.cpi_xsave.ymm_size; 6496 *offp = cpuid_info0.cpi_xsave.ymm_offset; 6497 break; 6498 case XFEATURE_AVX512_OPMASK: 6499 *sizep = cpuid_info0.cpi_xsave.opmask_size; 6500 *offp = cpuid_info0.cpi_xsave.opmask_offset; 6501 break; 6502 case XFEATURE_AVX512_ZMM: 6503 *sizep = cpuid_info0.cpi_xsave.zmmlo_size; 6504 *offp = cpuid_info0.cpi_xsave.zmmlo_offset; 6505 break; 6506 case XFEATURE_AVX512_HI_ZMM: 6507 *sizep = cpuid_info0.cpi_xsave.zmmhi_size; 6508 *offp = cpuid_info0.cpi_xsave.zmmhi_offset; 6509 break; 6510 default: 6511 panic("asked for unsupported xsave feature: 0x%lx", bit); 6512 } 6513 } 6514 6515 /* 6516 * Use our supported-features indicators (xsave_bv_all) to return the XSAVE 6517 * size of our supported-features that need saving. Some CPUs' maximum save 6518 * size (stored in cpuid_info0.cpi_xsave.xsav_max_size) includes 6519 * unsupported-by-us features (e.g. Intel AMX) which we MAY be able to safely 6520 * dismiss if the supported XSAVE data's offset + length are before the 6521 * unsupported feature. 6522 */ 6523 size_t 6524 cpuid_get_xsave_size(void) 6525 { 6526 size_t furthest_out = sizeof (struct xsave_state); 6527 uint_t shift = 0; 6528 6529 VERIFY(xsave_bv_all != 0); 6530 6531 for (uint64_t current = xsave_bv_all; current != 0; 6532 current >>= 1, shift++) { 6533 uint64_t testbit = 1UL << shift; 6534 size_t size, offset; 6535 6536 if ((testbit & xsave_bv_all) == 0) 6537 continue; 6538 6539 cpuid_get_xsave_info(testbit, &size, &offset); 6540 furthest_out = MAX(furthest_out, offset + size); 6541 } 6542 6543 return (furthest_out); 6544 } 6545 6546 /* 6547 * Return true if the CPUs on this system require 'pointer clearing' for the 6548 * floating point error pointer exception handling. In the past, this has been 6549 * true for all AMD K7 & K8 CPUs, although newer AMD CPUs have been changed to 6550 * behave the same as Intel. This is checked via the CPUID_AMD_EBX_ERR_PTR_ZERO 6551 * feature bit and is reflected in the cpi_fp_amd_save member. 6552 */ 6553 boolean_t 6554 cpuid_need_fp_excp_handling(void) 6555 { 6556 return (cpuid_info0.cpi_vendor == X86_VENDOR_AMD && 6557 cpuid_info0.cpi_fp_amd_save != 0); 6558 } 6559 6560 /* 6561 * Returns the number of data TLB entries for a corresponding 6562 * pagesize. If it can't be computed, or isn't known, the 6563 * routine returns zero. If you ask about an architecturally 6564 * impossible pagesize, the routine will panic (so that the 6565 * hat implementor knows that things are inconsistent.) 6566 */ 6567 uint_t 6568 cpuid_get_dtlb_nent(cpu_t *cpu, size_t pagesize) 6569 { 6570 struct cpuid_info *cpi; 6571 uint_t dtlb_nent = 0; 6572 6573 if (cpu == NULL) 6574 cpu = CPU; 6575 cpi = cpu->cpu_m.mcpu_cpi; 6576 6577 ASSERT(cpuid_checkpass(cpu, CPUID_PASS_BASIC)); 6578 6579 /* 6580 * Check the L2 TLB info 6581 */ 6582 if (cpi->cpi_xmaxeax >= 0x80000006) { 6583 struct cpuid_regs *cp = &cpi->cpi_extd[6]; 6584 6585 switch (pagesize) { 6586 6587 case 4 * 1024: 6588 /* 6589 * All zero in the top 16 bits of the register 6590 * indicates a unified TLB. Size is in low 16 bits. 6591 */ 6592 if ((cp->cp_ebx & 0xffff0000) == 0) 6593 dtlb_nent = cp->cp_ebx & 0x0000ffff; 6594 else 6595 dtlb_nent = BITX(cp->cp_ebx, 27, 16); 6596 break; 6597 6598 case 2 * 1024 * 1024: 6599 if ((cp->cp_eax & 0xffff0000) == 0) 6600 dtlb_nent = cp->cp_eax & 0x0000ffff; 6601 else 6602 dtlb_nent = BITX(cp->cp_eax, 27, 16); 6603 break; 6604 6605 default: 6606 panic("unknown L2 pagesize"); 6607 /*NOTREACHED*/ 6608 } 6609 } 6610 6611 if (dtlb_nent != 0) 6612 return (dtlb_nent); 6613 6614 /* 6615 * No L2 TLB support for this size, try L1. 6616 */ 6617 if (cpi->cpi_xmaxeax >= 0x80000005) { 6618 struct cpuid_regs *cp = &cpi->cpi_extd[5]; 6619 6620 switch (pagesize) { 6621 case 4 * 1024: 6622 dtlb_nent = BITX(cp->cp_ebx, 23, 16); 6623 break; 6624 case 2 * 1024 * 1024: 6625 dtlb_nent = BITX(cp->cp_eax, 23, 16); 6626 break; 6627 default: 6628 panic("unknown L1 d-TLB pagesize"); 6629 /*NOTREACHED*/ 6630 } 6631 } 6632 6633 return (dtlb_nent); 6634 } 6635 6636 /* 6637 * Return 0 if the erratum is not present or not applicable, positive 6638 * if it is, and negative if the status of the erratum is unknown. 6639 * 6640 * See "Revision Guide for AMD Athlon(tm) 64 and AMD Opteron(tm) 6641 * Processors" #25759, Rev 3.57, August 2005 6642 */ 6643 int 6644 cpuid_opteron_erratum(cpu_t *cpu, uint_t erratum) 6645 { 6646 struct cpuid_info *cpi = cpu->cpu_m.mcpu_cpi; 6647 uint_t eax; 6648 6649 /* 6650 * Bail out if this CPU isn't an AMD CPU, or if it's 6651 * a legacy (32-bit) AMD CPU. 6652 */ 6653 if (cpi->cpi_vendor != X86_VENDOR_AMD || 6654 cpi->cpi_family == 4 || cpi->cpi_family == 5 || 6655 cpi->cpi_family == 6) { 6656 return (0); 6657 } 6658 6659 eax = cpi->cpi_std[1].cp_eax; 6660 6661 #define SH_B0(eax) (eax == 0xf40 || eax == 0xf50) 6662 #define SH_B3(eax) (eax == 0xf51) 6663 #define B(eax) (SH_B0(eax) || SH_B3(eax)) 6664 6665 #define SH_C0(eax) (eax == 0xf48 || eax == 0xf58) 6666 6667 #define SH_CG(eax) (eax == 0xf4a || eax == 0xf5a || eax == 0xf7a) 6668 #define DH_CG(eax) (eax == 0xfc0 || eax == 0xfe0 || eax == 0xff0) 6669 #define CH_CG(eax) (eax == 0xf82 || eax == 0xfb2) 6670 #define CG(eax) (SH_CG(eax) || DH_CG(eax) || CH_CG(eax)) 6671 6672 #define SH_D0(eax) (eax == 0x10f40 || eax == 0x10f50 || eax == 0x10f70) 6673 #define DH_D0(eax) (eax == 0x10fc0 || eax == 0x10ff0) 6674 #define CH_D0(eax) (eax == 0x10f80 || eax == 0x10fb0) 6675 #define D0(eax) (SH_D0(eax) || DH_D0(eax) || CH_D0(eax)) 6676 6677 #define SH_E0(eax) (eax == 0x20f50 || eax == 0x20f40 || eax == 0x20f70) 6678 #define JH_E1(eax) (eax == 0x20f10) /* JH8_E0 had 0x20f30 */ 6679 #define DH_E3(eax) (eax == 0x20fc0 || eax == 0x20ff0) 6680 #define SH_E4(eax) (eax == 0x20f51 || eax == 0x20f71) 6681 #define BH_E4(eax) (eax == 0x20fb1) 6682 #define SH_E5(eax) (eax == 0x20f42) 6683 #define DH_E6(eax) (eax == 0x20ff2 || eax == 0x20fc2) 6684 #define JH_E6(eax) (eax == 0x20f12 || eax == 0x20f32) 6685 #define EX(eax) (SH_E0(eax) || JH_E1(eax) || DH_E3(eax) || \ 6686 SH_E4(eax) || BH_E4(eax) || SH_E5(eax) || \ 6687 DH_E6(eax) || JH_E6(eax)) 6688 6689 #define DR_AX(eax) (eax == 0x100f00 || eax == 0x100f01 || eax == 0x100f02) 6690 #define DR_B0(eax) (eax == 0x100f20) 6691 #define DR_B1(eax) (eax == 0x100f21) 6692 #define DR_BA(eax) (eax == 0x100f2a) 6693 #define DR_B2(eax) (eax == 0x100f22) 6694 #define DR_B3(eax) (eax == 0x100f23) 6695 #define RB_C0(eax) (eax == 0x100f40) 6696 6697 switch (erratum) { 6698 case 1: 6699 return (cpi->cpi_family < 0x10); 6700 case 51: /* what does the asterisk mean? */ 6701 return (B(eax) || SH_C0(eax) || CG(eax)); 6702 case 52: 6703 return (B(eax)); 6704 case 57: 6705 return (cpi->cpi_family <= 0x11); 6706 case 58: 6707 return (B(eax)); 6708 case 60: 6709 return (cpi->cpi_family <= 0x11); 6710 case 61: 6711 case 62: 6712 case 63: 6713 case 64: 6714 case 65: 6715 case 66: 6716 case 68: 6717 case 69: 6718 case 70: 6719 case 71: 6720 return (B(eax)); 6721 case 72: 6722 return (SH_B0(eax)); 6723 case 74: 6724 return (B(eax)); 6725 case 75: 6726 return (cpi->cpi_family < 0x10); 6727 case 76: 6728 return (B(eax)); 6729 case 77: 6730 return (cpi->cpi_family <= 0x11); 6731 case 78: 6732 return (B(eax) || SH_C0(eax)); 6733 case 79: 6734 return (B(eax) || SH_C0(eax) || CG(eax) || D0(eax) || EX(eax)); 6735 case 80: 6736 case 81: 6737 case 82: 6738 return (B(eax)); 6739 case 83: 6740 return (B(eax) || SH_C0(eax) || CG(eax)); 6741 case 85: 6742 return (cpi->cpi_family < 0x10); 6743 case 86: 6744 return (SH_C0(eax) || CG(eax)); 6745 case 88: 6746 return (B(eax) || SH_C0(eax)); 6747 case 89: 6748 return (cpi->cpi_family < 0x10); 6749 case 90: 6750 return (B(eax) || SH_C0(eax) || CG(eax)); 6751 case 91: 6752 case 92: 6753 return (B(eax) || SH_C0(eax)); 6754 case 93: 6755 return (SH_C0(eax)); 6756 case 94: 6757 return (B(eax) || SH_C0(eax) || CG(eax)); 6758 case 95: 6759 return (B(eax) || SH_C0(eax)); 6760 case 96: 6761 return (B(eax) || SH_C0(eax) || CG(eax)); 6762 case 97: 6763 case 98: 6764 return (SH_C0(eax) || CG(eax)); 6765 case 99: 6766 return (B(eax) || SH_C0(eax) || CG(eax) || D0(eax)); 6767 case 100: 6768 return (B(eax) || SH_C0(eax)); 6769 case 101: 6770 case 103: 6771 return (B(eax) || SH_C0(eax) || CG(eax) || D0(eax)); 6772 case 104: 6773 return (SH_C0(eax) || CG(eax) || D0(eax)); 6774 case 105: 6775 case 106: 6776 case 107: 6777 return (B(eax) || SH_C0(eax) || CG(eax) || D0(eax)); 6778 case 108: 6779 return (DH_CG(eax)); 6780 case 109: 6781 return (SH_C0(eax) || CG(eax) || D0(eax)); 6782 case 110: 6783 return (D0(eax) || EX(eax)); 6784 case 111: 6785 return (CG(eax)); 6786 case 112: 6787 return (B(eax) || SH_C0(eax) || CG(eax) || D0(eax) || EX(eax)); 6788 case 113: 6789 return (eax == 0x20fc0); 6790 case 114: 6791 return (SH_E0(eax) || JH_E1(eax) || DH_E3(eax)); 6792 case 115: 6793 return (SH_E0(eax) || JH_E1(eax)); 6794 case 116: 6795 return (SH_E0(eax) || JH_E1(eax) || DH_E3(eax)); 6796 case 117: 6797 return (B(eax) || SH_C0(eax) || CG(eax) || D0(eax)); 6798 case 118: 6799 return (SH_E0(eax) || JH_E1(eax) || SH_E4(eax) || BH_E4(eax) || 6800 JH_E6(eax)); 6801 case 121: 6802 return (B(eax) || SH_C0(eax) || CG(eax) || D0(eax) || EX(eax)); 6803 case 122: 6804 return (cpi->cpi_family < 0x10 || cpi->cpi_family == 0x11); 6805 case 123: 6806 return (JH_E1(eax) || BH_E4(eax) || JH_E6(eax)); 6807 case 131: 6808 return (cpi->cpi_family < 0x10); 6809 case 6336786: 6810 6811 /* 6812 * Test for AdvPowerMgmtInfo.TscPStateInvariant 6813 * if this is a K8 family or newer processor. We're testing for 6814 * this 'erratum' to determine whether or not we have a constant 6815 * TSC. 6816 * 6817 * Our current fix for this is to disable the C1-Clock ramping. 6818 * However, this doesn't work on newer processor families nor 6819 * does it work when virtualized as those devices don't exist. 6820 */ 6821 if (cpi->cpi_family >= 0x12 || get_hwenv() != HW_NATIVE) { 6822 return (0); 6823 } 6824 6825 if (CPI_FAMILY(cpi) == 0xf) { 6826 struct cpuid_regs regs; 6827 regs.cp_eax = 0x80000007; 6828 (void) __cpuid_insn(®s); 6829 return (!(regs.cp_edx & 0x100)); 6830 } 6831 return (0); 6832 case 147: 6833 /* 6834 * This erratum (K8 #147) is not present on family 10 and newer. 6835 */ 6836 if (cpi->cpi_family >= 0x10) { 6837 return (0); 6838 } 6839 return (((((eax >> 12) & 0xff00) + (eax & 0xf00)) | 6840 (((eax >> 4) & 0xf) | ((eax >> 12) & 0xf0))) < 0xf40); 6841 6842 case 6671130: 6843 /* 6844 * check for processors (pre-Shanghai) that do not provide 6845 * optimal management of 1gb ptes in its tlb. 6846 */ 6847 return (cpi->cpi_family == 0x10 && cpi->cpi_model < 4); 6848 6849 case 298: 6850 return (DR_AX(eax) || DR_B0(eax) || DR_B1(eax) || DR_BA(eax) || 6851 DR_B2(eax) || RB_C0(eax)); 6852 6853 case 721: 6854 return (cpi->cpi_family == 0x10 || cpi->cpi_family == 0x12); 6855 6856 default: 6857 return (-1); 6858 6859 } 6860 } 6861 6862 /* 6863 * Determine if specified erratum is present via OSVW (OS Visible Workaround). 6864 * Return 1 if erratum is present, 0 if not present and -1 if indeterminate. 6865 */ 6866 int 6867 osvw_opteron_erratum(cpu_t *cpu, uint_t erratum) 6868 { 6869 struct cpuid_info *cpi; 6870 uint_t osvwid; 6871 static int osvwfeature = -1; 6872 uint64_t osvwlength; 6873 6874 6875 cpi = cpu->cpu_m.mcpu_cpi; 6876 6877 /* confirm OSVW supported */ 6878 if (osvwfeature == -1) { 6879 osvwfeature = cpi->cpi_extd[1].cp_ecx & CPUID_AMD_ECX_OSVW; 6880 } else { 6881 /* assert that osvw feature setting is consistent on all cpus */ 6882 ASSERT(osvwfeature == 6883 (cpi->cpi_extd[1].cp_ecx & CPUID_AMD_ECX_OSVW)); 6884 } 6885 if (!osvwfeature) 6886 return (-1); 6887 6888 osvwlength = rdmsr(MSR_AMD_OSVW_ID_LEN) & OSVW_ID_LEN_MASK; 6889 6890 switch (erratum) { 6891 case 298: /* osvwid is 0 */ 6892 osvwid = 0; 6893 if (osvwlength <= (uint64_t)osvwid) { 6894 /* osvwid 0 is unknown */ 6895 return (-1); 6896 } 6897 6898 /* 6899 * Check the OSVW STATUS MSR to determine the state 6900 * of the erratum where: 6901 * 0 - fixed by HW 6902 * 1 - BIOS has applied the workaround when BIOS 6903 * workaround is available. (Or for other errata, 6904 * OS workaround is required.) 6905 * For a value of 1, caller will confirm that the 6906 * erratum 298 workaround has indeed been applied by BIOS. 6907 * 6908 * A 1 may be set in cpus that have a HW fix 6909 * in a mixed cpu system. Regarding erratum 298: 6910 * In a multiprocessor platform, the workaround above 6911 * should be applied to all processors regardless of 6912 * silicon revision when an affected processor is 6913 * present. 6914 */ 6915 6916 return (rdmsr(MSR_AMD_OSVW_STATUS + 6917 (osvwid / OSVW_ID_CNT_PER_MSR)) & 6918 (1ULL << (osvwid % OSVW_ID_CNT_PER_MSR))); 6919 6920 default: 6921 return (-1); 6922 } 6923 } 6924 6925 static const char assoc_str[] = "associativity"; 6926 static const char line_str[] = "line-size"; 6927 static const char size_str[] = "size"; 6928 6929 static void 6930 add_cache_prop(dev_info_t *devi, const char *label, const char *type, 6931 uint32_t val) 6932 { 6933 char buf[128]; 6934 6935 /* 6936 * ndi_prop_update_int() is used because it is desirable for 6937 * DDI_PROP_HW_DEF and DDI_PROP_DONTSLEEP to be set. 6938 */ 6939 if (snprintf(buf, sizeof (buf), "%s-%s", label, type) < sizeof (buf)) 6940 (void) ndi_prop_update_int(DDI_DEV_T_NONE, devi, buf, val); 6941 } 6942 6943 /* 6944 * Intel-style cache/tlb description 6945 * 6946 * Standard cpuid level 2 gives a randomly ordered 6947 * selection of tags that index into a table that describes 6948 * cache and tlb properties. 6949 */ 6950 6951 static const char l1_icache_str[] = "l1-icache"; 6952 static const char l1_dcache_str[] = "l1-dcache"; 6953 static const char l2_cache_str[] = "l2-cache"; 6954 static const char l3_cache_str[] = "l3-cache"; 6955 static const char itlb4k_str[] = "itlb-4K"; 6956 static const char dtlb4k_str[] = "dtlb-4K"; 6957 static const char itlb2M_str[] = "itlb-2M"; 6958 static const char itlb4M_str[] = "itlb-4M"; 6959 static const char dtlb4M_str[] = "dtlb-4M"; 6960 static const char dtlb24_str[] = "dtlb0-2M-4M"; 6961 static const char itlb424_str[] = "itlb-4K-2M-4M"; 6962 static const char itlb24_str[] = "itlb-2M-4M"; 6963 static const char dtlb44_str[] = "dtlb-4K-4M"; 6964 static const char sl1_dcache_str[] = "sectored-l1-dcache"; 6965 static const char sl2_cache_str[] = "sectored-l2-cache"; 6966 static const char itrace_str[] = "itrace-cache"; 6967 static const char sl3_cache_str[] = "sectored-l3-cache"; 6968 static const char sh_l2_tlb4k_str[] = "shared-l2-tlb-4k"; 6969 6970 static const struct cachetab { 6971 uint8_t ct_code; 6972 uint8_t ct_assoc; 6973 uint16_t ct_line_size; 6974 size_t ct_size; 6975 const char *ct_label; 6976 } intel_ctab[] = { 6977 /* 6978 * maintain descending order! 6979 * 6980 * Codes ignored - Reason 6981 * ---------------------- 6982 * 40H - intel_cpuid_4_cache_info() disambiguates l2/l3 cache 6983 * f0H/f1H - Currently we do not interpret prefetch size by design 6984 */ 6985 { 0xe4, 16, 64, 8*1024*1024, l3_cache_str}, 6986 { 0xe3, 16, 64, 4*1024*1024, l3_cache_str}, 6987 { 0xe2, 16, 64, 2*1024*1024, l3_cache_str}, 6988 { 0xde, 12, 64, 6*1024*1024, l3_cache_str}, 6989 { 0xdd, 12, 64, 3*1024*1024, l3_cache_str}, 6990 { 0xdc, 12, 64, ((1*1024*1024)+(512*1024)), l3_cache_str}, 6991 { 0xd8, 8, 64, 4*1024*1024, l3_cache_str}, 6992 { 0xd7, 8, 64, 2*1024*1024, l3_cache_str}, 6993 { 0xd6, 8, 64, 1*1024*1024, l3_cache_str}, 6994 { 0xd2, 4, 64, 2*1024*1024, l3_cache_str}, 6995 { 0xd1, 4, 64, 1*1024*1024, l3_cache_str}, 6996 { 0xd0, 4, 64, 512*1024, l3_cache_str}, 6997 { 0xca, 4, 0, 512, sh_l2_tlb4k_str}, 6998 { 0xc0, 4, 0, 8, dtlb44_str }, 6999 { 0xba, 4, 0, 64, dtlb4k_str }, 7000 { 0xb4, 4, 0, 256, dtlb4k_str }, 7001 { 0xb3, 4, 0, 128, dtlb4k_str }, 7002 { 0xb2, 4, 0, 64, itlb4k_str }, 7003 { 0xb0, 4, 0, 128, itlb4k_str }, 7004 { 0x87, 8, 64, 1024*1024, l2_cache_str}, 7005 { 0x86, 4, 64, 512*1024, l2_cache_str}, 7006 { 0x85, 8, 32, 2*1024*1024, l2_cache_str}, 7007 { 0x84, 8, 32, 1024*1024, l2_cache_str}, 7008 { 0x83, 8, 32, 512*1024, l2_cache_str}, 7009 { 0x82, 8, 32, 256*1024, l2_cache_str}, 7010 { 0x80, 8, 64, 512*1024, l2_cache_str}, 7011 { 0x7f, 2, 64, 512*1024, l2_cache_str}, 7012 { 0x7d, 8, 64, 2*1024*1024, sl2_cache_str}, 7013 { 0x7c, 8, 64, 1024*1024, sl2_cache_str}, 7014 { 0x7b, 8, 64, 512*1024, sl2_cache_str}, 7015 { 0x7a, 8, 64, 256*1024, sl2_cache_str}, 7016 { 0x79, 8, 64, 128*1024, sl2_cache_str}, 7017 { 0x78, 8, 64, 1024*1024, l2_cache_str}, 7018 { 0x73, 8, 0, 64*1024, itrace_str}, 7019 { 0x72, 8, 0, 32*1024, itrace_str}, 7020 { 0x71, 8, 0, 16*1024, itrace_str}, 7021 { 0x70, 8, 0, 12*1024, itrace_str}, 7022 { 0x68, 4, 64, 32*1024, sl1_dcache_str}, 7023 { 0x67, 4, 64, 16*1024, sl1_dcache_str}, 7024 { 0x66, 4, 64, 8*1024, sl1_dcache_str}, 7025 { 0x60, 8, 64, 16*1024, sl1_dcache_str}, 7026 { 0x5d, 0, 0, 256, dtlb44_str}, 7027 { 0x5c, 0, 0, 128, dtlb44_str}, 7028 { 0x5b, 0, 0, 64, dtlb44_str}, 7029 { 0x5a, 4, 0, 32, dtlb24_str}, 7030 { 0x59, 0, 0, 16, dtlb4k_str}, 7031 { 0x57, 4, 0, 16, dtlb4k_str}, 7032 { 0x56, 4, 0, 16, dtlb4M_str}, 7033 { 0x55, 0, 0, 7, itlb24_str}, 7034 { 0x52, 0, 0, 256, itlb424_str}, 7035 { 0x51, 0, 0, 128, itlb424_str}, 7036 { 0x50, 0, 0, 64, itlb424_str}, 7037 { 0x4f, 0, 0, 32, itlb4k_str}, 7038 { 0x4e, 24, 64, 6*1024*1024, l2_cache_str}, 7039 { 0x4d, 16, 64, 16*1024*1024, l3_cache_str}, 7040 { 0x4c, 12, 64, 12*1024*1024, l3_cache_str}, 7041 { 0x4b, 16, 64, 8*1024*1024, l3_cache_str}, 7042 { 0x4a, 12, 64, 6*1024*1024, l3_cache_str}, 7043 { 0x49, 16, 64, 4*1024*1024, l3_cache_str}, 7044 { 0x48, 12, 64, 3*1024*1024, l2_cache_str}, 7045 { 0x47, 8, 64, 8*1024*1024, l3_cache_str}, 7046 { 0x46, 4, 64, 4*1024*1024, l3_cache_str}, 7047 { 0x45, 4, 32, 2*1024*1024, l2_cache_str}, 7048 { 0x44, 4, 32, 1024*1024, l2_cache_str}, 7049 { 0x43, 4, 32, 512*1024, l2_cache_str}, 7050 { 0x42, 4, 32, 256*1024, l2_cache_str}, 7051 { 0x41, 4, 32, 128*1024, l2_cache_str}, 7052 { 0x3e, 4, 64, 512*1024, sl2_cache_str}, 7053 { 0x3d, 6, 64, 384*1024, sl2_cache_str}, 7054 { 0x3c, 4, 64, 256*1024, sl2_cache_str}, 7055 { 0x3b, 2, 64, 128*1024, sl2_cache_str}, 7056 { 0x3a, 6, 64, 192*1024, sl2_cache_str}, 7057 { 0x39, 4, 64, 128*1024, sl2_cache_str}, 7058 { 0x30, 8, 64, 32*1024, l1_icache_str}, 7059 { 0x2c, 8, 64, 32*1024, l1_dcache_str}, 7060 { 0x29, 8, 64, 4096*1024, sl3_cache_str}, 7061 { 0x25, 8, 64, 2048*1024, sl3_cache_str}, 7062 { 0x23, 8, 64, 1024*1024, sl3_cache_str}, 7063 { 0x22, 4, 64, 512*1024, sl3_cache_str}, 7064 { 0x0e, 6, 64, 24*1024, l1_dcache_str}, 7065 { 0x0d, 4, 32, 16*1024, l1_dcache_str}, 7066 { 0x0c, 4, 32, 16*1024, l1_dcache_str}, 7067 { 0x0b, 4, 0, 4, itlb4M_str}, 7068 { 0x0a, 2, 32, 8*1024, l1_dcache_str}, 7069 { 0x08, 4, 32, 16*1024, l1_icache_str}, 7070 { 0x06, 4, 32, 8*1024, l1_icache_str}, 7071 { 0x05, 4, 0, 32, dtlb4M_str}, 7072 { 0x04, 4, 0, 8, dtlb4M_str}, 7073 { 0x03, 4, 0, 64, dtlb4k_str}, 7074 { 0x02, 4, 0, 2, itlb4M_str}, 7075 { 0x01, 4, 0, 32, itlb4k_str}, 7076 { 0 } 7077 }; 7078 7079 static const struct cachetab cyrix_ctab[] = { 7080 { 0x70, 4, 0, 32, "tlb-4K" }, 7081 { 0x80, 4, 16, 16*1024, "l1-cache" }, 7082 { 0 } 7083 }; 7084 7085 /* 7086 * Search a cache table for a matching entry 7087 */ 7088 static const struct cachetab * 7089 find_cacheent(const struct cachetab *ct, uint_t code) 7090 { 7091 if (code != 0) { 7092 for (; ct->ct_code != 0; ct++) 7093 if (ct->ct_code <= code) 7094 break; 7095 if (ct->ct_code == code) 7096 return (ct); 7097 } 7098 return (NULL); 7099 } 7100 7101 /* 7102 * Populate cachetab entry with L2 or L3 cache-information using 7103 * cpuid function 4. This function is called from intel_walk_cacheinfo() 7104 * when descriptor 0x49 is encountered. It returns 0 if no such cache 7105 * information is found. 7106 */ 7107 static int 7108 intel_cpuid_4_cache_info(struct cachetab *ct, struct cpuid_info *cpi) 7109 { 7110 uint32_t level, i; 7111 int ret = 0; 7112 7113 for (i = 0; i < cpi->cpi_cache_leaf_size; i++) { 7114 level = CPI_CACHE_LVL(cpi->cpi_cache_leaves[i]); 7115 7116 if (level == 2 || level == 3) { 7117 ct->ct_assoc = 7118 CPI_CACHE_WAYS(cpi->cpi_cache_leaves[i]) + 1; 7119 ct->ct_line_size = 7120 CPI_CACHE_COH_LN_SZ(cpi->cpi_cache_leaves[i]) + 1; 7121 ct->ct_size = ct->ct_assoc * 7122 (CPI_CACHE_PARTS(cpi->cpi_cache_leaves[i]) + 1) * 7123 ct->ct_line_size * 7124 (cpi->cpi_cache_leaves[i]->cp_ecx + 1); 7125 7126 if (level == 2) { 7127 ct->ct_label = l2_cache_str; 7128 } else if (level == 3) { 7129 ct->ct_label = l3_cache_str; 7130 } 7131 ret = 1; 7132 } 7133 } 7134 7135 return (ret); 7136 } 7137 7138 /* 7139 * Walk the cacheinfo descriptor, applying 'func' to every valid element 7140 * The walk is terminated if the walker returns non-zero. 7141 */ 7142 static void 7143 intel_walk_cacheinfo(struct cpuid_info *cpi, 7144 void *arg, int (*func)(void *, const struct cachetab *)) 7145 { 7146 const struct cachetab *ct; 7147 struct cachetab des_49_ct, des_b1_ct; 7148 uint8_t *dp; 7149 int i; 7150 7151 if ((dp = cpi->cpi_cacheinfo) == NULL) 7152 return; 7153 for (i = 0; i < cpi->cpi_ncache; i++, dp++) { 7154 /* 7155 * For overloaded descriptor 0x49 we use cpuid function 4 7156 * if supported by the current processor, to create 7157 * cache information. 7158 * For overloaded descriptor 0xb1 we use X86_PAE flag 7159 * to disambiguate the cache information. 7160 */ 7161 if (*dp == 0x49 && cpi->cpi_maxeax >= 0x4 && 7162 intel_cpuid_4_cache_info(&des_49_ct, cpi) == 1) { 7163 ct = &des_49_ct; 7164 } else if (*dp == 0xb1) { 7165 des_b1_ct.ct_code = 0xb1; 7166 des_b1_ct.ct_assoc = 4; 7167 des_b1_ct.ct_line_size = 0; 7168 if (is_x86_feature(x86_featureset, X86FSET_PAE)) { 7169 des_b1_ct.ct_size = 8; 7170 des_b1_ct.ct_label = itlb2M_str; 7171 } else { 7172 des_b1_ct.ct_size = 4; 7173 des_b1_ct.ct_label = itlb4M_str; 7174 } 7175 ct = &des_b1_ct; 7176 } else { 7177 if ((ct = find_cacheent(intel_ctab, *dp)) == NULL) { 7178 continue; 7179 } 7180 } 7181 7182 if (func(arg, ct) != 0) { 7183 break; 7184 } 7185 } 7186 } 7187 7188 /* 7189 * (Like the Intel one, except for Cyrix CPUs) 7190 */ 7191 static void 7192 cyrix_walk_cacheinfo(struct cpuid_info *cpi, 7193 void *arg, int (*func)(void *, const struct cachetab *)) 7194 { 7195 const struct cachetab *ct; 7196 uint8_t *dp; 7197 int i; 7198 7199 if ((dp = cpi->cpi_cacheinfo) == NULL) 7200 return; 7201 for (i = 0; i < cpi->cpi_ncache; i++, dp++) { 7202 /* 7203 * Search Cyrix-specific descriptor table first .. 7204 */ 7205 if ((ct = find_cacheent(cyrix_ctab, *dp)) != NULL) { 7206 if (func(arg, ct) != 0) 7207 break; 7208 continue; 7209 } 7210 /* 7211 * .. else fall back to the Intel one 7212 */ 7213 if ((ct = find_cacheent(intel_ctab, *dp)) != NULL) { 7214 if (func(arg, ct) != 0) 7215 break; 7216 continue; 7217 } 7218 } 7219 } 7220 7221 /* 7222 * A cacheinfo walker that adds associativity, line-size, and size properties 7223 * to the devinfo node it is passed as an argument. 7224 */ 7225 static int 7226 add_cacheent_props(void *arg, const struct cachetab *ct) 7227 { 7228 dev_info_t *devi = arg; 7229 7230 add_cache_prop(devi, ct->ct_label, assoc_str, ct->ct_assoc); 7231 if (ct->ct_line_size != 0) 7232 add_cache_prop(devi, ct->ct_label, line_str, 7233 ct->ct_line_size); 7234 add_cache_prop(devi, ct->ct_label, size_str, ct->ct_size); 7235 return (0); 7236 } 7237 7238 7239 static const char fully_assoc[] = "fully-associative?"; 7240 7241 /* 7242 * AMD style cache/tlb description 7243 * 7244 * Extended functions 5 and 6 directly describe properties of 7245 * tlbs and various cache levels. 7246 */ 7247 static void 7248 add_amd_assoc(dev_info_t *devi, const char *label, uint_t assoc) 7249 { 7250 switch (assoc) { 7251 case 0: /* reserved; ignore */ 7252 break; 7253 default: 7254 add_cache_prop(devi, label, assoc_str, assoc); 7255 break; 7256 case 0xff: 7257 add_cache_prop(devi, label, fully_assoc, 1); 7258 break; 7259 } 7260 } 7261 7262 static void 7263 add_amd_tlb(dev_info_t *devi, const char *label, uint_t assoc, uint_t size) 7264 { 7265 if (size == 0) 7266 return; 7267 add_cache_prop(devi, label, size_str, size); 7268 add_amd_assoc(devi, label, assoc); 7269 } 7270 7271 static void 7272 add_amd_cache(dev_info_t *devi, const char *label, 7273 uint_t size, uint_t assoc, uint_t lines_per_tag, uint_t line_size) 7274 { 7275 if (size == 0 || line_size == 0) 7276 return; 7277 add_amd_assoc(devi, label, assoc); 7278 /* 7279 * Most AMD parts have a sectored cache. Multiple cache lines are 7280 * associated with each tag. A sector consists of all cache lines 7281 * associated with a tag. For example, the AMD K6-III has a sector 7282 * size of 2 cache lines per tag. 7283 */ 7284 if (lines_per_tag != 0) 7285 add_cache_prop(devi, label, "lines-per-tag", lines_per_tag); 7286 add_cache_prop(devi, label, line_str, line_size); 7287 add_cache_prop(devi, label, size_str, size * 1024); 7288 } 7289 7290 static void 7291 add_amd_l2_assoc(dev_info_t *devi, const char *label, uint_t assoc) 7292 { 7293 switch (assoc) { 7294 case 0: /* off */ 7295 break; 7296 case 1: 7297 case 2: 7298 case 4: 7299 add_cache_prop(devi, label, assoc_str, assoc); 7300 break; 7301 case 6: 7302 add_cache_prop(devi, label, assoc_str, 8); 7303 break; 7304 case 8: 7305 add_cache_prop(devi, label, assoc_str, 16); 7306 break; 7307 case 0xf: 7308 add_cache_prop(devi, label, fully_assoc, 1); 7309 break; 7310 default: /* reserved; ignore */ 7311 break; 7312 } 7313 } 7314 7315 static void 7316 add_amd_l2_tlb(dev_info_t *devi, const char *label, uint_t assoc, uint_t size) 7317 { 7318 if (size == 0 || assoc == 0) 7319 return; 7320 add_amd_l2_assoc(devi, label, assoc); 7321 add_cache_prop(devi, label, size_str, size); 7322 } 7323 7324 static void 7325 add_amd_l2_cache(dev_info_t *devi, const char *label, 7326 uint_t size, uint_t assoc, uint_t lines_per_tag, uint_t line_size) 7327 { 7328 if (size == 0 || assoc == 0 || line_size == 0) 7329 return; 7330 add_amd_l2_assoc(devi, label, assoc); 7331 if (lines_per_tag != 0) 7332 add_cache_prop(devi, label, "lines-per-tag", lines_per_tag); 7333 add_cache_prop(devi, label, line_str, line_size); 7334 add_cache_prop(devi, label, size_str, size * 1024); 7335 } 7336 7337 static void 7338 amd_cache_info(struct cpuid_info *cpi, dev_info_t *devi) 7339 { 7340 struct cpuid_regs *cp; 7341 7342 if (cpi->cpi_xmaxeax < 0x80000005) 7343 return; 7344 cp = &cpi->cpi_extd[5]; 7345 7346 /* 7347 * 4M/2M L1 TLB configuration 7348 * 7349 * We report the size for 2M pages because AMD uses two 7350 * TLB entries for one 4M page. 7351 */ 7352 add_amd_tlb(devi, "dtlb-2M", 7353 BITX(cp->cp_eax, 31, 24), BITX(cp->cp_eax, 23, 16)); 7354 add_amd_tlb(devi, "itlb-2M", 7355 BITX(cp->cp_eax, 15, 8), BITX(cp->cp_eax, 7, 0)); 7356 7357 /* 7358 * 4K L1 TLB configuration 7359 */ 7360 7361 switch (cpi->cpi_vendor) { 7362 uint_t nentries; 7363 case X86_VENDOR_TM: 7364 if (cpi->cpi_family >= 5) { 7365 /* 7366 * Crusoe processors have 256 TLB entries, but 7367 * cpuid data format constrains them to only 7368 * reporting 255 of them. 7369 */ 7370 if ((nentries = BITX(cp->cp_ebx, 23, 16)) == 255) 7371 nentries = 256; 7372 /* 7373 * Crusoe processors also have a unified TLB 7374 */ 7375 add_amd_tlb(devi, "tlb-4K", BITX(cp->cp_ebx, 31, 24), 7376 nentries); 7377 break; 7378 } 7379 /*FALLTHROUGH*/ 7380 default: 7381 add_amd_tlb(devi, itlb4k_str, 7382 BITX(cp->cp_ebx, 31, 24), BITX(cp->cp_ebx, 23, 16)); 7383 add_amd_tlb(devi, dtlb4k_str, 7384 BITX(cp->cp_ebx, 15, 8), BITX(cp->cp_ebx, 7, 0)); 7385 break; 7386 } 7387 7388 /* 7389 * data L1 cache configuration 7390 */ 7391 7392 add_amd_cache(devi, l1_dcache_str, 7393 BITX(cp->cp_ecx, 31, 24), BITX(cp->cp_ecx, 23, 16), 7394 BITX(cp->cp_ecx, 15, 8), BITX(cp->cp_ecx, 7, 0)); 7395 7396 /* 7397 * code L1 cache configuration 7398 */ 7399 7400 add_amd_cache(devi, l1_icache_str, 7401 BITX(cp->cp_edx, 31, 24), BITX(cp->cp_edx, 23, 16), 7402 BITX(cp->cp_edx, 15, 8), BITX(cp->cp_edx, 7, 0)); 7403 7404 if (cpi->cpi_xmaxeax < 0x80000006) 7405 return; 7406 cp = &cpi->cpi_extd[6]; 7407 7408 /* Check for a unified L2 TLB for large pages */ 7409 7410 if (BITX(cp->cp_eax, 31, 16) == 0) 7411 add_amd_l2_tlb(devi, "l2-tlb-2M", 7412 BITX(cp->cp_eax, 15, 12), BITX(cp->cp_eax, 11, 0)); 7413 else { 7414 add_amd_l2_tlb(devi, "l2-dtlb-2M", 7415 BITX(cp->cp_eax, 31, 28), BITX(cp->cp_eax, 27, 16)); 7416 add_amd_l2_tlb(devi, "l2-itlb-2M", 7417 BITX(cp->cp_eax, 15, 12), BITX(cp->cp_eax, 11, 0)); 7418 } 7419 7420 /* Check for a unified L2 TLB for 4K pages */ 7421 7422 if (BITX(cp->cp_ebx, 31, 16) == 0) { 7423 add_amd_l2_tlb(devi, "l2-tlb-4K", 7424 BITX(cp->cp_eax, 15, 12), BITX(cp->cp_eax, 11, 0)); 7425 } else { 7426 add_amd_l2_tlb(devi, "l2-dtlb-4K", 7427 BITX(cp->cp_eax, 31, 28), BITX(cp->cp_eax, 27, 16)); 7428 add_amd_l2_tlb(devi, "l2-itlb-4K", 7429 BITX(cp->cp_eax, 15, 12), BITX(cp->cp_eax, 11, 0)); 7430 } 7431 7432 add_amd_l2_cache(devi, l2_cache_str, 7433 BITX(cp->cp_ecx, 31, 16), BITX(cp->cp_ecx, 15, 12), 7434 BITX(cp->cp_ecx, 11, 8), BITX(cp->cp_ecx, 7, 0)); 7435 } 7436 7437 /* 7438 * There are two basic ways that the x86 world describes it cache 7439 * and tlb architecture - Intel's way and AMD's way. 7440 * 7441 * Return which flavor of cache architecture we should use 7442 */ 7443 static int 7444 x86_which_cacheinfo(struct cpuid_info *cpi) 7445 { 7446 switch (cpi->cpi_vendor) { 7447 case X86_VENDOR_Intel: 7448 if (cpi->cpi_maxeax >= 2) 7449 return (X86_VENDOR_Intel); 7450 break; 7451 case X86_VENDOR_AMD: 7452 /* 7453 * The K5 model 1 was the first part from AMD that reported 7454 * cache sizes via extended cpuid functions. 7455 */ 7456 if (cpi->cpi_family > 5 || 7457 (cpi->cpi_family == 5 && cpi->cpi_model >= 1)) 7458 return (X86_VENDOR_AMD); 7459 break; 7460 case X86_VENDOR_HYGON: 7461 return (X86_VENDOR_AMD); 7462 case X86_VENDOR_TM: 7463 if (cpi->cpi_family >= 5) 7464 return (X86_VENDOR_AMD); 7465 /*FALLTHROUGH*/ 7466 default: 7467 /* 7468 * If they have extended CPU data for 0x80000005 7469 * then we assume they have AMD-format cache 7470 * information. 7471 * 7472 * If not, and the vendor happens to be Cyrix, 7473 * then try our-Cyrix specific handler. 7474 * 7475 * If we're not Cyrix, then assume we're using Intel's 7476 * table-driven format instead. 7477 */ 7478 if (cpi->cpi_xmaxeax >= 0x80000005) 7479 return (X86_VENDOR_AMD); 7480 else if (cpi->cpi_vendor == X86_VENDOR_Cyrix) 7481 return (X86_VENDOR_Cyrix); 7482 else if (cpi->cpi_maxeax >= 2) 7483 return (X86_VENDOR_Intel); 7484 break; 7485 } 7486 return (-1); 7487 } 7488 7489 void 7490 cpuid_set_cpu_properties(void *dip, processorid_t cpu_id, 7491 struct cpuid_info *cpi) 7492 { 7493 dev_info_t *cpu_devi; 7494 int create; 7495 7496 cpu_devi = (dev_info_t *)dip; 7497 7498 /* device_type */ 7499 (void) ndi_prop_update_string(DDI_DEV_T_NONE, cpu_devi, 7500 "device_type", "cpu"); 7501 7502 /* reg */ 7503 (void) ndi_prop_update_int(DDI_DEV_T_NONE, cpu_devi, 7504 "reg", cpu_id); 7505 7506 /* cpu-mhz, and clock-frequency */ 7507 if (cpu_freq > 0) { 7508 long long mul; 7509 7510 (void) ndi_prop_update_int(DDI_DEV_T_NONE, cpu_devi, 7511 "cpu-mhz", cpu_freq); 7512 if ((mul = cpu_freq * 1000000LL) <= INT_MAX) 7513 (void) ndi_prop_update_int(DDI_DEV_T_NONE, cpu_devi, 7514 "clock-frequency", (int)mul); 7515 } 7516 7517 ASSERT(is_x86_feature(x86_featureset, X86FSET_CPUID)); 7518 7519 /* vendor-id */ 7520 (void) ndi_prop_update_string(DDI_DEV_T_NONE, cpu_devi, 7521 "vendor-id", cpi->cpi_vendorstr); 7522 7523 if (cpi->cpi_maxeax == 0) { 7524 return; 7525 } 7526 7527 /* 7528 * family, model, and step 7529 */ 7530 (void) ndi_prop_update_int(DDI_DEV_T_NONE, cpu_devi, 7531 "family", CPI_FAMILY(cpi)); 7532 (void) ndi_prop_update_int(DDI_DEV_T_NONE, cpu_devi, 7533 "cpu-model", CPI_MODEL(cpi)); 7534 (void) ndi_prop_update_int(DDI_DEV_T_NONE, cpu_devi, 7535 "stepping-id", CPI_STEP(cpi)); 7536 7537 /* type */ 7538 switch (cpi->cpi_vendor) { 7539 case X86_VENDOR_Intel: 7540 create = 1; 7541 break; 7542 default: 7543 create = 0; 7544 break; 7545 } 7546 if (create) 7547 (void) ndi_prop_update_int(DDI_DEV_T_NONE, cpu_devi, 7548 "type", CPI_TYPE(cpi)); 7549 7550 /* ext-family */ 7551 switch (cpi->cpi_vendor) { 7552 case X86_VENDOR_Intel: 7553 case X86_VENDOR_AMD: 7554 create = cpi->cpi_family >= 0xf; 7555 break; 7556 case X86_VENDOR_HYGON: 7557 create = 1; 7558 break; 7559 default: 7560 create = 0; 7561 break; 7562 } 7563 if (create) 7564 (void) ndi_prop_update_int(DDI_DEV_T_NONE, cpu_devi, 7565 "ext-family", CPI_FAMILY_XTD(cpi)); 7566 7567 /* ext-model */ 7568 switch (cpi->cpi_vendor) { 7569 case X86_VENDOR_Intel: 7570 create = IS_EXTENDED_MODEL_INTEL(cpi); 7571 break; 7572 case X86_VENDOR_AMD: 7573 create = CPI_FAMILY(cpi) == 0xf; 7574 break; 7575 case X86_VENDOR_HYGON: 7576 create = 1; 7577 break; 7578 default: 7579 create = 0; 7580 break; 7581 } 7582 if (create) 7583 (void) ndi_prop_update_int(DDI_DEV_T_NONE, cpu_devi, 7584 "ext-model", CPI_MODEL_XTD(cpi)); 7585 7586 /* generation */ 7587 switch (cpi->cpi_vendor) { 7588 case X86_VENDOR_AMD: 7589 case X86_VENDOR_HYGON: 7590 /* 7591 * AMD K5 model 1 was the first part to support this 7592 */ 7593 create = cpi->cpi_xmaxeax >= 0x80000001; 7594 break; 7595 default: 7596 create = 0; 7597 break; 7598 } 7599 if (create) 7600 (void) ndi_prop_update_int(DDI_DEV_T_NONE, cpu_devi, 7601 "generation", BITX((cpi)->cpi_extd[1].cp_eax, 11, 8)); 7602 7603 /* brand-id */ 7604 switch (cpi->cpi_vendor) { 7605 case X86_VENDOR_Intel: 7606 /* 7607 * brand id first appeared on Pentium III Xeon model 8, 7608 * and Celeron model 8 processors and Opteron 7609 */ 7610 create = cpi->cpi_family > 6 || 7611 (cpi->cpi_family == 6 && cpi->cpi_model >= 8); 7612 break; 7613 case X86_VENDOR_AMD: 7614 create = cpi->cpi_family >= 0xf; 7615 break; 7616 case X86_VENDOR_HYGON: 7617 create = 1; 7618 break; 7619 default: 7620 create = 0; 7621 break; 7622 } 7623 if (create && cpi->cpi_brandid != 0) { 7624 (void) ndi_prop_update_int(DDI_DEV_T_NONE, cpu_devi, 7625 "brand-id", cpi->cpi_brandid); 7626 } 7627 7628 /* chunks, and apic-id */ 7629 switch (cpi->cpi_vendor) { 7630 /* 7631 * first available on Pentium IV and Opteron (K8) 7632 */ 7633 case X86_VENDOR_Intel: 7634 create = IS_NEW_F6(cpi) || cpi->cpi_family >= 0xf; 7635 break; 7636 case X86_VENDOR_AMD: 7637 create = cpi->cpi_family >= 0xf; 7638 break; 7639 case X86_VENDOR_HYGON: 7640 create = 1; 7641 break; 7642 default: 7643 create = 0; 7644 break; 7645 } 7646 if (create) { 7647 (void) ndi_prop_update_int(DDI_DEV_T_NONE, cpu_devi, 7648 "chunks", CPI_CHUNKS(cpi)); 7649 (void) ndi_prop_update_int(DDI_DEV_T_NONE, cpu_devi, 7650 "apic-id", cpi->cpi_apicid); 7651 if (cpi->cpi_chipid >= 0) { 7652 (void) ndi_prop_update_int(DDI_DEV_T_NONE, cpu_devi, 7653 "chip#", cpi->cpi_chipid); 7654 (void) ndi_prop_update_int(DDI_DEV_T_NONE, cpu_devi, 7655 "clog#", cpi->cpi_clogid); 7656 } 7657 } 7658 7659 /* cpuid-features */ 7660 (void) ndi_prop_update_int(DDI_DEV_T_NONE, cpu_devi, 7661 "cpuid-features", CPI_FEATURES_EDX(cpi)); 7662 7663 7664 /* cpuid-features-ecx */ 7665 switch (cpi->cpi_vendor) { 7666 case X86_VENDOR_Intel: 7667 create = IS_NEW_F6(cpi) || cpi->cpi_family >= 0xf; 7668 break; 7669 case X86_VENDOR_AMD: 7670 create = cpi->cpi_family >= 0xf; 7671 break; 7672 case X86_VENDOR_HYGON: 7673 create = 1; 7674 break; 7675 default: 7676 create = 0; 7677 break; 7678 } 7679 if (create) 7680 (void) ndi_prop_update_int(DDI_DEV_T_NONE, cpu_devi, 7681 "cpuid-features-ecx", CPI_FEATURES_ECX(cpi)); 7682 7683 /* ext-cpuid-features */ 7684 switch (cpi->cpi_vendor) { 7685 case X86_VENDOR_Intel: 7686 case X86_VENDOR_AMD: 7687 case X86_VENDOR_HYGON: 7688 case X86_VENDOR_Cyrix: 7689 case X86_VENDOR_TM: 7690 case X86_VENDOR_Centaur: 7691 create = cpi->cpi_xmaxeax >= 0x80000001; 7692 break; 7693 default: 7694 create = 0; 7695 break; 7696 } 7697 if (create) { 7698 (void) ndi_prop_update_int(DDI_DEV_T_NONE, cpu_devi, 7699 "ext-cpuid-features", CPI_FEATURES_XTD_EDX(cpi)); 7700 (void) ndi_prop_update_int(DDI_DEV_T_NONE, cpu_devi, 7701 "ext-cpuid-features-ecx", CPI_FEATURES_XTD_ECX(cpi)); 7702 } 7703 7704 /* 7705 * Brand String first appeared in Intel Pentium IV, AMD K5 7706 * model 1, and Cyrix GXm. On earlier models we try and 7707 * simulate something similar .. so this string should always 7708 * same -something- about the processor, however lame. 7709 */ 7710 (void) ndi_prop_update_string(DDI_DEV_T_NONE, cpu_devi, 7711 "brand-string", cpi->cpi_brandstr); 7712 7713 /* 7714 * Finally, cache and tlb information 7715 */ 7716 switch (x86_which_cacheinfo(cpi)) { 7717 case X86_VENDOR_Intel: 7718 intel_walk_cacheinfo(cpi, cpu_devi, add_cacheent_props); 7719 break; 7720 case X86_VENDOR_Cyrix: 7721 cyrix_walk_cacheinfo(cpi, cpu_devi, add_cacheent_props); 7722 break; 7723 case X86_VENDOR_AMD: 7724 amd_cache_info(cpi, cpu_devi); 7725 break; 7726 default: 7727 break; 7728 } 7729 } 7730 7731 struct l2info { 7732 int *l2i_csz; 7733 int *l2i_lsz; 7734 int *l2i_assoc; 7735 int l2i_ret; 7736 }; 7737 7738 /* 7739 * A cacheinfo walker that fetches the size, line-size and associativity 7740 * of the L2 cache 7741 */ 7742 static int 7743 intel_l2cinfo(void *arg, const struct cachetab *ct) 7744 { 7745 struct l2info *l2i = arg; 7746 int *ip; 7747 7748 if (ct->ct_label != l2_cache_str && 7749 ct->ct_label != sl2_cache_str) 7750 return (0); /* not an L2 -- keep walking */ 7751 7752 if ((ip = l2i->l2i_csz) != NULL) 7753 *ip = ct->ct_size; 7754 if ((ip = l2i->l2i_lsz) != NULL) 7755 *ip = ct->ct_line_size; 7756 if ((ip = l2i->l2i_assoc) != NULL) 7757 *ip = ct->ct_assoc; 7758 l2i->l2i_ret = ct->ct_size; 7759 return (1); /* was an L2 -- terminate walk */ 7760 } 7761 7762 /* 7763 * AMD L2/L3 Cache and TLB Associativity Field Definition: 7764 * 7765 * Unlike the associativity for the L1 cache and tlb where the 8 bit 7766 * value is the associativity, the associativity for the L2 cache and 7767 * tlb is encoded in the following table. The 4 bit L2 value serves as 7768 * an index into the amd_afd[] array to determine the associativity. 7769 * -1 is undefined. 0 is fully associative. 7770 */ 7771 7772 static int amd_afd[] = 7773 {-1, 1, 2, -1, 4, -1, 8, -1, 16, -1, 32, 48, 64, 96, 128, 0}; 7774 7775 static void 7776 amd_l2cacheinfo(struct cpuid_info *cpi, struct l2info *l2i) 7777 { 7778 struct cpuid_regs *cp; 7779 uint_t size, assoc; 7780 int i; 7781 int *ip; 7782 7783 if (cpi->cpi_xmaxeax < 0x80000006) 7784 return; 7785 cp = &cpi->cpi_extd[6]; 7786 7787 if ((i = BITX(cp->cp_ecx, 15, 12)) != 0 && 7788 (size = BITX(cp->cp_ecx, 31, 16)) != 0) { 7789 uint_t cachesz = size * 1024; 7790 assoc = amd_afd[i]; 7791 7792 ASSERT(assoc != -1); 7793 7794 if ((ip = l2i->l2i_csz) != NULL) 7795 *ip = cachesz; 7796 if ((ip = l2i->l2i_lsz) != NULL) 7797 *ip = BITX(cp->cp_ecx, 7, 0); 7798 if ((ip = l2i->l2i_assoc) != NULL) 7799 *ip = assoc; 7800 l2i->l2i_ret = cachesz; 7801 } 7802 } 7803 7804 int 7805 getl2cacheinfo(cpu_t *cpu, int *csz, int *lsz, int *assoc) 7806 { 7807 struct cpuid_info *cpi = cpu->cpu_m.mcpu_cpi; 7808 struct l2info __l2info, *l2i = &__l2info; 7809 7810 l2i->l2i_csz = csz; 7811 l2i->l2i_lsz = lsz; 7812 l2i->l2i_assoc = assoc; 7813 l2i->l2i_ret = -1; 7814 7815 switch (x86_which_cacheinfo(cpi)) { 7816 case X86_VENDOR_Intel: 7817 intel_walk_cacheinfo(cpi, l2i, intel_l2cinfo); 7818 break; 7819 case X86_VENDOR_Cyrix: 7820 cyrix_walk_cacheinfo(cpi, l2i, intel_l2cinfo); 7821 break; 7822 case X86_VENDOR_AMD: 7823 amd_l2cacheinfo(cpi, l2i); 7824 break; 7825 default: 7826 break; 7827 } 7828 return (l2i->l2i_ret); 7829 } 7830 7831 #if !defined(__xpv) 7832 7833 uint32_t * 7834 cpuid_mwait_alloc(cpu_t *cpu) 7835 { 7836 uint32_t *ret; 7837 size_t mwait_size; 7838 7839 ASSERT(cpuid_checkpass(CPU, CPUID_PASS_EXTENDED)); 7840 7841 mwait_size = CPU->cpu_m.mcpu_cpi->cpi_mwait.mon_max; 7842 if (mwait_size == 0) 7843 return (NULL); 7844 7845 /* 7846 * kmem_alloc() returns cache line size aligned data for mwait_size 7847 * allocations. mwait_size is currently cache line sized. Neither 7848 * of these implementation details are guarantied to be true in the 7849 * future. 7850 * 7851 * First try allocating mwait_size as kmem_alloc() currently returns 7852 * correctly aligned memory. If kmem_alloc() does not return 7853 * mwait_size aligned memory, then use mwait_size ROUNDUP. 7854 * 7855 * Set cpi_mwait.buf_actual and cpi_mwait.size_actual in case we 7856 * decide to free this memory. 7857 */ 7858 ret = kmem_zalloc(mwait_size, KM_SLEEP); 7859 if (ret == (uint32_t *)P2ROUNDUP((uintptr_t)ret, mwait_size)) { 7860 cpu->cpu_m.mcpu_cpi->cpi_mwait.buf_actual = ret; 7861 cpu->cpu_m.mcpu_cpi->cpi_mwait.size_actual = mwait_size; 7862 *ret = MWAIT_RUNNING; 7863 return (ret); 7864 } else { 7865 kmem_free(ret, mwait_size); 7866 ret = kmem_zalloc(mwait_size * 2, KM_SLEEP); 7867 cpu->cpu_m.mcpu_cpi->cpi_mwait.buf_actual = ret; 7868 cpu->cpu_m.mcpu_cpi->cpi_mwait.size_actual = mwait_size * 2; 7869 ret = (uint32_t *)P2ROUNDUP((uintptr_t)ret, mwait_size); 7870 *ret = MWAIT_RUNNING; 7871 return (ret); 7872 } 7873 } 7874 7875 void 7876 cpuid_mwait_free(cpu_t *cpu) 7877 { 7878 if (cpu->cpu_m.mcpu_cpi == NULL) { 7879 return; 7880 } 7881 7882 if (cpu->cpu_m.mcpu_cpi->cpi_mwait.buf_actual != NULL && 7883 cpu->cpu_m.mcpu_cpi->cpi_mwait.size_actual > 0) { 7884 kmem_free(cpu->cpu_m.mcpu_cpi->cpi_mwait.buf_actual, 7885 cpu->cpu_m.mcpu_cpi->cpi_mwait.size_actual); 7886 } 7887 7888 cpu->cpu_m.mcpu_cpi->cpi_mwait.buf_actual = NULL; 7889 cpu->cpu_m.mcpu_cpi->cpi_mwait.size_actual = 0; 7890 } 7891 7892 void 7893 patch_tsc_read(int flag) 7894 { 7895 size_t cnt; 7896 7897 switch (flag) { 7898 case TSC_NONE: 7899 cnt = &_no_rdtsc_end - &_no_rdtsc_start; 7900 (void) memcpy((void *)tsc_read, (void *)&_no_rdtsc_start, cnt); 7901 break; 7902 case TSC_RDTSC_LFENCE: 7903 cnt = &_tsc_lfence_end - &_tsc_lfence_start; 7904 (void) memcpy((void *)tsc_read, 7905 (void *)&_tsc_lfence_start, cnt); 7906 break; 7907 case TSC_TSCP: 7908 cnt = &_tscp_end - &_tscp_start; 7909 (void) memcpy((void *)tsc_read, (void *)&_tscp_start, cnt); 7910 break; 7911 default: 7912 /* Bail for unexpected TSC types. (TSC_NONE covers 0) */ 7913 cmn_err(CE_PANIC, "Unrecogized TSC type: %d", flag); 7914 break; 7915 } 7916 tsc_type = flag; 7917 } 7918 7919 int 7920 cpuid_deep_cstates_supported(void) 7921 { 7922 struct cpuid_info *cpi; 7923 struct cpuid_regs regs; 7924 7925 ASSERT(cpuid_checkpass(CPU, CPUID_PASS_BASIC)); 7926 ASSERT(is_x86_feature(x86_featureset, X86FSET_CPUID)); 7927 7928 cpi = CPU->cpu_m.mcpu_cpi; 7929 7930 switch (cpi->cpi_vendor) { 7931 case X86_VENDOR_Intel: 7932 case X86_VENDOR_AMD: 7933 case X86_VENDOR_HYGON: 7934 if (cpi->cpi_xmaxeax < 0x80000007) 7935 return (0); 7936 7937 /* 7938 * Does TSC run at a constant rate in all C-states? 7939 */ 7940 regs.cp_eax = 0x80000007; 7941 (void) __cpuid_insn(®s); 7942 return (regs.cp_edx & CPUID_TSC_CSTATE_INVARIANCE); 7943 7944 default: 7945 return (0); 7946 } 7947 } 7948 7949 #endif /* !__xpv */ 7950 7951 void 7952 post_startup_cpu_fixups(void) 7953 { 7954 #ifndef __xpv 7955 /* 7956 * Some AMD processors support C1E state. Entering this state will 7957 * cause the local APIC timer to stop, which we can't deal with at 7958 * this time. 7959 */ 7960 if (cpuid_getvendor(CPU) == X86_VENDOR_AMD) { 7961 on_trap_data_t otd; 7962 uint64_t reg; 7963 7964 if (!on_trap(&otd, OT_DATA_ACCESS)) { 7965 reg = rdmsr(MSR_AMD_INT_PENDING_CMP_HALT); 7966 /* Disable C1E state if it is enabled by BIOS */ 7967 if ((reg >> AMD_ACTONCMPHALT_SHIFT) & 7968 AMD_ACTONCMPHALT_MASK) { 7969 reg &= ~(AMD_ACTONCMPHALT_MASK << 7970 AMD_ACTONCMPHALT_SHIFT); 7971 wrmsr(MSR_AMD_INT_PENDING_CMP_HALT, reg); 7972 } 7973 } 7974 no_trap(); 7975 } 7976 #endif /* !__xpv */ 7977 } 7978 7979 void 7980 enable_pcid(void) 7981 { 7982 if (x86_use_pcid == -1) 7983 x86_use_pcid = is_x86_feature(x86_featureset, X86FSET_PCID); 7984 7985 if (x86_use_invpcid == -1) { 7986 x86_use_invpcid = is_x86_feature(x86_featureset, 7987 X86FSET_INVPCID); 7988 } 7989 7990 if (!x86_use_pcid) 7991 return; 7992 7993 /* 7994 * Intel say that on setting PCIDE, it immediately starts using the PCID 7995 * bits; better make sure there's nothing there. 7996 */ 7997 ASSERT((getcr3() & MMU_PAGEOFFSET) == PCID_NONE); 7998 7999 setcr4(getcr4() | CR4_PCIDE); 8000 } 8001 8002 /* 8003 * Setup necessary registers to enable XSAVE feature on this processor. 8004 * This function needs to be called early enough, so that no xsave/xrstor 8005 * ops will execute on the processor before the MSRs are properly set up. 8006 * 8007 * Current implementation has the following assumption: 8008 * - cpuid_pass_basic() is done, so that X86 features are known. 8009 * - fpu_probe() is done, so that fp_save_mech is chosen. 8010 */ 8011 void 8012 xsave_setup_msr(cpu_t *cpu) 8013 { 8014 ASSERT(cpuid_checkpass(cpu, CPUID_PASS_BASIC)); 8015 ASSERT(fp_save_mech == FP_XSAVE); 8016 ASSERT(is_x86_feature(x86_featureset, X86FSET_XSAVE)); 8017 8018 /* Enable OSXSAVE in CR4. */ 8019 setcr4(getcr4() | CR4_OSXSAVE); 8020 /* 8021 * Update SW copy of ECX, so that /dev/cpu/self/cpuid will report 8022 * correct value. 8023 */ 8024 cpu->cpu_m.mcpu_cpi->cpi_std[1].cp_ecx |= CPUID_INTC_ECX_OSXSAVE; 8025 setup_xfem(); 8026 } 8027 8028 /* 8029 * Starting with the Westmere processor the local 8030 * APIC timer will continue running in all C-states, 8031 * including the deepest C-states. 8032 */ 8033 int 8034 cpuid_arat_supported(void) 8035 { 8036 struct cpuid_info *cpi; 8037 struct cpuid_regs regs; 8038 8039 ASSERT(cpuid_checkpass(CPU, CPUID_PASS_BASIC)); 8040 ASSERT(is_x86_feature(x86_featureset, X86FSET_CPUID)); 8041 8042 cpi = CPU->cpu_m.mcpu_cpi; 8043 8044 switch (cpi->cpi_vendor) { 8045 case X86_VENDOR_Intel: 8046 case X86_VENDOR_AMD: 8047 case X86_VENDOR_HYGON: 8048 /* 8049 * Always-running Local APIC Timer is 8050 * indicated by CPUID.6.EAX[2]. 8051 */ 8052 if (cpi->cpi_maxeax >= 6) { 8053 regs.cp_eax = 6; 8054 (void) cpuid_insn(NULL, ®s); 8055 return (regs.cp_eax & CPUID_INTC_EAX_ARAT); 8056 } else { 8057 return (0); 8058 } 8059 default: 8060 return (0); 8061 } 8062 } 8063 8064 /* 8065 * Check support for Intel ENERGY_PERF_BIAS feature 8066 */ 8067 int 8068 cpuid_iepb_supported(struct cpu *cp) 8069 { 8070 struct cpuid_info *cpi = cp->cpu_m.mcpu_cpi; 8071 struct cpuid_regs regs; 8072 8073 ASSERT(cpuid_checkpass(cp, CPUID_PASS_BASIC)); 8074 ASSERT(is_x86_feature(x86_featureset, X86FSET_CPUID)); 8075 8076 if (!(is_x86_feature(x86_featureset, X86FSET_MSR))) { 8077 return (0); 8078 } 8079 8080 /* 8081 * Intel ENERGY_PERF_BIAS MSR is indicated by 8082 * capability bit CPUID.6.ECX.3 8083 */ 8084 if ((cpi->cpi_vendor != X86_VENDOR_Intel) || (cpi->cpi_maxeax < 6)) 8085 return (0); 8086 8087 regs.cp_eax = 0x6; 8088 (void) cpuid_insn(NULL, ®s); 8089 return (regs.cp_ecx & CPUID_INTC_ECX_PERFBIAS); 8090 } 8091 8092 /* 8093 * Check support for TSC deadline timer 8094 * 8095 * TSC deadline timer provides a superior software programming 8096 * model over local APIC timer that eliminates "time drifts". 8097 * Instead of specifying a relative time, software specifies an 8098 * absolute time as the target at which the processor should 8099 * generate a timer event. 8100 */ 8101 int 8102 cpuid_deadline_tsc_supported(void) 8103 { 8104 struct cpuid_info *cpi = CPU->cpu_m.mcpu_cpi; 8105 struct cpuid_regs regs; 8106 8107 ASSERT(cpuid_checkpass(CPU, CPUID_PASS_BASIC)); 8108 ASSERT(is_x86_feature(x86_featureset, X86FSET_CPUID)); 8109 8110 switch (cpi->cpi_vendor) { 8111 case X86_VENDOR_Intel: 8112 if (cpi->cpi_maxeax >= 1) { 8113 regs.cp_eax = 1; 8114 (void) cpuid_insn(NULL, ®s); 8115 return (regs.cp_ecx & CPUID_DEADLINE_TSC); 8116 } else { 8117 return (0); 8118 } 8119 default: 8120 return (0); 8121 } 8122 } 8123 8124 #if !defined(__xpv) 8125 /* 8126 * Patch in versions of bcopy for high performance Intel Nhm processors 8127 * and later... 8128 */ 8129 void 8130 patch_memops(uint_t vendor) 8131 { 8132 size_t cnt, i; 8133 caddr_t to, from; 8134 8135 if ((vendor == X86_VENDOR_Intel) && 8136 is_x86_feature(x86_featureset, X86FSET_SSE4_2)) { 8137 cnt = &bcopy_patch_end - &bcopy_patch_start; 8138 to = &bcopy_ck_size; 8139 from = &bcopy_patch_start; 8140 for (i = 0; i < cnt; i++) { 8141 *to++ = *from++; 8142 } 8143 } 8144 } 8145 #endif /* !__xpv */ 8146 8147 /* 8148 * We're being asked to tell the system how many bits are required to represent 8149 * the various thread and strand IDs. While it's tempting to derive this based 8150 * on the values in cpi_ncore_per_chip and cpi_ncpu_per_chip, that isn't quite 8151 * correct. Instead, this needs to be based on the number of bits that the APIC 8152 * allows for these different configurations. We only update these to a larger 8153 * value if we find one. 8154 */ 8155 void 8156 cpuid_get_ext_topo(cpu_t *cpu, uint_t *core_nbits, uint_t *strand_nbits) 8157 { 8158 struct cpuid_info *cpi; 8159 8160 VERIFY(cpuid_checkpass(CPU, CPUID_PASS_BASIC)); 8161 cpi = cpu->cpu_m.mcpu_cpi; 8162 8163 if (cpi->cpi_ncore_bits > *core_nbits) { 8164 *core_nbits = cpi->cpi_ncore_bits; 8165 } 8166 8167 if (cpi->cpi_nthread_bits > *strand_nbits) { 8168 *strand_nbits = cpi->cpi_nthread_bits; 8169 } 8170 } 8171 8172 void 8173 cpuid_pass_ucode(cpu_t *cpu, uchar_t *fset) 8174 { 8175 struct cpuid_info *cpi = cpu->cpu_m.mcpu_cpi; 8176 struct cpuid_regs cp; 8177 8178 /* 8179 * Reread the CPUID portions that we need for various security 8180 * information. 8181 */ 8182 switch (cpi->cpi_vendor) { 8183 case X86_VENDOR_Intel: 8184 /* 8185 * Check if we now have leaf 7 available to us. 8186 */ 8187 if (cpi->cpi_maxeax < 7) { 8188 bzero(&cp, sizeof (cp)); 8189 cp.cp_eax = 0; 8190 cpi->cpi_maxeax = __cpuid_insn(&cp); 8191 if (cpi->cpi_maxeax < 7) 8192 break; 8193 } 8194 8195 bzero(&cp, sizeof (cp)); 8196 cp.cp_eax = 7; 8197 cp.cp_ecx = 0; 8198 (void) __cpuid_insn(&cp); 8199 cpi->cpi_std[7] = cp; 8200 break; 8201 8202 case X86_VENDOR_AMD: 8203 case X86_VENDOR_HYGON: 8204 /* No xcpuid support */ 8205 if (cpi->cpi_family < 5 || 8206 (cpi->cpi_family == 5 && cpi->cpi_model < 1)) 8207 break; 8208 8209 if (cpi->cpi_xmaxeax < CPUID_LEAF_EXT_8) { 8210 bzero(&cp, sizeof (cp)); 8211 cp.cp_eax = CPUID_LEAF_EXT_0; 8212 cpi->cpi_xmaxeax = __cpuid_insn(&cp); 8213 if (cpi->cpi_xmaxeax < CPUID_LEAF_EXT_8) 8214 break; 8215 } 8216 8217 /* 8218 * Most AMD features are in leaf 8. Automatic IBRS was added in 8219 * leaf 0x21. So we also check that. 8220 */ 8221 bzero(&cp, sizeof (cp)); 8222 cp.cp_eax = CPUID_LEAF_EXT_8; 8223 (void) __cpuid_insn(&cp); 8224 platform_cpuid_mangle(cpi->cpi_vendor, CPUID_LEAF_EXT_8, &cp); 8225 cpi->cpi_extd[8] = cp; 8226 8227 if (cpi->cpi_xmaxeax < CPUID_LEAF_EXT_21) 8228 break; 8229 8230 bzero(&cp, sizeof (cp)); 8231 cp.cp_eax = CPUID_LEAF_EXT_21; 8232 (void) __cpuid_insn(&cp); 8233 platform_cpuid_mangle(cpi->cpi_vendor, CPUID_LEAF_EXT_21, &cp); 8234 cpi->cpi_extd[0x21] = cp; 8235 break; 8236 8237 default: 8238 /* 8239 * Nothing to do here. Return an empty set which has already 8240 * been zeroed for us. 8241 */ 8242 return; 8243 } 8244 8245 cpuid_scan_security(cpu, fset); 8246 } 8247 8248 /* ARGSUSED */ 8249 static int 8250 cpuid_post_ucodeadm_xc(xc_arg_t arg0, xc_arg_t arg1, xc_arg_t arg2) 8251 { 8252 uchar_t *fset; 8253 boolean_t first_pass = (boolean_t)arg1; 8254 8255 fset = (uchar_t *)(arg0 + sizeof (x86_featureset) * CPU->cpu_id); 8256 if (first_pass && CPU->cpu_id != 0) 8257 return (0); 8258 if (!first_pass && CPU->cpu_id == 0) 8259 return (0); 8260 cpuid_pass_ucode(CPU, fset); 8261 8262 return (0); 8263 } 8264 8265 /* 8266 * After a microcode update where the version has changed, then we need to 8267 * rescan CPUID. To do this we check every CPU to make sure that they have the 8268 * same microcode. Then we perform a cross call to all such CPUs. It's the 8269 * caller's job to make sure that no one else can end up doing an update while 8270 * this is going on. 8271 * 8272 * We assume that the system is microcode capable if we're called. 8273 */ 8274 void 8275 cpuid_post_ucodeadm(void) 8276 { 8277 uint32_t rev; 8278 int i; 8279 struct cpu *cpu; 8280 cpuset_t cpuset; 8281 void *argdata; 8282 uchar_t *f0; 8283 8284 argdata = kmem_zalloc(sizeof (x86_featureset) * NCPU, KM_SLEEP); 8285 8286 mutex_enter(&cpu_lock); 8287 cpu = cpu_get(0); 8288 rev = cpu->cpu_m.mcpu_ucode_info->cui_rev; 8289 CPUSET_ONLY(cpuset, 0); 8290 for (i = 1; i < max_ncpus; i++) { 8291 if ((cpu = cpu_get(i)) == NULL) 8292 continue; 8293 8294 if (cpu->cpu_m.mcpu_ucode_info->cui_rev != rev) { 8295 panic("post microcode update CPU %d has differing " 8296 "microcode revision (%u) from CPU 0 (%u)", 8297 i, cpu->cpu_m.mcpu_ucode_info->cui_rev, rev); 8298 } 8299 CPUSET_ADD(cpuset, i); 8300 } 8301 8302 /* 8303 * We do the cross calls in two passes. The first pass is only for the 8304 * boot CPU. The second pass is for all of the other CPUs. This allows 8305 * the boot CPU to go through and change behavior related to patching or 8306 * whether or not Enhanced IBRS needs to be enabled and then allow all 8307 * other CPUs to follow suit. 8308 */ 8309 kpreempt_disable(); 8310 xc_sync((xc_arg_t)argdata, B_TRUE, 0, CPUSET2BV(cpuset), 8311 cpuid_post_ucodeadm_xc); 8312 xc_sync((xc_arg_t)argdata, B_FALSE, 0, CPUSET2BV(cpuset), 8313 cpuid_post_ucodeadm_xc); 8314 kpreempt_enable(); 8315 8316 /* 8317 * OK, now look at each CPU and see if their feature sets are equal. 8318 */ 8319 f0 = argdata; 8320 for (i = 1; i < max_ncpus; i++) { 8321 uchar_t *fset; 8322 if (!CPU_IN_SET(cpuset, i)) 8323 continue; 8324 8325 fset = (uchar_t *)((uintptr_t)argdata + 8326 sizeof (x86_featureset) * i); 8327 8328 if (!compare_x86_featureset(f0, fset)) { 8329 panic("Post microcode update CPU %d has " 8330 "differing security feature (%p) set from CPU 0 " 8331 "(%p), not appending to feature set", i, 8332 (void *)fset, (void *)f0); 8333 } 8334 } 8335 8336 mutex_exit(&cpu_lock); 8337 8338 for (i = 0; i < NUM_X86_FEATURES; i++) { 8339 cmn_err(CE_CONT, "?post-ucode x86_feature: %s\n", 8340 x86_feature_names[i]); 8341 if (is_x86_feature(f0, i)) { 8342 add_x86_feature(x86_featureset, i); 8343 } 8344 } 8345 kmem_free(argdata, sizeof (x86_featureset) * NCPU); 8346 } 8347 8348 typedef void (*cpuid_pass_f)(cpu_t *, void *); 8349 8350 typedef struct cpuid_pass_def { 8351 cpuid_pass_t cpd_pass; 8352 cpuid_pass_f cpd_func; 8353 } cpuid_pass_def_t; 8354 8355 /* 8356 * See block comment at the top; note that cpuid_pass_ucode is not a pass in the 8357 * normal sense and should not appear here. 8358 */ 8359 static const cpuid_pass_def_t cpuid_pass_defs[] = { 8360 { CPUID_PASS_PRELUDE, cpuid_pass_prelude }, 8361 { CPUID_PASS_IDENT, cpuid_pass_ident }, 8362 { CPUID_PASS_BASIC, cpuid_pass_basic }, 8363 { CPUID_PASS_EXTENDED, cpuid_pass_extended }, 8364 { CPUID_PASS_DYNAMIC, cpuid_pass_dynamic }, 8365 { CPUID_PASS_RESOLVE, cpuid_pass_resolve }, 8366 }; 8367 8368 void 8369 cpuid_execpass(cpu_t *cp, cpuid_pass_t pass, void *arg) 8370 { 8371 VERIFY3S(pass, !=, CPUID_PASS_NONE); 8372 8373 if (cp == NULL) 8374 cp = CPU; 8375 8376 /* 8377 * Space statically allocated for BSP, ensure pointer is set 8378 */ 8379 if (cp->cpu_id == 0 && cp->cpu_m.mcpu_cpi == NULL) 8380 cp->cpu_m.mcpu_cpi = &cpuid_info0; 8381 8382 ASSERT(cpuid_checkpass(cp, pass - 1)); 8383 8384 for (uint_t i = 0; i < ARRAY_SIZE(cpuid_pass_defs); i++) { 8385 if (cpuid_pass_defs[i].cpd_pass == pass) { 8386 cpuid_pass_defs[i].cpd_func(cp, arg); 8387 cp->cpu_m.mcpu_cpi->cpi_pass = pass; 8388 return; 8389 } 8390 } 8391 8392 panic("unable to execute invalid cpuid pass %d on cpu%d\n", 8393 pass, cp->cpu_id); 8394 } 8395 8396 /* 8397 * Extract the processor family from a chiprev. Processor families are not the 8398 * same as cpuid families; see comments above and in x86_archext.h. 8399 */ 8400 x86_processor_family_t 8401 chiprev_family(const x86_chiprev_t cr) 8402 { 8403 return ((x86_processor_family_t)_X86_CHIPREV_FAMILY(cr)); 8404 } 8405 8406 /* 8407 * A chiprev matches its template if the vendor and family are identical and the 8408 * revision of the chiprev matches one of the bits set in the template. Callers 8409 * may bitwise-OR together chiprevs of the same vendor and family to form the 8410 * template, or use the _ANY variant. It is not possible to match chiprevs of 8411 * multiple vendors or processor families with a single call. Note that this 8412 * function operates on processor families, not cpuid families. 8413 */ 8414 boolean_t 8415 chiprev_matches(const x86_chiprev_t cr, const x86_chiprev_t template) 8416 { 8417 return (_X86_CHIPREV_VENDOR(cr) == _X86_CHIPREV_VENDOR(template) && 8418 _X86_CHIPREV_FAMILY(cr) == _X86_CHIPREV_FAMILY(template) && 8419 (_X86_CHIPREV_REV(cr) & _X86_CHIPREV_REV(template)) != 0); 8420 } 8421 8422 /* 8423 * A chiprev is at least min if the vendor and family are identical and the 8424 * revision of the chiprev is at least as recent as that of min. Processor 8425 * families are considered unordered and cannot be compared using this function. 8426 * Note that this function operates on processor families, not cpuid families. 8427 * Use of the _ANY chiprev variant with this function is not useful; it will 8428 * always return B_FALSE if the _ANY variant is supplied as the minimum 8429 * revision. To determine only whether a chiprev is of a given processor 8430 * family, test the return value of chiprev_family() instead. 8431 */ 8432 boolean_t 8433 chiprev_at_least(const x86_chiprev_t cr, const x86_chiprev_t min) 8434 { 8435 return (_X86_CHIPREV_VENDOR(cr) == _X86_CHIPREV_VENDOR(min) && 8436 _X86_CHIPREV_FAMILY(cr) == _X86_CHIPREV_FAMILY(min) && 8437 _X86_CHIPREV_REV(cr) >= _X86_CHIPREV_REV(min)); 8438 } 8439 8440 /* 8441 * The uarch functions operate in a manner similar to the chiprev functions 8442 * above. While it is tempting to allow these to operate on microarchitectures 8443 * produced by a specific vendor in an ordered fashion (e.g., ZEN3 is "newer" 8444 * than ZEN2), we elect not to do so because a manufacturer may supply 8445 * processors of multiple different microarchitecture families each of which may 8446 * be internally ordered but unordered with respect to those of other families. 8447 */ 8448 x86_uarch_t 8449 uarchrev_uarch(const x86_uarchrev_t ur) 8450 { 8451 return ((x86_uarch_t)_X86_UARCHREV_UARCH(ur)); 8452 } 8453 8454 boolean_t 8455 uarchrev_matches(const x86_uarchrev_t ur, const x86_uarchrev_t template) 8456 { 8457 return (_X86_UARCHREV_VENDOR(ur) == _X86_UARCHREV_VENDOR(template) && 8458 _X86_UARCHREV_UARCH(ur) == _X86_UARCHREV_UARCH(template) && 8459 (_X86_UARCHREV_REV(ur) & _X86_UARCHREV_REV(template)) != 0); 8460 } 8461 8462 boolean_t 8463 uarchrev_at_least(const x86_uarchrev_t ur, const x86_uarchrev_t min) 8464 { 8465 return (_X86_UARCHREV_VENDOR(ur) == _X86_UARCHREV_VENDOR(min) && 8466 _X86_UARCHREV_UARCH(ur) == _X86_UARCHREV_UARCH(min) && 8467 _X86_UARCHREV_REV(ur) >= _X86_UARCHREV_REV(min)); 8468 } 8469 8470 /* 8471 * Topology cache related information. This is yet another cache interface that 8472 * we're exposing out intended to be used when we have either Intel Leaf 4 or 8473 * AMD Leaf 8x1D (introduced with Zen 1). 8474 */ 8475 static boolean_t 8476 cpuid_cache_topo_sup(const struct cpuid_info *cpi) 8477 { 8478 switch (cpi->cpi_vendor) { 8479 case X86_VENDOR_Intel: 8480 if (cpi->cpi_maxeax >= 4) { 8481 return (B_TRUE); 8482 } 8483 break; 8484 case X86_VENDOR_AMD: 8485 case X86_VENDOR_HYGON: 8486 if (cpi->cpi_xmaxeax >= CPUID_LEAF_EXT_1d && 8487 is_x86_feature(x86_featureset, X86FSET_TOPOEXT)) { 8488 return (B_TRUE); 8489 } 8490 break; 8491 default: 8492 break; 8493 } 8494 8495 return (B_FALSE); 8496 } 8497 8498 int 8499 cpuid_getncaches(struct cpu *cpu, uint32_t *ncache) 8500 { 8501 const struct cpuid_info *cpi; 8502 8503 ASSERT(cpuid_checkpass(cpu, CPUID_PASS_DYNAMIC)); 8504 cpi = cpu->cpu_m.mcpu_cpi; 8505 8506 if (!cpuid_cache_topo_sup(cpi)) { 8507 return (ENOTSUP); 8508 } 8509 8510 *ncache = cpi->cpi_cache_leaf_size; 8511 return (0); 8512 } 8513 8514 int 8515 cpuid_getcache(struct cpu *cpu, uint32_t cno, x86_cache_t *cache) 8516 { 8517 const struct cpuid_info *cpi; 8518 const struct cpuid_regs *cp; 8519 8520 ASSERT(cpuid_checkpass(cpu, CPUID_PASS_DYNAMIC)); 8521 cpi = cpu->cpu_m.mcpu_cpi; 8522 8523 if (!cpuid_cache_topo_sup(cpi)) { 8524 return (ENOTSUP); 8525 } 8526 8527 if (cno >= cpi->cpi_cache_leaf_size) { 8528 return (EINVAL); 8529 } 8530 8531 bzero(cache, sizeof (x86_cache_t)); 8532 cp = cpi->cpi_cache_leaves[cno]; 8533 switch (CPI_CACHE_TYPE(cp)) { 8534 case CPI_CACHE_TYPE_DATA: 8535 cache->xc_type = X86_CACHE_TYPE_DATA; 8536 break; 8537 case CPI_CACHE_TYPE_INSTR: 8538 cache->xc_type = X86_CACHE_TYPE_INST; 8539 break; 8540 case CPI_CACHE_TYPE_UNIFIED: 8541 cache->xc_type = X86_CACHE_TYPE_UNIFIED; 8542 break; 8543 case CPI_CACHE_TYPE_DONE: 8544 default: 8545 return (EINVAL); 8546 } 8547 cache->xc_level = CPI_CACHE_LVL(cp); 8548 if (CPI_FULL_ASSOC_CACHE(cp) != 0) { 8549 cache->xc_flags |= X86_CACHE_F_FULL_ASSOC; 8550 } 8551 cache->xc_nparts = CPI_CACHE_PARTS(cp) + 1; 8552 /* 8553 * The number of sets is reserved on AMD if the CPU is tagged as fully 8554 * associative, where as it is considered valid on Intel. 8555 */ 8556 if (cpi->cpi_vendor == X86_VENDOR_AMD && 8557 CPI_FULL_ASSOC_CACHE(cp) != 0) { 8558 cache->xc_nsets = 1; 8559 } else { 8560 cache->xc_nsets = CPI_CACHE_SETS(cp) + 1; 8561 } 8562 cache->xc_nways = CPI_CACHE_WAYS(cp) + 1; 8563 cache->xc_line_size = CPI_CACHE_COH_LN_SZ(cp) + 1; 8564 cache->xc_size = cache->xc_nparts * cache->xc_nsets * cache->xc_nways * 8565 cache->xc_line_size; 8566 /* 8567 * We're looking for the number of bits to cover the number of CPUs that 8568 * are being shared. Normally this would be the value - 1, but the CPUID 8569 * value is encoded as the actual value minus one, so we don't modify 8570 * this at all. 8571 */ 8572 cache->xc_apic_shift = highbit(CPI_NTHR_SHR_CACHE(cp)); 8573 8574 /* 8575 * To construct a unique ID we construct a uint64_t that looks as 8576 * follows: 8577 * 8578 * [47:40] cache level 8579 * [39:32] CPUID cache type 8580 * [31:00] shifted APIC ID 8581 * 8582 * The shifted APIC ID gives us a guarantee that a given cache entry is 8583 * unique within its peers. The other two numbers give us something that 8584 * ensures that something is unique within the CPU. If we just had the 8585 * APIC ID shifted over by the indicated number of bits we'd end up with 8586 * an ID of zero for the L1I, L1D, L2, and L3. 8587 * 8588 * The format of this ID is private to the system and can change across 8589 * a reboot for the time being. 8590 */ 8591 cache->xc_id = (uint64_t)cache->xc_level << 40; 8592 cache->xc_id |= (uint64_t)cache->xc_type << 32; 8593 cache->xc_id |= (uint64_t)cpi->cpi_apicid >> cache->xc_apic_shift; 8594 8595 return (0); 8596 } 8597