1 /* 2 * CDDL HEADER START 3 * 4 * The contents of this file are subject to the terms of the 5 * Common Development and Distribution License (the "License"). 6 * You may not use this file except in compliance with the License. 7 * 8 * You can obtain a copy of the license at usr/src/OPENSOLARIS.LICENSE 9 * or http://www.opensolaris.org/os/licensing. 10 * See the License for the specific language governing permissions 11 * and limitations under the License. 12 * 13 * When distributing Covered Code, include this CDDL HEADER in each 14 * file and include the License file at usr/src/OPENSOLARIS.LICENSE. 15 * If applicable, add the following below this CDDL HEADER, with the 16 * fields enclosed by brackets "[]" replaced with your own identifying 17 * information: Portions Copyright [yyyy] [name of copyright owner] 18 * 19 * CDDL HEADER END 20 */ 21 /* 22 * Copyright (c) 2004, 2010, Oracle and/or its affiliates. All rights reserved. 23 * Copyright (c) 2011, 2016 by Delphix. All rights reserved. 24 * Copyright 2013 Nexenta Systems, Inc. All rights reserved. 25 * Copyright 2014 Josef "Jeff" Sipek <jeffpc@josefsipek.net> 26 * Copyright 2020 Joyent, Inc. 27 * Copyright 2023 Oxide Computer Company 28 * Copyright 2024 MNX Cloud, Inc. 29 */ 30 /* 31 * Copyright (c) 2010, Intel Corporation. 32 * All rights reserved. 33 */ 34 /* 35 * Portions Copyright 2009 Advanced Micro Devices, Inc. 36 */ 37 38 /* 39 * CPU Identification logic 40 * 41 * The purpose of this file and its companion, cpuid_subr.c, is to help deal 42 * with the identification of CPUs, their features, and their topologies. More 43 * specifically, this file helps drive the following: 44 * 45 * 1. Enumeration of features of the processor which are used by the kernel to 46 * determine what features to enable or disable. These may be instruction set 47 * enhancements or features that we use. 48 * 49 * 2. Enumeration of instruction set architecture (ISA) additions that userland 50 * will be told about through the auxiliary vector. 51 * 52 * 3. Understanding the physical topology of the CPU such as the number of 53 * caches, how many cores it has, whether or not it supports symmetric 54 * multi-processing (SMT), etc. 55 * 56 * ------------------------ 57 * CPUID History and Basics 58 * ------------------------ 59 * 60 * The cpuid instruction was added by Intel roughly around the time that the 61 * original Pentium was introduced. The purpose of cpuid was to tell in a 62 * programmatic fashion information about the CPU that previously was guessed 63 * at. For example, an important part of cpuid is that we can know what 64 * extensions to the ISA exist. If you use an invalid opcode you would get a 65 * #UD, so this method allows a program (whether a user program or the kernel) 66 * to determine what exists without crashing or getting a SIGILL. Of course, 67 * this was also during the era of the clones and the AMD Am5x86. The vendor 68 * name shows up first in cpuid for a reason. 69 * 70 * cpuid information is broken down into ranges called a 'leaf'. Each leaf puts 71 * unique values into the registers %eax, %ebx, %ecx, and %edx and each leaf has 72 * its own meaning. The different leaves are broken down into different regions: 73 * 74 * [ 0, 7fffffff ] This region is called the 'basic' 75 * region. This region is generally defined 76 * by Intel, though some of the original 77 * portions have different meanings based 78 * on the manufacturer. These days, Intel 79 * adds most new features to this region. 80 * AMD adds non-Intel compatible 81 * information in the third, extended 82 * region. Intel uses this for everything 83 * including ISA extensions, CPU 84 * features, cache information, topology, 85 * and more. 86 * 87 * There is a hole carved out of this 88 * region which is reserved for 89 * hypervisors. 90 * 91 * [ 40000000, 4fffffff ] This region, which is found in the 92 * middle of the previous region, is 93 * explicitly promised to never be used by 94 * CPUs. Instead, it is used by hypervisors 95 * to communicate information about 96 * themselves to the operating system. The 97 * values and details are unique for each 98 * hypervisor. 99 * 100 * [ 80000000, ffffffff ] This region is called the 'extended' 101 * region. Some of the low leaves mirror 102 * parts of the basic leaves. This region 103 * has generally been used by AMD for 104 * various extensions. For example, AMD- 105 * specific information about caches, 106 * features, and topology are found in this 107 * region. 108 * 109 * To specify a range, you place the desired leaf into %eax, zero %ebx, %ecx, 110 * and %edx, and then issue the cpuid instruction. At the first leaf in each of 111 * the ranges, one of the primary things returned is the maximum valid leaf in 112 * that range. This allows for discovery of what range of CPUID is valid. 113 * 114 * The CPUs have potentially surprising behavior when using an invalid leaf or 115 * unimplemented leaf. If the requested leaf is within the valid basic or 116 * extended range, but is unimplemented, then %eax, %ebx, %ecx, and %edx will be 117 * set to zero. However, if you specify a leaf that is outside of a valid range, 118 * then instead it will be filled with the last valid _basic_ leaf. For example, 119 * if the maximum basic value is on leaf 0x3, then issuing a cpuid for leaf 4 or 120 * an invalid extended leaf will return the information for leaf 3. 121 * 122 * Some leaves are broken down into sub-leaves. This means that the value 123 * depends on both the leaf asked for in %eax and a secondary register. For 124 * example, Intel uses the value in %ecx on leaf 7 to indicate a sub-leaf to get 125 * additional information. Or when getting topology information in leaf 0xb, the 126 * initial value in %ecx changes which level of the topology that you are 127 * getting information about. 128 * 129 * cpuid values are always kept to 32 bits regardless of whether or not the 130 * program is in 64-bit mode. When executing in 64-bit mode, the upper 131 * 32 bits of the register are always set to zero so that way the values are the 132 * same regardless of execution mode. 133 * 134 * ---------------------- 135 * Identifying Processors 136 * ---------------------- 137 * 138 * We can identify a processor in two steps. The first step looks at cpuid leaf 139 * 0. Leaf 0 contains the processor's vendor information. This is done by 140 * putting a 12 character string in %ebx, %ecx, and %edx. On AMD, it is 141 * 'AuthenticAMD' and on Intel it is 'GenuineIntel'. 142 * 143 * From there, a processor is identified by a combination of three different 144 * values: 145 * 146 * 1. Family 147 * 2. Model 148 * 3. Stepping 149 * 150 * Each vendor uses the family and model to uniquely identify a processor. The 151 * way that family and model are changed depends on the vendor. For example, 152 * Intel has been using family 0x6 for almost all of their processor since the 153 * Pentium Pro/Pentium II era, often called the P6. The model is used to 154 * identify the exact processor. Different models are often used for the client 155 * (consumer) and server parts. Even though each processor often has major 156 * architectural differences, they still are considered the same family by 157 * Intel. 158 * 159 * On the other hand, each major AMD architecture generally has its own family. 160 * For example, the K8 is family 0x10, Bulldozer 0x15, and Zen 0x17. Within it 161 * the model number is used to help identify specific processors. As AMD's 162 * product lines have expanded, they have started putting a mixed bag of 163 * processors into the same family, with each processor under a single 164 * identifying banner (e.g., Milan, Cezanne) using a range of model numbers. We 165 * refer to each such collection as a processor family, distinct from cpuid 166 * family. Importantly, each processor family has a BIOS and Kernel Developer's 167 * Guide (BKDG, older parts) or Processor Programming Reference (PPR) that 168 * defines the processor family's non-architectural features. In general, we'll 169 * use "family" here to mean the family number reported by the cpuid instruction 170 * and distinguish the processor family from it where appropriate. 171 * 172 * The stepping is used to refer to a revision of a specific microprocessor. The 173 * term comes from equipment used to produce masks that are used to create 174 * integrated circuits. 175 * 176 * The information is present in leaf 1, %eax. In technical documentation you 177 * will see the terms extended model and extended family. The original family, 178 * model, and stepping fields were each 4 bits wide. If the values in either 179 * are 0xf, then one is to consult the extended model and extended family, which 180 * take previously reserved bits and allow for a larger number of models and add 181 * 0xf to them. 182 * 183 * When we process this information, we store the full family, model, and 184 * stepping in the struct cpuid_info members cpi_family, cpi_model, and 185 * cpi_step, respectively. Whenever you are performing comparisons with the 186 * family, model, and stepping, you should use these members and not the raw 187 * values from cpuid. If you must use the raw values from cpuid directly, you 188 * must make sure that you add the extended model and family to the base model 189 * and family. 190 * 191 * In general, we do not use information about the family, model, and stepping 192 * to determine whether or not a feature is present; that is generally driven by 193 * specific leaves. However, when something we care about on the processor is 194 * not considered 'architectural' meaning that it is specific to a set of 195 * processors and not promised in the architecture model to be consistent from 196 * generation to generation, then we will fall back on this information. The 197 * most common cases where this comes up is when we have to workaround errata in 198 * the processor, are dealing with processor-specific features such as CPU 199 * performance counters, or we want to provide additional information for things 200 * such as fault management. 201 * 202 * While processors also do have a brand string, which is the name that people 203 * are familiar with when buying the processor, they are not meant for 204 * programmatic consumption. That is what the family, model, and stepping are 205 * for. 206 * 207 * We use the x86_chiprev_t to encode a combination of vendor, processor family, 208 * and stepping(s) that refer to a single or very closely related set of silicon 209 * implementations; while there are sometimes more specific ways to learn of the 210 * presence or absence of a particular erratum or workaround, one may generally 211 * assume that all processors of the same chiprev have the same errata and we 212 * have chosen to represent them this way precisely because that is how AMD 213 * groups them in their revision guides (errata documentation). The processor 214 * family (x86_processor_family_t) may be extracted from the chiprev if that 215 * level of detail is not needed. Processor families are considered unordered 216 * but revisions within a family may be compared for either an exact match or at 217 * least as recent as a reference revision. See the chiprev_xxx() functions 218 * below. 219 * 220 * Similarly, each processor family implements a particular microarchitecture, 221 * which itself may have multiple revisions. In general, non-architectural 222 * features are specific to a processor family, but some may exist across 223 * families containing cores that implement the same microarchitectural revision 224 * (and, such cores share common bugs, too). We provide utility routines 225 * analogous to those for extracting and comparing chiprevs for 226 * microarchitectures as well; see the uarch_xxx() functions. 227 * 228 * Both chiprevs and uarchrevs are defined in x86_archext.h and both are at 229 * present used and available only for AMD and AMD-like processors. 230 * 231 * ------------ 232 * CPUID Passes 233 * ------------ 234 * 235 * As part of performing feature detection, we break this into several different 236 * passes. There used to be a pass 0 that was done from assembly in locore.s to 237 * support processors that have a missing or broken cpuid instruction (notably 238 * certain Cyrix processors) but those were all 32-bit processors which are no 239 * longer supported. Passes are no longer numbered explicitly to make it easier 240 * to break them up or move them around as needed; however, they still have a 241 * well-defined execution ordering enforced by the definition of cpuid_pass_t in 242 * x86_archext.h. The external interface to execute a cpuid pass or determine 243 * whether a pass has been completed consists of cpuid_execpass() and 244 * cpuid_checkpass() respectively. The passes now, in that execution order, 245 * are as follows: 246 * 247 * PRELUDE This pass does not have any dependencies on system 248 * setup; in particular, unlike all subsequent passes it is 249 * guaranteed not to require PCI config space access. It 250 * sets the flag indicating that the processor we are 251 * running on supports the cpuid instruction, which all 252 * 64-bit processors do. This would also be the place to 253 * add any other basic state that is required later on and 254 * can be learned without dependencies. 255 * 256 * IDENT Determine which vendor manufactured the CPU, the family, 257 * model, and stepping information, and compute basic 258 * identifying tags from those values. This is done first 259 * so that machine-dependent code can control the features 260 * the cpuid instruction will report during subsequent 261 * passes if needed, and so that any intervening 262 * machine-dependent code that needs basic identity will 263 * have it available. This includes synthesised 264 * identifiers such as chiprev and uarchrev as well as the 265 * values obtained directly from cpuid. Prior to executing 266 * this pass, machine-depedent boot code is responsible for 267 * ensuring that the PCI configuration space access 268 * functions have been set up and, if necessary, that 269 * determine_platform() has been called. 270 * 271 * BASIC This is the primary pass and is responsible for doing a 272 * large number of different things: 273 * 274 * 1. Gathering a large number of feature flags to 275 * determine which features the CPU support and which 276 * indicate things that we need to do other work in the OS 277 * to enable. Features detected this way are added to the 278 * x86_featureset which can be queried to 279 * determine what we should do. This includes processing 280 * all of the basic and extended CPU features that we care 281 * about. 282 * 283 * 2. Determining the CPU's topology. This includes 284 * information about how many cores and threads are present 285 * in the package. It also is responsible for figuring out 286 * which logical CPUs are potentially part of the same core 287 * and what other resources they might share. For more 288 * information see the 'Topology' section. 289 * 290 * 3. Determining the set of CPU security-specific features 291 * that we need to worry about and determine the 292 * appropriate set of workarounds. 293 * 294 * Pass 1 on the boot CPU occurs before KMDB is started. 295 * 296 * EXTENDED The second pass is done after startup(). Here, we check 297 * other miscellaneous features. Most of this is gathering 298 * additional basic and extended features that we'll use in 299 * later passes or for debugging support. 300 * 301 * DYNAMIC The third pass occurs after the kernel memory allocator 302 * has been fully initialized. This gathers information 303 * where we might need dynamic memory available for our 304 * uses. This includes several varying width leaves that 305 * have cache information and the processor's brand string. 306 * 307 * RESOLVE The fourth and final normal pass is performed after the 308 * kernel has brought most everything online. This is 309 * invoked from post_startup(). In this pass, we go through 310 * the set of features that we have enabled and turn that 311 * into the hardware auxiliary vector features that 312 * userland receives. This is used by userland, primarily 313 * by the run-time link-editor (RTLD), though userland 314 * software could also refer to it directly. 315 * 316 * The function that performs a pass is currently assumed to be infallible, and 317 * all existing implementation are. This simplifies callers by allowing 318 * cpuid_execpass() to return void. Similarly, implementers do not need to check 319 * for a NULL CPU argument; the current CPU's cpu_t is substituted if necessary. 320 * Both of these assumptions can be relaxed if needed by future developments. 321 * Tracking of completed states is handled by cpuid_execpass(). It is programmer 322 * error to attempt to execute a pass before all previous passes have been 323 * completed on the specified CPU, or to request cpuid information before the 324 * pass that captures it has been executed. These conditions can be tested 325 * using cpuid_checkpass(). 326 * 327 * The Microcode Pass 328 * 329 * After a microcode update, we do a selective rescan of the cpuid leaves to 330 * determine what features have changed. Microcode updates can provide more 331 * details about security related features to deal with issues like Spectre and 332 * L1TF. On occasion, vendors have violated their contract and removed bits. 333 * However, we don't try to detect that because that puts us in a situation that 334 * we really can't deal with. As such, the only thing we rescan are security 335 * related features today. See cpuid_pass_ucode(). This pass may be run in a 336 * different sequence on APs and therefore is not part of the sequential order; 337 * It is invoked directly instead of by cpuid_execpass() and its completion 338 * status cannot be checked by cpuid_checkpass(). This could be integrated with 339 * a more complex dependency mechanism if warranted by future developments. 340 * 341 * All of the passes are run on all CPUs. However, for the most part we only 342 * care about what the boot CPU says about this information and use the other 343 * CPUs as a rough guide to sanity check that we have the same feature set. 344 * 345 * We do not support running multiple logical CPUs with disjoint, let alone 346 * different, feature sets. 347 * 348 * ------------------ 349 * Processor Topology 350 * ------------------ 351 * 352 * One of the important things that we need to do is to understand the topology 353 * of the underlying processor. When we say topology in this case, we're trying 354 * to understand the relationship between the logical CPUs that the operating 355 * system sees and the underlying physical layout. Different logical CPUs may 356 * share different resources which can have important consequences for the 357 * performance of the system. For example, they may share caches, execution 358 * units, and more. 359 * 360 * The topology of the processor changes from generation to generation and 361 * vendor to vendor. Along with that, different vendors use different 362 * terminology, and the operating system itself uses occasionally overlapping 363 * terminology. It's important to understand what this topology looks like so 364 * one can understand the different things that we try to calculate and 365 * determine. 366 * 367 * To get started, let's talk about a little bit of terminology that we've used 368 * so far, is used throughout this file, and is fairly generic across multiple 369 * vendors: 370 * 371 * CPU 372 * A central processing unit (CPU) refers to a logical and/or virtual 373 * entity that the operating system can execute instructions on. The 374 * underlying resources for this CPU may be shared between multiple 375 * entities; however, to the operating system it is a discrete unit. 376 * 377 * PROCESSOR and PACKAGE 378 * 379 * Generally, when we use the term 'processor' on its own, we are referring 380 * to the physical entity that one buys and plugs into a board. However, 381 * because processor has been overloaded and one might see it used to mean 382 * multiple different levels, we will instead use the term 'package' for 383 * the rest of this file. The term package comes from the electrical 384 * engineering side and refers to the physical entity that encloses the 385 * electronics inside. Strictly speaking the package can contain more than 386 * just the CPU, for example, on many processors it may also have what's 387 * called an 'integrated graphical processing unit (GPU)'. Because the 388 * package can encapsulate multiple units, it is the largest physical unit 389 * that we refer to. 390 * 391 * SOCKET 392 * 393 * A socket refers to unit on a system board (generally the motherboard) 394 * that can receive a package. A single package, or processor, is plugged 395 * into a single socket. A system may have multiple sockets. Often times, 396 * the term socket is used interchangeably with package and refers to the 397 * electrical component that has plugged in, and not the receptacle itself. 398 * 399 * CORE 400 * 401 * A core refers to the physical instantiation of a CPU, generally, with a 402 * full set of hardware resources available to it. A package may contain 403 * multiple cores inside of it or it may just have a single one. A 404 * processor with more than one core is often referred to as 'multi-core'. 405 * In illumos, we will use the feature X86FSET_CMP to refer to a system 406 * that has 'multi-core' processors. 407 * 408 * A core may expose a single logical CPU to the operating system, or it 409 * may expose multiple CPUs, which we call threads, defined below. 410 * 411 * Some resources may still be shared by cores in the same package. For 412 * example, many processors will share the level 3 cache between cores. 413 * Some AMD generations share hardware resources between cores. For more 414 * information on that see the section 'AMD Topology'. 415 * 416 * THREAD and STRAND 417 * 418 * In this file, generally a thread refers to a hardware resources and not 419 * the operating system's logical abstraction. A thread is always exposed 420 * as an independent logical CPU to the operating system. A thread belongs 421 * to a specific core. A core may have more than one thread. When that is 422 * the case, the threads that are part of the same core are often referred 423 * to as 'siblings'. 424 * 425 * When multiple threads exist, this is generally referred to as 426 * simultaneous multi-threading (SMT). When Intel introduced this in their 427 * processors they called it hyper-threading (HT). When multiple threads 428 * are active in a core, they split the resources of the core. For example, 429 * two threads may share the same set of hardware execution units. 430 * 431 * The operating system often uses the term 'strand' to refer to a thread. 432 * This helps disambiguate it from the software concept. 433 * 434 * CHIP 435 * 436 * Unfortunately, the term 'chip' is dramatically overloaded. At its most 437 * base meaning, it is used to refer to a single integrated circuit, which 438 * may or may not be the only thing in the package. In illumos, when you 439 * see the term 'chip' it is almost always referring to the same thing as 440 * the 'package'. However, many vendors may use chip to refer to one of 441 * many integrated circuits that have been placed in the package. As an 442 * example, see the subsequent definition. 443 * 444 * To try and keep things consistent, we will only use chip when referring 445 * to the entire integrated circuit package, with the exception of the 446 * definition of multi-chip module (because it is in the name) and use the 447 * term 'die' when we want the more general, potential sub-component 448 * definition. 449 * 450 * DIE 451 * 452 * A die refers to an integrated circuit. Inside of the package there may 453 * be a single die or multiple dies. This is sometimes called a 'chip' in 454 * vendor's parlance, but in this file, we use the term die to refer to a 455 * subcomponent. 456 * 457 * MULTI-CHIP MODULE 458 * 459 * A multi-chip module (MCM) refers to putting multiple distinct chips that 460 * are connected together in the same package. When a multi-chip design is 461 * used, generally each chip is manufactured independently and then joined 462 * together in the package. For example, on AMD's Zen microarchitecture 463 * (family 0x17), the package contains several dies (the second meaning of 464 * chip from above) that are connected together. 465 * 466 * CACHE 467 * 468 * A cache is a part of the processor that maintains copies of recently 469 * accessed memory. Caches are split into levels and then into types. 470 * Commonly there are one to three levels, called level one, two, and 471 * three. The lower the level, the smaller it is, the closer it is to the 472 * execution units of the CPU, and the faster it is to access. The layout 473 * and design of the cache come in many different flavors, consult other 474 * resources for a discussion of those. 475 * 476 * Caches are generally split into two types, the instruction and data 477 * cache. The caches contain what their names suggest, the instruction 478 * cache has executable program text, while the data cache has all other 479 * memory that the processor accesses. As of this writing, data is kept 480 * coherent between all of the caches on x86, so if one modifies program 481 * text before it is executed, that will be in the data cache, and the 482 * instruction cache will be synchronized with that change when the 483 * processor actually executes those instructions. This coherency also 484 * covers the fact that data could show up in multiple caches. 485 * 486 * Generally, the lowest level caches are specific to a core. However, the 487 * last layer cache is shared between some number of cores. The number of 488 * CPUs sharing this last level cache is important. This has implications 489 * for the choices that the scheduler makes, as accessing memory that might 490 * be in a remote cache after thread migration can be quite expensive. 491 * 492 * Sometimes, the word cache is abbreviated with a '$', because in US 493 * English the word cache is pronounced the same as cash. So L1D$ refers to 494 * the L1 data cache, and L2$ would be the L2 cache. This will not be used 495 * in the rest of this theory statement for clarity. 496 * 497 * MEMORY CONTROLLER 498 * 499 * The memory controller is a component that provides access to DRAM. Each 500 * memory controller can access a set number of DRAM channels. Each channel 501 * can have a number of DIMMs (sticks of memory) associated with it. A 502 * given package may have more than one memory controller. The association 503 * of the memory controller to a group of cores is important as it is 504 * cheaper to access memory on the controller that you are associated with. 505 * 506 * NUMA 507 * 508 * NUMA or non-uniform memory access, describes a way that systems are 509 * built. On x86, any processor core can address all of the memory in the 510 * system. However, When using multiple sockets or possibly within a 511 * multi-chip module, some of that memory is physically closer and some of 512 * it is further. Memory that is further away is more expensive to access. 513 * Consider the following image of multiple sockets with memory: 514 * 515 * +--------+ +--------+ 516 * | DIMM A | +----------+ +----------+ | DIMM D | 517 * +--------+-+ | | | | +-+------+-+ 518 * | DIMM B |=======| Socket 0 |======| Socket 1 |=======| DIMM E | 519 * +--------+-+ | | | | +-+------+-+ 520 * | DIMM C | +----------+ +----------+ | DIMM F | 521 * +--------+ +--------+ 522 * 523 * In this example, Socket 0 is closer to DIMMs A-C while Socket 1 is 524 * closer to DIMMs D-F. This means that it is cheaper for socket 0 to 525 * access DIMMs A-C and more expensive to access D-F as it has to go 526 * through Socket 1 to get there. The inverse is true for Socket 1. DIMMs 527 * D-F are cheaper than A-C. While the socket form is the most common, when 528 * using multi-chip modules, this can also sometimes occur. For another 529 * example of this that's more involved, see the AMD topology section. 530 * 531 * 532 * Intel Topology 533 * -------------- 534 * 535 * Most Intel processors since Nehalem, (as of this writing the current gen 536 * is Skylake / Cannon Lake) follow a fairly similar pattern. The CPU portion of 537 * the package is a single monolithic die. MCMs currently aren't used. Most 538 * parts have three levels of caches, with the L3 cache being shared between 539 * all of the cores on the package. The L1/L2 cache is generally specific to 540 * an individual core. The following image shows at a simplified level what 541 * this looks like. The memory controller is commonly part of something called 542 * the 'Uncore', that used to be separate physical chips that were not a part of 543 * the package, but are now part of the same chip. 544 * 545 * +-----------------------------------------------------------------------+ 546 * | Package | 547 * | +-------------------+ +-------------------+ +-------------------+ | 548 * | | Core | | Core | | Core | | 549 * | | +--------+ +---+ | | +--------+ +---+ | | +--------+ +---+ | | 550 * | | | Thread | | L | | | | Thread | | L | | | | Thread | | L | | | 551 * | | +--------+ | 1 | | | +--------+ | 1 | | | +--------+ | 1 | | | 552 * | | +--------+ | | | | +--------+ | | | | +--------+ | | | | 553 * | | | Thread | | | | | | Thread | | | | | | Thread | | | | | 554 * | | +--------+ +---+ | | +--------+ +---+ | | +--------+ +---+ | | 555 * | | +--------------+ | | +--------------+ | | +--------------+ | | 556 * | | | L2 Cache | | | | L2 Cache | | | | L2 Cache | | | 557 * | | +--------------+ | | +--------------+ | | +--------------+ | | 558 * | +-------------------+ +-------------------+ +-------------------+ | 559 * | +-------------------------------------------------------------------+ | 560 * | | Shared L3 Cache | | 561 * | +-------------------------------------------------------------------+ | 562 * | +-------------------------------------------------------------------+ | 563 * | | Memory Controller | | 564 * | +-------------------------------------------------------------------+ | 565 * +-----------------------------------------------------------------------+ 566 * 567 * A side effect of this current architecture is that what we care about from a 568 * scheduling and topology perspective, is simplified. In general we care about 569 * understanding which logical CPUs are part of the same core and socket. 570 * 571 * To determine the relationship between threads and cores, Intel initially used 572 * the identifier in the advanced programmable interrupt controller (APIC). They 573 * also added cpuid leaf 4 to give additional information about the number of 574 * threads and CPUs in the processor. With the addition of x2apic (which 575 * increased the number of addressable logical CPUs from 8-bits to 32-bits), an 576 * additional cpuid topology leaf 0xB was added. 577 * 578 * AMD Topology 579 * ------------ 580 * 581 * When discussing AMD topology, we want to break this into three distinct 582 * generations of topology. There's the basic topology that has been used in 583 * family 0xf+ (Opteron, Athlon64), there's the topology that was introduced 584 * with family 0x15 (Bulldozer), and there's the topology that was introduced 585 * with family 0x17 (Zen), evolved more dramatically in Zen 2 (still family 586 * 0x17), and tweaked slightly in Zen 3 (family 19h). AMD also has some 587 * additional terminology that's worth talking about. 588 * 589 * Until the introduction of family 0x17 (Zen), AMD did not implement something 590 * that they considered SMT. Whether or not the AMD processors have SMT 591 * influences many things including scheduling and reliability, availability, 592 * and serviceability (RAS) features. 593 * 594 * NODE 595 * 596 * AMD uses the term node to refer to a die that contains a number of cores 597 * and I/O resources. Depending on the processor family and model, more 598 * than one node can be present in the package. When there is more than one 599 * node this indicates a multi-chip module. Usually each node has its own 600 * access to memory and I/O devices. This is important and generally 601 * different from the corresponding Intel Nehalem-Skylake+ processors. As a 602 * result, we track this relationship in the operating system. 603 * 604 * In processors with an L3 cache, the L3 cache is generally shared across 605 * the entire node, though the way this is carved up varies from generation 606 * to generation. 607 * 608 * BULLDOZER 609 * 610 * Starting with the Bulldozer family (0x15) and continuing until the 611 * introduction of the Zen microarchitecture, AMD introduced the idea of a 612 * compute unit. In a compute unit, two traditional cores share a number of 613 * hardware resources. Critically, they share the FPU, L1 instruction 614 * cache, and the L2 cache. Several compute units were then combined inside 615 * of a single node. Because the integer execution units, L1 data cache, 616 * and some other resources were not shared between the cores, AMD never 617 * considered this to be SMT. 618 * 619 * ZEN 620 * 621 * The Zen family (0x17) uses a multi-chip module (MCM) design, the module 622 * is called Zeppelin. These modules are similar to the idea of nodes used 623 * previously. Each of these nodes has two DRAM channels which all of the 624 * cores in the node can access uniformly. These nodes are linked together 625 * in the package, creating a NUMA environment. 626 * 627 * The Zeppelin die itself contains two different 'core complexes'. Each 628 * core complex consists of four cores which each have two threads, for a 629 * total of 8 logical CPUs per complex. Unlike other generations, 630 * where all the logical CPUs in a given node share the L3 cache, here each 631 * core complex has its own shared L3 cache. 632 * 633 * A further thing that we need to consider is that in some configurations, 634 * particularly with the Threadripper line of processors, not every die 635 * actually has its memory controllers wired up to actual memory channels. 636 * This means that some cores have memory attached to them and others 637 * don't. 638 * 639 * To put Zen in perspective, consider the following images: 640 * 641 * +--------------------------------------------------------+ 642 * | Core Complex | 643 * | +-------------------+ +-------------------+ +---+ | 644 * | | Core +----+ | | Core +----+ | | | | 645 * | | +--------+ | L2 | | | +--------+ | L2 | | | | | 646 * | | | Thread | +----+ | | | Thread | +----+ | | | | 647 * | | +--------+-+ +--+ | | +--------+-+ +--+ | | L | | 648 * | | | Thread | |L1| | | | Thread | |L1| | | 3 | | 649 * | | +--------+ +--+ | | +--------+ +--+ | | | | 650 * | +-------------------+ +-------------------+ | C | | 651 * | +-------------------+ +-------------------+ | a | | 652 * | | Core +----+ | | Core +----+ | | c | | 653 * | | +--------+ | L2 | | | +--------+ | L2 | | | h | | 654 * | | | Thread | +----+ | | | Thread | +----+ | | e | | 655 * | | +--------+-+ +--+ | | +--------+-+ +--+ | | | | 656 * | | | Thread | |L1| | | | Thread | |L1| | | | | 657 * | | +--------+ +--+ | | +--------+ +--+ | | | | 658 * | +-------------------+ +-------------------+ +---+ | 659 * | | 660 * +--------------------------------------------------------+ 661 * 662 * This first image represents a single Zen core complex that consists of four 663 * cores. 664 * 665 * 666 * +--------------------------------------------------------+ 667 * | Zeppelin Die | 668 * | +--------------------------------------------------+ | 669 * | | I/O Units (PCIe, SATA, USB, etc.) | | 670 * | +--------------------------------------------------+ | 671 * | HH | 672 * | +-----------+ HH +-----------+ | 673 * | | | HH | | | 674 * | | Core |==========| Core | | 675 * | | Complex |==========| Complex | | 676 * | | | HH | | | 677 * | +-----------+ HH +-----------+ | 678 * | HH | 679 * | +--------------------------------------------------+ | 680 * | | Memory Controller | | 681 * | +--------------------------------------------------+ | 682 * | | 683 * +--------------------------------------------------------+ 684 * 685 * This image represents a single Zeppelin Die. Note how both cores are 686 * connected to the same memory controller and I/O units. While each core 687 * complex has its own L3 cache as seen in the first image, they both have 688 * uniform access to memory. 689 * 690 * 691 * PP PP 692 * PP PP 693 * +----------PP---------------------PP---------+ 694 * | PP PP | 695 * | +-----------+ +-----------+ | 696 * | | | | | | 697 * MMMMMMMMM| Zeppelin |==========| Zeppelin |MMMMMMMMM 698 * MMMMMMMMM| Die |==========| Die |MMMMMMMMM 699 * | | | | | | 700 * | +-----------+ooo ...+-----------+ | 701 * | HH ooo ... HH | 702 * | HH oo.. HH | 703 * | HH ..oo HH | 704 * | HH ... ooo HH | 705 * | +-----------+... ooo+-----------+ | 706 * | | | | | | 707 * MMMMMMMMM| Zeppelin |==========| Zeppelin |MMMMMMMMM 708 * MMMMMMMMM| Die |==========| Die |MMMMMMMMM 709 * | | | | | | 710 * | +-----------+ +-----------+ | 711 * | PP PP | 712 * +----------PP---------------------PP---------+ 713 * PP PP 714 * PP PP 715 * 716 * This image represents a single Zen package. In this example, it has four 717 * Zeppelin dies, though some configurations only have a single one. In this 718 * example, each die is directly connected to the next. Also, each die is 719 * represented as being connected to memory by the 'M' character and connected 720 * to PCIe devices and other I/O, by the 'P' character. Because each Zeppelin 721 * die is made up of two core complexes, we have multiple different NUMA 722 * domains that we care about for these systems. 723 * 724 * ZEN 2 725 * 726 * Zen 2 changes things in a dramatic way from Zen 1. Whereas in Zen 1 727 * each Zeppelin Die had its own I/O die, that has been moved out of the 728 * core complex in Zen 2. The actual core complex looks pretty similar, but 729 * now the die actually looks much simpler: 730 * 731 * +--------------------------------------------------------+ 732 * | Zen 2 Core Complex Die HH | 733 * | HH | 734 * | +-----------+ HH +-----------+ | 735 * | | | HH | | | 736 * | | Core |==========| Core | | 737 * | | Complex |==========| Complex | | 738 * | | | HH | | | 739 * | +-----------+ HH +-----------+ | 740 * | HH | 741 * | HH | 742 * +--------------------------------------------------------+ 743 * 744 * From here, when we add the central I/O die, this changes things a bit. 745 * Each die is connected to the I/O die, rather than trying to interconnect 746 * them directly. The following image takes the same Zen 1 image that we 747 * had earlier and shows what it looks like with the I/O die instead: 748 * 749 * PP PP 750 * PP PP 751 * +---------------------PP----PP---------------------+ 752 * | PP PP | 753 * | +-----------+ PP PP +-----------+ | 754 * | | | PP PP | | | 755 * | | Zen 2 | +-PP----PP-+ | Zen 2 | | 756 * | | Die _| | PP PP | |_ Die | | 757 * | | |o|oooo| |oooo|o| | | 758 * | +-----------+ | | +-----------+ | 759 * | | I/O | | 760 * MMMMMMMMMMMMMMMMMMMMMMMMMM Die MMMMMMMMMMMMMMMMMMMMMMMMMM 761 * MMMMMMMMMMMMMMMMMMMMMMMMMM MMMMMMMMMMMMMMMMMMMMMMMMMM 762 * | | | | 763 * MMMMMMMMMMMMMMMMMMMMMMMMMM MMMMMMMMMMMMMMMMMMMMMMMMMM 764 * MMMMMMMMMMMMMMMMMMMMMMMMMM MMMMMMMMMMMMMMMMMMMMMMMMMM 765 * | | | | 766 * | +-----------+ | | +-----------+ | 767 * | | |o|oooo| PP PP |oooo|o| | | 768 * | | Zen 2 -| +-PP----PP-+ |- Zen 2 | | 769 * | | Die | PP PP | Die | | 770 * | | | PP PP | | | 771 * | +-----------+ PP PP +-----------+ | 772 * | PP PP | 773 * +---------------------PP----PP---------------------+ 774 * PP PP 775 * PP PP 776 * 777 * The above has four core complex dies installed, though the Zen 2 EPYC 778 * and ThreadRipper parts allow for up to eight, while the Ryzen parts 779 * generally only have one to two. The more notable difference here is how 780 * everything communicates. Note that memory and PCIe come out of the 781 * central die. This changes the way that one die accesses a resource. It 782 * basically always has to go to the I/O die, where as in Zen 1 it may have 783 * satisfied it locally. In general, this ends up being a better strategy 784 * for most things, though it is possible to still treat everything in four 785 * distinct NUMA domains with each Zen 2 die slightly closer to some memory 786 * and PCIe than otherwise. This also impacts the 'amdzen' nexus driver as 787 * now there is only one 'node' present. 788 * 789 * ZEN 3 790 * 791 * From an architectural perspective, Zen 3 is a much smaller change from 792 * Zen 2 than Zen 2 was from Zen 1, though it makes up for most of that in 793 * its microarchitectural changes. The biggest thing for us is how the die 794 * changes. In Zen 1 and Zen 2, each core complex still had its own L3 795 * cache. However, in Zen 3, the L3 is now shared between the entire core 796 * complex die and is no longer partitioned between each core complex. This 797 * means that all cores on the die can share the same L3 cache. Otherwise, 798 * the general layout of the overall package with various core complexes 799 * and an I/O die stays the same. Here's what the Core Complex Die looks 800 * like in a bit more detail: 801 * 802 * +-------------------------------------------------+ 803 * | Zen 3 Core Complex Die | 804 * | +-------------------+ +-------------------+ | 805 * | | Core +----+ | | Core +----+ | | 806 * | | +--------+ | L2 | | | +--------+ | L2 | | | 807 * | | | Thread | +----+ | | | Thread | +----+ | | 808 * | | +--------+-+ +--+ | | +--------+-+ +--+ | | 809 * | | | Thread | |L1| | | | Thread | |L1| | | 810 * | | +--------+ +--+ | | +--------+ +--+ | | 811 * | +-------------------+ +-------------------+ | 812 * | +-------------------+ +-------------------+ | 813 * | | Core +----+ | | Core +----+ | | 814 * | | +--------+ | L2 | | | +--------+ | L2 | | | 815 * | | | Thread | +----+ | | | Thread | +----+ | | 816 * | | +--------+-+ +--+ | | +--------+-+ +--+ | | 817 * | | | Thread | |L1| | | | Thread | |L1| | | 818 * | | +--------+ +--+ | | +--------+ +--+ | | 819 * | +-------------------+ +-------------------+ | 820 * | | 821 * | +--------------------------------------------+ | 822 * | | L3 Cache | | 823 * | +--------------------------------------------+ | 824 * | | 825 * | +-------------------+ +-------------------+ | 826 * | | Core +----+ | | Core +----+ | | 827 * | | +--------+ | L2 | | | +--------+ | L2 | | | 828 * | | | Thread | +----+ | | | Thread | +----+ | | 829 * | | +--------+-+ +--+ | | +--------+-+ +--+ | | 830 * | | | Thread | |L1| | | | Thread | |L1| | | 831 * | | +--------+ +--+ | | +--------+ +--+ | | 832 * | +-------------------+ +-------------------+ | 833 * | +-------------------+ +-------------------+ | 834 * | | Core +----+ | | Core +----+ | | 835 * | | +--------+ | L2 | | | +--------+ | L2 | | | 836 * | | | Thread | +----+ | | | Thread | +----+ | | 837 * | | +--------+-+ +--+ | | +--------+-+ +--+ | | 838 * | | | Thread | |L1| | | | Thread | |L1| | | 839 * | | +--------+ +--+ | | +--------+ +--+ | | 840 * | +-------------------+ +-------------------+ | 841 * +-------------------------------------------------+ 842 * 843 * While it is not pictured, there are connections from the die to the 844 * broader data fabric and additional functional blocks to support that 845 * communication and coherency. 846 * 847 * CPUID LEAVES 848 * 849 * There are a few different CPUID leaves that we can use to try and understand 850 * the actual state of the world. As part of the introduction of family 0xf, AMD 851 * added CPUID leaf 0x80000008. This leaf tells us the number of logical 852 * processors that are in the system. Because families before Zen didn't have 853 * SMT, this was always the number of cores that were in the system. However, it 854 * should always be thought of as the number of logical threads to be consistent 855 * between generations. In addition we also get the size of the APIC ID that is 856 * used to represent the number of logical processors. This is important for 857 * deriving topology information. 858 * 859 * In the Bulldozer family, AMD added leaf 0x8000001E. The information varies a 860 * bit between Bulldozer and later families, but it is quite useful in 861 * determining the topology information. Because this information has changed 862 * across family generations, it's worth calling out what these mean 863 * explicitly. The registers have the following meanings: 864 * 865 * %eax The APIC ID. The entire register is defined to have a 32-bit 866 * APIC ID, even though on systems without x2apic support, it will 867 * be limited to 8 bits. 868 * 869 * %ebx On Bulldozer-era systems this contains information about the 870 * number of cores that are in a compute unit (cores that share 871 * resources). It also contains a per-package compute unit ID that 872 * identifies which compute unit the logical CPU is a part of. 873 * 874 * On Zen-era systems this instead contains the number of threads 875 * per core and the ID of the core that the logical CPU is a part 876 * of. Note, this ID is unique only to the package, it is not 877 * globally unique across the entire system. 878 * 879 * %ecx This contains the number of nodes that exist in the package. It 880 * also contains an ID that identifies which node the logical CPU 881 * is a part of. 882 * 883 * Finally, we also use cpuid leaf 0x8000001D to determine information about the 884 * cache layout to determine which logical CPUs are sharing which caches. 885 * 886 * illumos Topology 887 * ---------------- 888 * 889 * Based on the above we synthesize the information into several different 890 * variables that we store in the 'struct cpuid_info'. We'll go into the details 891 * of what each member is supposed to represent and their uniqueness. In 892 * general, there are two levels of uniqueness that we care about. We care about 893 * an ID that is globally unique. That means that it will be unique across all 894 * entities in the system. For example, the default logical CPU ID is globally 895 * unique. On the other hand, there is some information that we only care about 896 * being unique within the context of a single package / socket. Here are the 897 * variables that we keep track of and their meaning. 898 * 899 * Several of the values that are asking for an identifier, with the exception 900 * of cpi_apicid, are allowed to be synthetic. 901 * 902 * 903 * cpi_apicid 904 * 905 * This is the value of the CPU's APIC id. This should be the full 32-bit 906 * ID if the CPU is using the x2apic. Otherwise, it should be the 8-bit 907 * APIC ID. This value is globally unique between all logical CPUs across 908 * all packages. This is usually required by the APIC. 909 * 910 * cpi_chipid 911 * 912 * This value indicates the ID of the package that the logical CPU is a 913 * part of. This value is allowed to be synthetic. It is usually derived by 914 * taking the CPU's APIC ID and determining how many bits are used to 915 * represent CPU cores in the package. All logical CPUs that are part of 916 * the same package must have the same value. 917 * 918 * cpi_coreid 919 * 920 * This represents the ID of a CPU core. Two logical CPUs should only have 921 * the same cpi_coreid value if they are part of the same core. These 922 * values may be synthetic. On systems that support SMT, this value is 923 * usually derived from the APIC ID, otherwise it is often synthetic and 924 * just set to the value of the cpu_id in the cpu_t. 925 * 926 * cpi_pkgcoreid 927 * 928 * This is similar to the cpi_coreid in that logical CPUs that are part of 929 * the same core should have the same ID. The main difference is that these 930 * values are only required to be unique to a given socket. 931 * 932 * cpi_clogid 933 * 934 * This represents the logical ID of a logical CPU. This value should be 935 * unique within a given socket for each logical CPU. This is allowed to be 936 * synthetic, though it is usually based off of the CPU's apic ID. The 937 * broader system expects that logical CPUs that have are part of the same 938 * core have contiguous numbers. For example, if there were two threads per 939 * core, then the core IDs divided by two should be the same and the first 940 * modulus two should be zero and the second one. For example, IDs 4 and 5 941 * indicate two logical CPUs that are part of the same core. But IDs 5 and 942 * 6 represent two logical CPUs that are part of different cores. 943 * 944 * While it is common for the cpi_coreid and the cpi_clogid to be derived 945 * from the same source, strictly speaking, they don't have to be and the 946 * two values should be considered logically independent. One should not 947 * try to compare a logical CPU's cpi_coreid and cpi_clogid to determine 948 * some kind of relationship. While this is tempting, we've seen cases on 949 * AMD family 0xf where the system's cpu id is not related to its APIC ID. 950 * 951 * cpi_ncpu_per_chip 952 * 953 * This value indicates the total number of logical CPUs that exist in the 954 * physical package. Critically, this is not the number of logical CPUs 955 * that exist for just the single core. 956 * 957 * This value should be the same for all logical CPUs in the same package. 958 * 959 * cpi_ncore_per_chip 960 * 961 * This value indicates the total number of physical CPU cores that exist 962 * in the package. The system compares this value with cpi_ncpu_per_chip to 963 * determine if simultaneous multi-threading (SMT) is enabled. When 964 * cpi_ncpu_per_chip equals cpi_ncore_per_chip, then there is no SMT and 965 * the X86FSET_HTT feature is not set. If this value is greater than one, 966 * than we consider the processor to have the feature X86FSET_CMP, to 967 * indicate that there is support for more than one core. 968 * 969 * This value should be the same for all logical CPUs in the same package. 970 * 971 * cpi_procnodes_per_pkg 972 * 973 * This value indicates the number of 'nodes' that exist in the package. 974 * When processors are actually a multi-chip module, this represents the 975 * number of such modules that exist in the package. Currently, on Intel 976 * based systems this member is always set to 1. 977 * 978 * This value should be the same for all logical CPUs in the same package. 979 * 980 * cpi_procnodeid 981 * 982 * This value indicates the ID of the node that the logical CPU is a part 983 * of. All logical CPUs that are in the same node must have the same value 984 * here. This value must be unique across all of the packages in the 985 * system. On Intel based systems, this is currently set to the value in 986 * cpi_chipid because there is only one node. 987 * 988 * cpi_cores_per_compunit 989 * 990 * This value indicates the number of cores that are part of a compute 991 * unit. See the AMD topology section for this. This member only has real 992 * meaning currently for AMD Bulldozer family processors. For all other 993 * processors, this should currently be set to 1. 994 * 995 * cpi_compunitid 996 * 997 * This indicates the compute unit that the logical CPU belongs to. For 998 * processors without AMD Bulldozer-style compute units this should be set 999 * to the value of cpi_coreid. 1000 * 1001 * cpi_ncpu_shr_last_cache 1002 * 1003 * This indicates the number of logical CPUs that are sharing the same last 1004 * level cache. This value should be the same for all CPUs that are sharing 1005 * that cache. The last cache refers to the cache that is closest to memory 1006 * and furthest away from the CPU. 1007 * 1008 * cpi_last_lvl_cacheid 1009 * 1010 * This indicates the ID of the last cache that the logical CPU uses. This 1011 * cache is often shared between multiple logical CPUs and is the cache 1012 * that is closest to memory and furthest away from the CPU. This value 1013 * should be the same for a group of logical CPUs only if they actually 1014 * share the same last level cache. IDs should not overlap between 1015 * packages. 1016 * 1017 * cpi_ncore_bits 1018 * 1019 * This indicates the number of bits that are required to represent all of 1020 * the cores in the system. As cores are derived based on their APIC IDs, 1021 * we aren't guaranteed a run of APIC IDs starting from zero. It's OK for 1022 * this value to be larger than the actual number of IDs that are present 1023 * in the system. This is used to size tables by the CMI framework. It is 1024 * only filled in for Intel and AMD CPUs. 1025 * 1026 * cpi_nthread_bits 1027 * 1028 * This indicates the number of bits required to represent all of the IDs 1029 * that cover the logical CPUs that exist on a given core. It's OK for this 1030 * value to be larger than the actual number of IDs that are present in the 1031 * system. This is used to size tables by the CMI framework. It is 1032 * only filled in for Intel and AMD CPUs. 1033 * 1034 * ----------- 1035 * Hypervisors 1036 * ----------- 1037 * 1038 * If trying to manage the differences between vendors wasn't bad enough, it can 1039 * get worse thanks to our friend hardware virtualization. Hypervisors are given 1040 * the ability to interpose on all cpuid instructions and change them to suit 1041 * their purposes. In general, this is necessary as the hypervisor wants to be 1042 * able to present a more uniform set of features or not necessarily give the 1043 * guest operating system kernel knowledge of all features so it can be 1044 * more easily migrated between systems. 1045 * 1046 * When it comes to trying to determine topology information, this can be a 1047 * double edged sword. When a hypervisor doesn't actually implement a cpuid 1048 * leaf, it'll often return all zeros. Because of that, you'll often see various 1049 * checks scattered about fields being non-zero before we assume we can use 1050 * them. 1051 * 1052 * When it comes to topology information, the hypervisor is often incentivized 1053 * to lie to you about topology. This is because it doesn't always actually 1054 * guarantee that topology at all. The topology path we take in the system 1055 * depends on how the CPU advertises itself. If it advertises itself as an Intel 1056 * or AMD CPU, then we basically do our normal path. However, when they don't 1057 * use an actual vendor, then that usually turns into multiple one-core CPUs 1058 * that we enumerate that are often on different sockets. The actual behavior 1059 * depends greatly on what the hypervisor actually exposes to us. 1060 * 1061 * -------------------- 1062 * Exposing Information 1063 * -------------------- 1064 * 1065 * We expose CPUID information in three different forms in the system. 1066 * 1067 * The first is through the x86_featureset variable. This is used in conjunction 1068 * with the is_x86_feature() function. This is queried by x86-specific functions 1069 * to determine which features are or aren't present in the system and to make 1070 * decisions based upon them. For example, users of this include everything from 1071 * parts of the system dedicated to reliability, availability, and 1072 * serviceability (RAS), to making decisions about how to handle security 1073 * mitigations, to various x86-specific drivers. General purpose or 1074 * architecture independent drivers should never be calling this function. 1075 * 1076 * The second means is through the auxiliary vector. The auxiliary vector is a 1077 * series of tagged data that the kernel passes down to a user program when it 1078 * begins executing. This information is used to indicate to programs what 1079 * instruction set extensions are present. For example, information about the 1080 * CPU supporting the machine check architecture (MCA) wouldn't be passed down 1081 * since user programs cannot make use of it. However, things like the AVX 1082 * instruction sets are. Programs use this information to make run-time 1083 * decisions about what features they should use. As an example, the run-time 1084 * link-editor (rtld) can relocate different functions depending on the hardware 1085 * support available. 1086 * 1087 * The final form is through a series of accessor functions that all have the 1088 * form cpuid_get*. This is used by a number of different subsystems in the 1089 * kernel to determine more detailed information about what we're running on, 1090 * topology information, etc. Some of these subsystems include processor groups 1091 * (uts/common/os/pg.c.), CPU Module Interface (uts/i86pc/os/cmi.c), ACPI, 1092 * microcode, and performance monitoring. These functions all ASSERT that the 1093 * CPU they're being called on has reached a certain cpuid pass. If the passes 1094 * are rearranged, then this needs to be adjusted. 1095 * 1096 * ----------------------------------------------- 1097 * Speculative Execution CPU Side Channel Security 1098 * ----------------------------------------------- 1099 * 1100 * With the advent of the Spectre and Meltdown attacks which exploit speculative 1101 * execution in the CPU to create side channels there have been a number of 1102 * different attacks and corresponding issues that the operating system needs to 1103 * mitigate against. The following list is some of the common, but not 1104 * exhaustive, set of issues that we know about and have done some or need to do 1105 * more work in the system to mitigate against: 1106 * 1107 * - Spectre v1 1108 * - swapgs (Spectre v1 variant) 1109 * - Spectre v2 1110 * - Branch History Injection (BHI). 1111 * - Meltdown (Spectre v3) 1112 * - Rogue Register Read (Spectre v3a) 1113 * - Speculative Store Bypass (Spectre v4) 1114 * - ret2spec, SpectreRSB 1115 * - L1 Terminal Fault (L1TF) 1116 * - Microarchitectural Data Sampling (MDS) 1117 * - Register File Data Sampling (RFDS) 1118 * 1119 * Each of these requires different sets of mitigations and has different attack 1120 * surfaces. For the most part, this discussion is about protecting the kernel 1121 * from non-kernel executing environments such as user processes and hardware 1122 * virtual machines. Unfortunately, there are a number of user vs. user 1123 * scenarios that exist with these. The rest of this section will describe the 1124 * overall approach that the system has taken to address these as well as their 1125 * shortcomings. Unfortunately, not all of the above have been handled today. 1126 * 1127 * SPECTRE v2, ret2spec, SpectreRSB 1128 * 1129 * The second variant of the spectre attack focuses on performing branch target 1130 * injection. This generally impacts indirect call instructions in the system. 1131 * There are four different ways to mitigate this issue that are commonly 1132 * described today: 1133 * 1134 * 1. Using Indirect Branch Restricted Speculation (IBRS). 1135 * 2. Using Retpolines and RSB Stuffing 1136 * 3. Using Enhanced Indirect Branch Restricted Speculation (eIBRS) 1137 * 4. Using Automated Indirect Branch Restricted Speculation (AIBRS) 1138 * 1139 * IBRS uses a feature added to microcode to restrict speculation, among other 1140 * things. This form of mitigation has not been used as it has been generally 1141 * seen as too expensive and requires reactivation upon various transitions in 1142 * the system. 1143 * 1144 * As a less impactful alternative to IBRS, retpolines were developed by 1145 * Google. These basically require one to replace indirect calls with a specific 1146 * trampoline that will cause speculation to fail and break the attack. 1147 * Retpolines require compiler support. We always build with retpolines in the 1148 * external thunk mode. This means that a traditional indirect call is replaced 1149 * with a call to one of the __x86_indirect_thunk_<reg> functions. A side effect 1150 * of this is that all indirect function calls are performed through a register. 1151 * 1152 * We have to use a common external location of the thunk and not inline it into 1153 * the callsite so that way we can have a single place to patch these functions. 1154 * As it turns out, we currently have two different forms of retpolines that 1155 * exist in the system: 1156 * 1157 * 1. A full retpoline 1158 * 2. A no-op version 1159 * 1160 * The first one is used in the general case. Historically, there was an 1161 * AMD-specific optimized retopoline variant that was based around using a 1162 * serializing lfence instruction; however, in March 2022 it was announced that 1163 * this was actually still vulnerable to Spectre v2 and therefore we no longer 1164 * use it and it is no longer available in the system. 1165 * 1166 * The third form described above is the most curious. It turns out that the way 1167 * that retpolines are implemented is that they rely on how speculation is 1168 * performed on a 'ret' instruction. Intel has continued to optimize this 1169 * process (which is partly why we need to have return stack buffer stuffing, 1170 * but more on that in a bit) and in processors starting with Cascade Lake 1171 * on the server side, it's dangerous to rely on retpolines. Instead, a new 1172 * mechanism has been introduced called Enhanced IBRS (eIBRS). 1173 * 1174 * Unlike IBRS, eIBRS is designed to be enabled once at boot and left on each 1175 * physical core. However, if this is the case, we don't want to use retpolines 1176 * any more. Therefore if eIBRS is present, we end up turning each retpoline 1177 * function (called a thunk) into a jmp instruction. This means that we're still 1178 * paying the cost of an extra jump to the external thunk, but it gives us 1179 * flexibility and the ability to have a single kernel image that works across a 1180 * wide variety of systems and hardware features. 1181 * 1182 * Unfortunately, this alone is insufficient. First, Skylake systems have 1183 * additional speculation for the Return Stack Buffer (RSB) which is used to 1184 * return from call instructions which retpolines take advantage of. However, 1185 * this problem is not just limited to Skylake and is actually more pernicious. 1186 * The SpectreRSB paper introduces several more problems that can arise with 1187 * dealing with this. The RSB can be poisoned just like the indirect branch 1188 * predictor. This means that one needs to clear the RSB when transitioning 1189 * between two different privilege domains. Some examples include: 1190 * 1191 * - Switching between two different user processes 1192 * - Going between user land and the kernel 1193 * - Returning to the kernel from a hardware virtual machine 1194 * 1195 * Mitigating this involves combining a couple of different things. The first is 1196 * SMEP (supervisor mode execution protection) which was introduced in Ivy 1197 * Bridge. When an RSB entry refers to a user address and we're executing in the 1198 * kernel, speculation through it will be stopped when SMEP is enabled. This 1199 * protects against a number of the different cases that we would normally be 1200 * worried about such as when we enter the kernel from user land. 1201 * 1202 * To prevent against additional manipulation of the RSB from other contexts 1203 * such as a non-root VMX context attacking the kernel we first look to 1204 * enhanced IBRS. When eIBRS is present and enabled, then there should be 1205 * nothing else that we need to do to protect the kernel at this time. 1206 * 1207 * Unfortunately, not all eIBRS implementations are sufficient to guard 1208 * against RSB manipulations, so we still need to manually overwrite the 1209 * contents of the return stack buffer unless the hardware specifies we are 1210 * covered. We do this through the x86_rsb_stuff() function. Currently this 1211 * is employed on context switch and vmx_exit. The x86_rsb_stuff() function is 1212 * disabled only when mitigations in general are, or if we have hardware 1213 * indicating no need for post-barrier RSB protections, either in one place 1214 * (old hardware), or on both (newer hardware). 1215 * 1216 * If SMEP is not present, then we would have to stuff the RSB every time we 1217 * transitioned from user mode to the kernel, which isn't very practical right 1218 * now. 1219 * 1220 * To fully protect user to user and vmx to vmx attacks from these classes of 1221 * issues, we would also need to allow them to opt into performing an Indirect 1222 * Branch Prediction Barrier (IBPB) on switch. This is not currently wired up. 1223 * 1224 * The fourth form of mitigation here is specific to AMD and is called Automated 1225 * IBRS (AIBRS). This is similar in spirit to eIBRS; however rather than set the 1226 * IBRS bit in MSR_IA32_SPEC_CTRL (0x48) we instead set a bit in the EFER 1227 * (extended feature enable register) MSR. This bit basically says that IBRS 1228 * acts as though it is always active when executing at CPL0 and when executing 1229 * in the 'host' context when SEV-SNP is enabled. 1230 * 1231 * When this is active, AMD states that the RSB is cleared on VMEXIT and 1232 * therefore it is unnecessary. While this handles RSB stuffing attacks from SVM 1233 * to the kernel, we must still consider the remaining cases that exist, just 1234 * like above. While traditionally AMD employed a 32 entry RSB allowing the 1235 * traditional technique to work, this is not true on all CPUs. While a write to 1236 * IBRS would clear the RSB if the processor supports more than 32 entries (but 1237 * not otherwise), AMD states that as long as at leat a single 4 KiB unmapped 1238 * guard page is present between user and kernel address spaces and SMEP is 1239 * enabled, then there is no need to clear the RSB at all. 1240 * 1241 * By default, the system will enable RSB stuffing and the required variant of 1242 * retpolines and store that information in the x86_spectrev2_mitigation value. 1243 * This will be evaluated after a microcode update as well, though it is 1244 * expected that microcode updates will not take away features. This may mean 1245 * that a late loaded microcode may not end up in the optimal configuration 1246 * (though this should be rare). 1247 * 1248 * Currently we do not build kmdb with retpolines or perform any additional side 1249 * channel security mitigations for it. One complication with kmdb is that it 1250 * requires its own retpoline thunks and it would need to adjust itself based on 1251 * what the kernel does. The threat model of kmdb is more limited and therefore 1252 * it may make more sense to investigate using prediction barriers as the whole 1253 * system is only executing a single instruction at a time while in kmdb. 1254 * 1255 * Branch History Injection (BHI) 1256 * 1257 * BHI is a specific form of SPECTREv2 where an attacker may manipulate branch 1258 * history before transitioning from user to supervisor mode (or from VMX 1259 * non-root/guest to root mode). The attacker can then exploit certain 1260 * compiler-generated code-sequences ("gadgets") to disclose information from 1261 * other contexts or domains. Recent (late-2023/early-2024) research in 1262 * object code analysis discovered many more potential gadgets than what was 1263 * initially reported (which previously was confined to Linux use of 1264 * unprivileged eBPF). 1265 * 1266 * The BHI threat doesn't exist in processsors that predate eIBRS, or in AMD 1267 * ones. Some eIBRS processors have the ability to disable branch history in 1268 * certain (but not all) cases using an MSR write. eIBRS processors that don't 1269 * have the ability to disable must use a software sequence to scrub the 1270 * branch history buffer. 1271 * 1272 * BHI_DIS_S (the aforementioned MSR) prevents ring 0 from ring 3 (VMX guest 1273 * or VMX root). It does not protect different user processes from each other, 1274 * or ring 3 VMX guest from ring 3 VMX root or vice versa. 1275 * 1276 * The BHI clearing sequence prevents user exploiting kernel gadgets, and user 1277 * A's use of user B's gadgets. 1278 * 1279 * SMEP and eIBRS are a continuing defense-in-depth measure protecting the 1280 * kernel. 1281 * 1282 * SPECTRE v1, v4 1283 * 1284 * The v1 and v4 variants of spectre are not currently mitigated in the 1285 * system and require other classes of changes to occur in the code. 1286 * 1287 * SPECTRE v1 (SWAPGS VARIANT) 1288 * 1289 * The class of Spectre v1 vulnerabilities aren't all about bounds checks, but 1290 * can generally affect any branch-dependent code. The swapgs issue is one 1291 * variant of this. If we are coming in from userspace, we can have code like 1292 * this: 1293 * 1294 * cmpw $KCS_SEL, REGOFF_CS(%rsp) 1295 * je 1f 1296 * movq $0, REGOFF_SAVFP(%rsp) 1297 * swapgs 1298 * 1: 1299 * movq %gs:CPU_THREAD, %rax 1300 * 1301 * If an attacker can cause a mis-speculation of the branch here, we could skip 1302 * the needed swapgs, and use the /user/ %gsbase as the base of the %gs-based 1303 * load. If subsequent code can act as the usual Spectre cache gadget, this 1304 * would potentially allow KPTI bypass. To fix this, we need an lfence prior to 1305 * any use of the %gs override. 1306 * 1307 * The other case is also an issue: if we're coming into a trap from kernel 1308 * space, we could mis-speculate and swapgs the user %gsbase back in prior to 1309 * using it. AMD systems are not vulnerable to this version, as a swapgs is 1310 * serializing with respect to subsequent uses. But as AMD /does/ need the other 1311 * case, and the fix is the same in both cases (an lfence at the branch target 1312 * 1: in this example), we'll just do it unconditionally. 1313 * 1314 * Note that we don't enable user-space "wrgsbase" via CR4_FSGSBASE, making it 1315 * harder for user-space to actually set a useful %gsbase value: although it's 1316 * not clear, it might still be feasible via lwp_setprivate(), though, so we 1317 * mitigate anyway. 1318 * 1319 * MELTDOWN 1320 * 1321 * Meltdown, or spectre v3, allowed a user process to read any data in their 1322 * address space regardless of whether or not the page tables in question 1323 * allowed the user to have the ability to read them. The solution to meltdown 1324 * is kernel page table isolation. In this world, there are two page tables that 1325 * are used for a process, one in user land and one in the kernel. To implement 1326 * this we use per-CPU page tables and switch between the user and kernel 1327 * variants when entering and exiting the kernel. For more information about 1328 * this process and how the trampolines work, please see the big theory 1329 * statements and additional comments in: 1330 * 1331 * - uts/i86pc/ml/kpti_trampolines.s 1332 * - uts/i86pc/vm/hat_i86.c 1333 * 1334 * While Meltdown only impacted Intel systems and there are also Intel systems 1335 * that have Meltdown fixed (called Rogue Data Cache Load), we always have 1336 * kernel page table isolation enabled. While this may at first seem weird, an 1337 * important thing to remember is that you can't speculatively read an address 1338 * if it's never in your page table at all. Having user processes without kernel 1339 * pages present provides us with an important layer of defense in the kernel 1340 * against any other side channel attacks that exist and have yet to be 1341 * discovered. As such, kernel page table isolation (KPTI) is always enabled by 1342 * default, no matter the x86 system. 1343 * 1344 * L1 TERMINAL FAULT 1345 * 1346 * L1 Terminal Fault (L1TF) takes advantage of an issue in how speculative 1347 * execution uses page table entries. Effectively, it is two different problems. 1348 * The first is that it ignores the not present bit in the page table entries 1349 * when performing speculative execution. This means that something can 1350 * speculatively read the listed physical address if it's present in the L1 1351 * cache under certain conditions (see Intel's documentation for the full set of 1352 * conditions). Secondly, this can be used to bypass hardware virtualization 1353 * extended page tables (EPT) that are part of Intel's hardware virtual machine 1354 * instructions. 1355 * 1356 * For the non-hardware virtualized case, this is relatively easy to deal with. 1357 * We must make sure that all unmapped pages have an address of zero. This means 1358 * that they could read the first 4k of physical memory; however, we never use 1359 * that first page in the operating system and always skip putting it in our 1360 * memory map, even if firmware tells us we can use it in our memory map. While 1361 * other systems try to put extra metadata in the address and reserved bits, 1362 * which led to this being problematic in those cases, we do not. 1363 * 1364 * For hardware virtual machines things are more complicated. Because they can 1365 * construct their own page tables, it isn't hard for them to perform this 1366 * attack against any physical address. The one wrinkle is that this physical 1367 * address must be in the L1 data cache. Thus Intel added an MSR that we can use 1368 * to flush the L1 data cache. We wrap this up in the function 1369 * spec_uarch_flush(). This function is also used in the mitigation of 1370 * microarchitectural data sampling (MDS) discussed later on. Kernel based 1371 * hypervisors such as KVM or bhyve are responsible for performing this before 1372 * entering the guest. 1373 * 1374 * Because this attack takes place in the L1 cache, there's another wrinkle 1375 * here. The L1 cache is shared between all logical CPUs in a core in most Intel 1376 * designs. This means that when a thread enters a hardware virtualized context 1377 * and flushes the L1 data cache, the other thread on the processor may then go 1378 * ahead and put new data in it that can be potentially attacked. While one 1379 * solution is to disable SMT on the system, another option that is available is 1380 * to use a feature for hardware virtualization called 'SMT exclusion'. This 1381 * goes through and makes sure that if a HVM is being scheduled on one thread, 1382 * then the thing on the other thread is from the same hardware virtual machine. 1383 * If an interrupt comes in or the guest exits to the broader system, then the 1384 * other SMT thread will be kicked out. 1385 * 1386 * L1TF can be fully mitigated by hardware. If the RDCL_NO feature is set in the 1387 * architecture capabilities MSR (MSR_IA32_ARCH_CAPABILITIES), then we will not 1388 * perform L1TF related mitigations. 1389 * 1390 * MICROARCHITECTURAL DATA SAMPLING 1391 * 1392 * Microarchitectural data sampling (MDS) is a combination of four discrete 1393 * vulnerabilities that are similar issues affecting various parts of the CPU's 1394 * microarchitectural implementation around load, store, and fill buffers. 1395 * Specifically it is made up of the following subcomponents: 1396 * 1397 * 1. Microarchitectural Store Buffer Data Sampling (MSBDS) 1398 * 2. Microarchitectural Fill Buffer Data Sampling (MFBDS) 1399 * 3. Microarchitectural Load Port Data Sampling (MLPDS) 1400 * 4. Microarchitectural Data Sampling Uncacheable Memory (MDSUM) 1401 * 1402 * To begin addressing these, Intel has introduced another feature in microcode 1403 * called MD_CLEAR. This changes the verw instruction to operate in a different 1404 * way. This allows us to execute the verw instruction in a particular way to 1405 * flush the state of the affected parts. The L1TF L1D flush mechanism is also 1406 * updated when this microcode is present to flush this state. 1407 * 1408 * Primarily we need to flush this state whenever we transition from the kernel 1409 * to a less privileged context such as user mode or an HVM guest. MSBDS is a 1410 * little bit different. Here the structures are statically sized when a logical 1411 * CPU is in use and resized when it goes to sleep. Therefore, we also need to 1412 * flush the microarchitectural state before the CPU goes idles by calling hlt, 1413 * mwait, or another ACPI method. To perform these flushes, we call 1414 * x86_md_clear() at all of these transition points. 1415 * 1416 * If hardware enumerates RDCL_NO, indicating that it is not vulnerable to L1TF, 1417 * then we change the spec_uarch_flush() function to point to x86_md_clear(). If 1418 * MDS_NO has been set, then this is fully mitigated and x86_md_clear() becomes 1419 * a no-op. 1420 * 1421 * Unfortunately, with this issue hyperthreading rears its ugly head. In 1422 * particular, everything we've discussed above is only valid for a single 1423 * thread executing on a core. In the case where you have hyper-threading 1424 * present, this attack can be performed between threads. The theoretical fix 1425 * for this is to ensure that both threads are always in the same security 1426 * domain. This means that they are executing in the same ring and mutually 1427 * trust each other. Practically speaking, this would mean that a system call 1428 * would have to issue an inter-processor interrupt (IPI) to the other thread. 1429 * Rather than implement this, we recommend that one disables hyper-threading 1430 * through the use of psradm -aS. 1431 * 1432 * TSX ASYNCHRONOUS ABORT 1433 * 1434 * TSX Asynchronous Abort (TAA) is another side-channel vulnerability that 1435 * behaves like MDS, but leverages Intel's transactional instructions as another 1436 * vector. Effectively, when a transaction hits one of these cases (unmapped 1437 * page, various cache snoop activity, etc.) then the same data can be exposed 1438 * as in the case of MDS. This means that you can attack your twin. 1439 * 1440 * Intel has described that there are two different ways that we can mitigate 1441 * this problem on affected processors: 1442 * 1443 * 1) We can use the same techniques used to deal with MDS. Flushing the 1444 * microarchitectural buffers and disabling hyperthreading will mitigate 1445 * this in the same way. 1446 * 1447 * 2) Using microcode to disable TSX. 1448 * 1449 * Now, most processors that are subject to MDS (as in they don't have MDS_NO in 1450 * the IA32_ARCH_CAPABILITIES MSR) will not receive microcode to disable TSX. 1451 * That's OK as we're already doing all such mitigations. On the other hand, 1452 * processors with MDS_NO are all supposed to receive microcode updates that 1453 * enumerate support for disabling TSX. In general, we'd rather use this method 1454 * when available as it doesn't require disabling hyperthreading to be 1455 * effective. Currently we basically are relying on microcode for processors 1456 * that enumerate MDS_NO. 1457 * 1458 * Another MDS-variant in a few select Intel Atom CPUs is Register File Data 1459 * Sampling: RFDS. This allows an attacker to sample values that were in any 1460 * of integer, floating point, or vector registers. This was discovered by 1461 * Intel during internal validation work. The existence of the RFDS_NO 1462 * capability, or the LACK of a RFDS_CLEAR capability, means we do not have to 1463 * act. Intel has said some CPU models immune to RFDS MAY NOT enumerate 1464 * RFDS_NO. If RFDS_NO is not set, but RFDS_CLEAR is, we must set x86_md_clear, 1465 * and make sure it's using VERW. Unlike MDS, RFDS can't be helped by the 1466 * MSR that L1D uses. 1467 * 1468 * The microcode features are enumerated as part of the IA32_ARCH_CAPABILITIES. 1469 * When bit 7 (IA32_ARCH_CAP_TSX_CTRL) is present, then we are given two 1470 * different powers. The first allows us to cause all transactions to 1471 * immediately abort. The second gives us a means of disabling TSX completely, 1472 * which includes removing it from cpuid. If we have support for this in 1473 * microcode during the first cpuid pass, then we'll disable TSX completely such 1474 * that user land never has a chance to observe the bit. However, if we are late 1475 * loading the microcode, then we must use the functionality to cause 1476 * transactions to automatically abort. This is necessary for user land's sake. 1477 * Once a program sees a cpuid bit, it must not be taken away. 1478 * 1479 * We track whether or not we should do this based on what cpuid pass we're in. 1480 * Whenever we hit cpuid_scan_security() on the boot CPU and we're still on pass 1481 * 1 of the cpuid logic, then we can completely turn off TSX. Notably this 1482 * should happen twice. Once in the normal cpuid_pass_basic() code and then a 1483 * second time after we do the initial microcode update. As a result we need to 1484 * be careful in cpuid_apply_tsx() to only use the MSR if we've loaded a 1485 * suitable microcode on the current CPU (which happens prior to 1486 * cpuid_pass_ucode()). 1487 * 1488 * If TAA has been fixed, then it will be enumerated in IA32_ARCH_CAPABILITIES 1489 * as TAA_NO. In such a case, we will still disable TSX: it's proven to be an 1490 * unfortunate feature in a number of ways, and taking the opportunity to 1491 * finally be able to turn it off is likely to be of benefit in the future. 1492 * 1493 * SUMMARY 1494 * 1495 * The following table attempts to summarize the mitigations for various issues 1496 * and what's done in various places: 1497 * 1498 * - Spectre v1: Not currently mitigated 1499 * - swapgs: lfences after swapgs paths 1500 * - Spectre v2: Retpolines/RSB Stuffing or eIBRS/AIBRS if HW support 1501 * - Meltdown: Kernel Page Table Isolation 1502 * - Spectre v3a: Updated CPU microcode 1503 * - Spectre v4: Not currently mitigated 1504 * - SpectreRSB: SMEP and RSB Stuffing 1505 * - L1TF: spec_uarch_flush, SMT exclusion, requires microcode 1506 * - MDS: x86_md_clear, requires microcode, disabling SMT 1507 * - TAA: x86_md_clear and disabling SMT OR microcode and disabling TSX 1508 * - RFDS: microcode with x86_md_clear if RFDS_CLEAR set and RFDS_NO not. 1509 * - BHI: software sequence, and use of BHI_DIS_S if microcode has it. 1510 * 1511 * The following table indicates the x86 feature set bits that indicate that a 1512 * given problem has been solved or a notable feature is present: 1513 * 1514 * - RDCL_NO: Meltdown, L1TF, MSBDS subset of MDS 1515 * - MDS_NO: All forms of MDS 1516 * - TAA_NO: TAA 1517 * - RFDS_NO: RFDS 1518 * - BHI_NO: BHI 1519 */ 1520 1521 #include <sys/types.h> 1522 #include <sys/archsystm.h> 1523 #include <sys/x86_archext.h> 1524 #include <sys/kmem.h> 1525 #include <sys/systm.h> 1526 #include <sys/cmn_err.h> 1527 #include <sys/sunddi.h> 1528 #include <sys/sunndi.h> 1529 #include <sys/cpuvar.h> 1530 #include <sys/processor.h> 1531 #include <sys/sysmacros.h> 1532 #include <sys/pg.h> 1533 #include <sys/fp.h> 1534 #include <sys/controlregs.h> 1535 #include <sys/bitmap.h> 1536 #include <sys/auxv_386.h> 1537 #include <sys/memnode.h> 1538 #include <sys/pci_cfgspace.h> 1539 #include <sys/comm_page.h> 1540 #include <sys/mach_mmu.h> 1541 #include <sys/ucode.h> 1542 #include <sys/tsc.h> 1543 #include <sys/kobj.h> 1544 #include <sys/asm_misc.h> 1545 #include <sys/bitmap.h> 1546 1547 #ifdef __xpv 1548 #include <sys/hypervisor.h> 1549 #else 1550 #include <sys/ontrap.h> 1551 #endif 1552 1553 uint_t x86_vendor = X86_VENDOR_IntelClone; 1554 uint_t x86_type = X86_TYPE_OTHER; 1555 uint_t x86_clflush_size = 0; 1556 1557 #if defined(__xpv) 1558 int x86_use_pcid = 0; 1559 int x86_use_invpcid = 0; 1560 #else 1561 int x86_use_pcid = -1; 1562 int x86_use_invpcid = -1; 1563 #endif 1564 1565 typedef enum { 1566 X86_SPECTREV2_RETPOLINE, 1567 X86_SPECTREV2_ENHANCED_IBRS, 1568 X86_SPECTREV2_AUTO_IBRS, 1569 X86_SPECTREV2_DISABLED 1570 } x86_spectrev2_mitigation_t; 1571 1572 uint_t x86_disable_spectrev2 = 0; 1573 static x86_spectrev2_mitigation_t x86_spectrev2_mitigation = 1574 X86_SPECTREV2_RETPOLINE; 1575 1576 /* 1577 * The mitigation status for TAA: 1578 * X86_TAA_NOTHING -- no mitigation available for TAA side-channels 1579 * X86_TAA_DISABLED -- mitigation disabled via x86_disable_taa 1580 * X86_TAA_MD_CLEAR -- MDS mitigation also suffices for TAA 1581 * X86_TAA_TSX_FORCE_ABORT -- transactions are forced to abort 1582 * X86_TAA_TSX_DISABLE -- force abort transactions and hide from CPUID 1583 * X86_TAA_HW_MITIGATED -- TSX potentially active but H/W not TAA-vulnerable 1584 */ 1585 typedef enum { 1586 X86_TAA_NOTHING, 1587 X86_TAA_DISABLED, 1588 X86_TAA_MD_CLEAR, 1589 X86_TAA_TSX_FORCE_ABORT, 1590 X86_TAA_TSX_DISABLE, 1591 X86_TAA_HW_MITIGATED 1592 } x86_taa_mitigation_t; 1593 1594 uint_t x86_disable_taa = 0; 1595 static x86_taa_mitigation_t x86_taa_mitigation = X86_TAA_NOTHING; 1596 1597 uint_t pentiumpro_bug4046376; 1598 1599 uchar_t x86_featureset[BT_SIZEOFMAP(NUM_X86_FEATURES)]; 1600 1601 static char *x86_feature_names[NUM_X86_FEATURES] = { 1602 "lgpg", 1603 "tsc", 1604 "msr", 1605 "mtrr", 1606 "pge", 1607 "de", 1608 "cmov", 1609 "mmx", 1610 "mca", 1611 "pae", 1612 "cv8", 1613 "pat", 1614 "sep", 1615 "sse", 1616 "sse2", 1617 "htt", 1618 "asysc", 1619 "nx", 1620 "sse3", 1621 "cx16", 1622 "cmp", 1623 "tscp", 1624 "mwait", 1625 "sse4a", 1626 "cpuid", 1627 "ssse3", 1628 "sse4_1", 1629 "sse4_2", 1630 "1gpg", 1631 "clfsh", 1632 "64", 1633 "aes", 1634 "pclmulqdq", 1635 "xsave", 1636 "avx", 1637 "vmx", 1638 "svm", 1639 "topoext", 1640 "f16c", 1641 "rdrand", 1642 "x2apic", 1643 "avx2", 1644 "bmi1", 1645 "bmi2", 1646 "fma", 1647 "smep", 1648 "smap", 1649 "adx", 1650 "rdseed", 1651 "mpx", 1652 "avx512f", 1653 "avx512dq", 1654 "avx512pf", 1655 "avx512er", 1656 "avx512cd", 1657 "avx512bw", 1658 "avx512vl", 1659 "avx512fma", 1660 "avx512vbmi", 1661 "avx512_vpopcntdq", 1662 "avx512_4vnniw", 1663 "avx512_4fmaps", 1664 "xsaveopt", 1665 "xsavec", 1666 "xsaves", 1667 "sha", 1668 "umip", 1669 "pku", 1670 "ospke", 1671 "pcid", 1672 "invpcid", 1673 "ibrs", 1674 "ibpb", 1675 "stibp", 1676 "ssbd", 1677 "ssbd_virt", 1678 "rdcl_no", 1679 "ibrs_all", 1680 "rsba", 1681 "ssb_no", 1682 "stibp_all", 1683 "flush_cmd", 1684 "l1d_vmentry_no", 1685 "fsgsbase", 1686 "clflushopt", 1687 "clwb", 1688 "monitorx", 1689 "clzero", 1690 "xop", 1691 "fma4", 1692 "tbm", 1693 "avx512_vnni", 1694 "amd_pcec", 1695 "md_clear", 1696 "mds_no", 1697 "core_thermal", 1698 "pkg_thermal", 1699 "tsx_ctrl", 1700 "taa_no", 1701 "ppin", 1702 "vaes", 1703 "vpclmulqdq", 1704 "lfence_serializing", 1705 "gfni", 1706 "avx512_vp2intersect", 1707 "avx512_bitalg", 1708 "avx512_vbmi2", 1709 "avx512_bf16", 1710 "auto_ibrs", 1711 "rfds_no", 1712 "rfds_clear", 1713 "pbrsb_no", 1714 "bhi_no", 1715 "bhi_clear" 1716 }; 1717 1718 boolean_t 1719 is_x86_feature(void *featureset, uint_t feature) 1720 { 1721 ASSERT(feature < NUM_X86_FEATURES); 1722 return (BT_TEST((ulong_t *)featureset, feature)); 1723 } 1724 1725 void 1726 add_x86_feature(void *featureset, uint_t feature) 1727 { 1728 ASSERT(feature < NUM_X86_FEATURES); 1729 BT_SET((ulong_t *)featureset, feature); 1730 } 1731 1732 void 1733 remove_x86_feature(void *featureset, uint_t feature) 1734 { 1735 ASSERT(feature < NUM_X86_FEATURES); 1736 BT_CLEAR((ulong_t *)featureset, feature); 1737 } 1738 1739 boolean_t 1740 compare_x86_featureset(void *setA, void *setB) 1741 { 1742 /* 1743 * We assume that the unused bits of the bitmap are always zero. 1744 */ 1745 if (memcmp(setA, setB, BT_SIZEOFMAP(NUM_X86_FEATURES)) == 0) { 1746 return (B_TRUE); 1747 } else { 1748 return (B_FALSE); 1749 } 1750 } 1751 1752 void 1753 print_x86_featureset(void *featureset) 1754 { 1755 uint_t i; 1756 1757 for (i = 0; i < NUM_X86_FEATURES; i++) { 1758 if (is_x86_feature(featureset, i)) { 1759 cmn_err(CE_CONT, "?x86_feature: %s\n", 1760 x86_feature_names[i]); 1761 } 1762 } 1763 } 1764 1765 /* Note: This is the maximum size for the CPU, not the size of the structure. */ 1766 static size_t xsave_state_size = 0; 1767 uint64_t xsave_bv_all = (XFEATURE_LEGACY_FP | XFEATURE_SSE); 1768 boolean_t xsave_force_disable = B_FALSE; 1769 extern int disable_smap; 1770 1771 /* 1772 * This is set to platform type we are running on. 1773 */ 1774 static int platform_type = -1; 1775 1776 #if !defined(__xpv) 1777 /* 1778 * Variable to patch if hypervisor platform detection needs to be 1779 * disabled (e.g. platform_type will always be HW_NATIVE if this is 0). 1780 */ 1781 int enable_platform_detection = 1; 1782 #endif 1783 1784 /* 1785 * monitor/mwait info. 1786 * 1787 * size_actual and buf_actual are the real address and size allocated to get 1788 * proper mwait_buf alignement. buf_actual and size_actual should be passed 1789 * to kmem_free(). Currently kmem_alloc() and mwait happen to both use 1790 * processor cache-line alignment, but this is not guarantied in the furture. 1791 */ 1792 struct mwait_info { 1793 size_t mon_min; /* min size to avoid missed wakeups */ 1794 size_t mon_max; /* size to avoid false wakeups */ 1795 size_t size_actual; /* size actually allocated */ 1796 void *buf_actual; /* memory actually allocated */ 1797 uint32_t support; /* processor support of monitor/mwait */ 1798 }; 1799 1800 /* 1801 * xsave/xrestor info. 1802 * 1803 * This structure contains HW feature bits and the size of the xsave save area. 1804 * Note: the kernel declares a fixed size (AVX_XSAVE_SIZE) structure 1805 * (xsave_state) to describe the xsave layout. However, at runtime the 1806 * per-lwp xsave area is dynamically allocated based on xsav_max_size. The 1807 * xsave_state structure simply represents the legacy layout of the beginning 1808 * of the xsave area. 1809 */ 1810 struct xsave_info { 1811 uint32_t xsav_hw_features_low; /* Supported HW features */ 1812 uint32_t xsav_hw_features_high; /* Supported HW features */ 1813 size_t xsav_max_size; /* max size save area for HW features */ 1814 size_t ymm_size; /* AVX: size of ymm save area */ 1815 size_t ymm_offset; /* AVX: offset for ymm save area */ 1816 size_t bndregs_size; /* MPX: size of bndregs save area */ 1817 size_t bndregs_offset; /* MPX: offset for bndregs save area */ 1818 size_t bndcsr_size; /* MPX: size of bndcsr save area */ 1819 size_t bndcsr_offset; /* MPX: offset for bndcsr save area */ 1820 size_t opmask_size; /* AVX512: size of opmask save */ 1821 size_t opmask_offset; /* AVX512: offset for opmask save */ 1822 size_t zmmlo_size; /* AVX512: size of zmm 256 save */ 1823 size_t zmmlo_offset; /* AVX512: offset for zmm 256 save */ 1824 size_t zmmhi_size; /* AVX512: size of zmm hi reg save */ 1825 size_t zmmhi_offset; /* AVX512: offset for zmm hi reg save */ 1826 }; 1827 1828 1829 /* 1830 * These constants determine how many of the elements of the 1831 * cpuid we cache in the cpuid_info data structure; the 1832 * remaining elements are accessible via the cpuid instruction. 1833 */ 1834 1835 #define NMAX_CPI_STD 8 /* eax = 0 .. 7 */ 1836 #define NMAX_CPI_EXTD 0x22 /* eax = 0x80000000 .. 0x80000021 */ 1837 #define NMAX_CPI_TOPO 0x10 /* Sanity check on leaf 8X26, 1F */ 1838 1839 /* 1840 * See the big theory statement for a more detailed explanation of what some of 1841 * these members mean. 1842 */ 1843 struct cpuid_info { 1844 uint_t cpi_pass; /* last pass completed */ 1845 /* 1846 * standard function information 1847 */ 1848 uint_t cpi_maxeax; /* fn 0: %eax */ 1849 char cpi_vendorstr[13]; /* fn 0: %ebx:%ecx:%edx */ 1850 uint_t cpi_vendor; /* enum of cpi_vendorstr */ 1851 1852 uint_t cpi_family; /* fn 1: extended family */ 1853 uint_t cpi_model; /* fn 1: extended model */ 1854 uint_t cpi_step; /* fn 1: stepping */ 1855 chipid_t cpi_chipid; /* fn 1: %ebx: Intel: chip # */ 1856 /* AMD: package/socket # */ 1857 uint_t cpi_brandid; /* fn 1: %ebx: brand ID */ 1858 int cpi_clogid; /* fn 1: %ebx: thread # */ 1859 uint_t cpi_ncpu_per_chip; /* fn 1: %ebx: logical cpu count */ 1860 uint8_t cpi_cacheinfo[16]; /* fn 2: intel-style cache desc */ 1861 uint_t cpi_ncache; /* fn 2: number of elements */ 1862 uint_t cpi_ncpu_shr_last_cache; /* fn 4: %eax: ncpus sharing cache */ 1863 id_t cpi_last_lvl_cacheid; /* fn 4: %eax: derived cache id */ 1864 uint_t cpi_cache_leaf_size; /* Number of cache elements */ 1865 /* Intel fn: 4, AMD fn: 8000001d */ 1866 struct cpuid_regs **cpi_cache_leaves; /* Actual leaves from above */ 1867 struct cpuid_regs cpi_std[NMAX_CPI_STD]; /* 0 .. 7 */ 1868 struct cpuid_regs cpi_sub7[2]; /* Leaf 7, sub-leaves 1-2 */ 1869 /* 1870 * extended function information 1871 */ 1872 uint_t cpi_xmaxeax; /* fn 0x80000000: %eax */ 1873 char cpi_brandstr[49]; /* fn 0x8000000[234] */ 1874 uint8_t cpi_pabits; /* fn 0x80000006: %eax */ 1875 uint8_t cpi_vabits; /* fn 0x80000006: %eax */ 1876 uint8_t cpi_fp_amd_save; /* AMD: FP error pointer save rqd. */ 1877 struct cpuid_regs cpi_extd[NMAX_CPI_EXTD]; /* 0x800000XX */ 1878 1879 id_t cpi_coreid; /* same coreid => strands share core */ 1880 int cpi_pkgcoreid; /* core number within single package */ 1881 uint_t cpi_ncore_per_chip; /* AMD: fn 0x80000008: %ecx[7-0] */ 1882 /* Intel: fn 4: %eax[31-26] */ 1883 1884 /* 1885 * These values represent the number of bits that are required to store 1886 * information about the number of cores and threads. 1887 */ 1888 uint_t cpi_ncore_bits; 1889 uint_t cpi_nthread_bits; 1890 /* 1891 * supported feature information 1892 */ 1893 uint32_t cpi_support[6]; 1894 #define STD_EDX_FEATURES 0 1895 #define AMD_EDX_FEATURES 1 1896 #define TM_EDX_FEATURES 2 1897 #define STD_ECX_FEATURES 3 1898 #define AMD_ECX_FEATURES 4 1899 #define STD_EBX_FEATURES 5 1900 /* 1901 * Synthesized information, where known. 1902 */ 1903 x86_chiprev_t cpi_chiprev; /* See X86_CHIPREV_* in x86_archext.h */ 1904 const char *cpi_chiprevstr; /* May be NULL if chiprev unknown */ 1905 uint32_t cpi_socket; /* Chip package/socket type */ 1906 x86_uarchrev_t cpi_uarchrev; /* Microarchitecture and revision */ 1907 1908 struct mwait_info cpi_mwait; /* fn 5: monitor/mwait info */ 1909 uint32_t cpi_apicid; 1910 uint_t cpi_procnodeid; /* AMD: nodeID on HT, Intel: chipid */ 1911 uint_t cpi_procnodes_per_pkg; /* AMD: # of nodes in the package */ 1912 /* Intel: 1 */ 1913 uint_t cpi_compunitid; /* AMD: ComputeUnit ID, Intel: coreid */ 1914 uint_t cpi_cores_per_compunit; /* AMD: # of cores in the ComputeUnit */ 1915 1916 struct xsave_info cpi_xsave; /* fn D: xsave/xrestor info */ 1917 1918 /* 1919 * AMD and Intel extended topology information. Leaf 8X26 (AMD) and 1920 * eventually leaf 0x1F (Intel). 1921 */ 1922 uint_t cpi_topo_nleaves; 1923 struct cpuid_regs cpi_topo[NMAX_CPI_TOPO]; 1924 }; 1925 1926 1927 static struct cpuid_info cpuid_info0; 1928 1929 /* 1930 * These bit fields are defined by the Intel Application Note AP-485 1931 * "Intel Processor Identification and the CPUID Instruction" 1932 */ 1933 #define CPI_FAMILY_XTD(cpi) BITX((cpi)->cpi_std[1].cp_eax, 27, 20) 1934 #define CPI_MODEL_XTD(cpi) BITX((cpi)->cpi_std[1].cp_eax, 19, 16) 1935 #define CPI_TYPE(cpi) BITX((cpi)->cpi_std[1].cp_eax, 13, 12) 1936 #define CPI_FAMILY(cpi) BITX((cpi)->cpi_std[1].cp_eax, 11, 8) 1937 #define CPI_STEP(cpi) BITX((cpi)->cpi_std[1].cp_eax, 3, 0) 1938 #define CPI_MODEL(cpi) BITX((cpi)->cpi_std[1].cp_eax, 7, 4) 1939 1940 #define CPI_FEATURES_EDX(cpi) ((cpi)->cpi_std[1].cp_edx) 1941 #define CPI_FEATURES_ECX(cpi) ((cpi)->cpi_std[1].cp_ecx) 1942 #define CPI_FEATURES_XTD_EDX(cpi) ((cpi)->cpi_extd[1].cp_edx) 1943 #define CPI_FEATURES_XTD_ECX(cpi) ((cpi)->cpi_extd[1].cp_ecx) 1944 #define CPI_FEATURES_7_0_EBX(cpi) ((cpi)->cpi_std[7].cp_ebx) 1945 #define CPI_FEATURES_7_0_ECX(cpi) ((cpi)->cpi_std[7].cp_ecx) 1946 #define CPI_FEATURES_7_0_EDX(cpi) ((cpi)->cpi_std[7].cp_edx) 1947 #define CPI_FEATURES_7_1_EAX(cpi) ((cpi)->cpi_sub7[0].cp_eax) 1948 #define CPI_FEATURES_7_2_EDX(cpi) ((cpi)->cpi_sub7[1].cp_edx) 1949 1950 #define CPI_BRANDID(cpi) BITX((cpi)->cpi_std[1].cp_ebx, 7, 0) 1951 #define CPI_CHUNKS(cpi) BITX((cpi)->cpi_std[1].cp_ebx, 15, 7) 1952 #define CPI_CPU_COUNT(cpi) BITX((cpi)->cpi_std[1].cp_ebx, 23, 16) 1953 #define CPI_APIC_ID(cpi) BITX((cpi)->cpi_std[1].cp_ebx, 31, 24) 1954 1955 #define CPI_MAXEAX_MAX 0x100 /* sanity control */ 1956 #define CPI_XMAXEAX_MAX 0x80000100 1957 #define CPI_FN4_ECX_MAX 0x20 /* sanity: max fn 4 levels */ 1958 #define CPI_FNB_ECX_MAX 0x20 /* sanity: max fn B levels */ 1959 1960 /* 1961 * Function 4 (Deterministic Cache Parameters) macros 1962 * Defined by Intel Application Note AP-485 1963 */ 1964 #define CPI_NUM_CORES(regs) BITX((regs)->cp_eax, 31, 26) 1965 #define CPI_NTHR_SHR_CACHE(regs) BITX((regs)->cp_eax, 25, 14) 1966 #define CPI_FULL_ASSOC_CACHE(regs) BITX((regs)->cp_eax, 9, 9) 1967 #define CPI_SELF_INIT_CACHE(regs) BITX((regs)->cp_eax, 8, 8) 1968 #define CPI_CACHE_LVL(regs) BITX((regs)->cp_eax, 7, 5) 1969 #define CPI_CACHE_TYPE(regs) BITX((regs)->cp_eax, 4, 0) 1970 #define CPI_CACHE_TYPE_DONE 0 1971 #define CPI_CACHE_TYPE_DATA 1 1972 #define CPI_CACHE_TYPE_INSTR 2 1973 #define CPI_CACHE_TYPE_UNIFIED 3 1974 #define CPI_CPU_LEVEL_TYPE(regs) BITX((regs)->cp_ecx, 15, 8) 1975 1976 #define CPI_CACHE_WAYS(regs) BITX((regs)->cp_ebx, 31, 22) 1977 #define CPI_CACHE_PARTS(regs) BITX((regs)->cp_ebx, 21, 12) 1978 #define CPI_CACHE_COH_LN_SZ(regs) BITX((regs)->cp_ebx, 11, 0) 1979 1980 #define CPI_CACHE_SETS(regs) BITX((regs)->cp_ecx, 31, 0) 1981 1982 #define CPI_PREFCH_STRIDE(regs) BITX((regs)->cp_edx, 9, 0) 1983 1984 1985 /* 1986 * A couple of shorthand macros to identify "later" P6-family chips 1987 * like the Pentium M and Core. First, the "older" P6-based stuff 1988 * (loosely defined as "pre-Pentium-4"): 1989 * P6, PII, Mobile PII, PII Xeon, PIII, Mobile PIII, PIII Xeon 1990 */ 1991 #define IS_LEGACY_P6(cpi) ( \ 1992 cpi->cpi_family == 6 && \ 1993 (cpi->cpi_model == 1 || \ 1994 cpi->cpi_model == 3 || \ 1995 cpi->cpi_model == 5 || \ 1996 cpi->cpi_model == 6 || \ 1997 cpi->cpi_model == 7 || \ 1998 cpi->cpi_model == 8 || \ 1999 cpi->cpi_model == 0xA || \ 2000 cpi->cpi_model == 0xB) \ 2001 ) 2002 2003 /* A "new F6" is everything with family 6 that's not the above */ 2004 #define IS_NEW_F6(cpi) ((cpi->cpi_family == 6) && !IS_LEGACY_P6(cpi)) 2005 2006 /* Extended family/model support */ 2007 #define IS_EXTENDED_MODEL_INTEL(cpi) (cpi->cpi_family == 0x6 || \ 2008 cpi->cpi_family >= 0xf) 2009 2010 /* 2011 * Info for monitor/mwait idle loop. 2012 * 2013 * See cpuid section of "Intel 64 and IA-32 Architectures Software Developer's 2014 * Manual Volume 2A: Instruction Set Reference, A-M" #25366-022US, November 2015 * 2006. 2016 * See MONITOR/MWAIT section of "AMD64 Architecture Programmer's Manual 2017 * Documentation Updates" #33633, Rev 2.05, December 2006. 2018 */ 2019 #define MWAIT_SUPPORT (0x00000001) /* mwait supported */ 2020 #define MWAIT_EXTENSIONS (0x00000002) /* extenstion supported */ 2021 #define MWAIT_ECX_INT_ENABLE (0x00000004) /* ecx 1 extension supported */ 2022 #define MWAIT_SUPPORTED(cpi) ((cpi)->cpi_std[1].cp_ecx & CPUID_INTC_ECX_MON) 2023 #define MWAIT_INT_ENABLE(cpi) ((cpi)->cpi_std[5].cp_ecx & 0x2) 2024 #define MWAIT_EXTENSION(cpi) ((cpi)->cpi_std[5].cp_ecx & 0x1) 2025 #define MWAIT_SIZE_MIN(cpi) BITX((cpi)->cpi_std[5].cp_eax, 15, 0) 2026 #define MWAIT_SIZE_MAX(cpi) BITX((cpi)->cpi_std[5].cp_ebx, 15, 0) 2027 /* 2028 * Number of sub-cstates for a given c-state. 2029 */ 2030 #define MWAIT_NUM_SUBC_STATES(cpi, c_state) \ 2031 BITX((cpi)->cpi_std[5].cp_edx, c_state + 3, c_state) 2032 2033 /* 2034 * XSAVE leaf 0xD enumeration 2035 */ 2036 #define CPUID_LEAFD_2_YMM_OFFSET 576 2037 #define CPUID_LEAFD_2_YMM_SIZE 256 2038 2039 /* 2040 * Common extended leaf names to cut down on typos. 2041 */ 2042 #define CPUID_LEAF_EXT_0 0x80000000 2043 #define CPUID_LEAF_EXT_8 0x80000008 2044 #define CPUID_LEAF_EXT_1d 0x8000001d 2045 #define CPUID_LEAF_EXT_1e 0x8000001e 2046 #define CPUID_LEAF_EXT_21 0x80000021 2047 #define CPUID_LEAF_EXT_26 0x80000026 2048 2049 /* 2050 * Functions we consume from cpuid_subr.c; don't publish these in a header 2051 * file to try and keep people using the expected cpuid_* interfaces. 2052 */ 2053 extern uint32_t _cpuid_skt(uint_t, uint_t, uint_t, uint_t); 2054 extern const char *_cpuid_sktstr(uint_t, uint_t, uint_t, uint_t); 2055 extern x86_chiprev_t _cpuid_chiprev(uint_t, uint_t, uint_t, uint_t); 2056 extern const char *_cpuid_chiprevstr(uint_t, uint_t, uint_t, uint_t); 2057 extern x86_uarchrev_t _cpuid_uarchrev(uint_t, uint_t, uint_t, uint_t); 2058 extern uint_t _cpuid_vendorstr_to_vendorcode(char *); 2059 2060 /* 2061 * Apply up various platform-dependent restrictions where the 2062 * underlying platform restrictions mean the CPU can be marked 2063 * as less capable than its cpuid instruction would imply. 2064 */ 2065 #if defined(__xpv) 2066 static void 2067 platform_cpuid_mangle(uint_t vendor, uint32_t eax, struct cpuid_regs *cp) 2068 { 2069 switch (eax) { 2070 case 1: { 2071 uint32_t mcamask = DOMAIN_IS_INITDOMAIN(xen_info) ? 2072 0 : CPUID_INTC_EDX_MCA; 2073 cp->cp_edx &= 2074 ~(mcamask | 2075 CPUID_INTC_EDX_PSE | 2076 CPUID_INTC_EDX_VME | CPUID_INTC_EDX_DE | 2077 CPUID_INTC_EDX_SEP | CPUID_INTC_EDX_MTRR | 2078 CPUID_INTC_EDX_PGE | CPUID_INTC_EDX_PAT | 2079 CPUID_AMD_EDX_SYSC | CPUID_INTC_EDX_SEP | 2080 CPUID_INTC_EDX_PSE36 | CPUID_INTC_EDX_HTT); 2081 break; 2082 } 2083 2084 case 0x80000001: 2085 cp->cp_edx &= 2086 ~(CPUID_AMD_EDX_PSE | 2087 CPUID_INTC_EDX_VME | CPUID_INTC_EDX_DE | 2088 CPUID_AMD_EDX_MTRR | CPUID_AMD_EDX_PGE | 2089 CPUID_AMD_EDX_PAT | CPUID_AMD_EDX_PSE36 | 2090 CPUID_AMD_EDX_SYSC | CPUID_INTC_EDX_SEP | 2091 CPUID_AMD_EDX_TSCP); 2092 cp->cp_ecx &= ~CPUID_AMD_ECX_CMP_LGCY; 2093 break; 2094 default: 2095 break; 2096 } 2097 2098 switch (vendor) { 2099 case X86_VENDOR_Intel: 2100 switch (eax) { 2101 case 4: 2102 /* 2103 * Zero out the (ncores-per-chip - 1) field 2104 */ 2105 cp->cp_eax &= 0x03fffffff; 2106 break; 2107 default: 2108 break; 2109 } 2110 break; 2111 case X86_VENDOR_AMD: 2112 case X86_VENDOR_HYGON: 2113 switch (eax) { 2114 2115 case 0x80000001: 2116 cp->cp_ecx &= ~CPUID_AMD_ECX_CR8D; 2117 break; 2118 2119 case CPUID_LEAF_EXT_8: 2120 /* 2121 * Zero out the (ncores-per-chip - 1) field 2122 */ 2123 cp->cp_ecx &= 0xffffff00; 2124 break; 2125 default: 2126 break; 2127 } 2128 break; 2129 default: 2130 break; 2131 } 2132 } 2133 #else 2134 #define platform_cpuid_mangle(vendor, eax, cp) /* nothing */ 2135 #endif 2136 2137 /* 2138 * Some undocumented ways of patching the results of the cpuid 2139 * instruction to permit running Solaris 10 on future cpus that 2140 * we don't currently support. Could be set to non-zero values 2141 * via settings in eeprom. 2142 */ 2143 2144 uint32_t cpuid_feature_ecx_include; 2145 uint32_t cpuid_feature_ecx_exclude; 2146 uint32_t cpuid_feature_edx_include; 2147 uint32_t cpuid_feature_edx_exclude; 2148 2149 /* 2150 * Allocate space for mcpu_cpi in the machcpu structure for all non-boot CPUs. 2151 */ 2152 void 2153 cpuid_alloc_space(cpu_t *cpu) 2154 { 2155 /* 2156 * By convention, cpu0 is the boot cpu, which is set up 2157 * before memory allocation is available. All other cpus get 2158 * their cpuid_info struct allocated here. 2159 */ 2160 ASSERT(cpu->cpu_id != 0); 2161 ASSERT(cpu->cpu_m.mcpu_cpi == NULL); 2162 cpu->cpu_m.mcpu_cpi = 2163 kmem_zalloc(sizeof (*cpu->cpu_m.mcpu_cpi), KM_SLEEP); 2164 } 2165 2166 void 2167 cpuid_free_space(cpu_t *cpu) 2168 { 2169 struct cpuid_info *cpi = cpu->cpu_m.mcpu_cpi; 2170 int i; 2171 2172 ASSERT(cpi != NULL); 2173 ASSERT(cpi != &cpuid_info0); 2174 2175 /* 2176 * Free up any cache leaf related dynamic storage. The first entry was 2177 * cached from the standard cpuid storage, so we should not free it. 2178 */ 2179 for (i = 1; i < cpi->cpi_cache_leaf_size; i++) 2180 kmem_free(cpi->cpi_cache_leaves[i], sizeof (struct cpuid_regs)); 2181 if (cpi->cpi_cache_leaf_size > 0) 2182 kmem_free(cpi->cpi_cache_leaves, 2183 cpi->cpi_cache_leaf_size * sizeof (struct cpuid_regs *)); 2184 2185 kmem_free(cpi, sizeof (*cpi)); 2186 cpu->cpu_m.mcpu_cpi = NULL; 2187 } 2188 2189 #if !defined(__xpv) 2190 /* 2191 * Determine the type of the underlying platform. This is used to customize 2192 * initialization of various subsystems (e.g. TSC). determine_platform() must 2193 * only ever be called once to prevent two processors from seeing different 2194 * values of platform_type. Must be called before cpuid_pass_ident(), the 2195 * earliest consumer to execute; the identification pass will call 2196 * synth_amd_info() to compute the chiprev, which in turn calls get_hwenv(). 2197 */ 2198 void 2199 determine_platform(void) 2200 { 2201 struct cpuid_regs cp; 2202 uint32_t base; 2203 uint32_t regs[4]; 2204 char *hvstr = (char *)regs; 2205 2206 ASSERT(platform_type == -1); 2207 2208 platform_type = HW_NATIVE; 2209 2210 if (!enable_platform_detection) 2211 return; 2212 2213 /* 2214 * If Hypervisor CPUID bit is set, try to determine hypervisor 2215 * vendor signature, and set platform type accordingly. 2216 * 2217 * References: 2218 * http://lkml.org/lkml/2008/10/1/246 2219 * http://kb.vmware.com/kb/1009458 2220 */ 2221 cp.cp_eax = 0x1; 2222 (void) __cpuid_insn(&cp); 2223 if ((cp.cp_ecx & CPUID_INTC_ECX_HV) != 0) { 2224 cp.cp_eax = 0x40000000; 2225 (void) __cpuid_insn(&cp); 2226 regs[0] = cp.cp_ebx; 2227 regs[1] = cp.cp_ecx; 2228 regs[2] = cp.cp_edx; 2229 regs[3] = 0; 2230 if (strcmp(hvstr, HVSIG_XEN_HVM) == 0) { 2231 platform_type = HW_XEN_HVM; 2232 return; 2233 } 2234 if (strcmp(hvstr, HVSIG_VMWARE) == 0) { 2235 platform_type = HW_VMWARE; 2236 return; 2237 } 2238 if (strcmp(hvstr, HVSIG_KVM) == 0) { 2239 platform_type = HW_KVM; 2240 return; 2241 } 2242 if (strcmp(hvstr, HVSIG_BHYVE) == 0) { 2243 platform_type = HW_BHYVE; 2244 return; 2245 } 2246 if (strcmp(hvstr, HVSIG_MICROSOFT) == 0) { 2247 platform_type = HW_MICROSOFT; 2248 return; 2249 } 2250 if (strcmp(hvstr, HVSIG_QEMU_TCG) == 0) { 2251 platform_type = HW_QEMU_TCG; 2252 return; 2253 } 2254 if (strcmp(hvstr, HVSIG_VIRTUALBOX) == 0) { 2255 platform_type = HW_VIRTUALBOX; 2256 return; 2257 } 2258 if (strcmp(hvstr, HVSIG_ACRN) == 0) { 2259 platform_type = HW_ACRN; 2260 return; 2261 } 2262 } else { 2263 /* 2264 * Check older VMware hardware versions. VMware hypervisor is 2265 * detected by performing an IN operation to VMware hypervisor 2266 * port and checking that value returned in %ebx is VMware 2267 * hypervisor magic value. 2268 * 2269 * References: http://kb.vmware.com/kb/1009458 2270 */ 2271 vmware_port(VMWARE_HVCMD_GETVERSION, regs); 2272 if (regs[1] == VMWARE_HVMAGIC) { 2273 platform_type = HW_VMWARE; 2274 return; 2275 } 2276 } 2277 2278 /* 2279 * Check Xen hypervisor. In a fully virtualized domain, 2280 * Xen's pseudo-cpuid function returns a string representing the 2281 * Xen signature in %ebx, %ecx, and %edx. %eax contains the maximum 2282 * supported cpuid function. We need at least a (base + 2) leaf value 2283 * to do what we want to do. Try different base values, since the 2284 * hypervisor might use a different one depending on whether Hyper-V 2285 * emulation is switched on by default or not. 2286 */ 2287 for (base = 0x40000000; base < 0x40010000; base += 0x100) { 2288 cp.cp_eax = base; 2289 (void) __cpuid_insn(&cp); 2290 regs[0] = cp.cp_ebx; 2291 regs[1] = cp.cp_ecx; 2292 regs[2] = cp.cp_edx; 2293 regs[3] = 0; 2294 if (strcmp(hvstr, HVSIG_XEN_HVM) == 0 && 2295 cp.cp_eax >= (base + 2)) { 2296 platform_type &= ~HW_NATIVE; 2297 platform_type |= HW_XEN_HVM; 2298 return; 2299 } 2300 } 2301 } 2302 2303 int 2304 get_hwenv(void) 2305 { 2306 ASSERT(platform_type != -1); 2307 return (platform_type); 2308 } 2309 2310 int 2311 is_controldom(void) 2312 { 2313 return (0); 2314 } 2315 2316 #else 2317 2318 int 2319 get_hwenv(void) 2320 { 2321 return (HW_XEN_PV); 2322 } 2323 2324 int 2325 is_controldom(void) 2326 { 2327 return (DOMAIN_IS_INITDOMAIN(xen_info)); 2328 } 2329 2330 #endif /* __xpv */ 2331 2332 /* 2333 * Gather the extended topology information. This should be the same for both 2334 * AMD leaf 8X26 and Intel leaf 0x1F (though the data interpretation varies). 2335 */ 2336 static void 2337 cpuid_gather_ext_topo_leaf(struct cpuid_info *cpi, uint32_t leaf) 2338 { 2339 uint_t i; 2340 2341 for (i = 0; i < ARRAY_SIZE(cpi->cpi_topo); i++) { 2342 struct cpuid_regs *regs = &cpi->cpi_topo[i]; 2343 2344 bzero(regs, sizeof (struct cpuid_regs)); 2345 regs->cp_eax = leaf; 2346 regs->cp_ecx = i; 2347 2348 (void) __cpuid_insn(regs); 2349 if (CPUID_AMD_8X26_ECX_TYPE(regs->cp_ecx) == 2350 CPUID_AMD_8X26_TYPE_DONE) { 2351 break; 2352 } 2353 } 2354 2355 cpi->cpi_topo_nleaves = i; 2356 } 2357 2358 /* 2359 * Make sure that we have gathered all of the CPUID leaves that we might need to 2360 * determine topology. We assume that the standard leaf 1 has already been done 2361 * and that xmaxeax has already been calculated. 2362 */ 2363 static void 2364 cpuid_gather_amd_topology_leaves(cpu_t *cpu) 2365 { 2366 struct cpuid_info *cpi = cpu->cpu_m.mcpu_cpi; 2367 2368 if (cpi->cpi_xmaxeax >= CPUID_LEAF_EXT_8) { 2369 struct cpuid_regs *cp; 2370 2371 cp = &cpi->cpi_extd[8]; 2372 cp->cp_eax = CPUID_LEAF_EXT_8; 2373 (void) __cpuid_insn(cp); 2374 platform_cpuid_mangle(cpi->cpi_vendor, CPUID_LEAF_EXT_8, cp); 2375 } 2376 2377 if (is_x86_feature(x86_featureset, X86FSET_TOPOEXT) && 2378 cpi->cpi_xmaxeax >= CPUID_LEAF_EXT_1e) { 2379 struct cpuid_regs *cp; 2380 2381 cp = &cpi->cpi_extd[0x1e]; 2382 cp->cp_eax = CPUID_LEAF_EXT_1e; 2383 (void) __cpuid_insn(cp); 2384 } 2385 2386 if (cpi->cpi_xmaxeax >= CPUID_LEAF_EXT_26) { 2387 cpuid_gather_ext_topo_leaf(cpi, CPUID_LEAF_EXT_26); 2388 } 2389 } 2390 2391 /* 2392 * Get the APIC ID for this processor. If Leaf B is present and valid, we prefer 2393 * it to everything else. If not, and we're on an AMD system where 8000001e is 2394 * valid, then we use that. Othewrise, we fall back to the default value for the 2395 * APIC ID in leaf 1. 2396 */ 2397 static uint32_t 2398 cpuid_gather_apicid(struct cpuid_info *cpi) 2399 { 2400 /* 2401 * Leaf B changes based on the arguments to it. Because we don't cache 2402 * it, we need to gather it again. 2403 */ 2404 if (cpi->cpi_maxeax >= 0xB) { 2405 struct cpuid_regs regs; 2406 struct cpuid_regs *cp; 2407 2408 cp = ®s; 2409 cp->cp_eax = 0xB; 2410 cp->cp_edx = cp->cp_ebx = cp->cp_ecx = 0; 2411 (void) __cpuid_insn(cp); 2412 2413 if (cp->cp_ebx != 0) { 2414 return (cp->cp_edx); 2415 } 2416 } 2417 2418 if ((cpi->cpi_vendor == X86_VENDOR_AMD || 2419 cpi->cpi_vendor == X86_VENDOR_HYGON) && 2420 is_x86_feature(x86_featureset, X86FSET_TOPOEXT) && 2421 cpi->cpi_xmaxeax >= CPUID_LEAF_EXT_1e) { 2422 return (cpi->cpi_extd[0x1e].cp_eax); 2423 } 2424 2425 return (CPI_APIC_ID(cpi)); 2426 } 2427 2428 /* 2429 * For AMD processors, attempt to calculate the number of chips and cores that 2430 * exist. The way that we do this varies based on the generation, because the 2431 * generations themselves have changed dramatically. 2432 * 2433 * If cpuid leaf 0x80000008 exists, that generally tells us the number of cores. 2434 * However, with the advent of family 17h (Zen) it actually tells us the number 2435 * of threads, so we need to look at leaf 0x8000001e if available to determine 2436 * its value. Otherwise, for all prior families, the number of enabled cores is 2437 * the same as threads. 2438 * 2439 * If we do not have leaf 0x80000008, then we assume that this processor does 2440 * not have anything. AMD's older CPUID specification says there's no reason to 2441 * fall back to leaf 1. 2442 * 2443 * In some virtualization cases we will not have leaf 8000001e or it will be 2444 * zero. When that happens we assume the number of threads is one. 2445 */ 2446 static void 2447 cpuid_amd_ncores(struct cpuid_info *cpi, uint_t *ncpus, uint_t *ncores) 2448 { 2449 uint_t nthreads, nthread_per_core; 2450 2451 nthreads = nthread_per_core = 1; 2452 2453 if (cpi->cpi_xmaxeax >= CPUID_LEAF_EXT_8) { 2454 nthreads = BITX(cpi->cpi_extd[8].cp_ecx, 7, 0) + 1; 2455 } else if ((cpi->cpi_std[1].cp_edx & CPUID_INTC_EDX_HTT) != 0) { 2456 nthreads = CPI_CPU_COUNT(cpi); 2457 } 2458 2459 /* 2460 * For us to have threads, and know about it, we have to be at least at 2461 * family 17h and have the cpuid bit that says we have extended 2462 * topology. 2463 */ 2464 if (cpi->cpi_family >= 0x17 && 2465 is_x86_feature(x86_featureset, X86FSET_TOPOEXT) && 2466 cpi->cpi_xmaxeax >= CPUID_LEAF_EXT_1e) { 2467 nthread_per_core = BITX(cpi->cpi_extd[0x1e].cp_ebx, 15, 8) + 1; 2468 } 2469 2470 *ncpus = nthreads; 2471 *ncores = nthreads / nthread_per_core; 2472 } 2473 2474 /* 2475 * Seed the initial values for the cores and threads for an Intel based 2476 * processor. These values will be overwritten if we detect that the processor 2477 * supports CPUID leaf 0xb. 2478 */ 2479 static void 2480 cpuid_intel_ncores(struct cpuid_info *cpi, uint_t *ncpus, uint_t *ncores) 2481 { 2482 /* 2483 * Only seed the number of physical cores from the first level leaf 4 2484 * information. The number of threads there indicate how many share the 2485 * L1 cache, which may or may not have anything to do with the number of 2486 * logical CPUs per core. 2487 */ 2488 if (cpi->cpi_maxeax >= 4) { 2489 *ncores = BITX(cpi->cpi_std[4].cp_eax, 31, 26) + 1; 2490 } else { 2491 *ncores = 1; 2492 } 2493 2494 if ((cpi->cpi_std[1].cp_edx & CPUID_INTC_EDX_HTT) != 0) { 2495 *ncpus = CPI_CPU_COUNT(cpi); 2496 } else { 2497 *ncpus = *ncores; 2498 } 2499 } 2500 2501 static boolean_t 2502 cpuid_leafB_getids(cpu_t *cpu) 2503 { 2504 struct cpuid_info *cpi = cpu->cpu_m.mcpu_cpi; 2505 struct cpuid_regs regs; 2506 struct cpuid_regs *cp; 2507 2508 if (cpi->cpi_maxeax < 0xB) 2509 return (B_FALSE); 2510 2511 cp = ®s; 2512 cp->cp_eax = 0xB; 2513 cp->cp_edx = cp->cp_ebx = cp->cp_ecx = 0; 2514 2515 (void) __cpuid_insn(cp); 2516 2517 /* 2518 * Check CPUID.EAX=0BH, ECX=0H:EBX is non-zero, which 2519 * indicates that the extended topology enumeration leaf is 2520 * available. 2521 */ 2522 if (cp->cp_ebx != 0) { 2523 uint32_t x2apic_id = 0; 2524 uint_t coreid_shift = 0; 2525 uint_t ncpu_per_core = 1; 2526 uint_t chipid_shift = 0; 2527 uint_t ncpu_per_chip = 1; 2528 uint_t i; 2529 uint_t level; 2530 2531 for (i = 0; i < CPI_FNB_ECX_MAX; i++) { 2532 cp->cp_eax = 0xB; 2533 cp->cp_ecx = i; 2534 2535 (void) __cpuid_insn(cp); 2536 level = CPI_CPU_LEVEL_TYPE(cp); 2537 2538 if (level == 1) { 2539 x2apic_id = cp->cp_edx; 2540 coreid_shift = BITX(cp->cp_eax, 4, 0); 2541 ncpu_per_core = BITX(cp->cp_ebx, 15, 0); 2542 } else if (level == 2) { 2543 x2apic_id = cp->cp_edx; 2544 chipid_shift = BITX(cp->cp_eax, 4, 0); 2545 ncpu_per_chip = BITX(cp->cp_ebx, 15, 0); 2546 } 2547 } 2548 2549 /* 2550 * cpi_apicid is taken care of in cpuid_gather_apicid. 2551 */ 2552 cpi->cpi_ncpu_per_chip = ncpu_per_chip; 2553 cpi->cpi_ncore_per_chip = ncpu_per_chip / 2554 ncpu_per_core; 2555 cpi->cpi_chipid = x2apic_id >> chipid_shift; 2556 cpi->cpi_clogid = x2apic_id & ((1 << chipid_shift) - 1); 2557 cpi->cpi_coreid = x2apic_id >> coreid_shift; 2558 cpi->cpi_pkgcoreid = cpi->cpi_clogid >> coreid_shift; 2559 cpi->cpi_procnodeid = cpi->cpi_chipid; 2560 cpi->cpi_compunitid = cpi->cpi_coreid; 2561 2562 if (coreid_shift > 0 && chipid_shift > coreid_shift) { 2563 cpi->cpi_nthread_bits = coreid_shift; 2564 cpi->cpi_ncore_bits = chipid_shift - coreid_shift; 2565 } 2566 2567 return (B_TRUE); 2568 } else { 2569 return (B_FALSE); 2570 } 2571 } 2572 2573 static void 2574 cpuid_intel_getids(cpu_t *cpu, void *feature) 2575 { 2576 uint_t i; 2577 uint_t chipid_shift = 0; 2578 uint_t coreid_shift = 0; 2579 struct cpuid_info *cpi = cpu->cpu_m.mcpu_cpi; 2580 2581 /* 2582 * There are no compute units or processor nodes currently on Intel. 2583 * Always set these to one. 2584 */ 2585 cpi->cpi_procnodes_per_pkg = 1; 2586 cpi->cpi_cores_per_compunit = 1; 2587 2588 /* 2589 * If cpuid Leaf B is present, use that to try and get this information. 2590 * It will be the most accurate for Intel CPUs. 2591 */ 2592 if (cpuid_leafB_getids(cpu)) 2593 return; 2594 2595 /* 2596 * In this case, we have the leaf 1 and leaf 4 values for ncpu_per_chip 2597 * and ncore_per_chip. These represent the largest power of two values 2598 * that we need to cover all of the IDs in the system. Therefore, we use 2599 * those values to seed the number of bits needed to cover information 2600 * in the case when leaf B is not available. These values will probably 2601 * be larger than required, but that's OK. 2602 */ 2603 cpi->cpi_nthread_bits = ddi_fls(cpi->cpi_ncpu_per_chip); 2604 cpi->cpi_ncore_bits = ddi_fls(cpi->cpi_ncore_per_chip); 2605 2606 for (i = 1; i < cpi->cpi_ncpu_per_chip; i <<= 1) 2607 chipid_shift++; 2608 2609 cpi->cpi_chipid = cpi->cpi_apicid >> chipid_shift; 2610 cpi->cpi_clogid = cpi->cpi_apicid & ((1 << chipid_shift) - 1); 2611 2612 if (is_x86_feature(feature, X86FSET_CMP)) { 2613 /* 2614 * Multi-core (and possibly multi-threaded) 2615 * processors. 2616 */ 2617 uint_t ncpu_per_core = 0; 2618 2619 if (cpi->cpi_ncore_per_chip == 1) 2620 ncpu_per_core = cpi->cpi_ncpu_per_chip; 2621 else if (cpi->cpi_ncore_per_chip > 1) 2622 ncpu_per_core = cpi->cpi_ncpu_per_chip / 2623 cpi->cpi_ncore_per_chip; 2624 /* 2625 * 8bit APIC IDs on dual core Pentiums 2626 * look like this: 2627 * 2628 * +-----------------------+------+------+ 2629 * | Physical Package ID | MC | HT | 2630 * +-----------------------+------+------+ 2631 * <------- chipid --------> 2632 * <------- coreid ---------------> 2633 * <--- clogid --> 2634 * <------> 2635 * pkgcoreid 2636 * 2637 * Where the number of bits necessary to 2638 * represent MC and HT fields together equals 2639 * to the minimum number of bits necessary to 2640 * store the value of cpi->cpi_ncpu_per_chip. 2641 * Of those bits, the MC part uses the number 2642 * of bits necessary to store the value of 2643 * cpi->cpi_ncore_per_chip. 2644 */ 2645 for (i = 1; i < ncpu_per_core; i <<= 1) 2646 coreid_shift++; 2647 cpi->cpi_coreid = cpi->cpi_apicid >> coreid_shift; 2648 cpi->cpi_pkgcoreid = cpi->cpi_clogid >> coreid_shift; 2649 } else if (is_x86_feature(feature, X86FSET_HTT)) { 2650 /* 2651 * Single-core multi-threaded processors. 2652 */ 2653 cpi->cpi_coreid = cpi->cpi_chipid; 2654 cpi->cpi_pkgcoreid = 0; 2655 } else { 2656 /* 2657 * Single-core single-thread processors. 2658 */ 2659 cpi->cpi_coreid = cpu->cpu_id; 2660 cpi->cpi_pkgcoreid = 0; 2661 } 2662 cpi->cpi_procnodeid = cpi->cpi_chipid; 2663 cpi->cpi_compunitid = cpi->cpi_coreid; 2664 } 2665 2666 /* 2667 * Historically, AMD has had CMP chips with only a single thread per core. 2668 * However, starting in family 17h (Zen), this has changed and they now have 2669 * multiple threads. Our internal core id needs to be a unique value. 2670 * 2671 * To determine the core id of an AMD system, if we're from a family before 17h, 2672 * then we just use the cpu id, as that gives us a good value that will be 2673 * unique for each core. If instead, we're on family 17h or later, then we need 2674 * to do something more complicated. CPUID leaf 0x8000001e can tell us 2675 * how many threads are in the system. Based on that, we'll shift the APIC ID. 2676 * We can't use the normal core id in that leaf as it's only unique within the 2677 * socket, which is perfect for cpi_pkgcoreid, but not us. 2678 */ 2679 static id_t 2680 cpuid_amd_get_coreid(cpu_t *cpu) 2681 { 2682 struct cpuid_info *cpi = cpu->cpu_m.mcpu_cpi; 2683 2684 if (cpi->cpi_family >= 0x17 && 2685 is_x86_feature(x86_featureset, X86FSET_TOPOEXT) && 2686 cpi->cpi_xmaxeax >= CPUID_LEAF_EXT_1e) { 2687 uint_t nthreads = BITX(cpi->cpi_extd[0x1e].cp_ebx, 15, 8) + 1; 2688 if (nthreads > 1) { 2689 VERIFY3U(nthreads, ==, 2); 2690 return (cpi->cpi_apicid >> 1); 2691 } 2692 } 2693 2694 return (cpu->cpu_id); 2695 } 2696 2697 /* 2698 * IDs on AMD is a more challenging task. This is notable because of the 2699 * following two facts: 2700 * 2701 * 1. Before family 0x17 (Zen), there was no support for SMT and there was 2702 * also no way to get an actual unique core id from the system. As such, we 2703 * synthesize this case by using cpu->cpu_id. This scheme does not, 2704 * however, guarantee that sibling cores of a chip will have sequential 2705 * coreids starting at a multiple of the number of cores per chip - that is 2706 * usually the case, but if the APIC IDs have been set up in a different 2707 * order then we need to perform a few more gymnastics for the pkgcoreid. 2708 * 2709 * 2. In families 0x15 and 16x (Bulldozer and co.) the cores came in groups 2710 * called compute units. These compute units share the L1I cache, L2 cache, 2711 * and the FPU. To deal with this, a new topology leaf was added in 2712 * 0x8000001e. However, parts of this leaf have different meanings 2713 * once we get to family 0x17. 2714 */ 2715 2716 static void 2717 cpuid_amd_getids(cpu_t *cpu, uchar_t *features) 2718 { 2719 int i, first_half, coreidsz; 2720 uint32_t nb_caps_reg; 2721 uint_t node2_1; 2722 struct cpuid_info *cpi = cpu->cpu_m.mcpu_cpi; 2723 struct cpuid_regs *cp; 2724 2725 /* 2726 * Calculate the core id (this comes from hardware in family 0x17 if it 2727 * hasn't been stripped by virtualization). We always set the compute 2728 * unit id to the same value. Also, initialize the default number of 2729 * cores per compute unit and nodes per package. This will be 2730 * overwritten when we know information about a particular family. 2731 */ 2732 cpi->cpi_coreid = cpuid_amd_get_coreid(cpu); 2733 cpi->cpi_compunitid = cpi->cpi_coreid; 2734 cpi->cpi_cores_per_compunit = 1; 2735 cpi->cpi_procnodes_per_pkg = 1; 2736 2737 /* 2738 * To construct the logical ID, we need to determine how many APIC IDs 2739 * are dedicated to the cores and threads. This is provided for us in 2740 * 0x80000008. However, if it's not present (say due to virtualization), 2741 * then we assume it's one. This should be present on all 64-bit AMD 2742 * processors. It was added in family 0xf (Hammer). 2743 */ 2744 if (cpi->cpi_xmaxeax >= CPUID_LEAF_EXT_8) { 2745 coreidsz = BITX((cpi)->cpi_extd[8].cp_ecx, 15, 12); 2746 2747 /* 2748 * In AMD parlance chip is really a node while illumos 2749 * uses chip as equivalent to socket/package. 2750 */ 2751 if (coreidsz == 0) { 2752 /* Use legacy method */ 2753 for (i = 1; i < cpi->cpi_ncore_per_chip; i <<= 1) 2754 coreidsz++; 2755 if (coreidsz == 0) 2756 coreidsz = 1; 2757 } 2758 } else { 2759 /* Assume single-core part */ 2760 coreidsz = 1; 2761 } 2762 cpi->cpi_clogid = cpi->cpi_apicid & ((1 << coreidsz) - 1); 2763 2764 /* 2765 * The package core ID varies depending on the family. While it may be 2766 * tempting to use the CPUID_LEAF_EXT_1e %ebx core id, unfortunately, 2767 * this value is the core id in the given node. For non-virtualized 2768 * family 17h, we need to take the logical core id and shift off the 2769 * threads like we do when getting the core id. Otherwise, we can use 2770 * the clogid as is. When family 17h is virtualized, the clogid should 2771 * be sufficient as if we don't have valid data in the leaf, then we 2772 * won't think we have SMT, in which case the cpi_clogid should be 2773 * sufficient. 2774 */ 2775 if (cpi->cpi_family >= 0x17 && 2776 is_x86_feature(x86_featureset, X86FSET_TOPOEXT) && 2777 cpi->cpi_xmaxeax >= CPUID_LEAF_EXT_1e && 2778 cpi->cpi_extd[0x1e].cp_ebx != 0) { 2779 uint_t nthreads = BITX(cpi->cpi_extd[0x1e].cp_ebx, 15, 8) + 1; 2780 if (nthreads > 1) { 2781 VERIFY3U(nthreads, ==, 2); 2782 cpi->cpi_pkgcoreid = cpi->cpi_clogid >> 1; 2783 } else { 2784 cpi->cpi_pkgcoreid = cpi->cpi_clogid; 2785 } 2786 } else { 2787 cpi->cpi_pkgcoreid = cpi->cpi_clogid; 2788 } 2789 2790 /* 2791 * Obtain the node ID and compute unit IDs. If we're on family 0x15 2792 * (bulldozer) or newer, then we can derive all of this from leaf 2793 * CPUID_LEAF_EXT_1e. Otherwise, the method varies by family. 2794 */ 2795 if (is_x86_feature(x86_featureset, X86FSET_TOPOEXT) && 2796 cpi->cpi_xmaxeax >= CPUID_LEAF_EXT_1e) { 2797 cp = &cpi->cpi_extd[0x1e]; 2798 2799 cpi->cpi_procnodes_per_pkg = BITX(cp->cp_ecx, 10, 8) + 1; 2800 cpi->cpi_procnodeid = BITX(cp->cp_ecx, 7, 0); 2801 2802 /* 2803 * For Bulldozer-era CPUs, recalculate the compute unit 2804 * information. 2805 */ 2806 if (cpi->cpi_family >= 0x15 && cpi->cpi_family < 0x17) { 2807 cpi->cpi_cores_per_compunit = 2808 BITX(cp->cp_ebx, 15, 8) + 1; 2809 cpi->cpi_compunitid = BITX(cp->cp_ebx, 7, 0) + 2810 (cpi->cpi_ncore_per_chip / 2811 cpi->cpi_cores_per_compunit) * 2812 (cpi->cpi_procnodeid / 2813 cpi->cpi_procnodes_per_pkg); 2814 } 2815 } else if (cpi->cpi_family == 0xf || cpi->cpi_family >= 0x11) { 2816 cpi->cpi_procnodeid = (cpi->cpi_apicid >> coreidsz) & 7; 2817 } else if (cpi->cpi_family == 0x10) { 2818 /* 2819 * See if we are a multi-node processor. 2820 * All processors in the system have the same number of nodes 2821 */ 2822 nb_caps_reg = pci_getl_func(0, 24, 3, 0xe8); 2823 if ((cpi->cpi_model < 8) || BITX(nb_caps_reg, 29, 29) == 0) { 2824 /* Single-node */ 2825 cpi->cpi_procnodeid = BITX(cpi->cpi_apicid, 5, 2826 coreidsz); 2827 } else { 2828 2829 /* 2830 * Multi-node revision D (2 nodes per package 2831 * are supported) 2832 */ 2833 cpi->cpi_procnodes_per_pkg = 2; 2834 2835 first_half = (cpi->cpi_pkgcoreid <= 2836 (cpi->cpi_ncore_per_chip/2 - 1)); 2837 2838 if (cpi->cpi_apicid == cpi->cpi_pkgcoreid) { 2839 /* We are BSP */ 2840 cpi->cpi_procnodeid = (first_half ? 0 : 1); 2841 } else { 2842 2843 /* We are AP */ 2844 /* NodeId[2:1] bits to use for reading F3xe8 */ 2845 node2_1 = BITX(cpi->cpi_apicid, 5, 4) << 1; 2846 2847 nb_caps_reg = 2848 pci_getl_func(0, 24 + node2_1, 3, 0xe8); 2849 2850 /* 2851 * Check IntNodeNum bit (31:30, but bit 31 is 2852 * always 0 on dual-node processors) 2853 */ 2854 if (BITX(nb_caps_reg, 30, 30) == 0) 2855 cpi->cpi_procnodeid = node2_1 + 2856 !first_half; 2857 else 2858 cpi->cpi_procnodeid = node2_1 + 2859 first_half; 2860 } 2861 } 2862 } else { 2863 cpi->cpi_procnodeid = 0; 2864 } 2865 2866 cpi->cpi_chipid = 2867 cpi->cpi_procnodeid / cpi->cpi_procnodes_per_pkg; 2868 2869 cpi->cpi_ncore_bits = coreidsz; 2870 cpi->cpi_nthread_bits = ddi_fls(cpi->cpi_ncpu_per_chip / 2871 cpi->cpi_ncore_per_chip); 2872 } 2873 2874 static void 2875 spec_uarch_flush_noop(void) 2876 { 2877 } 2878 2879 /* 2880 * When microcode is present that mitigates MDS, this wrmsr will also flush the 2881 * MDS-related micro-architectural state that would normally happen by calling 2882 * x86_md_clear(). 2883 */ 2884 static void 2885 spec_uarch_flush_msr(void) 2886 { 2887 wrmsr(MSR_IA32_FLUSH_CMD, IA32_FLUSH_CMD_L1D); 2888 } 2889 2890 /* 2891 * This function points to a function that will flush certain 2892 * micro-architectural state on the processor. This flush is used to mitigate 2893 * three different classes of Intel CPU vulnerabilities: L1TF, MDS, and RFDS. 2894 * This function can point to one of three functions: 2895 * 2896 * - A noop which is done because we either are vulnerable, but do not have 2897 * microcode available to help deal with a fix, or because we aren't 2898 * vulnerable. 2899 * 2900 * - spec_uarch_flush_msr which will issue an L1D flush and if microcode to 2901 * mitigate MDS is present, also perform the equivalent of the MDS flush; 2902 * however, it only flushes the MDS related micro-architectural state on the 2903 * current hyperthread, it does not do anything for the twin. 2904 * 2905 * - x86_md_clear which will flush the MDS related state. This is done when we 2906 * have a processor that is vulnerable to MDS, but is not vulnerable to L1TF 2907 * (RDCL_NO is set); or if the CPU is vulnerable to RFDS and indicates VERW 2908 * can clear it (RFDS_CLEAR is set). 2909 */ 2910 void (*spec_uarch_flush)(void) = spec_uarch_flush_noop; 2911 2912 static void 2913 cpuid_update_md_clear(cpu_t *cpu, uchar_t *featureset) 2914 { 2915 struct cpuid_info *cpi = cpu->cpu_m.mcpu_cpi; 2916 2917 /* Non-Intel doesn't concern us here. */ 2918 if (cpi->cpi_vendor != X86_VENDOR_Intel) 2919 return; 2920 2921 /* 2922 * While RDCL_NO indicates that one of the MDS vulnerabilities (MSBDS) 2923 * has been fixed in hardware, it doesn't cover everything related to 2924 * MDS. Therefore we can only rely on MDS_NO to determine that we don't 2925 * need to mitigate this. 2926 * 2927 * We must ALSO check the case of RFDS_NO and if RFDS_CLEAR is set, 2928 * because of the small cases of RFDS. 2929 */ 2930 2931 if ((!is_x86_feature(featureset, X86FSET_MDS_NO) && 2932 is_x86_feature(featureset, X86FSET_MD_CLEAR)) || 2933 (!is_x86_feature(featureset, X86FSET_RFDS_NO) && 2934 is_x86_feature(featureset, X86FSET_RFDS_CLEAR))) { 2935 const uint8_t nop = NOP_INSTR; 2936 uint8_t *md = (uint8_t *)x86_md_clear; 2937 2938 *md = nop; 2939 } 2940 2941 membar_producer(); 2942 } 2943 2944 static void 2945 cpuid_update_l1d_flush(cpu_t *cpu, uchar_t *featureset) 2946 { 2947 boolean_t need_l1d, need_mds, need_rfds; 2948 struct cpuid_info *cpi = cpu->cpu_m.mcpu_cpi; 2949 2950 /* 2951 * If we're not on Intel or we've mitigated all of RDCL, MDS, and RFDS 2952 * in hardware, then there's nothing left for us to do for enabling 2953 * the flush. We can also go ahead and say that SMT exclusion is 2954 * unnecessary. 2955 */ 2956 if (cpi->cpi_vendor != X86_VENDOR_Intel || 2957 (is_x86_feature(featureset, X86FSET_RDCL_NO) && 2958 is_x86_feature(featureset, X86FSET_MDS_NO) && 2959 is_x86_feature(featureset, X86FSET_RFDS_NO))) { 2960 extern int smt_exclusion; 2961 smt_exclusion = 0; 2962 spec_uarch_flush = spec_uarch_flush_noop; 2963 membar_producer(); 2964 return; 2965 } 2966 2967 /* 2968 * The locations where we need to perform an L1D flush are required both 2969 * for mitigating L1TF and MDS. When verw support is present in 2970 * microcode, then the L1D flush will take care of doing that as well. 2971 * However, if we have a system where RDCL_NO is present, but we don't 2972 * have MDS_NO, then we need to do a verw (x86_md_clear) and not a full 2973 * L1D flush. 2974 */ 2975 if (!is_x86_feature(featureset, X86FSET_RDCL_NO) && 2976 is_x86_feature(featureset, X86FSET_FLUSH_CMD) && 2977 !is_x86_feature(featureset, X86FSET_L1D_VM_NO)) { 2978 need_l1d = B_TRUE; 2979 } else { 2980 need_l1d = B_FALSE; 2981 } 2982 2983 if (!is_x86_feature(featureset, X86FSET_MDS_NO) && 2984 is_x86_feature(featureset, X86FSET_MD_CLEAR)) { 2985 need_mds = B_TRUE; 2986 } else { 2987 need_mds = B_FALSE; 2988 } 2989 2990 if (!is_x86_feature(featureset, X86FSET_RFDS_NO) && 2991 is_x86_feature(featureset, X86FSET_RFDS_CLEAR)) { 2992 need_rfds = B_TRUE; 2993 } else { 2994 need_rfds = B_FALSE; 2995 } 2996 2997 if (need_l1d) { 2998 /* 2999 * As of Feb, 2024, no CPU needs L1D *and* RFDS mitigation 3000 * together. If the following VERIFY trips, we need to add 3001 * further fixes here. 3002 */ 3003 VERIFY(!need_rfds); 3004 spec_uarch_flush = spec_uarch_flush_msr; 3005 } else if (need_mds || need_rfds) { 3006 spec_uarch_flush = x86_md_clear; 3007 } else { 3008 /* 3009 * We have no hardware mitigations available to us. 3010 */ 3011 spec_uarch_flush = spec_uarch_flush_noop; 3012 } 3013 membar_producer(); 3014 } 3015 3016 /* 3017 * Branch History Injection (BHI) mitigations. 3018 * 3019 * Intel has provided a software sequence that will scrub the BHB. Like RSB 3020 * (below) we can scribble a return at the beginning to avoid if if the CPU 3021 * is modern enough. We can also scribble a return if the CPU is old enough 3022 * to not have an RSB (pre-eIBRS). 3023 */ 3024 typedef enum { 3025 X86_BHI_TOO_OLD_OR_DISABLED, /* Pre-eIBRS or disabled */ 3026 X86_BHI_NEW_ENOUGH, /* AMD, or Intel with BHI_NO set */ 3027 X86_BHI_DIS_S, /* BHI_NO == 0, but BHI_DIS_S avail. */ 3028 /* NOTE: BHI_DIS_S above will still need the software sequence. */ 3029 X86_BHI_SOFTWARE_SEQUENCE, /* Use software sequence */ 3030 } x86_native_bhi_mitigation_t; 3031 3032 x86_native_bhi_mitigation_t x86_bhi_mitigation = X86_BHI_SOFTWARE_SEQUENCE; 3033 3034 static void 3035 cpuid_enable_bhi_dis_s(void) 3036 { 3037 uint64_t val; 3038 3039 val = rdmsr(MSR_IA32_SPEC_CTRL); 3040 val |= IA32_SPEC_CTRL_BHI_DIS_S; 3041 wrmsr(MSR_IA32_SPEC_CTRL, val); 3042 } 3043 3044 /* 3045 * This function scribbles RET into the first instruction of x86_bhb_clear() 3046 * if SPECTREV2 mitigations are disabled, the CPU is too old, the CPU is new 3047 * enough to fix (which includes non-Intel CPUs), or the CPU has an explicit 3048 * disable-Branch-History control. 3049 */ 3050 static x86_native_bhi_mitigation_t 3051 cpuid_learn_and_patch_bhi(x86_spectrev2_mitigation_t v2mit, cpu_t *cpu, 3052 uchar_t *featureset) 3053 { 3054 struct cpuid_info *cpi = cpu->cpu_m.mcpu_cpi; 3055 const uint8_t ret = RET_INSTR; 3056 uint8_t *bhb_clear = (uint8_t *)x86_bhb_clear; 3057 3058 ASSERT0(cpu->cpu_id); 3059 3060 /* First check for explicitly disabled... */ 3061 if (v2mit == X86_SPECTREV2_DISABLED) { 3062 *bhb_clear = ret; 3063 return (X86_BHI_TOO_OLD_OR_DISABLED); 3064 } 3065 3066 /* 3067 * Then check for BHI_NO, which means the CPU doesn't have this bug, 3068 * or if it's non-Intel, in which case this mitigation mechanism 3069 * doesn't apply. 3070 */ 3071 if (cpi->cpi_vendor != X86_VENDOR_Intel || 3072 is_x86_feature(featureset, X86FSET_BHI_NO)) { 3073 *bhb_clear = ret; 3074 return (X86_BHI_NEW_ENOUGH); 3075 } 3076 3077 /* 3078 * Now check for the BHI_CTRL MSR, and then set it if available. 3079 * We will still need to use the software sequence, however. 3080 */ 3081 if (is_x86_feature(featureset, X86FSET_BHI_CTRL)) { 3082 cpuid_enable_bhi_dis_s(); 3083 return (X86_BHI_DIS_S); 3084 } 3085 3086 /* 3087 * Finally, check if we are too old to bother with RSB: 3088 */ 3089 if (v2mit == X86_SPECTREV2_RETPOLINE) { 3090 *bhb_clear = ret; 3091 return (X86_BHI_TOO_OLD_OR_DISABLED); 3092 } 3093 3094 ASSERT(*bhb_clear != ret); 3095 return (X86_BHI_SOFTWARE_SEQUENCE); 3096 } 3097 3098 /* 3099 * We default to enabling Return Stack Buffer (RSB) mitigations. 3100 * 3101 * We used to skip RSB mitigations with Intel eIBRS, but developments around 3102 * post-barrier RSB (PBRSB) guessing suggests we should enable Intel RSB 3103 * mitigations always unless explicitly bypassed, or unless hardware indicates 3104 * the bug has been fixed. 3105 * 3106 * The current decisions for using, or ignoring, a RSB software stuffing 3107 * sequence are expressed by the following table: 3108 * 3109 * +-------+------------+-----------------+--------+ 3110 * | eIBRS | PBRSB_NO | context switch | vmexit | 3111 * +-------+------------+-----------------+--------+ 3112 * | Yes | No | stuff | stuff | 3113 * | Yes | Yes | ignore | ignore | 3114 * | No | No | stuff | ignore | 3115 * +-------+------------+-----------------+--------+ 3116 * 3117 * Note that if an Intel CPU has no eIBRS, it will never enumerate PBRSB_NO, 3118 * because machines with no eIBRS do not have a problem with PBRSB overflow. 3119 * See the Intel document cited below for details. 3120 * 3121 * Also note that AMD AUTO_IBRS has no PBRSB problem, so it is not included in 3122 * the table above, and that there is no situation where vmexit stuffing is 3123 * needed, but context-switch stuffing isn't. 3124 */ 3125 3126 /* BEGIN CSTYLED */ 3127 /* 3128 * https://www.intel.com/content/www/us/en/developer/articles/technical/software-security-guidance/advisory-guidance/post-barrier-return-stack-buffer-predictions.html 3129 */ 3130 /* END CSTYLED */ 3131 3132 /* 3133 * AMD indicates that when Automatic IBRS is enabled we do not need to implement 3134 * return stack buffer clearing for VMEXIT as it takes care of it. The manual 3135 * also states that as long as SMEP and we maintain at least one page between 3136 * the kernel and user space (we have much more of a red zone), then we do not 3137 * need to clear the RSB. We constrain this to only when Automatic IRBS is 3138 * present. 3139 */ 3140 static void 3141 cpuid_patch_rsb(x86_spectrev2_mitigation_t mit, bool intel_pbrsb_no) 3142 { 3143 const uint8_t ret = RET_INSTR; 3144 uint8_t *stuff = (uint8_t *)x86_rsb_stuff; 3145 uint8_t *vmx_stuff = (uint8_t *)x86_rsb_stuff_vmexit; 3146 3147 switch (mit) { 3148 case X86_SPECTREV2_AUTO_IBRS: 3149 case X86_SPECTREV2_DISABLED: 3150 /* Don't bother with any RSB stuffing! */ 3151 *stuff = ret; 3152 *vmx_stuff = ret; 3153 break; 3154 case X86_SPECTREV2_RETPOLINE: 3155 /* 3156 * The Intel document on Post-Barrier RSB says that processors 3157 * without eIBRS do not have PBRSB problems upon VMEXIT. 3158 */ 3159 VERIFY(!intel_pbrsb_no); 3160 VERIFY3U(*stuff, !=, ret); 3161 *vmx_stuff = ret; 3162 break; 3163 default: 3164 /* 3165 * eIBRS is all that's left. If CPU claims PBRSB is fixed, 3166 * don't use the RSB mitigation in either case. Otherwise 3167 * both vmexit and context-switching require the software 3168 * mitigation. 3169 */ 3170 if (intel_pbrsb_no) { 3171 /* CPU claims PBRSB problems are fixed. */ 3172 *stuff = ret; 3173 *vmx_stuff = ret; 3174 } 3175 VERIFY3U(*stuff, ==, *vmx_stuff); 3176 break; 3177 } 3178 } 3179 3180 static void 3181 cpuid_patch_retpolines(x86_spectrev2_mitigation_t mit) 3182 { 3183 const char *thunks[] = { "_rax", "_rbx", "_rcx", "_rdx", "_rdi", 3184 "_rsi", "_rbp", "_r8", "_r9", "_r10", "_r11", "_r12", "_r13", 3185 "_r14", "_r15" }; 3186 const uint_t nthunks = ARRAY_SIZE(thunks); 3187 const char *type; 3188 uint_t i; 3189 3190 if (mit == x86_spectrev2_mitigation) 3191 return; 3192 3193 switch (mit) { 3194 case X86_SPECTREV2_RETPOLINE: 3195 type = "gen"; 3196 break; 3197 case X86_SPECTREV2_AUTO_IBRS: 3198 case X86_SPECTREV2_ENHANCED_IBRS: 3199 case X86_SPECTREV2_DISABLED: 3200 type = "jmp"; 3201 break; 3202 default: 3203 panic("asked to update retpoline state with unknown state!"); 3204 } 3205 3206 for (i = 0; i < nthunks; i++) { 3207 uintptr_t source, dest; 3208 int ssize, dsize; 3209 char sourcebuf[64], destbuf[64]; 3210 3211 (void) snprintf(destbuf, sizeof (destbuf), 3212 "__x86_indirect_thunk%s", thunks[i]); 3213 (void) snprintf(sourcebuf, sizeof (sourcebuf), 3214 "__x86_indirect_thunk_%s%s", type, thunks[i]); 3215 3216 source = kobj_getelfsym(sourcebuf, NULL, &ssize); 3217 dest = kobj_getelfsym(destbuf, NULL, &dsize); 3218 VERIFY3U(source, !=, 0); 3219 VERIFY3U(dest, !=, 0); 3220 VERIFY3S(dsize, >=, ssize); 3221 bcopy((void *)source, (void *)dest, ssize); 3222 } 3223 } 3224 3225 static void 3226 cpuid_enable_enhanced_ibrs(void) 3227 { 3228 uint64_t val; 3229 3230 val = rdmsr(MSR_IA32_SPEC_CTRL); 3231 val |= IA32_SPEC_CTRL_IBRS; 3232 wrmsr(MSR_IA32_SPEC_CTRL, val); 3233 } 3234 3235 static void 3236 cpuid_enable_auto_ibrs(void) 3237 { 3238 uint64_t val; 3239 3240 val = rdmsr(MSR_AMD_EFER); 3241 val |= AMD_EFER_AIBRSE; 3242 wrmsr(MSR_AMD_EFER, val); 3243 } 3244 3245 /* 3246 * Determine how we should mitigate TAA or if we need to. Regardless of TAA, if 3247 * we can disable TSX, we do so. 3248 * 3249 * This determination is done only on the boot CPU, potentially after loading 3250 * updated microcode. 3251 */ 3252 static void 3253 cpuid_update_tsx(cpu_t *cpu, uchar_t *featureset) 3254 { 3255 struct cpuid_info *cpi = cpu->cpu_m.mcpu_cpi; 3256 3257 VERIFY(cpu->cpu_id == 0); 3258 3259 if (cpi->cpi_vendor != X86_VENDOR_Intel) { 3260 x86_taa_mitigation = X86_TAA_HW_MITIGATED; 3261 return; 3262 } 3263 3264 if (x86_disable_taa) { 3265 x86_taa_mitigation = X86_TAA_DISABLED; 3266 return; 3267 } 3268 3269 /* 3270 * If we do not have the ability to disable TSX, then our only 3271 * mitigation options are in hardware (TAA_NO), or by using our existing 3272 * MDS mitigation as described above. The latter relies upon us having 3273 * configured MDS mitigations correctly! This includes disabling SMT if 3274 * we want to cross-CPU-thread protection. 3275 */ 3276 if (!is_x86_feature(featureset, X86FSET_TSX_CTRL)) { 3277 /* 3278 * It's not clear whether any parts will enumerate TAA_NO 3279 * *without* TSX_CTRL, but let's mark it as such if we see this. 3280 */ 3281 if (is_x86_feature(featureset, X86FSET_TAA_NO)) { 3282 x86_taa_mitigation = X86_TAA_HW_MITIGATED; 3283 return; 3284 } 3285 3286 if (is_x86_feature(featureset, X86FSET_MD_CLEAR) && 3287 !is_x86_feature(featureset, X86FSET_MDS_NO)) { 3288 x86_taa_mitigation = X86_TAA_MD_CLEAR; 3289 } else { 3290 x86_taa_mitigation = X86_TAA_NOTHING; 3291 } 3292 return; 3293 } 3294 3295 /* 3296 * We have TSX_CTRL, but we can only fully disable TSX if we're early 3297 * enough in boot. 3298 * 3299 * Otherwise, we'll fall back to causing transactions to abort as our 3300 * mitigation. TSX-using code will always take the fallback path. 3301 */ 3302 if (cpi->cpi_pass < 4) { 3303 x86_taa_mitigation = X86_TAA_TSX_DISABLE; 3304 } else { 3305 x86_taa_mitigation = X86_TAA_TSX_FORCE_ABORT; 3306 } 3307 } 3308 3309 /* 3310 * As mentioned, we should only touch the MSR when we've got a suitable 3311 * microcode loaded on this CPU. 3312 */ 3313 static void 3314 cpuid_apply_tsx(x86_taa_mitigation_t taa, uchar_t *featureset) 3315 { 3316 uint64_t val; 3317 3318 switch (taa) { 3319 case X86_TAA_TSX_DISABLE: 3320 if (!is_x86_feature(featureset, X86FSET_TSX_CTRL)) 3321 return; 3322 val = rdmsr(MSR_IA32_TSX_CTRL); 3323 val |= IA32_TSX_CTRL_CPUID_CLEAR | IA32_TSX_CTRL_RTM_DISABLE; 3324 wrmsr(MSR_IA32_TSX_CTRL, val); 3325 break; 3326 case X86_TAA_TSX_FORCE_ABORT: 3327 if (!is_x86_feature(featureset, X86FSET_TSX_CTRL)) 3328 return; 3329 val = rdmsr(MSR_IA32_TSX_CTRL); 3330 val |= IA32_TSX_CTRL_RTM_DISABLE; 3331 wrmsr(MSR_IA32_TSX_CTRL, val); 3332 break; 3333 case X86_TAA_HW_MITIGATED: 3334 case X86_TAA_MD_CLEAR: 3335 case X86_TAA_DISABLED: 3336 case X86_TAA_NOTHING: 3337 break; 3338 } 3339 } 3340 3341 static void 3342 cpuid_scan_security(cpu_t *cpu, uchar_t *featureset) 3343 { 3344 struct cpuid_info *cpi = cpu->cpu_m.mcpu_cpi; 3345 x86_spectrev2_mitigation_t v2mit; 3346 3347 if ((cpi->cpi_vendor == X86_VENDOR_AMD || 3348 cpi->cpi_vendor == X86_VENDOR_HYGON) && 3349 cpi->cpi_xmaxeax >= CPUID_LEAF_EXT_8) { 3350 if (cpi->cpi_extd[8].cp_ebx & CPUID_AMD_EBX_IBPB) 3351 add_x86_feature(featureset, X86FSET_IBPB); 3352 if (cpi->cpi_extd[8].cp_ebx & CPUID_AMD_EBX_IBRS) 3353 add_x86_feature(featureset, X86FSET_IBRS); 3354 if (cpi->cpi_extd[8].cp_ebx & CPUID_AMD_EBX_STIBP) 3355 add_x86_feature(featureset, X86FSET_STIBP); 3356 if (cpi->cpi_extd[8].cp_ebx & CPUID_AMD_EBX_STIBP_ALL) 3357 add_x86_feature(featureset, X86FSET_STIBP_ALL); 3358 if (cpi->cpi_extd[8].cp_ebx & CPUID_AMD_EBX_SSBD) 3359 add_x86_feature(featureset, X86FSET_SSBD); 3360 if (cpi->cpi_extd[8].cp_ebx & CPUID_AMD_EBX_VIRT_SSBD) 3361 add_x86_feature(featureset, X86FSET_SSBD_VIRT); 3362 if (cpi->cpi_extd[8].cp_ebx & CPUID_AMD_EBX_SSB_NO) 3363 add_x86_feature(featureset, X86FSET_SSB_NO); 3364 3365 /* 3366 * Rather than Enhanced IBRS, AMD has a different feature that 3367 * is a bit in EFER that can be enabled and will basically do 3368 * the right thing while executing in the kernel. 3369 */ 3370 if (cpi->cpi_vendor == X86_VENDOR_AMD && 3371 (cpi->cpi_extd[8].cp_ebx & CPUID_AMD_EBX_PREFER_IBRS) && 3372 cpi->cpi_xmaxeax >= CPUID_LEAF_EXT_21 && 3373 (cpi->cpi_extd[0x21].cp_eax & CPUID_AMD_8X21_EAX_AIBRS)) { 3374 add_x86_feature(featureset, X86FSET_AUTO_IBRS); 3375 } 3376 3377 } else if (cpi->cpi_vendor == X86_VENDOR_Intel && 3378 cpi->cpi_maxeax >= 7) { 3379 struct cpuid_regs *ecp; 3380 ecp = &cpi->cpi_std[7]; 3381 3382 if (ecp->cp_edx & CPUID_INTC_EDX_7_0_MD_CLEAR) { 3383 add_x86_feature(featureset, X86FSET_MD_CLEAR); 3384 } 3385 3386 if (ecp->cp_edx & CPUID_INTC_EDX_7_0_SPEC_CTRL) { 3387 add_x86_feature(featureset, X86FSET_IBRS); 3388 add_x86_feature(featureset, X86FSET_IBPB); 3389 } 3390 3391 if (ecp->cp_edx & CPUID_INTC_EDX_7_0_STIBP) { 3392 add_x86_feature(featureset, X86FSET_STIBP); 3393 } 3394 3395 /* 3396 * Some prediction controls are enumerated by subleaf 2 of 3397 * leaf 7. 3398 */ 3399 if (CPI_FEATURES_7_2_EDX(cpi) & CPUID_INTC_EDX_7_2_BHI_CTRL) { 3400 add_x86_feature(featureset, X86FSET_BHI_CTRL); 3401 } 3402 3403 /* 3404 * Don't read the arch caps MSR on xpv where we lack the 3405 * on_trap(). 3406 */ 3407 #ifndef __xpv 3408 if (ecp->cp_edx & CPUID_INTC_EDX_7_0_ARCH_CAPS) { 3409 on_trap_data_t otd; 3410 3411 /* 3412 * Be paranoid and assume we'll get a #GP. 3413 */ 3414 if (!on_trap(&otd, OT_DATA_ACCESS)) { 3415 uint64_t reg; 3416 3417 reg = rdmsr(MSR_IA32_ARCH_CAPABILITIES); 3418 if (reg & IA32_ARCH_CAP_RDCL_NO) { 3419 add_x86_feature(featureset, 3420 X86FSET_RDCL_NO); 3421 } 3422 if (reg & IA32_ARCH_CAP_IBRS_ALL) { 3423 add_x86_feature(featureset, 3424 X86FSET_IBRS_ALL); 3425 } 3426 if (reg & IA32_ARCH_CAP_RSBA) { 3427 add_x86_feature(featureset, 3428 X86FSET_RSBA); 3429 } 3430 if (reg & IA32_ARCH_CAP_SKIP_L1DFL_VMENTRY) { 3431 add_x86_feature(featureset, 3432 X86FSET_L1D_VM_NO); 3433 } 3434 if (reg & IA32_ARCH_CAP_SSB_NO) { 3435 add_x86_feature(featureset, 3436 X86FSET_SSB_NO); 3437 } 3438 if (reg & IA32_ARCH_CAP_MDS_NO) { 3439 add_x86_feature(featureset, 3440 X86FSET_MDS_NO); 3441 } 3442 if (reg & IA32_ARCH_CAP_TSX_CTRL) { 3443 add_x86_feature(featureset, 3444 X86FSET_TSX_CTRL); 3445 } 3446 if (reg & IA32_ARCH_CAP_TAA_NO) { 3447 add_x86_feature(featureset, 3448 X86FSET_TAA_NO); 3449 } 3450 if (reg & IA32_ARCH_CAP_RFDS_NO) { 3451 add_x86_feature(featureset, 3452 X86FSET_RFDS_NO); 3453 } 3454 if (reg & IA32_ARCH_CAP_RFDS_CLEAR) { 3455 add_x86_feature(featureset, 3456 X86FSET_RFDS_CLEAR); 3457 } 3458 if (reg & IA32_ARCH_CAP_PBRSB_NO) { 3459 add_x86_feature(featureset, 3460 X86FSET_PBRSB_NO); 3461 } 3462 if (reg & IA32_ARCH_CAP_BHI_NO) { 3463 add_x86_feature(featureset, 3464 X86FSET_BHI_NO); 3465 } 3466 } 3467 no_trap(); 3468 } 3469 #endif /* !__xpv */ 3470 3471 if (ecp->cp_edx & CPUID_INTC_EDX_7_0_SSBD) 3472 add_x86_feature(featureset, X86FSET_SSBD); 3473 3474 if (ecp->cp_edx & CPUID_INTC_EDX_7_0_FLUSH_CMD) 3475 add_x86_feature(featureset, X86FSET_FLUSH_CMD); 3476 } 3477 3478 /* 3479 * Take care of certain mitigations on the non-boot CPU. The boot CPU 3480 * will have already run this function and determined what we need to 3481 * do. This gives us a hook for per-HW thread mitigations such as 3482 * enhanced IBRS, or disabling TSX. 3483 */ 3484 if (cpu->cpu_id != 0) { 3485 switch (x86_spectrev2_mitigation) { 3486 case X86_SPECTREV2_ENHANCED_IBRS: 3487 cpuid_enable_enhanced_ibrs(); 3488 break; 3489 case X86_SPECTREV2_AUTO_IBRS: 3490 cpuid_enable_auto_ibrs(); 3491 break; 3492 default: 3493 break; 3494 } 3495 3496 /* If we're committed to BHI_DIS_S, set it for this core. */ 3497 if (x86_bhi_mitigation == X86_BHI_DIS_S) 3498 cpuid_enable_bhi_dis_s(); 3499 3500 cpuid_apply_tsx(x86_taa_mitigation, featureset); 3501 return; 3502 } 3503 3504 /* 3505 * Go through and initialize various security mechanisms that we should 3506 * only do on a single CPU. This includes Spectre V2, L1TF, MDS, and 3507 * TAA. 3508 */ 3509 3510 /* 3511 * By default we've come in with retpolines enabled. Check whether we 3512 * should disable them or enable enhanced or automatic IBRS. 3513 * 3514 * Note, we do not allow the use of AMD optimized retpolines as it was 3515 * disclosed by AMD in March 2022 that they were still 3516 * vulnerable. Prior to that point, we used them. 3517 */ 3518 if (x86_disable_spectrev2 != 0) { 3519 v2mit = X86_SPECTREV2_DISABLED; 3520 } else if (is_x86_feature(featureset, X86FSET_AUTO_IBRS)) { 3521 cpuid_enable_auto_ibrs(); 3522 v2mit = X86_SPECTREV2_AUTO_IBRS; 3523 } else if (is_x86_feature(featureset, X86FSET_IBRS_ALL)) { 3524 cpuid_enable_enhanced_ibrs(); 3525 v2mit = X86_SPECTREV2_ENHANCED_IBRS; 3526 } else { 3527 v2mit = X86_SPECTREV2_RETPOLINE; 3528 } 3529 3530 cpuid_patch_retpolines(v2mit); 3531 cpuid_patch_rsb(v2mit, is_x86_feature(featureset, X86FSET_PBRSB_NO)); 3532 x86_bhi_mitigation = cpuid_learn_and_patch_bhi(v2mit, cpu, featureset); 3533 x86_spectrev2_mitigation = v2mit; 3534 membar_producer(); 3535 3536 /* 3537 * We need to determine what changes are required for mitigating L1TF 3538 * and MDS. If the CPU suffers from either of them, then SMT exclusion 3539 * is required. 3540 * 3541 * If any of these are present, then we need to flush u-arch state at 3542 * various points. For MDS, we need to do so whenever we change to a 3543 * lesser privilege level or we are halting the CPU. For L1TF we need to 3544 * flush the L1D cache at VM entry. When we have microcode that handles 3545 * MDS, the L1D flush also clears the other u-arch state that the 3546 * md_clear does. 3547 */ 3548 3549 /* 3550 * Update whether or not we need to be taking explicit action against 3551 * MDS or RFDS. 3552 */ 3553 cpuid_update_md_clear(cpu, featureset); 3554 3555 /* 3556 * Determine whether SMT exclusion is required and whether or not we 3557 * need to perform an l1d flush. 3558 */ 3559 cpuid_update_l1d_flush(cpu, featureset); 3560 3561 /* 3562 * Determine what our mitigation strategy should be for TAA and then 3563 * also apply TAA mitigations. 3564 */ 3565 cpuid_update_tsx(cpu, featureset); 3566 cpuid_apply_tsx(x86_taa_mitigation, featureset); 3567 } 3568 3569 /* 3570 * Setup XFeature_Enabled_Mask register. Required by xsave feature. 3571 */ 3572 void 3573 setup_xfem(void) 3574 { 3575 uint64_t flags = XFEATURE_LEGACY_FP; 3576 3577 ASSERT(is_x86_feature(x86_featureset, X86FSET_XSAVE)); 3578 3579 if (is_x86_feature(x86_featureset, X86FSET_SSE)) 3580 flags |= XFEATURE_SSE; 3581 3582 if (is_x86_feature(x86_featureset, X86FSET_AVX)) 3583 flags |= XFEATURE_AVX; 3584 3585 if (is_x86_feature(x86_featureset, X86FSET_AVX512F)) 3586 flags |= XFEATURE_AVX512; 3587 3588 set_xcr(XFEATURE_ENABLED_MASK, flags); 3589 3590 xsave_bv_all = flags; 3591 } 3592 3593 static void 3594 cpuid_basic_topology(cpu_t *cpu, uchar_t *featureset) 3595 { 3596 struct cpuid_info *cpi; 3597 3598 cpi = cpu->cpu_m.mcpu_cpi; 3599 3600 if (cpi->cpi_vendor == X86_VENDOR_AMD || 3601 cpi->cpi_vendor == X86_VENDOR_HYGON) { 3602 cpuid_gather_amd_topology_leaves(cpu); 3603 } 3604 3605 cpi->cpi_apicid = cpuid_gather_apicid(cpi); 3606 3607 /* 3608 * Before we can calculate the IDs that we should assign to this 3609 * processor, we need to understand how many cores and threads it has. 3610 */ 3611 switch (cpi->cpi_vendor) { 3612 case X86_VENDOR_Intel: 3613 cpuid_intel_ncores(cpi, &cpi->cpi_ncpu_per_chip, 3614 &cpi->cpi_ncore_per_chip); 3615 break; 3616 case X86_VENDOR_AMD: 3617 case X86_VENDOR_HYGON: 3618 cpuid_amd_ncores(cpi, &cpi->cpi_ncpu_per_chip, 3619 &cpi->cpi_ncore_per_chip); 3620 break; 3621 default: 3622 /* 3623 * If we have some other x86 compatible chip, it's not clear how 3624 * they would behave. The most common case is virtualization 3625 * today, though there are also 64-bit VIA chips. Assume that 3626 * all we can get is the basic Leaf 1 HTT information. 3627 */ 3628 if ((cpi->cpi_std[1].cp_edx & CPUID_INTC_EDX_HTT) != 0) { 3629 cpi->cpi_ncore_per_chip = 1; 3630 cpi->cpi_ncpu_per_chip = CPI_CPU_COUNT(cpi); 3631 } 3632 break; 3633 } 3634 3635 /* 3636 * Based on the calculated number of threads and cores, potentially 3637 * assign the HTT and CMT features. 3638 */ 3639 if (cpi->cpi_ncore_per_chip > 1) { 3640 add_x86_feature(featureset, X86FSET_CMP); 3641 } 3642 3643 if (cpi->cpi_ncpu_per_chip > 1 && 3644 cpi->cpi_ncpu_per_chip != cpi->cpi_ncore_per_chip) { 3645 add_x86_feature(featureset, X86FSET_HTT); 3646 } 3647 3648 /* 3649 * Now that has been set up, we need to go through and calculate all of 3650 * the rest of the parameters that exist. If we think the CPU doesn't 3651 * have either SMT (HTT) or CMP, then we basically go through and fake 3652 * up information in some way. The most likely case for this is 3653 * virtualization where we have a lot of partial topology information. 3654 */ 3655 if (!is_x86_feature(featureset, X86FSET_HTT) && 3656 !is_x86_feature(featureset, X86FSET_CMP)) { 3657 /* 3658 * This is a single core, single-threaded processor. 3659 */ 3660 cpi->cpi_procnodes_per_pkg = 1; 3661 cpi->cpi_cores_per_compunit = 1; 3662 cpi->cpi_compunitid = 0; 3663 cpi->cpi_chipid = -1; 3664 cpi->cpi_clogid = 0; 3665 cpi->cpi_coreid = cpu->cpu_id; 3666 cpi->cpi_pkgcoreid = 0; 3667 if (cpi->cpi_vendor == X86_VENDOR_AMD || 3668 cpi->cpi_vendor == X86_VENDOR_HYGON) { 3669 cpi->cpi_procnodeid = BITX(cpi->cpi_apicid, 3, 0); 3670 } else { 3671 cpi->cpi_procnodeid = cpi->cpi_chipid; 3672 } 3673 } else { 3674 switch (cpi->cpi_vendor) { 3675 case X86_VENDOR_Intel: 3676 cpuid_intel_getids(cpu, featureset); 3677 break; 3678 case X86_VENDOR_AMD: 3679 case X86_VENDOR_HYGON: 3680 cpuid_amd_getids(cpu, featureset); 3681 break; 3682 default: 3683 /* 3684 * In this case, it's hard to say what we should do. 3685 * We're going to model them to the OS as single core 3686 * threads. We don't have a good identifier for them, so 3687 * we're just going to use the cpu id all on a single 3688 * chip. 3689 * 3690 * This case has historically been different from the 3691 * case above where we don't have HTT or CMP. While they 3692 * could be combined, we've opted to keep it separate to 3693 * minimize the risk of topology changes in weird cases. 3694 */ 3695 cpi->cpi_procnodes_per_pkg = 1; 3696 cpi->cpi_cores_per_compunit = 1; 3697 cpi->cpi_chipid = 0; 3698 cpi->cpi_coreid = cpu->cpu_id; 3699 cpi->cpi_clogid = cpu->cpu_id; 3700 cpi->cpi_pkgcoreid = cpu->cpu_id; 3701 cpi->cpi_procnodeid = cpi->cpi_chipid; 3702 cpi->cpi_compunitid = cpi->cpi_coreid; 3703 break; 3704 } 3705 } 3706 } 3707 3708 /* 3709 * Gather relevant CPU features from leaf 6 which covers thermal information. We 3710 * always gather leaf 6 if it's supported; however, we only look for features on 3711 * Intel systems as AMD does not currently define any of the features we look 3712 * for below. 3713 */ 3714 static void 3715 cpuid_basic_thermal(cpu_t *cpu, uchar_t *featureset) 3716 { 3717 struct cpuid_regs *cp; 3718 struct cpuid_info *cpi = cpu->cpu_m.mcpu_cpi; 3719 3720 if (cpi->cpi_maxeax < 6) { 3721 return; 3722 } 3723 3724 cp = &cpi->cpi_std[6]; 3725 cp->cp_eax = 6; 3726 cp->cp_ebx = cp->cp_ecx = cp->cp_edx = 0; 3727 (void) __cpuid_insn(cp); 3728 platform_cpuid_mangle(cpi->cpi_vendor, 6, cp); 3729 3730 if (cpi->cpi_vendor != X86_VENDOR_Intel) { 3731 return; 3732 } 3733 3734 if ((cp->cp_eax & CPUID_INTC_EAX_DTS) != 0) { 3735 add_x86_feature(featureset, X86FSET_CORE_THERMAL); 3736 } 3737 3738 if ((cp->cp_eax & CPUID_INTC_EAX_PTM) != 0) { 3739 add_x86_feature(featureset, X86FSET_PKG_THERMAL); 3740 } 3741 } 3742 3743 /* 3744 * This is used when we discover that we have AVX support in cpuid. This 3745 * proceeds to scan for the rest of the AVX derived features. 3746 */ 3747 static void 3748 cpuid_basic_avx(cpu_t *cpu, uchar_t *featureset) 3749 { 3750 struct cpuid_info *cpi = cpu->cpu_m.mcpu_cpi; 3751 3752 /* 3753 * If we don't have AVX, don't bother with most of this. 3754 */ 3755 if ((cpi->cpi_std[1].cp_ecx & CPUID_INTC_ECX_AVX) == 0) 3756 return; 3757 3758 add_x86_feature(featureset, X86FSET_AVX); 3759 3760 /* 3761 * Intel says we can't check these without also 3762 * checking AVX. 3763 */ 3764 if (cpi->cpi_std[1].cp_ecx & CPUID_INTC_ECX_F16C) 3765 add_x86_feature(featureset, X86FSET_F16C); 3766 3767 if (cpi->cpi_std[1].cp_ecx & CPUID_INTC_ECX_FMA) 3768 add_x86_feature(featureset, X86FSET_FMA); 3769 3770 if (cpi->cpi_std[7].cp_ebx & CPUID_INTC_EBX_7_0_BMI1) 3771 add_x86_feature(featureset, X86FSET_BMI1); 3772 3773 if (cpi->cpi_std[7].cp_ebx & CPUID_INTC_EBX_7_0_BMI2) 3774 add_x86_feature(featureset, X86FSET_BMI2); 3775 3776 if (cpi->cpi_std[7].cp_ebx & CPUID_INTC_EBX_7_0_AVX2) 3777 add_x86_feature(featureset, X86FSET_AVX2); 3778 3779 if (cpi->cpi_std[7].cp_ecx & CPUID_INTC_ECX_7_0_VAES) 3780 add_x86_feature(featureset, X86FSET_VAES); 3781 3782 if (cpi->cpi_std[7].cp_ecx & CPUID_INTC_ECX_7_0_VPCLMULQDQ) 3783 add_x86_feature(featureset, X86FSET_VPCLMULQDQ); 3784 3785 /* 3786 * The rest of the AVX features require AVX512. Do not check them unless 3787 * it is present. 3788 */ 3789 if ((cpi->cpi_std[7].cp_ebx & CPUID_INTC_EBX_7_0_AVX512F) == 0) 3790 return; 3791 add_x86_feature(featureset, X86FSET_AVX512F); 3792 3793 if (cpi->cpi_std[7].cp_ebx & CPUID_INTC_EBX_7_0_AVX512DQ) 3794 add_x86_feature(featureset, X86FSET_AVX512DQ); 3795 3796 if (cpi->cpi_std[7].cp_ebx & CPUID_INTC_EBX_7_0_AVX512IFMA) 3797 add_x86_feature(featureset, X86FSET_AVX512FMA); 3798 3799 if (cpi->cpi_std[7].cp_ebx & CPUID_INTC_EBX_7_0_AVX512PF) 3800 add_x86_feature(featureset, X86FSET_AVX512PF); 3801 3802 if (cpi->cpi_std[7].cp_ebx & CPUID_INTC_EBX_7_0_AVX512ER) 3803 add_x86_feature(featureset, X86FSET_AVX512ER); 3804 3805 if (cpi->cpi_std[7].cp_ebx & CPUID_INTC_EBX_7_0_AVX512CD) 3806 add_x86_feature(featureset, X86FSET_AVX512CD); 3807 3808 if (cpi->cpi_std[7].cp_ebx & CPUID_INTC_EBX_7_0_AVX512BW) 3809 add_x86_feature(featureset, X86FSET_AVX512BW); 3810 3811 if (cpi->cpi_std[7].cp_ebx & CPUID_INTC_EBX_7_0_AVX512VL) 3812 add_x86_feature(featureset, X86FSET_AVX512VL); 3813 3814 if (cpi->cpi_std[7].cp_ecx & CPUID_INTC_ECX_7_0_AVX512VBMI) 3815 add_x86_feature(featureset, X86FSET_AVX512VBMI); 3816 3817 if (cpi->cpi_std[7].cp_ecx & CPUID_INTC_ECX_7_0_AVX512VBMI2) 3818 add_x86_feature(featureset, X86FSET_AVX512_VBMI2); 3819 3820 if (cpi->cpi_std[7].cp_ecx & CPUID_INTC_ECX_7_0_AVX512VNNI) 3821 add_x86_feature(featureset, X86FSET_AVX512VNNI); 3822 3823 if (cpi->cpi_std[7].cp_ecx & CPUID_INTC_ECX_7_0_AVX512BITALG) 3824 add_x86_feature(featureset, X86FSET_AVX512_BITALG); 3825 3826 if (cpi->cpi_std[7].cp_ecx & CPUID_INTC_ECX_7_0_AVX512VPOPCDQ) 3827 add_x86_feature(featureset, X86FSET_AVX512VPOPCDQ); 3828 3829 if (cpi->cpi_std[7].cp_edx & CPUID_INTC_EDX_7_0_AVX5124NNIW) 3830 add_x86_feature(featureset, X86FSET_AVX512NNIW); 3831 3832 if (cpi->cpi_std[7].cp_edx & CPUID_INTC_EDX_7_0_AVX5124FMAPS) 3833 add_x86_feature(featureset, X86FSET_AVX512FMAPS); 3834 3835 /* 3836 * More features here are in Leaf 7, subleaf 1. Don't bother checking if 3837 * we don't need to. 3838 */ 3839 if (cpi->cpi_std[7].cp_eax < 1) 3840 return; 3841 3842 if (cpi->cpi_sub7[0].cp_eax & CPUID_INTC_EAX_7_1_AVX512_BF16) 3843 add_x86_feature(featureset, X86FSET_AVX512_BF16); 3844 } 3845 3846 /* 3847 * PPIN is the protected processor inventory number. On AMD this is an actual 3848 * feature bit. However, on Intel systems we need to read the platform 3849 * information MSR if we're on a specific model. 3850 */ 3851 #if !defined(__xpv) 3852 static void 3853 cpuid_basic_ppin(cpu_t *cpu, uchar_t *featureset) 3854 { 3855 on_trap_data_t otd; 3856 struct cpuid_info *cpi = cpu->cpu_m.mcpu_cpi; 3857 3858 switch (cpi->cpi_vendor) { 3859 case X86_VENDOR_AMD: 3860 /* 3861 * This leaf will have already been gathered in the topology 3862 * functions. 3863 */ 3864 if (cpi->cpi_xmaxeax >= CPUID_LEAF_EXT_8) { 3865 if (cpi->cpi_extd[8].cp_ebx & CPUID_AMD_EBX_PPIN) { 3866 add_x86_feature(featureset, X86FSET_PPIN); 3867 } 3868 } 3869 break; 3870 case X86_VENDOR_Intel: 3871 if (cpi->cpi_family != 6) 3872 break; 3873 switch (cpi->cpi_model) { 3874 case INTC_MODEL_IVYBRIDGE_XEON: 3875 case INTC_MODEL_HASWELL_XEON: 3876 case INTC_MODEL_BROADWELL_XEON: 3877 case INTC_MODEL_BROADWELL_XEON_D: 3878 case INTC_MODEL_SKYLAKE_XEON: 3879 case INTC_MODEL_ICELAKE_XEON: 3880 if (!on_trap(&otd, OT_DATA_ACCESS)) { 3881 uint64_t value; 3882 3883 value = rdmsr(MSR_PLATFORM_INFO); 3884 if ((value & MSR_PLATFORM_INFO_PPIN) != 0) { 3885 add_x86_feature(featureset, 3886 X86FSET_PPIN); 3887 } 3888 } 3889 no_trap(); 3890 break; 3891 default: 3892 break; 3893 } 3894 break; 3895 default: 3896 break; 3897 } 3898 } 3899 #endif /* ! __xpv */ 3900 3901 static void 3902 cpuid_pass_prelude(cpu_t *cpu, void *arg) 3903 { 3904 uchar_t *featureset = (uchar_t *)arg; 3905 3906 /* 3907 * We don't run on any processor that doesn't have cpuid, and could not 3908 * possibly have arrived here. 3909 */ 3910 add_x86_feature(featureset, X86FSET_CPUID); 3911 } 3912 3913 static void 3914 cpuid_pass_ident(cpu_t *cpu, void *arg __unused) 3915 { 3916 struct cpuid_info *cpi; 3917 struct cpuid_regs *cp; 3918 3919 /* 3920 * We require that virtual/native detection be complete and that PCI 3921 * config space access has been set up; at present there is no reliable 3922 * way to determine the latter. 3923 */ 3924 #if !defined(__xpv) 3925 ASSERT3S(platform_type, !=, -1); 3926 #endif /* !__xpv */ 3927 3928 cpi = cpu->cpu_m.mcpu_cpi; 3929 ASSERT(cpi != NULL); 3930 3931 cp = &cpi->cpi_std[0]; 3932 cp->cp_eax = 0; 3933 cpi->cpi_maxeax = __cpuid_insn(cp); 3934 { 3935 uint32_t *iptr = (uint32_t *)cpi->cpi_vendorstr; 3936 *iptr++ = cp->cp_ebx; 3937 *iptr++ = cp->cp_edx; 3938 *iptr++ = cp->cp_ecx; 3939 *(char *)&cpi->cpi_vendorstr[12] = '\0'; 3940 } 3941 3942 cpi->cpi_vendor = _cpuid_vendorstr_to_vendorcode(cpi->cpi_vendorstr); 3943 x86_vendor = cpi->cpi_vendor; /* for compatibility */ 3944 3945 /* 3946 * Limit the range in case of weird hardware 3947 */ 3948 if (cpi->cpi_maxeax > CPI_MAXEAX_MAX) 3949 cpi->cpi_maxeax = CPI_MAXEAX_MAX; 3950 if (cpi->cpi_maxeax < 1) 3951 return; 3952 3953 cp = &cpi->cpi_std[1]; 3954 cp->cp_eax = 1; 3955 (void) __cpuid_insn(cp); 3956 3957 /* 3958 * Extract identifying constants for easy access. 3959 */ 3960 cpi->cpi_model = CPI_MODEL(cpi); 3961 cpi->cpi_family = CPI_FAMILY(cpi); 3962 3963 if (cpi->cpi_family == 0xf) 3964 cpi->cpi_family += CPI_FAMILY_XTD(cpi); 3965 3966 /* 3967 * Beware: AMD uses "extended model" iff base *FAMILY* == 0xf. 3968 * Intel, and presumably everyone else, uses model == 0xf, as 3969 * one would expect (max value means possible overflow). Sigh. 3970 */ 3971 3972 switch (cpi->cpi_vendor) { 3973 case X86_VENDOR_Intel: 3974 if (IS_EXTENDED_MODEL_INTEL(cpi)) 3975 cpi->cpi_model += CPI_MODEL_XTD(cpi) << 4; 3976 break; 3977 case X86_VENDOR_AMD: 3978 if (CPI_FAMILY(cpi) == 0xf) 3979 cpi->cpi_model += CPI_MODEL_XTD(cpi) << 4; 3980 break; 3981 case X86_VENDOR_HYGON: 3982 cpi->cpi_model += CPI_MODEL_XTD(cpi) << 4; 3983 break; 3984 default: 3985 if (cpi->cpi_model == 0xf) 3986 cpi->cpi_model += CPI_MODEL_XTD(cpi) << 4; 3987 break; 3988 } 3989 3990 cpi->cpi_step = CPI_STEP(cpi); 3991 cpi->cpi_brandid = CPI_BRANDID(cpi); 3992 3993 /* 3994 * Synthesize chip "revision" and socket type 3995 */ 3996 cpi->cpi_chiprev = _cpuid_chiprev(cpi->cpi_vendor, cpi->cpi_family, 3997 cpi->cpi_model, cpi->cpi_step); 3998 cpi->cpi_chiprevstr = _cpuid_chiprevstr(cpi->cpi_vendor, 3999 cpi->cpi_family, cpi->cpi_model, cpi->cpi_step); 4000 cpi->cpi_socket = _cpuid_skt(cpi->cpi_vendor, cpi->cpi_family, 4001 cpi->cpi_model, cpi->cpi_step); 4002 cpi->cpi_uarchrev = _cpuid_uarchrev(cpi->cpi_vendor, cpi->cpi_family, 4003 cpi->cpi_model, cpi->cpi_step); 4004 } 4005 4006 static void 4007 cpuid_pass_basic(cpu_t *cpu, void *arg) 4008 { 4009 uchar_t *featureset = (uchar_t *)arg; 4010 uint32_t mask_ecx, mask_edx; 4011 struct cpuid_info *cpi; 4012 struct cpuid_regs *cp; 4013 int xcpuid; 4014 #if !defined(__xpv) 4015 extern int idle_cpu_prefer_mwait; 4016 #endif 4017 4018 cpi = cpu->cpu_m.mcpu_cpi; 4019 ASSERT(cpi != NULL); 4020 4021 if (cpi->cpi_maxeax < 1) 4022 return; 4023 4024 /* 4025 * This was filled during the identification pass. 4026 */ 4027 cp = &cpi->cpi_std[1]; 4028 4029 /* 4030 * *default* assumptions: 4031 * - believe %edx feature word 4032 * - ignore %ecx feature word 4033 * - 32-bit virtual and physical addressing 4034 */ 4035 mask_edx = 0xffffffff; 4036 mask_ecx = 0; 4037 4038 cpi->cpi_pabits = cpi->cpi_vabits = 32; 4039 4040 switch (cpi->cpi_vendor) { 4041 case X86_VENDOR_Intel: 4042 if (cpi->cpi_family == 5) 4043 x86_type = X86_TYPE_P5; 4044 else if (IS_LEGACY_P6(cpi)) { 4045 x86_type = X86_TYPE_P6; 4046 pentiumpro_bug4046376 = 1; 4047 /* 4048 * Clear the SEP bit when it was set erroneously 4049 */ 4050 if (cpi->cpi_model < 3 && cpi->cpi_step < 3) 4051 cp->cp_edx &= ~CPUID_INTC_EDX_SEP; 4052 } else if (IS_NEW_F6(cpi) || cpi->cpi_family == 0xf) { 4053 x86_type = X86_TYPE_P4; 4054 /* 4055 * We don't currently depend on any of the %ecx 4056 * features until Prescott, so we'll only check 4057 * this from P4 onwards. We might want to revisit 4058 * that idea later. 4059 */ 4060 mask_ecx = 0xffffffff; 4061 } else if (cpi->cpi_family > 0xf) 4062 mask_ecx = 0xffffffff; 4063 /* 4064 * We don't support MONITOR/MWAIT if leaf 5 is not available 4065 * to obtain the monitor linesize. 4066 */ 4067 if (cpi->cpi_maxeax < 5) 4068 mask_ecx &= ~CPUID_INTC_ECX_MON; 4069 break; 4070 case X86_VENDOR_IntelClone: 4071 default: 4072 break; 4073 case X86_VENDOR_AMD: 4074 #if defined(OPTERON_ERRATUM_108) 4075 if (cpi->cpi_family == 0xf && cpi->cpi_model == 0xe) { 4076 cp->cp_eax = (0xf0f & cp->cp_eax) | 0xc0; 4077 cpi->cpi_model = 0xc; 4078 } else 4079 #endif 4080 if (cpi->cpi_family == 5) { 4081 /* 4082 * AMD K5 and K6 4083 * 4084 * These CPUs have an incomplete implementation 4085 * of MCA/MCE which we mask away. 4086 */ 4087 mask_edx &= ~(CPUID_INTC_EDX_MCE | CPUID_INTC_EDX_MCA); 4088 4089 /* 4090 * Model 0 uses the wrong (APIC) bit 4091 * to indicate PGE. Fix it here. 4092 */ 4093 if (cpi->cpi_model == 0) { 4094 if (cp->cp_edx & 0x200) { 4095 cp->cp_edx &= ~0x200; 4096 cp->cp_edx |= CPUID_INTC_EDX_PGE; 4097 } 4098 } 4099 4100 /* 4101 * Early models had problems w/ MMX; disable. 4102 */ 4103 if (cpi->cpi_model < 6) 4104 mask_edx &= ~CPUID_INTC_EDX_MMX; 4105 } 4106 4107 /* 4108 * For newer families, SSE3 and CX16, at least, are valid; 4109 * enable all 4110 */ 4111 if (cpi->cpi_family >= 0xf) 4112 mask_ecx = 0xffffffff; 4113 /* 4114 * We don't support MONITOR/MWAIT if leaf 5 is not available 4115 * to obtain the monitor linesize. 4116 */ 4117 if (cpi->cpi_maxeax < 5) 4118 mask_ecx &= ~CPUID_INTC_ECX_MON; 4119 4120 #if !defined(__xpv) 4121 /* 4122 * AMD has not historically used MWAIT in the CPU's idle loop. 4123 * Pre-family-10h Opterons do not have the MWAIT instruction. We 4124 * know for certain that in at least family 17h, per AMD, mwait 4125 * is preferred. Families in-between are less certain. 4126 */ 4127 if (cpi->cpi_family < 0x17) { 4128 idle_cpu_prefer_mwait = 0; 4129 } 4130 #endif 4131 4132 break; 4133 case X86_VENDOR_HYGON: 4134 /* Enable all for Hygon Dhyana CPU */ 4135 mask_ecx = 0xffffffff; 4136 break; 4137 case X86_VENDOR_TM: 4138 /* 4139 * workaround the NT workaround in CMS 4.1 4140 */ 4141 if (cpi->cpi_family == 5 && cpi->cpi_model == 4 && 4142 (cpi->cpi_step == 2 || cpi->cpi_step == 3)) 4143 cp->cp_edx |= CPUID_INTC_EDX_CX8; 4144 break; 4145 case X86_VENDOR_Centaur: 4146 /* 4147 * workaround the NT workarounds again 4148 */ 4149 if (cpi->cpi_family == 6) 4150 cp->cp_edx |= CPUID_INTC_EDX_CX8; 4151 break; 4152 case X86_VENDOR_Cyrix: 4153 /* 4154 * We rely heavily on the probing in locore 4155 * to actually figure out what parts, if any, 4156 * of the Cyrix cpuid instruction to believe. 4157 */ 4158 switch (x86_type) { 4159 case X86_TYPE_CYRIX_486: 4160 mask_edx = 0; 4161 break; 4162 case X86_TYPE_CYRIX_6x86: 4163 mask_edx = 0; 4164 break; 4165 case X86_TYPE_CYRIX_6x86L: 4166 mask_edx = 4167 CPUID_INTC_EDX_DE | 4168 CPUID_INTC_EDX_CX8; 4169 break; 4170 case X86_TYPE_CYRIX_6x86MX: 4171 mask_edx = 4172 CPUID_INTC_EDX_DE | 4173 CPUID_INTC_EDX_MSR | 4174 CPUID_INTC_EDX_CX8 | 4175 CPUID_INTC_EDX_PGE | 4176 CPUID_INTC_EDX_CMOV | 4177 CPUID_INTC_EDX_MMX; 4178 break; 4179 case X86_TYPE_CYRIX_GXm: 4180 mask_edx = 4181 CPUID_INTC_EDX_MSR | 4182 CPUID_INTC_EDX_CX8 | 4183 CPUID_INTC_EDX_CMOV | 4184 CPUID_INTC_EDX_MMX; 4185 break; 4186 case X86_TYPE_CYRIX_MediaGX: 4187 break; 4188 case X86_TYPE_CYRIX_MII: 4189 case X86_TYPE_VIA_CYRIX_III: 4190 mask_edx = 4191 CPUID_INTC_EDX_DE | 4192 CPUID_INTC_EDX_TSC | 4193 CPUID_INTC_EDX_MSR | 4194 CPUID_INTC_EDX_CX8 | 4195 CPUID_INTC_EDX_PGE | 4196 CPUID_INTC_EDX_CMOV | 4197 CPUID_INTC_EDX_MMX; 4198 break; 4199 default: 4200 break; 4201 } 4202 break; 4203 } 4204 4205 #if defined(__xpv) 4206 /* 4207 * Do not support MONITOR/MWAIT under a hypervisor 4208 */ 4209 mask_ecx &= ~CPUID_INTC_ECX_MON; 4210 /* 4211 * Do not support XSAVE under a hypervisor for now 4212 */ 4213 xsave_force_disable = B_TRUE; 4214 4215 #endif /* __xpv */ 4216 4217 if (xsave_force_disable) { 4218 mask_ecx &= ~CPUID_INTC_ECX_XSAVE; 4219 mask_ecx &= ~CPUID_INTC_ECX_AVX; 4220 mask_ecx &= ~CPUID_INTC_ECX_F16C; 4221 mask_ecx &= ~CPUID_INTC_ECX_FMA; 4222 } 4223 4224 /* 4225 * Now we've figured out the masks that determine 4226 * which bits we choose to believe, apply the masks 4227 * to the feature words, then map the kernel's view 4228 * of these feature words into its feature word. 4229 */ 4230 cp->cp_edx &= mask_edx; 4231 cp->cp_ecx &= mask_ecx; 4232 4233 /* 4234 * apply any platform restrictions (we don't call this 4235 * immediately after __cpuid_insn here, because we need the 4236 * workarounds applied above first) 4237 */ 4238 platform_cpuid_mangle(cpi->cpi_vendor, 1, cp); 4239 4240 /* 4241 * In addition to ecx and edx, Intel and AMD are storing a bunch of 4242 * instruction set extensions in leaf 7's ebx, ecx, and edx. Note, leaf 4243 * 7 has sub-leaves determined by ecx. 4244 */ 4245 if (cpi->cpi_maxeax >= 7) { 4246 struct cpuid_regs *ecp; 4247 ecp = &cpi->cpi_std[7]; 4248 ecp->cp_eax = 7; 4249 ecp->cp_ecx = 0; 4250 (void) __cpuid_insn(ecp); 4251 4252 /* 4253 * If XSAVE has been disabled, just ignore all of the 4254 * extended-save-area dependent flags here. By removing most of 4255 * the leaf 7, sub-leaf 0 flags, that will ensure tha we don't 4256 * end up looking at additional xsave dependent leaves right 4257 * now. 4258 */ 4259 if (xsave_force_disable) { 4260 ecp->cp_ebx &= ~CPUID_INTC_EBX_7_0_BMI1; 4261 ecp->cp_ebx &= ~CPUID_INTC_EBX_7_0_BMI2; 4262 ecp->cp_ebx &= ~CPUID_INTC_EBX_7_0_AVX2; 4263 ecp->cp_ebx &= ~CPUID_INTC_EBX_7_0_MPX; 4264 ecp->cp_ebx &= ~CPUID_INTC_EBX_7_0_ALL_AVX512; 4265 ecp->cp_ecx &= ~CPUID_INTC_ECX_7_0_ALL_AVX512; 4266 ecp->cp_edx &= ~CPUID_INTC_EDX_7_0_ALL_AVX512; 4267 ecp->cp_ecx &= ~CPUID_INTC_ECX_7_0_VAES; 4268 ecp->cp_ecx &= ~CPUID_INTC_ECX_7_0_VPCLMULQDQ; 4269 ecp->cp_ecx &= ~CPUID_INTC_ECX_7_0_GFNI; 4270 } 4271 4272 if (ecp->cp_ebx & CPUID_INTC_EBX_7_0_SMEP) 4273 add_x86_feature(featureset, X86FSET_SMEP); 4274 4275 /* 4276 * We check disable_smap here in addition to in startup_smap() 4277 * to ensure CPUs that aren't the boot CPU don't accidentally 4278 * include it in the feature set and thus generate a mismatched 4279 * x86 feature set across CPUs. 4280 */ 4281 if (ecp->cp_ebx & CPUID_INTC_EBX_7_0_SMAP && 4282 disable_smap == 0) 4283 add_x86_feature(featureset, X86FSET_SMAP); 4284 4285 if (ecp->cp_ebx & CPUID_INTC_EBX_7_0_RDSEED) 4286 add_x86_feature(featureset, X86FSET_RDSEED); 4287 4288 if (ecp->cp_ebx & CPUID_INTC_EBX_7_0_ADX) 4289 add_x86_feature(featureset, X86FSET_ADX); 4290 4291 if (ecp->cp_ebx & CPUID_INTC_EBX_7_0_FSGSBASE) 4292 add_x86_feature(featureset, X86FSET_FSGSBASE); 4293 4294 if (ecp->cp_ebx & CPUID_INTC_EBX_7_0_CLFLUSHOPT) 4295 add_x86_feature(featureset, X86FSET_CLFLUSHOPT); 4296 4297 if (ecp->cp_ebx & CPUID_INTC_EBX_7_0_INVPCID) 4298 add_x86_feature(featureset, X86FSET_INVPCID); 4299 4300 if (ecp->cp_ecx & CPUID_INTC_ECX_7_0_UMIP) 4301 add_x86_feature(featureset, X86FSET_UMIP); 4302 if (ecp->cp_ecx & CPUID_INTC_ECX_7_0_PKU) 4303 add_x86_feature(featureset, X86FSET_PKU); 4304 if (ecp->cp_ecx & CPUID_INTC_ECX_7_0_OSPKE) 4305 add_x86_feature(featureset, X86FSET_OSPKE); 4306 if (ecp->cp_ecx & CPUID_INTC_ECX_7_0_GFNI) 4307 add_x86_feature(featureset, X86FSET_GFNI); 4308 4309 if (ecp->cp_ebx & CPUID_INTC_EBX_7_0_CLWB) 4310 add_x86_feature(featureset, X86FSET_CLWB); 4311 4312 if (cpi->cpi_vendor == X86_VENDOR_Intel) { 4313 if (ecp->cp_ebx & CPUID_INTC_EBX_7_0_MPX) 4314 add_x86_feature(featureset, X86FSET_MPX); 4315 } 4316 4317 /* 4318 * If we have subleaf 1 or 2 available, grab and store 4319 * that. This is used for more AVX and related features. 4320 */ 4321 if (ecp->cp_eax >= 1) { 4322 struct cpuid_regs *c71; 4323 c71 = &cpi->cpi_sub7[0]; 4324 c71->cp_eax = 7; 4325 c71->cp_ecx = 1; 4326 (void) __cpuid_insn(c71); 4327 } 4328 4329 /* Subleaf 2 has certain security indicators in it. */ 4330 if (ecp->cp_eax >= 2) { 4331 struct cpuid_regs *c72; 4332 c72 = &cpi->cpi_sub7[1]; 4333 c72->cp_eax = 7; 4334 c72->cp_ecx = 2; 4335 (void) __cpuid_insn(c72); 4336 } 4337 } 4338 4339 /* 4340 * fold in overrides from the "eeprom" mechanism 4341 */ 4342 cp->cp_edx |= cpuid_feature_edx_include; 4343 cp->cp_edx &= ~cpuid_feature_edx_exclude; 4344 4345 cp->cp_ecx |= cpuid_feature_ecx_include; 4346 cp->cp_ecx &= ~cpuid_feature_ecx_exclude; 4347 4348 if (cp->cp_edx & CPUID_INTC_EDX_PSE) { 4349 add_x86_feature(featureset, X86FSET_LARGEPAGE); 4350 } 4351 if (cp->cp_edx & CPUID_INTC_EDX_TSC) { 4352 add_x86_feature(featureset, X86FSET_TSC); 4353 } 4354 if (cp->cp_edx & CPUID_INTC_EDX_MSR) { 4355 add_x86_feature(featureset, X86FSET_MSR); 4356 } 4357 if (cp->cp_edx & CPUID_INTC_EDX_MTRR) { 4358 add_x86_feature(featureset, X86FSET_MTRR); 4359 } 4360 if (cp->cp_edx & CPUID_INTC_EDX_PGE) { 4361 add_x86_feature(featureset, X86FSET_PGE); 4362 } 4363 if (cp->cp_edx & CPUID_INTC_EDX_CMOV) { 4364 add_x86_feature(featureset, X86FSET_CMOV); 4365 } 4366 if (cp->cp_edx & CPUID_INTC_EDX_MMX) { 4367 add_x86_feature(featureset, X86FSET_MMX); 4368 } 4369 if ((cp->cp_edx & CPUID_INTC_EDX_MCE) != 0 && 4370 (cp->cp_edx & CPUID_INTC_EDX_MCA) != 0) { 4371 add_x86_feature(featureset, X86FSET_MCA); 4372 } 4373 if (cp->cp_edx & CPUID_INTC_EDX_PAE) { 4374 add_x86_feature(featureset, X86FSET_PAE); 4375 } 4376 if (cp->cp_edx & CPUID_INTC_EDX_CX8) { 4377 add_x86_feature(featureset, X86FSET_CX8); 4378 } 4379 if (cp->cp_ecx & CPUID_INTC_ECX_CX16) { 4380 add_x86_feature(featureset, X86FSET_CX16); 4381 } 4382 if (cp->cp_edx & CPUID_INTC_EDX_PAT) { 4383 add_x86_feature(featureset, X86FSET_PAT); 4384 } 4385 if (cp->cp_edx & CPUID_INTC_EDX_SEP) { 4386 add_x86_feature(featureset, X86FSET_SEP); 4387 } 4388 if (cp->cp_edx & CPUID_INTC_EDX_FXSR) { 4389 /* 4390 * In our implementation, fxsave/fxrstor 4391 * are prerequisites before we'll even 4392 * try and do SSE things. 4393 */ 4394 if (cp->cp_edx & CPUID_INTC_EDX_SSE) { 4395 add_x86_feature(featureset, X86FSET_SSE); 4396 } 4397 if (cp->cp_edx & CPUID_INTC_EDX_SSE2) { 4398 add_x86_feature(featureset, X86FSET_SSE2); 4399 } 4400 if (cp->cp_ecx & CPUID_INTC_ECX_SSE3) { 4401 add_x86_feature(featureset, X86FSET_SSE3); 4402 } 4403 if (cp->cp_ecx & CPUID_INTC_ECX_SSSE3) { 4404 add_x86_feature(featureset, X86FSET_SSSE3); 4405 } 4406 if (cp->cp_ecx & CPUID_INTC_ECX_SSE4_1) { 4407 add_x86_feature(featureset, X86FSET_SSE4_1); 4408 } 4409 if (cp->cp_ecx & CPUID_INTC_ECX_SSE4_2) { 4410 add_x86_feature(featureset, X86FSET_SSE4_2); 4411 } 4412 if (cp->cp_ecx & CPUID_INTC_ECX_AES) { 4413 add_x86_feature(featureset, X86FSET_AES); 4414 } 4415 if (cp->cp_ecx & CPUID_INTC_ECX_PCLMULQDQ) { 4416 add_x86_feature(featureset, X86FSET_PCLMULQDQ); 4417 } 4418 4419 if (cpi->cpi_std[7].cp_ebx & CPUID_INTC_EBX_7_0_SHA) 4420 add_x86_feature(featureset, X86FSET_SHA); 4421 4422 if (cp->cp_ecx & CPUID_INTC_ECX_XSAVE) { 4423 add_x86_feature(featureset, X86FSET_XSAVE); 4424 4425 /* We only test AVX & AVX512 when there is XSAVE */ 4426 cpuid_basic_avx(cpu, featureset); 4427 } 4428 } 4429 4430 if (cp->cp_ecx & CPUID_INTC_ECX_PCID) { 4431 add_x86_feature(featureset, X86FSET_PCID); 4432 } 4433 4434 if (cp->cp_ecx & CPUID_INTC_ECX_X2APIC) { 4435 add_x86_feature(featureset, X86FSET_X2APIC); 4436 } 4437 if (cp->cp_edx & CPUID_INTC_EDX_DE) { 4438 add_x86_feature(featureset, X86FSET_DE); 4439 } 4440 #if !defined(__xpv) 4441 if (cp->cp_ecx & CPUID_INTC_ECX_MON) { 4442 4443 /* 4444 * We require the CLFLUSH instruction for erratum workaround 4445 * to use MONITOR/MWAIT. 4446 */ 4447 if (cp->cp_edx & CPUID_INTC_EDX_CLFSH) { 4448 cpi->cpi_mwait.support |= MWAIT_SUPPORT; 4449 add_x86_feature(featureset, X86FSET_MWAIT); 4450 } else { 4451 extern int idle_cpu_assert_cflush_monitor; 4452 4453 /* 4454 * All processors we are aware of which have 4455 * MONITOR/MWAIT also have CLFLUSH. 4456 */ 4457 if (idle_cpu_assert_cflush_monitor) { 4458 ASSERT((cp->cp_ecx & CPUID_INTC_ECX_MON) && 4459 (cp->cp_edx & CPUID_INTC_EDX_CLFSH)); 4460 } 4461 } 4462 } 4463 #endif /* __xpv */ 4464 4465 if (cp->cp_ecx & CPUID_INTC_ECX_VMX) { 4466 add_x86_feature(featureset, X86FSET_VMX); 4467 } 4468 4469 if (cp->cp_ecx & CPUID_INTC_ECX_RDRAND) 4470 add_x86_feature(featureset, X86FSET_RDRAND); 4471 4472 /* 4473 * Only need it first time, rest of the cpus would follow suit. 4474 * we only capture this for the bootcpu. 4475 */ 4476 if (cp->cp_edx & CPUID_INTC_EDX_CLFSH) { 4477 add_x86_feature(featureset, X86FSET_CLFSH); 4478 x86_clflush_size = (BITX(cp->cp_ebx, 15, 8) * 8); 4479 } 4480 if (is_x86_feature(featureset, X86FSET_PAE)) 4481 cpi->cpi_pabits = 36; 4482 4483 if (cpi->cpi_maxeax >= 0xD && !xsave_force_disable) { 4484 struct cpuid_regs r, *ecp; 4485 4486 ecp = &r; 4487 ecp->cp_eax = 0xD; 4488 ecp->cp_ecx = 1; 4489 ecp->cp_edx = ecp->cp_ebx = 0; 4490 (void) __cpuid_insn(ecp); 4491 4492 if (ecp->cp_eax & CPUID_INTC_EAX_D_1_XSAVEOPT) 4493 add_x86_feature(featureset, X86FSET_XSAVEOPT); 4494 if (ecp->cp_eax & CPUID_INTC_EAX_D_1_XSAVEC) 4495 add_x86_feature(featureset, X86FSET_XSAVEC); 4496 if (ecp->cp_eax & CPUID_INTC_EAX_D_1_XSAVES) 4497 add_x86_feature(featureset, X86FSET_XSAVES); 4498 4499 /* 4500 * Zen 2 family processors suffer from erratum 1386 that causes 4501 * xsaves to not function correctly in some circumstances. There 4502 * are no supervisor states in Zen 2 and earlier. Practically 4503 * speaking this has no impact for us as we currently do not 4504 * leverage compressed xsave formats. To safeguard against 4505 * issues in the future where we may opt to using it, we remove 4506 * it from the feature set now. While Matisse has a microcode 4507 * update available with a fix, not all Zen 2 CPUs do so it's 4508 * simpler for the moment to unconditionally remove it. 4509 */ 4510 if (cpi->cpi_vendor == X86_VENDOR_AMD && 4511 uarchrev_uarch(cpi->cpi_uarchrev) <= X86_UARCH_AMD_ZEN2) { 4512 remove_x86_feature(featureset, X86FSET_XSAVES); 4513 } 4514 } 4515 4516 /* 4517 * Work on the "extended" feature information, doing 4518 * some basic initialization to be used in the extended pass. 4519 */ 4520 xcpuid = 0; 4521 switch (cpi->cpi_vendor) { 4522 case X86_VENDOR_Intel: 4523 /* 4524 * On KVM we know we will have proper support for extended 4525 * cpuid. 4526 */ 4527 if (IS_NEW_F6(cpi) || cpi->cpi_family >= 0xf || 4528 (get_hwenv() == HW_KVM && cpi->cpi_family == 6 && 4529 (cpi->cpi_model == 6 || cpi->cpi_model == 2))) 4530 xcpuid++; 4531 break; 4532 case X86_VENDOR_AMD: 4533 if (cpi->cpi_family > 5 || 4534 (cpi->cpi_family == 5 && cpi->cpi_model >= 1)) 4535 xcpuid++; 4536 break; 4537 case X86_VENDOR_Cyrix: 4538 /* 4539 * Only these Cyrix CPUs are -known- to support 4540 * extended cpuid operations. 4541 */ 4542 if (x86_type == X86_TYPE_VIA_CYRIX_III || 4543 x86_type == X86_TYPE_CYRIX_GXm) 4544 xcpuid++; 4545 break; 4546 case X86_VENDOR_HYGON: 4547 case X86_VENDOR_Centaur: 4548 case X86_VENDOR_TM: 4549 default: 4550 xcpuid++; 4551 break; 4552 } 4553 4554 if (xcpuid) { 4555 cp = &cpi->cpi_extd[0]; 4556 cp->cp_eax = CPUID_LEAF_EXT_0; 4557 cpi->cpi_xmaxeax = __cpuid_insn(cp); 4558 } 4559 4560 if (cpi->cpi_xmaxeax & CPUID_LEAF_EXT_0) { 4561 4562 if (cpi->cpi_xmaxeax > CPI_XMAXEAX_MAX) 4563 cpi->cpi_xmaxeax = CPI_XMAXEAX_MAX; 4564 4565 switch (cpi->cpi_vendor) { 4566 case X86_VENDOR_Intel: 4567 case X86_VENDOR_AMD: 4568 case X86_VENDOR_HYGON: 4569 if (cpi->cpi_xmaxeax < 0x80000001) 4570 break; 4571 cp = &cpi->cpi_extd[1]; 4572 cp->cp_eax = 0x80000001; 4573 (void) __cpuid_insn(cp); 4574 4575 if (cpi->cpi_vendor == X86_VENDOR_AMD && 4576 cpi->cpi_family == 5 && 4577 cpi->cpi_model == 6 && 4578 cpi->cpi_step == 6) { 4579 /* 4580 * K6 model 6 uses bit 10 to indicate SYSC 4581 * Later models use bit 11. Fix it here. 4582 */ 4583 if (cp->cp_edx & 0x400) { 4584 cp->cp_edx &= ~0x400; 4585 cp->cp_edx |= CPUID_AMD_EDX_SYSC; 4586 } 4587 } 4588 4589 platform_cpuid_mangle(cpi->cpi_vendor, 0x80000001, cp); 4590 4591 /* 4592 * Compute the additions to the kernel's feature word. 4593 */ 4594 if (cp->cp_edx & CPUID_AMD_EDX_NX) { 4595 add_x86_feature(featureset, X86FSET_NX); 4596 } 4597 4598 /* 4599 * Regardless whether or not we boot 64-bit, 4600 * we should have a way to identify whether 4601 * the CPU is capable of running 64-bit. 4602 */ 4603 if (cp->cp_edx & CPUID_AMD_EDX_LM) { 4604 add_x86_feature(featureset, X86FSET_64); 4605 } 4606 4607 /* 1 GB large page - enable only for 64 bit kernel */ 4608 if (cp->cp_edx & CPUID_AMD_EDX_1GPG) { 4609 add_x86_feature(featureset, X86FSET_1GPG); 4610 } 4611 4612 if ((cpi->cpi_vendor == X86_VENDOR_AMD || 4613 cpi->cpi_vendor == X86_VENDOR_HYGON) && 4614 (cpi->cpi_std[1].cp_edx & CPUID_INTC_EDX_FXSR) && 4615 (cp->cp_ecx & CPUID_AMD_ECX_SSE4A)) { 4616 add_x86_feature(featureset, X86FSET_SSE4A); 4617 } 4618 4619 /* 4620 * It's really tricky to support syscall/sysret in 4621 * the i386 kernel; we rely on sysenter/sysexit 4622 * instead. In the amd64 kernel, things are -way- 4623 * better. 4624 */ 4625 if (cp->cp_edx & CPUID_AMD_EDX_SYSC) { 4626 add_x86_feature(featureset, X86FSET_ASYSC); 4627 } 4628 4629 /* 4630 * While we're thinking about system calls, note 4631 * that AMD processors don't support sysenter 4632 * in long mode at all, so don't try to program them. 4633 */ 4634 if (x86_vendor == X86_VENDOR_AMD || 4635 x86_vendor == X86_VENDOR_HYGON) { 4636 remove_x86_feature(featureset, X86FSET_SEP); 4637 } 4638 4639 if (cp->cp_edx & CPUID_AMD_EDX_TSCP) { 4640 add_x86_feature(featureset, X86FSET_TSCP); 4641 } 4642 4643 if (cp->cp_ecx & CPUID_AMD_ECX_SVM) { 4644 add_x86_feature(featureset, X86FSET_SVM); 4645 } 4646 4647 if (cp->cp_ecx & CPUID_AMD_ECX_TOPOEXT) { 4648 add_x86_feature(featureset, X86FSET_TOPOEXT); 4649 } 4650 4651 if (cp->cp_ecx & CPUID_AMD_ECX_PCEC) { 4652 add_x86_feature(featureset, X86FSET_AMD_PCEC); 4653 } 4654 4655 if (cp->cp_ecx & CPUID_AMD_ECX_XOP) { 4656 add_x86_feature(featureset, X86FSET_XOP); 4657 } 4658 4659 if (cp->cp_ecx & CPUID_AMD_ECX_FMA4) { 4660 add_x86_feature(featureset, X86FSET_FMA4); 4661 } 4662 4663 if (cp->cp_ecx & CPUID_AMD_ECX_TBM) { 4664 add_x86_feature(featureset, X86FSET_TBM); 4665 } 4666 4667 if (cp->cp_ecx & CPUID_AMD_ECX_MONITORX) { 4668 add_x86_feature(featureset, X86FSET_MONITORX); 4669 } 4670 break; 4671 default: 4672 break; 4673 } 4674 4675 /* 4676 * Get CPUID data about processor cores and hyperthreads. 4677 */ 4678 switch (cpi->cpi_vendor) { 4679 case X86_VENDOR_Intel: 4680 if (cpi->cpi_maxeax >= 4) { 4681 cp = &cpi->cpi_std[4]; 4682 cp->cp_eax = 4; 4683 cp->cp_ecx = 0; 4684 (void) __cpuid_insn(cp); 4685 platform_cpuid_mangle(cpi->cpi_vendor, 4, cp); 4686 } 4687 /*FALLTHROUGH*/ 4688 case X86_VENDOR_AMD: 4689 case X86_VENDOR_HYGON: 4690 if (cpi->cpi_xmaxeax < CPUID_LEAF_EXT_8) 4691 break; 4692 cp = &cpi->cpi_extd[8]; 4693 cp->cp_eax = CPUID_LEAF_EXT_8; 4694 (void) __cpuid_insn(cp); 4695 platform_cpuid_mangle(cpi->cpi_vendor, CPUID_LEAF_EXT_8, 4696 cp); 4697 4698 /* 4699 * AMD uses ebx for some extended functions. 4700 */ 4701 if (cpi->cpi_vendor == X86_VENDOR_AMD || 4702 cpi->cpi_vendor == X86_VENDOR_HYGON) { 4703 /* 4704 * While we're here, check for the AMD "Error 4705 * Pointer Zero/Restore" feature. This can be 4706 * used to setup the FP save handlers 4707 * appropriately. 4708 */ 4709 if (cp->cp_ebx & CPUID_AMD_EBX_ERR_PTR_ZERO) { 4710 cpi->cpi_fp_amd_save = 0; 4711 } else { 4712 cpi->cpi_fp_amd_save = 1; 4713 } 4714 4715 if (cp->cp_ebx & CPUID_AMD_EBX_CLZERO) { 4716 add_x86_feature(featureset, 4717 X86FSET_CLZERO); 4718 } 4719 } 4720 4721 /* 4722 * Virtual and physical address limits from 4723 * cpuid override previously guessed values. 4724 */ 4725 cpi->cpi_pabits = BITX(cp->cp_eax, 7, 0); 4726 cpi->cpi_vabits = BITX(cp->cp_eax, 15, 8); 4727 break; 4728 default: 4729 break; 4730 } 4731 4732 /* 4733 * Get CPUID data about TSC Invariance in Deep C-State. 4734 */ 4735 switch (cpi->cpi_vendor) { 4736 case X86_VENDOR_Intel: 4737 case X86_VENDOR_AMD: 4738 case X86_VENDOR_HYGON: 4739 if (cpi->cpi_maxeax >= 7) { 4740 cp = &cpi->cpi_extd[7]; 4741 cp->cp_eax = 0x80000007; 4742 cp->cp_ecx = 0; 4743 (void) __cpuid_insn(cp); 4744 } 4745 break; 4746 default: 4747 break; 4748 } 4749 } 4750 4751 /* 4752 * cpuid_basic_ppin assumes that cpuid_basic_topology has already been 4753 * run and thus gathered some of its dependent leaves. 4754 */ 4755 cpuid_basic_topology(cpu, featureset); 4756 cpuid_basic_thermal(cpu, featureset); 4757 #if !defined(__xpv) 4758 cpuid_basic_ppin(cpu, featureset); 4759 #endif 4760 4761 if (cpi->cpi_vendor == X86_VENDOR_AMD || 4762 cpi->cpi_vendor == X86_VENDOR_HYGON) { 4763 if (cpi->cpi_xmaxeax >= CPUID_LEAF_EXT_8 && 4764 cpi->cpi_extd[8].cp_ebx & CPUID_AMD_EBX_ERR_PTR_ZERO) { 4765 /* Special handling for AMD FP not necessary. */ 4766 cpi->cpi_fp_amd_save = 0; 4767 } else { 4768 cpi->cpi_fp_amd_save = 1; 4769 } 4770 } 4771 4772 /* 4773 * Check (and potentially set) if lfence is serializing. 4774 * This is useful for accurate rdtsc measurements and AMD retpolines. 4775 */ 4776 if ((cpi->cpi_vendor == X86_VENDOR_AMD || 4777 cpi->cpi_vendor == X86_VENDOR_HYGON) && 4778 is_x86_feature(featureset, X86FSET_SSE2)) { 4779 /* 4780 * The AMD white paper Software Techniques For Managing 4781 * Speculation on AMD Processors details circumstances for when 4782 * lfence instructions are serializing. 4783 * 4784 * On family 0xf and 0x11, it is inherently so. On family 0x10 4785 * and later (excluding 0x11), a bit in the DE_CFG MSR 4786 * determines the lfence behavior. Per that whitepaper, AMD has 4787 * committed to supporting that MSR on all later CPUs. 4788 */ 4789 if (cpi->cpi_family == 0xf || cpi->cpi_family == 0x11) { 4790 add_x86_feature(featureset, X86FSET_LFENCE_SER); 4791 } else if (cpi->cpi_family >= 0x10) { 4792 #if !defined(__xpv) 4793 uint64_t val; 4794 4795 /* 4796 * Be careful when attempting to enable the bit, and 4797 * verify that it was actually set in case we are 4798 * running in a hypervisor which is less than faithful 4799 * about its emulation of this feature. 4800 */ 4801 on_trap_data_t otd; 4802 if (!on_trap(&otd, OT_DATA_ACCESS)) { 4803 val = rdmsr(MSR_AMD_DE_CFG); 4804 val |= AMD_DE_CFG_LFENCE_DISPATCH; 4805 wrmsr(MSR_AMD_DE_CFG, val); 4806 val = rdmsr(MSR_AMD_DE_CFG); 4807 } else { 4808 val = 0; 4809 } 4810 no_trap(); 4811 4812 if ((val & AMD_DE_CFG_LFENCE_DISPATCH) != 0) { 4813 add_x86_feature(featureset, X86FSET_LFENCE_SER); 4814 } 4815 #endif 4816 } 4817 } else if (cpi->cpi_vendor == X86_VENDOR_Intel && 4818 is_x86_feature(featureset, X86FSET_SSE2)) { 4819 /* 4820 * Documentation and other OSes indicate that lfence is always 4821 * serializing on Intel CPUs. 4822 */ 4823 add_x86_feature(featureset, X86FSET_LFENCE_SER); 4824 } 4825 4826 4827 /* 4828 * Check the processor leaves that are used for security features. Grab 4829 * any additional processor-specific leaves that we may not have yet. 4830 */ 4831 switch (cpi->cpi_vendor) { 4832 case X86_VENDOR_AMD: 4833 case X86_VENDOR_HYGON: 4834 if (cpi->cpi_xmaxeax >= CPUID_LEAF_EXT_21) { 4835 cp = &cpi->cpi_extd[7]; 4836 cp->cp_eax = CPUID_LEAF_EXT_21; 4837 cp->cp_ecx = 0; 4838 (void) __cpuid_insn(cp); 4839 } 4840 break; 4841 default: 4842 break; 4843 } 4844 4845 cpuid_scan_security(cpu, featureset); 4846 } 4847 4848 /* 4849 * Make copies of the cpuid table entries we depend on, in 4850 * part for ease of parsing now, in part so that we have only 4851 * one place to correct any of it, in part for ease of 4852 * later export to userland, and in part so we can look at 4853 * this stuff in a crash dump. 4854 */ 4855 4856 static void 4857 cpuid_pass_extended(cpu_t *cpu, void *_arg __unused) 4858 { 4859 uint_t n, nmax; 4860 int i; 4861 struct cpuid_regs *cp; 4862 uint8_t *dp; 4863 uint32_t *iptr; 4864 struct cpuid_info *cpi = cpu->cpu_m.mcpu_cpi; 4865 4866 if (cpi->cpi_maxeax < 1) 4867 return; 4868 4869 if ((nmax = cpi->cpi_maxeax + 1) > NMAX_CPI_STD) 4870 nmax = NMAX_CPI_STD; 4871 /* 4872 * (We already handled n == 0 and n == 1 in the basic pass) 4873 */ 4874 for (n = 2, cp = &cpi->cpi_std[2]; n < nmax; n++, cp++) { 4875 /* 4876 * leaves 6 and 7 were handled in the basic pass 4877 */ 4878 if (n == 6 || n == 7) 4879 continue; 4880 4881 cp->cp_eax = n; 4882 4883 /* 4884 * CPUID function 4 expects %ecx to be initialized 4885 * with an index which indicates which cache to return 4886 * information about. The OS is expected to call function 4 4887 * with %ecx set to 0, 1, 2, ... until it returns with 4888 * EAX[4:0] set to 0, which indicates there are no more 4889 * caches. 4890 * 4891 * Here, populate cpi_std[4] with the information returned by 4892 * function 4 when %ecx == 0, and do the rest in a later pass 4893 * when dynamic memory allocation becomes available. 4894 * 4895 * Note: we need to explicitly initialize %ecx here, since 4896 * function 4 may have been previously invoked. 4897 */ 4898 if (n == 4) 4899 cp->cp_ecx = 0; 4900 4901 (void) __cpuid_insn(cp); 4902 platform_cpuid_mangle(cpi->cpi_vendor, n, cp); 4903 switch (n) { 4904 case 2: 4905 /* 4906 * "the lower 8 bits of the %eax register 4907 * contain a value that identifies the number 4908 * of times the cpuid [instruction] has to be 4909 * executed to obtain a complete image of the 4910 * processor's caching systems." 4911 * 4912 * How *do* they make this stuff up? 4913 */ 4914 cpi->cpi_ncache = sizeof (*cp) * 4915 BITX(cp->cp_eax, 7, 0); 4916 if (cpi->cpi_ncache == 0) 4917 break; 4918 cpi->cpi_ncache--; /* skip count byte */ 4919 4920 /* 4921 * Well, for now, rather than attempt to implement 4922 * this slightly dubious algorithm, we just look 4923 * at the first 15 .. 4924 */ 4925 if (cpi->cpi_ncache > (sizeof (*cp) - 1)) 4926 cpi->cpi_ncache = sizeof (*cp) - 1; 4927 4928 dp = cpi->cpi_cacheinfo; 4929 if (BITX(cp->cp_eax, 31, 31) == 0) { 4930 uint8_t *p = (void *)&cp->cp_eax; 4931 for (i = 1; i < 4; i++) 4932 if (p[i] != 0) 4933 *dp++ = p[i]; 4934 } 4935 if (BITX(cp->cp_ebx, 31, 31) == 0) { 4936 uint8_t *p = (void *)&cp->cp_ebx; 4937 for (i = 0; i < 4; i++) 4938 if (p[i] != 0) 4939 *dp++ = p[i]; 4940 } 4941 if (BITX(cp->cp_ecx, 31, 31) == 0) { 4942 uint8_t *p = (void *)&cp->cp_ecx; 4943 for (i = 0; i < 4; i++) 4944 if (p[i] != 0) 4945 *dp++ = p[i]; 4946 } 4947 if (BITX(cp->cp_edx, 31, 31) == 0) { 4948 uint8_t *p = (void *)&cp->cp_edx; 4949 for (i = 0; i < 4; i++) 4950 if (p[i] != 0) 4951 *dp++ = p[i]; 4952 } 4953 break; 4954 4955 case 3: /* Processor serial number, if PSN supported */ 4956 break; 4957 4958 case 4: /* Deterministic cache parameters */ 4959 break; 4960 4961 case 5: /* Monitor/Mwait parameters */ 4962 { 4963 size_t mwait_size; 4964 4965 /* 4966 * check cpi_mwait.support which was set in 4967 * cpuid_pass_basic() 4968 */ 4969 if (!(cpi->cpi_mwait.support & MWAIT_SUPPORT)) 4970 break; 4971 4972 /* 4973 * Protect ourself from insane mwait line size. 4974 * Workaround for incomplete hardware emulator(s). 4975 */ 4976 mwait_size = (size_t)MWAIT_SIZE_MAX(cpi); 4977 if (mwait_size < sizeof (uint32_t) || 4978 !ISP2(mwait_size)) { 4979 #if DEBUG 4980 cmn_err(CE_NOTE, "Cannot handle cpu %d mwait " 4981 "size %ld", cpu->cpu_id, (long)mwait_size); 4982 #endif 4983 break; 4984 } 4985 4986 cpi->cpi_mwait.mon_min = (size_t)MWAIT_SIZE_MIN(cpi); 4987 cpi->cpi_mwait.mon_max = mwait_size; 4988 if (MWAIT_EXTENSION(cpi)) { 4989 cpi->cpi_mwait.support |= MWAIT_EXTENSIONS; 4990 if (MWAIT_INT_ENABLE(cpi)) 4991 cpi->cpi_mwait.support |= 4992 MWAIT_ECX_INT_ENABLE; 4993 } 4994 break; 4995 } 4996 default: 4997 break; 4998 } 4999 } 5000 5001 /* 5002 * XSAVE enumeration 5003 */ 5004 if (cpi->cpi_maxeax >= 0xD) { 5005 struct cpuid_regs regs; 5006 boolean_t cpuid_d_valid = B_TRUE; 5007 5008 cp = ®s; 5009 cp->cp_eax = 0xD; 5010 cp->cp_edx = cp->cp_ebx = cp->cp_ecx = 0; 5011 5012 (void) __cpuid_insn(cp); 5013 5014 /* 5015 * Sanity checks for debug 5016 */ 5017 if ((cp->cp_eax & XFEATURE_LEGACY_FP) == 0 || 5018 (cp->cp_eax & XFEATURE_SSE) == 0) { 5019 cpuid_d_valid = B_FALSE; 5020 } 5021 5022 cpi->cpi_xsave.xsav_hw_features_low = cp->cp_eax; 5023 cpi->cpi_xsave.xsav_hw_features_high = cp->cp_edx; 5024 cpi->cpi_xsave.xsav_max_size = cp->cp_ecx; 5025 5026 /* 5027 * If the hw supports AVX, get the size and offset in the save 5028 * area for the ymm state. 5029 */ 5030 if (cpi->cpi_xsave.xsav_hw_features_low & XFEATURE_AVX) { 5031 cp->cp_eax = 0xD; 5032 cp->cp_ecx = 2; 5033 cp->cp_edx = cp->cp_ebx = 0; 5034 5035 (void) __cpuid_insn(cp); 5036 5037 if (cp->cp_ebx != CPUID_LEAFD_2_YMM_OFFSET || 5038 cp->cp_eax != CPUID_LEAFD_2_YMM_SIZE) { 5039 cpuid_d_valid = B_FALSE; 5040 } 5041 5042 cpi->cpi_xsave.ymm_size = cp->cp_eax; 5043 cpi->cpi_xsave.ymm_offset = cp->cp_ebx; 5044 } 5045 5046 /* 5047 * If the hw supports MPX, get the size and offset in the 5048 * save area for BNDREGS and BNDCSR. 5049 */ 5050 if (cpi->cpi_xsave.xsav_hw_features_low & XFEATURE_MPX) { 5051 cp->cp_eax = 0xD; 5052 cp->cp_ecx = 3; 5053 cp->cp_edx = cp->cp_ebx = 0; 5054 5055 (void) __cpuid_insn(cp); 5056 5057 cpi->cpi_xsave.bndregs_size = cp->cp_eax; 5058 cpi->cpi_xsave.bndregs_offset = cp->cp_ebx; 5059 5060 cp->cp_eax = 0xD; 5061 cp->cp_ecx = 4; 5062 cp->cp_edx = cp->cp_ebx = 0; 5063 5064 (void) __cpuid_insn(cp); 5065 5066 cpi->cpi_xsave.bndcsr_size = cp->cp_eax; 5067 cpi->cpi_xsave.bndcsr_offset = cp->cp_ebx; 5068 } 5069 5070 /* 5071 * If the hw supports AVX512, get the size and offset in the 5072 * save area for the opmask registers and zmm state. 5073 */ 5074 if (cpi->cpi_xsave.xsav_hw_features_low & XFEATURE_AVX512) { 5075 cp->cp_eax = 0xD; 5076 cp->cp_ecx = 5; 5077 cp->cp_edx = cp->cp_ebx = 0; 5078 5079 (void) __cpuid_insn(cp); 5080 5081 cpi->cpi_xsave.opmask_size = cp->cp_eax; 5082 cpi->cpi_xsave.opmask_offset = cp->cp_ebx; 5083 5084 cp->cp_eax = 0xD; 5085 cp->cp_ecx = 6; 5086 cp->cp_edx = cp->cp_ebx = 0; 5087 5088 (void) __cpuid_insn(cp); 5089 5090 cpi->cpi_xsave.zmmlo_size = cp->cp_eax; 5091 cpi->cpi_xsave.zmmlo_offset = cp->cp_ebx; 5092 5093 cp->cp_eax = 0xD; 5094 cp->cp_ecx = 7; 5095 cp->cp_edx = cp->cp_ebx = 0; 5096 5097 (void) __cpuid_insn(cp); 5098 5099 cpi->cpi_xsave.zmmhi_size = cp->cp_eax; 5100 cpi->cpi_xsave.zmmhi_offset = cp->cp_ebx; 5101 } 5102 5103 if (is_x86_feature(x86_featureset, X86FSET_XSAVE)) { 5104 xsave_state_size = 0; 5105 } else if (cpuid_d_valid) { 5106 xsave_state_size = cpi->cpi_xsave.xsav_max_size; 5107 } else { 5108 /* Broken CPUID 0xD, probably in HVM */ 5109 cmn_err(CE_WARN, "cpu%d: CPUID.0xD returns invalid " 5110 "value: hw_low = %d, hw_high = %d, xsave_size = %d" 5111 ", ymm_size = %d, ymm_offset = %d\n", 5112 cpu->cpu_id, cpi->cpi_xsave.xsav_hw_features_low, 5113 cpi->cpi_xsave.xsav_hw_features_high, 5114 (int)cpi->cpi_xsave.xsav_max_size, 5115 (int)cpi->cpi_xsave.ymm_size, 5116 (int)cpi->cpi_xsave.ymm_offset); 5117 5118 if (xsave_state_size != 0) { 5119 /* 5120 * This must be a non-boot CPU. We cannot 5121 * continue, because boot cpu has already 5122 * enabled XSAVE. 5123 */ 5124 ASSERT(cpu->cpu_id != 0); 5125 cmn_err(CE_PANIC, "cpu%d: we have already " 5126 "enabled XSAVE on boot cpu, cannot " 5127 "continue.", cpu->cpu_id); 5128 } else { 5129 /* 5130 * If we reached here on the boot CPU, it's also 5131 * almost certain that we'll reach here on the 5132 * non-boot CPUs. When we're here on a boot CPU 5133 * we should disable the feature, on a non-boot 5134 * CPU we need to confirm that we have. 5135 */ 5136 if (cpu->cpu_id == 0) { 5137 remove_x86_feature(x86_featureset, 5138 X86FSET_XSAVE); 5139 remove_x86_feature(x86_featureset, 5140 X86FSET_AVX); 5141 remove_x86_feature(x86_featureset, 5142 X86FSET_F16C); 5143 remove_x86_feature(x86_featureset, 5144 X86FSET_BMI1); 5145 remove_x86_feature(x86_featureset, 5146 X86FSET_BMI2); 5147 remove_x86_feature(x86_featureset, 5148 X86FSET_FMA); 5149 remove_x86_feature(x86_featureset, 5150 X86FSET_AVX2); 5151 remove_x86_feature(x86_featureset, 5152 X86FSET_MPX); 5153 remove_x86_feature(x86_featureset, 5154 X86FSET_AVX512F); 5155 remove_x86_feature(x86_featureset, 5156 X86FSET_AVX512DQ); 5157 remove_x86_feature(x86_featureset, 5158 X86FSET_AVX512PF); 5159 remove_x86_feature(x86_featureset, 5160 X86FSET_AVX512ER); 5161 remove_x86_feature(x86_featureset, 5162 X86FSET_AVX512CD); 5163 remove_x86_feature(x86_featureset, 5164 X86FSET_AVX512BW); 5165 remove_x86_feature(x86_featureset, 5166 X86FSET_AVX512VL); 5167 remove_x86_feature(x86_featureset, 5168 X86FSET_AVX512FMA); 5169 remove_x86_feature(x86_featureset, 5170 X86FSET_AVX512VBMI); 5171 remove_x86_feature(x86_featureset, 5172 X86FSET_AVX512VNNI); 5173 remove_x86_feature(x86_featureset, 5174 X86FSET_AVX512VPOPCDQ); 5175 remove_x86_feature(x86_featureset, 5176 X86FSET_AVX512NNIW); 5177 remove_x86_feature(x86_featureset, 5178 X86FSET_AVX512FMAPS); 5179 remove_x86_feature(x86_featureset, 5180 X86FSET_VAES); 5181 remove_x86_feature(x86_featureset, 5182 X86FSET_VPCLMULQDQ); 5183 remove_x86_feature(x86_featureset, 5184 X86FSET_GFNI); 5185 remove_x86_feature(x86_featureset, 5186 X86FSET_AVX512_VP2INT); 5187 remove_x86_feature(x86_featureset, 5188 X86FSET_AVX512_BITALG); 5189 remove_x86_feature(x86_featureset, 5190 X86FSET_AVX512_VBMI2); 5191 remove_x86_feature(x86_featureset, 5192 X86FSET_AVX512_BF16); 5193 5194 xsave_force_disable = B_TRUE; 5195 } else { 5196 VERIFY(is_x86_feature(x86_featureset, 5197 X86FSET_XSAVE) == B_FALSE); 5198 } 5199 } 5200 } 5201 } 5202 5203 5204 if ((cpi->cpi_xmaxeax & CPUID_LEAF_EXT_0) == 0) 5205 return; 5206 5207 if ((nmax = cpi->cpi_xmaxeax - CPUID_LEAF_EXT_0 + 1) > NMAX_CPI_EXTD) 5208 nmax = NMAX_CPI_EXTD; 5209 /* 5210 * Copy the extended properties, fixing them as we go. While we start at 5211 * 2 because we've already handled a few cases in the basic pass, the 5212 * rest we let ourselves just grab again (e.g. 0x8, 0x21). 5213 */ 5214 iptr = (void *)cpi->cpi_brandstr; 5215 for (n = 2, cp = &cpi->cpi_extd[2]; n < nmax; cp++, n++) { 5216 cp->cp_eax = CPUID_LEAF_EXT_0 + n; 5217 (void) __cpuid_insn(cp); 5218 platform_cpuid_mangle(cpi->cpi_vendor, CPUID_LEAF_EXT_0 + n, 5219 cp); 5220 switch (n) { 5221 case 2: 5222 case 3: 5223 case 4: 5224 /* 5225 * Extract the brand string 5226 */ 5227 *iptr++ = cp->cp_eax; 5228 *iptr++ = cp->cp_ebx; 5229 *iptr++ = cp->cp_ecx; 5230 *iptr++ = cp->cp_edx; 5231 break; 5232 case 5: 5233 switch (cpi->cpi_vendor) { 5234 case X86_VENDOR_AMD: 5235 /* 5236 * The Athlon and Duron were the first 5237 * parts to report the sizes of the 5238 * TLB for large pages. Before then, 5239 * we don't trust the data. 5240 */ 5241 if (cpi->cpi_family < 6 || 5242 (cpi->cpi_family == 6 && 5243 cpi->cpi_model < 1)) 5244 cp->cp_eax = 0; 5245 break; 5246 default: 5247 break; 5248 } 5249 break; 5250 case 6: 5251 switch (cpi->cpi_vendor) { 5252 case X86_VENDOR_AMD: 5253 /* 5254 * The Athlon and Duron were the first 5255 * AMD parts with L2 TLB's. 5256 * Before then, don't trust the data. 5257 */ 5258 if (cpi->cpi_family < 6 || 5259 (cpi->cpi_family == 6 && 5260 cpi->cpi_model < 1)) 5261 cp->cp_eax = cp->cp_ebx = 0; 5262 /* 5263 * AMD Duron rev A0 reports L2 5264 * cache size incorrectly as 1K 5265 * when it is really 64K 5266 */ 5267 if (cpi->cpi_family == 6 && 5268 cpi->cpi_model == 3 && 5269 cpi->cpi_step == 0) { 5270 cp->cp_ecx &= 0xffff; 5271 cp->cp_ecx |= 0x400000; 5272 } 5273 break; 5274 case X86_VENDOR_Cyrix: /* VIA C3 */ 5275 /* 5276 * VIA C3 processors are a bit messed 5277 * up w.r.t. encoding cache sizes in %ecx 5278 */ 5279 if (cpi->cpi_family != 6) 5280 break; 5281 /* 5282 * model 7 and 8 were incorrectly encoded 5283 * 5284 * xxx is model 8 really broken? 5285 */ 5286 if (cpi->cpi_model == 7 || 5287 cpi->cpi_model == 8) 5288 cp->cp_ecx = 5289 BITX(cp->cp_ecx, 31, 24) << 16 | 5290 BITX(cp->cp_ecx, 23, 16) << 12 | 5291 BITX(cp->cp_ecx, 15, 8) << 8 | 5292 BITX(cp->cp_ecx, 7, 0); 5293 /* 5294 * model 9 stepping 1 has wrong associativity 5295 */ 5296 if (cpi->cpi_model == 9 && cpi->cpi_step == 1) 5297 cp->cp_ecx |= 8 << 12; 5298 break; 5299 case X86_VENDOR_Intel: 5300 /* 5301 * Extended L2 Cache features function. 5302 * First appeared on Prescott. 5303 */ 5304 default: 5305 break; 5306 } 5307 break; 5308 default: 5309 break; 5310 } 5311 } 5312 } 5313 5314 static const char * 5315 intel_cpubrand(const struct cpuid_info *cpi) 5316 { 5317 int i; 5318 5319 ASSERT(is_x86_feature(x86_featureset, X86FSET_CPUID)); 5320 5321 switch (cpi->cpi_family) { 5322 case 5: 5323 return ("Intel Pentium(r)"); 5324 case 6: 5325 switch (cpi->cpi_model) { 5326 uint_t celeron, xeon; 5327 const struct cpuid_regs *cp; 5328 case 0: 5329 case 1: 5330 case 2: 5331 return ("Intel Pentium(r) Pro"); 5332 case 3: 5333 case 4: 5334 return ("Intel Pentium(r) II"); 5335 case 6: 5336 return ("Intel Celeron(r)"); 5337 case 5: 5338 case 7: 5339 celeron = xeon = 0; 5340 cp = &cpi->cpi_std[2]; /* cache info */ 5341 5342 for (i = 1; i < 4; i++) { 5343 uint_t tmp; 5344 5345 tmp = (cp->cp_eax >> (8 * i)) & 0xff; 5346 if (tmp == 0x40) 5347 celeron++; 5348 if (tmp >= 0x44 && tmp <= 0x45) 5349 xeon++; 5350 } 5351 5352 for (i = 0; i < 2; i++) { 5353 uint_t tmp; 5354 5355 tmp = (cp->cp_ebx >> (8 * i)) & 0xff; 5356 if (tmp == 0x40) 5357 celeron++; 5358 else if (tmp >= 0x44 && tmp <= 0x45) 5359 xeon++; 5360 } 5361 5362 for (i = 0; i < 4; i++) { 5363 uint_t tmp; 5364 5365 tmp = (cp->cp_ecx >> (8 * i)) & 0xff; 5366 if (tmp == 0x40) 5367 celeron++; 5368 else if (tmp >= 0x44 && tmp <= 0x45) 5369 xeon++; 5370 } 5371 5372 for (i = 0; i < 4; i++) { 5373 uint_t tmp; 5374 5375 tmp = (cp->cp_edx >> (8 * i)) & 0xff; 5376 if (tmp == 0x40) 5377 celeron++; 5378 else if (tmp >= 0x44 && tmp <= 0x45) 5379 xeon++; 5380 } 5381 5382 if (celeron) 5383 return ("Intel Celeron(r)"); 5384 if (xeon) 5385 return (cpi->cpi_model == 5 ? 5386 "Intel Pentium(r) II Xeon(tm)" : 5387 "Intel Pentium(r) III Xeon(tm)"); 5388 return (cpi->cpi_model == 5 ? 5389 "Intel Pentium(r) II or Pentium(r) II Xeon(tm)" : 5390 "Intel Pentium(r) III or Pentium(r) III Xeon(tm)"); 5391 default: 5392 break; 5393 } 5394 default: 5395 break; 5396 } 5397 5398 /* BrandID is present if the field is nonzero */ 5399 if (cpi->cpi_brandid != 0) { 5400 static const struct { 5401 uint_t bt_bid; 5402 const char *bt_str; 5403 } brand_tbl[] = { 5404 { 0x1, "Intel(r) Celeron(r)" }, 5405 { 0x2, "Intel(r) Pentium(r) III" }, 5406 { 0x3, "Intel(r) Pentium(r) III Xeon(tm)" }, 5407 { 0x4, "Intel(r) Pentium(r) III" }, 5408 { 0x6, "Mobile Intel(r) Pentium(r) III" }, 5409 { 0x7, "Mobile Intel(r) Celeron(r)" }, 5410 { 0x8, "Intel(r) Pentium(r) 4" }, 5411 { 0x9, "Intel(r) Pentium(r) 4" }, 5412 { 0xa, "Intel(r) Celeron(r)" }, 5413 { 0xb, "Intel(r) Xeon(tm)" }, 5414 { 0xc, "Intel(r) Xeon(tm) MP" }, 5415 { 0xe, "Mobile Intel(r) Pentium(r) 4" }, 5416 { 0xf, "Mobile Intel(r) Celeron(r)" }, 5417 { 0x11, "Mobile Genuine Intel(r)" }, 5418 { 0x12, "Intel(r) Celeron(r) M" }, 5419 { 0x13, "Mobile Intel(r) Celeron(r)" }, 5420 { 0x14, "Intel(r) Celeron(r)" }, 5421 { 0x15, "Mobile Genuine Intel(r)" }, 5422 { 0x16, "Intel(r) Pentium(r) M" }, 5423 { 0x17, "Mobile Intel(r) Celeron(r)" } 5424 }; 5425 uint_t btblmax = sizeof (brand_tbl) / sizeof (brand_tbl[0]); 5426 uint_t sgn; 5427 5428 sgn = (cpi->cpi_family << 8) | 5429 (cpi->cpi_model << 4) | cpi->cpi_step; 5430 5431 for (i = 0; i < btblmax; i++) 5432 if (brand_tbl[i].bt_bid == cpi->cpi_brandid) 5433 break; 5434 if (i < btblmax) { 5435 if (sgn == 0x6b1 && cpi->cpi_brandid == 3) 5436 return ("Intel(r) Celeron(r)"); 5437 if (sgn < 0xf13 && cpi->cpi_brandid == 0xb) 5438 return ("Intel(r) Xeon(tm) MP"); 5439 if (sgn < 0xf13 && cpi->cpi_brandid == 0xe) 5440 return ("Intel(r) Xeon(tm)"); 5441 return (brand_tbl[i].bt_str); 5442 } 5443 } 5444 5445 return (NULL); 5446 } 5447 5448 static const char * 5449 amd_cpubrand(const struct cpuid_info *cpi) 5450 { 5451 ASSERT(is_x86_feature(x86_featureset, X86FSET_CPUID)); 5452 5453 switch (cpi->cpi_family) { 5454 case 5: 5455 switch (cpi->cpi_model) { 5456 case 0: 5457 case 1: 5458 case 2: 5459 case 3: 5460 case 4: 5461 case 5: 5462 return ("AMD-K5(r)"); 5463 case 6: 5464 case 7: 5465 return ("AMD-K6(r)"); 5466 case 8: 5467 return ("AMD-K6(r)-2"); 5468 case 9: 5469 return ("AMD-K6(r)-III"); 5470 default: 5471 return ("AMD (family 5)"); 5472 } 5473 case 6: 5474 switch (cpi->cpi_model) { 5475 case 1: 5476 return ("AMD-K7(tm)"); 5477 case 0: 5478 case 2: 5479 case 4: 5480 return ("AMD Athlon(tm)"); 5481 case 3: 5482 case 7: 5483 return ("AMD Duron(tm)"); 5484 case 6: 5485 case 8: 5486 case 10: 5487 /* 5488 * Use the L2 cache size to distinguish 5489 */ 5490 return ((cpi->cpi_extd[6].cp_ecx >> 16) >= 256 ? 5491 "AMD Athlon(tm)" : "AMD Duron(tm)"); 5492 default: 5493 return ("AMD (family 6)"); 5494 } 5495 default: 5496 break; 5497 } 5498 5499 if (cpi->cpi_family == 0xf && cpi->cpi_model == 5 && 5500 cpi->cpi_brandid != 0) { 5501 switch (BITX(cpi->cpi_brandid, 7, 5)) { 5502 case 3: 5503 return ("AMD Opteron(tm) UP 1xx"); 5504 case 4: 5505 return ("AMD Opteron(tm) DP 2xx"); 5506 case 5: 5507 return ("AMD Opteron(tm) MP 8xx"); 5508 default: 5509 return ("AMD Opteron(tm)"); 5510 } 5511 } 5512 5513 return (NULL); 5514 } 5515 5516 static const char * 5517 cyrix_cpubrand(struct cpuid_info *cpi, uint_t type) 5518 { 5519 ASSERT(is_x86_feature(x86_featureset, X86FSET_CPUID)); 5520 5521 switch (type) { 5522 case X86_TYPE_CYRIX_6x86: 5523 return ("Cyrix 6x86"); 5524 case X86_TYPE_CYRIX_6x86L: 5525 return ("Cyrix 6x86L"); 5526 case X86_TYPE_CYRIX_6x86MX: 5527 return ("Cyrix 6x86MX"); 5528 case X86_TYPE_CYRIX_GXm: 5529 return ("Cyrix GXm"); 5530 case X86_TYPE_CYRIX_MediaGX: 5531 return ("Cyrix MediaGX"); 5532 case X86_TYPE_CYRIX_MII: 5533 return ("Cyrix M2"); 5534 case X86_TYPE_VIA_CYRIX_III: 5535 return ("VIA Cyrix M3"); 5536 default: 5537 /* 5538 * Have another wild guess .. 5539 */ 5540 if (cpi->cpi_family == 4 && cpi->cpi_model == 9) 5541 return ("Cyrix 5x86"); 5542 else if (cpi->cpi_family == 5) { 5543 switch (cpi->cpi_model) { 5544 case 2: 5545 return ("Cyrix 6x86"); /* Cyrix M1 */ 5546 case 4: 5547 return ("Cyrix MediaGX"); 5548 default: 5549 break; 5550 } 5551 } else if (cpi->cpi_family == 6) { 5552 switch (cpi->cpi_model) { 5553 case 0: 5554 return ("Cyrix 6x86MX"); /* Cyrix M2? */ 5555 case 5: 5556 case 6: 5557 case 7: 5558 case 8: 5559 case 9: 5560 return ("VIA C3"); 5561 default: 5562 break; 5563 } 5564 } 5565 break; 5566 } 5567 return (NULL); 5568 } 5569 5570 /* 5571 * This only gets called in the case that the CPU extended 5572 * feature brand string (0x80000002, 0x80000003, 0x80000004) 5573 * aren't available, or contain null bytes for some reason. 5574 */ 5575 static void 5576 fabricate_brandstr(struct cpuid_info *cpi) 5577 { 5578 const char *brand = NULL; 5579 5580 switch (cpi->cpi_vendor) { 5581 case X86_VENDOR_Intel: 5582 brand = intel_cpubrand(cpi); 5583 break; 5584 case X86_VENDOR_AMD: 5585 brand = amd_cpubrand(cpi); 5586 break; 5587 case X86_VENDOR_Cyrix: 5588 brand = cyrix_cpubrand(cpi, x86_type); 5589 break; 5590 case X86_VENDOR_NexGen: 5591 if (cpi->cpi_family == 5 && cpi->cpi_model == 0) 5592 brand = "NexGen Nx586"; 5593 break; 5594 case X86_VENDOR_Centaur: 5595 if (cpi->cpi_family == 5) 5596 switch (cpi->cpi_model) { 5597 case 4: 5598 brand = "Centaur C6"; 5599 break; 5600 case 8: 5601 brand = "Centaur C2"; 5602 break; 5603 case 9: 5604 brand = "Centaur C3"; 5605 break; 5606 default: 5607 break; 5608 } 5609 break; 5610 case X86_VENDOR_Rise: 5611 if (cpi->cpi_family == 5 && 5612 (cpi->cpi_model == 0 || cpi->cpi_model == 2)) 5613 brand = "Rise mP6"; 5614 break; 5615 case X86_VENDOR_SiS: 5616 if (cpi->cpi_family == 5 && cpi->cpi_model == 0) 5617 brand = "SiS 55x"; 5618 break; 5619 case X86_VENDOR_TM: 5620 if (cpi->cpi_family == 5 && cpi->cpi_model == 4) 5621 brand = "Transmeta Crusoe TM3x00 or TM5x00"; 5622 break; 5623 case X86_VENDOR_NSC: 5624 case X86_VENDOR_UMC: 5625 default: 5626 break; 5627 } 5628 if (brand) { 5629 (void) strcpy((char *)cpi->cpi_brandstr, brand); 5630 return; 5631 } 5632 5633 /* 5634 * If all else fails ... 5635 */ 5636 (void) snprintf(cpi->cpi_brandstr, sizeof (cpi->cpi_brandstr), 5637 "%s %d.%d.%d", cpi->cpi_vendorstr, cpi->cpi_family, 5638 cpi->cpi_model, cpi->cpi_step); 5639 } 5640 5641 /* 5642 * This routine is called just after kernel memory allocation 5643 * becomes available on cpu0, and as part of mp_startup() on 5644 * the other cpus. 5645 * 5646 * Fixup the brand string, and collect any information from cpuid 5647 * that requires dynamically allocated storage to represent. 5648 */ 5649 5650 static void 5651 cpuid_pass_dynamic(cpu_t *cpu, void *_arg __unused) 5652 { 5653 int i, max, shft, level, size; 5654 struct cpuid_regs regs; 5655 struct cpuid_regs *cp; 5656 struct cpuid_info *cpi = cpu->cpu_m.mcpu_cpi; 5657 5658 /* 5659 * Deterministic cache parameters 5660 * 5661 * Intel uses leaf 0x4 for this, while AMD uses leaf 0x8000001d. The 5662 * values that are present are currently defined to be the same. This 5663 * means we can use the same logic to parse it as long as we use the 5664 * appropriate leaf to get the data. If you're updating this, make sure 5665 * you're careful about which vendor supports which aspect. 5666 * 5667 * Take this opportunity to detect the number of threads sharing the 5668 * last level cache, and construct a corresponding cache id. The 5669 * respective cpuid_info members are initialized to the default case of 5670 * "no last level cache sharing". 5671 */ 5672 cpi->cpi_ncpu_shr_last_cache = 1; 5673 cpi->cpi_last_lvl_cacheid = cpu->cpu_id; 5674 5675 if ((cpi->cpi_maxeax >= 4 && cpi->cpi_vendor == X86_VENDOR_Intel) || 5676 ((cpi->cpi_vendor == X86_VENDOR_AMD || 5677 cpi->cpi_vendor == X86_VENDOR_HYGON) && 5678 cpi->cpi_xmaxeax >= CPUID_LEAF_EXT_1d && 5679 is_x86_feature(x86_featureset, X86FSET_TOPOEXT))) { 5680 uint32_t leaf; 5681 5682 if (cpi->cpi_vendor == X86_VENDOR_Intel) { 5683 leaf = 4; 5684 } else { 5685 leaf = CPUID_LEAF_EXT_1d; 5686 } 5687 5688 /* 5689 * Find the # of elements (size) returned by the leaf and along 5690 * the way detect last level cache sharing details. 5691 */ 5692 bzero(®s, sizeof (regs)); 5693 cp = ®s; 5694 for (i = 0, max = 0; i < CPI_FN4_ECX_MAX; i++) { 5695 cp->cp_eax = leaf; 5696 cp->cp_ecx = i; 5697 5698 (void) __cpuid_insn(cp); 5699 5700 if (CPI_CACHE_TYPE(cp) == 0) 5701 break; 5702 level = CPI_CACHE_LVL(cp); 5703 if (level > max) { 5704 max = level; 5705 cpi->cpi_ncpu_shr_last_cache = 5706 CPI_NTHR_SHR_CACHE(cp) + 1; 5707 } 5708 } 5709 cpi->cpi_cache_leaf_size = size = i; 5710 5711 /* 5712 * Allocate the cpi_cache_leaves array. The first element 5713 * references the regs for the corresponding leaf with %ecx set 5714 * to 0. This was gathered in cpuid_pass_extended(). 5715 */ 5716 if (size > 0) { 5717 cpi->cpi_cache_leaves = 5718 kmem_alloc(size * sizeof (cp), KM_SLEEP); 5719 if (cpi->cpi_vendor == X86_VENDOR_Intel) { 5720 cpi->cpi_cache_leaves[0] = &cpi->cpi_std[4]; 5721 } else { 5722 cpi->cpi_cache_leaves[0] = &cpi->cpi_extd[0x1d]; 5723 } 5724 5725 /* 5726 * Allocate storage to hold the additional regs 5727 * for the leaf, %ecx == 1 .. cpi_cache_leaf_size. 5728 * 5729 * The regs for the leaf, %ecx == 0 has already 5730 * been allocated as indicated above. 5731 */ 5732 for (i = 1; i < size; i++) { 5733 cp = cpi->cpi_cache_leaves[i] = 5734 kmem_zalloc(sizeof (regs), KM_SLEEP); 5735 cp->cp_eax = leaf; 5736 cp->cp_ecx = i; 5737 5738 (void) __cpuid_insn(cp); 5739 } 5740 } 5741 /* 5742 * Determine the number of bits needed to represent 5743 * the number of CPUs sharing the last level cache. 5744 * 5745 * Shift off that number of bits from the APIC id to 5746 * derive the cache id. 5747 */ 5748 shft = 0; 5749 for (i = 1; i < cpi->cpi_ncpu_shr_last_cache; i <<= 1) 5750 shft++; 5751 cpi->cpi_last_lvl_cacheid = cpi->cpi_apicid >> shft; 5752 } 5753 5754 /* 5755 * Now fixup the brand string 5756 */ 5757 if ((cpi->cpi_xmaxeax & CPUID_LEAF_EXT_0) == 0) { 5758 fabricate_brandstr(cpi); 5759 } else { 5760 5761 /* 5762 * If we successfully extracted a brand string from the cpuid 5763 * instruction, clean it up by removing leading spaces and 5764 * similar junk. 5765 */ 5766 if (cpi->cpi_brandstr[0]) { 5767 size_t maxlen = sizeof (cpi->cpi_brandstr); 5768 char *src, *dst; 5769 5770 dst = src = (char *)cpi->cpi_brandstr; 5771 src[maxlen - 1] = '\0'; 5772 /* 5773 * strip leading spaces 5774 */ 5775 while (*src == ' ') 5776 src++; 5777 /* 5778 * Remove any 'Genuine' or "Authentic" prefixes 5779 */ 5780 if (strncmp(src, "Genuine ", 8) == 0) 5781 src += 8; 5782 if (strncmp(src, "Authentic ", 10) == 0) 5783 src += 10; 5784 5785 /* 5786 * Now do an in-place copy. 5787 * Map (R) to (r) and (TM) to (tm). 5788 * The era of teletypes is long gone, and there's 5789 * -really- no need to shout. 5790 */ 5791 while (*src != '\0') { 5792 if (src[0] == '(') { 5793 if (strncmp(src + 1, "R)", 2) == 0) { 5794 (void) strncpy(dst, "(r)", 3); 5795 src += 3; 5796 dst += 3; 5797 continue; 5798 } 5799 if (strncmp(src + 1, "TM)", 3) == 0) { 5800 (void) strncpy(dst, "(tm)", 4); 5801 src += 4; 5802 dst += 4; 5803 continue; 5804 } 5805 } 5806 *dst++ = *src++; 5807 } 5808 *dst = '\0'; 5809 5810 /* 5811 * Finally, remove any trailing spaces 5812 */ 5813 while (--dst > cpi->cpi_brandstr) 5814 if (*dst == ' ') 5815 *dst = '\0'; 5816 else 5817 break; 5818 } else 5819 fabricate_brandstr(cpi); 5820 } 5821 } 5822 5823 typedef struct { 5824 uint32_t avm_av; 5825 uint32_t avm_feat; 5826 } av_feat_map_t; 5827 5828 /* 5829 * These arrays are used to map features that we should add based on x86 5830 * features that are present. As a large number depend on kernel features, 5831 * rather than rechecking and clearing CPUID everywhere, we simply map these. 5832 * There is an array of these for each hwcap word. Some features aren't tracked 5833 * in the kernel x86 featureset and that's ok. They will not show up in here. 5834 */ 5835 static const av_feat_map_t x86fset_to_av1[] = { 5836 { AV_386_CX8, X86FSET_CX8 }, 5837 { AV_386_SEP, X86FSET_SEP }, 5838 { AV_386_AMD_SYSC, X86FSET_ASYSC }, 5839 { AV_386_CMOV, X86FSET_CMOV }, 5840 { AV_386_FXSR, X86FSET_SSE }, 5841 { AV_386_SSE, X86FSET_SSE }, 5842 { AV_386_SSE2, X86FSET_SSE2 }, 5843 { AV_386_SSE3, X86FSET_SSE3 }, 5844 { AV_386_CX16, X86FSET_CX16 }, 5845 { AV_386_TSCP, X86FSET_TSCP }, 5846 { AV_386_AMD_SSE4A, X86FSET_SSE4A }, 5847 { AV_386_SSSE3, X86FSET_SSSE3 }, 5848 { AV_386_SSE4_1, X86FSET_SSE4_1 }, 5849 { AV_386_SSE4_2, X86FSET_SSE4_2 }, 5850 { AV_386_AES, X86FSET_AES }, 5851 { AV_386_PCLMULQDQ, X86FSET_PCLMULQDQ }, 5852 { AV_386_XSAVE, X86FSET_XSAVE }, 5853 { AV_386_AVX, X86FSET_AVX }, 5854 { AV_386_VMX, X86FSET_VMX }, 5855 { AV_386_AMD_SVM, X86FSET_SVM } 5856 }; 5857 5858 static const av_feat_map_t x86fset_to_av2[] = { 5859 { AV_386_2_F16C, X86FSET_F16C }, 5860 { AV_386_2_RDRAND, X86FSET_RDRAND }, 5861 { AV_386_2_BMI1, X86FSET_BMI1 }, 5862 { AV_386_2_BMI2, X86FSET_BMI2 }, 5863 { AV_386_2_FMA, X86FSET_FMA }, 5864 { AV_386_2_AVX2, X86FSET_AVX2 }, 5865 { AV_386_2_ADX, X86FSET_ADX }, 5866 { AV_386_2_RDSEED, X86FSET_RDSEED }, 5867 { AV_386_2_AVX512F, X86FSET_AVX512F }, 5868 { AV_386_2_AVX512DQ, X86FSET_AVX512DQ }, 5869 { AV_386_2_AVX512IFMA, X86FSET_AVX512FMA }, 5870 { AV_386_2_AVX512PF, X86FSET_AVX512PF }, 5871 { AV_386_2_AVX512ER, X86FSET_AVX512ER }, 5872 { AV_386_2_AVX512CD, X86FSET_AVX512CD }, 5873 { AV_386_2_AVX512BW, X86FSET_AVX512BW }, 5874 { AV_386_2_AVX512VL, X86FSET_AVX512VL }, 5875 { AV_386_2_AVX512VBMI, X86FSET_AVX512VBMI }, 5876 { AV_386_2_AVX512VPOPCDQ, X86FSET_AVX512VPOPCDQ }, 5877 { AV_386_2_SHA, X86FSET_SHA }, 5878 { AV_386_2_FSGSBASE, X86FSET_FSGSBASE }, 5879 { AV_386_2_CLFLUSHOPT, X86FSET_CLFLUSHOPT }, 5880 { AV_386_2_CLWB, X86FSET_CLWB }, 5881 { AV_386_2_MONITORX, X86FSET_MONITORX }, 5882 { AV_386_2_CLZERO, X86FSET_CLZERO }, 5883 { AV_386_2_AVX512_VNNI, X86FSET_AVX512VNNI }, 5884 { AV_386_2_VPCLMULQDQ, X86FSET_VPCLMULQDQ }, 5885 { AV_386_2_VAES, X86FSET_VAES }, 5886 { AV_386_2_GFNI, X86FSET_GFNI }, 5887 { AV_386_2_AVX512_VP2INT, X86FSET_AVX512_VP2INT }, 5888 { AV_386_2_AVX512_BITALG, X86FSET_AVX512_BITALG } 5889 }; 5890 5891 static const av_feat_map_t x86fset_to_av3[] = { 5892 { AV_386_3_AVX512_VBMI2, X86FSET_AVX512_VBMI2 }, 5893 { AV_386_3_AVX512_BF16, X86FSET_AVX512_BF16 } 5894 }; 5895 5896 /* 5897 * This routine is called out of bind_hwcap() much later in the life 5898 * of the kernel (post_startup()). The job of this routine is to resolve 5899 * the hardware feature support and kernel support for those features into 5900 * what we're actually going to tell applications via the aux vector. 5901 * 5902 * Most of the aux vector is derived from the x86_featureset array vector where 5903 * a given feature indicates that an aux vector should be plumbed through. This 5904 * allows the kernel to use one tracking mechanism for these based on whether or 5905 * not it has the required hardware support (most often xsave). Most newer 5906 * features are added there in case we need them in the kernel. Otherwise, 5907 * features are evaluated based on looking at the cpuid features that remain. If 5908 * you find yourself wanting to clear out cpuid features for some reason, they 5909 * should instead be driven by the feature set so we have a consistent view. 5910 */ 5911 5912 static void 5913 cpuid_pass_resolve(cpu_t *cpu, void *arg) 5914 { 5915 uint_t *hwcap_out = (uint_t *)arg; 5916 struct cpuid_info *cpi; 5917 uint_t hwcap_flags = 0, hwcap_flags_2 = 0, hwcap_flags_3 = 0; 5918 5919 cpi = cpu->cpu_m.mcpu_cpi; 5920 5921 for (uint_t i = 0; i < ARRAY_SIZE(x86fset_to_av1); i++) { 5922 if (is_x86_feature(x86_featureset, 5923 x86fset_to_av1[i].avm_feat)) { 5924 hwcap_flags |= x86fset_to_av1[i].avm_av; 5925 } 5926 } 5927 5928 for (uint_t i = 0; i < ARRAY_SIZE(x86fset_to_av2); i++) { 5929 if (is_x86_feature(x86_featureset, 5930 x86fset_to_av2[i].avm_feat)) { 5931 hwcap_flags_2 |= x86fset_to_av2[i].avm_av; 5932 } 5933 } 5934 5935 for (uint_t i = 0; i < ARRAY_SIZE(x86fset_to_av3); i++) { 5936 if (is_x86_feature(x86_featureset, 5937 x86fset_to_av3[i].avm_feat)) { 5938 hwcap_flags_3 |= x86fset_to_av3[i].avm_av; 5939 } 5940 } 5941 5942 /* 5943 * From here on out we're working through features that don't have 5944 * corresponding kernel feature flags for various reasons that are 5945 * mostly just due to the historical implementation. 5946 */ 5947 if (cpi->cpi_maxeax >= 1) { 5948 uint32_t *edx = &cpi->cpi_support[STD_EDX_FEATURES]; 5949 uint32_t *ecx = &cpi->cpi_support[STD_ECX_FEATURES]; 5950 5951 *edx = CPI_FEATURES_EDX(cpi); 5952 *ecx = CPI_FEATURES_ECX(cpi); 5953 5954 /* 5955 * [no explicit support required beyond x87 fp context] 5956 */ 5957 if (!fpu_exists) 5958 *edx &= ~(CPUID_INTC_EDX_FPU | CPUID_INTC_EDX_MMX); 5959 5960 /* 5961 * Now map the supported feature vector to things that we 5962 * think userland will care about. 5963 */ 5964 if (*ecx & CPUID_INTC_ECX_MOVBE) 5965 hwcap_flags |= AV_386_MOVBE; 5966 5967 if (*ecx & CPUID_INTC_ECX_POPCNT) 5968 hwcap_flags |= AV_386_POPCNT; 5969 if (*edx & CPUID_INTC_EDX_FPU) 5970 hwcap_flags |= AV_386_FPU; 5971 if (*edx & CPUID_INTC_EDX_MMX) 5972 hwcap_flags |= AV_386_MMX; 5973 if (*edx & CPUID_INTC_EDX_TSC) 5974 hwcap_flags |= AV_386_TSC; 5975 } 5976 5977 /* 5978 * Check a few miscellaneous features. 5979 */ 5980 if (cpi->cpi_xmaxeax < 0x80000001) 5981 goto resolve_done; 5982 5983 switch (cpi->cpi_vendor) { 5984 uint32_t *edx, *ecx; 5985 5986 case X86_VENDOR_Intel: 5987 /* 5988 * Seems like Intel duplicated what we necessary 5989 * here to make the initial crop of 64-bit OS's work. 5990 * Hopefully, those are the only "extended" bits 5991 * they'll add. 5992 */ 5993 /*FALLTHROUGH*/ 5994 5995 case X86_VENDOR_AMD: 5996 case X86_VENDOR_HYGON: 5997 edx = &cpi->cpi_support[AMD_EDX_FEATURES]; 5998 ecx = &cpi->cpi_support[AMD_ECX_FEATURES]; 5999 6000 *edx = CPI_FEATURES_XTD_EDX(cpi); 6001 *ecx = CPI_FEATURES_XTD_ECX(cpi); 6002 6003 /* 6004 * [no explicit support required beyond 6005 * x87 fp context and exception handlers] 6006 */ 6007 if (!fpu_exists) 6008 *edx &= ~(CPUID_AMD_EDX_MMXamd | 6009 CPUID_AMD_EDX_3DNow | CPUID_AMD_EDX_3DNowx); 6010 6011 /* 6012 * Now map the supported feature vector to 6013 * things that we think userland will care about. 6014 */ 6015 if (*edx & CPUID_AMD_EDX_MMXamd) 6016 hwcap_flags |= AV_386_AMD_MMX; 6017 if (*edx & CPUID_AMD_EDX_3DNow) 6018 hwcap_flags |= AV_386_AMD_3DNow; 6019 if (*edx & CPUID_AMD_EDX_3DNowx) 6020 hwcap_flags |= AV_386_AMD_3DNowx; 6021 6022 switch (cpi->cpi_vendor) { 6023 case X86_VENDOR_AMD: 6024 case X86_VENDOR_HYGON: 6025 if (*ecx & CPUID_AMD_ECX_AHF64) 6026 hwcap_flags |= AV_386_AHF; 6027 if (*ecx & CPUID_AMD_ECX_LZCNT) 6028 hwcap_flags |= AV_386_AMD_LZCNT; 6029 break; 6030 6031 case X86_VENDOR_Intel: 6032 if (*ecx & CPUID_AMD_ECX_LZCNT) 6033 hwcap_flags |= AV_386_AMD_LZCNT; 6034 /* 6035 * Aarrgh. 6036 * Intel uses a different bit in the same word. 6037 */ 6038 if (*ecx & CPUID_INTC_ECX_AHF64) 6039 hwcap_flags |= AV_386_AHF; 6040 break; 6041 default: 6042 break; 6043 } 6044 break; 6045 6046 default: 6047 break; 6048 } 6049 6050 resolve_done: 6051 if (hwcap_out != NULL) { 6052 hwcap_out[0] = hwcap_flags; 6053 hwcap_out[1] = hwcap_flags_2; 6054 hwcap_out[2] = hwcap_flags_3; 6055 } 6056 } 6057 6058 6059 /* 6060 * Simulate the cpuid instruction using the data we previously 6061 * captured about this CPU. We try our best to return the truth 6062 * about the hardware, independently of kernel support. 6063 */ 6064 uint32_t 6065 cpuid_insn(cpu_t *cpu, struct cpuid_regs *cp) 6066 { 6067 struct cpuid_info *cpi; 6068 struct cpuid_regs *xcp; 6069 6070 if (cpu == NULL) 6071 cpu = CPU; 6072 cpi = cpu->cpu_m.mcpu_cpi; 6073 6074 ASSERT(cpuid_checkpass(cpu, CPUID_PASS_DYNAMIC)); 6075 6076 /* 6077 * CPUID data is cached in two separate places: cpi_std for standard 6078 * CPUID leaves , and cpi_extd for extended CPUID leaves. 6079 */ 6080 if (cp->cp_eax <= cpi->cpi_maxeax && cp->cp_eax < NMAX_CPI_STD) { 6081 xcp = &cpi->cpi_std[cp->cp_eax]; 6082 } else if (cp->cp_eax >= CPUID_LEAF_EXT_0 && 6083 cp->cp_eax <= cpi->cpi_xmaxeax && 6084 cp->cp_eax < CPUID_LEAF_EXT_0 + NMAX_CPI_EXTD) { 6085 xcp = &cpi->cpi_extd[cp->cp_eax - CPUID_LEAF_EXT_0]; 6086 } else { 6087 /* 6088 * The caller is asking for data from an input parameter which 6089 * the kernel has not cached. In this case we go fetch from 6090 * the hardware and return the data directly to the user. 6091 */ 6092 return (__cpuid_insn(cp)); 6093 } 6094 6095 cp->cp_eax = xcp->cp_eax; 6096 cp->cp_ebx = xcp->cp_ebx; 6097 cp->cp_ecx = xcp->cp_ecx; 6098 cp->cp_edx = xcp->cp_edx; 6099 return (cp->cp_eax); 6100 } 6101 6102 boolean_t 6103 cpuid_checkpass(const cpu_t *const cpu, const cpuid_pass_t pass) 6104 { 6105 return (cpu != NULL && cpu->cpu_m.mcpu_cpi != NULL && 6106 cpu->cpu_m.mcpu_cpi->cpi_pass >= pass); 6107 } 6108 6109 int 6110 cpuid_getbrandstr(cpu_t *cpu, char *s, size_t n) 6111 { 6112 ASSERT(cpuid_checkpass(cpu, CPUID_PASS_DYNAMIC)); 6113 6114 return (snprintf(s, n, "%s", cpu->cpu_m.mcpu_cpi->cpi_brandstr)); 6115 } 6116 6117 int 6118 cpuid_is_cmt(cpu_t *cpu) 6119 { 6120 if (cpu == NULL) 6121 cpu = CPU; 6122 6123 ASSERT(cpuid_checkpass(cpu, CPUID_PASS_BASIC)); 6124 6125 return (cpu->cpu_m.mcpu_cpi->cpi_chipid >= 0); 6126 } 6127 6128 /* 6129 * AMD and Intel both implement the 64-bit variant of the syscall 6130 * instruction (syscallq), so if there's -any- support for syscall, 6131 * cpuid currently says "yes, we support this". 6132 * 6133 * However, Intel decided to -not- implement the 32-bit variant of the 6134 * syscall instruction, so we provide a predicate to allow our caller 6135 * to test that subtlety here. 6136 * 6137 * XXPV Currently, 32-bit syscall instructions don't work via the hypervisor, 6138 * even in the case where the hardware would in fact support it. 6139 */ 6140 /*ARGSUSED*/ 6141 int 6142 cpuid_syscall32_insn(cpu_t *cpu) 6143 { 6144 ASSERT(cpuid_checkpass((cpu == NULL ? CPU : cpu), CPUID_PASS_BASIC)); 6145 6146 #if !defined(__xpv) 6147 if (cpu == NULL) 6148 cpu = CPU; 6149 6150 /*CSTYLED*/ 6151 { 6152 struct cpuid_info *cpi = cpu->cpu_m.mcpu_cpi; 6153 6154 if ((cpi->cpi_vendor == X86_VENDOR_AMD || 6155 cpi->cpi_vendor == X86_VENDOR_HYGON) && 6156 cpi->cpi_xmaxeax >= 0x80000001 && 6157 (CPI_FEATURES_XTD_EDX(cpi) & CPUID_AMD_EDX_SYSC)) 6158 return (1); 6159 } 6160 #endif 6161 return (0); 6162 } 6163 6164 int 6165 cpuid_getidstr(cpu_t *cpu, char *s, size_t n) 6166 { 6167 struct cpuid_info *cpi = cpu->cpu_m.mcpu_cpi; 6168 6169 static const char fmt[] = 6170 "x86 (%s %X family %d model %d step %d clock %d MHz)"; 6171 static const char fmt_ht[] = 6172 "x86 (chipid 0x%x %s %X family %d model %d step %d clock %d MHz)"; 6173 6174 ASSERT(cpuid_checkpass(cpu, CPUID_PASS_BASIC)); 6175 6176 if (cpuid_is_cmt(cpu)) 6177 return (snprintf(s, n, fmt_ht, cpi->cpi_chipid, 6178 cpi->cpi_vendorstr, cpi->cpi_std[1].cp_eax, 6179 cpi->cpi_family, cpi->cpi_model, 6180 cpi->cpi_step, cpu->cpu_type_info.pi_clock)); 6181 return (snprintf(s, n, fmt, 6182 cpi->cpi_vendorstr, cpi->cpi_std[1].cp_eax, 6183 cpi->cpi_family, cpi->cpi_model, 6184 cpi->cpi_step, cpu->cpu_type_info.pi_clock)); 6185 } 6186 6187 const char * 6188 cpuid_getvendorstr(cpu_t *cpu) 6189 { 6190 ASSERT(cpuid_checkpass(cpu, CPUID_PASS_IDENT)); 6191 return ((const char *)cpu->cpu_m.mcpu_cpi->cpi_vendorstr); 6192 } 6193 6194 uint_t 6195 cpuid_getvendor(cpu_t *cpu) 6196 { 6197 ASSERT(cpuid_checkpass(cpu, CPUID_PASS_IDENT)); 6198 return (cpu->cpu_m.mcpu_cpi->cpi_vendor); 6199 } 6200 6201 uint_t 6202 cpuid_getfamily(cpu_t *cpu) 6203 { 6204 ASSERT(cpuid_checkpass(cpu, CPUID_PASS_IDENT)); 6205 return (cpu->cpu_m.mcpu_cpi->cpi_family); 6206 } 6207 6208 uint_t 6209 cpuid_getmodel(cpu_t *cpu) 6210 { 6211 ASSERT(cpuid_checkpass(cpu, CPUID_PASS_IDENT)); 6212 return (cpu->cpu_m.mcpu_cpi->cpi_model); 6213 } 6214 6215 uint_t 6216 cpuid_get_ncpu_per_chip(cpu_t *cpu) 6217 { 6218 ASSERT(cpuid_checkpass(cpu, CPUID_PASS_BASIC)); 6219 return (cpu->cpu_m.mcpu_cpi->cpi_ncpu_per_chip); 6220 } 6221 6222 uint_t 6223 cpuid_get_ncore_per_chip(cpu_t *cpu) 6224 { 6225 ASSERT(cpuid_checkpass(cpu, CPUID_PASS_BASIC)); 6226 return (cpu->cpu_m.mcpu_cpi->cpi_ncore_per_chip); 6227 } 6228 6229 uint_t 6230 cpuid_get_ncpu_sharing_last_cache(cpu_t *cpu) 6231 { 6232 ASSERT(cpuid_checkpass(cpu, CPUID_PASS_EXTENDED)); 6233 return (cpu->cpu_m.mcpu_cpi->cpi_ncpu_shr_last_cache); 6234 } 6235 6236 id_t 6237 cpuid_get_last_lvl_cacheid(cpu_t *cpu) 6238 { 6239 ASSERT(cpuid_checkpass(cpu, CPUID_PASS_EXTENDED)); 6240 return (cpu->cpu_m.mcpu_cpi->cpi_last_lvl_cacheid); 6241 } 6242 6243 uint_t 6244 cpuid_getstep(cpu_t *cpu) 6245 { 6246 ASSERT(cpuid_checkpass(cpu, CPUID_PASS_IDENT)); 6247 return (cpu->cpu_m.mcpu_cpi->cpi_step); 6248 } 6249 6250 uint_t 6251 cpuid_getsig(struct cpu *cpu) 6252 { 6253 ASSERT(cpuid_checkpass(cpu, CPUID_PASS_BASIC)); 6254 return (cpu->cpu_m.mcpu_cpi->cpi_std[1].cp_eax); 6255 } 6256 6257 x86_chiprev_t 6258 cpuid_getchiprev(struct cpu *cpu) 6259 { 6260 ASSERT(cpuid_checkpass(cpu, CPUID_PASS_IDENT)); 6261 return (cpu->cpu_m.mcpu_cpi->cpi_chiprev); 6262 } 6263 6264 const char * 6265 cpuid_getchiprevstr(struct cpu *cpu) 6266 { 6267 ASSERT(cpuid_checkpass(cpu, CPUID_PASS_IDENT)); 6268 return (cpu->cpu_m.mcpu_cpi->cpi_chiprevstr); 6269 } 6270 6271 uint32_t 6272 cpuid_getsockettype(struct cpu *cpu) 6273 { 6274 ASSERT(cpuid_checkpass(cpu, CPUID_PASS_IDENT)); 6275 return (cpu->cpu_m.mcpu_cpi->cpi_socket); 6276 } 6277 6278 const char * 6279 cpuid_getsocketstr(cpu_t *cpu) 6280 { 6281 static const char *socketstr = NULL; 6282 struct cpuid_info *cpi; 6283 6284 ASSERT(cpuid_checkpass(cpu, CPUID_PASS_IDENT)); 6285 cpi = cpu->cpu_m.mcpu_cpi; 6286 6287 /* Assume that socket types are the same across the system */ 6288 if (socketstr == NULL) 6289 socketstr = _cpuid_sktstr(cpi->cpi_vendor, cpi->cpi_family, 6290 cpi->cpi_model, cpi->cpi_step); 6291 6292 6293 return (socketstr); 6294 } 6295 6296 x86_uarchrev_t 6297 cpuid_getuarchrev(cpu_t *cpu) 6298 { 6299 return (cpu->cpu_m.mcpu_cpi->cpi_uarchrev); 6300 } 6301 6302 int 6303 cpuid_get_chipid(cpu_t *cpu) 6304 { 6305 ASSERT(cpuid_checkpass(cpu, CPUID_PASS_BASIC)); 6306 6307 if (cpuid_is_cmt(cpu)) 6308 return (cpu->cpu_m.mcpu_cpi->cpi_chipid); 6309 return (cpu->cpu_id); 6310 } 6311 6312 id_t 6313 cpuid_get_coreid(cpu_t *cpu) 6314 { 6315 ASSERT(cpuid_checkpass(cpu, CPUID_PASS_BASIC)); 6316 return (cpu->cpu_m.mcpu_cpi->cpi_coreid); 6317 } 6318 6319 int 6320 cpuid_get_pkgcoreid(cpu_t *cpu) 6321 { 6322 ASSERT(cpuid_checkpass(cpu, CPUID_PASS_BASIC)); 6323 return (cpu->cpu_m.mcpu_cpi->cpi_pkgcoreid); 6324 } 6325 6326 int 6327 cpuid_get_clogid(cpu_t *cpu) 6328 { 6329 ASSERT(cpuid_checkpass(cpu, CPUID_PASS_BASIC)); 6330 return (cpu->cpu_m.mcpu_cpi->cpi_clogid); 6331 } 6332 6333 int 6334 cpuid_get_cacheid(cpu_t *cpu) 6335 { 6336 ASSERT(cpuid_checkpass(cpu, CPUID_PASS_BASIC)); 6337 return (cpu->cpu_m.mcpu_cpi->cpi_last_lvl_cacheid); 6338 } 6339 6340 uint_t 6341 cpuid_get_procnodeid(cpu_t *cpu) 6342 { 6343 ASSERT(cpuid_checkpass(cpu, CPUID_PASS_BASIC)); 6344 return (cpu->cpu_m.mcpu_cpi->cpi_procnodeid); 6345 } 6346 6347 uint_t 6348 cpuid_get_procnodes_per_pkg(cpu_t *cpu) 6349 { 6350 ASSERT(cpuid_checkpass(cpu, CPUID_PASS_BASIC)); 6351 return (cpu->cpu_m.mcpu_cpi->cpi_procnodes_per_pkg); 6352 } 6353 6354 uint_t 6355 cpuid_get_compunitid(cpu_t *cpu) 6356 { 6357 ASSERT(cpuid_checkpass(cpu, CPUID_PASS_BASIC)); 6358 return (cpu->cpu_m.mcpu_cpi->cpi_compunitid); 6359 } 6360 6361 uint_t 6362 cpuid_get_cores_per_compunit(cpu_t *cpu) 6363 { 6364 ASSERT(cpuid_checkpass(cpu, CPUID_PASS_BASIC)); 6365 return (cpu->cpu_m.mcpu_cpi->cpi_cores_per_compunit); 6366 } 6367 6368 uint32_t 6369 cpuid_get_apicid(cpu_t *cpu) 6370 { 6371 ASSERT(cpuid_checkpass(cpu, CPUID_PASS_BASIC)); 6372 if (cpu->cpu_m.mcpu_cpi->cpi_maxeax < 1) { 6373 return (UINT32_MAX); 6374 } else { 6375 return (cpu->cpu_m.mcpu_cpi->cpi_apicid); 6376 } 6377 } 6378 6379 void 6380 cpuid_get_addrsize(cpu_t *cpu, uint_t *pabits, uint_t *vabits) 6381 { 6382 struct cpuid_info *cpi; 6383 6384 if (cpu == NULL) 6385 cpu = CPU; 6386 cpi = cpu->cpu_m.mcpu_cpi; 6387 6388 ASSERT(cpuid_checkpass(cpu, CPUID_PASS_BASIC)); 6389 6390 if (pabits) 6391 *pabits = cpi->cpi_pabits; 6392 if (vabits) 6393 *vabits = cpi->cpi_vabits; 6394 } 6395 6396 size_t 6397 cpuid_get_xsave_size(void) 6398 { 6399 return (MAX(cpuid_info0.cpi_xsave.xsav_max_size, 6400 sizeof (struct xsave_state))); 6401 } 6402 6403 /* 6404 * Export information about known offsets to the kernel. We only care about 6405 * things we have actually enabled support for in %xcr0. 6406 */ 6407 void 6408 cpuid_get_xsave_info(uint64_t bit, size_t *sizep, size_t *offp) 6409 { 6410 size_t size, off; 6411 6412 VERIFY3U(bit & xsave_bv_all, !=, 0); 6413 6414 if (sizep == NULL) 6415 sizep = &size; 6416 if (offp == NULL) 6417 offp = &off; 6418 6419 switch (bit) { 6420 case XFEATURE_LEGACY_FP: 6421 case XFEATURE_SSE: 6422 *sizep = sizeof (struct fxsave_state); 6423 *offp = 0; 6424 break; 6425 case XFEATURE_AVX: 6426 *sizep = cpuid_info0.cpi_xsave.ymm_size; 6427 *offp = cpuid_info0.cpi_xsave.ymm_offset; 6428 break; 6429 case XFEATURE_AVX512_OPMASK: 6430 *sizep = cpuid_info0.cpi_xsave.opmask_size; 6431 *offp = cpuid_info0.cpi_xsave.opmask_offset; 6432 break; 6433 case XFEATURE_AVX512_ZMM: 6434 *sizep = cpuid_info0.cpi_xsave.zmmlo_size; 6435 *offp = cpuid_info0.cpi_xsave.zmmlo_offset; 6436 break; 6437 case XFEATURE_AVX512_HI_ZMM: 6438 *sizep = cpuid_info0.cpi_xsave.zmmhi_size; 6439 *offp = cpuid_info0.cpi_xsave.zmmhi_offset; 6440 break; 6441 default: 6442 panic("asked for unsupported xsave feature: 0x%lx", bit); 6443 } 6444 } 6445 6446 /* 6447 * Return true if the CPUs on this system require 'pointer clearing' for the 6448 * floating point error pointer exception handling. In the past, this has been 6449 * true for all AMD K7 & K8 CPUs, although newer AMD CPUs have been changed to 6450 * behave the same as Intel. This is checked via the CPUID_AMD_EBX_ERR_PTR_ZERO 6451 * feature bit and is reflected in the cpi_fp_amd_save member. 6452 */ 6453 boolean_t 6454 cpuid_need_fp_excp_handling(void) 6455 { 6456 return (cpuid_info0.cpi_vendor == X86_VENDOR_AMD && 6457 cpuid_info0.cpi_fp_amd_save != 0); 6458 } 6459 6460 /* 6461 * Returns the number of data TLB entries for a corresponding 6462 * pagesize. If it can't be computed, or isn't known, the 6463 * routine returns zero. If you ask about an architecturally 6464 * impossible pagesize, the routine will panic (so that the 6465 * hat implementor knows that things are inconsistent.) 6466 */ 6467 uint_t 6468 cpuid_get_dtlb_nent(cpu_t *cpu, size_t pagesize) 6469 { 6470 struct cpuid_info *cpi; 6471 uint_t dtlb_nent = 0; 6472 6473 if (cpu == NULL) 6474 cpu = CPU; 6475 cpi = cpu->cpu_m.mcpu_cpi; 6476 6477 ASSERT(cpuid_checkpass(cpu, CPUID_PASS_BASIC)); 6478 6479 /* 6480 * Check the L2 TLB info 6481 */ 6482 if (cpi->cpi_xmaxeax >= 0x80000006) { 6483 struct cpuid_regs *cp = &cpi->cpi_extd[6]; 6484 6485 switch (pagesize) { 6486 6487 case 4 * 1024: 6488 /* 6489 * All zero in the top 16 bits of the register 6490 * indicates a unified TLB. Size is in low 16 bits. 6491 */ 6492 if ((cp->cp_ebx & 0xffff0000) == 0) 6493 dtlb_nent = cp->cp_ebx & 0x0000ffff; 6494 else 6495 dtlb_nent = BITX(cp->cp_ebx, 27, 16); 6496 break; 6497 6498 case 2 * 1024 * 1024: 6499 if ((cp->cp_eax & 0xffff0000) == 0) 6500 dtlb_nent = cp->cp_eax & 0x0000ffff; 6501 else 6502 dtlb_nent = BITX(cp->cp_eax, 27, 16); 6503 break; 6504 6505 default: 6506 panic("unknown L2 pagesize"); 6507 /*NOTREACHED*/ 6508 } 6509 } 6510 6511 if (dtlb_nent != 0) 6512 return (dtlb_nent); 6513 6514 /* 6515 * No L2 TLB support for this size, try L1. 6516 */ 6517 if (cpi->cpi_xmaxeax >= 0x80000005) { 6518 struct cpuid_regs *cp = &cpi->cpi_extd[5]; 6519 6520 switch (pagesize) { 6521 case 4 * 1024: 6522 dtlb_nent = BITX(cp->cp_ebx, 23, 16); 6523 break; 6524 case 2 * 1024 * 1024: 6525 dtlb_nent = BITX(cp->cp_eax, 23, 16); 6526 break; 6527 default: 6528 panic("unknown L1 d-TLB pagesize"); 6529 /*NOTREACHED*/ 6530 } 6531 } 6532 6533 return (dtlb_nent); 6534 } 6535 6536 /* 6537 * Return 0 if the erratum is not present or not applicable, positive 6538 * if it is, and negative if the status of the erratum is unknown. 6539 * 6540 * See "Revision Guide for AMD Athlon(tm) 64 and AMD Opteron(tm) 6541 * Processors" #25759, Rev 3.57, August 2005 6542 */ 6543 int 6544 cpuid_opteron_erratum(cpu_t *cpu, uint_t erratum) 6545 { 6546 struct cpuid_info *cpi = cpu->cpu_m.mcpu_cpi; 6547 uint_t eax; 6548 6549 /* 6550 * Bail out if this CPU isn't an AMD CPU, or if it's 6551 * a legacy (32-bit) AMD CPU. 6552 */ 6553 if (cpi->cpi_vendor != X86_VENDOR_AMD || 6554 cpi->cpi_family == 4 || cpi->cpi_family == 5 || 6555 cpi->cpi_family == 6) { 6556 return (0); 6557 } 6558 6559 eax = cpi->cpi_std[1].cp_eax; 6560 6561 #define SH_B0(eax) (eax == 0xf40 || eax == 0xf50) 6562 #define SH_B3(eax) (eax == 0xf51) 6563 #define B(eax) (SH_B0(eax) || SH_B3(eax)) 6564 6565 #define SH_C0(eax) (eax == 0xf48 || eax == 0xf58) 6566 6567 #define SH_CG(eax) (eax == 0xf4a || eax == 0xf5a || eax == 0xf7a) 6568 #define DH_CG(eax) (eax == 0xfc0 || eax == 0xfe0 || eax == 0xff0) 6569 #define CH_CG(eax) (eax == 0xf82 || eax == 0xfb2) 6570 #define CG(eax) (SH_CG(eax) || DH_CG(eax) || CH_CG(eax)) 6571 6572 #define SH_D0(eax) (eax == 0x10f40 || eax == 0x10f50 || eax == 0x10f70) 6573 #define DH_D0(eax) (eax == 0x10fc0 || eax == 0x10ff0) 6574 #define CH_D0(eax) (eax == 0x10f80 || eax == 0x10fb0) 6575 #define D0(eax) (SH_D0(eax) || DH_D0(eax) || CH_D0(eax)) 6576 6577 #define SH_E0(eax) (eax == 0x20f50 || eax == 0x20f40 || eax == 0x20f70) 6578 #define JH_E1(eax) (eax == 0x20f10) /* JH8_E0 had 0x20f30 */ 6579 #define DH_E3(eax) (eax == 0x20fc0 || eax == 0x20ff0) 6580 #define SH_E4(eax) (eax == 0x20f51 || eax == 0x20f71) 6581 #define BH_E4(eax) (eax == 0x20fb1) 6582 #define SH_E5(eax) (eax == 0x20f42) 6583 #define DH_E6(eax) (eax == 0x20ff2 || eax == 0x20fc2) 6584 #define JH_E6(eax) (eax == 0x20f12 || eax == 0x20f32) 6585 #define EX(eax) (SH_E0(eax) || JH_E1(eax) || DH_E3(eax) || \ 6586 SH_E4(eax) || BH_E4(eax) || SH_E5(eax) || \ 6587 DH_E6(eax) || JH_E6(eax)) 6588 6589 #define DR_AX(eax) (eax == 0x100f00 || eax == 0x100f01 || eax == 0x100f02) 6590 #define DR_B0(eax) (eax == 0x100f20) 6591 #define DR_B1(eax) (eax == 0x100f21) 6592 #define DR_BA(eax) (eax == 0x100f2a) 6593 #define DR_B2(eax) (eax == 0x100f22) 6594 #define DR_B3(eax) (eax == 0x100f23) 6595 #define RB_C0(eax) (eax == 0x100f40) 6596 6597 switch (erratum) { 6598 case 1: 6599 return (cpi->cpi_family < 0x10); 6600 case 51: /* what does the asterisk mean? */ 6601 return (B(eax) || SH_C0(eax) || CG(eax)); 6602 case 52: 6603 return (B(eax)); 6604 case 57: 6605 return (cpi->cpi_family <= 0x11); 6606 case 58: 6607 return (B(eax)); 6608 case 60: 6609 return (cpi->cpi_family <= 0x11); 6610 case 61: 6611 case 62: 6612 case 63: 6613 case 64: 6614 case 65: 6615 case 66: 6616 case 68: 6617 case 69: 6618 case 70: 6619 case 71: 6620 return (B(eax)); 6621 case 72: 6622 return (SH_B0(eax)); 6623 case 74: 6624 return (B(eax)); 6625 case 75: 6626 return (cpi->cpi_family < 0x10); 6627 case 76: 6628 return (B(eax)); 6629 case 77: 6630 return (cpi->cpi_family <= 0x11); 6631 case 78: 6632 return (B(eax) || SH_C0(eax)); 6633 case 79: 6634 return (B(eax) || SH_C0(eax) || CG(eax) || D0(eax) || EX(eax)); 6635 case 80: 6636 case 81: 6637 case 82: 6638 return (B(eax)); 6639 case 83: 6640 return (B(eax) || SH_C0(eax) || CG(eax)); 6641 case 85: 6642 return (cpi->cpi_family < 0x10); 6643 case 86: 6644 return (SH_C0(eax) || CG(eax)); 6645 case 88: 6646 return (B(eax) || SH_C0(eax)); 6647 case 89: 6648 return (cpi->cpi_family < 0x10); 6649 case 90: 6650 return (B(eax) || SH_C0(eax) || CG(eax)); 6651 case 91: 6652 case 92: 6653 return (B(eax) || SH_C0(eax)); 6654 case 93: 6655 return (SH_C0(eax)); 6656 case 94: 6657 return (B(eax) || SH_C0(eax) || CG(eax)); 6658 case 95: 6659 return (B(eax) || SH_C0(eax)); 6660 case 96: 6661 return (B(eax) || SH_C0(eax) || CG(eax)); 6662 case 97: 6663 case 98: 6664 return (SH_C0(eax) || CG(eax)); 6665 case 99: 6666 return (B(eax) || SH_C0(eax) || CG(eax) || D0(eax)); 6667 case 100: 6668 return (B(eax) || SH_C0(eax)); 6669 case 101: 6670 case 103: 6671 return (B(eax) || SH_C0(eax) || CG(eax) || D0(eax)); 6672 case 104: 6673 return (SH_C0(eax) || CG(eax) || D0(eax)); 6674 case 105: 6675 case 106: 6676 case 107: 6677 return (B(eax) || SH_C0(eax) || CG(eax) || D0(eax)); 6678 case 108: 6679 return (DH_CG(eax)); 6680 case 109: 6681 return (SH_C0(eax) || CG(eax) || D0(eax)); 6682 case 110: 6683 return (D0(eax) || EX(eax)); 6684 case 111: 6685 return (CG(eax)); 6686 case 112: 6687 return (B(eax) || SH_C0(eax) || CG(eax) || D0(eax) || EX(eax)); 6688 case 113: 6689 return (eax == 0x20fc0); 6690 case 114: 6691 return (SH_E0(eax) || JH_E1(eax) || DH_E3(eax)); 6692 case 115: 6693 return (SH_E0(eax) || JH_E1(eax)); 6694 case 116: 6695 return (SH_E0(eax) || JH_E1(eax) || DH_E3(eax)); 6696 case 117: 6697 return (B(eax) || SH_C0(eax) || CG(eax) || D0(eax)); 6698 case 118: 6699 return (SH_E0(eax) || JH_E1(eax) || SH_E4(eax) || BH_E4(eax) || 6700 JH_E6(eax)); 6701 case 121: 6702 return (B(eax) || SH_C0(eax) || CG(eax) || D0(eax) || EX(eax)); 6703 case 122: 6704 return (cpi->cpi_family < 0x10 || cpi->cpi_family == 0x11); 6705 case 123: 6706 return (JH_E1(eax) || BH_E4(eax) || JH_E6(eax)); 6707 case 131: 6708 return (cpi->cpi_family < 0x10); 6709 case 6336786: 6710 6711 /* 6712 * Test for AdvPowerMgmtInfo.TscPStateInvariant 6713 * if this is a K8 family or newer processor. We're testing for 6714 * this 'erratum' to determine whether or not we have a constant 6715 * TSC. 6716 * 6717 * Our current fix for this is to disable the C1-Clock ramping. 6718 * However, this doesn't work on newer processor families nor 6719 * does it work when virtualized as those devices don't exist. 6720 */ 6721 if (cpi->cpi_family >= 0x12 || get_hwenv() != HW_NATIVE) { 6722 return (0); 6723 } 6724 6725 if (CPI_FAMILY(cpi) == 0xf) { 6726 struct cpuid_regs regs; 6727 regs.cp_eax = 0x80000007; 6728 (void) __cpuid_insn(®s); 6729 return (!(regs.cp_edx & 0x100)); 6730 } 6731 return (0); 6732 case 147: 6733 /* 6734 * This erratum (K8 #147) is not present on family 10 and newer. 6735 */ 6736 if (cpi->cpi_family >= 0x10) { 6737 return (0); 6738 } 6739 return (((((eax >> 12) & 0xff00) + (eax & 0xf00)) | 6740 (((eax >> 4) & 0xf) | ((eax >> 12) & 0xf0))) < 0xf40); 6741 6742 case 6671130: 6743 /* 6744 * check for processors (pre-Shanghai) that do not provide 6745 * optimal management of 1gb ptes in its tlb. 6746 */ 6747 return (cpi->cpi_family == 0x10 && cpi->cpi_model < 4); 6748 6749 case 298: 6750 return (DR_AX(eax) || DR_B0(eax) || DR_B1(eax) || DR_BA(eax) || 6751 DR_B2(eax) || RB_C0(eax)); 6752 6753 case 721: 6754 return (cpi->cpi_family == 0x10 || cpi->cpi_family == 0x12); 6755 6756 default: 6757 return (-1); 6758 6759 } 6760 } 6761 6762 /* 6763 * Determine if specified erratum is present via OSVW (OS Visible Workaround). 6764 * Return 1 if erratum is present, 0 if not present and -1 if indeterminate. 6765 */ 6766 int 6767 osvw_opteron_erratum(cpu_t *cpu, uint_t erratum) 6768 { 6769 struct cpuid_info *cpi; 6770 uint_t osvwid; 6771 static int osvwfeature = -1; 6772 uint64_t osvwlength; 6773 6774 6775 cpi = cpu->cpu_m.mcpu_cpi; 6776 6777 /* confirm OSVW supported */ 6778 if (osvwfeature == -1) { 6779 osvwfeature = cpi->cpi_extd[1].cp_ecx & CPUID_AMD_ECX_OSVW; 6780 } else { 6781 /* assert that osvw feature setting is consistent on all cpus */ 6782 ASSERT(osvwfeature == 6783 (cpi->cpi_extd[1].cp_ecx & CPUID_AMD_ECX_OSVW)); 6784 } 6785 if (!osvwfeature) 6786 return (-1); 6787 6788 osvwlength = rdmsr(MSR_AMD_OSVW_ID_LEN) & OSVW_ID_LEN_MASK; 6789 6790 switch (erratum) { 6791 case 298: /* osvwid is 0 */ 6792 osvwid = 0; 6793 if (osvwlength <= (uint64_t)osvwid) { 6794 /* osvwid 0 is unknown */ 6795 return (-1); 6796 } 6797 6798 /* 6799 * Check the OSVW STATUS MSR to determine the state 6800 * of the erratum where: 6801 * 0 - fixed by HW 6802 * 1 - BIOS has applied the workaround when BIOS 6803 * workaround is available. (Or for other errata, 6804 * OS workaround is required.) 6805 * For a value of 1, caller will confirm that the 6806 * erratum 298 workaround has indeed been applied by BIOS. 6807 * 6808 * A 1 may be set in cpus that have a HW fix 6809 * in a mixed cpu system. Regarding erratum 298: 6810 * In a multiprocessor platform, the workaround above 6811 * should be applied to all processors regardless of 6812 * silicon revision when an affected processor is 6813 * present. 6814 */ 6815 6816 return (rdmsr(MSR_AMD_OSVW_STATUS + 6817 (osvwid / OSVW_ID_CNT_PER_MSR)) & 6818 (1ULL << (osvwid % OSVW_ID_CNT_PER_MSR))); 6819 6820 default: 6821 return (-1); 6822 } 6823 } 6824 6825 static const char assoc_str[] = "associativity"; 6826 static const char line_str[] = "line-size"; 6827 static const char size_str[] = "size"; 6828 6829 static void 6830 add_cache_prop(dev_info_t *devi, const char *label, const char *type, 6831 uint32_t val) 6832 { 6833 char buf[128]; 6834 6835 /* 6836 * ndi_prop_update_int() is used because it is desirable for 6837 * DDI_PROP_HW_DEF and DDI_PROP_DONTSLEEP to be set. 6838 */ 6839 if (snprintf(buf, sizeof (buf), "%s-%s", label, type) < sizeof (buf)) 6840 (void) ndi_prop_update_int(DDI_DEV_T_NONE, devi, buf, val); 6841 } 6842 6843 /* 6844 * Intel-style cache/tlb description 6845 * 6846 * Standard cpuid level 2 gives a randomly ordered 6847 * selection of tags that index into a table that describes 6848 * cache and tlb properties. 6849 */ 6850 6851 static const char l1_icache_str[] = "l1-icache"; 6852 static const char l1_dcache_str[] = "l1-dcache"; 6853 static const char l2_cache_str[] = "l2-cache"; 6854 static const char l3_cache_str[] = "l3-cache"; 6855 static const char itlb4k_str[] = "itlb-4K"; 6856 static const char dtlb4k_str[] = "dtlb-4K"; 6857 static const char itlb2M_str[] = "itlb-2M"; 6858 static const char itlb4M_str[] = "itlb-4M"; 6859 static const char dtlb4M_str[] = "dtlb-4M"; 6860 static const char dtlb24_str[] = "dtlb0-2M-4M"; 6861 static const char itlb424_str[] = "itlb-4K-2M-4M"; 6862 static const char itlb24_str[] = "itlb-2M-4M"; 6863 static const char dtlb44_str[] = "dtlb-4K-4M"; 6864 static const char sl1_dcache_str[] = "sectored-l1-dcache"; 6865 static const char sl2_cache_str[] = "sectored-l2-cache"; 6866 static const char itrace_str[] = "itrace-cache"; 6867 static const char sl3_cache_str[] = "sectored-l3-cache"; 6868 static const char sh_l2_tlb4k_str[] = "shared-l2-tlb-4k"; 6869 6870 static const struct cachetab { 6871 uint8_t ct_code; 6872 uint8_t ct_assoc; 6873 uint16_t ct_line_size; 6874 size_t ct_size; 6875 const char *ct_label; 6876 } intel_ctab[] = { 6877 /* 6878 * maintain descending order! 6879 * 6880 * Codes ignored - Reason 6881 * ---------------------- 6882 * 40H - intel_cpuid_4_cache_info() disambiguates l2/l3 cache 6883 * f0H/f1H - Currently we do not interpret prefetch size by design 6884 */ 6885 { 0xe4, 16, 64, 8*1024*1024, l3_cache_str}, 6886 { 0xe3, 16, 64, 4*1024*1024, l3_cache_str}, 6887 { 0xe2, 16, 64, 2*1024*1024, l3_cache_str}, 6888 { 0xde, 12, 64, 6*1024*1024, l3_cache_str}, 6889 { 0xdd, 12, 64, 3*1024*1024, l3_cache_str}, 6890 { 0xdc, 12, 64, ((1*1024*1024)+(512*1024)), l3_cache_str}, 6891 { 0xd8, 8, 64, 4*1024*1024, l3_cache_str}, 6892 { 0xd7, 8, 64, 2*1024*1024, l3_cache_str}, 6893 { 0xd6, 8, 64, 1*1024*1024, l3_cache_str}, 6894 { 0xd2, 4, 64, 2*1024*1024, l3_cache_str}, 6895 { 0xd1, 4, 64, 1*1024*1024, l3_cache_str}, 6896 { 0xd0, 4, 64, 512*1024, l3_cache_str}, 6897 { 0xca, 4, 0, 512, sh_l2_tlb4k_str}, 6898 { 0xc0, 4, 0, 8, dtlb44_str }, 6899 { 0xba, 4, 0, 64, dtlb4k_str }, 6900 { 0xb4, 4, 0, 256, dtlb4k_str }, 6901 { 0xb3, 4, 0, 128, dtlb4k_str }, 6902 { 0xb2, 4, 0, 64, itlb4k_str }, 6903 { 0xb0, 4, 0, 128, itlb4k_str }, 6904 { 0x87, 8, 64, 1024*1024, l2_cache_str}, 6905 { 0x86, 4, 64, 512*1024, l2_cache_str}, 6906 { 0x85, 8, 32, 2*1024*1024, l2_cache_str}, 6907 { 0x84, 8, 32, 1024*1024, l2_cache_str}, 6908 { 0x83, 8, 32, 512*1024, l2_cache_str}, 6909 { 0x82, 8, 32, 256*1024, l2_cache_str}, 6910 { 0x80, 8, 64, 512*1024, l2_cache_str}, 6911 { 0x7f, 2, 64, 512*1024, l2_cache_str}, 6912 { 0x7d, 8, 64, 2*1024*1024, sl2_cache_str}, 6913 { 0x7c, 8, 64, 1024*1024, sl2_cache_str}, 6914 { 0x7b, 8, 64, 512*1024, sl2_cache_str}, 6915 { 0x7a, 8, 64, 256*1024, sl2_cache_str}, 6916 { 0x79, 8, 64, 128*1024, sl2_cache_str}, 6917 { 0x78, 8, 64, 1024*1024, l2_cache_str}, 6918 { 0x73, 8, 0, 64*1024, itrace_str}, 6919 { 0x72, 8, 0, 32*1024, itrace_str}, 6920 { 0x71, 8, 0, 16*1024, itrace_str}, 6921 { 0x70, 8, 0, 12*1024, itrace_str}, 6922 { 0x68, 4, 64, 32*1024, sl1_dcache_str}, 6923 { 0x67, 4, 64, 16*1024, sl1_dcache_str}, 6924 { 0x66, 4, 64, 8*1024, sl1_dcache_str}, 6925 { 0x60, 8, 64, 16*1024, sl1_dcache_str}, 6926 { 0x5d, 0, 0, 256, dtlb44_str}, 6927 { 0x5c, 0, 0, 128, dtlb44_str}, 6928 { 0x5b, 0, 0, 64, dtlb44_str}, 6929 { 0x5a, 4, 0, 32, dtlb24_str}, 6930 { 0x59, 0, 0, 16, dtlb4k_str}, 6931 { 0x57, 4, 0, 16, dtlb4k_str}, 6932 { 0x56, 4, 0, 16, dtlb4M_str}, 6933 { 0x55, 0, 0, 7, itlb24_str}, 6934 { 0x52, 0, 0, 256, itlb424_str}, 6935 { 0x51, 0, 0, 128, itlb424_str}, 6936 { 0x50, 0, 0, 64, itlb424_str}, 6937 { 0x4f, 0, 0, 32, itlb4k_str}, 6938 { 0x4e, 24, 64, 6*1024*1024, l2_cache_str}, 6939 { 0x4d, 16, 64, 16*1024*1024, l3_cache_str}, 6940 { 0x4c, 12, 64, 12*1024*1024, l3_cache_str}, 6941 { 0x4b, 16, 64, 8*1024*1024, l3_cache_str}, 6942 { 0x4a, 12, 64, 6*1024*1024, l3_cache_str}, 6943 { 0x49, 16, 64, 4*1024*1024, l3_cache_str}, 6944 { 0x48, 12, 64, 3*1024*1024, l2_cache_str}, 6945 { 0x47, 8, 64, 8*1024*1024, l3_cache_str}, 6946 { 0x46, 4, 64, 4*1024*1024, l3_cache_str}, 6947 { 0x45, 4, 32, 2*1024*1024, l2_cache_str}, 6948 { 0x44, 4, 32, 1024*1024, l2_cache_str}, 6949 { 0x43, 4, 32, 512*1024, l2_cache_str}, 6950 { 0x42, 4, 32, 256*1024, l2_cache_str}, 6951 { 0x41, 4, 32, 128*1024, l2_cache_str}, 6952 { 0x3e, 4, 64, 512*1024, sl2_cache_str}, 6953 { 0x3d, 6, 64, 384*1024, sl2_cache_str}, 6954 { 0x3c, 4, 64, 256*1024, sl2_cache_str}, 6955 { 0x3b, 2, 64, 128*1024, sl2_cache_str}, 6956 { 0x3a, 6, 64, 192*1024, sl2_cache_str}, 6957 { 0x39, 4, 64, 128*1024, sl2_cache_str}, 6958 { 0x30, 8, 64, 32*1024, l1_icache_str}, 6959 { 0x2c, 8, 64, 32*1024, l1_dcache_str}, 6960 { 0x29, 8, 64, 4096*1024, sl3_cache_str}, 6961 { 0x25, 8, 64, 2048*1024, sl3_cache_str}, 6962 { 0x23, 8, 64, 1024*1024, sl3_cache_str}, 6963 { 0x22, 4, 64, 512*1024, sl3_cache_str}, 6964 { 0x0e, 6, 64, 24*1024, l1_dcache_str}, 6965 { 0x0d, 4, 32, 16*1024, l1_dcache_str}, 6966 { 0x0c, 4, 32, 16*1024, l1_dcache_str}, 6967 { 0x0b, 4, 0, 4, itlb4M_str}, 6968 { 0x0a, 2, 32, 8*1024, l1_dcache_str}, 6969 { 0x08, 4, 32, 16*1024, l1_icache_str}, 6970 { 0x06, 4, 32, 8*1024, l1_icache_str}, 6971 { 0x05, 4, 0, 32, dtlb4M_str}, 6972 { 0x04, 4, 0, 8, dtlb4M_str}, 6973 { 0x03, 4, 0, 64, dtlb4k_str}, 6974 { 0x02, 4, 0, 2, itlb4M_str}, 6975 { 0x01, 4, 0, 32, itlb4k_str}, 6976 { 0 } 6977 }; 6978 6979 static const struct cachetab cyrix_ctab[] = { 6980 { 0x70, 4, 0, 32, "tlb-4K" }, 6981 { 0x80, 4, 16, 16*1024, "l1-cache" }, 6982 { 0 } 6983 }; 6984 6985 /* 6986 * Search a cache table for a matching entry 6987 */ 6988 static const struct cachetab * 6989 find_cacheent(const struct cachetab *ct, uint_t code) 6990 { 6991 if (code != 0) { 6992 for (; ct->ct_code != 0; ct++) 6993 if (ct->ct_code <= code) 6994 break; 6995 if (ct->ct_code == code) 6996 return (ct); 6997 } 6998 return (NULL); 6999 } 7000 7001 /* 7002 * Populate cachetab entry with L2 or L3 cache-information using 7003 * cpuid function 4. This function is called from intel_walk_cacheinfo() 7004 * when descriptor 0x49 is encountered. It returns 0 if no such cache 7005 * information is found. 7006 */ 7007 static int 7008 intel_cpuid_4_cache_info(struct cachetab *ct, struct cpuid_info *cpi) 7009 { 7010 uint32_t level, i; 7011 int ret = 0; 7012 7013 for (i = 0; i < cpi->cpi_cache_leaf_size; i++) { 7014 level = CPI_CACHE_LVL(cpi->cpi_cache_leaves[i]); 7015 7016 if (level == 2 || level == 3) { 7017 ct->ct_assoc = 7018 CPI_CACHE_WAYS(cpi->cpi_cache_leaves[i]) + 1; 7019 ct->ct_line_size = 7020 CPI_CACHE_COH_LN_SZ(cpi->cpi_cache_leaves[i]) + 1; 7021 ct->ct_size = ct->ct_assoc * 7022 (CPI_CACHE_PARTS(cpi->cpi_cache_leaves[i]) + 1) * 7023 ct->ct_line_size * 7024 (cpi->cpi_cache_leaves[i]->cp_ecx + 1); 7025 7026 if (level == 2) { 7027 ct->ct_label = l2_cache_str; 7028 } else if (level == 3) { 7029 ct->ct_label = l3_cache_str; 7030 } 7031 ret = 1; 7032 } 7033 } 7034 7035 return (ret); 7036 } 7037 7038 /* 7039 * Walk the cacheinfo descriptor, applying 'func' to every valid element 7040 * The walk is terminated if the walker returns non-zero. 7041 */ 7042 static void 7043 intel_walk_cacheinfo(struct cpuid_info *cpi, 7044 void *arg, int (*func)(void *, const struct cachetab *)) 7045 { 7046 const struct cachetab *ct; 7047 struct cachetab des_49_ct, des_b1_ct; 7048 uint8_t *dp; 7049 int i; 7050 7051 if ((dp = cpi->cpi_cacheinfo) == NULL) 7052 return; 7053 for (i = 0; i < cpi->cpi_ncache; i++, dp++) { 7054 /* 7055 * For overloaded descriptor 0x49 we use cpuid function 4 7056 * if supported by the current processor, to create 7057 * cache information. 7058 * For overloaded descriptor 0xb1 we use X86_PAE flag 7059 * to disambiguate the cache information. 7060 */ 7061 if (*dp == 0x49 && cpi->cpi_maxeax >= 0x4 && 7062 intel_cpuid_4_cache_info(&des_49_ct, cpi) == 1) { 7063 ct = &des_49_ct; 7064 } else if (*dp == 0xb1) { 7065 des_b1_ct.ct_code = 0xb1; 7066 des_b1_ct.ct_assoc = 4; 7067 des_b1_ct.ct_line_size = 0; 7068 if (is_x86_feature(x86_featureset, X86FSET_PAE)) { 7069 des_b1_ct.ct_size = 8; 7070 des_b1_ct.ct_label = itlb2M_str; 7071 } else { 7072 des_b1_ct.ct_size = 4; 7073 des_b1_ct.ct_label = itlb4M_str; 7074 } 7075 ct = &des_b1_ct; 7076 } else { 7077 if ((ct = find_cacheent(intel_ctab, *dp)) == NULL) { 7078 continue; 7079 } 7080 } 7081 7082 if (func(arg, ct) != 0) { 7083 break; 7084 } 7085 } 7086 } 7087 7088 /* 7089 * (Like the Intel one, except for Cyrix CPUs) 7090 */ 7091 static void 7092 cyrix_walk_cacheinfo(struct cpuid_info *cpi, 7093 void *arg, int (*func)(void *, const struct cachetab *)) 7094 { 7095 const struct cachetab *ct; 7096 uint8_t *dp; 7097 int i; 7098 7099 if ((dp = cpi->cpi_cacheinfo) == NULL) 7100 return; 7101 for (i = 0; i < cpi->cpi_ncache; i++, dp++) { 7102 /* 7103 * Search Cyrix-specific descriptor table first .. 7104 */ 7105 if ((ct = find_cacheent(cyrix_ctab, *dp)) != NULL) { 7106 if (func(arg, ct) != 0) 7107 break; 7108 continue; 7109 } 7110 /* 7111 * .. else fall back to the Intel one 7112 */ 7113 if ((ct = find_cacheent(intel_ctab, *dp)) != NULL) { 7114 if (func(arg, ct) != 0) 7115 break; 7116 continue; 7117 } 7118 } 7119 } 7120 7121 /* 7122 * A cacheinfo walker that adds associativity, line-size, and size properties 7123 * to the devinfo node it is passed as an argument. 7124 */ 7125 static int 7126 add_cacheent_props(void *arg, const struct cachetab *ct) 7127 { 7128 dev_info_t *devi = arg; 7129 7130 add_cache_prop(devi, ct->ct_label, assoc_str, ct->ct_assoc); 7131 if (ct->ct_line_size != 0) 7132 add_cache_prop(devi, ct->ct_label, line_str, 7133 ct->ct_line_size); 7134 add_cache_prop(devi, ct->ct_label, size_str, ct->ct_size); 7135 return (0); 7136 } 7137 7138 7139 static const char fully_assoc[] = "fully-associative?"; 7140 7141 /* 7142 * AMD style cache/tlb description 7143 * 7144 * Extended functions 5 and 6 directly describe properties of 7145 * tlbs and various cache levels. 7146 */ 7147 static void 7148 add_amd_assoc(dev_info_t *devi, const char *label, uint_t assoc) 7149 { 7150 switch (assoc) { 7151 case 0: /* reserved; ignore */ 7152 break; 7153 default: 7154 add_cache_prop(devi, label, assoc_str, assoc); 7155 break; 7156 case 0xff: 7157 add_cache_prop(devi, label, fully_assoc, 1); 7158 break; 7159 } 7160 } 7161 7162 static void 7163 add_amd_tlb(dev_info_t *devi, const char *label, uint_t assoc, uint_t size) 7164 { 7165 if (size == 0) 7166 return; 7167 add_cache_prop(devi, label, size_str, size); 7168 add_amd_assoc(devi, label, assoc); 7169 } 7170 7171 static void 7172 add_amd_cache(dev_info_t *devi, const char *label, 7173 uint_t size, uint_t assoc, uint_t lines_per_tag, uint_t line_size) 7174 { 7175 if (size == 0 || line_size == 0) 7176 return; 7177 add_amd_assoc(devi, label, assoc); 7178 /* 7179 * Most AMD parts have a sectored cache. Multiple cache lines are 7180 * associated with each tag. A sector consists of all cache lines 7181 * associated with a tag. For example, the AMD K6-III has a sector 7182 * size of 2 cache lines per tag. 7183 */ 7184 if (lines_per_tag != 0) 7185 add_cache_prop(devi, label, "lines-per-tag", lines_per_tag); 7186 add_cache_prop(devi, label, line_str, line_size); 7187 add_cache_prop(devi, label, size_str, size * 1024); 7188 } 7189 7190 static void 7191 add_amd_l2_assoc(dev_info_t *devi, const char *label, uint_t assoc) 7192 { 7193 switch (assoc) { 7194 case 0: /* off */ 7195 break; 7196 case 1: 7197 case 2: 7198 case 4: 7199 add_cache_prop(devi, label, assoc_str, assoc); 7200 break; 7201 case 6: 7202 add_cache_prop(devi, label, assoc_str, 8); 7203 break; 7204 case 8: 7205 add_cache_prop(devi, label, assoc_str, 16); 7206 break; 7207 case 0xf: 7208 add_cache_prop(devi, label, fully_assoc, 1); 7209 break; 7210 default: /* reserved; ignore */ 7211 break; 7212 } 7213 } 7214 7215 static void 7216 add_amd_l2_tlb(dev_info_t *devi, const char *label, uint_t assoc, uint_t size) 7217 { 7218 if (size == 0 || assoc == 0) 7219 return; 7220 add_amd_l2_assoc(devi, label, assoc); 7221 add_cache_prop(devi, label, size_str, size); 7222 } 7223 7224 static void 7225 add_amd_l2_cache(dev_info_t *devi, const char *label, 7226 uint_t size, uint_t assoc, uint_t lines_per_tag, uint_t line_size) 7227 { 7228 if (size == 0 || assoc == 0 || line_size == 0) 7229 return; 7230 add_amd_l2_assoc(devi, label, assoc); 7231 if (lines_per_tag != 0) 7232 add_cache_prop(devi, label, "lines-per-tag", lines_per_tag); 7233 add_cache_prop(devi, label, line_str, line_size); 7234 add_cache_prop(devi, label, size_str, size * 1024); 7235 } 7236 7237 static void 7238 amd_cache_info(struct cpuid_info *cpi, dev_info_t *devi) 7239 { 7240 struct cpuid_regs *cp; 7241 7242 if (cpi->cpi_xmaxeax < 0x80000005) 7243 return; 7244 cp = &cpi->cpi_extd[5]; 7245 7246 /* 7247 * 4M/2M L1 TLB configuration 7248 * 7249 * We report the size for 2M pages because AMD uses two 7250 * TLB entries for one 4M page. 7251 */ 7252 add_amd_tlb(devi, "dtlb-2M", 7253 BITX(cp->cp_eax, 31, 24), BITX(cp->cp_eax, 23, 16)); 7254 add_amd_tlb(devi, "itlb-2M", 7255 BITX(cp->cp_eax, 15, 8), BITX(cp->cp_eax, 7, 0)); 7256 7257 /* 7258 * 4K L1 TLB configuration 7259 */ 7260 7261 switch (cpi->cpi_vendor) { 7262 uint_t nentries; 7263 case X86_VENDOR_TM: 7264 if (cpi->cpi_family >= 5) { 7265 /* 7266 * Crusoe processors have 256 TLB entries, but 7267 * cpuid data format constrains them to only 7268 * reporting 255 of them. 7269 */ 7270 if ((nentries = BITX(cp->cp_ebx, 23, 16)) == 255) 7271 nentries = 256; 7272 /* 7273 * Crusoe processors also have a unified TLB 7274 */ 7275 add_amd_tlb(devi, "tlb-4K", BITX(cp->cp_ebx, 31, 24), 7276 nentries); 7277 break; 7278 } 7279 /*FALLTHROUGH*/ 7280 default: 7281 add_amd_tlb(devi, itlb4k_str, 7282 BITX(cp->cp_ebx, 31, 24), BITX(cp->cp_ebx, 23, 16)); 7283 add_amd_tlb(devi, dtlb4k_str, 7284 BITX(cp->cp_ebx, 15, 8), BITX(cp->cp_ebx, 7, 0)); 7285 break; 7286 } 7287 7288 /* 7289 * data L1 cache configuration 7290 */ 7291 7292 add_amd_cache(devi, l1_dcache_str, 7293 BITX(cp->cp_ecx, 31, 24), BITX(cp->cp_ecx, 23, 16), 7294 BITX(cp->cp_ecx, 15, 8), BITX(cp->cp_ecx, 7, 0)); 7295 7296 /* 7297 * code L1 cache configuration 7298 */ 7299 7300 add_amd_cache(devi, l1_icache_str, 7301 BITX(cp->cp_edx, 31, 24), BITX(cp->cp_edx, 23, 16), 7302 BITX(cp->cp_edx, 15, 8), BITX(cp->cp_edx, 7, 0)); 7303 7304 if (cpi->cpi_xmaxeax < 0x80000006) 7305 return; 7306 cp = &cpi->cpi_extd[6]; 7307 7308 /* Check for a unified L2 TLB for large pages */ 7309 7310 if (BITX(cp->cp_eax, 31, 16) == 0) 7311 add_amd_l2_tlb(devi, "l2-tlb-2M", 7312 BITX(cp->cp_eax, 15, 12), BITX(cp->cp_eax, 11, 0)); 7313 else { 7314 add_amd_l2_tlb(devi, "l2-dtlb-2M", 7315 BITX(cp->cp_eax, 31, 28), BITX(cp->cp_eax, 27, 16)); 7316 add_amd_l2_tlb(devi, "l2-itlb-2M", 7317 BITX(cp->cp_eax, 15, 12), BITX(cp->cp_eax, 11, 0)); 7318 } 7319 7320 /* Check for a unified L2 TLB for 4K pages */ 7321 7322 if (BITX(cp->cp_ebx, 31, 16) == 0) { 7323 add_amd_l2_tlb(devi, "l2-tlb-4K", 7324 BITX(cp->cp_eax, 15, 12), BITX(cp->cp_eax, 11, 0)); 7325 } else { 7326 add_amd_l2_tlb(devi, "l2-dtlb-4K", 7327 BITX(cp->cp_eax, 31, 28), BITX(cp->cp_eax, 27, 16)); 7328 add_amd_l2_tlb(devi, "l2-itlb-4K", 7329 BITX(cp->cp_eax, 15, 12), BITX(cp->cp_eax, 11, 0)); 7330 } 7331 7332 add_amd_l2_cache(devi, l2_cache_str, 7333 BITX(cp->cp_ecx, 31, 16), BITX(cp->cp_ecx, 15, 12), 7334 BITX(cp->cp_ecx, 11, 8), BITX(cp->cp_ecx, 7, 0)); 7335 } 7336 7337 /* 7338 * There are two basic ways that the x86 world describes it cache 7339 * and tlb architecture - Intel's way and AMD's way. 7340 * 7341 * Return which flavor of cache architecture we should use 7342 */ 7343 static int 7344 x86_which_cacheinfo(struct cpuid_info *cpi) 7345 { 7346 switch (cpi->cpi_vendor) { 7347 case X86_VENDOR_Intel: 7348 if (cpi->cpi_maxeax >= 2) 7349 return (X86_VENDOR_Intel); 7350 break; 7351 case X86_VENDOR_AMD: 7352 /* 7353 * The K5 model 1 was the first part from AMD that reported 7354 * cache sizes via extended cpuid functions. 7355 */ 7356 if (cpi->cpi_family > 5 || 7357 (cpi->cpi_family == 5 && cpi->cpi_model >= 1)) 7358 return (X86_VENDOR_AMD); 7359 break; 7360 case X86_VENDOR_HYGON: 7361 return (X86_VENDOR_AMD); 7362 case X86_VENDOR_TM: 7363 if (cpi->cpi_family >= 5) 7364 return (X86_VENDOR_AMD); 7365 /*FALLTHROUGH*/ 7366 default: 7367 /* 7368 * If they have extended CPU data for 0x80000005 7369 * then we assume they have AMD-format cache 7370 * information. 7371 * 7372 * If not, and the vendor happens to be Cyrix, 7373 * then try our-Cyrix specific handler. 7374 * 7375 * If we're not Cyrix, then assume we're using Intel's 7376 * table-driven format instead. 7377 */ 7378 if (cpi->cpi_xmaxeax >= 0x80000005) 7379 return (X86_VENDOR_AMD); 7380 else if (cpi->cpi_vendor == X86_VENDOR_Cyrix) 7381 return (X86_VENDOR_Cyrix); 7382 else if (cpi->cpi_maxeax >= 2) 7383 return (X86_VENDOR_Intel); 7384 break; 7385 } 7386 return (-1); 7387 } 7388 7389 void 7390 cpuid_set_cpu_properties(void *dip, processorid_t cpu_id, 7391 struct cpuid_info *cpi) 7392 { 7393 dev_info_t *cpu_devi; 7394 int create; 7395 7396 cpu_devi = (dev_info_t *)dip; 7397 7398 /* device_type */ 7399 (void) ndi_prop_update_string(DDI_DEV_T_NONE, cpu_devi, 7400 "device_type", "cpu"); 7401 7402 /* reg */ 7403 (void) ndi_prop_update_int(DDI_DEV_T_NONE, cpu_devi, 7404 "reg", cpu_id); 7405 7406 /* cpu-mhz, and clock-frequency */ 7407 if (cpu_freq > 0) { 7408 long long mul; 7409 7410 (void) ndi_prop_update_int(DDI_DEV_T_NONE, cpu_devi, 7411 "cpu-mhz", cpu_freq); 7412 if ((mul = cpu_freq * 1000000LL) <= INT_MAX) 7413 (void) ndi_prop_update_int(DDI_DEV_T_NONE, cpu_devi, 7414 "clock-frequency", (int)mul); 7415 } 7416 7417 ASSERT(is_x86_feature(x86_featureset, X86FSET_CPUID)); 7418 7419 /* vendor-id */ 7420 (void) ndi_prop_update_string(DDI_DEV_T_NONE, cpu_devi, 7421 "vendor-id", cpi->cpi_vendorstr); 7422 7423 if (cpi->cpi_maxeax == 0) { 7424 return; 7425 } 7426 7427 /* 7428 * family, model, and step 7429 */ 7430 (void) ndi_prop_update_int(DDI_DEV_T_NONE, cpu_devi, 7431 "family", CPI_FAMILY(cpi)); 7432 (void) ndi_prop_update_int(DDI_DEV_T_NONE, cpu_devi, 7433 "cpu-model", CPI_MODEL(cpi)); 7434 (void) ndi_prop_update_int(DDI_DEV_T_NONE, cpu_devi, 7435 "stepping-id", CPI_STEP(cpi)); 7436 7437 /* type */ 7438 switch (cpi->cpi_vendor) { 7439 case X86_VENDOR_Intel: 7440 create = 1; 7441 break; 7442 default: 7443 create = 0; 7444 break; 7445 } 7446 if (create) 7447 (void) ndi_prop_update_int(DDI_DEV_T_NONE, cpu_devi, 7448 "type", CPI_TYPE(cpi)); 7449 7450 /* ext-family */ 7451 switch (cpi->cpi_vendor) { 7452 case X86_VENDOR_Intel: 7453 case X86_VENDOR_AMD: 7454 create = cpi->cpi_family >= 0xf; 7455 break; 7456 case X86_VENDOR_HYGON: 7457 create = 1; 7458 break; 7459 default: 7460 create = 0; 7461 break; 7462 } 7463 if (create) 7464 (void) ndi_prop_update_int(DDI_DEV_T_NONE, cpu_devi, 7465 "ext-family", CPI_FAMILY_XTD(cpi)); 7466 7467 /* ext-model */ 7468 switch (cpi->cpi_vendor) { 7469 case X86_VENDOR_Intel: 7470 create = IS_EXTENDED_MODEL_INTEL(cpi); 7471 break; 7472 case X86_VENDOR_AMD: 7473 create = CPI_FAMILY(cpi) == 0xf; 7474 break; 7475 case X86_VENDOR_HYGON: 7476 create = 1; 7477 break; 7478 default: 7479 create = 0; 7480 break; 7481 } 7482 if (create) 7483 (void) ndi_prop_update_int(DDI_DEV_T_NONE, cpu_devi, 7484 "ext-model", CPI_MODEL_XTD(cpi)); 7485 7486 /* generation */ 7487 switch (cpi->cpi_vendor) { 7488 case X86_VENDOR_AMD: 7489 case X86_VENDOR_HYGON: 7490 /* 7491 * AMD K5 model 1 was the first part to support this 7492 */ 7493 create = cpi->cpi_xmaxeax >= 0x80000001; 7494 break; 7495 default: 7496 create = 0; 7497 break; 7498 } 7499 if (create) 7500 (void) ndi_prop_update_int(DDI_DEV_T_NONE, cpu_devi, 7501 "generation", BITX((cpi)->cpi_extd[1].cp_eax, 11, 8)); 7502 7503 /* brand-id */ 7504 switch (cpi->cpi_vendor) { 7505 case X86_VENDOR_Intel: 7506 /* 7507 * brand id first appeared on Pentium III Xeon model 8, 7508 * and Celeron model 8 processors and Opteron 7509 */ 7510 create = cpi->cpi_family > 6 || 7511 (cpi->cpi_family == 6 && cpi->cpi_model >= 8); 7512 break; 7513 case X86_VENDOR_AMD: 7514 create = cpi->cpi_family >= 0xf; 7515 break; 7516 case X86_VENDOR_HYGON: 7517 create = 1; 7518 break; 7519 default: 7520 create = 0; 7521 break; 7522 } 7523 if (create && cpi->cpi_brandid != 0) { 7524 (void) ndi_prop_update_int(DDI_DEV_T_NONE, cpu_devi, 7525 "brand-id", cpi->cpi_brandid); 7526 } 7527 7528 /* chunks, and apic-id */ 7529 switch (cpi->cpi_vendor) { 7530 /* 7531 * first available on Pentium IV and Opteron (K8) 7532 */ 7533 case X86_VENDOR_Intel: 7534 create = IS_NEW_F6(cpi) || cpi->cpi_family >= 0xf; 7535 break; 7536 case X86_VENDOR_AMD: 7537 create = cpi->cpi_family >= 0xf; 7538 break; 7539 case X86_VENDOR_HYGON: 7540 create = 1; 7541 break; 7542 default: 7543 create = 0; 7544 break; 7545 } 7546 if (create) { 7547 (void) ndi_prop_update_int(DDI_DEV_T_NONE, cpu_devi, 7548 "chunks", CPI_CHUNKS(cpi)); 7549 (void) ndi_prop_update_int(DDI_DEV_T_NONE, cpu_devi, 7550 "apic-id", cpi->cpi_apicid); 7551 if (cpi->cpi_chipid >= 0) { 7552 (void) ndi_prop_update_int(DDI_DEV_T_NONE, cpu_devi, 7553 "chip#", cpi->cpi_chipid); 7554 (void) ndi_prop_update_int(DDI_DEV_T_NONE, cpu_devi, 7555 "clog#", cpi->cpi_clogid); 7556 } 7557 } 7558 7559 /* cpuid-features */ 7560 (void) ndi_prop_update_int(DDI_DEV_T_NONE, cpu_devi, 7561 "cpuid-features", CPI_FEATURES_EDX(cpi)); 7562 7563 7564 /* cpuid-features-ecx */ 7565 switch (cpi->cpi_vendor) { 7566 case X86_VENDOR_Intel: 7567 create = IS_NEW_F6(cpi) || cpi->cpi_family >= 0xf; 7568 break; 7569 case X86_VENDOR_AMD: 7570 create = cpi->cpi_family >= 0xf; 7571 break; 7572 case X86_VENDOR_HYGON: 7573 create = 1; 7574 break; 7575 default: 7576 create = 0; 7577 break; 7578 } 7579 if (create) 7580 (void) ndi_prop_update_int(DDI_DEV_T_NONE, cpu_devi, 7581 "cpuid-features-ecx", CPI_FEATURES_ECX(cpi)); 7582 7583 /* ext-cpuid-features */ 7584 switch (cpi->cpi_vendor) { 7585 case X86_VENDOR_Intel: 7586 case X86_VENDOR_AMD: 7587 case X86_VENDOR_HYGON: 7588 case X86_VENDOR_Cyrix: 7589 case X86_VENDOR_TM: 7590 case X86_VENDOR_Centaur: 7591 create = cpi->cpi_xmaxeax >= 0x80000001; 7592 break; 7593 default: 7594 create = 0; 7595 break; 7596 } 7597 if (create) { 7598 (void) ndi_prop_update_int(DDI_DEV_T_NONE, cpu_devi, 7599 "ext-cpuid-features", CPI_FEATURES_XTD_EDX(cpi)); 7600 (void) ndi_prop_update_int(DDI_DEV_T_NONE, cpu_devi, 7601 "ext-cpuid-features-ecx", CPI_FEATURES_XTD_ECX(cpi)); 7602 } 7603 7604 /* 7605 * Brand String first appeared in Intel Pentium IV, AMD K5 7606 * model 1, and Cyrix GXm. On earlier models we try and 7607 * simulate something similar .. so this string should always 7608 * same -something- about the processor, however lame. 7609 */ 7610 (void) ndi_prop_update_string(DDI_DEV_T_NONE, cpu_devi, 7611 "brand-string", cpi->cpi_brandstr); 7612 7613 /* 7614 * Finally, cache and tlb information 7615 */ 7616 switch (x86_which_cacheinfo(cpi)) { 7617 case X86_VENDOR_Intel: 7618 intel_walk_cacheinfo(cpi, cpu_devi, add_cacheent_props); 7619 break; 7620 case X86_VENDOR_Cyrix: 7621 cyrix_walk_cacheinfo(cpi, cpu_devi, add_cacheent_props); 7622 break; 7623 case X86_VENDOR_AMD: 7624 amd_cache_info(cpi, cpu_devi); 7625 break; 7626 default: 7627 break; 7628 } 7629 } 7630 7631 struct l2info { 7632 int *l2i_csz; 7633 int *l2i_lsz; 7634 int *l2i_assoc; 7635 int l2i_ret; 7636 }; 7637 7638 /* 7639 * A cacheinfo walker that fetches the size, line-size and associativity 7640 * of the L2 cache 7641 */ 7642 static int 7643 intel_l2cinfo(void *arg, const struct cachetab *ct) 7644 { 7645 struct l2info *l2i = arg; 7646 int *ip; 7647 7648 if (ct->ct_label != l2_cache_str && 7649 ct->ct_label != sl2_cache_str) 7650 return (0); /* not an L2 -- keep walking */ 7651 7652 if ((ip = l2i->l2i_csz) != NULL) 7653 *ip = ct->ct_size; 7654 if ((ip = l2i->l2i_lsz) != NULL) 7655 *ip = ct->ct_line_size; 7656 if ((ip = l2i->l2i_assoc) != NULL) 7657 *ip = ct->ct_assoc; 7658 l2i->l2i_ret = ct->ct_size; 7659 return (1); /* was an L2 -- terminate walk */ 7660 } 7661 7662 /* 7663 * AMD L2/L3 Cache and TLB Associativity Field Definition: 7664 * 7665 * Unlike the associativity for the L1 cache and tlb where the 8 bit 7666 * value is the associativity, the associativity for the L2 cache and 7667 * tlb is encoded in the following table. The 4 bit L2 value serves as 7668 * an index into the amd_afd[] array to determine the associativity. 7669 * -1 is undefined. 0 is fully associative. 7670 */ 7671 7672 static int amd_afd[] = 7673 {-1, 1, 2, -1, 4, -1, 8, -1, 16, -1, 32, 48, 64, 96, 128, 0}; 7674 7675 static void 7676 amd_l2cacheinfo(struct cpuid_info *cpi, struct l2info *l2i) 7677 { 7678 struct cpuid_regs *cp; 7679 uint_t size, assoc; 7680 int i; 7681 int *ip; 7682 7683 if (cpi->cpi_xmaxeax < 0x80000006) 7684 return; 7685 cp = &cpi->cpi_extd[6]; 7686 7687 if ((i = BITX(cp->cp_ecx, 15, 12)) != 0 && 7688 (size = BITX(cp->cp_ecx, 31, 16)) != 0) { 7689 uint_t cachesz = size * 1024; 7690 assoc = amd_afd[i]; 7691 7692 ASSERT(assoc != -1); 7693 7694 if ((ip = l2i->l2i_csz) != NULL) 7695 *ip = cachesz; 7696 if ((ip = l2i->l2i_lsz) != NULL) 7697 *ip = BITX(cp->cp_ecx, 7, 0); 7698 if ((ip = l2i->l2i_assoc) != NULL) 7699 *ip = assoc; 7700 l2i->l2i_ret = cachesz; 7701 } 7702 } 7703 7704 int 7705 getl2cacheinfo(cpu_t *cpu, int *csz, int *lsz, int *assoc) 7706 { 7707 struct cpuid_info *cpi = cpu->cpu_m.mcpu_cpi; 7708 struct l2info __l2info, *l2i = &__l2info; 7709 7710 l2i->l2i_csz = csz; 7711 l2i->l2i_lsz = lsz; 7712 l2i->l2i_assoc = assoc; 7713 l2i->l2i_ret = -1; 7714 7715 switch (x86_which_cacheinfo(cpi)) { 7716 case X86_VENDOR_Intel: 7717 intel_walk_cacheinfo(cpi, l2i, intel_l2cinfo); 7718 break; 7719 case X86_VENDOR_Cyrix: 7720 cyrix_walk_cacheinfo(cpi, l2i, intel_l2cinfo); 7721 break; 7722 case X86_VENDOR_AMD: 7723 amd_l2cacheinfo(cpi, l2i); 7724 break; 7725 default: 7726 break; 7727 } 7728 return (l2i->l2i_ret); 7729 } 7730 7731 #if !defined(__xpv) 7732 7733 uint32_t * 7734 cpuid_mwait_alloc(cpu_t *cpu) 7735 { 7736 uint32_t *ret; 7737 size_t mwait_size; 7738 7739 ASSERT(cpuid_checkpass(CPU, CPUID_PASS_EXTENDED)); 7740 7741 mwait_size = CPU->cpu_m.mcpu_cpi->cpi_mwait.mon_max; 7742 if (mwait_size == 0) 7743 return (NULL); 7744 7745 /* 7746 * kmem_alloc() returns cache line size aligned data for mwait_size 7747 * allocations. mwait_size is currently cache line sized. Neither 7748 * of these implementation details are guarantied to be true in the 7749 * future. 7750 * 7751 * First try allocating mwait_size as kmem_alloc() currently returns 7752 * correctly aligned memory. If kmem_alloc() does not return 7753 * mwait_size aligned memory, then use mwait_size ROUNDUP. 7754 * 7755 * Set cpi_mwait.buf_actual and cpi_mwait.size_actual in case we 7756 * decide to free this memory. 7757 */ 7758 ret = kmem_zalloc(mwait_size, KM_SLEEP); 7759 if (ret == (uint32_t *)P2ROUNDUP((uintptr_t)ret, mwait_size)) { 7760 cpu->cpu_m.mcpu_cpi->cpi_mwait.buf_actual = ret; 7761 cpu->cpu_m.mcpu_cpi->cpi_mwait.size_actual = mwait_size; 7762 *ret = MWAIT_RUNNING; 7763 return (ret); 7764 } else { 7765 kmem_free(ret, mwait_size); 7766 ret = kmem_zalloc(mwait_size * 2, KM_SLEEP); 7767 cpu->cpu_m.mcpu_cpi->cpi_mwait.buf_actual = ret; 7768 cpu->cpu_m.mcpu_cpi->cpi_mwait.size_actual = mwait_size * 2; 7769 ret = (uint32_t *)P2ROUNDUP((uintptr_t)ret, mwait_size); 7770 *ret = MWAIT_RUNNING; 7771 return (ret); 7772 } 7773 } 7774 7775 void 7776 cpuid_mwait_free(cpu_t *cpu) 7777 { 7778 if (cpu->cpu_m.mcpu_cpi == NULL) { 7779 return; 7780 } 7781 7782 if (cpu->cpu_m.mcpu_cpi->cpi_mwait.buf_actual != NULL && 7783 cpu->cpu_m.mcpu_cpi->cpi_mwait.size_actual > 0) { 7784 kmem_free(cpu->cpu_m.mcpu_cpi->cpi_mwait.buf_actual, 7785 cpu->cpu_m.mcpu_cpi->cpi_mwait.size_actual); 7786 } 7787 7788 cpu->cpu_m.mcpu_cpi->cpi_mwait.buf_actual = NULL; 7789 cpu->cpu_m.mcpu_cpi->cpi_mwait.size_actual = 0; 7790 } 7791 7792 void 7793 patch_tsc_read(int flag) 7794 { 7795 size_t cnt; 7796 7797 switch (flag) { 7798 case TSC_NONE: 7799 cnt = &_no_rdtsc_end - &_no_rdtsc_start; 7800 (void) memcpy((void *)tsc_read, (void *)&_no_rdtsc_start, cnt); 7801 break; 7802 case TSC_RDTSC_LFENCE: 7803 cnt = &_tsc_lfence_end - &_tsc_lfence_start; 7804 (void) memcpy((void *)tsc_read, 7805 (void *)&_tsc_lfence_start, cnt); 7806 break; 7807 case TSC_TSCP: 7808 cnt = &_tscp_end - &_tscp_start; 7809 (void) memcpy((void *)tsc_read, (void *)&_tscp_start, cnt); 7810 break; 7811 default: 7812 /* Bail for unexpected TSC types. (TSC_NONE covers 0) */ 7813 cmn_err(CE_PANIC, "Unrecogized TSC type: %d", flag); 7814 break; 7815 } 7816 tsc_type = flag; 7817 } 7818 7819 int 7820 cpuid_deep_cstates_supported(void) 7821 { 7822 struct cpuid_info *cpi; 7823 struct cpuid_regs regs; 7824 7825 ASSERT(cpuid_checkpass(CPU, CPUID_PASS_BASIC)); 7826 ASSERT(is_x86_feature(x86_featureset, X86FSET_CPUID)); 7827 7828 cpi = CPU->cpu_m.mcpu_cpi; 7829 7830 switch (cpi->cpi_vendor) { 7831 case X86_VENDOR_Intel: 7832 if (cpi->cpi_xmaxeax < 0x80000007) 7833 return (0); 7834 7835 /* 7836 * Does TSC run at a constant rate in all C-states? 7837 */ 7838 regs.cp_eax = 0x80000007; 7839 (void) __cpuid_insn(®s); 7840 return (regs.cp_edx & CPUID_TSC_CSTATE_INVARIANCE); 7841 7842 default: 7843 return (0); 7844 } 7845 } 7846 7847 #endif /* !__xpv */ 7848 7849 void 7850 post_startup_cpu_fixups(void) 7851 { 7852 #ifndef __xpv 7853 /* 7854 * Some AMD processors support C1E state. Entering this state will 7855 * cause the local APIC timer to stop, which we can't deal with at 7856 * this time. 7857 */ 7858 if (cpuid_getvendor(CPU) == X86_VENDOR_AMD) { 7859 on_trap_data_t otd; 7860 uint64_t reg; 7861 7862 if (!on_trap(&otd, OT_DATA_ACCESS)) { 7863 reg = rdmsr(MSR_AMD_INT_PENDING_CMP_HALT); 7864 /* Disable C1E state if it is enabled by BIOS */ 7865 if ((reg >> AMD_ACTONCMPHALT_SHIFT) & 7866 AMD_ACTONCMPHALT_MASK) { 7867 reg &= ~(AMD_ACTONCMPHALT_MASK << 7868 AMD_ACTONCMPHALT_SHIFT); 7869 wrmsr(MSR_AMD_INT_PENDING_CMP_HALT, reg); 7870 } 7871 } 7872 no_trap(); 7873 } 7874 #endif /* !__xpv */ 7875 } 7876 7877 void 7878 enable_pcid(void) 7879 { 7880 if (x86_use_pcid == -1) 7881 x86_use_pcid = is_x86_feature(x86_featureset, X86FSET_PCID); 7882 7883 if (x86_use_invpcid == -1) { 7884 x86_use_invpcid = is_x86_feature(x86_featureset, 7885 X86FSET_INVPCID); 7886 } 7887 7888 if (!x86_use_pcid) 7889 return; 7890 7891 /* 7892 * Intel say that on setting PCIDE, it immediately starts using the PCID 7893 * bits; better make sure there's nothing there. 7894 */ 7895 ASSERT((getcr3() & MMU_PAGEOFFSET) == PCID_NONE); 7896 7897 setcr4(getcr4() | CR4_PCIDE); 7898 } 7899 7900 /* 7901 * Setup necessary registers to enable XSAVE feature on this processor. 7902 * This function needs to be called early enough, so that no xsave/xrstor 7903 * ops will execute on the processor before the MSRs are properly set up. 7904 * 7905 * Current implementation has the following assumption: 7906 * - cpuid_pass_basic() is done, so that X86 features are known. 7907 * - fpu_probe() is done, so that fp_save_mech is chosen. 7908 */ 7909 void 7910 xsave_setup_msr(cpu_t *cpu) 7911 { 7912 ASSERT(cpuid_checkpass(cpu, CPUID_PASS_BASIC)); 7913 ASSERT(fp_save_mech == FP_XSAVE); 7914 ASSERT(is_x86_feature(x86_featureset, X86FSET_XSAVE)); 7915 7916 /* Enable OSXSAVE in CR4. */ 7917 setcr4(getcr4() | CR4_OSXSAVE); 7918 /* 7919 * Update SW copy of ECX, so that /dev/cpu/self/cpuid will report 7920 * correct value. 7921 */ 7922 cpu->cpu_m.mcpu_cpi->cpi_std[1].cp_ecx |= CPUID_INTC_ECX_OSXSAVE; 7923 setup_xfem(); 7924 } 7925 7926 /* 7927 * Starting with the Westmere processor the local 7928 * APIC timer will continue running in all C-states, 7929 * including the deepest C-states. 7930 */ 7931 int 7932 cpuid_arat_supported(void) 7933 { 7934 struct cpuid_info *cpi; 7935 struct cpuid_regs regs; 7936 7937 ASSERT(cpuid_checkpass(CPU, CPUID_PASS_BASIC)); 7938 ASSERT(is_x86_feature(x86_featureset, X86FSET_CPUID)); 7939 7940 cpi = CPU->cpu_m.mcpu_cpi; 7941 7942 switch (cpi->cpi_vendor) { 7943 case X86_VENDOR_Intel: 7944 /* 7945 * Always-running Local APIC Timer is 7946 * indicated by CPUID.6.EAX[2]. 7947 */ 7948 if (cpi->cpi_maxeax >= 6) { 7949 regs.cp_eax = 6; 7950 (void) cpuid_insn(NULL, ®s); 7951 return (regs.cp_eax & CPUID_INTC_EAX_ARAT); 7952 } else { 7953 return (0); 7954 } 7955 default: 7956 return (0); 7957 } 7958 } 7959 7960 /* 7961 * Check support for Intel ENERGY_PERF_BIAS feature 7962 */ 7963 int 7964 cpuid_iepb_supported(struct cpu *cp) 7965 { 7966 struct cpuid_info *cpi = cp->cpu_m.mcpu_cpi; 7967 struct cpuid_regs regs; 7968 7969 ASSERT(cpuid_checkpass(cp, CPUID_PASS_BASIC)); 7970 ASSERT(is_x86_feature(x86_featureset, X86FSET_CPUID)); 7971 7972 if (!(is_x86_feature(x86_featureset, X86FSET_MSR))) { 7973 return (0); 7974 } 7975 7976 /* 7977 * Intel ENERGY_PERF_BIAS MSR is indicated by 7978 * capability bit CPUID.6.ECX.3 7979 */ 7980 if ((cpi->cpi_vendor != X86_VENDOR_Intel) || (cpi->cpi_maxeax < 6)) 7981 return (0); 7982 7983 regs.cp_eax = 0x6; 7984 (void) cpuid_insn(NULL, ®s); 7985 return (regs.cp_ecx & CPUID_INTC_ECX_PERFBIAS); 7986 } 7987 7988 /* 7989 * Check support for TSC deadline timer 7990 * 7991 * TSC deadline timer provides a superior software programming 7992 * model over local APIC timer that eliminates "time drifts". 7993 * Instead of specifying a relative time, software specifies an 7994 * absolute time as the target at which the processor should 7995 * generate a timer event. 7996 */ 7997 int 7998 cpuid_deadline_tsc_supported(void) 7999 { 8000 struct cpuid_info *cpi = CPU->cpu_m.mcpu_cpi; 8001 struct cpuid_regs regs; 8002 8003 ASSERT(cpuid_checkpass(CPU, CPUID_PASS_BASIC)); 8004 ASSERT(is_x86_feature(x86_featureset, X86FSET_CPUID)); 8005 8006 switch (cpi->cpi_vendor) { 8007 case X86_VENDOR_Intel: 8008 if (cpi->cpi_maxeax >= 1) { 8009 regs.cp_eax = 1; 8010 (void) cpuid_insn(NULL, ®s); 8011 return (regs.cp_ecx & CPUID_DEADLINE_TSC); 8012 } else { 8013 return (0); 8014 } 8015 default: 8016 return (0); 8017 } 8018 } 8019 8020 #if !defined(__xpv) 8021 /* 8022 * Patch in versions of bcopy for high performance Intel Nhm processors 8023 * and later... 8024 */ 8025 void 8026 patch_memops(uint_t vendor) 8027 { 8028 size_t cnt, i; 8029 caddr_t to, from; 8030 8031 if ((vendor == X86_VENDOR_Intel) && 8032 is_x86_feature(x86_featureset, X86FSET_SSE4_2)) { 8033 cnt = &bcopy_patch_end - &bcopy_patch_start; 8034 to = &bcopy_ck_size; 8035 from = &bcopy_patch_start; 8036 for (i = 0; i < cnt; i++) { 8037 *to++ = *from++; 8038 } 8039 } 8040 } 8041 #endif /* !__xpv */ 8042 8043 /* 8044 * We're being asked to tell the system how many bits are required to represent 8045 * the various thread and strand IDs. While it's tempting to derive this based 8046 * on the values in cpi_ncore_per_chip and cpi_ncpu_per_chip, that isn't quite 8047 * correct. Instead, this needs to be based on the number of bits that the APIC 8048 * allows for these different configurations. We only update these to a larger 8049 * value if we find one. 8050 */ 8051 void 8052 cpuid_get_ext_topo(cpu_t *cpu, uint_t *core_nbits, uint_t *strand_nbits) 8053 { 8054 struct cpuid_info *cpi; 8055 8056 VERIFY(cpuid_checkpass(CPU, CPUID_PASS_BASIC)); 8057 cpi = cpu->cpu_m.mcpu_cpi; 8058 8059 if (cpi->cpi_ncore_bits > *core_nbits) { 8060 *core_nbits = cpi->cpi_ncore_bits; 8061 } 8062 8063 if (cpi->cpi_nthread_bits > *strand_nbits) { 8064 *strand_nbits = cpi->cpi_nthread_bits; 8065 } 8066 } 8067 8068 void 8069 cpuid_pass_ucode(cpu_t *cpu, uchar_t *fset) 8070 { 8071 struct cpuid_info *cpi = cpu->cpu_m.mcpu_cpi; 8072 struct cpuid_regs cp; 8073 8074 /* 8075 * Reread the CPUID portions that we need for various security 8076 * information. 8077 */ 8078 switch (cpi->cpi_vendor) { 8079 case X86_VENDOR_Intel: 8080 /* 8081 * Check if we now have leaf 7 available to us. 8082 */ 8083 if (cpi->cpi_maxeax < 7) { 8084 bzero(&cp, sizeof (cp)); 8085 cp.cp_eax = 0; 8086 cpi->cpi_maxeax = __cpuid_insn(&cp); 8087 if (cpi->cpi_maxeax < 7) 8088 break; 8089 } 8090 8091 bzero(&cp, sizeof (cp)); 8092 cp.cp_eax = 7; 8093 cp.cp_ecx = 0; 8094 (void) __cpuid_insn(&cp); 8095 cpi->cpi_std[7] = cp; 8096 break; 8097 8098 case X86_VENDOR_AMD: 8099 case X86_VENDOR_HYGON: 8100 /* No xcpuid support */ 8101 if (cpi->cpi_family < 5 || 8102 (cpi->cpi_family == 5 && cpi->cpi_model < 1)) 8103 break; 8104 8105 if (cpi->cpi_xmaxeax < CPUID_LEAF_EXT_8) { 8106 bzero(&cp, sizeof (cp)); 8107 cp.cp_eax = CPUID_LEAF_EXT_0; 8108 cpi->cpi_xmaxeax = __cpuid_insn(&cp); 8109 if (cpi->cpi_xmaxeax < CPUID_LEAF_EXT_8) 8110 break; 8111 } 8112 8113 /* 8114 * Most AMD features are in leaf 8. Automatic IBRS was added in 8115 * leaf 0x21. So we also check that. 8116 */ 8117 bzero(&cp, sizeof (cp)); 8118 cp.cp_eax = CPUID_LEAF_EXT_8; 8119 (void) __cpuid_insn(&cp); 8120 platform_cpuid_mangle(cpi->cpi_vendor, CPUID_LEAF_EXT_8, &cp); 8121 cpi->cpi_extd[8] = cp; 8122 8123 if (cpi->cpi_xmaxeax < CPUID_LEAF_EXT_21) 8124 break; 8125 8126 bzero(&cp, sizeof (cp)); 8127 cp.cp_eax = CPUID_LEAF_EXT_21; 8128 (void) __cpuid_insn(&cp); 8129 platform_cpuid_mangle(cpi->cpi_vendor, CPUID_LEAF_EXT_21, &cp); 8130 cpi->cpi_extd[0x21] = cp; 8131 break; 8132 8133 default: 8134 /* 8135 * Nothing to do here. Return an empty set which has already 8136 * been zeroed for us. 8137 */ 8138 return; 8139 } 8140 8141 cpuid_scan_security(cpu, fset); 8142 } 8143 8144 /* ARGSUSED */ 8145 static int 8146 cpuid_post_ucodeadm_xc(xc_arg_t arg0, xc_arg_t arg1, xc_arg_t arg2) 8147 { 8148 uchar_t *fset; 8149 boolean_t first_pass = (boolean_t)arg1; 8150 8151 fset = (uchar_t *)(arg0 + sizeof (x86_featureset) * CPU->cpu_id); 8152 if (first_pass && CPU->cpu_id != 0) 8153 return (0); 8154 if (!first_pass && CPU->cpu_id == 0) 8155 return (0); 8156 cpuid_pass_ucode(CPU, fset); 8157 8158 return (0); 8159 } 8160 8161 /* 8162 * After a microcode update where the version has changed, then we need to 8163 * rescan CPUID. To do this we check every CPU to make sure that they have the 8164 * same microcode. Then we perform a cross call to all such CPUs. It's the 8165 * caller's job to make sure that no one else can end up doing an update while 8166 * this is going on. 8167 * 8168 * We assume that the system is microcode capable if we're called. 8169 */ 8170 void 8171 cpuid_post_ucodeadm(void) 8172 { 8173 uint32_t rev; 8174 int i; 8175 struct cpu *cpu; 8176 cpuset_t cpuset; 8177 void *argdata; 8178 uchar_t *f0; 8179 8180 argdata = kmem_zalloc(sizeof (x86_featureset) * NCPU, KM_SLEEP); 8181 8182 mutex_enter(&cpu_lock); 8183 cpu = cpu_get(0); 8184 rev = cpu->cpu_m.mcpu_ucode_info->cui_rev; 8185 CPUSET_ONLY(cpuset, 0); 8186 for (i = 1; i < max_ncpus; i++) { 8187 if ((cpu = cpu_get(i)) == NULL) 8188 continue; 8189 8190 if (cpu->cpu_m.mcpu_ucode_info->cui_rev != rev) { 8191 panic("post microcode update CPU %d has differing " 8192 "microcode revision (%u) from CPU 0 (%u)", 8193 i, cpu->cpu_m.mcpu_ucode_info->cui_rev, rev); 8194 } 8195 CPUSET_ADD(cpuset, i); 8196 } 8197 8198 /* 8199 * We do the cross calls in two passes. The first pass is only for the 8200 * boot CPU. The second pass is for all of the other CPUs. This allows 8201 * the boot CPU to go through and change behavior related to patching or 8202 * whether or not Enhanced IBRS needs to be enabled and then allow all 8203 * other CPUs to follow suit. 8204 */ 8205 kpreempt_disable(); 8206 xc_sync((xc_arg_t)argdata, B_TRUE, 0, CPUSET2BV(cpuset), 8207 cpuid_post_ucodeadm_xc); 8208 xc_sync((xc_arg_t)argdata, B_FALSE, 0, CPUSET2BV(cpuset), 8209 cpuid_post_ucodeadm_xc); 8210 kpreempt_enable(); 8211 8212 /* 8213 * OK, now look at each CPU and see if their feature sets are equal. 8214 */ 8215 f0 = argdata; 8216 for (i = 1; i < max_ncpus; i++) { 8217 uchar_t *fset; 8218 if (!CPU_IN_SET(cpuset, i)) 8219 continue; 8220 8221 fset = (uchar_t *)((uintptr_t)argdata + 8222 sizeof (x86_featureset) * i); 8223 8224 if (!compare_x86_featureset(f0, fset)) { 8225 panic("Post microcode update CPU %d has " 8226 "differing security feature (%p) set from CPU 0 " 8227 "(%p), not appending to feature set", i, 8228 (void *)fset, (void *)f0); 8229 } 8230 } 8231 8232 mutex_exit(&cpu_lock); 8233 8234 for (i = 0; i < NUM_X86_FEATURES; i++) { 8235 cmn_err(CE_CONT, "?post-ucode x86_feature: %s\n", 8236 x86_feature_names[i]); 8237 if (is_x86_feature(f0, i)) { 8238 add_x86_feature(x86_featureset, i); 8239 } 8240 } 8241 kmem_free(argdata, sizeof (x86_featureset) * NCPU); 8242 } 8243 8244 typedef void (*cpuid_pass_f)(cpu_t *, void *); 8245 8246 typedef struct cpuid_pass_def { 8247 cpuid_pass_t cpd_pass; 8248 cpuid_pass_f cpd_func; 8249 } cpuid_pass_def_t; 8250 8251 /* 8252 * See block comment at the top; note that cpuid_pass_ucode is not a pass in the 8253 * normal sense and should not appear here. 8254 */ 8255 static const cpuid_pass_def_t cpuid_pass_defs[] = { 8256 { CPUID_PASS_PRELUDE, cpuid_pass_prelude }, 8257 { CPUID_PASS_IDENT, cpuid_pass_ident }, 8258 { CPUID_PASS_BASIC, cpuid_pass_basic }, 8259 { CPUID_PASS_EXTENDED, cpuid_pass_extended }, 8260 { CPUID_PASS_DYNAMIC, cpuid_pass_dynamic }, 8261 { CPUID_PASS_RESOLVE, cpuid_pass_resolve }, 8262 }; 8263 8264 void 8265 cpuid_execpass(cpu_t *cp, cpuid_pass_t pass, void *arg) 8266 { 8267 VERIFY3S(pass, !=, CPUID_PASS_NONE); 8268 8269 if (cp == NULL) 8270 cp = CPU; 8271 8272 /* 8273 * Space statically allocated for BSP, ensure pointer is set 8274 */ 8275 if (cp->cpu_id == 0 && cp->cpu_m.mcpu_cpi == NULL) 8276 cp->cpu_m.mcpu_cpi = &cpuid_info0; 8277 8278 ASSERT(cpuid_checkpass(cp, pass - 1)); 8279 8280 for (uint_t i = 0; i < ARRAY_SIZE(cpuid_pass_defs); i++) { 8281 if (cpuid_pass_defs[i].cpd_pass == pass) { 8282 cpuid_pass_defs[i].cpd_func(cp, arg); 8283 cp->cpu_m.mcpu_cpi->cpi_pass = pass; 8284 return; 8285 } 8286 } 8287 8288 panic("unable to execute invalid cpuid pass %d on cpu%d\n", 8289 pass, cp->cpu_id); 8290 } 8291 8292 /* 8293 * Extract the processor family from a chiprev. Processor families are not the 8294 * same as cpuid families; see comments above and in x86_archext.h. 8295 */ 8296 x86_processor_family_t 8297 chiprev_family(const x86_chiprev_t cr) 8298 { 8299 return ((x86_processor_family_t)_X86_CHIPREV_FAMILY(cr)); 8300 } 8301 8302 /* 8303 * A chiprev matches its template if the vendor and family are identical and the 8304 * revision of the chiprev matches one of the bits set in the template. Callers 8305 * may bitwise-OR together chiprevs of the same vendor and family to form the 8306 * template, or use the _ANY variant. It is not possible to match chiprevs of 8307 * multiple vendors or processor families with a single call. Note that this 8308 * function operates on processor families, not cpuid families. 8309 */ 8310 boolean_t 8311 chiprev_matches(const x86_chiprev_t cr, const x86_chiprev_t template) 8312 { 8313 return (_X86_CHIPREV_VENDOR(cr) == _X86_CHIPREV_VENDOR(template) && 8314 _X86_CHIPREV_FAMILY(cr) == _X86_CHIPREV_FAMILY(template) && 8315 (_X86_CHIPREV_REV(cr) & _X86_CHIPREV_REV(template)) != 0); 8316 } 8317 8318 /* 8319 * A chiprev is at least min if the vendor and family are identical and the 8320 * revision of the chiprev is at least as recent as that of min. Processor 8321 * families are considered unordered and cannot be compared using this function. 8322 * Note that this function operates on processor families, not cpuid families. 8323 * Use of the _ANY chiprev variant with this function is not useful; it will 8324 * always return B_FALSE if the _ANY variant is supplied as the minimum 8325 * revision. To determine only whether a chiprev is of a given processor 8326 * family, test the return value of chiprev_family() instead. 8327 */ 8328 boolean_t 8329 chiprev_at_least(const x86_chiprev_t cr, const x86_chiprev_t min) 8330 { 8331 return (_X86_CHIPREV_VENDOR(cr) == _X86_CHIPREV_VENDOR(min) && 8332 _X86_CHIPREV_FAMILY(cr) == _X86_CHIPREV_FAMILY(min) && 8333 _X86_CHIPREV_REV(cr) >= _X86_CHIPREV_REV(min)); 8334 } 8335 8336 /* 8337 * The uarch functions operate in a manner similar to the chiprev functions 8338 * above. While it is tempting to allow these to operate on microarchitectures 8339 * produced by a specific vendor in an ordered fashion (e.g., ZEN3 is "newer" 8340 * than ZEN2), we elect not to do so because a manufacturer may supply 8341 * processors of multiple different microarchitecture families each of which may 8342 * be internally ordered but unordered with respect to those of other families. 8343 */ 8344 x86_uarch_t 8345 uarchrev_uarch(const x86_uarchrev_t ur) 8346 { 8347 return ((x86_uarch_t)_X86_UARCHREV_UARCH(ur)); 8348 } 8349 8350 boolean_t 8351 uarchrev_matches(const x86_uarchrev_t ur, const x86_uarchrev_t template) 8352 { 8353 return (_X86_UARCHREV_VENDOR(ur) == _X86_UARCHREV_VENDOR(template) && 8354 _X86_UARCHREV_UARCH(ur) == _X86_UARCHREV_UARCH(template) && 8355 (_X86_UARCHREV_REV(ur) & _X86_UARCHREV_REV(template)) != 0); 8356 } 8357 8358 boolean_t 8359 uarchrev_at_least(const x86_uarchrev_t ur, const x86_uarchrev_t min) 8360 { 8361 return (_X86_UARCHREV_VENDOR(ur) == _X86_UARCHREV_VENDOR(min) && 8362 _X86_UARCHREV_UARCH(ur) == _X86_UARCHREV_UARCH(min) && 8363 _X86_UARCHREV_REV(ur) >= _X86_UARCHREV_REV(min)); 8364 } 8365 8366 /* 8367 * Topology cache related information. This is yet another cache interface that 8368 * we're exposing out intended to be used when we have either Intel Leaf 4 or 8369 * AMD Leaf 8x1D (introduced with Zen 1). 8370 */ 8371 static boolean_t 8372 cpuid_cache_topo_sup(const struct cpuid_info *cpi) 8373 { 8374 switch (cpi->cpi_vendor) { 8375 case X86_VENDOR_Intel: 8376 if (cpi->cpi_maxeax >= 4) { 8377 return (B_TRUE); 8378 } 8379 break; 8380 case X86_VENDOR_AMD: 8381 case X86_VENDOR_HYGON: 8382 if (cpi->cpi_xmaxeax >= CPUID_LEAF_EXT_1d && 8383 is_x86_feature(x86_featureset, X86FSET_TOPOEXT)) { 8384 return (B_TRUE); 8385 } 8386 break; 8387 default: 8388 break; 8389 } 8390 8391 return (B_FALSE); 8392 } 8393 8394 int 8395 cpuid_getncaches(struct cpu *cpu, uint32_t *ncache) 8396 { 8397 const struct cpuid_info *cpi; 8398 8399 ASSERT(cpuid_checkpass(cpu, CPUID_PASS_DYNAMIC)); 8400 cpi = cpu->cpu_m.mcpu_cpi; 8401 8402 if (!cpuid_cache_topo_sup(cpi)) { 8403 return (ENOTSUP); 8404 } 8405 8406 *ncache = cpi->cpi_cache_leaf_size; 8407 return (0); 8408 } 8409 8410 int 8411 cpuid_getcache(struct cpu *cpu, uint32_t cno, x86_cache_t *cache) 8412 { 8413 const struct cpuid_info *cpi; 8414 const struct cpuid_regs *cp; 8415 8416 ASSERT(cpuid_checkpass(cpu, CPUID_PASS_DYNAMIC)); 8417 cpi = cpu->cpu_m.mcpu_cpi; 8418 8419 if (!cpuid_cache_topo_sup(cpi)) { 8420 return (ENOTSUP); 8421 } 8422 8423 if (cno >= cpi->cpi_cache_leaf_size) { 8424 return (EINVAL); 8425 } 8426 8427 bzero(cache, sizeof (cache)); 8428 cp = cpi->cpi_cache_leaves[cno]; 8429 switch (CPI_CACHE_TYPE(cp)) { 8430 case CPI_CACHE_TYPE_DATA: 8431 cache->xc_type = X86_CACHE_TYPE_DATA; 8432 break; 8433 case CPI_CACHE_TYPE_INSTR: 8434 cache->xc_type = X86_CACHE_TYPE_INST; 8435 break; 8436 case CPI_CACHE_TYPE_UNIFIED: 8437 cache->xc_type = X86_CACHE_TYPE_UNIFIED; 8438 break; 8439 case CPI_CACHE_TYPE_DONE: 8440 default: 8441 return (EINVAL); 8442 } 8443 cache->xc_level = CPI_CACHE_LVL(cp); 8444 if (CPI_FULL_ASSOC_CACHE(cp) != 0) { 8445 cache->xc_flags |= X86_CACHE_F_FULL_ASSOC; 8446 } 8447 cache->xc_nparts = CPI_CACHE_PARTS(cp) + 1; 8448 /* 8449 * The number of sets is reserved on AMD if the CPU is tagged as fully 8450 * associative, where as it is considered valid on Intel. 8451 */ 8452 if (cpi->cpi_vendor == X86_VENDOR_AMD && 8453 CPI_FULL_ASSOC_CACHE(cp) != 0) { 8454 cache->xc_nsets = 1; 8455 } else { 8456 cache->xc_nsets = CPI_CACHE_SETS(cp) + 1; 8457 } 8458 cache->xc_nways = CPI_CACHE_WAYS(cp) + 1; 8459 cache->xc_line_size = CPI_CACHE_COH_LN_SZ(cp) + 1; 8460 cache->xc_size = cache->xc_nparts * cache->xc_nsets * cache->xc_nways * 8461 cache->xc_line_size; 8462 /* 8463 * We're looking for the number of bits to cover the number of CPUs that 8464 * are being shared. Normally this would be the value - 1, but the CPUID 8465 * value is encoded as the actual value minus one, so we don't modify 8466 * this at all. 8467 */ 8468 cache->xc_apic_shift = highbit(CPI_NTHR_SHR_CACHE(cp)); 8469 8470 /* 8471 * To construct a unique ID we construct a uint64_t that looks as 8472 * follows: 8473 * 8474 * [47:40] cache level 8475 * [39:32] CPUID cache type 8476 * [31:00] shifted APIC ID 8477 * 8478 * The shifted APIC ID gives us a guarantee that a given cache entry is 8479 * unique within its peers. The other two numbers give us something that 8480 * ensures that something is unique within the CPU. If we just had the 8481 * APIC ID shifted over by the indicated number of bits we'd end up with 8482 * an ID of zero for the L1I, L1D, L2, and L3. 8483 * 8484 * The format of this ID is private to the system and can change across 8485 * a reboot for the time being. 8486 */ 8487 cache->xc_id = (uint64_t)cache->xc_level << 40; 8488 cache->xc_id |= (uint64_t)cache->xc_type << 32; 8489 cache->xc_id |= (uint64_t)cpi->cpi_apicid >> cache->xc_apic_shift; 8490 8491 return (0); 8492 } 8493