1*11390345SPerry Yuan.. SPDX-License-Identifier: GPL-2.0 2*11390345SPerry Yuan 3*11390345SPerry Yuan====================================================================== 4*11390345SPerry YuanHardware Feedback Interface For Hetero Core Scheduling On AMD Platform 5*11390345SPerry Yuan====================================================================== 6*11390345SPerry Yuan 7*11390345SPerry Yuan:Copyright: 2025 Advanced Micro Devices, Inc. All Rights Reserved. 8*11390345SPerry Yuan 9*11390345SPerry Yuan:Author: Perry Yuan <perry.yuan@amd.com> 10*11390345SPerry Yuan:Author: Mario Limonciello <mario.limonciello@amd.com> 11*11390345SPerry Yuan 12*11390345SPerry YuanOverview 13*11390345SPerry Yuan-------- 14*11390345SPerry Yuan 15*11390345SPerry YuanAMD Heterogeneous Core implementations are comprised of more than one 16*11390345SPerry Yuanarchitectural class and CPUs are comprised of cores of various efficiency and 17*11390345SPerry Yuanpower capabilities: performance-oriented *classic cores* and power-efficient 18*11390345SPerry Yuan*dense cores*. As such, power management strategies must be designed to 19*11390345SPerry Yuanaccommodate the complexities introduced by incorporating different core types. 20*11390345SPerry YuanHeterogeneous systems can also extend to more than two architectural classes 21*11390345SPerry Yuanas well. The purpose of the scheduling feedback mechanism is to provide 22*11390345SPerry Yuaninformation to the operating system scheduler in real time such that the 23*11390345SPerry Yuanscheduler can direct threads to the optimal core. 24*11390345SPerry Yuan 25*11390345SPerry YuanThe goal of AMD's heterogeneous architecture is to attain power benefit by 26*11390345SPerry Yuansending background threads to the dense cores while sending high priority 27*11390345SPerry Yuanthreads to the classic cores. From a performance perspective, sending 28*11390345SPerry Yuanbackground threads to dense cores can free up power headroom and allow the 29*11390345SPerry Yuanclassic cores to optimally service demanding threads. Furthermore, the area 30*11390345SPerry Yuanoptimized nature of the dense cores allows for an increasing number of 31*11390345SPerry Yuanphysical cores. This improved core density will have positive multithreaded 32*11390345SPerry Yuanperformance impact. 33*11390345SPerry Yuan 34*11390345SPerry YuanAMD Heterogeneous Core Driver 35*11390345SPerry Yuan----------------------------- 36*11390345SPerry Yuan 37*11390345SPerry YuanThe ``amd_hfi`` driver delivers the operating system a performance and energy 38*11390345SPerry Yuanefficiency capability data for each CPU in the system. The scheduler can use 39*11390345SPerry Yuanthe ranking data from the HFI driver to make task placement decisions. 40*11390345SPerry Yuan 41*11390345SPerry YuanThread Classification and Ranking Table Interaction 42*11390345SPerry Yuan---------------------------------------------------- 43*11390345SPerry Yuan 44*11390345SPerry YuanThe thread classification is used to select into a ranking table that 45*11390345SPerry Yuandescribes an efficiency and performance ranking for each classification. 46*11390345SPerry Yuan 47*11390345SPerry YuanThreads are classified during runtime into enumerated classes. The classes 48*11390345SPerry Yuanrepresent thread performance/power characteristics that may benefit from 49*11390345SPerry Yuanspecial scheduling behaviors. The below table depicts an example of thread 50*11390345SPerry Yuanclassification and a preference where a given thread should be scheduled 51*11390345SPerry Yuanbased on its thread class. The real time thread classification is consumed 52*11390345SPerry Yuanby the operating system and is used to inform the scheduler of where the 53*11390345SPerry Yuanthread should be placed. 54*11390345SPerry Yuan 55*11390345SPerry YuanThread Classification Example Table 56*11390345SPerry Yuan^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 57*11390345SPerry Yuan+----------+----------------+-------------------------------+---------------------+---------+ 58*11390345SPerry Yuan| class ID | Classification | Preferred scheduling behavior | Preemption priority | Counter | 59*11390345SPerry Yuan+----------+----------------+-------------------------------+---------------------+---------+ 60*11390345SPerry Yuan| 0 | Default | Performant | Highest | | 61*11390345SPerry Yuan+----------+----------------+-------------------------------+---------------------+---------+ 62*11390345SPerry Yuan| 1 | Non-scalable | Efficient | Lowest | PMCx1A1 | 63*11390345SPerry Yuan+----------+----------------+-------------------------------+---------------------+---------+ 64*11390345SPerry Yuan| 2 | I/O bound | Efficient | Lowest | PMCx044 | 65*11390345SPerry Yuan+----------+----------------+-------------------------------+---------------------+---------+ 66*11390345SPerry Yuan 67*11390345SPerry YuanThread classification is performed by the hardware each time that the thread is switched out. 68*11390345SPerry YuanThreads that don't meet any hardware specified criteria are classified as "default". 69*11390345SPerry Yuan 70*11390345SPerry YuanAMD Hardware Feedback Interface 71*11390345SPerry Yuan-------------------------------- 72*11390345SPerry Yuan 73*11390345SPerry YuanThe Hardware Feedback Interface provides to the operating system information 74*11390345SPerry Yuanabout the performance and energy efficiency of each CPU in the system. Each 75*11390345SPerry Yuancapability is given as a unit-less quantity in the range [0-255]. A higher 76*11390345SPerry Yuanperformance value indicates higher performance capability, and a higher 77*11390345SPerry Yuanefficiency value indicates more efficiency. Energy efficiency and performance 78*11390345SPerry Yuanare reported in separate capabilities in the shared memory based ranking table. 79*11390345SPerry Yuan 80*11390345SPerry YuanThese capabilities may change at runtime as a result of changes in the 81*11390345SPerry Yuanoperating conditions of the system or the action of external factors. 82*11390345SPerry YuanPower Management firmware is responsible for detecting events that require 83*11390345SPerry Yuana reordering of the performance and efficiency ranking. Table updates happen 84*11390345SPerry Yuanrelatively infrequently and occur on the time scale of seconds or more. 85*11390345SPerry Yuan 86*11390345SPerry YuanThe following events trigger a table update: 87*11390345SPerry Yuan * Thermal Stress Events 88*11390345SPerry Yuan * Silent Compute 89*11390345SPerry Yuan * Extreme Low Battery Scenarios 90*11390345SPerry Yuan 91*11390345SPerry YuanThe kernel or a userspace policy daemon can use these capabilities to modify 92*11390345SPerry Yuantask placement decisions. For instance, if either the performance or energy 93*11390345SPerry Yuancapabilities of a given logical processor becomes zero, it is an indication 94*11390345SPerry Yuanthat the hardware recommends to the operating system to not schedule any tasks 95*11390345SPerry Yuanon that processor for performance or energy efficiency reasons, respectively. 96*11390345SPerry Yuan 97*11390345SPerry YuanImplementation details for Linux 98*11390345SPerry Yuan-------------------------------- 99*11390345SPerry Yuan 100*11390345SPerry YuanThe implementation of threads scheduling consists of the following steps: 101*11390345SPerry Yuan 102*11390345SPerry Yuan1. A thread is spawned and scheduled to the ideal core using the default 103*11390345SPerry Yuan heterogeneous scheduling policy. 104*11390345SPerry Yuan2. The processor profiles thread execution and assigns an enumerated 105*11390345SPerry Yuan classification ID. 106*11390345SPerry Yuan This classification is communicated to the OS via logical processor 107*11390345SPerry Yuan scope MSR. 108*11390345SPerry Yuan3. During the thread context switch out the operating system consumes the 109*11390345SPerry Yuan workload (WL) classification which resides in a logical processor scope MSR. 110*11390345SPerry Yuan4. The OS triggers the hardware to clear its history by writing to an MSR, 111*11390345SPerry Yuan after consuming the WL classification and before switching in the new thread. 112*11390345SPerry Yuan5. If due to the classification, ranking table, and processor availability, 113*11390345SPerry Yuan the thread is not on its ideal processor, the OS will then consider 114*11390345SPerry Yuan scheduling the thread on its ideal processor (if available). 115*11390345SPerry Yuan 116*11390345SPerry YuanRanking Table 117*11390345SPerry Yuan------------- 118*11390345SPerry YuanThe ranking table is a shared memory region that is used to communicate the 119*11390345SPerry Yuanperformance and energy efficiency capabilities of each CPU in the system. 120*11390345SPerry Yuan 121*11390345SPerry YuanThe ranking table design includes rankings for each APIC ID in the system and 122*11390345SPerry Yuanrankings both for performance and efficiency for each workload classification. 123*11390345SPerry Yuan 124*11390345SPerry Yuan.. kernel-doc:: drivers/platform/x86/amd/hfi/hfi.c 125*11390345SPerry Yuan :doc: amd_shmem_info 126*11390345SPerry Yuan 127*11390345SPerry YuanRanking Table update 128*11390345SPerry Yuan--------------------------- 129*11390345SPerry YuanThe power management firmware issues an platform interrupt after updating the 130*11390345SPerry Yuanranking table and is ready for the operating system to consume it. CPUs receive 131*11390345SPerry Yuansuch interrupt and read new ranking table from shared memory which PCCT table 132*11390345SPerry Yuanhas provided, then ``amd_hfi`` driver parses the new table to provide new 133*11390345SPerry Yuanconsume data for scheduling decisions. 134