1328781dfSAndi KleenUsing TopDown metrics in user space 2328781dfSAndi Kleen----------------------------------- 3328781dfSAndi Kleen 4328781dfSAndi KleenIntel CPUs (since Sandy Bridge and Silvermont) support a TopDown 5328781dfSAndi Kleenmethology to break down CPU pipeline execution into 4 bottlenecks: 6328781dfSAndi Kleenfrontend bound, backend bound, bad speculation, retiring. 7328781dfSAndi Kleen 8328781dfSAndi KleenFor more details on Topdown see [1][5] 9328781dfSAndi Kleen 10328781dfSAndi KleenTraditionally this was implemented by events in generic counters 11328781dfSAndi Kleenand specific formulas to compute the bottlenecks. 12328781dfSAndi Kleen 13328781dfSAndi Kleenperf stat --topdown implements this. 14328781dfSAndi Kleen 15328781dfSAndi KleenFull Top Down includes more levels that can break down the 16328781dfSAndi Kleenbottlenecks further. This is not directly implemented in perf, 17328781dfSAndi Kleenbut available in other tools that can run on top of perf, 18328781dfSAndi Kleensuch as toplev[2] or vtune[3] 19328781dfSAndi Kleen 20328781dfSAndi KleenNew Topdown features in Ice Lake 21328781dfSAndi Kleen=============================== 22328781dfSAndi Kleen 23328781dfSAndi KleenWith Ice Lake CPUs the TopDown metrics are directly available as 24328781dfSAndi Kleenfixed counters and do not require generic counters. This allows 25328781dfSAndi Kleento collect TopDown always in addition to other events. 26328781dfSAndi Kleen 27328781dfSAndi Kleen% perf stat -a --topdown -I1000 28328781dfSAndi Kleen# time retiring bad speculation frontend bound backend bound 29328781dfSAndi Kleen 1.001281330 23.0% 15.3% 29.6% 32.1% 30328781dfSAndi Kleen 2.003009005 5.0% 6.8% 46.6% 41.6% 31328781dfSAndi Kleen 3.004646182 6.7% 6.7% 46.0% 40.6% 32328781dfSAndi Kleen 4.006326375 5.0% 6.4% 47.6% 41.0% 33328781dfSAndi Kleen 5.007991804 5.1% 6.3% 46.3% 42.3% 34328781dfSAndi Kleen 6.009626773 6.2% 7.1% 47.3% 39.3% 35328781dfSAndi Kleen 7.011296356 4.7% 6.7% 46.2% 42.4% 36328781dfSAndi Kleen 8.012951831 4.7% 6.7% 47.5% 41.1% 37328781dfSAndi Kleen... 38328781dfSAndi Kleen 39328781dfSAndi KleenThis also enables measuring TopDown per thread/process instead 40328781dfSAndi Kleenof only per core. 41328781dfSAndi Kleen 42328781dfSAndi KleenUsing TopDown through RDPMC in applications on Ice Lake 43328781dfSAndi Kleen====================================================== 44328781dfSAndi Kleen 45328781dfSAndi KleenFor more fine grained measurements it can be useful to 46328781dfSAndi Kleenaccess the new directly from user space. This is more complicated, 47328781dfSAndi Kleenbut drastically lowers overhead. 48328781dfSAndi Kleen 49328781dfSAndi KleenOn Ice Lake, there is a new fixed counter 3: SLOTS, which reports 50328781dfSAndi Kleen"pipeline SLOTS" (cycles multiplied by core issue width) and a 51328781dfSAndi Kleenmetric register that reports slots ratios for the different bottleneck 52328781dfSAndi Kleencategories. 53328781dfSAndi Kleen 54328781dfSAndi KleenThe metrics counter is CPU model specific and is not available on older 55328781dfSAndi KleenCPUs. 56328781dfSAndi Kleen 57328781dfSAndi KleenExample code 58328781dfSAndi Kleen============ 59328781dfSAndi Kleen 60328781dfSAndi KleenLibrary functions to do the functionality described below 61328781dfSAndi Kleenis also available in libjevents [4] 62328781dfSAndi Kleen 63328781dfSAndi KleenThe application opens a group with fixed counter 3 (SLOTS) and any 64328781dfSAndi Kleenmetric event, and allow user programs to read the performance counters. 65328781dfSAndi Kleen 66328781dfSAndi KleenFixed counter 3 is mapped to a pseudo event event=0x00, umask=04, 67328781dfSAndi Kleenso the perf_event_attr structure should be initialized with 68328781dfSAndi Kleen{ .config = 0x0400, .type = PERF_TYPE_RAW } 69328781dfSAndi KleenThe metric events are mapped to the pseudo event event=0x00, umask=0x8X. 70328781dfSAndi KleenFor example, the perf_event_attr structure can be initialized with 71328781dfSAndi Kleen{ .config = 0x8000, .type = PERF_TYPE_RAW } for Retiring metric event 72328781dfSAndi KleenThe Fixed counter 3 must be the leader of the group. 73328781dfSAndi Kleen 74328781dfSAndi Kleen#include <linux/perf_event.h> 75328781dfSAndi Kleen#include <sys/syscall.h> 76328781dfSAndi Kleen#include <unistd.h> 77328781dfSAndi Kleen 78328781dfSAndi Kleen/* Provide own perf_event_open stub because glibc doesn't */ 79328781dfSAndi Kleen__attribute__((weak)) 80328781dfSAndi Kleenint perf_event_open(struct perf_event_attr *attr, pid_t pid, 81328781dfSAndi Kleen int cpu, int group_fd, unsigned long flags) 82328781dfSAndi Kleen{ 83328781dfSAndi Kleen return syscall(__NR_perf_event_open, attr, pid, cpu, group_fd, flags); 84328781dfSAndi Kleen} 85328781dfSAndi Kleen 86328781dfSAndi Kleen/* Open slots counter file descriptor for current task. */ 87328781dfSAndi Kleenstruct perf_event_attr slots = { 88328781dfSAndi Kleen .type = PERF_TYPE_RAW, 89328781dfSAndi Kleen .size = sizeof(struct perf_event_attr), 90328781dfSAndi Kleen .config = 0x400, 91328781dfSAndi Kleen .exclude_kernel = 1, 92328781dfSAndi Kleen}; 93328781dfSAndi Kleen 94328781dfSAndi Kleenint slots_fd = perf_event_open(&slots, 0, -1, -1, 0); 95328781dfSAndi Kleenif (slots_fd < 0) 96328781dfSAndi Kleen ... error ... 97328781dfSAndi Kleen 98328781dfSAndi Kleen/* 99328781dfSAndi Kleen * Open metrics event file descriptor for current task. 100328781dfSAndi Kleen * Set slots event as the leader of the group. 101328781dfSAndi Kleen */ 102328781dfSAndi Kleenstruct perf_event_attr metrics = { 103328781dfSAndi Kleen .type = PERF_TYPE_RAW, 104328781dfSAndi Kleen .size = sizeof(struct perf_event_attr), 105328781dfSAndi Kleen .config = 0x8000, 106328781dfSAndi Kleen .exclude_kernel = 1, 107328781dfSAndi Kleen}; 108328781dfSAndi Kleen 109328781dfSAndi Kleenint metrics_fd = perf_event_open(&metrics, 0, -1, slots_fd, 0); 110328781dfSAndi Kleenif (metrics_fd < 0) 111328781dfSAndi Kleen ... error ... 112328781dfSAndi Kleen 113328781dfSAndi Kleen 114328781dfSAndi KleenThe RDPMC instruction (or _rdpmc compiler intrinsic) can now be used 115328781dfSAndi Kleento read slots and the topdown metrics at different points of the program: 116328781dfSAndi Kleen 117328781dfSAndi Kleen#include <stdint.h> 118328781dfSAndi Kleen#include <x86intrin.h> 119328781dfSAndi Kleen 120328781dfSAndi Kleen#define RDPMC_FIXED (1 << 30) /* return fixed counters */ 121328781dfSAndi Kleen#define RDPMC_METRIC (1 << 29) /* return metric counters */ 122328781dfSAndi Kleen 123328781dfSAndi Kleen#define FIXED_COUNTER_SLOTS 3 124*7d91e818SKan Liang#define METRIC_COUNTER_TOPDOWN_L1_L2 0 125328781dfSAndi Kleen 126328781dfSAndi Kleenstatic inline uint64_t read_slots(void) 127328781dfSAndi Kleen{ 128328781dfSAndi Kleen return _rdpmc(RDPMC_FIXED | FIXED_COUNTER_SLOTS); 129328781dfSAndi Kleen} 130328781dfSAndi Kleen 131328781dfSAndi Kleenstatic inline uint64_t read_metrics(void) 132328781dfSAndi Kleen{ 133*7d91e818SKan Liang return _rdpmc(RDPMC_METRIC | METRIC_COUNTER_TOPDOWN_L1_L2); 134328781dfSAndi Kleen} 135328781dfSAndi Kleen 136328781dfSAndi KleenThen the program can be instrumented to read these metrics at different 137328781dfSAndi Kleenpoints. 138328781dfSAndi Kleen 139328781dfSAndi KleenIt's not a good idea to do this with too short code regions, 140328781dfSAndi Kleenas the parallelism and overlap in the CPU program execution will 141328781dfSAndi Kleencause too much measurement inaccuracy. For example instrumenting 142328781dfSAndi Kleenindividual basic blocks is definitely too fine grained. 143328781dfSAndi Kleen 144328781dfSAndi KleenDecoding metrics values 145328781dfSAndi Kleen======================= 146328781dfSAndi Kleen 147328781dfSAndi KleenThe value reported by read_metrics() contains four 8 bit fields 148328781dfSAndi Kleenthat represent a scaled ratio that represent the Level 1 bottleneck. 149328781dfSAndi KleenAll four fields add up to 0xff (= 100%) 150328781dfSAndi Kleen 151328781dfSAndi KleenThe binary ratios in the metric value can be converted to float ratios: 152328781dfSAndi Kleen 153328781dfSAndi Kleen#define GET_METRIC(m, i) (((m) >> (i*8)) & 0xff) 154328781dfSAndi Kleen 155*7d91e818SKan Liang/* L1 Topdown metric events */ 156328781dfSAndi Kleen#define TOPDOWN_RETIRING(val) ((float)GET_METRIC(val, 0) / 0xff) 157328781dfSAndi Kleen#define TOPDOWN_BAD_SPEC(val) ((float)GET_METRIC(val, 1) / 0xff) 158328781dfSAndi Kleen#define TOPDOWN_FE_BOUND(val) ((float)GET_METRIC(val, 2) / 0xff) 159328781dfSAndi Kleen#define TOPDOWN_BE_BOUND(val) ((float)GET_METRIC(val, 3) / 0xff) 160328781dfSAndi Kleen 161*7d91e818SKan Liang/* 162*7d91e818SKan Liang * L2 Topdown metric events. 163*7d91e818SKan Liang * Available on Sapphire Rapids and later platforms. 164*7d91e818SKan Liang */ 165*7d91e818SKan Liang#define TOPDOWN_HEAVY_OPS(val) ((float)GET_METRIC(val, 4) / 0xff) 166*7d91e818SKan Liang#define TOPDOWN_BR_MISPREDICT(val) ((float)GET_METRIC(val, 5) / 0xff) 167*7d91e818SKan Liang#define TOPDOWN_FETCH_LAT(val) ((float)GET_METRIC(val, 6) / 0xff) 168*7d91e818SKan Liang#define TOPDOWN_MEM_BOUND(val) ((float)GET_METRIC(val, 7) / 0xff) 169*7d91e818SKan Liang 170328781dfSAndi Kleenand then converted to percent for printing. 171328781dfSAndi Kleen 172328781dfSAndi KleenThe ratios in the metric accumulate for the time when the counter 173328781dfSAndi Kleenis enabled. For measuring programs it is often useful to measure 174328781dfSAndi Kleenspecific sections. For this it is needed to deltas on metrics. 175328781dfSAndi Kleen 176328781dfSAndi KleenThis can be done by scaling the metrics with the slots counter 177328781dfSAndi Kleenread at the same time. 178328781dfSAndi Kleen 179328781dfSAndi KleenThen it's possible to take deltas of these slots counts 180328781dfSAndi Kleenmeasured at different points, and determine the metrics 181328781dfSAndi Kleenfor that time period. 182328781dfSAndi Kleen 183328781dfSAndi Kleen slots_a = read_slots(); 184328781dfSAndi Kleen metric_a = read_metrics(); 185328781dfSAndi Kleen 186328781dfSAndi Kleen ... larger code region ... 187328781dfSAndi Kleen 188328781dfSAndi Kleen slots_b = read_slots() 189328781dfSAndi Kleen metric_b = read_metrics() 190328781dfSAndi Kleen 191328781dfSAndi Kleen # compute scaled metrics for measurement a 192328781dfSAndi Kleen retiring_slots_a = GET_METRIC(metric_a, 0) * slots_a 193328781dfSAndi Kleen bad_spec_slots_a = GET_METRIC(metric_a, 1) * slots_a 194328781dfSAndi Kleen fe_bound_slots_a = GET_METRIC(metric_a, 2) * slots_a 195328781dfSAndi Kleen be_bound_slots_a = GET_METRIC(metric_a, 3) * slots_a 196328781dfSAndi Kleen 197328781dfSAndi Kleen # compute delta scaled metrics between b and a 198328781dfSAndi Kleen retiring_slots = GET_METRIC(metric_b, 0) * slots_b - retiring_slots_a 199328781dfSAndi Kleen bad_spec_slots = GET_METRIC(metric_b, 1) * slots_b - bad_spec_slots_a 200328781dfSAndi Kleen fe_bound_slots = GET_METRIC(metric_b, 2) * slots_b - fe_bound_slots_a 201328781dfSAndi Kleen be_bound_slots = GET_METRIC(metric_b, 3) * slots_b - be_bound_slots_a 202328781dfSAndi Kleen 203*7d91e818SKan LiangLater the individual ratios of L1 metric events for the measurement period can 204*7d91e818SKan Liangbe recreated from these counts. 205328781dfSAndi Kleen 206328781dfSAndi Kleen slots_delta = slots_b - slots_a 207328781dfSAndi Kleen retiring_ratio = (float)retiring_slots / slots_delta 208328781dfSAndi Kleen bad_spec_ratio = (float)bad_spec_slots / slots_delta 209328781dfSAndi Kleen fe_bound_ratio = (float)fe_bound_slots / slots_delta 210328781dfSAndi Kleen be_bound_ratio = (float)be_bound_slots / slota_delta 211328781dfSAndi Kleen 212328781dfSAndi Kleen printf("Retiring %.2f%% Bad Speculation %.2f%% FE Bound %.2f%% BE Bound %.2f%%\n", 213328781dfSAndi Kleen retiring_ratio * 100., 214328781dfSAndi Kleen bad_spec_ratio * 100., 215328781dfSAndi Kleen fe_bound_ratio * 100., 216328781dfSAndi Kleen be_bound_ratio * 100.); 217328781dfSAndi Kleen 218*7d91e818SKan LiangThe individual ratios of L2 metric events for the measurement period can be 219*7d91e818SKan Liangrecreated from L1 and L2 metric counters. (Available on Sapphire Rapids and 220*7d91e818SKan Lianglater platforms) 221*7d91e818SKan Liang 222*7d91e818SKan Liang # compute scaled metrics for measurement a 223*7d91e818SKan Liang heavy_ops_slots_a = GET_METRIC(metric_a, 4) * slots_a 224*7d91e818SKan Liang br_mispredict_slots_a = GET_METRIC(metric_a, 5) * slots_a 225*7d91e818SKan Liang fetch_lat_slots_a = GET_METRIC(metric_a, 6) * slots_a 226*7d91e818SKan Liang mem_bound_slots_a = GET_METRIC(metric_a, 7) * slots_a 227*7d91e818SKan Liang 228*7d91e818SKan Liang # compute delta scaled metrics between b and a 229*7d91e818SKan Liang heavy_ops_slots = GET_METRIC(metric_b, 4) * slots_b - heavy_ops_slots_a 230*7d91e818SKan Liang br_mispredict_slots = GET_METRIC(metric_b, 5) * slots_b - br_mispredict_slots_a 231*7d91e818SKan Liang fetch_lat_slots = GET_METRIC(metric_b, 6) * slots_b - fetch_lat_slots_a 232*7d91e818SKan Liang mem_bound_slots = GET_METRIC(metric_b, 7) * slots_b - mem_bound_slots_a 233*7d91e818SKan Liang 234*7d91e818SKan Liang slots_delta = slots_b - slots_a 235*7d91e818SKan Liang heavy_ops_ratio = (float)heavy_ops_slots / slots_delta 236*7d91e818SKan Liang light_ops_ratio = retiring_ratio - heavy_ops_ratio; 237*7d91e818SKan Liang 238*7d91e818SKan Liang br_mispredict_ratio = (float)br_mispredict_slots / slots_delta 239*7d91e818SKan Liang machine_clears_ratio = bad_spec_ratio - br_mispredict_ratio; 240*7d91e818SKan Liang 241*7d91e818SKan Liang fetch_lat_ratio = (float)fetch_lat_slots / slots_delta 242*7d91e818SKan Liang fetch_bw_ratio = fe_bound_ratio - fetch_lat_ratio; 243*7d91e818SKan Liang 244*7d91e818SKan Liang mem_bound_ratio = (float)mem_bound_slots / slota_delta 245*7d91e818SKan Liang core_bound_ratio = be_bound_ratio - mem_bound_ratio; 246*7d91e818SKan Liang 247*7d91e818SKan Liang printf("Heavy Operations %.2f%% Light Operations %.2f%% " 248*7d91e818SKan Liang "Branch Mispredict %.2f%% Machine Clears %.2f%% " 249*7d91e818SKan Liang "Fetch Latency %.2f%% Fetch Bandwidth %.2f%% " 250*7d91e818SKan Liang "Mem Bound %.2f%% Core Bound %.2f%%\n", 251*7d91e818SKan Liang heavy_ops_ratio * 100., 252*7d91e818SKan Liang light_ops_ratio * 100., 253*7d91e818SKan Liang br_mispredict_ratio * 100., 254*7d91e818SKan Liang machine_clears_ratio * 100., 255*7d91e818SKan Liang fetch_lat_ratio * 100., 256*7d91e818SKan Liang fetch_bw_ratio * 100., 257*7d91e818SKan Liang mem_bound_ratio * 100., 258*7d91e818SKan Liang core_bound_ratio * 100.); 259*7d91e818SKan Liang 260328781dfSAndi KleenResetting metrics counters 261328781dfSAndi Kleen========================== 262328781dfSAndi Kleen 263328781dfSAndi KleenSince the individual metrics are only 8bit they lose precision for 264328781dfSAndi Kleenshort regions over time because the number of cycles covered by each 265328781dfSAndi Kleenfraction bit shrinks. So the counters need to be reset regularly. 266328781dfSAndi Kleen 267328781dfSAndi KleenWhen using the kernel perf API the kernel resets on every read. 268328781dfSAndi KleenSo as long as the reading is at reasonable intervals (every few 269328781dfSAndi Kleenseconds) the precision is good. 270328781dfSAndi Kleen 271328781dfSAndi KleenWhen using perf stat it is recommended to always use the -I option, 272328781dfSAndi Kleenwith no longer interval than a few seconds 273328781dfSAndi Kleen 274328781dfSAndi Kleen perf stat -I 1000 --topdown ... 275328781dfSAndi Kleen 276328781dfSAndi KleenFor user programs using RDPMC directly the counter can 277328781dfSAndi Kleenbe reset explicitly using ioctl: 278328781dfSAndi Kleen 279328781dfSAndi Kleen ioctl(perf_fd, PERF_EVENT_IOC_RESET, 0); 280328781dfSAndi Kleen 281328781dfSAndi KleenThis "opens" a new measurement period. 282328781dfSAndi Kleen 283328781dfSAndi KleenA program using RDPMC for TopDown should schedule such a reset 284328781dfSAndi Kleenregularly, as in every few seconds. 285328781dfSAndi Kleen 286328781dfSAndi KleenLimits on Ice Lake 287328781dfSAndi Kleen================== 288328781dfSAndi Kleen 289328781dfSAndi KleenFour pseudo TopDown metric events are exposed for the end-users, 290328781dfSAndi Kleentopdown-retiring, topdown-bad-spec, topdown-fe-bound and topdown-be-bound. 291328781dfSAndi KleenThey can be used to collect the TopDown value under the following 292328781dfSAndi Kleenrules: 293328781dfSAndi Kleen- All the TopDown metric events must be in a group with the SLOTS event. 294328781dfSAndi Kleen- The SLOTS event must be the leader of the group. 295328781dfSAndi Kleen- The PERF_FORMAT_GROUP flag must be applied for each TopDown metric 296328781dfSAndi Kleen events 297328781dfSAndi Kleen 298328781dfSAndi KleenThe SLOTS event and the TopDown metric events can be counting members of 299328781dfSAndi Kleena sampling read group. Since the SLOTS event must be the leader of a TopDown 300328781dfSAndi Kleengroup, the second event of the group is the sampling event. 301328781dfSAndi KleenFor example, perf record -e '{slots, $sampling_event, topdown-retiring}:S' 302328781dfSAndi Kleen 303*7d91e818SKan LiangExtension on Sapphire Rapids Server 304*7d91e818SKan Liang=================================== 305*7d91e818SKan LiangThe metrics counter is extended to support TMA method level 2 metrics. 306*7d91e818SKan LiangThe lower half of the register is the TMA level 1 metrics (legacy). 307*7d91e818SKan LiangThe upper half is also divided into four 8-bit fields for the new level 2 308*7d91e818SKan Liangmetrics. Four more TopDown metric events are exposed for the end-users, 309*7d91e818SKan Liangtopdown-heavy-ops, topdown-br-mispredict, topdown-fetch-lat and 310*7d91e818SKan Liangtopdown-mem-bound. 311*7d91e818SKan Liang 312*7d91e818SKan LiangEach of the new level 2 metrics in the upper half is a subset of the 313*7d91e818SKan Liangcorresponding level 1 metric in the lower half. Software can deduce the 314*7d91e818SKan Liangother four level 2 metrics by subtracting corresponding metrics as below. 315*7d91e818SKan Liang 316*7d91e818SKan Liang Light_Operations = Retiring - Heavy_Operations 317*7d91e818SKan Liang Machine_Clears = Bad_Speculation - Branch_Mispredicts 318*7d91e818SKan Liang Fetch_Bandwidth = Frontend_Bound - Fetch_Latency 319*7d91e818SKan Liang Core_Bound = Backend_Bound - Memory_Bound 320*7d91e818SKan Liang 321328781dfSAndi Kleen 322328781dfSAndi Kleen[1] https://software.intel.com/en-us/top-down-microarchitecture-analysis-method-win 323328781dfSAndi Kleen[2] https://github.com/andikleen/pmu-tools/wiki/toplev-manual 324328781dfSAndi Kleen[3] https://software.intel.com/en-us/intel-vtune-amplifier-xe 325328781dfSAndi Kleen[4] https://github.com/andikleen/pmu-tools/tree/master/jevents 326328781dfSAndi Kleen[5] https://sites.google.com/site/analysismethods/yasin-pubs 327