120cb10eaSIan RogersUsing TopDown metrics 220cb10eaSIan Rogers--------------------- 3328781dfSAndi Kleen 420cb10eaSIan RogersTopDown metrics break apart performance bottlenecks. Starting at level 520cb10eaSIan Rogers1 it is typical to get metrics on retiring, bad speculation, frontend 620cb10eaSIan Rogersbound, and backend bound. Higher levels provide more detail in to the 720cb10eaSIan Rogerslevel 1 bottlenecks, such as at level 2: core bound, memory bound, 820cb10eaSIan Rogersheavy operations, light operations, branch mispredicts, machine 920cb10eaSIan Rogersclears, fetch latency and fetch bandwidth. For more details see [1][2][3]. 10328781dfSAndi Kleen 1120cb10eaSIan Rogersperf stat --topdown implements this using available metrics that vary 1220cb10eaSIan Rogersper architecture. 13328781dfSAndi Kleen 1420cb10eaSIan Rogers% perf stat -a --topdown -I1000 1520cb10eaSIan Rogers# time % tma_retiring % tma_backend_bound % tma_frontend_bound % tma_bad_speculation 1620cb10eaSIan Rogers 1.001141351 11.5 34.9 46.9 6.7 1720cb10eaSIan Rogers 2.006141972 13.4 28.1 50.4 8.1 1820cb10eaSIan Rogers 3.010162040 12.9 28.1 51.1 8.0 1920cb10eaSIan Rogers 4.014009311 12.5 28.6 51.8 7.2 2020cb10eaSIan Rogers 5.017838554 11.8 33.0 48.0 7.2 2120cb10eaSIan Rogers 5.704818971 14.0 27.5 51.3 7.3 2220cb10eaSIan Rogers... 23328781dfSAndi Kleen 2420cb10eaSIan RogersNew Topdown features in Intel Ice Lake 2520cb10eaSIan Rogers====================================== 26328781dfSAndi Kleen 27328781dfSAndi KleenWith Ice Lake CPUs the TopDown metrics are directly available as 28328781dfSAndi Kleenfixed counters and do not require generic counters. This allows 29328781dfSAndi Kleento collect TopDown always in addition to other events. 30328781dfSAndi Kleen 3120cb10eaSIan RogersUsing TopDown through RDPMC in applications on Intel Ice Lake 3220cb10eaSIan Rogers============================================================= 33328781dfSAndi Kleen 34328781dfSAndi KleenFor more fine grained measurements it can be useful to 35328781dfSAndi Kleenaccess the new directly from user space. This is more complicated, 36328781dfSAndi Kleenbut drastically lowers overhead. 37328781dfSAndi Kleen 38328781dfSAndi KleenOn Ice Lake, there is a new fixed counter 3: SLOTS, which reports 39328781dfSAndi Kleen"pipeline SLOTS" (cycles multiplied by core issue width) and a 40328781dfSAndi Kleenmetric register that reports slots ratios for the different bottleneck 41328781dfSAndi Kleencategories. 42328781dfSAndi Kleen 43328781dfSAndi KleenThe metrics counter is CPU model specific and is not available on older 44328781dfSAndi KleenCPUs. 45328781dfSAndi Kleen 46328781dfSAndi KleenExample code 47328781dfSAndi Kleen============ 48328781dfSAndi Kleen 49328781dfSAndi KleenLibrary functions to do the functionality described below 50328781dfSAndi Kleenis also available in libjevents [4] 51328781dfSAndi Kleen 52328781dfSAndi KleenThe application opens a group with fixed counter 3 (SLOTS) and any 53328781dfSAndi Kleenmetric event, and allow user programs to read the performance counters. 54328781dfSAndi Kleen 55328781dfSAndi KleenFixed counter 3 is mapped to a pseudo event event=0x00, umask=04, 56328781dfSAndi Kleenso the perf_event_attr structure should be initialized with 57328781dfSAndi Kleen{ .config = 0x0400, .type = PERF_TYPE_RAW } 58328781dfSAndi KleenThe metric events are mapped to the pseudo event event=0x00, umask=0x8X. 59328781dfSAndi KleenFor example, the perf_event_attr structure can be initialized with 60328781dfSAndi Kleen{ .config = 0x8000, .type = PERF_TYPE_RAW } for Retiring metric event 61328781dfSAndi KleenThe Fixed counter 3 must be the leader of the group. 62328781dfSAndi Kleen 63328781dfSAndi Kleen#include <linux/perf_event.h> 64a4b0fccfSRay Kinsella#include <sys/mman.h> 65328781dfSAndi Kleen#include <sys/syscall.h> 66328781dfSAndi Kleen#include <unistd.h> 67328781dfSAndi Kleen 68328781dfSAndi Kleen/* Provide own perf_event_open stub because glibc doesn't */ 69328781dfSAndi Kleen__attribute__((weak)) 70328781dfSAndi Kleenint perf_event_open(struct perf_event_attr *attr, pid_t pid, 71328781dfSAndi Kleen int cpu, int group_fd, unsigned long flags) 72328781dfSAndi Kleen{ 73328781dfSAndi Kleen return syscall(__NR_perf_event_open, attr, pid, cpu, group_fd, flags); 74328781dfSAndi Kleen} 75328781dfSAndi Kleen 76328781dfSAndi Kleen/* Open slots counter file descriptor for current task. */ 77328781dfSAndi Kleenstruct perf_event_attr slots = { 78328781dfSAndi Kleen .type = PERF_TYPE_RAW, 79328781dfSAndi Kleen .size = sizeof(struct perf_event_attr), 80328781dfSAndi Kleen .config = 0x400, 81328781dfSAndi Kleen .exclude_kernel = 1, 82328781dfSAndi Kleen}; 83328781dfSAndi Kleen 84328781dfSAndi Kleenint slots_fd = perf_event_open(&slots, 0, -1, -1, 0); 85328781dfSAndi Kleenif (slots_fd < 0) 86328781dfSAndi Kleen ... error ... 87328781dfSAndi Kleen 88a4b0fccfSRay Kinsella/* Memory mapping the fd permits _rdpmc calls from userspace */ 89a4b0fccfSRay Kinsellavoid *slots_p = mmap(0, getpagesize(), PROT_READ, MAP_SHARED, slots_fd, 0); 90a4b0fccfSRay Kinsellaif (!slot_p) 91a4b0fccfSRay Kinsella .... error ... 92a4b0fccfSRay Kinsella 93328781dfSAndi Kleen/* 94328781dfSAndi Kleen * Open metrics event file descriptor for current task. 95328781dfSAndi Kleen * Set slots event as the leader of the group. 96328781dfSAndi Kleen */ 97328781dfSAndi Kleenstruct perf_event_attr metrics = { 98328781dfSAndi Kleen .type = PERF_TYPE_RAW, 99328781dfSAndi Kleen .size = sizeof(struct perf_event_attr), 100328781dfSAndi Kleen .config = 0x8000, 101328781dfSAndi Kleen .exclude_kernel = 1, 102328781dfSAndi Kleen}; 103328781dfSAndi Kleen 104328781dfSAndi Kleenint metrics_fd = perf_event_open(&metrics, 0, -1, slots_fd, 0); 105328781dfSAndi Kleenif (metrics_fd < 0) 106328781dfSAndi Kleen ... error ... 107328781dfSAndi Kleen 108a4b0fccfSRay Kinsella/* Memory mapping the fd permits _rdpmc calls from userspace */ 109a4b0fccfSRay Kinsellavoid *metrics_p = mmap(0, getpagesize(), PROT_READ, MAP_SHARED, metrics_fd, 0); 110a4b0fccfSRay Kinsellaif (!metrics_p) 111a4b0fccfSRay Kinsella ... error ... 112a4b0fccfSRay Kinsella 113a4b0fccfSRay KinsellaNote: the file descriptors returned by the perf_event_open calls must be memory 114a4b0fccfSRay Kinsellamapped to permit calls to the _rdpmd instruction. Permission may also be granted 115a4b0fccfSRay Kinsellaby writing the /sys/devices/cpu/rdpmc sysfs node. 116328781dfSAndi Kleen 117328781dfSAndi KleenThe RDPMC instruction (or _rdpmc compiler intrinsic) can now be used 118328781dfSAndi Kleento read slots and the topdown metrics at different points of the program: 119328781dfSAndi Kleen 120328781dfSAndi Kleen#include <stdint.h> 121328781dfSAndi Kleen#include <x86intrin.h> 122328781dfSAndi Kleen 123328781dfSAndi Kleen#define RDPMC_FIXED (1 << 30) /* return fixed counters */ 124328781dfSAndi Kleen#define RDPMC_METRIC (1 << 29) /* return metric counters */ 125328781dfSAndi Kleen 126328781dfSAndi Kleen#define FIXED_COUNTER_SLOTS 3 1277d91e818SKan Liang#define METRIC_COUNTER_TOPDOWN_L1_L2 0 128328781dfSAndi Kleen 129328781dfSAndi Kleenstatic inline uint64_t read_slots(void) 130328781dfSAndi Kleen{ 131328781dfSAndi Kleen return _rdpmc(RDPMC_FIXED | FIXED_COUNTER_SLOTS); 132328781dfSAndi Kleen} 133328781dfSAndi Kleen 134328781dfSAndi Kleenstatic inline uint64_t read_metrics(void) 135328781dfSAndi Kleen{ 1367d91e818SKan Liang return _rdpmc(RDPMC_METRIC | METRIC_COUNTER_TOPDOWN_L1_L2); 137328781dfSAndi Kleen} 138328781dfSAndi Kleen 139328781dfSAndi KleenThen the program can be instrumented to read these metrics at different 140328781dfSAndi Kleenpoints. 141328781dfSAndi Kleen 142328781dfSAndi KleenIt's not a good idea to do this with too short code regions, 143328781dfSAndi Kleenas the parallelism and overlap in the CPU program execution will 144328781dfSAndi Kleencause too much measurement inaccuracy. For example instrumenting 145328781dfSAndi Kleenindividual basic blocks is definitely too fine grained. 146328781dfSAndi Kleen 147a4b0fccfSRay Kinsella_rdpmc calls should not be mixed with reading the metrics and slots counters 148a4b0fccfSRay Kinsellathrough system calls, as the kernel will reset these counters after each system 149a4b0fccfSRay Kinsellacall. 150a4b0fccfSRay Kinsella 151328781dfSAndi KleenDecoding metrics values 152328781dfSAndi Kleen======================= 153328781dfSAndi Kleen 154328781dfSAndi KleenThe value reported by read_metrics() contains four 8 bit fields 155328781dfSAndi Kleenthat represent a scaled ratio that represent the Level 1 bottleneck. 156328781dfSAndi KleenAll four fields add up to 0xff (= 100%) 157328781dfSAndi Kleen 158328781dfSAndi KleenThe binary ratios in the metric value can be converted to float ratios: 159328781dfSAndi Kleen 160328781dfSAndi Kleen#define GET_METRIC(m, i) (((m) >> (i*8)) & 0xff) 161328781dfSAndi Kleen 1627d91e818SKan Liang/* L1 Topdown metric events */ 163328781dfSAndi Kleen#define TOPDOWN_RETIRING(val) ((float)GET_METRIC(val, 0) / 0xff) 164328781dfSAndi Kleen#define TOPDOWN_BAD_SPEC(val) ((float)GET_METRIC(val, 1) / 0xff) 165328781dfSAndi Kleen#define TOPDOWN_FE_BOUND(val) ((float)GET_METRIC(val, 2) / 0xff) 166328781dfSAndi Kleen#define TOPDOWN_BE_BOUND(val) ((float)GET_METRIC(val, 3) / 0xff) 167328781dfSAndi Kleen 1687d91e818SKan Liang/* 1697d91e818SKan Liang * L2 Topdown metric events. 1707d91e818SKan Liang * Available on Sapphire Rapids and later platforms. 1717d91e818SKan Liang */ 1727d91e818SKan Liang#define TOPDOWN_HEAVY_OPS(val) ((float)GET_METRIC(val, 4) / 0xff) 1737d91e818SKan Liang#define TOPDOWN_BR_MISPREDICT(val) ((float)GET_METRIC(val, 5) / 0xff) 1747d91e818SKan Liang#define TOPDOWN_FETCH_LAT(val) ((float)GET_METRIC(val, 6) / 0xff) 1757d91e818SKan Liang#define TOPDOWN_MEM_BOUND(val) ((float)GET_METRIC(val, 7) / 0xff) 1767d91e818SKan Liang 177328781dfSAndi Kleenand then converted to percent for printing. 178328781dfSAndi Kleen 179328781dfSAndi KleenThe ratios in the metric accumulate for the time when the counter 180328781dfSAndi Kleenis enabled. For measuring programs it is often useful to measure 181328781dfSAndi Kleenspecific sections. For this it is needed to deltas on metrics. 182328781dfSAndi Kleen 183328781dfSAndi KleenThis can be done by scaling the metrics with the slots counter 184328781dfSAndi Kleenread at the same time. 185328781dfSAndi Kleen 186328781dfSAndi KleenThen it's possible to take deltas of these slots counts 187328781dfSAndi Kleenmeasured at different points, and determine the metrics 188328781dfSAndi Kleenfor that time period. 189328781dfSAndi Kleen 190328781dfSAndi Kleen slots_a = read_slots(); 191328781dfSAndi Kleen metric_a = read_metrics(); 192328781dfSAndi Kleen 193328781dfSAndi Kleen ... larger code region ... 194328781dfSAndi Kleen 195328781dfSAndi Kleen slots_b = read_slots() 196328781dfSAndi Kleen metric_b = read_metrics() 197328781dfSAndi Kleen 198328781dfSAndi Kleen # compute scaled metrics for measurement a 199328781dfSAndi Kleen retiring_slots_a = GET_METRIC(metric_a, 0) * slots_a 200328781dfSAndi Kleen bad_spec_slots_a = GET_METRIC(metric_a, 1) * slots_a 201328781dfSAndi Kleen fe_bound_slots_a = GET_METRIC(metric_a, 2) * slots_a 202328781dfSAndi Kleen be_bound_slots_a = GET_METRIC(metric_a, 3) * slots_a 203328781dfSAndi Kleen 204328781dfSAndi Kleen # compute delta scaled metrics between b and a 205328781dfSAndi Kleen retiring_slots = GET_METRIC(metric_b, 0) * slots_b - retiring_slots_a 206328781dfSAndi Kleen bad_spec_slots = GET_METRIC(metric_b, 1) * slots_b - bad_spec_slots_a 207328781dfSAndi Kleen fe_bound_slots = GET_METRIC(metric_b, 2) * slots_b - fe_bound_slots_a 208328781dfSAndi Kleen be_bound_slots = GET_METRIC(metric_b, 3) * slots_b - be_bound_slots_a 209328781dfSAndi Kleen 2107d91e818SKan LiangLater the individual ratios of L1 metric events for the measurement period can 2117d91e818SKan Liangbe recreated from these counts. 212328781dfSAndi Kleen 213328781dfSAndi Kleen slots_delta = slots_b - slots_a 214328781dfSAndi Kleen retiring_ratio = (float)retiring_slots / slots_delta 215328781dfSAndi Kleen bad_spec_ratio = (float)bad_spec_slots / slots_delta 216328781dfSAndi Kleen fe_bound_ratio = (float)fe_bound_slots / slots_delta 217328781dfSAndi Kleen be_bound_ratio = (float)be_bound_slots / slota_delta 218328781dfSAndi Kleen 219328781dfSAndi Kleen printf("Retiring %.2f%% Bad Speculation %.2f%% FE Bound %.2f%% BE Bound %.2f%%\n", 220328781dfSAndi Kleen retiring_ratio * 100., 221328781dfSAndi Kleen bad_spec_ratio * 100., 222328781dfSAndi Kleen fe_bound_ratio * 100., 223328781dfSAndi Kleen be_bound_ratio * 100.); 224328781dfSAndi Kleen 2257d91e818SKan LiangThe individual ratios of L2 metric events for the measurement period can be 2267d91e818SKan Liangrecreated from L1 and L2 metric counters. (Available on Sapphire Rapids and 2277d91e818SKan Lianglater platforms) 2287d91e818SKan Liang 2297d91e818SKan Liang # compute scaled metrics for measurement a 2307d91e818SKan Liang heavy_ops_slots_a = GET_METRIC(metric_a, 4) * slots_a 2317d91e818SKan Liang br_mispredict_slots_a = GET_METRIC(metric_a, 5) * slots_a 2327d91e818SKan Liang fetch_lat_slots_a = GET_METRIC(metric_a, 6) * slots_a 2337d91e818SKan Liang mem_bound_slots_a = GET_METRIC(metric_a, 7) * slots_a 2347d91e818SKan Liang 2357d91e818SKan Liang # compute delta scaled metrics between b and a 2367d91e818SKan Liang heavy_ops_slots = GET_METRIC(metric_b, 4) * slots_b - heavy_ops_slots_a 2377d91e818SKan Liang br_mispredict_slots = GET_METRIC(metric_b, 5) * slots_b - br_mispredict_slots_a 2387d91e818SKan Liang fetch_lat_slots = GET_METRIC(metric_b, 6) * slots_b - fetch_lat_slots_a 2397d91e818SKan Liang mem_bound_slots = GET_METRIC(metric_b, 7) * slots_b - mem_bound_slots_a 2407d91e818SKan Liang 2417d91e818SKan Liang slots_delta = slots_b - slots_a 2427d91e818SKan Liang heavy_ops_ratio = (float)heavy_ops_slots / slots_delta 2437d91e818SKan Liang light_ops_ratio = retiring_ratio - heavy_ops_ratio; 2447d91e818SKan Liang 2457d91e818SKan Liang br_mispredict_ratio = (float)br_mispredict_slots / slots_delta 2467d91e818SKan Liang machine_clears_ratio = bad_spec_ratio - br_mispredict_ratio; 2477d91e818SKan Liang 2487d91e818SKan Liang fetch_lat_ratio = (float)fetch_lat_slots / slots_delta 2497d91e818SKan Liang fetch_bw_ratio = fe_bound_ratio - fetch_lat_ratio; 2507d91e818SKan Liang 2517d91e818SKan Liang mem_bound_ratio = (float)mem_bound_slots / slota_delta 2527d91e818SKan Liang core_bound_ratio = be_bound_ratio - mem_bound_ratio; 2537d91e818SKan Liang 2547d91e818SKan Liang printf("Heavy Operations %.2f%% Light Operations %.2f%% " 2557d91e818SKan Liang "Branch Mispredict %.2f%% Machine Clears %.2f%% " 2567d91e818SKan Liang "Fetch Latency %.2f%% Fetch Bandwidth %.2f%% " 2577d91e818SKan Liang "Mem Bound %.2f%% Core Bound %.2f%%\n", 2587d91e818SKan Liang heavy_ops_ratio * 100., 2597d91e818SKan Liang light_ops_ratio * 100., 2607d91e818SKan Liang br_mispredict_ratio * 100., 2617d91e818SKan Liang machine_clears_ratio * 100., 2627d91e818SKan Liang fetch_lat_ratio * 100., 2637d91e818SKan Liang fetch_bw_ratio * 100., 2647d91e818SKan Liang mem_bound_ratio * 100., 2657d91e818SKan Liang core_bound_ratio * 100.); 2667d91e818SKan Liang 267328781dfSAndi KleenResetting metrics counters 268328781dfSAndi Kleen========================== 269328781dfSAndi Kleen 270328781dfSAndi KleenSince the individual metrics are only 8bit they lose precision for 271328781dfSAndi Kleenshort regions over time because the number of cycles covered by each 272328781dfSAndi Kleenfraction bit shrinks. So the counters need to be reset regularly. 273328781dfSAndi Kleen 274328781dfSAndi KleenWhen using the kernel perf API the kernel resets on every read. 275328781dfSAndi KleenSo as long as the reading is at reasonable intervals (every few 276328781dfSAndi Kleenseconds) the precision is good. 277328781dfSAndi Kleen 278328781dfSAndi KleenWhen using perf stat it is recommended to always use the -I option, 279328781dfSAndi Kleenwith no longer interval than a few seconds 280328781dfSAndi Kleen 281328781dfSAndi Kleen perf stat -I 1000 --topdown ... 282328781dfSAndi Kleen 283328781dfSAndi KleenFor user programs using RDPMC directly the counter can 284328781dfSAndi Kleenbe reset explicitly using ioctl: 285328781dfSAndi Kleen 286328781dfSAndi Kleen ioctl(perf_fd, PERF_EVENT_IOC_RESET, 0); 287328781dfSAndi Kleen 288328781dfSAndi KleenThis "opens" a new measurement period. 289328781dfSAndi Kleen 290328781dfSAndi KleenA program using RDPMC for TopDown should schedule such a reset 291328781dfSAndi Kleenregularly, as in every few seconds. 292328781dfSAndi Kleen 29320cb10eaSIan RogersLimits on Intel Ice Lake 29420cb10eaSIan Rogers======================== 295328781dfSAndi Kleen 296328781dfSAndi KleenFour pseudo TopDown metric events are exposed for the end-users, 297328781dfSAndi Kleentopdown-retiring, topdown-bad-spec, topdown-fe-bound and topdown-be-bound. 298328781dfSAndi KleenThey can be used to collect the TopDown value under the following 299328781dfSAndi Kleenrules: 300328781dfSAndi Kleen- All the TopDown metric events must be in a group with the SLOTS event. 301328781dfSAndi Kleen- The SLOTS event must be the leader of the group. 302328781dfSAndi Kleen- The PERF_FORMAT_GROUP flag must be applied for each TopDown metric 303328781dfSAndi Kleen events 304328781dfSAndi Kleen 305328781dfSAndi KleenThe SLOTS event and the TopDown metric events can be counting members of 306328781dfSAndi Kleena sampling read group. Since the SLOTS event must be the leader of a TopDown 307328781dfSAndi Kleengroup, the second event of the group is the sampling event. 308328781dfSAndi KleenFor example, perf record -e '{slots, $sampling_event, topdown-retiring}:S' 309328781dfSAndi Kleen 31020cb10eaSIan RogersExtension on Intel Sapphire Rapids Server 31120cb10eaSIan Rogers========================================= 3127d91e818SKan LiangThe metrics counter is extended to support TMA method level 2 metrics. 3137d91e818SKan LiangThe lower half of the register is the TMA level 1 metrics (legacy). 3147d91e818SKan LiangThe upper half is also divided into four 8-bit fields for the new level 2 3157d91e818SKan Liangmetrics. Four more TopDown metric events are exposed for the end-users, 3167d91e818SKan Liangtopdown-heavy-ops, topdown-br-mispredict, topdown-fetch-lat and 3177d91e818SKan Liangtopdown-mem-bound. 3187d91e818SKan Liang 3197d91e818SKan LiangEach of the new level 2 metrics in the upper half is a subset of the 3207d91e818SKan Liangcorresponding level 1 metric in the lower half. Software can deduce the 3217d91e818SKan Liangother four level 2 metrics by subtracting corresponding metrics as below. 3227d91e818SKan Liang 3237d91e818SKan Liang Light_Operations = Retiring - Heavy_Operations 3247d91e818SKan Liang Machine_Clears = Bad_Speculation - Branch_Mispredicts 3257d91e818SKan Liang Fetch_Bandwidth = Frontend_Bound - Fetch_Latency 3267d91e818SKan Liang Core_Bound = Backend_Bound - Memory_Bound 3277d91e818SKan Liang 328*169f18fdSWeilin WangTPEBS in TopDown 329*169f18fdSWeilin Wang================ 330*169f18fdSWeilin Wang 331*169f18fdSWeilin WangTPEBS (Timed PEBS) is one of the new Intel PMU features provided since Granite 332*169f18fdSWeilin WangRapids microarchitecture. The TPEBS feature adds a 16 bit retire_latency field 333*169f18fdSWeilin Wangin the Basic Info group of the PEBS record. It records the Core cycles since the 334*169f18fdSWeilin Wangretirement of the previous instruction to the retirement of current instruction. 335*169f18fdSWeilin WangPlease refer to Section 8.4.1 of "Intel® Architecture Instruction Set Extensions 336*169f18fdSWeilin WangProgramming Reference" for more details about this feature. Because this feature 337*169f18fdSWeilin Wangextends PEBS record, sampling with weight option is required to get the 338*169f18fdSWeilin Wangretire_latency value. 339*169f18fdSWeilin Wang 340*169f18fdSWeilin Wang perf record -e event_name -W ... 341*169f18fdSWeilin Wang 342*169f18fdSWeilin WangIn the most recent release of TMA, the metrics begin to use event retire_latency 343*169f18fdSWeilin Wangvalues in some of the metrics’ formulas on processors that support TPEBS feature. 344*169f18fdSWeilin WangFor previous generations that do not support TPEBS, the values are static and 345*169f18fdSWeilin Wangpredefined per processor family by the hardware architects. Due to the diversity 346*169f18fdSWeilin Wangof workloads in execution environments, retire_latency values measured at real 347*169f18fdSWeilin Wangtime are more accurate. Therefore, new TMA metrics that use TPEBS will provide 348*169f18fdSWeilin Wangmore accurate performance analysis results. 349*169f18fdSWeilin Wang 350*169f18fdSWeilin WangTo support TPEBS in TMA metrics, a new modifier :R on event is added. Perf would 351*169f18fdSWeilin Wangcapture retire_latency value of required events(event with :R in metric formula) 352*169f18fdSWeilin Wangwith perf record. The retire_latency value would be used in metric calculation. 353*169f18fdSWeilin WangCurrently, this feature is supported through perf stat 354*169f18fdSWeilin Wang 355*169f18fdSWeilin Wang perf stat -M metric_name --record-tpebs ... 356*169f18fdSWeilin Wang 357*169f18fdSWeilin Wang 358328781dfSAndi Kleen 359328781dfSAndi Kleen[1] https://software.intel.com/en-us/top-down-microarchitecture-analysis-method-win 36020cb10eaSIan Rogers[2] https://sites.google.com/site/analysismethods/yasin-pubs 36120cb10eaSIan Rogers[3] https://perf.wiki.kernel.org/index.php/Top-Down_Analysis 362328781dfSAndi Kleen[4] https://github.com/andikleen/pmu-tools/tree/master/jevents 363