perf/Documentation/topdown.txt

328781dfSAndi KleenUsing TopDown metrics in user space
328781dfSAndi Kleen-----------------------------------
328781dfSAndi Kleen
328781dfSAndi KleenIntel CPUs (since Sandy Bridge and Silvermont) support a TopDown
328781dfSAndi Kleenmethology to break down CPU pipeline execution into 4 bottlenecks:
328781dfSAndi Kleenfrontend bound, backend bound, bad speculation, retiring.
328781dfSAndi Kleen
328781dfSAndi KleenFor more details on Topdown see [1][5]
328781dfSAndi Kleen
328781dfSAndi KleenTraditionally this was implemented by events in generic counters
328781dfSAndi Kleenand specific formulas to compute the bottlenecks.
328781dfSAndi Kleen
328781dfSAndi Kleenperf stat --topdown implements this.
328781dfSAndi Kleen
328781dfSAndi KleenFull Top Down includes more levels that can break down the
328781dfSAndi Kleenbottlenecks further. This is not directly implemented in perf,
328781dfSAndi Kleenbut available in other tools that can run on top of perf,
328781dfSAndi Kleensuch as toplev[2] or vtune[3]
328781dfSAndi Kleen
328781dfSAndi KleenNew Topdown features in Ice Lake
328781dfSAndi Kleen===============================
328781dfSAndi Kleen
328781dfSAndi KleenWith Ice Lake CPUs the TopDown metrics are directly available as
328781dfSAndi Kleenfixed counters and do not require generic counters. This allows
328781dfSAndi Kleento collect TopDown always in addition to other events.
328781dfSAndi Kleen
328781dfSAndi Kleen% perf stat -a --topdown -I1000
328781dfSAndi Kleen#           time             retiring      bad speculation       frontend bound        backend bound
328781dfSAndi Kleen     1.001281330                23.0%                15.3%                29.6%                32.1%
328781dfSAndi Kleen     2.003009005                 5.0%                 6.8%                46.6%                41.6%
328781dfSAndi Kleen     3.004646182                 6.7%                 6.7%                46.0%                40.6%
328781dfSAndi Kleen     4.006326375                 5.0%                 6.4%                47.6%                41.0%
328781dfSAndi Kleen     5.007991804                 5.1%                 6.3%                46.3%                42.3%
328781dfSAndi Kleen     6.009626773                 6.2%                 7.1%                47.3%                39.3%
328781dfSAndi Kleen     7.011296356                 4.7%                 6.7%                46.2%                42.4%
328781dfSAndi Kleen     8.012951831                 4.7%                 6.7%                47.5%                41.1%
328781dfSAndi Kleen...
328781dfSAndi Kleen
328781dfSAndi KleenThis also enables measuring TopDown per thread/process instead
328781dfSAndi Kleenof only per core.
328781dfSAndi Kleen
328781dfSAndi KleenUsing TopDown through RDPMC in applications on Ice Lake
328781dfSAndi Kleen======================================================
328781dfSAndi Kleen
328781dfSAndi KleenFor more fine grained measurements it can be useful to
328781dfSAndi Kleenaccess the new  directly from user space. This is more complicated,
328781dfSAndi Kleenbut drastically lowers overhead.
328781dfSAndi Kleen
328781dfSAndi KleenOn Ice Lake, there is a new fixed counter 3: SLOTS, which reports
328781dfSAndi Kleen"pipeline SLOTS" (cycles multiplied by core issue width) and a
328781dfSAndi Kleenmetric register that reports slots ratios for the different bottleneck
328781dfSAndi Kleencategories.
328781dfSAndi Kleen
328781dfSAndi KleenThe metrics counter is CPU model specific and is not available on older
328781dfSAndi KleenCPUs.
328781dfSAndi Kleen
328781dfSAndi KleenExample code
328781dfSAndi Kleen============
328781dfSAndi Kleen
328781dfSAndi KleenLibrary functions to do the functionality described below
328781dfSAndi Kleenis also available in libjevents [4]
328781dfSAndi Kleen
328781dfSAndi KleenThe application opens a group with fixed counter 3 (SLOTS) and any
328781dfSAndi Kleenmetric event, and allow user programs to read the performance counters.
328781dfSAndi Kleen
328781dfSAndi KleenFixed counter 3 is mapped to a pseudo event event=0x00, umask=04,
328781dfSAndi Kleenso the perf_event_attr structure should be initialized with
328781dfSAndi Kleen{ .config = 0x0400, .type = PERF_TYPE_RAW }
328781dfSAndi KleenThe metric events are mapped to the pseudo event event=0x00, umask=0x8X.
328781dfSAndi KleenFor example, the perf_event_attr structure can be initialized with
328781dfSAndi Kleen{ .config = 0x8000, .type = PERF_TYPE_RAW } for Retiring metric event
328781dfSAndi KleenThe Fixed counter 3 must be the leader of the group.
328781dfSAndi Kleen
328781dfSAndi Kleen#include <linux/perf_event.h>
328781dfSAndi Kleen#include <sys/syscall.h>
328781dfSAndi Kleen#include <unistd.h>
328781dfSAndi Kleen
328781dfSAndi Kleen/* Provide own perf_event_open stub because glibc doesn't */
328781dfSAndi Kleen__attribute__((weak))
328781dfSAndi Kleenint perf_event_open(struct perf_event_attr *attr, pid_t pid,
328781dfSAndi Kleen		    int cpu, int group_fd, unsigned long flags)
328781dfSAndi Kleen{
328781dfSAndi Kleen	return syscall(__NR_perf_event_open, attr, pid, cpu, group_fd, flags);
328781dfSAndi Kleen}
328781dfSAndi Kleen
328781dfSAndi Kleen/* Open slots counter file descriptor for current task. */
328781dfSAndi Kleenstruct perf_event_attr slots = {
328781dfSAndi Kleen	.type = PERF_TYPE_RAW,
328781dfSAndi Kleen	.size = sizeof(struct perf_event_attr),
328781dfSAndi Kleen	.config = 0x400,
328781dfSAndi Kleen	.exclude_kernel = 1,
328781dfSAndi Kleen};
328781dfSAndi Kleen
328781dfSAndi Kleenint slots_fd = perf_event_open(&slots, 0, -1, -1, 0);
328781dfSAndi Kleenif (slots_fd < 0)
328781dfSAndi Kleen	... error ...
328781dfSAndi Kleen
328781dfSAndi Kleen/*
328781dfSAndi Kleen * Open metrics event file descriptor for current task.
328781dfSAndi Kleen * Set slots event as the leader of the group.
328781dfSAndi Kleen */
328781dfSAndi Kleenstruct perf_event_attr metrics = {
328781dfSAndi Kleen	.type = PERF_TYPE_RAW,
328781dfSAndi Kleen	.size = sizeof(struct perf_event_attr),
328781dfSAndi Kleen	.config = 0x8000,
328781dfSAndi Kleen	.exclude_kernel = 1,
328781dfSAndi Kleen};
328781dfSAndi Kleen
328781dfSAndi Kleenint metrics_fd = perf_event_open(&metrics, 0, -1, slots_fd, 0);
328781dfSAndi Kleenif (metrics_fd < 0)
328781dfSAndi Kleen	... error ...
328781dfSAndi Kleen
328781dfSAndi Kleen
328781dfSAndi KleenThe RDPMC instruction (or _rdpmc compiler intrinsic) can now be used
328781dfSAndi Kleento read slots and the topdown metrics at different points of the program:
328781dfSAndi Kleen
328781dfSAndi Kleen#include <stdint.h>
328781dfSAndi Kleen#include <x86intrin.h>
328781dfSAndi Kleen
328781dfSAndi Kleen#define RDPMC_FIXED	(1 << 30)	/* return fixed counters */
328781dfSAndi Kleen#define RDPMC_METRIC	(1 << 29)	/* return metric counters */
328781dfSAndi Kleen
328781dfSAndi Kleen#define FIXED_COUNTER_SLOTS		3
*7d91e818SKan Liang#define METRIC_COUNTER_TOPDOWN_L1_L2	0
328781dfSAndi Kleen
328781dfSAndi Kleenstatic inline uint64_t read_slots(void)
328781dfSAndi Kleen{
328781dfSAndi Kleen	return _rdpmc(RDPMC_FIXED | FIXED_COUNTER_SLOTS);
328781dfSAndi Kleen}
328781dfSAndi Kleen
328781dfSAndi Kleenstatic inline uint64_t read_metrics(void)
328781dfSAndi Kleen{
*7d91e818SKan Liang	return _rdpmc(RDPMC_METRIC | METRIC_COUNTER_TOPDOWN_L1_L2);
328781dfSAndi Kleen}
328781dfSAndi Kleen
328781dfSAndi KleenThen the program can be instrumented to read these metrics at different
328781dfSAndi Kleenpoints.
328781dfSAndi Kleen
328781dfSAndi KleenIt's not a good idea to do this with too short code regions,
328781dfSAndi Kleenas the parallelism and overlap in the CPU program execution will
328781dfSAndi Kleencause too much measurement inaccuracy. For example instrumenting
328781dfSAndi Kleenindividual basic blocks is definitely too fine grained.
328781dfSAndi Kleen
328781dfSAndi KleenDecoding metrics values
328781dfSAndi Kleen=======================
328781dfSAndi Kleen
328781dfSAndi KleenThe value reported by read_metrics() contains four 8 bit fields
328781dfSAndi Kleenthat represent a scaled ratio that represent the Level 1 bottleneck.
328781dfSAndi KleenAll four fields add up to 0xff (= 100%)
328781dfSAndi Kleen
328781dfSAndi KleenThe binary ratios in the metric value can be converted to float ratios:
328781dfSAndi Kleen
328781dfSAndi Kleen#define GET_METRIC(m, i) (((m) >> (i*8)) & 0xff)
328781dfSAndi Kleen
*7d91e818SKan Liang/* L1 Topdown metric events */
328781dfSAndi Kleen#define TOPDOWN_RETIRING(val)	((float)GET_METRIC(val, 0) / 0xff)
328781dfSAndi Kleen#define TOPDOWN_BAD_SPEC(val)	((float)GET_METRIC(val, 1) / 0xff)
328781dfSAndi Kleen#define TOPDOWN_FE_BOUND(val)	((float)GET_METRIC(val, 2) / 0xff)
328781dfSAndi Kleen#define TOPDOWN_BE_BOUND(val)	((float)GET_METRIC(val, 3) / 0xff)
328781dfSAndi Kleen
*7d91e818SKan Liang/*
*7d91e818SKan Liang * L2 Topdown metric events.
*7d91e818SKan Liang * Available on Sapphire Rapids and later platforms.
*7d91e818SKan Liang */
*7d91e818SKan Liang#define TOPDOWN_HEAVY_OPS(val)		((float)GET_METRIC(val, 4) / 0xff)
*7d91e818SKan Liang#define TOPDOWN_BR_MISPREDICT(val)	((float)GET_METRIC(val, 5) / 0xff)
*7d91e818SKan Liang#define TOPDOWN_FETCH_LAT(val)		((float)GET_METRIC(val, 6) / 0xff)
*7d91e818SKan Liang#define TOPDOWN_MEM_BOUND(val)		((float)GET_METRIC(val, 7) / 0xff)
*7d91e818SKan Liang
328781dfSAndi Kleenand then converted to percent for printing.
328781dfSAndi Kleen
328781dfSAndi KleenThe ratios in the metric accumulate for the time when the counter
328781dfSAndi Kleenis enabled. For measuring programs it is often useful to measure
328781dfSAndi Kleenspecific sections. For this it is needed to deltas on metrics.
328781dfSAndi Kleen
328781dfSAndi KleenThis can be done by scaling the metrics with the slots counter
328781dfSAndi Kleenread at the same time.
328781dfSAndi Kleen
328781dfSAndi KleenThen it's possible to take deltas of these slots counts
328781dfSAndi Kleenmeasured at different points, and determine the metrics
328781dfSAndi Kleenfor that time period.
328781dfSAndi Kleen
328781dfSAndi Kleen	slots_a = read_slots();
328781dfSAndi Kleen	metric_a = read_metrics();
328781dfSAndi Kleen
328781dfSAndi Kleen	... larger code region ...
328781dfSAndi Kleen
328781dfSAndi Kleen	slots_b = read_slots()
328781dfSAndi Kleen	metric_b = read_metrics()
328781dfSAndi Kleen
328781dfSAndi Kleen	# compute scaled metrics for measurement a
328781dfSAndi Kleen	retiring_slots_a = GET_METRIC(metric_a, 0) * slots_a
328781dfSAndi Kleen	bad_spec_slots_a = GET_METRIC(metric_a, 1) * slots_a
328781dfSAndi Kleen	fe_bound_slots_a = GET_METRIC(metric_a, 2) * slots_a
328781dfSAndi Kleen	be_bound_slots_a = GET_METRIC(metric_a, 3) * slots_a
328781dfSAndi Kleen
328781dfSAndi Kleen	# compute delta scaled metrics between b and a
328781dfSAndi Kleen	retiring_slots = GET_METRIC(metric_b, 0) * slots_b - retiring_slots_a
328781dfSAndi Kleen	bad_spec_slots = GET_METRIC(metric_b, 1) * slots_b - bad_spec_slots_a
328781dfSAndi Kleen	fe_bound_slots = GET_METRIC(metric_b, 2) * slots_b - fe_bound_slots_a
328781dfSAndi Kleen	be_bound_slots = GET_METRIC(metric_b, 3) * slots_b - be_bound_slots_a
328781dfSAndi Kleen
*7d91e818SKan LiangLater the individual ratios of L1 metric events for the measurement period can
*7d91e818SKan Liangbe recreated from these counts.
328781dfSAndi Kleen
328781dfSAndi Kleen	slots_delta = slots_b - slots_a
328781dfSAndi Kleen	retiring_ratio = (float)retiring_slots / slots_delta
328781dfSAndi Kleen	bad_spec_ratio = (float)bad_spec_slots / slots_delta
328781dfSAndi Kleen	fe_bound_ratio = (float)fe_bound_slots / slots_delta
328781dfSAndi Kleen	be_bound_ratio = (float)be_bound_slots / slota_delta
328781dfSAndi Kleen
328781dfSAndi Kleen	printf("Retiring %.2f%% Bad Speculation %.2f%% FE Bound %.2f%% BE Bound %.2f%%\n",
328781dfSAndi Kleen		retiring_ratio * 100.,
328781dfSAndi Kleen		bad_spec_ratio * 100.,
328781dfSAndi Kleen		fe_bound_ratio * 100.,
328781dfSAndi Kleen		be_bound_ratio * 100.);
328781dfSAndi Kleen
*7d91e818SKan LiangThe individual ratios of L2 metric events for the measurement period can be
*7d91e818SKan Liangrecreated from L1 and L2 metric counters. (Available on Sapphire Rapids and
*7d91e818SKan Lianglater platforms)
*7d91e818SKan Liang
*7d91e818SKan Liang	# compute scaled metrics for measurement a
*7d91e818SKan Liang	heavy_ops_slots_a = GET_METRIC(metric_a, 4) * slots_a
*7d91e818SKan Liang	br_mispredict_slots_a = GET_METRIC(metric_a, 5) * slots_a
*7d91e818SKan Liang	fetch_lat_slots_a = GET_METRIC(metric_a, 6) * slots_a
*7d91e818SKan Liang	mem_bound_slots_a = GET_METRIC(metric_a, 7) * slots_a
*7d91e818SKan Liang
*7d91e818SKan Liang	# compute delta scaled metrics between b and a
*7d91e818SKan Liang	heavy_ops_slots = GET_METRIC(metric_b, 4) * slots_b - heavy_ops_slots_a
*7d91e818SKan Liang	br_mispredict_slots = GET_METRIC(metric_b, 5) * slots_b - br_mispredict_slots_a
*7d91e818SKan Liang	fetch_lat_slots = GET_METRIC(metric_b, 6) * slots_b - fetch_lat_slots_a
*7d91e818SKan Liang	mem_bound_slots = GET_METRIC(metric_b, 7) * slots_b - mem_bound_slots_a
*7d91e818SKan Liang
*7d91e818SKan Liang	slots_delta = slots_b - slots_a
*7d91e818SKan Liang	heavy_ops_ratio = (float)heavy_ops_slots / slots_delta
*7d91e818SKan Liang	light_ops_ratio = retiring_ratio - heavy_ops_ratio;
*7d91e818SKan Liang
*7d91e818SKan Liang	br_mispredict_ratio = (float)br_mispredict_slots / slots_delta
*7d91e818SKan Liang	machine_clears_ratio = bad_spec_ratio - br_mispredict_ratio;
*7d91e818SKan Liang
*7d91e818SKan Liang	fetch_lat_ratio = (float)fetch_lat_slots / slots_delta
*7d91e818SKan Liang	fetch_bw_ratio = fe_bound_ratio - fetch_lat_ratio;
*7d91e818SKan Liang
*7d91e818SKan Liang	mem_bound_ratio = (float)mem_bound_slots / slota_delta
*7d91e818SKan Liang	core_bound_ratio = be_bound_ratio - mem_bound_ratio;
*7d91e818SKan Liang
*7d91e818SKan Liang	printf("Heavy Operations %.2f%% Light Operations %.2f%% "
*7d91e818SKan Liang	       "Branch Mispredict %.2f%% Machine Clears %.2f%% "
*7d91e818SKan Liang	       "Fetch Latency %.2f%% Fetch Bandwidth %.2f%% "
*7d91e818SKan Liang	       "Mem Bound %.2f%% Core Bound %.2f%%\n",
*7d91e818SKan Liang		heavy_ops_ratio * 100.,
*7d91e818SKan Liang		light_ops_ratio * 100.,
*7d91e818SKan Liang		br_mispredict_ratio * 100.,
*7d91e818SKan Liang		machine_clears_ratio * 100.,
*7d91e818SKan Liang		fetch_lat_ratio * 100.,
*7d91e818SKan Liang		fetch_bw_ratio * 100.,
*7d91e818SKan Liang		mem_bound_ratio * 100.,
*7d91e818SKan Liang		core_bound_ratio * 100.);
*7d91e818SKan Liang
328781dfSAndi KleenResetting metrics counters
328781dfSAndi Kleen==========================
328781dfSAndi Kleen
328781dfSAndi KleenSince the individual metrics are only 8bit they lose precision for
328781dfSAndi Kleenshort regions over time because the number of cycles covered by each
328781dfSAndi Kleenfraction bit shrinks. So the counters need to be reset regularly.
328781dfSAndi Kleen
328781dfSAndi KleenWhen using the kernel perf API the kernel resets on every read.
328781dfSAndi KleenSo as long as the reading is at reasonable intervals (every few
328781dfSAndi Kleenseconds) the precision is good.
328781dfSAndi Kleen
328781dfSAndi KleenWhen using perf stat it is recommended to always use the -I option,
328781dfSAndi Kleenwith no longer interval than a few seconds
328781dfSAndi Kleen
328781dfSAndi Kleen	perf stat -I 1000 --topdown ...
328781dfSAndi Kleen
328781dfSAndi KleenFor user programs using RDPMC directly the counter can
328781dfSAndi Kleenbe reset explicitly using ioctl:
328781dfSAndi Kleen
328781dfSAndi Kleen	ioctl(perf_fd, PERF_EVENT_IOC_RESET, 0);
328781dfSAndi Kleen
328781dfSAndi KleenThis "opens" a new measurement period.
328781dfSAndi Kleen
328781dfSAndi KleenA program using RDPMC for TopDown should schedule such a reset
328781dfSAndi Kleenregularly, as in every few seconds.
328781dfSAndi Kleen
328781dfSAndi KleenLimits on Ice Lake
328781dfSAndi Kleen==================
328781dfSAndi Kleen
328781dfSAndi KleenFour pseudo TopDown metric events are exposed for the end-users,
328781dfSAndi Kleentopdown-retiring, topdown-bad-spec, topdown-fe-bound and topdown-be-bound.
328781dfSAndi KleenThey can be used to collect the TopDown value under the following
328781dfSAndi Kleenrules:
328781dfSAndi Kleen- All the TopDown metric events must be in a group with the SLOTS event.
328781dfSAndi Kleen- The SLOTS event must be the leader of the group.
328781dfSAndi Kleen- The PERF_FORMAT_GROUP flag must be applied for each TopDown metric
328781dfSAndi Kleen  events
328781dfSAndi Kleen
328781dfSAndi KleenThe SLOTS event and the TopDown metric events can be counting members of
328781dfSAndi Kleena sampling read group. Since the SLOTS event must be the leader of a TopDown
328781dfSAndi Kleengroup, the second event of the group is the sampling event.
328781dfSAndi KleenFor example, perf record -e '{slots, $sampling_event, topdown-retiring}:S'
328781dfSAndi Kleen
*7d91e818SKan LiangExtension on Sapphire Rapids Server
*7d91e818SKan Liang===================================
*7d91e818SKan LiangThe metrics counter is extended to support TMA method level 2 metrics.
*7d91e818SKan LiangThe lower half of the register is the TMA level 1 metrics (legacy).
*7d91e818SKan LiangThe upper half is also divided into four 8-bit fields for the new level 2
*7d91e818SKan Liangmetrics. Four more TopDown metric events are exposed for the end-users,
*7d91e818SKan Liangtopdown-heavy-ops, topdown-br-mispredict, topdown-fetch-lat and
*7d91e818SKan Liangtopdown-mem-bound.
*7d91e818SKan Liang
*7d91e818SKan LiangEach of the new level 2 metrics in the upper half is a subset of the
*7d91e818SKan Liangcorresponding level 1 metric in the lower half. Software can deduce the
*7d91e818SKan Liangother four level 2 metrics by subtracting corresponding metrics as below.
*7d91e818SKan Liang
*7d91e818SKan Liang    Light_Operations = Retiring - Heavy_Operations
*7d91e818SKan Liang    Machine_Clears = Bad_Speculation - Branch_Mispredicts
*7d91e818SKan Liang    Fetch_Bandwidth = Frontend_Bound - Fetch_Latency
*7d91e818SKan Liang    Core_Bound = Backend_Bound - Memory_Bound
*7d91e818SKan Liang
328781dfSAndi Kleen
328781dfSAndi Kleen[1] https://software.intel.com/en-us/top-down-microarchitecture-analysis-method-win
328781dfSAndi Kleen[2] https://github.com/andikleen/pmu-tools/wiki/toplev-manual
328781dfSAndi Kleen[3] https://software.intel.com/en-us/intel-vtune-amplifier-xe
328781dfSAndi Kleen[4] https://github.com/andikleen/pmu-tools/tree/master/jevents
328781dfSAndi Kleen[5] https://sites.google.com/site/analysismethods/yasin-pubs