xref: /linux/tools/perf/Documentation/topdown.txt (revision 7d91e8181dc0ed8585e55234288d11bc5dc083b2)
1328781dfSAndi KleenUsing TopDown metrics in user space
2328781dfSAndi Kleen-----------------------------------
3328781dfSAndi Kleen
4328781dfSAndi KleenIntel CPUs (since Sandy Bridge and Silvermont) support a TopDown
5328781dfSAndi Kleenmethology to break down CPU pipeline execution into 4 bottlenecks:
6328781dfSAndi Kleenfrontend bound, backend bound, bad speculation, retiring.
7328781dfSAndi Kleen
8328781dfSAndi KleenFor more details on Topdown see [1][5]
9328781dfSAndi Kleen
10328781dfSAndi KleenTraditionally this was implemented by events in generic counters
11328781dfSAndi Kleenand specific formulas to compute the bottlenecks.
12328781dfSAndi Kleen
13328781dfSAndi Kleenperf stat --topdown implements this.
14328781dfSAndi Kleen
15328781dfSAndi KleenFull Top Down includes more levels that can break down the
16328781dfSAndi Kleenbottlenecks further. This is not directly implemented in perf,
17328781dfSAndi Kleenbut available in other tools that can run on top of perf,
18328781dfSAndi Kleensuch as toplev[2] or vtune[3]
19328781dfSAndi Kleen
20328781dfSAndi KleenNew Topdown features in Ice Lake
21328781dfSAndi Kleen===============================
22328781dfSAndi Kleen
23328781dfSAndi KleenWith Ice Lake CPUs the TopDown metrics are directly available as
24328781dfSAndi Kleenfixed counters and do not require generic counters. This allows
25328781dfSAndi Kleento collect TopDown always in addition to other events.
26328781dfSAndi Kleen
27328781dfSAndi Kleen% perf stat -a --topdown -I1000
28328781dfSAndi Kleen#           time             retiring      bad speculation       frontend bound        backend bound
29328781dfSAndi Kleen     1.001281330                23.0%                15.3%                29.6%                32.1%
30328781dfSAndi Kleen     2.003009005                 5.0%                 6.8%                46.6%                41.6%
31328781dfSAndi Kleen     3.004646182                 6.7%                 6.7%                46.0%                40.6%
32328781dfSAndi Kleen     4.006326375                 5.0%                 6.4%                47.6%                41.0%
33328781dfSAndi Kleen     5.007991804                 5.1%                 6.3%                46.3%                42.3%
34328781dfSAndi Kleen     6.009626773                 6.2%                 7.1%                47.3%                39.3%
35328781dfSAndi Kleen     7.011296356                 4.7%                 6.7%                46.2%                42.4%
36328781dfSAndi Kleen     8.012951831                 4.7%                 6.7%                47.5%                41.1%
37328781dfSAndi Kleen...
38328781dfSAndi Kleen
39328781dfSAndi KleenThis also enables measuring TopDown per thread/process instead
40328781dfSAndi Kleenof only per core.
41328781dfSAndi Kleen
42328781dfSAndi KleenUsing TopDown through RDPMC in applications on Ice Lake
43328781dfSAndi Kleen======================================================
44328781dfSAndi Kleen
45328781dfSAndi KleenFor more fine grained measurements it can be useful to
46328781dfSAndi Kleenaccess the new  directly from user space. This is more complicated,
47328781dfSAndi Kleenbut drastically lowers overhead.
48328781dfSAndi Kleen
49328781dfSAndi KleenOn Ice Lake, there is a new fixed counter 3: SLOTS, which reports
50328781dfSAndi Kleen"pipeline SLOTS" (cycles multiplied by core issue width) and a
51328781dfSAndi Kleenmetric register that reports slots ratios for the different bottleneck
52328781dfSAndi Kleencategories.
53328781dfSAndi Kleen
54328781dfSAndi KleenThe metrics counter is CPU model specific and is not available on older
55328781dfSAndi KleenCPUs.
56328781dfSAndi Kleen
57328781dfSAndi KleenExample code
58328781dfSAndi Kleen============
59328781dfSAndi Kleen
60328781dfSAndi KleenLibrary functions to do the functionality described below
61328781dfSAndi Kleenis also available in libjevents [4]
62328781dfSAndi Kleen
63328781dfSAndi KleenThe application opens a group with fixed counter 3 (SLOTS) and any
64328781dfSAndi Kleenmetric event, and allow user programs to read the performance counters.
65328781dfSAndi Kleen
66328781dfSAndi KleenFixed counter 3 is mapped to a pseudo event event=0x00, umask=04,
67328781dfSAndi Kleenso the perf_event_attr structure should be initialized with
68328781dfSAndi Kleen{ .config = 0x0400, .type = PERF_TYPE_RAW }
69328781dfSAndi KleenThe metric events are mapped to the pseudo event event=0x00, umask=0x8X.
70328781dfSAndi KleenFor example, the perf_event_attr structure can be initialized with
71328781dfSAndi Kleen{ .config = 0x8000, .type = PERF_TYPE_RAW } for Retiring metric event
72328781dfSAndi KleenThe Fixed counter 3 must be the leader of the group.
73328781dfSAndi Kleen
74328781dfSAndi Kleen#include <linux/perf_event.h>
75328781dfSAndi Kleen#include <sys/syscall.h>
76328781dfSAndi Kleen#include <unistd.h>
77328781dfSAndi Kleen
78328781dfSAndi Kleen/* Provide own perf_event_open stub because glibc doesn't */
79328781dfSAndi Kleen__attribute__((weak))
80328781dfSAndi Kleenint perf_event_open(struct perf_event_attr *attr, pid_t pid,
81328781dfSAndi Kleen		    int cpu, int group_fd, unsigned long flags)
82328781dfSAndi Kleen{
83328781dfSAndi Kleen	return syscall(__NR_perf_event_open, attr, pid, cpu, group_fd, flags);
84328781dfSAndi Kleen}
85328781dfSAndi Kleen
86328781dfSAndi Kleen/* Open slots counter file descriptor for current task. */
87328781dfSAndi Kleenstruct perf_event_attr slots = {
88328781dfSAndi Kleen	.type = PERF_TYPE_RAW,
89328781dfSAndi Kleen	.size = sizeof(struct perf_event_attr),
90328781dfSAndi Kleen	.config = 0x400,
91328781dfSAndi Kleen	.exclude_kernel = 1,
92328781dfSAndi Kleen};
93328781dfSAndi Kleen
94328781dfSAndi Kleenint slots_fd = perf_event_open(&slots, 0, -1, -1, 0);
95328781dfSAndi Kleenif (slots_fd < 0)
96328781dfSAndi Kleen	... error ...
97328781dfSAndi Kleen
98328781dfSAndi Kleen/*
99328781dfSAndi Kleen * Open metrics event file descriptor for current task.
100328781dfSAndi Kleen * Set slots event as the leader of the group.
101328781dfSAndi Kleen */
102328781dfSAndi Kleenstruct perf_event_attr metrics = {
103328781dfSAndi Kleen	.type = PERF_TYPE_RAW,
104328781dfSAndi Kleen	.size = sizeof(struct perf_event_attr),
105328781dfSAndi Kleen	.config = 0x8000,
106328781dfSAndi Kleen	.exclude_kernel = 1,
107328781dfSAndi Kleen};
108328781dfSAndi Kleen
109328781dfSAndi Kleenint metrics_fd = perf_event_open(&metrics, 0, -1, slots_fd, 0);
110328781dfSAndi Kleenif (metrics_fd < 0)
111328781dfSAndi Kleen	... error ...
112328781dfSAndi Kleen
113328781dfSAndi Kleen
114328781dfSAndi KleenThe RDPMC instruction (or _rdpmc compiler intrinsic) can now be used
115328781dfSAndi Kleento read slots and the topdown metrics at different points of the program:
116328781dfSAndi Kleen
117328781dfSAndi Kleen#include <stdint.h>
118328781dfSAndi Kleen#include <x86intrin.h>
119328781dfSAndi Kleen
120328781dfSAndi Kleen#define RDPMC_FIXED	(1 << 30)	/* return fixed counters */
121328781dfSAndi Kleen#define RDPMC_METRIC	(1 << 29)	/* return metric counters */
122328781dfSAndi Kleen
123328781dfSAndi Kleen#define FIXED_COUNTER_SLOTS		3
124*7d91e818SKan Liang#define METRIC_COUNTER_TOPDOWN_L1_L2	0
125328781dfSAndi Kleen
126328781dfSAndi Kleenstatic inline uint64_t read_slots(void)
127328781dfSAndi Kleen{
128328781dfSAndi Kleen	return _rdpmc(RDPMC_FIXED | FIXED_COUNTER_SLOTS);
129328781dfSAndi Kleen}
130328781dfSAndi Kleen
131328781dfSAndi Kleenstatic inline uint64_t read_metrics(void)
132328781dfSAndi Kleen{
133*7d91e818SKan Liang	return _rdpmc(RDPMC_METRIC | METRIC_COUNTER_TOPDOWN_L1_L2);
134328781dfSAndi Kleen}
135328781dfSAndi Kleen
136328781dfSAndi KleenThen the program can be instrumented to read these metrics at different
137328781dfSAndi Kleenpoints.
138328781dfSAndi Kleen
139328781dfSAndi KleenIt's not a good idea to do this with too short code regions,
140328781dfSAndi Kleenas the parallelism and overlap in the CPU program execution will
141328781dfSAndi Kleencause too much measurement inaccuracy. For example instrumenting
142328781dfSAndi Kleenindividual basic blocks is definitely too fine grained.
143328781dfSAndi Kleen
144328781dfSAndi KleenDecoding metrics values
145328781dfSAndi Kleen=======================
146328781dfSAndi Kleen
147328781dfSAndi KleenThe value reported by read_metrics() contains four 8 bit fields
148328781dfSAndi Kleenthat represent a scaled ratio that represent the Level 1 bottleneck.
149328781dfSAndi KleenAll four fields add up to 0xff (= 100%)
150328781dfSAndi Kleen
151328781dfSAndi KleenThe binary ratios in the metric value can be converted to float ratios:
152328781dfSAndi Kleen
153328781dfSAndi Kleen#define GET_METRIC(m, i) (((m) >> (i*8)) & 0xff)
154328781dfSAndi Kleen
155*7d91e818SKan Liang/* L1 Topdown metric events */
156328781dfSAndi Kleen#define TOPDOWN_RETIRING(val)	((float)GET_METRIC(val, 0) / 0xff)
157328781dfSAndi Kleen#define TOPDOWN_BAD_SPEC(val)	((float)GET_METRIC(val, 1) / 0xff)
158328781dfSAndi Kleen#define TOPDOWN_FE_BOUND(val)	((float)GET_METRIC(val, 2) / 0xff)
159328781dfSAndi Kleen#define TOPDOWN_BE_BOUND(val)	((float)GET_METRIC(val, 3) / 0xff)
160328781dfSAndi Kleen
161*7d91e818SKan Liang/*
162*7d91e818SKan Liang * L2 Topdown metric events.
163*7d91e818SKan Liang * Available on Sapphire Rapids and later platforms.
164*7d91e818SKan Liang */
165*7d91e818SKan Liang#define TOPDOWN_HEAVY_OPS(val)		((float)GET_METRIC(val, 4) / 0xff)
166*7d91e818SKan Liang#define TOPDOWN_BR_MISPREDICT(val)	((float)GET_METRIC(val, 5) / 0xff)
167*7d91e818SKan Liang#define TOPDOWN_FETCH_LAT(val)		((float)GET_METRIC(val, 6) / 0xff)
168*7d91e818SKan Liang#define TOPDOWN_MEM_BOUND(val)		((float)GET_METRIC(val, 7) / 0xff)
169*7d91e818SKan Liang
170328781dfSAndi Kleenand then converted to percent for printing.
171328781dfSAndi Kleen
172328781dfSAndi KleenThe ratios in the metric accumulate for the time when the counter
173328781dfSAndi Kleenis enabled. For measuring programs it is often useful to measure
174328781dfSAndi Kleenspecific sections. For this it is needed to deltas on metrics.
175328781dfSAndi Kleen
176328781dfSAndi KleenThis can be done by scaling the metrics with the slots counter
177328781dfSAndi Kleenread at the same time.
178328781dfSAndi Kleen
179328781dfSAndi KleenThen it's possible to take deltas of these slots counts
180328781dfSAndi Kleenmeasured at different points, and determine the metrics
181328781dfSAndi Kleenfor that time period.
182328781dfSAndi Kleen
183328781dfSAndi Kleen	slots_a = read_slots();
184328781dfSAndi Kleen	metric_a = read_metrics();
185328781dfSAndi Kleen
186328781dfSAndi Kleen	... larger code region ...
187328781dfSAndi Kleen
188328781dfSAndi Kleen	slots_b = read_slots()
189328781dfSAndi Kleen	metric_b = read_metrics()
190328781dfSAndi Kleen
191328781dfSAndi Kleen	# compute scaled metrics for measurement a
192328781dfSAndi Kleen	retiring_slots_a = GET_METRIC(metric_a, 0) * slots_a
193328781dfSAndi Kleen	bad_spec_slots_a = GET_METRIC(metric_a, 1) * slots_a
194328781dfSAndi Kleen	fe_bound_slots_a = GET_METRIC(metric_a, 2) * slots_a
195328781dfSAndi Kleen	be_bound_slots_a = GET_METRIC(metric_a, 3) * slots_a
196328781dfSAndi Kleen
197328781dfSAndi Kleen	# compute delta scaled metrics between b and a
198328781dfSAndi Kleen	retiring_slots = GET_METRIC(metric_b, 0) * slots_b - retiring_slots_a
199328781dfSAndi Kleen	bad_spec_slots = GET_METRIC(metric_b, 1) * slots_b - bad_spec_slots_a
200328781dfSAndi Kleen	fe_bound_slots = GET_METRIC(metric_b, 2) * slots_b - fe_bound_slots_a
201328781dfSAndi Kleen	be_bound_slots = GET_METRIC(metric_b, 3) * slots_b - be_bound_slots_a
202328781dfSAndi Kleen
203*7d91e818SKan LiangLater the individual ratios of L1 metric events for the measurement period can
204*7d91e818SKan Liangbe recreated from these counts.
205328781dfSAndi Kleen
206328781dfSAndi Kleen	slots_delta = slots_b - slots_a
207328781dfSAndi Kleen	retiring_ratio = (float)retiring_slots / slots_delta
208328781dfSAndi Kleen	bad_spec_ratio = (float)bad_spec_slots / slots_delta
209328781dfSAndi Kleen	fe_bound_ratio = (float)fe_bound_slots / slots_delta
210328781dfSAndi Kleen	be_bound_ratio = (float)be_bound_slots / slota_delta
211328781dfSAndi Kleen
212328781dfSAndi Kleen	printf("Retiring %.2f%% Bad Speculation %.2f%% FE Bound %.2f%% BE Bound %.2f%%\n",
213328781dfSAndi Kleen		retiring_ratio * 100.,
214328781dfSAndi Kleen		bad_spec_ratio * 100.,
215328781dfSAndi Kleen		fe_bound_ratio * 100.,
216328781dfSAndi Kleen		be_bound_ratio * 100.);
217328781dfSAndi Kleen
218*7d91e818SKan LiangThe individual ratios of L2 metric events for the measurement period can be
219*7d91e818SKan Liangrecreated from L1 and L2 metric counters. (Available on Sapphire Rapids and
220*7d91e818SKan Lianglater platforms)
221*7d91e818SKan Liang
222*7d91e818SKan Liang	# compute scaled metrics for measurement a
223*7d91e818SKan Liang	heavy_ops_slots_a = GET_METRIC(metric_a, 4) * slots_a
224*7d91e818SKan Liang	br_mispredict_slots_a = GET_METRIC(metric_a, 5) * slots_a
225*7d91e818SKan Liang	fetch_lat_slots_a = GET_METRIC(metric_a, 6) * slots_a
226*7d91e818SKan Liang	mem_bound_slots_a = GET_METRIC(metric_a, 7) * slots_a
227*7d91e818SKan Liang
228*7d91e818SKan Liang	# compute delta scaled metrics between b and a
229*7d91e818SKan Liang	heavy_ops_slots = GET_METRIC(metric_b, 4) * slots_b - heavy_ops_slots_a
230*7d91e818SKan Liang	br_mispredict_slots = GET_METRIC(metric_b, 5) * slots_b - br_mispredict_slots_a
231*7d91e818SKan Liang	fetch_lat_slots = GET_METRIC(metric_b, 6) * slots_b - fetch_lat_slots_a
232*7d91e818SKan Liang	mem_bound_slots = GET_METRIC(metric_b, 7) * slots_b - mem_bound_slots_a
233*7d91e818SKan Liang
234*7d91e818SKan Liang	slots_delta = slots_b - slots_a
235*7d91e818SKan Liang	heavy_ops_ratio = (float)heavy_ops_slots / slots_delta
236*7d91e818SKan Liang	light_ops_ratio = retiring_ratio - heavy_ops_ratio;
237*7d91e818SKan Liang
238*7d91e818SKan Liang	br_mispredict_ratio = (float)br_mispredict_slots / slots_delta
239*7d91e818SKan Liang	machine_clears_ratio = bad_spec_ratio - br_mispredict_ratio;
240*7d91e818SKan Liang
241*7d91e818SKan Liang	fetch_lat_ratio = (float)fetch_lat_slots / slots_delta
242*7d91e818SKan Liang	fetch_bw_ratio = fe_bound_ratio - fetch_lat_ratio;
243*7d91e818SKan Liang
244*7d91e818SKan Liang	mem_bound_ratio = (float)mem_bound_slots / slota_delta
245*7d91e818SKan Liang	core_bound_ratio = be_bound_ratio - mem_bound_ratio;
246*7d91e818SKan Liang
247*7d91e818SKan Liang	printf("Heavy Operations %.2f%% Light Operations %.2f%% "
248*7d91e818SKan Liang	       "Branch Mispredict %.2f%% Machine Clears %.2f%% "
249*7d91e818SKan Liang	       "Fetch Latency %.2f%% Fetch Bandwidth %.2f%% "
250*7d91e818SKan Liang	       "Mem Bound %.2f%% Core Bound %.2f%%\n",
251*7d91e818SKan Liang		heavy_ops_ratio * 100.,
252*7d91e818SKan Liang		light_ops_ratio * 100.,
253*7d91e818SKan Liang		br_mispredict_ratio * 100.,
254*7d91e818SKan Liang		machine_clears_ratio * 100.,
255*7d91e818SKan Liang		fetch_lat_ratio * 100.,
256*7d91e818SKan Liang		fetch_bw_ratio * 100.,
257*7d91e818SKan Liang		mem_bound_ratio * 100.,
258*7d91e818SKan Liang		core_bound_ratio * 100.);
259*7d91e818SKan Liang
260328781dfSAndi KleenResetting metrics counters
261328781dfSAndi Kleen==========================
262328781dfSAndi Kleen
263328781dfSAndi KleenSince the individual metrics are only 8bit they lose precision for
264328781dfSAndi Kleenshort regions over time because the number of cycles covered by each
265328781dfSAndi Kleenfraction bit shrinks. So the counters need to be reset regularly.
266328781dfSAndi Kleen
267328781dfSAndi KleenWhen using the kernel perf API the kernel resets on every read.
268328781dfSAndi KleenSo as long as the reading is at reasonable intervals (every few
269328781dfSAndi Kleenseconds) the precision is good.
270328781dfSAndi Kleen
271328781dfSAndi KleenWhen using perf stat it is recommended to always use the -I option,
272328781dfSAndi Kleenwith no longer interval than a few seconds
273328781dfSAndi Kleen
274328781dfSAndi Kleen	perf stat -I 1000 --topdown ...
275328781dfSAndi Kleen
276328781dfSAndi KleenFor user programs using RDPMC directly the counter can
277328781dfSAndi Kleenbe reset explicitly using ioctl:
278328781dfSAndi Kleen
279328781dfSAndi Kleen	ioctl(perf_fd, PERF_EVENT_IOC_RESET, 0);
280328781dfSAndi Kleen
281328781dfSAndi KleenThis "opens" a new measurement period.
282328781dfSAndi Kleen
283328781dfSAndi KleenA program using RDPMC for TopDown should schedule such a reset
284328781dfSAndi Kleenregularly, as in every few seconds.
285328781dfSAndi Kleen
286328781dfSAndi KleenLimits on Ice Lake
287328781dfSAndi Kleen==================
288328781dfSAndi Kleen
289328781dfSAndi KleenFour pseudo TopDown metric events are exposed for the end-users,
290328781dfSAndi Kleentopdown-retiring, topdown-bad-spec, topdown-fe-bound and topdown-be-bound.
291328781dfSAndi KleenThey can be used to collect the TopDown value under the following
292328781dfSAndi Kleenrules:
293328781dfSAndi Kleen- All the TopDown metric events must be in a group with the SLOTS event.
294328781dfSAndi Kleen- The SLOTS event must be the leader of the group.
295328781dfSAndi Kleen- The PERF_FORMAT_GROUP flag must be applied for each TopDown metric
296328781dfSAndi Kleen  events
297328781dfSAndi Kleen
298328781dfSAndi KleenThe SLOTS event and the TopDown metric events can be counting members of
299328781dfSAndi Kleena sampling read group. Since the SLOTS event must be the leader of a TopDown
300328781dfSAndi Kleengroup, the second event of the group is the sampling event.
301328781dfSAndi KleenFor example, perf record -e '{slots, $sampling_event, topdown-retiring}:S'
302328781dfSAndi Kleen
303*7d91e818SKan LiangExtension on Sapphire Rapids Server
304*7d91e818SKan Liang===================================
305*7d91e818SKan LiangThe metrics counter is extended to support TMA method level 2 metrics.
306*7d91e818SKan LiangThe lower half of the register is the TMA level 1 metrics (legacy).
307*7d91e818SKan LiangThe upper half is also divided into four 8-bit fields for the new level 2
308*7d91e818SKan Liangmetrics. Four more TopDown metric events are exposed for the end-users,
309*7d91e818SKan Liangtopdown-heavy-ops, topdown-br-mispredict, topdown-fetch-lat and
310*7d91e818SKan Liangtopdown-mem-bound.
311*7d91e818SKan Liang
312*7d91e818SKan LiangEach of the new level 2 metrics in the upper half is a subset of the
313*7d91e818SKan Liangcorresponding level 1 metric in the lower half. Software can deduce the
314*7d91e818SKan Liangother four level 2 metrics by subtracting corresponding metrics as below.
315*7d91e818SKan Liang
316*7d91e818SKan Liang    Light_Operations = Retiring - Heavy_Operations
317*7d91e818SKan Liang    Machine_Clears = Bad_Speculation - Branch_Mispredicts
318*7d91e818SKan Liang    Fetch_Bandwidth = Frontend_Bound - Fetch_Latency
319*7d91e818SKan Liang    Core_Bound = Backend_Bound - Memory_Bound
320*7d91e818SKan Liang
321328781dfSAndi Kleen
322328781dfSAndi Kleen[1] https://software.intel.com/en-us/top-down-microarchitecture-analysis-method-win
323328781dfSAndi Kleen[2] https://github.com/andikleen/pmu-tools/wiki/toplev-manual
324328781dfSAndi Kleen[3] https://software.intel.com/en-us/intel-vtune-amplifier-xe
325328781dfSAndi Kleen[4] https://github.com/andikleen/pmu-tools/tree/master/jevents
326328781dfSAndi Kleen[5] https://sites.google.com/site/analysismethods/yasin-pubs
327