xref: /linux/tools/perf/Documentation/perf-amd-ibs.txt (revision add452d09a38c7a7c44aea55c1015392cebf9fa7)
1perf-amd-ibs(1)
2===============
3
4NAME
5----
6perf-amd-ibs - Support for AMD Instruction-Based Sampling (IBS) with perf tool
7
8SYNOPSIS
9--------
10[verse]
11'perf record' -e ibs_op//
12'perf record' -e ibs_fetch//
13
14DESCRIPTION
15-----------
16
17Instruction-Based Sampling (IBS) provides precise Instruction Pointer (IP)
18profiling support on AMD platforms. IBS has two independent components: IBS
19Op and IBS Fetch. IBS Op sampling provides information about instruction
20execution (micro-op execution to be precise) with details like d-cache
21hit/miss, d-TLB hit/miss, cache miss latency, load/store data source, branch
22behavior etc. IBS Fetch sampling provides information about instruction fetch
23with details like i-cache hit/miss, i-TLB hit/miss, fetch latency etc. IBS is
24per-smt-thread i.e. each SMT hardware thread contains standalone IBS units.
25
26Both, IBS Op and IBS Fetch, are exposed as PMUs by Linux and can be exploited
27using the Linux perf utility. The following files will be created at boot time
28if IBS is supported by the hardware and kernel.
29
30  /sys/bus/event_source/devices/ibs_op/
31  /sys/bus/event_source/devices/ibs_fetch/
32
33IBS Op PMU supports two events: cycles and micro ops. IBS Fetch PMU supports
34one event: fetch ops.
35
36IBS PMUs do not have user/kernel filtering capability and thus it requires
37CAP_SYS_ADMIN or CAP_PERFMON privilege.
38
39IBS VS. REGULAR CORE PMU
40------------------------
41
42IBS gives samples with precise IP, i.e. the IP recorded with IBS sample has
43no skid. Whereas the IP recorded by regular core PMU will have some skid
44(sample was generated at IP X but perf would record it at IP X+n). Hence,
45regular core PMU might not help for profiling with instruction level
46precision. Further, IBS provides additional information about the sample in
47question. On the other hand, regular core PMU has it's own advantages like
48plethora of events, counting mode (less interference), up to 6 parallel
49counters, event grouping support, filtering capabilities etc.
50
51Three regular core PMU events are internally forwarded to IBS Op PMU when
52precise_ip attribute is set:
53
54	-e cpu-cycles:p becomes -e ibs_op//
55	-e r076:p becomes -e ibs_op//
56	-e r0C1:p becomes -e ibs_op/cnt_ctl=1/
57
58EXAMPLES
59--------
60
61IBS Op PMU
62~~~~~~~~~~
63
64System-wide profile, cycles event, sampling period: 100000
65
66	# perf record -e ibs_op// -c 100000 -a
67
68Per-cpu profile (cpu10), cycles event, sampling period: 100000
69
70	# perf record -e ibs_op// -c 100000 -C 10
71
72Per-cpu profile (cpu10), cycles event, sampling freq: 1000
73
74	# perf record -e ibs_op// -F 1000 -C 10
75
76System-wide profile, uOps event, sampling period: 100000
77
78	# perf record -e ibs_op/cnt_ctl=1/ -c 100000 -a
79
80Same command, but also capture IBS register raw dump along with perf sample:
81
82	# perf record -e ibs_op/cnt_ctl=1/ -c 100000 -a --raw-samples
83
84System-wide profile, uOps event, sampling period: 100000, L3MissOnly (Zen4 onward)
85
86	# perf record -e ibs_op/cnt_ctl=1,l3missonly=1/ -c 100000 -a
87
88Per process(upstream v6.2 onward), uOps event, sampling period: 100000
89
90	# perf record -e ibs_op/cnt_ctl=1/ -c 100000 -p 1234
91
92Per process(upstream v6.2 onward), uOps event, sampling period: 100000
93
94	# perf record -e ibs_op/cnt_ctl=1/ -c 100000 -- ls
95
96To analyse recorded profile in aggregate mode
97
98	# perf report
99	/* Select a line and press 'a' to drill down at instruction level. */
100
101To go over each sample
102
103	# perf script
104
105Raw dump of IBS registers when profiled with --raw-samples
106
107	# perf report -D
108	/* Look for PERF_RECORD_SAMPLE */
109
110	Example register raw dump:
111
112	ibs_op_ctl:     000002c30006186a MaxCnt    100000 L3MissOnly 0 En 1
113		Val 1 CntCtl 0=cycles CurCnt       707
114	IbsOpRip:       ffffffff8204aea7
115	ibs_op_data:    0000010002550001 CompToRetCtr     1 TagToRetCtr   597
116		BrnRet 0  RipInvalid 0 BrnFuse 0 Microcode 1
117	ibs_op_data2:   0000000000000013 RmtNode 1 DataSrc 3=DRAM
118	ibs_op_data3:   0000000031960092 LdOp 0 StOp 1 DcL1TlbMiss 0
119		DcL2TlbMiss 0 DcL1TlbHit2M 1 DcL1TlbHit1G 0 DcL2TlbHit2M 0
120		DcMiss 1 DcMisAcc 0 DcWcMemAcc 0 DcUcMemAcc 0 DcLockedOp 0
121		DcMissNoMabAlloc 0 DcLinAddrValid 1 DcPhyAddrValid 1
122		DcL2TlbHit1G 0 L2Miss 1 SwPf 0 OpMemWidth 32 bytes
123		OpDcMissOpenMemReqs 12 DcMissLat     0 TlbRefillLat     0
124	IbsDCLinAd:     ff110008a5398920
125	IbsDCPhysAd:    00000008a5398920
126
127IBS applied in a real world usecase
128
129	~90% regression was observed in tbench with specific scheduler hint
130	which was counter intuitive. IBS profile of good and bad run captured
131	using perf helped in identifying exact cause of the problem:
132
133	https://lore.kernel.org/r/20220921063638.2489-1-kprateek.nayak@amd.com
134
135IBS Fetch PMU
136~~~~~~~~~~~~~
137
138Similar commands can be used with Fetch PMU as well.
139
140System-wide profile, fetch ops event, sampling period: 100000
141
142	# perf record -e ibs_fetch// -c 100000 -a
143
144System-wide profile, fetch ops event, sampling period: 100000, Random enable
145
146	# perf record -e ibs_fetch/rand_en=1/ -c 100000 -a
147
148	Random enable adds small degree of variability to sample period. This
149	helps in cases like long running loops where PMU is tagging the same
150	instruction over and over because of fixed sample period.
151
152etc.
153
154PERF MEM AND PERF C2C
155---------------------
156
157perf mem is a memory access profiler tool and perf c2c is a shared data
158cacheline analyser tool. Both of them internally uses IBS Op PMU on AMD.
159Below is a simple example of the perf mem tool.
160
161	# perf mem record -c 100000 -- make
162	# perf mem report
163
164A normal perf mem report output will provide detailed memory access profile.
165However, it can also be aggregated based on output fields. For example:
166
167	# perf mem report -F mem,sample,snoop
168	Samples: 3M of event 'ibs_op//', Event count (approx.): 23524876
169	Memory access                                 Samples  Snoop
170	N/A                                           1903343  N/A
171	L1 hit                                        1056754  N/A
172	L2 hit                                          75231  N/A
173	L3 hit                                           9496  HitM
174	L3 hit                                           2270  N/A
175	RAM hit                                          8710  N/A
176	Remote node, same socket RAM hit                 3241  N/A
177	Remote core, same node Any cache hit             1572  HitM
178	Remote core, same node Any cache hit              514  N/A
179	Remote node, same socket Any cache hit           1216  HitM
180	Remote node, same socket Any cache hit            350  N/A
181	Uncached hit                                       18  N/A
182
183Please refer to their man page for more detail.
184
185SEE ALSO
186--------
187
188linkperf:perf-record[1], linkperf:perf-script[1], linkperf:perf-report[1],
189linkperf:perf-mem[1], linkperf:perf-c2c[1]
190