xref: /linux/tools/perf/Documentation/perf-arm-spe.txt (revision 221013afb459e5deb8bd08e29b37050af5586d1c)
1perf-arm-spe(1)
2================
3
4NAME
5----
6perf-arm-spe - Support for Arm Statistical Profiling Extension within Perf tools
7
8SYNOPSIS
9--------
10[verse]
11'perf record' -e arm_spe//
12
13DESCRIPTION
14-----------
15
16The SPE (Statistical Profiling Extension) feature provides accurate attribution of latencies and
17 events down to individual instructions. Rather than being interrupt-driven, it picks an
18instruction to sample and then captures data for it during execution. Data includes execution time
19in cycles. For loads and stores it also includes data address, cache miss events, and data origin.
20
21The sampling has 5 stages:
22
23  1. Choose an operation
24  2. Collect data about the operation
25  3. Optionally discard the record based on a filter
26  4. Write the record to memory
27  5. Interrupt when the buffer is full
28
29Choose an operation
30~~~~~~~~~~~~~~~~~~~
31
32This is chosen from a sample population, for SPE this is an IMPLEMENTATION DEFINED choice of all
33architectural instructions or all micro-ops. Sampling happens at a programmable interval. The
34architecture provides a mechanism for the SPE driver to infer the minimum interval at which it should
35sample. This minimum interval is used by the driver if no interval is specified. A pseudo-random
36perturbation is also added to the sampling interval by default.
37
38Collect data about the operation
39~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
40
41Program counter, PMU events, timings and data addresses related to the operation are recorded.
42Sampling ensures there is only one sampled operation is in flight.
43
44Optionally discard the record based on a filter
45~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
46
47Based on programmable criteria, choose whether to keep the record or discard it. If the record is
48discarded then the flow stops here for this sample.
49
50Write the record to memory
51~~~~~~~~~~~~~~~~~~~~~~~~~~
52
53The record is appended to a memory buffer
54
55Interrupt when the buffer is full
56~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
57
58When the buffer fills, an interrupt is sent and the driver signals Perf to collect the records.
59Perf saves the raw data in the perf.data file.
60
61Opening the file
62----------------
63
64Up until this point no decoding of the SPE data was done by either the kernel or Perf. Only when the
65recorded file is opened with 'perf report' or 'perf script' does the decoding happen. When decoding
66the data, Perf generates "synthetic samples" as if these were generated at the time of the
67recording. These samples are the same as if normal sampling was done by Perf without using SPE,
68although they may have more attributes associated with them. For example a normal sample may have
69just the instruction pointer, but an SPE sample can have data addresses and latency attributes.
70
71Why Sampling?
72-------------
73
74 - Sampling, rather than tracing, cuts down the profiling problem to something more manageable for
75 hardware. Only one sampled operation is in flight at a time.
76
77 - Allows precise attribution data, including: Full PC of instruction, data virtual and physical
78 addresses.
79
80 - Allows correlation between an instruction and events, such as TLB and cache miss. (Data source
81 indicates which particular cache was hit, but the meaning is implementation defined because
82 different implementations can have different cache configurations.)
83
84However, SPE does not provide any call-graph information, and relies on statistical methods.
85
86Collisions
87----------
88
89When an operation is sampled while a previous sampled operation has not finished, a collision
90occurs. The new sample is dropped. Collisions affect the integrity of the data, so the sample rate
91should be set to avoid collisions.
92
93The 'sample_collision' PMU event can be used to determine the number of lost samples. Although this
94count is based on collisions _before_ filtering occurs. Therefore this can not be used as an exact
95number for samples dropped that would have made it through the filter, but can be a rough
96guide.
97
98The effect of microarchitectural sampling
99-----------------------------------------
100
101If an implementation samples micro-operations instead of instructions, the results of sampling must
102be weighted accordingly.
103
104For example, if a given instruction A is always converted into two micro-operations, A0 and A1, it
105becomes twice as likely to appear in the sample population.
106
107The coarse effect of conversions, and, if applicable, sampling of speculative operations, can be
108estimated from the 'sample_pop' and 'inst_retired' PMU events.
109
110Kernel Requirements
111-------------------
112
113The ARM_SPE_PMU config must be set to build as either a module or statically.
114
115Depending on CPU model, the kernel may need to be booted with page table isolation disabled
116(kpti=off). If KPTI needs to be disabled, this will fail with a console message "profiling buffer
117inaccessible. Try passing 'kpti=off' on the kernel command line".
118
119For the full criteria that determine whether KPTI needs to be forced off or not, see function
120unmap_kernel_at_el0() in the kernel sources. Common cases where it's not required
121are on the CPUs in kpti_safe_list, or on Arm v8.5+ where FEAT_E0PD is mandatory.
122
123The SPE interrupt must also be described by the firmware. If the module is loaded and KPTI is
124disabled (or isn't required to be disabled) but the SPE PMU still doesn't show in
125/sys/bus/event_source/devices/, then it's possible that the SPE interrupt isn't described by
126ACPI or DT. In this case no warning will be printed by the driver.
127
128Capturing SPE with perf command-line tools
129------------------------------------------
130
131You can record a session with SPE samples:
132
133  perf record -e arm_spe// -- ./mybench
134
135The sample period is set from the -c option, and because the minimum interval is used by default
136it's recommended to set this to a higher value. The value is written to PMSIRR.INTERVAL.
137
138Config parameters
139~~~~~~~~~~~~~~~~~
140
141These are placed between the // in the event and comma separated. For example '-e
142arm_spe/load_filter=1,min_latency=10/'
143
144  branch_filter=1     - collect branches only (PMSFCR.B)
145  event_filter=<mask> - filter on specific events (PMSEVFR) - see bitfield description below
146  jitter=1            - use jitter to avoid resonance when sampling (PMSIRR.RND)
147  load_filter=1       - collect loads only (PMSFCR.LD)
148  min_latency=<n>     - collect only samples with this latency or higher* (PMSLATFR)
149  pa_enable=1         - collect physical address (as well as VA) of loads/stores (PMSCR.PA) - requires privilege
150  pct_enable=1        - collect physical timestamp instead of virtual timestamp (PMSCR.PCT) - requires privilege
151  store_filter=1      - collect stores only (PMSFCR.ST)
152  ts_enable=1         - enable timestamping with value of generic timer (PMSCR.TS)
153
154+++*+++ Latency is the total latency from the point at which sampling started on that instruction, rather
155than only the execution latency.
156
157Only some events can be filtered on; these include:
158
159  bit 1     - instruction retired (i.e. omit speculative instructions)
160  bit 3     - L1D refill
161  bit 5     - TLB refill
162  bit 7     - mispredict
163  bit 11    - misaligned access
164
165So to sample just retired instructions:
166
167  perf record -e arm_spe/event_filter=2/ -- ./mybench
168
169or just mispredicted branches:
170
171  perf record -e arm_spe/event_filter=0x80/ -- ./mybench
172
173Viewing the data
174~~~~~~~~~~~~~~~~~
175
176By default perf report and perf script will assign samples to separate groups depending on the
177attributes/events of the SPE record. Because instructions can have multiple events associated with
178them, the samples in these groups are not necessarily unique. For example perf report shows these
179groups:
180
181  Available samples
182  0 arm_spe//
183  0 dummy:u
184  21 l1d-miss
185  897 l1d-access
186  5 llc-miss
187  7 llc-access
188  2 tlb-miss
189  1K tlb-access
190  36 branch-miss
191  0 remote-access
192  900 memory
193
194The arm_spe// and dummy:u events are implementation details and are expected to be empty.
195
196To get a full list of unique samples that are not sorted into groups, set the itrace option to
197generate 'instruction' samples. The period option is also taken into account, so set it to 1
198instruction unless you want to further downsample the already sampled SPE data:
199
200  perf report --itrace=i1i
201
202Memory access details are also stored on the samples and this can be viewed with:
203
204  perf report --mem-mode
205
206Common errors
207~~~~~~~~~~~~~
208
209 - "Cannot find PMU `arm_spe'. Missing kernel support?"
210
211   Module not built or loaded, KPTI not disabled, interrupt not described by firmware,
212   or running on a VM. See 'Kernel Requirements' above.
213
214 - "Arm SPE CONTEXT packets not found in the traces."
215
216   Root privilege is required to collect context packets. But these only increase the accuracy of
217   assigning PIDs to kernel samples. For userspace sampling this can be ignored.
218
219 - Excessively large perf.data file size
220
221   Increase sampling interval (see above)
222
223
224SEE ALSO
225--------
226
227linkperf:perf-record[1], linkperf:perf-script[1], linkperf:perf-report[1],
228linkperf:perf-inject[1]
229