1perf-arm-spe(1) 2================ 3 4NAME 5---- 6perf-arm-spe - Support for Arm Statistical Profiling Extension within Perf tools 7 8SYNOPSIS 9-------- 10[verse] 11'perf record' -e arm_spe// 12 13DESCRIPTION 14----------- 15 16The SPE (Statistical Profiling Extension) feature provides accurate attribution of latencies and 17 events down to individual instructions. Rather than being interrupt-driven, it picks an 18instruction to sample and then captures data for it during execution. Data includes execution time 19in cycles. For loads and stores it also includes data address, cache miss events, and data origin. 20 21The sampling has 5 stages: 22 23 1. Choose an operation 24 2. Collect data about the operation 25 3. Optionally discard the record based on a filter 26 4. Write the record to memory 27 5. Interrupt when the buffer is full 28 29Choose an operation 30~~~~~~~~~~~~~~~~~~~ 31 32This is chosen from a sample population, for SPE this is an IMPLEMENTATION DEFINED choice of all 33architectural instructions or all micro-ops. Sampling happens at a programmable interval. The 34architecture provides a mechanism for the SPE driver to infer the minimum interval at which it should 35sample. This minimum interval is used by the driver if no interval is specified. A pseudo-random 36perturbation is also added to the sampling interval by default. 37 38Collect data about the operation 39~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 40 41Program counter, PMU events, timings and data addresses related to the operation are recorded. 42Sampling ensures there is only one sampled operation is in flight. 43 44Optionally discard the record based on a filter 45~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 46 47Based on programmable criteria, choose whether to keep the record or discard it. If the record is 48discarded then the flow stops here for this sample. 49 50Write the record to memory 51~~~~~~~~~~~~~~~~~~~~~~~~~~ 52 53The record is appended to a memory buffer 54 55Interrupt when the buffer is full 56~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 57 58When the buffer fills, an interrupt is sent and the driver signals Perf to collect the records. 59Perf saves the raw data in the perf.data file. 60 61Opening the file 62---------------- 63 64Up until this point no decoding of the SPE data was done by either the kernel or Perf. Only when the 65recorded file is opened with 'perf report' or 'perf script' does the decoding happen. When decoding 66the data, Perf generates "synthetic samples" as if these were generated at the time of the 67recording. These samples are the same as if normal sampling was done by Perf without using SPE, 68although they may have more attributes associated with them. For example a normal sample may have 69just the instruction pointer, but an SPE sample can have data addresses and latency attributes. 70 71Why Sampling? 72------------- 73 74 - Sampling, rather than tracing, cuts down the profiling problem to something more manageable for 75 hardware. Only one sampled operation is in flight at a time. 76 77 - Allows precise attribution data, including: Full PC of instruction, data virtual and physical 78 addresses. 79 80 - Allows correlation between an instruction and events, such as TLB and cache miss. (Data source 81 indicates which particular cache was hit, but the meaning is implementation defined because 82 different implementations can have different cache configurations.) 83 84However, SPE does not provide any call-graph information, and relies on statistical methods. 85 86Collisions 87---------- 88 89When an operation is sampled while a previous sampled operation has not finished, a collision 90occurs. The new sample is dropped. Collisions affect the integrity of the data, so the sample rate 91should be set to avoid collisions. 92 93The 'sample_collision' PMU event can be used to determine the number of lost samples. Although this 94count is based on collisions _before_ filtering occurs. Therefore this can not be used as an exact 95number for samples dropped that would have made it through the filter, but can be a rough 96guide. 97 98The effect of microarchitectural sampling 99----------------------------------------- 100 101If an implementation samples micro-operations instead of instructions, the results of sampling must 102be weighted accordingly. 103 104For example, if a given instruction A is always converted into two micro-operations, A0 and A1, it 105becomes twice as likely to appear in the sample population. 106 107The coarse effect of conversions, and, if applicable, sampling of speculative operations, can be 108estimated from the 'sample_pop' and 'inst_retired' PMU events. 109 110Kernel Requirements 111------------------- 112 113The ARM_SPE_PMU config must be set to build as either a module or statically. 114 115Depending on CPU model, the kernel may need to be booted with page table isolation disabled 116(kpti=off). If KPTI needs to be disabled, this will fail with a console message "profiling buffer 117inaccessible. Try passing 'kpti=off' on the kernel command line". 118 119For the full criteria that determine whether KPTI needs to be forced off or not, see function 120unmap_kernel_at_el0() in the kernel sources. Common cases where it's not required 121are on the CPUs in kpti_safe_list, or on Arm v8.5+ where FEAT_E0PD is mandatory. 122 123The SPE interrupt must also be described by the firmware. If the module is loaded and KPTI is 124disabled (or isn't required to be disabled) but the SPE PMU still doesn't show in 125/sys/bus/event_source/devices/, then it's possible that the SPE interrupt isn't described by 126ACPI or DT. In this case no warning will be printed by the driver. 127 128Capturing SPE with perf command-line tools 129------------------------------------------ 130 131You can record a session with SPE samples: 132 133 perf record -e arm_spe// -- ./mybench 134 135The sample period is set from the -c option, and because the minimum interval is used by default 136it's recommended to set this to a higher value. The value is written to PMSIRR.INTERVAL. 137 138Config parameters 139~~~~~~~~~~~~~~~~~ 140 141These are placed between the // in the event and comma separated. For example '-e 142arm_spe/load_filter=1,min_latency=10/' 143 144 branch_filter=1 - collect branches only (PMSFCR.B) 145 event_filter=<mask> - filter on specific events (PMSEVFR) - see bitfield description below 146 jitter=1 - use jitter to avoid resonance when sampling (PMSIRR.RND) 147 load_filter=1 - collect loads only (PMSFCR.LD) 148 min_latency=<n> - collect only samples with this latency or higher* (PMSLATFR) 149 pa_enable=1 - collect physical address (as well as VA) of loads/stores (PMSCR.PA) - requires privilege 150 pct_enable=1 - collect physical timestamp instead of virtual timestamp (PMSCR.PCT) - requires privilege 151 store_filter=1 - collect stores only (PMSFCR.ST) 152 ts_enable=1 - enable timestamping with value of generic timer (PMSCR.TS) 153 discard=1 - enable SPE PMU events but don't collect sample data - see 'Discard mode' (PMBLIMITR.FM = DISCARD) 154 155+++*+++ Latency is the total latency from the point at which sampling started on that instruction, rather 156than only the execution latency. 157 158Only some events can be filtered on; these include: 159 160 bit 1 - instruction retired (i.e. omit speculative instructions) 161 bit 3 - L1D refill 162 bit 5 - TLB refill 163 bit 7 - mispredict 164 bit 11 - misaligned access 165 166So to sample just retired instructions: 167 168 perf record -e arm_spe/event_filter=2/ -- ./mybench 169 170or just mispredicted branches: 171 172 perf record -e arm_spe/event_filter=0x80/ -- ./mybench 173 174Viewing the data 175~~~~~~~~~~~~~~~~~ 176 177By default perf report and perf script will assign samples to separate groups depending on the 178attributes/events of the SPE record. Because instructions can have multiple events associated with 179them, the samples in these groups are not necessarily unique. For example perf report shows these 180groups: 181 182 Available samples 183 0 arm_spe// 184 0 dummy:u 185 21 l1d-miss 186 897 l1d-access 187 5 llc-miss 188 7 llc-access 189 2 tlb-miss 190 1K tlb-access 191 36 branch 192 0 remote-access 193 900 memory 194 195The arm_spe// and dummy:u events are implementation details and are expected to be empty. 196 197To get a full list of unique samples that are not sorted into groups, set the itrace option to 198generate 'instruction' samples. The period option is also taken into account, so set it to 1 199instruction unless you want to further downsample the already sampled SPE data: 200 201 perf report --itrace=i1i 202 203Memory access details are also stored on the samples and this can be viewed with: 204 205 perf report --mem-mode 206 207Common errors 208~~~~~~~~~~~~~ 209 210 - "Cannot find PMU `arm_spe'. Missing kernel support?" 211 212 Module not built or loaded, KPTI not disabled, interrupt not described by firmware, 213 or running on a VM. See 'Kernel Requirements' above. 214 215 - "Arm SPE CONTEXT packets not found in the traces." 216 217 Root privilege is required to collect context packets. But these only increase the accuracy of 218 assigning PIDs to kernel samples. For userspace sampling this can be ignored. 219 220 - Excessively large perf.data file size 221 222 Increase sampling interval (see above) 223 224PMU events 225~~~~~~~~~~ 226 227SPE has events that can be counted on core PMUs. These are prefixed with 228SAMPLE_, for example SAMPLE_POP, SAMPLE_FEED, SAMPLE_COLLISION and 229SAMPLE_FEED_BR. 230 231These events will only count when an SPE event is running on the same core that 232the PMU event is opened on, otherwise they read as 0. There are various ways to 233ensure that the PMU event and SPE event are scheduled together depending on the 234way the event is opened. For example opening both events as per-process events 235on the same process, although it's not guaranteed that the PMU event is enabled 236first when context switching. For that reason it may be better to open the PMU 237event as a systemwide event and then open SPE on the process of interest. 238 239Discard mode 240~~~~~~~~~~~~ 241 242SPE related (SAMPLE_* etc) core PMU events can be used without the overhead of 243collecting sample data if discard mode is supported (optional from Armv8.6). 244First run a system wide SPE session (or on the core of interest) using options 245to minimize output. Then run perf stat: 246 247 perf record -e arm_spe/discard/ -a -N -B --no-bpf-event -o - > /dev/null & 248 perf stat -e SAMPLE_FEED_LD 249 250SEE ALSO 251-------- 252 253linkperf:perf-record[1], linkperf:perf-script[1], linkperf:perf-report[1], 254linkperf:perf-inject[1] 255