1perf-arm-spe(1) 2================ 3 4NAME 5---- 6perf-arm-spe - Support for Arm Statistical Profiling Extension within Perf tools 7 8SYNOPSIS 9-------- 10[verse] 11'perf record' -e arm_spe// 12 13DESCRIPTION 14----------- 15 16The SPE (Statistical Profiling Extension) feature provides accurate attribution of latencies and 17 events down to individual instructions. Rather than being interrupt-driven, it picks an 18instruction to sample and then captures data for it during execution. Data includes execution time 19in cycles. For loads and stores it also includes data address, cache miss events, and data origin. 20 21The sampling has 5 stages: 22 23 1. Choose an operation 24 2. Collect data about the operation 25 3. Optionally discard the record based on a filter 26 4. Write the record to memory 27 5. Interrupt when the buffer is full 28 29Choose an operation 30~~~~~~~~~~~~~~~~~~~ 31 32This is chosen from a sample population, for SPE this is an IMPLEMENTATION DEFINED choice of all 33architectural instructions or all micro-ops. Sampling happens at a programmable interval. The 34architecture provides a mechanism for the SPE driver to infer the minimum interval at which it should 35sample. This minimum interval is used by the driver if no interval is specified. A pseudo-random 36perturbation is also added to the sampling interval by default. 37 38Collect data about the operation 39~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 40 41Program counter, PMU events, timings and data addresses related to the operation are recorded. 42Sampling ensures there is only one sampled operation is in flight. 43 44Optionally discard the record based on a filter 45~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 46 47Based on programmable criteria, choose whether to keep the record or discard it. If the record is 48discarded then the flow stops here for this sample. 49 50Write the record to memory 51~~~~~~~~~~~~~~~~~~~~~~~~~~ 52 53The record is appended to a memory buffer 54 55Interrupt when the buffer is full 56~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 57 58When the buffer fills, an interrupt is sent and the driver signals Perf to collect the records. 59Perf saves the raw data in the perf.data file. 60 61Opening the file 62---------------- 63 64Up until this point no decoding of the SPE data was done by either the kernel or Perf. Only when the 65recorded file is opened with 'perf report' or 'perf script' does the decoding happen. When decoding 66the data, Perf generates "synthetic samples" as if these were generated at the time of the 67recording. These samples are the same as if normal sampling was done by Perf without using SPE, 68although they may have more attributes associated with them. For example a normal sample may have 69just the instruction pointer, but an SPE sample can have data addresses and latency attributes. 70 71Why Sampling? 72------------- 73 74 - Sampling, rather than tracing, cuts down the profiling problem to something more manageable for 75 hardware. Only one sampled operation is in flight at a time. 76 77 - Allows precise attribution data, including: Full PC of instruction, data virtual and physical 78 addresses. 79 80 - Allows correlation between an instruction and events, such as TLB and cache miss. (Data source 81 indicates which particular cache was hit, but the meaning is implementation defined because 82 different implementations can have different cache configurations.) 83 84However, SPE does not provide any call-graph information, and relies on statistical methods. 85 86Collisions 87---------- 88 89When an operation is sampled while a previous sampled operation has not finished, a collision 90occurs. The new sample is dropped. Collisions affect the integrity of the data, so the sample rate 91should be set to avoid collisions. 92 93The 'sample_collision' PMU event can be used to determine the number of lost samples. Although this 94count is based on collisions _before_ filtering occurs. Therefore this can not be used as an exact 95number for samples dropped that would have made it through the filter, but can be a rough 96guide. 97 98The effect of microarchitectural sampling 99----------------------------------------- 100 101If an implementation samples micro-operations instead of instructions, the results of sampling must 102be weighted accordingly. 103 104For example, if a given instruction A is always converted into two micro-operations, A0 and A1, it 105becomes twice as likely to appear in the sample population. 106 107The coarse effect of conversions, and, if applicable, sampling of speculative operations, can be 108estimated from the 'sample_pop' and 'inst_retired' PMU events. 109 110Kernel Requirements 111------------------- 112 113The ARM_SPE_PMU config must be set to build as either a module or statically. 114 115Depending on CPU model, the kernel may need to be booted with page table isolation disabled 116(kpti=off). If KPTI needs to be disabled, this will fail with a console message "profiling buffer 117inaccessible. Try passing 'kpti=off' on the kernel command line". 118 119For the full criteria that determine whether KPTI needs to be forced off or not, see function 120unmap_kernel_at_el0() in the kernel sources. Common cases where it's not required 121are on the CPUs in kpti_safe_list, or on Arm v8.5+ where FEAT_E0PD is mandatory. 122 123The SPE interrupt must also be described by the firmware. If the module is loaded and KPTI is 124disabled (or isn't required to be disabled) but the SPE PMU still doesn't show in 125/sys/bus/event_source/devices/, then it's possible that the SPE interrupt isn't described by 126ACPI or DT. In this case no warning will be printed by the driver. 127 128Capturing SPE with perf command-line tools 129------------------------------------------ 130 131You can record a session with SPE samples: 132 133 perf record -e arm_spe// -- ./mybench 134 135The sample period is set from the -c option, and because the minimum interval is used by default 136it's recommended to set this to a higher value. The value is written to PMSIRR.INTERVAL. 137 138Config parameters 139~~~~~~~~~~~~~~~~~ 140 141These are placed between the // in the event and comma separated. For example '-e 142arm_spe/load_filter=1,min_latency=10/' 143 144 event_filter=<mask> - logical AND filter on specific events (PMSEVFR) - see bitfield description below 145 inv_event_filter=<mask> - logical OR to filter out specific events (PMSNEVFR, FEAT_SPEv1p2) - see bitfield description below 146 jitter=1 - use jitter to avoid resonance when sampling (PMSIRR.RND) 147 min_latency=<n> - collect only samples with this latency or higher* (PMSLATFR) 148 pa_enable=1 - collect physical address (as well as VA) of loads/stores (PMSCR.PA) - requires privilege 149 pct_enable=1 - collect physical timestamp instead of virtual timestamp (PMSCR.PCT) - requires privilege 150 ts_enable=1 - enable timestamping with value of generic timer (PMSCR.TS) 151 discard=1 - enable SPE PMU events but don't collect sample data - see 'Discard mode' (PMBLIMITR.FM = DISCARD) 152 inv_data_src_filter=<mask> - mask to filter from 0-63 possible data sources (PMSDSFR, FEAT_SPE_FDS) - See 'Data source filtering' 153 154+++*+++ Latency is the total latency from the point at which sampling started on that instruction, rather 155than only the execution latency. 156 157Only some events can be filtered on using 'event_filter' bits. The overall 158filter is the logical AND of these bits, for example if bits 3 and 5 are set 159only samples that have both 'L1D cache refill' AND 'TLB walk' are recorded. When 160FEAT_SPEv1p2 is implemented 'inv_event_filter' can also be used to exclude 161events that have any (OR) of the filter's bits set. For example setting bits 3 162and 5 in 'inv_event_filter' will exclude any events that are either L1D cache 163refill OR TLB walk. If the same bit is set in both filters it's UNPREDICTABLE 164whether the sample is included or excluded. Filter bits for both event_filter 165and inv_event_filter are: 166 167 bit 1 - Instruction retired (i.e. omit speculative instructions) 168 bit 2 - L1D access (FEAT_SPEv1p4) 169 bit 3 - L1D refill 170 bit 4 - TLB access (FEAT_SPEv1p4) 171 bit 5 - TLB refill 172 bit 6 - Not taken event (FEAT_SPEv1p2) 173 bit 7 - Mispredict 174 bit 8 - Last level cache access (FEAT_SPEv1p4) 175 bit 9 - Last level cache miss (FEAT_SPEv1p4) 176 bit 10 - Remote access (FEAT_SPEv1p4) 177 bit 11 - Misaligned access (FEAT_SPEv1p1) 178 bit 12-15 - IMPLEMENTATION DEFINED events (when implemented) 179 bit 16 - Transaction (FEAT_TME) 180 bit 17 - Partial or empty SME or SVE predicate (FEAT_SPEv1p1) 181 bit 18 - Empty SME or SVE predicate (FEAT_SPEv1p1) 182 bit 19 - L2D access (FEAT_SPEv1p4) 183 bit 20 - L2D miss (FEAT_SPEv1p4) 184 bit 21 - Cache data modified (FEAT_SPEv1p4) 185 bit 22 - Recently fetched (FEAT_SPEv1p4) 186 bit 23 - Data snooped (FEAT_SPEv1p4) 187 bit 24 - Streaming SVE mode event (when FEAT_SPE_SME is implemented), or 188 IMPLEMENTATION DEFINED event 24 (when implemented, only versions 189 less than FEAT_SPEv1p4) 190 bit 25 - SMCU or external coprocessor operation event when FEAT_SPE_SME is 191 implemented, or IMPLEMENTATION DEFINED event 25 (when implemented, 192 only versions less than FEAT_SPEv1p4) 193 bit 26-31 - IMPLEMENTATION DEFINED events (only versions less than FEAT_SPEv1p4) 194 bit 48-63 - IMPLEMENTATION DEFINED events (when implemented) 195 196For IMPLEMENTATION DEFINED bits, refer to the CPU TRM if these bits are 197implemented. 198 199The driver will reject events if requested filter bits require unimplemented SPE 200versions, but will not reject filter bits for unimplemented IMPDEF bits or when 201their related feature is not present (e.g. SME). For example, if FEAT_SPEv1p2 is 202not implemented, filtering on "Not taken event" (bit 6) will be rejected. 203 204So to sample just retired instructions: 205 206 perf record -e arm_spe/event_filter=2/ -- ./mybench 207 208or just mispredicted branches: 209 210 perf record -e arm_spe/event_filter=0x80/ -- ./mybench 211 212When set, the following filters can be used to select samples that match any of 213the operation types (OR filtering). If only one is set then only samples of that 214type are collected: 215 216 branch_filter=1 - Collect branches (PMSFCR.B) 217 load_filter=1 - Collect loads (PMSFCR.LD) 218 store_filter=1 - Collect stores (PMSFCR.ST) 219 220When extended filtering is supported (FEAT_SPE_EFT), SIMD and float 221pointer operations can also be selected: 222 223 simd_filter=1 - Collect SIMD loads, stores and operations (PMSFCR.SIMD) 224 float_filter=1 - Collect floating point loads, stores and operations (PMSFCR.FP) 225 226When extended filtering is supported (FEAT_SPE_EFT), operation type filters can 227be changed to AND using _mask fields. For example samples could be selected if 228they are store AND SIMD by setting 'store_filter=1,simd_filter=1, 229store_filter_mask=1,simd_filter_mask=1'. The new masks are as follows: 230 231 branch_filter_mask=1 - Change branch filter behavior from OR to AND (PMSFCR.Bm) 232 load_filter_mask=1 - Change load filter behavior from OR to AND (PMSFCR.LDm) 233 store_filter_mask=1 - Change store filter behavior from OR to AND (PMSFCR.STm) 234 simd_filter_mask=1 - Change SIMD filter behavior from OR to AND (PMSFCR.SIMDm) 235 float_filter_mask=1 - Change floating point filter behavior from OR to AND (PMSFCR.FPm) 236 237Viewing the data 238~~~~~~~~~~~~~~~~~ 239 240By default perf report and perf script will assign samples to separate groups depending on the 241attributes/events of the SPE record. Because instructions can have multiple events associated with 242them, the samples in these groups are not necessarily unique. For example perf report shows these 243groups: 244 245 Available samples 246 0 arm_spe// 247 0 dummy:u 248 21 l1d-miss 249 897 l1d-access 250 5 llc-miss 251 7 llc-access 252 2 tlb-miss 253 1K tlb-access 254 36 branch 255 0 remote-access 256 900 memory 257 1800 instructions 258 259The arm_spe// and dummy:u events are implementation details and are expected to be empty. 260 261The instructions group contains the full list of unique samples that are not 262sorted into other groups. To generate only this group use --itrace=i1i. 263 2641i (1 instruction interval) signifies no further downsampling. Rather than an 265instruction interval, this generates a sample every n SPE samples. For example 266to generate the default set of events for every 100 SPE samples: 267 268 perf report --itrace==bxofmtMai100i 269 270Other period types, for example nanoseconds (ns) are not currently supported. 271 272Memory access details are also stored on the samples and this can be viewed with: 273 274 perf report --mem-mode 275 276The latency value from the SPE sample is stored in the 'weight' field of the 277Perf samples and can be displayed in Perf script and report outputs by enabling 278its display from the command line. 279 280Common errors 281~~~~~~~~~~~~~ 282 283 - "Cannot find PMU `arm_spe'. Missing kernel support?" 284 285 Module not built or loaded, KPTI not disabled, interrupt not described by firmware, 286 or running on a VM. See 'Kernel Requirements' above. 287 288 - "Arm SPE CONTEXT packets not found in the traces." 289 290 Root privilege is required to collect context packets. But these only increase the accuracy of 291 assigning PIDs to kernel samples. For userspace sampling this can be ignored. 292 293 - Excessively large perf.data file size 294 295 Increase sampling interval (see above) 296 297PMU events 298~~~~~~~~~~ 299 300SPE has events that can be counted on core PMUs. These are prefixed with 301SAMPLE_, for example SAMPLE_POP, SAMPLE_FEED, SAMPLE_COLLISION and 302SAMPLE_FEED_BR. 303 304These events will only count when an SPE event is running on the same core that 305the PMU event is opened on, otherwise they read as 0. There are various ways to 306ensure that the PMU event and SPE event are scheduled together depending on the 307way the event is opened. For example opening both events as per-process events 308on the same process, although it's not guaranteed that the PMU event is enabled 309first when context switching. For that reason it may be better to open the PMU 310event as a systemwide event and then open SPE on the process of interest. 311 312Discard mode 313~~~~~~~~~~~~ 314 315SPE related (SAMPLE_* etc) core PMU events can be used without the overhead of 316collecting sample data if discard mode is supported (optional from Armv8.6). 317First run a system wide SPE session (or on the core of interest) using options 318to minimize output. Then run perf stat: 319 320 perf record -e arm_spe/discard/ -a -N -B --no-bpf-event -o - > /dev/null & 321 perf stat -e SAMPLE_FEED_LD 322 323Data source filtering 324~~~~~~~~~~~~~~~~~~~~~ 325 326When FEAT_SPE_FDS is present, 'inv_data_src_filter' can be used as a mask to 327filter on a subset (0 - 63) of possible data source IDs. The full range of data 328sources is 0 - 65535 although these are unlikely to be used in practice. Data 329sources are IMPDEF so refer to the TRM for the mappings. Each bit N of the 330filter maps to data source N. The filter is an OR of all the bits, and the value 331provided inv_data_src_filter is inverted before writing to PMSDSFR_EL1 so that 332set bits exclude that data source and cleared bits include that data source. 333Therefore the default value of 0 is equivalent to no filtering (all data sources 334included). 335 336For example, to include only data sources 0 and 3, clear bits 0 and 3 337(0xFFFFFFFFFFFFFFF6) 338 339When 'inv_data_src_filter' is set to 0xFFFFFFFFFFFFFFFF, any samples with any 340data source set are excluded. 341 342SEE ALSO 343-------- 344 345linkperf:perf-record[1], linkperf:perf-script[1], linkperf:perf-report[1], 346linkperf:perf-inject[1] 347