12adacd7fSJames Clarkperf-arm-spe(1) 22adacd7fSJames Clark================ 32adacd7fSJames Clark 42adacd7fSJames ClarkNAME 52adacd7fSJames Clark---- 62adacd7fSJames Clarkperf-arm-spe - Support for Arm Statistical Profiling Extension within Perf tools 72adacd7fSJames Clark 82adacd7fSJames ClarkSYNOPSIS 92adacd7fSJames Clark-------- 102adacd7fSJames Clark[verse] 112adacd7fSJames Clark'perf record' -e arm_spe// 122adacd7fSJames Clark 132adacd7fSJames ClarkDESCRIPTION 142adacd7fSJames Clark----------- 152adacd7fSJames Clark 162adacd7fSJames ClarkThe SPE (Statistical Profiling Extension) feature provides accurate attribution of latencies and 172adacd7fSJames Clark events down to individual instructions. Rather than being interrupt-driven, it picks an 182adacd7fSJames Clarkinstruction to sample and then captures data for it during execution. Data includes execution time 192adacd7fSJames Clarkin cycles. For loads and stores it also includes data address, cache miss events, and data origin. 202adacd7fSJames Clark 212adacd7fSJames ClarkThe sampling has 5 stages: 222adacd7fSJames Clark 232adacd7fSJames Clark 1. Choose an operation 242adacd7fSJames Clark 2. Collect data about the operation 252adacd7fSJames Clark 3. Optionally discard the record based on a filter 262adacd7fSJames Clark 4. Write the record to memory 272adacd7fSJames Clark 5. Interrupt when the buffer is full 282adacd7fSJames Clark 292adacd7fSJames ClarkChoose an operation 302adacd7fSJames Clark~~~~~~~~~~~~~~~~~~~ 312adacd7fSJames Clark 322adacd7fSJames ClarkThis is chosen from a sample population, for SPE this is an IMPLEMENTATION DEFINED choice of all 332adacd7fSJames Clarkarchitectural instructions or all micro-ops. Sampling happens at a programmable interval. The 342adacd7fSJames Clarkarchitecture provides a mechanism for the SPE driver to infer the minimum interval at which it should 352adacd7fSJames Clarksample. This minimum interval is used by the driver if no interval is specified. A pseudo-random 362adacd7fSJames Clarkperturbation is also added to the sampling interval by default. 372adacd7fSJames Clark 382adacd7fSJames ClarkCollect data about the operation 392adacd7fSJames Clark~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 402adacd7fSJames Clark 412adacd7fSJames ClarkProgram counter, PMU events, timings and data addresses related to the operation are recorded. 422adacd7fSJames ClarkSampling ensures there is only one sampled operation is in flight. 432adacd7fSJames Clark 442adacd7fSJames ClarkOptionally discard the record based on a filter 452adacd7fSJames Clark~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 462adacd7fSJames Clark 472adacd7fSJames ClarkBased on programmable criteria, choose whether to keep the record or discard it. If the record is 482adacd7fSJames Clarkdiscarded then the flow stops here for this sample. 492adacd7fSJames Clark 502adacd7fSJames ClarkWrite the record to memory 512adacd7fSJames Clark~~~~~~~~~~~~~~~~~~~~~~~~~~ 522adacd7fSJames Clark 532adacd7fSJames ClarkThe record is appended to a memory buffer 542adacd7fSJames Clark 552adacd7fSJames ClarkInterrupt when the buffer is full 562adacd7fSJames Clark~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 572adacd7fSJames Clark 582adacd7fSJames ClarkWhen the buffer fills, an interrupt is sent and the driver signals Perf to collect the records. 592adacd7fSJames ClarkPerf saves the raw data in the perf.data file. 602adacd7fSJames Clark 612adacd7fSJames ClarkOpening the file 622adacd7fSJames Clark---------------- 632adacd7fSJames Clark 642adacd7fSJames ClarkUp until this point no decoding of the SPE data was done by either the kernel or Perf. Only when the 652adacd7fSJames Clarkrecorded file is opened with 'perf report' or 'perf script' does the decoding happen. When decoding 662adacd7fSJames Clarkthe data, Perf generates "synthetic samples" as if these were generated at the time of the 672adacd7fSJames Clarkrecording. These samples are the same as if normal sampling was done by Perf without using SPE, 682adacd7fSJames Clarkalthough they may have more attributes associated with them. For example a normal sample may have 692adacd7fSJames Clarkjust the instruction pointer, but an SPE sample can have data addresses and latency attributes. 702adacd7fSJames Clark 712adacd7fSJames ClarkWhy Sampling? 722adacd7fSJames Clark------------- 732adacd7fSJames Clark 742adacd7fSJames Clark - Sampling, rather than tracing, cuts down the profiling problem to something more manageable for 752adacd7fSJames Clark hardware. Only one sampled operation is in flight at a time. 762adacd7fSJames Clark 772adacd7fSJames Clark - Allows precise attribution data, including: Full PC of instruction, data virtual and physical 782adacd7fSJames Clark addresses. 792adacd7fSJames Clark 802adacd7fSJames Clark - Allows correlation between an instruction and events, such as TLB and cache miss. (Data source 812adacd7fSJames Clark indicates which particular cache was hit, but the meaning is implementation defined because 822adacd7fSJames Clark different implementations can have different cache configurations.) 832adacd7fSJames Clark 842adacd7fSJames ClarkHowever, SPE does not provide any call-graph information, and relies on statistical methods. 852adacd7fSJames Clark 862adacd7fSJames ClarkCollisions 872adacd7fSJames Clark---------- 882adacd7fSJames Clark 892adacd7fSJames ClarkWhen an operation is sampled while a previous sampled operation has not finished, a collision 902adacd7fSJames Clarkoccurs. The new sample is dropped. Collisions affect the integrity of the data, so the sample rate 912adacd7fSJames Clarkshould be set to avoid collisions. 922adacd7fSJames Clark 932adacd7fSJames ClarkThe 'sample_collision' PMU event can be used to determine the number of lost samples. Although this 942adacd7fSJames Clarkcount is based on collisions _before_ filtering occurs. Therefore this can not be used as an exact 952adacd7fSJames Clarknumber for samples dropped that would have made it through the filter, but can be a rough 962adacd7fSJames Clarkguide. 972adacd7fSJames Clark 982adacd7fSJames ClarkThe effect of microarchitectural sampling 992adacd7fSJames Clark----------------------------------------- 1002adacd7fSJames Clark 1012adacd7fSJames ClarkIf an implementation samples micro-operations instead of instructions, the results of sampling must 1022adacd7fSJames Clarkbe weighted accordingly. 1032adacd7fSJames Clark 1042adacd7fSJames ClarkFor example, if a given instruction A is always converted into two micro-operations, A0 and A1, it 1052adacd7fSJames Clarkbecomes twice as likely to appear in the sample population. 1062adacd7fSJames Clark 1072adacd7fSJames ClarkThe coarse effect of conversions, and, if applicable, sampling of speculative operations, can be 1082adacd7fSJames Clarkestimated from the 'sample_pop' and 'inst_retired' PMU events. 1092adacd7fSJames Clark 1102adacd7fSJames ClarkKernel Requirements 1112adacd7fSJames Clark------------------- 1122adacd7fSJames Clark 1132adacd7fSJames ClarkThe ARM_SPE_PMU config must be set to build as either a module or statically. 1142adacd7fSJames Clark 1152adacd7fSJames ClarkDepending on CPU model, the kernel may need to be booted with page table isolation disabled 1162adacd7fSJames Clark(kpti=off). If KPTI needs to be disabled, this will fail with a console message "profiling buffer 1172adacd7fSJames Clarkinaccessible. Try passing 'kpti=off' on the kernel command line". 1182adacd7fSJames Clark 119*36f65f9bSJames ClarkFor the full criteria that determine whether KPTI needs to be forced off or not, see function 120*36f65f9bSJames Clarkunmap_kernel_at_el0() in the kernel sources. Common cases where it's not required 121*36f65f9bSJames Clarkare on the CPUs in kpti_safe_list, or on Arm v8.5+ where FEAT_E0PD is mandatory. 122*36f65f9bSJames Clark 123*36f65f9bSJames ClarkThe SPE interrupt must also be described by the firmware. If the module is loaded and KPTI is 124*36f65f9bSJames Clarkdisabled (or isn't required to be disabled) but the SPE PMU still doesn't show in 125*36f65f9bSJames Clark/sys/bus/event_source/devices/, then it's possible that the SPE interrupt isn't described by 126*36f65f9bSJames ClarkACPI or DT. In this case no warning will be printed by the driver. 127*36f65f9bSJames Clark 1282adacd7fSJames ClarkCapturing SPE with perf command-line tools 1292adacd7fSJames Clark------------------------------------------ 1302adacd7fSJames Clark 1312adacd7fSJames ClarkYou can record a session with SPE samples: 1322adacd7fSJames Clark 1332adacd7fSJames Clark perf record -e arm_spe// -- ./mybench 1342adacd7fSJames Clark 1352adacd7fSJames ClarkThe sample period is set from the -c option, and because the minimum interval is used by default 1362adacd7fSJames Clarkit's recommended to set this to a higher value. The value is written to PMSIRR.INTERVAL. 1372adacd7fSJames Clark 1382adacd7fSJames ClarkConfig parameters 1392adacd7fSJames Clark~~~~~~~~~~~~~~~~~ 1402adacd7fSJames Clark 1412adacd7fSJames ClarkThese are placed between the // in the event and comma separated. For example '-e 1422adacd7fSJames Clarkarm_spe/load_filter=1,min_latency=10/' 1432adacd7fSJames Clark 1442adacd7fSJames Clark branch_filter=1 - collect branches only (PMSFCR.B) 1452adacd7fSJames Clark event_filter=<mask> - filter on specific events (PMSEVFR) - see bitfield description below 1462adacd7fSJames Clark jitter=1 - use jitter to avoid resonance when sampling (PMSIRR.RND) 1472adacd7fSJames Clark load_filter=1 - collect loads only (PMSFCR.LD) 1482adacd7fSJames Clark min_latency=<n> - collect only samples with this latency or higher* (PMSLATFR) 1492adacd7fSJames Clark pa_enable=1 - collect physical address (as well as VA) of loads/stores (PMSCR.PA) - requires privilege 1502adacd7fSJames Clark pct_enable=1 - collect physical timestamp instead of virtual timestamp (PMSCR.PCT) - requires privilege 1512adacd7fSJames Clark store_filter=1 - collect stores only (PMSFCR.ST) 1522adacd7fSJames Clark ts_enable=1 - enable timestamping with value of generic timer (PMSCR.TS) 1532adacd7fSJames Clark 1542adacd7fSJames Clark+++*+++ Latency is the total latency from the point at which sampling started on that instruction, rather 1552adacd7fSJames Clarkthan only the execution latency. 1562adacd7fSJames Clark 1572adacd7fSJames ClarkOnly some events can be filtered on; these include: 1582adacd7fSJames Clark 1592adacd7fSJames Clark bit 1 - instruction retired (i.e. omit speculative instructions) 1602adacd7fSJames Clark bit 3 - L1D refill 1612adacd7fSJames Clark bit 5 - TLB refill 1622adacd7fSJames Clark bit 7 - mispredict 1632adacd7fSJames Clark bit 11 - misaligned access 1642adacd7fSJames Clark 1652adacd7fSJames ClarkSo to sample just retired instructions: 1662adacd7fSJames Clark 1672adacd7fSJames Clark perf record -e arm_spe/event_filter=2/ -- ./mybench 1682adacd7fSJames Clark 1692adacd7fSJames Clarkor just mispredicted branches: 1702adacd7fSJames Clark 1712adacd7fSJames Clark perf record -e arm_spe/event_filter=0x80/ -- ./mybench 1722adacd7fSJames Clark 1732adacd7fSJames ClarkViewing the data 1742adacd7fSJames Clark~~~~~~~~~~~~~~~~~ 1752adacd7fSJames Clark 1762adacd7fSJames ClarkBy default perf report and perf script will assign samples to separate groups depending on the 1772adacd7fSJames Clarkattributes/events of the SPE record. Because instructions can have multiple events associated with 1782adacd7fSJames Clarkthem, the samples in these groups are not necessarily unique. For example perf report shows these 1792adacd7fSJames Clarkgroups: 1802adacd7fSJames Clark 1812adacd7fSJames Clark Available samples 1822adacd7fSJames Clark 0 arm_spe// 1832adacd7fSJames Clark 0 dummy:u 1842adacd7fSJames Clark 21 l1d-miss 1852adacd7fSJames Clark 897 l1d-access 1862adacd7fSJames Clark 5 llc-miss 1872adacd7fSJames Clark 7 llc-access 1882adacd7fSJames Clark 2 tlb-miss 1892adacd7fSJames Clark 1K tlb-access 1902adacd7fSJames Clark 36 branch-miss 1912adacd7fSJames Clark 0 remote-access 1922adacd7fSJames Clark 900 memory 1932adacd7fSJames Clark 1942adacd7fSJames ClarkThe arm_spe// and dummy:u events are implementation details and are expected to be empty. 1952adacd7fSJames Clark 1962adacd7fSJames ClarkTo get a full list of unique samples that are not sorted into groups, set the itrace option to 1972adacd7fSJames Clarkgenerate 'instruction' samples. The period option is also taken into account, so set it to 1 1982adacd7fSJames Clarkinstruction unless you want to further downsample the already sampled SPE data: 1992adacd7fSJames Clark 2002adacd7fSJames Clark perf report --itrace=i1i 2012adacd7fSJames Clark 2022adacd7fSJames ClarkMemory access details are also stored on the samples and this can be viewed with: 2032adacd7fSJames Clark 2042adacd7fSJames Clark perf report --mem-mode 2052adacd7fSJames Clark 2062adacd7fSJames ClarkCommon errors 2072adacd7fSJames Clark~~~~~~~~~~~~~ 2082adacd7fSJames Clark 2092adacd7fSJames Clark - "Cannot find PMU `arm_spe'. Missing kernel support?" 2102adacd7fSJames Clark 211*36f65f9bSJames Clark Module not built or loaded, KPTI not disabled, interrupt not described by firmware, 212*36f65f9bSJames Clark or running on a VM. See 'Kernel Requirements' above. 2132adacd7fSJames Clark 2142adacd7fSJames Clark - "Arm SPE CONTEXT packets not found in the traces." 2152adacd7fSJames Clark 2162adacd7fSJames Clark Root privilege is required to collect context packets. But these only increase the accuracy of 2172adacd7fSJames Clark assigning PIDs to kernel samples. For userspace sampling this can be ignored. 2182adacd7fSJames Clark 2192adacd7fSJames Clark - Excessively large perf.data file size 2202adacd7fSJames Clark 2212adacd7fSJames Clark Increase sampling interval (see above) 2222adacd7fSJames Clark 2232adacd7fSJames Clark 2242adacd7fSJames ClarkSEE ALSO 2252adacd7fSJames Clark-------- 2262adacd7fSJames Clark 2272adacd7fSJames Clarklinkperf:perf-record[1], linkperf:perf-script[1], linkperf:perf-report[1], 2282adacd7fSJames Clarklinkperf:perf-inject[1] 229