xref: /linux/tools/perf/Documentation/perf-arm-spe.txt (revision 36ec807b627b4c0a0a382f0ae48eac7187d14b2b)
12adacd7fSJames Clarkperf-arm-spe(1)
22adacd7fSJames Clark================
32adacd7fSJames Clark
42adacd7fSJames ClarkNAME
52adacd7fSJames Clark----
62adacd7fSJames Clarkperf-arm-spe - Support for Arm Statistical Profiling Extension within Perf tools
72adacd7fSJames Clark
82adacd7fSJames ClarkSYNOPSIS
92adacd7fSJames Clark--------
102adacd7fSJames Clark[verse]
112adacd7fSJames Clark'perf record' -e arm_spe//
122adacd7fSJames Clark
132adacd7fSJames ClarkDESCRIPTION
142adacd7fSJames Clark-----------
152adacd7fSJames Clark
162adacd7fSJames ClarkThe SPE (Statistical Profiling Extension) feature provides accurate attribution of latencies and
172adacd7fSJames Clark events down to individual instructions. Rather than being interrupt-driven, it picks an
182adacd7fSJames Clarkinstruction to sample and then captures data for it during execution. Data includes execution time
192adacd7fSJames Clarkin cycles. For loads and stores it also includes data address, cache miss events, and data origin.
202adacd7fSJames Clark
212adacd7fSJames ClarkThe sampling has 5 stages:
222adacd7fSJames Clark
232adacd7fSJames Clark  1. Choose an operation
242adacd7fSJames Clark  2. Collect data about the operation
252adacd7fSJames Clark  3. Optionally discard the record based on a filter
262adacd7fSJames Clark  4. Write the record to memory
272adacd7fSJames Clark  5. Interrupt when the buffer is full
282adacd7fSJames Clark
292adacd7fSJames ClarkChoose an operation
302adacd7fSJames Clark~~~~~~~~~~~~~~~~~~~
312adacd7fSJames Clark
322adacd7fSJames ClarkThis is chosen from a sample population, for SPE this is an IMPLEMENTATION DEFINED choice of all
332adacd7fSJames Clarkarchitectural instructions or all micro-ops. Sampling happens at a programmable interval. The
342adacd7fSJames Clarkarchitecture provides a mechanism for the SPE driver to infer the minimum interval at which it should
352adacd7fSJames Clarksample. This minimum interval is used by the driver if no interval is specified. A pseudo-random
362adacd7fSJames Clarkperturbation is also added to the sampling interval by default.
372adacd7fSJames Clark
382adacd7fSJames ClarkCollect data about the operation
392adacd7fSJames Clark~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
402adacd7fSJames Clark
412adacd7fSJames ClarkProgram counter, PMU events, timings and data addresses related to the operation are recorded.
422adacd7fSJames ClarkSampling ensures there is only one sampled operation is in flight.
432adacd7fSJames Clark
442adacd7fSJames ClarkOptionally discard the record based on a filter
452adacd7fSJames Clark~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
462adacd7fSJames Clark
472adacd7fSJames ClarkBased on programmable criteria, choose whether to keep the record or discard it. If the record is
482adacd7fSJames Clarkdiscarded then the flow stops here for this sample.
492adacd7fSJames Clark
502adacd7fSJames ClarkWrite the record to memory
512adacd7fSJames Clark~~~~~~~~~~~~~~~~~~~~~~~~~~
522adacd7fSJames Clark
532adacd7fSJames ClarkThe record is appended to a memory buffer
542adacd7fSJames Clark
552adacd7fSJames ClarkInterrupt when the buffer is full
562adacd7fSJames Clark~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
572adacd7fSJames Clark
582adacd7fSJames ClarkWhen the buffer fills, an interrupt is sent and the driver signals Perf to collect the records.
592adacd7fSJames ClarkPerf saves the raw data in the perf.data file.
602adacd7fSJames Clark
612adacd7fSJames ClarkOpening the file
622adacd7fSJames Clark----------------
632adacd7fSJames Clark
642adacd7fSJames ClarkUp until this point no decoding of the SPE data was done by either the kernel or Perf. Only when the
652adacd7fSJames Clarkrecorded file is opened with 'perf report' or 'perf script' does the decoding happen. When decoding
662adacd7fSJames Clarkthe data, Perf generates "synthetic samples" as if these were generated at the time of the
672adacd7fSJames Clarkrecording. These samples are the same as if normal sampling was done by Perf without using SPE,
682adacd7fSJames Clarkalthough they may have more attributes associated with them. For example a normal sample may have
692adacd7fSJames Clarkjust the instruction pointer, but an SPE sample can have data addresses and latency attributes.
702adacd7fSJames Clark
712adacd7fSJames ClarkWhy Sampling?
722adacd7fSJames Clark-------------
732adacd7fSJames Clark
742adacd7fSJames Clark - Sampling, rather than tracing, cuts down the profiling problem to something more manageable for
752adacd7fSJames Clark hardware. Only one sampled operation is in flight at a time.
762adacd7fSJames Clark
772adacd7fSJames Clark - Allows precise attribution data, including: Full PC of instruction, data virtual and physical
782adacd7fSJames Clark addresses.
792adacd7fSJames Clark
802adacd7fSJames Clark - Allows correlation between an instruction and events, such as TLB and cache miss. (Data source
812adacd7fSJames Clark indicates which particular cache was hit, but the meaning is implementation defined because
822adacd7fSJames Clark different implementations can have different cache configurations.)
832adacd7fSJames Clark
842adacd7fSJames ClarkHowever, SPE does not provide any call-graph information, and relies on statistical methods.
852adacd7fSJames Clark
862adacd7fSJames ClarkCollisions
872adacd7fSJames Clark----------
882adacd7fSJames Clark
892adacd7fSJames ClarkWhen an operation is sampled while a previous sampled operation has not finished, a collision
902adacd7fSJames Clarkoccurs. The new sample is dropped. Collisions affect the integrity of the data, so the sample rate
912adacd7fSJames Clarkshould be set to avoid collisions.
922adacd7fSJames Clark
932adacd7fSJames ClarkThe 'sample_collision' PMU event can be used to determine the number of lost samples. Although this
942adacd7fSJames Clarkcount is based on collisions _before_ filtering occurs. Therefore this can not be used as an exact
952adacd7fSJames Clarknumber for samples dropped that would have made it through the filter, but can be a rough
962adacd7fSJames Clarkguide.
972adacd7fSJames Clark
982adacd7fSJames ClarkThe effect of microarchitectural sampling
992adacd7fSJames Clark-----------------------------------------
1002adacd7fSJames Clark
1012adacd7fSJames ClarkIf an implementation samples micro-operations instead of instructions, the results of sampling must
1022adacd7fSJames Clarkbe weighted accordingly.
1032adacd7fSJames Clark
1042adacd7fSJames ClarkFor example, if a given instruction A is always converted into two micro-operations, A0 and A1, it
1052adacd7fSJames Clarkbecomes twice as likely to appear in the sample population.
1062adacd7fSJames Clark
1072adacd7fSJames ClarkThe coarse effect of conversions, and, if applicable, sampling of speculative operations, can be
1082adacd7fSJames Clarkestimated from the 'sample_pop' and 'inst_retired' PMU events.
1092adacd7fSJames Clark
1102adacd7fSJames ClarkKernel Requirements
1112adacd7fSJames Clark-------------------
1122adacd7fSJames Clark
1132adacd7fSJames ClarkThe ARM_SPE_PMU config must be set to build as either a module or statically.
1142adacd7fSJames Clark
1152adacd7fSJames ClarkDepending on CPU model, the kernel may need to be booted with page table isolation disabled
1162adacd7fSJames Clark(kpti=off). If KPTI needs to be disabled, this will fail with a console message "profiling buffer
1172adacd7fSJames Clarkinaccessible. Try passing 'kpti=off' on the kernel command line".
1182adacd7fSJames Clark
119*36f65f9bSJames ClarkFor the full criteria that determine whether KPTI needs to be forced off or not, see function
120*36f65f9bSJames Clarkunmap_kernel_at_el0() in the kernel sources. Common cases where it's not required
121*36f65f9bSJames Clarkare on the CPUs in kpti_safe_list, or on Arm v8.5+ where FEAT_E0PD is mandatory.
122*36f65f9bSJames Clark
123*36f65f9bSJames ClarkThe SPE interrupt must also be described by the firmware. If the module is loaded and KPTI is
124*36f65f9bSJames Clarkdisabled (or isn't required to be disabled) but the SPE PMU still doesn't show in
125*36f65f9bSJames Clark/sys/bus/event_source/devices/, then it's possible that the SPE interrupt isn't described by
126*36f65f9bSJames ClarkACPI or DT. In this case no warning will be printed by the driver.
127*36f65f9bSJames Clark
1282adacd7fSJames ClarkCapturing SPE with perf command-line tools
1292adacd7fSJames Clark------------------------------------------
1302adacd7fSJames Clark
1312adacd7fSJames ClarkYou can record a session with SPE samples:
1322adacd7fSJames Clark
1332adacd7fSJames Clark  perf record -e arm_spe// -- ./mybench
1342adacd7fSJames Clark
1352adacd7fSJames ClarkThe sample period is set from the -c option, and because the minimum interval is used by default
1362adacd7fSJames Clarkit's recommended to set this to a higher value. The value is written to PMSIRR.INTERVAL.
1372adacd7fSJames Clark
1382adacd7fSJames ClarkConfig parameters
1392adacd7fSJames Clark~~~~~~~~~~~~~~~~~
1402adacd7fSJames Clark
1412adacd7fSJames ClarkThese are placed between the // in the event and comma separated. For example '-e
1422adacd7fSJames Clarkarm_spe/load_filter=1,min_latency=10/'
1432adacd7fSJames Clark
1442adacd7fSJames Clark  branch_filter=1     - collect branches only (PMSFCR.B)
1452adacd7fSJames Clark  event_filter=<mask> - filter on specific events (PMSEVFR) - see bitfield description below
1462adacd7fSJames Clark  jitter=1            - use jitter to avoid resonance when sampling (PMSIRR.RND)
1472adacd7fSJames Clark  load_filter=1       - collect loads only (PMSFCR.LD)
1482adacd7fSJames Clark  min_latency=<n>     - collect only samples with this latency or higher* (PMSLATFR)
1492adacd7fSJames Clark  pa_enable=1         - collect physical address (as well as VA) of loads/stores (PMSCR.PA) - requires privilege
1502adacd7fSJames Clark  pct_enable=1        - collect physical timestamp instead of virtual timestamp (PMSCR.PCT) - requires privilege
1512adacd7fSJames Clark  store_filter=1      - collect stores only (PMSFCR.ST)
1522adacd7fSJames Clark  ts_enable=1         - enable timestamping with value of generic timer (PMSCR.TS)
1532adacd7fSJames Clark
1542adacd7fSJames Clark+++*+++ Latency is the total latency from the point at which sampling started on that instruction, rather
1552adacd7fSJames Clarkthan only the execution latency.
1562adacd7fSJames Clark
1572adacd7fSJames ClarkOnly some events can be filtered on; these include:
1582adacd7fSJames Clark
1592adacd7fSJames Clark  bit 1     - instruction retired (i.e. omit speculative instructions)
1602adacd7fSJames Clark  bit 3     - L1D refill
1612adacd7fSJames Clark  bit 5     - TLB refill
1622adacd7fSJames Clark  bit 7     - mispredict
1632adacd7fSJames Clark  bit 11    - misaligned access
1642adacd7fSJames Clark
1652adacd7fSJames ClarkSo to sample just retired instructions:
1662adacd7fSJames Clark
1672adacd7fSJames Clark  perf record -e arm_spe/event_filter=2/ -- ./mybench
1682adacd7fSJames Clark
1692adacd7fSJames Clarkor just mispredicted branches:
1702adacd7fSJames Clark
1712adacd7fSJames Clark  perf record -e arm_spe/event_filter=0x80/ -- ./mybench
1722adacd7fSJames Clark
1732adacd7fSJames ClarkViewing the data
1742adacd7fSJames Clark~~~~~~~~~~~~~~~~~
1752adacd7fSJames Clark
1762adacd7fSJames ClarkBy default perf report and perf script will assign samples to separate groups depending on the
1772adacd7fSJames Clarkattributes/events of the SPE record. Because instructions can have multiple events associated with
1782adacd7fSJames Clarkthem, the samples in these groups are not necessarily unique. For example perf report shows these
1792adacd7fSJames Clarkgroups:
1802adacd7fSJames Clark
1812adacd7fSJames Clark  Available samples
1822adacd7fSJames Clark  0 arm_spe//
1832adacd7fSJames Clark  0 dummy:u
1842adacd7fSJames Clark  21 l1d-miss
1852adacd7fSJames Clark  897 l1d-access
1862adacd7fSJames Clark  5 llc-miss
1872adacd7fSJames Clark  7 llc-access
1882adacd7fSJames Clark  2 tlb-miss
1892adacd7fSJames Clark  1K tlb-access
1902adacd7fSJames Clark  36 branch-miss
1912adacd7fSJames Clark  0 remote-access
1922adacd7fSJames Clark  900 memory
1932adacd7fSJames Clark
1942adacd7fSJames ClarkThe arm_spe// and dummy:u events are implementation details and are expected to be empty.
1952adacd7fSJames Clark
1962adacd7fSJames ClarkTo get a full list of unique samples that are not sorted into groups, set the itrace option to
1972adacd7fSJames Clarkgenerate 'instruction' samples. The period option is also taken into account, so set it to 1
1982adacd7fSJames Clarkinstruction unless you want to further downsample the already sampled SPE data:
1992adacd7fSJames Clark
2002adacd7fSJames Clark  perf report --itrace=i1i
2012adacd7fSJames Clark
2022adacd7fSJames ClarkMemory access details are also stored on the samples and this can be viewed with:
2032adacd7fSJames Clark
2042adacd7fSJames Clark  perf report --mem-mode
2052adacd7fSJames Clark
2062adacd7fSJames ClarkCommon errors
2072adacd7fSJames Clark~~~~~~~~~~~~~
2082adacd7fSJames Clark
2092adacd7fSJames Clark - "Cannot find PMU `arm_spe'. Missing kernel support?"
2102adacd7fSJames Clark
211*36f65f9bSJames Clark   Module not built or loaded, KPTI not disabled, interrupt not described by firmware,
212*36f65f9bSJames Clark   or running on a VM. See 'Kernel Requirements' above.
2132adacd7fSJames Clark
2142adacd7fSJames Clark - "Arm SPE CONTEXT packets not found in the traces."
2152adacd7fSJames Clark
2162adacd7fSJames Clark   Root privilege is required to collect context packets. But these only increase the accuracy of
2172adacd7fSJames Clark   assigning PIDs to kernel samples. For userspace sampling this can be ignored.
2182adacd7fSJames Clark
2192adacd7fSJames Clark - Excessively large perf.data file size
2202adacd7fSJames Clark
2212adacd7fSJames Clark   Increase sampling interval (see above)
2222adacd7fSJames Clark
2232adacd7fSJames Clark
2242adacd7fSJames ClarkSEE ALSO
2252adacd7fSJames Clark--------
2262adacd7fSJames Clark
2272adacd7fSJames Clarklinkperf:perf-record[1], linkperf:perf-script[1], linkperf:perf-report[1],
2282adacd7fSJames Clarklinkperf:perf-inject[1]
229