1perf-amd-ibs(1) 2=============== 3 4NAME 5---- 6perf-amd-ibs - Support for AMD Instruction-Based Sampling (IBS) with perf tool 7 8SYNOPSIS 9-------- 10[verse] 11'perf record' -e ibs_op// 12'perf record' -e ibs_fetch// 13 14DESCRIPTION 15----------- 16 17Instruction-Based Sampling (IBS) provides precise Instruction Pointer (IP) 18profiling support on AMD platforms. IBS has two independent components: IBS 19Op and IBS Fetch. IBS Op sampling provides information about instruction 20execution (micro-op execution to be precise) with details like d-cache 21hit/miss, d-TLB hit/miss, cache miss latency, load/store data source, branch 22behavior etc. IBS Fetch sampling provides information about instruction fetch 23with details like i-cache hit/miss, i-TLB hit/miss, fetch latency etc. IBS is 24per-smt-thread i.e. each SMT hardware thread contains standalone IBS units. 25 26Both, IBS Op and IBS Fetch, are exposed as PMUs by Linux and can be exploited 27using the Linux perf utility. The following files will be created at boot time 28if IBS is supported by the hardware and kernel. 29 30 /sys/bus/event_source/devices/ibs_op/ 31 /sys/bus/event_source/devices/ibs_fetch/ 32 33IBS Op PMU supports two events: cycles and micro ops. IBS Fetch PMU supports 34one event: fetch ops. 35 36IBS PMUs do not have user/kernel filtering capability and thus it requires 37CAP_SYS_ADMIN or CAP_PERFMON privilege. 38 39IBS VS. REGULAR CORE PMU 40------------------------ 41 42IBS gives samples with precise IP, i.e. the IP recorded with IBS sample has 43no skid. Whereas the IP recorded by regular core PMU will have some skid 44(sample was generated at IP X but perf would record it at IP X+n). Hence, 45regular core PMU might not help for profiling with instruction level 46precision. Further, IBS provides additional information about the sample in 47question. On the other hand, regular core PMU has it's own advantages like 48plethora of events, counting mode (less interference), up to 6 parallel 49counters, event grouping support, filtering capabilities etc. 50 51Three regular core PMU events are internally forwarded to IBS Op PMU when 52precise_ip attribute is set: 53 54 -e cpu-cycles:p becomes -e ibs_op// 55 -e r076:p becomes -e ibs_op// 56 -e r0C1:p becomes -e ibs_op/cnt_ctl=1/ 57 58EXAMPLES 59-------- 60 61IBS Op PMU 62~~~~~~~~~~ 63 64System-wide profile, cycles event, sampling period: 100000 65 66 # perf record -e ibs_op// -c 100000 -a 67 68Per-cpu profile (cpu10), cycles event, sampling period: 100000 69 70 # perf record -e ibs_op// -c 100000 -C 10 71 72Per-cpu profile (cpu10), cycles event, sampling freq: 1000 73 74 # perf record -e ibs_op// -F 1000 -C 10 75 76System-wide profile, uOps event, sampling period: 100000 77 78 # perf record -e ibs_op/cnt_ctl=1/ -c 100000 -a 79 80Same command, but also capture IBS register raw dump along with perf sample: 81 82 # perf record -e ibs_op/cnt_ctl=1/ -c 100000 -a --raw-samples 83 84System-wide profile, uOps event, sampling period: 100000, L3MissOnly (Zen4 onward) 85 86 # perf record -e ibs_op/cnt_ctl=1,l3missonly=1/ -c 100000 -a 87 88Per process(upstream v6.2 onward), uOps event, sampling period: 100000 89 90 # perf record -e ibs_op/cnt_ctl=1/ -c 100000 -p 1234 91 92Per process(upstream v6.2 onward), uOps event, sampling period: 100000 93 94 # perf record -e ibs_op/cnt_ctl=1/ -c 100000 -- ls 95 96To analyse recorded profile in aggregate mode 97 98 # perf report 99 /* Select a line and press 'a' to drill down at instruction level. */ 100 101To go over each sample 102 103 # perf script 104 105Raw dump of IBS registers when profiled with --raw-samples 106 107 # perf report -D 108 /* Look for PERF_RECORD_SAMPLE */ 109 110 Example register raw dump: 111 112 ibs_op_ctl: 000002c30006186a MaxCnt 100000 L3MissOnly 0 En 1 113 Val 1 CntCtl 0=cycles CurCnt 707 114 IbsOpRip: ffffffff8204aea7 115 ibs_op_data: 0000010002550001 CompToRetCtr 1 TagToRetCtr 597 116 BrnRet 0 RipInvalid 0 BrnFuse 0 Microcode 1 117 ibs_op_data2: 0000000000000013 RmtNode 1 DataSrc 3=DRAM 118 ibs_op_data3: 0000000031960092 LdOp 0 StOp 1 DcL1TlbMiss 0 119 DcL2TlbMiss 0 DcL1TlbHit2M 1 DcL1TlbHit1G 0 DcL2TlbHit2M 0 120 DcMiss 1 DcMisAcc 0 DcWcMemAcc 0 DcUcMemAcc 0 DcLockedOp 0 121 DcMissNoMabAlloc 0 DcLinAddrValid 1 DcPhyAddrValid 1 122 DcL2TlbHit1G 0 L2Miss 1 SwPf 0 OpMemWidth 32 bytes 123 OpDcMissOpenMemReqs 12 DcMissLat 0 TlbRefillLat 0 124 IbsDCLinAd: ff110008a5398920 125 IbsDCPhysAd: 00000008a5398920 126 127IBS applied in a real world usecase 128 129 ~90% regression was observed in tbench with specific scheduler hint 130 which was counter intuitive. IBS profile of good and bad run captured 131 using perf helped in identifying exact cause of the problem: 132 133 https://lore.kernel.org/r/20220921063638.2489-1-kprateek.nayak@amd.com 134 135IBS Fetch PMU 136~~~~~~~~~~~~~ 137 138Similar commands can be used with Fetch PMU as well. 139 140System-wide profile, fetch ops event, sampling period: 100000 141 142 # perf record -e ibs_fetch// -c 100000 -a 143 144System-wide profile, fetch ops event, sampling period: 100000, Random enable 145 146 # perf record -e ibs_fetch/rand_en=1/ -c 100000 -a 147 148 Random enable adds small degree of variability to sample period. This 149 helps in cases like long running loops where PMU is tagging the same 150 instruction over and over because of fixed sample period. 151 152etc. 153 154PERF MEM AND PERF C2C 155--------------------- 156 157perf mem is a memory access profiler tool and perf c2c is a shared data 158cacheline analyser tool. Both of them internally uses IBS Op PMU on AMD. 159Below is a simple example of the perf mem tool. 160 161 # perf mem record -c 100000 -- make 162 # perf mem report 163 164A normal perf mem report output will provide detailed memory access profile. 165However, it can also be aggregated based on output fields. For example: 166 167 # perf mem report -F mem,sample,snoop 168 Samples: 3M of event 'ibs_op//', Event count (approx.): 23524876 169 Memory access Samples Snoop 170 N/A 1903343 N/A 171 L1 hit 1056754 N/A 172 L2 hit 75231 N/A 173 L3 hit 9496 HitM 174 L3 hit 2270 N/A 175 RAM hit 8710 N/A 176 Remote node, same socket RAM hit 3241 N/A 177 Remote core, same node Any cache hit 1572 HitM 178 Remote core, same node Any cache hit 514 N/A 179 Remote node, same socket Any cache hit 1216 HitM 180 Remote node, same socket Any cache hit 350 N/A 181 Uncached hit 18 N/A 182 183Please refer to their man page for more detail. 184 185SEE ALSO 186-------- 187 188linkperf:perf-record[1], linkperf:perf-script[1], linkperf:perf-report[1], 189linkperf:perf-mem[1], linkperf:perf-c2c[1] 190