1perf-amd-ibs(1) 2=============== 3 4NAME 5---- 6perf-amd-ibs - Support for AMD Instruction-Based Sampling (IBS) with perf tool 7 8SYNOPSIS 9-------- 10[verse] 11'perf record' -e ibs_op// 12'perf record' -e ibs_fetch// 13 14DESCRIPTION 15----------- 16 17Instruction-Based Sampling (IBS) provides precise Instruction Pointer (IP) 18profiling support on AMD platforms. IBS has two independent components: IBS 19Op and IBS Fetch. IBS Op sampling provides information about instruction 20execution (micro-op execution to be precise) with details like d-cache 21hit/miss, d-TLB hit/miss, cache miss latency, load/store data source, branch 22behavior etc. IBS Fetch sampling provides information about instruction fetch 23with details like i-cache hit/miss, i-TLB hit/miss, fetch latency etc. IBS is 24per-smt-thread i.e. each SMT hardware thread contains standalone IBS units. 25 26Both, IBS Op and IBS Fetch, are exposed as PMUs by Linux and can be exploited 27using the Linux perf utility. The following files will be created at boot time 28if IBS is supported by the hardware and kernel. 29 30 /sys/bus/event_source/devices/ibs_op/ 31 /sys/bus/event_source/devices/ibs_fetch/ 32 33IBS Op PMU supports two events: cycles and micro ops. IBS Fetch PMU supports 34one event: fetch ops. 35 36IBS PMUs do not have user/kernel filtering capability and thus it requires 37CAP_SYS_ADMIN or CAP_PERFMON privilege. 38 39IBS VS. REGULAR CORE PMU 40------------------------ 41 42IBS gives samples with precise IP, i.e. the IP recorded with IBS sample has 43no skid. Whereas the IP recorded by regular core PMU will have some skid 44(sample was generated at IP X but perf would record it at IP X+n). Hence, 45regular core PMU might not help for profiling with instruction level 46precision. Further, IBS provides additional information about the sample in 47question. On the other hand, regular core PMU has it's own advantages like 48plethora of events, counting mode (less interference), up to 6 parallel 49counters, event grouping support, filtering capabilities etc. 50 51Three regular core PMU events are internally forwarded to IBS Op PMU when 52precise_ip attribute is set: 53 54 -e cpu-cycles:p becomes -e ibs_op// 55 -e r076:p becomes -e ibs_op// 56 -e r0C1:p becomes -e ibs_op/cnt_ctl=1/ 57 58EXAMPLES 59-------- 60 61IBS Op PMU 62~~~~~~~~~~ 63 64System-wide profile, cycles event, sampling period: 100000 65 66 # perf record -e ibs_op// -c 100000 -a 67 68Per-cpu profile (cpu10), cycles event, sampling period: 100000 69 70 # perf record -e ibs_op// -c 100000 -C 10 71 72Per-cpu profile (cpu10), cycles event, sampling freq: 1000 73 74 # perf record -e ibs_op// -F 1000 -C 10 75 76System-wide profile, uOps event, sampling period: 100000 77 78 # perf record -e ibs_op/cnt_ctl=1/ -c 100000 -a 79 80Same command, but also capture IBS register raw dump along with perf sample: 81 82 # perf record -e ibs_op/cnt_ctl=1/ -c 100000 -a --raw-samples 83 84System-wide profile, uOps event, sampling period: 100000, L3MissOnly (Zen4 onward) 85 86 # perf record -e ibs_op/cnt_ctl=1,l3missonly=1/ -c 100000 -a 87 88System-wide profile, cycles event, sampling period: 100000, LdLat filtering (Zen5 89onward) 90 91 # perf record -e ibs_op/ldlat=128/ -c 100000 -a 92 93 Supported load latency threshold values are 128 to 2048 (both inclusive). 94 Latency value which is a multiple of 128 incurs a little less profiling 95 overhead compared to other values. 96 97Per process(upstream v6.2 onward), uOps event, sampling period: 100000 98 99 # perf record -e ibs_op/cnt_ctl=1/ -c 100000 -p 1234 100 101Per process(upstream v6.2 onward), uOps event, sampling period: 100000 102 103 # perf record -e ibs_op/cnt_ctl=1/ -c 100000 -- ls 104 105To analyse recorded profile in aggregate mode 106 107 # perf report 108 /* Select a line and press 'a' to drill down at instruction level. */ 109 110To go over each sample 111 112 # perf script 113 114Raw dump of IBS registers when profiled with --raw-samples 115 116 # perf report -D 117 /* Look for PERF_RECORD_SAMPLE */ 118 119 Example register raw dump: 120 121 ibs_op_ctl: 000002c30006186a MaxCnt 100000 L3MissOnly 0 En 1 122 Val 1 CntCtl 0=cycles CurCnt 707 123 IbsOpRip: ffffffff8204aea7 124 ibs_op_data: 0000010002550001 CompToRetCtr 1 TagToRetCtr 597 125 BrnRet 0 RipInvalid 0 BrnFuse 0 Microcode 1 126 ibs_op_data2: 0000000000000013 RmtNode 1 DataSrc 3=DRAM 127 ibs_op_data3: 0000000031960092 LdOp 0 StOp 1 DcL1TlbMiss 0 128 DcL2TlbMiss 0 DcL1TlbHit2M 1 DcL1TlbHit1G 0 DcL2TlbHit2M 0 129 DcMiss 1 DcMisAcc 0 DcWcMemAcc 0 DcUcMemAcc 0 DcLockedOp 0 130 DcMissNoMabAlloc 0 DcLinAddrValid 1 DcPhyAddrValid 1 131 DcL2TlbHit1G 0 L2Miss 1 SwPf 0 OpMemWidth 32 bytes 132 OpDcMissOpenMemReqs 12 DcMissLat 0 TlbRefillLat 0 133 IbsDCLinAd: ff110008a5398920 134 IbsDCPhysAd: 00000008a5398920 135 136IBS applied in a real world usecase 137 138 ~90% regression was observed in tbench with specific scheduler hint 139 which was counter intuitive. IBS profile of good and bad run captured 140 using perf helped in identifying exact cause of the problem: 141 142 https://lore.kernel.org/r/20220921063638.2489-1-kprateek.nayak@amd.com 143 144IBS Fetch PMU 145~~~~~~~~~~~~~ 146 147Similar commands can be used with Fetch PMU as well. 148 149System-wide profile, fetch ops event, sampling period: 100000 150 151 # perf record -e ibs_fetch// -c 100000 -a 152 153System-wide profile, fetch ops event, sampling period: 100000, Random enable 154 155 # perf record -e ibs_fetch/rand_en=1/ -c 100000 -a 156 157 Random enable adds small degree of variability to sample period. This 158 helps in cases like long running loops where PMU is tagging the same 159 instruction over and over because of fixed sample period. 160 161etc. 162 163PERF MEM AND PERF C2C 164--------------------- 165 166perf mem is a memory access profiler tool and perf c2c is a shared data 167cacheline analyser tool. Both of them internally uses IBS Op PMU on AMD. 168Below is a simple example of the perf mem tool. 169 170 # perf mem record -c 100000 -- make 171 # perf mem report 172 173A normal perf mem report output will provide detailed memory access profile. 174New output fields will show related access info together. For example: 175 176 # perf mem report -F overhead,cache,snoop,comm 177 ... 178 # Samples: 92K of event 'ibs_op//' 179 # Total weight : 531104 180 # 181 # ---------- Cache ----------- --- Snoop ---- 182 # Overhead L1 L2 L1-buf Other HitM Other Command 183 # ........ ............................ .............. .......... 184 # 185 76.07% 5.8% 35.7% 0.0% 34.6% 23.3% 52.8% cc1 186 5.79% 0.2% 0.0% 0.0% 5.6% 0.1% 5.7% make 187 5.78% 0.1% 4.4% 0.0% 1.2% 0.5% 5.3% gcc 188 5.33% 0.3% 3.9% 0.0% 1.1% 0.2% 5.2% as 189 5.00% 0.1% 3.8% 0.0% 1.0% 0.3% 4.7% sh 190 1.56% 0.1% 0.1% 0.0% 1.4% 0.6% 0.9% ld 191 0.28% 0.1% 0.0% 0.0% 0.2% 0.1% 0.2% pkg-config 192 0.09% 0.0% 0.0% 0.0% 0.1% 0.0% 0.1% git 193 0.03% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% rm 194 ... 195 196Also, it can be aggregated based on various memory access info using the 197sort keys. For example: 198 199 # perf mem report -s mem,snoop 200 ... 201 # Samples: 92K of event 'ibs_op//' 202 # Total weight : 531104 203 # Sort order : mem,snoop 204 # 205 # Overhead Samples Memory access Snoop 206 # ........ ............ ....................................... ............ 207 # 208 47.99% 1509 L2 hit N/A 209 25.08% 338 core, same node Any cache hit HitM 210 10.24% 54374 N/A N/A 211 6.77% 35938 L1 hit N/A 212 6.39% 101 core, same node Any cache hit N/A 213 3.50% 69 RAM hit N/A 214 0.03% 158 LFB/MAB hit N/A 215 0.00% 2 Uncached hit N/A 216 217Please refer to their man page for more detail. 218 219SEE ALSO 220-------- 221 222linkperf:perf-record[1], linkperf:perf-script[1], linkperf:perf-report[1], 223linkperf:perf-mem[1], linkperf:perf-c2c[1] 224