1.. SPDX-License-Identifier: GPL-2.0 2 3================================== 4Eprobe - Event-based Probe Tracing 5================================== 6 7:Author: Steven Rostedt <rostedt@goodmis.org> 8 9- Written for v6.17 10 11Overview 12======== 13 14Eprobes are dynamic events that are placed on existing events to either 15dereference a field that is a pointer, or simply to limit what fields are 16recorded in the trace event. 17 18Eprobes depend on kprobe events so to enable this feature, build your kernel 19with CONFIG_EPROBE_EVENTS=y. 20 21Eprobes are created via the /sys/kernel/tracing/dynamic_events file. 22 23Synopsis of eprobe_events 24------------------------- 25:: 26 27 e[:[EGRP/][EEVENT]] GRP.EVENT [FETCHARGS] : Set a probe 28 -:[EGRP/][EEVENT] : Clear a probe 29 30 EGRP : Group name of the new event. If omitted, use "eprobes" for it. 31 EEVENT : Event name. If omitted, the event name is generated and will 32 be the same event name as the event it attached to. 33 GRP : Group name of the event to attach to. 34 EVENT : Event name of the event to attach to. 35 36 FETCHARGS : Arguments. Each probe can have up to 128 args. 37 $FIELD : Fetch the value of the event field called FIELD. 38 @ADDR : Fetch memory at ADDR (ADDR should be in kernel) 39 @SYM[+|-offs] : Fetch memory at SYM +|- offs (SYM should be a data symbol) 40 $comm : Fetch current task comm. 41 +|-[u]OFFS(FETCHARG) : Fetch memory at FETCHARG +|- OFFS address.(\*3)(\*4) 42 \IMM : Store an immediate value to the argument. 43 NAME=FETCHARG : Set NAME as the argument name of FETCHARG. 44 FETCHARG:TYPE : Set TYPE as the type of FETCHARG. Currently, basic types 45 (u8/u16/u32/u64/s8/s16/s32/s64), hexadecimal types 46 (x8/x16/x32/x64), VFS layer common type(%pd/%pD), "char", 47 "string", "ustring", "symbol", "symstr" and "bitfield" are 48 supported. 49 50Types 51----- 52The FETCHARGS above is very similar to the kprobe events as described in 53Documentation/trace/kprobetrace.rst. 54 55The difference between eprobes and kprobes FETCHARGS is that eprobes has a 56$FIELD command that returns the content of the event field of the event 57that is attached. Eprobes do not have access to registers, stacks and function 58arguments that kprobes has. 59 60If a field argument is a pointer, it may be dereferenced just like a memory 61address using the FETCHARGS syntax. 62 63 64Attaching to dynamic events 65--------------------------- 66 67Eprobes may attach to dynamic events as well as to normal events. It may 68attach to a kprobe event, a synthetic event or a fprobe event. This is useful 69if the type of a field needs to be changed. See Example 2 below. 70 71Usage examples 72============== 73 74Example 1 75--------- 76 77The basic usage of eprobes is to limit the data that is being recorded into 78the tracing buffer. For example, a common event to trace is the sched_switch 79trace event. That has a format of:: 80 81 field:unsigned short common_type; offset:0; size:2; signed:0; 82 field:unsigned char common_flags; offset:2; size:1; signed:0; 83 field:unsigned char common_preempt_count; offset:3; size:1; signed:0; 84 field:int common_pid; offset:4; size:4; signed:1; 85 86 field:char prev_comm[16]; offset:8; size:16; signed:0; 87 field:pid_t prev_pid; offset:24; size:4; signed:1; 88 field:int prev_prio; offset:28; size:4; signed:1; 89 field:long prev_state; offset:32; size:8; signed:1; 90 field:char next_comm[16]; offset:40; size:16; signed:0; 91 field:pid_t next_pid; offset:56; size:4; signed:1; 92 field:int next_prio; offset:60; size:4; signed:1; 93 94The first four fields are common to all events and can not be limited. But the 95rest of the event has 60 bytes of information. It records the names of the 96previous and next tasks being scheduled out and in, as well as their pids and 97priorities. It also records the state of the previous task. If only the pids 98of the tasks are of interest, why waste the ring buffer with all the other 99fields? 100 101An eprobe can limit what gets recorded. Note, it does not help in performance, 102as all the fields are recorded in a temporary buffer to process the eprobe. 103:: 104 105 # echo 'e:sched/switch sched.sched_switch prev=$prev_pid:u32 next=$next_pid:u32' >> /sys/kernel/tracing/dynamic_events 106 # echo 1 > /sys/kernel/tracing/events/sched/switch/enable 107 # cat /sys/kernel/tracing/trace 108 109 # tracer: nop 110 # 111 # entries-in-buffer/entries-written: 2721/2721 #P:8 112 # 113 # _-----=> irqs-off/BH-disabled 114 # / _----=> need-resched 115 # | / _---=> hardirq/softirq 116 # || / _--=> preempt-depth 117 # ||| / _-=> migrate-disable 118 # |||| / delay 119 # TASK-PID CPU# ||||| TIMESTAMP FUNCTION 120 # | | | ||||| | | 121 sshd-session-1082 [004] d..4. 5041.239906: switch: (sched.sched_switch) prev=1082 next=0 122 bash-1085 [001] d..4. 5041.240198: switch: (sched.sched_switch) prev=1085 next=141 123 kworker/u34:5-141 [001] d..4. 5041.240259: switch: (sched.sched_switch) prev=141 next=1085 124 <idle>-0 [004] d..4. 5041.240354: switch: (sched.sched_switch) prev=0 next=1082 125 bash-1085 [001] d..4. 5041.240385: switch: (sched.sched_switch) prev=1085 next=141 126 kworker/u34:5-141 [001] d..4. 5041.240410: switch: (sched.sched_switch) prev=141 next=1085 127 bash-1085 [001] d..4. 5041.240478: switch: (sched.sched_switch) prev=1085 next=0 128 sshd-session-1082 [004] d..4. 5041.240526: switch: (sched.sched_switch) prev=1082 next=0 129 <idle>-0 [001] d..4. 5041.247524: switch: (sched.sched_switch) prev=0 next=90 130 <idle>-0 [002] d..4. 5041.247545: switch: (sched.sched_switch) prev=0 next=16 131 kworker/1:1-90 [001] d..4. 5041.247580: switch: (sched.sched_switch) prev=90 next=0 132 rcu_sched-16 [002] d..4. 5041.247591: switch: (sched.sched_switch) prev=16 next=0 133 <idle>-0 [002] d..4. 5041.257536: switch: (sched.sched_switch) prev=0 next=16 134 rcu_sched-16 [002] d..4. 5041.257573: switch: (sched.sched_switch) prev=16 next=0 135 136Note, without adding the "u32" after the prev_pid and next_pid, the values 137would default showing in hexadecimal. 138 139Example 2 140--------- 141 142If a specific system call is to be recorded but the syscalls events are not 143enabled, the raw_syscalls can still be used (syscalls are system call 144events are not normal events, but are created from the raw_syscalls events 145within the kernel). In order to trace the openat system call, one can create 146an event probe on top of the raw_syscalls event: 147:: 148 149 # cd /sys/kernel/tracing 150 # cat events/raw_syscalls/sys_enter/format 151 name: sys_enter 152 ID: 395 153 format: 154 field:unsigned short common_type; offset:0; size:2; signed:0; 155 field:unsigned char common_flags; offset:2; size:1; signed:0; 156 field:unsigned char common_preempt_count; offset:3; size:1; signed:0; 157 field:int common_pid; offset:4; size:4; signed:1; 158 159 field:long id; offset:8; size:8; signed:1; 160 field:unsigned long args[6]; offset:16; size:48; signed:0; 161 162 print fmt: "NR %ld (%lx, %lx, %lx, %lx, %lx, %lx)", REC->id, REC->args[0], REC->args[1], REC->args[2], REC->args[3], REC->args[4], REC->args[5] 163 164From the source code, the sys_openat() has: 165:: 166 167 int sys_openat(int dirfd, const char *path, int flags, mode_t mode) 168 { 169 return my_syscall4(__NR_openat, dirfd, path, flags, mode); 170 } 171 172The path is the second parameter, and that is what is wanted. 173:: 174 175 # echo 'e:openat raw_syscalls.sys_enter nr=$id filename=+8($args):ustring' >> dynamic_events 176 177This is being run on x86_64 where the word size is 8 bytes and the openat 178system call __NR_openat is set at 257. 179:: 180 181 # echo 'nr == 257' > events/eprobes/openat/filter 182 183Now enable the event and look at the trace. 184:: 185 186 # echo 1 > events/eprobes/openat/enable 187 # cat trace 188 189 # tracer: nop 190 # 191 # entries-in-buffer/entries-written: 4/4 #P:8 192 # 193 # _-----=> irqs-off/BH-disabled 194 # / _----=> need-resched 195 # | / _---=> hardirq/softirq 196 # || / _--=> preempt-depth 197 # ||| / _-=> migrate-disable 198 # |||| / delay 199 # TASK-PID CPU# ||||| TIMESTAMP FUNCTION 200 # | | | ||||| | | 201 cat-1298 [003] ...2. 2060.875970: openat: (raw_syscalls.sys_enter) nr=0x101 filename=(fault) 202 cat-1298 [003] ...2. 2060.876197: openat: (raw_syscalls.sys_enter) nr=0x101 filename=(fault) 203 cat-1298 [003] ...2. 2060.879126: openat: (raw_syscalls.sys_enter) nr=0x101 filename=(fault) 204 cat-1298 [003] ...2. 2060.879639: openat: (raw_syscalls.sys_enter) nr=0x101 filename=(fault) 205 206The filename shows "(fault)". This is likely because the filename has not been 207pulled into memory yet and currently trace events cannot fault in memory that 208is not present. When an eprobe tries to read memory that has not been faulted 209in yet, it will show the "(fault)" text. 210 211To get around this, as the kernel will likely pull in this filename and make 212it present, attaching it to a synthetic event that can pass the address of the 213filename from the entry of the event to the end of the event, this can be used 214to show the filename when the system call returns. 215 216Remove the old eprobe:: 217 218 # echo 1 > events/eprobes/openat/enable 219 # echo '-:openat' >> dynamic_events 220 221This time make an eprobe where the address of the filename is saved:: 222 223 # echo 'e:openat_start raw_syscalls.sys_enter nr=$id filename=+8($args):x64' >> dynamic_events 224 225Create a synthetic event that passes the address of the filename to the 226end of the event:: 227 228 # echo 's:filename u64 file' >> dynamic_events 229 # echo 'hist:keys=common_pid:f=filename if nr == 257' > events/eprobes/openat_start/trigger 230 # echo 'hist:keys=common_pid:file=$f:onmatch(eprobes.openat_start).trace(filename,$file) if id == 257' > events/raw_syscalls/sys_exit/trigger 231 232Now that the address of the filename has been passed to the end of the 233system call, create another eprobe to attach to the exit event to show the 234string:: 235 236 # echo 'e:openat synthetic.filename filename=+0($file):ustring' >> dynamic_events 237 # echo 1 > events/eprobes/openat/enable 238 # cat trace 239 240 # tracer: nop 241 # 242 # entries-in-buffer/entries-written: 4/4 #P:8 243 # 244 # _-----=> irqs-off/BH-disabled 245 # / _----=> need-resched 246 # | / _---=> hardirq/softirq 247 # || / _--=> preempt-depth 248 # ||| / _-=> migrate-disable 249 # |||| / delay 250 # TASK-PID CPU# ||||| TIMESTAMP FUNCTION 251 # | | | ||||| | | 252 cat-1331 [001] ...5. 2944.787977: openat: (synthetic.filename) filename="/etc/ld.so.cache" 253 cat-1331 [001] ...5. 2944.788480: openat: (synthetic.filename) filename="/lib/x86_64-linux-gnu/libc.so.6" 254 cat-1331 [001] ...5. 2944.793426: openat: (synthetic.filename) filename="/usr/lib/locale/locale-archive" 255 cat-1331 [001] ...5. 2944.831362: openat: (synthetic.filename) filename="trace" 256 257Example 3 258--------- 259 260If syscall trace events are available, the above would not need the first 261eprobe, but it would still need the last one:: 262 263 # echo 's:filename u64 file' >> dynamic_events 264 # echo 'hist:keys=common_pid:f=filename' > events/syscalls/sys_enter_openat/trigger 265 # echo 'hist:keys=common_pid:file=$f:onmatch(syscalls.sys_enter_openat).trace(filename,$file)' > events/syscalls/sys_exit_openat/trigger 266 # echo 'e:openat synthetic.filename filename=+0($file):ustring' >> dynamic_events 267 # echo 1 > events/eprobes/openat/enable 268 269And this would produce the same result as Example 2. 270