1*623526baSSteven Rostedt.. SPDX-License-Identifier: GPL-2.0 2*623526baSSteven Rostedt 3*623526baSSteven Rostedt================================== 4*623526baSSteven RostedtEprobe - Event-based Probe Tracing 5*623526baSSteven Rostedt================================== 6*623526baSSteven Rostedt 7*623526baSSteven Rostedt:Author: Steven Rostedt <rostedt@goodmis.org> 8*623526baSSteven Rostedt 9*623526baSSteven Rostedt- Written for v6.17 10*623526baSSteven Rostedt 11*623526baSSteven RostedtOverview 12*623526baSSteven Rostedt======== 13*623526baSSteven Rostedt 14*623526baSSteven RostedtEprobes are dynamic events that are placed on existing events to either 15*623526baSSteven Rostedtdereference a field that is a pointer, or simply to limit what fields are 16*623526baSSteven Rostedtrecorded in the trace event. 17*623526baSSteven Rostedt 18*623526baSSteven RostedtEprobes depend on kprobe events so to enable this feature, build your kernel 19*623526baSSteven Rostedtwith CONFIG_EPROBE_EVENTS=y. 20*623526baSSteven Rostedt 21*623526baSSteven RostedtEprobes are created via the /sys/kernel/tracing/dynamic_events file. 22*623526baSSteven Rostedt 23*623526baSSteven RostedtSynopsis of eprobe_events 24*623526baSSteven Rostedt------------------------- 25*623526baSSteven Rostedt:: 26*623526baSSteven Rostedt 27*623526baSSteven Rostedt e[:[EGRP/][EEVENT]] GRP.EVENT [FETCHARGS] : Set a probe 28*623526baSSteven Rostedt -:[EGRP/][EEVENT] : Clear a probe 29*623526baSSteven Rostedt 30*623526baSSteven Rostedt EGRP : Group name of the new event. If omitted, use "eprobes" for it. 31*623526baSSteven Rostedt EEVENT : Event name. If omitted, the event name is generated and will 32*623526baSSteven Rostedt be the same event name as the event it attached to. 33*623526baSSteven Rostedt GRP : Group name of the event to attach to. 34*623526baSSteven Rostedt EVENT : Event name of the event to attach to. 35*623526baSSteven Rostedt 36*623526baSSteven Rostedt FETCHARGS : Arguments. Each probe can have up to 128 args. 37*623526baSSteven Rostedt $FIELD : Fetch the value of the event field called FIELD. 38*623526baSSteven Rostedt @ADDR : Fetch memory at ADDR (ADDR should be in kernel) 39*623526baSSteven Rostedt @SYM[+|-offs] : Fetch memory at SYM +|- offs (SYM should be a data symbol) 40*623526baSSteven Rostedt $comm : Fetch current task comm. 41*623526baSSteven Rostedt +|-[u]OFFS(FETCHARG) : Fetch memory at FETCHARG +|- OFFS address.(\*3)(\*4) 42*623526baSSteven Rostedt \IMM : Store an immediate value to the argument. 43*623526baSSteven Rostedt NAME=FETCHARG : Set NAME as the argument name of FETCHARG. 44*623526baSSteven Rostedt FETCHARG:TYPE : Set TYPE as the type of FETCHARG. Currently, basic types 45*623526baSSteven Rostedt (u8/u16/u32/u64/s8/s16/s32/s64), hexadecimal types 46*623526baSSteven Rostedt (x8/x16/x32/x64), VFS layer common type(%pd/%pD), "char", 47*623526baSSteven Rostedt "string", "ustring", "symbol", "symstr" and "bitfield" are 48*623526baSSteven Rostedt supported. 49*623526baSSteven Rostedt 50*623526baSSteven RostedtTypes 51*623526baSSteven Rostedt----- 52*623526baSSteven RostedtThe FETCHARGS above is very similar to the kprobe events as described in 53*623526baSSteven RostedtDocumentation/trace/kprobetrace.rst. 54*623526baSSteven Rostedt 55*623526baSSteven RostedtThe difference between eprobes and kprobes FETCHARGS is that eprobes has a 56*623526baSSteven Rostedt$FIELD command that returns the content of the event field of the event 57*623526baSSteven Rostedtthat is attached. Eprobes do not have access to registers, stacks and function 58*623526baSSteven Rostedtarguments that kprobes has. 59*623526baSSteven Rostedt 60*623526baSSteven RostedtIf a field argument is a pointer, it may be dereferenced just like a memory 61*623526baSSteven Rostedtaddress using the FETCHARGS syntax. 62*623526baSSteven Rostedt 63*623526baSSteven Rostedt 64*623526baSSteven RostedtAttaching to dynamic events 65*623526baSSteven Rostedt--------------------------- 66*623526baSSteven Rostedt 67*623526baSSteven RostedtEprobes may attach to dynamic events as well as to normal events. It may 68*623526baSSteven Rostedtattach to a kprobe event, a synthetic event or a fprobe event. This is useful 69*623526baSSteven Rostedtif the type of a field needs to be changed. See Example 2 below. 70*623526baSSteven Rostedt 71*623526baSSteven RostedtUsage examples 72*623526baSSteven Rostedt============== 73*623526baSSteven Rostedt 74*623526baSSteven RostedtExample 1 75*623526baSSteven Rostedt--------- 76*623526baSSteven Rostedt 77*623526baSSteven RostedtThe basic usage of eprobes is to limit the data that is being recorded into 78*623526baSSteven Rostedtthe tracing buffer. For example, a common event to trace is the sched_switch 79*623526baSSteven Rostedttrace event. That has a format of:: 80*623526baSSteven Rostedt 81*623526baSSteven Rostedt field:unsigned short common_type; offset:0; size:2; signed:0; 82*623526baSSteven Rostedt field:unsigned char common_flags; offset:2; size:1; signed:0; 83*623526baSSteven Rostedt field:unsigned char common_preempt_count; offset:3; size:1; signed:0; 84*623526baSSteven Rostedt field:int common_pid; offset:4; size:4; signed:1; 85*623526baSSteven Rostedt 86*623526baSSteven Rostedt field:char prev_comm[16]; offset:8; size:16; signed:0; 87*623526baSSteven Rostedt field:pid_t prev_pid; offset:24; size:4; signed:1; 88*623526baSSteven Rostedt field:int prev_prio; offset:28; size:4; signed:1; 89*623526baSSteven Rostedt field:long prev_state; offset:32; size:8; signed:1; 90*623526baSSteven Rostedt field:char next_comm[16]; offset:40; size:16; signed:0; 91*623526baSSteven Rostedt field:pid_t next_pid; offset:56; size:4; signed:1; 92*623526baSSteven Rostedt field:int next_prio; offset:60; size:4; signed:1; 93*623526baSSteven Rostedt 94*623526baSSteven RostedtThe first four fields are common to all events and can not be limited. But the 95*623526baSSteven Rostedtrest of the event has 60 bytes of information. It records the names of the 96*623526baSSteven Rostedtprevious and next tasks being scheduled out and in, as well as their pids and 97*623526baSSteven Rostedtpriorities. It also records the state of the previous task. If only the pids 98*623526baSSteven Rostedtof the tasks are of interest, why waste the ring buffer with all the other 99*623526baSSteven Rostedtfields? 100*623526baSSteven Rostedt 101*623526baSSteven RostedtAn eprobe can limit what gets recorded. Note, it does not help in performance, 102*623526baSSteven Rostedtas all the fields are recorded in a temporary buffer to process the eprobe. 103*623526baSSteven Rostedt:: 104*623526baSSteven Rostedt 105*623526baSSteven Rostedt # echo 'e:sched/switch sched.sched_switch prev=$prev_pid:u32 next=$next_pid:u32' >> /sys/kernel/tracing/dynamic_events 106*623526baSSteven Rostedt # echo 1 > /sys/kernel/tracing/events/sched/switch/enable 107*623526baSSteven Rostedt # cat /sys/kernel/tracing/trace 108*623526baSSteven Rostedt 109*623526baSSteven Rostedt # tracer: nop 110*623526baSSteven Rostedt # 111*623526baSSteven Rostedt # entries-in-buffer/entries-written: 2721/2721 #P:8 112*623526baSSteven Rostedt # 113*623526baSSteven Rostedt # _-----=> irqs-off/BH-disabled 114*623526baSSteven Rostedt # / _----=> need-resched 115*623526baSSteven Rostedt # | / _---=> hardirq/softirq 116*623526baSSteven Rostedt # || / _--=> preempt-depth 117*623526baSSteven Rostedt # ||| / _-=> migrate-disable 118*623526baSSteven Rostedt # |||| / delay 119*623526baSSteven Rostedt # TASK-PID CPU# ||||| TIMESTAMP FUNCTION 120*623526baSSteven Rostedt # | | | ||||| | | 121*623526baSSteven Rostedt sshd-session-1082 [004] d..4. 5041.239906: switch: (sched.sched_switch) prev=1082 next=0 122*623526baSSteven Rostedt bash-1085 [001] d..4. 5041.240198: switch: (sched.sched_switch) prev=1085 next=141 123*623526baSSteven Rostedt kworker/u34:5-141 [001] d..4. 5041.240259: switch: (sched.sched_switch) prev=141 next=1085 124*623526baSSteven Rostedt <idle>-0 [004] d..4. 5041.240354: switch: (sched.sched_switch) prev=0 next=1082 125*623526baSSteven Rostedt bash-1085 [001] d..4. 5041.240385: switch: (sched.sched_switch) prev=1085 next=141 126*623526baSSteven Rostedt kworker/u34:5-141 [001] d..4. 5041.240410: switch: (sched.sched_switch) prev=141 next=1085 127*623526baSSteven Rostedt bash-1085 [001] d..4. 5041.240478: switch: (sched.sched_switch) prev=1085 next=0 128*623526baSSteven Rostedt sshd-session-1082 [004] d..4. 5041.240526: switch: (sched.sched_switch) prev=1082 next=0 129*623526baSSteven Rostedt <idle>-0 [001] d..4. 5041.247524: switch: (sched.sched_switch) prev=0 next=90 130*623526baSSteven Rostedt <idle>-0 [002] d..4. 5041.247545: switch: (sched.sched_switch) prev=0 next=16 131*623526baSSteven Rostedt kworker/1:1-90 [001] d..4. 5041.247580: switch: (sched.sched_switch) prev=90 next=0 132*623526baSSteven Rostedt rcu_sched-16 [002] d..4. 5041.247591: switch: (sched.sched_switch) prev=16 next=0 133*623526baSSteven Rostedt <idle>-0 [002] d..4. 5041.257536: switch: (sched.sched_switch) prev=0 next=16 134*623526baSSteven Rostedt rcu_sched-16 [002] d..4. 5041.257573: switch: (sched.sched_switch) prev=16 next=0 135*623526baSSteven Rostedt 136*623526baSSteven RostedtNote, without adding the "u32" after the prev_pid and next_pid, the values 137*623526baSSteven Rostedtwould default showing in hexadecimal. 138*623526baSSteven Rostedt 139*623526baSSteven RostedtExample 2 140*623526baSSteven Rostedt--------- 141*623526baSSteven Rostedt 142*623526baSSteven RostedtIf a specific system call is to be recorded but the syscalls events are not 143*623526baSSteven Rostedtenabled, the raw_syscalls can still be used (syscalls are system call 144*623526baSSteven Rostedtevents are not normal events, but are created from the raw_syscalls events 145*623526baSSteven Rostedtwithin the kernel). In order to trace the openat system call, one can create 146*623526baSSteven Rostedtan event probe on top of the raw_syscalls event: 147*623526baSSteven Rostedt:: 148*623526baSSteven Rostedt 149*623526baSSteven Rostedt # cd /sys/kernel/tracing 150*623526baSSteven Rostedt # cat events/raw_syscalls/sys_enter/format 151*623526baSSteven Rostedt name: sys_enter 152*623526baSSteven Rostedt ID: 395 153*623526baSSteven Rostedt format: 154*623526baSSteven Rostedt field:unsigned short common_type; offset:0; size:2; signed:0; 155*623526baSSteven Rostedt field:unsigned char common_flags; offset:2; size:1; signed:0; 156*623526baSSteven Rostedt field:unsigned char common_preempt_count; offset:3; size:1; signed:0; 157*623526baSSteven Rostedt field:int common_pid; offset:4; size:4; signed:1; 158*623526baSSteven Rostedt 159*623526baSSteven Rostedt field:long id; offset:8; size:8; signed:1; 160*623526baSSteven Rostedt field:unsigned long args[6]; offset:16; size:48; signed:0; 161*623526baSSteven Rostedt 162*623526baSSteven Rostedt print fmt: "NR %ld (%lx, %lx, %lx, %lx, %lx, %lx)", REC->id, REC->args[0], REC->args[1], REC->args[2], REC->args[3], REC->args[4], REC->args[5] 163*623526baSSteven Rostedt 164*623526baSSteven RostedtFrom the source code, the sys_openat() has: 165*623526baSSteven Rostedt:: 166*623526baSSteven Rostedt 167*623526baSSteven Rostedt int sys_openat(int dirfd, const char *path, int flags, mode_t mode) 168*623526baSSteven Rostedt { 169*623526baSSteven Rostedt return my_syscall4(__NR_openat, dirfd, path, flags, mode); 170*623526baSSteven Rostedt } 171*623526baSSteven Rostedt 172*623526baSSteven RostedtThe path is the second parameter, and that is what is wanted. 173*623526baSSteven Rostedt:: 174*623526baSSteven Rostedt 175*623526baSSteven Rostedt # echo 'e:openat raw_syscalls.sys_enter nr=$id filename=+8($args):ustring' >> dynamic_events 176*623526baSSteven Rostedt 177*623526baSSteven RostedtThis is being run on x86_64 where the word size is 8 bytes and the openat 178*623526baSSteven Rostedtsystem call __NR_openat is set at 257. 179*623526baSSteven Rostedt:: 180*623526baSSteven Rostedt 181*623526baSSteven Rostedt # echo 'nr == 257' > events/eprobes/openat/filter 182*623526baSSteven Rostedt 183*623526baSSteven RostedtNow enable the event and look at the trace. 184*623526baSSteven Rostedt:: 185*623526baSSteven Rostedt 186*623526baSSteven Rostedt # echo 1 > events/eprobes/openat/enable 187*623526baSSteven Rostedt # cat trace 188*623526baSSteven Rostedt 189*623526baSSteven Rostedt # tracer: nop 190*623526baSSteven Rostedt # 191*623526baSSteven Rostedt # entries-in-buffer/entries-written: 4/4 #P:8 192*623526baSSteven Rostedt # 193*623526baSSteven Rostedt # _-----=> irqs-off/BH-disabled 194*623526baSSteven Rostedt # / _----=> need-resched 195*623526baSSteven Rostedt # | / _---=> hardirq/softirq 196*623526baSSteven Rostedt # || / _--=> preempt-depth 197*623526baSSteven Rostedt # ||| / _-=> migrate-disable 198*623526baSSteven Rostedt # |||| / delay 199*623526baSSteven Rostedt # TASK-PID CPU# ||||| TIMESTAMP FUNCTION 200*623526baSSteven Rostedt # | | | ||||| | | 201*623526baSSteven Rostedt cat-1298 [003] ...2. 2060.875970: openat: (raw_syscalls.sys_enter) nr=0x101 filename=(fault) 202*623526baSSteven Rostedt cat-1298 [003] ...2. 2060.876197: openat: (raw_syscalls.sys_enter) nr=0x101 filename=(fault) 203*623526baSSteven Rostedt cat-1298 [003] ...2. 2060.879126: openat: (raw_syscalls.sys_enter) nr=0x101 filename=(fault) 204*623526baSSteven Rostedt cat-1298 [003] ...2. 2060.879639: openat: (raw_syscalls.sys_enter) nr=0x101 filename=(fault) 205*623526baSSteven Rostedt 206*623526baSSteven RostedtThe filename shows "(fault)". This is likely because the filename has not been 207*623526baSSteven Rostedtpulled into memory yet and currently trace events cannot fault in memory that 208*623526baSSteven Rostedtis not present. When an eprobe tries to read memory that has not been faulted 209*623526baSSteven Rostedtin yet, it will show the "(fault)" text. 210*623526baSSteven Rostedt 211*623526baSSteven RostedtTo get around this, as the kernel will likely pull in this filename and make 212*623526baSSteven Rostedtit present, attaching it to a synthetic event that can pass the address of the 213*623526baSSteven Rostedtfilename from the entry of the event to the end of the event, this can be used 214*623526baSSteven Rostedtto show the filename when the system call returns. 215*623526baSSteven Rostedt 216*623526baSSteven RostedtRemove the old eprobe:: 217*623526baSSteven Rostedt 218*623526baSSteven Rostedt # echo 1 > events/eprobes/openat/enable 219*623526baSSteven Rostedt # echo '-:openat' >> dynamic_events 220*623526baSSteven Rostedt 221*623526baSSteven RostedtThis time make an eprobe where the address of the filename is saved:: 222*623526baSSteven Rostedt 223*623526baSSteven Rostedt # echo 'e:openat_start raw_syscalls.sys_enter nr=$id filename=+8($args):x64' >> dynamic_events 224*623526baSSteven Rostedt 225*623526baSSteven RostedtCreate a synthetic event that passes the address of the filename to the 226*623526baSSteven Rostedtend of the event:: 227*623526baSSteven Rostedt 228*623526baSSteven Rostedt # echo 's:filename u64 file' >> dynamic_events 229*623526baSSteven Rostedt # echo 'hist:keys=common_pid:f=filename if nr == 257' > events/eprobes/openat_start/trigger 230*623526baSSteven Rostedt # echo 'hist:keys=common_pid:file=$f:onmatch(eprobes.openat_start).trace(filename,$file) if id == 257' > events/raw_syscalls/sys_exit/trigger 231*623526baSSteven Rostedt 232*623526baSSteven RostedtNow that the address of the filename has been passed to the end of the 233*623526baSSteven Rostedtsystem call, create another eprobe to attach to the exit event to show the 234*623526baSSteven Rostedtstring:: 235*623526baSSteven Rostedt 236*623526baSSteven Rostedt # echo 'e:openat synthetic.filename filename=+0($file):ustring' >> dynamic_events 237*623526baSSteven Rostedt # echo 1 > events/eprobes/openat/enable 238*623526baSSteven Rostedt # cat trace 239*623526baSSteven Rostedt 240*623526baSSteven Rostedt # tracer: nop 241*623526baSSteven Rostedt # 242*623526baSSteven Rostedt # entries-in-buffer/entries-written: 4/4 #P:8 243*623526baSSteven Rostedt # 244*623526baSSteven Rostedt # _-----=> irqs-off/BH-disabled 245*623526baSSteven Rostedt # / _----=> need-resched 246*623526baSSteven Rostedt # | / _---=> hardirq/softirq 247*623526baSSteven Rostedt # || / _--=> preempt-depth 248*623526baSSteven Rostedt # ||| / _-=> migrate-disable 249*623526baSSteven Rostedt # |||| / delay 250*623526baSSteven Rostedt # TASK-PID CPU# ||||| TIMESTAMP FUNCTION 251*623526baSSteven Rostedt # | | | ||||| | | 252*623526baSSteven Rostedt cat-1331 [001] ...5. 2944.787977: openat: (synthetic.filename) filename="/etc/ld.so.cache" 253*623526baSSteven Rostedt cat-1331 [001] ...5. 2944.788480: openat: (synthetic.filename) filename="/lib/x86_64-linux-gnu/libc.so.6" 254*623526baSSteven Rostedt cat-1331 [001] ...5. 2944.793426: openat: (synthetic.filename) filename="/usr/lib/locale/locale-archive" 255*623526baSSteven Rostedt cat-1331 [001] ...5. 2944.831362: openat: (synthetic.filename) filename="trace" 256*623526baSSteven Rostedt 257*623526baSSteven RostedtExample 3 258*623526baSSteven Rostedt--------- 259*623526baSSteven Rostedt 260*623526baSSteven RostedtIf syscall trace events are available, the above would not need the first 261*623526baSSteven Rostedteprobe, but it would still need the last one:: 262*623526baSSteven Rostedt 263*623526baSSteven Rostedt # echo 's:filename u64 file' >> dynamic_events 264*623526baSSteven Rostedt # echo 'hist:keys=common_pid:f=filename' > events/syscalls/sys_enter_openat/trigger 265*623526baSSteven Rostedt # echo 'hist:keys=common_pid:file=$f:onmatch(syscalls.sys_enter_openat).trace(filename,$file)' > events/syscalls/sys_exit_openat/trigger 266*623526baSSteven Rostedt # echo 'e:openat synthetic.filename filename=+0($file):ustring' >> dynamic_events 267*623526baSSteven Rostedt # echo 1 > events/eprobes/openat/enable 268*623526baSSteven Rostedt 269*623526baSSteven RostedtAnd this would produce the same result as Example 2. 270