xref: /linux/Documentation/trace/eprobetrace.rst (revision d6f38c12396397e48092ad9e8a4d7be4de51b942)
1.. SPDX-License-Identifier: GPL-2.0
2
3==================================
4Eprobe - Event-based Probe Tracing
5==================================
6
7:Author: Steven Rostedt <rostedt@goodmis.org>
8
9- Written for v6.17
10
11Overview
12========
13
14Eprobes are dynamic events that are placed on existing events to either
15dereference a field that is a pointer, or simply to limit what fields are
16recorded in the trace event.
17
18Eprobes depend on kprobe events so to enable this feature, build your kernel
19with CONFIG_EPROBE_EVENTS=y.
20
21Eprobes are created via the /sys/kernel/tracing/dynamic_events file.
22
23Synopsis of eprobe_events
24-------------------------
25::
26
27  e[:[EGRP/][EEVENT]] GRP.EVENT [FETCHARGS]	: Set a probe
28  -:[EGRP/][EEVENT]				: Clear a probe
29
30 EGRP		: Group name of the new event. If omitted, use "eprobes" for it.
31 EEVENT		: Event name. If omitted, the event name is generated and will
32		  be the same event name as the event it attached to.
33 GRP		: Group name of the event to attach to.
34 EVENT		: Event name of the event to attach to.
35
36 FETCHARGS	: Arguments. Each probe can have up to 128 args.
37  $FIELD	: Fetch the value of the event field called FIELD.
38  @ADDR		: Fetch memory at ADDR (ADDR should be in kernel)
39  @SYM[+|-offs]	: Fetch memory at SYM +|- offs (SYM should be a data symbol)
40  $comm		: Fetch current task comm.
41  +|-[u]OFFS(FETCHARG) : Fetch memory at FETCHARG +|- OFFS address.(\*3)(\*4)
42  \IMM		: Store an immediate value to the argument.
43  NAME=FETCHARG : Set NAME as the argument name of FETCHARG.
44  FETCHARG:TYPE : Set TYPE as the type of FETCHARG. Currently, basic types
45		  (u8/u16/u32/u64/s8/s16/s32/s64), hexadecimal types
46		  (x8/x16/x32/x64), VFS layer common type(%pd/%pD), "char",
47                  "string", "ustring", "symbol", "symstr" and "bitfield" are
48                  supported.
49
50Types
51-----
52The FETCHARGS above is very similar to the kprobe events as described in
53Documentation/trace/kprobetrace.rst.
54
55The difference between eprobes and kprobes FETCHARGS is that eprobes has a
56$FIELD command that returns the content of the event field of the event
57that is attached. Eprobes do not have access to registers, stacks and function
58arguments that kprobes has.
59
60If a field argument is a pointer, it may be dereferenced just like a memory
61address using the FETCHARGS syntax.
62
63
64Attaching to dynamic events
65---------------------------
66
67Eprobes may attach to dynamic events as well as to normal events. It may
68attach to a kprobe event, a synthetic event or a fprobe event. This is useful
69if the type of a field needs to be changed. See Example 2 below.
70
71Usage examples
72==============
73
74Example 1
75---------
76
77The basic usage of eprobes is to limit the data that is being recorded into
78the tracing buffer. For example, a common event to trace is the sched_switch
79trace event. That has a format of::
80
81	field:unsigned short common_type;	offset:0;	size:2;	signed:0;
82	field:unsigned char common_flags;	offset:2;	size:1;	signed:0;
83	field:unsigned char common_preempt_count;	offset:3;	size:1;	signed:0;
84	field:int common_pid;	offset:4;	size:4;	signed:1;
85
86	field:char prev_comm[16];	offset:8;	size:16;	signed:0;
87	field:pid_t prev_pid;	offset:24;	size:4;	signed:1;
88	field:int prev_prio;	offset:28;	size:4;	signed:1;
89	field:long prev_state;	offset:32;	size:8;	signed:1;
90	field:char next_comm[16];	offset:40;	size:16;	signed:0;
91	field:pid_t next_pid;	offset:56;	size:4;	signed:1;
92	field:int next_prio;	offset:60;	size:4;	signed:1;
93
94The first four fields are common to all events and can not be limited. But the
95rest of the event has 60 bytes of information. It records the names of the
96previous and next tasks being scheduled out and in, as well as their pids and
97priorities. It also records the state of the previous task. If only the pids
98of the tasks are of interest, why waste the ring buffer with all the other
99fields?
100
101An eprobe can limit what gets recorded. Note, it does not help in performance,
102as all the fields are recorded in a temporary buffer to process the eprobe.
103::
104
105 # echo 'e:sched/switch sched.sched_switch prev=$prev_pid:u32 next=$next_pid:u32' >> /sys/kernel/tracing/dynamic_events
106 # echo 1 > /sys/kernel/tracing/events/sched/switch/enable
107 # cat /sys/kernel/tracing/trace
108
109 # tracer: nop
110 #
111 # entries-in-buffer/entries-written: 2721/2721   #P:8
112 #
113 #                                _-----=> irqs-off/BH-disabled
114 #                               / _----=> need-resched
115 #                              | / _---=> hardirq/softirq
116 #                              || / _--=> preempt-depth
117 #                              ||| / _-=> migrate-disable
118 #                              |||| /     delay
119 #           TASK-PID     CPU#  |||||  TIMESTAMP  FUNCTION
120 #              | |         |   |||||     |         |
121     sshd-session-1082    [004] d..4.  5041.239906: switch: (sched.sched_switch) prev=1082 next=0
122             bash-1085    [001] d..4.  5041.240198: switch: (sched.sched_switch) prev=1085 next=141
123    kworker/u34:5-141     [001] d..4.  5041.240259: switch: (sched.sched_switch) prev=141 next=1085
124           <idle>-0       [004] d..4.  5041.240354: switch: (sched.sched_switch) prev=0 next=1082
125             bash-1085    [001] d..4.  5041.240385: switch: (sched.sched_switch) prev=1085 next=141
126    kworker/u34:5-141     [001] d..4.  5041.240410: switch: (sched.sched_switch) prev=141 next=1085
127             bash-1085    [001] d..4.  5041.240478: switch: (sched.sched_switch) prev=1085 next=0
128     sshd-session-1082    [004] d..4.  5041.240526: switch: (sched.sched_switch) prev=1082 next=0
129           <idle>-0       [001] d..4.  5041.247524: switch: (sched.sched_switch) prev=0 next=90
130           <idle>-0       [002] d..4.  5041.247545: switch: (sched.sched_switch) prev=0 next=16
131      kworker/1:1-90      [001] d..4.  5041.247580: switch: (sched.sched_switch) prev=90 next=0
132        rcu_sched-16      [002] d..4.  5041.247591: switch: (sched.sched_switch) prev=16 next=0
133           <idle>-0       [002] d..4.  5041.257536: switch: (sched.sched_switch) prev=0 next=16
134        rcu_sched-16      [002] d..4.  5041.257573: switch: (sched.sched_switch) prev=16 next=0
135
136Note, without adding the "u32" after the prev_pid and next_pid, the values
137would default showing in hexadecimal.
138
139Example 2
140---------
141
142If a specific system call is to be recorded but the syscalls events are not
143enabled, the raw_syscalls can still be used (syscalls are system call
144events are not normal events, but are created from the raw_syscalls events
145within the kernel). In order to trace the openat system call, one can create
146an event probe on top of the raw_syscalls event:
147::
148
149 # cd /sys/kernel/tracing
150 # cat events/raw_syscalls/sys_enter/format
151 name: sys_enter
152 ID: 395
153 format:
154	field:unsigned short common_type;	offset:0;	size:2;	signed:0;
155	field:unsigned char common_flags;	offset:2;	size:1;	signed:0;
156	field:unsigned char common_preempt_count;	offset:3;	size:1;	signed:0;
157	field:int common_pid;	offset:4;	size:4;	signed:1;
158
159	field:long id;	offset:8;	size:8;	signed:1;
160	field:unsigned long args[6];	offset:16;	size:48;	signed:0;
161
162 print fmt: "NR %ld (%lx, %lx, %lx, %lx, %lx, %lx)", REC->id, REC->args[0], REC->args[1], REC->args[2], REC->args[3], REC->args[4], REC->args[5]
163
164From the source code, the sys_openat() has:
165::
166
167 int sys_openat(int dirfd, const char *path, int flags, mode_t mode)
168 {
169	return my_syscall4(__NR_openat, dirfd, path, flags, mode);
170 }
171
172The path is the second parameter, and that is what is wanted.
173::
174
175 # echo 'e:openat raw_syscalls.sys_enter nr=$id filename=+8($args):ustring' >> dynamic_events
176
177This is being run on x86_64 where the word size is 8 bytes and the openat
178system call __NR_openat is set at 257.
179::
180
181 # echo 'nr == 257' > events/eprobes/openat/filter
182
183Now enable the event and look at the trace.
184::
185
186 # echo 1 > events/eprobes/openat/enable
187 # cat trace
188
189 # tracer: nop
190 #
191 # entries-in-buffer/entries-written: 4/4   #P:8
192 #
193 #                                _-----=> irqs-off/BH-disabled
194 #                               / _----=> need-resched
195 #                              | / _---=> hardirq/softirq
196 #                              || / _--=> preempt-depth
197 #                              ||| / _-=> migrate-disable
198 #                              |||| /     delay
199 #           TASK-PID     CPU#  |||||  TIMESTAMP  FUNCTION
200 #              | |         |   |||||     |         |
201              cat-1298    [003] ...2.  2060.875970: openat: (raw_syscalls.sys_enter) nr=0x101 filename=(fault)
202              cat-1298    [003] ...2.  2060.876197: openat: (raw_syscalls.sys_enter) nr=0x101 filename=(fault)
203              cat-1298    [003] ...2.  2060.879126: openat: (raw_syscalls.sys_enter) nr=0x101 filename=(fault)
204              cat-1298    [003] ...2.  2060.879639: openat: (raw_syscalls.sys_enter) nr=0x101 filename=(fault)
205
206The filename shows "(fault)". This is likely because the filename has not been
207pulled into memory yet and currently trace events cannot fault in memory that
208is not present. When an eprobe tries to read memory that has not been faulted
209in yet, it will show the "(fault)" text.
210
211To get around this, as the kernel will likely pull in this filename and make
212it present, attaching it to a synthetic event that can pass the address of the
213filename from the entry of the event to the end of the event, this can be used
214to show the filename when the system call returns.
215
216Remove the old eprobe::
217
218 # echo 1 > events/eprobes/openat/enable
219 # echo '-:openat' >> dynamic_events
220
221This time make an eprobe where the address of the filename is saved::
222
223 # echo 'e:openat_start raw_syscalls.sys_enter nr=$id filename=+8($args):x64' >> dynamic_events
224
225Create a synthetic event that passes the address of the filename to the
226end of the event::
227
228 # echo 's:filename u64 file' >> dynamic_events
229 # echo 'hist:keys=common_pid:f=filename if nr == 257' > events/eprobes/openat_start/trigger
230 # echo 'hist:keys=common_pid:file=$f:onmatch(eprobes.openat_start).trace(filename,$file) if id == 257' > events/raw_syscalls/sys_exit/trigger
231
232Now that the address of the filename has been passed to the end of the
233system call, create another eprobe to attach to the exit event to show the
234string::
235
236 # echo 'e:openat synthetic.filename filename=+0($file):ustring' >> dynamic_events
237 # echo 1 > events/eprobes/openat/enable
238 # cat trace
239
240 # tracer: nop
241 #
242 # entries-in-buffer/entries-written: 4/4   #P:8
243 #
244 #                                _-----=> irqs-off/BH-disabled
245 #                               / _----=> need-resched
246 #                              | / _---=> hardirq/softirq
247 #                              || / _--=> preempt-depth
248 #                              ||| / _-=> migrate-disable
249 #                              |||| /     delay
250 #           TASK-PID     CPU#  |||||  TIMESTAMP  FUNCTION
251 #              | |         |   |||||     |         |
252              cat-1331    [001] ...5.  2944.787977: openat: (synthetic.filename) filename="/etc/ld.so.cache"
253              cat-1331    [001] ...5.  2944.788480: openat: (synthetic.filename) filename="/lib/x86_64-linux-gnu/libc.so.6"
254              cat-1331    [001] ...5.  2944.793426: openat: (synthetic.filename) filename="/usr/lib/locale/locale-archive"
255              cat-1331    [001] ...5.  2944.831362: openat: (synthetic.filename) filename="trace"
256
257Example 3
258---------
259
260If syscall trace events are available, the above would not need the first
261eprobe, but it would still need the last one::
262
263 # echo 's:filename u64 file' >> dynamic_events
264 # echo 'hist:keys=common_pid:f=filename' > events/syscalls/sys_enter_openat/trigger
265 # echo 'hist:keys=common_pid:file=$f:onmatch(syscalls.sys_enter_openat).trace(filename,$file)' > events/syscalls/sys_exit_openat/trigger
266 # echo 'e:openat synthetic.filename filename=+0($file):ustring' >> dynamic_events
267 # echo 1 > events/eprobes/openat/enable
268
269And this would produce the same result as Example 2.
270