xref: /linux/Documentation/trace/fprobetrace.rst (revision 0678df8271820bcf8fb4f877129f05d68a237de4)
1.. SPDX-License-Identifier: GPL-2.0
2
3==========================
4Fprobe-based Event Tracing
5==========================
6
7.. Author: Masami Hiramatsu <mhiramat@kernel.org>
8
9Overview
10--------
11
12Fprobe event is similar to the kprobe event, but limited to probe on
13the function entry and exit only. It is good enough for many use cases
14which only traces some specific functions.
15
16This document also covers tracepoint probe events (tprobe) since this
17is also works only on the tracepoint entry. User can trace a part of
18tracepoint argument, or the tracepoint without trace-event, which is
19not exposed on tracefs.
20
21As same as other dynamic events, fprobe events and tracepoint probe
22events are defined via `dynamic_events` interface file on tracefs.
23
24Synopsis of fprobe-events
25-------------------------
26::
27
28  f[:[GRP1/][EVENT1]] SYM [FETCHARGS]                       : Probe on function entry
29  f[MAXACTIVE][:[GRP1/][EVENT1]] SYM%return [FETCHARGS]     : Probe on function exit
30  t[:[GRP2/][EVENT2]] TRACEPOINT [FETCHARGS]                : Probe on tracepoint
31
32 GRP1           : Group name for fprobe. If omitted, use "fprobes" for it.
33 GRP2           : Group name for tprobe. If omitted, use "tracepoints" for it.
34 EVENT1         : Event name for fprobe. If omitted, the event name is
35                  "SYM__entry" or "SYM__exit".
36 EVENT2         : Event name for tprobe. If omitted, the event name is
37                  the same as "TRACEPOINT", but if the "TRACEPOINT" starts
38                  with a digit character, "_TRACEPOINT" is used.
39 MAXACTIVE      : Maximum number of instances of the specified function that
40                  can be probed simultaneously, or 0 for the default value
41                  as defined in Documentation/trace/fprobe.rst
42
43 FETCHARGS      : Arguments. Each probe can have up to 128 args.
44  ARG           : Fetch "ARG" function argument using BTF (only for function
45                  entry or tracepoint.) (\*1)
46  @ADDR         : Fetch memory at ADDR (ADDR should be in kernel)
47  @SYM[+|-offs] : Fetch memory at SYM +|- offs (SYM should be a data symbol)
48  $stackN       : Fetch Nth entry of stack (N >= 0)
49  $stack        : Fetch stack address.
50  $argN         : Fetch the Nth function argument. (N >= 1) (\*2)
51  $retval       : Fetch return value.(\*3)
52  $comm         : Fetch current task comm.
53  +|-[u]OFFS(FETCHARG) : Fetch memory at FETCHARG +|- OFFS address.(\*4)(\*5)
54  \IMM          : Store an immediate value to the argument.
55  NAME=FETCHARG : Set NAME as the argument name of FETCHARG.
56  FETCHARG:TYPE : Set TYPE as the type of FETCHARG. Currently, basic types
57                  (u8/u16/u32/u64/s8/s16/s32/s64), hexadecimal types
58                  (x8/x16/x32/x64), "char", "string", "ustring", "symbol", "symstr"
59                  and bitfield are supported.
60
61  (\*1) This is available only when BTF is enabled.
62  (\*2) only for the probe on function entry (offs == 0). Note, this argument access
63        is best effort, because depending on the argument type, it may be passed on
64        the stack. But this only support the arguments via registers.
65  (\*3) only for return probe. Note that this is also best effort. Depending on the
66        return value type, it might be passed via a pair of registers. But this only
67        accesses one register.
68  (\*4) this is useful for fetching a field of data structures.
69  (\*5) "u" means user-space dereference.
70
71For the details of TYPE, see :ref:`kprobetrace documentation <kprobetrace_types>`.
72
73BTF arguments
74-------------
75BTF (BPF Type Format) argument allows user to trace function and tracepoint
76parameters by its name instead of ``$argN``. This feature is available if the
77kernel is configured with CONFIG_BPF_SYSCALL and CONFIG_DEBUG_INFO_BTF.
78If user only specify the BTF argument, the event's argument name is also
79automatically set by the given name. ::
80
81 # echo 'f:myprobe vfs_read count pos' >> dynamic_events
82 # cat dynamic_events
83 f:fprobes/myprobe vfs_read count=count pos=pos
84
85It also chooses the fetch type from BTF information. For example, in the above
86example, the ``count`` is unsigned long, and the ``pos`` is a pointer. Thus,
87both are converted to 64bit unsigned long, but only ``pos`` has "%Lx"
88print-format as below ::
89
90 # cat events/fprobes/myprobe/format
91 name: myprobe
92 ID: 1313
93 format:
94	field:unsigned short common_type;	offset:0;	size:2;	signed:0;
95	field:unsigned char common_flags;	offset:2;	size:1;	signed:0;
96	field:unsigned char common_preempt_count;	offset:3;	size:1;	signed:0;
97	field:int common_pid;	offset:4;	size:4;	signed:1;
98
99	field:unsigned long __probe_ip;	offset:8;	size:8;	signed:0;
100	field:u64 count;	offset:16;	size:8;	signed:0;
101	field:u64 pos;	offset:24;	size:8;	signed:0;
102
103 print fmt: "(%lx) count=%Lu pos=0x%Lx", REC->__probe_ip, REC->count, REC->pos
104
105If user unsures the name of arguments, ``$arg*`` will be helpful. The ``$arg*``
106is expanded to all function arguments of the function or the tracepoint. ::
107
108 # echo 'f:myprobe vfs_read $arg*' >> dynamic_events
109 # cat dynamic_events
110 f:fprobes/myprobe vfs_read file=file buf=buf count=count pos=pos
111
112BTF also affects the ``$retval``. If user doesn't set any type, the retval
113type is automatically picked from the BTF. If the function returns ``void``,
114``$retval`` is rejected.
115
116You can access the data fields of a data structure using allow operator ``->``
117(for pointer type) and dot operator ``.`` (for data structure type.)::
118
119# echo 't sched_switch preempt prev_pid=prev->pid next_pid=next->pid' >> dynamic_events
120
121The field access operators, ``->`` and ``.`` can be combined for accessing deeper
122members and other structure members pointed by the member. e.g. ``foo->bar.baz->qux``
123If there is non-name union member, you can directly access it as the C code does.
124For example::
125
126 struct {
127	union {
128	int a;
129	int b;
130	};
131 } *foo;
132
133To access ``a`` and ``b``, use ``foo->a`` and ``foo->b`` in this case.
134
135This data field access is available for the return value via ``$retval``,
136e.g. ``$retval->name``.
137
138For these BTF arguments and fields, ``:string`` and ``:ustring`` change the
139behavior. If these are used for BTF argument or field, it checks whether
140the BTF type of the argument or the data field is ``char *`` or ``char []``,
141or not.  If not, it rejects applying the string types. Also, with the BTF
142support, you don't need a memory dereference operator (``+0(PTR)``) for
143accessing the string pointed by a ``PTR``. It automatically adds the memory
144dereference operator according to the BTF type. e.g. ::
145
146# echo 't sched_switch prev->comm:string' >> dynamic_events
147# echo 'f getname_flags%return $retval->name:string' >> dynamic_events
148
149The ``prev->comm`` is an embedded char array in the data structure, and
150``$retval->name`` is a char pointer in the data structure. But in both
151cases, you can use ``:string`` type to get the string.
152
153
154Usage examples
155--------------
156Here is an example to add fprobe events on ``vfs_read()`` function entry
157and exit, with BTF arguments.
158::
159
160  # echo 'f vfs_read $arg*' >> dynamic_events
161  # echo 'f vfs_read%return $retval' >> dynamic_events
162  # cat dynamic_events
163 f:fprobes/vfs_read__entry vfs_read file=file buf=buf count=count pos=pos
164 f:fprobes/vfs_read__exit vfs_read%return arg1=$retval
165  # echo 1 > events/fprobes/enable
166  # head -n 20 trace | tail
167 #           TASK-PID     CPU#  |||||  TIMESTAMP  FUNCTION
168 #              | |         |   |||||     |         |
169               sh-70      [000] ...1.   335.883195: vfs_read__entry: (vfs_read+0x4/0x340) file=0xffff888005cf9a80 buf=0x7ffef36c6879 count=1 pos=0xffffc900005aff08
170               sh-70      [000] .....   335.883208: vfs_read__exit: (ksys_read+0x75/0x100 <- vfs_read) arg1=1
171               sh-70      [000] ...1.   335.883220: vfs_read__entry: (vfs_read+0x4/0x340) file=0xffff888005cf9a80 buf=0x7ffef36c6879 count=1 pos=0xffffc900005aff08
172               sh-70      [000] .....   335.883224: vfs_read__exit: (ksys_read+0x75/0x100 <- vfs_read) arg1=1
173               sh-70      [000] ...1.   335.883232: vfs_read__entry: (vfs_read+0x4/0x340) file=0xffff888005cf9a80 buf=0x7ffef36c687a count=1 pos=0xffffc900005aff08
174               sh-70      [000] .....   335.883237: vfs_read__exit: (ksys_read+0x75/0x100 <- vfs_read) arg1=1
175               sh-70      [000] ...1.   336.050329: vfs_read__entry: (vfs_read+0x4/0x340) file=0xffff888005cf9a80 buf=0x7ffef36c6879 count=1 pos=0xffffc900005aff08
176               sh-70      [000] .....   336.050343: vfs_read__exit: (ksys_read+0x75/0x100 <- vfs_read) arg1=1
177
178You can see all function arguments and return values are recorded as signed int.
179
180Also, here is an example of tracepoint events on ``sched_switch`` tracepoint.
181To compare the result, this also enables the ``sched_switch`` traceevent too.
182::
183
184  # echo 't sched_switch $arg*' >> dynamic_events
185  # echo 1 > events/sched/sched_switch/enable
186  # echo 1 > events/tracepoints/sched_switch/enable
187  # echo > trace
188  # head -n 20 trace | tail
189 #           TASK-PID     CPU#  |||||  TIMESTAMP  FUNCTION
190 #              | |         |   |||||     |         |
191               sh-70      [000] d..2.  3912.083993: sched_switch: prev_comm=sh prev_pid=70 prev_prio=120 prev_state=S ==> next_comm=swapper/0 next_pid=0 next_prio=120
192               sh-70      [000] d..3.  3912.083995: sched_switch: (__probestub_sched_switch+0x4/0x10) preempt=0 prev=0xffff88800664e100 next=0xffffffff828229c0 prev_state=1
193           <idle>-0       [000] d..2.  3912.084183: sched_switch: prev_comm=swapper/0 prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=rcu_preempt next_pid=16 next_prio=120
194           <idle>-0       [000] d..3.  3912.084184: sched_switch: (__probestub_sched_switch+0x4/0x10) preempt=0 prev=0xffffffff828229c0 next=0xffff888004208000 prev_state=0
195      rcu_preempt-16      [000] d..2.  3912.084196: sched_switch: prev_comm=rcu_preempt prev_pid=16 prev_prio=120 prev_state=I ==> next_comm=swapper/0 next_pid=0 next_prio=120
196      rcu_preempt-16      [000] d..3.  3912.084196: sched_switch: (__probestub_sched_switch+0x4/0x10) preempt=0 prev=0xffff888004208000 next=0xffffffff828229c0 prev_state=1026
197           <idle>-0       [000] d..2.  3912.085191: sched_switch: prev_comm=swapper/0 prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=rcu_preempt next_pid=16 next_prio=120
198           <idle>-0       [000] d..3.  3912.085191: sched_switch: (__probestub_sched_switch+0x4/0x10) preempt=0 prev=0xffffffff828229c0 next=0xffff888004208000 prev_state=0
199
200As you can see, the ``sched_switch`` trace-event shows *cooked* parameters, on
201the other hand, the ``sched_switch`` tracepoint probe event shows *raw*
202parameters. This means you can access any field values in the task
203structure pointed by the ``prev`` and ``next`` arguments.
204
205For example, usually ``task_struct::start_time`` is not traced, but with this
206traceprobe event, you can trace that field as below.
207::
208
209  # echo 't sched_switch comm=next->comm:string next->start_time' > dynamic_events
210  # head -n 20 trace | tail
211 #           TASK-PID     CPU#  |||||  TIMESTAMP  FUNCTION
212 #              | |         |   |||||     |         |
213               sh-70      [000] d..3.  5606.686577: sched_switch: (__probestub_sched_switch+0x4/0x10) comm="rcu_preempt" usage=1 start_time=245000000
214      rcu_preempt-16      [000] d..3.  5606.686602: sched_switch: (__probestub_sched_switch+0x4/0x10) comm="sh" usage=1 start_time=1596095526
215               sh-70      [000] d..3.  5606.686637: sched_switch: (__probestub_sched_switch+0x4/0x10) comm="swapper/0" usage=2 start_time=0
216           <idle>-0       [000] d..3.  5606.687190: sched_switch: (__probestub_sched_switch+0x4/0x10) comm="rcu_preempt" usage=1 start_time=245000000
217      rcu_preempt-16      [000] d..3.  5606.687202: sched_switch: (__probestub_sched_switch+0x4/0x10) comm="swapper/0" usage=2 start_time=0
218           <idle>-0       [000] d..3.  5606.690317: sched_switch: (__probestub_sched_switch+0x4/0x10) comm="kworker/0:1" usage=1 start_time=137000000
219      kworker/0:1-14      [000] d..3.  5606.690339: sched_switch: (__probestub_sched_switch+0x4/0x10) comm="swapper/0" usage=2 start_time=0
220           <idle>-0       [000] d..3.  5606.692368: sched_switch: (__probestub_sched_switch+0x4/0x10) comm="kworker/0:1" usage=1 start_time=137000000
221