xref: /linux/Documentation/trace/events.rst (revision e8744fbc83188693f3590020b14d50df3387fc5a)
1=============
2Event Tracing
3=============
4
5:Author: Theodore Ts'o
6:Updated: Li Zefan and Tom Zanussi
7
81. Introduction
9===============
10
11Tracepoints (see Documentation/trace/tracepoints.rst) can be used
12without creating custom kernel modules to register probe functions
13using the event tracing infrastructure.
14
15Not all tracepoints can be traced using the event tracing system;
16the kernel developer must provide code snippets which define how the
17tracing information is saved into the tracing buffer, and how the
18tracing information should be printed.
19
202. Using Event Tracing
21======================
22
232.1 Via the 'set_event' interface
24---------------------------------
25
26The events which are available for tracing can be found in the file
27/sys/kernel/tracing/available_events.
28
29To enable a particular event, such as 'sched_wakeup', simply echo it
30to /sys/kernel/tracing/set_event. For example::
31
32	# echo sched_wakeup >> /sys/kernel/tracing/set_event
33
34.. Note:: '>>' is necessary, otherwise it will firstly disable all the events.
35
36To disable an event, echo the event name to the set_event file prefixed
37with an exclamation point::
38
39	# echo '!sched_wakeup' >> /sys/kernel/tracing/set_event
40
41To disable all events, echo an empty line to the set_event file::
42
43	# echo > /sys/kernel/tracing/set_event
44
45To enable all events, echo ``*:*`` or ``*:`` to the set_event file::
46
47	# echo *:* > /sys/kernel/tracing/set_event
48
49The events are organized into subsystems, such as ext4, irq, sched,
50etc., and a full event name looks like this: <subsystem>:<event>.  The
51subsystem name is optional, but it is displayed in the available_events
52file.  All of the events in a subsystem can be specified via the syntax
53``<subsystem>:*``; for example, to enable all irq events, you can use the
54command::
55
56	# echo 'irq:*' > /sys/kernel/tracing/set_event
57
58The set_event file may also be used to enable events associated to only
59a specific module::
60
61	# echo ':mod:<module>' > /sys/kernel/tracing/set_event
62
63Will enable all events in the module ``<module>``.  If the module is not yet
64loaded, the string will be saved and when a module is that matches ``<module>``
65is loaded, then it will apply the enabling of events then.
66
67The text before ``:mod:`` will be parsed to specify specific events that the
68module creates::
69
70	# echo '<match>:mod:<module>' > /sys/kernel/tracing/set_event
71
72The above will enable any system or event that ``<match>`` matches. If
73``<match>`` is ``"*"`` then it will match all events.
74
75To enable only a specific event within a system::
76
77	# echo '<system>:<event>:mod:<module>' > /sys/kernel/tracing/set_event
78
79If ``<event>`` is ``"*"`` then it will match all events within the system
80for a given module.
81
822.2 Via the 'enable' toggle
83---------------------------
84
85The events available are also listed in /sys/kernel/tracing/events/ hierarchy
86of directories.
87
88To enable event 'sched_wakeup'::
89
90	# echo 1 > /sys/kernel/tracing/events/sched/sched_wakeup/enable
91
92To disable it::
93
94	# echo 0 > /sys/kernel/tracing/events/sched/sched_wakeup/enable
95
96To enable all events in sched subsystem::
97
98	# echo 1 > /sys/kernel/tracing/events/sched/enable
99
100To enable all events::
101
102	# echo 1 > /sys/kernel/tracing/events/enable
103
104When reading one of these enable files, there are four results:
105
106 - 0 - all events this file affects are disabled
107 - 1 - all events this file affects are enabled
108 - X - there is a mixture of events enabled and disabled
109 - ? - this file does not affect any event
110
1112.3 Boot option
112---------------
113
114In order to facilitate early boot debugging, use boot option::
115
116	trace_event=[event-list]
117
118event-list is a comma separated list of events. See section 2.1 for event
119format.
120
1213. Defining an event-enabled tracepoint
122=======================================
123
124See The example provided in samples/trace_events
125
1264. Event formats
127================
128
129Each trace event has a 'format' file associated with it that contains
130a description of each field in a logged event.  This information can
131be used to parse the binary trace stream, and is also the place to
132find the field names that can be used in event filters (see section 5).
133
134It also displays the format string that will be used to print the
135event in text mode, along with the event name and ID used for
136profiling.
137
138Every event has a set of ``common`` fields associated with it; these are
139the fields prefixed with ``common_``.  The other fields vary between
140events and correspond to the fields defined in the TRACE_EVENT
141definition for that event.
142
143Each field in the format has the form::
144
145     field:field-type field-name; offset:N; size:N;
146
147where offset is the offset of the field in the trace record and size
148is the size of the data item, in bytes.
149
150For example, here's the information displayed for the 'sched_wakeup'
151event::
152
153	# cat /sys/kernel/tracing/events/sched/sched_wakeup/format
154
155	name: sched_wakeup
156	ID: 60
157	format:
158		field:unsigned short common_type;	offset:0;	size:2;
159		field:unsigned char common_flags;	offset:2;	size:1;
160		field:unsigned char common_preempt_count;	offset:3;	size:1;
161		field:int common_pid;	offset:4;	size:4;
162		field:int common_tgid;	offset:8;	size:4;
163
164		field:char comm[TASK_COMM_LEN];	offset:12;	size:16;
165		field:pid_t pid;	offset:28;	size:4;
166		field:int prio;	offset:32;	size:4;
167		field:int success;	offset:36;	size:4;
168		field:int cpu;	offset:40;	size:4;
169
170	print fmt: "task %s:%d [%d] success=%d [%03d]", REC->comm, REC->pid,
171		   REC->prio, REC->success, REC->cpu
172
173This event contains 10 fields, the first 5 common and the remaining 5
174event-specific.  All the fields for this event are numeric, except for
175'comm' which is a string, a distinction important for event filtering.
176
1775. Event filtering
178==================
179
180Trace events can be filtered in the kernel by associating boolean
181'filter expressions' with them.  As soon as an event is logged into
182the trace buffer, its fields are checked against the filter expression
183associated with that event type.  An event with field values that
184'match' the filter will appear in the trace output, and an event whose
185values don't match will be discarded.  An event with no filter
186associated with it matches everything, and is the default when no
187filter has been set for an event.
188
1895.1 Expression syntax
190---------------------
191
192A filter expression consists of one or more 'predicates' that can be
193combined using the logical operators '&&' and '||'.  A predicate is
194simply a clause that compares the value of a field contained within a
195logged event with a constant value and returns either 0 or 1 depending
196on whether the field value matched (1) or didn't match (0)::
197
198	  field-name relational-operator value
199
200Parentheses can be used to provide arbitrary logical groupings and
201double-quotes can be used to prevent the shell from interpreting
202operators as shell metacharacters.
203
204The field-names available for use in filters can be found in the
205'format' files for trace events (see section 4).
206
207The relational-operators depend on the type of the field being tested:
208
209The operators available for numeric fields are:
210
211==, !=, <, <=, >, >=, &
212
213And for string fields they are:
214
215==, !=, ~
216
217The glob (~) accepts a wild card character (\*,?) and character classes
218([). For example::
219
220  prev_comm ~ "*sh"
221  prev_comm ~ "sh*"
222  prev_comm ~ "*sh*"
223  prev_comm ~ "ba*sh"
224
225If the field is a pointer that points into user space (for example
226"filename" from sys_enter_openat), then you have to append ".ustring" to the
227field name::
228
229  filename.ustring ~ "password"
230
231As the kernel will have to know how to retrieve the memory that the pointer
232is at from user space.
233
234You can convert any long type to a function address and search by function name::
235
236  call_site.function == security_prepare_creds
237
238The above will filter when the field "call_site" falls on the address within
239"security_prepare_creds". That is, it will compare the value of "call_site" and
240the filter will return true if it is greater than or equal to the start of
241the function "security_prepare_creds" and less than the end of that function.
242
243The ".function" postfix can only be attached to values of size long, and can only
244be compared with "==" or "!=".
245
246Cpumask fields or scalar fields that encode a CPU number can be filtered using
247a user-provided cpumask in cpulist format. The format is as follows::
248
249  CPUS{$cpulist}
250
251Operators available to cpumask filtering are:
252
253& (intersection), ==, !=
254
255For example, this will filter events that have their .target_cpu field present
256in the given cpumask::
257
258  target_cpu & CPUS{17-42}
259
2605.2 Setting filters
261-------------------
262
263A filter for an individual event is set by writing a filter expression
264to the 'filter' file for the given event.
265
266For example::
267
268	# cd /sys/kernel/tracing/events/sched/sched_wakeup
269	# echo "common_preempt_count > 4" > filter
270
271A slightly more involved example::
272
273	# cd /sys/kernel/tracing/events/signal/signal_generate
274	# echo "((sig >= 10 && sig < 15) || sig == 17) && comm != bash" > filter
275
276If there is an error in the expression, you'll get an 'Invalid
277argument' error when setting it, and the erroneous string along with
278an error message can be seen by looking at the filter e.g.::
279
280	# cd /sys/kernel/tracing/events/signal/signal_generate
281	# echo "((sig >= 10 && sig < 15) || dsig == 17) && comm != bash" > filter
282	-bash: echo: write error: Invalid argument
283	# cat filter
284	((sig >= 10 && sig < 15) || dsig == 17) && comm != bash
285	^
286	parse_error: Field not found
287
288Currently the caret ('^') for an error always appears at the beginning of
289the filter string; the error message should still be useful though
290even without more accurate position info.
291
2925.2.1 Filter limitations
293------------------------
294
295If a filter is placed on a string pointer ``(char *)`` that does not point
296to a string on the ring buffer, but instead points to kernel or user space
297memory, then, for safety reasons, at most 1024 bytes of the content is
298copied onto a temporary buffer to do the compare. If the copy of the memory
299faults (the pointer points to memory that should not be accessed), then the
300string compare will be treated as not matching.
301
3025.3 Clearing filters
303--------------------
304
305To clear the filter for an event, write a '0' to the event's filter
306file.
307
308To clear the filters for all events in a subsystem, write a '0' to the
309subsystem's filter file.
310
3115.4 Subsystem filters
312---------------------
313
314For convenience, filters for every event in a subsystem can be set or
315cleared as a group by writing a filter expression into the filter file
316at the root of the subsystem.  Note however, that if a filter for any
317event within the subsystem lacks a field specified in the subsystem
318filter, or if the filter can't be applied for any other reason, the
319filter for that event will retain its previous setting.  This can
320result in an unintended mixture of filters which could lead to
321confusing (to the user who might think different filters are in
322effect) trace output.  Only filters that reference just the common
323fields can be guaranteed to propagate successfully to all events.
324
325Here are a few subsystem filter examples that also illustrate the
326above points:
327
328Clear the filters on all events in the sched subsystem::
329
330	# cd /sys/kernel/tracing/events/sched
331	# echo 0 > filter
332	# cat sched_switch/filter
333	none
334	# cat sched_wakeup/filter
335	none
336
337Set a filter using only common fields for all events in the sched
338subsystem (all events end up with the same filter)::
339
340	# cd /sys/kernel/tracing/events/sched
341	# echo common_pid == 0 > filter
342	# cat sched_switch/filter
343	common_pid == 0
344	# cat sched_wakeup/filter
345	common_pid == 0
346
347Attempt to set a filter using a non-common field for all events in the
348sched subsystem (all events but those that have a prev_pid field retain
349their old filters)::
350
351	# cd /sys/kernel/tracing/events/sched
352	# echo prev_pid == 0 > filter
353	# cat sched_switch/filter
354	prev_pid == 0
355	# cat sched_wakeup/filter
356	common_pid == 0
357
3585.5 PID filtering
359-----------------
360
361The set_event_pid file in the same directory as the top events directory
362exists, will filter all events from tracing any task that does not have the
363PID listed in the set_event_pid file.
364::
365
366	# cd /sys/kernel/tracing
367	# echo $$ > set_event_pid
368	# echo 1 > events/enable
369
370Will only trace events for the current task.
371
372To add more PIDs without losing the PIDs already included, use '>>'.
373::
374
375	# echo 123 244 1 >> set_event_pid
376
377
3786. Event triggers
379=================
380
381Trace events can be made to conditionally invoke trigger 'commands'
382which can take various forms and are described in detail below;
383examples would be enabling or disabling other trace events or invoking
384a stack trace whenever the trace event is hit.  Whenever a trace event
385with attached triggers is invoked, the set of trigger commands
386associated with that event is invoked.  Any given trigger can
387additionally have an event filter of the same form as described in
388section 5 (Event filtering) associated with it - the command will only
389be invoked if the event being invoked passes the associated filter.
390If no filter is associated with the trigger, it always passes.
391
392Triggers are added to and removed from a particular event by writing
393trigger expressions to the 'trigger' file for the given event.
394
395A given event can have any number of triggers associated with it,
396subject to any restrictions that individual commands may have in that
397regard.
398
399Event triggers are implemented on top of "soft" mode, which means that
400whenever a trace event has one or more triggers associated with it,
401the event is activated even if it isn't actually enabled, but is
402disabled in a "soft" mode.  That is, the tracepoint will be called,
403but just will not be traced, unless of course it's actually enabled.
404This scheme allows triggers to be invoked even for events that aren't
405enabled, and also allows the current event filter implementation to be
406used for conditionally invoking triggers.
407
408The syntax for event triggers is roughly based on the syntax for
409set_ftrace_filter 'ftrace filter commands' (see the 'Filter commands'
410section of Documentation/trace/ftrace.rst), but there are major
411differences and the implementation isn't currently tied to it in any
412way, so beware about making generalizations between the two.
413
414.. Note::
415     Writing into trace_marker (See Documentation/trace/ftrace.rst)
416     can also enable triggers that are written into
417     /sys/kernel/tracing/events/ftrace/print/trigger
418
4196.1 Expression syntax
420---------------------
421
422Triggers are added by echoing the command to the 'trigger' file::
423
424  # echo 'command[:count] [if filter]' > trigger
425
426Triggers are removed by echoing the same command but starting with '!'
427to the 'trigger' file::
428
429  # echo '!command[:count] [if filter]' > trigger
430
431The [if filter] part isn't used in matching commands when removing, so
432leaving that off in a '!' command will accomplish the same thing as
433having it in.
434
435The filter syntax is the same as that described in the 'Event
436filtering' section above.
437
438For ease of use, writing to the trigger file using '>' currently just
439adds or removes a single trigger and there's no explicit '>>' support
440('>' actually behaves like '>>') or truncation support to remove all
441triggers (you have to use '!' for each one added.)
442
4436.2 Supported trigger commands
444------------------------------
445
446The following commands are supported:
447
448- enable_event/disable_event
449
450  These commands can enable or disable another trace event whenever
451  the triggering event is hit.  When these commands are registered,
452  the other trace event is activated, but disabled in a "soft" mode.
453  That is, the tracepoint will be called, but just will not be traced.
454  The event tracepoint stays in this mode as long as there's a trigger
455  in effect that can trigger it.
456
457  For example, the following trigger causes kmalloc events to be
458  traced when a read system call is entered, and the :1 at the end
459  specifies that this enablement happens only once::
460
461	  # echo 'enable_event:kmem:kmalloc:1' > \
462	      /sys/kernel/tracing/events/syscalls/sys_enter_read/trigger
463
464  The following trigger causes kmalloc events to stop being traced
465  when a read system call exits.  This disablement happens on every
466  read system call exit::
467
468	  # echo 'disable_event:kmem:kmalloc' > \
469	      /sys/kernel/tracing/events/syscalls/sys_exit_read/trigger
470
471  The format is::
472
473      enable_event:<system>:<event>[:count]
474      disable_event:<system>:<event>[:count]
475
476  To remove the above commands::
477
478	  # echo '!enable_event:kmem:kmalloc:1' > \
479	      /sys/kernel/tracing/events/syscalls/sys_enter_read/trigger
480
481	  # echo '!disable_event:kmem:kmalloc' > \
482	      /sys/kernel/tracing/events/syscalls/sys_exit_read/trigger
483
484  Note that there can be any number of enable/disable_event triggers
485  per triggering event, but there can only be one trigger per
486  triggered event. e.g. sys_enter_read can have triggers enabling both
487  kmem:kmalloc and sched:sched_switch, but can't have two kmem:kmalloc
488  versions such as kmem:kmalloc and kmem:kmalloc:1 or 'kmem:kmalloc if
489  bytes_req == 256' and 'kmem:kmalloc if bytes_alloc == 256' (they
490  could be combined into a single filter on kmem:kmalloc though).
491
492- stacktrace
493
494  This command dumps a stacktrace in the trace buffer whenever the
495  triggering event occurs.
496
497  For example, the following trigger dumps a stacktrace every time the
498  kmalloc tracepoint is hit::
499
500	  # echo 'stacktrace' > \
501		/sys/kernel/tracing/events/kmem/kmalloc/trigger
502
503  The following trigger dumps a stacktrace the first 5 times a kmalloc
504  request happens with a size >= 64K::
505
506	  # echo 'stacktrace:5 if bytes_req >= 65536' > \
507		/sys/kernel/tracing/events/kmem/kmalloc/trigger
508
509  The format is::
510
511      stacktrace[:count]
512
513  To remove the above commands::
514
515	  # echo '!stacktrace' > \
516		/sys/kernel/tracing/events/kmem/kmalloc/trigger
517
518	  # echo '!stacktrace:5 if bytes_req >= 65536' > \
519		/sys/kernel/tracing/events/kmem/kmalloc/trigger
520
521  The latter can also be removed more simply by the following (without
522  the filter)::
523
524	  # echo '!stacktrace:5' > \
525		/sys/kernel/tracing/events/kmem/kmalloc/trigger
526
527  Note that there can be only one stacktrace trigger per triggering
528  event.
529
530- snapshot
531
532  This command causes a snapshot to be triggered whenever the
533  triggering event occurs.
534
535  The following command creates a snapshot every time a block request
536  queue is unplugged with a depth > 1.  If you were tracing a set of
537  events or functions at the time, the snapshot trace buffer would
538  capture those events when the trigger event occurred::
539
540	  # echo 'snapshot if nr_rq > 1' > \
541		/sys/kernel/tracing/events/block/block_unplug/trigger
542
543  To only snapshot once::
544
545	  # echo 'snapshot:1 if nr_rq > 1' > \
546		/sys/kernel/tracing/events/block/block_unplug/trigger
547
548  To remove the above commands::
549
550	  # echo '!snapshot if nr_rq > 1' > \
551		/sys/kernel/tracing/events/block/block_unplug/trigger
552
553	  # echo '!snapshot:1 if nr_rq > 1' > \
554		/sys/kernel/tracing/events/block/block_unplug/trigger
555
556  Note that there can be only one snapshot trigger per triggering
557  event.
558
559- traceon/traceoff
560
561  These commands turn tracing on and off when the specified events are
562  hit. The parameter determines how many times the tracing system is
563  turned on and off. If unspecified, there is no limit.
564
565  The following command turns tracing off the first time a block
566  request queue is unplugged with a depth > 1.  If you were tracing a
567  set of events or functions at the time, you could then examine the
568  trace buffer to see the sequence of events that led up to the
569  trigger event::
570
571	  # echo 'traceoff:1 if nr_rq > 1' > \
572		/sys/kernel/tracing/events/block/block_unplug/trigger
573
574  To always disable tracing when nr_rq  > 1::
575
576	  # echo 'traceoff if nr_rq > 1' > \
577		/sys/kernel/tracing/events/block/block_unplug/trigger
578
579  To remove the above commands::
580
581	  # echo '!traceoff:1 if nr_rq > 1' > \
582		/sys/kernel/tracing/events/block/block_unplug/trigger
583
584	  # echo '!traceoff if nr_rq > 1' > \
585		/sys/kernel/tracing/events/block/block_unplug/trigger
586
587  Note that there can be only one traceon or traceoff trigger per
588  triggering event.
589
590- hist
591
592  This command aggregates event hits into a hash table keyed on one or
593  more trace event format fields (or stacktrace) and a set of running
594  totals derived from one or more trace event format fields and/or
595  event counts (hitcount).
596
597  See Documentation/trace/histogram.rst for details and examples.
598
5997. In-kernel trace event API
600============================
601
602In most cases, the command-line interface to trace events is more than
603sufficient.  Sometimes, however, applications might find the need for
604more complex relationships than can be expressed through a simple
605series of linked command-line expressions, or putting together sets of
606commands may be simply too cumbersome.  An example might be an
607application that needs to 'listen' to the trace stream in order to
608maintain an in-kernel state machine detecting, for instance, when an
609illegal kernel state occurs in the scheduler.
610
611The trace event subsystem provides an in-kernel API allowing modules
612or other kernel code to generate user-defined 'synthetic' events at
613will, which can be used to either augment the existing trace stream
614and/or signal that a particular important state has occurred.
615
616A similar in-kernel API is also available for creating kprobe and
617kretprobe events.
618
619Both the synthetic event and k/ret/probe event APIs are built on top
620of a lower-level "dynevent_cmd" event command API, which is also
621available for more specialized applications, or as the basis of other
622higher-level trace event APIs.
623
624The API provided for these purposes is describe below and allows the
625following:
626
627  - dynamically creating synthetic event definitions
628  - dynamically creating kprobe and kretprobe event definitions
629  - tracing synthetic events from in-kernel code
630  - the low-level "dynevent_cmd" API
631
6327.1 Dyamically creating synthetic event definitions
633---------------------------------------------------
634
635There are a couple ways to create a new synthetic event from a kernel
636module or other kernel code.
637
638The first creates the event in one step, using synth_event_create().
639In this method, the name of the event to create and an array defining
640the fields is supplied to synth_event_create().  If successful, a
641synthetic event with that name and fields will exist following that
642call.  For example, to create a new "schedtest" synthetic event::
643
644  ret = synth_event_create("schedtest", sched_fields,
645                           ARRAY_SIZE(sched_fields), THIS_MODULE);
646
647The sched_fields param in this example points to an array of struct
648synth_field_desc, each of which describes an event field by type and
649name::
650
651  static struct synth_field_desc sched_fields[] = {
652        { .type = "pid_t",              .name = "next_pid_field" },
653        { .type = "char[16]",           .name = "next_comm_field" },
654        { .type = "u64",                .name = "ts_ns" },
655        { .type = "u64",                .name = "ts_ms" },
656        { .type = "unsigned int",       .name = "cpu" },
657        { .type = "char[64]",           .name = "my_string_field" },
658        { .type = "int",                .name = "my_int_field" },
659  };
660
661See synth_field_size() for available types.
662
663If field_name contains [n], the field is considered to be a static array.
664
665If field_names contains[] (no subscript), the field is considered to
666be a dynamic array, which will only take as much space in the event as
667is required to hold the array.
668
669Because space for an event is reserved before assigning field values
670to the event, using dynamic arrays implies that the piecewise
671in-kernel API described below can't be used with dynamic arrays.  The
672other non-piecewise in-kernel APIs can, however, be used with dynamic
673arrays.
674
675If the event is created from within a module, a pointer to the module
676must be passed to synth_event_create().  This will ensure that the
677trace buffer won't contain unreadable events when the module is
678removed.
679
680At this point, the event object is ready to be used for generating new
681events.
682
683In the second method, the event is created in several steps.  This
684allows events to be created dynamically and without the need to create
685and populate an array of fields beforehand.
686
687To use this method, an empty or partially empty synthetic event should
688first be created using synth_event_gen_cmd_start() or
689synth_event_gen_cmd_array_start().  For synth_event_gen_cmd_start(),
690the name of the event along with one or more pairs of args each pair
691representing a 'type field_name;' field specification should be
692supplied.  For synth_event_gen_cmd_array_start(), the name of the
693event along with an array of struct synth_field_desc should be
694supplied. Before calling synth_event_gen_cmd_start() or
695synth_event_gen_cmd_array_start(), the user should create and
696initialize a dynevent_cmd object using synth_event_cmd_init().
697
698For example, to create a new "schedtest" synthetic event with two
699fields::
700
701  struct dynevent_cmd cmd;
702  char *buf;
703
704  /* Create a buffer to hold the generated command */
705  buf = kzalloc(MAX_DYNEVENT_CMD_LEN, GFP_KERNEL);
706
707  /* Before generating the command, initialize the cmd object */
708  synth_event_cmd_init(&cmd, buf, MAX_DYNEVENT_CMD_LEN);
709
710  ret = synth_event_gen_cmd_start(&cmd, "schedtest", THIS_MODULE,
711                                  "pid_t", "next_pid_field",
712                                  "u64", "ts_ns");
713
714Alternatively, using an array of struct synth_field_desc fields
715containing the same information::
716
717  ret = synth_event_gen_cmd_array_start(&cmd, "schedtest", THIS_MODULE,
718                                        fields, n_fields);
719
720Once the synthetic event object has been created, it can then be
721populated with more fields.  Fields are added one by one using
722synth_event_add_field(), supplying the dynevent_cmd object, a field
723type, and a field name.  For example, to add a new int field named
724"intfield", the following call should be made::
725
726  ret = synth_event_add_field(&cmd, "int", "intfield");
727
728See synth_field_size() for available types. If field_name contains [n]
729the field is considered to be an array.
730
731A group of fields can also be added all at once using an array of
732synth_field_desc with add_synth_fields().  For example, this would add
733just the first four sched_fields::
734
735  ret = synth_event_add_fields(&cmd, sched_fields, 4);
736
737If you already have a string of the form 'type field_name',
738synth_event_add_field_str() can be used to add it as-is; it will
739also automatically append a ';' to the string.
740
741Once all the fields have been added, the event should be finalized and
742registered by calling the synth_event_gen_cmd_end() function::
743
744  ret = synth_event_gen_cmd_end(&cmd);
745
746At this point, the event object is ready to be used for tracing new
747events.
748
7497.2 Tracing synthetic events from in-kernel code
750------------------------------------------------
751
752To trace a synthetic event, there are several options.  The first
753option is to trace the event in one call, using synth_event_trace()
754with a variable number of values, or synth_event_trace_array() with an
755array of values to be set.  A second option can be used to avoid the
756need for a pre-formed array of values or list of arguments, via
757synth_event_trace_start() and synth_event_trace_end() along with
758synth_event_add_next_val() or synth_event_add_val() to add the values
759piecewise.
760
7617.2.1 Tracing a synthetic event all at once
762-------------------------------------------
763
764To trace a synthetic event all at once, the synth_event_trace() or
765synth_event_trace_array() functions can be used.
766
767The synth_event_trace() function is passed the trace_event_file
768representing the synthetic event (which can be retrieved using
769trace_get_event_file() using the synthetic event name, "synthetic" as
770the system name, and the trace instance name (NULL if using the global
771trace array)), along with an variable number of u64 args, one for each
772synthetic event field, and the number of values being passed.
773
774So, to trace an event corresponding to the synthetic event definition
775above, code like the following could be used::
776
777  ret = synth_event_trace(create_synth_test, 7, /* number of values */
778                          444,             /* next_pid_field */
779                          (u64)"clackers", /* next_comm_field */
780                          1000000,         /* ts_ns */
781                          1000,            /* ts_ms */
782                          smp_processor_id(),/* cpu */
783                          (u64)"Thneed",   /* my_string_field */
784                          999);            /* my_int_field */
785
786All vals should be cast to u64, and string vals are just pointers to
787strings, cast to u64.  Strings will be copied into space reserved in
788the event for the string, using these pointers.
789
790Alternatively, the synth_event_trace_array() function can be used to
791accomplish the same thing.  It is passed the trace_event_file
792representing the synthetic event (which can be retrieved using
793trace_get_event_file() using the synthetic event name, "synthetic" as
794the system name, and the trace instance name (NULL if using the global
795trace array)), along with an array of u64, one for each synthetic
796event field.
797
798To trace an event corresponding to the synthetic event definition
799above, code like the following could be used::
800
801  u64 vals[7];
802
803  vals[0] = 777;                  /* next_pid_field */
804  vals[1] = (u64)"tiddlywinks";   /* next_comm_field */
805  vals[2] = 1000000;              /* ts_ns */
806  vals[3] = 1000;                 /* ts_ms */
807  vals[4] = smp_processor_id();   /* cpu */
808  vals[5] = (u64)"thneed";        /* my_string_field */
809  vals[6] = 398;                  /* my_int_field */
810
811The 'vals' array is just an array of u64, the number of which must
812match the number of field in the synthetic event, and which must be in
813the same order as the synthetic event fields.
814
815All vals should be cast to u64, and string vals are just pointers to
816strings, cast to u64.  Strings will be copied into space reserved in
817the event for the string, using these pointers.
818
819In order to trace a synthetic event, a pointer to the trace event file
820is needed.  The trace_get_event_file() function can be used to get
821it - it will find the file in the given trace instance (in this case
822NULL since the top trace array is being used) while at the same time
823preventing the instance containing it from going away::
824
825       schedtest_event_file = trace_get_event_file(NULL, "synthetic",
826                                                   "schedtest");
827
828Before tracing the event, it should be enabled in some way, otherwise
829the synthetic event won't actually show up in the trace buffer.
830
831To enable a synthetic event from the kernel, trace_array_set_clr_event()
832can be used (which is not specific to synthetic events, so does need
833the "synthetic" system name to be specified explicitly).
834
835To enable the event, pass 'true' to it::
836
837       trace_array_set_clr_event(schedtest_event_file->tr,
838                                 "synthetic", "schedtest", true);
839
840To disable it pass false::
841
842       trace_array_set_clr_event(schedtest_event_file->tr,
843                                 "synthetic", "schedtest", false);
844
845Finally, synth_event_trace_array() can be used to actually trace the
846event, which should be visible in the trace buffer afterwards::
847
848       ret = synth_event_trace_array(schedtest_event_file, vals,
849                                     ARRAY_SIZE(vals));
850
851To remove the synthetic event, the event should be disabled, and the
852trace instance should be 'put' back using trace_put_event_file()::
853
854       trace_array_set_clr_event(schedtest_event_file->tr,
855                                 "synthetic", "schedtest", false);
856       trace_put_event_file(schedtest_event_file);
857
858If those have been successful, synth_event_delete() can be called to
859remove the event::
860
861       ret = synth_event_delete("schedtest");
862
8637.2.2 Tracing a synthetic event piecewise
864-----------------------------------------
865
866To trace a synthetic using the piecewise method described above, the
867synth_event_trace_start() function is used to 'open' the synthetic
868event trace::
869
870       struct synth_event_trace_state trace_state;
871
872       ret = synth_event_trace_start(schedtest_event_file, &trace_state);
873
874It's passed the trace_event_file representing the synthetic event
875using the same methods as described above, along with a pointer to a
876struct synth_event_trace_state object, which will be zeroed before use and
877used to maintain state between this and following calls.
878
879Once the event has been opened, which means space for it has been
880reserved in the trace buffer, the individual fields can be set.  There
881are two ways to do that, either one after another for each field in
882the event, which requires no lookups, or by name, which does.  The
883tradeoff is flexibility in doing the assignments vs the cost of a
884lookup per field.
885
886To assign the values one after the other without lookups,
887synth_event_add_next_val() should be used.  Each call is passed the
888same synth_event_trace_state object used in the synth_event_trace_start(),
889along with the value to set the next field in the event.  After each
890field is set, the 'cursor' points to the next field, which will be set
891by the subsequent call, continuing until all the fields have been set
892in order.  The same sequence of calls as in the above examples using
893this method would be (without error-handling code)::
894
895       /* next_pid_field */
896       ret = synth_event_add_next_val(777, &trace_state);
897
898       /* next_comm_field */
899       ret = synth_event_add_next_val((u64)"slinky", &trace_state);
900
901       /* ts_ns */
902       ret = synth_event_add_next_val(1000000, &trace_state);
903
904       /* ts_ms */
905       ret = synth_event_add_next_val(1000, &trace_state);
906
907       /* cpu */
908       ret = synth_event_add_next_val(smp_processor_id(), &trace_state);
909
910       /* my_string_field */
911       ret = synth_event_add_next_val((u64)"thneed_2.01", &trace_state);
912
913       /* my_int_field */
914       ret = synth_event_add_next_val(395, &trace_state);
915
916To assign the values in any order, synth_event_add_val() should be
917used.  Each call is passed the same synth_event_trace_state object used in
918the synth_event_trace_start(), along with the field name of the field
919to set and the value to set it to.  The same sequence of calls as in
920the above examples using this method would be (without error-handling
921code)::
922
923       ret = synth_event_add_val("next_pid_field", 777, &trace_state);
924       ret = synth_event_add_val("next_comm_field", (u64)"silly putty",
925                                 &trace_state);
926       ret = synth_event_add_val("ts_ns", 1000000, &trace_state);
927       ret = synth_event_add_val("ts_ms", 1000, &trace_state);
928       ret = synth_event_add_val("cpu", smp_processor_id(), &trace_state);
929       ret = synth_event_add_val("my_string_field", (u64)"thneed_9",
930                                 &trace_state);
931       ret = synth_event_add_val("my_int_field", 3999, &trace_state);
932
933Note that synth_event_add_next_val() and synth_event_add_val() are
934incompatible if used within the same trace of an event - either one
935can be used but not both at the same time.
936
937Finally, the event won't be actually traced until it's 'closed',
938which is done using synth_event_trace_end(), which takes only the
939struct synth_event_trace_state object used in the previous calls::
940
941       ret = synth_event_trace_end(&trace_state);
942
943Note that synth_event_trace_end() must be called at the end regardless
944of whether any of the add calls failed (say due to a bad field name
945being passed in).
946
9477.3 Dyamically creating kprobe and kretprobe event definitions
948--------------------------------------------------------------
949
950To create a kprobe or kretprobe trace event from kernel code, the
951kprobe_event_gen_cmd_start() or kretprobe_event_gen_cmd_start()
952functions can be used.
953
954To create a kprobe event, an empty or partially empty kprobe event
955should first be created using kprobe_event_gen_cmd_start().  The name
956of the event and the probe location should be specified along with one
957or args each representing a probe field should be supplied to this
958function.  Before calling kprobe_event_gen_cmd_start(), the user
959should create and initialize a dynevent_cmd object using
960kprobe_event_cmd_init().
961
962For example, to create a new "schedtest" kprobe event with two fields::
963
964  struct dynevent_cmd cmd;
965  char *buf;
966
967  /* Create a buffer to hold the generated command */
968  buf = kzalloc(MAX_DYNEVENT_CMD_LEN, GFP_KERNEL);
969
970  /* Before generating the command, initialize the cmd object */
971  kprobe_event_cmd_init(&cmd, buf, MAX_DYNEVENT_CMD_LEN);
972
973  /*
974   * Define the gen_kprobe_test event with the first 2 kprobe
975   * fields.
976   */
977  ret = kprobe_event_gen_cmd_start(&cmd, "gen_kprobe_test", "do_sys_open",
978                                   "dfd=%ax", "filename=%dx");
979
980Once the kprobe event object has been created, it can then be
981populated with more fields.  Fields can be added using
982kprobe_event_add_fields(), supplying the dynevent_cmd object along
983with a variable arg list of probe fields.  For example, to add a
984couple additional fields, the following call could be made::
985
986  ret = kprobe_event_add_fields(&cmd, "flags=%cx", "mode=+4($stack)");
987
988Once all the fields have been added, the event should be finalized and
989registered by calling the kprobe_event_gen_cmd_end() or
990kretprobe_event_gen_cmd_end() functions, depending on whether a kprobe
991or kretprobe command was started::
992
993  ret = kprobe_event_gen_cmd_end(&cmd);
994
995or::
996
997  ret = kretprobe_event_gen_cmd_end(&cmd);
998
999At this point, the event object is ready to be used for tracing new
1000events.
1001
1002Similarly, a kretprobe event can be created using
1003kretprobe_event_gen_cmd_start() with a probe name and location and
1004additional params such as $retval::
1005
1006  ret = kretprobe_event_gen_cmd_start(&cmd, "gen_kretprobe_test",
1007                                      "do_sys_open", "$retval");
1008
1009Similar to the synthetic event case, code like the following can be
1010used to enable the newly created kprobe event::
1011
1012  gen_kprobe_test = trace_get_event_file(NULL, "kprobes", "gen_kprobe_test");
1013
1014  ret = trace_array_set_clr_event(gen_kprobe_test->tr,
1015                                  "kprobes", "gen_kprobe_test", true);
1016
1017Finally, also similar to synthetic events, the following code can be
1018used to give the kprobe event file back and delete the event::
1019
1020  trace_put_event_file(gen_kprobe_test);
1021
1022  ret = kprobe_event_delete("gen_kprobe_test");
1023
10247.4 The "dynevent_cmd" low-level API
1025------------------------------------
1026
1027Both the in-kernel synthetic event and kprobe interfaces are built on
1028top of a lower-level "dynevent_cmd" interface.  This interface is
1029meant to provide the basis for higher-level interfaces such as the
1030synthetic and kprobe interfaces, which can be used as examples.
1031
1032The basic idea is simple and amounts to providing a general-purpose
1033layer that can be used to generate trace event commands.  The
1034generated command strings can then be passed to the command-parsing
1035and event creation code that already exists in the trace event
1036subsystem for creating the corresponding trace events.
1037
1038In a nutshell, the way it works is that the higher-level interface
1039code creates a struct dynevent_cmd object, then uses a couple
1040functions, dynevent_arg_add() and dynevent_arg_pair_add() to build up
1041a command string, which finally causes the command to be executed
1042using the dynevent_create() function.  The details of the interface
1043are described below.
1044
1045The first step in building a new command string is to create and
1046initialize an instance of a dynevent_cmd.  Here, for instance, we
1047create a dynevent_cmd on the stack and initialize it::
1048
1049  struct dynevent_cmd cmd;
1050  char *buf;
1051  int ret;
1052
1053  buf = kzalloc(MAX_DYNEVENT_CMD_LEN, GFP_KERNEL);
1054
1055  dynevent_cmd_init(cmd, buf, maxlen, DYNEVENT_TYPE_FOO,
1056                    foo_event_run_command);
1057
1058The dynevent_cmd initialization needs to be given a user-specified
1059buffer and the length of the buffer (MAX_DYNEVENT_CMD_LEN can be used
1060for this purpose - at 2k it's generally too big to be comfortably put
1061on the stack, so is dynamically allocated), a dynevent type id, which
1062is meant to be used to check that further API calls are for the
1063correct command type, and a pointer to an event-specific run_command()
1064callback that will be called to actually execute the event-specific
1065command function.
1066
1067Once that's done, the command string can by built up by successive
1068calls to argument-adding functions.
1069
1070To add a single argument, define and initialize a struct dynevent_arg
1071or struct dynevent_arg_pair object.  Here's an example of the simplest
1072possible arg addition, which is simply to append the given string as
1073a whitespace-separated argument to the command::
1074
1075  struct dynevent_arg arg;
1076
1077  dynevent_arg_init(&arg, NULL, 0);
1078
1079  arg.str = name;
1080
1081  ret = dynevent_arg_add(cmd, &arg);
1082
1083The arg object is first initialized using dynevent_arg_init() and in
1084this case the parameters are NULL or 0, which means there's no
1085optional sanity-checking function or separator appended to the end of
1086the arg.
1087
1088Here's another more complicated example using an 'arg pair', which is
1089used to create an argument that consists of a couple components added
1090together as a unit, for example, a 'type field_name;' arg or a simple
1091expression arg e.g. 'flags=%cx'::
1092
1093  struct dynevent_arg_pair arg_pair;
1094
1095  dynevent_arg_pair_init(&arg_pair, dynevent_foo_check_arg_fn, 0, ';');
1096
1097  arg_pair.lhs = type;
1098  arg_pair.rhs = name;
1099
1100  ret = dynevent_arg_pair_add(cmd, &arg_pair);
1101
1102Again, the arg_pair is first initialized, in this case with a callback
1103function used to check the sanity of the args (for example, that
1104neither part of the pair is NULL), along with a character to be used
1105to add an operator between the pair (here none) and a separator to be
1106appended onto the end of the arg pair (here ';').
1107
1108There's also a dynevent_str_add() function that can be used to simply
1109add a string as-is, with no spaces, delimiters, or arg check.
1110
1111Any number of dynevent_*_add() calls can be made to build up the string
1112(until its length surpasses cmd->maxlen).  When all the arguments have
1113been added and the command string is complete, the only thing left to
1114do is run the command, which happens by simply calling
1115dynevent_create()::
1116
1117  ret = dynevent_create(&cmd);
1118
1119At that point, if the return value is 0, the dynamic event has been
1120created and is ready to use.
1121
1122See the dynevent_cmd function definitions themselves for the details
1123of the API.
1124