1============= 2Event Tracing 3============= 4 5:Author: Theodore Ts'o 6:Updated: Li Zefan and Tom Zanussi 7 81. Introduction 9=============== 10 11Tracepoints (see Documentation/trace/tracepoints.rst) can be used 12without creating custom kernel modules to register probe functions 13using the event tracing infrastructure. 14 15Not all tracepoints can be traced using the event tracing system; 16the kernel developer must provide code snippets which define how the 17tracing information is saved into the tracing buffer, and how the 18tracing information should be printed. 19 202. Using Event Tracing 21====================== 22 232.1 Via the 'set_event' interface 24--------------------------------- 25 26The events which are available for tracing can be found in the file 27/sys/kernel/tracing/available_events. 28 29To enable a particular event, such as 'sched_wakeup', simply echo it 30to /sys/kernel/tracing/set_event. For example:: 31 32 # echo sched_wakeup >> /sys/kernel/tracing/set_event 33 34.. Note:: '>>' is necessary, otherwise it will firstly disable all the events. 35 36To disable an event, echo the event name to the set_event file prefixed 37with an exclamation point:: 38 39 # echo '!sched_wakeup' >> /sys/kernel/tracing/set_event 40 41To disable all events, echo an empty line to the set_event file:: 42 43 # echo > /sys/kernel/tracing/set_event 44 45To enable all events, echo ``*:*`` or ``*:`` to the set_event file:: 46 47 # echo *:* > /sys/kernel/tracing/set_event 48 49The events are organized into subsystems, such as ext4, irq, sched, 50etc., and a full event name looks like this: <subsystem>:<event>. The 51subsystem name is optional, but it is displayed in the available_events 52file. All of the events in a subsystem can be specified via the syntax 53``<subsystem>:*``; for example, to enable all irq events, you can use the 54command:: 55 56 # echo 'irq:*' > /sys/kernel/tracing/set_event 57 58The set_event file may also be used to enable events associated to only 59a specific module:: 60 61 # echo ':mod:<module>' > /sys/kernel/tracing/set_event 62 63Will enable all events in the module ``<module>``. If the module is not yet 64loaded, the string will be saved and when a module is that matches ``<module>`` 65is loaded, then it will apply the enabling of events then. 66 67The text before ``:mod:`` will be parsed to specify specific events that the 68module creates:: 69 70 # echo '<match>:mod:<module>' > /sys/kernel/tracing/set_event 71 72The above will enable any system or event that ``<match>`` matches. If 73``<match>`` is ``"*"`` then it will match all events. 74 75To enable only a specific event within a system:: 76 77 # echo '<system>:<event>:mod:<module>' > /sys/kernel/tracing/set_event 78 79If ``<event>`` is ``"*"`` then it will match all events within the system 80for a given module. 81 822.2 Via the 'enable' toggle 83--------------------------- 84 85The events available are also listed in /sys/kernel/tracing/events/ hierarchy 86of directories. 87 88To enable event 'sched_wakeup':: 89 90 # echo 1 > /sys/kernel/tracing/events/sched/sched_wakeup/enable 91 92To disable it:: 93 94 # echo 0 > /sys/kernel/tracing/events/sched/sched_wakeup/enable 95 96To enable all events in sched subsystem:: 97 98 # echo 1 > /sys/kernel/tracing/events/sched/enable 99 100To enable all events:: 101 102 # echo 1 > /sys/kernel/tracing/events/enable 103 104When reading one of these enable files, there are four results: 105 106 - 0 - all events this file affects are disabled 107 - 1 - all events this file affects are enabled 108 - X - there is a mixture of events enabled and disabled 109 - ? - this file does not affect any event 110 1112.3 Boot option 112--------------- 113 114In order to facilitate early boot debugging, use boot option:: 115 116 trace_event=[event-list] 117 118event-list is a comma separated list of events. See section 2.1 for event 119format. 120 1213. Defining an event-enabled tracepoint 122======================================= 123 124See The example provided in samples/trace_events 125 1264. Event formats 127================ 128 129Each trace event has a 'format' file associated with it that contains 130a description of each field in a logged event. This information can 131be used to parse the binary trace stream, and is also the place to 132find the field names that can be used in event filters (see section 5). 133 134It also displays the format string that will be used to print the 135event in text mode, along with the event name and ID used for 136profiling. 137 138Every event has a set of ``common`` fields associated with it; these are 139the fields prefixed with ``common_``. The other fields vary between 140events and correspond to the fields defined in the TRACE_EVENT 141definition for that event. 142 143Each field in the format has the form:: 144 145 field:field-type field-name; offset:N; size:N; 146 147where offset is the offset of the field in the trace record and size 148is the size of the data item, in bytes. 149 150For example, here's the information displayed for the 'sched_wakeup' 151event:: 152 153 # cat /sys/kernel/tracing/events/sched/sched_wakeup/format 154 155 name: sched_wakeup 156 ID: 60 157 format: 158 field:unsigned short common_type; offset:0; size:2; 159 field:unsigned char common_flags; offset:2; size:1; 160 field:unsigned char common_preempt_count; offset:3; size:1; 161 field:int common_pid; offset:4; size:4; 162 field:int common_tgid; offset:8; size:4; 163 164 field:char comm[TASK_COMM_LEN]; offset:12; size:16; 165 field:pid_t pid; offset:28; size:4; 166 field:int prio; offset:32; size:4; 167 field:int success; offset:36; size:4; 168 field:int cpu; offset:40; size:4; 169 170 print fmt: "task %s:%d [%d] success=%d [%03d]", REC->comm, REC->pid, 171 REC->prio, REC->success, REC->cpu 172 173This event contains 10 fields, the first 5 common and the remaining 5 174event-specific. All the fields for this event are numeric, except for 175'comm' which is a string, a distinction important for event filtering. 176 1775. Event filtering 178================== 179 180Trace events can be filtered in the kernel by associating boolean 181'filter expressions' with them. As soon as an event is logged into 182the trace buffer, its fields are checked against the filter expression 183associated with that event type. An event with field values that 184'match' the filter will appear in the trace output, and an event whose 185values don't match will be discarded. An event with no filter 186associated with it matches everything, and is the default when no 187filter has been set for an event. 188 1895.1 Expression syntax 190--------------------- 191 192A filter expression consists of one or more 'predicates' that can be 193combined using the logical operators '&&' and '||'. A predicate is 194simply a clause that compares the value of a field contained within a 195logged event with a constant value and returns either 0 or 1 depending 196on whether the field value matched (1) or didn't match (0):: 197 198 field-name relational-operator value 199 200Parentheses can be used to provide arbitrary logical groupings and 201double-quotes can be used to prevent the shell from interpreting 202operators as shell metacharacters. 203 204The field-names available for use in filters can be found in the 205'format' files for trace events (see section 4). 206 207The relational-operators depend on the type of the field being tested: 208 209The operators available for numeric fields are: 210 211==, !=, <, <=, >, >=, & 212 213And for string fields they are: 214 215==, !=, ~ 216 217The glob (~) accepts a wild card character (\*,?) and character classes 218([). For example:: 219 220 prev_comm ~ "*sh" 221 prev_comm ~ "sh*" 222 prev_comm ~ "*sh*" 223 prev_comm ~ "ba*sh" 224 225If the field is a pointer that points into user space (for example 226"filename" from sys_enter_openat), then you have to append ".ustring" to the 227field name:: 228 229 filename.ustring ~ "password" 230 231As the kernel will have to know how to retrieve the memory that the pointer 232is at from user space. 233 234You can convert any long type to a function address and search by function name:: 235 236 call_site.function == security_prepare_creds 237 238The above will filter when the field "call_site" falls on the address within 239"security_prepare_creds". That is, it will compare the value of "call_site" and 240the filter will return true if it is greater than or equal to the start of 241the function "security_prepare_creds" and less than the end of that function. 242 243The ".function" postfix can only be attached to values of size long, and can only 244be compared with "==" or "!=". 245 246Cpumask fields or scalar fields that encode a CPU number can be filtered using 247a user-provided cpumask in cpulist format. The format is as follows:: 248 249 CPUS{$cpulist} 250 251Operators available to cpumask filtering are: 252 253& (intersection), ==, != 254 255For example, this will filter events that have their .target_cpu field present 256in the given cpumask:: 257 258 target_cpu & CPUS{17-42} 259 2605.2 Setting filters 261------------------- 262 263A filter for an individual event is set by writing a filter expression 264to the 'filter' file for the given event. 265 266For example:: 267 268 # cd /sys/kernel/tracing/events/sched/sched_wakeup 269 # echo "common_preempt_count > 4" > filter 270 271A slightly more involved example:: 272 273 # cd /sys/kernel/tracing/events/signal/signal_generate 274 # echo "((sig >= 10 && sig < 15) || sig == 17) && comm != bash" > filter 275 276If there is an error in the expression, you'll get an 'Invalid 277argument' error when setting it, and the erroneous string along with 278an error message can be seen by looking at the filter e.g.:: 279 280 # cd /sys/kernel/tracing/events/signal/signal_generate 281 # echo "((sig >= 10 && sig < 15) || dsig == 17) && comm != bash" > filter 282 -bash: echo: write error: Invalid argument 283 # cat filter 284 ((sig >= 10 && sig < 15) || dsig == 17) && comm != bash 285 ^ 286 parse_error: Field not found 287 288Currently the caret ('^') for an error always appears at the beginning of 289the filter string; the error message should still be useful though 290even without more accurate position info. 291 2925.2.1 Filter limitations 293------------------------ 294 295If a filter is placed on a string pointer ``(char *)`` that does not point 296to a string on the ring buffer, but instead points to kernel or user space 297memory, then, for safety reasons, at most 1024 bytes of the content is 298copied onto a temporary buffer to do the compare. If the copy of the memory 299faults (the pointer points to memory that should not be accessed), then the 300string compare will be treated as not matching. 301 3025.3 Clearing filters 303-------------------- 304 305To clear the filter for an event, write a '0' to the event's filter 306file. 307 308To clear the filters for all events in a subsystem, write a '0' to the 309subsystem's filter file. 310 3115.4 Subsystem filters 312--------------------- 313 314For convenience, filters for every event in a subsystem can be set or 315cleared as a group by writing a filter expression into the filter file 316at the root of the subsystem. Note however, that if a filter for any 317event within the subsystem lacks a field specified in the subsystem 318filter, or if the filter can't be applied for any other reason, the 319filter for that event will retain its previous setting. This can 320result in an unintended mixture of filters which could lead to 321confusing (to the user who might think different filters are in 322effect) trace output. Only filters that reference just the common 323fields can be guaranteed to propagate successfully to all events. 324 325Here are a few subsystem filter examples that also illustrate the 326above points: 327 328Clear the filters on all events in the sched subsystem:: 329 330 # cd /sys/kernel/tracing/events/sched 331 # echo 0 > filter 332 # cat sched_switch/filter 333 none 334 # cat sched_wakeup/filter 335 none 336 337Set a filter using only common fields for all events in the sched 338subsystem (all events end up with the same filter):: 339 340 # cd /sys/kernel/tracing/events/sched 341 # echo common_pid == 0 > filter 342 # cat sched_switch/filter 343 common_pid == 0 344 # cat sched_wakeup/filter 345 common_pid == 0 346 347Attempt to set a filter using a non-common field for all events in the 348sched subsystem (all events but those that have a prev_pid field retain 349their old filters):: 350 351 # cd /sys/kernel/tracing/events/sched 352 # echo prev_pid == 0 > filter 353 # cat sched_switch/filter 354 prev_pid == 0 355 # cat sched_wakeup/filter 356 common_pid == 0 357 3585.5 PID filtering 359----------------- 360 361The set_event_pid file in the same directory as the top events directory 362exists, will filter all events from tracing any task that does not have the 363PID listed in the set_event_pid file. 364:: 365 366 # cd /sys/kernel/tracing 367 # echo $$ > set_event_pid 368 # echo 1 > events/enable 369 370Will only trace events for the current task. 371 372To add more PIDs without losing the PIDs already included, use '>>'. 373:: 374 375 # echo 123 244 1 >> set_event_pid 376 377 3786. Event triggers 379================= 380 381Trace events can be made to conditionally invoke trigger 'commands' 382which can take various forms and are described in detail below; 383examples would be enabling or disabling other trace events or invoking 384a stack trace whenever the trace event is hit. Whenever a trace event 385with attached triggers is invoked, the set of trigger commands 386associated with that event is invoked. Any given trigger can 387additionally have an event filter of the same form as described in 388section 5 (Event filtering) associated with it - the command will only 389be invoked if the event being invoked passes the associated filter. 390If no filter is associated with the trigger, it always passes. 391 392Triggers are added to and removed from a particular event by writing 393trigger expressions to the 'trigger' file for the given event. 394 395A given event can have any number of triggers associated with it, 396subject to any restrictions that individual commands may have in that 397regard. 398 399Event triggers are implemented on top of "soft" mode, which means that 400whenever a trace event has one or more triggers associated with it, 401the event is activated even if it isn't actually enabled, but is 402disabled in a "soft" mode. That is, the tracepoint will be called, 403but just will not be traced, unless of course it's actually enabled. 404This scheme allows triggers to be invoked even for events that aren't 405enabled, and also allows the current event filter implementation to be 406used for conditionally invoking triggers. 407 408The syntax for event triggers is roughly based on the syntax for 409set_ftrace_filter 'ftrace filter commands' (see the 'Filter commands' 410section of Documentation/trace/ftrace.rst), but there are major 411differences and the implementation isn't currently tied to it in any 412way, so beware about making generalizations between the two. 413 414.. Note:: 415 Writing into trace_marker (See Documentation/trace/ftrace.rst) 416 can also enable triggers that are written into 417 /sys/kernel/tracing/events/ftrace/print/trigger 418 4196.1 Expression syntax 420--------------------- 421 422Triggers are added by echoing the command to the 'trigger' file:: 423 424 # echo 'command[:count] [if filter]' > trigger 425 426Triggers are removed by echoing the same command but starting with '!' 427to the 'trigger' file:: 428 429 # echo '!command[:count] [if filter]' > trigger 430 431The [if filter] part isn't used in matching commands when removing, so 432leaving that off in a '!' command will accomplish the same thing as 433having it in. 434 435The filter syntax is the same as that described in the 'Event 436filtering' section above. 437 438For ease of use, writing to the trigger file using '>' currently just 439adds or removes a single trigger and there's no explicit '>>' support 440('>' actually behaves like '>>') or truncation support to remove all 441triggers (you have to use '!' for each one added.) 442 4436.2 Supported trigger commands 444------------------------------ 445 446The following commands are supported: 447 448- enable_event/disable_event 449 450 These commands can enable or disable another trace event whenever 451 the triggering event is hit. When these commands are registered, 452 the other trace event is activated, but disabled in a "soft" mode. 453 That is, the tracepoint will be called, but just will not be traced. 454 The event tracepoint stays in this mode as long as there's a trigger 455 in effect that can trigger it. 456 457 For example, the following trigger causes kmalloc events to be 458 traced when a read system call is entered, and the :1 at the end 459 specifies that this enablement happens only once:: 460 461 # echo 'enable_event:kmem:kmalloc:1' > \ 462 /sys/kernel/tracing/events/syscalls/sys_enter_read/trigger 463 464 The following trigger causes kmalloc events to stop being traced 465 when a read system call exits. This disablement happens on every 466 read system call exit:: 467 468 # echo 'disable_event:kmem:kmalloc' > \ 469 /sys/kernel/tracing/events/syscalls/sys_exit_read/trigger 470 471 The format is:: 472 473 enable_event:<system>:<event>[:count] 474 disable_event:<system>:<event>[:count] 475 476 To remove the above commands:: 477 478 # echo '!enable_event:kmem:kmalloc:1' > \ 479 /sys/kernel/tracing/events/syscalls/sys_enter_read/trigger 480 481 # echo '!disable_event:kmem:kmalloc' > \ 482 /sys/kernel/tracing/events/syscalls/sys_exit_read/trigger 483 484 Note that there can be any number of enable/disable_event triggers 485 per triggering event, but there can only be one trigger per 486 triggered event. e.g. sys_enter_read can have triggers enabling both 487 kmem:kmalloc and sched:sched_switch, but can't have two kmem:kmalloc 488 versions such as kmem:kmalloc and kmem:kmalloc:1 or 'kmem:kmalloc if 489 bytes_req == 256' and 'kmem:kmalloc if bytes_alloc == 256' (they 490 could be combined into a single filter on kmem:kmalloc though). 491 492- stacktrace 493 494 This command dumps a stacktrace in the trace buffer whenever the 495 triggering event occurs. 496 497 For example, the following trigger dumps a stacktrace every time the 498 kmalloc tracepoint is hit:: 499 500 # echo 'stacktrace' > \ 501 /sys/kernel/tracing/events/kmem/kmalloc/trigger 502 503 The following trigger dumps a stacktrace the first 5 times a kmalloc 504 request happens with a size >= 64K:: 505 506 # echo 'stacktrace:5 if bytes_req >= 65536' > \ 507 /sys/kernel/tracing/events/kmem/kmalloc/trigger 508 509 The format is:: 510 511 stacktrace[:count] 512 513 To remove the above commands:: 514 515 # echo '!stacktrace' > \ 516 /sys/kernel/tracing/events/kmem/kmalloc/trigger 517 518 # echo '!stacktrace:5 if bytes_req >= 65536' > \ 519 /sys/kernel/tracing/events/kmem/kmalloc/trigger 520 521 The latter can also be removed more simply by the following (without 522 the filter):: 523 524 # echo '!stacktrace:5' > \ 525 /sys/kernel/tracing/events/kmem/kmalloc/trigger 526 527 Note that there can be only one stacktrace trigger per triggering 528 event. 529 530- snapshot 531 532 This command causes a snapshot to be triggered whenever the 533 triggering event occurs. 534 535 The following command creates a snapshot every time a block request 536 queue is unplugged with a depth > 1. If you were tracing a set of 537 events or functions at the time, the snapshot trace buffer would 538 capture those events when the trigger event occurred:: 539 540 # echo 'snapshot if nr_rq > 1' > \ 541 /sys/kernel/tracing/events/block/block_unplug/trigger 542 543 To only snapshot once:: 544 545 # echo 'snapshot:1 if nr_rq > 1' > \ 546 /sys/kernel/tracing/events/block/block_unplug/trigger 547 548 To remove the above commands:: 549 550 # echo '!snapshot if nr_rq > 1' > \ 551 /sys/kernel/tracing/events/block/block_unplug/trigger 552 553 # echo '!snapshot:1 if nr_rq > 1' > \ 554 /sys/kernel/tracing/events/block/block_unplug/trigger 555 556 Note that there can be only one snapshot trigger per triggering 557 event. 558 559- traceon/traceoff 560 561 These commands turn tracing on and off when the specified events are 562 hit. The parameter determines how many times the tracing system is 563 turned on and off. If unspecified, there is no limit. 564 565 The following command turns tracing off the first time a block 566 request queue is unplugged with a depth > 1. If you were tracing a 567 set of events or functions at the time, you could then examine the 568 trace buffer to see the sequence of events that led up to the 569 trigger event:: 570 571 # echo 'traceoff:1 if nr_rq > 1' > \ 572 /sys/kernel/tracing/events/block/block_unplug/trigger 573 574 To always disable tracing when nr_rq > 1:: 575 576 # echo 'traceoff if nr_rq > 1' > \ 577 /sys/kernel/tracing/events/block/block_unplug/trigger 578 579 To remove the above commands:: 580 581 # echo '!traceoff:1 if nr_rq > 1' > \ 582 /sys/kernel/tracing/events/block/block_unplug/trigger 583 584 # echo '!traceoff if nr_rq > 1' > \ 585 /sys/kernel/tracing/events/block/block_unplug/trigger 586 587 Note that there can be only one traceon or traceoff trigger per 588 triggering event. 589 590- hist 591 592 This command aggregates event hits into a hash table keyed on one or 593 more trace event format fields (or stacktrace) and a set of running 594 totals derived from one or more trace event format fields and/or 595 event counts (hitcount). 596 597 See Documentation/trace/histogram.rst for details and examples. 598 5997. In-kernel trace event API 600============================ 601 602In most cases, the command-line interface to trace events is more than 603sufficient. Sometimes, however, applications might find the need for 604more complex relationships than can be expressed through a simple 605series of linked command-line expressions, or putting together sets of 606commands may be simply too cumbersome. An example might be an 607application that needs to 'listen' to the trace stream in order to 608maintain an in-kernel state machine detecting, for instance, when an 609illegal kernel state occurs in the scheduler. 610 611The trace event subsystem provides an in-kernel API allowing modules 612or other kernel code to generate user-defined 'synthetic' events at 613will, which can be used to either augment the existing trace stream 614and/or signal that a particular important state has occurred. 615 616A similar in-kernel API is also available for creating kprobe and 617kretprobe events. 618 619Both the synthetic event and k/ret/probe event APIs are built on top 620of a lower-level "dynevent_cmd" event command API, which is also 621available for more specialized applications, or as the basis of other 622higher-level trace event APIs. 623 624The API provided for these purposes is describe below and allows the 625following: 626 627 - dynamically creating synthetic event definitions 628 - dynamically creating kprobe and kretprobe event definitions 629 - tracing synthetic events from in-kernel code 630 - the low-level "dynevent_cmd" API 631 6327.1 Dyamically creating synthetic event definitions 633--------------------------------------------------- 634 635There are a couple ways to create a new synthetic event from a kernel 636module or other kernel code. 637 638The first creates the event in one step, using synth_event_create(). 639In this method, the name of the event to create and an array defining 640the fields is supplied to synth_event_create(). If successful, a 641synthetic event with that name and fields will exist following that 642call. For example, to create a new "schedtest" synthetic event:: 643 644 ret = synth_event_create("schedtest", sched_fields, 645 ARRAY_SIZE(sched_fields), THIS_MODULE); 646 647The sched_fields param in this example points to an array of struct 648synth_field_desc, each of which describes an event field by type and 649name:: 650 651 static struct synth_field_desc sched_fields[] = { 652 { .type = "pid_t", .name = "next_pid_field" }, 653 { .type = "char[16]", .name = "next_comm_field" }, 654 { .type = "u64", .name = "ts_ns" }, 655 { .type = "u64", .name = "ts_ms" }, 656 { .type = "unsigned int", .name = "cpu" }, 657 { .type = "char[64]", .name = "my_string_field" }, 658 { .type = "int", .name = "my_int_field" }, 659 }; 660 661See synth_field_size() for available types. 662 663If field_name contains [n], the field is considered to be a static array. 664 665If field_names contains[] (no subscript), the field is considered to 666be a dynamic array, which will only take as much space in the event as 667is required to hold the array. 668 669Because space for an event is reserved before assigning field values 670to the event, using dynamic arrays implies that the piecewise 671in-kernel API described below can't be used with dynamic arrays. The 672other non-piecewise in-kernel APIs can, however, be used with dynamic 673arrays. 674 675If the event is created from within a module, a pointer to the module 676must be passed to synth_event_create(). This will ensure that the 677trace buffer won't contain unreadable events when the module is 678removed. 679 680At this point, the event object is ready to be used for generating new 681events. 682 683In the second method, the event is created in several steps. This 684allows events to be created dynamically and without the need to create 685and populate an array of fields beforehand. 686 687To use this method, an empty or partially empty synthetic event should 688first be created using synth_event_gen_cmd_start() or 689synth_event_gen_cmd_array_start(). For synth_event_gen_cmd_start(), 690the name of the event along with one or more pairs of args each pair 691representing a 'type field_name;' field specification should be 692supplied. For synth_event_gen_cmd_array_start(), the name of the 693event along with an array of struct synth_field_desc should be 694supplied. Before calling synth_event_gen_cmd_start() or 695synth_event_gen_cmd_array_start(), the user should create and 696initialize a dynevent_cmd object using synth_event_cmd_init(). 697 698For example, to create a new "schedtest" synthetic event with two 699fields:: 700 701 struct dynevent_cmd cmd; 702 char *buf; 703 704 /* Create a buffer to hold the generated command */ 705 buf = kzalloc(MAX_DYNEVENT_CMD_LEN, GFP_KERNEL); 706 707 /* Before generating the command, initialize the cmd object */ 708 synth_event_cmd_init(&cmd, buf, MAX_DYNEVENT_CMD_LEN); 709 710 ret = synth_event_gen_cmd_start(&cmd, "schedtest", THIS_MODULE, 711 "pid_t", "next_pid_field", 712 "u64", "ts_ns"); 713 714Alternatively, using an array of struct synth_field_desc fields 715containing the same information:: 716 717 ret = synth_event_gen_cmd_array_start(&cmd, "schedtest", THIS_MODULE, 718 fields, n_fields); 719 720Once the synthetic event object has been created, it can then be 721populated with more fields. Fields are added one by one using 722synth_event_add_field(), supplying the dynevent_cmd object, a field 723type, and a field name. For example, to add a new int field named 724"intfield", the following call should be made:: 725 726 ret = synth_event_add_field(&cmd, "int", "intfield"); 727 728See synth_field_size() for available types. If field_name contains [n] 729the field is considered to be an array. 730 731A group of fields can also be added all at once using an array of 732synth_field_desc with add_synth_fields(). For example, this would add 733just the first four sched_fields:: 734 735 ret = synth_event_add_fields(&cmd, sched_fields, 4); 736 737If you already have a string of the form 'type field_name', 738synth_event_add_field_str() can be used to add it as-is; it will 739also automatically append a ';' to the string. 740 741Once all the fields have been added, the event should be finalized and 742registered by calling the synth_event_gen_cmd_end() function:: 743 744 ret = synth_event_gen_cmd_end(&cmd); 745 746At this point, the event object is ready to be used for tracing new 747events. 748 7497.2 Tracing synthetic events from in-kernel code 750------------------------------------------------ 751 752To trace a synthetic event, there are several options. The first 753option is to trace the event in one call, using synth_event_trace() 754with a variable number of values, or synth_event_trace_array() with an 755array of values to be set. A second option can be used to avoid the 756need for a pre-formed array of values or list of arguments, via 757synth_event_trace_start() and synth_event_trace_end() along with 758synth_event_add_next_val() or synth_event_add_val() to add the values 759piecewise. 760 7617.2.1 Tracing a synthetic event all at once 762------------------------------------------- 763 764To trace a synthetic event all at once, the synth_event_trace() or 765synth_event_trace_array() functions can be used. 766 767The synth_event_trace() function is passed the trace_event_file 768representing the synthetic event (which can be retrieved using 769trace_get_event_file() using the synthetic event name, "synthetic" as 770the system name, and the trace instance name (NULL if using the global 771trace array)), along with an variable number of u64 args, one for each 772synthetic event field, and the number of values being passed. 773 774So, to trace an event corresponding to the synthetic event definition 775above, code like the following could be used:: 776 777 ret = synth_event_trace(create_synth_test, 7, /* number of values */ 778 444, /* next_pid_field */ 779 (u64)"clackers", /* next_comm_field */ 780 1000000, /* ts_ns */ 781 1000, /* ts_ms */ 782 smp_processor_id(),/* cpu */ 783 (u64)"Thneed", /* my_string_field */ 784 999); /* my_int_field */ 785 786All vals should be cast to u64, and string vals are just pointers to 787strings, cast to u64. Strings will be copied into space reserved in 788the event for the string, using these pointers. 789 790Alternatively, the synth_event_trace_array() function can be used to 791accomplish the same thing. It is passed the trace_event_file 792representing the synthetic event (which can be retrieved using 793trace_get_event_file() using the synthetic event name, "synthetic" as 794the system name, and the trace instance name (NULL if using the global 795trace array)), along with an array of u64, one for each synthetic 796event field. 797 798To trace an event corresponding to the synthetic event definition 799above, code like the following could be used:: 800 801 u64 vals[7]; 802 803 vals[0] = 777; /* next_pid_field */ 804 vals[1] = (u64)"tiddlywinks"; /* next_comm_field */ 805 vals[2] = 1000000; /* ts_ns */ 806 vals[3] = 1000; /* ts_ms */ 807 vals[4] = smp_processor_id(); /* cpu */ 808 vals[5] = (u64)"thneed"; /* my_string_field */ 809 vals[6] = 398; /* my_int_field */ 810 811The 'vals' array is just an array of u64, the number of which must 812match the number of field in the synthetic event, and which must be in 813the same order as the synthetic event fields. 814 815All vals should be cast to u64, and string vals are just pointers to 816strings, cast to u64. Strings will be copied into space reserved in 817the event for the string, using these pointers. 818 819In order to trace a synthetic event, a pointer to the trace event file 820is needed. The trace_get_event_file() function can be used to get 821it - it will find the file in the given trace instance (in this case 822NULL since the top trace array is being used) while at the same time 823preventing the instance containing it from going away:: 824 825 schedtest_event_file = trace_get_event_file(NULL, "synthetic", 826 "schedtest"); 827 828Before tracing the event, it should be enabled in some way, otherwise 829the synthetic event won't actually show up in the trace buffer. 830 831To enable a synthetic event from the kernel, trace_array_set_clr_event() 832can be used (which is not specific to synthetic events, so does need 833the "synthetic" system name to be specified explicitly). 834 835To enable the event, pass 'true' to it:: 836 837 trace_array_set_clr_event(schedtest_event_file->tr, 838 "synthetic", "schedtest", true); 839 840To disable it pass false:: 841 842 trace_array_set_clr_event(schedtest_event_file->tr, 843 "synthetic", "schedtest", false); 844 845Finally, synth_event_trace_array() can be used to actually trace the 846event, which should be visible in the trace buffer afterwards:: 847 848 ret = synth_event_trace_array(schedtest_event_file, vals, 849 ARRAY_SIZE(vals)); 850 851To remove the synthetic event, the event should be disabled, and the 852trace instance should be 'put' back using trace_put_event_file():: 853 854 trace_array_set_clr_event(schedtest_event_file->tr, 855 "synthetic", "schedtest", false); 856 trace_put_event_file(schedtest_event_file); 857 858If those have been successful, synth_event_delete() can be called to 859remove the event:: 860 861 ret = synth_event_delete("schedtest"); 862 8637.2.2 Tracing a synthetic event piecewise 864----------------------------------------- 865 866To trace a synthetic using the piecewise method described above, the 867synth_event_trace_start() function is used to 'open' the synthetic 868event trace:: 869 870 struct synth_event_trace_state trace_state; 871 872 ret = synth_event_trace_start(schedtest_event_file, &trace_state); 873 874It's passed the trace_event_file representing the synthetic event 875using the same methods as described above, along with a pointer to a 876struct synth_event_trace_state object, which will be zeroed before use and 877used to maintain state between this and following calls. 878 879Once the event has been opened, which means space for it has been 880reserved in the trace buffer, the individual fields can be set. There 881are two ways to do that, either one after another for each field in 882the event, which requires no lookups, or by name, which does. The 883tradeoff is flexibility in doing the assignments vs the cost of a 884lookup per field. 885 886To assign the values one after the other without lookups, 887synth_event_add_next_val() should be used. Each call is passed the 888same synth_event_trace_state object used in the synth_event_trace_start(), 889along with the value to set the next field in the event. After each 890field is set, the 'cursor' points to the next field, which will be set 891by the subsequent call, continuing until all the fields have been set 892in order. The same sequence of calls as in the above examples using 893this method would be (without error-handling code):: 894 895 /* next_pid_field */ 896 ret = synth_event_add_next_val(777, &trace_state); 897 898 /* next_comm_field */ 899 ret = synth_event_add_next_val((u64)"slinky", &trace_state); 900 901 /* ts_ns */ 902 ret = synth_event_add_next_val(1000000, &trace_state); 903 904 /* ts_ms */ 905 ret = synth_event_add_next_val(1000, &trace_state); 906 907 /* cpu */ 908 ret = synth_event_add_next_val(smp_processor_id(), &trace_state); 909 910 /* my_string_field */ 911 ret = synth_event_add_next_val((u64)"thneed_2.01", &trace_state); 912 913 /* my_int_field */ 914 ret = synth_event_add_next_val(395, &trace_state); 915 916To assign the values in any order, synth_event_add_val() should be 917used. Each call is passed the same synth_event_trace_state object used in 918the synth_event_trace_start(), along with the field name of the field 919to set and the value to set it to. The same sequence of calls as in 920the above examples using this method would be (without error-handling 921code):: 922 923 ret = synth_event_add_val("next_pid_field", 777, &trace_state); 924 ret = synth_event_add_val("next_comm_field", (u64)"silly putty", 925 &trace_state); 926 ret = synth_event_add_val("ts_ns", 1000000, &trace_state); 927 ret = synth_event_add_val("ts_ms", 1000, &trace_state); 928 ret = synth_event_add_val("cpu", smp_processor_id(), &trace_state); 929 ret = synth_event_add_val("my_string_field", (u64)"thneed_9", 930 &trace_state); 931 ret = synth_event_add_val("my_int_field", 3999, &trace_state); 932 933Note that synth_event_add_next_val() and synth_event_add_val() are 934incompatible if used within the same trace of an event - either one 935can be used but not both at the same time. 936 937Finally, the event won't be actually traced until it's 'closed', 938which is done using synth_event_trace_end(), which takes only the 939struct synth_event_trace_state object used in the previous calls:: 940 941 ret = synth_event_trace_end(&trace_state); 942 943Note that synth_event_trace_end() must be called at the end regardless 944of whether any of the add calls failed (say due to a bad field name 945being passed in). 946 9477.3 Dyamically creating kprobe and kretprobe event definitions 948-------------------------------------------------------------- 949 950To create a kprobe or kretprobe trace event from kernel code, the 951kprobe_event_gen_cmd_start() or kretprobe_event_gen_cmd_start() 952functions can be used. 953 954To create a kprobe event, an empty or partially empty kprobe event 955should first be created using kprobe_event_gen_cmd_start(). The name 956of the event and the probe location should be specified along with one 957or args each representing a probe field should be supplied to this 958function. Before calling kprobe_event_gen_cmd_start(), the user 959should create and initialize a dynevent_cmd object using 960kprobe_event_cmd_init(). 961 962For example, to create a new "schedtest" kprobe event with two fields:: 963 964 struct dynevent_cmd cmd; 965 char *buf; 966 967 /* Create a buffer to hold the generated command */ 968 buf = kzalloc(MAX_DYNEVENT_CMD_LEN, GFP_KERNEL); 969 970 /* Before generating the command, initialize the cmd object */ 971 kprobe_event_cmd_init(&cmd, buf, MAX_DYNEVENT_CMD_LEN); 972 973 /* 974 * Define the gen_kprobe_test event with the first 2 kprobe 975 * fields. 976 */ 977 ret = kprobe_event_gen_cmd_start(&cmd, "gen_kprobe_test", "do_sys_open", 978 "dfd=%ax", "filename=%dx"); 979 980Once the kprobe event object has been created, it can then be 981populated with more fields. Fields can be added using 982kprobe_event_add_fields(), supplying the dynevent_cmd object along 983with a variable arg list of probe fields. For example, to add a 984couple additional fields, the following call could be made:: 985 986 ret = kprobe_event_add_fields(&cmd, "flags=%cx", "mode=+4($stack)"); 987 988Once all the fields have been added, the event should be finalized and 989registered by calling the kprobe_event_gen_cmd_end() or 990kretprobe_event_gen_cmd_end() functions, depending on whether a kprobe 991or kretprobe command was started:: 992 993 ret = kprobe_event_gen_cmd_end(&cmd); 994 995or:: 996 997 ret = kretprobe_event_gen_cmd_end(&cmd); 998 999At this point, the event object is ready to be used for tracing new 1000events. 1001 1002Similarly, a kretprobe event can be created using 1003kretprobe_event_gen_cmd_start() with a probe name and location and 1004additional params such as $retval:: 1005 1006 ret = kretprobe_event_gen_cmd_start(&cmd, "gen_kretprobe_test", 1007 "do_sys_open", "$retval"); 1008 1009Similar to the synthetic event case, code like the following can be 1010used to enable the newly created kprobe event:: 1011 1012 gen_kprobe_test = trace_get_event_file(NULL, "kprobes", "gen_kprobe_test"); 1013 1014 ret = trace_array_set_clr_event(gen_kprobe_test->tr, 1015 "kprobes", "gen_kprobe_test", true); 1016 1017Finally, also similar to synthetic events, the following code can be 1018used to give the kprobe event file back and delete the event:: 1019 1020 trace_put_event_file(gen_kprobe_test); 1021 1022 ret = kprobe_event_delete("gen_kprobe_test"); 1023 10247.4 The "dynevent_cmd" low-level API 1025------------------------------------ 1026 1027Both the in-kernel synthetic event and kprobe interfaces are built on 1028top of a lower-level "dynevent_cmd" interface. This interface is 1029meant to provide the basis for higher-level interfaces such as the 1030synthetic and kprobe interfaces, which can be used as examples. 1031 1032The basic idea is simple and amounts to providing a general-purpose 1033layer that can be used to generate trace event commands. The 1034generated command strings can then be passed to the command-parsing 1035and event creation code that already exists in the trace event 1036subsystem for creating the corresponding trace events. 1037 1038In a nutshell, the way it works is that the higher-level interface 1039code creates a struct dynevent_cmd object, then uses a couple 1040functions, dynevent_arg_add() and dynevent_arg_pair_add() to build up 1041a command string, which finally causes the command to be executed 1042using the dynevent_create() function. The details of the interface 1043are described below. 1044 1045The first step in building a new command string is to create and 1046initialize an instance of a dynevent_cmd. Here, for instance, we 1047create a dynevent_cmd on the stack and initialize it:: 1048 1049 struct dynevent_cmd cmd; 1050 char *buf; 1051 int ret; 1052 1053 buf = kzalloc(MAX_DYNEVENT_CMD_LEN, GFP_KERNEL); 1054 1055 dynevent_cmd_init(cmd, buf, maxlen, DYNEVENT_TYPE_FOO, 1056 foo_event_run_command); 1057 1058The dynevent_cmd initialization needs to be given a user-specified 1059buffer and the length of the buffer (MAX_DYNEVENT_CMD_LEN can be used 1060for this purpose - at 2k it's generally too big to be comfortably put 1061on the stack, so is dynamically allocated), a dynevent type id, which 1062is meant to be used to check that further API calls are for the 1063correct command type, and a pointer to an event-specific run_command() 1064callback that will be called to actually execute the event-specific 1065command function. 1066 1067Once that's done, the command string can by built up by successive 1068calls to argument-adding functions. 1069 1070To add a single argument, define and initialize a struct dynevent_arg 1071or struct dynevent_arg_pair object. Here's an example of the simplest 1072possible arg addition, which is simply to append the given string as 1073a whitespace-separated argument to the command:: 1074 1075 struct dynevent_arg arg; 1076 1077 dynevent_arg_init(&arg, NULL, 0); 1078 1079 arg.str = name; 1080 1081 ret = dynevent_arg_add(cmd, &arg); 1082 1083The arg object is first initialized using dynevent_arg_init() and in 1084this case the parameters are NULL or 0, which means there's no 1085optional sanity-checking function or separator appended to the end of 1086the arg. 1087 1088Here's another more complicated example using an 'arg pair', which is 1089used to create an argument that consists of a couple components added 1090together as a unit, for example, a 'type field_name;' arg or a simple 1091expression arg e.g. 'flags=%cx':: 1092 1093 struct dynevent_arg_pair arg_pair; 1094 1095 dynevent_arg_pair_init(&arg_pair, dynevent_foo_check_arg_fn, 0, ';'); 1096 1097 arg_pair.lhs = type; 1098 arg_pair.rhs = name; 1099 1100 ret = dynevent_arg_pair_add(cmd, &arg_pair); 1101 1102Again, the arg_pair is first initialized, in this case with a callback 1103function used to check the sanity of the args (for example, that 1104neither part of the pair is NULL), along with a character to be used 1105to add an operator between the pair (here none) and a separator to be 1106appended onto the end of the arg pair (here ';'). 1107 1108There's also a dynevent_str_add() function that can be used to simply 1109add a string as-is, with no spaces, delimiters, or arg check. 1110 1111Any number of dynevent_*_add() calls can be made to build up the string 1112(until its length surpasses cmd->maxlen). When all the arguments have 1113been added and the command string is complete, the only thing left to 1114do is run the command, which happens by simply calling 1115dynevent_create():: 1116 1117 ret = dynevent_create(&cmd); 1118 1119At that point, if the return value is 0, the dynamic event has been 1120created and is ready to use. 1121 1122See the dynevent_cmd function definitions themselves for the details 1123of the API. 1124