1======================== 2ftrace - Function Tracer 3======================== 4 5Copyright 2008 Red Hat Inc. 6 7:Author: Steven Rostedt <srostedt@redhat.com> 8:License: The GNU Free Documentation License, Version 1.2 9 (dual licensed under the GPL v2) 10:Original Reviewers: Elias Oltmanns, Randy Dunlap, Andrew Morton, 11 John Kacur, and David Teigland. 12 13- Written for: 2.6.28-rc2 14- Updated for: 3.10 15- Updated for: 4.13 - Copyright 2017 VMware Inc. Steven Rostedt 16- Converted to rst format - Changbin Du <changbin.du@intel.com> 17 18Introduction 19------------ 20 21Ftrace is an internal tracer designed to help out developers and 22designers of systems to find what is going on inside the kernel. 23It can be used for debugging or analyzing latencies and 24performance issues that take place outside of user-space. 25 26Although ftrace is typically considered the function tracer, it 27is really a frame work of several assorted tracing utilities. 28There's latency tracing to examine what occurs between interrupts 29disabled and enabled, as well as for preemption and from a time 30a task is woken to the task is actually scheduled in. 31 32One of the most common uses of ftrace is the event tracing. 33Through out the kernel is hundreds of static event points that 34can be enabled via the tracefs file system to see what is 35going on in certain parts of the kernel. 36 37See events.txt for more information. 38 39 40Implementation Details 41---------------------- 42 43See :doc:`ftrace-design` for details for arch porters and such. 44 45 46The File System 47--------------- 48 49Ftrace uses the tracefs file system to hold the control files as 50well as the files to display output. 51 52When tracefs is configured into the kernel (which selecting any ftrace 53option will do) the directory /sys/kernel/tracing will be created. To mount 54this directory, you can add to your /etc/fstab file:: 55 56 tracefs /sys/kernel/tracing tracefs defaults 0 0 57 58Or you can mount it at run time with:: 59 60 mount -t tracefs nodev /sys/kernel/tracing 61 62For quicker access to that directory you may want to make a soft link to 63it:: 64 65 ln -s /sys/kernel/tracing /tracing 66 67.. attention:: 68 69 Before 4.1, all ftrace tracing control files were within the debugfs 70 file system, which is typically located at /sys/kernel/debug/tracing. 71 For backward compatibility, when mounting the debugfs file system, 72 the tracefs file system will be automatically mounted at: 73 74 /sys/kernel/debug/tracing 75 76 All files located in the tracefs file system will be located in that 77 debugfs file system directory as well. 78 79.. attention:: 80 81 Any selected ftrace option will also create the tracefs file system. 82 The rest of the document will assume that you are in the ftrace directory 83 (cd /sys/kernel/tracing) and will only concentrate on the files within that 84 directory and not distract from the content with the extended 85 "/sys/kernel/tracing" path name. 86 87That's it! (assuming that you have ftrace configured into your kernel) 88 89After mounting tracefs you will have access to the control and output files 90of ftrace. Here is a list of some of the key files: 91 92 93 Note: all time values are in microseconds. 94 95 current_tracer: 96 97 This is used to set or display the current tracer 98 that is configured. 99 100 available_tracers: 101 102 This holds the different types of tracers that 103 have been compiled into the kernel. The 104 tracers listed here can be configured by 105 echoing their name into current_tracer. 106 107 tracing_on: 108 109 This sets or displays whether writing to the trace 110 ring buffer is enabled. Echo 0 into this file to disable 111 the tracer or 1 to enable it. Note, this only disables 112 writing to the ring buffer, the tracing overhead may 113 still be occurring. 114 115 The kernel function tracing_off() can be used within the 116 kernel to disable writing to the ring buffer, which will 117 set this file to "0". User space can re-enable tracing by 118 echoing "1" into the file. 119 120 Note, the function and event trigger "traceoff" will also 121 set this file to zero and stop tracing. Which can also 122 be re-enabled by user space using this file. 123 124 trace: 125 126 This file holds the output of the trace in a human 127 readable format (described below). Note, tracing is temporarily 128 disabled while this file is being read (opened). 129 130 trace_pipe: 131 132 The output is the same as the "trace" file but this 133 file is meant to be streamed with live tracing. 134 Reads from this file will block until new data is 135 retrieved. Unlike the "trace" file, this file is a 136 consumer. This means reading from this file causes 137 sequential reads to display more current data. Once 138 data is read from this file, it is consumed, and 139 will not be read again with a sequential read. The 140 "trace" file is static, and if the tracer is not 141 adding more data, it will display the same 142 information every time it is read. This file will not 143 disable tracing while being read. 144 145 trace_options: 146 147 This file lets the user control the amount of data 148 that is displayed in one of the above output 149 files. Options also exist to modify how a tracer 150 or events work (stack traces, timestamps, etc). 151 152 options: 153 154 This is a directory that has a file for every available 155 trace option (also in trace_options). Options may also be set 156 or cleared by writing a "1" or "0" respectively into the 157 corresponding file with the option name. 158 159 tracing_max_latency: 160 161 Some of the tracers record the max latency. 162 For example, the maximum time that interrupts are disabled. 163 The maximum time is saved in this file. The max trace will also be 164 stored, and displayed by "trace". A new max trace will only be 165 recorded if the latency is greater than the value in this file 166 (in microseconds). 167 168 By echoing in a time into this file, no latency will be recorded 169 unless it is greater than the time in this file. 170 171 tracing_thresh: 172 173 Some latency tracers will record a trace whenever the 174 latency is greater than the number in this file. 175 Only active when the file contains a number greater than 0. 176 (in microseconds) 177 178 buffer_size_kb: 179 180 This sets or displays the number of kilobytes each CPU 181 buffer holds. By default, the trace buffers are the same size 182 for each CPU. The displayed number is the size of the 183 CPU buffer and not total size of all buffers. The 184 trace buffers are allocated in pages (blocks of memory 185 that the kernel uses for allocation, usually 4 KB in size). 186 If the last page allocated has room for more bytes 187 than requested, the rest of the page will be used, 188 making the actual allocation bigger than requested or shown. 189 ( Note, the size may not be a multiple of the page size 190 due to buffer management meta-data. ) 191 192 Buffer sizes for individual CPUs may vary 193 (see "per_cpu/cpu0/buffer_size_kb" below), and if they do 194 this file will show "X". 195 196 buffer_total_size_kb: 197 198 This displays the total combined size of all the trace buffers. 199 200 free_buffer: 201 202 If a process is performing tracing, and the ring buffer should be 203 shrunk "freed" when the process is finished, even if it were to be 204 killed by a signal, this file can be used for that purpose. On close 205 of this file, the ring buffer will be resized to its minimum size. 206 Having a process that is tracing also open this file, when the process 207 exits its file descriptor for this file will be closed, and in doing so, 208 the ring buffer will be "freed". 209 210 It may also stop tracing if disable_on_free option is set. 211 212 tracing_cpumask: 213 214 This is a mask that lets the user only trace on specified CPUs. 215 The format is a hex string representing the CPUs. 216 217 set_ftrace_filter: 218 219 When dynamic ftrace is configured in (see the 220 section below "dynamic ftrace"), the code is dynamically 221 modified (code text rewrite) to disable calling of the 222 function profiler (mcount). This lets tracing be configured 223 in with practically no overhead in performance. This also 224 has a side effect of enabling or disabling specific functions 225 to be traced. Echoing names of functions into this file 226 will limit the trace to only those functions. 227 This influences the tracers "function" and "function_graph" 228 and thus also function profiling (see "function_profile_enabled"). 229 230 The functions listed in "available_filter_functions" are what 231 can be written into this file. 232 233 This interface also allows for commands to be used. See the 234 "Filter commands" section for more details. 235 236 set_ftrace_notrace: 237 238 This has an effect opposite to that of 239 set_ftrace_filter. Any function that is added here will not 240 be traced. If a function exists in both set_ftrace_filter 241 and set_ftrace_notrace, the function will _not_ be traced. 242 243 set_ftrace_pid: 244 245 Have the function tracer only trace the threads whose PID are 246 listed in this file. 247 248 If the "function-fork" option is set, then when a task whose 249 PID is listed in this file forks, the child's PID will 250 automatically be added to this file, and the child will be 251 traced by the function tracer as well. This option will also 252 cause PIDs of tasks that exit to be removed from the file. 253 254 set_event_pid: 255 256 Have the events only trace a task with a PID listed in this file. 257 Note, sched_switch and sched_wake_up will also trace events 258 listed in this file. 259 260 To have the PIDs of children of tasks with their PID in this file 261 added on fork, enable the "event-fork" option. That option will also 262 cause the PIDs of tasks to be removed from this file when the task 263 exits. 264 265 set_graph_function: 266 267 Functions listed in this file will cause the function graph 268 tracer to only trace these functions and the functions that 269 they call. (See the section "dynamic ftrace" for more details). 270 Note, set_ftrace_filter and set_ftrace_notrace still affects 271 what functions are being traced. 272 273 set_graph_notrace: 274 275 Similar to set_graph_function, but will disable function graph 276 tracing when the function is hit until it exits the function. 277 This makes it possible to ignore tracing functions that are called 278 by a specific function. 279 280 available_filter_functions: 281 282 This lists the functions that ftrace has processed and can trace. 283 These are the function names that you can pass to 284 "set_ftrace_filter", "set_ftrace_notrace", 285 "set_graph_function", or "set_graph_notrace". 286 (See the section "dynamic ftrace" below for more details.) 287 288 dyn_ftrace_total_info: 289 290 This file is for debugging purposes. The number of functions that 291 have been converted to nops and are available to be traced. 292 293 enabled_functions: 294 295 This file is more for debugging ftrace, but can also be useful 296 in seeing if any function has a callback attached to it. 297 Not only does the trace infrastructure use ftrace function 298 trace utility, but other subsystems might too. This file 299 displays all functions that have a callback attached to them 300 as well as the number of callbacks that have been attached. 301 Note, a callback may also call multiple functions which will 302 not be listed in this count. 303 304 If the callback registered to be traced by a function with 305 the "save regs" attribute (thus even more overhead), a 'R' 306 will be displayed on the same line as the function that 307 is returning registers. 308 309 If the callback registered to be traced by a function with 310 the "ip modify" attribute (thus the regs->ip can be changed), 311 an 'I' will be displayed on the same line as the function that 312 can be overridden. 313 314 If the architecture supports it, it will also show what callback 315 is being directly called by the function. If the count is greater 316 than 1 it most likely will be ftrace_ops_list_func(). 317 318 If the callback of the function jumps to a trampoline that is 319 specific to a the callback and not the standard trampoline, 320 its address will be printed as well as the function that the 321 trampoline calls. 322 323 function_profile_enabled: 324 325 When set it will enable all functions with either the function 326 tracer, or if configured, the function graph tracer. It will 327 keep a histogram of the number of functions that were called 328 and if the function graph tracer was configured, it will also keep 329 track of the time spent in those functions. The histogram 330 content can be displayed in the files: 331 332 trace_stats/function<cpu> ( function0, function1, etc). 333 334 trace_stats: 335 336 A directory that holds different tracing stats. 337 338 kprobe_events: 339 340 Enable dynamic trace points. See kprobetrace.txt. 341 342 kprobe_profile: 343 344 Dynamic trace points stats. See kprobetrace.txt. 345 346 max_graph_depth: 347 348 Used with the function graph tracer. This is the max depth 349 it will trace into a function. Setting this to a value of 350 one will show only the first kernel function that is called 351 from user space. 352 353 printk_formats: 354 355 This is for tools that read the raw format files. If an event in 356 the ring buffer references a string, only a pointer to the string 357 is recorded into the buffer and not the string itself. This prevents 358 tools from knowing what that string was. This file displays the string 359 and address for the string allowing tools to map the pointers to what 360 the strings were. 361 362 saved_cmdlines: 363 364 Only the pid of the task is recorded in a trace event unless 365 the event specifically saves the task comm as well. Ftrace 366 makes a cache of pid mappings to comms to try to display 367 comms for events. If a pid for a comm is not listed, then 368 "<...>" is displayed in the output. 369 370 If the option "record-cmd" is set to "0", then comms of tasks 371 will not be saved during recording. By default, it is enabled. 372 373 saved_cmdlines_size: 374 375 By default, 128 comms are saved (see "saved_cmdlines" above). To 376 increase or decrease the amount of comms that are cached, echo 377 in a the number of comms to cache, into this file. 378 379 saved_tgids: 380 381 If the option "record-tgid" is set, on each scheduling context switch 382 the Task Group ID of a task is saved in a table mapping the PID of 383 the thread to its TGID. By default, the "record-tgid" option is 384 disabled. 385 386 snapshot: 387 388 This displays the "snapshot" buffer and also lets the user 389 take a snapshot of the current running trace. 390 See the "Snapshot" section below for more details. 391 392 stack_max_size: 393 394 When the stack tracer is activated, this will display the 395 maximum stack size it has encountered. 396 See the "Stack Trace" section below. 397 398 stack_trace: 399 400 This displays the stack back trace of the largest stack 401 that was encountered when the stack tracer is activated. 402 See the "Stack Trace" section below. 403 404 stack_trace_filter: 405 406 This is similar to "set_ftrace_filter" but it limits what 407 functions the stack tracer will check. 408 409 trace_clock: 410 411 Whenever an event is recorded into the ring buffer, a 412 "timestamp" is added. This stamp comes from a specified 413 clock. By default, ftrace uses the "local" clock. This 414 clock is very fast and strictly per cpu, but on some 415 systems it may not be monotonic with respect to other 416 CPUs. In other words, the local clocks may not be in sync 417 with local clocks on other CPUs. 418 419 Usual clocks for tracing:: 420 421 # cat trace_clock 422 [local] global counter x86-tsc 423 424 The clock with the square brackets around it is the one in effect. 425 426 local: 427 Default clock, but may not be in sync across CPUs 428 429 global: 430 This clock is in sync with all CPUs but may 431 be a bit slower than the local clock. 432 433 counter: 434 This is not a clock at all, but literally an atomic 435 counter. It counts up one by one, but is in sync 436 with all CPUs. This is useful when you need to 437 know exactly the order events occurred with respect to 438 each other on different CPUs. 439 440 uptime: 441 This uses the jiffies counter and the time stamp 442 is relative to the time since boot up. 443 444 perf: 445 This makes ftrace use the same clock that perf uses. 446 Eventually perf will be able to read ftrace buffers 447 and this will help out in interleaving the data. 448 449 x86-tsc: 450 Architectures may define their own clocks. For 451 example, x86 uses its own TSC cycle clock here. 452 453 ppc-tb: 454 This uses the powerpc timebase register value. 455 This is in sync across CPUs and can also be used 456 to correlate events across hypervisor/guest if 457 tb_offset is known. 458 459 mono: 460 This uses the fast monotonic clock (CLOCK_MONOTONIC) 461 which is monotonic and is subject to NTP rate adjustments. 462 463 mono_raw: 464 This is the raw monotonic clock (CLOCK_MONOTONIC_RAW) 465 which is montonic but is not subject to any rate adjustments 466 and ticks at the same rate as the hardware clocksource. 467 468 boot: 469 Same as mono. Used to be a separate clock which accounted 470 for the time spent in suspend while CLOCK_MONOTONIC did 471 not. 472 473 To set a clock, simply echo the clock name into this file:: 474 475 # echo global > trace_clock 476 477 trace_marker: 478 479 This is a very useful file for synchronizing user space 480 with events happening in the kernel. Writing strings into 481 this file will be written into the ftrace buffer. 482 483 It is useful in applications to open this file at the start 484 of the application and just reference the file descriptor 485 for the file:: 486 487 void trace_write(const char *fmt, ...) 488 { 489 va_list ap; 490 char buf[256]; 491 int n; 492 493 if (trace_fd < 0) 494 return; 495 496 va_start(ap, fmt); 497 n = vsnprintf(buf, 256, fmt, ap); 498 va_end(ap); 499 500 write(trace_fd, buf, n); 501 } 502 503 start:: 504 505 trace_fd = open("trace_marker", WR_ONLY); 506 507 trace_marker_raw: 508 509 This is similar to trace_marker above, but is meant for for binary data 510 to be written to it, where a tool can be used to parse the data 511 from trace_pipe_raw. 512 513 uprobe_events: 514 515 Add dynamic tracepoints in programs. 516 See uprobetracer.txt 517 518 uprobe_profile: 519 520 Uprobe statistics. See uprobetrace.txt 521 522 instances: 523 524 This is a way to make multiple trace buffers where different 525 events can be recorded in different buffers. 526 See "Instances" section below. 527 528 events: 529 530 This is the trace event directory. It holds event tracepoints 531 (also known as static tracepoints) that have been compiled 532 into the kernel. It shows what event tracepoints exist 533 and how they are grouped by system. There are "enable" 534 files at various levels that can enable the tracepoints 535 when a "1" is written to them. 536 537 See events.txt for more information. 538 539 set_event: 540 541 By echoing in the event into this file, will enable that event. 542 543 See events.txt for more information. 544 545 available_events: 546 547 A list of events that can be enabled in tracing. 548 549 See events.txt for more information. 550 551 timestamp_mode: 552 553 Certain tracers may change the timestamp mode used when 554 logging trace events into the event buffer. Events with 555 different modes can coexist within a buffer but the mode in 556 effect when an event is logged determines which timestamp mode 557 is used for that event. The default timestamp mode is 558 'delta'. 559 560 Usual timestamp modes for tracing: 561 562 # cat timestamp_mode 563 [delta] absolute 564 565 The timestamp mode with the square brackets around it is the 566 one in effect. 567 568 delta: Default timestamp mode - timestamp is a delta against 569 a per-buffer timestamp. 570 571 absolute: The timestamp is a full timestamp, not a delta 572 against some other value. As such it takes up more 573 space and is less efficient. 574 575 hwlat_detector: 576 577 Directory for the Hardware Latency Detector. 578 See "Hardware Latency Detector" section below. 579 580 per_cpu: 581 582 This is a directory that contains the trace per_cpu information. 583 584 per_cpu/cpu0/buffer_size_kb: 585 586 The ftrace buffer is defined per_cpu. That is, there's a separate 587 buffer for each CPU to allow writes to be done atomically, 588 and free from cache bouncing. These buffers may have different 589 size buffers. This file is similar to the buffer_size_kb 590 file, but it only displays or sets the buffer size for the 591 specific CPU. (here cpu0). 592 593 per_cpu/cpu0/trace: 594 595 This is similar to the "trace" file, but it will only display 596 the data specific for the CPU. If written to, it only clears 597 the specific CPU buffer. 598 599 per_cpu/cpu0/trace_pipe 600 601 This is similar to the "trace_pipe" file, and is a consuming 602 read, but it will only display (and consume) the data specific 603 for the CPU. 604 605 per_cpu/cpu0/trace_pipe_raw 606 607 For tools that can parse the ftrace ring buffer binary format, 608 the trace_pipe_raw file can be used to extract the data 609 from the ring buffer directly. With the use of the splice() 610 system call, the buffer data can be quickly transferred to 611 a file or to the network where a server is collecting the 612 data. 613 614 Like trace_pipe, this is a consuming reader, where multiple 615 reads will always produce different data. 616 617 per_cpu/cpu0/snapshot: 618 619 This is similar to the main "snapshot" file, but will only 620 snapshot the current CPU (if supported). It only displays 621 the content of the snapshot for a given CPU, and if 622 written to, only clears this CPU buffer. 623 624 per_cpu/cpu0/snapshot_raw: 625 626 Similar to the trace_pipe_raw, but will read the binary format 627 from the snapshot buffer for the given CPU. 628 629 per_cpu/cpu0/stats: 630 631 This displays certain stats about the ring buffer: 632 633 entries: 634 The number of events that are still in the buffer. 635 636 overrun: 637 The number of lost events due to overwriting when 638 the buffer was full. 639 640 commit overrun: 641 Should always be zero. 642 This gets set if so many events happened within a nested 643 event (ring buffer is re-entrant), that it fills the 644 buffer and starts dropping events. 645 646 bytes: 647 Bytes actually read (not overwritten). 648 649 oldest event ts: 650 The oldest timestamp in the buffer 651 652 now ts: 653 The current timestamp 654 655 dropped events: 656 Events lost due to overwrite option being off. 657 658 read events: 659 The number of events read. 660 661The Tracers 662----------- 663 664Here is the list of current tracers that may be configured. 665 666 "function" 667 668 Function call tracer to trace all kernel functions. 669 670 "function_graph" 671 672 Similar to the function tracer except that the 673 function tracer probes the functions on their entry 674 whereas the function graph tracer traces on both entry 675 and exit of the functions. It then provides the ability 676 to draw a graph of function calls similar to C code 677 source. 678 679 "blk" 680 681 The block tracer. The tracer used by the blktrace user 682 application. 683 684 "hwlat" 685 686 The Hardware Latency tracer is used to detect if the hardware 687 produces any latency. See "Hardware Latency Detector" section 688 below. 689 690 "irqsoff" 691 692 Traces the areas that disable interrupts and saves 693 the trace with the longest max latency. 694 See tracing_max_latency. When a new max is recorded, 695 it replaces the old trace. It is best to view this 696 trace with the latency-format option enabled, which 697 happens automatically when the tracer is selected. 698 699 "preemptoff" 700 701 Similar to irqsoff but traces and records the amount of 702 time for which preemption is disabled. 703 704 "preemptirqsoff" 705 706 Similar to irqsoff and preemptoff, but traces and 707 records the largest time for which irqs and/or preemption 708 is disabled. 709 710 "wakeup" 711 712 Traces and records the max latency that it takes for 713 the highest priority task to get scheduled after 714 it has been woken up. 715 Traces all tasks as an average developer would expect. 716 717 "wakeup_rt" 718 719 Traces and records the max latency that it takes for just 720 RT tasks (as the current "wakeup" does). This is useful 721 for those interested in wake up timings of RT tasks. 722 723 "wakeup_dl" 724 725 Traces and records the max latency that it takes for 726 a SCHED_DEADLINE task to be woken (as the "wakeup" and 727 "wakeup_rt" does). 728 729 "mmiotrace" 730 731 A special tracer that is used to trace binary module. 732 It will trace all the calls that a module makes to the 733 hardware. Everything it writes and reads from the I/O 734 as well. 735 736 "branch" 737 738 This tracer can be configured when tracing likely/unlikely 739 calls within the kernel. It will trace when a likely and 740 unlikely branch is hit and if it was correct in its prediction 741 of being correct. 742 743 "nop" 744 745 This is the "trace nothing" tracer. To remove all 746 tracers from tracing simply echo "nop" into 747 current_tracer. 748 749 750Examples of using the tracer 751---------------------------- 752 753Here are typical examples of using the tracers when controlling 754them only with the tracefs interface (without using any 755user-land utilities). 756 757Output format: 758-------------- 759 760Here is an example of the output format of the file "trace":: 761 762 # tracer: function 763 # 764 # entries-in-buffer/entries-written: 140080/250280 #P:4 765 # 766 # _-----=> irqs-off 767 # / _----=> need-resched 768 # | / _---=> hardirq/softirq 769 # || / _--=> preempt-depth 770 # ||| / delay 771 # TASK-PID CPU# |||| TIMESTAMP FUNCTION 772 # | | | |||| | | 773 bash-1977 [000] .... 17284.993652: sys_close <-system_call_fastpath 774 bash-1977 [000] .... 17284.993653: __close_fd <-sys_close 775 bash-1977 [000] .... 17284.993653: _raw_spin_lock <-__close_fd 776 sshd-1974 [003] .... 17284.993653: __srcu_read_unlock <-fsnotify 777 bash-1977 [000] .... 17284.993654: add_preempt_count <-_raw_spin_lock 778 bash-1977 [000] ...1 17284.993655: _raw_spin_unlock <-__close_fd 779 bash-1977 [000] ...1 17284.993656: sub_preempt_count <-_raw_spin_unlock 780 bash-1977 [000] .... 17284.993657: filp_close <-__close_fd 781 bash-1977 [000] .... 17284.993657: dnotify_flush <-filp_close 782 sshd-1974 [003] .... 17284.993658: sys_select <-system_call_fastpath 783 .... 784 785A header is printed with the tracer name that is represented by 786the trace. In this case the tracer is "function". Then it shows the 787number of events in the buffer as well as the total number of entries 788that were written. The difference is the number of entries that were 789lost due to the buffer filling up (250280 - 140080 = 110200 events 790lost). 791 792The header explains the content of the events. Task name "bash", the task 793PID "1977", the CPU that it was running on "000", the latency format 794(explained below), the timestamp in <secs>.<usecs> format, the 795function name that was traced "sys_close" and the parent function that 796called this function "system_call_fastpath". The timestamp is the time 797at which the function was entered. 798 799Latency trace format 800-------------------- 801 802When the latency-format option is enabled or when one of the latency 803tracers is set, the trace file gives somewhat more information to see 804why a latency happened. Here is a typical trace:: 805 806 # tracer: irqsoff 807 # 808 # irqsoff latency trace v1.1.5 on 3.8.0-test+ 809 # -------------------------------------------------------------------- 810 # latency: 259 us, #4/4, CPU#2 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:4) 811 # ----------------- 812 # | task: ps-6143 (uid:0 nice:0 policy:0 rt_prio:0) 813 # ----------------- 814 # => started at: __lock_task_sighand 815 # => ended at: _raw_spin_unlock_irqrestore 816 # 817 # 818 # _------=> CPU# 819 # / _-----=> irqs-off 820 # | / _----=> need-resched 821 # || / _---=> hardirq/softirq 822 # ||| / _--=> preempt-depth 823 # |||| / delay 824 # cmd pid ||||| time | caller 825 # \ / ||||| \ | / 826 ps-6143 2d... 0us!: trace_hardirqs_off <-__lock_task_sighand 827 ps-6143 2d..1 259us+: trace_hardirqs_on <-_raw_spin_unlock_irqrestore 828 ps-6143 2d..1 263us+: time_hardirqs_on <-_raw_spin_unlock_irqrestore 829 ps-6143 2d..1 306us : <stack trace> 830 => trace_hardirqs_on_caller 831 => trace_hardirqs_on 832 => _raw_spin_unlock_irqrestore 833 => do_task_stat 834 => proc_tgid_stat 835 => proc_single_show 836 => seq_read 837 => vfs_read 838 => sys_read 839 => system_call_fastpath 840 841 842This shows that the current tracer is "irqsoff" tracing the time 843for which interrupts were disabled. It gives the trace version (which 844never changes) and the version of the kernel upon which this was executed on 845(3.8). Then it displays the max latency in microseconds (259 us). The number 846of trace entries displayed and the total number (both are four: #4/4). 847VP, KP, SP, and HP are always zero and are reserved for later use. 848#P is the number of online CPUs (#P:4). 849 850The task is the process that was running when the latency 851occurred. (ps pid: 6143). 852 853The start and stop (the functions in which the interrupts were 854disabled and enabled respectively) that caused the latencies: 855 856 - __lock_task_sighand is where the interrupts were disabled. 857 - _raw_spin_unlock_irqrestore is where they were enabled again. 858 859The next lines after the header are the trace itself. The header 860explains which is which. 861 862 cmd: The name of the process in the trace. 863 864 pid: The PID of that process. 865 866 CPU#: The CPU which the process was running on. 867 868 irqs-off: 'd' interrupts are disabled. '.' otherwise. 869 .. caution:: If the architecture does not support a way to 870 read the irq flags variable, an 'X' will always 871 be printed here. 872 873 need-resched: 874 - 'N' both TIF_NEED_RESCHED and PREEMPT_NEED_RESCHED is set, 875 - 'n' only TIF_NEED_RESCHED is set, 876 - 'p' only PREEMPT_NEED_RESCHED is set, 877 - '.' otherwise. 878 879 hardirq/softirq: 880 - 'Z' - NMI occurred inside a hardirq 881 - 'z' - NMI is running 882 - 'H' - hard irq occurred inside a softirq. 883 - 'h' - hard irq is running 884 - 's' - soft irq is running 885 - '.' - normal context. 886 887 preempt-depth: The level of preempt_disabled 888 889The above is mostly meaningful for kernel developers. 890 891 time: 892 When the latency-format option is enabled, the trace file 893 output includes a timestamp relative to the start of the 894 trace. This differs from the output when latency-format 895 is disabled, which includes an absolute timestamp. 896 897 delay: 898 This is just to help catch your eye a bit better. And 899 needs to be fixed to be only relative to the same CPU. 900 The marks are determined by the difference between this 901 current trace and the next trace. 902 903 - '$' - greater than 1 second 904 - '@' - greater than 100 milisecond 905 - '*' - greater than 10 milisecond 906 - '#' - greater than 1000 microsecond 907 - '!' - greater than 100 microsecond 908 - '+' - greater than 10 microsecond 909 - ' ' - less than or equal to 10 microsecond. 910 911 The rest is the same as the 'trace' file. 912 913 Note, the latency tracers will usually end with a back trace 914 to easily find where the latency occurred. 915 916trace_options 917------------- 918 919The trace_options file (or the options directory) is used to control 920what gets printed in the trace output, or manipulate the tracers. 921To see what is available, simply cat the file:: 922 923 cat trace_options 924 print-parent 925 nosym-offset 926 nosym-addr 927 noverbose 928 noraw 929 nohex 930 nobin 931 noblock 932 trace_printk 933 annotate 934 nouserstacktrace 935 nosym-userobj 936 noprintk-msg-only 937 context-info 938 nolatency-format 939 record-cmd 940 norecord-tgid 941 overwrite 942 nodisable_on_free 943 irq-info 944 markers 945 noevent-fork 946 function-trace 947 nofunction-fork 948 nodisplay-graph 949 nostacktrace 950 nobranch 951 952To disable one of the options, echo in the option prepended with 953"no":: 954 955 echo noprint-parent > trace_options 956 957To enable an option, leave off the "no":: 958 959 echo sym-offset > trace_options 960 961Here are the available options: 962 963 print-parent 964 On function traces, display the calling (parent) 965 function as well as the function being traced. 966 :: 967 968 print-parent: 969 bash-4000 [01] 1477.606694: simple_strtoul <-kstrtoul 970 971 noprint-parent: 972 bash-4000 [01] 1477.606694: simple_strtoul 973 974 975 sym-offset 976 Display not only the function name, but also the 977 offset in the function. For example, instead of 978 seeing just "ktime_get", you will see 979 "ktime_get+0xb/0x20". 980 :: 981 982 sym-offset: 983 bash-4000 [01] 1477.606694: simple_strtoul+0x6/0xa0 984 985 sym-addr 986 This will also display the function address as well 987 as the function name. 988 :: 989 990 sym-addr: 991 bash-4000 [01] 1477.606694: simple_strtoul <c0339346> 992 993 verbose 994 This deals with the trace file when the 995 latency-format option is enabled. 996 :: 997 998 bash 4000 1 0 00000000 00010a95 [58127d26] 1720.415ms \ 999 (+0.000ms): simple_strtoul (kstrtoul) 1000 1001 raw 1002 This will display raw numbers. This option is best for 1003 use with user applications that can translate the raw 1004 numbers better than having it done in the kernel. 1005 1006 hex 1007 Similar to raw, but the numbers will be in a hexadecimal format. 1008 1009 bin 1010 This will print out the formats in raw binary. 1011 1012 block 1013 When set, reading trace_pipe will not block when polled. 1014 1015 trace_printk 1016 Can disable trace_printk() from writing into the buffer. 1017 1018 annotate 1019 It is sometimes confusing when the CPU buffers are full 1020 and one CPU buffer had a lot of events recently, thus 1021 a shorter time frame, were another CPU may have only had 1022 a few events, which lets it have older events. When 1023 the trace is reported, it shows the oldest events first, 1024 and it may look like only one CPU ran (the one with the 1025 oldest events). When the annotate option is set, it will 1026 display when a new CPU buffer started:: 1027 1028 <idle>-0 [001] dNs4 21169.031481: wake_up_idle_cpu <-add_timer_on 1029 <idle>-0 [001] dNs4 21169.031482: _raw_spin_unlock_irqrestore <-add_timer_on 1030 <idle>-0 [001] .Ns4 21169.031484: sub_preempt_count <-_raw_spin_unlock_irqrestore 1031 ##### CPU 2 buffer started #### 1032 <idle>-0 [002] .N.1 21169.031484: rcu_idle_exit <-cpu_idle 1033 <idle>-0 [001] .Ns3 21169.031484: _raw_spin_unlock <-clocksource_watchdog 1034 <idle>-0 [001] .Ns3 21169.031485: sub_preempt_count <-_raw_spin_unlock 1035 1036 userstacktrace 1037 This option changes the trace. It records a 1038 stacktrace of the current user space thread after 1039 each trace event. 1040 1041 sym-userobj 1042 when user stacktrace are enabled, look up which 1043 object the address belongs to, and print a 1044 relative address. This is especially useful when 1045 ASLR is on, otherwise you don't get a chance to 1046 resolve the address to object/file/line after 1047 the app is no longer running 1048 1049 The lookup is performed when you read 1050 trace,trace_pipe. Example:: 1051 1052 a.out-1623 [000] 40874.465068: /root/a.out[+0x480] <-/root/a.out[+0 1053 x494] <- /root/a.out[+0x4a8] <- /lib/libc-2.7.so[+0x1e1a6] 1054 1055 1056 printk-msg-only 1057 When set, trace_printk()s will only show the format 1058 and not their parameters (if trace_bprintk() or 1059 trace_bputs() was used to save the trace_printk()). 1060 1061 context-info 1062 Show only the event data. Hides the comm, PID, 1063 timestamp, CPU, and other useful data. 1064 1065 latency-format 1066 This option changes the trace output. When it is enabled, 1067 the trace displays additional information about the 1068 latency, as described in "Latency trace format". 1069 1070 record-cmd 1071 When any event or tracer is enabled, a hook is enabled 1072 in the sched_switch trace point to fill comm cache 1073 with mapped pids and comms. But this may cause some 1074 overhead, and if you only care about pids, and not the 1075 name of the task, disabling this option can lower the 1076 impact of tracing. See "saved_cmdlines". 1077 1078 record-tgid 1079 When any event or tracer is enabled, a hook is enabled 1080 in the sched_switch trace point to fill the cache of 1081 mapped Thread Group IDs (TGID) mapping to pids. See 1082 "saved_tgids". 1083 1084 overwrite 1085 This controls what happens when the trace buffer is 1086 full. If "1" (default), the oldest events are 1087 discarded and overwritten. If "0", then the newest 1088 events are discarded. 1089 (see per_cpu/cpu0/stats for overrun and dropped) 1090 1091 disable_on_free 1092 When the free_buffer is closed, tracing will 1093 stop (tracing_on set to 0). 1094 1095 irq-info 1096 Shows the interrupt, preempt count, need resched data. 1097 When disabled, the trace looks like:: 1098 1099 # tracer: function 1100 # 1101 # entries-in-buffer/entries-written: 144405/9452052 #P:4 1102 # 1103 # TASK-PID CPU# TIMESTAMP FUNCTION 1104 # | | | | | 1105 <idle>-0 [002] 23636.756054: ttwu_do_activate.constprop.89 <-try_to_wake_up 1106 <idle>-0 [002] 23636.756054: activate_task <-ttwu_do_activate.constprop.89 1107 <idle>-0 [002] 23636.756055: enqueue_task <-activate_task 1108 1109 1110 markers 1111 When set, the trace_marker is writable (only by root). 1112 When disabled, the trace_marker will error with EINVAL 1113 on write. 1114 1115 event-fork 1116 When set, tasks with PIDs listed in set_event_pid will have 1117 the PIDs of their children added to set_event_pid when those 1118 tasks fork. Also, when tasks with PIDs in set_event_pid exit, 1119 their PIDs will be removed from the file. 1120 1121 function-trace 1122 The latency tracers will enable function tracing 1123 if this option is enabled (default it is). When 1124 it is disabled, the latency tracers do not trace 1125 functions. This keeps the overhead of the tracer down 1126 when performing latency tests. 1127 1128 function-fork 1129 When set, tasks with PIDs listed in set_ftrace_pid will 1130 have the PIDs of their children added to set_ftrace_pid 1131 when those tasks fork. Also, when tasks with PIDs in 1132 set_ftrace_pid exit, their PIDs will be removed from the 1133 file. 1134 1135 display-graph 1136 When set, the latency tracers (irqsoff, wakeup, etc) will 1137 use function graph tracing instead of function tracing. 1138 1139 stacktrace 1140 When set, a stack trace is recorded after any trace event 1141 is recorded. 1142 1143 branch 1144 Enable branch tracing with the tracer. This enables branch 1145 tracer along with the currently set tracer. Enabling this 1146 with the "nop" tracer is the same as just enabling the 1147 "branch" tracer. 1148 1149.. tip:: Some tracers have their own options. They only appear in this 1150 file when the tracer is active. They always appear in the 1151 options directory. 1152 1153 1154Here are the per tracer options: 1155 1156Options for function tracer: 1157 1158 func_stack_trace 1159 When set, a stack trace is recorded after every 1160 function that is recorded. NOTE! Limit the functions 1161 that are recorded before enabling this, with 1162 "set_ftrace_filter" otherwise the system performance 1163 will be critically degraded. Remember to disable 1164 this option before clearing the function filter. 1165 1166Options for function_graph tracer: 1167 1168 Since the function_graph tracer has a slightly different output 1169 it has its own options to control what is displayed. 1170 1171 funcgraph-overrun 1172 When set, the "overrun" of the graph stack is 1173 displayed after each function traced. The 1174 overrun, is when the stack depth of the calls 1175 is greater than what is reserved for each task. 1176 Each task has a fixed array of functions to 1177 trace in the call graph. If the depth of the 1178 calls exceeds that, the function is not traced. 1179 The overrun is the number of functions missed 1180 due to exceeding this array. 1181 1182 funcgraph-cpu 1183 When set, the CPU number of the CPU where the trace 1184 occurred is displayed. 1185 1186 funcgraph-overhead 1187 When set, if the function takes longer than 1188 A certain amount, then a delay marker is 1189 displayed. See "delay" above, under the 1190 header description. 1191 1192 funcgraph-proc 1193 Unlike other tracers, the process' command line 1194 is not displayed by default, but instead only 1195 when a task is traced in and out during a context 1196 switch. Enabling this options has the command 1197 of each process displayed at every line. 1198 1199 funcgraph-duration 1200 At the end of each function (the return) 1201 the duration of the amount of time in the 1202 function is displayed in microseconds. 1203 1204 funcgraph-abstime 1205 When set, the timestamp is displayed at each line. 1206 1207 funcgraph-irqs 1208 When disabled, functions that happen inside an 1209 interrupt will not be traced. 1210 1211 funcgraph-tail 1212 When set, the return event will include the function 1213 that it represents. By default this is off, and 1214 only a closing curly bracket "}" is displayed for 1215 the return of a function. 1216 1217 sleep-time 1218 When running function graph tracer, to include 1219 the time a task schedules out in its function. 1220 When enabled, it will account time the task has been 1221 scheduled out as part of the function call. 1222 1223 graph-time 1224 When running function profiler with function graph tracer, 1225 to include the time to call nested functions. When this is 1226 not set, the time reported for the function will only 1227 include the time the function itself executed for, not the 1228 time for functions that it called. 1229 1230Options for blk tracer: 1231 1232 blk_classic 1233 Shows a more minimalistic output. 1234 1235 1236irqsoff 1237------- 1238 1239When interrupts are disabled, the CPU can not react to any other 1240external event (besides NMIs and SMIs). This prevents the timer 1241interrupt from triggering or the mouse interrupt from letting 1242the kernel know of a new mouse event. The result is a latency 1243with the reaction time. 1244 1245The irqsoff tracer tracks the time for which interrupts are 1246disabled. When a new maximum latency is hit, the tracer saves 1247the trace leading up to that latency point so that every time a 1248new maximum is reached, the old saved trace is discarded and the 1249new trace is saved. 1250 1251To reset the maximum, echo 0 into tracing_max_latency. Here is 1252an example:: 1253 1254 # echo 0 > options/function-trace 1255 # echo irqsoff > current_tracer 1256 # echo 1 > tracing_on 1257 # echo 0 > tracing_max_latency 1258 # ls -ltr 1259 [...] 1260 # echo 0 > tracing_on 1261 # cat trace 1262 # tracer: irqsoff 1263 # 1264 # irqsoff latency trace v1.1.5 on 3.8.0-test+ 1265 # -------------------------------------------------------------------- 1266 # latency: 16 us, #4/4, CPU#0 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:4) 1267 # ----------------- 1268 # | task: swapper/0-0 (uid:0 nice:0 policy:0 rt_prio:0) 1269 # ----------------- 1270 # => started at: run_timer_softirq 1271 # => ended at: run_timer_softirq 1272 # 1273 # 1274 # _------=> CPU# 1275 # / _-----=> irqs-off 1276 # | / _----=> need-resched 1277 # || / _---=> hardirq/softirq 1278 # ||| / _--=> preempt-depth 1279 # |||| / delay 1280 # cmd pid ||||| time | caller 1281 # \ / ||||| \ | / 1282 <idle>-0 0d.s2 0us+: _raw_spin_lock_irq <-run_timer_softirq 1283 <idle>-0 0dNs3 17us : _raw_spin_unlock_irq <-run_timer_softirq 1284 <idle>-0 0dNs3 17us+: trace_hardirqs_on <-run_timer_softirq 1285 <idle>-0 0dNs3 25us : <stack trace> 1286 => _raw_spin_unlock_irq 1287 => run_timer_softirq 1288 => __do_softirq 1289 => call_softirq 1290 => do_softirq 1291 => irq_exit 1292 => smp_apic_timer_interrupt 1293 => apic_timer_interrupt 1294 => rcu_idle_exit 1295 => cpu_idle 1296 => rest_init 1297 => start_kernel 1298 => x86_64_start_reservations 1299 => x86_64_start_kernel 1300 1301Here we see that that we had a latency of 16 microseconds (which is 1302very good). The _raw_spin_lock_irq in run_timer_softirq disabled 1303interrupts. The difference between the 16 and the displayed 1304timestamp 25us occurred because the clock was incremented 1305between the time of recording the max latency and the time of 1306recording the function that had that latency. 1307 1308Note the above example had function-trace not set. If we set 1309function-trace, we get a much larger output:: 1310 1311 with echo 1 > options/function-trace 1312 1313 # tracer: irqsoff 1314 # 1315 # irqsoff latency trace v1.1.5 on 3.8.0-test+ 1316 # -------------------------------------------------------------------- 1317 # latency: 71 us, #168/168, CPU#3 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:4) 1318 # ----------------- 1319 # | task: bash-2042 (uid:0 nice:0 policy:0 rt_prio:0) 1320 # ----------------- 1321 # => started at: ata_scsi_queuecmd 1322 # => ended at: ata_scsi_queuecmd 1323 # 1324 # 1325 # _------=> CPU# 1326 # / _-----=> irqs-off 1327 # | / _----=> need-resched 1328 # || / _---=> hardirq/softirq 1329 # ||| / _--=> preempt-depth 1330 # |||| / delay 1331 # cmd pid ||||| time | caller 1332 # \ / ||||| \ | / 1333 bash-2042 3d... 0us : _raw_spin_lock_irqsave <-ata_scsi_queuecmd 1334 bash-2042 3d... 0us : add_preempt_count <-_raw_spin_lock_irqsave 1335 bash-2042 3d..1 1us : ata_scsi_find_dev <-ata_scsi_queuecmd 1336 bash-2042 3d..1 1us : __ata_scsi_find_dev <-ata_scsi_find_dev 1337 bash-2042 3d..1 2us : ata_find_dev.part.14 <-__ata_scsi_find_dev 1338 bash-2042 3d..1 2us : ata_qc_new_init <-__ata_scsi_queuecmd 1339 bash-2042 3d..1 3us : ata_sg_init <-__ata_scsi_queuecmd 1340 bash-2042 3d..1 4us : ata_scsi_rw_xlat <-__ata_scsi_queuecmd 1341 bash-2042 3d..1 4us : ata_build_rw_tf <-ata_scsi_rw_xlat 1342 [...] 1343 bash-2042 3d..1 67us : delay_tsc <-__delay 1344 bash-2042 3d..1 67us : add_preempt_count <-delay_tsc 1345 bash-2042 3d..2 67us : sub_preempt_count <-delay_tsc 1346 bash-2042 3d..1 67us : add_preempt_count <-delay_tsc 1347 bash-2042 3d..2 68us : sub_preempt_count <-delay_tsc 1348 bash-2042 3d..1 68us+: ata_bmdma_start <-ata_bmdma_qc_issue 1349 bash-2042 3d..1 71us : _raw_spin_unlock_irqrestore <-ata_scsi_queuecmd 1350 bash-2042 3d..1 71us : _raw_spin_unlock_irqrestore <-ata_scsi_queuecmd 1351 bash-2042 3d..1 72us+: trace_hardirqs_on <-ata_scsi_queuecmd 1352 bash-2042 3d..1 120us : <stack trace> 1353 => _raw_spin_unlock_irqrestore 1354 => ata_scsi_queuecmd 1355 => scsi_dispatch_cmd 1356 => scsi_request_fn 1357 => __blk_run_queue_uncond 1358 => __blk_run_queue 1359 => blk_queue_bio 1360 => generic_make_request 1361 => submit_bio 1362 => submit_bh 1363 => __ext3_get_inode_loc 1364 => ext3_iget 1365 => ext3_lookup 1366 => lookup_real 1367 => __lookup_hash 1368 => walk_component 1369 => lookup_last 1370 => path_lookupat 1371 => filename_lookup 1372 => user_path_at_empty 1373 => user_path_at 1374 => vfs_fstatat 1375 => vfs_stat 1376 => sys_newstat 1377 => system_call_fastpath 1378 1379 1380Here we traced a 71 microsecond latency. But we also see all the 1381functions that were called during that time. Note that by 1382enabling function tracing, we incur an added overhead. This 1383overhead may extend the latency times. But nevertheless, this 1384trace has provided some very helpful debugging information. 1385 1386 1387preemptoff 1388---------- 1389 1390When preemption is disabled, we may be able to receive 1391interrupts but the task cannot be preempted and a higher 1392priority task must wait for preemption to be enabled again 1393before it can preempt a lower priority task. 1394 1395The preemptoff tracer traces the places that disable preemption. 1396Like the irqsoff tracer, it records the maximum latency for 1397which preemption was disabled. The control of preemptoff tracer 1398is much like the irqsoff tracer. 1399:: 1400 1401 # echo 0 > options/function-trace 1402 # echo preemptoff > current_tracer 1403 # echo 1 > tracing_on 1404 # echo 0 > tracing_max_latency 1405 # ls -ltr 1406 [...] 1407 # echo 0 > tracing_on 1408 # cat trace 1409 # tracer: preemptoff 1410 # 1411 # preemptoff latency trace v1.1.5 on 3.8.0-test+ 1412 # -------------------------------------------------------------------- 1413 # latency: 46 us, #4/4, CPU#1 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:4) 1414 # ----------------- 1415 # | task: sshd-1991 (uid:0 nice:0 policy:0 rt_prio:0) 1416 # ----------------- 1417 # => started at: do_IRQ 1418 # => ended at: do_IRQ 1419 # 1420 # 1421 # _------=> CPU# 1422 # / _-----=> irqs-off 1423 # | / _----=> need-resched 1424 # || / _---=> hardirq/softirq 1425 # ||| / _--=> preempt-depth 1426 # |||| / delay 1427 # cmd pid ||||| time | caller 1428 # \ / ||||| \ | / 1429 sshd-1991 1d.h. 0us+: irq_enter <-do_IRQ 1430 sshd-1991 1d..1 46us : irq_exit <-do_IRQ 1431 sshd-1991 1d..1 47us+: trace_preempt_on <-do_IRQ 1432 sshd-1991 1d..1 52us : <stack trace> 1433 => sub_preempt_count 1434 => irq_exit 1435 => do_IRQ 1436 => ret_from_intr 1437 1438 1439This has some more changes. Preemption was disabled when an 1440interrupt came in (notice the 'h'), and was enabled on exit. 1441But we also see that interrupts have been disabled when entering 1442the preempt off section and leaving it (the 'd'). We do not know if 1443interrupts were enabled in the mean time or shortly after this 1444was over. 1445:: 1446 1447 # tracer: preemptoff 1448 # 1449 # preemptoff latency trace v1.1.5 on 3.8.0-test+ 1450 # -------------------------------------------------------------------- 1451 # latency: 83 us, #241/241, CPU#1 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:4) 1452 # ----------------- 1453 # | task: bash-1994 (uid:0 nice:0 policy:0 rt_prio:0) 1454 # ----------------- 1455 # => started at: wake_up_new_task 1456 # => ended at: task_rq_unlock 1457 # 1458 # 1459 # _------=> CPU# 1460 # / _-----=> irqs-off 1461 # | / _----=> need-resched 1462 # || / _---=> hardirq/softirq 1463 # ||| / _--=> preempt-depth 1464 # |||| / delay 1465 # cmd pid ||||| time | caller 1466 # \ / ||||| \ | / 1467 bash-1994 1d..1 0us : _raw_spin_lock_irqsave <-wake_up_new_task 1468 bash-1994 1d..1 0us : select_task_rq_fair <-select_task_rq 1469 bash-1994 1d..1 1us : __rcu_read_lock <-select_task_rq_fair 1470 bash-1994 1d..1 1us : source_load <-select_task_rq_fair 1471 bash-1994 1d..1 1us : source_load <-select_task_rq_fair 1472 [...] 1473 bash-1994 1d..1 12us : irq_enter <-smp_apic_timer_interrupt 1474 bash-1994 1d..1 12us : rcu_irq_enter <-irq_enter 1475 bash-1994 1d..1 13us : add_preempt_count <-irq_enter 1476 bash-1994 1d.h1 13us : exit_idle <-smp_apic_timer_interrupt 1477 bash-1994 1d.h1 13us : hrtimer_interrupt <-smp_apic_timer_interrupt 1478 bash-1994 1d.h1 13us : _raw_spin_lock <-hrtimer_interrupt 1479 bash-1994 1d.h1 14us : add_preempt_count <-_raw_spin_lock 1480 bash-1994 1d.h2 14us : ktime_get_update_offsets <-hrtimer_interrupt 1481 [...] 1482 bash-1994 1d.h1 35us : lapic_next_event <-clockevents_program_event 1483 bash-1994 1d.h1 35us : irq_exit <-smp_apic_timer_interrupt 1484 bash-1994 1d.h1 36us : sub_preempt_count <-irq_exit 1485 bash-1994 1d..2 36us : do_softirq <-irq_exit 1486 bash-1994 1d..2 36us : __do_softirq <-call_softirq 1487 bash-1994 1d..2 36us : __local_bh_disable <-__do_softirq 1488 bash-1994 1d.s2 37us : add_preempt_count <-_raw_spin_lock_irq 1489 bash-1994 1d.s3 38us : _raw_spin_unlock <-run_timer_softirq 1490 bash-1994 1d.s3 39us : sub_preempt_count <-_raw_spin_unlock 1491 bash-1994 1d.s2 39us : call_timer_fn <-run_timer_softirq 1492 [...] 1493 bash-1994 1dNs2 81us : cpu_needs_another_gp <-rcu_process_callbacks 1494 bash-1994 1dNs2 82us : __local_bh_enable <-__do_softirq 1495 bash-1994 1dNs2 82us : sub_preempt_count <-__local_bh_enable 1496 bash-1994 1dN.2 82us : idle_cpu <-irq_exit 1497 bash-1994 1dN.2 83us : rcu_irq_exit <-irq_exit 1498 bash-1994 1dN.2 83us : sub_preempt_count <-irq_exit 1499 bash-1994 1.N.1 84us : _raw_spin_unlock_irqrestore <-task_rq_unlock 1500 bash-1994 1.N.1 84us+: trace_preempt_on <-task_rq_unlock 1501 bash-1994 1.N.1 104us : <stack trace> 1502 => sub_preempt_count 1503 => _raw_spin_unlock_irqrestore 1504 => task_rq_unlock 1505 => wake_up_new_task 1506 => do_fork 1507 => sys_clone 1508 => stub_clone 1509 1510 1511The above is an example of the preemptoff trace with 1512function-trace set. Here we see that interrupts were not disabled 1513the entire time. The irq_enter code lets us know that we entered 1514an interrupt 'h'. Before that, the functions being traced still 1515show that it is not in an interrupt, but we can see from the 1516functions themselves that this is not the case. 1517 1518preemptirqsoff 1519-------------- 1520 1521Knowing the locations that have interrupts disabled or 1522preemption disabled for the longest times is helpful. But 1523sometimes we would like to know when either preemption and/or 1524interrupts are disabled. 1525 1526Consider the following code:: 1527 1528 local_irq_disable(); 1529 call_function_with_irqs_off(); 1530 preempt_disable(); 1531 call_function_with_irqs_and_preemption_off(); 1532 local_irq_enable(); 1533 call_function_with_preemption_off(); 1534 preempt_enable(); 1535 1536The irqsoff tracer will record the total length of 1537call_function_with_irqs_off() and 1538call_function_with_irqs_and_preemption_off(). 1539 1540The preemptoff tracer will record the total length of 1541call_function_with_irqs_and_preemption_off() and 1542call_function_with_preemption_off(). 1543 1544But neither will trace the time that interrupts and/or 1545preemption is disabled. This total time is the time that we can 1546not schedule. To record this time, use the preemptirqsoff 1547tracer. 1548 1549Again, using this trace is much like the irqsoff and preemptoff 1550tracers. 1551:: 1552 1553 # echo 0 > options/function-trace 1554 # echo preemptirqsoff > current_tracer 1555 # echo 1 > tracing_on 1556 # echo 0 > tracing_max_latency 1557 # ls -ltr 1558 [...] 1559 # echo 0 > tracing_on 1560 # cat trace 1561 # tracer: preemptirqsoff 1562 # 1563 # preemptirqsoff latency trace v1.1.5 on 3.8.0-test+ 1564 # -------------------------------------------------------------------- 1565 # latency: 100 us, #4/4, CPU#3 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:4) 1566 # ----------------- 1567 # | task: ls-2230 (uid:0 nice:0 policy:0 rt_prio:0) 1568 # ----------------- 1569 # => started at: ata_scsi_queuecmd 1570 # => ended at: ata_scsi_queuecmd 1571 # 1572 # 1573 # _------=> CPU# 1574 # / _-----=> irqs-off 1575 # | / _----=> need-resched 1576 # || / _---=> hardirq/softirq 1577 # ||| / _--=> preempt-depth 1578 # |||| / delay 1579 # cmd pid ||||| time | caller 1580 # \ / ||||| \ | / 1581 ls-2230 3d... 0us+: _raw_spin_lock_irqsave <-ata_scsi_queuecmd 1582 ls-2230 3...1 100us : _raw_spin_unlock_irqrestore <-ata_scsi_queuecmd 1583 ls-2230 3...1 101us+: trace_preempt_on <-ata_scsi_queuecmd 1584 ls-2230 3...1 111us : <stack trace> 1585 => sub_preempt_count 1586 => _raw_spin_unlock_irqrestore 1587 => ata_scsi_queuecmd 1588 => scsi_dispatch_cmd 1589 => scsi_request_fn 1590 => __blk_run_queue_uncond 1591 => __blk_run_queue 1592 => blk_queue_bio 1593 => generic_make_request 1594 => submit_bio 1595 => submit_bh 1596 => ext3_bread 1597 => ext3_dir_bread 1598 => htree_dirblock_to_tree 1599 => ext3_htree_fill_tree 1600 => ext3_readdir 1601 => vfs_readdir 1602 => sys_getdents 1603 => system_call_fastpath 1604 1605 1606The trace_hardirqs_off_thunk is called from assembly on x86 when 1607interrupts are disabled in the assembly code. Without the 1608function tracing, we do not know if interrupts were enabled 1609within the preemption points. We do see that it started with 1610preemption enabled. 1611 1612Here is a trace with function-trace set:: 1613 1614 # tracer: preemptirqsoff 1615 # 1616 # preemptirqsoff latency trace v1.1.5 on 3.8.0-test+ 1617 # -------------------------------------------------------------------- 1618 # latency: 161 us, #339/339, CPU#3 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:4) 1619 # ----------------- 1620 # | task: ls-2269 (uid:0 nice:0 policy:0 rt_prio:0) 1621 # ----------------- 1622 # => started at: schedule 1623 # => ended at: mutex_unlock 1624 # 1625 # 1626 # _------=> CPU# 1627 # / _-----=> irqs-off 1628 # | / _----=> need-resched 1629 # || / _---=> hardirq/softirq 1630 # ||| / _--=> preempt-depth 1631 # |||| / delay 1632 # cmd pid ||||| time | caller 1633 # \ / ||||| \ | / 1634 kworker/-59 3...1 0us : __schedule <-schedule 1635 kworker/-59 3d..1 0us : rcu_preempt_qs <-rcu_note_context_switch 1636 kworker/-59 3d..1 1us : add_preempt_count <-_raw_spin_lock_irq 1637 kworker/-59 3d..2 1us : deactivate_task <-__schedule 1638 kworker/-59 3d..2 1us : dequeue_task <-deactivate_task 1639 kworker/-59 3d..2 2us : update_rq_clock <-dequeue_task 1640 kworker/-59 3d..2 2us : dequeue_task_fair <-dequeue_task 1641 kworker/-59 3d..2 2us : update_curr <-dequeue_task_fair 1642 kworker/-59 3d..2 2us : update_min_vruntime <-update_curr 1643 kworker/-59 3d..2 3us : cpuacct_charge <-update_curr 1644 kworker/-59 3d..2 3us : __rcu_read_lock <-cpuacct_charge 1645 kworker/-59 3d..2 3us : __rcu_read_unlock <-cpuacct_charge 1646 kworker/-59 3d..2 3us : update_cfs_rq_blocked_load <-dequeue_task_fair 1647 kworker/-59 3d..2 4us : clear_buddies <-dequeue_task_fair 1648 kworker/-59 3d..2 4us : account_entity_dequeue <-dequeue_task_fair 1649 kworker/-59 3d..2 4us : update_min_vruntime <-dequeue_task_fair 1650 kworker/-59 3d..2 4us : update_cfs_shares <-dequeue_task_fair 1651 kworker/-59 3d..2 5us : hrtick_update <-dequeue_task_fair 1652 kworker/-59 3d..2 5us : wq_worker_sleeping <-__schedule 1653 kworker/-59 3d..2 5us : kthread_data <-wq_worker_sleeping 1654 kworker/-59 3d..2 5us : put_prev_task_fair <-__schedule 1655 kworker/-59 3d..2 6us : pick_next_task_fair <-pick_next_task 1656 kworker/-59 3d..2 6us : clear_buddies <-pick_next_task_fair 1657 kworker/-59 3d..2 6us : set_next_entity <-pick_next_task_fair 1658 kworker/-59 3d..2 6us : update_stats_wait_end <-set_next_entity 1659 ls-2269 3d..2 7us : finish_task_switch <-__schedule 1660 ls-2269 3d..2 7us : _raw_spin_unlock_irq <-finish_task_switch 1661 ls-2269 3d..2 8us : do_IRQ <-ret_from_intr 1662 ls-2269 3d..2 8us : irq_enter <-do_IRQ 1663 ls-2269 3d..2 8us : rcu_irq_enter <-irq_enter 1664 ls-2269 3d..2 9us : add_preempt_count <-irq_enter 1665 ls-2269 3d.h2 9us : exit_idle <-do_IRQ 1666 [...] 1667 ls-2269 3d.h3 20us : sub_preempt_count <-_raw_spin_unlock 1668 ls-2269 3d.h2 20us : irq_exit <-do_IRQ 1669 ls-2269 3d.h2 21us : sub_preempt_count <-irq_exit 1670 ls-2269 3d..3 21us : do_softirq <-irq_exit 1671 ls-2269 3d..3 21us : __do_softirq <-call_softirq 1672 ls-2269 3d..3 21us+: __local_bh_disable <-__do_softirq 1673 ls-2269 3d.s4 29us : sub_preempt_count <-_local_bh_enable_ip 1674 ls-2269 3d.s5 29us : sub_preempt_count <-_local_bh_enable_ip 1675 ls-2269 3d.s5 31us : do_IRQ <-ret_from_intr 1676 ls-2269 3d.s5 31us : irq_enter <-do_IRQ 1677 ls-2269 3d.s5 31us : rcu_irq_enter <-irq_enter 1678 [...] 1679 ls-2269 3d.s5 31us : rcu_irq_enter <-irq_enter 1680 ls-2269 3d.s5 32us : add_preempt_count <-irq_enter 1681 ls-2269 3d.H5 32us : exit_idle <-do_IRQ 1682 ls-2269 3d.H5 32us : handle_irq <-do_IRQ 1683 ls-2269 3d.H5 32us : irq_to_desc <-handle_irq 1684 ls-2269 3d.H5 33us : handle_fasteoi_irq <-handle_irq 1685 [...] 1686 ls-2269 3d.s5 158us : _raw_spin_unlock_irqrestore <-rtl8139_poll 1687 ls-2269 3d.s3 158us : net_rps_action_and_irq_enable.isra.65 <-net_rx_action 1688 ls-2269 3d.s3 159us : __local_bh_enable <-__do_softirq 1689 ls-2269 3d.s3 159us : sub_preempt_count <-__local_bh_enable 1690 ls-2269 3d..3 159us : idle_cpu <-irq_exit 1691 ls-2269 3d..3 159us : rcu_irq_exit <-irq_exit 1692 ls-2269 3d..3 160us : sub_preempt_count <-irq_exit 1693 ls-2269 3d... 161us : __mutex_unlock_slowpath <-mutex_unlock 1694 ls-2269 3d... 162us+: trace_hardirqs_on <-mutex_unlock 1695 ls-2269 3d... 186us : <stack trace> 1696 => __mutex_unlock_slowpath 1697 => mutex_unlock 1698 => process_output 1699 => n_tty_write 1700 => tty_write 1701 => vfs_write 1702 => sys_write 1703 => system_call_fastpath 1704 1705This is an interesting trace. It started with kworker running and 1706scheduling out and ls taking over. But as soon as ls released the 1707rq lock and enabled interrupts (but not preemption) an interrupt 1708triggered. When the interrupt finished, it started running softirqs. 1709But while the softirq was running, another interrupt triggered. 1710When an interrupt is running inside a softirq, the annotation is 'H'. 1711 1712 1713wakeup 1714------ 1715 1716One common case that people are interested in tracing is the 1717time it takes for a task that is woken to actually wake up. 1718Now for non Real-Time tasks, this can be arbitrary. But tracing 1719it none the less can be interesting. 1720 1721Without function tracing:: 1722 1723 # echo 0 > options/function-trace 1724 # echo wakeup > current_tracer 1725 # echo 1 > tracing_on 1726 # echo 0 > tracing_max_latency 1727 # chrt -f 5 sleep 1 1728 # echo 0 > tracing_on 1729 # cat trace 1730 # tracer: wakeup 1731 # 1732 # wakeup latency trace v1.1.5 on 3.8.0-test+ 1733 # -------------------------------------------------------------------- 1734 # latency: 15 us, #4/4, CPU#3 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:4) 1735 # ----------------- 1736 # | task: kworker/3:1H-312 (uid:0 nice:-20 policy:0 rt_prio:0) 1737 # ----------------- 1738 # 1739 # _------=> CPU# 1740 # / _-----=> irqs-off 1741 # | / _----=> need-resched 1742 # || / _---=> hardirq/softirq 1743 # ||| / _--=> preempt-depth 1744 # |||| / delay 1745 # cmd pid ||||| time | caller 1746 # \ / ||||| \ | / 1747 <idle>-0 3dNs7 0us : 0:120:R + [003] 312:100:R kworker/3:1H 1748 <idle>-0 3dNs7 1us+: ttwu_do_activate.constprop.87 <-try_to_wake_up 1749 <idle>-0 3d..3 15us : __schedule <-schedule 1750 <idle>-0 3d..3 15us : 0:120:R ==> [003] 312:100:R kworker/3:1H 1751 1752The tracer only traces the highest priority task in the system 1753to avoid tracing the normal circumstances. Here we see that 1754the kworker with a nice priority of -20 (not very nice), took 1755just 15 microseconds from the time it woke up, to the time it 1756ran. 1757 1758Non Real-Time tasks are not that interesting. A more interesting 1759trace is to concentrate only on Real-Time tasks. 1760 1761wakeup_rt 1762--------- 1763 1764In a Real-Time environment it is very important to know the 1765wakeup time it takes for the highest priority task that is woken 1766up to the time that it executes. This is also known as "schedule 1767latency". I stress the point that this is about RT tasks. It is 1768also important to know the scheduling latency of non-RT tasks, 1769but the average schedule latency is better for non-RT tasks. 1770Tools like LatencyTop are more appropriate for such 1771measurements. 1772 1773Real-Time environments are interested in the worst case latency. 1774That is the longest latency it takes for something to happen, 1775and not the average. We can have a very fast scheduler that may 1776only have a large latency once in a while, but that would not 1777work well with Real-Time tasks. The wakeup_rt tracer was designed 1778to record the worst case wakeups of RT tasks. Non-RT tasks are 1779not recorded because the tracer only records one worst case and 1780tracing non-RT tasks that are unpredictable will overwrite the 1781worst case latency of RT tasks (just run the normal wakeup 1782tracer for a while to see that effect). 1783 1784Since this tracer only deals with RT tasks, we will run this 1785slightly differently than we did with the previous tracers. 1786Instead of performing an 'ls', we will run 'sleep 1' under 1787'chrt' which changes the priority of the task. 1788:: 1789 1790 # echo 0 > options/function-trace 1791 # echo wakeup_rt > current_tracer 1792 # echo 1 > tracing_on 1793 # echo 0 > tracing_max_latency 1794 # chrt -f 5 sleep 1 1795 # echo 0 > tracing_on 1796 # cat trace 1797 # tracer: wakeup 1798 # 1799 # tracer: wakeup_rt 1800 # 1801 # wakeup_rt latency trace v1.1.5 on 3.8.0-test+ 1802 # -------------------------------------------------------------------- 1803 # latency: 5 us, #4/4, CPU#3 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:4) 1804 # ----------------- 1805 # | task: sleep-2389 (uid:0 nice:0 policy:1 rt_prio:5) 1806 # ----------------- 1807 # 1808 # _------=> CPU# 1809 # / _-----=> irqs-off 1810 # | / _----=> need-resched 1811 # || / _---=> hardirq/softirq 1812 # ||| / _--=> preempt-depth 1813 # |||| / delay 1814 # cmd pid ||||| time | caller 1815 # \ / ||||| \ | / 1816 <idle>-0 3d.h4 0us : 0:120:R + [003] 2389: 94:R sleep 1817 <idle>-0 3d.h4 1us+: ttwu_do_activate.constprop.87 <-try_to_wake_up 1818 <idle>-0 3d..3 5us : __schedule <-schedule 1819 <idle>-0 3d..3 5us : 0:120:R ==> [003] 2389: 94:R sleep 1820 1821 1822Running this on an idle system, we see that it only took 5 microseconds 1823to perform the task switch. Note, since the trace point in the schedule 1824is before the actual "switch", we stop the tracing when the recorded task 1825is about to schedule in. This may change if we add a new marker at the 1826end of the scheduler. 1827 1828Notice that the recorded task is 'sleep' with the PID of 2389 1829and it has an rt_prio of 5. This priority is user-space priority 1830and not the internal kernel priority. The policy is 1 for 1831SCHED_FIFO and 2 for SCHED_RR. 1832 1833Note, that the trace data shows the internal priority (99 - rtprio). 1834:: 1835 1836 <idle>-0 3d..3 5us : 0:120:R ==> [003] 2389: 94:R sleep 1837 1838The 0:120:R means idle was running with a nice priority of 0 (120 - 120) 1839and in the running state 'R'. The sleep task was scheduled in with 18402389: 94:R. That is the priority is the kernel rtprio (99 - 5 = 94) 1841and it too is in the running state. 1842 1843Doing the same with chrt -r 5 and function-trace set. 1844:: 1845 1846 echo 1 > options/function-trace 1847 1848 # tracer: wakeup_rt 1849 # 1850 # wakeup_rt latency trace v1.1.5 on 3.8.0-test+ 1851 # -------------------------------------------------------------------- 1852 # latency: 29 us, #85/85, CPU#3 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:4) 1853 # ----------------- 1854 # | task: sleep-2448 (uid:0 nice:0 policy:1 rt_prio:5) 1855 # ----------------- 1856 # 1857 # _------=> CPU# 1858 # / _-----=> irqs-off 1859 # | / _----=> need-resched 1860 # || / _---=> hardirq/softirq 1861 # ||| / _--=> preempt-depth 1862 # |||| / delay 1863 # cmd pid ||||| time | caller 1864 # \ / ||||| \ | / 1865 <idle>-0 3d.h4 1us+: 0:120:R + [003] 2448: 94:R sleep 1866 <idle>-0 3d.h4 2us : ttwu_do_activate.constprop.87 <-try_to_wake_up 1867 <idle>-0 3d.h3 3us : check_preempt_curr <-ttwu_do_wakeup 1868 <idle>-0 3d.h3 3us : resched_curr <-check_preempt_curr 1869 <idle>-0 3dNh3 4us : task_woken_rt <-ttwu_do_wakeup 1870 <idle>-0 3dNh3 4us : _raw_spin_unlock <-try_to_wake_up 1871 <idle>-0 3dNh3 4us : sub_preempt_count <-_raw_spin_unlock 1872 <idle>-0 3dNh2 5us : ttwu_stat <-try_to_wake_up 1873 <idle>-0 3dNh2 5us : _raw_spin_unlock_irqrestore <-try_to_wake_up 1874 <idle>-0 3dNh2 6us : sub_preempt_count <-_raw_spin_unlock_irqrestore 1875 <idle>-0 3dNh1 6us : _raw_spin_lock <-__run_hrtimer 1876 <idle>-0 3dNh1 6us : add_preempt_count <-_raw_spin_lock 1877 <idle>-0 3dNh2 7us : _raw_spin_unlock <-hrtimer_interrupt 1878 <idle>-0 3dNh2 7us : sub_preempt_count <-_raw_spin_unlock 1879 <idle>-0 3dNh1 7us : tick_program_event <-hrtimer_interrupt 1880 <idle>-0 3dNh1 7us : clockevents_program_event <-tick_program_event 1881 <idle>-0 3dNh1 8us : ktime_get <-clockevents_program_event 1882 <idle>-0 3dNh1 8us : lapic_next_event <-clockevents_program_event 1883 <idle>-0 3dNh1 8us : irq_exit <-smp_apic_timer_interrupt 1884 <idle>-0 3dNh1 9us : sub_preempt_count <-irq_exit 1885 <idle>-0 3dN.2 9us : idle_cpu <-irq_exit 1886 <idle>-0 3dN.2 9us : rcu_irq_exit <-irq_exit 1887 <idle>-0 3dN.2 10us : rcu_eqs_enter_common.isra.45 <-rcu_irq_exit 1888 <idle>-0 3dN.2 10us : sub_preempt_count <-irq_exit 1889 <idle>-0 3.N.1 11us : rcu_idle_exit <-cpu_idle 1890 <idle>-0 3dN.1 11us : rcu_eqs_exit_common.isra.43 <-rcu_idle_exit 1891 <idle>-0 3.N.1 11us : tick_nohz_idle_exit <-cpu_idle 1892 <idle>-0 3dN.1 12us : menu_hrtimer_cancel <-tick_nohz_idle_exit 1893 <idle>-0 3dN.1 12us : ktime_get <-tick_nohz_idle_exit 1894 <idle>-0 3dN.1 12us : tick_do_update_jiffies64 <-tick_nohz_idle_exit 1895 <idle>-0 3dN.1 13us : cpu_load_update_nohz <-tick_nohz_idle_exit 1896 <idle>-0 3dN.1 13us : _raw_spin_lock <-cpu_load_update_nohz 1897 <idle>-0 3dN.1 13us : add_preempt_count <-_raw_spin_lock 1898 <idle>-0 3dN.2 13us : __cpu_load_update <-cpu_load_update_nohz 1899 <idle>-0 3dN.2 14us : sched_avg_update <-__cpu_load_update 1900 <idle>-0 3dN.2 14us : _raw_spin_unlock <-cpu_load_update_nohz 1901 <idle>-0 3dN.2 14us : sub_preempt_count <-_raw_spin_unlock 1902 <idle>-0 3dN.1 15us : calc_load_nohz_stop <-tick_nohz_idle_exit 1903 <idle>-0 3dN.1 15us : touch_softlockup_watchdog <-tick_nohz_idle_exit 1904 <idle>-0 3dN.1 15us : hrtimer_cancel <-tick_nohz_idle_exit 1905 <idle>-0 3dN.1 15us : hrtimer_try_to_cancel <-hrtimer_cancel 1906 <idle>-0 3dN.1 16us : lock_hrtimer_base.isra.18 <-hrtimer_try_to_cancel 1907 <idle>-0 3dN.1 16us : _raw_spin_lock_irqsave <-lock_hrtimer_base.isra.18 1908 <idle>-0 3dN.1 16us : add_preempt_count <-_raw_spin_lock_irqsave 1909 <idle>-0 3dN.2 17us : __remove_hrtimer <-remove_hrtimer.part.16 1910 <idle>-0 3dN.2 17us : hrtimer_force_reprogram <-__remove_hrtimer 1911 <idle>-0 3dN.2 17us : tick_program_event <-hrtimer_force_reprogram 1912 <idle>-0 3dN.2 18us : clockevents_program_event <-tick_program_event 1913 <idle>-0 3dN.2 18us : ktime_get <-clockevents_program_event 1914 <idle>-0 3dN.2 18us : lapic_next_event <-clockevents_program_event 1915 <idle>-0 3dN.2 19us : _raw_spin_unlock_irqrestore <-hrtimer_try_to_cancel 1916 <idle>-0 3dN.2 19us : sub_preempt_count <-_raw_spin_unlock_irqrestore 1917 <idle>-0 3dN.1 19us : hrtimer_forward <-tick_nohz_idle_exit 1918 <idle>-0 3dN.1 20us : ktime_add_safe <-hrtimer_forward 1919 <idle>-0 3dN.1 20us : ktime_add_safe <-hrtimer_forward 1920 <idle>-0 3dN.1 20us : hrtimer_start_range_ns <-hrtimer_start_expires.constprop.11 1921 <idle>-0 3dN.1 20us : __hrtimer_start_range_ns <-hrtimer_start_range_ns 1922 <idle>-0 3dN.1 21us : lock_hrtimer_base.isra.18 <-__hrtimer_start_range_ns 1923 <idle>-0 3dN.1 21us : _raw_spin_lock_irqsave <-lock_hrtimer_base.isra.18 1924 <idle>-0 3dN.1 21us : add_preempt_count <-_raw_spin_lock_irqsave 1925 <idle>-0 3dN.2 22us : ktime_add_safe <-__hrtimer_start_range_ns 1926 <idle>-0 3dN.2 22us : enqueue_hrtimer <-__hrtimer_start_range_ns 1927 <idle>-0 3dN.2 22us : tick_program_event <-__hrtimer_start_range_ns 1928 <idle>-0 3dN.2 23us : clockevents_program_event <-tick_program_event 1929 <idle>-0 3dN.2 23us : ktime_get <-clockevents_program_event 1930 <idle>-0 3dN.2 23us : lapic_next_event <-clockevents_program_event 1931 <idle>-0 3dN.2 24us : _raw_spin_unlock_irqrestore <-__hrtimer_start_range_ns 1932 <idle>-0 3dN.2 24us : sub_preempt_count <-_raw_spin_unlock_irqrestore 1933 <idle>-0 3dN.1 24us : account_idle_ticks <-tick_nohz_idle_exit 1934 <idle>-0 3dN.1 24us : account_idle_time <-account_idle_ticks 1935 <idle>-0 3.N.1 25us : sub_preempt_count <-cpu_idle 1936 <idle>-0 3.N.. 25us : schedule <-cpu_idle 1937 <idle>-0 3.N.. 25us : __schedule <-preempt_schedule 1938 <idle>-0 3.N.. 26us : add_preempt_count <-__schedule 1939 <idle>-0 3.N.1 26us : rcu_note_context_switch <-__schedule 1940 <idle>-0 3.N.1 26us : rcu_sched_qs <-rcu_note_context_switch 1941 <idle>-0 3dN.1 27us : rcu_preempt_qs <-rcu_note_context_switch 1942 <idle>-0 3.N.1 27us : _raw_spin_lock_irq <-__schedule 1943 <idle>-0 3dN.1 27us : add_preempt_count <-_raw_spin_lock_irq 1944 <idle>-0 3dN.2 28us : put_prev_task_idle <-__schedule 1945 <idle>-0 3dN.2 28us : pick_next_task_stop <-pick_next_task 1946 <idle>-0 3dN.2 28us : pick_next_task_rt <-pick_next_task 1947 <idle>-0 3dN.2 29us : dequeue_pushable_task <-pick_next_task_rt 1948 <idle>-0 3d..3 29us : __schedule <-preempt_schedule 1949 <idle>-0 3d..3 30us : 0:120:R ==> [003] 2448: 94:R sleep 1950 1951This isn't that big of a trace, even with function tracing enabled, 1952so I included the entire trace. 1953 1954The interrupt went off while when the system was idle. Somewhere 1955before task_woken_rt() was called, the NEED_RESCHED flag was set, 1956this is indicated by the first occurrence of the 'N' flag. 1957 1958Latency tracing and events 1959-------------------------- 1960As function tracing can induce a much larger latency, but without 1961seeing what happens within the latency it is hard to know what 1962caused it. There is a middle ground, and that is with enabling 1963events. 1964:: 1965 1966 # echo 0 > options/function-trace 1967 # echo wakeup_rt > current_tracer 1968 # echo 1 > events/enable 1969 # echo 1 > tracing_on 1970 # echo 0 > tracing_max_latency 1971 # chrt -f 5 sleep 1 1972 # echo 0 > tracing_on 1973 # cat trace 1974 # tracer: wakeup_rt 1975 # 1976 # wakeup_rt latency trace v1.1.5 on 3.8.0-test+ 1977 # -------------------------------------------------------------------- 1978 # latency: 6 us, #12/12, CPU#2 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:4) 1979 # ----------------- 1980 # | task: sleep-5882 (uid:0 nice:0 policy:1 rt_prio:5) 1981 # ----------------- 1982 # 1983 # _------=> CPU# 1984 # / _-----=> irqs-off 1985 # | / _----=> need-resched 1986 # || / _---=> hardirq/softirq 1987 # ||| / _--=> preempt-depth 1988 # |||| / delay 1989 # cmd pid ||||| time | caller 1990 # \ / ||||| \ | / 1991 <idle>-0 2d.h4 0us : 0:120:R + [002] 5882: 94:R sleep 1992 <idle>-0 2d.h4 0us : ttwu_do_activate.constprop.87 <-try_to_wake_up 1993 <idle>-0 2d.h4 1us : sched_wakeup: comm=sleep pid=5882 prio=94 success=1 target_cpu=002 1994 <idle>-0 2dNh2 1us : hrtimer_expire_exit: hrtimer=ffff88007796feb8 1995 <idle>-0 2.N.2 2us : power_end: cpu_id=2 1996 <idle>-0 2.N.2 3us : cpu_idle: state=4294967295 cpu_id=2 1997 <idle>-0 2dN.3 4us : hrtimer_cancel: hrtimer=ffff88007d50d5e0 1998 <idle>-0 2dN.3 4us : hrtimer_start: hrtimer=ffff88007d50d5e0 function=tick_sched_timer expires=34311211000000 softexpires=34311211000000 1999 <idle>-0 2.N.2 5us : rcu_utilization: Start context switch 2000 <idle>-0 2.N.2 5us : rcu_utilization: End context switch 2001 <idle>-0 2d..3 6us : __schedule <-schedule 2002 <idle>-0 2d..3 6us : 0:120:R ==> [002] 5882: 94:R sleep 2003 2004 2005Hardware Latency Detector 2006------------------------- 2007 2008The hardware latency detector is executed by enabling the "hwlat" tracer. 2009 2010NOTE, this tracer will affect the performance of the system as it will 2011periodically make a CPU constantly busy with interrupts disabled. 2012:: 2013 2014 # echo hwlat > current_tracer 2015 # sleep 100 2016 # cat trace 2017 # tracer: hwlat 2018 # 2019 # _-----=> irqs-off 2020 # / _----=> need-resched 2021 # | / _---=> hardirq/softirq 2022 # || / _--=> preempt-depth 2023 # ||| / delay 2024 # TASK-PID CPU# |||| TIMESTAMP FUNCTION 2025 # | | | |||| | | 2026 <...>-3638 [001] d... 19452.055471: #1 inner/outer(us): 12/14 ts:1499801089.066141940 2027 <...>-3638 [003] d... 19454.071354: #2 inner/outer(us): 11/9 ts:1499801091.082164365 2028 <...>-3638 [002] dn.. 19461.126852: #3 inner/outer(us): 12/9 ts:1499801098.138150062 2029 <...>-3638 [001] d... 19488.340960: #4 inner/outer(us): 8/12 ts:1499801125.354139633 2030 <...>-3638 [003] d... 19494.388553: #5 inner/outer(us): 8/12 ts:1499801131.402150961 2031 <...>-3638 [003] d... 19501.283419: #6 inner/outer(us): 0/12 ts:1499801138.297435289 nmi-total:4 nmi-count:1 2032 2033 2034The above output is somewhat the same in the header. All events will have 2035interrupts disabled 'd'. Under the FUNCTION title there is: 2036 2037 #1 2038 This is the count of events recorded that were greater than the 2039 tracing_threshold (See below). 2040 2041 inner/outer(us): 12/14 2042 2043 This shows two numbers as "inner latency" and "outer latency". The test 2044 runs in a loop checking a timestamp twice. The latency detected within 2045 the two timestamps is the "inner latency" and the latency detected 2046 after the previous timestamp and the next timestamp in the loop is 2047 the "outer latency". 2048 2049 ts:1499801089.066141940 2050 2051 The absolute timestamp that the event happened. 2052 2053 nmi-total:4 nmi-count:1 2054 2055 On architectures that support it, if an NMI comes in during the 2056 test, the time spent in NMI is reported in "nmi-total" (in 2057 microseconds). 2058 2059 All architectures that have NMIs will show the "nmi-count" if an 2060 NMI comes in during the test. 2061 2062hwlat files: 2063 2064 tracing_threshold 2065 This gets automatically set to "10" to represent 10 2066 microseconds. This is the threshold of latency that 2067 needs to be detected before the trace will be recorded. 2068 2069 Note, when hwlat tracer is finished (another tracer is 2070 written into "current_tracer"), the original value for 2071 tracing_threshold is placed back into this file. 2072 2073 hwlat_detector/width 2074 The length of time the test runs with interrupts disabled. 2075 2076 hwlat_detector/window 2077 The length of time of the window which the test 2078 runs. That is, the test will run for "width" 2079 microseconds per "window" microseconds 2080 2081 tracing_cpumask 2082 When the test is started. A kernel thread is created that 2083 runs the test. This thread will alternate between CPUs 2084 listed in the tracing_cpumask between each period 2085 (one "window"). To limit the test to specific CPUs 2086 set the mask in this file to only the CPUs that the test 2087 should run on. 2088 2089function 2090-------- 2091 2092This tracer is the function tracer. Enabling the function tracer 2093can be done from the debug file system. Make sure the 2094ftrace_enabled is set; otherwise this tracer is a nop. 2095See the "ftrace_enabled" section below. 2096:: 2097 2098 # sysctl kernel.ftrace_enabled=1 2099 # echo function > current_tracer 2100 # echo 1 > tracing_on 2101 # usleep 1 2102 # echo 0 > tracing_on 2103 # cat trace 2104 # tracer: function 2105 # 2106 # entries-in-buffer/entries-written: 24799/24799 #P:4 2107 # 2108 # _-----=> irqs-off 2109 # / _----=> need-resched 2110 # | / _---=> hardirq/softirq 2111 # || / _--=> preempt-depth 2112 # ||| / delay 2113 # TASK-PID CPU# |||| TIMESTAMP FUNCTION 2114 # | | | |||| | | 2115 bash-1994 [002] .... 3082.063030: mutex_unlock <-rb_simple_write 2116 bash-1994 [002] .... 3082.063031: __mutex_unlock_slowpath <-mutex_unlock 2117 bash-1994 [002] .... 3082.063031: __fsnotify_parent <-fsnotify_modify 2118 bash-1994 [002] .... 3082.063032: fsnotify <-fsnotify_modify 2119 bash-1994 [002] .... 3082.063032: __srcu_read_lock <-fsnotify 2120 bash-1994 [002] .... 3082.063032: add_preempt_count <-__srcu_read_lock 2121 bash-1994 [002] ...1 3082.063032: sub_preempt_count <-__srcu_read_lock 2122 bash-1994 [002] .... 3082.063033: __srcu_read_unlock <-fsnotify 2123 [...] 2124 2125 2126Note: function tracer uses ring buffers to store the above 2127entries. The newest data may overwrite the oldest data. 2128Sometimes using echo to stop the trace is not sufficient because 2129the tracing could have overwritten the data that you wanted to 2130record. For this reason, it is sometimes better to disable 2131tracing directly from a program. This allows you to stop the 2132tracing at the point that you hit the part that you are 2133interested in. To disable the tracing directly from a C program, 2134something like following code snippet can be used:: 2135 2136 int trace_fd; 2137 [...] 2138 int main(int argc, char *argv[]) { 2139 [...] 2140 trace_fd = open(tracing_file("tracing_on"), O_WRONLY); 2141 [...] 2142 if (condition_hit()) { 2143 write(trace_fd, "0", 1); 2144 } 2145 [...] 2146 } 2147 2148 2149Single thread tracing 2150--------------------- 2151 2152By writing into set_ftrace_pid you can trace a 2153single thread. For example:: 2154 2155 # cat set_ftrace_pid 2156 no pid 2157 # echo 3111 > set_ftrace_pid 2158 # cat set_ftrace_pid 2159 3111 2160 # echo function > current_tracer 2161 # cat trace | head 2162 # tracer: function 2163 # 2164 # TASK-PID CPU# TIMESTAMP FUNCTION 2165 # | | | | | 2166 yum-updatesd-3111 [003] 1637.254676: finish_task_switch <-thread_return 2167 yum-updatesd-3111 [003] 1637.254681: hrtimer_cancel <-schedule_hrtimeout_range 2168 yum-updatesd-3111 [003] 1637.254682: hrtimer_try_to_cancel <-hrtimer_cancel 2169 yum-updatesd-3111 [003] 1637.254683: lock_hrtimer_base <-hrtimer_try_to_cancel 2170 yum-updatesd-3111 [003] 1637.254685: fget_light <-do_sys_poll 2171 yum-updatesd-3111 [003] 1637.254686: pipe_poll <-do_sys_poll 2172 # echo > set_ftrace_pid 2173 # cat trace |head 2174 # tracer: function 2175 # 2176 # TASK-PID CPU# TIMESTAMP FUNCTION 2177 # | | | | | 2178 ##### CPU 3 buffer started #### 2179 yum-updatesd-3111 [003] 1701.957688: free_poll_entry <-poll_freewait 2180 yum-updatesd-3111 [003] 1701.957689: remove_wait_queue <-free_poll_entry 2181 yum-updatesd-3111 [003] 1701.957691: fput <-free_poll_entry 2182 yum-updatesd-3111 [003] 1701.957692: audit_syscall_exit <-sysret_audit 2183 yum-updatesd-3111 [003] 1701.957693: path_put <-audit_syscall_exit 2184 2185If you want to trace a function when executing, you could use 2186something like this simple program. 2187:: 2188 2189 #include <stdio.h> 2190 #include <stdlib.h> 2191 #include <sys/types.h> 2192 #include <sys/stat.h> 2193 #include <fcntl.h> 2194 #include <unistd.h> 2195 #include <string.h> 2196 2197 #define _STR(x) #x 2198 #define STR(x) _STR(x) 2199 #define MAX_PATH 256 2200 2201 const char *find_tracefs(void) 2202 { 2203 static char tracefs[MAX_PATH+1]; 2204 static int tracefs_found; 2205 char type[100]; 2206 FILE *fp; 2207 2208 if (tracefs_found) 2209 return tracefs; 2210 2211 if ((fp = fopen("/proc/mounts","r")) == NULL) { 2212 perror("/proc/mounts"); 2213 return NULL; 2214 } 2215 2216 while (fscanf(fp, "%*s %" 2217 STR(MAX_PATH) 2218 "s %99s %*s %*d %*d\n", 2219 tracefs, type) == 2) { 2220 if (strcmp(type, "tracefs") == 0) 2221 break; 2222 } 2223 fclose(fp); 2224 2225 if (strcmp(type, "tracefs") != 0) { 2226 fprintf(stderr, "tracefs not mounted"); 2227 return NULL; 2228 } 2229 2230 strcat(tracefs, "/tracing/"); 2231 tracefs_found = 1; 2232 2233 return tracefs; 2234 } 2235 2236 const char *tracing_file(const char *file_name) 2237 { 2238 static char trace_file[MAX_PATH+1]; 2239 snprintf(trace_file, MAX_PATH, "%s/%s", find_tracefs(), file_name); 2240 return trace_file; 2241 } 2242 2243 int main (int argc, char **argv) 2244 { 2245 if (argc < 1) 2246 exit(-1); 2247 2248 if (fork() > 0) { 2249 int fd, ffd; 2250 char line[64]; 2251 int s; 2252 2253 ffd = open(tracing_file("current_tracer"), O_WRONLY); 2254 if (ffd < 0) 2255 exit(-1); 2256 write(ffd, "nop", 3); 2257 2258 fd = open(tracing_file("set_ftrace_pid"), O_WRONLY); 2259 s = sprintf(line, "%d\n", getpid()); 2260 write(fd, line, s); 2261 2262 write(ffd, "function", 8); 2263 2264 close(fd); 2265 close(ffd); 2266 2267 execvp(argv[1], argv+1); 2268 } 2269 2270 return 0; 2271 } 2272 2273Or this simple script! 2274:: 2275 2276 #!/bin/bash 2277 2278 tracefs=`sed -ne 's/^tracefs \(.*\) tracefs.*/\1/p' /proc/mounts` 2279 echo nop > $tracefs/tracing/current_tracer 2280 echo 0 > $tracefs/tracing/tracing_on 2281 echo $$ > $tracefs/tracing/set_ftrace_pid 2282 echo function > $tracefs/tracing/current_tracer 2283 echo 1 > $tracefs/tracing/tracing_on 2284 exec "$@" 2285 2286 2287function graph tracer 2288--------------------------- 2289 2290This tracer is similar to the function tracer except that it 2291probes a function on its entry and its exit. This is done by 2292using a dynamically allocated stack of return addresses in each 2293task_struct. On function entry the tracer overwrites the return 2294address of each function traced to set a custom probe. Thus the 2295original return address is stored on the stack of return address 2296in the task_struct. 2297 2298Probing on both ends of a function leads to special features 2299such as: 2300 2301- measure of a function's time execution 2302- having a reliable call stack to draw function calls graph 2303 2304This tracer is useful in several situations: 2305 2306- you want to find the reason of a strange kernel behavior and 2307 need to see what happens in detail on any areas (or specific 2308 ones). 2309 2310- you are experiencing weird latencies but it's difficult to 2311 find its origin. 2312 2313- you want to find quickly which path is taken by a specific 2314 function 2315 2316- you just want to peek inside a working kernel and want to see 2317 what happens there. 2318 2319:: 2320 2321 # tracer: function_graph 2322 # 2323 # CPU DURATION FUNCTION CALLS 2324 # | | | | | | | 2325 2326 0) | sys_open() { 2327 0) | do_sys_open() { 2328 0) | getname() { 2329 0) | kmem_cache_alloc() { 2330 0) 1.382 us | __might_sleep(); 2331 0) 2.478 us | } 2332 0) | strncpy_from_user() { 2333 0) | might_fault() { 2334 0) 1.389 us | __might_sleep(); 2335 0) 2.553 us | } 2336 0) 3.807 us | } 2337 0) 7.876 us | } 2338 0) | alloc_fd() { 2339 0) 0.668 us | _spin_lock(); 2340 0) 0.570 us | expand_files(); 2341 0) 0.586 us | _spin_unlock(); 2342 2343 2344There are several columns that can be dynamically 2345enabled/disabled. You can use every combination of options you 2346want, depending on your needs. 2347 2348- The cpu number on which the function executed is default 2349 enabled. It is sometimes better to only trace one cpu (see 2350 tracing_cpu_mask file) or you might sometimes see unordered 2351 function calls while cpu tracing switch. 2352 2353 - hide: echo nofuncgraph-cpu > trace_options 2354 - show: echo funcgraph-cpu > trace_options 2355 2356- The duration (function's time of execution) is displayed on 2357 the closing bracket line of a function or on the same line 2358 than the current function in case of a leaf one. It is default 2359 enabled. 2360 2361 - hide: echo nofuncgraph-duration > trace_options 2362 - show: echo funcgraph-duration > trace_options 2363 2364- The overhead field precedes the duration field in case of 2365 reached duration thresholds. 2366 2367 - hide: echo nofuncgraph-overhead > trace_options 2368 - show: echo funcgraph-overhead > trace_options 2369 - depends on: funcgraph-duration 2370 2371 ie:: 2372 2373 3) # 1837.709 us | } /* __switch_to */ 2374 3) | finish_task_switch() { 2375 3) 0.313 us | _raw_spin_unlock_irq(); 2376 3) 3.177 us | } 2377 3) # 1889.063 us | } /* __schedule */ 2378 3) ! 140.417 us | } /* __schedule */ 2379 3) # 2034.948 us | } /* schedule */ 2380 3) * 33998.59 us | } /* schedule_preempt_disabled */ 2381 2382 [...] 2383 2384 1) 0.260 us | msecs_to_jiffies(); 2385 1) 0.313 us | __rcu_read_unlock(); 2386 1) + 61.770 us | } 2387 1) + 64.479 us | } 2388 1) 0.313 us | rcu_bh_qs(); 2389 1) 0.313 us | __local_bh_enable(); 2390 1) ! 217.240 us | } 2391 1) 0.365 us | idle_cpu(); 2392 1) | rcu_irq_exit() { 2393 1) 0.417 us | rcu_eqs_enter_common.isra.47(); 2394 1) 3.125 us | } 2395 1) ! 227.812 us | } 2396 1) ! 457.395 us | } 2397 1) @ 119760.2 us | } 2398 2399 [...] 2400 2401 2) | handle_IPI() { 2402 1) 6.979 us | } 2403 2) 0.417 us | scheduler_ipi(); 2404 1) 9.791 us | } 2405 1) + 12.917 us | } 2406 2) 3.490 us | } 2407 1) + 15.729 us | } 2408 1) + 18.542 us | } 2409 2) $ 3594274 us | } 2410 2411Flags:: 2412 2413 + means that the function exceeded 10 usecs. 2414 ! means that the function exceeded 100 usecs. 2415 # means that the function exceeded 1000 usecs. 2416 * means that the function exceeded 10 msecs. 2417 @ means that the function exceeded 100 msecs. 2418 $ means that the function exceeded 1 sec. 2419 2420 2421- The task/pid field displays the thread cmdline and pid which 2422 executed the function. It is default disabled. 2423 2424 - hide: echo nofuncgraph-proc > trace_options 2425 - show: echo funcgraph-proc > trace_options 2426 2427 ie:: 2428 2429 # tracer: function_graph 2430 # 2431 # CPU TASK/PID DURATION FUNCTION CALLS 2432 # | | | | | | | | | 2433 0) sh-4802 | | d_free() { 2434 0) sh-4802 | | call_rcu() { 2435 0) sh-4802 | | __call_rcu() { 2436 0) sh-4802 | 0.616 us | rcu_process_gp_end(); 2437 0) sh-4802 | 0.586 us | check_for_new_grace_period(); 2438 0) sh-4802 | 2.899 us | } 2439 0) sh-4802 | 4.040 us | } 2440 0) sh-4802 | 5.151 us | } 2441 0) sh-4802 | + 49.370 us | } 2442 2443 2444- The absolute time field is an absolute timestamp given by the 2445 system clock since it started. A snapshot of this time is 2446 given on each entry/exit of functions 2447 2448 - hide: echo nofuncgraph-abstime > trace_options 2449 - show: echo funcgraph-abstime > trace_options 2450 2451 ie:: 2452 2453 # 2454 # TIME CPU DURATION FUNCTION CALLS 2455 # | | | | | | | | 2456 360.774522 | 1) 0.541 us | } 2457 360.774522 | 1) 4.663 us | } 2458 360.774523 | 1) 0.541 us | __wake_up_bit(); 2459 360.774524 | 1) 6.796 us | } 2460 360.774524 | 1) 7.952 us | } 2461 360.774525 | 1) 9.063 us | } 2462 360.774525 | 1) 0.615 us | journal_mark_dirty(); 2463 360.774527 | 1) 0.578 us | __brelse(); 2464 360.774528 | 1) | reiserfs_prepare_for_journal() { 2465 360.774528 | 1) | unlock_buffer() { 2466 360.774529 | 1) | wake_up_bit() { 2467 360.774529 | 1) | bit_waitqueue() { 2468 360.774530 | 1) 0.594 us | __phys_addr(); 2469 2470 2471The function name is always displayed after the closing bracket 2472for a function if the start of that function is not in the 2473trace buffer. 2474 2475Display of the function name after the closing bracket may be 2476enabled for functions whose start is in the trace buffer, 2477allowing easier searching with grep for function durations. 2478It is default disabled. 2479 2480 - hide: echo nofuncgraph-tail > trace_options 2481 - show: echo funcgraph-tail > trace_options 2482 2483 Example with nofuncgraph-tail (default):: 2484 2485 0) | putname() { 2486 0) | kmem_cache_free() { 2487 0) 0.518 us | __phys_addr(); 2488 0) 1.757 us | } 2489 0) 2.861 us | } 2490 2491 Example with funcgraph-tail:: 2492 2493 0) | putname() { 2494 0) | kmem_cache_free() { 2495 0) 0.518 us | __phys_addr(); 2496 0) 1.757 us | } /* kmem_cache_free() */ 2497 0) 2.861 us | } /* putname() */ 2498 2499You can put some comments on specific functions by using 2500trace_printk() For example, if you want to put a comment inside 2501the __might_sleep() function, you just have to include 2502<linux/ftrace.h> and call trace_printk() inside __might_sleep():: 2503 2504 trace_printk("I'm a comment!\n") 2505 2506will produce:: 2507 2508 1) | __might_sleep() { 2509 1) | /* I'm a comment! */ 2510 1) 1.449 us | } 2511 2512 2513You might find other useful features for this tracer in the 2514following "dynamic ftrace" section such as tracing only specific 2515functions or tasks. 2516 2517dynamic ftrace 2518-------------- 2519 2520If CONFIG_DYNAMIC_FTRACE is set, the system will run with 2521virtually no overhead when function tracing is disabled. The way 2522this works is the mcount function call (placed at the start of 2523every kernel function, produced by the -pg switch in gcc), 2524starts of pointing to a simple return. (Enabling FTRACE will 2525include the -pg switch in the compiling of the kernel.) 2526 2527At compile time every C file object is run through the 2528recordmcount program (located in the scripts directory). This 2529program will parse the ELF headers in the C object to find all 2530the locations in the .text section that call mcount. Starting 2531with gcc verson 4.6, the -mfentry has been added for x86, which 2532calls "__fentry__" instead of "mcount". Which is called before 2533the creation of the stack frame. 2534 2535Note, not all sections are traced. They may be prevented by either 2536a notrace, or blocked another way and all inline functions are not 2537traced. Check the "available_filter_functions" file to see what functions 2538can be traced. 2539 2540A section called "__mcount_loc" is created that holds 2541references to all the mcount/fentry call sites in the .text section. 2542The recordmcount program re-links this section back into the 2543original object. The final linking stage of the kernel will add all these 2544references into a single table. 2545 2546On boot up, before SMP is initialized, the dynamic ftrace code 2547scans this table and updates all the locations into nops. It 2548also records the locations, which are added to the 2549available_filter_functions list. Modules are processed as they 2550are loaded and before they are executed. When a module is 2551unloaded, it also removes its functions from the ftrace function 2552list. This is automatic in the module unload code, and the 2553module author does not need to worry about it. 2554 2555When tracing is enabled, the process of modifying the function 2556tracepoints is dependent on architecture. The old method is to use 2557kstop_machine to prevent races with the CPUs executing code being 2558modified (which can cause the CPU to do undesirable things, especially 2559if the modified code crosses cache (or page) boundaries), and the nops are 2560patched back to calls. But this time, they do not call mcount 2561(which is just a function stub). They now call into the ftrace 2562infrastructure. 2563 2564The new method of modifying the function tracepoints is to place 2565a breakpoint at the location to be modified, sync all CPUs, modify 2566the rest of the instruction not covered by the breakpoint. Sync 2567all CPUs again, and then remove the breakpoint with the finished 2568version to the ftrace call site. 2569 2570Some archs do not even need to monkey around with the synchronization, 2571and can just slap the new code on top of the old without any 2572problems with other CPUs executing it at the same time. 2573 2574One special side-effect to the recording of the functions being 2575traced is that we can now selectively choose which functions we 2576wish to trace and which ones we want the mcount calls to remain 2577as nops. 2578 2579Two files are used, one for enabling and one for disabling the 2580tracing of specified functions. They are: 2581 2582 set_ftrace_filter 2583 2584and 2585 2586 set_ftrace_notrace 2587 2588A list of available functions that you can add to these files is 2589listed in: 2590 2591 available_filter_functions 2592 2593:: 2594 2595 # cat available_filter_functions 2596 put_prev_task_idle 2597 kmem_cache_create 2598 pick_next_task_rt 2599 get_online_cpus 2600 pick_next_task_fair 2601 mutex_lock 2602 [...] 2603 2604If I am only interested in sys_nanosleep and hrtimer_interrupt:: 2605 2606 # echo sys_nanosleep hrtimer_interrupt > set_ftrace_filter 2607 # echo function > current_tracer 2608 # echo 1 > tracing_on 2609 # usleep 1 2610 # echo 0 > tracing_on 2611 # cat trace 2612 # tracer: function 2613 # 2614 # entries-in-buffer/entries-written: 5/5 #P:4 2615 # 2616 # _-----=> irqs-off 2617 # / _----=> need-resched 2618 # | / _---=> hardirq/softirq 2619 # || / _--=> preempt-depth 2620 # ||| / delay 2621 # TASK-PID CPU# |||| TIMESTAMP FUNCTION 2622 # | | | |||| | | 2623 usleep-2665 [001] .... 4186.475355: sys_nanosleep <-system_call_fastpath 2624 <idle>-0 [001] d.h1 4186.475409: hrtimer_interrupt <-smp_apic_timer_interrupt 2625 usleep-2665 [001] d.h1 4186.475426: hrtimer_interrupt <-smp_apic_timer_interrupt 2626 <idle>-0 [003] d.h1 4186.475426: hrtimer_interrupt <-smp_apic_timer_interrupt 2627 <idle>-0 [002] d.h1 4186.475427: hrtimer_interrupt <-smp_apic_timer_interrupt 2628 2629To see which functions are being traced, you can cat the file: 2630:: 2631 2632 # cat set_ftrace_filter 2633 hrtimer_interrupt 2634 sys_nanosleep 2635 2636 2637Perhaps this is not enough. The filters also allow glob(7) matching. 2638 2639 ``<match>*`` 2640 will match functions that begin with <match> 2641 ``*<match>`` 2642 will match functions that end with <match> 2643 ``*<match>*`` 2644 will match functions that have <match> in it 2645 ``<match1>*<match2>`` 2646 will match functions that begin with <match1> and end with <match2> 2647 2648.. note:: 2649 It is better to use quotes to enclose the wild cards, 2650 otherwise the shell may expand the parameters into names 2651 of files in the local directory. 2652 2653:: 2654 2655 # echo 'hrtimer_*' > set_ftrace_filter 2656 2657Produces:: 2658 2659 # tracer: function 2660 # 2661 # entries-in-buffer/entries-written: 897/897 #P:4 2662 # 2663 # _-----=> irqs-off 2664 # / _----=> need-resched 2665 # | / _---=> hardirq/softirq 2666 # || / _--=> preempt-depth 2667 # ||| / delay 2668 # TASK-PID CPU# |||| TIMESTAMP FUNCTION 2669 # | | | |||| | | 2670 <idle>-0 [003] dN.1 4228.547803: hrtimer_cancel <-tick_nohz_idle_exit 2671 <idle>-0 [003] dN.1 4228.547804: hrtimer_try_to_cancel <-hrtimer_cancel 2672 <idle>-0 [003] dN.2 4228.547805: hrtimer_force_reprogram <-__remove_hrtimer 2673 <idle>-0 [003] dN.1 4228.547805: hrtimer_forward <-tick_nohz_idle_exit 2674 <idle>-0 [003] dN.1 4228.547805: hrtimer_start_range_ns <-hrtimer_start_expires.constprop.11 2675 <idle>-0 [003] d..1 4228.547858: hrtimer_get_next_event <-get_next_timer_interrupt 2676 <idle>-0 [003] d..1 4228.547859: hrtimer_start <-__tick_nohz_idle_enter 2677 <idle>-0 [003] d..2 4228.547860: hrtimer_force_reprogram <-__rem 2678 2679Notice that we lost the sys_nanosleep. 2680:: 2681 2682 # cat set_ftrace_filter 2683 hrtimer_run_queues 2684 hrtimer_run_pending 2685 hrtimer_init 2686 hrtimer_cancel 2687 hrtimer_try_to_cancel 2688 hrtimer_forward 2689 hrtimer_start 2690 hrtimer_reprogram 2691 hrtimer_force_reprogram 2692 hrtimer_get_next_event 2693 hrtimer_interrupt 2694 hrtimer_nanosleep 2695 hrtimer_wakeup 2696 hrtimer_get_remaining 2697 hrtimer_get_res 2698 hrtimer_init_sleeper 2699 2700 2701This is because the '>' and '>>' act just like they do in bash. 2702To rewrite the filters, use '>' 2703To append to the filters, use '>>' 2704 2705To clear out a filter so that all functions will be recorded 2706again:: 2707 2708 # echo > set_ftrace_filter 2709 # cat set_ftrace_filter 2710 # 2711 2712Again, now we want to append. 2713 2714:: 2715 2716 # echo sys_nanosleep > set_ftrace_filter 2717 # cat set_ftrace_filter 2718 sys_nanosleep 2719 # echo 'hrtimer_*' >> set_ftrace_filter 2720 # cat set_ftrace_filter 2721 hrtimer_run_queues 2722 hrtimer_run_pending 2723 hrtimer_init 2724 hrtimer_cancel 2725 hrtimer_try_to_cancel 2726 hrtimer_forward 2727 hrtimer_start 2728 hrtimer_reprogram 2729 hrtimer_force_reprogram 2730 hrtimer_get_next_event 2731 hrtimer_interrupt 2732 sys_nanosleep 2733 hrtimer_nanosleep 2734 hrtimer_wakeup 2735 hrtimer_get_remaining 2736 hrtimer_get_res 2737 hrtimer_init_sleeper 2738 2739 2740The set_ftrace_notrace prevents those functions from being 2741traced. 2742:: 2743 2744 # echo '*preempt*' '*lock*' > set_ftrace_notrace 2745 2746Produces:: 2747 2748 # tracer: function 2749 # 2750 # entries-in-buffer/entries-written: 39608/39608 #P:4 2751 # 2752 # _-----=> irqs-off 2753 # / _----=> need-resched 2754 # | / _---=> hardirq/softirq 2755 # || / _--=> preempt-depth 2756 # ||| / delay 2757 # TASK-PID CPU# |||| TIMESTAMP FUNCTION 2758 # | | | |||| | | 2759 bash-1994 [000] .... 4342.324896: file_ra_state_init <-do_dentry_open 2760 bash-1994 [000] .... 4342.324897: open_check_o_direct <-do_last 2761 bash-1994 [000] .... 4342.324897: ima_file_check <-do_last 2762 bash-1994 [000] .... 4342.324898: process_measurement <-ima_file_check 2763 bash-1994 [000] .... 4342.324898: ima_get_action <-process_measurement 2764 bash-1994 [000] .... 4342.324898: ima_match_policy <-ima_get_action 2765 bash-1994 [000] .... 4342.324899: do_truncate <-do_last 2766 bash-1994 [000] .... 4342.324899: should_remove_suid <-do_truncate 2767 bash-1994 [000] .... 4342.324899: notify_change <-do_truncate 2768 bash-1994 [000] .... 4342.324900: current_fs_time <-notify_change 2769 bash-1994 [000] .... 4342.324900: current_kernel_time <-current_fs_time 2770 bash-1994 [000] .... 4342.324900: timespec_trunc <-current_fs_time 2771 2772We can see that there's no more lock or preempt tracing. 2773 2774 2775Dynamic ftrace with the function graph tracer 2776--------------------------------------------- 2777 2778Although what has been explained above concerns both the 2779function tracer and the function-graph-tracer, there are some 2780special features only available in the function-graph tracer. 2781 2782If you want to trace only one function and all of its children, 2783you just have to echo its name into set_graph_function:: 2784 2785 echo __do_fault > set_graph_function 2786 2787will produce the following "expanded" trace of the __do_fault() 2788function:: 2789 2790 0) | __do_fault() { 2791 0) | filemap_fault() { 2792 0) | find_lock_page() { 2793 0) 0.804 us | find_get_page(); 2794 0) | __might_sleep() { 2795 0) 1.329 us | } 2796 0) 3.904 us | } 2797 0) 4.979 us | } 2798 0) 0.653 us | _spin_lock(); 2799 0) 0.578 us | page_add_file_rmap(); 2800 0) 0.525 us | native_set_pte_at(); 2801 0) 0.585 us | _spin_unlock(); 2802 0) | unlock_page() { 2803 0) 0.541 us | page_waitqueue(); 2804 0) 0.639 us | __wake_up_bit(); 2805 0) 2.786 us | } 2806 0) + 14.237 us | } 2807 0) | __do_fault() { 2808 0) | filemap_fault() { 2809 0) | find_lock_page() { 2810 0) 0.698 us | find_get_page(); 2811 0) | __might_sleep() { 2812 0) 1.412 us | } 2813 0) 3.950 us | } 2814 0) 5.098 us | } 2815 0) 0.631 us | _spin_lock(); 2816 0) 0.571 us | page_add_file_rmap(); 2817 0) 0.526 us | native_set_pte_at(); 2818 0) 0.586 us | _spin_unlock(); 2819 0) | unlock_page() { 2820 0) 0.533 us | page_waitqueue(); 2821 0) 0.638 us | __wake_up_bit(); 2822 0) 2.793 us | } 2823 0) + 14.012 us | } 2824 2825You can also expand several functions at once:: 2826 2827 echo sys_open > set_graph_function 2828 echo sys_close >> set_graph_function 2829 2830Now if you want to go back to trace all functions you can clear 2831this special filter via:: 2832 2833 echo > set_graph_function 2834 2835 2836ftrace_enabled 2837-------------- 2838 2839Note, the proc sysctl ftrace_enable is a big on/off switch for the 2840function tracer. By default it is enabled (when function tracing is 2841enabled in the kernel). If it is disabled, all function tracing is 2842disabled. This includes not only the function tracers for ftrace, but 2843also for any other uses (perf, kprobes, stack tracing, profiling, etc). 2844 2845Please disable this with care. 2846 2847This can be disable (and enabled) with:: 2848 2849 sysctl kernel.ftrace_enabled=0 2850 sysctl kernel.ftrace_enabled=1 2851 2852 or 2853 2854 echo 0 > /proc/sys/kernel/ftrace_enabled 2855 echo 1 > /proc/sys/kernel/ftrace_enabled 2856 2857 2858Filter commands 2859--------------- 2860 2861A few commands are supported by the set_ftrace_filter interface. 2862Trace commands have the following format:: 2863 2864 <function>:<command>:<parameter> 2865 2866The following commands are supported: 2867 2868- mod: 2869 This command enables function filtering per module. The 2870 parameter defines the module. For example, if only the write* 2871 functions in the ext3 module are desired, run: 2872 2873 echo 'write*:mod:ext3' > set_ftrace_filter 2874 2875 This command interacts with the filter in the same way as 2876 filtering based on function names. Thus, adding more functions 2877 in a different module is accomplished by appending (>>) to the 2878 filter file. Remove specific module functions by prepending 2879 '!':: 2880 2881 echo '!writeback*:mod:ext3' >> set_ftrace_filter 2882 2883 Mod command supports module globbing. Disable tracing for all 2884 functions except a specific module:: 2885 2886 echo '!*:mod:!ext3' >> set_ftrace_filter 2887 2888 Disable tracing for all modules, but still trace kernel:: 2889 2890 echo '!*:mod:*' >> set_ftrace_filter 2891 2892 Enable filter only for kernel:: 2893 2894 echo '*write*:mod:!*' >> set_ftrace_filter 2895 2896 Enable filter for module globbing:: 2897 2898 echo '*write*:mod:*snd*' >> set_ftrace_filter 2899 2900- traceon/traceoff: 2901 These commands turn tracing on and off when the specified 2902 functions are hit. The parameter determines how many times the 2903 tracing system is turned on and off. If unspecified, there is 2904 no limit. For example, to disable tracing when a schedule bug 2905 is hit the first 5 times, run:: 2906 2907 echo '__schedule_bug:traceoff:5' > set_ftrace_filter 2908 2909 To always disable tracing when __schedule_bug is hit:: 2910 2911 echo '__schedule_bug:traceoff' > set_ftrace_filter 2912 2913 These commands are cumulative whether or not they are appended 2914 to set_ftrace_filter. To remove a command, prepend it by '!' 2915 and drop the parameter:: 2916 2917 echo '!__schedule_bug:traceoff:0' > set_ftrace_filter 2918 2919 The above removes the traceoff command for __schedule_bug 2920 that have a counter. To remove commands without counters:: 2921 2922 echo '!__schedule_bug:traceoff' > set_ftrace_filter 2923 2924- snapshot: 2925 Will cause a snapshot to be triggered when the function is hit. 2926 :: 2927 2928 echo 'native_flush_tlb_others:snapshot' > set_ftrace_filter 2929 2930 To only snapshot once: 2931 :: 2932 2933 echo 'native_flush_tlb_others:snapshot:1' > set_ftrace_filter 2934 2935 To remove the above commands:: 2936 2937 echo '!native_flush_tlb_others:snapshot' > set_ftrace_filter 2938 echo '!native_flush_tlb_others:snapshot:0' > set_ftrace_filter 2939 2940- enable_event/disable_event: 2941 These commands can enable or disable a trace event. Note, because 2942 function tracing callbacks are very sensitive, when these commands 2943 are registered, the trace point is activated, but disabled in 2944 a "soft" mode. That is, the tracepoint will be called, but 2945 just will not be traced. The event tracepoint stays in this mode 2946 as long as there's a command that triggers it. 2947 :: 2948 2949 echo 'try_to_wake_up:enable_event:sched:sched_switch:2' > \ 2950 set_ftrace_filter 2951 2952 The format is:: 2953 2954 <function>:enable_event:<system>:<event>[:count] 2955 <function>:disable_event:<system>:<event>[:count] 2956 2957 To remove the events commands:: 2958 2959 echo '!try_to_wake_up:enable_event:sched:sched_switch:0' > \ 2960 set_ftrace_filter 2961 echo '!schedule:disable_event:sched:sched_switch' > \ 2962 set_ftrace_filter 2963 2964- dump: 2965 When the function is hit, it will dump the contents of the ftrace 2966 ring buffer to the console. This is useful if you need to debug 2967 something, and want to dump the trace when a certain function 2968 is hit. Perhaps its a function that is called before a tripple 2969 fault happens and does not allow you to get a regular dump. 2970 2971- cpudump: 2972 When the function is hit, it will dump the contents of the ftrace 2973 ring buffer for the current CPU to the console. Unlike the "dump" 2974 command, it only prints out the contents of the ring buffer for the 2975 CPU that executed the function that triggered the dump. 2976 2977trace_pipe 2978---------- 2979 2980The trace_pipe outputs the same content as the trace file, but 2981the effect on the tracing is different. Every read from 2982trace_pipe is consumed. This means that subsequent reads will be 2983different. The trace is live. 2984:: 2985 2986 # echo function > current_tracer 2987 # cat trace_pipe > /tmp/trace.out & 2988 [1] 4153 2989 # echo 1 > tracing_on 2990 # usleep 1 2991 # echo 0 > tracing_on 2992 # cat trace 2993 # tracer: function 2994 # 2995 # entries-in-buffer/entries-written: 0/0 #P:4 2996 # 2997 # _-----=> irqs-off 2998 # / _----=> need-resched 2999 # | / _---=> hardirq/softirq 3000 # || / _--=> preempt-depth 3001 # ||| / delay 3002 # TASK-PID CPU# |||| TIMESTAMP FUNCTION 3003 # | | | |||| | | 3004 3005 # 3006 # cat /tmp/trace.out 3007 bash-1994 [000] .... 5281.568961: mutex_unlock <-rb_simple_write 3008 bash-1994 [000] .... 5281.568963: __mutex_unlock_slowpath <-mutex_unlock 3009 bash-1994 [000] .... 5281.568963: __fsnotify_parent <-fsnotify_modify 3010 bash-1994 [000] .... 5281.568964: fsnotify <-fsnotify_modify 3011 bash-1994 [000] .... 5281.568964: __srcu_read_lock <-fsnotify 3012 bash-1994 [000] .... 5281.568964: add_preempt_count <-__srcu_read_lock 3013 bash-1994 [000] ...1 5281.568965: sub_preempt_count <-__srcu_read_lock 3014 bash-1994 [000] .... 5281.568965: __srcu_read_unlock <-fsnotify 3015 bash-1994 [000] .... 5281.568967: sys_dup2 <-system_call_fastpath 3016 3017 3018Note, reading the trace_pipe file will block until more input is 3019added. 3020 3021trace entries 3022------------- 3023 3024Having too much or not enough data can be troublesome in 3025diagnosing an issue in the kernel. The file buffer_size_kb is 3026used to modify the size of the internal trace buffers. The 3027number listed is the number of entries that can be recorded per 3028CPU. To know the full size, multiply the number of possible CPUs 3029with the number of entries. 3030:: 3031 3032 # cat buffer_size_kb 3033 1408 (units kilobytes) 3034 3035Or simply read buffer_total_size_kb 3036:: 3037 3038 # cat buffer_total_size_kb 3039 5632 3040 3041To modify the buffer, simple echo in a number (in 1024 byte segments). 3042:: 3043 3044 # echo 10000 > buffer_size_kb 3045 # cat buffer_size_kb 3046 10000 (units kilobytes) 3047 3048It will try to allocate as much as possible. If you allocate too 3049much, it can cause Out-Of-Memory to trigger. 3050:: 3051 3052 # echo 1000000000000 > buffer_size_kb 3053 -bash: echo: write error: Cannot allocate memory 3054 # cat buffer_size_kb 3055 85 3056 3057The per_cpu buffers can be changed individually as well: 3058:: 3059 3060 # echo 10000 > per_cpu/cpu0/buffer_size_kb 3061 # echo 100 > per_cpu/cpu1/buffer_size_kb 3062 3063When the per_cpu buffers are not the same, the buffer_size_kb 3064at the top level will just show an X 3065:: 3066 3067 # cat buffer_size_kb 3068 X 3069 3070This is where the buffer_total_size_kb is useful: 3071:: 3072 3073 # cat buffer_total_size_kb 3074 12916 3075 3076Writing to the top level buffer_size_kb will reset all the buffers 3077to be the same again. 3078 3079Snapshot 3080-------- 3081CONFIG_TRACER_SNAPSHOT makes a generic snapshot feature 3082available to all non latency tracers. (Latency tracers which 3083record max latency, such as "irqsoff" or "wakeup", can't use 3084this feature, since those are already using the snapshot 3085mechanism internally.) 3086 3087Snapshot preserves a current trace buffer at a particular point 3088in time without stopping tracing. Ftrace swaps the current 3089buffer with a spare buffer, and tracing continues in the new 3090current (=previous spare) buffer. 3091 3092The following tracefs files in "tracing" are related to this 3093feature: 3094 3095 snapshot: 3096 3097 This is used to take a snapshot and to read the output 3098 of the snapshot. Echo 1 into this file to allocate a 3099 spare buffer and to take a snapshot (swap), then read 3100 the snapshot from this file in the same format as 3101 "trace" (described above in the section "The File 3102 System"). Both reads snapshot and tracing are executable 3103 in parallel. When the spare buffer is allocated, echoing 3104 0 frees it, and echoing else (positive) values clear the 3105 snapshot contents. 3106 More details are shown in the table below. 3107 3108 +--------------+------------+------------+------------+ 3109 |status\\input | 0 | 1 | else | 3110 +==============+============+============+============+ 3111 |not allocated |(do nothing)| alloc+swap |(do nothing)| 3112 +--------------+------------+------------+------------+ 3113 |allocated | free | swap | clear | 3114 +--------------+------------+------------+------------+ 3115 3116Here is an example of using the snapshot feature. 3117:: 3118 3119 # echo 1 > events/sched/enable 3120 # echo 1 > snapshot 3121 # cat snapshot 3122 # tracer: nop 3123 # 3124 # entries-in-buffer/entries-written: 71/71 #P:8 3125 # 3126 # _-----=> irqs-off 3127 # / _----=> need-resched 3128 # | / _---=> hardirq/softirq 3129 # || / _--=> preempt-depth 3130 # ||| / delay 3131 # TASK-PID CPU# |||| TIMESTAMP FUNCTION 3132 # | | | |||| | | 3133 <idle>-0 [005] d... 2440.603828: sched_switch: prev_comm=swapper/5 prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=snapshot-test-2 next_pid=2242 next_prio=120 3134 sleep-2242 [005] d... 2440.603846: sched_switch: prev_comm=snapshot-test-2 prev_pid=2242 prev_prio=120 prev_state=R ==> next_comm=kworker/5:1 next_pid=60 next_prio=120 3135 [...] 3136 <idle>-0 [002] d... 2440.707230: sched_switch: prev_comm=swapper/2 prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=snapshot-test-2 next_pid=2229 next_prio=120 3137 3138 # cat trace 3139 # tracer: nop 3140 # 3141 # entries-in-buffer/entries-written: 77/77 #P:8 3142 # 3143 # _-----=> irqs-off 3144 # / _----=> need-resched 3145 # | / _---=> hardirq/softirq 3146 # || / _--=> preempt-depth 3147 # ||| / delay 3148 # TASK-PID CPU# |||| TIMESTAMP FUNCTION 3149 # | | | |||| | | 3150 <idle>-0 [007] d... 2440.707395: sched_switch: prev_comm=swapper/7 prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=snapshot-test-2 next_pid=2243 next_prio=120 3151 snapshot-test-2-2229 [002] d... 2440.707438: sched_switch: prev_comm=snapshot-test-2 prev_pid=2229 prev_prio=120 prev_state=S ==> next_comm=swapper/2 next_pid=0 next_prio=120 3152 [...] 3153 3154 3155If you try to use this snapshot feature when current tracer is 3156one of the latency tracers, you will get the following results. 3157:: 3158 3159 # echo wakeup > current_tracer 3160 # echo 1 > snapshot 3161 bash: echo: write error: Device or resource busy 3162 # cat snapshot 3163 cat: snapshot: Device or resource busy 3164 3165 3166Instances 3167--------- 3168In the tracefs tracing directory is a directory called "instances". 3169This directory can have new directories created inside of it using 3170mkdir, and removing directories with rmdir. The directory created 3171with mkdir in this directory will already contain files and other 3172directories after it is created. 3173:: 3174 3175 # mkdir instances/foo 3176 # ls instances/foo 3177 buffer_size_kb buffer_total_size_kb events free_buffer per_cpu 3178 set_event snapshot trace trace_clock trace_marker trace_options 3179 trace_pipe tracing_on 3180 3181As you can see, the new directory looks similar to the tracing directory 3182itself. In fact, it is very similar, except that the buffer and 3183events are agnostic from the main director, or from any other 3184instances that are created. 3185 3186The files in the new directory work just like the files with the 3187same name in the tracing directory except the buffer that is used 3188is a separate and new buffer. The files affect that buffer but do not 3189affect the main buffer with the exception of trace_options. Currently, 3190the trace_options affect all instances and the top level buffer 3191the same, but this may change in future releases. That is, options 3192may become specific to the instance they reside in. 3193 3194Notice that none of the function tracer files are there, nor is 3195current_tracer and available_tracers. This is because the buffers 3196can currently only have events enabled for them. 3197:: 3198 3199 # mkdir instances/foo 3200 # mkdir instances/bar 3201 # mkdir instances/zoot 3202 # echo 100000 > buffer_size_kb 3203 # echo 1000 > instances/foo/buffer_size_kb 3204 # echo 5000 > instances/bar/per_cpu/cpu1/buffer_size_kb 3205 # echo function > current_trace 3206 # echo 1 > instances/foo/events/sched/sched_wakeup/enable 3207 # echo 1 > instances/foo/events/sched/sched_wakeup_new/enable 3208 # echo 1 > instances/foo/events/sched/sched_switch/enable 3209 # echo 1 > instances/bar/events/irq/enable 3210 # echo 1 > instances/zoot/events/syscalls/enable 3211 # cat trace_pipe 3212 CPU:2 [LOST 11745 EVENTS] 3213 bash-2044 [002] .... 10594.481032: _raw_spin_lock_irqsave <-get_page_from_freelist 3214 bash-2044 [002] d... 10594.481032: add_preempt_count <-_raw_spin_lock_irqsave 3215 bash-2044 [002] d..1 10594.481032: __rmqueue <-get_page_from_freelist 3216 bash-2044 [002] d..1 10594.481033: _raw_spin_unlock <-get_page_from_freelist 3217 bash-2044 [002] d..1 10594.481033: sub_preempt_count <-_raw_spin_unlock 3218 bash-2044 [002] d... 10594.481033: get_pageblock_flags_group <-get_pageblock_migratetype 3219 bash-2044 [002] d... 10594.481034: __mod_zone_page_state <-get_page_from_freelist 3220 bash-2044 [002] d... 10594.481034: zone_statistics <-get_page_from_freelist 3221 bash-2044 [002] d... 10594.481034: __inc_zone_state <-zone_statistics 3222 bash-2044 [002] d... 10594.481034: __inc_zone_state <-zone_statistics 3223 bash-2044 [002] .... 10594.481035: arch_dup_task_struct <-copy_process 3224 [...] 3225 3226 # cat instances/foo/trace_pipe 3227 bash-1998 [000] d..4 136.676759: sched_wakeup: comm=kworker/0:1 pid=59 prio=120 success=1 target_cpu=000 3228 bash-1998 [000] dN.4 136.676760: sched_wakeup: comm=bash pid=1998 prio=120 success=1 target_cpu=000 3229 <idle>-0 [003] d.h3 136.676906: sched_wakeup: comm=rcu_preempt pid=9 prio=120 success=1 target_cpu=003 3230 <idle>-0 [003] d..3 136.676909: sched_switch: prev_comm=swapper/3 prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=rcu_preempt next_pid=9 next_prio=120 3231 rcu_preempt-9 [003] d..3 136.676916: sched_switch: prev_comm=rcu_preempt prev_pid=9 prev_prio=120 prev_state=S ==> next_comm=swapper/3 next_pid=0 next_prio=120 3232 bash-1998 [000] d..4 136.677014: sched_wakeup: comm=kworker/0:1 pid=59 prio=120 success=1 target_cpu=000 3233 bash-1998 [000] dN.4 136.677016: sched_wakeup: comm=bash pid=1998 prio=120 success=1 target_cpu=000 3234 bash-1998 [000] d..3 136.677018: sched_switch: prev_comm=bash prev_pid=1998 prev_prio=120 prev_state=R+ ==> next_comm=kworker/0:1 next_pid=59 next_prio=120 3235 kworker/0:1-59 [000] d..4 136.677022: sched_wakeup: comm=sshd pid=1995 prio=120 success=1 target_cpu=001 3236 kworker/0:1-59 [000] d..3 136.677025: sched_switch: prev_comm=kworker/0:1 prev_pid=59 prev_prio=120 prev_state=S ==> next_comm=bash next_pid=1998 next_prio=120 3237 [...] 3238 3239 # cat instances/bar/trace_pipe 3240 migration/1-14 [001] d.h3 138.732674: softirq_raise: vec=3 [action=NET_RX] 3241 <idle>-0 [001] dNh3 138.732725: softirq_raise: vec=3 [action=NET_RX] 3242 bash-1998 [000] d.h1 138.733101: softirq_raise: vec=1 [action=TIMER] 3243 bash-1998 [000] d.h1 138.733102: softirq_raise: vec=9 [action=RCU] 3244 bash-1998 [000] ..s2 138.733105: softirq_entry: vec=1 [action=TIMER] 3245 bash-1998 [000] ..s2 138.733106: softirq_exit: vec=1 [action=TIMER] 3246 bash-1998 [000] ..s2 138.733106: softirq_entry: vec=9 [action=RCU] 3247 bash-1998 [000] ..s2 138.733109: softirq_exit: vec=9 [action=RCU] 3248 sshd-1995 [001] d.h1 138.733278: irq_handler_entry: irq=21 name=uhci_hcd:usb4 3249 sshd-1995 [001] d.h1 138.733280: irq_handler_exit: irq=21 ret=unhandled 3250 sshd-1995 [001] d.h1 138.733281: irq_handler_entry: irq=21 name=eth0 3251 sshd-1995 [001] d.h1 138.733283: irq_handler_exit: irq=21 ret=handled 3252 [...] 3253 3254 # cat instances/zoot/trace 3255 # tracer: nop 3256 # 3257 # entries-in-buffer/entries-written: 18996/18996 #P:4 3258 # 3259 # _-----=> irqs-off 3260 # / _----=> need-resched 3261 # | / _---=> hardirq/softirq 3262 # || / _--=> preempt-depth 3263 # ||| / delay 3264 # TASK-PID CPU# |||| TIMESTAMP FUNCTION 3265 # | | | |||| | | 3266 bash-1998 [000] d... 140.733501: sys_write -> 0x2 3267 bash-1998 [000] d... 140.733504: sys_dup2(oldfd: a, newfd: 1) 3268 bash-1998 [000] d... 140.733506: sys_dup2 -> 0x1 3269 bash-1998 [000] d... 140.733508: sys_fcntl(fd: a, cmd: 1, arg: 0) 3270 bash-1998 [000] d... 140.733509: sys_fcntl -> 0x1 3271 bash-1998 [000] d... 140.733510: sys_close(fd: a) 3272 bash-1998 [000] d... 140.733510: sys_close -> 0x0 3273 bash-1998 [000] d... 140.733514: sys_rt_sigprocmask(how: 0, nset: 0, oset: 6e2768, sigsetsize: 8) 3274 bash-1998 [000] d... 140.733515: sys_rt_sigprocmask -> 0x0 3275 bash-1998 [000] d... 140.733516: sys_rt_sigaction(sig: 2, act: 7fff718846f0, oact: 7fff71884650, sigsetsize: 8) 3276 bash-1998 [000] d... 140.733516: sys_rt_sigaction -> 0x0 3277 3278You can see that the trace of the top most trace buffer shows only 3279the function tracing. The foo instance displays wakeups and task 3280switches. 3281 3282To remove the instances, simply delete their directories: 3283:: 3284 3285 # rmdir instances/foo 3286 # rmdir instances/bar 3287 # rmdir instances/zoot 3288 3289Note, if a process has a trace file open in one of the instance 3290directories, the rmdir will fail with EBUSY. 3291 3292 3293Stack trace 3294----------- 3295Since the kernel has a fixed sized stack, it is important not to 3296waste it in functions. A kernel developer must be conscience of 3297what they allocate on the stack. If they add too much, the system 3298can be in danger of a stack overflow, and corruption will occur, 3299usually leading to a system panic. 3300 3301There are some tools that check this, usually with interrupts 3302periodically checking usage. But if you can perform a check 3303at every function call that will become very useful. As ftrace provides 3304a function tracer, it makes it convenient to check the stack size 3305at every function call. This is enabled via the stack tracer. 3306 3307CONFIG_STACK_TRACER enables the ftrace stack tracing functionality. 3308To enable it, write a '1' into /proc/sys/kernel/stack_tracer_enabled. 3309:: 3310 3311 # echo 1 > /proc/sys/kernel/stack_tracer_enabled 3312 3313You can also enable it from the kernel command line to trace 3314the stack size of the kernel during boot up, by adding "stacktrace" 3315to the kernel command line parameter. 3316 3317After running it for a few minutes, the output looks like: 3318:: 3319 3320 # cat stack_max_size 3321 2928 3322 3323 # cat stack_trace 3324 Depth Size Location (18 entries) 3325 ----- ---- -------- 3326 0) 2928 224 update_sd_lb_stats+0xbc/0x4ac 3327 1) 2704 160 find_busiest_group+0x31/0x1f1 3328 2) 2544 256 load_balance+0xd9/0x662 3329 3) 2288 80 idle_balance+0xbb/0x130 3330 4) 2208 128 __schedule+0x26e/0x5b9 3331 5) 2080 16 schedule+0x64/0x66 3332 6) 2064 128 schedule_timeout+0x34/0xe0 3333 7) 1936 112 wait_for_common+0x97/0xf1 3334 8) 1824 16 wait_for_completion+0x1d/0x1f 3335 9) 1808 128 flush_work+0xfe/0x119 3336 10) 1680 16 tty_flush_to_ldisc+0x1e/0x20 3337 11) 1664 48 input_available_p+0x1d/0x5c 3338 12) 1616 48 n_tty_poll+0x6d/0x134 3339 13) 1568 64 tty_poll+0x64/0x7f 3340 14) 1504 880 do_select+0x31e/0x511 3341 15) 624 400 core_sys_select+0x177/0x216 3342 16) 224 96 sys_select+0x91/0xb9 3343 17) 128 128 system_call_fastpath+0x16/0x1b 3344 3345Note, if -mfentry is being used by gcc, functions get traced before 3346they set up the stack frame. This means that leaf level functions 3347are not tested by the stack tracer when -mfentry is used. 3348 3349Currently, -mfentry is used by gcc 4.6.0 and above on x86 only. 3350 3351More 3352---- 3353More details can be found in the source code, in the `kernel/trace/*.c` files. 3354