xref: /linux/Documentation/trace/ftrace.rst (revision 56fb34d86e875dbb0d3e6a81c5d3d035db373031)
1========================
2ftrace - Function Tracer
3========================
4
5Copyright 2008 Red Hat Inc.
6
7:Author:   Steven Rostedt <srostedt@redhat.com>
8:License:  The GNU Free Documentation License, Version 1.2
9          (dual licensed under the GPL v2)
10:Original Reviewers:  Elias Oltmanns, Randy Dunlap, Andrew Morton,
11		      John Kacur, and David Teigland.
12
13- Written for: 2.6.28-rc2
14- Updated for: 3.10
15- Updated for: 4.13 - Copyright 2017 VMware Inc. Steven Rostedt
16- Converted to rst format - Changbin Du <changbin.du@intel.com>
17
18Introduction
19------------
20
21Ftrace is an internal tracer designed to help out developers and
22designers of systems to find what is going on inside the kernel.
23It can be used for debugging or analyzing latencies and
24performance issues that take place outside of user-space.
25
26Although ftrace is typically considered the function tracer, it
27is really a framework of several assorted tracing utilities.
28There's latency tracing to examine what occurs between interrupts
29disabled and enabled, as well as for preemption and from a time
30a task is woken to the task is actually scheduled in.
31
32One of the most common uses of ftrace is the event tracing.
33Throughout the kernel is hundreds of static event points that
34can be enabled via the tracefs file system to see what is
35going on in certain parts of the kernel.
36
37See events.txt for more information.
38
39
40Implementation Details
41----------------------
42
43See :doc:`ftrace-design` for details for arch porters and such.
44
45
46The File System
47---------------
48
49Ftrace uses the tracefs file system to hold the control files as
50well as the files to display output.
51
52When tracefs is configured into the kernel (which selecting any ftrace
53option will do) the directory /sys/kernel/tracing will be created. To mount
54this directory, you can add to your /etc/fstab file::
55
56 tracefs       /sys/kernel/tracing       tracefs defaults        0       0
57
58Or you can mount it at run time with::
59
60 mount -t tracefs nodev /sys/kernel/tracing
61
62For quicker access to that directory you may want to make a soft link to
63it::
64
65 ln -s /sys/kernel/tracing /tracing
66
67.. attention::
68
69  Before 4.1, all ftrace tracing control files were within the debugfs
70  file system, which is typically located at /sys/kernel/debug/tracing.
71  For backward compatibility, when mounting the debugfs file system,
72  the tracefs file system will be automatically mounted at:
73
74  /sys/kernel/debug/tracing
75
76  All files located in the tracefs file system will be located in that
77  debugfs file system directory as well.
78
79.. attention::
80
81  Any selected ftrace option will also create the tracefs file system.
82  The rest of the document will assume that you are in the ftrace directory
83  (cd /sys/kernel/tracing) and will only concentrate on the files within that
84  directory and not distract from the content with the extended
85  "/sys/kernel/tracing" path name.
86
87That's it! (assuming that you have ftrace configured into your kernel)
88
89After mounting tracefs you will have access to the control and output files
90of ftrace. Here is a list of some of the key files:
91
92
93 Note: all time values are in microseconds.
94
95  current_tracer:
96
97	This is used to set or display the current tracer
98	that is configured.
99
100  available_tracers:
101
102	This holds the different types of tracers that
103	have been compiled into the kernel. The
104	tracers listed here can be configured by
105	echoing their name into current_tracer.
106
107  tracing_on:
108
109	This sets or displays whether writing to the trace
110	ring buffer is enabled. Echo 0 into this file to disable
111	the tracer or 1 to enable it. Note, this only disables
112	writing to the ring buffer, the tracing overhead may
113	still be occurring.
114
115	The kernel function tracing_off() can be used within the
116	kernel to disable writing to the ring buffer, which will
117	set this file to "0". User space can re-enable tracing by
118	echoing "1" into the file.
119
120	Note, the function and event trigger "traceoff" will also
121	set this file to zero and stop tracing. Which can also
122	be re-enabled by user space using this file.
123
124  trace:
125
126	This file holds the output of the trace in a human
127	readable format (described below). Note, tracing is temporarily
128	disabled when the file is open for reading. Once all readers
129	are closed, tracing is re-enabled.
130
131  trace_pipe:
132
133	The output is the same as the "trace" file but this
134	file is meant to be streamed with live tracing.
135	Reads from this file will block until new data is
136	retrieved.  Unlike the "trace" file, this file is a
137	consumer. This means reading from this file causes
138	sequential reads to display more current data. Once
139	data is read from this file, it is consumed, and
140	will not be read again with a sequential read. The
141	"trace" file is static, and if the tracer is not
142	adding more data, it will display the same
143	information every time it is read. Unlike the
144	"trace" file, opening this file for reading will not
145	temporarily disable tracing.
146
147  trace_options:
148
149	This file lets the user control the amount of data
150	that is displayed in one of the above output
151	files. Options also exist to modify how a tracer
152	or events work (stack traces, timestamps, etc).
153
154  options:
155
156	This is a directory that has a file for every available
157	trace option (also in trace_options). Options may also be set
158	or cleared by writing a "1" or "0" respectively into the
159	corresponding file with the option name.
160
161  tracing_max_latency:
162
163	Some of the tracers record the max latency.
164	For example, the maximum time that interrupts are disabled.
165	The maximum time is saved in this file. The max trace will also be
166	stored,	and displayed by "trace". A new max trace will only be
167	recorded if the latency is greater than the value in this file
168	(in microseconds).
169
170	By echoing in a time into this file, no latency will be recorded
171	unless it is greater than the time in this file.
172
173  tracing_thresh:
174
175	Some latency tracers will record a trace whenever the
176	latency is greater than the number in this file.
177	Only active when the file contains a number greater than 0.
178	(in microseconds)
179
180  buffer_size_kb:
181
182	This sets or displays the number of kilobytes each CPU
183	buffer holds. By default, the trace buffers are the same size
184	for each CPU. The displayed number is the size of the
185	CPU buffer and not total size of all buffers. The
186	trace buffers are allocated in pages (blocks of memory
187	that the kernel uses for allocation, usually 4 KB in size).
188	If the last page allocated has room for more bytes
189	than requested, the rest of the page will be used,
190	making the actual allocation bigger than requested or shown.
191	( Note, the size may not be a multiple of the page size
192	due to buffer management meta-data. )
193
194	Buffer sizes for individual CPUs may vary
195	(see "per_cpu/cpu0/buffer_size_kb" below), and if they do
196	this file will show "X".
197
198  buffer_total_size_kb:
199
200	This displays the total combined size of all the trace buffers.
201
202  free_buffer:
203
204	If a process is performing tracing, and the ring buffer	should be
205	shrunk "freed" when the process is finished, even if it were to be
206	killed by a signal, this file can be used for that purpose. On close
207	of this file, the ring buffer will be resized to its minimum size.
208	Having a process that is tracing also open this file, when the process
209	exits its file descriptor for this file will be closed, and in doing so,
210	the ring buffer will be "freed".
211
212	It may also stop tracing if disable_on_free option is set.
213
214  tracing_cpumask:
215
216	This is a mask that lets the user only trace on specified CPUs.
217	The format is a hex string representing the CPUs.
218
219  set_ftrace_filter:
220
221	When dynamic ftrace is configured in (see the
222	section below "dynamic ftrace"), the code is dynamically
223	modified (code text rewrite) to disable calling of the
224	function profiler (mcount). This lets tracing be configured
225	in with practically no overhead in performance.  This also
226	has a side effect of enabling or disabling specific functions
227	to be traced. Echoing names of functions into this file
228	will limit the trace to only those functions.
229	This influences the tracers "function" and "function_graph"
230	and thus also function profiling (see "function_profile_enabled").
231
232	The functions listed in "available_filter_functions" are what
233	can be written into this file.
234
235	This interface also allows for commands to be used. See the
236	"Filter commands" section for more details.
237
238	As a speed up, since processing strings can't be quite expensive
239	and requires a check of all functions registered to tracing, instead
240	an index can be written into this file. A number (starting with "1")
241	written will instead select the same corresponding at the line position
242	of the "available_filter_functions" file.
243
244  set_ftrace_notrace:
245
246	This has an effect opposite to that of
247	set_ftrace_filter. Any function that is added here will not
248	be traced. If a function exists in both set_ftrace_filter
249	and set_ftrace_notrace,	the function will _not_ be traced.
250
251  set_ftrace_pid:
252
253	Have the function tracer only trace the threads whose PID are
254	listed in this file.
255
256	If the "function-fork" option is set, then when a task whose
257	PID is listed in this file forks, the child's PID will
258	automatically be added to this file, and the child will be
259	traced by the function tracer as well. This option will also
260	cause PIDs of tasks that exit to be removed from the file.
261
262  set_event_pid:
263
264	Have the events only trace a task with a PID listed in this file.
265	Note, sched_switch and sched_wake_up will also trace events
266	listed in this file.
267
268	To have the PIDs of children of tasks with their PID in this file
269	added on fork, enable the "event-fork" option. That option will also
270	cause the PIDs of tasks to be removed from this file when the task
271	exits.
272
273  set_graph_function:
274
275	Functions listed in this file will cause the function graph
276	tracer to only trace these functions and the functions that
277	they call. (See the section "dynamic ftrace" for more details).
278	Note, set_ftrace_filter and set_ftrace_notrace still affects
279	what functions are being traced.
280
281  set_graph_notrace:
282
283	Similar to set_graph_function, but will disable function graph
284	tracing when the function is hit until it exits the function.
285	This makes it possible to ignore tracing functions that are called
286	by a specific function.
287
288  available_filter_functions:
289
290	This lists the functions that ftrace has processed and can trace.
291	These are the function names that you can pass to
292	"set_ftrace_filter", "set_ftrace_notrace",
293	"set_graph_function", or "set_graph_notrace".
294	(See the section "dynamic ftrace" below for more details.)
295
296  dyn_ftrace_total_info:
297
298	This file is for debugging purposes. The number of functions that
299	have been converted to nops and are available to be traced.
300
301  enabled_functions:
302
303	This file is more for debugging ftrace, but can also be useful
304	in seeing if any function has a callback attached to it.
305	Not only does the trace infrastructure use ftrace function
306	trace utility, but other subsystems might too. This file
307	displays all functions that have a callback attached to them
308	as well as the number of callbacks that have been attached.
309	Note, a callback may also call multiple functions which will
310	not be listed in this count.
311
312	If the callback registered to be traced by a function with
313	the "save regs" attribute (thus even more overhead), a 'R'
314	will be displayed on the same line as the function that
315	is returning registers.
316
317	If the callback registered to be traced by a function with
318	the "ip modify" attribute (thus the regs->ip can be changed),
319	an 'I' will be displayed on the same line as the function that
320	can be overridden.
321
322	If the architecture supports it, it will also show what callback
323	is being directly called by the function. If the count is greater
324	than 1 it most likely will be ftrace_ops_list_func().
325
326	If the callback of the function jumps to a trampoline that is
327	specific to a the callback and not the standard trampoline,
328	its address will be printed as well as the function that the
329	trampoline calls.
330
331  function_profile_enabled:
332
333	When set it will enable all functions with either the function
334	tracer, or if configured, the function graph tracer. It will
335	keep a histogram of the number of functions that were called
336	and if the function graph tracer was configured, it will also keep
337	track of the time spent in those functions. The histogram
338	content can be displayed in the files:
339
340	trace_stat/function<cpu> ( function0, function1, etc).
341
342  trace_stat:
343
344	A directory that holds different tracing stats.
345
346  kprobe_events:
347
348	Enable dynamic trace points. See kprobetrace.txt.
349
350  kprobe_profile:
351
352	Dynamic trace points stats. See kprobetrace.txt.
353
354  max_graph_depth:
355
356	Used with the function graph tracer. This is the max depth
357	it will trace into a function. Setting this to a value of
358	one will show only the first kernel function that is called
359	from user space.
360
361  printk_formats:
362
363	This is for tools that read the raw format files. If an event in
364	the ring buffer references a string, only a pointer to the string
365	is recorded into the buffer and not the string itself. This prevents
366	tools from knowing what that string was. This file displays the string
367	and address for	the string allowing tools to map the pointers to what
368	the strings were.
369
370  saved_cmdlines:
371
372	Only the pid of the task is recorded in a trace event unless
373	the event specifically saves the task comm as well. Ftrace
374	makes a cache of pid mappings to comms to try to display
375	comms for events. If a pid for a comm is not listed, then
376	"<...>" is displayed in the output.
377
378	If the option "record-cmd" is set to "0", then comms of tasks
379	will not be saved during recording. By default, it is enabled.
380
381  saved_cmdlines_size:
382
383	By default, 128 comms are saved (see "saved_cmdlines" above). To
384	increase or decrease the amount of comms that are cached, echo
385	in a the number of comms to cache, into this file.
386
387  saved_tgids:
388
389	If the option "record-tgid" is set, on each scheduling context switch
390	the Task Group ID of a task is saved in a table mapping the PID of
391	the thread to its TGID. By default, the "record-tgid" option is
392	disabled.
393
394  snapshot:
395
396	This displays the "snapshot" buffer and also lets the user
397	take a snapshot of the current running trace.
398	See the "Snapshot" section below for more details.
399
400  stack_max_size:
401
402	When the stack tracer is activated, this will display the
403	maximum stack size it has encountered.
404	See the "Stack Trace" section below.
405
406  stack_trace:
407
408	This displays the stack back trace of the largest stack
409	that was encountered when the stack tracer is activated.
410	See the "Stack Trace" section below.
411
412  stack_trace_filter:
413
414	This is similar to "set_ftrace_filter" but it limits what
415	functions the stack tracer will check.
416
417  trace_clock:
418
419	Whenever an event is recorded into the ring buffer, a
420	"timestamp" is added. This stamp comes from a specified
421	clock. By default, ftrace uses the "local" clock. This
422	clock is very fast and strictly per cpu, but on some
423	systems it may not be monotonic with respect to other
424	CPUs. In other words, the local clocks may not be in sync
425	with local clocks on other CPUs.
426
427	Usual clocks for tracing::
428
429	  # cat trace_clock
430	  [local] global counter x86-tsc
431
432	The clock with the square brackets around it is the one in effect.
433
434	local:
435		Default clock, but may not be in sync across CPUs
436
437	global:
438		This clock is in sync with all CPUs but may
439		be a bit slower than the local clock.
440
441	counter:
442		This is not a clock at all, but literally an atomic
443		counter. It counts up one by one, but is in sync
444		with all CPUs. This is useful when you need to
445		know exactly the order events occurred with respect to
446		each other on different CPUs.
447
448	uptime:
449		This uses the jiffies counter and the time stamp
450		is relative to the time since boot up.
451
452	perf:
453		This makes ftrace use the same clock that perf uses.
454		Eventually perf will be able to read ftrace buffers
455		and this will help out in interleaving the data.
456
457	x86-tsc:
458		Architectures may define their own clocks. For
459		example, x86 uses its own TSC cycle clock here.
460
461	ppc-tb:
462		This uses the powerpc timebase register value.
463		This is in sync across CPUs and can also be used
464		to correlate events across hypervisor/guest if
465		tb_offset is known.
466
467	mono:
468		This uses the fast monotonic clock (CLOCK_MONOTONIC)
469		which is monotonic and is subject to NTP rate adjustments.
470
471	mono_raw:
472		This is the raw monotonic clock (CLOCK_MONOTONIC_RAW)
473		which is monotonic but is not subject to any rate adjustments
474		and ticks at the same rate as the hardware clocksource.
475
476	boot:
477		This is the boot clock (CLOCK_BOOTTIME) and is based on the
478		fast monotonic clock, but also accounts for time spent in
479		suspend. Since the clock access is designed for use in
480		tracing in the suspend path, some side effects are possible
481		if clock is accessed after the suspend time is accounted before
482		the fast mono clock is updated. In this case, the clock update
483		appears to happen slightly sooner than it normally would have.
484		Also on 32-bit systems, it's possible that the 64-bit boot offset
485		sees a partial update. These effects are rare and post
486		processing should be able to handle them. See comments in the
487		ktime_get_boot_fast_ns() function for more information.
488
489	To set a clock, simply echo the clock name into this file::
490
491	  # echo global > trace_clock
492
493  trace_marker:
494
495	This is a very useful file for synchronizing user space
496	with events happening in the kernel. Writing strings into
497	this file will be written into the ftrace buffer.
498
499	It is useful in applications to open this file at the start
500	of the application and just reference the file descriptor
501	for the file::
502
503		void trace_write(const char *fmt, ...)
504		{
505			va_list ap;
506			char buf[256];
507			int n;
508
509			if (trace_fd < 0)
510				return;
511
512			va_start(ap, fmt);
513			n = vsnprintf(buf, 256, fmt, ap);
514			va_end(ap);
515
516			write(trace_fd, buf, n);
517		}
518
519	start::
520
521		trace_fd = open("trace_marker", WR_ONLY);
522
523	Note: Writing into the trace_marker file can also initiate triggers
524	      that are written into /sys/kernel/tracing/events/ftrace/print/trigger
525	      See "Event triggers" in Documentation/trace/events.rst and an
526              example in Documentation/trace/histogram.rst (Section 3.)
527
528  trace_marker_raw:
529
530	This is similar to trace_marker above, but is meant for for binary data
531	to be written to it, where a tool can be used to parse the data
532	from trace_pipe_raw.
533
534  uprobe_events:
535
536	Add dynamic tracepoints in programs.
537	See uprobetracer.txt
538
539  uprobe_profile:
540
541	Uprobe statistics. See uprobetrace.txt
542
543  instances:
544
545	This is a way to make multiple trace buffers where different
546	events can be recorded in different buffers.
547	See "Instances" section below.
548
549  events:
550
551	This is the trace event directory. It holds event tracepoints
552	(also known as static tracepoints) that have been compiled
553	into the kernel. It shows what event tracepoints exist
554	and how they are grouped by system. There are "enable"
555	files at various levels that can enable the tracepoints
556	when a "1" is written to them.
557
558	See events.txt for more information.
559
560  set_event:
561
562	By echoing in the event into this file, will enable that event.
563
564	See events.txt for more information.
565
566  available_events:
567
568	A list of events that can be enabled in tracing.
569
570	See events.txt for more information.
571
572  timestamp_mode:
573
574	Certain tracers may change the timestamp mode used when
575	logging trace events into the event buffer.  Events with
576	different modes can coexist within a buffer but the mode in
577	effect when an event is logged determines which timestamp mode
578	is used for that event.  The default timestamp mode is
579	'delta'.
580
581	Usual timestamp modes for tracing:
582
583	  # cat timestamp_mode
584	  [delta] absolute
585
586	  The timestamp mode with the square brackets around it is the
587	  one in effect.
588
589	  delta: Default timestamp mode - timestamp is a delta against
590	         a per-buffer timestamp.
591
592	  absolute: The timestamp is a full timestamp, not a delta
593                 against some other value.  As such it takes up more
594                 space and is less efficient.
595
596  hwlat_detector:
597
598	Directory for the Hardware Latency Detector.
599	See "Hardware Latency Detector" section below.
600
601  per_cpu:
602
603	This is a directory that contains the trace per_cpu information.
604
605  per_cpu/cpu0/buffer_size_kb:
606
607	The ftrace buffer is defined per_cpu. That is, there's a separate
608	buffer for each CPU to allow writes to be done atomically,
609	and free from cache bouncing. These buffers may have different
610	size buffers. This file is similar to the buffer_size_kb
611	file, but it only displays or sets the buffer size for the
612	specific CPU. (here cpu0).
613
614  per_cpu/cpu0/trace:
615
616	This is similar to the "trace" file, but it will only display
617	the data specific for the CPU. If written to, it only clears
618	the specific CPU buffer.
619
620  per_cpu/cpu0/trace_pipe
621
622	This is similar to the "trace_pipe" file, and is a consuming
623	read, but it will only display (and consume) the data specific
624	for the CPU.
625
626  per_cpu/cpu0/trace_pipe_raw
627
628	For tools that can parse the ftrace ring buffer binary format,
629	the trace_pipe_raw file can be used to extract the data
630	from the ring buffer directly. With the use of the splice()
631	system call, the buffer data can be quickly transferred to
632	a file or to the network where a server is collecting the
633	data.
634
635	Like trace_pipe, this is a consuming reader, where multiple
636	reads will always produce different data.
637
638  per_cpu/cpu0/snapshot:
639
640	This is similar to the main "snapshot" file, but will only
641	snapshot the current CPU (if supported). It only displays
642	the content of the snapshot for a given CPU, and if
643	written to, only clears this CPU buffer.
644
645  per_cpu/cpu0/snapshot_raw:
646
647	Similar to the trace_pipe_raw, but will read the binary format
648	from the snapshot buffer for the given CPU.
649
650  per_cpu/cpu0/stats:
651
652	This displays certain stats about the ring buffer:
653
654	entries:
655		The number of events that are still in the buffer.
656
657	overrun:
658		The number of lost events due to overwriting when
659		the buffer was full.
660
661	commit overrun:
662		Should always be zero.
663		This gets set if so many events happened within a nested
664		event (ring buffer is re-entrant), that it fills the
665		buffer and starts dropping events.
666
667	bytes:
668		Bytes actually read (not overwritten).
669
670	oldest event ts:
671		The oldest timestamp in the buffer
672
673	now ts:
674		The current timestamp
675
676	dropped events:
677		Events lost due to overwrite option being off.
678
679	read events:
680		The number of events read.
681
682The Tracers
683-----------
684
685Here is the list of current tracers that may be configured.
686
687  "function"
688
689	Function call tracer to trace all kernel functions.
690
691  "function_graph"
692
693	Similar to the function tracer except that the
694	function tracer probes the functions on their entry
695	whereas the function graph tracer traces on both entry
696	and exit of the functions. It then provides the ability
697	to draw a graph of function calls similar to C code
698	source.
699
700  "blk"
701
702	The block tracer. The tracer used by the blktrace user
703	application.
704
705  "hwlat"
706
707	The Hardware Latency tracer is used to detect if the hardware
708	produces any latency. See "Hardware Latency Detector" section
709	below.
710
711  "irqsoff"
712
713	Traces the areas that disable interrupts and saves
714	the trace with the longest max latency.
715	See tracing_max_latency. When a new max is recorded,
716	it replaces the old trace. It is best to view this
717	trace with the latency-format option enabled, which
718	happens automatically when the tracer is selected.
719
720  "preemptoff"
721
722	Similar to irqsoff but traces and records the amount of
723	time for which preemption is disabled.
724
725  "preemptirqsoff"
726
727	Similar to irqsoff and preemptoff, but traces and
728	records the largest time for which irqs and/or preemption
729	is disabled.
730
731  "wakeup"
732
733	Traces and records the max latency that it takes for
734	the highest priority task to get scheduled after
735	it has been woken up.
736        Traces all tasks as an average developer would expect.
737
738  "wakeup_rt"
739
740        Traces and records the max latency that it takes for just
741        RT tasks (as the current "wakeup" does). This is useful
742        for those interested in wake up timings of RT tasks.
743
744  "wakeup_dl"
745
746	Traces and records the max latency that it takes for
747	a SCHED_DEADLINE task to be woken (as the "wakeup" and
748	"wakeup_rt" does).
749
750  "mmiotrace"
751
752	A special tracer that is used to trace binary module.
753	It will trace all the calls that a module makes to the
754	hardware. Everything it writes and reads from the I/O
755	as well.
756
757  "branch"
758
759	This tracer can be configured when tracing likely/unlikely
760	calls within the kernel. It will trace when a likely and
761	unlikely branch is hit and if it was correct in its prediction
762	of being correct.
763
764  "nop"
765
766	This is the "trace nothing" tracer. To remove all
767	tracers from tracing simply echo "nop" into
768	current_tracer.
769
770Error conditions
771----------------
772
773  For most ftrace commands, failure modes are obvious and communicated
774  using standard return codes.
775
776  For other more involved commands, extended error information may be
777  available via the tracing/error_log file.  For the commands that
778  support it, reading the tracing/error_log file after an error will
779  display more detailed information about what went wrong, if
780  information is available.  The tracing/error_log file is a circular
781  error log displaying a small number (currently, 8) of ftrace errors
782  for the last (8) failed commands.
783
784  The extended error information and usage takes the form shown in
785  this example::
786
787    # echo xxx > /sys/kernel/debug/tracing/events/sched/sched_wakeup/trigger
788    echo: write error: Invalid argument
789
790    # cat /sys/kernel/debug/tracing/error_log
791    [ 5348.887237] location: error: Couldn't yyy: zzz
792      Command: xxx
793               ^
794    [ 7517.023364] location: error: Bad rrr: sss
795      Command: ppp qqq
796                   ^
797
798  To clear the error log, echo the empty string into it::
799
800    # echo > /sys/kernel/debug/tracing/error_log
801
802Examples of using the tracer
803----------------------------
804
805Here are typical examples of using the tracers when controlling
806them only with the tracefs interface (without using any
807user-land utilities).
808
809Output format:
810--------------
811
812Here is an example of the output format of the file "trace"::
813
814  # tracer: function
815  #
816  # entries-in-buffer/entries-written: 140080/250280   #P:4
817  #
818  #                              _-----=> irqs-off
819  #                             / _----=> need-resched
820  #                            | / _---=> hardirq/softirq
821  #                            || / _--=> preempt-depth
822  #                            ||| /     delay
823  #           TASK-PID   CPU#  ||||    TIMESTAMP  FUNCTION
824  #              | |       |   ||||       |         |
825              bash-1977  [000] .... 17284.993652: sys_close <-system_call_fastpath
826              bash-1977  [000] .... 17284.993653: __close_fd <-sys_close
827              bash-1977  [000] .... 17284.993653: _raw_spin_lock <-__close_fd
828              sshd-1974  [003] .... 17284.993653: __srcu_read_unlock <-fsnotify
829              bash-1977  [000] .... 17284.993654: add_preempt_count <-_raw_spin_lock
830              bash-1977  [000] ...1 17284.993655: _raw_spin_unlock <-__close_fd
831              bash-1977  [000] ...1 17284.993656: sub_preempt_count <-_raw_spin_unlock
832              bash-1977  [000] .... 17284.993657: filp_close <-__close_fd
833              bash-1977  [000] .... 17284.993657: dnotify_flush <-filp_close
834              sshd-1974  [003] .... 17284.993658: sys_select <-system_call_fastpath
835              ....
836
837A header is printed with the tracer name that is represented by
838the trace. In this case the tracer is "function". Then it shows the
839number of events in the buffer as well as the total number of entries
840that were written. The difference is the number of entries that were
841lost due to the buffer filling up (250280 - 140080 = 110200 events
842lost).
843
844The header explains the content of the events. Task name "bash", the task
845PID "1977", the CPU that it was running on "000", the latency format
846(explained below), the timestamp in <secs>.<usecs> format, the
847function name that was traced "sys_close" and the parent function that
848called this function "system_call_fastpath". The timestamp is the time
849at which the function was entered.
850
851Latency trace format
852--------------------
853
854When the latency-format option is enabled or when one of the latency
855tracers is set, the trace file gives somewhat more information to see
856why a latency happened. Here is a typical trace::
857
858  # tracer: irqsoff
859  #
860  # irqsoff latency trace v1.1.5 on 3.8.0-test+
861  # --------------------------------------------------------------------
862  # latency: 259 us, #4/4, CPU#2 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:4)
863  #    -----------------
864  #    | task: ps-6143 (uid:0 nice:0 policy:0 rt_prio:0)
865  #    -----------------
866  #  => started at: __lock_task_sighand
867  #  => ended at:   _raw_spin_unlock_irqrestore
868  #
869  #
870  #                  _------=> CPU#
871  #                 / _-----=> irqs-off
872  #                | / _----=> need-resched
873  #                || / _---=> hardirq/softirq
874  #                ||| / _--=> preempt-depth
875  #                |||| /     delay
876  #  cmd     pid   ||||| time  |   caller
877  #     \   /      |||||  \    |   /
878        ps-6143    2d...    0us!: trace_hardirqs_off <-__lock_task_sighand
879        ps-6143    2d..1  259us+: trace_hardirqs_on <-_raw_spin_unlock_irqrestore
880        ps-6143    2d..1  263us+: time_hardirqs_on <-_raw_spin_unlock_irqrestore
881        ps-6143    2d..1  306us : <stack trace>
882   => trace_hardirqs_on_caller
883   => trace_hardirqs_on
884   => _raw_spin_unlock_irqrestore
885   => do_task_stat
886   => proc_tgid_stat
887   => proc_single_show
888   => seq_read
889   => vfs_read
890   => sys_read
891   => system_call_fastpath
892
893
894This shows that the current tracer is "irqsoff" tracing the time
895for which interrupts were disabled. It gives the trace version (which
896never changes) and the version of the kernel upon which this was executed on
897(3.8). Then it displays the max latency in microseconds (259 us). The number
898of trace entries displayed and the total number (both are four: #4/4).
899VP, KP, SP, and HP are always zero and are reserved for later use.
900#P is the number of online CPUs (#P:4).
901
902The task is the process that was running when the latency
903occurred. (ps pid: 6143).
904
905The start and stop (the functions in which the interrupts were
906disabled and enabled respectively) that caused the latencies:
907
908  - __lock_task_sighand is where the interrupts were disabled.
909  - _raw_spin_unlock_irqrestore is where they were enabled again.
910
911The next lines after the header are the trace itself. The header
912explains which is which.
913
914  cmd: The name of the process in the trace.
915
916  pid: The PID of that process.
917
918  CPU#: The CPU which the process was running on.
919
920  irqs-off: 'd' interrupts are disabled. '.' otherwise.
921	.. caution:: If the architecture does not support a way to
922		read the irq flags variable, an 'X' will always
923		be printed here.
924
925  need-resched:
926	- 'N' both TIF_NEED_RESCHED and PREEMPT_NEED_RESCHED is set,
927	- 'n' only TIF_NEED_RESCHED is set,
928	- 'p' only PREEMPT_NEED_RESCHED is set,
929	- '.' otherwise.
930
931  hardirq/softirq:
932	- 'Z' - NMI occurred inside a hardirq
933	- 'z' - NMI is running
934	- 'H' - hard irq occurred inside a softirq.
935	- 'h' - hard irq is running
936	- 's' - soft irq is running
937	- '.' - normal context.
938
939  preempt-depth: The level of preempt_disabled
940
941The above is mostly meaningful for kernel developers.
942
943  time:
944	When the latency-format option is enabled, the trace file
945	output includes a timestamp relative to the start of the
946	trace. This differs from the output when latency-format
947	is disabled, which includes an absolute timestamp.
948
949  delay:
950	This is just to help catch your eye a bit better. And
951	needs to be fixed to be only relative to the same CPU.
952	The marks are determined by the difference between this
953	current trace and the next trace.
954
955	  - '$' - greater than 1 second
956	  - '@' - greater than 100 millisecond
957	  - '*' - greater than 10 millisecond
958	  - '#' - greater than 1000 microsecond
959	  - '!' - greater than 100 microsecond
960	  - '+' - greater than 10 microsecond
961	  - ' ' - less than or equal to 10 microsecond.
962
963  The rest is the same as the 'trace' file.
964
965  Note, the latency tracers will usually end with a back trace
966  to easily find where the latency occurred.
967
968trace_options
969-------------
970
971The trace_options file (or the options directory) is used to control
972what gets printed in the trace output, or manipulate the tracers.
973To see what is available, simply cat the file::
974
975  cat trace_options
976	print-parent
977	nosym-offset
978	nosym-addr
979	noverbose
980	noraw
981	nohex
982	nobin
983	noblock
984	trace_printk
985	annotate
986	nouserstacktrace
987	nosym-userobj
988	noprintk-msg-only
989	context-info
990	nolatency-format
991	record-cmd
992	norecord-tgid
993	overwrite
994	nodisable_on_free
995	irq-info
996	markers
997	noevent-fork
998	function-trace
999	nofunction-fork
1000	nodisplay-graph
1001	nostacktrace
1002	nobranch
1003
1004To disable one of the options, echo in the option prepended with
1005"no"::
1006
1007  echo noprint-parent > trace_options
1008
1009To enable an option, leave off the "no"::
1010
1011  echo sym-offset > trace_options
1012
1013Here are the available options:
1014
1015  print-parent
1016	On function traces, display the calling (parent)
1017	function as well as the function being traced.
1018	::
1019
1020	  print-parent:
1021	   bash-4000  [01]  1477.606694: simple_strtoul <-kstrtoul
1022
1023	  noprint-parent:
1024	   bash-4000  [01]  1477.606694: simple_strtoul
1025
1026
1027  sym-offset
1028	Display not only the function name, but also the
1029	offset in the function. For example, instead of
1030	seeing just "ktime_get", you will see
1031	"ktime_get+0xb/0x20".
1032	::
1033
1034	  sym-offset:
1035	   bash-4000  [01]  1477.606694: simple_strtoul+0x6/0xa0
1036
1037  sym-addr
1038	This will also display the function address as well
1039	as the function name.
1040	::
1041
1042	  sym-addr:
1043	   bash-4000  [01]  1477.606694: simple_strtoul <c0339346>
1044
1045  verbose
1046	This deals with the trace file when the
1047        latency-format option is enabled.
1048	::
1049
1050	    bash  4000 1 0 00000000 00010a95 [58127d26] 1720.415ms \
1051	    (+0.000ms): simple_strtoul (kstrtoul)
1052
1053  raw
1054	This will display raw numbers. This option is best for
1055	use with user applications that can translate the raw
1056	numbers better than having it done in the kernel.
1057
1058  hex
1059	Similar to raw, but the numbers will be in a hexadecimal format.
1060
1061  bin
1062	This will print out the formats in raw binary.
1063
1064  block
1065	When set, reading trace_pipe will not block when polled.
1066
1067  trace_printk
1068	Can disable trace_printk() from writing into the buffer.
1069
1070  annotate
1071	It is sometimes confusing when the CPU buffers are full
1072	and one CPU buffer had a lot of events recently, thus
1073	a shorter time frame, were another CPU may have only had
1074	a few events, which lets it have older events. When
1075	the trace is reported, it shows the oldest events first,
1076	and it may look like only one CPU ran (the one with the
1077	oldest events). When the annotate option is set, it will
1078	display when a new CPU buffer started::
1079
1080			  <idle>-0     [001] dNs4 21169.031481: wake_up_idle_cpu <-add_timer_on
1081			  <idle>-0     [001] dNs4 21169.031482: _raw_spin_unlock_irqrestore <-add_timer_on
1082			  <idle>-0     [001] .Ns4 21169.031484: sub_preempt_count <-_raw_spin_unlock_irqrestore
1083		##### CPU 2 buffer started ####
1084			  <idle>-0     [002] .N.1 21169.031484: rcu_idle_exit <-cpu_idle
1085			  <idle>-0     [001] .Ns3 21169.031484: _raw_spin_unlock <-clocksource_watchdog
1086			  <idle>-0     [001] .Ns3 21169.031485: sub_preempt_count <-_raw_spin_unlock
1087
1088  userstacktrace
1089	This option changes the trace. It records a
1090	stacktrace of the current user space thread after
1091	each trace event.
1092
1093  sym-userobj
1094	when user stacktrace are enabled, look up which
1095	object the address belongs to, and print a
1096	relative address. This is especially useful when
1097	ASLR is on, otherwise you don't get a chance to
1098	resolve the address to object/file/line after
1099	the app is no longer running
1100
1101	The lookup is performed when you read
1102	trace,trace_pipe. Example::
1103
1104		  a.out-1623  [000] 40874.465068: /root/a.out[+0x480] <-/root/a.out[+0
1105		  x494] <- /root/a.out[+0x4a8] <- /lib/libc-2.7.so[+0x1e1a6]
1106
1107
1108  printk-msg-only
1109	When set, trace_printk()s will only show the format
1110	and not their parameters (if trace_bprintk() or
1111	trace_bputs() was used to save the trace_printk()).
1112
1113  context-info
1114	Show only the event data. Hides the comm, PID,
1115	timestamp, CPU, and other useful data.
1116
1117  latency-format
1118	This option changes the trace output. When it is enabled,
1119	the trace displays additional information about the
1120	latency, as described in "Latency trace format".
1121
1122  record-cmd
1123	When any event or tracer is enabled, a hook is enabled
1124	in the sched_switch trace point to fill comm cache
1125	with mapped pids and comms. But this may cause some
1126	overhead, and if you only care about pids, and not the
1127	name of the task, disabling this option can lower the
1128	impact of tracing. See "saved_cmdlines".
1129
1130  record-tgid
1131	When any event or tracer is enabled, a hook is enabled
1132	in the sched_switch trace point to fill the cache of
1133	mapped Thread Group IDs (TGID) mapping to pids. See
1134	"saved_tgids".
1135
1136  overwrite
1137	This controls what happens when the trace buffer is
1138	full. If "1" (default), the oldest events are
1139	discarded and overwritten. If "0", then the newest
1140	events are discarded.
1141	(see per_cpu/cpu0/stats for overrun and dropped)
1142
1143  disable_on_free
1144	When the free_buffer is closed, tracing will
1145	stop (tracing_on set to 0).
1146
1147  irq-info
1148	Shows the interrupt, preempt count, need resched data.
1149	When disabled, the trace looks like::
1150
1151		# tracer: function
1152		#
1153		# entries-in-buffer/entries-written: 144405/9452052   #P:4
1154		#
1155		#           TASK-PID   CPU#      TIMESTAMP  FUNCTION
1156		#              | |       |          |         |
1157			  <idle>-0     [002]  23636.756054: ttwu_do_activate.constprop.89 <-try_to_wake_up
1158			  <idle>-0     [002]  23636.756054: activate_task <-ttwu_do_activate.constprop.89
1159			  <idle>-0     [002]  23636.756055: enqueue_task <-activate_task
1160
1161
1162  markers
1163	When set, the trace_marker is writable (only by root).
1164	When disabled, the trace_marker will error with EINVAL
1165	on write.
1166
1167  event-fork
1168	When set, tasks with PIDs listed in set_event_pid will have
1169	the PIDs of their children added to set_event_pid when those
1170	tasks fork. Also, when tasks with PIDs in set_event_pid exit,
1171	their PIDs will be removed from the file.
1172
1173  function-trace
1174	The latency tracers will enable function tracing
1175	if this option is enabled (default it is). When
1176	it is disabled, the latency tracers do not trace
1177	functions. This keeps the overhead of the tracer down
1178	when performing latency tests.
1179
1180  function-fork
1181	When set, tasks with PIDs listed in set_ftrace_pid will
1182	have the PIDs of their children added to set_ftrace_pid
1183	when those tasks fork. Also, when tasks with PIDs in
1184	set_ftrace_pid exit, their PIDs will be removed from the
1185	file.
1186
1187  display-graph
1188	When set, the latency tracers (irqsoff, wakeup, etc) will
1189	use function graph tracing instead of function tracing.
1190
1191  stacktrace
1192	When set, a stack trace is recorded after any trace event
1193	is recorded.
1194
1195  branch
1196	Enable branch tracing with the tracer. This enables branch
1197	tracer along with the currently set tracer. Enabling this
1198	with the "nop" tracer is the same as just enabling the
1199	"branch" tracer.
1200
1201.. tip:: Some tracers have their own options. They only appear in this
1202       file when the tracer is active. They always appear in the
1203       options directory.
1204
1205
1206Here are the per tracer options:
1207
1208Options for function tracer:
1209
1210  func_stack_trace
1211	When set, a stack trace is recorded after every
1212	function that is recorded. NOTE! Limit the functions
1213	that are recorded before enabling this, with
1214	"set_ftrace_filter" otherwise the system performance
1215	will be critically degraded. Remember to disable
1216	this option before clearing the function filter.
1217
1218Options for function_graph tracer:
1219
1220 Since the function_graph tracer has a slightly different output
1221 it has its own options to control what is displayed.
1222
1223  funcgraph-overrun
1224	When set, the "overrun" of the graph stack is
1225	displayed after each function traced. The
1226	overrun, is when the stack depth of the calls
1227	is greater than what is reserved for each task.
1228	Each task has a fixed array of functions to
1229	trace in the call graph. If the depth of the
1230	calls exceeds that, the function is not traced.
1231	The overrun is the number of functions missed
1232	due to exceeding this array.
1233
1234  funcgraph-cpu
1235	When set, the CPU number of the CPU where the trace
1236	occurred is displayed.
1237
1238  funcgraph-overhead
1239	When set, if the function takes longer than
1240	A certain amount, then a delay marker is
1241	displayed. See "delay" above, under the
1242	header description.
1243
1244  funcgraph-proc
1245	Unlike other tracers, the process' command line
1246	is not displayed by default, but instead only
1247	when a task is traced in and out during a context
1248	switch. Enabling this options has the command
1249	of each process displayed at every line.
1250
1251  funcgraph-duration
1252	At the end of each function (the return)
1253	the duration of the amount of time in the
1254	function is displayed in microseconds.
1255
1256  funcgraph-abstime
1257	When set, the timestamp is displayed at each line.
1258
1259  funcgraph-irqs
1260	When disabled, functions that happen inside an
1261	interrupt will not be traced.
1262
1263  funcgraph-tail
1264	When set, the return event will include the function
1265	that it represents. By default this is off, and
1266	only a closing curly bracket "}" is displayed for
1267	the return of a function.
1268
1269  sleep-time
1270	When running function graph tracer, to include
1271	the time a task schedules out in its function.
1272	When enabled, it will account time the task has been
1273	scheduled out as part of the function call.
1274
1275  graph-time
1276	When running function profiler with function graph tracer,
1277	to include the time to call nested functions. When this is
1278	not set, the time reported for the function will only
1279	include the time the function itself executed for, not the
1280	time for functions that it called.
1281
1282Options for blk tracer:
1283
1284  blk_classic
1285	Shows a more minimalistic output.
1286
1287
1288irqsoff
1289-------
1290
1291When interrupts are disabled, the CPU can not react to any other
1292external event (besides NMIs and SMIs). This prevents the timer
1293interrupt from triggering or the mouse interrupt from letting
1294the kernel know of a new mouse event. The result is a latency
1295with the reaction time.
1296
1297The irqsoff tracer tracks the time for which interrupts are
1298disabled. When a new maximum latency is hit, the tracer saves
1299the trace leading up to that latency point so that every time a
1300new maximum is reached, the old saved trace is discarded and the
1301new trace is saved.
1302
1303To reset the maximum, echo 0 into tracing_max_latency. Here is
1304an example::
1305
1306  # echo 0 > options/function-trace
1307  # echo irqsoff > current_tracer
1308  # echo 1 > tracing_on
1309  # echo 0 > tracing_max_latency
1310  # ls -ltr
1311  [...]
1312  # echo 0 > tracing_on
1313  # cat trace
1314  # tracer: irqsoff
1315  #
1316  # irqsoff latency trace v1.1.5 on 3.8.0-test+
1317  # --------------------------------------------------------------------
1318  # latency: 16 us, #4/4, CPU#0 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:4)
1319  #    -----------------
1320  #    | task: swapper/0-0 (uid:0 nice:0 policy:0 rt_prio:0)
1321  #    -----------------
1322  #  => started at: run_timer_softirq
1323  #  => ended at:   run_timer_softirq
1324  #
1325  #
1326  #                  _------=> CPU#
1327  #                 / _-----=> irqs-off
1328  #                | / _----=> need-resched
1329  #                || / _---=> hardirq/softirq
1330  #                ||| / _--=> preempt-depth
1331  #                |||| /     delay
1332  #  cmd     pid   ||||| time  |   caller
1333  #     \   /      |||||  \    |   /
1334    <idle>-0       0d.s2    0us+: _raw_spin_lock_irq <-run_timer_softirq
1335    <idle>-0       0dNs3   17us : _raw_spin_unlock_irq <-run_timer_softirq
1336    <idle>-0       0dNs3   17us+: trace_hardirqs_on <-run_timer_softirq
1337    <idle>-0       0dNs3   25us : <stack trace>
1338   => _raw_spin_unlock_irq
1339   => run_timer_softirq
1340   => __do_softirq
1341   => call_softirq
1342   => do_softirq
1343   => irq_exit
1344   => smp_apic_timer_interrupt
1345   => apic_timer_interrupt
1346   => rcu_idle_exit
1347   => cpu_idle
1348   => rest_init
1349   => start_kernel
1350   => x86_64_start_reservations
1351   => x86_64_start_kernel
1352
1353Here we see that that we had a latency of 16 microseconds (which is
1354very good). The _raw_spin_lock_irq in run_timer_softirq disabled
1355interrupts. The difference between the 16 and the displayed
1356timestamp 25us occurred because the clock was incremented
1357between the time of recording the max latency and the time of
1358recording the function that had that latency.
1359
1360Note the above example had function-trace not set. If we set
1361function-trace, we get a much larger output::
1362
1363 with echo 1 > options/function-trace
1364
1365  # tracer: irqsoff
1366  #
1367  # irqsoff latency trace v1.1.5 on 3.8.0-test+
1368  # --------------------------------------------------------------------
1369  # latency: 71 us, #168/168, CPU#3 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:4)
1370  #    -----------------
1371  #    | task: bash-2042 (uid:0 nice:0 policy:0 rt_prio:0)
1372  #    -----------------
1373  #  => started at: ata_scsi_queuecmd
1374  #  => ended at:   ata_scsi_queuecmd
1375  #
1376  #
1377  #                  _------=> CPU#
1378  #                 / _-----=> irqs-off
1379  #                | / _----=> need-resched
1380  #                || / _---=> hardirq/softirq
1381  #                ||| / _--=> preempt-depth
1382  #                |||| /     delay
1383  #  cmd     pid   ||||| time  |   caller
1384  #     \   /      |||||  \    |   /
1385      bash-2042    3d...    0us : _raw_spin_lock_irqsave <-ata_scsi_queuecmd
1386      bash-2042    3d...    0us : add_preempt_count <-_raw_spin_lock_irqsave
1387      bash-2042    3d..1    1us : ata_scsi_find_dev <-ata_scsi_queuecmd
1388      bash-2042    3d..1    1us : __ata_scsi_find_dev <-ata_scsi_find_dev
1389      bash-2042    3d..1    2us : ata_find_dev.part.14 <-__ata_scsi_find_dev
1390      bash-2042    3d..1    2us : ata_qc_new_init <-__ata_scsi_queuecmd
1391      bash-2042    3d..1    3us : ata_sg_init <-__ata_scsi_queuecmd
1392      bash-2042    3d..1    4us : ata_scsi_rw_xlat <-__ata_scsi_queuecmd
1393      bash-2042    3d..1    4us : ata_build_rw_tf <-ata_scsi_rw_xlat
1394  [...]
1395      bash-2042    3d..1   67us : delay_tsc <-__delay
1396      bash-2042    3d..1   67us : add_preempt_count <-delay_tsc
1397      bash-2042    3d..2   67us : sub_preempt_count <-delay_tsc
1398      bash-2042    3d..1   67us : add_preempt_count <-delay_tsc
1399      bash-2042    3d..2   68us : sub_preempt_count <-delay_tsc
1400      bash-2042    3d..1   68us+: ata_bmdma_start <-ata_bmdma_qc_issue
1401      bash-2042    3d..1   71us : _raw_spin_unlock_irqrestore <-ata_scsi_queuecmd
1402      bash-2042    3d..1   71us : _raw_spin_unlock_irqrestore <-ata_scsi_queuecmd
1403      bash-2042    3d..1   72us+: trace_hardirqs_on <-ata_scsi_queuecmd
1404      bash-2042    3d..1  120us : <stack trace>
1405   => _raw_spin_unlock_irqrestore
1406   => ata_scsi_queuecmd
1407   => scsi_dispatch_cmd
1408   => scsi_request_fn
1409   => __blk_run_queue_uncond
1410   => __blk_run_queue
1411   => blk_queue_bio
1412   => generic_make_request
1413   => submit_bio
1414   => submit_bh
1415   => __ext3_get_inode_loc
1416   => ext3_iget
1417   => ext3_lookup
1418   => lookup_real
1419   => __lookup_hash
1420   => walk_component
1421   => lookup_last
1422   => path_lookupat
1423   => filename_lookup
1424   => user_path_at_empty
1425   => user_path_at
1426   => vfs_fstatat
1427   => vfs_stat
1428   => sys_newstat
1429   => system_call_fastpath
1430
1431
1432Here we traced a 71 microsecond latency. But we also see all the
1433functions that were called during that time. Note that by
1434enabling function tracing, we incur an added overhead. This
1435overhead may extend the latency times. But nevertheless, this
1436trace has provided some very helpful debugging information.
1437
1438If we prefer function graph output instead of function, we can set
1439display-graph option::
1440
1441 with echo 1 > options/display-graph
1442
1443  # tracer: irqsoff
1444  #
1445  # irqsoff latency trace v1.1.5 on 4.20.0-rc6+
1446  # --------------------------------------------------------------------
1447  # latency: 3751 us, #274/274, CPU#0 | (M:desktop VP:0, KP:0, SP:0 HP:0 #P:4)
1448  #    -----------------
1449  #    | task: bash-1507 (uid:0 nice:0 policy:0 rt_prio:0)
1450  #    -----------------
1451  #  => started at: free_debug_processing
1452  #  => ended at:   return_to_handler
1453  #
1454  #
1455  #                                       _-----=> irqs-off
1456  #                                      / _----=> need-resched
1457  #                                     | / _---=> hardirq/softirq
1458  #                                     || / _--=> preempt-depth
1459  #                                     ||| /
1460  #   REL TIME      CPU  TASK/PID       ||||     DURATION                  FUNCTION CALLS
1461  #      |          |     |    |        ||||      |   |                     |   |   |   |
1462          0 us |   0)   bash-1507    |  d... |   0.000 us    |  _raw_spin_lock_irqsave();
1463          0 us |   0)   bash-1507    |  d..1 |   0.378 us    |    do_raw_spin_trylock();
1464          1 us |   0)   bash-1507    |  d..2 |               |    set_track() {
1465          2 us |   0)   bash-1507    |  d..2 |               |      save_stack_trace() {
1466          2 us |   0)   bash-1507    |  d..2 |               |        __save_stack_trace() {
1467          3 us |   0)   bash-1507    |  d..2 |               |          __unwind_start() {
1468          3 us |   0)   bash-1507    |  d..2 |               |            get_stack_info() {
1469          3 us |   0)   bash-1507    |  d..2 |   0.351 us    |              in_task_stack();
1470          4 us |   0)   bash-1507    |  d..2 |   1.107 us    |            }
1471  [...]
1472       3750 us |   0)   bash-1507    |  d..1 |   0.516 us    |      do_raw_spin_unlock();
1473       3750 us |   0)   bash-1507    |  d..1 |   0.000 us    |  _raw_spin_unlock_irqrestore();
1474       3764 us |   0)   bash-1507    |  d..1 |   0.000 us    |  tracer_hardirqs_on();
1475      bash-1507    0d..1 3792us : <stack trace>
1476   => free_debug_processing
1477   => __slab_free
1478   => kmem_cache_free
1479   => vm_area_free
1480   => remove_vma
1481   => exit_mmap
1482   => mmput
1483   => flush_old_exec
1484   => load_elf_binary
1485   => search_binary_handler
1486   => __do_execve_file.isra.32
1487   => __x64_sys_execve
1488   => do_syscall_64
1489   => entry_SYSCALL_64_after_hwframe
1490
1491preemptoff
1492----------
1493
1494When preemption is disabled, we may be able to receive
1495interrupts but the task cannot be preempted and a higher
1496priority task must wait for preemption to be enabled again
1497before it can preempt a lower priority task.
1498
1499The preemptoff tracer traces the places that disable preemption.
1500Like the irqsoff tracer, it records the maximum latency for
1501which preemption was disabled. The control of preemptoff tracer
1502is much like the irqsoff tracer.
1503::
1504
1505  # echo 0 > options/function-trace
1506  # echo preemptoff > current_tracer
1507  # echo 1 > tracing_on
1508  # echo 0 > tracing_max_latency
1509  # ls -ltr
1510  [...]
1511  # echo 0 > tracing_on
1512  # cat trace
1513  # tracer: preemptoff
1514  #
1515  # preemptoff latency trace v1.1.5 on 3.8.0-test+
1516  # --------------------------------------------------------------------
1517  # latency: 46 us, #4/4, CPU#1 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:4)
1518  #    -----------------
1519  #    | task: sshd-1991 (uid:0 nice:0 policy:0 rt_prio:0)
1520  #    -----------------
1521  #  => started at: do_IRQ
1522  #  => ended at:   do_IRQ
1523  #
1524  #
1525  #                  _------=> CPU#
1526  #                 / _-----=> irqs-off
1527  #                | / _----=> need-resched
1528  #                || / _---=> hardirq/softirq
1529  #                ||| / _--=> preempt-depth
1530  #                |||| /     delay
1531  #  cmd     pid   ||||| time  |   caller
1532  #     \   /      |||||  \    |   /
1533      sshd-1991    1d.h.    0us+: irq_enter <-do_IRQ
1534      sshd-1991    1d..1   46us : irq_exit <-do_IRQ
1535      sshd-1991    1d..1   47us+: trace_preempt_on <-do_IRQ
1536      sshd-1991    1d..1   52us : <stack trace>
1537   => sub_preempt_count
1538   => irq_exit
1539   => do_IRQ
1540   => ret_from_intr
1541
1542
1543This has some more changes. Preemption was disabled when an
1544interrupt came in (notice the 'h'), and was enabled on exit.
1545But we also see that interrupts have been disabled when entering
1546the preempt off section and leaving it (the 'd'). We do not know if
1547interrupts were enabled in the mean time or shortly after this
1548was over.
1549::
1550
1551  # tracer: preemptoff
1552  #
1553  # preemptoff latency trace v1.1.5 on 3.8.0-test+
1554  # --------------------------------------------------------------------
1555  # latency: 83 us, #241/241, CPU#1 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:4)
1556  #    -----------------
1557  #    | task: bash-1994 (uid:0 nice:0 policy:0 rt_prio:0)
1558  #    -----------------
1559  #  => started at: wake_up_new_task
1560  #  => ended at:   task_rq_unlock
1561  #
1562  #
1563  #                  _------=> CPU#
1564  #                 / _-----=> irqs-off
1565  #                | / _----=> need-resched
1566  #                || / _---=> hardirq/softirq
1567  #                ||| / _--=> preempt-depth
1568  #                |||| /     delay
1569  #  cmd     pid   ||||| time  |   caller
1570  #     \   /      |||||  \    |   /
1571      bash-1994    1d..1    0us : _raw_spin_lock_irqsave <-wake_up_new_task
1572      bash-1994    1d..1    0us : select_task_rq_fair <-select_task_rq
1573      bash-1994    1d..1    1us : __rcu_read_lock <-select_task_rq_fair
1574      bash-1994    1d..1    1us : source_load <-select_task_rq_fair
1575      bash-1994    1d..1    1us : source_load <-select_task_rq_fair
1576  [...]
1577      bash-1994    1d..1   12us : irq_enter <-smp_apic_timer_interrupt
1578      bash-1994    1d..1   12us : rcu_irq_enter <-irq_enter
1579      bash-1994    1d..1   13us : add_preempt_count <-irq_enter
1580      bash-1994    1d.h1   13us : exit_idle <-smp_apic_timer_interrupt
1581      bash-1994    1d.h1   13us : hrtimer_interrupt <-smp_apic_timer_interrupt
1582      bash-1994    1d.h1   13us : _raw_spin_lock <-hrtimer_interrupt
1583      bash-1994    1d.h1   14us : add_preempt_count <-_raw_spin_lock
1584      bash-1994    1d.h2   14us : ktime_get_update_offsets <-hrtimer_interrupt
1585  [...]
1586      bash-1994    1d.h1   35us : lapic_next_event <-clockevents_program_event
1587      bash-1994    1d.h1   35us : irq_exit <-smp_apic_timer_interrupt
1588      bash-1994    1d.h1   36us : sub_preempt_count <-irq_exit
1589      bash-1994    1d..2   36us : do_softirq <-irq_exit
1590      bash-1994    1d..2   36us : __do_softirq <-call_softirq
1591      bash-1994    1d..2   36us : __local_bh_disable <-__do_softirq
1592      bash-1994    1d.s2   37us : add_preempt_count <-_raw_spin_lock_irq
1593      bash-1994    1d.s3   38us : _raw_spin_unlock <-run_timer_softirq
1594      bash-1994    1d.s3   39us : sub_preempt_count <-_raw_spin_unlock
1595      bash-1994    1d.s2   39us : call_timer_fn <-run_timer_softirq
1596  [...]
1597      bash-1994    1dNs2   81us : cpu_needs_another_gp <-rcu_process_callbacks
1598      bash-1994    1dNs2   82us : __local_bh_enable <-__do_softirq
1599      bash-1994    1dNs2   82us : sub_preempt_count <-__local_bh_enable
1600      bash-1994    1dN.2   82us : idle_cpu <-irq_exit
1601      bash-1994    1dN.2   83us : rcu_irq_exit <-irq_exit
1602      bash-1994    1dN.2   83us : sub_preempt_count <-irq_exit
1603      bash-1994    1.N.1   84us : _raw_spin_unlock_irqrestore <-task_rq_unlock
1604      bash-1994    1.N.1   84us+: trace_preempt_on <-task_rq_unlock
1605      bash-1994    1.N.1  104us : <stack trace>
1606   => sub_preempt_count
1607   => _raw_spin_unlock_irqrestore
1608   => task_rq_unlock
1609   => wake_up_new_task
1610   => do_fork
1611   => sys_clone
1612   => stub_clone
1613
1614
1615The above is an example of the preemptoff trace with
1616function-trace set. Here we see that interrupts were not disabled
1617the entire time. The irq_enter code lets us know that we entered
1618an interrupt 'h'. Before that, the functions being traced still
1619show that it is not in an interrupt, but we can see from the
1620functions themselves that this is not the case.
1621
1622preemptirqsoff
1623--------------
1624
1625Knowing the locations that have interrupts disabled or
1626preemption disabled for the longest times is helpful. But
1627sometimes we would like to know when either preemption and/or
1628interrupts are disabled.
1629
1630Consider the following code::
1631
1632    local_irq_disable();
1633    call_function_with_irqs_off();
1634    preempt_disable();
1635    call_function_with_irqs_and_preemption_off();
1636    local_irq_enable();
1637    call_function_with_preemption_off();
1638    preempt_enable();
1639
1640The irqsoff tracer will record the total length of
1641call_function_with_irqs_off() and
1642call_function_with_irqs_and_preemption_off().
1643
1644The preemptoff tracer will record the total length of
1645call_function_with_irqs_and_preemption_off() and
1646call_function_with_preemption_off().
1647
1648But neither will trace the time that interrupts and/or
1649preemption is disabled. This total time is the time that we can
1650not schedule. To record this time, use the preemptirqsoff
1651tracer.
1652
1653Again, using this trace is much like the irqsoff and preemptoff
1654tracers.
1655::
1656
1657  # echo 0 > options/function-trace
1658  # echo preemptirqsoff > current_tracer
1659  # echo 1 > tracing_on
1660  # echo 0 > tracing_max_latency
1661  # ls -ltr
1662  [...]
1663  # echo 0 > tracing_on
1664  # cat trace
1665  # tracer: preemptirqsoff
1666  #
1667  # preemptirqsoff latency trace v1.1.5 on 3.8.0-test+
1668  # --------------------------------------------------------------------
1669  # latency: 100 us, #4/4, CPU#3 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:4)
1670  #    -----------------
1671  #    | task: ls-2230 (uid:0 nice:0 policy:0 rt_prio:0)
1672  #    -----------------
1673  #  => started at: ata_scsi_queuecmd
1674  #  => ended at:   ata_scsi_queuecmd
1675  #
1676  #
1677  #                  _------=> CPU#
1678  #                 / _-----=> irqs-off
1679  #                | / _----=> need-resched
1680  #                || / _---=> hardirq/softirq
1681  #                ||| / _--=> preempt-depth
1682  #                |||| /     delay
1683  #  cmd     pid   ||||| time  |   caller
1684  #     \   /      |||||  \    |   /
1685        ls-2230    3d...    0us+: _raw_spin_lock_irqsave <-ata_scsi_queuecmd
1686        ls-2230    3...1  100us : _raw_spin_unlock_irqrestore <-ata_scsi_queuecmd
1687        ls-2230    3...1  101us+: trace_preempt_on <-ata_scsi_queuecmd
1688        ls-2230    3...1  111us : <stack trace>
1689   => sub_preempt_count
1690   => _raw_spin_unlock_irqrestore
1691   => ata_scsi_queuecmd
1692   => scsi_dispatch_cmd
1693   => scsi_request_fn
1694   => __blk_run_queue_uncond
1695   => __blk_run_queue
1696   => blk_queue_bio
1697   => generic_make_request
1698   => submit_bio
1699   => submit_bh
1700   => ext3_bread
1701   => ext3_dir_bread
1702   => htree_dirblock_to_tree
1703   => ext3_htree_fill_tree
1704   => ext3_readdir
1705   => vfs_readdir
1706   => sys_getdents
1707   => system_call_fastpath
1708
1709
1710The trace_hardirqs_off_thunk is called from assembly on x86 when
1711interrupts are disabled in the assembly code. Without the
1712function tracing, we do not know if interrupts were enabled
1713within the preemption points. We do see that it started with
1714preemption enabled.
1715
1716Here is a trace with function-trace set::
1717
1718  # tracer: preemptirqsoff
1719  #
1720  # preemptirqsoff latency trace v1.1.5 on 3.8.0-test+
1721  # --------------------------------------------------------------------
1722  # latency: 161 us, #339/339, CPU#3 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:4)
1723  #    -----------------
1724  #    | task: ls-2269 (uid:0 nice:0 policy:0 rt_prio:0)
1725  #    -----------------
1726  #  => started at: schedule
1727  #  => ended at:   mutex_unlock
1728  #
1729  #
1730  #                  _------=> CPU#
1731  #                 / _-----=> irqs-off
1732  #                | / _----=> need-resched
1733  #                || / _---=> hardirq/softirq
1734  #                ||| / _--=> preempt-depth
1735  #                |||| /     delay
1736  #  cmd     pid   ||||| time  |   caller
1737  #     \   /      |||||  \    |   /
1738  kworker/-59      3...1    0us : __schedule <-schedule
1739  kworker/-59      3d..1    0us : rcu_preempt_qs <-rcu_note_context_switch
1740  kworker/-59      3d..1    1us : add_preempt_count <-_raw_spin_lock_irq
1741  kworker/-59      3d..2    1us : deactivate_task <-__schedule
1742  kworker/-59      3d..2    1us : dequeue_task <-deactivate_task
1743  kworker/-59      3d..2    2us : update_rq_clock <-dequeue_task
1744  kworker/-59      3d..2    2us : dequeue_task_fair <-dequeue_task
1745  kworker/-59      3d..2    2us : update_curr <-dequeue_task_fair
1746  kworker/-59      3d..2    2us : update_min_vruntime <-update_curr
1747  kworker/-59      3d..2    3us : cpuacct_charge <-update_curr
1748  kworker/-59      3d..2    3us : __rcu_read_lock <-cpuacct_charge
1749  kworker/-59      3d..2    3us : __rcu_read_unlock <-cpuacct_charge
1750  kworker/-59      3d..2    3us : update_cfs_rq_blocked_load <-dequeue_task_fair
1751  kworker/-59      3d..2    4us : clear_buddies <-dequeue_task_fair
1752  kworker/-59      3d..2    4us : account_entity_dequeue <-dequeue_task_fair
1753  kworker/-59      3d..2    4us : update_min_vruntime <-dequeue_task_fair
1754  kworker/-59      3d..2    4us : update_cfs_shares <-dequeue_task_fair
1755  kworker/-59      3d..2    5us : hrtick_update <-dequeue_task_fair
1756  kworker/-59      3d..2    5us : wq_worker_sleeping <-__schedule
1757  kworker/-59      3d..2    5us : kthread_data <-wq_worker_sleeping
1758  kworker/-59      3d..2    5us : put_prev_task_fair <-__schedule
1759  kworker/-59      3d..2    6us : pick_next_task_fair <-pick_next_task
1760  kworker/-59      3d..2    6us : clear_buddies <-pick_next_task_fair
1761  kworker/-59      3d..2    6us : set_next_entity <-pick_next_task_fair
1762  kworker/-59      3d..2    6us : update_stats_wait_end <-set_next_entity
1763        ls-2269    3d..2    7us : finish_task_switch <-__schedule
1764        ls-2269    3d..2    7us : _raw_spin_unlock_irq <-finish_task_switch
1765        ls-2269    3d..2    8us : do_IRQ <-ret_from_intr
1766        ls-2269    3d..2    8us : irq_enter <-do_IRQ
1767        ls-2269    3d..2    8us : rcu_irq_enter <-irq_enter
1768        ls-2269    3d..2    9us : add_preempt_count <-irq_enter
1769        ls-2269    3d.h2    9us : exit_idle <-do_IRQ
1770  [...]
1771        ls-2269    3d.h3   20us : sub_preempt_count <-_raw_spin_unlock
1772        ls-2269    3d.h2   20us : irq_exit <-do_IRQ
1773        ls-2269    3d.h2   21us : sub_preempt_count <-irq_exit
1774        ls-2269    3d..3   21us : do_softirq <-irq_exit
1775        ls-2269    3d..3   21us : __do_softirq <-call_softirq
1776        ls-2269    3d..3   21us+: __local_bh_disable <-__do_softirq
1777        ls-2269    3d.s4   29us : sub_preempt_count <-_local_bh_enable_ip
1778        ls-2269    3d.s5   29us : sub_preempt_count <-_local_bh_enable_ip
1779        ls-2269    3d.s5   31us : do_IRQ <-ret_from_intr
1780        ls-2269    3d.s5   31us : irq_enter <-do_IRQ
1781        ls-2269    3d.s5   31us : rcu_irq_enter <-irq_enter
1782  [...]
1783        ls-2269    3d.s5   31us : rcu_irq_enter <-irq_enter
1784        ls-2269    3d.s5   32us : add_preempt_count <-irq_enter
1785        ls-2269    3d.H5   32us : exit_idle <-do_IRQ
1786        ls-2269    3d.H5   32us : handle_irq <-do_IRQ
1787        ls-2269    3d.H5   32us : irq_to_desc <-handle_irq
1788        ls-2269    3d.H5   33us : handle_fasteoi_irq <-handle_irq
1789  [...]
1790        ls-2269    3d.s5  158us : _raw_spin_unlock_irqrestore <-rtl8139_poll
1791        ls-2269    3d.s3  158us : net_rps_action_and_irq_enable.isra.65 <-net_rx_action
1792        ls-2269    3d.s3  159us : __local_bh_enable <-__do_softirq
1793        ls-2269    3d.s3  159us : sub_preempt_count <-__local_bh_enable
1794        ls-2269    3d..3  159us : idle_cpu <-irq_exit
1795        ls-2269    3d..3  159us : rcu_irq_exit <-irq_exit
1796        ls-2269    3d..3  160us : sub_preempt_count <-irq_exit
1797        ls-2269    3d...  161us : __mutex_unlock_slowpath <-mutex_unlock
1798        ls-2269    3d...  162us+: trace_hardirqs_on <-mutex_unlock
1799        ls-2269    3d...  186us : <stack trace>
1800   => __mutex_unlock_slowpath
1801   => mutex_unlock
1802   => process_output
1803   => n_tty_write
1804   => tty_write
1805   => vfs_write
1806   => sys_write
1807   => system_call_fastpath
1808
1809This is an interesting trace. It started with kworker running and
1810scheduling out and ls taking over. But as soon as ls released the
1811rq lock and enabled interrupts (but not preemption) an interrupt
1812triggered. When the interrupt finished, it started running softirqs.
1813But while the softirq was running, another interrupt triggered.
1814When an interrupt is running inside a softirq, the annotation is 'H'.
1815
1816
1817wakeup
1818------
1819
1820One common case that people are interested in tracing is the
1821time it takes for a task that is woken to actually wake up.
1822Now for non Real-Time tasks, this can be arbitrary. But tracing
1823it none the less can be interesting.
1824
1825Without function tracing::
1826
1827  # echo 0 > options/function-trace
1828  # echo wakeup > current_tracer
1829  # echo 1 > tracing_on
1830  # echo 0 > tracing_max_latency
1831  # chrt -f 5 sleep 1
1832  # echo 0 > tracing_on
1833  # cat trace
1834  # tracer: wakeup
1835  #
1836  # wakeup latency trace v1.1.5 on 3.8.0-test+
1837  # --------------------------------------------------------------------
1838  # latency: 15 us, #4/4, CPU#3 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:4)
1839  #    -----------------
1840  #    | task: kworker/3:1H-312 (uid:0 nice:-20 policy:0 rt_prio:0)
1841  #    -----------------
1842  #
1843  #                  _------=> CPU#
1844  #                 / _-----=> irqs-off
1845  #                | / _----=> need-resched
1846  #                || / _---=> hardirq/softirq
1847  #                ||| / _--=> preempt-depth
1848  #                |||| /     delay
1849  #  cmd     pid   ||||| time  |   caller
1850  #     \   /      |||||  \    |   /
1851    <idle>-0       3dNs7    0us :      0:120:R   + [003]   312:100:R kworker/3:1H
1852    <idle>-0       3dNs7    1us+: ttwu_do_activate.constprop.87 <-try_to_wake_up
1853    <idle>-0       3d..3   15us : __schedule <-schedule
1854    <idle>-0       3d..3   15us :      0:120:R ==> [003]   312:100:R kworker/3:1H
1855
1856The tracer only traces the highest priority task in the system
1857to avoid tracing the normal circumstances. Here we see that
1858the kworker with a nice priority of -20 (not very nice), took
1859just 15 microseconds from the time it woke up, to the time it
1860ran.
1861
1862Non Real-Time tasks are not that interesting. A more interesting
1863trace is to concentrate only on Real-Time tasks.
1864
1865wakeup_rt
1866---------
1867
1868In a Real-Time environment it is very important to know the
1869wakeup time it takes for the highest priority task that is woken
1870up to the time that it executes. This is also known as "schedule
1871latency". I stress the point that this is about RT tasks. It is
1872also important to know the scheduling latency of non-RT tasks,
1873but the average schedule latency is better for non-RT tasks.
1874Tools like LatencyTop are more appropriate for such
1875measurements.
1876
1877Real-Time environments are interested in the worst case latency.
1878That is the longest latency it takes for something to happen,
1879and not the average. We can have a very fast scheduler that may
1880only have a large latency once in a while, but that would not
1881work well with Real-Time tasks.  The wakeup_rt tracer was designed
1882to record the worst case wakeups of RT tasks. Non-RT tasks are
1883not recorded because the tracer only records one worst case and
1884tracing non-RT tasks that are unpredictable will overwrite the
1885worst case latency of RT tasks (just run the normal wakeup
1886tracer for a while to see that effect).
1887
1888Since this tracer only deals with RT tasks, we will run this
1889slightly differently than we did with the previous tracers.
1890Instead of performing an 'ls', we will run 'sleep 1' under
1891'chrt' which changes the priority of the task.
1892::
1893
1894  # echo 0 > options/function-trace
1895  # echo wakeup_rt > current_tracer
1896  # echo 1 > tracing_on
1897  # echo 0 > tracing_max_latency
1898  # chrt -f 5 sleep 1
1899  # echo 0 > tracing_on
1900  # cat trace
1901  # tracer: wakeup
1902  #
1903  # tracer: wakeup_rt
1904  #
1905  # wakeup_rt latency trace v1.1.5 on 3.8.0-test+
1906  # --------------------------------------------------------------------
1907  # latency: 5 us, #4/4, CPU#3 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:4)
1908  #    -----------------
1909  #    | task: sleep-2389 (uid:0 nice:0 policy:1 rt_prio:5)
1910  #    -----------------
1911  #
1912  #                  _------=> CPU#
1913  #                 / _-----=> irqs-off
1914  #                | / _----=> need-resched
1915  #                || / _---=> hardirq/softirq
1916  #                ||| / _--=> preempt-depth
1917  #                |||| /     delay
1918  #  cmd     pid   ||||| time  |   caller
1919  #     \   /      |||||  \    |   /
1920    <idle>-0       3d.h4    0us :      0:120:R   + [003]  2389: 94:R sleep
1921    <idle>-0       3d.h4    1us+: ttwu_do_activate.constprop.87 <-try_to_wake_up
1922    <idle>-0       3d..3    5us : __schedule <-schedule
1923    <idle>-0       3d..3    5us :      0:120:R ==> [003]  2389: 94:R sleep
1924
1925
1926Running this on an idle system, we see that it only took 5 microseconds
1927to perform the task switch.  Note, since the trace point in the schedule
1928is before the actual "switch", we stop the tracing when the recorded task
1929is about to schedule in. This may change if we add a new marker at the
1930end of the scheduler.
1931
1932Notice that the recorded task is 'sleep' with the PID of 2389
1933and it has an rt_prio of 5. This priority is user-space priority
1934and not the internal kernel priority. The policy is 1 for
1935SCHED_FIFO and 2 for SCHED_RR.
1936
1937Note, that the trace data shows the internal priority (99 - rtprio).
1938::
1939
1940  <idle>-0       3d..3    5us :      0:120:R ==> [003]  2389: 94:R sleep
1941
1942The 0:120:R means idle was running with a nice priority of 0 (120 - 120)
1943and in the running state 'R'. The sleep task was scheduled in with
19442389: 94:R. That is the priority is the kernel rtprio (99 - 5 = 94)
1945and it too is in the running state.
1946
1947Doing the same with chrt -r 5 and function-trace set.
1948::
1949
1950  echo 1 > options/function-trace
1951
1952  # tracer: wakeup_rt
1953  #
1954  # wakeup_rt latency trace v1.1.5 on 3.8.0-test+
1955  # --------------------------------------------------------------------
1956  # latency: 29 us, #85/85, CPU#3 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:4)
1957  #    -----------------
1958  #    | task: sleep-2448 (uid:0 nice:0 policy:1 rt_prio:5)
1959  #    -----------------
1960  #
1961  #                  _------=> CPU#
1962  #                 / _-----=> irqs-off
1963  #                | / _----=> need-resched
1964  #                || / _---=> hardirq/softirq
1965  #                ||| / _--=> preempt-depth
1966  #                |||| /     delay
1967  #  cmd     pid   ||||| time  |   caller
1968  #     \   /      |||||  \    |   /
1969    <idle>-0       3d.h4    1us+:      0:120:R   + [003]  2448: 94:R sleep
1970    <idle>-0       3d.h4    2us : ttwu_do_activate.constprop.87 <-try_to_wake_up
1971    <idle>-0       3d.h3    3us : check_preempt_curr <-ttwu_do_wakeup
1972    <idle>-0       3d.h3    3us : resched_curr <-check_preempt_curr
1973    <idle>-0       3dNh3    4us : task_woken_rt <-ttwu_do_wakeup
1974    <idle>-0       3dNh3    4us : _raw_spin_unlock <-try_to_wake_up
1975    <idle>-0       3dNh3    4us : sub_preempt_count <-_raw_spin_unlock
1976    <idle>-0       3dNh2    5us : ttwu_stat <-try_to_wake_up
1977    <idle>-0       3dNh2    5us : _raw_spin_unlock_irqrestore <-try_to_wake_up
1978    <idle>-0       3dNh2    6us : sub_preempt_count <-_raw_spin_unlock_irqrestore
1979    <idle>-0       3dNh1    6us : _raw_spin_lock <-__run_hrtimer
1980    <idle>-0       3dNh1    6us : add_preempt_count <-_raw_spin_lock
1981    <idle>-0       3dNh2    7us : _raw_spin_unlock <-hrtimer_interrupt
1982    <idle>-0       3dNh2    7us : sub_preempt_count <-_raw_spin_unlock
1983    <idle>-0       3dNh1    7us : tick_program_event <-hrtimer_interrupt
1984    <idle>-0       3dNh1    7us : clockevents_program_event <-tick_program_event
1985    <idle>-0       3dNh1    8us : ktime_get <-clockevents_program_event
1986    <idle>-0       3dNh1    8us : lapic_next_event <-clockevents_program_event
1987    <idle>-0       3dNh1    8us : irq_exit <-smp_apic_timer_interrupt
1988    <idle>-0       3dNh1    9us : sub_preempt_count <-irq_exit
1989    <idle>-0       3dN.2    9us : idle_cpu <-irq_exit
1990    <idle>-0       3dN.2    9us : rcu_irq_exit <-irq_exit
1991    <idle>-0       3dN.2   10us : rcu_eqs_enter_common.isra.45 <-rcu_irq_exit
1992    <idle>-0       3dN.2   10us : sub_preempt_count <-irq_exit
1993    <idle>-0       3.N.1   11us : rcu_idle_exit <-cpu_idle
1994    <idle>-0       3dN.1   11us : rcu_eqs_exit_common.isra.43 <-rcu_idle_exit
1995    <idle>-0       3.N.1   11us : tick_nohz_idle_exit <-cpu_idle
1996    <idle>-0       3dN.1   12us : menu_hrtimer_cancel <-tick_nohz_idle_exit
1997    <idle>-0       3dN.1   12us : ktime_get <-tick_nohz_idle_exit
1998    <idle>-0       3dN.1   12us : tick_do_update_jiffies64 <-tick_nohz_idle_exit
1999    <idle>-0       3dN.1   13us : cpu_load_update_nohz <-tick_nohz_idle_exit
2000    <idle>-0       3dN.1   13us : _raw_spin_lock <-cpu_load_update_nohz
2001    <idle>-0       3dN.1   13us : add_preempt_count <-_raw_spin_lock
2002    <idle>-0       3dN.2   13us : __cpu_load_update <-cpu_load_update_nohz
2003    <idle>-0       3dN.2   14us : sched_avg_update <-__cpu_load_update
2004    <idle>-0       3dN.2   14us : _raw_spin_unlock <-cpu_load_update_nohz
2005    <idle>-0       3dN.2   14us : sub_preempt_count <-_raw_spin_unlock
2006    <idle>-0       3dN.1   15us : calc_load_nohz_stop <-tick_nohz_idle_exit
2007    <idle>-0       3dN.1   15us : touch_softlockup_watchdog <-tick_nohz_idle_exit
2008    <idle>-0       3dN.1   15us : hrtimer_cancel <-tick_nohz_idle_exit
2009    <idle>-0       3dN.1   15us : hrtimer_try_to_cancel <-hrtimer_cancel
2010    <idle>-0       3dN.1   16us : lock_hrtimer_base.isra.18 <-hrtimer_try_to_cancel
2011    <idle>-0       3dN.1   16us : _raw_spin_lock_irqsave <-lock_hrtimer_base.isra.18
2012    <idle>-0       3dN.1   16us : add_preempt_count <-_raw_spin_lock_irqsave
2013    <idle>-0       3dN.2   17us : __remove_hrtimer <-remove_hrtimer.part.16
2014    <idle>-0       3dN.2   17us : hrtimer_force_reprogram <-__remove_hrtimer
2015    <idle>-0       3dN.2   17us : tick_program_event <-hrtimer_force_reprogram
2016    <idle>-0       3dN.2   18us : clockevents_program_event <-tick_program_event
2017    <idle>-0       3dN.2   18us : ktime_get <-clockevents_program_event
2018    <idle>-0       3dN.2   18us : lapic_next_event <-clockevents_program_event
2019    <idle>-0       3dN.2   19us : _raw_spin_unlock_irqrestore <-hrtimer_try_to_cancel
2020    <idle>-0       3dN.2   19us : sub_preempt_count <-_raw_spin_unlock_irqrestore
2021    <idle>-0       3dN.1   19us : hrtimer_forward <-tick_nohz_idle_exit
2022    <idle>-0       3dN.1   20us : ktime_add_safe <-hrtimer_forward
2023    <idle>-0       3dN.1   20us : ktime_add_safe <-hrtimer_forward
2024    <idle>-0       3dN.1   20us : hrtimer_start_range_ns <-hrtimer_start_expires.constprop.11
2025    <idle>-0       3dN.1   20us : __hrtimer_start_range_ns <-hrtimer_start_range_ns
2026    <idle>-0       3dN.1   21us : lock_hrtimer_base.isra.18 <-__hrtimer_start_range_ns
2027    <idle>-0       3dN.1   21us : _raw_spin_lock_irqsave <-lock_hrtimer_base.isra.18
2028    <idle>-0       3dN.1   21us : add_preempt_count <-_raw_spin_lock_irqsave
2029    <idle>-0       3dN.2   22us : ktime_add_safe <-__hrtimer_start_range_ns
2030    <idle>-0       3dN.2   22us : enqueue_hrtimer <-__hrtimer_start_range_ns
2031    <idle>-0       3dN.2   22us : tick_program_event <-__hrtimer_start_range_ns
2032    <idle>-0       3dN.2   23us : clockevents_program_event <-tick_program_event
2033    <idle>-0       3dN.2   23us : ktime_get <-clockevents_program_event
2034    <idle>-0       3dN.2   23us : lapic_next_event <-clockevents_program_event
2035    <idle>-0       3dN.2   24us : _raw_spin_unlock_irqrestore <-__hrtimer_start_range_ns
2036    <idle>-0       3dN.2   24us : sub_preempt_count <-_raw_spin_unlock_irqrestore
2037    <idle>-0       3dN.1   24us : account_idle_ticks <-tick_nohz_idle_exit
2038    <idle>-0       3dN.1   24us : account_idle_time <-account_idle_ticks
2039    <idle>-0       3.N.1   25us : sub_preempt_count <-cpu_idle
2040    <idle>-0       3.N..   25us : schedule <-cpu_idle
2041    <idle>-0       3.N..   25us : __schedule <-preempt_schedule
2042    <idle>-0       3.N..   26us : add_preempt_count <-__schedule
2043    <idle>-0       3.N.1   26us : rcu_note_context_switch <-__schedule
2044    <idle>-0       3.N.1   26us : rcu_sched_qs <-rcu_note_context_switch
2045    <idle>-0       3dN.1   27us : rcu_preempt_qs <-rcu_note_context_switch
2046    <idle>-0       3.N.1   27us : _raw_spin_lock_irq <-__schedule
2047    <idle>-0       3dN.1   27us : add_preempt_count <-_raw_spin_lock_irq
2048    <idle>-0       3dN.2   28us : put_prev_task_idle <-__schedule
2049    <idle>-0       3dN.2   28us : pick_next_task_stop <-pick_next_task
2050    <idle>-0       3dN.2   28us : pick_next_task_rt <-pick_next_task
2051    <idle>-0       3dN.2   29us : dequeue_pushable_task <-pick_next_task_rt
2052    <idle>-0       3d..3   29us : __schedule <-preempt_schedule
2053    <idle>-0       3d..3   30us :      0:120:R ==> [003]  2448: 94:R sleep
2054
2055This isn't that big of a trace, even with function tracing enabled,
2056so I included the entire trace.
2057
2058The interrupt went off while when the system was idle. Somewhere
2059before task_woken_rt() was called, the NEED_RESCHED flag was set,
2060this is indicated by the first occurrence of the 'N' flag.
2061
2062Latency tracing and events
2063--------------------------
2064As function tracing can induce a much larger latency, but without
2065seeing what happens within the latency it is hard to know what
2066caused it. There is a middle ground, and that is with enabling
2067events.
2068::
2069
2070  # echo 0 > options/function-trace
2071  # echo wakeup_rt > current_tracer
2072  # echo 1 > events/enable
2073  # echo 1 > tracing_on
2074  # echo 0 > tracing_max_latency
2075  # chrt -f 5 sleep 1
2076  # echo 0 > tracing_on
2077  # cat trace
2078  # tracer: wakeup_rt
2079  #
2080  # wakeup_rt latency trace v1.1.5 on 3.8.0-test+
2081  # --------------------------------------------------------------------
2082  # latency: 6 us, #12/12, CPU#2 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:4)
2083  #    -----------------
2084  #    | task: sleep-5882 (uid:0 nice:0 policy:1 rt_prio:5)
2085  #    -----------------
2086  #
2087  #                  _------=> CPU#
2088  #                 / _-----=> irqs-off
2089  #                | / _----=> need-resched
2090  #                || / _---=> hardirq/softirq
2091  #                ||| / _--=> preempt-depth
2092  #                |||| /     delay
2093  #  cmd     pid   ||||| time  |   caller
2094  #     \   /      |||||  \    |   /
2095    <idle>-0       2d.h4    0us :      0:120:R   + [002]  5882: 94:R sleep
2096    <idle>-0       2d.h4    0us : ttwu_do_activate.constprop.87 <-try_to_wake_up
2097    <idle>-0       2d.h4    1us : sched_wakeup: comm=sleep pid=5882 prio=94 success=1 target_cpu=002
2098    <idle>-0       2dNh2    1us : hrtimer_expire_exit: hrtimer=ffff88007796feb8
2099    <idle>-0       2.N.2    2us : power_end: cpu_id=2
2100    <idle>-0       2.N.2    3us : cpu_idle: state=4294967295 cpu_id=2
2101    <idle>-0       2dN.3    4us : hrtimer_cancel: hrtimer=ffff88007d50d5e0
2102    <idle>-0       2dN.3    4us : hrtimer_start: hrtimer=ffff88007d50d5e0 function=tick_sched_timer expires=34311211000000 softexpires=34311211000000
2103    <idle>-0       2.N.2    5us : rcu_utilization: Start context switch
2104    <idle>-0       2.N.2    5us : rcu_utilization: End context switch
2105    <idle>-0       2d..3    6us : __schedule <-schedule
2106    <idle>-0       2d..3    6us :      0:120:R ==> [002]  5882: 94:R sleep
2107
2108
2109Hardware Latency Detector
2110-------------------------
2111
2112The hardware latency detector is executed by enabling the "hwlat" tracer.
2113
2114NOTE, this tracer will affect the performance of the system as it will
2115periodically make a CPU constantly busy with interrupts disabled.
2116::
2117
2118  # echo hwlat > current_tracer
2119  # sleep 100
2120  # cat trace
2121  # tracer: hwlat
2122  #
2123  #                              _-----=> irqs-off
2124  #                             / _----=> need-resched
2125  #                            | / _---=> hardirq/softirq
2126  #                            || / _--=> preempt-depth
2127  #                            ||| /     delay
2128  #           TASK-PID   CPU#  ||||    TIMESTAMP  FUNCTION
2129  #              | |       |   ||||       |         |
2130             <...>-3638  [001] d... 19452.055471: #1     inner/outer(us):   12/14    ts:1499801089.066141940
2131             <...>-3638  [003] d... 19454.071354: #2     inner/outer(us):   11/9     ts:1499801091.082164365
2132             <...>-3638  [002] dn.. 19461.126852: #3     inner/outer(us):   12/9     ts:1499801098.138150062
2133             <...>-3638  [001] d... 19488.340960: #4     inner/outer(us):    8/12    ts:1499801125.354139633
2134             <...>-3638  [003] d... 19494.388553: #5     inner/outer(us):    8/12    ts:1499801131.402150961
2135             <...>-3638  [003] d... 19501.283419: #6     inner/outer(us):    0/12    ts:1499801138.297435289 nmi-total:4 nmi-count:1
2136
2137
2138The above output is somewhat the same in the header. All events will have
2139interrupts disabled 'd'. Under the FUNCTION title there is:
2140
2141 #1
2142	This is the count of events recorded that were greater than the
2143	tracing_threshold (See below).
2144
2145 inner/outer(us):   12/14
2146
2147      This shows two numbers as "inner latency" and "outer latency". The test
2148      runs in a loop checking a timestamp twice. The latency detected within
2149      the two timestamps is the "inner latency" and the latency detected
2150      after the previous timestamp and the next timestamp in the loop is
2151      the "outer latency".
2152
2153 ts:1499801089.066141940
2154
2155      The absolute timestamp that the event happened.
2156
2157 nmi-total:4 nmi-count:1
2158
2159      On architectures that support it, if an NMI comes in during the
2160      test, the time spent in NMI is reported in "nmi-total" (in
2161      microseconds).
2162
2163      All architectures that have NMIs will show the "nmi-count" if an
2164      NMI comes in during the test.
2165
2166hwlat files:
2167
2168  tracing_threshold
2169	This gets automatically set to "10" to represent 10
2170	microseconds. This is the threshold of latency that
2171	needs to be detected before the trace will be recorded.
2172
2173	Note, when hwlat tracer is finished (another tracer is
2174	written into "current_tracer"), the original value for
2175	tracing_threshold is placed back into this file.
2176
2177  hwlat_detector/width
2178	The length of time the test runs with interrupts disabled.
2179
2180  hwlat_detector/window
2181	The length of time of the window which the test
2182	runs. That is, the test will run for "width"
2183	microseconds per "window" microseconds
2184
2185  tracing_cpumask
2186	When the test is started. A kernel thread is created that
2187	runs the test. This thread will alternate between CPUs
2188	listed in the tracing_cpumask between each period
2189	(one "window"). To limit the test to specific CPUs
2190	set the mask in this file to only the CPUs that the test
2191	should run on.
2192
2193function
2194--------
2195
2196This tracer is the function tracer. Enabling the function tracer
2197can be done from the debug file system. Make sure the
2198ftrace_enabled is set; otherwise this tracer is a nop.
2199See the "ftrace_enabled" section below.
2200::
2201
2202  # sysctl kernel.ftrace_enabled=1
2203  # echo function > current_tracer
2204  # echo 1 > tracing_on
2205  # usleep 1
2206  # echo 0 > tracing_on
2207  # cat trace
2208  # tracer: function
2209  #
2210  # entries-in-buffer/entries-written: 24799/24799   #P:4
2211  #
2212  #                              _-----=> irqs-off
2213  #                             / _----=> need-resched
2214  #                            | / _---=> hardirq/softirq
2215  #                            || / _--=> preempt-depth
2216  #                            ||| /     delay
2217  #           TASK-PID   CPU#  ||||    TIMESTAMP  FUNCTION
2218  #              | |       |   ||||       |         |
2219              bash-1994  [002] ....  3082.063030: mutex_unlock <-rb_simple_write
2220              bash-1994  [002] ....  3082.063031: __mutex_unlock_slowpath <-mutex_unlock
2221              bash-1994  [002] ....  3082.063031: __fsnotify_parent <-fsnotify_modify
2222              bash-1994  [002] ....  3082.063032: fsnotify <-fsnotify_modify
2223              bash-1994  [002] ....  3082.063032: __srcu_read_lock <-fsnotify
2224              bash-1994  [002] ....  3082.063032: add_preempt_count <-__srcu_read_lock
2225              bash-1994  [002] ...1  3082.063032: sub_preempt_count <-__srcu_read_lock
2226              bash-1994  [002] ....  3082.063033: __srcu_read_unlock <-fsnotify
2227  [...]
2228
2229
2230Note: function tracer uses ring buffers to store the above
2231entries. The newest data may overwrite the oldest data.
2232Sometimes using echo to stop the trace is not sufficient because
2233the tracing could have overwritten the data that you wanted to
2234record. For this reason, it is sometimes better to disable
2235tracing directly from a program. This allows you to stop the
2236tracing at the point that you hit the part that you are
2237interested in. To disable the tracing directly from a C program,
2238something like following code snippet can be used::
2239
2240	int trace_fd;
2241	[...]
2242	int main(int argc, char *argv[]) {
2243		[...]
2244		trace_fd = open(tracing_file("tracing_on"), O_WRONLY);
2245		[...]
2246		if (condition_hit()) {
2247			write(trace_fd, "0", 1);
2248		}
2249		[...]
2250	}
2251
2252
2253Single thread tracing
2254---------------------
2255
2256By writing into set_ftrace_pid you can trace a
2257single thread. For example::
2258
2259  # cat set_ftrace_pid
2260  no pid
2261  # echo 3111 > set_ftrace_pid
2262  # cat set_ftrace_pid
2263  3111
2264  # echo function > current_tracer
2265  # cat trace | head
2266  # tracer: function
2267  #
2268  #           TASK-PID    CPU#    TIMESTAMP  FUNCTION
2269  #              | |       |          |         |
2270      yum-updatesd-3111  [003]  1637.254676: finish_task_switch <-thread_return
2271      yum-updatesd-3111  [003]  1637.254681: hrtimer_cancel <-schedule_hrtimeout_range
2272      yum-updatesd-3111  [003]  1637.254682: hrtimer_try_to_cancel <-hrtimer_cancel
2273      yum-updatesd-3111  [003]  1637.254683: lock_hrtimer_base <-hrtimer_try_to_cancel
2274      yum-updatesd-3111  [003]  1637.254685: fget_light <-do_sys_poll
2275      yum-updatesd-3111  [003]  1637.254686: pipe_poll <-do_sys_poll
2276  # echo > set_ftrace_pid
2277  # cat trace |head
2278  # tracer: function
2279  #
2280  #           TASK-PID    CPU#    TIMESTAMP  FUNCTION
2281  #              | |       |          |         |
2282  ##### CPU 3 buffer started ####
2283      yum-updatesd-3111  [003]  1701.957688: free_poll_entry <-poll_freewait
2284      yum-updatesd-3111  [003]  1701.957689: remove_wait_queue <-free_poll_entry
2285      yum-updatesd-3111  [003]  1701.957691: fput <-free_poll_entry
2286      yum-updatesd-3111  [003]  1701.957692: audit_syscall_exit <-sysret_audit
2287      yum-updatesd-3111  [003]  1701.957693: path_put <-audit_syscall_exit
2288
2289If you want to trace a function when executing, you could use
2290something like this simple program.
2291::
2292
2293	#include <stdio.h>
2294	#include <stdlib.h>
2295	#include <sys/types.h>
2296	#include <sys/stat.h>
2297	#include <fcntl.h>
2298	#include <unistd.h>
2299	#include <string.h>
2300
2301	#define _STR(x) #x
2302	#define STR(x) _STR(x)
2303	#define MAX_PATH 256
2304
2305	const char *find_tracefs(void)
2306	{
2307	       static char tracefs[MAX_PATH+1];
2308	       static int tracefs_found;
2309	       char type[100];
2310	       FILE *fp;
2311
2312	       if (tracefs_found)
2313		       return tracefs;
2314
2315	       if ((fp = fopen("/proc/mounts","r")) == NULL) {
2316		       perror("/proc/mounts");
2317		       return NULL;
2318	       }
2319
2320	       while (fscanf(fp, "%*s %"
2321		             STR(MAX_PATH)
2322		             "s %99s %*s %*d %*d\n",
2323		             tracefs, type) == 2) {
2324		       if (strcmp(type, "tracefs") == 0)
2325		               break;
2326	       }
2327	       fclose(fp);
2328
2329	       if (strcmp(type, "tracefs") != 0) {
2330		       fprintf(stderr, "tracefs not mounted");
2331		       return NULL;
2332	       }
2333
2334	       strcat(tracefs, "/tracing/");
2335	       tracefs_found = 1;
2336
2337	       return tracefs;
2338	}
2339
2340	const char *tracing_file(const char *file_name)
2341	{
2342	       static char trace_file[MAX_PATH+1];
2343	       snprintf(trace_file, MAX_PATH, "%s/%s", find_tracefs(), file_name);
2344	       return trace_file;
2345	}
2346
2347	int main (int argc, char **argv)
2348	{
2349		if (argc < 1)
2350		        exit(-1);
2351
2352		if (fork() > 0) {
2353		        int fd, ffd;
2354		        char line[64];
2355		        int s;
2356
2357		        ffd = open(tracing_file("current_tracer"), O_WRONLY);
2358		        if (ffd < 0)
2359		                exit(-1);
2360		        write(ffd, "nop", 3);
2361
2362		        fd = open(tracing_file("set_ftrace_pid"), O_WRONLY);
2363		        s = sprintf(line, "%d\n", getpid());
2364		        write(fd, line, s);
2365
2366		        write(ffd, "function", 8);
2367
2368		        close(fd);
2369		        close(ffd);
2370
2371		        execvp(argv[1], argv+1);
2372		}
2373
2374		return 0;
2375	}
2376
2377Or this simple script!
2378::
2379
2380  #!/bin/bash
2381
2382  tracefs=`sed -ne 's/^tracefs \(.*\) tracefs.*/\1/p' /proc/mounts`
2383  echo nop > $tracefs/tracing/current_tracer
2384  echo 0 > $tracefs/tracing/tracing_on
2385  echo $$ > $tracefs/tracing/set_ftrace_pid
2386  echo function > $tracefs/tracing/current_tracer
2387  echo 1 > $tracefs/tracing/tracing_on
2388  exec "$@"
2389
2390
2391function graph tracer
2392---------------------------
2393
2394This tracer is similar to the function tracer except that it
2395probes a function on its entry and its exit. This is done by
2396using a dynamically allocated stack of return addresses in each
2397task_struct. On function entry the tracer overwrites the return
2398address of each function traced to set a custom probe. Thus the
2399original return address is stored on the stack of return address
2400in the task_struct.
2401
2402Probing on both ends of a function leads to special features
2403such as:
2404
2405- measure of a function's time execution
2406- having a reliable call stack to draw function calls graph
2407
2408This tracer is useful in several situations:
2409
2410- you want to find the reason of a strange kernel behavior and
2411  need to see what happens in detail on any areas (or specific
2412  ones).
2413
2414- you are experiencing weird latencies but it's difficult to
2415  find its origin.
2416
2417- you want to find quickly which path is taken by a specific
2418  function
2419
2420- you just want to peek inside a working kernel and want to see
2421  what happens there.
2422
2423::
2424
2425  # tracer: function_graph
2426  #
2427  # CPU  DURATION                  FUNCTION CALLS
2428  # |     |   |                     |   |   |   |
2429
2430   0)               |  sys_open() {
2431   0)               |    do_sys_open() {
2432   0)               |      getname() {
2433   0)               |        kmem_cache_alloc() {
2434   0)   1.382 us    |          __might_sleep();
2435   0)   2.478 us    |        }
2436   0)               |        strncpy_from_user() {
2437   0)               |          might_fault() {
2438   0)   1.389 us    |            __might_sleep();
2439   0)   2.553 us    |          }
2440   0)   3.807 us    |        }
2441   0)   7.876 us    |      }
2442   0)               |      alloc_fd() {
2443   0)   0.668 us    |        _spin_lock();
2444   0)   0.570 us    |        expand_files();
2445   0)   0.586 us    |        _spin_unlock();
2446
2447
2448There are several columns that can be dynamically
2449enabled/disabled. You can use every combination of options you
2450want, depending on your needs.
2451
2452- The cpu number on which the function executed is default
2453  enabled.  It is sometimes better to only trace one cpu (see
2454  tracing_cpu_mask file) or you might sometimes see unordered
2455  function calls while cpu tracing switch.
2456
2457	- hide: echo nofuncgraph-cpu > trace_options
2458	- show: echo funcgraph-cpu > trace_options
2459
2460- The duration (function's time of execution) is displayed on
2461  the closing bracket line of a function or on the same line
2462  than the current function in case of a leaf one. It is default
2463  enabled.
2464
2465	- hide: echo nofuncgraph-duration > trace_options
2466	- show: echo funcgraph-duration > trace_options
2467
2468- The overhead field precedes the duration field in case of
2469  reached duration thresholds.
2470
2471	- hide: echo nofuncgraph-overhead > trace_options
2472	- show: echo funcgraph-overhead > trace_options
2473	- depends on: funcgraph-duration
2474
2475  ie::
2476
2477    3) # 1837.709 us |          } /* __switch_to */
2478    3)               |          finish_task_switch() {
2479    3)   0.313 us    |            _raw_spin_unlock_irq();
2480    3)   3.177 us    |          }
2481    3) # 1889.063 us |        } /* __schedule */
2482    3) ! 140.417 us  |      } /* __schedule */
2483    3) # 2034.948 us |    } /* schedule */
2484    3) * 33998.59 us |  } /* schedule_preempt_disabled */
2485
2486    [...]
2487
2488    1)   0.260 us    |              msecs_to_jiffies();
2489    1)   0.313 us    |              __rcu_read_unlock();
2490    1) + 61.770 us   |            }
2491    1) + 64.479 us   |          }
2492    1)   0.313 us    |          rcu_bh_qs();
2493    1)   0.313 us    |          __local_bh_enable();
2494    1) ! 217.240 us  |        }
2495    1)   0.365 us    |        idle_cpu();
2496    1)               |        rcu_irq_exit() {
2497    1)   0.417 us    |          rcu_eqs_enter_common.isra.47();
2498    1)   3.125 us    |        }
2499    1) ! 227.812 us  |      }
2500    1) ! 457.395 us  |    }
2501    1) @ 119760.2 us |  }
2502
2503    [...]
2504
2505    2)               |    handle_IPI() {
2506    1)   6.979 us    |                  }
2507    2)   0.417 us    |      scheduler_ipi();
2508    1)   9.791 us    |                }
2509    1) + 12.917 us   |              }
2510    2)   3.490 us    |    }
2511    1) + 15.729 us   |            }
2512    1) + 18.542 us   |          }
2513    2) $ 3594274 us  |  }
2514
2515Flags::
2516
2517  + means that the function exceeded 10 usecs.
2518  ! means that the function exceeded 100 usecs.
2519  # means that the function exceeded 1000 usecs.
2520  * means that the function exceeded 10 msecs.
2521  @ means that the function exceeded 100 msecs.
2522  $ means that the function exceeded 1 sec.
2523
2524
2525- The task/pid field displays the thread cmdline and pid which
2526  executed the function. It is default disabled.
2527
2528	- hide: echo nofuncgraph-proc > trace_options
2529	- show: echo funcgraph-proc > trace_options
2530
2531  ie::
2532
2533    # tracer: function_graph
2534    #
2535    # CPU  TASK/PID        DURATION                  FUNCTION CALLS
2536    # |    |    |           |   |                     |   |   |   |
2537    0)    sh-4802     |               |                  d_free() {
2538    0)    sh-4802     |               |                    call_rcu() {
2539    0)    sh-4802     |               |                      __call_rcu() {
2540    0)    sh-4802     |   0.616 us    |                        rcu_process_gp_end();
2541    0)    sh-4802     |   0.586 us    |                        check_for_new_grace_period();
2542    0)    sh-4802     |   2.899 us    |                      }
2543    0)    sh-4802     |   4.040 us    |                    }
2544    0)    sh-4802     |   5.151 us    |                  }
2545    0)    sh-4802     | + 49.370 us   |                }
2546
2547
2548- The absolute time field is an absolute timestamp given by the
2549  system clock since it started. A snapshot of this time is
2550  given on each entry/exit of functions
2551
2552	- hide: echo nofuncgraph-abstime > trace_options
2553	- show: echo funcgraph-abstime > trace_options
2554
2555  ie::
2556
2557    #
2558    #      TIME       CPU  DURATION                  FUNCTION CALLS
2559    #       |         |     |   |                     |   |   |   |
2560    360.774522 |   1)   0.541 us    |                                          }
2561    360.774522 |   1)   4.663 us    |                                        }
2562    360.774523 |   1)   0.541 us    |                                        __wake_up_bit();
2563    360.774524 |   1)   6.796 us    |                                      }
2564    360.774524 |   1)   7.952 us    |                                    }
2565    360.774525 |   1)   9.063 us    |                                  }
2566    360.774525 |   1)   0.615 us    |                                  journal_mark_dirty();
2567    360.774527 |   1)   0.578 us    |                                  __brelse();
2568    360.774528 |   1)               |                                  reiserfs_prepare_for_journal() {
2569    360.774528 |   1)               |                                    unlock_buffer() {
2570    360.774529 |   1)               |                                      wake_up_bit() {
2571    360.774529 |   1)               |                                        bit_waitqueue() {
2572    360.774530 |   1)   0.594 us    |                                          __phys_addr();
2573
2574
2575The function name is always displayed after the closing bracket
2576for a function if the start of that function is not in the
2577trace buffer.
2578
2579Display of the function name after the closing bracket may be
2580enabled for functions whose start is in the trace buffer,
2581allowing easier searching with grep for function durations.
2582It is default disabled.
2583
2584	- hide: echo nofuncgraph-tail > trace_options
2585	- show: echo funcgraph-tail > trace_options
2586
2587  Example with nofuncgraph-tail (default)::
2588
2589    0)               |      putname() {
2590    0)               |        kmem_cache_free() {
2591    0)   0.518 us    |          __phys_addr();
2592    0)   1.757 us    |        }
2593    0)   2.861 us    |      }
2594
2595  Example with funcgraph-tail::
2596
2597    0)               |      putname() {
2598    0)               |        kmem_cache_free() {
2599    0)   0.518 us    |          __phys_addr();
2600    0)   1.757 us    |        } /* kmem_cache_free() */
2601    0)   2.861 us    |      } /* putname() */
2602
2603You can put some comments on specific functions by using
2604trace_printk() For example, if you want to put a comment inside
2605the __might_sleep() function, you just have to include
2606<linux/ftrace.h> and call trace_printk() inside __might_sleep()::
2607
2608	trace_printk("I'm a comment!\n")
2609
2610will produce::
2611
2612   1)               |             __might_sleep() {
2613   1)               |                /* I'm a comment! */
2614   1)   1.449 us    |             }
2615
2616
2617You might find other useful features for this tracer in the
2618following "dynamic ftrace" section such as tracing only specific
2619functions or tasks.
2620
2621dynamic ftrace
2622--------------
2623
2624If CONFIG_DYNAMIC_FTRACE is set, the system will run with
2625virtually no overhead when function tracing is disabled. The way
2626this works is the mcount function call (placed at the start of
2627every kernel function, produced by the -pg switch in gcc),
2628starts of pointing to a simple return. (Enabling FTRACE will
2629include the -pg switch in the compiling of the kernel.)
2630
2631At compile time every C file object is run through the
2632recordmcount program (located in the scripts directory). This
2633program will parse the ELF headers in the C object to find all
2634the locations in the .text section that call mcount. Starting
2635with gcc version 4.6, the -mfentry has been added for x86, which
2636calls "__fentry__" instead of "mcount". Which is called before
2637the creation of the stack frame.
2638
2639Note, not all sections are traced. They may be prevented by either
2640a notrace, or blocked another way and all inline functions are not
2641traced. Check the "available_filter_functions" file to see what functions
2642can be traced.
2643
2644A section called "__mcount_loc" is created that holds
2645references to all the mcount/fentry call sites in the .text section.
2646The recordmcount program re-links this section back into the
2647original object. The final linking stage of the kernel will add all these
2648references into a single table.
2649
2650On boot up, before SMP is initialized, the dynamic ftrace code
2651scans this table and updates all the locations into nops. It
2652also records the locations, which are added to the
2653available_filter_functions list.  Modules are processed as they
2654are loaded and before they are executed.  When a module is
2655unloaded, it also removes its functions from the ftrace function
2656list. This is automatic in the module unload code, and the
2657module author does not need to worry about it.
2658
2659When tracing is enabled, the process of modifying the function
2660tracepoints is dependent on architecture. The old method is to use
2661kstop_machine to prevent races with the CPUs executing code being
2662modified (which can cause the CPU to do undesirable things, especially
2663if the modified code crosses cache (or page) boundaries), and the nops are
2664patched back to calls. But this time, they do not call mcount
2665(which is just a function stub). They now call into the ftrace
2666infrastructure.
2667
2668The new method of modifying the function tracepoints is to place
2669a breakpoint at the location to be modified, sync all CPUs, modify
2670the rest of the instruction not covered by the breakpoint. Sync
2671all CPUs again, and then remove the breakpoint with the finished
2672version to the ftrace call site.
2673
2674Some archs do not even need to monkey around with the synchronization,
2675and can just slap the new code on top of the old without any
2676problems with other CPUs executing it at the same time.
2677
2678One special side-effect to the recording of the functions being
2679traced is that we can now selectively choose which functions we
2680wish to trace and which ones we want the mcount calls to remain
2681as nops.
2682
2683Two files are used, one for enabling and one for disabling the
2684tracing of specified functions. They are:
2685
2686  set_ftrace_filter
2687
2688and
2689
2690  set_ftrace_notrace
2691
2692A list of available functions that you can add to these files is
2693listed in:
2694
2695   available_filter_functions
2696
2697::
2698
2699  # cat available_filter_functions
2700  put_prev_task_idle
2701  kmem_cache_create
2702  pick_next_task_rt
2703  get_online_cpus
2704  pick_next_task_fair
2705  mutex_lock
2706  [...]
2707
2708If I am only interested in sys_nanosleep and hrtimer_interrupt::
2709
2710  # echo sys_nanosleep hrtimer_interrupt > set_ftrace_filter
2711  # echo function > current_tracer
2712  # echo 1 > tracing_on
2713  # usleep 1
2714  # echo 0 > tracing_on
2715  # cat trace
2716  # tracer: function
2717  #
2718  # entries-in-buffer/entries-written: 5/5   #P:4
2719  #
2720  #                              _-----=> irqs-off
2721  #                             / _----=> need-resched
2722  #                            | / _---=> hardirq/softirq
2723  #                            || / _--=> preempt-depth
2724  #                            ||| /     delay
2725  #           TASK-PID   CPU#  ||||    TIMESTAMP  FUNCTION
2726  #              | |       |   ||||       |         |
2727            usleep-2665  [001] ....  4186.475355: sys_nanosleep <-system_call_fastpath
2728            <idle>-0     [001] d.h1  4186.475409: hrtimer_interrupt <-smp_apic_timer_interrupt
2729            usleep-2665  [001] d.h1  4186.475426: hrtimer_interrupt <-smp_apic_timer_interrupt
2730            <idle>-0     [003] d.h1  4186.475426: hrtimer_interrupt <-smp_apic_timer_interrupt
2731            <idle>-0     [002] d.h1  4186.475427: hrtimer_interrupt <-smp_apic_timer_interrupt
2732
2733To see which functions are being traced, you can cat the file:
2734::
2735
2736  # cat set_ftrace_filter
2737  hrtimer_interrupt
2738  sys_nanosleep
2739
2740
2741Perhaps this is not enough. The filters also allow glob(7) matching.
2742
2743  ``<match>*``
2744	will match functions that begin with <match>
2745  ``*<match>``
2746	will match functions that end with <match>
2747  ``*<match>*``
2748	will match functions that have <match> in it
2749  ``<match1>*<match2>``
2750	will match functions that begin with <match1> and end with <match2>
2751
2752.. note::
2753      It is better to use quotes to enclose the wild cards,
2754      otherwise the shell may expand the parameters into names
2755      of files in the local directory.
2756
2757::
2758
2759  # echo 'hrtimer_*' > set_ftrace_filter
2760
2761Produces::
2762
2763  # tracer: function
2764  #
2765  # entries-in-buffer/entries-written: 897/897   #P:4
2766  #
2767  #                              _-----=> irqs-off
2768  #                             / _----=> need-resched
2769  #                            | / _---=> hardirq/softirq
2770  #                            || / _--=> preempt-depth
2771  #                            ||| /     delay
2772  #           TASK-PID   CPU#  ||||    TIMESTAMP  FUNCTION
2773  #              | |       |   ||||       |         |
2774            <idle>-0     [003] dN.1  4228.547803: hrtimer_cancel <-tick_nohz_idle_exit
2775            <idle>-0     [003] dN.1  4228.547804: hrtimer_try_to_cancel <-hrtimer_cancel
2776            <idle>-0     [003] dN.2  4228.547805: hrtimer_force_reprogram <-__remove_hrtimer
2777            <idle>-0     [003] dN.1  4228.547805: hrtimer_forward <-tick_nohz_idle_exit
2778            <idle>-0     [003] dN.1  4228.547805: hrtimer_start_range_ns <-hrtimer_start_expires.constprop.11
2779            <idle>-0     [003] d..1  4228.547858: hrtimer_get_next_event <-get_next_timer_interrupt
2780            <idle>-0     [003] d..1  4228.547859: hrtimer_start <-__tick_nohz_idle_enter
2781            <idle>-0     [003] d..2  4228.547860: hrtimer_force_reprogram <-__rem
2782
2783Notice that we lost the sys_nanosleep.
2784::
2785
2786  # cat set_ftrace_filter
2787  hrtimer_run_queues
2788  hrtimer_run_pending
2789  hrtimer_init
2790  hrtimer_cancel
2791  hrtimer_try_to_cancel
2792  hrtimer_forward
2793  hrtimer_start
2794  hrtimer_reprogram
2795  hrtimer_force_reprogram
2796  hrtimer_get_next_event
2797  hrtimer_interrupt
2798  hrtimer_nanosleep
2799  hrtimer_wakeup
2800  hrtimer_get_remaining
2801  hrtimer_get_res
2802  hrtimer_init_sleeper
2803
2804
2805This is because the '>' and '>>' act just like they do in bash.
2806To rewrite the filters, use '>'
2807To append to the filters, use '>>'
2808
2809To clear out a filter so that all functions will be recorded
2810again::
2811
2812 # echo > set_ftrace_filter
2813 # cat set_ftrace_filter
2814 #
2815
2816Again, now we want to append.
2817
2818::
2819
2820  # echo sys_nanosleep > set_ftrace_filter
2821  # cat set_ftrace_filter
2822  sys_nanosleep
2823  # echo 'hrtimer_*' >> set_ftrace_filter
2824  # cat set_ftrace_filter
2825  hrtimer_run_queues
2826  hrtimer_run_pending
2827  hrtimer_init
2828  hrtimer_cancel
2829  hrtimer_try_to_cancel
2830  hrtimer_forward
2831  hrtimer_start
2832  hrtimer_reprogram
2833  hrtimer_force_reprogram
2834  hrtimer_get_next_event
2835  hrtimer_interrupt
2836  sys_nanosleep
2837  hrtimer_nanosleep
2838  hrtimer_wakeup
2839  hrtimer_get_remaining
2840  hrtimer_get_res
2841  hrtimer_init_sleeper
2842
2843
2844The set_ftrace_notrace prevents those functions from being
2845traced.
2846::
2847
2848  # echo '*preempt*' '*lock*' > set_ftrace_notrace
2849
2850Produces::
2851
2852  # tracer: function
2853  #
2854  # entries-in-buffer/entries-written: 39608/39608   #P:4
2855  #
2856  #                              _-----=> irqs-off
2857  #                             / _----=> need-resched
2858  #                            | / _---=> hardirq/softirq
2859  #                            || / _--=> preempt-depth
2860  #                            ||| /     delay
2861  #           TASK-PID   CPU#  ||||    TIMESTAMP  FUNCTION
2862  #              | |       |   ||||       |         |
2863              bash-1994  [000] ....  4342.324896: file_ra_state_init <-do_dentry_open
2864              bash-1994  [000] ....  4342.324897: open_check_o_direct <-do_last
2865              bash-1994  [000] ....  4342.324897: ima_file_check <-do_last
2866              bash-1994  [000] ....  4342.324898: process_measurement <-ima_file_check
2867              bash-1994  [000] ....  4342.324898: ima_get_action <-process_measurement
2868              bash-1994  [000] ....  4342.324898: ima_match_policy <-ima_get_action
2869              bash-1994  [000] ....  4342.324899: do_truncate <-do_last
2870              bash-1994  [000] ....  4342.324899: should_remove_suid <-do_truncate
2871              bash-1994  [000] ....  4342.324899: notify_change <-do_truncate
2872              bash-1994  [000] ....  4342.324900: current_fs_time <-notify_change
2873              bash-1994  [000] ....  4342.324900: current_kernel_time <-current_fs_time
2874              bash-1994  [000] ....  4342.324900: timespec_trunc <-current_fs_time
2875
2876We can see that there's no more lock or preempt tracing.
2877
2878Selecting function filters via index
2879------------------------------------
2880
2881Because processing of strings is expensive (the address of the function
2882needs to be looked up before comparing to the string being passed in),
2883an index can be used as well to enable functions. This is useful in the
2884case of setting thousands of specific functions at a time. By passing
2885in a list of numbers, no string processing will occur. Instead, the function
2886at the specific location in the internal array (which corresponds to the
2887functions in the "available_filter_functions" file), is selected.
2888
2889::
2890
2891  # echo 1 > set_ftrace_filter
2892
2893Will select the first function listed in "available_filter_functions"
2894
2895::
2896
2897  # head -1 available_filter_functions
2898  trace_initcall_finish_cb
2899
2900  # cat set_ftrace_filter
2901  trace_initcall_finish_cb
2902
2903  # head -50 available_filter_functions | tail -1
2904  x86_pmu_commit_txn
2905
2906  # echo 1 50 > set_ftrace_filter
2907  # cat set_ftrace_filter
2908  trace_initcall_finish_cb
2909  x86_pmu_commit_txn
2910
2911Dynamic ftrace with the function graph tracer
2912---------------------------------------------
2913
2914Although what has been explained above concerns both the
2915function tracer and the function-graph-tracer, there are some
2916special features only available in the function-graph tracer.
2917
2918If you want to trace only one function and all of its children,
2919you just have to echo its name into set_graph_function::
2920
2921 echo __do_fault > set_graph_function
2922
2923will produce the following "expanded" trace of the __do_fault()
2924function::
2925
2926   0)               |  __do_fault() {
2927   0)               |    filemap_fault() {
2928   0)               |      find_lock_page() {
2929   0)   0.804 us    |        find_get_page();
2930   0)               |        __might_sleep() {
2931   0)   1.329 us    |        }
2932   0)   3.904 us    |      }
2933   0)   4.979 us    |    }
2934   0)   0.653 us    |    _spin_lock();
2935   0)   0.578 us    |    page_add_file_rmap();
2936   0)   0.525 us    |    native_set_pte_at();
2937   0)   0.585 us    |    _spin_unlock();
2938   0)               |    unlock_page() {
2939   0)   0.541 us    |      page_waitqueue();
2940   0)   0.639 us    |      __wake_up_bit();
2941   0)   2.786 us    |    }
2942   0) + 14.237 us   |  }
2943   0)               |  __do_fault() {
2944   0)               |    filemap_fault() {
2945   0)               |      find_lock_page() {
2946   0)   0.698 us    |        find_get_page();
2947   0)               |        __might_sleep() {
2948   0)   1.412 us    |        }
2949   0)   3.950 us    |      }
2950   0)   5.098 us    |    }
2951   0)   0.631 us    |    _spin_lock();
2952   0)   0.571 us    |    page_add_file_rmap();
2953   0)   0.526 us    |    native_set_pte_at();
2954   0)   0.586 us    |    _spin_unlock();
2955   0)               |    unlock_page() {
2956   0)   0.533 us    |      page_waitqueue();
2957   0)   0.638 us    |      __wake_up_bit();
2958   0)   2.793 us    |    }
2959   0) + 14.012 us   |  }
2960
2961You can also expand several functions at once::
2962
2963 echo sys_open > set_graph_function
2964 echo sys_close >> set_graph_function
2965
2966Now if you want to go back to trace all functions you can clear
2967this special filter via::
2968
2969 echo > set_graph_function
2970
2971
2972ftrace_enabled
2973--------------
2974
2975Note, the proc sysctl ftrace_enable is a big on/off switch for the
2976function tracer. By default it is enabled (when function tracing is
2977enabled in the kernel). If it is disabled, all function tracing is
2978disabled. This includes not only the function tracers for ftrace, but
2979also for any other uses (perf, kprobes, stack tracing, profiling, etc).
2980
2981Please disable this with care.
2982
2983This can be disable (and enabled) with::
2984
2985  sysctl kernel.ftrace_enabled=0
2986  sysctl kernel.ftrace_enabled=1
2987
2988 or
2989
2990  echo 0 > /proc/sys/kernel/ftrace_enabled
2991  echo 1 > /proc/sys/kernel/ftrace_enabled
2992
2993
2994Filter commands
2995---------------
2996
2997A few commands are supported by the set_ftrace_filter interface.
2998Trace commands have the following format::
2999
3000  <function>:<command>:<parameter>
3001
3002The following commands are supported:
3003
3004- mod:
3005  This command enables function filtering per module. The
3006  parameter defines the module. For example, if only the write*
3007  functions in the ext3 module are desired, run:
3008
3009   echo 'write*:mod:ext3' > set_ftrace_filter
3010
3011  This command interacts with the filter in the same way as
3012  filtering based on function names. Thus, adding more functions
3013  in a different module is accomplished by appending (>>) to the
3014  filter file. Remove specific module functions by prepending
3015  '!'::
3016
3017   echo '!writeback*:mod:ext3' >> set_ftrace_filter
3018
3019  Mod command supports module globbing. Disable tracing for all
3020  functions except a specific module::
3021
3022   echo '!*:mod:!ext3' >> set_ftrace_filter
3023
3024  Disable tracing for all modules, but still trace kernel::
3025
3026   echo '!*:mod:*' >> set_ftrace_filter
3027
3028  Enable filter only for kernel::
3029
3030   echo '*write*:mod:!*' >> set_ftrace_filter
3031
3032  Enable filter for module globbing::
3033
3034   echo '*write*:mod:*snd*' >> set_ftrace_filter
3035
3036- traceon/traceoff:
3037  These commands turn tracing on and off when the specified
3038  functions are hit. The parameter determines how many times the
3039  tracing system is turned on and off. If unspecified, there is
3040  no limit. For example, to disable tracing when a schedule bug
3041  is hit the first 5 times, run::
3042
3043   echo '__schedule_bug:traceoff:5' > set_ftrace_filter
3044
3045  To always disable tracing when __schedule_bug is hit::
3046
3047   echo '__schedule_bug:traceoff' > set_ftrace_filter
3048
3049  These commands are cumulative whether or not they are appended
3050  to set_ftrace_filter. To remove a command, prepend it by '!'
3051  and drop the parameter::
3052
3053   echo '!__schedule_bug:traceoff:0' > set_ftrace_filter
3054
3055  The above removes the traceoff command for __schedule_bug
3056  that have a counter. To remove commands without counters::
3057
3058   echo '!__schedule_bug:traceoff' > set_ftrace_filter
3059
3060- snapshot:
3061  Will cause a snapshot to be triggered when the function is hit.
3062  ::
3063
3064   echo 'native_flush_tlb_others:snapshot' > set_ftrace_filter
3065
3066  To only snapshot once:
3067  ::
3068
3069   echo 'native_flush_tlb_others:snapshot:1' > set_ftrace_filter
3070
3071  To remove the above commands::
3072
3073   echo '!native_flush_tlb_others:snapshot' > set_ftrace_filter
3074   echo '!native_flush_tlb_others:snapshot:0' > set_ftrace_filter
3075
3076- enable_event/disable_event:
3077  These commands can enable or disable a trace event. Note, because
3078  function tracing callbacks are very sensitive, when these commands
3079  are registered, the trace point is activated, but disabled in
3080  a "soft" mode. That is, the tracepoint will be called, but
3081  just will not be traced. The event tracepoint stays in this mode
3082  as long as there's a command that triggers it.
3083  ::
3084
3085   echo 'try_to_wake_up:enable_event:sched:sched_switch:2' > \
3086   	 set_ftrace_filter
3087
3088  The format is::
3089
3090    <function>:enable_event:<system>:<event>[:count]
3091    <function>:disable_event:<system>:<event>[:count]
3092
3093  To remove the events commands::
3094
3095   echo '!try_to_wake_up:enable_event:sched:sched_switch:0' > \
3096   	 set_ftrace_filter
3097   echo '!schedule:disable_event:sched:sched_switch' > \
3098   	 set_ftrace_filter
3099
3100- dump:
3101  When the function is hit, it will dump the contents of the ftrace
3102  ring buffer to the console. This is useful if you need to debug
3103  something, and want to dump the trace when a certain function
3104  is hit. Perhaps it's a function that is called before a triple
3105  fault happens and does not allow you to get a regular dump.
3106
3107- cpudump:
3108  When the function is hit, it will dump the contents of the ftrace
3109  ring buffer for the current CPU to the console. Unlike the "dump"
3110  command, it only prints out the contents of the ring buffer for the
3111  CPU that executed the function that triggered the dump.
3112
3113- stacktrace:
3114  When the function is hit, a stack trace is recorded.
3115
3116trace_pipe
3117----------
3118
3119The trace_pipe outputs the same content as the trace file, but
3120the effect on the tracing is different. Every read from
3121trace_pipe is consumed. This means that subsequent reads will be
3122different. The trace is live.
3123::
3124
3125  # echo function > current_tracer
3126  # cat trace_pipe > /tmp/trace.out &
3127  [1] 4153
3128  # echo 1 > tracing_on
3129  # usleep 1
3130  # echo 0 > tracing_on
3131  # cat trace
3132  # tracer: function
3133  #
3134  # entries-in-buffer/entries-written: 0/0   #P:4
3135  #
3136  #                              _-----=> irqs-off
3137  #                             / _----=> need-resched
3138  #                            | / _---=> hardirq/softirq
3139  #                            || / _--=> preempt-depth
3140  #                            ||| /     delay
3141  #           TASK-PID   CPU#  ||||    TIMESTAMP  FUNCTION
3142  #              | |       |   ||||       |         |
3143
3144  #
3145  # cat /tmp/trace.out
3146             bash-1994  [000] ....  5281.568961: mutex_unlock <-rb_simple_write
3147             bash-1994  [000] ....  5281.568963: __mutex_unlock_slowpath <-mutex_unlock
3148             bash-1994  [000] ....  5281.568963: __fsnotify_parent <-fsnotify_modify
3149             bash-1994  [000] ....  5281.568964: fsnotify <-fsnotify_modify
3150             bash-1994  [000] ....  5281.568964: __srcu_read_lock <-fsnotify
3151             bash-1994  [000] ....  5281.568964: add_preempt_count <-__srcu_read_lock
3152             bash-1994  [000] ...1  5281.568965: sub_preempt_count <-__srcu_read_lock
3153             bash-1994  [000] ....  5281.568965: __srcu_read_unlock <-fsnotify
3154             bash-1994  [000] ....  5281.568967: sys_dup2 <-system_call_fastpath
3155
3156
3157Note, reading the trace_pipe file will block until more input is
3158added. This is contrary to the trace file. If any process opened
3159the trace file for reading, it will actually disable tracing and
3160prevent new entries from being added. The trace_pipe file does
3161not have this limitation.
3162
3163trace entries
3164-------------
3165
3166Having too much or not enough data can be troublesome in
3167diagnosing an issue in the kernel. The file buffer_size_kb is
3168used to modify the size of the internal trace buffers. The
3169number listed is the number of entries that can be recorded per
3170CPU. To know the full size, multiply the number of possible CPUs
3171with the number of entries.
3172::
3173
3174  # cat buffer_size_kb
3175  1408 (units kilobytes)
3176
3177Or simply read buffer_total_size_kb
3178::
3179
3180  # cat buffer_total_size_kb
3181  5632
3182
3183To modify the buffer, simple echo in a number (in 1024 byte segments).
3184::
3185
3186  # echo 10000 > buffer_size_kb
3187  # cat buffer_size_kb
3188  10000 (units kilobytes)
3189
3190It will try to allocate as much as possible. If you allocate too
3191much, it can cause Out-Of-Memory to trigger.
3192::
3193
3194  # echo 1000000000000 > buffer_size_kb
3195  -bash: echo: write error: Cannot allocate memory
3196  # cat buffer_size_kb
3197  85
3198
3199The per_cpu buffers can be changed individually as well:
3200::
3201
3202  # echo 10000 > per_cpu/cpu0/buffer_size_kb
3203  # echo 100 > per_cpu/cpu1/buffer_size_kb
3204
3205When the per_cpu buffers are not the same, the buffer_size_kb
3206at the top level will just show an X
3207::
3208
3209  # cat buffer_size_kb
3210  X
3211
3212This is where the buffer_total_size_kb is useful:
3213::
3214
3215  # cat buffer_total_size_kb
3216  12916
3217
3218Writing to the top level buffer_size_kb will reset all the buffers
3219to be the same again.
3220
3221Snapshot
3222--------
3223CONFIG_TRACER_SNAPSHOT makes a generic snapshot feature
3224available to all non latency tracers. (Latency tracers which
3225record max latency, such as "irqsoff" or "wakeup", can't use
3226this feature, since those are already using the snapshot
3227mechanism internally.)
3228
3229Snapshot preserves a current trace buffer at a particular point
3230in time without stopping tracing. Ftrace swaps the current
3231buffer with a spare buffer, and tracing continues in the new
3232current (=previous spare) buffer.
3233
3234The following tracefs files in "tracing" are related to this
3235feature:
3236
3237  snapshot:
3238
3239	This is used to take a snapshot and to read the output
3240	of the snapshot. Echo 1 into this file to allocate a
3241	spare buffer and to take a snapshot (swap), then read
3242	the snapshot from this file in the same format as
3243	"trace" (described above in the section "The File
3244	System"). Both reads snapshot and tracing are executable
3245	in parallel. When the spare buffer is allocated, echoing
3246	0 frees it, and echoing else (positive) values clear the
3247	snapshot contents.
3248	More details are shown in the table below.
3249
3250	+--------------+------------+------------+------------+
3251	|status\\input |     0      |     1      |    else    |
3252	+==============+============+============+============+
3253	|not allocated |(do nothing)| alloc+swap |(do nothing)|
3254	+--------------+------------+------------+------------+
3255	|allocated     |    free    |    swap    |   clear    |
3256	+--------------+------------+------------+------------+
3257
3258Here is an example of using the snapshot feature.
3259::
3260
3261  # echo 1 > events/sched/enable
3262  # echo 1 > snapshot
3263  # cat snapshot
3264  # tracer: nop
3265  #
3266  # entries-in-buffer/entries-written: 71/71   #P:8
3267  #
3268  #                              _-----=> irqs-off
3269  #                             / _----=> need-resched
3270  #                            | / _---=> hardirq/softirq
3271  #                            || / _--=> preempt-depth
3272  #                            ||| /     delay
3273  #           TASK-PID   CPU#  ||||    TIMESTAMP  FUNCTION
3274  #              | |       |   ||||       |         |
3275            <idle>-0     [005] d...  2440.603828: sched_switch: prev_comm=swapper/5 prev_pid=0 prev_prio=120   prev_state=R ==> next_comm=snapshot-test-2 next_pid=2242 next_prio=120
3276             sleep-2242  [005] d...  2440.603846: sched_switch: prev_comm=snapshot-test-2 prev_pid=2242 prev_prio=120   prev_state=R ==> next_comm=kworker/5:1 next_pid=60 next_prio=120
3277  [...]
3278          <idle>-0     [002] d...  2440.707230: sched_switch: prev_comm=swapper/2 prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=snapshot-test-2 next_pid=2229 next_prio=120
3279
3280  # cat trace
3281  # tracer: nop
3282  #
3283  # entries-in-buffer/entries-written: 77/77   #P:8
3284  #
3285  #                              _-----=> irqs-off
3286  #                             / _----=> need-resched
3287  #                            | / _---=> hardirq/softirq
3288  #                            || / _--=> preempt-depth
3289  #                            ||| /     delay
3290  #           TASK-PID   CPU#  ||||    TIMESTAMP  FUNCTION
3291  #              | |       |   ||||       |         |
3292            <idle>-0     [007] d...  2440.707395: sched_switch: prev_comm=swapper/7 prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=snapshot-test-2 next_pid=2243 next_prio=120
3293   snapshot-test-2-2229  [002] d...  2440.707438: sched_switch: prev_comm=snapshot-test-2 prev_pid=2229 prev_prio=120 prev_state=S ==> next_comm=swapper/2 next_pid=0 next_prio=120
3294  [...]
3295
3296
3297If you try to use this snapshot feature when current tracer is
3298one of the latency tracers, you will get the following results.
3299::
3300
3301  # echo wakeup > current_tracer
3302  # echo 1 > snapshot
3303  bash: echo: write error: Device or resource busy
3304  # cat snapshot
3305  cat: snapshot: Device or resource busy
3306
3307
3308Instances
3309---------
3310In the tracefs tracing directory is a directory called "instances".
3311This directory can have new directories created inside of it using
3312mkdir, and removing directories with rmdir. The directory created
3313with mkdir in this directory will already contain files and other
3314directories after it is created.
3315::
3316
3317  # mkdir instances/foo
3318  # ls instances/foo
3319  buffer_size_kb  buffer_total_size_kb  events  free_buffer  per_cpu
3320  set_event  snapshot  trace  trace_clock  trace_marker  trace_options
3321  trace_pipe  tracing_on
3322
3323As you can see, the new directory looks similar to the tracing directory
3324itself. In fact, it is very similar, except that the buffer and
3325events are agnostic from the main director, or from any other
3326instances that are created.
3327
3328The files in the new directory work just like the files with the
3329same name in the tracing directory except the buffer that is used
3330is a separate and new buffer. The files affect that buffer but do not
3331affect the main buffer with the exception of trace_options. Currently,
3332the trace_options affect all instances and the top level buffer
3333the same, but this may change in future releases. That is, options
3334may become specific to the instance they reside in.
3335
3336Notice that none of the function tracer files are there, nor is
3337current_tracer and available_tracers. This is because the buffers
3338can currently only have events enabled for them.
3339::
3340
3341  # mkdir instances/foo
3342  # mkdir instances/bar
3343  # mkdir instances/zoot
3344  # echo 100000 > buffer_size_kb
3345  # echo 1000 > instances/foo/buffer_size_kb
3346  # echo 5000 > instances/bar/per_cpu/cpu1/buffer_size_kb
3347  # echo function > current_trace
3348  # echo 1 > instances/foo/events/sched/sched_wakeup/enable
3349  # echo 1 > instances/foo/events/sched/sched_wakeup_new/enable
3350  # echo 1 > instances/foo/events/sched/sched_switch/enable
3351  # echo 1 > instances/bar/events/irq/enable
3352  # echo 1 > instances/zoot/events/syscalls/enable
3353  # cat trace_pipe
3354  CPU:2 [LOST 11745 EVENTS]
3355              bash-2044  [002] .... 10594.481032: _raw_spin_lock_irqsave <-get_page_from_freelist
3356              bash-2044  [002] d... 10594.481032: add_preempt_count <-_raw_spin_lock_irqsave
3357              bash-2044  [002] d..1 10594.481032: __rmqueue <-get_page_from_freelist
3358              bash-2044  [002] d..1 10594.481033: _raw_spin_unlock <-get_page_from_freelist
3359              bash-2044  [002] d..1 10594.481033: sub_preempt_count <-_raw_spin_unlock
3360              bash-2044  [002] d... 10594.481033: get_pageblock_flags_group <-get_pageblock_migratetype
3361              bash-2044  [002] d... 10594.481034: __mod_zone_page_state <-get_page_from_freelist
3362              bash-2044  [002] d... 10594.481034: zone_statistics <-get_page_from_freelist
3363              bash-2044  [002] d... 10594.481034: __inc_zone_state <-zone_statistics
3364              bash-2044  [002] d... 10594.481034: __inc_zone_state <-zone_statistics
3365              bash-2044  [002] .... 10594.481035: arch_dup_task_struct <-copy_process
3366  [...]
3367
3368  # cat instances/foo/trace_pipe
3369              bash-1998  [000] d..4   136.676759: sched_wakeup: comm=kworker/0:1 pid=59 prio=120 success=1 target_cpu=000
3370              bash-1998  [000] dN.4   136.676760: sched_wakeup: comm=bash pid=1998 prio=120 success=1 target_cpu=000
3371            <idle>-0     [003] d.h3   136.676906: sched_wakeup: comm=rcu_preempt pid=9 prio=120 success=1 target_cpu=003
3372            <idle>-0     [003] d..3   136.676909: sched_switch: prev_comm=swapper/3 prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=rcu_preempt next_pid=9 next_prio=120
3373       rcu_preempt-9     [003] d..3   136.676916: sched_switch: prev_comm=rcu_preempt prev_pid=9 prev_prio=120 prev_state=S ==> next_comm=swapper/3 next_pid=0 next_prio=120
3374              bash-1998  [000] d..4   136.677014: sched_wakeup: comm=kworker/0:1 pid=59 prio=120 success=1 target_cpu=000
3375              bash-1998  [000] dN.4   136.677016: sched_wakeup: comm=bash pid=1998 prio=120 success=1 target_cpu=000
3376              bash-1998  [000] d..3   136.677018: sched_switch: prev_comm=bash prev_pid=1998 prev_prio=120 prev_state=R+ ==> next_comm=kworker/0:1 next_pid=59 next_prio=120
3377       kworker/0:1-59    [000] d..4   136.677022: sched_wakeup: comm=sshd pid=1995 prio=120 success=1 target_cpu=001
3378       kworker/0:1-59    [000] d..3   136.677025: sched_switch: prev_comm=kworker/0:1 prev_pid=59 prev_prio=120 prev_state=S ==> next_comm=bash next_pid=1998 next_prio=120
3379  [...]
3380
3381  # cat instances/bar/trace_pipe
3382       migration/1-14    [001] d.h3   138.732674: softirq_raise: vec=3 [action=NET_RX]
3383            <idle>-0     [001] dNh3   138.732725: softirq_raise: vec=3 [action=NET_RX]
3384              bash-1998  [000] d.h1   138.733101: softirq_raise: vec=1 [action=TIMER]
3385              bash-1998  [000] d.h1   138.733102: softirq_raise: vec=9 [action=RCU]
3386              bash-1998  [000] ..s2   138.733105: softirq_entry: vec=1 [action=TIMER]
3387              bash-1998  [000] ..s2   138.733106: softirq_exit: vec=1 [action=TIMER]
3388              bash-1998  [000] ..s2   138.733106: softirq_entry: vec=9 [action=RCU]
3389              bash-1998  [000] ..s2   138.733109: softirq_exit: vec=9 [action=RCU]
3390              sshd-1995  [001] d.h1   138.733278: irq_handler_entry: irq=21 name=uhci_hcd:usb4
3391              sshd-1995  [001] d.h1   138.733280: irq_handler_exit: irq=21 ret=unhandled
3392              sshd-1995  [001] d.h1   138.733281: irq_handler_entry: irq=21 name=eth0
3393              sshd-1995  [001] d.h1   138.733283: irq_handler_exit: irq=21 ret=handled
3394  [...]
3395
3396  # cat instances/zoot/trace
3397  # tracer: nop
3398  #
3399  # entries-in-buffer/entries-written: 18996/18996   #P:4
3400  #
3401  #                              _-----=> irqs-off
3402  #                             / _----=> need-resched
3403  #                            | / _---=> hardirq/softirq
3404  #                            || / _--=> preempt-depth
3405  #                            ||| /     delay
3406  #           TASK-PID   CPU#  ||||    TIMESTAMP  FUNCTION
3407  #              | |       |   ||||       |         |
3408              bash-1998  [000] d...   140.733501: sys_write -> 0x2
3409              bash-1998  [000] d...   140.733504: sys_dup2(oldfd: a, newfd: 1)
3410              bash-1998  [000] d...   140.733506: sys_dup2 -> 0x1
3411              bash-1998  [000] d...   140.733508: sys_fcntl(fd: a, cmd: 1, arg: 0)
3412              bash-1998  [000] d...   140.733509: sys_fcntl -> 0x1
3413              bash-1998  [000] d...   140.733510: sys_close(fd: a)
3414              bash-1998  [000] d...   140.733510: sys_close -> 0x0
3415              bash-1998  [000] d...   140.733514: sys_rt_sigprocmask(how: 0, nset: 0, oset: 6e2768, sigsetsize: 8)
3416              bash-1998  [000] d...   140.733515: sys_rt_sigprocmask -> 0x0
3417              bash-1998  [000] d...   140.733516: sys_rt_sigaction(sig: 2, act: 7fff718846f0, oact: 7fff71884650, sigsetsize: 8)
3418              bash-1998  [000] d...   140.733516: sys_rt_sigaction -> 0x0
3419
3420You can see that the trace of the top most trace buffer shows only
3421the function tracing. The foo instance displays wakeups and task
3422switches.
3423
3424To remove the instances, simply delete their directories:
3425::
3426
3427  # rmdir instances/foo
3428  # rmdir instances/bar
3429  # rmdir instances/zoot
3430
3431Note, if a process has a trace file open in one of the instance
3432directories, the rmdir will fail with EBUSY.
3433
3434
3435Stack trace
3436-----------
3437Since the kernel has a fixed sized stack, it is important not to
3438waste it in functions. A kernel developer must be conscience of
3439what they allocate on the stack. If they add too much, the system
3440can be in danger of a stack overflow, and corruption will occur,
3441usually leading to a system panic.
3442
3443There are some tools that check this, usually with interrupts
3444periodically checking usage. But if you can perform a check
3445at every function call that will become very useful. As ftrace provides
3446a function tracer, it makes it convenient to check the stack size
3447at every function call. This is enabled via the stack tracer.
3448
3449CONFIG_STACK_TRACER enables the ftrace stack tracing functionality.
3450To enable it, write a '1' into /proc/sys/kernel/stack_tracer_enabled.
3451::
3452
3453 # echo 1 > /proc/sys/kernel/stack_tracer_enabled
3454
3455You can also enable it from the kernel command line to trace
3456the stack size of the kernel during boot up, by adding "stacktrace"
3457to the kernel command line parameter.
3458
3459After running it for a few minutes, the output looks like:
3460::
3461
3462  # cat stack_max_size
3463  2928
3464
3465  # cat stack_trace
3466          Depth    Size   Location    (18 entries)
3467          -----    ----   --------
3468    0)     2928     224   update_sd_lb_stats+0xbc/0x4ac
3469    1)     2704     160   find_busiest_group+0x31/0x1f1
3470    2)     2544     256   load_balance+0xd9/0x662
3471    3)     2288      80   idle_balance+0xbb/0x130
3472    4)     2208     128   __schedule+0x26e/0x5b9
3473    5)     2080      16   schedule+0x64/0x66
3474    6)     2064     128   schedule_timeout+0x34/0xe0
3475    7)     1936     112   wait_for_common+0x97/0xf1
3476    8)     1824      16   wait_for_completion+0x1d/0x1f
3477    9)     1808     128   flush_work+0xfe/0x119
3478   10)     1680      16   tty_flush_to_ldisc+0x1e/0x20
3479   11)     1664      48   input_available_p+0x1d/0x5c
3480   12)     1616      48   n_tty_poll+0x6d/0x134
3481   13)     1568      64   tty_poll+0x64/0x7f
3482   14)     1504     880   do_select+0x31e/0x511
3483   15)      624     400   core_sys_select+0x177/0x216
3484   16)      224      96   sys_select+0x91/0xb9
3485   17)      128     128   system_call_fastpath+0x16/0x1b
3486
3487Note, if -mfentry is being used by gcc, functions get traced before
3488they set up the stack frame. This means that leaf level functions
3489are not tested by the stack tracer when -mfentry is used.
3490
3491Currently, -mfentry is used by gcc 4.6.0 and above on x86 only.
3492
3493More
3494----
3495More details can be found in the source code, in the `kernel/trace/*.c` files.
3496