xref: /linux/Documentation/trace/ftrace.rst (revision 3e0bc2855b573bcffa2a52955a878f537f5ac0cd)
1========================
2ftrace - Function Tracer
3========================
4
5Copyright 2008 Red Hat Inc.
6
7:Author:   Steven Rostedt <srostedt@redhat.com>
8:License:  The GNU Free Documentation License, Version 1.2
9          (dual licensed under the GPL v2)
10:Original Reviewers:  Elias Oltmanns, Randy Dunlap, Andrew Morton,
11		      John Kacur, and David Teigland.
12
13- Written for: 2.6.28-rc2
14- Updated for: 3.10
15- Updated for: 4.13 - Copyright 2017 VMware Inc. Steven Rostedt
16- Converted to rst format - Changbin Du <changbin.du@intel.com>
17
18Introduction
19------------
20
21Ftrace is an internal tracer designed to help out developers and
22designers of systems to find what is going on inside the kernel.
23It can be used for debugging or analyzing latencies and
24performance issues that take place outside of user-space.
25
26Although ftrace is typically considered the function tracer, it
27is really a framework of several assorted tracing utilities.
28There's latency tracing to examine what occurs between interrupts
29disabled and enabled, as well as for preemption and from a time
30a task is woken to the task is actually scheduled in.
31
32One of the most common uses of ftrace is the event tracing.
33Throughout the kernel is hundreds of static event points that
34can be enabled via the tracefs file system to see what is
35going on in certain parts of the kernel.
36
37See events.rst for more information.
38
39
40Implementation Details
41----------------------
42
43See Documentation/trace/ftrace-design.rst for details for arch porters and such.
44
45
46The File System
47---------------
48
49Ftrace uses the tracefs file system to hold the control files as
50well as the files to display output.
51
52When tracefs is configured into the kernel (which selecting any ftrace
53option will do) the directory /sys/kernel/tracing will be created. To mount
54this directory, you can add to your /etc/fstab file::
55
56 tracefs       /sys/kernel/tracing       tracefs defaults        0       0
57
58Or you can mount it at run time with::
59
60 mount -t tracefs nodev /sys/kernel/tracing
61
62For quicker access to that directory you may want to make a soft link to
63it::
64
65 ln -s /sys/kernel/tracing /tracing
66
67.. attention::
68
69  Before 4.1, all ftrace tracing control files were within the debugfs
70  file system, which is typically located at /sys/kernel/debug/tracing.
71  For backward compatibility, when mounting the debugfs file system,
72  the tracefs file system will be automatically mounted at:
73
74  /sys/kernel/debug/tracing
75
76  All files located in the tracefs file system will be located in that
77  debugfs file system directory as well.
78
79.. attention::
80
81  Any selected ftrace option will also create the tracefs file system.
82  The rest of the document will assume that you are in the ftrace directory
83  (cd /sys/kernel/tracing) and will only concentrate on the files within that
84  directory and not distract from the content with the extended
85  "/sys/kernel/tracing" path name.
86
87That's it! (assuming that you have ftrace configured into your kernel)
88
89After mounting tracefs you will have access to the control and output files
90of ftrace. Here is a list of some of the key files:
91
92
93 Note: all time values are in microseconds.
94
95  current_tracer:
96
97	This is used to set or display the current tracer
98	that is configured. Changing the current tracer clears
99	the ring buffer content as well as the "snapshot" buffer.
100
101  available_tracers:
102
103	This holds the different types of tracers that
104	have been compiled into the kernel. The
105	tracers listed here can be configured by
106	echoing their name into current_tracer.
107
108  tracing_on:
109
110	This sets or displays whether writing to the trace
111	ring buffer is enabled. Echo 0 into this file to disable
112	the tracer or 1 to enable it. Note, this only disables
113	writing to the ring buffer, the tracing overhead may
114	still be occurring.
115
116	The kernel function tracing_off() can be used within the
117	kernel to disable writing to the ring buffer, which will
118	set this file to "0". User space can re-enable tracing by
119	echoing "1" into the file.
120
121	Note, the function and event trigger "traceoff" will also
122	set this file to zero and stop tracing. Which can also
123	be re-enabled by user space using this file.
124
125  trace:
126
127	This file holds the output of the trace in a human
128	readable format (described below). Opening this file for
129	writing with the O_TRUNC flag clears the ring buffer content.
130        Note, this file is not a consumer. If tracing is off
131        (no tracer running, or tracing_on is zero), it will produce
132        the same output each time it is read. When tracing is on,
133        it may produce inconsistent results as it tries to read
134        the entire buffer without consuming it.
135
136  trace_pipe:
137
138	The output is the same as the "trace" file but this
139	file is meant to be streamed with live tracing.
140	Reads from this file will block until new data is
141	retrieved.  Unlike the "trace" file, this file is a
142	consumer. This means reading from this file causes
143	sequential reads to display more current data. Once
144	data is read from this file, it is consumed, and
145	will not be read again with a sequential read. The
146	"trace" file is static, and if the tracer is not
147	adding more data, it will display the same
148	information every time it is read.
149
150  trace_options:
151
152	This file lets the user control the amount of data
153	that is displayed in one of the above output
154	files. Options also exist to modify how a tracer
155	or events work (stack traces, timestamps, etc).
156
157  options:
158
159	This is a directory that has a file for every available
160	trace option (also in trace_options). Options may also be set
161	or cleared by writing a "1" or "0" respectively into the
162	corresponding file with the option name.
163
164  tracing_max_latency:
165
166	Some of the tracers record the max latency.
167	For example, the maximum time that interrupts are disabled.
168	The maximum time is saved in this file. The max trace will also be
169	stored,	and displayed by "trace". A new max trace will only be
170	recorded if the latency is greater than the value in this file
171	(in microseconds).
172
173	By echoing in a time into this file, no latency will be recorded
174	unless it is greater than the time in this file.
175
176  tracing_thresh:
177
178	Some latency tracers will record a trace whenever the
179	latency is greater than the number in this file.
180	Only active when the file contains a number greater than 0.
181	(in microseconds)
182
183  buffer_percent:
184
185	This is the watermark for how much the ring buffer needs to be filled
186	before a waiter is woken up. That is, if an application calls a
187	blocking read syscall on one of the per_cpu trace_pipe_raw files, it
188	will block until the given amount of data specified by buffer_percent
189	is in the ring buffer before it wakes the reader up. This also
190	controls how the splice system calls are blocked on this file::
191
192	  0   - means to wake up as soon as there is any data in the ring buffer.
193	  50  - means to wake up when roughly half of the ring buffer sub-buffers
194	        are full.
195	  100 - means to block until the ring buffer is totally full and is
196	        about to start overwriting the older data.
197
198  buffer_size_kb:
199
200	This sets or displays the number of kilobytes each CPU
201	buffer holds. By default, the trace buffers are the same size
202	for each CPU. The displayed number is the size of the
203	CPU buffer and not total size of all buffers. The
204	trace buffers are allocated in pages (blocks of memory
205	that the kernel uses for allocation, usually 4 KB in size).
206	A few extra pages may be allocated to accommodate buffer management
207	meta-data. If the last page allocated has room for more bytes
208	than requested, the rest of the page will be used,
209	making the actual allocation bigger than requested or shown.
210	( Note, the size may not be a multiple of the page size
211	due to buffer management meta-data. )
212
213	Buffer sizes for individual CPUs may vary
214	(see "per_cpu/cpu0/buffer_size_kb" below), and if they do
215	this file will show "X".
216
217  buffer_total_size_kb:
218
219	This displays the total combined size of all the trace buffers.
220
221  free_buffer:
222
223	If a process is performing tracing, and the ring buffer	should be
224	shrunk "freed" when the process is finished, even if it were to be
225	killed by a signal, this file can be used for that purpose. On close
226	of this file, the ring buffer will be resized to its minimum size.
227	Having a process that is tracing also open this file, when the process
228	exits its file descriptor for this file will be closed, and in doing so,
229	the ring buffer will be "freed".
230
231	It may also stop tracing if disable_on_free option is set.
232
233  tracing_cpumask:
234
235	This is a mask that lets the user only trace on specified CPUs.
236	The format is a hex string representing the CPUs.
237
238  set_ftrace_filter:
239
240	When dynamic ftrace is configured in (see the
241	section below "dynamic ftrace"), the code is dynamically
242	modified (code text rewrite) to disable calling of the
243	function profiler (mcount). This lets tracing be configured
244	in with practically no overhead in performance.  This also
245	has a side effect of enabling or disabling specific functions
246	to be traced. Echoing names of functions into this file
247	will limit the trace to only those functions.
248	This influences the tracers "function" and "function_graph"
249	and thus also function profiling (see "function_profile_enabled").
250
251	The functions listed in "available_filter_functions" are what
252	can be written into this file.
253
254	This interface also allows for commands to be used. See the
255	"Filter commands" section for more details.
256
257	As a speed up, since processing strings can be quite expensive
258	and requires a check of all functions registered to tracing, instead
259	an index can be written into this file. A number (starting with "1")
260	written will instead select the same corresponding at the line position
261	of the "available_filter_functions" file.
262
263  set_ftrace_notrace:
264
265	This has an effect opposite to that of
266	set_ftrace_filter. Any function that is added here will not
267	be traced. If a function exists in both set_ftrace_filter
268	and set_ftrace_notrace,	the function will _not_ be traced.
269
270  set_ftrace_pid:
271
272	Have the function tracer only trace the threads whose PID are
273	listed in this file.
274
275	If the "function-fork" option is set, then when a task whose
276	PID is listed in this file forks, the child's PID will
277	automatically be added to this file, and the child will be
278	traced by the function tracer as well. This option will also
279	cause PIDs of tasks that exit to be removed from the file.
280
281  set_ftrace_notrace_pid:
282
283        Have the function tracer ignore threads whose PID are listed in
284        this file.
285
286        If the "function-fork" option is set, then when a task whose
287	PID is listed in this file forks, the child's PID will
288	automatically be added to this file, and the child will not be
289	traced by the function tracer as well. This option will also
290	cause PIDs of tasks that exit to be removed from the file.
291
292        If a PID is in both this file and "set_ftrace_pid", then this
293        file takes precedence, and the thread will not be traced.
294
295  set_event_pid:
296
297	Have the events only trace a task with a PID listed in this file.
298	Note, sched_switch and sched_wake_up will also trace events
299	listed in this file.
300
301	To have the PIDs of children of tasks with their PID in this file
302	added on fork, enable the "event-fork" option. That option will also
303	cause the PIDs of tasks to be removed from this file when the task
304	exits.
305
306  set_event_notrace_pid:
307
308	Have the events not trace a task with a PID listed in this file.
309	Note, sched_switch and sched_wakeup will trace threads not listed
310	in this file, even if a thread's PID is in the file if the
311        sched_switch or sched_wakeup events also trace a thread that should
312        be traced.
313
314	To have the PIDs of children of tasks with their PID in this file
315	added on fork, enable the "event-fork" option. That option will also
316	cause the PIDs of tasks to be removed from this file when the task
317	exits.
318
319  set_graph_function:
320
321	Functions listed in this file will cause the function graph
322	tracer to only trace these functions and the functions that
323	they call. (See the section "dynamic ftrace" for more details).
324	Note, set_ftrace_filter and set_ftrace_notrace still affects
325	what functions are being traced.
326
327  set_graph_notrace:
328
329	Similar to set_graph_function, but will disable function graph
330	tracing when the function is hit until it exits the function.
331	This makes it possible to ignore tracing functions that are called
332	by a specific function.
333
334  available_filter_functions:
335
336	This lists the functions that ftrace has processed and can trace.
337	These are the function names that you can pass to
338	"set_ftrace_filter", "set_ftrace_notrace",
339	"set_graph_function", or "set_graph_notrace".
340	(See the section "dynamic ftrace" below for more details.)
341
342  available_filter_functions_addrs:
343
344	Similar to available_filter_functions, but with address displayed
345	for each function. The displayed address is the patch-site address
346	and can differ from /proc/kallsyms address.
347
348  dyn_ftrace_total_info:
349
350	This file is for debugging purposes. The number of functions that
351	have been converted to nops and are available to be traced.
352
353  enabled_functions:
354
355	This file is more for debugging ftrace, but can also be useful
356	in seeing if any function has a callback attached to it.
357	Not only does the trace infrastructure use ftrace function
358	trace utility, but other subsystems might too. This file
359	displays all functions that have a callback attached to them
360	as well as the number of callbacks that have been attached.
361	Note, a callback may also call multiple functions which will
362	not be listed in this count.
363
364	If the callback registered to be traced by a function with
365	the "save regs" attribute (thus even more overhead), a 'R'
366	will be displayed on the same line as the function that
367	is returning registers.
368
369	If the callback registered to be traced by a function with
370	the "ip modify" attribute (thus the regs->ip can be changed),
371	an 'I' will be displayed on the same line as the function that
372	can be overridden.
373
374	If a non ftrace trampoline is attached (BPF) a 'D' will be displayed.
375	Note, normal ftrace trampolines can also be attached, but only one
376	"direct" trampoline can be attached to a given function at a time.
377
378	Some architectures can not call direct trampolines, but instead have
379	the ftrace ops function located above the function entry point. In
380	such cases an 'O' will be displayed.
381
382	If a function had either the "ip modify" or a "direct" call attached to
383	it in the past, a 'M' will be shown. This flag is never cleared. It is
384	used to know if a function was every modified by the ftrace infrastructure,
385	and can be used for debugging.
386
387	If the architecture supports it, it will also show what callback
388	is being directly called by the function. If the count is greater
389	than 1 it most likely will be ftrace_ops_list_func().
390
391	If the callback of a function jumps to a trampoline that is
392	specific to the callback and which is not the standard trampoline,
393	its address will be printed as well as the function that the
394	trampoline calls.
395
396  touched_functions:
397
398	This file contains all the functions that ever had a function callback
399	to it via the ftrace infrastructure. It has the same format as
400	enabled_functions but shows all functions that have every been
401	traced.
402
403	To see any function that has every been modified by "ip modify" or a
404	direct trampoline, one can perform the following command:
405
406	grep ' M ' /sys/kernel/tracing/touched_functions
407
408  function_profile_enabled:
409
410	When set it will enable all functions with either the function
411	tracer, or if configured, the function graph tracer. It will
412	keep a histogram of the number of functions that were called
413	and if the function graph tracer was configured, it will also keep
414	track of the time spent in those functions. The histogram
415	content can be displayed in the files:
416
417	trace_stat/function<cpu> ( function0, function1, etc).
418
419  trace_stat:
420
421	A directory that holds different tracing stats.
422
423  kprobe_events:
424
425	Enable dynamic trace points. See kprobetrace.rst.
426
427  kprobe_profile:
428
429	Dynamic trace points stats. See kprobetrace.rst.
430
431  max_graph_depth:
432
433	Used with the function graph tracer. This is the max depth
434	it will trace into a function. Setting this to a value of
435	one will show only the first kernel function that is called
436	from user space.
437
438  printk_formats:
439
440	This is for tools that read the raw format files. If an event in
441	the ring buffer references a string, only a pointer to the string
442	is recorded into the buffer and not the string itself. This prevents
443	tools from knowing what that string was. This file displays the string
444	and address for	the string allowing tools to map the pointers to what
445	the strings were.
446
447  saved_cmdlines:
448
449	Only the pid of the task is recorded in a trace event unless
450	the event specifically saves the task comm as well. Ftrace
451	makes a cache of pid mappings to comms to try to display
452	comms for events. If a pid for a comm is not listed, then
453	"<...>" is displayed in the output.
454
455	If the option "record-cmd" is set to "0", then comms of tasks
456	will not be saved during recording. By default, it is enabled.
457
458  saved_cmdlines_size:
459
460	By default, 128 comms are saved (see "saved_cmdlines" above). To
461	increase or decrease the amount of comms that are cached, echo
462	the number of comms to cache into this file.
463
464  saved_tgids:
465
466	If the option "record-tgid" is set, on each scheduling context switch
467	the Task Group ID of a task is saved in a table mapping the PID of
468	the thread to its TGID. By default, the "record-tgid" option is
469	disabled.
470
471  snapshot:
472
473	This displays the "snapshot" buffer and also lets the user
474	take a snapshot of the current running trace.
475	See the "Snapshot" section below for more details.
476
477  stack_max_size:
478
479	When the stack tracer is activated, this will display the
480	maximum stack size it has encountered.
481	See the "Stack Trace" section below.
482
483  stack_trace:
484
485	This displays the stack back trace of the largest stack
486	that was encountered when the stack tracer is activated.
487	See the "Stack Trace" section below.
488
489  stack_trace_filter:
490
491	This is similar to "set_ftrace_filter" but it limits what
492	functions the stack tracer will check.
493
494  trace_clock:
495
496	Whenever an event is recorded into the ring buffer, a
497	"timestamp" is added. This stamp comes from a specified
498	clock. By default, ftrace uses the "local" clock. This
499	clock is very fast and strictly per cpu, but on some
500	systems it may not be monotonic with respect to other
501	CPUs. In other words, the local clocks may not be in sync
502	with local clocks on other CPUs.
503
504	Usual clocks for tracing::
505
506	  # cat trace_clock
507	  [local] global counter x86-tsc
508
509	The clock with the square brackets around it is the one in effect.
510
511	local:
512		Default clock, but may not be in sync across CPUs
513
514	global:
515		This clock is in sync with all CPUs but may
516		be a bit slower than the local clock.
517
518	counter:
519		This is not a clock at all, but literally an atomic
520		counter. It counts up one by one, but is in sync
521		with all CPUs. This is useful when you need to
522		know exactly the order events occurred with respect to
523		each other on different CPUs.
524
525	uptime:
526		This uses the jiffies counter and the time stamp
527		is relative to the time since boot up.
528
529	perf:
530		This makes ftrace use the same clock that perf uses.
531		Eventually perf will be able to read ftrace buffers
532		and this will help out in interleaving the data.
533
534	x86-tsc:
535		Architectures may define their own clocks. For
536		example, x86 uses its own TSC cycle clock here.
537
538	ppc-tb:
539		This uses the powerpc timebase register value.
540		This is in sync across CPUs and can also be used
541		to correlate events across hypervisor/guest if
542		tb_offset is known.
543
544	mono:
545		This uses the fast monotonic clock (CLOCK_MONOTONIC)
546		which is monotonic and is subject to NTP rate adjustments.
547
548	mono_raw:
549		This is the raw monotonic clock (CLOCK_MONOTONIC_RAW)
550		which is monotonic but is not subject to any rate adjustments
551		and ticks at the same rate as the hardware clocksource.
552
553	boot:
554		This is the boot clock (CLOCK_BOOTTIME) and is based on the
555		fast monotonic clock, but also accounts for time spent in
556		suspend. Since the clock access is designed for use in
557		tracing in the suspend path, some side effects are possible
558		if clock is accessed after the suspend time is accounted before
559		the fast mono clock is updated. In this case, the clock update
560		appears to happen slightly sooner than it normally would have.
561		Also on 32-bit systems, it's possible that the 64-bit boot offset
562		sees a partial update. These effects are rare and post
563		processing should be able to handle them. See comments in the
564		ktime_get_boot_fast_ns() function for more information.
565
566	tai:
567		This is the tai clock (CLOCK_TAI) and is derived from the wall-
568		clock time. However, this clock does not experience
569		discontinuities and backwards jumps caused by NTP inserting leap
570		seconds. Since the clock access is designed for use in tracing,
571		side effects are possible. The clock access may yield wrong
572		readouts in case the internal TAI offset is updated e.g., caused
573		by setting the system time or using adjtimex() with an offset.
574		These effects are rare and post processing should be able to
575		handle them. See comments in the ktime_get_tai_fast_ns()
576		function for more information.
577
578	To set a clock, simply echo the clock name into this file::
579
580	  # echo global > trace_clock
581
582	Setting a clock clears the ring buffer content as well as the
583	"snapshot" buffer.
584
585  trace_marker:
586
587	This is a very useful file for synchronizing user space
588	with events happening in the kernel. Writing strings into
589	this file will be written into the ftrace buffer.
590
591	It is useful in applications to open this file at the start
592	of the application and just reference the file descriptor
593	for the file::
594
595		void trace_write(const char *fmt, ...)
596		{
597			va_list ap;
598			char buf[256];
599			int n;
600
601			if (trace_fd < 0)
602				return;
603
604			va_start(ap, fmt);
605			n = vsnprintf(buf, 256, fmt, ap);
606			va_end(ap);
607
608			write(trace_fd, buf, n);
609		}
610
611	start::
612
613		trace_fd = open("trace_marker", O_WRONLY);
614
615	Note: Writing into the trace_marker file can also initiate triggers
616	      that are written into /sys/kernel/tracing/events/ftrace/print/trigger
617	      See "Event triggers" in Documentation/trace/events.rst and an
618              example in Documentation/trace/histogram.rst (Section 3.)
619
620  trace_marker_raw:
621
622	This is similar to trace_marker above, but is meant for binary data
623	to be written to it, where a tool can be used to parse the data
624	from trace_pipe_raw.
625
626  uprobe_events:
627
628	Add dynamic tracepoints in programs.
629	See uprobetracer.rst
630
631  uprobe_profile:
632
633	Uprobe statistics. See uprobetrace.txt
634
635  instances:
636
637	This is a way to make multiple trace buffers where different
638	events can be recorded in different buffers.
639	See "Instances" section below.
640
641  events:
642
643	This is the trace event directory. It holds event tracepoints
644	(also known as static tracepoints) that have been compiled
645	into the kernel. It shows what event tracepoints exist
646	and how they are grouped by system. There are "enable"
647	files at various levels that can enable the tracepoints
648	when a "1" is written to them.
649
650	See events.rst for more information.
651
652  set_event:
653
654	By echoing in the event into this file, will enable that event.
655
656	See events.rst for more information.
657
658  available_events:
659
660	A list of events that can be enabled in tracing.
661
662	See events.rst for more information.
663
664  timestamp_mode:
665
666	Certain tracers may change the timestamp mode used when
667	logging trace events into the event buffer.  Events with
668	different modes can coexist within a buffer but the mode in
669	effect when an event is logged determines which timestamp mode
670	is used for that event.  The default timestamp mode is
671	'delta'.
672
673	Usual timestamp modes for tracing:
674
675	  # cat timestamp_mode
676	  [delta] absolute
677
678	  The timestamp mode with the square brackets around it is the
679	  one in effect.
680
681	  delta: Default timestamp mode - timestamp is a delta against
682	         a per-buffer timestamp.
683
684	  absolute: The timestamp is a full timestamp, not a delta
685                 against some other value.  As such it takes up more
686                 space and is less efficient.
687
688  hwlat_detector:
689
690	Directory for the Hardware Latency Detector.
691	See "Hardware Latency Detector" section below.
692
693  per_cpu:
694
695	This is a directory that contains the trace per_cpu information.
696
697  per_cpu/cpu0/buffer_size_kb:
698
699	The ftrace buffer is defined per_cpu. That is, there's a separate
700	buffer for each CPU to allow writes to be done atomically,
701	and free from cache bouncing. These buffers may have different
702	size buffers. This file is similar to the buffer_size_kb
703	file, but it only displays or sets the buffer size for the
704	specific CPU. (here cpu0).
705
706  per_cpu/cpu0/trace:
707
708	This is similar to the "trace" file, but it will only display
709	the data specific for the CPU. If written to, it only clears
710	the specific CPU buffer.
711
712  per_cpu/cpu0/trace_pipe
713
714	This is similar to the "trace_pipe" file, and is a consuming
715	read, but it will only display (and consume) the data specific
716	for the CPU.
717
718  per_cpu/cpu0/trace_pipe_raw
719
720	For tools that can parse the ftrace ring buffer binary format,
721	the trace_pipe_raw file can be used to extract the data
722	from the ring buffer directly. With the use of the splice()
723	system call, the buffer data can be quickly transferred to
724	a file or to the network where a server is collecting the
725	data.
726
727	Like trace_pipe, this is a consuming reader, where multiple
728	reads will always produce different data.
729
730  per_cpu/cpu0/snapshot:
731
732	This is similar to the main "snapshot" file, but will only
733	snapshot the current CPU (if supported). It only displays
734	the content of the snapshot for a given CPU, and if
735	written to, only clears this CPU buffer.
736
737  per_cpu/cpu0/snapshot_raw:
738
739	Similar to the trace_pipe_raw, but will read the binary format
740	from the snapshot buffer for the given CPU.
741
742  per_cpu/cpu0/stats:
743
744	This displays certain stats about the ring buffer:
745
746	entries:
747		The number of events that are still in the buffer.
748
749	overrun:
750		The number of lost events due to overwriting when
751		the buffer was full.
752
753	commit overrun:
754		Should always be zero.
755		This gets set if so many events happened within a nested
756		event (ring buffer is re-entrant), that it fills the
757		buffer and starts dropping events.
758
759	bytes:
760		Bytes actually read (not overwritten).
761
762	oldest event ts:
763		The oldest timestamp in the buffer
764
765	now ts:
766		The current timestamp
767
768	dropped events:
769		Events lost due to overwrite option being off.
770
771	read events:
772		The number of events read.
773
774The Tracers
775-----------
776
777Here is the list of current tracers that may be configured.
778
779  "function"
780
781	Function call tracer to trace all kernel functions.
782
783  "function_graph"
784
785	Similar to the function tracer except that the
786	function tracer probes the functions on their entry
787	whereas the function graph tracer traces on both entry
788	and exit of the functions. It then provides the ability
789	to draw a graph of function calls similar to C code
790	source.
791
792  "blk"
793
794	The block tracer. The tracer used by the blktrace user
795	application.
796
797  "hwlat"
798
799	The Hardware Latency tracer is used to detect if the hardware
800	produces any latency. See "Hardware Latency Detector" section
801	below.
802
803  "irqsoff"
804
805	Traces the areas that disable interrupts and saves
806	the trace with the longest max latency.
807	See tracing_max_latency. When a new max is recorded,
808	it replaces the old trace. It is best to view this
809	trace with the latency-format option enabled, which
810	happens automatically when the tracer is selected.
811
812  "preemptoff"
813
814	Similar to irqsoff but traces and records the amount of
815	time for which preemption is disabled.
816
817  "preemptirqsoff"
818
819	Similar to irqsoff and preemptoff, but traces and
820	records the largest time for which irqs and/or preemption
821	is disabled.
822
823  "wakeup"
824
825	Traces and records the max latency that it takes for
826	the highest priority task to get scheduled after
827	it has been woken up.
828        Traces all tasks as an average developer would expect.
829
830  "wakeup_rt"
831
832        Traces and records the max latency that it takes for just
833        RT tasks (as the current "wakeup" does). This is useful
834        for those interested in wake up timings of RT tasks.
835
836  "wakeup_dl"
837
838	Traces and records the max latency that it takes for
839	a SCHED_DEADLINE task to be woken (as the "wakeup" and
840	"wakeup_rt" does).
841
842  "mmiotrace"
843
844	A special tracer that is used to trace binary module.
845	It will trace all the calls that a module makes to the
846	hardware. Everything it writes and reads from the I/O
847	as well.
848
849  "branch"
850
851	This tracer can be configured when tracing likely/unlikely
852	calls within the kernel. It will trace when a likely and
853	unlikely branch is hit and if it was correct in its prediction
854	of being correct.
855
856  "nop"
857
858	This is the "trace nothing" tracer. To remove all
859	tracers from tracing simply echo "nop" into
860	current_tracer.
861
862Error conditions
863----------------
864
865  For most ftrace commands, failure modes are obvious and communicated
866  using standard return codes.
867
868  For other more involved commands, extended error information may be
869  available via the tracing/error_log file.  For the commands that
870  support it, reading the tracing/error_log file after an error will
871  display more detailed information about what went wrong, if
872  information is available.  The tracing/error_log file is a circular
873  error log displaying a small number (currently, 8) of ftrace errors
874  for the last (8) failed commands.
875
876  The extended error information and usage takes the form shown in
877  this example::
878
879    # echo xxx > /sys/kernel/tracing/events/sched/sched_wakeup/trigger
880    echo: write error: Invalid argument
881
882    # cat /sys/kernel/tracing/error_log
883    [ 5348.887237] location: error: Couldn't yyy: zzz
884      Command: xxx
885               ^
886    [ 7517.023364] location: error: Bad rrr: sss
887      Command: ppp qqq
888                   ^
889
890  To clear the error log, echo the empty string into it::
891
892    # echo > /sys/kernel/tracing/error_log
893
894Examples of using the tracer
895----------------------------
896
897Here are typical examples of using the tracers when controlling
898them only with the tracefs interface (without using any
899user-land utilities).
900
901Output format:
902--------------
903
904Here is an example of the output format of the file "trace"::
905
906  # tracer: function
907  #
908  # entries-in-buffer/entries-written: 140080/250280   #P:4
909  #
910  #                              _-----=> irqs-off
911  #                             / _----=> need-resched
912  #                            | / _---=> hardirq/softirq
913  #                            || / _--=> preempt-depth
914  #                            ||| /     delay
915  #           TASK-PID   CPU#  ||||    TIMESTAMP  FUNCTION
916  #              | |       |   ||||       |         |
917              bash-1977  [000] .... 17284.993652: sys_close <-system_call_fastpath
918              bash-1977  [000] .... 17284.993653: __close_fd <-sys_close
919              bash-1977  [000] .... 17284.993653: _raw_spin_lock <-__close_fd
920              sshd-1974  [003] .... 17284.993653: __srcu_read_unlock <-fsnotify
921              bash-1977  [000] .... 17284.993654: add_preempt_count <-_raw_spin_lock
922              bash-1977  [000] ...1 17284.993655: _raw_spin_unlock <-__close_fd
923              bash-1977  [000] ...1 17284.993656: sub_preempt_count <-_raw_spin_unlock
924              bash-1977  [000] .... 17284.993657: filp_close <-__close_fd
925              bash-1977  [000] .... 17284.993657: dnotify_flush <-filp_close
926              sshd-1974  [003] .... 17284.993658: sys_select <-system_call_fastpath
927              ....
928
929A header is printed with the tracer name that is represented by
930the trace. In this case the tracer is "function". Then it shows the
931number of events in the buffer as well as the total number of entries
932that were written. The difference is the number of entries that were
933lost due to the buffer filling up (250280 - 140080 = 110200 events
934lost).
935
936The header explains the content of the events. Task name "bash", the task
937PID "1977", the CPU that it was running on "000", the latency format
938(explained below), the timestamp in <secs>.<usecs> format, the
939function name that was traced "sys_close" and the parent function that
940called this function "system_call_fastpath". The timestamp is the time
941at which the function was entered.
942
943Latency trace format
944--------------------
945
946When the latency-format option is enabled or when one of the latency
947tracers is set, the trace file gives somewhat more information to see
948why a latency happened. Here is a typical trace::
949
950  # tracer: irqsoff
951  #
952  # irqsoff latency trace v1.1.5 on 3.8.0-test+
953  # --------------------------------------------------------------------
954  # latency: 259 us, #4/4, CPU#2 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:4)
955  #    -----------------
956  #    | task: ps-6143 (uid:0 nice:0 policy:0 rt_prio:0)
957  #    -----------------
958  #  => started at: __lock_task_sighand
959  #  => ended at:   _raw_spin_unlock_irqrestore
960  #
961  #
962  #                  _------=> CPU#
963  #                 / _-----=> irqs-off
964  #                | / _----=> need-resched
965  #                || / _---=> hardirq/softirq
966  #                ||| / _--=> preempt-depth
967  #                |||| /     delay
968  #  cmd     pid   ||||| time  |   caller
969  #     \   /      |||||  \    |   /
970        ps-6143    2d...    0us!: trace_hardirqs_off <-__lock_task_sighand
971        ps-6143    2d..1  259us+: trace_hardirqs_on <-_raw_spin_unlock_irqrestore
972        ps-6143    2d..1  263us+: time_hardirqs_on <-_raw_spin_unlock_irqrestore
973        ps-6143    2d..1  306us : <stack trace>
974   => trace_hardirqs_on_caller
975   => trace_hardirqs_on
976   => _raw_spin_unlock_irqrestore
977   => do_task_stat
978   => proc_tgid_stat
979   => proc_single_show
980   => seq_read
981   => vfs_read
982   => sys_read
983   => system_call_fastpath
984
985
986This shows that the current tracer is "irqsoff" tracing the time
987for which interrupts were disabled. It gives the trace version (which
988never changes) and the version of the kernel upon which this was executed on
989(3.8). Then it displays the max latency in microseconds (259 us). The number
990of trace entries displayed and the total number (both are four: #4/4).
991VP, KP, SP, and HP are always zero and are reserved for later use.
992#P is the number of online CPUs (#P:4).
993
994The task is the process that was running when the latency
995occurred. (ps pid: 6143).
996
997The start and stop (the functions in which the interrupts were
998disabled and enabled respectively) that caused the latencies:
999
1000  - __lock_task_sighand is where the interrupts were disabled.
1001  - _raw_spin_unlock_irqrestore is where they were enabled again.
1002
1003The next lines after the header are the trace itself. The header
1004explains which is which.
1005
1006  cmd: The name of the process in the trace.
1007
1008  pid: The PID of that process.
1009
1010  CPU#: The CPU which the process was running on.
1011
1012  irqs-off: 'd' interrupts are disabled. '.' otherwise.
1013	.. caution:: If the architecture does not support a way to
1014		read the irq flags variable, an 'X' will always
1015		be printed here.
1016
1017  need-resched:
1018	- 'N' both TIF_NEED_RESCHED and PREEMPT_NEED_RESCHED is set,
1019	- 'n' only TIF_NEED_RESCHED is set,
1020	- 'p' only PREEMPT_NEED_RESCHED is set,
1021	- '.' otherwise.
1022
1023  hardirq/softirq:
1024	- 'Z' - NMI occurred inside a hardirq
1025	- 'z' - NMI is running
1026	- 'H' - hard irq occurred inside a softirq.
1027	- 'h' - hard irq is running
1028	- 's' - soft irq is running
1029	- '.' - normal context.
1030
1031  preempt-depth: The level of preempt_disabled
1032
1033The above is mostly meaningful for kernel developers.
1034
1035  time:
1036	When the latency-format option is enabled, the trace file
1037	output includes a timestamp relative to the start of the
1038	trace. This differs from the output when latency-format
1039	is disabled, which includes an absolute timestamp.
1040
1041  delay:
1042	This is just to help catch your eye a bit better. And
1043	needs to be fixed to be only relative to the same CPU.
1044	The marks are determined by the difference between this
1045	current trace and the next trace.
1046
1047	  - '$' - greater than 1 second
1048	  - '@' - greater than 100 millisecond
1049	  - '*' - greater than 10 millisecond
1050	  - '#' - greater than 1000 microsecond
1051	  - '!' - greater than 100 microsecond
1052	  - '+' - greater than 10 microsecond
1053	  - ' ' - less than or equal to 10 microsecond.
1054
1055  The rest is the same as the 'trace' file.
1056
1057  Note, the latency tracers will usually end with a back trace
1058  to easily find where the latency occurred.
1059
1060trace_options
1061-------------
1062
1063The trace_options file (or the options directory) is used to control
1064what gets printed in the trace output, or manipulate the tracers.
1065To see what is available, simply cat the file::
1066
1067  cat trace_options
1068	print-parent
1069	nosym-offset
1070	nosym-addr
1071	noverbose
1072	noraw
1073	nohex
1074	nobin
1075	noblock
1076	nofields
1077	trace_printk
1078	annotate
1079	nouserstacktrace
1080	nosym-userobj
1081	noprintk-msg-only
1082	context-info
1083	nolatency-format
1084	record-cmd
1085	norecord-tgid
1086	overwrite
1087	nodisable_on_free
1088	irq-info
1089	markers
1090	noevent-fork
1091	function-trace
1092	nofunction-fork
1093	nodisplay-graph
1094	nostacktrace
1095	nobranch
1096
1097To disable one of the options, echo in the option prepended with
1098"no"::
1099
1100  echo noprint-parent > trace_options
1101
1102To enable an option, leave off the "no"::
1103
1104  echo sym-offset > trace_options
1105
1106Here are the available options:
1107
1108  print-parent
1109	On function traces, display the calling (parent)
1110	function as well as the function being traced.
1111	::
1112
1113	  print-parent:
1114	   bash-4000  [01]  1477.606694: simple_strtoul <-kstrtoul
1115
1116	  noprint-parent:
1117	   bash-4000  [01]  1477.606694: simple_strtoul
1118
1119
1120  sym-offset
1121	Display not only the function name, but also the
1122	offset in the function. For example, instead of
1123	seeing just "ktime_get", you will see
1124	"ktime_get+0xb/0x20".
1125	::
1126
1127	  sym-offset:
1128	   bash-4000  [01]  1477.606694: simple_strtoul+0x6/0xa0
1129
1130  sym-addr
1131	This will also display the function address as well
1132	as the function name.
1133	::
1134
1135	  sym-addr:
1136	   bash-4000  [01]  1477.606694: simple_strtoul <c0339346>
1137
1138  verbose
1139	This deals with the trace file when the
1140        latency-format option is enabled.
1141	::
1142
1143	    bash  4000 1 0 00000000 00010a95 [58127d26] 1720.415ms \
1144	    (+0.000ms): simple_strtoul (kstrtoul)
1145
1146  raw
1147	This will display raw numbers. This option is best for
1148	use with user applications that can translate the raw
1149	numbers better than having it done in the kernel.
1150
1151  hex
1152	Similar to raw, but the numbers will be in a hexadecimal format.
1153
1154  bin
1155	This will print out the formats in raw binary.
1156
1157  block
1158	When set, reading trace_pipe will not block when polled.
1159
1160  fields
1161	Print the fields as described by their types. This is a better
1162	option than using hex, bin or raw, as it gives a better parsing
1163	of the content of the event.
1164
1165  trace_printk
1166	Can disable trace_printk() from writing into the buffer.
1167
1168  annotate
1169	It is sometimes confusing when the CPU buffers are full
1170	and one CPU buffer had a lot of events recently, thus
1171	a shorter time frame, were another CPU may have only had
1172	a few events, which lets it have older events. When
1173	the trace is reported, it shows the oldest events first,
1174	and it may look like only one CPU ran (the one with the
1175	oldest events). When the annotate option is set, it will
1176	display when a new CPU buffer started::
1177
1178			  <idle>-0     [001] dNs4 21169.031481: wake_up_idle_cpu <-add_timer_on
1179			  <idle>-0     [001] dNs4 21169.031482: _raw_spin_unlock_irqrestore <-add_timer_on
1180			  <idle>-0     [001] .Ns4 21169.031484: sub_preempt_count <-_raw_spin_unlock_irqrestore
1181		##### CPU 2 buffer started ####
1182			  <idle>-0     [002] .N.1 21169.031484: rcu_idle_exit <-cpu_idle
1183			  <idle>-0     [001] .Ns3 21169.031484: _raw_spin_unlock <-clocksource_watchdog
1184			  <idle>-0     [001] .Ns3 21169.031485: sub_preempt_count <-_raw_spin_unlock
1185
1186  userstacktrace
1187	This option changes the trace. It records a
1188	stacktrace of the current user space thread after
1189	each trace event.
1190
1191  sym-userobj
1192	when user stacktrace are enabled, look up which
1193	object the address belongs to, and print a
1194	relative address. This is especially useful when
1195	ASLR is on, otherwise you don't get a chance to
1196	resolve the address to object/file/line after
1197	the app is no longer running
1198
1199	The lookup is performed when you read
1200	trace,trace_pipe. Example::
1201
1202		  a.out-1623  [000] 40874.465068: /root/a.out[+0x480] <-/root/a.out[+0
1203		  x494] <- /root/a.out[+0x4a8] <- /lib/libc-2.7.so[+0x1e1a6]
1204
1205
1206  printk-msg-only
1207	When set, trace_printk()s will only show the format
1208	and not their parameters (if trace_bprintk() or
1209	trace_bputs() was used to save the trace_printk()).
1210
1211  context-info
1212	Show only the event data. Hides the comm, PID,
1213	timestamp, CPU, and other useful data.
1214
1215  latency-format
1216	This option changes the trace output. When it is enabled,
1217	the trace displays additional information about the
1218	latency, as described in "Latency trace format".
1219
1220  pause-on-trace
1221	When set, opening the trace file for read, will pause
1222	writing to the ring buffer (as if tracing_on was set to zero).
1223	This simulates the original behavior of the trace file.
1224	When the file is closed, tracing will be enabled again.
1225
1226  hash-ptr
1227        When set, "%p" in the event printk format displays the
1228        hashed pointer value instead of real address.
1229        This will be useful if you want to find out which hashed
1230        value is corresponding to the real value in trace log.
1231
1232  record-cmd
1233	When any event or tracer is enabled, a hook is enabled
1234	in the sched_switch trace point to fill comm cache
1235	with mapped pids and comms. But this may cause some
1236	overhead, and if you only care about pids, and not the
1237	name of the task, disabling this option can lower the
1238	impact of tracing. See "saved_cmdlines".
1239
1240  record-tgid
1241	When any event or tracer is enabled, a hook is enabled
1242	in the sched_switch trace point to fill the cache of
1243	mapped Thread Group IDs (TGID) mapping to pids. See
1244	"saved_tgids".
1245
1246  overwrite
1247	This controls what happens when the trace buffer is
1248	full. If "1" (default), the oldest events are
1249	discarded and overwritten. If "0", then the newest
1250	events are discarded.
1251	(see per_cpu/cpu0/stats for overrun and dropped)
1252
1253  disable_on_free
1254	When the free_buffer is closed, tracing will
1255	stop (tracing_on set to 0).
1256
1257  irq-info
1258	Shows the interrupt, preempt count, need resched data.
1259	When disabled, the trace looks like::
1260
1261		# tracer: function
1262		#
1263		# entries-in-buffer/entries-written: 144405/9452052   #P:4
1264		#
1265		#           TASK-PID   CPU#      TIMESTAMP  FUNCTION
1266		#              | |       |          |         |
1267			  <idle>-0     [002]  23636.756054: ttwu_do_activate.constprop.89 <-try_to_wake_up
1268			  <idle>-0     [002]  23636.756054: activate_task <-ttwu_do_activate.constprop.89
1269			  <idle>-0     [002]  23636.756055: enqueue_task <-activate_task
1270
1271
1272  markers
1273	When set, the trace_marker is writable (only by root).
1274	When disabled, the trace_marker will error with EINVAL
1275	on write.
1276
1277  event-fork
1278	When set, tasks with PIDs listed in set_event_pid will have
1279	the PIDs of their children added to set_event_pid when those
1280	tasks fork. Also, when tasks with PIDs in set_event_pid exit,
1281	their PIDs will be removed from the file.
1282
1283        This affects PIDs listed in set_event_notrace_pid as well.
1284
1285  function-trace
1286	The latency tracers will enable function tracing
1287	if this option is enabled (default it is). When
1288	it is disabled, the latency tracers do not trace
1289	functions. This keeps the overhead of the tracer down
1290	when performing latency tests.
1291
1292  function-fork
1293	When set, tasks with PIDs listed in set_ftrace_pid will
1294	have the PIDs of their children added to set_ftrace_pid
1295	when those tasks fork. Also, when tasks with PIDs in
1296	set_ftrace_pid exit, their PIDs will be removed from the
1297	file.
1298
1299        This affects PIDs in set_ftrace_notrace_pid as well.
1300
1301  display-graph
1302	When set, the latency tracers (irqsoff, wakeup, etc) will
1303	use function graph tracing instead of function tracing.
1304
1305  stacktrace
1306	When set, a stack trace is recorded after any trace event
1307	is recorded.
1308
1309  branch
1310	Enable branch tracing with the tracer. This enables branch
1311	tracer along with the currently set tracer. Enabling this
1312	with the "nop" tracer is the same as just enabling the
1313	"branch" tracer.
1314
1315.. tip:: Some tracers have their own options. They only appear in this
1316       file when the tracer is active. They always appear in the
1317       options directory.
1318
1319
1320Here are the per tracer options:
1321
1322Options for function tracer:
1323
1324  func_stack_trace
1325	When set, a stack trace is recorded after every
1326	function that is recorded. NOTE! Limit the functions
1327	that are recorded before enabling this, with
1328	"set_ftrace_filter" otherwise the system performance
1329	will be critically degraded. Remember to disable
1330	this option before clearing the function filter.
1331
1332Options for function_graph tracer:
1333
1334 Since the function_graph tracer has a slightly different output
1335 it has its own options to control what is displayed.
1336
1337  funcgraph-overrun
1338	When set, the "overrun" of the graph stack is
1339	displayed after each function traced. The
1340	overrun, is when the stack depth of the calls
1341	is greater than what is reserved for each task.
1342	Each task has a fixed array of functions to
1343	trace in the call graph. If the depth of the
1344	calls exceeds that, the function is not traced.
1345	The overrun is the number of functions missed
1346	due to exceeding this array.
1347
1348  funcgraph-cpu
1349	When set, the CPU number of the CPU where the trace
1350	occurred is displayed.
1351
1352  funcgraph-overhead
1353	When set, if the function takes longer than
1354	A certain amount, then a delay marker is
1355	displayed. See "delay" above, under the
1356	header description.
1357
1358  funcgraph-proc
1359	Unlike other tracers, the process' command line
1360	is not displayed by default, but instead only
1361	when a task is traced in and out during a context
1362	switch. Enabling this options has the command
1363	of each process displayed at every line.
1364
1365  funcgraph-duration
1366	At the end of each function (the return)
1367	the duration of the amount of time in the
1368	function is displayed in microseconds.
1369
1370  funcgraph-abstime
1371	When set, the timestamp is displayed at each line.
1372
1373  funcgraph-irqs
1374	When disabled, functions that happen inside an
1375	interrupt will not be traced.
1376
1377  funcgraph-tail
1378	When set, the return event will include the function
1379	that it represents. By default this is off, and
1380	only a closing curly bracket "}" is displayed for
1381	the return of a function.
1382
1383  funcgraph-retval
1384	When set, the return value of each traced function
1385	will be printed after an equal sign "=". By default
1386	this is off.
1387
1388  funcgraph-retval-hex
1389	When set, the return value will always be printed
1390	in hexadecimal format. If the option is not set and
1391	the return value is an error code, it will be printed
1392	in signed decimal format; otherwise it will also be
1393	printed in hexadecimal format. By default, this option
1394	is off.
1395
1396  sleep-time
1397	When running function graph tracer, to include
1398	the time a task schedules out in its function.
1399	When enabled, it will account time the task has been
1400	scheduled out as part of the function call.
1401
1402  graph-time
1403	When running function profiler with function graph tracer,
1404	to include the time to call nested functions. When this is
1405	not set, the time reported for the function will only
1406	include the time the function itself executed for, not the
1407	time for functions that it called.
1408
1409Options for blk tracer:
1410
1411  blk_classic
1412	Shows a more minimalistic output.
1413
1414
1415irqsoff
1416-------
1417
1418When interrupts are disabled, the CPU can not react to any other
1419external event (besides NMIs and SMIs). This prevents the timer
1420interrupt from triggering or the mouse interrupt from letting
1421the kernel know of a new mouse event. The result is a latency
1422with the reaction time.
1423
1424The irqsoff tracer tracks the time for which interrupts are
1425disabled. When a new maximum latency is hit, the tracer saves
1426the trace leading up to that latency point so that every time a
1427new maximum is reached, the old saved trace is discarded and the
1428new trace is saved.
1429
1430To reset the maximum, echo 0 into tracing_max_latency. Here is
1431an example::
1432
1433  # echo 0 > options/function-trace
1434  # echo irqsoff > current_tracer
1435  # echo 1 > tracing_on
1436  # echo 0 > tracing_max_latency
1437  # ls -ltr
1438  [...]
1439  # echo 0 > tracing_on
1440  # cat trace
1441  # tracer: irqsoff
1442  #
1443  # irqsoff latency trace v1.1.5 on 3.8.0-test+
1444  # --------------------------------------------------------------------
1445  # latency: 16 us, #4/4, CPU#0 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:4)
1446  #    -----------------
1447  #    | task: swapper/0-0 (uid:0 nice:0 policy:0 rt_prio:0)
1448  #    -----------------
1449  #  => started at: run_timer_softirq
1450  #  => ended at:   run_timer_softirq
1451  #
1452  #
1453  #                  _------=> CPU#
1454  #                 / _-----=> irqs-off
1455  #                | / _----=> need-resched
1456  #                || / _---=> hardirq/softirq
1457  #                ||| / _--=> preempt-depth
1458  #                |||| /     delay
1459  #  cmd     pid   ||||| time  |   caller
1460  #     \   /      |||||  \    |   /
1461    <idle>-0       0d.s2    0us+: _raw_spin_lock_irq <-run_timer_softirq
1462    <idle>-0       0dNs3   17us : _raw_spin_unlock_irq <-run_timer_softirq
1463    <idle>-0       0dNs3   17us+: trace_hardirqs_on <-run_timer_softirq
1464    <idle>-0       0dNs3   25us : <stack trace>
1465   => _raw_spin_unlock_irq
1466   => run_timer_softirq
1467   => __do_softirq
1468   => call_softirq
1469   => do_softirq
1470   => irq_exit
1471   => smp_apic_timer_interrupt
1472   => apic_timer_interrupt
1473   => rcu_idle_exit
1474   => cpu_idle
1475   => rest_init
1476   => start_kernel
1477   => x86_64_start_reservations
1478   => x86_64_start_kernel
1479
1480Here we see that we had a latency of 16 microseconds (which is
1481very good). The _raw_spin_lock_irq in run_timer_softirq disabled
1482interrupts. The difference between the 16 and the displayed
1483timestamp 25us occurred because the clock was incremented
1484between the time of recording the max latency and the time of
1485recording the function that had that latency.
1486
1487Note the above example had function-trace not set. If we set
1488function-trace, we get a much larger output::
1489
1490 with echo 1 > options/function-trace
1491
1492  # tracer: irqsoff
1493  #
1494  # irqsoff latency trace v1.1.5 on 3.8.0-test+
1495  # --------------------------------------------------------------------
1496  # latency: 71 us, #168/168, CPU#3 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:4)
1497  #    -----------------
1498  #    | task: bash-2042 (uid:0 nice:0 policy:0 rt_prio:0)
1499  #    -----------------
1500  #  => started at: ata_scsi_queuecmd
1501  #  => ended at:   ata_scsi_queuecmd
1502  #
1503  #
1504  #                  _------=> CPU#
1505  #                 / _-----=> irqs-off
1506  #                | / _----=> need-resched
1507  #                || / _---=> hardirq/softirq
1508  #                ||| / _--=> preempt-depth
1509  #                |||| /     delay
1510  #  cmd     pid   ||||| time  |   caller
1511  #     \   /      |||||  \    |   /
1512      bash-2042    3d...    0us : _raw_spin_lock_irqsave <-ata_scsi_queuecmd
1513      bash-2042    3d...    0us : add_preempt_count <-_raw_spin_lock_irqsave
1514      bash-2042    3d..1    1us : ata_scsi_find_dev <-ata_scsi_queuecmd
1515      bash-2042    3d..1    1us : __ata_scsi_find_dev <-ata_scsi_find_dev
1516      bash-2042    3d..1    2us : ata_find_dev.part.14 <-__ata_scsi_find_dev
1517      bash-2042    3d..1    2us : ata_qc_new_init <-__ata_scsi_queuecmd
1518      bash-2042    3d..1    3us : ata_sg_init <-__ata_scsi_queuecmd
1519      bash-2042    3d..1    4us : ata_scsi_rw_xlat <-__ata_scsi_queuecmd
1520      bash-2042    3d..1    4us : ata_build_rw_tf <-ata_scsi_rw_xlat
1521  [...]
1522      bash-2042    3d..1   67us : delay_tsc <-__delay
1523      bash-2042    3d..1   67us : add_preempt_count <-delay_tsc
1524      bash-2042    3d..2   67us : sub_preempt_count <-delay_tsc
1525      bash-2042    3d..1   67us : add_preempt_count <-delay_tsc
1526      bash-2042    3d..2   68us : sub_preempt_count <-delay_tsc
1527      bash-2042    3d..1   68us+: ata_bmdma_start <-ata_bmdma_qc_issue
1528      bash-2042    3d..1   71us : _raw_spin_unlock_irqrestore <-ata_scsi_queuecmd
1529      bash-2042    3d..1   71us : _raw_spin_unlock_irqrestore <-ata_scsi_queuecmd
1530      bash-2042    3d..1   72us+: trace_hardirqs_on <-ata_scsi_queuecmd
1531      bash-2042    3d..1  120us : <stack trace>
1532   => _raw_spin_unlock_irqrestore
1533   => ata_scsi_queuecmd
1534   => scsi_dispatch_cmd
1535   => scsi_request_fn
1536   => __blk_run_queue_uncond
1537   => __blk_run_queue
1538   => blk_queue_bio
1539   => submit_bio_noacct
1540   => submit_bio
1541   => submit_bh
1542   => __ext3_get_inode_loc
1543   => ext3_iget
1544   => ext3_lookup
1545   => lookup_real
1546   => __lookup_hash
1547   => walk_component
1548   => lookup_last
1549   => path_lookupat
1550   => filename_lookup
1551   => user_path_at_empty
1552   => user_path_at
1553   => vfs_fstatat
1554   => vfs_stat
1555   => sys_newstat
1556   => system_call_fastpath
1557
1558
1559Here we traced a 71 microsecond latency. But we also see all the
1560functions that were called during that time. Note that by
1561enabling function tracing, we incur an added overhead. This
1562overhead may extend the latency times. But nevertheless, this
1563trace has provided some very helpful debugging information.
1564
1565If we prefer function graph output instead of function, we can set
1566display-graph option::
1567
1568 with echo 1 > options/display-graph
1569
1570  # tracer: irqsoff
1571  #
1572  # irqsoff latency trace v1.1.5 on 4.20.0-rc6+
1573  # --------------------------------------------------------------------
1574  # latency: 3751 us, #274/274, CPU#0 | (M:desktop VP:0, KP:0, SP:0 HP:0 #P:4)
1575  #    -----------------
1576  #    | task: bash-1507 (uid:0 nice:0 policy:0 rt_prio:0)
1577  #    -----------------
1578  #  => started at: free_debug_processing
1579  #  => ended at:   return_to_handler
1580  #
1581  #
1582  #                                       _-----=> irqs-off
1583  #                                      / _----=> need-resched
1584  #                                     | / _---=> hardirq/softirq
1585  #                                     || / _--=> preempt-depth
1586  #                                     ||| /
1587  #   REL TIME      CPU  TASK/PID       ||||     DURATION                  FUNCTION CALLS
1588  #      |          |     |    |        ||||      |   |                     |   |   |   |
1589          0 us |   0)   bash-1507    |  d... |   0.000 us    |  _raw_spin_lock_irqsave();
1590          0 us |   0)   bash-1507    |  d..1 |   0.378 us    |    do_raw_spin_trylock();
1591          1 us |   0)   bash-1507    |  d..2 |               |    set_track() {
1592          2 us |   0)   bash-1507    |  d..2 |               |      save_stack_trace() {
1593          2 us |   0)   bash-1507    |  d..2 |               |        __save_stack_trace() {
1594          3 us |   0)   bash-1507    |  d..2 |               |          __unwind_start() {
1595          3 us |   0)   bash-1507    |  d..2 |               |            get_stack_info() {
1596          3 us |   0)   bash-1507    |  d..2 |   0.351 us    |              in_task_stack();
1597          4 us |   0)   bash-1507    |  d..2 |   1.107 us    |            }
1598  [...]
1599       3750 us |   0)   bash-1507    |  d..1 |   0.516 us    |      do_raw_spin_unlock();
1600       3750 us |   0)   bash-1507    |  d..1 |   0.000 us    |  _raw_spin_unlock_irqrestore();
1601       3764 us |   0)   bash-1507    |  d..1 |   0.000 us    |  tracer_hardirqs_on();
1602      bash-1507    0d..1 3792us : <stack trace>
1603   => free_debug_processing
1604   => __slab_free
1605   => kmem_cache_free
1606   => vm_area_free
1607   => remove_vma
1608   => exit_mmap
1609   => mmput
1610   => begin_new_exec
1611   => load_elf_binary
1612   => search_binary_handler
1613   => __do_execve_file.isra.32
1614   => __x64_sys_execve
1615   => do_syscall_64
1616   => entry_SYSCALL_64_after_hwframe
1617
1618preemptoff
1619----------
1620
1621When preemption is disabled, we may be able to receive
1622interrupts but the task cannot be preempted and a higher
1623priority task must wait for preemption to be enabled again
1624before it can preempt a lower priority task.
1625
1626The preemptoff tracer traces the places that disable preemption.
1627Like the irqsoff tracer, it records the maximum latency for
1628which preemption was disabled. The control of preemptoff tracer
1629is much like the irqsoff tracer.
1630::
1631
1632  # echo 0 > options/function-trace
1633  # echo preemptoff > current_tracer
1634  # echo 1 > tracing_on
1635  # echo 0 > tracing_max_latency
1636  # ls -ltr
1637  [...]
1638  # echo 0 > tracing_on
1639  # cat trace
1640  # tracer: preemptoff
1641  #
1642  # preemptoff latency trace v1.1.5 on 3.8.0-test+
1643  # --------------------------------------------------------------------
1644  # latency: 46 us, #4/4, CPU#1 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:4)
1645  #    -----------------
1646  #    | task: sshd-1991 (uid:0 nice:0 policy:0 rt_prio:0)
1647  #    -----------------
1648  #  => started at: do_IRQ
1649  #  => ended at:   do_IRQ
1650  #
1651  #
1652  #                  _------=> CPU#
1653  #                 / _-----=> irqs-off
1654  #                | / _----=> need-resched
1655  #                || / _---=> hardirq/softirq
1656  #                ||| / _--=> preempt-depth
1657  #                |||| /     delay
1658  #  cmd     pid   ||||| time  |   caller
1659  #     \   /      |||||  \    |   /
1660      sshd-1991    1d.h.    0us+: irq_enter <-do_IRQ
1661      sshd-1991    1d..1   46us : irq_exit <-do_IRQ
1662      sshd-1991    1d..1   47us+: trace_preempt_on <-do_IRQ
1663      sshd-1991    1d..1   52us : <stack trace>
1664   => sub_preempt_count
1665   => irq_exit
1666   => do_IRQ
1667   => ret_from_intr
1668
1669
1670This has some more changes. Preemption was disabled when an
1671interrupt came in (notice the 'h'), and was enabled on exit.
1672But we also see that interrupts have been disabled when entering
1673the preempt off section and leaving it (the 'd'). We do not know if
1674interrupts were enabled in the mean time or shortly after this
1675was over.
1676::
1677
1678  # tracer: preemptoff
1679  #
1680  # preemptoff latency trace v1.1.5 on 3.8.0-test+
1681  # --------------------------------------------------------------------
1682  # latency: 83 us, #241/241, CPU#1 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:4)
1683  #    -----------------
1684  #    | task: bash-1994 (uid:0 nice:0 policy:0 rt_prio:0)
1685  #    -----------------
1686  #  => started at: wake_up_new_task
1687  #  => ended at:   task_rq_unlock
1688  #
1689  #
1690  #                  _------=> CPU#
1691  #                 / _-----=> irqs-off
1692  #                | / _----=> need-resched
1693  #                || / _---=> hardirq/softirq
1694  #                ||| / _--=> preempt-depth
1695  #                |||| /     delay
1696  #  cmd     pid   ||||| time  |   caller
1697  #     \   /      |||||  \    |   /
1698      bash-1994    1d..1    0us : _raw_spin_lock_irqsave <-wake_up_new_task
1699      bash-1994    1d..1    0us : select_task_rq_fair <-select_task_rq
1700      bash-1994    1d..1    1us : __rcu_read_lock <-select_task_rq_fair
1701      bash-1994    1d..1    1us : source_load <-select_task_rq_fair
1702      bash-1994    1d..1    1us : source_load <-select_task_rq_fair
1703  [...]
1704      bash-1994    1d..1   12us : irq_enter <-smp_apic_timer_interrupt
1705      bash-1994    1d..1   12us : rcu_irq_enter <-irq_enter
1706      bash-1994    1d..1   13us : add_preempt_count <-irq_enter
1707      bash-1994    1d.h1   13us : exit_idle <-smp_apic_timer_interrupt
1708      bash-1994    1d.h1   13us : hrtimer_interrupt <-smp_apic_timer_interrupt
1709      bash-1994    1d.h1   13us : _raw_spin_lock <-hrtimer_interrupt
1710      bash-1994    1d.h1   14us : add_preempt_count <-_raw_spin_lock
1711      bash-1994    1d.h2   14us : ktime_get_update_offsets <-hrtimer_interrupt
1712  [...]
1713      bash-1994    1d.h1   35us : lapic_next_event <-clockevents_program_event
1714      bash-1994    1d.h1   35us : irq_exit <-smp_apic_timer_interrupt
1715      bash-1994    1d.h1   36us : sub_preempt_count <-irq_exit
1716      bash-1994    1d..2   36us : do_softirq <-irq_exit
1717      bash-1994    1d..2   36us : __do_softirq <-call_softirq
1718      bash-1994    1d..2   36us : __local_bh_disable <-__do_softirq
1719      bash-1994    1d.s2   37us : add_preempt_count <-_raw_spin_lock_irq
1720      bash-1994    1d.s3   38us : _raw_spin_unlock <-run_timer_softirq
1721      bash-1994    1d.s3   39us : sub_preempt_count <-_raw_spin_unlock
1722      bash-1994    1d.s2   39us : call_timer_fn <-run_timer_softirq
1723  [...]
1724      bash-1994    1dNs2   81us : cpu_needs_another_gp <-rcu_process_callbacks
1725      bash-1994    1dNs2   82us : __local_bh_enable <-__do_softirq
1726      bash-1994    1dNs2   82us : sub_preempt_count <-__local_bh_enable
1727      bash-1994    1dN.2   82us : idle_cpu <-irq_exit
1728      bash-1994    1dN.2   83us : rcu_irq_exit <-irq_exit
1729      bash-1994    1dN.2   83us : sub_preempt_count <-irq_exit
1730      bash-1994    1.N.1   84us : _raw_spin_unlock_irqrestore <-task_rq_unlock
1731      bash-1994    1.N.1   84us+: trace_preempt_on <-task_rq_unlock
1732      bash-1994    1.N.1  104us : <stack trace>
1733   => sub_preempt_count
1734   => _raw_spin_unlock_irqrestore
1735   => task_rq_unlock
1736   => wake_up_new_task
1737   => do_fork
1738   => sys_clone
1739   => stub_clone
1740
1741
1742The above is an example of the preemptoff trace with
1743function-trace set. Here we see that interrupts were not disabled
1744the entire time. The irq_enter code lets us know that we entered
1745an interrupt 'h'. Before that, the functions being traced still
1746show that it is not in an interrupt, but we can see from the
1747functions themselves that this is not the case.
1748
1749preemptirqsoff
1750--------------
1751
1752Knowing the locations that have interrupts disabled or
1753preemption disabled for the longest times is helpful. But
1754sometimes we would like to know when either preemption and/or
1755interrupts are disabled.
1756
1757Consider the following code::
1758
1759    local_irq_disable();
1760    call_function_with_irqs_off();
1761    preempt_disable();
1762    call_function_with_irqs_and_preemption_off();
1763    local_irq_enable();
1764    call_function_with_preemption_off();
1765    preempt_enable();
1766
1767The irqsoff tracer will record the total length of
1768call_function_with_irqs_off() and
1769call_function_with_irqs_and_preemption_off().
1770
1771The preemptoff tracer will record the total length of
1772call_function_with_irqs_and_preemption_off() and
1773call_function_with_preemption_off().
1774
1775But neither will trace the time that interrupts and/or
1776preemption is disabled. This total time is the time that we can
1777not schedule. To record this time, use the preemptirqsoff
1778tracer.
1779
1780Again, using this trace is much like the irqsoff and preemptoff
1781tracers.
1782::
1783
1784  # echo 0 > options/function-trace
1785  # echo preemptirqsoff > current_tracer
1786  # echo 1 > tracing_on
1787  # echo 0 > tracing_max_latency
1788  # ls -ltr
1789  [...]
1790  # echo 0 > tracing_on
1791  # cat trace
1792  # tracer: preemptirqsoff
1793  #
1794  # preemptirqsoff latency trace v1.1.5 on 3.8.0-test+
1795  # --------------------------------------------------------------------
1796  # latency: 100 us, #4/4, CPU#3 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:4)
1797  #    -----------------
1798  #    | task: ls-2230 (uid:0 nice:0 policy:0 rt_prio:0)
1799  #    -----------------
1800  #  => started at: ata_scsi_queuecmd
1801  #  => ended at:   ata_scsi_queuecmd
1802  #
1803  #
1804  #                  _------=> CPU#
1805  #                 / _-----=> irqs-off
1806  #                | / _----=> need-resched
1807  #                || / _---=> hardirq/softirq
1808  #                ||| / _--=> preempt-depth
1809  #                |||| /     delay
1810  #  cmd     pid   ||||| time  |   caller
1811  #     \   /      |||||  \    |   /
1812        ls-2230    3d...    0us+: _raw_spin_lock_irqsave <-ata_scsi_queuecmd
1813        ls-2230    3...1  100us : _raw_spin_unlock_irqrestore <-ata_scsi_queuecmd
1814        ls-2230    3...1  101us+: trace_preempt_on <-ata_scsi_queuecmd
1815        ls-2230    3...1  111us : <stack trace>
1816   => sub_preempt_count
1817   => _raw_spin_unlock_irqrestore
1818   => ata_scsi_queuecmd
1819   => scsi_dispatch_cmd
1820   => scsi_request_fn
1821   => __blk_run_queue_uncond
1822   => __blk_run_queue
1823   => blk_queue_bio
1824   => submit_bio_noacct
1825   => submit_bio
1826   => submit_bh
1827   => ext3_bread
1828   => ext3_dir_bread
1829   => htree_dirblock_to_tree
1830   => ext3_htree_fill_tree
1831   => ext3_readdir
1832   => vfs_readdir
1833   => sys_getdents
1834   => system_call_fastpath
1835
1836
1837The trace_hardirqs_off_thunk is called from assembly on x86 when
1838interrupts are disabled in the assembly code. Without the
1839function tracing, we do not know if interrupts were enabled
1840within the preemption points. We do see that it started with
1841preemption enabled.
1842
1843Here is a trace with function-trace set::
1844
1845  # tracer: preemptirqsoff
1846  #
1847  # preemptirqsoff latency trace v1.1.5 on 3.8.0-test+
1848  # --------------------------------------------------------------------
1849  # latency: 161 us, #339/339, CPU#3 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:4)
1850  #    -----------------
1851  #    | task: ls-2269 (uid:0 nice:0 policy:0 rt_prio:0)
1852  #    -----------------
1853  #  => started at: schedule
1854  #  => ended at:   mutex_unlock
1855  #
1856  #
1857  #                  _------=> CPU#
1858  #                 / _-----=> irqs-off
1859  #                | / _----=> need-resched
1860  #                || / _---=> hardirq/softirq
1861  #                ||| / _--=> preempt-depth
1862  #                |||| /     delay
1863  #  cmd     pid   ||||| time  |   caller
1864  #     \   /      |||||  \    |   /
1865  kworker/-59      3...1    0us : __schedule <-schedule
1866  kworker/-59      3d..1    0us : rcu_preempt_qs <-rcu_note_context_switch
1867  kworker/-59      3d..1    1us : add_preempt_count <-_raw_spin_lock_irq
1868  kworker/-59      3d..2    1us : deactivate_task <-__schedule
1869  kworker/-59      3d..2    1us : dequeue_task <-deactivate_task
1870  kworker/-59      3d..2    2us : update_rq_clock <-dequeue_task
1871  kworker/-59      3d..2    2us : dequeue_task_fair <-dequeue_task
1872  kworker/-59      3d..2    2us : update_curr <-dequeue_task_fair
1873  kworker/-59      3d..2    2us : update_min_vruntime <-update_curr
1874  kworker/-59      3d..2    3us : cpuacct_charge <-update_curr
1875  kworker/-59      3d..2    3us : __rcu_read_lock <-cpuacct_charge
1876  kworker/-59      3d..2    3us : __rcu_read_unlock <-cpuacct_charge
1877  kworker/-59      3d..2    3us : update_cfs_rq_blocked_load <-dequeue_task_fair
1878  kworker/-59      3d..2    4us : clear_buddies <-dequeue_task_fair
1879  kworker/-59      3d..2    4us : account_entity_dequeue <-dequeue_task_fair
1880  kworker/-59      3d..2    4us : update_min_vruntime <-dequeue_task_fair
1881  kworker/-59      3d..2    4us : update_cfs_shares <-dequeue_task_fair
1882  kworker/-59      3d..2    5us : hrtick_update <-dequeue_task_fair
1883  kworker/-59      3d..2    5us : wq_worker_sleeping <-__schedule
1884  kworker/-59      3d..2    5us : kthread_data <-wq_worker_sleeping
1885  kworker/-59      3d..2    5us : put_prev_task_fair <-__schedule
1886  kworker/-59      3d..2    6us : pick_next_task_fair <-pick_next_task
1887  kworker/-59      3d..2    6us : clear_buddies <-pick_next_task_fair
1888  kworker/-59      3d..2    6us : set_next_entity <-pick_next_task_fair
1889  kworker/-59      3d..2    6us : update_stats_wait_end <-set_next_entity
1890        ls-2269    3d..2    7us : finish_task_switch <-__schedule
1891        ls-2269    3d..2    7us : _raw_spin_unlock_irq <-finish_task_switch
1892        ls-2269    3d..2    8us : do_IRQ <-ret_from_intr
1893        ls-2269    3d..2    8us : irq_enter <-do_IRQ
1894        ls-2269    3d..2    8us : rcu_irq_enter <-irq_enter
1895        ls-2269    3d..2    9us : add_preempt_count <-irq_enter
1896        ls-2269    3d.h2    9us : exit_idle <-do_IRQ
1897  [...]
1898        ls-2269    3d.h3   20us : sub_preempt_count <-_raw_spin_unlock
1899        ls-2269    3d.h2   20us : irq_exit <-do_IRQ
1900        ls-2269    3d.h2   21us : sub_preempt_count <-irq_exit
1901        ls-2269    3d..3   21us : do_softirq <-irq_exit
1902        ls-2269    3d..3   21us : __do_softirq <-call_softirq
1903        ls-2269    3d..3   21us+: __local_bh_disable <-__do_softirq
1904        ls-2269    3d.s4   29us : sub_preempt_count <-_local_bh_enable_ip
1905        ls-2269    3d.s5   29us : sub_preempt_count <-_local_bh_enable_ip
1906        ls-2269    3d.s5   31us : do_IRQ <-ret_from_intr
1907        ls-2269    3d.s5   31us : irq_enter <-do_IRQ
1908        ls-2269    3d.s5   31us : rcu_irq_enter <-irq_enter
1909  [...]
1910        ls-2269    3d.s5   31us : rcu_irq_enter <-irq_enter
1911        ls-2269    3d.s5   32us : add_preempt_count <-irq_enter
1912        ls-2269    3d.H5   32us : exit_idle <-do_IRQ
1913        ls-2269    3d.H5   32us : handle_irq <-do_IRQ
1914        ls-2269    3d.H5   32us : irq_to_desc <-handle_irq
1915        ls-2269    3d.H5   33us : handle_fasteoi_irq <-handle_irq
1916  [...]
1917        ls-2269    3d.s5  158us : _raw_spin_unlock_irqrestore <-rtl8139_poll
1918        ls-2269    3d.s3  158us : net_rps_action_and_irq_enable.isra.65 <-net_rx_action
1919        ls-2269    3d.s3  159us : __local_bh_enable <-__do_softirq
1920        ls-2269    3d.s3  159us : sub_preempt_count <-__local_bh_enable
1921        ls-2269    3d..3  159us : idle_cpu <-irq_exit
1922        ls-2269    3d..3  159us : rcu_irq_exit <-irq_exit
1923        ls-2269    3d..3  160us : sub_preempt_count <-irq_exit
1924        ls-2269    3d...  161us : __mutex_unlock_slowpath <-mutex_unlock
1925        ls-2269    3d...  162us+: trace_hardirqs_on <-mutex_unlock
1926        ls-2269    3d...  186us : <stack trace>
1927   => __mutex_unlock_slowpath
1928   => mutex_unlock
1929   => process_output
1930   => n_tty_write
1931   => tty_write
1932   => vfs_write
1933   => sys_write
1934   => system_call_fastpath
1935
1936This is an interesting trace. It started with kworker running and
1937scheduling out and ls taking over. But as soon as ls released the
1938rq lock and enabled interrupts (but not preemption) an interrupt
1939triggered. When the interrupt finished, it started running softirqs.
1940But while the softirq was running, another interrupt triggered.
1941When an interrupt is running inside a softirq, the annotation is 'H'.
1942
1943
1944wakeup
1945------
1946
1947One common case that people are interested in tracing is the
1948time it takes for a task that is woken to actually wake up.
1949Now for non Real-Time tasks, this can be arbitrary. But tracing
1950it none the less can be interesting.
1951
1952Without function tracing::
1953
1954  # echo 0 > options/function-trace
1955  # echo wakeup > current_tracer
1956  # echo 1 > tracing_on
1957  # echo 0 > tracing_max_latency
1958  # chrt -f 5 sleep 1
1959  # echo 0 > tracing_on
1960  # cat trace
1961  # tracer: wakeup
1962  #
1963  # wakeup latency trace v1.1.5 on 3.8.0-test+
1964  # --------------------------------------------------------------------
1965  # latency: 15 us, #4/4, CPU#3 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:4)
1966  #    -----------------
1967  #    | task: kworker/3:1H-312 (uid:0 nice:-20 policy:0 rt_prio:0)
1968  #    -----------------
1969  #
1970  #                  _------=> CPU#
1971  #                 / _-----=> irqs-off
1972  #                | / _----=> need-resched
1973  #                || / _---=> hardirq/softirq
1974  #                ||| / _--=> preempt-depth
1975  #                |||| /     delay
1976  #  cmd     pid   ||||| time  |   caller
1977  #     \   /      |||||  \    |   /
1978    <idle>-0       3dNs7    0us :      0:120:R   + [003]   312:100:R kworker/3:1H
1979    <idle>-0       3dNs7    1us+: ttwu_do_activate.constprop.87 <-try_to_wake_up
1980    <idle>-0       3d..3   15us : __schedule <-schedule
1981    <idle>-0       3d..3   15us :      0:120:R ==> [003]   312:100:R kworker/3:1H
1982
1983The tracer only traces the highest priority task in the system
1984to avoid tracing the normal circumstances. Here we see that
1985the kworker with a nice priority of -20 (not very nice), took
1986just 15 microseconds from the time it woke up, to the time it
1987ran.
1988
1989Non Real-Time tasks are not that interesting. A more interesting
1990trace is to concentrate only on Real-Time tasks.
1991
1992wakeup_rt
1993---------
1994
1995In a Real-Time environment it is very important to know the
1996wakeup time it takes for the highest priority task that is woken
1997up to the time that it executes. This is also known as "schedule
1998latency". I stress the point that this is about RT tasks. It is
1999also important to know the scheduling latency of non-RT tasks,
2000but the average schedule latency is better for non-RT tasks.
2001Tools like LatencyTop are more appropriate for such
2002measurements.
2003
2004Real-Time environments are interested in the worst case latency.
2005That is the longest latency it takes for something to happen,
2006and not the average. We can have a very fast scheduler that may
2007only have a large latency once in a while, but that would not
2008work well with Real-Time tasks.  The wakeup_rt tracer was designed
2009to record the worst case wakeups of RT tasks. Non-RT tasks are
2010not recorded because the tracer only records one worst case and
2011tracing non-RT tasks that are unpredictable will overwrite the
2012worst case latency of RT tasks (just run the normal wakeup
2013tracer for a while to see that effect).
2014
2015Since this tracer only deals with RT tasks, we will run this
2016slightly differently than we did with the previous tracers.
2017Instead of performing an 'ls', we will run 'sleep 1' under
2018'chrt' which changes the priority of the task.
2019::
2020
2021  # echo 0 > options/function-trace
2022  # echo wakeup_rt > current_tracer
2023  # echo 1 > tracing_on
2024  # echo 0 > tracing_max_latency
2025  # chrt -f 5 sleep 1
2026  # echo 0 > tracing_on
2027  # cat trace
2028  # tracer: wakeup
2029  #
2030  # tracer: wakeup_rt
2031  #
2032  # wakeup_rt latency trace v1.1.5 on 3.8.0-test+
2033  # --------------------------------------------------------------------
2034  # latency: 5 us, #4/4, CPU#3 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:4)
2035  #    -----------------
2036  #    | task: sleep-2389 (uid:0 nice:0 policy:1 rt_prio:5)
2037  #    -----------------
2038  #
2039  #                  _------=> CPU#
2040  #                 / _-----=> irqs-off
2041  #                | / _----=> need-resched
2042  #                || / _---=> hardirq/softirq
2043  #                ||| / _--=> preempt-depth
2044  #                |||| /     delay
2045  #  cmd     pid   ||||| time  |   caller
2046  #     \   /      |||||  \    |   /
2047    <idle>-0       3d.h4    0us :      0:120:R   + [003]  2389: 94:R sleep
2048    <idle>-0       3d.h4    1us+: ttwu_do_activate.constprop.87 <-try_to_wake_up
2049    <idle>-0       3d..3    5us : __schedule <-schedule
2050    <idle>-0       3d..3    5us :      0:120:R ==> [003]  2389: 94:R sleep
2051
2052
2053Running this on an idle system, we see that it only took 5 microseconds
2054to perform the task switch.  Note, since the trace point in the schedule
2055is before the actual "switch", we stop the tracing when the recorded task
2056is about to schedule in. This may change if we add a new marker at the
2057end of the scheduler.
2058
2059Notice that the recorded task is 'sleep' with the PID of 2389
2060and it has an rt_prio of 5. This priority is user-space priority
2061and not the internal kernel priority. The policy is 1 for
2062SCHED_FIFO and 2 for SCHED_RR.
2063
2064Note, that the trace data shows the internal priority (99 - rtprio).
2065::
2066
2067  <idle>-0       3d..3    5us :      0:120:R ==> [003]  2389: 94:R sleep
2068
2069The 0:120:R means idle was running with a nice priority of 0 (120 - 120)
2070and in the running state 'R'. The sleep task was scheduled in with
20712389: 94:R. That is the priority is the kernel rtprio (99 - 5 = 94)
2072and it too is in the running state.
2073
2074Doing the same with chrt -r 5 and function-trace set.
2075::
2076
2077  echo 1 > options/function-trace
2078
2079  # tracer: wakeup_rt
2080  #
2081  # wakeup_rt latency trace v1.1.5 on 3.8.0-test+
2082  # --------------------------------------------------------------------
2083  # latency: 29 us, #85/85, CPU#3 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:4)
2084  #    -----------------
2085  #    | task: sleep-2448 (uid:0 nice:0 policy:1 rt_prio:5)
2086  #    -----------------
2087  #
2088  #                  _------=> CPU#
2089  #                 / _-----=> irqs-off
2090  #                | / _----=> need-resched
2091  #                || / _---=> hardirq/softirq
2092  #                ||| / _--=> preempt-depth
2093  #                |||| /     delay
2094  #  cmd     pid   ||||| time  |   caller
2095  #     \   /      |||||  \    |   /
2096    <idle>-0       3d.h4    1us+:      0:120:R   + [003]  2448: 94:R sleep
2097    <idle>-0       3d.h4    2us : ttwu_do_activate.constprop.87 <-try_to_wake_up
2098    <idle>-0       3d.h3    3us : check_preempt_curr <-ttwu_do_wakeup
2099    <idle>-0       3d.h3    3us : resched_curr <-check_preempt_curr
2100    <idle>-0       3dNh3    4us : task_woken_rt <-ttwu_do_wakeup
2101    <idle>-0       3dNh3    4us : _raw_spin_unlock <-try_to_wake_up
2102    <idle>-0       3dNh3    4us : sub_preempt_count <-_raw_spin_unlock
2103    <idle>-0       3dNh2    5us : ttwu_stat <-try_to_wake_up
2104    <idle>-0       3dNh2    5us : _raw_spin_unlock_irqrestore <-try_to_wake_up
2105    <idle>-0       3dNh2    6us : sub_preempt_count <-_raw_spin_unlock_irqrestore
2106    <idle>-0       3dNh1    6us : _raw_spin_lock <-__run_hrtimer
2107    <idle>-0       3dNh1    6us : add_preempt_count <-_raw_spin_lock
2108    <idle>-0       3dNh2    7us : _raw_spin_unlock <-hrtimer_interrupt
2109    <idle>-0       3dNh2    7us : sub_preempt_count <-_raw_spin_unlock
2110    <idle>-0       3dNh1    7us : tick_program_event <-hrtimer_interrupt
2111    <idle>-0       3dNh1    7us : clockevents_program_event <-tick_program_event
2112    <idle>-0       3dNh1    8us : ktime_get <-clockevents_program_event
2113    <idle>-0       3dNh1    8us : lapic_next_event <-clockevents_program_event
2114    <idle>-0       3dNh1    8us : irq_exit <-smp_apic_timer_interrupt
2115    <idle>-0       3dNh1    9us : sub_preempt_count <-irq_exit
2116    <idle>-0       3dN.2    9us : idle_cpu <-irq_exit
2117    <idle>-0       3dN.2    9us : rcu_irq_exit <-irq_exit
2118    <idle>-0       3dN.2   10us : rcu_eqs_enter_common.isra.45 <-rcu_irq_exit
2119    <idle>-0       3dN.2   10us : sub_preempt_count <-irq_exit
2120    <idle>-0       3.N.1   11us : rcu_idle_exit <-cpu_idle
2121    <idle>-0       3dN.1   11us : rcu_eqs_exit_common.isra.43 <-rcu_idle_exit
2122    <idle>-0       3.N.1   11us : tick_nohz_idle_exit <-cpu_idle
2123    <idle>-0       3dN.1   12us : menu_hrtimer_cancel <-tick_nohz_idle_exit
2124    <idle>-0       3dN.1   12us : ktime_get <-tick_nohz_idle_exit
2125    <idle>-0       3dN.1   12us : tick_do_update_jiffies64 <-tick_nohz_idle_exit
2126    <idle>-0       3dN.1   13us : cpu_load_update_nohz <-tick_nohz_idle_exit
2127    <idle>-0       3dN.1   13us : _raw_spin_lock <-cpu_load_update_nohz
2128    <idle>-0       3dN.1   13us : add_preempt_count <-_raw_spin_lock
2129    <idle>-0       3dN.2   13us : __cpu_load_update <-cpu_load_update_nohz
2130    <idle>-0       3dN.2   14us : sched_avg_update <-__cpu_load_update
2131    <idle>-0       3dN.2   14us : _raw_spin_unlock <-cpu_load_update_nohz
2132    <idle>-0       3dN.2   14us : sub_preempt_count <-_raw_spin_unlock
2133    <idle>-0       3dN.1   15us : calc_load_nohz_stop <-tick_nohz_idle_exit
2134    <idle>-0       3dN.1   15us : touch_softlockup_watchdog <-tick_nohz_idle_exit
2135    <idle>-0       3dN.1   15us : hrtimer_cancel <-tick_nohz_idle_exit
2136    <idle>-0       3dN.1   15us : hrtimer_try_to_cancel <-hrtimer_cancel
2137    <idle>-0       3dN.1   16us : lock_hrtimer_base.isra.18 <-hrtimer_try_to_cancel
2138    <idle>-0       3dN.1   16us : _raw_spin_lock_irqsave <-lock_hrtimer_base.isra.18
2139    <idle>-0       3dN.1   16us : add_preempt_count <-_raw_spin_lock_irqsave
2140    <idle>-0       3dN.2   17us : __remove_hrtimer <-remove_hrtimer.part.16
2141    <idle>-0       3dN.2   17us : hrtimer_force_reprogram <-__remove_hrtimer
2142    <idle>-0       3dN.2   17us : tick_program_event <-hrtimer_force_reprogram
2143    <idle>-0       3dN.2   18us : clockevents_program_event <-tick_program_event
2144    <idle>-0       3dN.2   18us : ktime_get <-clockevents_program_event
2145    <idle>-0       3dN.2   18us : lapic_next_event <-clockevents_program_event
2146    <idle>-0       3dN.2   19us : _raw_spin_unlock_irqrestore <-hrtimer_try_to_cancel
2147    <idle>-0       3dN.2   19us : sub_preempt_count <-_raw_spin_unlock_irqrestore
2148    <idle>-0       3dN.1   19us : hrtimer_forward <-tick_nohz_idle_exit
2149    <idle>-0       3dN.1   20us : ktime_add_safe <-hrtimer_forward
2150    <idle>-0       3dN.1   20us : ktime_add_safe <-hrtimer_forward
2151    <idle>-0       3dN.1   20us : hrtimer_start_range_ns <-hrtimer_start_expires.constprop.11
2152    <idle>-0       3dN.1   20us : __hrtimer_start_range_ns <-hrtimer_start_range_ns
2153    <idle>-0       3dN.1   21us : lock_hrtimer_base.isra.18 <-__hrtimer_start_range_ns
2154    <idle>-0       3dN.1   21us : _raw_spin_lock_irqsave <-lock_hrtimer_base.isra.18
2155    <idle>-0       3dN.1   21us : add_preempt_count <-_raw_spin_lock_irqsave
2156    <idle>-0       3dN.2   22us : ktime_add_safe <-__hrtimer_start_range_ns
2157    <idle>-0       3dN.2   22us : enqueue_hrtimer <-__hrtimer_start_range_ns
2158    <idle>-0       3dN.2   22us : tick_program_event <-__hrtimer_start_range_ns
2159    <idle>-0       3dN.2   23us : clockevents_program_event <-tick_program_event
2160    <idle>-0       3dN.2   23us : ktime_get <-clockevents_program_event
2161    <idle>-0       3dN.2   23us : lapic_next_event <-clockevents_program_event
2162    <idle>-0       3dN.2   24us : _raw_spin_unlock_irqrestore <-__hrtimer_start_range_ns
2163    <idle>-0       3dN.2   24us : sub_preempt_count <-_raw_spin_unlock_irqrestore
2164    <idle>-0       3dN.1   24us : account_idle_ticks <-tick_nohz_idle_exit
2165    <idle>-0       3dN.1   24us : account_idle_time <-account_idle_ticks
2166    <idle>-0       3.N.1   25us : sub_preempt_count <-cpu_idle
2167    <idle>-0       3.N..   25us : schedule <-cpu_idle
2168    <idle>-0       3.N..   25us : __schedule <-preempt_schedule
2169    <idle>-0       3.N..   26us : add_preempt_count <-__schedule
2170    <idle>-0       3.N.1   26us : rcu_note_context_switch <-__schedule
2171    <idle>-0       3.N.1   26us : rcu_sched_qs <-rcu_note_context_switch
2172    <idle>-0       3dN.1   27us : rcu_preempt_qs <-rcu_note_context_switch
2173    <idle>-0       3.N.1   27us : _raw_spin_lock_irq <-__schedule
2174    <idle>-0       3dN.1   27us : add_preempt_count <-_raw_spin_lock_irq
2175    <idle>-0       3dN.2   28us : put_prev_task_idle <-__schedule
2176    <idle>-0       3dN.2   28us : pick_next_task_stop <-pick_next_task
2177    <idle>-0       3dN.2   28us : pick_next_task_rt <-pick_next_task
2178    <idle>-0       3dN.2   29us : dequeue_pushable_task <-pick_next_task_rt
2179    <idle>-0       3d..3   29us : __schedule <-preempt_schedule
2180    <idle>-0       3d..3   30us :      0:120:R ==> [003]  2448: 94:R sleep
2181
2182This isn't that big of a trace, even with function tracing enabled,
2183so I included the entire trace.
2184
2185The interrupt went off while when the system was idle. Somewhere
2186before task_woken_rt() was called, the NEED_RESCHED flag was set,
2187this is indicated by the first occurrence of the 'N' flag.
2188
2189Latency tracing and events
2190--------------------------
2191As function tracing can induce a much larger latency, but without
2192seeing what happens within the latency it is hard to know what
2193caused it. There is a middle ground, and that is with enabling
2194events.
2195::
2196
2197  # echo 0 > options/function-trace
2198  # echo wakeup_rt > current_tracer
2199  # echo 1 > events/enable
2200  # echo 1 > tracing_on
2201  # echo 0 > tracing_max_latency
2202  # chrt -f 5 sleep 1
2203  # echo 0 > tracing_on
2204  # cat trace
2205  # tracer: wakeup_rt
2206  #
2207  # wakeup_rt latency trace v1.1.5 on 3.8.0-test+
2208  # --------------------------------------------------------------------
2209  # latency: 6 us, #12/12, CPU#2 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:4)
2210  #    -----------------
2211  #    | task: sleep-5882 (uid:0 nice:0 policy:1 rt_prio:5)
2212  #    -----------------
2213  #
2214  #                  _------=> CPU#
2215  #                 / _-----=> irqs-off
2216  #                | / _----=> need-resched
2217  #                || / _---=> hardirq/softirq
2218  #                ||| / _--=> preempt-depth
2219  #                |||| /     delay
2220  #  cmd     pid   ||||| time  |   caller
2221  #     \   /      |||||  \    |   /
2222    <idle>-0       2d.h4    0us :      0:120:R   + [002]  5882: 94:R sleep
2223    <idle>-0       2d.h4    0us : ttwu_do_activate.constprop.87 <-try_to_wake_up
2224    <idle>-0       2d.h4    1us : sched_wakeup: comm=sleep pid=5882 prio=94 success=1 target_cpu=002
2225    <idle>-0       2dNh2    1us : hrtimer_expire_exit: hrtimer=ffff88007796feb8
2226    <idle>-0       2.N.2    2us : power_end: cpu_id=2
2227    <idle>-0       2.N.2    3us : cpu_idle: state=4294967295 cpu_id=2
2228    <idle>-0       2dN.3    4us : hrtimer_cancel: hrtimer=ffff88007d50d5e0
2229    <idle>-0       2dN.3    4us : hrtimer_start: hrtimer=ffff88007d50d5e0 function=tick_sched_timer expires=34311211000000 softexpires=34311211000000
2230    <idle>-0       2.N.2    5us : rcu_utilization: Start context switch
2231    <idle>-0       2.N.2    5us : rcu_utilization: End context switch
2232    <idle>-0       2d..3    6us : __schedule <-schedule
2233    <idle>-0       2d..3    6us :      0:120:R ==> [002]  5882: 94:R sleep
2234
2235
2236Hardware Latency Detector
2237-------------------------
2238
2239The hardware latency detector is executed by enabling the "hwlat" tracer.
2240
2241NOTE, this tracer will affect the performance of the system as it will
2242periodically make a CPU constantly busy with interrupts disabled.
2243::
2244
2245  # echo hwlat > current_tracer
2246  # sleep 100
2247  # cat trace
2248  # tracer: hwlat
2249  #
2250  # entries-in-buffer/entries-written: 13/13   #P:8
2251  #
2252  #                              _-----=> irqs-off
2253  #                             / _----=> need-resched
2254  #                            | / _---=> hardirq/softirq
2255  #                            || / _--=> preempt-depth
2256  #                            ||| /     delay
2257  #           TASK-PID   CPU#  ||||    TIMESTAMP  FUNCTION
2258  #              | |       |   ||||       |         |
2259             <...>-1729  [001] d...   678.473449: #1     inner/outer(us):   11/12    ts:1581527483.343962693 count:6
2260             <...>-1729  [004] d...   689.556542: #2     inner/outer(us):   16/9     ts:1581527494.889008092 count:1
2261             <...>-1729  [005] d...   714.756290: #3     inner/outer(us):   16/16    ts:1581527519.678961629 count:5
2262             <...>-1729  [001] d...   718.788247: #4     inner/outer(us):    9/17    ts:1581527523.889012713 count:1
2263             <...>-1729  [002] d...   719.796341: #5     inner/outer(us):   13/9     ts:1581527524.912872606 count:1
2264             <...>-1729  [006] d...   844.787091: #6     inner/outer(us):    9/12    ts:1581527649.889048502 count:2
2265             <...>-1729  [003] d...   849.827033: #7     inner/outer(us):   18/9     ts:1581527654.889013793 count:1
2266             <...>-1729  [007] d...   853.859002: #8     inner/outer(us):    9/12    ts:1581527658.889065736 count:1
2267             <...>-1729  [001] d...   855.874978: #9     inner/outer(us):    9/11    ts:1581527660.861991877 count:1
2268             <...>-1729  [001] d...   863.938932: #10    inner/outer(us):    9/11    ts:1581527668.970010500 count:1 nmi-total:7 nmi-count:1
2269             <...>-1729  [007] d...   878.050780: #11    inner/outer(us):    9/12    ts:1581527683.385002600 count:1 nmi-total:5 nmi-count:1
2270             <...>-1729  [007] d...   886.114702: #12    inner/outer(us):    9/12    ts:1581527691.385001600 count:1
2271
2272
2273The above output is somewhat the same in the header. All events will have
2274interrupts disabled 'd'. Under the FUNCTION title there is:
2275
2276 #1
2277	This is the count of events recorded that were greater than the
2278	tracing_threshold (See below).
2279
2280 inner/outer(us):   11/11
2281
2282      This shows two numbers as "inner latency" and "outer latency". The test
2283      runs in a loop checking a timestamp twice. The latency detected within
2284      the two timestamps is the "inner latency" and the latency detected
2285      after the previous timestamp and the next timestamp in the loop is
2286      the "outer latency".
2287
2288 ts:1581527483.343962693
2289
2290      The absolute timestamp that the first latency was recorded in the window.
2291
2292 count:6
2293
2294      The number of times a latency was detected during the window.
2295
2296 nmi-total:7 nmi-count:1
2297
2298      On architectures that support it, if an NMI comes in during the
2299      test, the time spent in NMI is reported in "nmi-total" (in
2300      microseconds).
2301
2302      All architectures that have NMIs will show the "nmi-count" if an
2303      NMI comes in during the test.
2304
2305hwlat files:
2306
2307  tracing_threshold
2308	This gets automatically set to "10" to represent 10
2309	microseconds. This is the threshold of latency that
2310	needs to be detected before the trace will be recorded.
2311
2312	Note, when hwlat tracer is finished (another tracer is
2313	written into "current_tracer"), the original value for
2314	tracing_threshold is placed back into this file.
2315
2316  hwlat_detector/width
2317	The length of time the test runs with interrupts disabled.
2318
2319  hwlat_detector/window
2320	The length of time of the window which the test
2321	runs. That is, the test will run for "width"
2322	microseconds per "window" microseconds
2323
2324  tracing_cpumask
2325	When the test is started. A kernel thread is created that
2326	runs the test. This thread will alternate between CPUs
2327	listed in the tracing_cpumask between each period
2328	(one "window"). To limit the test to specific CPUs
2329	set the mask in this file to only the CPUs that the test
2330	should run on.
2331
2332function
2333--------
2334
2335This tracer is the function tracer. Enabling the function tracer
2336can be done from the debug file system. Make sure the
2337ftrace_enabled is set; otherwise this tracer is a nop.
2338See the "ftrace_enabled" section below.
2339::
2340
2341  # sysctl kernel.ftrace_enabled=1
2342  # echo function > current_tracer
2343  # echo 1 > tracing_on
2344  # usleep 1
2345  # echo 0 > tracing_on
2346  # cat trace
2347  # tracer: function
2348  #
2349  # entries-in-buffer/entries-written: 24799/24799   #P:4
2350  #
2351  #                              _-----=> irqs-off
2352  #                             / _----=> need-resched
2353  #                            | / _---=> hardirq/softirq
2354  #                            || / _--=> preempt-depth
2355  #                            ||| /     delay
2356  #           TASK-PID   CPU#  ||||    TIMESTAMP  FUNCTION
2357  #              | |       |   ||||       |         |
2358              bash-1994  [002] ....  3082.063030: mutex_unlock <-rb_simple_write
2359              bash-1994  [002] ....  3082.063031: __mutex_unlock_slowpath <-mutex_unlock
2360              bash-1994  [002] ....  3082.063031: __fsnotify_parent <-fsnotify_modify
2361              bash-1994  [002] ....  3082.063032: fsnotify <-fsnotify_modify
2362              bash-1994  [002] ....  3082.063032: __srcu_read_lock <-fsnotify
2363              bash-1994  [002] ....  3082.063032: add_preempt_count <-__srcu_read_lock
2364              bash-1994  [002] ...1  3082.063032: sub_preempt_count <-__srcu_read_lock
2365              bash-1994  [002] ....  3082.063033: __srcu_read_unlock <-fsnotify
2366  [...]
2367
2368
2369Note: function tracer uses ring buffers to store the above
2370entries. The newest data may overwrite the oldest data.
2371Sometimes using echo to stop the trace is not sufficient because
2372the tracing could have overwritten the data that you wanted to
2373record. For this reason, it is sometimes better to disable
2374tracing directly from a program. This allows you to stop the
2375tracing at the point that you hit the part that you are
2376interested in. To disable the tracing directly from a C program,
2377something like following code snippet can be used::
2378
2379	int trace_fd;
2380	[...]
2381	int main(int argc, char *argv[]) {
2382		[...]
2383		trace_fd = open(tracing_file("tracing_on"), O_WRONLY);
2384		[...]
2385		if (condition_hit()) {
2386			write(trace_fd, "0", 1);
2387		}
2388		[...]
2389	}
2390
2391
2392Single thread tracing
2393---------------------
2394
2395By writing into set_ftrace_pid you can trace a
2396single thread. For example::
2397
2398  # cat set_ftrace_pid
2399  no pid
2400  # echo 3111 > set_ftrace_pid
2401  # cat set_ftrace_pid
2402  3111
2403  # echo function > current_tracer
2404  # cat trace | head
2405  # tracer: function
2406  #
2407  #           TASK-PID    CPU#    TIMESTAMP  FUNCTION
2408  #              | |       |          |         |
2409      yum-updatesd-3111  [003]  1637.254676: finish_task_switch <-thread_return
2410      yum-updatesd-3111  [003]  1637.254681: hrtimer_cancel <-schedule_hrtimeout_range
2411      yum-updatesd-3111  [003]  1637.254682: hrtimer_try_to_cancel <-hrtimer_cancel
2412      yum-updatesd-3111  [003]  1637.254683: lock_hrtimer_base <-hrtimer_try_to_cancel
2413      yum-updatesd-3111  [003]  1637.254685: fget_light <-do_sys_poll
2414      yum-updatesd-3111  [003]  1637.254686: pipe_poll <-do_sys_poll
2415  # echo > set_ftrace_pid
2416  # cat trace |head
2417  # tracer: function
2418  #
2419  #           TASK-PID    CPU#    TIMESTAMP  FUNCTION
2420  #              | |       |          |         |
2421  ##### CPU 3 buffer started ####
2422      yum-updatesd-3111  [003]  1701.957688: free_poll_entry <-poll_freewait
2423      yum-updatesd-3111  [003]  1701.957689: remove_wait_queue <-free_poll_entry
2424      yum-updatesd-3111  [003]  1701.957691: fput <-free_poll_entry
2425      yum-updatesd-3111  [003]  1701.957692: audit_syscall_exit <-sysret_audit
2426      yum-updatesd-3111  [003]  1701.957693: path_put <-audit_syscall_exit
2427
2428If you want to trace a function when executing, you could use
2429something like this simple program.
2430::
2431
2432	#include <stdio.h>
2433	#include <stdlib.h>
2434	#include <sys/types.h>
2435	#include <sys/stat.h>
2436	#include <fcntl.h>
2437	#include <unistd.h>
2438	#include <string.h>
2439
2440	#define _STR(x) #x
2441	#define STR(x) _STR(x)
2442	#define MAX_PATH 256
2443
2444	const char *find_tracefs(void)
2445	{
2446	       static char tracefs[MAX_PATH+1];
2447	       static int tracefs_found;
2448	       char type[100];
2449	       FILE *fp;
2450
2451	       if (tracefs_found)
2452		       return tracefs;
2453
2454	       if ((fp = fopen("/proc/mounts","r")) == NULL) {
2455		       perror("/proc/mounts");
2456		       return NULL;
2457	       }
2458
2459	       while (fscanf(fp, "%*s %"
2460		             STR(MAX_PATH)
2461		             "s %99s %*s %*d %*d\n",
2462		             tracefs, type) == 2) {
2463		       if (strcmp(type, "tracefs") == 0)
2464		               break;
2465	       }
2466	       fclose(fp);
2467
2468	       if (strcmp(type, "tracefs") != 0) {
2469		       fprintf(stderr, "tracefs not mounted");
2470		       return NULL;
2471	       }
2472
2473	       strcat(tracefs, "/tracing/");
2474	       tracefs_found = 1;
2475
2476	       return tracefs;
2477	}
2478
2479	const char *tracing_file(const char *file_name)
2480	{
2481	       static char trace_file[MAX_PATH+1];
2482	       snprintf(trace_file, MAX_PATH, "%s/%s", find_tracefs(), file_name);
2483	       return trace_file;
2484	}
2485
2486	int main (int argc, char **argv)
2487	{
2488		if (argc < 1)
2489		        exit(-1);
2490
2491		if (fork() > 0) {
2492		        int fd, ffd;
2493		        char line[64];
2494		        int s;
2495
2496		        ffd = open(tracing_file("current_tracer"), O_WRONLY);
2497		        if (ffd < 0)
2498		                exit(-1);
2499		        write(ffd, "nop", 3);
2500
2501		        fd = open(tracing_file("set_ftrace_pid"), O_WRONLY);
2502		        s = sprintf(line, "%d\n", getpid());
2503		        write(fd, line, s);
2504
2505		        write(ffd, "function", 8);
2506
2507		        close(fd);
2508		        close(ffd);
2509
2510		        execvp(argv[1], argv+1);
2511		}
2512
2513		return 0;
2514	}
2515
2516Or this simple script!
2517::
2518
2519  #!/bin/bash
2520
2521  tracefs=`sed -ne 's/^tracefs \(.*\) tracefs.*/\1/p' /proc/mounts`
2522  echo 0 > $tracefs/tracing_on
2523  echo $$ > $tracefs/set_ftrace_pid
2524  echo function > $tracefs/current_tracer
2525  echo 1 > $tracefs/tracing_on
2526  exec "$@"
2527
2528
2529function graph tracer
2530---------------------------
2531
2532This tracer is similar to the function tracer except that it
2533probes a function on its entry and its exit. This is done by
2534using a dynamically allocated stack of return addresses in each
2535task_struct. On function entry the tracer overwrites the return
2536address of each function traced to set a custom probe. Thus the
2537original return address is stored on the stack of return address
2538in the task_struct.
2539
2540Probing on both ends of a function leads to special features
2541such as:
2542
2543- measure of a function's time execution
2544- having a reliable call stack to draw function calls graph
2545
2546This tracer is useful in several situations:
2547
2548- you want to find the reason of a strange kernel behavior and
2549  need to see what happens in detail on any areas (or specific
2550  ones).
2551
2552- you are experiencing weird latencies but it's difficult to
2553  find its origin.
2554
2555- you want to find quickly which path is taken by a specific
2556  function
2557
2558- you just want to peek inside a working kernel and want to see
2559  what happens there.
2560
2561::
2562
2563  # tracer: function_graph
2564  #
2565  # CPU  DURATION                  FUNCTION CALLS
2566  # |     |   |                     |   |   |   |
2567
2568   0)               |  sys_open() {
2569   0)               |    do_sys_open() {
2570   0)               |      getname() {
2571   0)               |        kmem_cache_alloc() {
2572   0)   1.382 us    |          __might_sleep();
2573   0)   2.478 us    |        }
2574   0)               |        strncpy_from_user() {
2575   0)               |          might_fault() {
2576   0)   1.389 us    |            __might_sleep();
2577   0)   2.553 us    |          }
2578   0)   3.807 us    |        }
2579   0)   7.876 us    |      }
2580   0)               |      alloc_fd() {
2581   0)   0.668 us    |        _spin_lock();
2582   0)   0.570 us    |        expand_files();
2583   0)   0.586 us    |        _spin_unlock();
2584
2585
2586There are several columns that can be dynamically
2587enabled/disabled. You can use every combination of options you
2588want, depending on your needs.
2589
2590- The cpu number on which the function executed is default
2591  enabled.  It is sometimes better to only trace one cpu (see
2592  tracing_cpumask file) or you might sometimes see unordered
2593  function calls while cpu tracing switch.
2594
2595	- hide: echo nofuncgraph-cpu > trace_options
2596	- show: echo funcgraph-cpu > trace_options
2597
2598- The duration (function's time of execution) is displayed on
2599  the closing bracket line of a function or on the same line
2600  than the current function in case of a leaf one. It is default
2601  enabled.
2602
2603	- hide: echo nofuncgraph-duration > trace_options
2604	- show: echo funcgraph-duration > trace_options
2605
2606- The overhead field precedes the duration field in case of
2607  reached duration thresholds.
2608
2609	- hide: echo nofuncgraph-overhead > trace_options
2610	- show: echo funcgraph-overhead > trace_options
2611	- depends on: funcgraph-duration
2612
2613  ie::
2614
2615    3) # 1837.709 us |          } /* __switch_to */
2616    3)               |          finish_task_switch() {
2617    3)   0.313 us    |            _raw_spin_unlock_irq();
2618    3)   3.177 us    |          }
2619    3) # 1889.063 us |        } /* __schedule */
2620    3) ! 140.417 us  |      } /* __schedule */
2621    3) # 2034.948 us |    } /* schedule */
2622    3) * 33998.59 us |  } /* schedule_preempt_disabled */
2623
2624    [...]
2625
2626    1)   0.260 us    |              msecs_to_jiffies();
2627    1)   0.313 us    |              __rcu_read_unlock();
2628    1) + 61.770 us   |            }
2629    1) + 64.479 us   |          }
2630    1)   0.313 us    |          rcu_bh_qs();
2631    1)   0.313 us    |          __local_bh_enable();
2632    1) ! 217.240 us  |        }
2633    1)   0.365 us    |        idle_cpu();
2634    1)               |        rcu_irq_exit() {
2635    1)   0.417 us    |          rcu_eqs_enter_common.isra.47();
2636    1)   3.125 us    |        }
2637    1) ! 227.812 us  |      }
2638    1) ! 457.395 us  |    }
2639    1) @ 119760.2 us |  }
2640
2641    [...]
2642
2643    2)               |    handle_IPI() {
2644    1)   6.979 us    |                  }
2645    2)   0.417 us    |      scheduler_ipi();
2646    1)   9.791 us    |                }
2647    1) + 12.917 us   |              }
2648    2)   3.490 us    |    }
2649    1) + 15.729 us   |            }
2650    1) + 18.542 us   |          }
2651    2) $ 3594274 us  |  }
2652
2653Flags::
2654
2655  + means that the function exceeded 10 usecs.
2656  ! means that the function exceeded 100 usecs.
2657  # means that the function exceeded 1000 usecs.
2658  * means that the function exceeded 10 msecs.
2659  @ means that the function exceeded 100 msecs.
2660  $ means that the function exceeded 1 sec.
2661
2662
2663- The task/pid field displays the thread cmdline and pid which
2664  executed the function. It is default disabled.
2665
2666	- hide: echo nofuncgraph-proc > trace_options
2667	- show: echo funcgraph-proc > trace_options
2668
2669  ie::
2670
2671    # tracer: function_graph
2672    #
2673    # CPU  TASK/PID        DURATION                  FUNCTION CALLS
2674    # |    |    |           |   |                     |   |   |   |
2675    0)    sh-4802     |               |                  d_free() {
2676    0)    sh-4802     |               |                    call_rcu() {
2677    0)    sh-4802     |               |                      __call_rcu() {
2678    0)    sh-4802     |   0.616 us    |                        rcu_process_gp_end();
2679    0)    sh-4802     |   0.586 us    |                        check_for_new_grace_period();
2680    0)    sh-4802     |   2.899 us    |                      }
2681    0)    sh-4802     |   4.040 us    |                    }
2682    0)    sh-4802     |   5.151 us    |                  }
2683    0)    sh-4802     | + 49.370 us   |                }
2684
2685
2686- The absolute time field is an absolute timestamp given by the
2687  system clock since it started. A snapshot of this time is
2688  given on each entry/exit of functions
2689
2690	- hide: echo nofuncgraph-abstime > trace_options
2691	- show: echo funcgraph-abstime > trace_options
2692
2693  ie::
2694
2695    #
2696    #      TIME       CPU  DURATION                  FUNCTION CALLS
2697    #       |         |     |   |                     |   |   |   |
2698    360.774522 |   1)   0.541 us    |                                          }
2699    360.774522 |   1)   4.663 us    |                                        }
2700    360.774523 |   1)   0.541 us    |                                        __wake_up_bit();
2701    360.774524 |   1)   6.796 us    |                                      }
2702    360.774524 |   1)   7.952 us    |                                    }
2703    360.774525 |   1)   9.063 us    |                                  }
2704    360.774525 |   1)   0.615 us    |                                  journal_mark_dirty();
2705    360.774527 |   1)   0.578 us    |                                  __brelse();
2706    360.774528 |   1)               |                                  reiserfs_prepare_for_journal() {
2707    360.774528 |   1)               |                                    unlock_buffer() {
2708    360.774529 |   1)               |                                      wake_up_bit() {
2709    360.774529 |   1)               |                                        bit_waitqueue() {
2710    360.774530 |   1)   0.594 us    |                                          __phys_addr();
2711
2712
2713The function name is always displayed after the closing bracket
2714for a function if the start of that function is not in the
2715trace buffer.
2716
2717Display of the function name after the closing bracket may be
2718enabled for functions whose start is in the trace buffer,
2719allowing easier searching with grep for function durations.
2720It is default disabled.
2721
2722	- hide: echo nofuncgraph-tail > trace_options
2723	- show: echo funcgraph-tail > trace_options
2724
2725  Example with nofuncgraph-tail (default)::
2726
2727    0)               |      putname() {
2728    0)               |        kmem_cache_free() {
2729    0)   0.518 us    |          __phys_addr();
2730    0)   1.757 us    |        }
2731    0)   2.861 us    |      }
2732
2733  Example with funcgraph-tail::
2734
2735    0)               |      putname() {
2736    0)               |        kmem_cache_free() {
2737    0)   0.518 us    |          __phys_addr();
2738    0)   1.757 us    |        } /* kmem_cache_free() */
2739    0)   2.861 us    |      } /* putname() */
2740
2741The return value of each traced function can be displayed after
2742an equal sign "=". When encountering system call failures, it
2743can be very helpful to quickly locate the function that first
2744returns an error code.
2745
2746	- hide: echo nofuncgraph-retval > trace_options
2747	- show: echo funcgraph-retval > trace_options
2748
2749  Example with funcgraph-retval::
2750
2751    1)               |    cgroup_migrate() {
2752    1)   0.651 us    |      cgroup_migrate_add_task(); /* = 0xffff93fcfd346c00 */
2753    1)               |      cgroup_migrate_execute() {
2754    1)               |        cpu_cgroup_can_attach() {
2755    1)               |          cgroup_taskset_first() {
2756    1)   0.732 us    |            cgroup_taskset_next(); /* = 0xffff93fc8fb20000 */
2757    1)   1.232 us    |          } /* cgroup_taskset_first = 0xffff93fc8fb20000 */
2758    1)   0.380 us    |          sched_rt_can_attach(); /* = 0x0 */
2759    1)   2.335 us    |        } /* cpu_cgroup_can_attach = -22 */
2760    1)   4.369 us    |      } /* cgroup_migrate_execute = -22 */
2761    1)   7.143 us    |    } /* cgroup_migrate = -22 */
2762
2763The above example shows that the function cpu_cgroup_can_attach
2764returned the error code -22 firstly, then we can read the code
2765of this function to get the root cause.
2766
2767When the option funcgraph-retval-hex is not set, the return value can
2768be displayed in a smart way. Specifically, if it is an error code,
2769it will be printed in signed decimal format, otherwise it will
2770printed in hexadecimal format.
2771
2772	- smart: echo nofuncgraph-retval-hex > trace_options
2773	- hexadecimal: echo funcgraph-retval-hex > trace_options
2774
2775  Example with funcgraph-retval-hex::
2776
2777    1)               |      cgroup_migrate() {
2778    1)   0.651 us    |        cgroup_migrate_add_task(); /* = 0xffff93fcfd346c00 */
2779    1)               |        cgroup_migrate_execute() {
2780    1)               |          cpu_cgroup_can_attach() {
2781    1)               |            cgroup_taskset_first() {
2782    1)   0.732 us    |              cgroup_taskset_next(); /* = 0xffff93fc8fb20000 */
2783    1)   1.232 us    |            } /* cgroup_taskset_first = 0xffff93fc8fb20000 */
2784    1)   0.380 us    |            sched_rt_can_attach(); /* = 0x0 */
2785    1)   2.335 us    |          } /* cpu_cgroup_can_attach = 0xffffffea */
2786    1)   4.369 us    |        } /* cgroup_migrate_execute = 0xffffffea */
2787    1)   7.143 us    |      } /* cgroup_migrate = 0xffffffea */
2788
2789At present, there are some limitations when using the funcgraph-retval
2790option, and these limitations will be eliminated in the future:
2791
2792- Even if the function return type is void, a return value will still
2793  be printed, and you can just ignore it.
2794
2795- Even if return values are stored in multiple registers, only the
2796  value contained in the first register will be recorded and printed.
2797  To illustrate, in the x86 architecture, eax and edx are used to store
2798  a 64-bit return value, with the lower 32 bits saved in eax and the
2799  upper 32 bits saved in edx. However, only the value stored in eax
2800  will be recorded and printed.
2801
2802- In certain procedure call standards, such as arm64's AAPCS64, when a
2803  type is smaller than a GPR, it is the responsibility of the consumer
2804  to perform the narrowing, and the upper bits may contain UNKNOWN values.
2805  Therefore, it is advisable to check the code for such cases. For instance,
2806  when using a u8 in a 64-bit GPR, bits [63:8] may contain arbitrary values,
2807  especially when larger types are truncated, whether explicitly or implicitly.
2808  Here are some specific cases to illustrate this point:
2809
2810  **Case One**:
2811
2812  The function narrow_to_u8 is defined as follows::
2813
2814	u8 narrow_to_u8(u64 val)
2815	{
2816		// implicitly truncated
2817		return val;
2818	}
2819
2820  It may be compiled to::
2821
2822	narrow_to_u8:
2823		< ... ftrace instrumentation ... >
2824		RET
2825
2826  If you pass 0x123456789abcdef to this function and want to narrow it,
2827  it may be recorded as 0x123456789abcdef instead of 0xef.
2828
2829  **Case Two**:
2830
2831  The function error_if_not_4g_aligned is defined as follows::
2832
2833	int error_if_not_4g_aligned(u64 val)
2834	{
2835		if (val & GENMASK(31, 0))
2836			return -EINVAL;
2837
2838		return 0;
2839	}
2840
2841  It could be compiled to::
2842
2843	error_if_not_4g_aligned:
2844		CBNZ    w0, .Lnot_aligned
2845		RET			// bits [31:0] are zero, bits
2846					// [63:32] are UNKNOWN
2847	.Lnot_aligned:
2848		MOV    x0, #-EINVAL
2849		RET
2850
2851  When passing 0x2_0000_0000 to it, the return value may be recorded as
2852  0x2_0000_0000 instead of 0.
2853
2854You can put some comments on specific functions by using
2855trace_printk() For example, if you want to put a comment inside
2856the __might_sleep() function, you just have to include
2857<linux/ftrace.h> and call trace_printk() inside __might_sleep()::
2858
2859	trace_printk("I'm a comment!\n")
2860
2861will produce::
2862
2863   1)               |             __might_sleep() {
2864   1)               |                /* I'm a comment! */
2865   1)   1.449 us    |             }
2866
2867
2868You might find other useful features for this tracer in the
2869following "dynamic ftrace" section such as tracing only specific
2870functions or tasks.
2871
2872dynamic ftrace
2873--------------
2874
2875If CONFIG_DYNAMIC_FTRACE is set, the system will run with
2876virtually no overhead when function tracing is disabled. The way
2877this works is the mcount function call (placed at the start of
2878every kernel function, produced by the -pg switch in gcc),
2879starts of pointing to a simple return. (Enabling FTRACE will
2880include the -pg switch in the compiling of the kernel.)
2881
2882At compile time every C file object is run through the
2883recordmcount program (located in the scripts directory). This
2884program will parse the ELF headers in the C object to find all
2885the locations in the .text section that call mcount. Starting
2886with gcc version 4.6, the -mfentry has been added for x86, which
2887calls "__fentry__" instead of "mcount". Which is called before
2888the creation of the stack frame.
2889
2890Note, not all sections are traced. They may be prevented by either
2891a notrace, or blocked another way and all inline functions are not
2892traced. Check the "available_filter_functions" file to see what functions
2893can be traced.
2894
2895A section called "__mcount_loc" is created that holds
2896references to all the mcount/fentry call sites in the .text section.
2897The recordmcount program re-links this section back into the
2898original object. The final linking stage of the kernel will add all these
2899references into a single table.
2900
2901On boot up, before SMP is initialized, the dynamic ftrace code
2902scans this table and updates all the locations into nops. It
2903also records the locations, which are added to the
2904available_filter_functions list.  Modules are processed as they
2905are loaded and before they are executed.  When a module is
2906unloaded, it also removes its functions from the ftrace function
2907list. This is automatic in the module unload code, and the
2908module author does not need to worry about it.
2909
2910When tracing is enabled, the process of modifying the function
2911tracepoints is dependent on architecture. The old method is to use
2912kstop_machine to prevent races with the CPUs executing code being
2913modified (which can cause the CPU to do undesirable things, especially
2914if the modified code crosses cache (or page) boundaries), and the nops are
2915patched back to calls. But this time, they do not call mcount
2916(which is just a function stub). They now call into the ftrace
2917infrastructure.
2918
2919The new method of modifying the function tracepoints is to place
2920a breakpoint at the location to be modified, sync all CPUs, modify
2921the rest of the instruction not covered by the breakpoint. Sync
2922all CPUs again, and then remove the breakpoint with the finished
2923version to the ftrace call site.
2924
2925Some archs do not even need to monkey around with the synchronization,
2926and can just slap the new code on top of the old without any
2927problems with other CPUs executing it at the same time.
2928
2929One special side-effect to the recording of the functions being
2930traced is that we can now selectively choose which functions we
2931wish to trace and which ones we want the mcount calls to remain
2932as nops.
2933
2934Two files are used, one for enabling and one for disabling the
2935tracing of specified functions. They are:
2936
2937  set_ftrace_filter
2938
2939and
2940
2941  set_ftrace_notrace
2942
2943A list of available functions that you can add to these files is
2944listed in:
2945
2946   available_filter_functions
2947
2948::
2949
2950  # cat available_filter_functions
2951  put_prev_task_idle
2952  kmem_cache_create
2953  pick_next_task_rt
2954  cpus_read_lock
2955  pick_next_task_fair
2956  mutex_lock
2957  [...]
2958
2959If I am only interested in sys_nanosleep and hrtimer_interrupt::
2960
2961  # echo sys_nanosleep hrtimer_interrupt > set_ftrace_filter
2962  # echo function > current_tracer
2963  # echo 1 > tracing_on
2964  # usleep 1
2965  # echo 0 > tracing_on
2966  # cat trace
2967  # tracer: function
2968  #
2969  # entries-in-buffer/entries-written: 5/5   #P:4
2970  #
2971  #                              _-----=> irqs-off
2972  #                             / _----=> need-resched
2973  #                            | / _---=> hardirq/softirq
2974  #                            || / _--=> preempt-depth
2975  #                            ||| /     delay
2976  #           TASK-PID   CPU#  ||||    TIMESTAMP  FUNCTION
2977  #              | |       |   ||||       |         |
2978            usleep-2665  [001] ....  4186.475355: sys_nanosleep <-system_call_fastpath
2979            <idle>-0     [001] d.h1  4186.475409: hrtimer_interrupt <-smp_apic_timer_interrupt
2980            usleep-2665  [001] d.h1  4186.475426: hrtimer_interrupt <-smp_apic_timer_interrupt
2981            <idle>-0     [003] d.h1  4186.475426: hrtimer_interrupt <-smp_apic_timer_interrupt
2982            <idle>-0     [002] d.h1  4186.475427: hrtimer_interrupt <-smp_apic_timer_interrupt
2983
2984To see which functions are being traced, you can cat the file:
2985::
2986
2987  # cat set_ftrace_filter
2988  hrtimer_interrupt
2989  sys_nanosleep
2990
2991
2992Perhaps this is not enough. The filters also allow glob(7) matching.
2993
2994  ``<match>*``
2995	will match functions that begin with <match>
2996  ``*<match>``
2997	will match functions that end with <match>
2998  ``*<match>*``
2999	will match functions that have <match> in it
3000  ``<match1>*<match2>``
3001	will match functions that begin with <match1> and end with <match2>
3002
3003.. note::
3004      It is better to use quotes to enclose the wild cards,
3005      otherwise the shell may expand the parameters into names
3006      of files in the local directory.
3007
3008::
3009
3010  # echo 'hrtimer_*' > set_ftrace_filter
3011
3012Produces::
3013
3014  # tracer: function
3015  #
3016  # entries-in-buffer/entries-written: 897/897   #P:4
3017  #
3018  #                              _-----=> irqs-off
3019  #                             / _----=> need-resched
3020  #                            | / _---=> hardirq/softirq
3021  #                            || / _--=> preempt-depth
3022  #                            ||| /     delay
3023  #           TASK-PID   CPU#  ||||    TIMESTAMP  FUNCTION
3024  #              | |       |   ||||       |         |
3025            <idle>-0     [003] dN.1  4228.547803: hrtimer_cancel <-tick_nohz_idle_exit
3026            <idle>-0     [003] dN.1  4228.547804: hrtimer_try_to_cancel <-hrtimer_cancel
3027            <idle>-0     [003] dN.2  4228.547805: hrtimer_force_reprogram <-__remove_hrtimer
3028            <idle>-0     [003] dN.1  4228.547805: hrtimer_forward <-tick_nohz_idle_exit
3029            <idle>-0     [003] dN.1  4228.547805: hrtimer_start_range_ns <-hrtimer_start_expires.constprop.11
3030            <idle>-0     [003] d..1  4228.547858: hrtimer_get_next_event <-get_next_timer_interrupt
3031            <idle>-0     [003] d..1  4228.547859: hrtimer_start <-__tick_nohz_idle_enter
3032            <idle>-0     [003] d..2  4228.547860: hrtimer_force_reprogram <-__rem
3033
3034Notice that we lost the sys_nanosleep.
3035::
3036
3037  # cat set_ftrace_filter
3038  hrtimer_run_queues
3039  hrtimer_run_pending
3040  hrtimer_init
3041  hrtimer_cancel
3042  hrtimer_try_to_cancel
3043  hrtimer_forward
3044  hrtimer_start
3045  hrtimer_reprogram
3046  hrtimer_force_reprogram
3047  hrtimer_get_next_event
3048  hrtimer_interrupt
3049  hrtimer_nanosleep
3050  hrtimer_wakeup
3051  hrtimer_get_remaining
3052  hrtimer_get_res
3053  hrtimer_init_sleeper
3054
3055
3056This is because the '>' and '>>' act just like they do in bash.
3057To rewrite the filters, use '>'
3058To append to the filters, use '>>'
3059
3060To clear out a filter so that all functions will be recorded
3061again::
3062
3063 # echo > set_ftrace_filter
3064 # cat set_ftrace_filter
3065 #
3066
3067Again, now we want to append.
3068
3069::
3070
3071  # echo sys_nanosleep > set_ftrace_filter
3072  # cat set_ftrace_filter
3073  sys_nanosleep
3074  # echo 'hrtimer_*' >> set_ftrace_filter
3075  # cat set_ftrace_filter
3076  hrtimer_run_queues
3077  hrtimer_run_pending
3078  hrtimer_init
3079  hrtimer_cancel
3080  hrtimer_try_to_cancel
3081  hrtimer_forward
3082  hrtimer_start
3083  hrtimer_reprogram
3084  hrtimer_force_reprogram
3085  hrtimer_get_next_event
3086  hrtimer_interrupt
3087  sys_nanosleep
3088  hrtimer_nanosleep
3089  hrtimer_wakeup
3090  hrtimer_get_remaining
3091  hrtimer_get_res
3092  hrtimer_init_sleeper
3093
3094
3095The set_ftrace_notrace prevents those functions from being
3096traced.
3097::
3098
3099  # echo '*preempt*' '*lock*' > set_ftrace_notrace
3100
3101Produces::
3102
3103  # tracer: function
3104  #
3105  # entries-in-buffer/entries-written: 39608/39608   #P:4
3106  #
3107  #                              _-----=> irqs-off
3108  #                             / _----=> need-resched
3109  #                            | / _---=> hardirq/softirq
3110  #                            || / _--=> preempt-depth
3111  #                            ||| /     delay
3112  #           TASK-PID   CPU#  ||||    TIMESTAMP  FUNCTION
3113  #              | |       |   ||||       |         |
3114              bash-1994  [000] ....  4342.324896: file_ra_state_init <-do_dentry_open
3115              bash-1994  [000] ....  4342.324897: open_check_o_direct <-do_last
3116              bash-1994  [000] ....  4342.324897: ima_file_check <-do_last
3117              bash-1994  [000] ....  4342.324898: process_measurement <-ima_file_check
3118              bash-1994  [000] ....  4342.324898: ima_get_action <-process_measurement
3119              bash-1994  [000] ....  4342.324898: ima_match_policy <-ima_get_action
3120              bash-1994  [000] ....  4342.324899: do_truncate <-do_last
3121              bash-1994  [000] ....  4342.324899: setattr_should_drop_suidgid <-do_truncate
3122              bash-1994  [000] ....  4342.324899: notify_change <-do_truncate
3123              bash-1994  [000] ....  4342.324900: current_fs_time <-notify_change
3124              bash-1994  [000] ....  4342.324900: current_kernel_time <-current_fs_time
3125              bash-1994  [000] ....  4342.324900: timespec_trunc <-current_fs_time
3126
3127We can see that there's no more lock or preempt tracing.
3128
3129Selecting function filters via index
3130------------------------------------
3131
3132Because processing of strings is expensive (the address of the function
3133needs to be looked up before comparing to the string being passed in),
3134an index can be used as well to enable functions. This is useful in the
3135case of setting thousands of specific functions at a time. By passing
3136in a list of numbers, no string processing will occur. Instead, the function
3137at the specific location in the internal array (which corresponds to the
3138functions in the "available_filter_functions" file), is selected.
3139
3140::
3141
3142  # echo 1 > set_ftrace_filter
3143
3144Will select the first function listed in "available_filter_functions"
3145
3146::
3147
3148  # head -1 available_filter_functions
3149  trace_initcall_finish_cb
3150
3151  # cat set_ftrace_filter
3152  trace_initcall_finish_cb
3153
3154  # head -50 available_filter_functions | tail -1
3155  x86_pmu_commit_txn
3156
3157  # echo 1 50 > set_ftrace_filter
3158  # cat set_ftrace_filter
3159  trace_initcall_finish_cb
3160  x86_pmu_commit_txn
3161
3162Dynamic ftrace with the function graph tracer
3163---------------------------------------------
3164
3165Although what has been explained above concerns both the
3166function tracer and the function-graph-tracer, there are some
3167special features only available in the function-graph tracer.
3168
3169If you want to trace only one function and all of its children,
3170you just have to echo its name into set_graph_function::
3171
3172 echo __do_fault > set_graph_function
3173
3174will produce the following "expanded" trace of the __do_fault()
3175function::
3176
3177   0)               |  __do_fault() {
3178   0)               |    filemap_fault() {
3179   0)               |      find_lock_page() {
3180   0)   0.804 us    |        find_get_page();
3181   0)               |        __might_sleep() {
3182   0)   1.329 us    |        }
3183   0)   3.904 us    |      }
3184   0)   4.979 us    |    }
3185   0)   0.653 us    |    _spin_lock();
3186   0)   0.578 us    |    page_add_file_rmap();
3187   0)   0.525 us    |    native_set_pte_at();
3188   0)   0.585 us    |    _spin_unlock();
3189   0)               |    unlock_page() {
3190   0)   0.541 us    |      page_waitqueue();
3191   0)   0.639 us    |      __wake_up_bit();
3192   0)   2.786 us    |    }
3193   0) + 14.237 us   |  }
3194   0)               |  __do_fault() {
3195   0)               |    filemap_fault() {
3196   0)               |      find_lock_page() {
3197   0)   0.698 us    |        find_get_page();
3198   0)               |        __might_sleep() {
3199   0)   1.412 us    |        }
3200   0)   3.950 us    |      }
3201   0)   5.098 us    |    }
3202   0)   0.631 us    |    _spin_lock();
3203   0)   0.571 us    |    page_add_file_rmap();
3204   0)   0.526 us    |    native_set_pte_at();
3205   0)   0.586 us    |    _spin_unlock();
3206   0)               |    unlock_page() {
3207   0)   0.533 us    |      page_waitqueue();
3208   0)   0.638 us    |      __wake_up_bit();
3209   0)   2.793 us    |    }
3210   0) + 14.012 us   |  }
3211
3212You can also expand several functions at once::
3213
3214 echo sys_open > set_graph_function
3215 echo sys_close >> set_graph_function
3216
3217Now if you want to go back to trace all functions you can clear
3218this special filter via::
3219
3220 echo > set_graph_function
3221
3222
3223ftrace_enabled
3224--------------
3225
3226Note, the proc sysctl ftrace_enable is a big on/off switch for the
3227function tracer. By default it is enabled (when function tracing is
3228enabled in the kernel). If it is disabled, all function tracing is
3229disabled. This includes not only the function tracers for ftrace, but
3230also for any other uses (perf, kprobes, stack tracing, profiling, etc). It
3231cannot be disabled if there is a callback with FTRACE_OPS_FL_PERMANENT set
3232registered.
3233
3234Please disable this with care.
3235
3236This can be disable (and enabled) with::
3237
3238  sysctl kernel.ftrace_enabled=0
3239  sysctl kernel.ftrace_enabled=1
3240
3241 or
3242
3243  echo 0 > /proc/sys/kernel/ftrace_enabled
3244  echo 1 > /proc/sys/kernel/ftrace_enabled
3245
3246
3247Filter commands
3248---------------
3249
3250A few commands are supported by the set_ftrace_filter interface.
3251Trace commands have the following format::
3252
3253  <function>:<command>:<parameter>
3254
3255The following commands are supported:
3256
3257- mod:
3258  This command enables function filtering per module. The
3259  parameter defines the module. For example, if only the write*
3260  functions in the ext3 module are desired, run:
3261
3262   echo 'write*:mod:ext3' > set_ftrace_filter
3263
3264  This command interacts with the filter in the same way as
3265  filtering based on function names. Thus, adding more functions
3266  in a different module is accomplished by appending (>>) to the
3267  filter file. Remove specific module functions by prepending
3268  '!'::
3269
3270   echo '!writeback*:mod:ext3' >> set_ftrace_filter
3271
3272  Mod command supports module globbing. Disable tracing for all
3273  functions except a specific module::
3274
3275   echo '!*:mod:!ext3' >> set_ftrace_filter
3276
3277  Disable tracing for all modules, but still trace kernel::
3278
3279   echo '!*:mod:*' >> set_ftrace_filter
3280
3281  Enable filter only for kernel::
3282
3283   echo '*write*:mod:!*' >> set_ftrace_filter
3284
3285  Enable filter for module globbing::
3286
3287   echo '*write*:mod:*snd*' >> set_ftrace_filter
3288
3289- traceon/traceoff:
3290  These commands turn tracing on and off when the specified
3291  functions are hit. The parameter determines how many times the
3292  tracing system is turned on and off. If unspecified, there is
3293  no limit. For example, to disable tracing when a schedule bug
3294  is hit the first 5 times, run::
3295
3296   echo '__schedule_bug:traceoff:5' > set_ftrace_filter
3297
3298  To always disable tracing when __schedule_bug is hit::
3299
3300   echo '__schedule_bug:traceoff' > set_ftrace_filter
3301
3302  These commands are cumulative whether or not they are appended
3303  to set_ftrace_filter. To remove a command, prepend it by '!'
3304  and drop the parameter::
3305
3306   echo '!__schedule_bug:traceoff:0' > set_ftrace_filter
3307
3308  The above removes the traceoff command for __schedule_bug
3309  that have a counter. To remove commands without counters::
3310
3311   echo '!__schedule_bug:traceoff' > set_ftrace_filter
3312
3313- snapshot:
3314  Will cause a snapshot to be triggered when the function is hit.
3315  ::
3316
3317   echo 'native_flush_tlb_others:snapshot' > set_ftrace_filter
3318
3319  To only snapshot once:
3320  ::
3321
3322   echo 'native_flush_tlb_others:snapshot:1' > set_ftrace_filter
3323
3324  To remove the above commands::
3325
3326   echo '!native_flush_tlb_others:snapshot' > set_ftrace_filter
3327   echo '!native_flush_tlb_others:snapshot:0' > set_ftrace_filter
3328
3329- enable_event/disable_event:
3330  These commands can enable or disable a trace event. Note, because
3331  function tracing callbacks are very sensitive, when these commands
3332  are registered, the trace point is activated, but disabled in
3333  a "soft" mode. That is, the tracepoint will be called, but
3334  just will not be traced. The event tracepoint stays in this mode
3335  as long as there's a command that triggers it.
3336  ::
3337
3338   echo 'try_to_wake_up:enable_event:sched:sched_switch:2' > \
3339   	 set_ftrace_filter
3340
3341  The format is::
3342
3343    <function>:enable_event:<system>:<event>[:count]
3344    <function>:disable_event:<system>:<event>[:count]
3345
3346  To remove the events commands::
3347
3348   echo '!try_to_wake_up:enable_event:sched:sched_switch:0' > \
3349   	 set_ftrace_filter
3350   echo '!schedule:disable_event:sched:sched_switch' > \
3351   	 set_ftrace_filter
3352
3353- dump:
3354  When the function is hit, it will dump the contents of the ftrace
3355  ring buffer to the console. This is useful if you need to debug
3356  something, and want to dump the trace when a certain function
3357  is hit. Perhaps it's a function that is called before a triple
3358  fault happens and does not allow you to get a regular dump.
3359
3360- cpudump:
3361  When the function is hit, it will dump the contents of the ftrace
3362  ring buffer for the current CPU to the console. Unlike the "dump"
3363  command, it only prints out the contents of the ring buffer for the
3364  CPU that executed the function that triggered the dump.
3365
3366- stacktrace:
3367  When the function is hit, a stack trace is recorded.
3368
3369trace_pipe
3370----------
3371
3372The trace_pipe outputs the same content as the trace file, but
3373the effect on the tracing is different. Every read from
3374trace_pipe is consumed. This means that subsequent reads will be
3375different. The trace is live.
3376::
3377
3378  # echo function > current_tracer
3379  # cat trace_pipe > /tmp/trace.out &
3380  [1] 4153
3381  # echo 1 > tracing_on
3382  # usleep 1
3383  # echo 0 > tracing_on
3384  # cat trace
3385  # tracer: function
3386  #
3387  # entries-in-buffer/entries-written: 0/0   #P:4
3388  #
3389  #                              _-----=> irqs-off
3390  #                             / _----=> need-resched
3391  #                            | / _---=> hardirq/softirq
3392  #                            || / _--=> preempt-depth
3393  #                            ||| /     delay
3394  #           TASK-PID   CPU#  ||||    TIMESTAMP  FUNCTION
3395  #              | |       |   ||||       |         |
3396
3397  #
3398  # cat /tmp/trace.out
3399             bash-1994  [000] ....  5281.568961: mutex_unlock <-rb_simple_write
3400             bash-1994  [000] ....  5281.568963: __mutex_unlock_slowpath <-mutex_unlock
3401             bash-1994  [000] ....  5281.568963: __fsnotify_parent <-fsnotify_modify
3402             bash-1994  [000] ....  5281.568964: fsnotify <-fsnotify_modify
3403             bash-1994  [000] ....  5281.568964: __srcu_read_lock <-fsnotify
3404             bash-1994  [000] ....  5281.568964: add_preempt_count <-__srcu_read_lock
3405             bash-1994  [000] ...1  5281.568965: sub_preempt_count <-__srcu_read_lock
3406             bash-1994  [000] ....  5281.568965: __srcu_read_unlock <-fsnotify
3407             bash-1994  [000] ....  5281.568967: sys_dup2 <-system_call_fastpath
3408
3409
3410Note, reading the trace_pipe file will block until more input is
3411added. This is contrary to the trace file. If any process opened
3412the trace file for reading, it will actually disable tracing and
3413prevent new entries from being added. The trace_pipe file does
3414not have this limitation.
3415
3416trace entries
3417-------------
3418
3419Having too much or not enough data can be troublesome in
3420diagnosing an issue in the kernel. The file buffer_size_kb is
3421used to modify the size of the internal trace buffers. The
3422number listed is the number of entries that can be recorded per
3423CPU. To know the full size, multiply the number of possible CPUs
3424with the number of entries.
3425::
3426
3427  # cat buffer_size_kb
3428  1408 (units kilobytes)
3429
3430Or simply read buffer_total_size_kb
3431::
3432
3433  # cat buffer_total_size_kb
3434  5632
3435
3436To modify the buffer, simple echo in a number (in 1024 byte segments).
3437::
3438
3439  # echo 10000 > buffer_size_kb
3440  # cat buffer_size_kb
3441  10000 (units kilobytes)
3442
3443It will try to allocate as much as possible. If you allocate too
3444much, it can cause Out-Of-Memory to trigger.
3445::
3446
3447  # echo 1000000000000 > buffer_size_kb
3448  -bash: echo: write error: Cannot allocate memory
3449  # cat buffer_size_kb
3450  85
3451
3452The per_cpu buffers can be changed individually as well:
3453::
3454
3455  # echo 10000 > per_cpu/cpu0/buffer_size_kb
3456  # echo 100 > per_cpu/cpu1/buffer_size_kb
3457
3458When the per_cpu buffers are not the same, the buffer_size_kb
3459at the top level will just show an X
3460::
3461
3462  # cat buffer_size_kb
3463  X
3464
3465This is where the buffer_total_size_kb is useful:
3466::
3467
3468  # cat buffer_total_size_kb
3469  12916
3470
3471Writing to the top level buffer_size_kb will reset all the buffers
3472to be the same again.
3473
3474Snapshot
3475--------
3476CONFIG_TRACER_SNAPSHOT makes a generic snapshot feature
3477available to all non latency tracers. (Latency tracers which
3478record max latency, such as "irqsoff" or "wakeup", can't use
3479this feature, since those are already using the snapshot
3480mechanism internally.)
3481
3482Snapshot preserves a current trace buffer at a particular point
3483in time without stopping tracing. Ftrace swaps the current
3484buffer with a spare buffer, and tracing continues in the new
3485current (=previous spare) buffer.
3486
3487The following tracefs files in "tracing" are related to this
3488feature:
3489
3490  snapshot:
3491
3492	This is used to take a snapshot and to read the output
3493	of the snapshot. Echo 1 into this file to allocate a
3494	spare buffer and to take a snapshot (swap), then read
3495	the snapshot from this file in the same format as
3496	"trace" (described above in the section "The File
3497	System"). Both reads snapshot and tracing are executable
3498	in parallel. When the spare buffer is allocated, echoing
3499	0 frees it, and echoing else (positive) values clear the
3500	snapshot contents.
3501	More details are shown in the table below.
3502
3503	+--------------+------------+------------+------------+
3504	|status\\input |     0      |     1      |    else    |
3505	+==============+============+============+============+
3506	|not allocated |(do nothing)| alloc+swap |(do nothing)|
3507	+--------------+------------+------------+------------+
3508	|allocated     |    free    |    swap    |   clear    |
3509	+--------------+------------+------------+------------+
3510
3511Here is an example of using the snapshot feature.
3512::
3513
3514  # echo 1 > events/sched/enable
3515  # echo 1 > snapshot
3516  # cat snapshot
3517  # tracer: nop
3518  #
3519  # entries-in-buffer/entries-written: 71/71   #P:8
3520  #
3521  #                              _-----=> irqs-off
3522  #                             / _----=> need-resched
3523  #                            | / _---=> hardirq/softirq
3524  #                            || / _--=> preempt-depth
3525  #                            ||| /     delay
3526  #           TASK-PID   CPU#  ||||    TIMESTAMP  FUNCTION
3527  #              | |       |   ||||       |         |
3528            <idle>-0     [005] d...  2440.603828: sched_switch: prev_comm=swapper/5 prev_pid=0 prev_prio=120   prev_state=R ==> next_comm=snapshot-test-2 next_pid=2242 next_prio=120
3529             sleep-2242  [005] d...  2440.603846: sched_switch: prev_comm=snapshot-test-2 prev_pid=2242 prev_prio=120   prev_state=R ==> next_comm=kworker/5:1 next_pid=60 next_prio=120
3530  [...]
3531          <idle>-0     [002] d...  2440.707230: sched_switch: prev_comm=swapper/2 prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=snapshot-test-2 next_pid=2229 next_prio=120
3532
3533  # cat trace
3534  # tracer: nop
3535  #
3536  # entries-in-buffer/entries-written: 77/77   #P:8
3537  #
3538  #                              _-----=> irqs-off
3539  #                             / _----=> need-resched
3540  #                            | / _---=> hardirq/softirq
3541  #                            || / _--=> preempt-depth
3542  #                            ||| /     delay
3543  #           TASK-PID   CPU#  ||||    TIMESTAMP  FUNCTION
3544  #              | |       |   ||||       |         |
3545            <idle>-0     [007] d...  2440.707395: sched_switch: prev_comm=swapper/7 prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=snapshot-test-2 next_pid=2243 next_prio=120
3546   snapshot-test-2-2229  [002] d...  2440.707438: sched_switch: prev_comm=snapshot-test-2 prev_pid=2229 prev_prio=120 prev_state=S ==> next_comm=swapper/2 next_pid=0 next_prio=120
3547  [...]
3548
3549
3550If you try to use this snapshot feature when current tracer is
3551one of the latency tracers, you will get the following results.
3552::
3553
3554  # echo wakeup > current_tracer
3555  # echo 1 > snapshot
3556  bash: echo: write error: Device or resource busy
3557  # cat snapshot
3558  cat: snapshot: Device or resource busy
3559
3560
3561Instances
3562---------
3563In the tracefs tracing directory, there is a directory called "instances".
3564This directory can have new directories created inside of it using
3565mkdir, and removing directories with rmdir. The directory created
3566with mkdir in this directory will already contain files and other
3567directories after it is created.
3568::
3569
3570  # mkdir instances/foo
3571  # ls instances/foo
3572  buffer_size_kb  buffer_total_size_kb  events  free_buffer  per_cpu
3573  set_event  snapshot  trace  trace_clock  trace_marker  trace_options
3574  trace_pipe  tracing_on
3575
3576As you can see, the new directory looks similar to the tracing directory
3577itself. In fact, it is very similar, except that the buffer and
3578events are agnostic from the main directory, or from any other
3579instances that are created.
3580
3581The files in the new directory work just like the files with the
3582same name in the tracing directory except the buffer that is used
3583is a separate and new buffer. The files affect that buffer but do not
3584affect the main buffer with the exception of trace_options. Currently,
3585the trace_options affect all instances and the top level buffer
3586the same, but this may change in future releases. That is, options
3587may become specific to the instance they reside in.
3588
3589Notice that none of the function tracer files are there, nor is
3590current_tracer and available_tracers. This is because the buffers
3591can currently only have events enabled for them.
3592::
3593
3594  # mkdir instances/foo
3595  # mkdir instances/bar
3596  # mkdir instances/zoot
3597  # echo 100000 > buffer_size_kb
3598  # echo 1000 > instances/foo/buffer_size_kb
3599  # echo 5000 > instances/bar/per_cpu/cpu1/buffer_size_kb
3600  # echo function > current_trace
3601  # echo 1 > instances/foo/events/sched/sched_wakeup/enable
3602  # echo 1 > instances/foo/events/sched/sched_wakeup_new/enable
3603  # echo 1 > instances/foo/events/sched/sched_switch/enable
3604  # echo 1 > instances/bar/events/irq/enable
3605  # echo 1 > instances/zoot/events/syscalls/enable
3606  # cat trace_pipe
3607  CPU:2 [LOST 11745 EVENTS]
3608              bash-2044  [002] .... 10594.481032: _raw_spin_lock_irqsave <-get_page_from_freelist
3609              bash-2044  [002] d... 10594.481032: add_preempt_count <-_raw_spin_lock_irqsave
3610              bash-2044  [002] d..1 10594.481032: __rmqueue <-get_page_from_freelist
3611              bash-2044  [002] d..1 10594.481033: _raw_spin_unlock <-get_page_from_freelist
3612              bash-2044  [002] d..1 10594.481033: sub_preempt_count <-_raw_spin_unlock
3613              bash-2044  [002] d... 10594.481033: get_pageblock_flags_group <-get_pageblock_migratetype
3614              bash-2044  [002] d... 10594.481034: __mod_zone_page_state <-get_page_from_freelist
3615              bash-2044  [002] d... 10594.481034: zone_statistics <-get_page_from_freelist
3616              bash-2044  [002] d... 10594.481034: __inc_zone_state <-zone_statistics
3617              bash-2044  [002] d... 10594.481034: __inc_zone_state <-zone_statistics
3618              bash-2044  [002] .... 10594.481035: arch_dup_task_struct <-copy_process
3619  [...]
3620
3621  # cat instances/foo/trace_pipe
3622              bash-1998  [000] d..4   136.676759: sched_wakeup: comm=kworker/0:1 pid=59 prio=120 success=1 target_cpu=000
3623              bash-1998  [000] dN.4   136.676760: sched_wakeup: comm=bash pid=1998 prio=120 success=1 target_cpu=000
3624            <idle>-0     [003] d.h3   136.676906: sched_wakeup: comm=rcu_preempt pid=9 prio=120 success=1 target_cpu=003
3625            <idle>-0     [003] d..3   136.676909: sched_switch: prev_comm=swapper/3 prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=rcu_preempt next_pid=9 next_prio=120
3626       rcu_preempt-9     [003] d..3   136.676916: sched_switch: prev_comm=rcu_preempt prev_pid=9 prev_prio=120 prev_state=S ==> next_comm=swapper/3 next_pid=0 next_prio=120
3627              bash-1998  [000] d..4   136.677014: sched_wakeup: comm=kworker/0:1 pid=59 prio=120 success=1 target_cpu=000
3628              bash-1998  [000] dN.4   136.677016: sched_wakeup: comm=bash pid=1998 prio=120 success=1 target_cpu=000
3629              bash-1998  [000] d..3   136.677018: sched_switch: prev_comm=bash prev_pid=1998 prev_prio=120 prev_state=R+ ==> next_comm=kworker/0:1 next_pid=59 next_prio=120
3630       kworker/0:1-59    [000] d..4   136.677022: sched_wakeup: comm=sshd pid=1995 prio=120 success=1 target_cpu=001
3631       kworker/0:1-59    [000] d..3   136.677025: sched_switch: prev_comm=kworker/0:1 prev_pid=59 prev_prio=120 prev_state=S ==> next_comm=bash next_pid=1998 next_prio=120
3632  [...]
3633
3634  # cat instances/bar/trace_pipe
3635       migration/1-14    [001] d.h3   138.732674: softirq_raise: vec=3 [action=NET_RX]
3636            <idle>-0     [001] dNh3   138.732725: softirq_raise: vec=3 [action=NET_RX]
3637              bash-1998  [000] d.h1   138.733101: softirq_raise: vec=1 [action=TIMER]
3638              bash-1998  [000] d.h1   138.733102: softirq_raise: vec=9 [action=RCU]
3639              bash-1998  [000] ..s2   138.733105: softirq_entry: vec=1 [action=TIMER]
3640              bash-1998  [000] ..s2   138.733106: softirq_exit: vec=1 [action=TIMER]
3641              bash-1998  [000] ..s2   138.733106: softirq_entry: vec=9 [action=RCU]
3642              bash-1998  [000] ..s2   138.733109: softirq_exit: vec=9 [action=RCU]
3643              sshd-1995  [001] d.h1   138.733278: irq_handler_entry: irq=21 name=uhci_hcd:usb4
3644              sshd-1995  [001] d.h1   138.733280: irq_handler_exit: irq=21 ret=unhandled
3645              sshd-1995  [001] d.h1   138.733281: irq_handler_entry: irq=21 name=eth0
3646              sshd-1995  [001] d.h1   138.733283: irq_handler_exit: irq=21 ret=handled
3647  [...]
3648
3649  # cat instances/zoot/trace
3650  # tracer: nop
3651  #
3652  # entries-in-buffer/entries-written: 18996/18996   #P:4
3653  #
3654  #                              _-----=> irqs-off
3655  #                             / _----=> need-resched
3656  #                            | / _---=> hardirq/softirq
3657  #                            || / _--=> preempt-depth
3658  #                            ||| /     delay
3659  #           TASK-PID   CPU#  ||||    TIMESTAMP  FUNCTION
3660  #              | |       |   ||||       |         |
3661              bash-1998  [000] d...   140.733501: sys_write -> 0x2
3662              bash-1998  [000] d...   140.733504: sys_dup2(oldfd: a, newfd: 1)
3663              bash-1998  [000] d...   140.733506: sys_dup2 -> 0x1
3664              bash-1998  [000] d...   140.733508: sys_fcntl(fd: a, cmd: 1, arg: 0)
3665              bash-1998  [000] d...   140.733509: sys_fcntl -> 0x1
3666              bash-1998  [000] d...   140.733510: sys_close(fd: a)
3667              bash-1998  [000] d...   140.733510: sys_close -> 0x0
3668              bash-1998  [000] d...   140.733514: sys_rt_sigprocmask(how: 0, nset: 0, oset: 6e2768, sigsetsize: 8)
3669              bash-1998  [000] d...   140.733515: sys_rt_sigprocmask -> 0x0
3670              bash-1998  [000] d...   140.733516: sys_rt_sigaction(sig: 2, act: 7fff718846f0, oact: 7fff71884650, sigsetsize: 8)
3671              bash-1998  [000] d...   140.733516: sys_rt_sigaction -> 0x0
3672
3673You can see that the trace of the top most trace buffer shows only
3674the function tracing. The foo instance displays wakeups and task
3675switches.
3676
3677To remove the instances, simply delete their directories:
3678::
3679
3680  # rmdir instances/foo
3681  # rmdir instances/bar
3682  # rmdir instances/zoot
3683
3684Note, if a process has a trace file open in one of the instance
3685directories, the rmdir will fail with EBUSY.
3686
3687
3688Stack trace
3689-----------
3690Since the kernel has a fixed sized stack, it is important not to
3691waste it in functions. A kernel developer must be conscious of
3692what they allocate on the stack. If they add too much, the system
3693can be in danger of a stack overflow, and corruption will occur,
3694usually leading to a system panic.
3695
3696There are some tools that check this, usually with interrupts
3697periodically checking usage. But if you can perform a check
3698at every function call that will become very useful. As ftrace provides
3699a function tracer, it makes it convenient to check the stack size
3700at every function call. This is enabled via the stack tracer.
3701
3702CONFIG_STACK_TRACER enables the ftrace stack tracing functionality.
3703To enable it, write a '1' into /proc/sys/kernel/stack_tracer_enabled.
3704::
3705
3706 # echo 1 > /proc/sys/kernel/stack_tracer_enabled
3707
3708You can also enable it from the kernel command line to trace
3709the stack size of the kernel during boot up, by adding "stacktrace"
3710to the kernel command line parameter.
3711
3712After running it for a few minutes, the output looks like:
3713::
3714
3715  # cat stack_max_size
3716  2928
3717
3718  # cat stack_trace
3719          Depth    Size   Location    (18 entries)
3720          -----    ----   --------
3721    0)     2928     224   update_sd_lb_stats+0xbc/0x4ac
3722    1)     2704     160   find_busiest_group+0x31/0x1f1
3723    2)     2544     256   load_balance+0xd9/0x662
3724    3)     2288      80   idle_balance+0xbb/0x130
3725    4)     2208     128   __schedule+0x26e/0x5b9
3726    5)     2080      16   schedule+0x64/0x66
3727    6)     2064     128   schedule_timeout+0x34/0xe0
3728    7)     1936     112   wait_for_common+0x97/0xf1
3729    8)     1824      16   wait_for_completion+0x1d/0x1f
3730    9)     1808     128   flush_work+0xfe/0x119
3731   10)     1680      16   tty_flush_to_ldisc+0x1e/0x20
3732   11)     1664      48   input_available_p+0x1d/0x5c
3733   12)     1616      48   n_tty_poll+0x6d/0x134
3734   13)     1568      64   tty_poll+0x64/0x7f
3735   14)     1504     880   do_select+0x31e/0x511
3736   15)      624     400   core_sys_select+0x177/0x216
3737   16)      224      96   sys_select+0x91/0xb9
3738   17)      128     128   system_call_fastpath+0x16/0x1b
3739
3740Note, if -mfentry is being used by gcc, functions get traced before
3741they set up the stack frame. This means that leaf level functions
3742are not tested by the stack tracer when -mfentry is used.
3743
3744Currently, -mfentry is used by gcc 4.6.0 and above on x86 only.
3745
3746More
3747----
3748More details can be found in the source code, in the `kernel/trace/*.c` files.
3749