1============================== 2Using the tracer for debugging 3============================== 4 5Copyright 2024 Google LLC. 6 7:Author: Steven Rostedt <rostedt@goodmis.org> 8:License: The GNU Free Documentation License, Version 1.2 9 (dual licensed under the GPL v2) 10 11- Written for: 6.12 12 13Introduction 14------------ 15The tracing infrastructure can be very useful for debugging the Linux 16kernel. This document is a place to add various methods of using the tracer 17for debugging. 18 19First, make sure that the tracefs file system is mounted:: 20 21 $ sudo mount -t tracefs tracefs /sys/kernel/tracing 22 23 24Using trace_printk() 25-------------------- 26 27trace_printk() is a very lightweight utility that can be used in any context 28inside the kernel, with the exception of "noinstr" sections. It can be used 29in normal, softirq, interrupt and even NMI context. The trace data is 30written to the tracing ring buffer in a lockless way. To make it even 31lighter weight, when possible, it will only record the pointer to the format 32string, and save the raw arguments into the buffer. The format and the 33arguments will be post processed when the ring buffer is read. This way the 34trace_printk() format conversions are not done during the hot path, where 35the trace is being recorded. 36 37trace_printk() is meant only for debugging, and should never be added into 38a subsystem of the kernel. If you need debugging traces, add trace events 39instead. If a trace_printk() is found in the kernel, the following will 40appear in the dmesg:: 41 42 ********************************************************** 43 ** NOTICE NOTICE NOTICE NOTICE NOTICE NOTICE NOTICE ** 44 ** ** 45 ** trace_printk() being used. Allocating extra memory. ** 46 ** ** 47 ** This means that this is a DEBUG kernel and it is ** 48 ** unsafe for production use. ** 49 ** ** 50 ** If you see this message and you are not debugging ** 51 ** the kernel, report this immediately to your vendor! ** 52 ** ** 53 ** NOTICE NOTICE NOTICE NOTICE NOTICE NOTICE NOTICE ** 54 ********************************************************** 55 56Debugging kernel crashes 57------------------------ 58There is various methods of acquiring the state of the system when a kernel 59crash occurs. This could be from the oops message in printk, or one could 60use kexec/kdump. But these just show what happened at the time of the crash. 61It can be very useful in knowing what happened up to the point of the crash. 62The tracing ring buffer, by default, is a circular buffer than will 63overwrite older events with newer ones. When a crash happens, the content of 64the ring buffer will be all the events that lead up to the crash. 65 66There are several kernel command line parameters that can be used to help in 67this. The first is "ftrace_dump_on_oops". This will dump the tracing ring 68buffer when a oops occurs to the console. This can be useful if the console 69is being logged somewhere. If a serial console is used, it may be prudent to 70make sure the ring buffer is relatively small, otherwise the dumping of the 71ring buffer may take several minutes to hours to finish. Here's an example 72of the kernel command line:: 73 74 ftrace_dump_on_oops trace_buf_size=50K 75 76Note, the tracing buffer is made up of per CPU buffers where each of these 77buffers is broken up into sub-buffers that are by default PAGE_SIZE. The 78above trace_buf_size option above sets each of the per CPU buffers to 50K, 79so, on a machine with 8 CPUs, that's actually 400K total. 80 81Persistent buffers across boots 82------------------------------- 83If the system memory allows it, the tracing ring buffer can be specified at 84a specific location in memory. If the location is the same across boots and 85the memory is not modified, the tracing buffer can be retrieved from the 86following boot. There's two ways to reserve memory for the use of the ring 87buffer. 88 89The more reliable way (on x86) is to reserve memory with the "memmap" kernel 90command line option and then use that memory for the trace_instance. This 91requires a bit of knowledge of the physical memory layout of the system. The 92advantage of using this method, is that the memory for the ring buffer will 93always be the same:: 94 95 memmap==12M$0x284500000 trace_instance=boot_map@0x284500000:12M 96 97The memmap above reserves 12 megabytes of memory at the physical memory 98location 0x284500000. Then the trace_instance option will create a trace 99instance "boot_map" at that same location with the same amount of memory 100reserved. As the ring buffer is broke up into per CPU buffers, the 12 101megabytes will be broken up evenly between those CPUs. If you have 8 CPUs, 102each per CPU ring buffer will be 1.5 megabytes in size. Note, that also 103includes meta data, so the amount of memory actually used by the ring buffer 104will be slightly smaller. 105 106Another more generic but less robust way to allocate a ring buffer mapping 107at boot is with the "reserve_mem" option:: 108 109 reserve_mem=12M:4096:trace trace_instance=boot_map@trace 110 111The reserve_mem option above will find 12 megabytes that are available at 112boot up, and align it by 4096 bytes. It will label this memory as "trace" 113that can be used by later command line options. 114 115The trace_instance option creates a "boot_map" instance and will use the 116memory reserved by reserve_mem that was labeled as "trace". This method is 117more generic but may not be as reliable. Due to KASLR, the memory reserved 118by reserve_mem may not be located at the same location. If this happens, 119then the ring buffer will not be from the previous boot and will be reset. 120 121Sometimes, by using a larger alignment, it can keep KASLR from moving things 122around in such a way that it will move the location of the reserve_mem. By 123using a larger alignment, you may find better that the buffer is more 124consistent to where it is placed:: 125 126 reserve_mem=12M:0x2000000:trace trace_instance=boot_map@trace 127 128On boot up, the memory reserved for the ring buffer is validated. It will go 129through a series of tests to make sure that the ring buffer contains valid 130data. If it is, it will then set it up to be available to read from the 131instance. If it fails any of the tests, it will clear the entire ring buffer 132and initialize it as new. 133 134The layout of this mapped memory may not be consistent from kernel to 135kernel, so only the same kernel is guaranteed to work if the mapping is 136preserved. Switching to a different kernel version may find a different 137layout and mark the buffer as invalid. 138 139Using trace_printk() in the boot instance 140----------------------------------------- 141By default, the content of trace_printk() goes into the top level tracing 142instance. But this instance is never preserved across boots. To have the 143trace_printk() content, and some other internal tracing go to the preserved 144buffer (like dump stacks), either set the instance to be the trace_printk() 145destination from the kernel command line, or set it after boot up via the 146trace_printk_dest option. 147 148After boot up:: 149 150 echo 1 > /sys/kernel/tracing/instances/boot_map/options/trace_printk_dest 151 152From the kernel command line:: 153 154 reserve_mem=12M:4096:trace trace_instance=boot_map^traceprintk^traceoff@trace 155 156If setting it from the kernel command line, it is recommended to also 157disable tracing with the "traceoff" flag, and enable tracing after boot up. 158Otherwise the trace from the most recent boot will be mixed with the trace 159from the previous boot, and may make it confusing to read. 160