1perf-sched(1) 2============= 3 4NAME 5---- 6perf-sched - Tool to trace/measure scheduler properties (latencies) 7 8SYNOPSIS 9-------- 10[verse] 11'perf sched' {record|latency|map|replay|script|timehist|stats} 12 13DESCRIPTION 14----------- 15There are several variants of 'perf sched': 16 17 'perf sched record <command>' to record the scheduling events 18 of an arbitrary workload. 19 20 'perf sched latency' to report the per task scheduling latencies 21 and other scheduling properties of the workload. 22 23 Example usage: 24 perf sched record -- sleep 1 25 perf sched latency 26 27 ------------------------------------------------------------------------------------------------------------------------------------------- 28 Task | Runtime ms | Count | Avg delay ms | Max delay ms | Max delay start | Max delay end | 29 ------------------------------------------------------------------------------------------------------------------------------------------- 30 perf:(2) | 2.804 ms | 66 | avg: 0.524 ms | max: 1.069 ms | max start: 254752.314960 s | max end: 254752.316029 s 31 NetworkManager:1343 | 0.372 ms | 13 | avg: 0.008 ms | max: 0.013 ms | max start: 254751.551153 s | max end: 254751.551166 s 32 kworker/1:2-xfs:4649 | 0.012 ms | 1 | avg: 0.008 ms | max: 0.008 ms | max start: 254751.519807 s | max end: 254751.519815 s 33 kworker/3:1-xfs:388 | 0.011 ms | 1 | avg: 0.006 ms | max: 0.006 ms | max start: 254751.519809 s | max end: 254751.519815 s 34 sleep:147736 | 0.938 ms | 3 | avg: 0.006 ms | max: 0.007 ms | max start: 254751.313817 s | max end: 254751.313824 s 35 36 It shows Runtime(time that a task spent actually running on the CPU), 37 Count(number of times a delay was calculated) and delay(time that a 38 task was ready to run but was kept waiting). 39 40 Tasks with the same command name are merged and the merge count is 41 given within (), However if -p option is used, pid is mentioned. 42 43 'perf sched script' to see a detailed trace of the workload that 44 was recorded (aliased to 'perf script' for now). 45 46 'perf sched replay' to simulate the workload that was recorded 47 via perf sched record. (this is done by starting up mockup threads 48 that mimic the workload based on the events in the trace. These 49 threads can then replay the timings (CPU runtime and sleep patterns) 50 of the workload as it occurred when it was recorded - and can repeat 51 it a number of times, measuring its performance.) 52 53 'perf sched map' to print a textual context-switching outline of 54 workload captured via perf sched record. Columns stand for 55 individual CPUs, and the two-letter shortcuts stand for tasks that 56 are running on a CPU. A '*' denotes the CPU that had the event, and 57 a dot signals an idle CPU. 58 59 'perf sched timehist' provides an analysis of scheduling events. 60 61 Example usage: 62 perf sched record -- sleep 1 63 perf sched timehist 64 65 By default it shows the individual schedule events, including the wait 66 time (time between sched-out and next sched-in events for the task), the 67 task scheduling delay (time between runnable and actually running) and 68 run time for the task: 69 70 time cpu task name wait time sch delay run time 71 [tid/pid] (msec) (msec) (msec) 72 -------------- ------ -------------------- --------- --------- --------- 73 79371.874569 [0011] gcc[31949] 0.014 0.000 1.148 74 79371.874591 [0010] gcc[31951] 0.000 0.000 0.024 75 79371.874603 [0010] migration/10[59] 3.350 0.004 0.011 76 79371.874604 [0011] <idle> 1.148 0.000 0.035 77 79371.874723 [0005] <idle> 0.016 0.000 1.383 78 79371.874746 [0005] gcc[31949] 0.153 0.078 0.022 79 ... 80 81 Times are in msec.usec. 82 83 'perf sched stats {record | report | diff} <command>' to capture, report the diff 84 in schedstat counters and show the difference between perf sched stats report 85 respectively. schedstat counters which are present in the linux kernel and are 86 exposed through the file ``/proc/schedstat``. These counters are enabled or disabled 87 via the sysctl governed by the file ``/proc/sys/kernel/sched_schedstats``. These 88 counters accounts for many scheduler events such as ``schedule()`` calls, load-balancing 89 events, ``try_to_wakeup()`` call among others. This is useful in understanding the 90 scheduler behavior for the workload. 91 92 Note: The tool will not give correct results if there is topological reordering or 93 online/offline of cpus in between capturing snapshots of `/proc/schedstat`. 94 95 Example usage: 96 perf sched stats record -- sleep 1 97 perf sched stats report 98 perf sched stats diff 99 100 A detailed description of the schedstats can be found in the Kernel Documentation: 101 https://www.kernel.org/doc/html/latest/scheduler/sched-stats.html 102 103 The result can be interpreted as follows: 104 105 The `perf sched stats report` starts with description of the columns present in 106 the report. These column names are given before cpu and domain stats to improve 107 the readability of the report. 108 109 ---------------------------------------------------------------------------------------------------- 110 DESC -> Description of the field 111 COUNT -> Value of the field 112 PCT_CHANGE -> Percent change with corresponding base value 113 AVG_JIFFIES -> Avg time in jiffies between two consecutive occurrence of event 114 ---------------------------------------------------------------------------------------------------- 115 116 Next is the total profiling time in terms of jiffies: 117 118 ---------------------------------------------------------------------------------------------------- 119 Time elapsed (in jiffies) : 2323 120 ---------------------------------------------------------------------------------------------------- 121 122 Next is CPU scheduling statistics. These are simple diffs of /proc/schedstat CPU lines 123 along with description. The report also prints % relative to base stat. 124 125 In the example below, schedule() left the CPU0 idle 36.58% of the time. 0.45% of total 126 try_to_wake_up() was to wakeup local CPU. And, the total waittime by tasks on CPU0 is 127 48.70% of the total runtime by tasks on the same CPU. 128 129 ---------------------------------------------------------------------------------------------------- 130 CPU 0 131 ---------------------------------------------------------------------------------------------------- 132 DESC COUNT PCT_CHANGE 133 ---------------------------------------------------------------------------------------------------- 134 yld_count : 0 135 array_exp : 0 136 sched_count : 402267 137 sched_goidle : 147161 ( 36.58% ) 138 ttwu_count : 236309 139 ttwu_local : 1062 ( 0.45% ) 140 rq_cpu_time : 7083791148 141 run_delay : 3449973971 ( 48.70% ) 142 pcount : 255035 143 ---------------------------------------------------------------------------------------------------- 144 145 Next is load balancing statistics. For each of the sched domains 146 (eg: `SMT`, `MC`, `DIE`...), the scheduler computes statistics under 147 the following three categories: 148 149 1) Idle Load Balance: Load balancing performed on behalf of a long 150 idling CPU by some other CPU. 151 2) Busy Load Balance: Load balancing performed when the CPU was busy. 152 3) New Idle Balance : Load balancing performed when a CPU just became 153 idle. 154 155 Under each of these three categories, sched stats report provides 156 different load balancing statistics. Along with direct stats, the 157 report also contains derived metrics prefixed with *. Example: 158 159 ---------------------------------------------------------------------------------------------------- 160 CPU 0, DOMAIN SMT CPUS 0,64 161 ---------------------------------------------------------------------------------------------------- 162 DESC COUNT AVG_JIFFIES 163 ----------------------------------------- <Category busy> ------------------------------------------ 164 busy_lb_count : 136 $ 17.08 $ 165 busy_lb_balanced : 131 $ 17.73 $ 166 busy_lb_failed : 0 $ 0.00 $ 167 busy_lb_imbalance_load : 58 168 busy_lb_imbalance_util : 0 169 busy_lb_imbalance_task : 0 170 busy_lb_imbalance_misfit : 0 171 busy_lb_gained : 7 172 busy_lb_hot_gained : 0 173 busy_lb_nobusyq : 2 $ 1161.50 $ 174 busy_lb_nobusyg : 129 $ 18.01 $ 175 *busy_lb_success_count : 5 176 *busy_lb_avg_pulled : 1.40 177 ----------------------------------------- <Category idle> ------------------------------------------ 178 idle_lb_count : 449 $ 5.17 $ 179 idle_lb_balanced : 382 $ 6.08 $ 180 idle_lb_failed : 3 $ 774.33 $ 181 idle_lb_imbalance_load : 0 182 idle_lb_imbalance_util : 0 183 idle_lb_imbalance_task : 71 184 idle_lb_imbalance_misfit : 0 185 idle_lb_gained : 67 186 idle_lb_hot_gained : 0 187 idle_lb_nobusyq : 0 $ 0.00 $ 188 idle_lb_nobusyg : 382 $ 6.08 $ 189 *idle_lb_success_count : 64 190 *idle_lb_avg_pulled : 1.05 191 ---------------------------------------- <Category newidle> ---------------------------------------- 192 newidle_lb_count : 30471 $ 0.08 $ 193 newidle_lb_balanced : 28490 $ 0.08 $ 194 newidle_lb_failed : 633 $ 3.67 $ 195 newidle_lb_imbalance_load : 0 196 newidle_lb_imbalance_util : 0 197 newidle_lb_imbalance_task : 2040 198 newidle_lb_imbalance_misfit : 0 199 newidle_lb_gained : 1348 200 newidle_lb_hot_gained : 0 201 newidle_lb_nobusyq : 6 $ 387.17 $ 202 newidle_lb_nobusyg : 26634 $ 0.09 $ 203 *newidle_lb_success_count : 1348 204 *newidle_lb_avg_pulled : 1.00 205 ---------------------------------------------------------------------------------------------------- 206 207 Consider following line: 208 209 newidle_lb_balanced : 28490 $ 0.08 $ 210 211 While profiling was active, the load-balancer found 28490 times the load 212 needs to be balanced on a newly idle CPU 0. Following value encapsulated 213 inside $ is average jiffies between two events (2323 / 28490 = 0.08). 214 215 Next are active_load_balance() stats. alb did not trigger while the 216 profiling was active, hence it's all 0s. 217 218 --------------------------------- <Category active_load_balance()> --------------------------------- 219 alb_count : 0 220 alb_failed : 0 221 alb_pushed : 0 222 ---------------------------------------------------------------------------------------------------- 223 224 Next are sched_balance_exec() and sched_balance_fork() stats. They are 225 not used but we kept it in RFC just for legacy purpose. Unless opposed, 226 we plan to remove them in next revision. 227 228 Next are wakeup statistics. For every domain, the report also shows 229 task-wakeup statistics. Example: 230 231 ------------------------------------------ <Wakeup Info> ------------------------------------------- 232 ttwu_wake_remote : 1590 233 ttwu_move_affine : 84 234 ttwu_move_balance : 0 235 ---------------------------------------------------------------------------------------------------- 236 237 Same set of stats are reported for each CPU and each domain level. 238 239 How to interpret the diff 240 ~~~~~~~~~~~~~~~~~~~~~~~~~ 241 242 The `perf sched stats diff` will also start with explaining the columns 243 present in the diff. Then it will show the diff in time in terms of 244 jiffies. The order of the values depends on the order of input data 245 files. It will take `perf.data.old` and `perf.data` respectively as the 246 defaults for comparison. Example: 247 248 ---------------------------------------------------------------------------------------------------- 249 Time elapsed (in jiffies) : 2009, 2001 250 ---------------------------------------------------------------------------------------------------- 251 252 Below is the sample representing the difference in cpu and domain stats of 253 two runs. Here third column or the values enclosed in `|...|` shows the 254 percent change between the two. Second and fourth columns shows the 255 side-by-side representions of the corresponding fields from `perf sched 256 stats report`. 257 258 ---------------------------------------------------------------------------------------------------- 259 CPU <ALL CPUS SUMMARY> 260 ---------------------------------------------------------------------------------------------------- 261 DESC COUNT1 COUNT2 PCT_CHANG> 262 ---------------------------------------------------------------------------------------------------- 263 yld_count : 0, 0 | 0.00> 264 array_exp : 0, 0 | 0.00> 265 sched_count : 528533, 412573 | -21.94> 266 sched_goidle : 193426, 146082 | -24.48> 267 ttwu_count : 313134, 385975 | 23.26> 268 ttwu_local : 1126, 1282 | 13.85> 269 rq_cpu_time : 8257200244, 8301250047 | 0.53> 270 run_delay : 4728347053, 3997100703 | -15.47> 271 pcount : 335031, 266396 | -20.49> 272 ---------------------------------------------------------------------------------------------------- 273 274 Below is the sample of domain stats diff: 275 276 ---------------------------------------------------------------------------------------------------- 277 CPU <ALL CPUS SUMMARY>, DOMAIN SMT 278 ---------------------------------------------------------------------------------------------------- 279 DESC COUNT1 COUNT2 PCT_CHANG> 280 ----------------------------------------- <Category busy> ------------------------------------------ 281 busy_lb_count : 122, 80 | -34.43> 282 busy_lb_balanced : 115, 76 | -33.91> 283 busy_lb_failed : 1, 3 | 200.00> 284 busy_lb_imbalance_load : 35, 49 | 40.00> 285 busy_lb_imbalance_util : 0, 0 | 0.00> 286 busy_lb_imbalance_task : 0, 0 | 0.00> 287 busy_lb_imbalance_misfit : 0, 0 | 0.00> 288 busy_lb_gained : 7, 2 | -71.43> 289 busy_lb_hot_gained : 0, 0 | 0.00> 290 busy_lb_nobusyq : 0, 0 | 0.00> 291 busy_lb_nobusyg : 115, 76 | -33.91> 292 *busy_lb_success_count : 6, 1 | -83.33> 293 *busy_lb_avg_pulled : 1.17, 2.00 | 71.43> 294 ----------------------------------------- <Category idle> ------------------------------------------ 295 idle_lb_count : 568, 620 | 9.15> 296 idle_lb_balanced : 462, 449 | -2.81> 297 idle_lb_failed : 11, 21 | 90.91> 298 idle_lb_imbalance_load : 0, 0 | 0.00> 299 idle_lb_imbalance_util : 0, 0 | 0.00> 300 idle_lb_imbalance_task : 115, 189 | 64.35> 301 idle_lb_imbalance_misfit : 0, 0 | 0.00> 302 idle_lb_gained : 103, 169 | 64.08> 303 idle_lb_hot_gained : 0, 0 | 0.00> 304 idle_lb_nobusyq : 0, 0 | 0.00> 305 idle_lb_nobusyg : 462, 449 | -2.81> 306 *idle_lb_success_count : 95, 150 | 57.89> 307 *idle_lb_avg_pulled : 1.08, 1.13 | 3.92> 308 ---------------------------------------- <Category newidle> ---------------------------------------- 309 newidle_lb_count : 16961, 3155 | -81.40> 310 newidle_lb_balanced : 15646, 2556 | -83.66> 311 newidle_lb_failed : 397, 142 | -64.23> 312 newidle_lb_imbalance_load : 0, 0 | 0.00> 313 newidle_lb_imbalance_util : 0, 0 | 0.00> 314 newidle_lb_imbalance_task : 1376, 655 | -52.40> 315 newidle_lb_imbalance_misfit : 0, 0 | 0.00> 316 newidle_lb_gained : 917, 457 | -50.16> 317 newidle_lb_hot_gained : 0, 0 | 0.00> 318 newidle_lb_nobusyq : 3, 1 | -66.67> 319 newidle_lb_nobusyg : 14480, 2103 | -85.48> 320 *newidle_lb_success_count : 918, 457 | -50.22> 321 *newidle_lb_avg_pulled : 1.00, 1.00 | 0.11> 322 --------------------------------- <Category active_load_balance()> --------------------------------- 323 alb_count : 0, 1 | 0.00> 324 alb_failed : 0, 0 | 0.00> 325 alb_pushed : 0, 1 | 0.00> 326 --------------------------------- <Category sched_balance_exec()> ---------------------------------- 327 sbe_count : 0, 0 | 0.00> 328 sbe_balanced : 0, 0 | 0.00> 329 sbe_pushed : 0, 0 | 0.00> 330 --------------------------------- <Category sched_balance_fork()> ---------------------------------- 331 sbf_count : 0, 0 | 0.00> 332 sbf_balanced : 0, 0 | 0.00> 333 sbf_pushed : 0, 0 | 0.00> 334 ------------------------------------------ <Wakeup Info> ------------------------------------------- 335 ttwu_wake_remote : 2031, 2914 | 43.48> 336 ttwu_move_affine : 73, 124 | 69.86> 337 ttwu_move_balance : 0, 0 | 0.00> 338 ---------------------------------------------------------------------------------------------------- 339 340OPTIONS 341------- 342Applicable to {record|latency|map|replay|script} 343 344-i:: 345--input=<file>:: 346 Input file name. (default: perf.data unless stdin is a fifo) 347 348-v:: 349--verbose:: 350 Be more verbose. (show symbol address, etc) 351 352-D:: 353--dump-raw-trace=:: 354 Display verbose dump of the sched data. 355 356-f:: 357--force:: 358 Don't complain, do it. 359 360OPTIONS for 'perf sched latency' 361------------------------------- 362 363-C:: 364--CPU <n>:: 365 CPU to profile on. 366 367-p:: 368--pids:: 369 latency stats per pid instead of per command name. 370 371-s:: 372--sort <key[,key2...]>:: 373 sort by key(s): runtime, switch, avg, max 374 by default it's sorted by "avg ,max ,switch ,runtime". 375 376OPTIONS for 'perf sched map' 377---------------------------- 378 379--compact:: 380 Show only CPUs with activity. Helps visualizing on high core 381 count systems. 382 383--cpus:: 384 Show just entries with activities for the given CPUs. 385 386--color-cpus:: 387 Highlight the given cpus. 388 389--color-pids:: 390 Highlight the given pids. 391 392--task-name <task>:: 393 Map output only for the given task name(s). Separate the 394 task names with a comma (without whitespace). The sched-out 395 time is printed and is represented by '*-' for the given 396 task name(s). 397 ('-' indicates other tasks while '.' is idle). 398 399--fuzzy-name:: 400 Given task name(s) can be partially matched (fuzzy matching). 401 402OPTIONS for 'perf sched timehist' 403--------------------------------- 404-k:: 405--vmlinux=<file>:: 406 vmlinux pathname 407 408--kallsyms=<file>:: 409 kallsyms pathname 410 411-g:: 412--call-graph:: 413 Display call chains if present (default on). 414 415--max-stack:: 416 Maximum number of functions to display in backtrace, default 5. 417 418-C=:: 419--cpu=:: 420 Only show events for the given CPU(s) (comma separated list). 421 422-p=:: 423--pid=:: 424 Only show events for given process ID (comma separated list). 425 426-t=:: 427--tid=:: 428 Only show events for given thread ID (comma separated list). 429 430-s:: 431--summary:: 432 Show only a summary of scheduling by thread with min, max, and average 433 run times (in sec) and relative stddev. 434 435-S:: 436--with-summary:: 437 Show all scheduling events followed by a summary by thread with min, 438 max, and average run times (in sec) and relative stddev. 439 440--symfs=<directory>:: 441 Look for files with symbols relative to this directory. 442 443-V:: 444--cpu-visual:: 445 Show visual aid for sched switches by CPU: 'i' marks idle time, 446 's' are scheduler events. 447 448-w:: 449--wakeups:: 450 Show wakeup events. 451 452-M:: 453--migrations:: 454 Show migration events. 455 456-n:: 457--next:: 458 Show next task. 459 460-I:: 461--idle-hist:: 462 Show idle-related events only. 463 464--time:: 465 Only analyze samples within given time window: <start>,<stop>. Times 466 have the format seconds.microseconds. If start is not given (i.e., time 467 string is ',x.y') then analysis starts at the beginning of the file. If 468 stop time is not given (i.e, time string is 'x.y,') then analysis goes 469 to end of file. 470 471--state:: 472 Show task state when it switched out. 473 474--show-prio:: 475 Show task priority. 476 477--prio:: 478 Only show events for given task priority(ies). Multiple priorities can be 479 provided as a comma-separated list with no spaces: 0,120. Ranges of 480 priorities are specified with -: 120-129. A combination of both can also be 481 provided: 0,120-129. 482 483-P:: 484--pre-migrations:: 485 Show pre-migration wait time. pre-migration wait time is the time spent 486 by a task waiting on a runqueue but not getting the chance to run there 487 and is migrated to a different runqueue where it is finally run. This 488 time between sched_wakeup and migrate_task is the pre-migration wait 489 time. 490 491OPTIONS for 'perf sched replay' 492------------------------------ 493 494-r:: 495--repeat <n>:: 496 repeat the workload n times (0: infinite). Default is 10. 497 498SEE ALSO 499-------- 500linkperf:perf-record[1] 501