1CPU and latency overheads 2------------------------- 3There are two notions of time: wall-clock time and CPU time. 4For a single-threaded program, or a program running on a single-core machine, 5these notions are the same. However, for a multi-threaded/multi-process program 6running on a multi-core machine, these notions are significantly different. 7Each second of wall-clock time we have number-of-cores seconds of CPU time. 8Perf can measure overhead for both of these times (shown in 'overhead' and 9'latency' columns for CPU and wall-clock time correspondingly). 10 11Optimizing CPU overhead is useful to improve 'throughput', while optimizing 12latency overhead is useful to improve 'latency'. It's important to understand 13which one is useful in a concrete situation at hand. For example, the former 14may be useful to improve max throughput of a CI build server that runs on 100% 15CPU utilization, while the latter may be useful to improve user-perceived 16latency of a single interactive program build. 17These overheads may be significantly different in some cases. For example, 18consider a program that executes function 'foo' for 9 seconds with 1 thread, 19and then executes function 'bar' for 1 second with 128 threads (consumes 20128 seconds of CPU time). The CPU overhead is: 'foo' - 6.6%, 'bar' - 93.4%. 21While the latency overhead is: 'foo' - 90%, 'bar' - 10%. If we try to optimize 22running time of the program looking at the (wrong in this case) CPU overhead, 23we would concentrate on the function 'bar', but it can yield only 10% running 24time improvement at best. 25 26By default, perf shows only CPU overhead. To show latency overhead, use 27'perf record --latency' and 'perf report': 28 29----------------------------------- 30Overhead Latency Command 31 93.88% 25.79% cc1 32 1.90% 39.87% gzip 33 0.99% 10.16% dpkg-deb 34 0.57% 1.00% as 35 0.40% 0.46% sh 36----------------------------------- 37 38To sort by latency overhead, use 'perf report --latency': 39 40----------------------------------- 41Latency Overhead Command 42 39.87% 1.90% gzip 43 25.79% 93.88% cc1 44 10.16% 0.99% dpkg-deb 45 4.17% 0.29% git 46 2.81% 0.11% objtool 47----------------------------------- 48 49To get insight into the difference between the overheads, you may check 50parallelization histogram with '--sort=latency,parallelism,comm,symbol --hierarchy' 51flags. It shows fraction of (wall-clock) time the workload utilizes different 52numbers of cores ('Parallelism' column). For example, in the following case 53the workload utilizes only 1 core most of the time, but also has some 54highly-parallel phases, which explains significant difference between 55CPU and wall-clock overheads: 56 57----------------------------------- 58 Latency Overhead Parallelism / Command / Symbol 59+ 56.98% 2.29% 1 60+ 16.94% 1.36% 2 61+ 4.00% 20.13% 125 62+ 3.66% 18.25% 124 63+ 3.48% 17.66% 126 64+ 3.26% 0.39% 3 65+ 2.61% 12.93% 123 66----------------------------------- 67 68By expanding corresponding lines, you may see what commands/functions run 69at the given parallelism level: 70 71----------------------------------- 72 Latency Overhead Parallelism / Command / Symbol 73- 56.98% 2.29% 1 74 32.80% 1.32% gzip 75 4.46% 0.18% cc1 76 2.81% 0.11% objtool 77 2.43% 0.10% dpkg-source 78 2.22% 0.09% ld 79 2.10% 0.08% dpkg-genchanges 80----------------------------------- 81 82To see the normal function-level profile for particular parallelism levels 83(number of threads actively running on CPUs), you may use '--parallelism' 84filter. For example, to see the profile only for low parallelism phases 85of a workload use '--latency --parallelism=1-2' flags. 86