1*32ecca8dSDmitry VyukovCPU and latency overheads 2*32ecca8dSDmitry Vyukov------------------------- 3*32ecca8dSDmitry VyukovThere are two notions of time: wall-clock time and CPU time. 4*32ecca8dSDmitry VyukovFor a single-threaded program, or a program running on a single-core machine, 5*32ecca8dSDmitry Vyukovthese notions are the same. However, for a multi-threaded/multi-process program 6*32ecca8dSDmitry Vyukovrunning on a multi-core machine, these notions are significantly different. 7*32ecca8dSDmitry VyukovEach second of wall-clock time we have number-of-cores seconds of CPU time. 8*32ecca8dSDmitry VyukovPerf can measure overhead for both of these times (shown in 'overhead' and 9*32ecca8dSDmitry Vyukov'latency' columns for CPU and wall-clock time correspondingly). 10*32ecca8dSDmitry Vyukov 11*32ecca8dSDmitry VyukovOptimizing CPU overhead is useful to improve 'throughput', while optimizing 12*32ecca8dSDmitry Vyukovlatency overhead is useful to improve 'latency'. It's important to understand 13*32ecca8dSDmitry Vyukovwhich one is useful in a concrete situation at hand. For example, the former 14*32ecca8dSDmitry Vyukovmay be useful to improve max throughput of a CI build server that runs on 100% 15*32ecca8dSDmitry VyukovCPU utilization, while the latter may be useful to improve user-perceived 16*32ecca8dSDmitry Vyukovlatency of a single interactive program build. 17*32ecca8dSDmitry VyukovThese overheads may be significantly different in some cases. For example, 18*32ecca8dSDmitry Vyukovconsider a program that executes function 'foo' for 9 seconds with 1 thread, 19*32ecca8dSDmitry Vyukovand then executes function 'bar' for 1 second with 128 threads (consumes 20*32ecca8dSDmitry Vyukov128 seconds of CPU time). The CPU overhead is: 'foo' - 6.6%, 'bar' - 93.4%. 21*32ecca8dSDmitry VyukovWhile the latency overhead is: 'foo' - 90%, 'bar' - 10%. If we try to optimize 22*32ecca8dSDmitry Vyukovrunning time of the program looking at the (wrong in this case) CPU overhead, 23*32ecca8dSDmitry Vyukovwe would concentrate on the function 'bar', but it can yield only 10% running 24*32ecca8dSDmitry Vyukovtime improvement at best. 25*32ecca8dSDmitry Vyukov 26*32ecca8dSDmitry VyukovBy default, perf shows only CPU overhead. To show latency overhead, use 27*32ecca8dSDmitry Vyukov'perf record --latency' and 'perf report': 28*32ecca8dSDmitry Vyukov 29*32ecca8dSDmitry Vyukov----------------------------------- 30*32ecca8dSDmitry VyukovOverhead Latency Command 31*32ecca8dSDmitry Vyukov 93.88% 25.79% cc1 32*32ecca8dSDmitry Vyukov 1.90% 39.87% gzip 33*32ecca8dSDmitry Vyukov 0.99% 10.16% dpkg-deb 34*32ecca8dSDmitry Vyukov 0.57% 1.00% as 35*32ecca8dSDmitry Vyukov 0.40% 0.46% sh 36*32ecca8dSDmitry Vyukov----------------------------------- 37*32ecca8dSDmitry Vyukov 38*32ecca8dSDmitry VyukovTo sort by latency overhead, use 'perf report --latency': 39*32ecca8dSDmitry Vyukov 40*32ecca8dSDmitry Vyukov----------------------------------- 41*32ecca8dSDmitry VyukovLatency Overhead Command 42*32ecca8dSDmitry Vyukov 39.87% 1.90% gzip 43*32ecca8dSDmitry Vyukov 25.79% 93.88% cc1 44*32ecca8dSDmitry Vyukov 10.16% 0.99% dpkg-deb 45*32ecca8dSDmitry Vyukov 4.17% 0.29% git 46*32ecca8dSDmitry Vyukov 2.81% 0.11% objtool 47*32ecca8dSDmitry Vyukov----------------------------------- 48*32ecca8dSDmitry Vyukov 49*32ecca8dSDmitry VyukovTo get insight into the difference between the overheads, you may check 50*32ecca8dSDmitry Vyukovparallelization histogram with '--sort=latency,parallelism,comm,symbol --hierarchy' 51*32ecca8dSDmitry Vyukovflags. It shows fraction of (wall-clock) time the workload utilizes different 52*32ecca8dSDmitry Vyukovnumbers of cores ('Parallelism' column). For example, in the following case 53*32ecca8dSDmitry Vyukovthe workload utilizes only 1 core most of the time, but also has some 54*32ecca8dSDmitry Vyukovhighly-parallel phases, which explains significant difference between 55*32ecca8dSDmitry VyukovCPU and wall-clock overheads: 56*32ecca8dSDmitry Vyukov 57*32ecca8dSDmitry Vyukov----------------------------------- 58*32ecca8dSDmitry Vyukov Latency Overhead Parallelism / Command / Symbol 59*32ecca8dSDmitry Vyukov+ 56.98% 2.29% 1 60*32ecca8dSDmitry Vyukov+ 16.94% 1.36% 2 61*32ecca8dSDmitry Vyukov+ 4.00% 20.13% 125 62*32ecca8dSDmitry Vyukov+ 3.66% 18.25% 124 63*32ecca8dSDmitry Vyukov+ 3.48% 17.66% 126 64*32ecca8dSDmitry Vyukov+ 3.26% 0.39% 3 65*32ecca8dSDmitry Vyukov+ 2.61% 12.93% 123 66*32ecca8dSDmitry Vyukov----------------------------------- 67*32ecca8dSDmitry Vyukov 68*32ecca8dSDmitry VyukovBy expanding corresponding lines, you may see what commands/functions run 69*32ecca8dSDmitry Vyukovat the given parallelism level: 70*32ecca8dSDmitry Vyukov 71*32ecca8dSDmitry Vyukov----------------------------------- 72*32ecca8dSDmitry Vyukov Latency Overhead Parallelism / Command / Symbol 73*32ecca8dSDmitry Vyukov- 56.98% 2.29% 1 74*32ecca8dSDmitry Vyukov 32.80% 1.32% gzip 75*32ecca8dSDmitry Vyukov 4.46% 0.18% cc1 76*32ecca8dSDmitry Vyukov 2.81% 0.11% objtool 77*32ecca8dSDmitry Vyukov 2.43% 0.10% dpkg-source 78*32ecca8dSDmitry Vyukov 2.22% 0.09% ld 79*32ecca8dSDmitry Vyukov 2.10% 0.08% dpkg-genchanges 80*32ecca8dSDmitry Vyukov----------------------------------- 81*32ecca8dSDmitry Vyukov 82*32ecca8dSDmitry VyukovTo see the normal function-level profile for particular parallelism levels 83*32ecca8dSDmitry Vyukov(number of threads actively running on CPUs), you may use '--parallelism' 84*32ecca8dSDmitry Vyukovfilter. For example, to see the profile only for low parallelism phases 85*32ecca8dSDmitry Vyukovof a workload use '--latency --parallelism=1-2' flags. 86