1*c43cad87SWarner Losh<<<<<<< HEAD 2*c43cad87SWarner LoshThis document summarizes the common approaches for performance fine tuning with 3*c43cad87SWarner Loshjemalloc (as of 5.3.0). The default configuration of jemalloc tends to work 4*c43cad87SWarner Loshreasonably well in practice, and most applications should not have to tune any 5*c43cad87SWarner Loshoptions. However, in order to cover a wide range of applications and avoid 6*c43cad87SWarner Loshpathological cases, the default setting is sometimes kept conservative and 7*c43cad87SWarner Loshsuboptimal, even for many common workloads. When jemalloc is properly tuned for 8*c43cad87SWarner Losha specific application / workload, it is common to improve system level metrics 9*c43cad87SWarner Loshby a few percent, or make favorable trade-offs. 10*c43cad87SWarner Losh 11*c43cad87SWarner Losh 12*c43cad87SWarner Losh## Notable runtime options for performance tuning 13*c43cad87SWarner Losh 14*c43cad87SWarner LoshRuntime options can be set via 15*c43cad87SWarner Losh[malloc_conf](http://jemalloc.net/jemalloc.3.html#tuning). 16*c43cad87SWarner Losh 17*c43cad87SWarner Losh* [background_thread](http://jemalloc.net/jemalloc.3.html#background_thread) 18*c43cad87SWarner Losh 19*c43cad87SWarner Losh Enabling jemalloc background threads generally improves the tail latency for 20*c43cad87SWarner Losh application threads, since unused memory purging is shifted to the dedicated 21*c43cad87SWarner Losh background threads. In addition, unintended purging delay caused by 22*c43cad87SWarner Losh application inactivity is avoided with background threads. 23*c43cad87SWarner Losh 24*c43cad87SWarner Losh Suggested: `background_thread:true` when jemalloc managed threads can be 25*c43cad87SWarner Losh allowed. 26*c43cad87SWarner Losh 27*c43cad87SWarner Losh* [metadata_thp](http://jemalloc.net/jemalloc.3.html#opt.metadata_thp) 28*c43cad87SWarner Losh 29*c43cad87SWarner Losh Allowing jemalloc to utilize transparent huge pages for its internal 30*c43cad87SWarner Losh metadata usually reduces TLB misses significantly, especially for programs 31*c43cad87SWarner Losh with large memory footprint and frequent allocation / deallocation 32*c43cad87SWarner Losh activities. Metadata memory usage may increase due to the use of huge 33*c43cad87SWarner Losh pages. 34*c43cad87SWarner Losh 35*c43cad87SWarner Losh Suggested for allocation intensive programs: `metadata_thp:auto` or 36*c43cad87SWarner Losh `metadata_thp:always`, which is expected to improve CPU utilization at a 37*c43cad87SWarner Losh small memory cost. 38*c43cad87SWarner Losh 39*c43cad87SWarner Losh* [dirty_decay_ms](http://jemalloc.net/jemalloc.3.html#opt.dirty_decay_ms) and 40*c43cad87SWarner Losh [muzzy_decay_ms](http://jemalloc.net/jemalloc.3.html#opt.muzzy_decay_ms) 41*c43cad87SWarner Losh 42*c43cad87SWarner Losh Decay time determines how fast jemalloc returns unused pages back to the 43*c43cad87SWarner Losh operating system, and therefore provides a fairly straightforward trade-off 44*c43cad87SWarner Losh between CPU and memory usage. Shorter decay time purges unused pages faster 45*c43cad87SWarner Losh to reduces memory usage (usually at the cost of more CPU cycles spent on 46*c43cad87SWarner Losh purging), and vice versa. 47*c43cad87SWarner Losh 48*c43cad87SWarner Losh Suggested: tune the values based on the desired trade-offs. 49*c43cad87SWarner Losh 50*c43cad87SWarner Losh* [narenas](http://jemalloc.net/jemalloc.3.html#opt.narenas) 51*c43cad87SWarner Losh 52*c43cad87SWarner Losh By default jemalloc uses multiple arenas to reduce internal lock contention. 53*c43cad87SWarner Losh However high arena count may also increase overall memory fragmentation, 54*c43cad87SWarner Losh since arenas manage memory independently. When high degree of parallelism 55*c43cad87SWarner Losh is not expected at the allocator level, lower number of arenas often 56*c43cad87SWarner Losh improves memory usage. 57*c43cad87SWarner Losh 58*c43cad87SWarner Losh Suggested: if low parallelism is expected, try lower arena count while 59*c43cad87SWarner Losh monitoring CPU and memory usage. 60*c43cad87SWarner Losh 61*c43cad87SWarner Losh* [percpu_arena](http://jemalloc.net/jemalloc.3.html#opt.percpu_arena) 62*c43cad87SWarner Losh 63*c43cad87SWarner Losh Enable dynamic thread to arena association based on running CPU. This has 64*c43cad87SWarner Losh the potential to improve locality, e.g. when thread to CPU affinity is 65*c43cad87SWarner Losh present. 66*c43cad87SWarner Losh 67*c43cad87SWarner Losh Suggested: try `percpu_arena:percpu` or `percpu_arena:phycpu` if 68*c43cad87SWarner Losh thread migration between processors is expected to be infrequent. 69*c43cad87SWarner Losh 70*c43cad87SWarner LoshExamples: 71*c43cad87SWarner Losh 72*c43cad87SWarner Losh* High resource consumption application, prioritizing CPU utilization: 73*c43cad87SWarner Losh 74*c43cad87SWarner Losh `background_thread:true,metadata_thp:auto` combined with relaxed decay time 75*c43cad87SWarner Losh (increased `dirty_decay_ms` and / or `muzzy_decay_ms`, 76*c43cad87SWarner Losh e.g. `dirty_decay_ms:30000,muzzy_decay_ms:30000`). 77*c43cad87SWarner Losh 78*c43cad87SWarner Losh* High resource consumption application, prioritizing memory usage: 79*c43cad87SWarner Losh 80*c43cad87SWarner Losh `background_thread:true,tcache_max:4096` combined with shorter decay time 81*c43cad87SWarner Losh (decreased `dirty_decay_ms` and / or `muzzy_decay_ms`, 82*c43cad87SWarner Losh e.g. `dirty_decay_ms:5000,muzzy_decay_ms:5000`), and lower arena count 83*c43cad87SWarner Losh (e.g. number of CPUs). 84*c43cad87SWarner Losh 85*c43cad87SWarner Losh* Low resource consumption application: 86*c43cad87SWarner Losh 87*c43cad87SWarner Losh `narenas:1,tcache_max:1024` combined with shorter decay time (decreased 88*c43cad87SWarner Losh `dirty_decay_ms` and / or `muzzy_decay_ms`,e.g. 89*c43cad87SWarner Losh `dirty_decay_ms:1000,muzzy_decay_ms:0`). 90*c43cad87SWarner Losh 91*c43cad87SWarner Losh* Extremely conservative -- minimize memory usage at all costs, only suitable when 92*c43cad87SWarner Loshallocation activity is very rare: 93*c43cad87SWarner Losh 94*c43cad87SWarner Losh `narenas:1,tcache:false,dirty_decay_ms:0,muzzy_decay_ms:0` 95*c43cad87SWarner Losh 96*c43cad87SWarner LoshNote that it is recommended to combine the options with `abort_conf:true` which 97*c43cad87SWarner Loshaborts immediately on illegal options. 98*c43cad87SWarner Losh 99*c43cad87SWarner Losh## Beyond runtime options 100*c43cad87SWarner Losh 101*c43cad87SWarner LoshIn addition to the runtime options, there are a number of programmatic ways to 102*c43cad87SWarner Loshimprove application performance with jemalloc. 103*c43cad87SWarner Losh 104*c43cad87SWarner Losh* [Explicit arenas](http://jemalloc.net/jemalloc.3.html#arenas.create) 105*c43cad87SWarner Losh 106*c43cad87SWarner Losh Manually created arenas can help performance in various ways, e.g. by 107*c43cad87SWarner Losh managing locality and contention for specific usages. For example, 108*c43cad87SWarner Losh applications can explicitly allocate frequently accessed objects from a 109*c43cad87SWarner Losh dedicated arena with 110*c43cad87SWarner Losh [mallocx()](http://jemalloc.net/jemalloc.3.html#MALLOCX_ARENA) to improve 111*c43cad87SWarner Losh locality. In addition, explicit arenas often benefit from individually 112*c43cad87SWarner Losh tuned options, e.g. relaxed [decay 113*c43cad87SWarner Losh time](http://jemalloc.net/jemalloc.3.html#arena.i.dirty_decay_ms) if 114*c43cad87SWarner Losh frequent reuse is expected. 115*c43cad87SWarner Losh 116*c43cad87SWarner Losh* [Extent hooks](http://jemalloc.net/jemalloc.3.html#arena.i.extent_hooks) 117*c43cad87SWarner Losh 118*c43cad87SWarner Losh Extent hooks allow customization for managing underlying memory. One use 119*c43cad87SWarner Losh case for performance purpose is to utilize huge pages -- for example, 120*c43cad87SWarner Losh [HHVM](https://github.com/facebook/hhvm/blob/master/hphp/util/alloc.cpp) 121*c43cad87SWarner Losh uses explicit arenas with customized extent hooks to manage 1GB huge pages 122*c43cad87SWarner Losh for frequently accessed data, which reduces TLB misses significantly. 123*c43cad87SWarner Losh 124*c43cad87SWarner Losh* [Explicit thread-to-arena 125*c43cad87SWarner Losh binding](http://jemalloc.net/jemalloc.3.html#thread.arena) 126*c43cad87SWarner Losh 127*c43cad87SWarner Losh It is common for some threads in an application to have different memory 128*c43cad87SWarner Losh access / allocation patterns. Threads with heavy workloads often benefit 129*c43cad87SWarner Losh from explicit binding, e.g. binding very active threads to dedicated arenas 130*c43cad87SWarner Losh may reduce contention at the allocator level. 131*c43cad87SWarner Losh||||||| dec341af7695 132*c43cad87SWarner Losh======= 133bf6039f0SWarner LoshThis document summarizes the common approaches for performance fine tuning with 134bf6039f0SWarner Loshjemalloc (as of 5.1.0). The default configuration of jemalloc tends to work 135bf6039f0SWarner Loshreasonably well in practice, and most applications should not have to tune any 136bf6039f0SWarner Loshoptions. However, in order to cover a wide range of applications and avoid 137bf6039f0SWarner Loshpathological cases, the default setting is sometimes kept conservative and 138bf6039f0SWarner Loshsuboptimal, even for many common workloads. When jemalloc is properly tuned for 139bf6039f0SWarner Losha specific application / workload, it is common to improve system level metrics 140bf6039f0SWarner Loshby a few percent, or make favorable trade-offs. 141bf6039f0SWarner Losh 142bf6039f0SWarner Losh 143bf6039f0SWarner Losh## Notable runtime options for performance tuning 144bf6039f0SWarner Losh 145bf6039f0SWarner LoshRuntime options can be set via 146bf6039f0SWarner Losh[malloc_conf](http://jemalloc.net/jemalloc.3.html#tuning). 147bf6039f0SWarner Losh 148bf6039f0SWarner Losh* [background_thread](http://jemalloc.net/jemalloc.3.html#background_thread) 149bf6039f0SWarner Losh 150bf6039f0SWarner Losh Enabling jemalloc background threads generally improves the tail latency for 151bf6039f0SWarner Losh application threads, since unused memory purging is shifted to the dedicated 152bf6039f0SWarner Losh background threads. In addition, unintended purging delay caused by 153bf6039f0SWarner Losh application inactivity is avoided with background threads. 154bf6039f0SWarner Losh 155bf6039f0SWarner Losh Suggested: `background_thread:true` when jemalloc managed threads can be 156bf6039f0SWarner Losh allowed. 157bf6039f0SWarner Losh 158bf6039f0SWarner Losh* [metadata_thp](http://jemalloc.net/jemalloc.3.html#opt.metadata_thp) 159bf6039f0SWarner Losh 160bf6039f0SWarner Losh Allowing jemalloc to utilize transparent huge pages for its internal 161bf6039f0SWarner Losh metadata usually reduces TLB misses significantly, especially for programs 162bf6039f0SWarner Losh with large memory footprint and frequent allocation / deallocation 163bf6039f0SWarner Losh activities. Metadata memory usage may increase due to the use of huge 164bf6039f0SWarner Losh pages. 165bf6039f0SWarner Losh 166bf6039f0SWarner Losh Suggested for allocation intensive programs: `metadata_thp:auto` or 167bf6039f0SWarner Losh `metadata_thp:always`, which is expected to improve CPU utilization at a 168bf6039f0SWarner Losh small memory cost. 169bf6039f0SWarner Losh 170bf6039f0SWarner Losh* [dirty_decay_ms](http://jemalloc.net/jemalloc.3.html#opt.dirty_decay_ms) and 171bf6039f0SWarner Losh [muzzy_decay_ms](http://jemalloc.net/jemalloc.3.html#opt.muzzy_decay_ms) 172bf6039f0SWarner Losh 173bf6039f0SWarner Losh Decay time determines how fast jemalloc returns unused pages back to the 174bf6039f0SWarner Losh operating system, and therefore provides a fairly straightforward trade-off 175bf6039f0SWarner Losh between CPU and memory usage. Shorter decay time purges unused pages faster 176bf6039f0SWarner Losh to reduces memory usage (usually at the cost of more CPU cycles spent on 177bf6039f0SWarner Losh purging), and vice versa. 178bf6039f0SWarner Losh 179bf6039f0SWarner Losh Suggested: tune the values based on the desired trade-offs. 180bf6039f0SWarner Losh 181bf6039f0SWarner Losh* [narenas](http://jemalloc.net/jemalloc.3.html#opt.narenas) 182bf6039f0SWarner Losh 183bf6039f0SWarner Losh By default jemalloc uses multiple arenas to reduce internal lock contention. 184bf6039f0SWarner Losh However high arena count may also increase overall memory fragmentation, 185bf6039f0SWarner Losh since arenas manage memory independently. When high degree of parallelism 186bf6039f0SWarner Losh is not expected at the allocator level, lower number of arenas often 187bf6039f0SWarner Losh improves memory usage. 188bf6039f0SWarner Losh 189bf6039f0SWarner Losh Suggested: if low parallelism is expected, try lower arena count while 190bf6039f0SWarner Losh monitoring CPU and memory usage. 191bf6039f0SWarner Losh 192bf6039f0SWarner Losh* [percpu_arena](http://jemalloc.net/jemalloc.3.html#opt.percpu_arena) 193bf6039f0SWarner Losh 194bf6039f0SWarner Losh Enable dynamic thread to arena association based on running CPU. This has 195bf6039f0SWarner Losh the potential to improve locality, e.g. when thread to CPU affinity is 196bf6039f0SWarner Losh present. 197bf6039f0SWarner Losh 198bf6039f0SWarner Losh Suggested: try `percpu_arena:percpu` or `percpu_arena:phycpu` if 199bf6039f0SWarner Losh thread migration between processors is expected to be infrequent. 200bf6039f0SWarner Losh 201bf6039f0SWarner LoshExamples: 202bf6039f0SWarner Losh 203bf6039f0SWarner Losh* High resource consumption application, prioritizing CPU utilization: 204bf6039f0SWarner Losh 205bf6039f0SWarner Losh `background_thread:true,metadata_thp:auto` combined with relaxed decay time 206bf6039f0SWarner Losh (increased `dirty_decay_ms` and / or `muzzy_decay_ms`, 207bf6039f0SWarner Losh e.g. `dirty_decay_ms:30000,muzzy_decay_ms:30000`). 208bf6039f0SWarner Losh 209bf6039f0SWarner Losh* High resource consumption application, prioritizing memory usage: 210bf6039f0SWarner Losh 211bf6039f0SWarner Losh `background_thread:true` combined with shorter decay time (decreased 212bf6039f0SWarner Losh `dirty_decay_ms` and / or `muzzy_decay_ms`, 213bf6039f0SWarner Losh e.g. `dirty_decay_ms:5000,muzzy_decay_ms:5000`), and lower arena count 214bf6039f0SWarner Losh (e.g. number of CPUs). 215bf6039f0SWarner Losh 216bf6039f0SWarner Losh* Low resource consumption application: 217bf6039f0SWarner Losh 218bf6039f0SWarner Losh `narenas:1,lg_tcache_max:13` combined with shorter decay time (decreased 219bf6039f0SWarner Losh `dirty_decay_ms` and / or `muzzy_decay_ms`,e.g. 220bf6039f0SWarner Losh `dirty_decay_ms:1000,muzzy_decay_ms:0`). 221bf6039f0SWarner Losh 222bf6039f0SWarner Losh* Extremely conservative -- minimize memory usage at all costs, only suitable when 223bf6039f0SWarner Loshallocation activity is very rare: 224bf6039f0SWarner Losh 225bf6039f0SWarner Losh `narenas:1,tcache:false,dirty_decay_ms:0,muzzy_decay_ms:0` 226bf6039f0SWarner Losh 227bf6039f0SWarner LoshNote that it is recommended to combine the options with `abort_conf:true` which 228bf6039f0SWarner Loshaborts immediately on illegal options. 229bf6039f0SWarner Losh 230bf6039f0SWarner Losh## Beyond runtime options 231bf6039f0SWarner Losh 232bf6039f0SWarner LoshIn addition to the runtime options, there are a number of programmatic ways to 233bf6039f0SWarner Loshimprove application performance with jemalloc. 234bf6039f0SWarner Losh 235bf6039f0SWarner Losh* [Explicit arenas](http://jemalloc.net/jemalloc.3.html#arenas.create) 236bf6039f0SWarner Losh 237bf6039f0SWarner Losh Manually created arenas can help performance in various ways, e.g. by 238bf6039f0SWarner Losh managing locality and contention for specific usages. For example, 239bf6039f0SWarner Losh applications can explicitly allocate frequently accessed objects from a 240bf6039f0SWarner Losh dedicated arena with 241bf6039f0SWarner Losh [mallocx()](http://jemalloc.net/jemalloc.3.html#MALLOCX_ARENA) to improve 242bf6039f0SWarner Losh locality. In addition, explicit arenas often benefit from individually 243bf6039f0SWarner Losh tuned options, e.g. relaxed [decay 244bf6039f0SWarner Losh time](http://jemalloc.net/jemalloc.3.html#arena.i.dirty_decay_ms) if 245bf6039f0SWarner Losh frequent reuse is expected. 246bf6039f0SWarner Losh 247bf6039f0SWarner Losh* [Extent hooks](http://jemalloc.net/jemalloc.3.html#arena.i.extent_hooks) 248bf6039f0SWarner Losh 249bf6039f0SWarner Losh Extent hooks allow customization for managing underlying memory. One use 250bf6039f0SWarner Losh case for performance purpose is to utilize huge pages -- for example, 251bf6039f0SWarner Losh [HHVM](https://github.com/facebook/hhvm/blob/master/hphp/util/alloc.cpp) 252bf6039f0SWarner Losh uses explicit arenas with customized extent hooks to manage 1GB huge pages 253bf6039f0SWarner Losh for frequently accessed data, which reduces TLB misses significantly. 254bf6039f0SWarner Losh 255bf6039f0SWarner Losh* [Explicit thread-to-arena 256bf6039f0SWarner Losh binding](http://jemalloc.net/jemalloc.3.html#thread.arena) 257bf6039f0SWarner Losh 258bf6039f0SWarner Losh It is common for some threads in an application to have different memory 259bf6039f0SWarner Losh access / allocation patterns. Threads with heavy workloads often benefit 260bf6039f0SWarner Losh from explicit binding, e.g. binding very active threads to dedicated arenas 261bf6039f0SWarner Losh may reduce contention at the allocator level. 262*c43cad87SWarner Losh>>>>>>> main 263