1*c43cad87SWarner LoshThis document summarizes the common approaches for performance fine tuning with 2*c43cad87SWarner Loshjemalloc (as of 5.3.0). The default configuration of jemalloc tends to work 3*c43cad87SWarner Loshreasonably well in practice, and most applications should not have to tune any 4*c43cad87SWarner Loshoptions. However, in order to cover a wide range of applications and avoid 5*c43cad87SWarner Loshpathological cases, the default setting is sometimes kept conservative and 6*c43cad87SWarner Loshsuboptimal, even for many common workloads. When jemalloc is properly tuned for 7*c43cad87SWarner Losha specific application / workload, it is common to improve system level metrics 8*c43cad87SWarner Loshby a few percent, or make favorable trade-offs. 9*c43cad87SWarner Losh 10*c43cad87SWarner Losh 11*c43cad87SWarner Losh## Notable runtime options for performance tuning 12*c43cad87SWarner Losh 13*c43cad87SWarner LoshRuntime options can be set via 14*c43cad87SWarner Losh[malloc_conf](http://jemalloc.net/jemalloc.3.html#tuning). 15*c43cad87SWarner Losh 16*c43cad87SWarner Losh* [background_thread](http://jemalloc.net/jemalloc.3.html#background_thread) 17*c43cad87SWarner Losh 18*c43cad87SWarner Losh Enabling jemalloc background threads generally improves the tail latency for 19*c43cad87SWarner Losh application threads, since unused memory purging is shifted to the dedicated 20*c43cad87SWarner Losh background threads. In addition, unintended purging delay caused by 21*c43cad87SWarner Losh application inactivity is avoided with background threads. 22*c43cad87SWarner Losh 23*c43cad87SWarner Losh Suggested: `background_thread:true` when jemalloc managed threads can be 24*c43cad87SWarner Losh allowed. 25*c43cad87SWarner Losh 26*c43cad87SWarner Losh* [metadata_thp](http://jemalloc.net/jemalloc.3.html#opt.metadata_thp) 27*c43cad87SWarner Losh 28*c43cad87SWarner Losh Allowing jemalloc to utilize transparent huge pages for its internal 29*c43cad87SWarner Losh metadata usually reduces TLB misses significantly, especially for programs 30*c43cad87SWarner Losh with large memory footprint and frequent allocation / deallocation 31*c43cad87SWarner Losh activities. Metadata memory usage may increase due to the use of huge 32*c43cad87SWarner Losh pages. 33*c43cad87SWarner Losh 34*c43cad87SWarner Losh Suggested for allocation intensive programs: `metadata_thp:auto` or 35*c43cad87SWarner Losh `metadata_thp:always`, which is expected to improve CPU utilization at a 36*c43cad87SWarner Losh small memory cost. 37*c43cad87SWarner Losh 38*c43cad87SWarner Losh* [dirty_decay_ms](http://jemalloc.net/jemalloc.3.html#opt.dirty_decay_ms) and 39*c43cad87SWarner Losh [muzzy_decay_ms](http://jemalloc.net/jemalloc.3.html#opt.muzzy_decay_ms) 40*c43cad87SWarner Losh 41*c43cad87SWarner Losh Decay time determines how fast jemalloc returns unused pages back to the 42*c43cad87SWarner Losh operating system, and therefore provides a fairly straightforward trade-off 43*c43cad87SWarner Losh between CPU and memory usage. Shorter decay time purges unused pages faster 44*c43cad87SWarner Losh to reduces memory usage (usually at the cost of more CPU cycles spent on 45*c43cad87SWarner Losh purging), and vice versa. 46*c43cad87SWarner Losh 47*c43cad87SWarner Losh Suggested: tune the values based on the desired trade-offs. 48*c43cad87SWarner Losh 49*c43cad87SWarner Losh* [narenas](http://jemalloc.net/jemalloc.3.html#opt.narenas) 50*c43cad87SWarner Losh 51*c43cad87SWarner Losh By default jemalloc uses multiple arenas to reduce internal lock contention. 52*c43cad87SWarner Losh However high arena count may also increase overall memory fragmentation, 53*c43cad87SWarner Losh since arenas manage memory independently. When high degree of parallelism 54*c43cad87SWarner Losh is not expected at the allocator level, lower number of arenas often 55*c43cad87SWarner Losh improves memory usage. 56*c43cad87SWarner Losh 57*c43cad87SWarner Losh Suggested: if low parallelism is expected, try lower arena count while 58*c43cad87SWarner Losh monitoring CPU and memory usage. 59*c43cad87SWarner Losh 60*c43cad87SWarner Losh* [percpu_arena](http://jemalloc.net/jemalloc.3.html#opt.percpu_arena) 61*c43cad87SWarner Losh 62*c43cad87SWarner Losh Enable dynamic thread to arena association based on running CPU. This has 63*c43cad87SWarner Losh the potential to improve locality, e.g. when thread to CPU affinity is 64*c43cad87SWarner Losh present. 65*c43cad87SWarner Losh 66*c43cad87SWarner Losh Suggested: try `percpu_arena:percpu` or `percpu_arena:phycpu` if 67*c43cad87SWarner Losh thread migration between processors is expected to be infrequent. 68*c43cad87SWarner Losh 69*c43cad87SWarner LoshExamples: 70*c43cad87SWarner Losh 71*c43cad87SWarner Losh* High resource consumption application, prioritizing CPU utilization: 72*c43cad87SWarner Losh 73*c43cad87SWarner Losh `background_thread:true,metadata_thp:auto` combined with relaxed decay time 74*c43cad87SWarner Losh (increased `dirty_decay_ms` and / or `muzzy_decay_ms`, 75*c43cad87SWarner Losh e.g. `dirty_decay_ms:30000,muzzy_decay_ms:30000`). 76*c43cad87SWarner Losh 77*c43cad87SWarner Losh* High resource consumption application, prioritizing memory usage: 78*c43cad87SWarner Losh 79*c43cad87SWarner Losh `background_thread:true,tcache_max:4096` combined with shorter decay time 80*c43cad87SWarner Losh (decreased `dirty_decay_ms` and / or `muzzy_decay_ms`, 81*c43cad87SWarner Losh e.g. `dirty_decay_ms:5000,muzzy_decay_ms:5000`), and lower arena count 82*c43cad87SWarner Losh (e.g. number of CPUs). 83*c43cad87SWarner Losh 84*c43cad87SWarner Losh* Low resource consumption application: 85*c43cad87SWarner Losh 86*c43cad87SWarner Losh `narenas:1,tcache_max:1024` combined with shorter decay time (decreased 87*c43cad87SWarner Losh `dirty_decay_ms` and / or `muzzy_decay_ms`,e.g. 88*c43cad87SWarner Losh `dirty_decay_ms:1000,muzzy_decay_ms:0`). 89*c43cad87SWarner Losh 90*c43cad87SWarner Losh* Extremely conservative -- minimize memory usage at all costs, only suitable when 91*c43cad87SWarner Loshallocation activity is very rare: 92*c43cad87SWarner Losh 93*c43cad87SWarner Losh `narenas:1,tcache:false,dirty_decay_ms:0,muzzy_decay_ms:0` 94*c43cad87SWarner Losh 95*c43cad87SWarner LoshNote that it is recommended to combine the options with `abort_conf:true` which 96*c43cad87SWarner Loshaborts immediately on illegal options. 97*c43cad87SWarner Losh 98*c43cad87SWarner Losh## Beyond runtime options 99*c43cad87SWarner Losh 100*c43cad87SWarner LoshIn addition to the runtime options, there are a number of programmatic ways to 101*c43cad87SWarner Loshimprove application performance with jemalloc. 102*c43cad87SWarner Losh 103*c43cad87SWarner Losh* [Explicit arenas](http://jemalloc.net/jemalloc.3.html#arenas.create) 104*c43cad87SWarner Losh 105*c43cad87SWarner Losh Manually created arenas can help performance in various ways, e.g. by 106*c43cad87SWarner Losh managing locality and contention for specific usages. For example, 107*c43cad87SWarner Losh applications can explicitly allocate frequently accessed objects from a 108*c43cad87SWarner Losh dedicated arena with 109*c43cad87SWarner Losh [mallocx()](http://jemalloc.net/jemalloc.3.html#MALLOCX_ARENA) to improve 110*c43cad87SWarner Losh locality. In addition, explicit arenas often benefit from individually 111*c43cad87SWarner Losh tuned options, e.g. relaxed [decay 112*c43cad87SWarner Losh time](http://jemalloc.net/jemalloc.3.html#arena.i.dirty_decay_ms) if 113*c43cad87SWarner Losh frequent reuse is expected. 114*c43cad87SWarner Losh 115*c43cad87SWarner Losh* [Extent hooks](http://jemalloc.net/jemalloc.3.html#arena.i.extent_hooks) 116*c43cad87SWarner Losh 117*c43cad87SWarner Losh Extent hooks allow customization for managing underlying memory. One use 118*c43cad87SWarner Losh case for performance purpose is to utilize huge pages -- for example, 119*c43cad87SWarner Losh [HHVM](https://github.com/facebook/hhvm/blob/master/hphp/util/alloc.cpp) 120*c43cad87SWarner Losh uses explicit arenas with customized extent hooks to manage 1GB huge pages 121*c43cad87SWarner Losh for frequently accessed data, which reduces TLB misses significantly. 122*c43cad87SWarner Losh 123*c43cad87SWarner Losh* [Explicit thread-to-arena 124*c43cad87SWarner Losh binding](http://jemalloc.net/jemalloc.3.html#thread.arena) 125*c43cad87SWarner Losh 126*c43cad87SWarner Losh It is common for some threads in an application to have different memory 127*c43cad87SWarner Losh access / allocation patterns. Threads with heavy workloads often benefit 128*c43cad87SWarner Losh from explicit binding, e.g. binding very active threads to dedicated arenas 129*c43cad87SWarner Losh may reduce contention at the allocator level. 130