1<<<<<<< HEAD 2This document summarizes the common approaches for performance fine tuning with 3jemalloc (as of 5.3.0). The default configuration of jemalloc tends to work 4reasonably well in practice, and most applications should not have to tune any 5options. However, in order to cover a wide range of applications and avoid 6pathological cases, the default setting is sometimes kept conservative and 7suboptimal, even for many common workloads. When jemalloc is properly tuned for 8a specific application / workload, it is common to improve system level metrics 9by a few percent, or make favorable trade-offs. 10 11 12## Notable runtime options for performance tuning 13 14Runtime options can be set via 15[malloc_conf](http://jemalloc.net/jemalloc.3.html#tuning). 16 17* [background_thread](http://jemalloc.net/jemalloc.3.html#background_thread) 18 19 Enabling jemalloc background threads generally improves the tail latency for 20 application threads, since unused memory purging is shifted to the dedicated 21 background threads. In addition, unintended purging delay caused by 22 application inactivity is avoided with background threads. 23 24 Suggested: `background_thread:true` when jemalloc managed threads can be 25 allowed. 26 27* [metadata_thp](http://jemalloc.net/jemalloc.3.html#opt.metadata_thp) 28 29 Allowing jemalloc to utilize transparent huge pages for its internal 30 metadata usually reduces TLB misses significantly, especially for programs 31 with large memory footprint and frequent allocation / deallocation 32 activities. Metadata memory usage may increase due to the use of huge 33 pages. 34 35 Suggested for allocation intensive programs: `metadata_thp:auto` or 36 `metadata_thp:always`, which is expected to improve CPU utilization at a 37 small memory cost. 38 39* [dirty_decay_ms](http://jemalloc.net/jemalloc.3.html#opt.dirty_decay_ms) and 40 [muzzy_decay_ms](http://jemalloc.net/jemalloc.3.html#opt.muzzy_decay_ms) 41 42 Decay time determines how fast jemalloc returns unused pages back to the 43 operating system, and therefore provides a fairly straightforward trade-off 44 between CPU and memory usage. Shorter decay time purges unused pages faster 45 to reduces memory usage (usually at the cost of more CPU cycles spent on 46 purging), and vice versa. 47 48 Suggested: tune the values based on the desired trade-offs. 49 50* [narenas](http://jemalloc.net/jemalloc.3.html#opt.narenas) 51 52 By default jemalloc uses multiple arenas to reduce internal lock contention. 53 However high arena count may also increase overall memory fragmentation, 54 since arenas manage memory independently. When high degree of parallelism 55 is not expected at the allocator level, lower number of arenas often 56 improves memory usage. 57 58 Suggested: if low parallelism is expected, try lower arena count while 59 monitoring CPU and memory usage. 60 61* [percpu_arena](http://jemalloc.net/jemalloc.3.html#opt.percpu_arena) 62 63 Enable dynamic thread to arena association based on running CPU. This has 64 the potential to improve locality, e.g. when thread to CPU affinity is 65 present. 66 67 Suggested: try `percpu_arena:percpu` or `percpu_arena:phycpu` if 68 thread migration between processors is expected to be infrequent. 69 70Examples: 71 72* High resource consumption application, prioritizing CPU utilization: 73 74 `background_thread:true,metadata_thp:auto` combined with relaxed decay time 75 (increased `dirty_decay_ms` and / or `muzzy_decay_ms`, 76 e.g. `dirty_decay_ms:30000,muzzy_decay_ms:30000`). 77 78* High resource consumption application, prioritizing memory usage: 79 80 `background_thread:true,tcache_max:4096` combined with shorter decay time 81 (decreased `dirty_decay_ms` and / or `muzzy_decay_ms`, 82 e.g. `dirty_decay_ms:5000,muzzy_decay_ms:5000`), and lower arena count 83 (e.g. number of CPUs). 84 85* Low resource consumption application: 86 87 `narenas:1,tcache_max:1024` combined with shorter decay time (decreased 88 `dirty_decay_ms` and / or `muzzy_decay_ms`,e.g. 89 `dirty_decay_ms:1000,muzzy_decay_ms:0`). 90 91* Extremely conservative -- minimize memory usage at all costs, only suitable when 92allocation activity is very rare: 93 94 `narenas:1,tcache:false,dirty_decay_ms:0,muzzy_decay_ms:0` 95 96Note that it is recommended to combine the options with `abort_conf:true` which 97aborts immediately on illegal options. 98 99## Beyond runtime options 100 101In addition to the runtime options, there are a number of programmatic ways to 102improve application performance with jemalloc. 103 104* [Explicit arenas](http://jemalloc.net/jemalloc.3.html#arenas.create) 105 106 Manually created arenas can help performance in various ways, e.g. by 107 managing locality and contention for specific usages. For example, 108 applications can explicitly allocate frequently accessed objects from a 109 dedicated arena with 110 [mallocx()](http://jemalloc.net/jemalloc.3.html#MALLOCX_ARENA) to improve 111 locality. In addition, explicit arenas often benefit from individually 112 tuned options, e.g. relaxed [decay 113 time](http://jemalloc.net/jemalloc.3.html#arena.i.dirty_decay_ms) if 114 frequent reuse is expected. 115 116* [Extent hooks](http://jemalloc.net/jemalloc.3.html#arena.i.extent_hooks) 117 118 Extent hooks allow customization for managing underlying memory. One use 119 case for performance purpose is to utilize huge pages -- for example, 120 [HHVM](https://github.com/facebook/hhvm/blob/master/hphp/util/alloc.cpp) 121 uses explicit arenas with customized extent hooks to manage 1GB huge pages 122 for frequently accessed data, which reduces TLB misses significantly. 123 124* [Explicit thread-to-arena 125 binding](http://jemalloc.net/jemalloc.3.html#thread.arena) 126 127 It is common for some threads in an application to have different memory 128 access / allocation patterns. Threads with heavy workloads often benefit 129 from explicit binding, e.g. binding very active threads to dedicated arenas 130 may reduce contention at the allocator level. 131||||||| dec341af7695 132======= 133This document summarizes the common approaches for performance fine tuning with 134jemalloc (as of 5.1.0). The default configuration of jemalloc tends to work 135reasonably well in practice, and most applications should not have to tune any 136options. However, in order to cover a wide range of applications and avoid 137pathological cases, the default setting is sometimes kept conservative and 138suboptimal, even for many common workloads. When jemalloc is properly tuned for 139a specific application / workload, it is common to improve system level metrics 140by a few percent, or make favorable trade-offs. 141 142 143## Notable runtime options for performance tuning 144 145Runtime options can be set via 146[malloc_conf](http://jemalloc.net/jemalloc.3.html#tuning). 147 148* [background_thread](http://jemalloc.net/jemalloc.3.html#background_thread) 149 150 Enabling jemalloc background threads generally improves the tail latency for 151 application threads, since unused memory purging is shifted to the dedicated 152 background threads. In addition, unintended purging delay caused by 153 application inactivity is avoided with background threads. 154 155 Suggested: `background_thread:true` when jemalloc managed threads can be 156 allowed. 157 158* [metadata_thp](http://jemalloc.net/jemalloc.3.html#opt.metadata_thp) 159 160 Allowing jemalloc to utilize transparent huge pages for its internal 161 metadata usually reduces TLB misses significantly, especially for programs 162 with large memory footprint and frequent allocation / deallocation 163 activities. Metadata memory usage may increase due to the use of huge 164 pages. 165 166 Suggested for allocation intensive programs: `metadata_thp:auto` or 167 `metadata_thp:always`, which is expected to improve CPU utilization at a 168 small memory cost. 169 170* [dirty_decay_ms](http://jemalloc.net/jemalloc.3.html#opt.dirty_decay_ms) and 171 [muzzy_decay_ms](http://jemalloc.net/jemalloc.3.html#opt.muzzy_decay_ms) 172 173 Decay time determines how fast jemalloc returns unused pages back to the 174 operating system, and therefore provides a fairly straightforward trade-off 175 between CPU and memory usage. Shorter decay time purges unused pages faster 176 to reduces memory usage (usually at the cost of more CPU cycles spent on 177 purging), and vice versa. 178 179 Suggested: tune the values based on the desired trade-offs. 180 181* [narenas](http://jemalloc.net/jemalloc.3.html#opt.narenas) 182 183 By default jemalloc uses multiple arenas to reduce internal lock contention. 184 However high arena count may also increase overall memory fragmentation, 185 since arenas manage memory independently. When high degree of parallelism 186 is not expected at the allocator level, lower number of arenas often 187 improves memory usage. 188 189 Suggested: if low parallelism is expected, try lower arena count while 190 monitoring CPU and memory usage. 191 192* [percpu_arena](http://jemalloc.net/jemalloc.3.html#opt.percpu_arena) 193 194 Enable dynamic thread to arena association based on running CPU. This has 195 the potential to improve locality, e.g. when thread to CPU affinity is 196 present. 197 198 Suggested: try `percpu_arena:percpu` or `percpu_arena:phycpu` if 199 thread migration between processors is expected to be infrequent. 200 201Examples: 202 203* High resource consumption application, prioritizing CPU utilization: 204 205 `background_thread:true,metadata_thp:auto` combined with relaxed decay time 206 (increased `dirty_decay_ms` and / or `muzzy_decay_ms`, 207 e.g. `dirty_decay_ms:30000,muzzy_decay_ms:30000`). 208 209* High resource consumption application, prioritizing memory usage: 210 211 `background_thread:true` combined with shorter decay time (decreased 212 `dirty_decay_ms` and / or `muzzy_decay_ms`, 213 e.g. `dirty_decay_ms:5000,muzzy_decay_ms:5000`), and lower arena count 214 (e.g. number of CPUs). 215 216* Low resource consumption application: 217 218 `narenas:1,lg_tcache_max:13` combined with shorter decay time (decreased 219 `dirty_decay_ms` and / or `muzzy_decay_ms`,e.g. 220 `dirty_decay_ms:1000,muzzy_decay_ms:0`). 221 222* Extremely conservative -- minimize memory usage at all costs, only suitable when 223allocation activity is very rare: 224 225 `narenas:1,tcache:false,dirty_decay_ms:0,muzzy_decay_ms:0` 226 227Note that it is recommended to combine the options with `abort_conf:true` which 228aborts immediately on illegal options. 229 230## Beyond runtime options 231 232In addition to the runtime options, there are a number of programmatic ways to 233improve application performance with jemalloc. 234 235* [Explicit arenas](http://jemalloc.net/jemalloc.3.html#arenas.create) 236 237 Manually created arenas can help performance in various ways, e.g. by 238 managing locality and contention for specific usages. For example, 239 applications can explicitly allocate frequently accessed objects from a 240 dedicated arena with 241 [mallocx()](http://jemalloc.net/jemalloc.3.html#MALLOCX_ARENA) to improve 242 locality. In addition, explicit arenas often benefit from individually 243 tuned options, e.g. relaxed [decay 244 time](http://jemalloc.net/jemalloc.3.html#arena.i.dirty_decay_ms) if 245 frequent reuse is expected. 246 247* [Extent hooks](http://jemalloc.net/jemalloc.3.html#arena.i.extent_hooks) 248 249 Extent hooks allow customization for managing underlying memory. One use 250 case for performance purpose is to utilize huge pages -- for example, 251 [HHVM](https://github.com/facebook/hhvm/blob/master/hphp/util/alloc.cpp) 252 uses explicit arenas with customized extent hooks to manage 1GB huge pages 253 for frequently accessed data, which reduces TLB misses significantly. 254 255* [Explicit thread-to-arena 256 binding](http://jemalloc.net/jemalloc.3.html#thread.arena) 257 258 It is common for some threads in an application to have different memory 259 access / allocation patterns. Threads with heavy workloads often benefit 260 from explicit binding, e.g. binding very active threads to dedicated arenas 261 may reduce contention at the allocator level. 262>>>>>>> main 263