xref: /freebsd/contrib/jemalloc/TUNING.md (revision c43cad87172039ccf38172129c79755ea79e6102)
1<<<<<<< HEAD
2This document summarizes the common approaches for performance fine tuning with
3jemalloc (as of 5.3.0).  The default configuration of jemalloc tends to work
4reasonably well in practice, and most applications should not have to tune any
5options. However, in order to cover a wide range of applications and avoid
6pathological cases, the default setting is sometimes kept conservative and
7suboptimal, even for many common workloads.  When jemalloc is properly tuned for
8a specific application / workload, it is common to improve system level metrics
9by a few percent, or make favorable trade-offs.
10
11
12## Notable runtime options for performance tuning
13
14Runtime options can be set via
15[malloc_conf](http://jemalloc.net/jemalloc.3.html#tuning).
16
17* [background_thread](http://jemalloc.net/jemalloc.3.html#background_thread)
18
19    Enabling jemalloc background threads generally improves the tail latency for
20    application threads, since unused memory purging is shifted to the dedicated
21    background threads.  In addition, unintended purging delay caused by
22    application inactivity is avoided with background threads.
23
24    Suggested: `background_thread:true` when jemalloc managed threads can be
25    allowed.
26
27* [metadata_thp](http://jemalloc.net/jemalloc.3.html#opt.metadata_thp)
28
29    Allowing jemalloc to utilize transparent huge pages for its internal
30    metadata usually reduces TLB misses significantly, especially for programs
31    with large memory footprint and frequent allocation / deallocation
32    activities.  Metadata memory usage may increase due to the use of huge
33    pages.
34
35    Suggested for allocation intensive programs: `metadata_thp:auto` or
36    `metadata_thp:always`, which is expected to improve CPU utilization at a
37    small memory cost.
38
39* [dirty_decay_ms](http://jemalloc.net/jemalloc.3.html#opt.dirty_decay_ms) and
40  [muzzy_decay_ms](http://jemalloc.net/jemalloc.3.html#opt.muzzy_decay_ms)
41
42    Decay time determines how fast jemalloc returns unused pages back to the
43    operating system, and therefore provides a fairly straightforward trade-off
44    between CPU and memory usage.  Shorter decay time purges unused pages faster
45    to reduces memory usage (usually at the cost of more CPU cycles spent on
46    purging), and vice versa.
47
48    Suggested: tune the values based on the desired trade-offs.
49
50* [narenas](http://jemalloc.net/jemalloc.3.html#opt.narenas)
51
52    By default jemalloc uses multiple arenas to reduce internal lock contention.
53    However high arena count may also increase overall memory fragmentation,
54    since arenas manage memory independently.  When high degree of parallelism
55    is not expected at the allocator level, lower number of arenas often
56    improves memory usage.
57
58    Suggested: if low parallelism is expected, try lower arena count while
59    monitoring CPU and memory usage.
60
61* [percpu_arena](http://jemalloc.net/jemalloc.3.html#opt.percpu_arena)
62
63    Enable dynamic thread to arena association based on running CPU.  This has
64    the potential to improve locality, e.g. when thread to CPU affinity is
65    present.
66
67    Suggested: try `percpu_arena:percpu` or `percpu_arena:phycpu` if
68    thread migration between processors is expected to be infrequent.
69
70Examples:
71
72* High resource consumption application, prioritizing CPU utilization:
73
74    `background_thread:true,metadata_thp:auto` combined with relaxed decay time
75    (increased `dirty_decay_ms` and / or `muzzy_decay_ms`,
76    e.g. `dirty_decay_ms:30000,muzzy_decay_ms:30000`).
77
78* High resource consumption application, prioritizing memory usage:
79
80    `background_thread:true,tcache_max:4096` combined with shorter decay time
81    (decreased `dirty_decay_ms` and / or `muzzy_decay_ms`,
82    e.g. `dirty_decay_ms:5000,muzzy_decay_ms:5000`), and lower arena count
83    (e.g. number of CPUs).
84
85* Low resource consumption application:
86
87    `narenas:1,tcache_max:1024` combined with shorter decay time (decreased
88    `dirty_decay_ms` and / or `muzzy_decay_ms`,e.g.
89    `dirty_decay_ms:1000,muzzy_decay_ms:0`).
90
91* Extremely conservative -- minimize memory usage at all costs, only suitable when
92allocation activity is very rare:
93
94    `narenas:1,tcache:false,dirty_decay_ms:0,muzzy_decay_ms:0`
95
96Note that it is recommended to combine the options with `abort_conf:true` which
97aborts immediately on illegal options.
98
99## Beyond runtime options
100
101In addition to the runtime options, there are a number of programmatic ways to
102improve application performance with jemalloc.
103
104* [Explicit arenas](http://jemalloc.net/jemalloc.3.html#arenas.create)
105
106    Manually created arenas can help performance in various ways, e.g. by
107    managing locality and contention for specific usages.  For example,
108    applications can explicitly allocate frequently accessed objects from a
109    dedicated arena with
110    [mallocx()](http://jemalloc.net/jemalloc.3.html#MALLOCX_ARENA) to improve
111    locality.  In addition, explicit arenas often benefit from individually
112    tuned options, e.g. relaxed [decay
113    time](http://jemalloc.net/jemalloc.3.html#arena.i.dirty_decay_ms) if
114    frequent reuse is expected.
115
116* [Extent hooks](http://jemalloc.net/jemalloc.3.html#arena.i.extent_hooks)
117
118    Extent hooks allow customization for managing underlying memory.  One use
119    case for performance purpose is to utilize huge pages -- for example,
120    [HHVM](https://github.com/facebook/hhvm/blob/master/hphp/util/alloc.cpp)
121    uses explicit arenas with customized extent hooks to manage 1GB huge pages
122    for frequently accessed data, which reduces TLB misses significantly.
123
124* [Explicit thread-to-arena
125  binding](http://jemalloc.net/jemalloc.3.html#thread.arena)
126
127    It is common for some threads in an application to have different memory
128    access / allocation patterns.  Threads with heavy workloads often benefit
129    from explicit binding, e.g. binding very active threads to dedicated arenas
130    may reduce contention at the allocator level.
131||||||| dec341af7695
132=======
133This document summarizes the common approaches for performance fine tuning with
134jemalloc (as of 5.1.0).  The default configuration of jemalloc tends to work
135reasonably well in practice, and most applications should not have to tune any
136options. However, in order to cover a wide range of applications and avoid
137pathological cases, the default setting is sometimes kept conservative and
138suboptimal, even for many common workloads.  When jemalloc is properly tuned for
139a specific application / workload, it is common to improve system level metrics
140by a few percent, or make favorable trade-offs.
141
142
143## Notable runtime options for performance tuning
144
145Runtime options can be set via
146[malloc_conf](http://jemalloc.net/jemalloc.3.html#tuning).
147
148* [background_thread](http://jemalloc.net/jemalloc.3.html#background_thread)
149
150    Enabling jemalloc background threads generally improves the tail latency for
151    application threads, since unused memory purging is shifted to the dedicated
152    background threads.  In addition, unintended purging delay caused by
153    application inactivity is avoided with background threads.
154
155    Suggested: `background_thread:true` when jemalloc managed threads can be
156    allowed.
157
158* [metadata_thp](http://jemalloc.net/jemalloc.3.html#opt.metadata_thp)
159
160    Allowing jemalloc to utilize transparent huge pages for its internal
161    metadata usually reduces TLB misses significantly, especially for programs
162    with large memory footprint and frequent allocation / deallocation
163    activities.  Metadata memory usage may increase due to the use of huge
164    pages.
165
166    Suggested for allocation intensive programs: `metadata_thp:auto` or
167    `metadata_thp:always`, which is expected to improve CPU utilization at a
168    small memory cost.
169
170* [dirty_decay_ms](http://jemalloc.net/jemalloc.3.html#opt.dirty_decay_ms) and
171  [muzzy_decay_ms](http://jemalloc.net/jemalloc.3.html#opt.muzzy_decay_ms)
172
173    Decay time determines how fast jemalloc returns unused pages back to the
174    operating system, and therefore provides a fairly straightforward trade-off
175    between CPU and memory usage.  Shorter decay time purges unused pages faster
176    to reduces memory usage (usually at the cost of more CPU cycles spent on
177    purging), and vice versa.
178
179    Suggested: tune the values based on the desired trade-offs.
180
181* [narenas](http://jemalloc.net/jemalloc.3.html#opt.narenas)
182
183    By default jemalloc uses multiple arenas to reduce internal lock contention.
184    However high arena count may also increase overall memory fragmentation,
185    since arenas manage memory independently.  When high degree of parallelism
186    is not expected at the allocator level, lower number of arenas often
187    improves memory usage.
188
189    Suggested: if low parallelism is expected, try lower arena count while
190    monitoring CPU and memory usage.
191
192* [percpu_arena](http://jemalloc.net/jemalloc.3.html#opt.percpu_arena)
193
194    Enable dynamic thread to arena association based on running CPU.  This has
195    the potential to improve locality, e.g. when thread to CPU affinity is
196    present.
197
198    Suggested: try `percpu_arena:percpu` or `percpu_arena:phycpu` if
199    thread migration between processors is expected to be infrequent.
200
201Examples:
202
203* High resource consumption application, prioritizing CPU utilization:
204
205    `background_thread:true,metadata_thp:auto` combined with relaxed decay time
206    (increased `dirty_decay_ms` and / or `muzzy_decay_ms`,
207    e.g. `dirty_decay_ms:30000,muzzy_decay_ms:30000`).
208
209* High resource consumption application, prioritizing memory usage:
210
211    `background_thread:true` combined with shorter decay time (decreased
212    `dirty_decay_ms` and / or `muzzy_decay_ms`,
213    e.g. `dirty_decay_ms:5000,muzzy_decay_ms:5000`), and lower arena count
214    (e.g. number of CPUs).
215
216* Low resource consumption application:
217
218    `narenas:1,lg_tcache_max:13` combined with shorter decay time (decreased
219    `dirty_decay_ms` and / or `muzzy_decay_ms`,e.g.
220    `dirty_decay_ms:1000,muzzy_decay_ms:0`).
221
222* Extremely conservative -- minimize memory usage at all costs, only suitable when
223allocation activity is very rare:
224
225    `narenas:1,tcache:false,dirty_decay_ms:0,muzzy_decay_ms:0`
226
227Note that it is recommended to combine the options with `abort_conf:true` which
228aborts immediately on illegal options.
229
230## Beyond runtime options
231
232In addition to the runtime options, there are a number of programmatic ways to
233improve application performance with jemalloc.
234
235* [Explicit arenas](http://jemalloc.net/jemalloc.3.html#arenas.create)
236
237    Manually created arenas can help performance in various ways, e.g. by
238    managing locality and contention for specific usages.  For example,
239    applications can explicitly allocate frequently accessed objects from a
240    dedicated arena with
241    [mallocx()](http://jemalloc.net/jemalloc.3.html#MALLOCX_ARENA) to improve
242    locality.  In addition, explicit arenas often benefit from individually
243    tuned options, e.g. relaxed [decay
244    time](http://jemalloc.net/jemalloc.3.html#arena.i.dirty_decay_ms) if
245    frequent reuse is expected.
246
247* [Extent hooks](http://jemalloc.net/jemalloc.3.html#arena.i.extent_hooks)
248
249    Extent hooks allow customization for managing underlying memory.  One use
250    case for performance purpose is to utilize huge pages -- for example,
251    [HHVM](https://github.com/facebook/hhvm/blob/master/hphp/util/alloc.cpp)
252    uses explicit arenas with customized extent hooks to manage 1GB huge pages
253    for frequently accessed data, which reduces TLB misses significantly.
254
255* [Explicit thread-to-arena
256  binding](http://jemalloc.net/jemalloc.3.html#thread.arena)
257
258    It is common for some threads in an application to have different memory
259    access / allocation patterns.  Threads with heavy workloads often benefit
260    from explicit binding, e.g. binding very active threads to dedicated arenas
261    may reduce contention at the allocator level.
262>>>>>>> main
263