xref: /freebsd/contrib/jemalloc/TUNING.md (revision c43cad87172039ccf38172129c79755ea79e6102)
1*c43cad87SWarner Losh<<<<<<< HEAD
2*c43cad87SWarner LoshThis document summarizes the common approaches for performance fine tuning with
3*c43cad87SWarner Loshjemalloc (as of 5.3.0).  The default configuration of jemalloc tends to work
4*c43cad87SWarner Loshreasonably well in practice, and most applications should not have to tune any
5*c43cad87SWarner Loshoptions. However, in order to cover a wide range of applications and avoid
6*c43cad87SWarner Loshpathological cases, the default setting is sometimes kept conservative and
7*c43cad87SWarner Loshsuboptimal, even for many common workloads.  When jemalloc is properly tuned for
8*c43cad87SWarner Losha specific application / workload, it is common to improve system level metrics
9*c43cad87SWarner Loshby a few percent, or make favorable trade-offs.
10*c43cad87SWarner Losh
11*c43cad87SWarner Losh
12*c43cad87SWarner Losh## Notable runtime options for performance tuning
13*c43cad87SWarner Losh
14*c43cad87SWarner LoshRuntime options can be set via
15*c43cad87SWarner Losh[malloc_conf](http://jemalloc.net/jemalloc.3.html#tuning).
16*c43cad87SWarner Losh
17*c43cad87SWarner Losh* [background_thread](http://jemalloc.net/jemalloc.3.html#background_thread)
18*c43cad87SWarner Losh
19*c43cad87SWarner Losh    Enabling jemalloc background threads generally improves the tail latency for
20*c43cad87SWarner Losh    application threads, since unused memory purging is shifted to the dedicated
21*c43cad87SWarner Losh    background threads.  In addition, unintended purging delay caused by
22*c43cad87SWarner Losh    application inactivity is avoided with background threads.
23*c43cad87SWarner Losh
24*c43cad87SWarner Losh    Suggested: `background_thread:true` when jemalloc managed threads can be
25*c43cad87SWarner Losh    allowed.
26*c43cad87SWarner Losh
27*c43cad87SWarner Losh* [metadata_thp](http://jemalloc.net/jemalloc.3.html#opt.metadata_thp)
28*c43cad87SWarner Losh
29*c43cad87SWarner Losh    Allowing jemalloc to utilize transparent huge pages for its internal
30*c43cad87SWarner Losh    metadata usually reduces TLB misses significantly, especially for programs
31*c43cad87SWarner Losh    with large memory footprint and frequent allocation / deallocation
32*c43cad87SWarner Losh    activities.  Metadata memory usage may increase due to the use of huge
33*c43cad87SWarner Losh    pages.
34*c43cad87SWarner Losh
35*c43cad87SWarner Losh    Suggested for allocation intensive programs: `metadata_thp:auto` or
36*c43cad87SWarner Losh    `metadata_thp:always`, which is expected to improve CPU utilization at a
37*c43cad87SWarner Losh    small memory cost.
38*c43cad87SWarner Losh
39*c43cad87SWarner Losh* [dirty_decay_ms](http://jemalloc.net/jemalloc.3.html#opt.dirty_decay_ms) and
40*c43cad87SWarner Losh  [muzzy_decay_ms](http://jemalloc.net/jemalloc.3.html#opt.muzzy_decay_ms)
41*c43cad87SWarner Losh
42*c43cad87SWarner Losh    Decay time determines how fast jemalloc returns unused pages back to the
43*c43cad87SWarner Losh    operating system, and therefore provides a fairly straightforward trade-off
44*c43cad87SWarner Losh    between CPU and memory usage.  Shorter decay time purges unused pages faster
45*c43cad87SWarner Losh    to reduces memory usage (usually at the cost of more CPU cycles spent on
46*c43cad87SWarner Losh    purging), and vice versa.
47*c43cad87SWarner Losh
48*c43cad87SWarner Losh    Suggested: tune the values based on the desired trade-offs.
49*c43cad87SWarner Losh
50*c43cad87SWarner Losh* [narenas](http://jemalloc.net/jemalloc.3.html#opt.narenas)
51*c43cad87SWarner Losh
52*c43cad87SWarner Losh    By default jemalloc uses multiple arenas to reduce internal lock contention.
53*c43cad87SWarner Losh    However high arena count may also increase overall memory fragmentation,
54*c43cad87SWarner Losh    since arenas manage memory independently.  When high degree of parallelism
55*c43cad87SWarner Losh    is not expected at the allocator level, lower number of arenas often
56*c43cad87SWarner Losh    improves memory usage.
57*c43cad87SWarner Losh
58*c43cad87SWarner Losh    Suggested: if low parallelism is expected, try lower arena count while
59*c43cad87SWarner Losh    monitoring CPU and memory usage.
60*c43cad87SWarner Losh
61*c43cad87SWarner Losh* [percpu_arena](http://jemalloc.net/jemalloc.3.html#opt.percpu_arena)
62*c43cad87SWarner Losh
63*c43cad87SWarner Losh    Enable dynamic thread to arena association based on running CPU.  This has
64*c43cad87SWarner Losh    the potential to improve locality, e.g. when thread to CPU affinity is
65*c43cad87SWarner Losh    present.
66*c43cad87SWarner Losh
67*c43cad87SWarner Losh    Suggested: try `percpu_arena:percpu` or `percpu_arena:phycpu` if
68*c43cad87SWarner Losh    thread migration between processors is expected to be infrequent.
69*c43cad87SWarner Losh
70*c43cad87SWarner LoshExamples:
71*c43cad87SWarner Losh
72*c43cad87SWarner Losh* High resource consumption application, prioritizing CPU utilization:
73*c43cad87SWarner Losh
74*c43cad87SWarner Losh    `background_thread:true,metadata_thp:auto` combined with relaxed decay time
75*c43cad87SWarner Losh    (increased `dirty_decay_ms` and / or `muzzy_decay_ms`,
76*c43cad87SWarner Losh    e.g. `dirty_decay_ms:30000,muzzy_decay_ms:30000`).
77*c43cad87SWarner Losh
78*c43cad87SWarner Losh* High resource consumption application, prioritizing memory usage:
79*c43cad87SWarner Losh
80*c43cad87SWarner Losh    `background_thread:true,tcache_max:4096` combined with shorter decay time
81*c43cad87SWarner Losh    (decreased `dirty_decay_ms` and / or `muzzy_decay_ms`,
82*c43cad87SWarner Losh    e.g. `dirty_decay_ms:5000,muzzy_decay_ms:5000`), and lower arena count
83*c43cad87SWarner Losh    (e.g. number of CPUs).
84*c43cad87SWarner Losh
85*c43cad87SWarner Losh* Low resource consumption application:
86*c43cad87SWarner Losh
87*c43cad87SWarner Losh    `narenas:1,tcache_max:1024` combined with shorter decay time (decreased
88*c43cad87SWarner Losh    `dirty_decay_ms` and / or `muzzy_decay_ms`,e.g.
89*c43cad87SWarner Losh    `dirty_decay_ms:1000,muzzy_decay_ms:0`).
90*c43cad87SWarner Losh
91*c43cad87SWarner Losh* Extremely conservative -- minimize memory usage at all costs, only suitable when
92*c43cad87SWarner Loshallocation activity is very rare:
93*c43cad87SWarner Losh
94*c43cad87SWarner Losh    `narenas:1,tcache:false,dirty_decay_ms:0,muzzy_decay_ms:0`
95*c43cad87SWarner Losh
96*c43cad87SWarner LoshNote that it is recommended to combine the options with `abort_conf:true` which
97*c43cad87SWarner Loshaborts immediately on illegal options.
98*c43cad87SWarner Losh
99*c43cad87SWarner Losh## Beyond runtime options
100*c43cad87SWarner Losh
101*c43cad87SWarner LoshIn addition to the runtime options, there are a number of programmatic ways to
102*c43cad87SWarner Loshimprove application performance with jemalloc.
103*c43cad87SWarner Losh
104*c43cad87SWarner Losh* [Explicit arenas](http://jemalloc.net/jemalloc.3.html#arenas.create)
105*c43cad87SWarner Losh
106*c43cad87SWarner Losh    Manually created arenas can help performance in various ways, e.g. by
107*c43cad87SWarner Losh    managing locality and contention for specific usages.  For example,
108*c43cad87SWarner Losh    applications can explicitly allocate frequently accessed objects from a
109*c43cad87SWarner Losh    dedicated arena with
110*c43cad87SWarner Losh    [mallocx()](http://jemalloc.net/jemalloc.3.html#MALLOCX_ARENA) to improve
111*c43cad87SWarner Losh    locality.  In addition, explicit arenas often benefit from individually
112*c43cad87SWarner Losh    tuned options, e.g. relaxed [decay
113*c43cad87SWarner Losh    time](http://jemalloc.net/jemalloc.3.html#arena.i.dirty_decay_ms) if
114*c43cad87SWarner Losh    frequent reuse is expected.
115*c43cad87SWarner Losh
116*c43cad87SWarner Losh* [Extent hooks](http://jemalloc.net/jemalloc.3.html#arena.i.extent_hooks)
117*c43cad87SWarner Losh
118*c43cad87SWarner Losh    Extent hooks allow customization for managing underlying memory.  One use
119*c43cad87SWarner Losh    case for performance purpose is to utilize huge pages -- for example,
120*c43cad87SWarner Losh    [HHVM](https://github.com/facebook/hhvm/blob/master/hphp/util/alloc.cpp)
121*c43cad87SWarner Losh    uses explicit arenas with customized extent hooks to manage 1GB huge pages
122*c43cad87SWarner Losh    for frequently accessed data, which reduces TLB misses significantly.
123*c43cad87SWarner Losh
124*c43cad87SWarner Losh* [Explicit thread-to-arena
125*c43cad87SWarner Losh  binding](http://jemalloc.net/jemalloc.3.html#thread.arena)
126*c43cad87SWarner Losh
127*c43cad87SWarner Losh    It is common for some threads in an application to have different memory
128*c43cad87SWarner Losh    access / allocation patterns.  Threads with heavy workloads often benefit
129*c43cad87SWarner Losh    from explicit binding, e.g. binding very active threads to dedicated arenas
130*c43cad87SWarner Losh    may reduce contention at the allocator level.
131*c43cad87SWarner Losh||||||| dec341af7695
132*c43cad87SWarner Losh=======
133bf6039f0SWarner LoshThis document summarizes the common approaches for performance fine tuning with
134bf6039f0SWarner Loshjemalloc (as of 5.1.0).  The default configuration of jemalloc tends to work
135bf6039f0SWarner Loshreasonably well in practice, and most applications should not have to tune any
136bf6039f0SWarner Loshoptions. However, in order to cover a wide range of applications and avoid
137bf6039f0SWarner Loshpathological cases, the default setting is sometimes kept conservative and
138bf6039f0SWarner Loshsuboptimal, even for many common workloads.  When jemalloc is properly tuned for
139bf6039f0SWarner Losha specific application / workload, it is common to improve system level metrics
140bf6039f0SWarner Loshby a few percent, or make favorable trade-offs.
141bf6039f0SWarner Losh
142bf6039f0SWarner Losh
143bf6039f0SWarner Losh## Notable runtime options for performance tuning
144bf6039f0SWarner Losh
145bf6039f0SWarner LoshRuntime options can be set via
146bf6039f0SWarner Losh[malloc_conf](http://jemalloc.net/jemalloc.3.html#tuning).
147bf6039f0SWarner Losh
148bf6039f0SWarner Losh* [background_thread](http://jemalloc.net/jemalloc.3.html#background_thread)
149bf6039f0SWarner Losh
150bf6039f0SWarner Losh    Enabling jemalloc background threads generally improves the tail latency for
151bf6039f0SWarner Losh    application threads, since unused memory purging is shifted to the dedicated
152bf6039f0SWarner Losh    background threads.  In addition, unintended purging delay caused by
153bf6039f0SWarner Losh    application inactivity is avoided with background threads.
154bf6039f0SWarner Losh
155bf6039f0SWarner Losh    Suggested: `background_thread:true` when jemalloc managed threads can be
156bf6039f0SWarner Losh    allowed.
157bf6039f0SWarner Losh
158bf6039f0SWarner Losh* [metadata_thp](http://jemalloc.net/jemalloc.3.html#opt.metadata_thp)
159bf6039f0SWarner Losh
160bf6039f0SWarner Losh    Allowing jemalloc to utilize transparent huge pages for its internal
161bf6039f0SWarner Losh    metadata usually reduces TLB misses significantly, especially for programs
162bf6039f0SWarner Losh    with large memory footprint and frequent allocation / deallocation
163bf6039f0SWarner Losh    activities.  Metadata memory usage may increase due to the use of huge
164bf6039f0SWarner Losh    pages.
165bf6039f0SWarner Losh
166bf6039f0SWarner Losh    Suggested for allocation intensive programs: `metadata_thp:auto` or
167bf6039f0SWarner Losh    `metadata_thp:always`, which is expected to improve CPU utilization at a
168bf6039f0SWarner Losh    small memory cost.
169bf6039f0SWarner Losh
170bf6039f0SWarner Losh* [dirty_decay_ms](http://jemalloc.net/jemalloc.3.html#opt.dirty_decay_ms) and
171bf6039f0SWarner Losh  [muzzy_decay_ms](http://jemalloc.net/jemalloc.3.html#opt.muzzy_decay_ms)
172bf6039f0SWarner Losh
173bf6039f0SWarner Losh    Decay time determines how fast jemalloc returns unused pages back to the
174bf6039f0SWarner Losh    operating system, and therefore provides a fairly straightforward trade-off
175bf6039f0SWarner Losh    between CPU and memory usage.  Shorter decay time purges unused pages faster
176bf6039f0SWarner Losh    to reduces memory usage (usually at the cost of more CPU cycles spent on
177bf6039f0SWarner Losh    purging), and vice versa.
178bf6039f0SWarner Losh
179bf6039f0SWarner Losh    Suggested: tune the values based on the desired trade-offs.
180bf6039f0SWarner Losh
181bf6039f0SWarner Losh* [narenas](http://jemalloc.net/jemalloc.3.html#opt.narenas)
182bf6039f0SWarner Losh
183bf6039f0SWarner Losh    By default jemalloc uses multiple arenas to reduce internal lock contention.
184bf6039f0SWarner Losh    However high arena count may also increase overall memory fragmentation,
185bf6039f0SWarner Losh    since arenas manage memory independently.  When high degree of parallelism
186bf6039f0SWarner Losh    is not expected at the allocator level, lower number of arenas often
187bf6039f0SWarner Losh    improves memory usage.
188bf6039f0SWarner Losh
189bf6039f0SWarner Losh    Suggested: if low parallelism is expected, try lower arena count while
190bf6039f0SWarner Losh    monitoring CPU and memory usage.
191bf6039f0SWarner Losh
192bf6039f0SWarner Losh* [percpu_arena](http://jemalloc.net/jemalloc.3.html#opt.percpu_arena)
193bf6039f0SWarner Losh
194bf6039f0SWarner Losh    Enable dynamic thread to arena association based on running CPU.  This has
195bf6039f0SWarner Losh    the potential to improve locality, e.g. when thread to CPU affinity is
196bf6039f0SWarner Losh    present.
197bf6039f0SWarner Losh
198bf6039f0SWarner Losh    Suggested: try `percpu_arena:percpu` or `percpu_arena:phycpu` if
199bf6039f0SWarner Losh    thread migration between processors is expected to be infrequent.
200bf6039f0SWarner Losh
201bf6039f0SWarner LoshExamples:
202bf6039f0SWarner Losh
203bf6039f0SWarner Losh* High resource consumption application, prioritizing CPU utilization:
204bf6039f0SWarner Losh
205bf6039f0SWarner Losh    `background_thread:true,metadata_thp:auto` combined with relaxed decay time
206bf6039f0SWarner Losh    (increased `dirty_decay_ms` and / or `muzzy_decay_ms`,
207bf6039f0SWarner Losh    e.g. `dirty_decay_ms:30000,muzzy_decay_ms:30000`).
208bf6039f0SWarner Losh
209bf6039f0SWarner Losh* High resource consumption application, prioritizing memory usage:
210bf6039f0SWarner Losh
211bf6039f0SWarner Losh    `background_thread:true` combined with shorter decay time (decreased
212bf6039f0SWarner Losh    `dirty_decay_ms` and / or `muzzy_decay_ms`,
213bf6039f0SWarner Losh    e.g. `dirty_decay_ms:5000,muzzy_decay_ms:5000`), and lower arena count
214bf6039f0SWarner Losh    (e.g. number of CPUs).
215bf6039f0SWarner Losh
216bf6039f0SWarner Losh* Low resource consumption application:
217bf6039f0SWarner Losh
218bf6039f0SWarner Losh    `narenas:1,lg_tcache_max:13` combined with shorter decay time (decreased
219bf6039f0SWarner Losh    `dirty_decay_ms` and / or `muzzy_decay_ms`,e.g.
220bf6039f0SWarner Losh    `dirty_decay_ms:1000,muzzy_decay_ms:0`).
221bf6039f0SWarner Losh
222bf6039f0SWarner Losh* Extremely conservative -- minimize memory usage at all costs, only suitable when
223bf6039f0SWarner Loshallocation activity is very rare:
224bf6039f0SWarner Losh
225bf6039f0SWarner Losh    `narenas:1,tcache:false,dirty_decay_ms:0,muzzy_decay_ms:0`
226bf6039f0SWarner Losh
227bf6039f0SWarner LoshNote that it is recommended to combine the options with `abort_conf:true` which
228bf6039f0SWarner Loshaborts immediately on illegal options.
229bf6039f0SWarner Losh
230bf6039f0SWarner Losh## Beyond runtime options
231bf6039f0SWarner Losh
232bf6039f0SWarner LoshIn addition to the runtime options, there are a number of programmatic ways to
233bf6039f0SWarner Loshimprove application performance with jemalloc.
234bf6039f0SWarner Losh
235bf6039f0SWarner Losh* [Explicit arenas](http://jemalloc.net/jemalloc.3.html#arenas.create)
236bf6039f0SWarner Losh
237bf6039f0SWarner Losh    Manually created arenas can help performance in various ways, e.g. by
238bf6039f0SWarner Losh    managing locality and contention for specific usages.  For example,
239bf6039f0SWarner Losh    applications can explicitly allocate frequently accessed objects from a
240bf6039f0SWarner Losh    dedicated arena with
241bf6039f0SWarner Losh    [mallocx()](http://jemalloc.net/jemalloc.3.html#MALLOCX_ARENA) to improve
242bf6039f0SWarner Losh    locality.  In addition, explicit arenas often benefit from individually
243bf6039f0SWarner Losh    tuned options, e.g. relaxed [decay
244bf6039f0SWarner Losh    time](http://jemalloc.net/jemalloc.3.html#arena.i.dirty_decay_ms) if
245bf6039f0SWarner Losh    frequent reuse is expected.
246bf6039f0SWarner Losh
247bf6039f0SWarner Losh* [Extent hooks](http://jemalloc.net/jemalloc.3.html#arena.i.extent_hooks)
248bf6039f0SWarner Losh
249bf6039f0SWarner Losh    Extent hooks allow customization for managing underlying memory.  One use
250bf6039f0SWarner Losh    case for performance purpose is to utilize huge pages -- for example,
251bf6039f0SWarner Losh    [HHVM](https://github.com/facebook/hhvm/blob/master/hphp/util/alloc.cpp)
252bf6039f0SWarner Losh    uses explicit arenas with customized extent hooks to manage 1GB huge pages
253bf6039f0SWarner Losh    for frequently accessed data, which reduces TLB misses significantly.
254bf6039f0SWarner Losh
255bf6039f0SWarner Losh* [Explicit thread-to-arena
256bf6039f0SWarner Losh  binding](http://jemalloc.net/jemalloc.3.html#thread.arena)
257bf6039f0SWarner Losh
258bf6039f0SWarner Losh    It is common for some threads in an application to have different memory
259bf6039f0SWarner Losh    access / allocation patterns.  Threads with heavy workloads often benefit
260bf6039f0SWarner Losh    from explicit binding, e.g. binding very active threads to dedicated arenas
261bf6039f0SWarner Losh    may reduce contention at the allocator level.
262*c43cad87SWarner Losh>>>>>>> main
263