xref: /freebsd/contrib/jemalloc/TUNING.md (revision 8ebb3de0c9dfb1a15bf24dcb0ca65cc91e7ad0e8)
1*c43cad87SWarner LoshThis document summarizes the common approaches for performance fine tuning with
2*c43cad87SWarner Loshjemalloc (as of 5.3.0).  The default configuration of jemalloc tends to work
3*c43cad87SWarner Loshreasonably well in practice, and most applications should not have to tune any
4*c43cad87SWarner Loshoptions. However, in order to cover a wide range of applications and avoid
5*c43cad87SWarner Loshpathological cases, the default setting is sometimes kept conservative and
6*c43cad87SWarner Loshsuboptimal, even for many common workloads.  When jemalloc is properly tuned for
7*c43cad87SWarner Losha specific application / workload, it is common to improve system level metrics
8*c43cad87SWarner Loshby a few percent, or make favorable trade-offs.
9*c43cad87SWarner Losh
10*c43cad87SWarner Losh
11*c43cad87SWarner Losh## Notable runtime options for performance tuning
12*c43cad87SWarner Losh
13*c43cad87SWarner LoshRuntime options can be set via
14*c43cad87SWarner Losh[malloc_conf](http://jemalloc.net/jemalloc.3.html#tuning).
15*c43cad87SWarner Losh
16*c43cad87SWarner Losh* [background_thread](http://jemalloc.net/jemalloc.3.html#background_thread)
17*c43cad87SWarner Losh
18*c43cad87SWarner Losh    Enabling jemalloc background threads generally improves the tail latency for
19*c43cad87SWarner Losh    application threads, since unused memory purging is shifted to the dedicated
20*c43cad87SWarner Losh    background threads.  In addition, unintended purging delay caused by
21*c43cad87SWarner Losh    application inactivity is avoided with background threads.
22*c43cad87SWarner Losh
23*c43cad87SWarner Losh    Suggested: `background_thread:true` when jemalloc managed threads can be
24*c43cad87SWarner Losh    allowed.
25*c43cad87SWarner Losh
26*c43cad87SWarner Losh* [metadata_thp](http://jemalloc.net/jemalloc.3.html#opt.metadata_thp)
27*c43cad87SWarner Losh
28*c43cad87SWarner Losh    Allowing jemalloc to utilize transparent huge pages for its internal
29*c43cad87SWarner Losh    metadata usually reduces TLB misses significantly, especially for programs
30*c43cad87SWarner Losh    with large memory footprint and frequent allocation / deallocation
31*c43cad87SWarner Losh    activities.  Metadata memory usage may increase due to the use of huge
32*c43cad87SWarner Losh    pages.
33*c43cad87SWarner Losh
34*c43cad87SWarner Losh    Suggested for allocation intensive programs: `metadata_thp:auto` or
35*c43cad87SWarner Losh    `metadata_thp:always`, which is expected to improve CPU utilization at a
36*c43cad87SWarner Losh    small memory cost.
37*c43cad87SWarner Losh
38*c43cad87SWarner Losh* [dirty_decay_ms](http://jemalloc.net/jemalloc.3.html#opt.dirty_decay_ms) and
39*c43cad87SWarner Losh  [muzzy_decay_ms](http://jemalloc.net/jemalloc.3.html#opt.muzzy_decay_ms)
40*c43cad87SWarner Losh
41*c43cad87SWarner Losh    Decay time determines how fast jemalloc returns unused pages back to the
42*c43cad87SWarner Losh    operating system, and therefore provides a fairly straightforward trade-off
43*c43cad87SWarner Losh    between CPU and memory usage.  Shorter decay time purges unused pages faster
44*c43cad87SWarner Losh    to reduces memory usage (usually at the cost of more CPU cycles spent on
45*c43cad87SWarner Losh    purging), and vice versa.
46*c43cad87SWarner Losh
47*c43cad87SWarner Losh    Suggested: tune the values based on the desired trade-offs.
48*c43cad87SWarner Losh
49*c43cad87SWarner Losh* [narenas](http://jemalloc.net/jemalloc.3.html#opt.narenas)
50*c43cad87SWarner Losh
51*c43cad87SWarner Losh    By default jemalloc uses multiple arenas to reduce internal lock contention.
52*c43cad87SWarner Losh    However high arena count may also increase overall memory fragmentation,
53*c43cad87SWarner Losh    since arenas manage memory independently.  When high degree of parallelism
54*c43cad87SWarner Losh    is not expected at the allocator level, lower number of arenas often
55*c43cad87SWarner Losh    improves memory usage.
56*c43cad87SWarner Losh
57*c43cad87SWarner Losh    Suggested: if low parallelism is expected, try lower arena count while
58*c43cad87SWarner Losh    monitoring CPU and memory usage.
59*c43cad87SWarner Losh
60*c43cad87SWarner Losh* [percpu_arena](http://jemalloc.net/jemalloc.3.html#opt.percpu_arena)
61*c43cad87SWarner Losh
62*c43cad87SWarner Losh    Enable dynamic thread to arena association based on running CPU.  This has
63*c43cad87SWarner Losh    the potential to improve locality, e.g. when thread to CPU affinity is
64*c43cad87SWarner Losh    present.
65*c43cad87SWarner Losh
66*c43cad87SWarner Losh    Suggested: try `percpu_arena:percpu` or `percpu_arena:phycpu` if
67*c43cad87SWarner Losh    thread migration between processors is expected to be infrequent.
68*c43cad87SWarner Losh
69*c43cad87SWarner LoshExamples:
70*c43cad87SWarner Losh
71*c43cad87SWarner Losh* High resource consumption application, prioritizing CPU utilization:
72*c43cad87SWarner Losh
73*c43cad87SWarner Losh    `background_thread:true,metadata_thp:auto` combined with relaxed decay time
74*c43cad87SWarner Losh    (increased `dirty_decay_ms` and / or `muzzy_decay_ms`,
75*c43cad87SWarner Losh    e.g. `dirty_decay_ms:30000,muzzy_decay_ms:30000`).
76*c43cad87SWarner Losh
77*c43cad87SWarner Losh* High resource consumption application, prioritizing memory usage:
78*c43cad87SWarner Losh
79*c43cad87SWarner Losh    `background_thread:true,tcache_max:4096` combined with shorter decay time
80*c43cad87SWarner Losh    (decreased `dirty_decay_ms` and / or `muzzy_decay_ms`,
81*c43cad87SWarner Losh    e.g. `dirty_decay_ms:5000,muzzy_decay_ms:5000`), and lower arena count
82*c43cad87SWarner Losh    (e.g. number of CPUs).
83*c43cad87SWarner Losh
84*c43cad87SWarner Losh* Low resource consumption application:
85*c43cad87SWarner Losh
86*c43cad87SWarner Losh    `narenas:1,tcache_max:1024` combined with shorter decay time (decreased
87*c43cad87SWarner Losh    `dirty_decay_ms` and / or `muzzy_decay_ms`,e.g.
88*c43cad87SWarner Losh    `dirty_decay_ms:1000,muzzy_decay_ms:0`).
89*c43cad87SWarner Losh
90*c43cad87SWarner Losh* Extremely conservative -- minimize memory usage at all costs, only suitable when
91*c43cad87SWarner Loshallocation activity is very rare:
92*c43cad87SWarner Losh
93*c43cad87SWarner Losh    `narenas:1,tcache:false,dirty_decay_ms:0,muzzy_decay_ms:0`
94*c43cad87SWarner Losh
95*c43cad87SWarner LoshNote that it is recommended to combine the options with `abort_conf:true` which
96*c43cad87SWarner Loshaborts immediately on illegal options.
97*c43cad87SWarner Losh
98*c43cad87SWarner Losh## Beyond runtime options
99*c43cad87SWarner Losh
100*c43cad87SWarner LoshIn addition to the runtime options, there are a number of programmatic ways to
101*c43cad87SWarner Loshimprove application performance with jemalloc.
102*c43cad87SWarner Losh
103*c43cad87SWarner Losh* [Explicit arenas](http://jemalloc.net/jemalloc.3.html#arenas.create)
104*c43cad87SWarner Losh
105*c43cad87SWarner Losh    Manually created arenas can help performance in various ways, e.g. by
106*c43cad87SWarner Losh    managing locality and contention for specific usages.  For example,
107*c43cad87SWarner Losh    applications can explicitly allocate frequently accessed objects from a
108*c43cad87SWarner Losh    dedicated arena with
109*c43cad87SWarner Losh    [mallocx()](http://jemalloc.net/jemalloc.3.html#MALLOCX_ARENA) to improve
110*c43cad87SWarner Losh    locality.  In addition, explicit arenas often benefit from individually
111*c43cad87SWarner Losh    tuned options, e.g. relaxed [decay
112*c43cad87SWarner Losh    time](http://jemalloc.net/jemalloc.3.html#arena.i.dirty_decay_ms) if
113*c43cad87SWarner Losh    frequent reuse is expected.
114*c43cad87SWarner Losh
115*c43cad87SWarner Losh* [Extent hooks](http://jemalloc.net/jemalloc.3.html#arena.i.extent_hooks)
116*c43cad87SWarner Losh
117*c43cad87SWarner Losh    Extent hooks allow customization for managing underlying memory.  One use
118*c43cad87SWarner Losh    case for performance purpose is to utilize huge pages -- for example,
119*c43cad87SWarner Losh    [HHVM](https://github.com/facebook/hhvm/blob/master/hphp/util/alloc.cpp)
120*c43cad87SWarner Losh    uses explicit arenas with customized extent hooks to manage 1GB huge pages
121*c43cad87SWarner Losh    for frequently accessed data, which reduces TLB misses significantly.
122*c43cad87SWarner Losh
123*c43cad87SWarner Losh* [Explicit thread-to-arena
124*c43cad87SWarner Losh  binding](http://jemalloc.net/jemalloc.3.html#thread.arena)
125*c43cad87SWarner Losh
126*c43cad87SWarner Losh    It is common for some threads in an application to have different memory
127*c43cad87SWarner Losh    access / allocation patterns.  Threads with heavy workloads often benefit
128*c43cad87SWarner Losh    from explicit binding, e.g. binding very active threads to dedicated arenas
129*c43cad87SWarner Losh    may reduce contention at the allocator level.
130