xref: /linux/Documentation/mm/multigen_lru.rst (revision 8be976a0937a18118424dd2505925081d9192fd5)
1*8be976a0SYu Zhao.. SPDX-License-Identifier: GPL-2.0
2*8be976a0SYu Zhao
3*8be976a0SYu Zhao=============
4*8be976a0SYu ZhaoMulti-Gen LRU
5*8be976a0SYu Zhao=============
6*8be976a0SYu ZhaoThe multi-gen LRU is an alternative LRU implementation that optimizes
7*8be976a0SYu Zhaopage reclaim and improves performance under memory pressure. Page
8*8be976a0SYu Zhaoreclaim decides the kernel's caching policy and ability to overcommit
9*8be976a0SYu Zhaomemory. It directly impacts the kswapd CPU usage and RAM efficiency.
10*8be976a0SYu Zhao
11*8be976a0SYu ZhaoDesign overview
12*8be976a0SYu Zhao===============
13*8be976a0SYu ZhaoObjectives
14*8be976a0SYu Zhao----------
15*8be976a0SYu ZhaoThe design objectives are:
16*8be976a0SYu Zhao
17*8be976a0SYu Zhao* Good representation of access recency
18*8be976a0SYu Zhao* Try to profit from spatial locality
19*8be976a0SYu Zhao* Fast paths to make obvious choices
20*8be976a0SYu Zhao* Simple self-correcting heuristics
21*8be976a0SYu Zhao
22*8be976a0SYu ZhaoThe representation of access recency is at the core of all LRU
23*8be976a0SYu Zhaoimplementations. In the multi-gen LRU, each generation represents a
24*8be976a0SYu Zhaogroup of pages with similar access recency. Generations establish a
25*8be976a0SYu Zhao(time-based) common frame of reference and therefore help make better
26*8be976a0SYu Zhaochoices, e.g., between different memcgs on a computer or different
27*8be976a0SYu Zhaocomputers in a data center (for job scheduling).
28*8be976a0SYu Zhao
29*8be976a0SYu ZhaoExploiting spatial locality improves efficiency when gathering the
30*8be976a0SYu Zhaoaccessed bit. A rmap walk targets a single page and does not try to
31*8be976a0SYu Zhaoprofit from discovering a young PTE. A page table walk can sweep all
32*8be976a0SYu Zhaothe young PTEs in an address space, but the address space can be too
33*8be976a0SYu Zhaosparse to make a profit. The key is to optimize both methods and use
34*8be976a0SYu Zhaothem in combination.
35*8be976a0SYu Zhao
36*8be976a0SYu ZhaoFast paths reduce code complexity and runtime overhead. Unmapped pages
37*8be976a0SYu Zhaodo not require TLB flushes; clean pages do not require writeback.
38*8be976a0SYu ZhaoThese facts are only helpful when other conditions, e.g., access
39*8be976a0SYu Zhaorecency, are similar. With generations as a common frame of reference,
40*8be976a0SYu Zhaoadditional factors stand out. But obvious choices might not be good
41*8be976a0SYu Zhaochoices; thus self-correction is necessary.
42*8be976a0SYu Zhao
43*8be976a0SYu ZhaoThe benefits of simple self-correcting heuristics are self-evident.
44*8be976a0SYu ZhaoAgain, with generations as a common frame of reference, this becomes
45*8be976a0SYu Zhaoattainable. Specifically, pages in the same generation can be
46*8be976a0SYu Zhaocategorized based on additional factors, and a feedback loop can
47*8be976a0SYu Zhaostatistically compare the refault percentages across those categories
48*8be976a0SYu Zhaoand infer which of them are better choices.
49*8be976a0SYu Zhao
50*8be976a0SYu ZhaoAssumptions
51*8be976a0SYu Zhao-----------
52*8be976a0SYu ZhaoThe protection of hot pages and the selection of cold pages are based
53*8be976a0SYu Zhaoon page access channels and patterns. There are two access channels:
54*8be976a0SYu Zhao
55*8be976a0SYu Zhao* Accesses through page tables
56*8be976a0SYu Zhao* Accesses through file descriptors
57*8be976a0SYu Zhao
58*8be976a0SYu ZhaoThe protection of the former channel is by design stronger because:
59*8be976a0SYu Zhao
60*8be976a0SYu Zhao1. The uncertainty in determining the access patterns of the former
61*8be976a0SYu Zhao   channel is higher due to the approximation of the accessed bit.
62*8be976a0SYu Zhao2. The cost of evicting the former channel is higher due to the TLB
63*8be976a0SYu Zhao   flushes required and the likelihood of encountering the dirty bit.
64*8be976a0SYu Zhao3. The penalty of underprotecting the former channel is higher because
65*8be976a0SYu Zhao   applications usually do not prepare themselves for major page
66*8be976a0SYu Zhao   faults like they do for blocked I/O. E.g., GUI applications
67*8be976a0SYu Zhao   commonly use dedicated I/O threads to avoid blocking rendering
68*8be976a0SYu Zhao   threads.
69*8be976a0SYu Zhao
70*8be976a0SYu ZhaoThere are also two access patterns:
71*8be976a0SYu Zhao
72*8be976a0SYu Zhao* Accesses exhibiting temporal locality
73*8be976a0SYu Zhao* Accesses not exhibiting temporal locality
74*8be976a0SYu Zhao
75*8be976a0SYu ZhaoFor the reasons listed above, the former channel is assumed to follow
76*8be976a0SYu Zhaothe former pattern unless ``VM_SEQ_READ`` or ``VM_RAND_READ`` is
77*8be976a0SYu Zhaopresent, and the latter channel is assumed to follow the latter
78*8be976a0SYu Zhaopattern unless outlying refaults have been observed.
79*8be976a0SYu Zhao
80*8be976a0SYu ZhaoWorkflow overview
81*8be976a0SYu Zhao=================
82*8be976a0SYu ZhaoEvictable pages are divided into multiple generations for each
83*8be976a0SYu Zhao``lruvec``. The youngest generation number is stored in
84*8be976a0SYu Zhao``lrugen->max_seq`` for both anon and file types as they are aged on
85*8be976a0SYu Zhaoan equal footing. The oldest generation numbers are stored in
86*8be976a0SYu Zhao``lrugen->min_seq[]`` separately for anon and file types as clean file
87*8be976a0SYu Zhaopages can be evicted regardless of swap constraints. These three
88*8be976a0SYu Zhaovariables are monotonically increasing.
89*8be976a0SYu Zhao
90*8be976a0SYu ZhaoGeneration numbers are truncated into ``order_base_2(MAX_NR_GENS+1)``
91*8be976a0SYu Zhaobits in order to fit into the gen counter in ``folio->flags``. Each
92*8be976a0SYu Zhaotruncated generation number is an index to ``lrugen->lists[]``. The
93*8be976a0SYu Zhaosliding window technique is used to track at least ``MIN_NR_GENS`` and
94*8be976a0SYu Zhaoat most ``MAX_NR_GENS`` generations. The gen counter stores a value
95*8be976a0SYu Zhaowithin ``[1, MAX_NR_GENS]`` while a page is on one of
96*8be976a0SYu Zhao``lrugen->lists[]``; otherwise it stores zero.
97*8be976a0SYu Zhao
98*8be976a0SYu ZhaoEach generation is divided into multiple tiers. A page accessed ``N``
99*8be976a0SYu Zhaotimes through file descriptors is in tier ``order_base_2(N)``. Unlike
100*8be976a0SYu Zhaogenerations, tiers do not have dedicated ``lrugen->lists[]``. In
101*8be976a0SYu Zhaocontrast to moving across generations, which requires the LRU lock,
102*8be976a0SYu Zhaomoving across tiers only involves atomic operations on
103*8be976a0SYu Zhao``folio->flags`` and therefore has a negligible cost. A feedback loop
104*8be976a0SYu Zhaomodeled after the PID controller monitors refaults over all the tiers
105*8be976a0SYu Zhaofrom anon and file types and decides which tiers from which types to
106*8be976a0SYu Zhaoevict or protect.
107*8be976a0SYu Zhao
108*8be976a0SYu ZhaoThere are two conceptually independent procedures: the aging and the
109*8be976a0SYu Zhaoeviction. They form a closed-loop system, i.e., the page reclaim.
110*8be976a0SYu Zhao
111*8be976a0SYu ZhaoAging
112*8be976a0SYu Zhao-----
113*8be976a0SYu ZhaoThe aging produces young generations. Given an ``lruvec``, it
114*8be976a0SYu Zhaoincrements ``max_seq`` when ``max_seq-min_seq+1`` approaches
115*8be976a0SYu Zhao``MIN_NR_GENS``. The aging promotes hot pages to the youngest
116*8be976a0SYu Zhaogeneration when it finds them accessed through page tables; the
117*8be976a0SYu Zhaodemotion of cold pages happens consequently when it increments
118*8be976a0SYu Zhao``max_seq``. The aging uses page table walks and rmap walks to find
119*8be976a0SYu Zhaoyoung PTEs. For the former, it iterates ``lruvec_memcg()->mm_list``
120*8be976a0SYu Zhaoand calls ``walk_page_range()`` with each ``mm_struct`` on this list
121*8be976a0SYu Zhaoto scan PTEs, and after each iteration, it increments ``max_seq``. For
122*8be976a0SYu Zhaothe latter, when the eviction walks the rmap and finds a young PTE,
123*8be976a0SYu Zhaothe aging scans the adjacent PTEs. For both, on finding a young PTE,
124*8be976a0SYu Zhaothe aging clears the accessed bit and updates the gen counter of the
125*8be976a0SYu Zhaopage mapped by this PTE to ``(max_seq%MAX_NR_GENS)+1``.
126*8be976a0SYu Zhao
127*8be976a0SYu ZhaoEviction
128*8be976a0SYu Zhao--------
129*8be976a0SYu ZhaoThe eviction consumes old generations. Given an ``lruvec``, it
130*8be976a0SYu Zhaoincrements ``min_seq`` when ``lrugen->lists[]`` indexed by
131*8be976a0SYu Zhao``min_seq%MAX_NR_GENS`` becomes empty. To select a type and a tier to
132*8be976a0SYu Zhaoevict from, it first compares ``min_seq[]`` to select the older type.
133*8be976a0SYu ZhaoIf both types are equally old, it selects the one whose first tier has
134*8be976a0SYu Zhaoa lower refault percentage. The first tier contains single-use
135*8be976a0SYu Zhaounmapped clean pages, which are the best bet. The eviction sorts a
136*8be976a0SYu Zhaopage according to its gen counter if the aging has found this page
137*8be976a0SYu Zhaoaccessed through page tables and updated its gen counter. It also
138*8be976a0SYu Zhaomoves a page to the next generation, i.e., ``min_seq+1``, if this page
139*8be976a0SYu Zhaowas accessed multiple times through file descriptors and the feedback
140*8be976a0SYu Zhaoloop has detected outlying refaults from the tier this page is in. To
141*8be976a0SYu Zhaothis end, the feedback loop uses the first tier as the baseline, for
142*8be976a0SYu Zhaothe reason stated earlier.
143*8be976a0SYu Zhao
144*8be976a0SYu ZhaoSummary
145*8be976a0SYu Zhao-------
146*8be976a0SYu ZhaoThe multi-gen LRU can be disassembled into the following parts:
147*8be976a0SYu Zhao
148*8be976a0SYu Zhao* Generations
149*8be976a0SYu Zhao* Rmap walks
150*8be976a0SYu Zhao* Page table walks
151*8be976a0SYu Zhao* Bloom filters
152*8be976a0SYu Zhao* PID controller
153*8be976a0SYu Zhao
154*8be976a0SYu ZhaoThe aging and the eviction form a producer-consumer model;
155*8be976a0SYu Zhaospecifically, the latter drives the former by the sliding window over
156*8be976a0SYu Zhaogenerations. Within the aging, rmap walks drive page table walks by
157*8be976a0SYu Zhaoinserting hot densely populated page tables to the Bloom filters.
158*8be976a0SYu ZhaoWithin the eviction, the PID controller uses refaults as the feedback
159*8be976a0SYu Zhaoto select types to evict and tiers to protect.
160