xref: /linux/Documentation/mm/multigen_lru.rst (revision 1ac731c529cd4d6adbce134754b51ff7d822b145)
18be976a0SYu Zhao.. SPDX-License-Identifier: GPL-2.0
28be976a0SYu Zhao
38be976a0SYu Zhao=============
48be976a0SYu ZhaoMulti-Gen LRU
58be976a0SYu Zhao=============
68be976a0SYu ZhaoThe multi-gen LRU is an alternative LRU implementation that optimizes
78be976a0SYu Zhaopage reclaim and improves performance under memory pressure. Page
88be976a0SYu Zhaoreclaim decides the kernel's caching policy and ability to overcommit
98be976a0SYu Zhaomemory. It directly impacts the kswapd CPU usage and RAM efficiency.
108be976a0SYu Zhao
118be976a0SYu ZhaoDesign overview
128be976a0SYu Zhao===============
138be976a0SYu ZhaoObjectives
148be976a0SYu Zhao----------
158be976a0SYu ZhaoThe design objectives are:
168be976a0SYu Zhao
178be976a0SYu Zhao* Good representation of access recency
188be976a0SYu Zhao* Try to profit from spatial locality
198be976a0SYu Zhao* Fast paths to make obvious choices
208be976a0SYu Zhao* Simple self-correcting heuristics
218be976a0SYu Zhao
228be976a0SYu ZhaoThe representation of access recency is at the core of all LRU
238be976a0SYu Zhaoimplementations. In the multi-gen LRU, each generation represents a
248be976a0SYu Zhaogroup of pages with similar access recency. Generations establish a
258be976a0SYu Zhao(time-based) common frame of reference and therefore help make better
268be976a0SYu Zhaochoices, e.g., between different memcgs on a computer or different
278be976a0SYu Zhaocomputers in a data center (for job scheduling).
288be976a0SYu Zhao
298be976a0SYu ZhaoExploiting spatial locality improves efficiency when gathering the
308be976a0SYu Zhaoaccessed bit. A rmap walk targets a single page and does not try to
318be976a0SYu Zhaoprofit from discovering a young PTE. A page table walk can sweep all
328be976a0SYu Zhaothe young PTEs in an address space, but the address space can be too
338be976a0SYu Zhaosparse to make a profit. The key is to optimize both methods and use
348be976a0SYu Zhaothem in combination.
358be976a0SYu Zhao
368be976a0SYu ZhaoFast paths reduce code complexity and runtime overhead. Unmapped pages
378be976a0SYu Zhaodo not require TLB flushes; clean pages do not require writeback.
388be976a0SYu ZhaoThese facts are only helpful when other conditions, e.g., access
398be976a0SYu Zhaorecency, are similar. With generations as a common frame of reference,
408be976a0SYu Zhaoadditional factors stand out. But obvious choices might not be good
418be976a0SYu Zhaochoices; thus self-correction is necessary.
428be976a0SYu Zhao
438be976a0SYu ZhaoThe benefits of simple self-correcting heuristics are self-evident.
448be976a0SYu ZhaoAgain, with generations as a common frame of reference, this becomes
458be976a0SYu Zhaoattainable. Specifically, pages in the same generation can be
468be976a0SYu Zhaocategorized based on additional factors, and a feedback loop can
478be976a0SYu Zhaostatistically compare the refault percentages across those categories
488be976a0SYu Zhaoand infer which of them are better choices.
498be976a0SYu Zhao
508be976a0SYu ZhaoAssumptions
518be976a0SYu Zhao-----------
528be976a0SYu ZhaoThe protection of hot pages and the selection of cold pages are based
538be976a0SYu Zhaoon page access channels and patterns. There are two access channels:
548be976a0SYu Zhao
558be976a0SYu Zhao* Accesses through page tables
568be976a0SYu Zhao* Accesses through file descriptors
578be976a0SYu Zhao
588be976a0SYu ZhaoThe protection of the former channel is by design stronger because:
598be976a0SYu Zhao
608be976a0SYu Zhao1. The uncertainty in determining the access patterns of the former
618be976a0SYu Zhao   channel is higher due to the approximation of the accessed bit.
628be976a0SYu Zhao2. The cost of evicting the former channel is higher due to the TLB
638be976a0SYu Zhao   flushes required and the likelihood of encountering the dirty bit.
648be976a0SYu Zhao3. The penalty of underprotecting the former channel is higher because
658be976a0SYu Zhao   applications usually do not prepare themselves for major page
668be976a0SYu Zhao   faults like they do for blocked I/O. E.g., GUI applications
678be976a0SYu Zhao   commonly use dedicated I/O threads to avoid blocking rendering
688be976a0SYu Zhao   threads.
698be976a0SYu Zhao
708be976a0SYu ZhaoThere are also two access patterns:
718be976a0SYu Zhao
728be976a0SYu Zhao* Accesses exhibiting temporal locality
738be976a0SYu Zhao* Accesses not exhibiting temporal locality
748be976a0SYu Zhao
758be976a0SYu ZhaoFor the reasons listed above, the former channel is assumed to follow
768be976a0SYu Zhaothe former pattern unless ``VM_SEQ_READ`` or ``VM_RAND_READ`` is
778be976a0SYu Zhaopresent, and the latter channel is assumed to follow the latter
788be976a0SYu Zhaopattern unless outlying refaults have been observed.
798be976a0SYu Zhao
808be976a0SYu ZhaoWorkflow overview
818be976a0SYu Zhao=================
828be976a0SYu ZhaoEvictable pages are divided into multiple generations for each
838be976a0SYu Zhao``lruvec``. The youngest generation number is stored in
848be976a0SYu Zhao``lrugen->max_seq`` for both anon and file types as they are aged on
858be976a0SYu Zhaoan equal footing. The oldest generation numbers are stored in
868be976a0SYu Zhao``lrugen->min_seq[]`` separately for anon and file types as clean file
878be976a0SYu Zhaopages can be evicted regardless of swap constraints. These three
888be976a0SYu Zhaovariables are monotonically increasing.
898be976a0SYu Zhao
908be976a0SYu ZhaoGeneration numbers are truncated into ``order_base_2(MAX_NR_GENS+1)``
918be976a0SYu Zhaobits in order to fit into the gen counter in ``folio->flags``. Each
926df1b221SYu Zhaotruncated generation number is an index to ``lrugen->folios[]``. The
938be976a0SYu Zhaosliding window technique is used to track at least ``MIN_NR_GENS`` and
948be976a0SYu Zhaoat most ``MAX_NR_GENS`` generations. The gen counter stores a value
958be976a0SYu Zhaowithin ``[1, MAX_NR_GENS]`` while a page is on one of
966df1b221SYu Zhao``lrugen->folios[]``; otherwise it stores zero.
978be976a0SYu Zhao
988be976a0SYu ZhaoEach generation is divided into multiple tiers. A page accessed ``N``
998be976a0SYu Zhaotimes through file descriptors is in tier ``order_base_2(N)``. Unlike
1006df1b221SYu Zhaogenerations, tiers do not have dedicated ``lrugen->folios[]``. In
1018be976a0SYu Zhaocontrast to moving across generations, which requires the LRU lock,
1028be976a0SYu Zhaomoving across tiers only involves atomic operations on
1038be976a0SYu Zhao``folio->flags`` and therefore has a negligible cost. A feedback loop
1048be976a0SYu Zhaomodeled after the PID controller monitors refaults over all the tiers
1058be976a0SYu Zhaofrom anon and file types and decides which tiers from which types to
106*32d32ef1ST.J. Alumbaughevict or protect. The desired effect is to balance refault percentages
107*32d32ef1ST.J. Alumbaughbetween anon and file types proportional to the swappiness level.
1088be976a0SYu Zhao
1098be976a0SYu ZhaoThere are two conceptually independent procedures: the aging and the
1108be976a0SYu Zhaoeviction. They form a closed-loop system, i.e., the page reclaim.
1118be976a0SYu Zhao
1128be976a0SYu ZhaoAging
1138be976a0SYu Zhao-----
1148be976a0SYu ZhaoThe aging produces young generations. Given an ``lruvec``, it
1158be976a0SYu Zhaoincrements ``max_seq`` when ``max_seq-min_seq+1`` approaches
1168be976a0SYu Zhao``MIN_NR_GENS``. The aging promotes hot pages to the youngest
1178be976a0SYu Zhaogeneration when it finds them accessed through page tables; the
1188be976a0SYu Zhaodemotion of cold pages happens consequently when it increments
1198be976a0SYu Zhao``max_seq``. The aging uses page table walks and rmap walks to find
1208be976a0SYu Zhaoyoung PTEs. For the former, it iterates ``lruvec_memcg()->mm_list``
1218be976a0SYu Zhaoand calls ``walk_page_range()`` with each ``mm_struct`` on this list
1228be976a0SYu Zhaoto scan PTEs, and after each iteration, it increments ``max_seq``. For
1238be976a0SYu Zhaothe latter, when the eviction walks the rmap and finds a young PTE,
1248be976a0SYu Zhaothe aging scans the adjacent PTEs. For both, on finding a young PTE,
1258be976a0SYu Zhaothe aging clears the accessed bit and updates the gen counter of the
1268be976a0SYu Zhaopage mapped by this PTE to ``(max_seq%MAX_NR_GENS)+1``.
1278be976a0SYu Zhao
1288be976a0SYu ZhaoEviction
1298be976a0SYu Zhao--------
1308be976a0SYu ZhaoThe eviction consumes old generations. Given an ``lruvec``, it
1316df1b221SYu Zhaoincrements ``min_seq`` when ``lrugen->folios[]`` indexed by
1328be976a0SYu Zhao``min_seq%MAX_NR_GENS`` becomes empty. To select a type and a tier to
1338be976a0SYu Zhaoevict from, it first compares ``min_seq[]`` to select the older type.
1348be976a0SYu ZhaoIf both types are equally old, it selects the one whose first tier has
1358be976a0SYu Zhaoa lower refault percentage. The first tier contains single-use
1368be976a0SYu Zhaounmapped clean pages, which are the best bet. The eviction sorts a
1378be976a0SYu Zhaopage according to its gen counter if the aging has found this page
1388be976a0SYu Zhaoaccessed through page tables and updated its gen counter. It also
1398be976a0SYu Zhaomoves a page to the next generation, i.e., ``min_seq+1``, if this page
1408be976a0SYu Zhaowas accessed multiple times through file descriptors and the feedback
1418be976a0SYu Zhaoloop has detected outlying refaults from the tier this page is in. To
1428be976a0SYu Zhaothis end, the feedback loop uses the first tier as the baseline, for
1438be976a0SYu Zhaothe reason stated earlier.
1448be976a0SYu Zhao
1457b8144e6ST.J. AlumbaughWorking set protection
1467b8144e6ST.J. Alumbaugh----------------------
1477b8144e6ST.J. AlumbaughEach generation is timestamped at birth. If ``lru_gen_min_ttl`` is
1487b8144e6ST.J. Alumbaughset, an ``lruvec`` is protected from the eviction when its oldest
1497b8144e6ST.J. Alumbaughgeneration was born within ``lru_gen_min_ttl`` milliseconds. In other
1507b8144e6ST.J. Alumbaughwords, it prevents the working set of ``lru_gen_min_ttl`` milliseconds
1517b8144e6ST.J. Alumbaughfrom getting evicted. The OOM killer is triggered if this working set
1527b8144e6ST.J. Alumbaughcannot be kept in memory.
1537b8144e6ST.J. Alumbaugh
1547b8144e6ST.J. AlumbaughThis time-based approach has the following advantages:
1557b8144e6ST.J. Alumbaugh
1567b8144e6ST.J. Alumbaugh1. It is easier to configure because it is agnostic to applications
1577b8144e6ST.J. Alumbaugh   and memory sizes.
1587b8144e6ST.J. Alumbaugh2. It is more reliable because it is directly wired to the OOM killer.
1597b8144e6ST.J. Alumbaugh
160*32d32ef1ST.J. Alumbaugh``mm_struct`` list
161*32d32ef1ST.J. Alumbaugh------------------
162*32d32ef1ST.J. AlumbaughAn ``mm_struct`` list is maintained for each memcg, and an
163*32d32ef1ST.J. Alumbaugh``mm_struct`` follows its owner task to the new memcg when this task
164*32d32ef1ST.J. Alumbaughis migrated.
165*32d32ef1ST.J. Alumbaugh
166*32d32ef1ST.J. AlumbaughA page table walker iterates ``lruvec_memcg()->mm_list`` and calls
167*32d32ef1ST.J. Alumbaugh``walk_page_range()`` with each ``mm_struct`` on this list to scan
168*32d32ef1ST.J. AlumbaughPTEs. When multiple page table walkers iterate the same list, each of
169*32d32ef1ST.J. Alumbaughthem gets a unique ``mm_struct``, and therefore they can run in
170*32d32ef1ST.J. Alumbaughparallel.
171*32d32ef1ST.J. Alumbaugh
172*32d32ef1ST.J. AlumbaughPage table walkers ignore any misplaced pages, e.g., if an
173*32d32ef1ST.J. Alumbaugh``mm_struct`` was migrated, pages left in the previous memcg will be
174*32d32ef1ST.J. Alumbaughignored when the current memcg is under reclaim. Similarly, page table
175*32d32ef1ST.J. Alumbaughwalkers will ignore pages from nodes other than the one under reclaim.
176*32d32ef1ST.J. Alumbaugh
177*32d32ef1ST.J. AlumbaughThis infrastructure also tracks the usage of ``mm_struct`` between
178*32d32ef1ST.J. Alumbaughcontext switches so that page table walkers can skip processes that
179*32d32ef1ST.J. Alumbaughhave been sleeping since the last iteration.
180*32d32ef1ST.J. Alumbaugh
181db19a43dST.J. AlumbaughRmap/PT walk feedback
182db19a43dST.J. Alumbaugh---------------------
183db19a43dST.J. AlumbaughSearching the rmap for PTEs mapping each page on an LRU list (to test
184db19a43dST.J. Alumbaughand clear the accessed bit) can be expensive because pages from
185db19a43dST.J. Alumbaughdifferent VMAs (PA space) are not cache friendly to the rmap (VA
186db19a43dST.J. Alumbaughspace). For workloads mostly using mapped pages, searching the rmap
187db19a43dST.J. Alumbaughcan incur the highest CPU cost in the reclaim path.
188db19a43dST.J. Alumbaugh
189db19a43dST.J. Alumbaugh``lru_gen_look_around()`` exploits spatial locality to reduce the
190db19a43dST.J. Alumbaughtrips into the rmap. It scans the adjacent PTEs of a young PTE and
191db19a43dST.J. Alumbaughpromotes hot pages. If the scan was done cacheline efficiently, it
192db19a43dST.J. Alumbaughadds the PMD entry pointing to the PTE table to the Bloom filter. This
193db19a43dST.J. Alumbaughforms a feedback loop between the eviction and the aging.
194db19a43dST.J. Alumbaugh
195*32d32ef1ST.J. AlumbaughBloom filters
196ccbbbb85ST.J. Alumbaugh-------------
197ccbbbb85ST.J. AlumbaughBloom filters are a space and memory efficient data structure for set
198ccbbbb85ST.J. Alumbaughmembership test, i.e., test if an element is not in the set or may be
199ccbbbb85ST.J. Alumbaughin the set.
200ccbbbb85ST.J. Alumbaugh
201ccbbbb85ST.J. AlumbaughIn the eviction path, specifically, in ``lru_gen_look_around()``, if a
202ccbbbb85ST.J. AlumbaughPMD has a sufficient number of hot pages, its address is placed in the
203ccbbbb85ST.J. Alumbaughfilter. In the aging path, set membership means that the PTE range
204ccbbbb85ST.J. Alumbaughwill be scanned for young pages.
205ccbbbb85ST.J. Alumbaugh
206ccbbbb85ST.J. AlumbaughNote that Bloom filters are probabilistic on set membership. If a test
207ccbbbb85ST.J. Alumbaughis false positive, the cost is an additional scan of a range of PTEs,
208ccbbbb85ST.J. Alumbaughwhich may yield hot pages anyway. Parameters of the filter itself can
209ccbbbb85ST.J. Alumbaughcontrol the false positive rate in the limit.
210ccbbbb85ST.J. Alumbaugh
211*32d32ef1ST.J. AlumbaughPID controller
212*32d32ef1ST.J. Alumbaugh--------------
213*32d32ef1ST.J. AlumbaughA feedback loop modeled after the Proportional-Integral-Derivative
214*32d32ef1ST.J. Alumbaugh(PID) controller monitors refaults over anon and file types and
215*32d32ef1ST.J. Alumbaughdecides which type to evict when both types are available from the
216*32d32ef1ST.J. Alumbaughsame generation.
217*32d32ef1ST.J. Alumbaugh
218*32d32ef1ST.J. AlumbaughThe PID controller uses generations rather than the wall clock as the
219*32d32ef1ST.J. Alumbaughtime domain because a CPU can scan pages at different rates under
220*32d32ef1ST.J. Alumbaughvarying memory pressure. It calculates a moving average for each new
221*32d32ef1ST.J. Alumbaughgeneration to avoid being permanently locked in a suboptimal state.
222*32d32ef1ST.J. Alumbaugh
22336c7b4dbST.J. AlumbaughMemcg LRU
22436c7b4dbST.J. Alumbaugh---------
22536c7b4dbST.J. AlumbaughAn memcg LRU is a per-node LRU of memcgs. It is also an LRU of LRUs,
22636c7b4dbST.J. Alumbaughsince each node and memcg combination has an LRU of folios (see
22736c7b4dbST.J. Alumbaugh``mem_cgroup_lruvec()``). Its goal is to improve the scalability of
22836c7b4dbST.J. Alumbaughglobal reclaim, which is critical to system-wide memory overcommit in
22936c7b4dbST.J. Alumbaughdata centers. Note that memcg LRU only applies to global reclaim.
23036c7b4dbST.J. Alumbaugh
23136c7b4dbST.J. AlumbaughThe basic structure of an memcg LRU can be understood by an analogy to
23236c7b4dbST.J. Alumbaughthe active/inactive LRU (of folios):
23336c7b4dbST.J. Alumbaugh
23436c7b4dbST.J. Alumbaugh1. It has the young and the old (generations), i.e., the counterparts
23536c7b4dbST.J. Alumbaugh   to the active and the inactive;
23636c7b4dbST.J. Alumbaugh2. The increment of ``max_seq`` triggers promotion, i.e., the
23736c7b4dbST.J. Alumbaugh   counterpart to activation;
23836c7b4dbST.J. Alumbaugh3. Other events trigger similar operations, e.g., offlining an memcg
23936c7b4dbST.J. Alumbaugh   triggers demotion, i.e., the counterpart to deactivation.
24036c7b4dbST.J. Alumbaugh
24136c7b4dbST.J. AlumbaughIn terms of global reclaim, it has two distinct features:
24236c7b4dbST.J. Alumbaugh
24336c7b4dbST.J. Alumbaugh1. Sharding, which allows each thread to start at a random memcg (in
24436c7b4dbST.J. Alumbaugh   the old generation) and improves parallelism;
24536c7b4dbST.J. Alumbaugh2. Eventual fairness, which allows direct reclaim to bail out at will
24636c7b4dbST.J. Alumbaugh   and reduces latency without affecting fairness over some time.
24736c7b4dbST.J. Alumbaugh
24836c7b4dbST.J. AlumbaughIn terms of traversing memcgs during global reclaim, it improves the
24936c7b4dbST.J. Alumbaughbest-case complexity from O(n) to O(1) and does not affect the
25036c7b4dbST.J. Alumbaughworst-case complexity O(n). Therefore, on average, it has a sublinear
25136c7b4dbST.J. Alumbaughcomplexity.
25236c7b4dbST.J. Alumbaugh
2538be976a0SYu ZhaoSummary
2548be976a0SYu Zhao-------
25536c7b4dbST.J. AlumbaughThe multi-gen LRU (of folios) can be disassembled into the following
25636c7b4dbST.J. Alumbaughparts:
2578be976a0SYu Zhao
2588be976a0SYu Zhao* Generations
2598be976a0SYu Zhao* Rmap walks
260*32d32ef1ST.J. Alumbaugh* Page table walks via ``mm_struct`` list
261*32d32ef1ST.J. Alumbaugh* Bloom filters for rmap/PT walk feedback
262*32d32ef1ST.J. Alumbaugh* PID controller for refault feedback
2638be976a0SYu Zhao
2648be976a0SYu ZhaoThe aging and the eviction form a producer-consumer model;
2658be976a0SYu Zhaospecifically, the latter drives the former by the sliding window over
2668be976a0SYu Zhaogenerations. Within the aging, rmap walks drive page table walks by
2678be976a0SYu Zhaoinserting hot densely populated page tables to the Bloom filters.
2688be976a0SYu ZhaoWithin the eviction, the PID controller uses refaults as the feedback
2698be976a0SYu Zhaoto select types to evict and tiers to protect.
270