18be976a0SYu Zhao.. SPDX-License-Identifier: GPL-2.0 28be976a0SYu Zhao 38be976a0SYu Zhao============= 48be976a0SYu ZhaoMulti-Gen LRU 58be976a0SYu Zhao============= 68be976a0SYu ZhaoThe multi-gen LRU is an alternative LRU implementation that optimizes 78be976a0SYu Zhaopage reclaim and improves performance under memory pressure. Page 88be976a0SYu Zhaoreclaim decides the kernel's caching policy and ability to overcommit 98be976a0SYu Zhaomemory. It directly impacts the kswapd CPU usage and RAM efficiency. 108be976a0SYu Zhao 118be976a0SYu ZhaoDesign overview 128be976a0SYu Zhao=============== 138be976a0SYu ZhaoObjectives 148be976a0SYu Zhao---------- 158be976a0SYu ZhaoThe design objectives are: 168be976a0SYu Zhao 178be976a0SYu Zhao* Good representation of access recency 188be976a0SYu Zhao* Try to profit from spatial locality 198be976a0SYu Zhao* Fast paths to make obvious choices 208be976a0SYu Zhao* Simple self-correcting heuristics 218be976a0SYu Zhao 228be976a0SYu ZhaoThe representation of access recency is at the core of all LRU 238be976a0SYu Zhaoimplementations. In the multi-gen LRU, each generation represents a 248be976a0SYu Zhaogroup of pages with similar access recency. Generations establish a 258be976a0SYu Zhao(time-based) common frame of reference and therefore help make better 268be976a0SYu Zhaochoices, e.g., between different memcgs on a computer or different 278be976a0SYu Zhaocomputers in a data center (for job scheduling). 288be976a0SYu Zhao 298be976a0SYu ZhaoExploiting spatial locality improves efficiency when gathering the 308be976a0SYu Zhaoaccessed bit. A rmap walk targets a single page and does not try to 318be976a0SYu Zhaoprofit from discovering a young PTE. A page table walk can sweep all 328be976a0SYu Zhaothe young PTEs in an address space, but the address space can be too 338be976a0SYu Zhaosparse to make a profit. The key is to optimize both methods and use 348be976a0SYu Zhaothem in combination. 358be976a0SYu Zhao 368be976a0SYu ZhaoFast paths reduce code complexity and runtime overhead. Unmapped pages 378be976a0SYu Zhaodo not require TLB flushes; clean pages do not require writeback. 388be976a0SYu ZhaoThese facts are only helpful when other conditions, e.g., access 398be976a0SYu Zhaorecency, are similar. With generations as a common frame of reference, 408be976a0SYu Zhaoadditional factors stand out. But obvious choices might not be good 418be976a0SYu Zhaochoices; thus self-correction is necessary. 428be976a0SYu Zhao 438be976a0SYu ZhaoThe benefits of simple self-correcting heuristics are self-evident. 448be976a0SYu ZhaoAgain, with generations as a common frame of reference, this becomes 458be976a0SYu Zhaoattainable. Specifically, pages in the same generation can be 468be976a0SYu Zhaocategorized based on additional factors, and a feedback loop can 478be976a0SYu Zhaostatistically compare the refault percentages across those categories 488be976a0SYu Zhaoand infer which of them are better choices. 498be976a0SYu Zhao 508be976a0SYu ZhaoAssumptions 518be976a0SYu Zhao----------- 528be976a0SYu ZhaoThe protection of hot pages and the selection of cold pages are based 538be976a0SYu Zhaoon page access channels and patterns. There are two access channels: 548be976a0SYu Zhao 558be976a0SYu Zhao* Accesses through page tables 568be976a0SYu Zhao* Accesses through file descriptors 578be976a0SYu Zhao 588be976a0SYu ZhaoThe protection of the former channel is by design stronger because: 598be976a0SYu Zhao 608be976a0SYu Zhao1. The uncertainty in determining the access patterns of the former 618be976a0SYu Zhao channel is higher due to the approximation of the accessed bit. 628be976a0SYu Zhao2. The cost of evicting the former channel is higher due to the TLB 638be976a0SYu Zhao flushes required and the likelihood of encountering the dirty bit. 648be976a0SYu Zhao3. The penalty of underprotecting the former channel is higher because 658be976a0SYu Zhao applications usually do not prepare themselves for major page 668be976a0SYu Zhao faults like they do for blocked I/O. E.g., GUI applications 678be976a0SYu Zhao commonly use dedicated I/O threads to avoid blocking rendering 688be976a0SYu Zhao threads. 698be976a0SYu Zhao 708be976a0SYu ZhaoThere are also two access patterns: 718be976a0SYu Zhao 728be976a0SYu Zhao* Accesses exhibiting temporal locality 738be976a0SYu Zhao* Accesses not exhibiting temporal locality 748be976a0SYu Zhao 758be976a0SYu ZhaoFor the reasons listed above, the former channel is assumed to follow 768be976a0SYu Zhaothe former pattern unless ``VM_SEQ_READ`` or ``VM_RAND_READ`` is 778be976a0SYu Zhaopresent, and the latter channel is assumed to follow the latter 788be976a0SYu Zhaopattern unless outlying refaults have been observed. 798be976a0SYu Zhao 808be976a0SYu ZhaoWorkflow overview 818be976a0SYu Zhao================= 828be976a0SYu ZhaoEvictable pages are divided into multiple generations for each 838be976a0SYu Zhao``lruvec``. The youngest generation number is stored in 848be976a0SYu Zhao``lrugen->max_seq`` for both anon and file types as they are aged on 858be976a0SYu Zhaoan equal footing. The oldest generation numbers are stored in 868be976a0SYu Zhao``lrugen->min_seq[]`` separately for anon and file types as clean file 878be976a0SYu Zhaopages can be evicted regardless of swap constraints. These three 888be976a0SYu Zhaovariables are monotonically increasing. 898be976a0SYu Zhao 908be976a0SYu ZhaoGeneration numbers are truncated into ``order_base_2(MAX_NR_GENS+1)`` 918be976a0SYu Zhaobits in order to fit into the gen counter in ``folio->flags``. Each 926df1b221SYu Zhaotruncated generation number is an index to ``lrugen->folios[]``. The 938be976a0SYu Zhaosliding window technique is used to track at least ``MIN_NR_GENS`` and 948be976a0SYu Zhaoat most ``MAX_NR_GENS`` generations. The gen counter stores a value 958be976a0SYu Zhaowithin ``[1, MAX_NR_GENS]`` while a page is on one of 966df1b221SYu Zhao``lrugen->folios[]``; otherwise it stores zero. 978be976a0SYu Zhao 988be976a0SYu ZhaoEach generation is divided into multiple tiers. A page accessed ``N`` 998be976a0SYu Zhaotimes through file descriptors is in tier ``order_base_2(N)``. Unlike 1006df1b221SYu Zhaogenerations, tiers do not have dedicated ``lrugen->folios[]``. In 1018be976a0SYu Zhaocontrast to moving across generations, which requires the LRU lock, 1028be976a0SYu Zhaomoving across tiers only involves atomic operations on 1038be976a0SYu Zhao``folio->flags`` and therefore has a negligible cost. A feedback loop 1048be976a0SYu Zhaomodeled after the PID controller monitors refaults over all the tiers 1058be976a0SYu Zhaofrom anon and file types and decides which tiers from which types to 1068be976a0SYu Zhaoevict or protect. 1078be976a0SYu Zhao 1088be976a0SYu ZhaoThere are two conceptually independent procedures: the aging and the 1098be976a0SYu Zhaoeviction. They form a closed-loop system, i.e., the page reclaim. 1108be976a0SYu Zhao 1118be976a0SYu ZhaoAging 1128be976a0SYu Zhao----- 1138be976a0SYu ZhaoThe aging produces young generations. Given an ``lruvec``, it 1148be976a0SYu Zhaoincrements ``max_seq`` when ``max_seq-min_seq+1`` approaches 1158be976a0SYu Zhao``MIN_NR_GENS``. The aging promotes hot pages to the youngest 1168be976a0SYu Zhaogeneration when it finds them accessed through page tables; the 1178be976a0SYu Zhaodemotion of cold pages happens consequently when it increments 1188be976a0SYu Zhao``max_seq``. The aging uses page table walks and rmap walks to find 1198be976a0SYu Zhaoyoung PTEs. For the former, it iterates ``lruvec_memcg()->mm_list`` 1208be976a0SYu Zhaoand calls ``walk_page_range()`` with each ``mm_struct`` on this list 1218be976a0SYu Zhaoto scan PTEs, and after each iteration, it increments ``max_seq``. For 1228be976a0SYu Zhaothe latter, when the eviction walks the rmap and finds a young PTE, 1238be976a0SYu Zhaothe aging scans the adjacent PTEs. For both, on finding a young PTE, 1248be976a0SYu Zhaothe aging clears the accessed bit and updates the gen counter of the 1258be976a0SYu Zhaopage mapped by this PTE to ``(max_seq%MAX_NR_GENS)+1``. 1268be976a0SYu Zhao 1278be976a0SYu ZhaoEviction 1288be976a0SYu Zhao-------- 1298be976a0SYu ZhaoThe eviction consumes old generations. Given an ``lruvec``, it 1306df1b221SYu Zhaoincrements ``min_seq`` when ``lrugen->folios[]`` indexed by 1318be976a0SYu Zhao``min_seq%MAX_NR_GENS`` becomes empty. To select a type and a tier to 1328be976a0SYu Zhaoevict from, it first compares ``min_seq[]`` to select the older type. 1338be976a0SYu ZhaoIf both types are equally old, it selects the one whose first tier has 1348be976a0SYu Zhaoa lower refault percentage. The first tier contains single-use 1358be976a0SYu Zhaounmapped clean pages, which are the best bet. The eviction sorts a 1368be976a0SYu Zhaopage according to its gen counter if the aging has found this page 1378be976a0SYu Zhaoaccessed through page tables and updated its gen counter. It also 1388be976a0SYu Zhaomoves a page to the next generation, i.e., ``min_seq+1``, if this page 1398be976a0SYu Zhaowas accessed multiple times through file descriptors and the feedback 1408be976a0SYu Zhaoloop has detected outlying refaults from the tier this page is in. To 1418be976a0SYu Zhaothis end, the feedback loop uses the first tier as the baseline, for 1428be976a0SYu Zhaothe reason stated earlier. 1438be976a0SYu Zhao 144*7b8144e6ST.J. AlumbaughWorking set protection 145*7b8144e6ST.J. Alumbaugh---------------------- 146*7b8144e6ST.J. AlumbaughEach generation is timestamped at birth. If ``lru_gen_min_ttl`` is 147*7b8144e6ST.J. Alumbaughset, an ``lruvec`` is protected from the eviction when its oldest 148*7b8144e6ST.J. Alumbaughgeneration was born within ``lru_gen_min_ttl`` milliseconds. In other 149*7b8144e6ST.J. Alumbaughwords, it prevents the working set of ``lru_gen_min_ttl`` milliseconds 150*7b8144e6ST.J. Alumbaughfrom getting evicted. The OOM killer is triggered if this working set 151*7b8144e6ST.J. Alumbaughcannot be kept in memory. 152*7b8144e6ST.J. Alumbaugh 153*7b8144e6ST.J. AlumbaughThis time-based approach has the following advantages: 154*7b8144e6ST.J. Alumbaugh 155*7b8144e6ST.J. Alumbaugh1. It is easier to configure because it is agnostic to applications 156*7b8144e6ST.J. Alumbaugh and memory sizes. 157*7b8144e6ST.J. Alumbaugh2. It is more reliable because it is directly wired to the OOM killer. 158*7b8144e6ST.J. Alumbaugh 1598be976a0SYu ZhaoSummary 1608be976a0SYu Zhao------- 1618be976a0SYu ZhaoThe multi-gen LRU can be disassembled into the following parts: 1628be976a0SYu Zhao 1638be976a0SYu Zhao* Generations 1648be976a0SYu Zhao* Rmap walks 1658be976a0SYu Zhao* Page table walks 1668be976a0SYu Zhao* Bloom filters 1678be976a0SYu Zhao* PID controller 1688be976a0SYu Zhao 1698be976a0SYu ZhaoThe aging and the eviction form a producer-consumer model; 1708be976a0SYu Zhaospecifically, the latter drives the former by the sliding window over 1718be976a0SYu Zhaogenerations. Within the aging, rmap walks drive page table walks by 1728be976a0SYu Zhaoinserting hot densely populated page tables to the Bloom filters. 1738be976a0SYu ZhaoWithin the eviction, the PID controller uses refaults as the feedback 1748be976a0SYu Zhaoto select types to evict and tiers to protect. 175