1*8be976a0SYu Zhao.. SPDX-License-Identifier: GPL-2.0 2*8be976a0SYu Zhao 3*8be976a0SYu Zhao============= 4*8be976a0SYu ZhaoMulti-Gen LRU 5*8be976a0SYu Zhao============= 6*8be976a0SYu ZhaoThe multi-gen LRU is an alternative LRU implementation that optimizes 7*8be976a0SYu Zhaopage reclaim and improves performance under memory pressure. Page 8*8be976a0SYu Zhaoreclaim decides the kernel's caching policy and ability to overcommit 9*8be976a0SYu Zhaomemory. It directly impacts the kswapd CPU usage and RAM efficiency. 10*8be976a0SYu Zhao 11*8be976a0SYu ZhaoDesign overview 12*8be976a0SYu Zhao=============== 13*8be976a0SYu ZhaoObjectives 14*8be976a0SYu Zhao---------- 15*8be976a0SYu ZhaoThe design objectives are: 16*8be976a0SYu Zhao 17*8be976a0SYu Zhao* Good representation of access recency 18*8be976a0SYu Zhao* Try to profit from spatial locality 19*8be976a0SYu Zhao* Fast paths to make obvious choices 20*8be976a0SYu Zhao* Simple self-correcting heuristics 21*8be976a0SYu Zhao 22*8be976a0SYu ZhaoThe representation of access recency is at the core of all LRU 23*8be976a0SYu Zhaoimplementations. In the multi-gen LRU, each generation represents a 24*8be976a0SYu Zhaogroup of pages with similar access recency. Generations establish a 25*8be976a0SYu Zhao(time-based) common frame of reference and therefore help make better 26*8be976a0SYu Zhaochoices, e.g., between different memcgs on a computer or different 27*8be976a0SYu Zhaocomputers in a data center (for job scheduling). 28*8be976a0SYu Zhao 29*8be976a0SYu ZhaoExploiting spatial locality improves efficiency when gathering the 30*8be976a0SYu Zhaoaccessed bit. A rmap walk targets a single page and does not try to 31*8be976a0SYu Zhaoprofit from discovering a young PTE. A page table walk can sweep all 32*8be976a0SYu Zhaothe young PTEs in an address space, but the address space can be too 33*8be976a0SYu Zhaosparse to make a profit. The key is to optimize both methods and use 34*8be976a0SYu Zhaothem in combination. 35*8be976a0SYu Zhao 36*8be976a0SYu ZhaoFast paths reduce code complexity and runtime overhead. Unmapped pages 37*8be976a0SYu Zhaodo not require TLB flushes; clean pages do not require writeback. 38*8be976a0SYu ZhaoThese facts are only helpful when other conditions, e.g., access 39*8be976a0SYu Zhaorecency, are similar. With generations as a common frame of reference, 40*8be976a0SYu Zhaoadditional factors stand out. But obvious choices might not be good 41*8be976a0SYu Zhaochoices; thus self-correction is necessary. 42*8be976a0SYu Zhao 43*8be976a0SYu ZhaoThe benefits of simple self-correcting heuristics are self-evident. 44*8be976a0SYu ZhaoAgain, with generations as a common frame of reference, this becomes 45*8be976a0SYu Zhaoattainable. Specifically, pages in the same generation can be 46*8be976a0SYu Zhaocategorized based on additional factors, and a feedback loop can 47*8be976a0SYu Zhaostatistically compare the refault percentages across those categories 48*8be976a0SYu Zhaoand infer which of them are better choices. 49*8be976a0SYu Zhao 50*8be976a0SYu ZhaoAssumptions 51*8be976a0SYu Zhao----------- 52*8be976a0SYu ZhaoThe protection of hot pages and the selection of cold pages are based 53*8be976a0SYu Zhaoon page access channels and patterns. There are two access channels: 54*8be976a0SYu Zhao 55*8be976a0SYu Zhao* Accesses through page tables 56*8be976a0SYu Zhao* Accesses through file descriptors 57*8be976a0SYu Zhao 58*8be976a0SYu ZhaoThe protection of the former channel is by design stronger because: 59*8be976a0SYu Zhao 60*8be976a0SYu Zhao1. The uncertainty in determining the access patterns of the former 61*8be976a0SYu Zhao channel is higher due to the approximation of the accessed bit. 62*8be976a0SYu Zhao2. The cost of evicting the former channel is higher due to the TLB 63*8be976a0SYu Zhao flushes required and the likelihood of encountering the dirty bit. 64*8be976a0SYu Zhao3. The penalty of underprotecting the former channel is higher because 65*8be976a0SYu Zhao applications usually do not prepare themselves for major page 66*8be976a0SYu Zhao faults like they do for blocked I/O. E.g., GUI applications 67*8be976a0SYu Zhao commonly use dedicated I/O threads to avoid blocking rendering 68*8be976a0SYu Zhao threads. 69*8be976a0SYu Zhao 70*8be976a0SYu ZhaoThere are also two access patterns: 71*8be976a0SYu Zhao 72*8be976a0SYu Zhao* Accesses exhibiting temporal locality 73*8be976a0SYu Zhao* Accesses not exhibiting temporal locality 74*8be976a0SYu Zhao 75*8be976a0SYu ZhaoFor the reasons listed above, the former channel is assumed to follow 76*8be976a0SYu Zhaothe former pattern unless ``VM_SEQ_READ`` or ``VM_RAND_READ`` is 77*8be976a0SYu Zhaopresent, and the latter channel is assumed to follow the latter 78*8be976a0SYu Zhaopattern unless outlying refaults have been observed. 79*8be976a0SYu Zhao 80*8be976a0SYu ZhaoWorkflow overview 81*8be976a0SYu Zhao================= 82*8be976a0SYu ZhaoEvictable pages are divided into multiple generations for each 83*8be976a0SYu Zhao``lruvec``. The youngest generation number is stored in 84*8be976a0SYu Zhao``lrugen->max_seq`` for both anon and file types as they are aged on 85*8be976a0SYu Zhaoan equal footing. The oldest generation numbers are stored in 86*8be976a0SYu Zhao``lrugen->min_seq[]`` separately for anon and file types as clean file 87*8be976a0SYu Zhaopages can be evicted regardless of swap constraints. These three 88*8be976a0SYu Zhaovariables are monotonically increasing. 89*8be976a0SYu Zhao 90*8be976a0SYu ZhaoGeneration numbers are truncated into ``order_base_2(MAX_NR_GENS+1)`` 91*8be976a0SYu Zhaobits in order to fit into the gen counter in ``folio->flags``. Each 92*8be976a0SYu Zhaotruncated generation number is an index to ``lrugen->lists[]``. The 93*8be976a0SYu Zhaosliding window technique is used to track at least ``MIN_NR_GENS`` and 94*8be976a0SYu Zhaoat most ``MAX_NR_GENS`` generations. The gen counter stores a value 95*8be976a0SYu Zhaowithin ``[1, MAX_NR_GENS]`` while a page is on one of 96*8be976a0SYu Zhao``lrugen->lists[]``; otherwise it stores zero. 97*8be976a0SYu Zhao 98*8be976a0SYu ZhaoEach generation is divided into multiple tiers. A page accessed ``N`` 99*8be976a0SYu Zhaotimes through file descriptors is in tier ``order_base_2(N)``. Unlike 100*8be976a0SYu Zhaogenerations, tiers do not have dedicated ``lrugen->lists[]``. In 101*8be976a0SYu Zhaocontrast to moving across generations, which requires the LRU lock, 102*8be976a0SYu Zhaomoving across tiers only involves atomic operations on 103*8be976a0SYu Zhao``folio->flags`` and therefore has a negligible cost. A feedback loop 104*8be976a0SYu Zhaomodeled after the PID controller monitors refaults over all the tiers 105*8be976a0SYu Zhaofrom anon and file types and decides which tiers from which types to 106*8be976a0SYu Zhaoevict or protect. 107*8be976a0SYu Zhao 108*8be976a0SYu ZhaoThere are two conceptually independent procedures: the aging and the 109*8be976a0SYu Zhaoeviction. They form a closed-loop system, i.e., the page reclaim. 110*8be976a0SYu Zhao 111*8be976a0SYu ZhaoAging 112*8be976a0SYu Zhao----- 113*8be976a0SYu ZhaoThe aging produces young generations. Given an ``lruvec``, it 114*8be976a0SYu Zhaoincrements ``max_seq`` when ``max_seq-min_seq+1`` approaches 115*8be976a0SYu Zhao``MIN_NR_GENS``. The aging promotes hot pages to the youngest 116*8be976a0SYu Zhaogeneration when it finds them accessed through page tables; the 117*8be976a0SYu Zhaodemotion of cold pages happens consequently when it increments 118*8be976a0SYu Zhao``max_seq``. The aging uses page table walks and rmap walks to find 119*8be976a0SYu Zhaoyoung PTEs. For the former, it iterates ``lruvec_memcg()->mm_list`` 120*8be976a0SYu Zhaoand calls ``walk_page_range()`` with each ``mm_struct`` on this list 121*8be976a0SYu Zhaoto scan PTEs, and after each iteration, it increments ``max_seq``. For 122*8be976a0SYu Zhaothe latter, when the eviction walks the rmap and finds a young PTE, 123*8be976a0SYu Zhaothe aging scans the adjacent PTEs. For both, on finding a young PTE, 124*8be976a0SYu Zhaothe aging clears the accessed bit and updates the gen counter of the 125*8be976a0SYu Zhaopage mapped by this PTE to ``(max_seq%MAX_NR_GENS)+1``. 126*8be976a0SYu Zhao 127*8be976a0SYu ZhaoEviction 128*8be976a0SYu Zhao-------- 129*8be976a0SYu ZhaoThe eviction consumes old generations. Given an ``lruvec``, it 130*8be976a0SYu Zhaoincrements ``min_seq`` when ``lrugen->lists[]`` indexed by 131*8be976a0SYu Zhao``min_seq%MAX_NR_GENS`` becomes empty. To select a type and a tier to 132*8be976a0SYu Zhaoevict from, it first compares ``min_seq[]`` to select the older type. 133*8be976a0SYu ZhaoIf both types are equally old, it selects the one whose first tier has 134*8be976a0SYu Zhaoa lower refault percentage. The first tier contains single-use 135*8be976a0SYu Zhaounmapped clean pages, which are the best bet. The eviction sorts a 136*8be976a0SYu Zhaopage according to its gen counter if the aging has found this page 137*8be976a0SYu Zhaoaccessed through page tables and updated its gen counter. It also 138*8be976a0SYu Zhaomoves a page to the next generation, i.e., ``min_seq+1``, if this page 139*8be976a0SYu Zhaowas accessed multiple times through file descriptors and the feedback 140*8be976a0SYu Zhaoloop has detected outlying refaults from the tier this page is in. To 141*8be976a0SYu Zhaothis end, the feedback loop uses the first tier as the baseline, for 142*8be976a0SYu Zhaothe reason stated earlier. 143*8be976a0SYu Zhao 144*8be976a0SYu ZhaoSummary 145*8be976a0SYu Zhao------- 146*8be976a0SYu ZhaoThe multi-gen LRU can be disassembled into the following parts: 147*8be976a0SYu Zhao 148*8be976a0SYu Zhao* Generations 149*8be976a0SYu Zhao* Rmap walks 150*8be976a0SYu Zhao* Page table walks 151*8be976a0SYu Zhao* Bloom filters 152*8be976a0SYu Zhao* PID controller 153*8be976a0SYu Zhao 154*8be976a0SYu ZhaoThe aging and the eviction form a producer-consumer model; 155*8be976a0SYu Zhaospecifically, the latter drives the former by the sliding window over 156*8be976a0SYu Zhaogenerations. Within the aging, rmap walks drive page table walks by 157*8be976a0SYu Zhaoinserting hot densely populated page tables to the Bloom filters. 158*8be976a0SYu ZhaoWithin the eviction, the PID controller uses refaults as the feedback 159*8be976a0SYu Zhaoto select types to evict and tiers to protect. 160