1ee65728eSMike Rapoport================ 2ee65728eSMike RapoportMemory Balancing 3ee65728eSMike Rapoport================ 4ee65728eSMike Rapoport 5ee65728eSMike RapoportStarted Jan 2000 by Kanoj Sarcar <kanoj@sgi.com> 6ee65728eSMike Rapoport 7*2973d822SNeilBrownMemory balancing is needed for !__GFP_HIGH and !__GFP_KSWAPD_RECLAIM as 8ee65728eSMike Rapoportwell as for non __GFP_IO allocations. 9ee65728eSMike Rapoport 10ee65728eSMike RapoportThe first reason why a caller may avoid reclaim is that the caller can not 11ee65728eSMike Rapoportsleep due to holding a spinlock or is in interrupt context. The second may 12ee65728eSMike Rapoportbe that the caller is willing to fail the allocation without incurring the 13ee65728eSMike Rapoportoverhead of page reclaim. This may happen for opportunistic high-order 14ee65728eSMike Rapoportallocation requests that have order-0 fallback options. In such cases, 15ee65728eSMike Rapoportthe caller may also wish to avoid waking kswapd. 16ee65728eSMike Rapoport 17ee65728eSMike Rapoport__GFP_IO allocation requests are made to prevent file system deadlocks. 18ee65728eSMike Rapoport 19ee65728eSMike RapoportIn the absence of non sleepable allocation requests, it seems detrimental 20ee65728eSMike Rapoportto be doing balancing. Page reclamation can be kicked off lazily, that 21ee65728eSMike Rapoportis, only when needed (aka zone free memory is 0), instead of making it 22ee65728eSMike Rapoporta proactive process. 23ee65728eSMike Rapoport 24ee65728eSMike RapoportThat being said, the kernel should try to fulfill requests for direct 25ee65728eSMike Rapoportmapped pages from the direct mapped pool, instead of falling back on 26ee65728eSMike Rapoportthe dma pool, so as to keep the dma pool filled for dma requests (atomic 27ee65728eSMike Rapoportor not). A similar argument applies to highmem and direct mapped pages. 28ee65728eSMike RapoportOTOH, if there is a lot of free dma pages, it is preferable to satisfy 29ee65728eSMike Rapoportregular memory requests by allocating one from the dma pool, instead 30ee65728eSMike Rapoportof incurring the overhead of regular zone balancing. 31ee65728eSMike Rapoport 32ee65728eSMike RapoportIn 2.2, memory balancing/page reclamation would kick off only when the 33ee65728eSMike Rapoport_total_ number of free pages fell below 1/64 th of total memory. With the 34ee65728eSMike Rapoportright ratio of dma and regular memory, it is quite possible that balancing 35ee65728eSMike Rapoportwould not be done even when the dma zone was completely empty. 2.2 has 36ee65728eSMike Rapoportbeen running production machines of varying memory sizes, and seems to be 37ee65728eSMike Rapoportdoing fine even with the presence of this problem. In 2.3, due to 38ee65728eSMike RapoportHIGHMEM, this problem is aggravated. 39ee65728eSMike Rapoport 40ee65728eSMike RapoportIn 2.3, zone balancing can be done in one of two ways: depending on the 41ee65728eSMike Rapoportzone size (and possibly of the size of lower class zones), we can decide 42ee65728eSMike Rapoportat init time how many free pages we should aim for while balancing any 43ee65728eSMike Rapoportzone. The good part is, while balancing, we do not need to look at sizes 44ee65728eSMike Rapoportof lower class zones, the bad part is, we might do too frequent balancing 45ee65728eSMike Rapoportdue to ignoring possibly lower usage in the lower class zones. Also, 46ee65728eSMike Rapoportwith a slight change in the allocation routine, it is possible to reduce 47ee65728eSMike Rapoportthe memclass() macro to be a simple equality. 48ee65728eSMike Rapoport 49ee65728eSMike RapoportAnother possible solution is that we balance only when the free memory 50ee65728eSMike Rapoportof a zone _and_ all its lower class zones falls below 1/64th of the 51ee65728eSMike Rapoporttotal memory in the zone and its lower class zones. This fixes the 2.2 52ee65728eSMike Rapoportbalancing problem, and stays as close to 2.2 behavior as possible. Also, 53ee65728eSMike Rapoportthe balancing algorithm works the same way on the various architectures, 54ee65728eSMike Rapoportwhich have different numbers and types of zones. If we wanted to get 55ee65728eSMike Rapoportfancy, we could assign different weights to free pages in different 56ee65728eSMike Rapoportzones in the future. 57ee65728eSMike Rapoport 58ee65728eSMike RapoportNote that if the size of the regular zone is huge compared to dma zone, 59ee65728eSMike Rapoportit becomes less significant to consider the free dma pages while 60ee65728eSMike Rapoportdeciding whether to balance the regular zone. The first solution 61ee65728eSMike Rapoportbecomes more attractive then. 62ee65728eSMike Rapoport 63ee65728eSMike RapoportThe appended patch implements the second solution. It also "fixes" two 64ee65728eSMike Rapoportproblems: first, kswapd is woken up as in 2.2 on low memory conditions 65ee65728eSMike Rapoportfor non-sleepable allocations. Second, the HIGHMEM zone is also balanced, 66ee65728eSMike Rapoportso as to give a fighting chance for replace_with_highmem() to get a 67ee65728eSMike RapoportHIGHMEM page, as well as to ensure that HIGHMEM allocations do not 68ee65728eSMike Rapoportfall back into regular zone. This also makes sure that HIGHMEM pages 69ee65728eSMike Rapoportare not leaked (for example, in situations where a HIGHMEM page is in 70ee65728eSMike Rapoportthe swapcache but is not being used by anyone) 71ee65728eSMike Rapoport 72ee65728eSMike Rapoportkswapd also needs to know about the zones it should balance. kswapd is 73ee65728eSMike Rapoportprimarily needed in a situation where balancing can not be done, 74ee65728eSMike Rapoportprobably because all allocation requests are coming from intr context 75ee65728eSMike Rapoportand all process contexts are sleeping. For 2.3, kswapd does not really 76ee65728eSMike Rapoportneed to balance the highmem zone, since intr context does not request 77ee65728eSMike Rapoporthighmem pages. kswapd looks at the zone_wake_kswapd field in the zone 78ee65728eSMike Rapoportstructure to decide whether a zone needs balancing. 79ee65728eSMike Rapoport 80ee65728eSMike RapoportPage stealing from process memory and shm is done if stealing the page would 81ee65728eSMike Rapoportalleviate memory pressure on any zone in the page's node that has fallen below 82ee65728eSMike Rapoportits watermark. 83ee65728eSMike Rapoport 84ee65728eSMike Rapoportwatemark[WMARK_MIN/WMARK_LOW/WMARK_HIGH]/low_on_memory/zone_wake_kswapd: These 85ee65728eSMike Rapoportare per-zone fields, used to determine when a zone needs to be balanced. When 86ee65728eSMike Rapoportthe number of pages falls below watermark[WMARK_MIN], the hysteric field 87ee65728eSMike Rapoportlow_on_memory gets set. This stays set till the number of free pages becomes 88ee65728eSMike Rapoportwatermark[WMARK_HIGH]. When low_on_memory is set, page allocation requests will 89ee65728eSMike Rapoporttry to free some pages in the zone (providing GFP_WAIT is set in the request). 90ee65728eSMike RapoportOrthogonal to this, is the decision to poke kswapd to free some zone pages. 91ee65728eSMike RapoportThat decision is not hysteresis based, and is done when the number of free 92ee65728eSMike Rapoportpages is below watermark[WMARK_LOW]; in which case zone_wake_kswapd is also set. 93ee65728eSMike Rapoport 94ee65728eSMike Rapoport 95ee65728eSMike Rapoport(Good) Ideas that I have heard: 96ee65728eSMike Rapoport 97ee65728eSMike Rapoport1. Dynamic experience should influence balancing: number of failed requests 98ee65728eSMike Rapoport for a zone can be tracked and fed into the balancing scheme (jalvo@mbay.net) 99ee65728eSMike Rapoport2. Implement a replace_with_highmem()-like replace_with_regular() to preserve 100ee65728eSMike Rapoport dma pages. (lkd@tantalophile.demon.co.uk) 101