xref: /linux/Documentation/mm/balance.rst (revision 9a87ffc99ec8eb8d35eed7c4f816d75f5cc9662e)
1ee65728eSMike Rapoport================
2ee65728eSMike RapoportMemory Balancing
3ee65728eSMike Rapoport================
4ee65728eSMike Rapoport
5ee65728eSMike RapoportStarted Jan 2000 by Kanoj Sarcar <kanoj@sgi.com>
6ee65728eSMike Rapoport
7*2973d822SNeilBrownMemory balancing is needed for !__GFP_HIGH and !__GFP_KSWAPD_RECLAIM as
8ee65728eSMike Rapoportwell as for non __GFP_IO allocations.
9ee65728eSMike Rapoport
10ee65728eSMike RapoportThe first reason why a caller may avoid reclaim is that the caller can not
11ee65728eSMike Rapoportsleep due to holding a spinlock or is in interrupt context. The second may
12ee65728eSMike Rapoportbe that the caller is willing to fail the allocation without incurring the
13ee65728eSMike Rapoportoverhead of page reclaim. This may happen for opportunistic high-order
14ee65728eSMike Rapoportallocation requests that have order-0 fallback options. In such cases,
15ee65728eSMike Rapoportthe caller may also wish to avoid waking kswapd.
16ee65728eSMike Rapoport
17ee65728eSMike Rapoport__GFP_IO allocation requests are made to prevent file system deadlocks.
18ee65728eSMike Rapoport
19ee65728eSMike RapoportIn the absence of non sleepable allocation requests, it seems detrimental
20ee65728eSMike Rapoportto be doing balancing. Page reclamation can be kicked off lazily, that
21ee65728eSMike Rapoportis, only when needed (aka zone free memory is 0), instead of making it
22ee65728eSMike Rapoporta proactive process.
23ee65728eSMike Rapoport
24ee65728eSMike RapoportThat being said, the kernel should try to fulfill requests for direct
25ee65728eSMike Rapoportmapped pages from the direct mapped pool, instead of falling back on
26ee65728eSMike Rapoportthe dma pool, so as to keep the dma pool filled for dma requests (atomic
27ee65728eSMike Rapoportor not). A similar argument applies to highmem and direct mapped pages.
28ee65728eSMike RapoportOTOH, if there is a lot of free dma pages, it is preferable to satisfy
29ee65728eSMike Rapoportregular memory requests by allocating one from the dma pool, instead
30ee65728eSMike Rapoportof incurring the overhead of regular zone balancing.
31ee65728eSMike Rapoport
32ee65728eSMike RapoportIn 2.2, memory balancing/page reclamation would kick off only when the
33ee65728eSMike Rapoport_total_ number of free pages fell below 1/64 th of total memory. With the
34ee65728eSMike Rapoportright ratio of dma and regular memory, it is quite possible that balancing
35ee65728eSMike Rapoportwould not be done even when the dma zone was completely empty. 2.2 has
36ee65728eSMike Rapoportbeen running production machines of varying memory sizes, and seems to be
37ee65728eSMike Rapoportdoing fine even with the presence of this problem. In 2.3, due to
38ee65728eSMike RapoportHIGHMEM, this problem is aggravated.
39ee65728eSMike Rapoport
40ee65728eSMike RapoportIn 2.3, zone balancing can be done in one of two ways: depending on the
41ee65728eSMike Rapoportzone size (and possibly of the size of lower class zones), we can decide
42ee65728eSMike Rapoportat init time how many free pages we should aim for while balancing any
43ee65728eSMike Rapoportzone. The good part is, while balancing, we do not need to look at sizes
44ee65728eSMike Rapoportof lower class zones, the bad part is, we might do too frequent balancing
45ee65728eSMike Rapoportdue to ignoring possibly lower usage in the lower class zones. Also,
46ee65728eSMike Rapoportwith a slight change in the allocation routine, it is possible to reduce
47ee65728eSMike Rapoportthe memclass() macro to be a simple equality.
48ee65728eSMike Rapoport
49ee65728eSMike RapoportAnother possible solution is that we balance only when the free memory
50ee65728eSMike Rapoportof a zone _and_ all its lower class zones falls below 1/64th of the
51ee65728eSMike Rapoporttotal memory in the zone and its lower class zones. This fixes the 2.2
52ee65728eSMike Rapoportbalancing problem, and stays as close to 2.2 behavior as possible. Also,
53ee65728eSMike Rapoportthe balancing algorithm works the same way on the various architectures,
54ee65728eSMike Rapoportwhich have different numbers and types of zones. If we wanted to get
55ee65728eSMike Rapoportfancy, we could assign different weights to free pages in different
56ee65728eSMike Rapoportzones in the future.
57ee65728eSMike Rapoport
58ee65728eSMike RapoportNote that if the size of the regular zone is huge compared to dma zone,
59ee65728eSMike Rapoportit becomes less significant to consider the free dma pages while
60ee65728eSMike Rapoportdeciding whether to balance the regular zone. The first solution
61ee65728eSMike Rapoportbecomes more attractive then.
62ee65728eSMike Rapoport
63ee65728eSMike RapoportThe appended patch implements the second solution. It also "fixes" two
64ee65728eSMike Rapoportproblems: first, kswapd is woken up as in 2.2 on low memory conditions
65ee65728eSMike Rapoportfor non-sleepable allocations. Second, the HIGHMEM zone is also balanced,
66ee65728eSMike Rapoportso as to give a fighting chance for replace_with_highmem() to get a
67ee65728eSMike RapoportHIGHMEM page, as well as to ensure that HIGHMEM allocations do not
68ee65728eSMike Rapoportfall back into regular zone. This also makes sure that HIGHMEM pages
69ee65728eSMike Rapoportare not leaked (for example, in situations where a HIGHMEM page is in
70ee65728eSMike Rapoportthe swapcache but is not being used by anyone)
71ee65728eSMike Rapoport
72ee65728eSMike Rapoportkswapd also needs to know about the zones it should balance. kswapd is
73ee65728eSMike Rapoportprimarily needed in a situation where balancing can not be done,
74ee65728eSMike Rapoportprobably because all allocation requests are coming from intr context
75ee65728eSMike Rapoportand all process contexts are sleeping. For 2.3, kswapd does not really
76ee65728eSMike Rapoportneed to balance the highmem zone, since intr context does not request
77ee65728eSMike Rapoporthighmem pages. kswapd looks at the zone_wake_kswapd field in the zone
78ee65728eSMike Rapoportstructure to decide whether a zone needs balancing.
79ee65728eSMike Rapoport
80ee65728eSMike RapoportPage stealing from process memory and shm is done if stealing the page would
81ee65728eSMike Rapoportalleviate memory pressure on any zone in the page's node that has fallen below
82ee65728eSMike Rapoportits watermark.
83ee65728eSMike Rapoport
84ee65728eSMike Rapoportwatemark[WMARK_MIN/WMARK_LOW/WMARK_HIGH]/low_on_memory/zone_wake_kswapd: These
85ee65728eSMike Rapoportare per-zone fields, used to determine when a zone needs to be balanced. When
86ee65728eSMike Rapoportthe number of pages falls below watermark[WMARK_MIN], the hysteric field
87ee65728eSMike Rapoportlow_on_memory gets set. This stays set till the number of free pages becomes
88ee65728eSMike Rapoportwatermark[WMARK_HIGH]. When low_on_memory is set, page allocation requests will
89ee65728eSMike Rapoporttry to free some pages in the zone (providing GFP_WAIT is set in the request).
90ee65728eSMike RapoportOrthogonal to this, is the decision to poke kswapd to free some zone pages.
91ee65728eSMike RapoportThat decision is not hysteresis based, and is done when the number of free
92ee65728eSMike Rapoportpages is below watermark[WMARK_LOW]; in which case zone_wake_kswapd is also set.
93ee65728eSMike Rapoport
94ee65728eSMike Rapoport
95ee65728eSMike Rapoport(Good) Ideas that I have heard:
96ee65728eSMike Rapoport
97ee65728eSMike Rapoport1. Dynamic experience should influence balancing: number of failed requests
98ee65728eSMike Rapoport   for a zone can be tracked and fed into the balancing scheme (jalvo@mbay.net)
99ee65728eSMike Rapoport2. Implement a replace_with_highmem()-like replace_with_regular() to preserve
100ee65728eSMike Rapoport   dma pages. (lkd@tantalophile.demon.co.uk)
101