xref: /linux/Documentation/mm/transhuge.rst (revision 61307b7be41a1f1039d1d1368810a1d92cb97b44)
1ee65728eSMike Rapoport============================
2ee65728eSMike RapoportTransparent Hugepage Support
3ee65728eSMike Rapoport============================
4ee65728eSMike Rapoport
5ee65728eSMike RapoportThis document describes design principles for Transparent Hugepage (THP)
6ee65728eSMike Rapoportsupport and its interaction with other parts of the memory management
7ee65728eSMike Rapoportsystem.
8ee65728eSMike Rapoport
9ee65728eSMike RapoportDesign principles
10ee65728eSMike Rapoport=================
11ee65728eSMike Rapoport
12ee65728eSMike Rapoport- "graceful fallback": mm components which don't have transparent hugepage
13ee65728eSMike Rapoport  knowledge fall back to breaking huge pmd mapping into table of ptes and,
14ee65728eSMike Rapoport  if necessary, split a transparent hugepage. Therefore these components
15ee65728eSMike Rapoport  can continue working on the regular pages or regular pte mappings.
16ee65728eSMike Rapoport
17ee65728eSMike Rapoport- if a hugepage allocation fails because of memory fragmentation,
18ee65728eSMike Rapoport  regular pages should be gracefully allocated instead and mixed in
19ee65728eSMike Rapoport  the same vma without any failure or significant delay and without
20ee65728eSMike Rapoport  userland noticing
21ee65728eSMike Rapoport
22ee65728eSMike Rapoport- if some task quits and more hugepages become available (either
23ee65728eSMike Rapoport  immediately in the buddy or through the VM), guest physical memory
24ee65728eSMike Rapoport  backed by regular pages should be relocated on hugepages
25ee65728eSMike Rapoport  automatically (with khugepaged)
26ee65728eSMike Rapoport
27ee65728eSMike Rapoport- it doesn't require memory reservation and in turn it uses hugepages
28ee65728eSMike Rapoport  whenever possible (the only possible reservation here is kernelcore=
29ee65728eSMike Rapoport  to avoid unmovable pages to fragment all the memory but such a tweak
30ee65728eSMike Rapoport  is not specific to transparent hugepage support and it's a generic
31ee65728eSMike Rapoport  feature that applies to all dynamic high order allocations in the
32ee65728eSMike Rapoport  kernel)
33ee65728eSMike Rapoport
34ee65728eSMike Rapoportget_user_pages and follow_page
35ee65728eSMike Rapoport==============================
36ee65728eSMike Rapoport
37ee65728eSMike Rapoportget_user_pages and follow_page if run on a hugepage, will return the
38ee65728eSMike Rapoporthead or tail pages as usual (exactly as they would do on
39ee65728eSMike Rapoporthugetlbfs). Most GUP users will only care about the actual physical
40ee65728eSMike Rapoportaddress of the page and its temporary pinning to release after the I/O
41ee65728eSMike Rapoportis complete, so they won't ever notice the fact the page is huge. But
42ee65728eSMike Rapoportif any driver is going to mangle over the page structure of the tail
43ee65728eSMike Rapoportpage (like for checking page->mapping or other bits that are relevant
44ee65728eSMike Rapoportfor the head page and not the tail page), it should be updated to jump
45ee65728eSMike Rapoportto check head page instead. Taking a reference on any head/tail page would
46ee65728eSMike Rapoportprevent the page from being split by anyone.
47ee65728eSMike Rapoport
48ee65728eSMike Rapoport.. note::
49ee65728eSMike Rapoport   these aren't new constraints to the GUP API, and they match the
50ee65728eSMike Rapoport   same constraints that apply to hugetlbfs too, so any driver capable
51ee65728eSMike Rapoport   of handling GUP on hugetlbfs will also work fine on transparent
52ee65728eSMike Rapoport   hugepage backed mappings.
53ee65728eSMike Rapoport
54ee65728eSMike RapoportGraceful fallback
55ee65728eSMike Rapoport=================
56ee65728eSMike Rapoport
57ee65728eSMike RapoportCode walking pagetables but unaware about huge pmds can simply call
58ee65728eSMike Rapoportsplit_huge_pmd(vma, pmd, addr) where the pmd is the one returned by
59ee65728eSMike Rapoportpmd_offset. It's trivial to make the code transparent hugepage aware
60ee65728eSMike Rapoportby just grepping for "pmd_offset" and adding split_huge_pmd where
61ee65728eSMike Rapoportmissing after pmd_offset returns the pmd. Thanks to the graceful
62ee65728eSMike Rapoportfallback design, with a one liner change, you can avoid to write
63ee65728eSMike Rapoporthundreds if not thousands of lines of complex code to make your code
64ee65728eSMike Rapoporthugepage aware.
65ee65728eSMike Rapoport
66ee65728eSMike RapoportIf you're not walking pagetables but you run into a physical hugepage
67ee65728eSMike Rapoportthat you can't handle natively in your code, you can split it by
68ee65728eSMike Rapoportcalling split_huge_page(page). This is what the Linux VM does before
69ee65728eSMike Rapoportit tries to swapout the hugepage for example. split_huge_page() can fail
70ee65728eSMike Rapoportif the page is pinned and you must handle this correctly.
71ee65728eSMike Rapoport
72ee65728eSMike RapoportExample to make mremap.c transparent hugepage aware with a one liner
73ee65728eSMike Rapoportchange::
74ee65728eSMike Rapoport
75ee65728eSMike Rapoport	diff --git a/mm/mremap.c b/mm/mremap.c
76ee65728eSMike Rapoport	--- a/mm/mremap.c
77ee65728eSMike Rapoport	+++ b/mm/mremap.c
78ee65728eSMike Rapoport	@@ -41,6 +41,7 @@ static pmd_t *get_old_pmd(struct mm_stru
79ee65728eSMike Rapoport			return NULL;
80ee65728eSMike Rapoport
81ee65728eSMike Rapoport		pmd = pmd_offset(pud, addr);
82ee65728eSMike Rapoport	+	split_huge_pmd(vma, pmd, addr);
83ee65728eSMike Rapoport		if (pmd_none_or_clear_bad(pmd))
84ee65728eSMike Rapoport			return NULL;
85ee65728eSMike Rapoport
86ee65728eSMike RapoportLocking in hugepage aware code
87ee65728eSMike Rapoport==============================
88ee65728eSMike Rapoport
89ee65728eSMike RapoportWe want as much code as possible hugepage aware, as calling
90ee65728eSMike Rapoportsplit_huge_page() or split_huge_pmd() has a cost.
91ee65728eSMike Rapoport
92ee65728eSMike RapoportTo make pagetable walks huge pmd aware, all you need to do is to call
93ee65728eSMike Rapoportpmd_trans_huge() on the pmd returned by pmd_offset. You must hold the
94ee65728eSMike Rapoportmmap_lock in read (or write) mode to be sure a huge pmd cannot be
95ee65728eSMike Rapoportcreated from under you by khugepaged (khugepaged collapse_huge_page
96ee65728eSMike Rapoporttakes the mmap_lock in write mode in addition to the anon_vma lock). If
97ee65728eSMike Rapoportpmd_trans_huge returns false, you just fallback in the old code
98ee65728eSMike Rapoportpaths. If instead pmd_trans_huge returns true, you have to take the
99ee65728eSMike Rapoportpage table lock (pmd_lock()) and re-run pmd_trans_huge. Taking the
100ee65728eSMike Rapoportpage table lock will prevent the huge pmd being converted into a
101ee65728eSMike Rapoportregular pmd from under you (split_huge_pmd can run in parallel to the
102ee65728eSMike Rapoportpagetable walk). If the second pmd_trans_huge returns false, you
103ee65728eSMike Rapoportshould just drop the page table lock and fallback to the old code as
104ee65728eSMike Rapoportbefore. Otherwise, you can proceed to process the huge pmd and the
105ee65728eSMike Rapoporthugepage natively. Once finished, you can drop the page table lock.
106ee65728eSMike Rapoport
107ee65728eSMike RapoportRefcounts and transparent huge pages
108ee65728eSMike Rapoport====================================
109ee65728eSMike Rapoport
110ee65728eSMike RapoportRefcounting on THP is mostly consistent with refcounting on other compound
111ee65728eSMike Rapoportpages:
112ee65728eSMike Rapoport
1136eee1a00SMatthew Wilcox (Oracle)  - get_page()/put_page() and GUP operate on the folio->_refcount.
114ee65728eSMike Rapoport
115ee65728eSMike Rapoport  - ->_refcount in tail pages is always zero: get_page_unless_zero() never
116ee65728eSMike Rapoport    succeeds on tail pages.
117ee65728eSMike Rapoport
1186eee1a00SMatthew Wilcox (Oracle)  - map/unmap of a PMD entry for the whole THP increment/decrement
119*05c5323bSDavid Hildenbrand    folio->_entire_mapcount, increment/decrement folio->_large_mapcount
120*05c5323bSDavid Hildenbrand    and also increment/decrement folio->_nr_pages_mapped by ENTIRELY_MAPPED
121*05c5323bSDavid Hildenbrand    when _entire_mapcount goes from -1 to 0 or 0 to -1.
122ee65728eSMike Rapoport
1236eee1a00SMatthew Wilcox (Oracle)  - map/unmap of individual pages with PTE entry increment/decrement
124*05c5323bSDavid Hildenbrand    page->_mapcount, increment/decrement folio->_large_mapcount and also
125*05c5323bSDavid Hildenbrand    increment/decrement folio->_nr_pages_mapped when page->_mapcount goes
126*05c5323bSDavid Hildenbrand    from -1 to 0 or 0 to -1 as this counts the number of pages mapped by PTE.
127ee65728eSMike Rapoport
128ee65728eSMike Rapoportsplit_huge_page internally has to distribute the refcounts in the head
129ee65728eSMike Rapoportpage to the tail pages before clearing all PG_head/tail bits from the page
130ee65728eSMike Rapoportstructures. It can be done easily for refcounts taken by page table
131ee65728eSMike Rapoportentries, but we don't have enough information on how to distribute any
132ee65728eSMike Rapoportadditional pins (i.e. from get_user_pages). split_huge_page() fails any
133ee65728eSMike Rapoportrequests to split pinned huge pages: it expects page count to be equal to
134ee65728eSMike Rapoportthe sum of mapcount of all sub-pages plus one (split_huge_page caller must
135ee65728eSMike Rapoporthave a reference to the head page).
136ee65728eSMike Rapoport
137ee65728eSMike Rapoportsplit_huge_page uses migration entries to stabilize page->_refcount and
138ee65728eSMike Rapoportpage->_mapcount of anonymous pages. File pages just get unmapped.
139ee65728eSMike Rapoport
140ee65728eSMike RapoportWe are safe against physical memory scanners too: the only legitimate way
141ee65728eSMike Rapoporta scanner can get a reference to a page is get_page_unless_zero().
142ee65728eSMike Rapoport
143ee65728eSMike RapoportAll tail pages have zero ->_refcount until atomic_add(). This prevents the
144ee65728eSMike Rapoportscanner from getting a reference to the tail page up to that point. After the
145ee65728eSMike Rapoportatomic_add() we don't care about the ->_refcount value. We already know how
146ee65728eSMike Rapoportmany references should be uncharged from the head page.
147ee65728eSMike Rapoport
148ee65728eSMike RapoportFor head page get_page_unless_zero() will succeed and we don't mind. It's
149ee65728eSMike Rapoportclear where references should go after split: it will stay on the head page.
150ee65728eSMike Rapoport
151ee65728eSMike RapoportNote that split_huge_pmd() doesn't have any limitations on refcounting:
152ee65728eSMike Rapoportpmd can be split at any point and never fails.
153ee65728eSMike Rapoport
154f158ed61SMatthew Wilcox (Oracle)Partial unmap and deferred_split_folio()
155f158ed61SMatthew Wilcox (Oracle)========================================
156ee65728eSMike Rapoport
157ee65728eSMike RapoportUnmapping part of THP (with munmap() or other way) is not going to free
158ee65728eSMike Rapoportmemory immediately. Instead, we detect that a subpage of THP is not in use
1595a0033f0SDavid Hildenbrandin folio_remove_rmap_*() and queue the THP for splitting if memory pressure
160ee65728eSMike Rapoportcomes. Splitting will free up unused subpages.
161ee65728eSMike Rapoport
162ee65728eSMike RapoportSplitting the page right away is not an option due to locking context in
163ee65728eSMike Rapoportthe place where we can detect partial unmap. It also might be
164ee65728eSMike Rapoportcounterproductive since in many cases partial unmap happens during exit(2) if
165ee65728eSMike Rapoporta THP crosses a VMA boundary.
166ee65728eSMike Rapoport
167f158ed61SMatthew Wilcox (Oracle)The function deferred_split_folio() is used to queue a folio for splitting.
168ee65728eSMike RapoportThe splitting itself will happen when we get memory pressure via shrinker
169ee65728eSMike Rapoportinterface.
170