xref: /linux/Documentation/mm/transhuge.rst (revision 6eee1a0062298601dfc36bd34517affc4458c43d)
1ee65728eSMike Rapoport.. _transhuge:
2ee65728eSMike Rapoport
3ee65728eSMike Rapoport============================
4ee65728eSMike RapoportTransparent Hugepage Support
5ee65728eSMike Rapoport============================
6ee65728eSMike Rapoport
7ee65728eSMike RapoportThis document describes design principles for Transparent Hugepage (THP)
8ee65728eSMike Rapoportsupport and its interaction with other parts of the memory management
9ee65728eSMike Rapoportsystem.
10ee65728eSMike Rapoport
11ee65728eSMike RapoportDesign principles
12ee65728eSMike Rapoport=================
13ee65728eSMike Rapoport
14ee65728eSMike Rapoport- "graceful fallback": mm components which don't have transparent hugepage
15ee65728eSMike Rapoport  knowledge fall back to breaking huge pmd mapping into table of ptes and,
16ee65728eSMike Rapoport  if necessary, split a transparent hugepage. Therefore these components
17ee65728eSMike Rapoport  can continue working on the regular pages or regular pte mappings.
18ee65728eSMike Rapoport
19ee65728eSMike Rapoport- if a hugepage allocation fails because of memory fragmentation,
20ee65728eSMike Rapoport  regular pages should be gracefully allocated instead and mixed in
21ee65728eSMike Rapoport  the same vma without any failure or significant delay and without
22ee65728eSMike Rapoport  userland noticing
23ee65728eSMike Rapoport
24ee65728eSMike Rapoport- if some task quits and more hugepages become available (either
25ee65728eSMike Rapoport  immediately in the buddy or through the VM), guest physical memory
26ee65728eSMike Rapoport  backed by regular pages should be relocated on hugepages
27ee65728eSMike Rapoport  automatically (with khugepaged)
28ee65728eSMike Rapoport
29ee65728eSMike Rapoport- it doesn't require memory reservation and in turn it uses hugepages
30ee65728eSMike Rapoport  whenever possible (the only possible reservation here is kernelcore=
31ee65728eSMike Rapoport  to avoid unmovable pages to fragment all the memory but such a tweak
32ee65728eSMike Rapoport  is not specific to transparent hugepage support and it's a generic
33ee65728eSMike Rapoport  feature that applies to all dynamic high order allocations in the
34ee65728eSMike Rapoport  kernel)
35ee65728eSMike Rapoport
36ee65728eSMike Rapoportget_user_pages and follow_page
37ee65728eSMike Rapoport==============================
38ee65728eSMike Rapoport
39ee65728eSMike Rapoportget_user_pages and follow_page if run on a hugepage, will return the
40ee65728eSMike Rapoporthead or tail pages as usual (exactly as they would do on
41ee65728eSMike Rapoporthugetlbfs). Most GUP users will only care about the actual physical
42ee65728eSMike Rapoportaddress of the page and its temporary pinning to release after the I/O
43ee65728eSMike Rapoportis complete, so they won't ever notice the fact the page is huge. But
44ee65728eSMike Rapoportif any driver is going to mangle over the page structure of the tail
45ee65728eSMike Rapoportpage (like for checking page->mapping or other bits that are relevant
46ee65728eSMike Rapoportfor the head page and not the tail page), it should be updated to jump
47ee65728eSMike Rapoportto check head page instead. Taking a reference on any head/tail page would
48ee65728eSMike Rapoportprevent the page from being split by anyone.
49ee65728eSMike Rapoport
50ee65728eSMike Rapoport.. note::
51ee65728eSMike Rapoport   these aren't new constraints to the GUP API, and they match the
52ee65728eSMike Rapoport   same constraints that apply to hugetlbfs too, so any driver capable
53ee65728eSMike Rapoport   of handling GUP on hugetlbfs will also work fine on transparent
54ee65728eSMike Rapoport   hugepage backed mappings.
55ee65728eSMike Rapoport
56ee65728eSMike RapoportGraceful fallback
57ee65728eSMike Rapoport=================
58ee65728eSMike Rapoport
59ee65728eSMike RapoportCode walking pagetables but unaware about huge pmds can simply call
60ee65728eSMike Rapoportsplit_huge_pmd(vma, pmd, addr) where the pmd is the one returned by
61ee65728eSMike Rapoportpmd_offset. It's trivial to make the code transparent hugepage aware
62ee65728eSMike Rapoportby just grepping for "pmd_offset" and adding split_huge_pmd where
63ee65728eSMike Rapoportmissing after pmd_offset returns the pmd. Thanks to the graceful
64ee65728eSMike Rapoportfallback design, with a one liner change, you can avoid to write
65ee65728eSMike Rapoporthundreds if not thousands of lines of complex code to make your code
66ee65728eSMike Rapoporthugepage aware.
67ee65728eSMike Rapoport
68ee65728eSMike RapoportIf you're not walking pagetables but you run into a physical hugepage
69ee65728eSMike Rapoportthat you can't handle natively in your code, you can split it by
70ee65728eSMike Rapoportcalling split_huge_page(page). This is what the Linux VM does before
71ee65728eSMike Rapoportit tries to swapout the hugepage for example. split_huge_page() can fail
72ee65728eSMike Rapoportif the page is pinned and you must handle this correctly.
73ee65728eSMike Rapoport
74ee65728eSMike RapoportExample to make mremap.c transparent hugepage aware with a one liner
75ee65728eSMike Rapoportchange::
76ee65728eSMike Rapoport
77ee65728eSMike Rapoport	diff --git a/mm/mremap.c b/mm/mremap.c
78ee65728eSMike Rapoport	--- a/mm/mremap.c
79ee65728eSMike Rapoport	+++ b/mm/mremap.c
80ee65728eSMike Rapoport	@@ -41,6 +41,7 @@ static pmd_t *get_old_pmd(struct mm_stru
81ee65728eSMike Rapoport			return NULL;
82ee65728eSMike Rapoport
83ee65728eSMike Rapoport		pmd = pmd_offset(pud, addr);
84ee65728eSMike Rapoport	+	split_huge_pmd(vma, pmd, addr);
85ee65728eSMike Rapoport		if (pmd_none_or_clear_bad(pmd))
86ee65728eSMike Rapoport			return NULL;
87ee65728eSMike Rapoport
88ee65728eSMike RapoportLocking in hugepage aware code
89ee65728eSMike Rapoport==============================
90ee65728eSMike Rapoport
91ee65728eSMike RapoportWe want as much code as possible hugepage aware, as calling
92ee65728eSMike Rapoportsplit_huge_page() or split_huge_pmd() has a cost.
93ee65728eSMike Rapoport
94ee65728eSMike RapoportTo make pagetable walks huge pmd aware, all you need to do is to call
95ee65728eSMike Rapoportpmd_trans_huge() on the pmd returned by pmd_offset. You must hold the
96ee65728eSMike Rapoportmmap_lock in read (or write) mode to be sure a huge pmd cannot be
97ee65728eSMike Rapoportcreated from under you by khugepaged (khugepaged collapse_huge_page
98ee65728eSMike Rapoporttakes the mmap_lock in write mode in addition to the anon_vma lock). If
99ee65728eSMike Rapoportpmd_trans_huge returns false, you just fallback in the old code
100ee65728eSMike Rapoportpaths. If instead pmd_trans_huge returns true, you have to take the
101ee65728eSMike Rapoportpage table lock (pmd_lock()) and re-run pmd_trans_huge. Taking the
102ee65728eSMike Rapoportpage table lock will prevent the huge pmd being converted into a
103ee65728eSMike Rapoportregular pmd from under you (split_huge_pmd can run in parallel to the
104ee65728eSMike Rapoportpagetable walk). If the second pmd_trans_huge returns false, you
105ee65728eSMike Rapoportshould just drop the page table lock and fallback to the old code as
106ee65728eSMike Rapoportbefore. Otherwise, you can proceed to process the huge pmd and the
107ee65728eSMike Rapoporthugepage natively. Once finished, you can drop the page table lock.
108ee65728eSMike Rapoport
109ee65728eSMike RapoportRefcounts and transparent huge pages
110ee65728eSMike Rapoport====================================
111ee65728eSMike Rapoport
112ee65728eSMike RapoportRefcounting on THP is mostly consistent with refcounting on other compound
113ee65728eSMike Rapoportpages:
114ee65728eSMike Rapoport
115*6eee1a00SMatthew Wilcox (Oracle)  - get_page()/put_page() and GUP operate on the folio->_refcount.
116ee65728eSMike Rapoport
117ee65728eSMike Rapoport  - ->_refcount in tail pages is always zero: get_page_unless_zero() never
118ee65728eSMike Rapoport    succeeds on tail pages.
119ee65728eSMike Rapoport
120*6eee1a00SMatthew Wilcox (Oracle)  - map/unmap of a PMD entry for the whole THP increment/decrement
121*6eee1a00SMatthew Wilcox (Oracle)    folio->_entire_mapcount and also increment/decrement
122*6eee1a00SMatthew Wilcox (Oracle)    folio->_nr_pages_mapped by COMPOUND_MAPPED when _entire_mapcount
123*6eee1a00SMatthew Wilcox (Oracle)    goes from -1 to 0 or 0 to -1.
124ee65728eSMike Rapoport
125*6eee1a00SMatthew Wilcox (Oracle)  - map/unmap of individual pages with PTE entry increment/decrement
126*6eee1a00SMatthew Wilcox (Oracle)    page->_mapcount and also increment/decrement folio->_nr_pages_mapped
127*6eee1a00SMatthew Wilcox (Oracle)    when page->_mapcount goes from -1 to 0 or 0 to -1 as this counts
128*6eee1a00SMatthew Wilcox (Oracle)    the number of pages mapped by PTE.
129ee65728eSMike Rapoport
130ee65728eSMike Rapoportsplit_huge_page internally has to distribute the refcounts in the head
131ee65728eSMike Rapoportpage to the tail pages before clearing all PG_head/tail bits from the page
132ee65728eSMike Rapoportstructures. It can be done easily for refcounts taken by page table
133ee65728eSMike Rapoportentries, but we don't have enough information on how to distribute any
134ee65728eSMike Rapoportadditional pins (i.e. from get_user_pages). split_huge_page() fails any
135ee65728eSMike Rapoportrequests to split pinned huge pages: it expects page count to be equal to
136ee65728eSMike Rapoportthe sum of mapcount of all sub-pages plus one (split_huge_page caller must
137ee65728eSMike Rapoporthave a reference to the head page).
138ee65728eSMike Rapoport
139ee65728eSMike Rapoportsplit_huge_page uses migration entries to stabilize page->_refcount and
140ee65728eSMike Rapoportpage->_mapcount of anonymous pages. File pages just get unmapped.
141ee65728eSMike Rapoport
142ee65728eSMike RapoportWe are safe against physical memory scanners too: the only legitimate way
143ee65728eSMike Rapoporta scanner can get a reference to a page is get_page_unless_zero().
144ee65728eSMike Rapoport
145ee65728eSMike RapoportAll tail pages have zero ->_refcount until atomic_add(). This prevents the
146ee65728eSMike Rapoportscanner from getting a reference to the tail page up to that point. After the
147ee65728eSMike Rapoportatomic_add() we don't care about the ->_refcount value. We already know how
148ee65728eSMike Rapoportmany references should be uncharged from the head page.
149ee65728eSMike Rapoport
150ee65728eSMike RapoportFor head page get_page_unless_zero() will succeed and we don't mind. It's
151ee65728eSMike Rapoportclear where references should go after split: it will stay on the head page.
152ee65728eSMike Rapoport
153ee65728eSMike RapoportNote that split_huge_pmd() doesn't have any limitations on refcounting:
154ee65728eSMike Rapoportpmd can be split at any point and never fails.
155ee65728eSMike Rapoport
156ee65728eSMike RapoportPartial unmap and deferred_split_huge_page()
157ee65728eSMike Rapoport============================================
158ee65728eSMike Rapoport
159ee65728eSMike RapoportUnmapping part of THP (with munmap() or other way) is not going to free
160ee65728eSMike Rapoportmemory immediately. Instead, we detect that a subpage of THP is not in use
161ee65728eSMike Rapoportin page_remove_rmap() and queue the THP for splitting if memory pressure
162ee65728eSMike Rapoportcomes. Splitting will free up unused subpages.
163ee65728eSMike Rapoport
164ee65728eSMike RapoportSplitting the page right away is not an option due to locking context in
165ee65728eSMike Rapoportthe place where we can detect partial unmap. It also might be
166ee65728eSMike Rapoportcounterproductive since in many cases partial unmap happens during exit(2) if
167ee65728eSMike Rapoporta THP crosses a VMA boundary.
168ee65728eSMike Rapoport
169ee65728eSMike RapoportThe function deferred_split_huge_page() is used to queue a page for splitting.
170ee65728eSMike RapoportThe splitting itself will happen when we get memory pressure via shrinker
171ee65728eSMike Rapoportinterface.
172