Documentation/mm/transhuge.rst

ee65728eSMike Rapoport============================
ee65728eSMike RapoportTransparent Hugepage Support
ee65728eSMike Rapoport============================
ee65728eSMike Rapoport
ee65728eSMike RapoportThis document describes design principles for Transparent Hugepage (THP)
ee65728eSMike Rapoportsupport and its interaction with other parts of the memory management
ee65728eSMike Rapoportsystem.
ee65728eSMike Rapoport
ee65728eSMike RapoportDesign principles
ee65728eSMike Rapoport=================
ee65728eSMike Rapoport
ee65728eSMike Rapoport- "graceful fallback": mm components which don't have transparent hugepage
ee65728eSMike Rapoport  knowledge fall back to breaking huge pmd mapping into table of ptes and,
ee65728eSMike Rapoport  if necessary, split a transparent hugepage. Therefore these components
ee65728eSMike Rapoport  can continue working on the regular pages or regular pte mappings.
ee65728eSMike Rapoport
ee65728eSMike Rapoport- if a hugepage allocation fails because of memory fragmentation,
ee65728eSMike Rapoport  regular pages should be gracefully allocated instead and mixed in
ee65728eSMike Rapoport  the same vma without any failure or significant delay and without
ee65728eSMike Rapoport  userland noticing
ee65728eSMike Rapoport
ee65728eSMike Rapoport- if some task quits and more hugepages become available (either
ee65728eSMike Rapoport  immediately in the buddy or through the VM), guest physical memory
ee65728eSMike Rapoport  backed by regular pages should be relocated on hugepages
ee65728eSMike Rapoport  automatically (with khugepaged)
ee65728eSMike Rapoport
ee65728eSMike Rapoport- it doesn't require memory reservation and in turn it uses hugepages
ee65728eSMike Rapoport  whenever possible (the only possible reservation here is kernelcore=
ee65728eSMike Rapoport  to avoid unmovable pages to fragment all the memory but such a tweak
ee65728eSMike Rapoport  is not specific to transparent hugepage support and it's a generic
ee65728eSMike Rapoport  feature that applies to all dynamic high order allocations in the
ee65728eSMike Rapoport  kernel)
ee65728eSMike Rapoport
ee65728eSMike Rapoportget_user_pages and follow_page
ee65728eSMike Rapoport==============================
ee65728eSMike Rapoport
ee65728eSMike Rapoportget_user_pages and follow_page if run on a hugepage, will return the
ee65728eSMike Rapoporthead or tail pages as usual (exactly as they would do on
ee65728eSMike Rapoporthugetlbfs). Most GUP users will only care about the actual physical
ee65728eSMike Rapoportaddress of the page and its temporary pinning to release after the I/O
ee65728eSMike Rapoportis complete, so they won't ever notice the fact the page is huge. But
ee65728eSMike Rapoportif any driver is going to mangle over the page structure of the tail
ee65728eSMike Rapoportpage (like for checking page->mapping or other bits that are relevant
ee65728eSMike Rapoportfor the head page and not the tail page), it should be updated to jump
ee65728eSMike Rapoportto check head page instead. Taking a reference on any head/tail page would
ee65728eSMike Rapoportprevent the page from being split by anyone.
ee65728eSMike Rapoport
ee65728eSMike Rapoport.. note::
ee65728eSMike Rapoport   these aren't new constraints to the GUP API, and they match the
ee65728eSMike Rapoport   same constraints that apply to hugetlbfs too, so any driver capable
ee65728eSMike Rapoport   of handling GUP on hugetlbfs will also work fine on transparent
ee65728eSMike Rapoport   hugepage backed mappings.
ee65728eSMike Rapoport
ee65728eSMike RapoportGraceful fallback
ee65728eSMike Rapoport=================
ee65728eSMike Rapoport
ee65728eSMike RapoportCode walking pagetables but unaware about huge pmds can simply call
ee65728eSMike Rapoportsplit_huge_pmd(vma, pmd, addr) where the pmd is the one returned by
ee65728eSMike Rapoportpmd_offset. It's trivial to make the code transparent hugepage aware
ee65728eSMike Rapoportby just grepping for "pmd_offset" and adding split_huge_pmd where
ee65728eSMike Rapoportmissing after pmd_offset returns the pmd. Thanks to the graceful
ee65728eSMike Rapoportfallback design, with a one liner change, you can avoid to write
ee65728eSMike Rapoporthundreds if not thousands of lines of complex code to make your code
ee65728eSMike Rapoporthugepage aware.
ee65728eSMike Rapoport
ee65728eSMike RapoportIf you're not walking pagetables but you run into a physical hugepage
ee65728eSMike Rapoportthat you can't handle natively in your code, you can split it by
ee65728eSMike Rapoportcalling split_huge_page(page). This is what the Linux VM does before
ee65728eSMike Rapoportit tries to swapout the hugepage for example. split_huge_page() can fail
ee65728eSMike Rapoportif the page is pinned and you must handle this correctly.
ee65728eSMike Rapoport
ee65728eSMike RapoportExample to make mremap.c transparent hugepage aware with a one liner
ee65728eSMike Rapoportchange::
ee65728eSMike Rapoport
ee65728eSMike Rapoport	diff --git a/mm/mremap.c b/mm/mremap.c
ee65728eSMike Rapoport	--- a/mm/mremap.c
ee65728eSMike Rapoport	+++ b/mm/mremap.c
ee65728eSMike Rapoport	@@ -41,6 +41,7 @@ static pmd_t *get_old_pmd(struct mm_stru
ee65728eSMike Rapoport			return NULL;
ee65728eSMike Rapoport
ee65728eSMike Rapoport		pmd = pmd_offset(pud, addr);
ee65728eSMike Rapoport	+	split_huge_pmd(vma, pmd, addr);
ee65728eSMike Rapoport		if (pmd_none_or_clear_bad(pmd))
ee65728eSMike Rapoport			return NULL;
ee65728eSMike Rapoport
ee65728eSMike RapoportLocking in hugepage aware code
ee65728eSMike Rapoport==============================
ee65728eSMike Rapoport
ee65728eSMike RapoportWe want as much code as possible hugepage aware, as calling
ee65728eSMike Rapoportsplit_huge_page() or split_huge_pmd() has a cost.
ee65728eSMike Rapoport
ee65728eSMike RapoportTo make pagetable walks huge pmd aware, all you need to do is to call
ee65728eSMike Rapoportpmd_trans_huge() on the pmd returned by pmd_offset. You must hold the
ee65728eSMike Rapoportmmap_lock in read (or write) mode to be sure a huge pmd cannot be
ee65728eSMike Rapoportcreated from under you by khugepaged (khugepaged collapse_huge_page
ee65728eSMike Rapoporttakes the mmap_lock in write mode in addition to the anon_vma lock). If
ee65728eSMike Rapoportpmd_trans_huge returns false, you just fallback in the old code
ee65728eSMike Rapoportpaths. If instead pmd_trans_huge returns true, you have to take the
ee65728eSMike Rapoportpage table lock (pmd_lock()) and re-run pmd_trans_huge. Taking the
ee65728eSMike Rapoportpage table lock will prevent the huge pmd being converted into a
ee65728eSMike Rapoportregular pmd from under you (split_huge_pmd can run in parallel to the
ee65728eSMike Rapoportpagetable walk). If the second pmd_trans_huge returns false, you
ee65728eSMike Rapoportshould just drop the page table lock and fallback to the old code as
ee65728eSMike Rapoportbefore. Otherwise, you can proceed to process the huge pmd and the
ee65728eSMike Rapoporthugepage natively. Once finished, you can drop the page table lock.
ee65728eSMike Rapoport
ee65728eSMike RapoportRefcounts and transparent huge pages
ee65728eSMike Rapoport====================================
ee65728eSMike Rapoport
ee65728eSMike RapoportRefcounting on THP is mostly consistent with refcounting on other compound
ee65728eSMike Rapoportpages:
ee65728eSMike Rapoport
6eee1a00SMatthew Wilcox (Oracle)  - get_page()/put_page() and GUP operate on the folio->_refcount.
ee65728eSMike Rapoport
ee65728eSMike Rapoport  - ->_refcount in tail pages is always zero: get_page_unless_zero() never
ee65728eSMike Rapoport    succeeds on tail pages.
ee65728eSMike Rapoport
6eee1a00SMatthew Wilcox (Oracle)  - map/unmap of a PMD entry for the whole THP increment/decrement
*05c5323bSDavid Hildenbrand    folio->_entire_mapcount, increment/decrement folio->_large_mapcount
*05c5323bSDavid Hildenbrand    and also increment/decrement folio->_nr_pages_mapped by ENTIRELY_MAPPED
*05c5323bSDavid Hildenbrand    when _entire_mapcount goes from -1 to 0 or 0 to -1.
ee65728eSMike Rapoport
6eee1a00SMatthew Wilcox (Oracle)  - map/unmap of individual pages with PTE entry increment/decrement
*05c5323bSDavid Hildenbrand    page->_mapcount, increment/decrement folio->_large_mapcount and also
*05c5323bSDavid Hildenbrand    increment/decrement folio->_nr_pages_mapped when page->_mapcount goes
*05c5323bSDavid Hildenbrand    from -1 to 0 or 0 to -1 as this counts the number of pages mapped by PTE.
ee65728eSMike Rapoport
ee65728eSMike Rapoportsplit_huge_page internally has to distribute the refcounts in the head
ee65728eSMike Rapoportpage to the tail pages before clearing all PG_head/tail bits from the page
ee65728eSMike Rapoportstructures. It can be done easily for refcounts taken by page table
ee65728eSMike Rapoportentries, but we don't have enough information on how to distribute any
ee65728eSMike Rapoportadditional pins (i.e. from get_user_pages). split_huge_page() fails any
ee65728eSMike Rapoportrequests to split pinned huge pages: it expects page count to be equal to
ee65728eSMike Rapoportthe sum of mapcount of all sub-pages plus one (split_huge_page caller must
ee65728eSMike Rapoporthave a reference to the head page).
ee65728eSMike Rapoport
ee65728eSMike Rapoportsplit_huge_page uses migration entries to stabilize page->_refcount and
ee65728eSMike Rapoportpage->_mapcount of anonymous pages. File pages just get unmapped.
ee65728eSMike Rapoport
ee65728eSMike RapoportWe are safe against physical memory scanners too: the only legitimate way
ee65728eSMike Rapoporta scanner can get a reference to a page is get_page_unless_zero().
ee65728eSMike Rapoport
ee65728eSMike RapoportAll tail pages have zero ->_refcount until atomic_add(). This prevents the
ee65728eSMike Rapoportscanner from getting a reference to the tail page up to that point. After the
ee65728eSMike Rapoportatomic_add() we don't care about the ->_refcount value. We already know how
ee65728eSMike Rapoportmany references should be uncharged from the head page.
ee65728eSMike Rapoport
ee65728eSMike RapoportFor head page get_page_unless_zero() will succeed and we don't mind. It's
ee65728eSMike Rapoportclear where references should go after split: it will stay on the head page.
ee65728eSMike Rapoport
ee65728eSMike RapoportNote that split_huge_pmd() doesn't have any limitations on refcounting:
ee65728eSMike Rapoportpmd can be split at any point and never fails.
ee65728eSMike Rapoport
f158ed61SMatthew Wilcox (Oracle)Partial unmap and deferred_split_folio()
f158ed61SMatthew Wilcox (Oracle)========================================
ee65728eSMike Rapoport
ee65728eSMike RapoportUnmapping part of THP (with munmap() or other way) is not going to free
ee65728eSMike Rapoportmemory immediately. Instead, we detect that a subpage of THP is not in use
5a0033f0SDavid Hildenbrandin folio_remove_rmap_*() and queue the THP for splitting if memory pressure
ee65728eSMike Rapoportcomes. Splitting will free up unused subpages.
ee65728eSMike Rapoport
ee65728eSMike RapoportSplitting the page right away is not an option due to locking context in
ee65728eSMike Rapoportthe place where we can detect partial unmap. It also might be
ee65728eSMike Rapoportcounterproductive since in many cases partial unmap happens during exit(2) if
ee65728eSMike Rapoporta THP crosses a VMA boundary.
ee65728eSMike Rapoport
f158ed61SMatthew Wilcox (Oracle)The function deferred_split_folio() is used to queue a folio for splitting.
ee65728eSMike RapoportThe splitting itself will happen when we get memory pressure via shrinker
ee65728eSMike Rapoportinterface.