xref: /linux/Documentation/mm/memory-model.rst (revision 6aacab308a5dfd222b2d23662bbae60c11007cfb)
1.. SPDX-License-Identifier: GPL-2.0
2
3=====================
4Physical Memory Model
5=====================
6
7Physical memory in a system may be addressed in different ways. The
8simplest case is when the physical memory starts at address 0 and
9spans a contiguous range up to the maximal address. It could be,
10however, that this range contains small holes that are not accessible
11for the CPU. Then there could be several contiguous ranges at
12completely distinct addresses. And, don't forget about NUMA, where
13different memory banks are attached to different CPUs.
14
15Linux abstracts this diversity using one of the two memory models:
16FLATMEM and SPARSEMEM. Each architecture defines what
17memory models it supports, what the default memory model is and
18whether it is possible to manually override that default.
19
20All the memory models track the status of physical page frames using
21struct page arranged in one or more arrays.
22
23Regardless of the selected memory model, there exists one-to-one
24mapping between the physical page frame number (PFN) and the
25corresponding `struct page`.
26
27Each memory model defines :c:func:`pfn_to_page` and :c:func:`page_to_pfn`
28helpers that allow the conversion from PFN to `struct page` and vice
29versa.
30
31FLATMEM
32=======
33
34The simplest memory model is FLATMEM. This model is suitable for
35non-NUMA systems with contiguous, or mostly contiguous, physical
36memory.
37
38In the FLATMEM memory model, there is a global `mem_map` array that
39maps the entire physical memory. For most architectures, the holes
40have entries in the `mem_map` array. The `struct page` objects
41corresponding to the holes are never fully initialized.
42
43To allocate the `mem_map` array, architecture specific setup code should
44call :c:func:`free_area_init` function. Yet, the mappings array is not
45usable until the call to :c:func:`memblock_free_all` that hands all the
46memory to the page allocator.
47
48An architecture may free parts of the `mem_map` array that do not cover the
49actual physical pages. In such case, the architecture specific
50:c:func:`pfn_valid` implementation should take the holes in the
51`mem_map` into account.
52
53With FLATMEM, the conversion between a PFN and the `struct page` is
54straightforward: `PFN - ARCH_PFN_OFFSET` is an index to the
55`mem_map` array.
56
57The `ARCH_PFN_OFFSET` defines the first page frame number for
58systems with physical memory starting at address different from 0.
59
60SPARSEMEM
61=========
62
63SPARSEMEM is the most versatile memory model available in Linux and it
64is the only memory model that supports several advanced features such
65as hot-plug and hot-remove of the physical memory, alternative memory
66maps for non-volatile memory devices and deferred initialization of
67the memory map for larger systems.
68
69The SPARSEMEM model presents the physical memory as a collection of
70sections. A section is represented with struct mem_section
71that contains `section_mem_map` that is, logically, a pointer to an
72array of struct pages. However, it is stored with some other magic
73that aids the sections management. The section size and maximal number
74of section is specified using `SECTION_SIZE_BITS` and
75`MAX_PHYSMEM_BITS` constants defined by each architecture that
76supports SPARSEMEM. While `MAX_PHYSMEM_BITS` is an actual width of a
77physical address that an architecture supports, the
78`SECTION_SIZE_BITS` is an arbitrary value.
79
80The maximal number of sections is denoted `NR_MEM_SECTIONS` and
81defined as
82
83.. math::
84
85   NR\_MEM\_SECTIONS = 2 ^ {(MAX\_PHYSMEM\_BITS - SECTION\_SIZE\_BITS)}
86
87The `mem_section` objects are arranged in a two-dimensional array
88called `mem_sections`. The size and placement of this array depend
89on `CONFIG_SPARSEMEM_EXTREME` and the maximal possible number of
90sections:
91
92* When `CONFIG_SPARSEMEM_EXTREME` is disabled, the `mem_sections`
93  array is static and has `NR_MEM_SECTIONS` rows. Each row holds a
94  single `mem_section` object.
95* When `CONFIG_SPARSEMEM_EXTREME` is enabled, the `mem_sections`
96  array is dynamically allocated. Each row contains PAGE_SIZE worth of
97  `mem_section` objects and the number of rows is calculated to fit
98  all the memory sections.
99
100With SPARSEMEM there are two possible ways to convert a PFN to the
101corresponding `struct page` - a "classic sparse" and "sparse
102vmemmap". The selection is made at build time and it is determined by
103the value of `CONFIG_SPARSEMEM_VMEMMAP`.
104
105The classic sparse encodes the section number of a page in page->flags
106and uses high bits of a PFN to access the section that maps that page
107frame. Inside a section, the PFN is the index to the array of pages.
108
109The sparse vmemmap uses a virtually mapped memory map to optimize
110pfn_to_page and page_to_pfn operations. There is a global `struct
111page *vmemmap` pointer that points to a virtually contiguous array of
112`struct page` objects. A PFN is an index to that array and the
113offset of the `struct page` from `vmemmap` is the PFN of that
114page.
115
116To use vmemmap, an architecture has to reserve a range of virtual
117addresses that will map the physical pages containing the memory
118map and make sure that `vmemmap` points to that range. In addition,
119the architecture should implement :c:func:`vmemmap_populate` method
120that will allocate the physical memory and create page tables for the
121virtual memory map. If an architecture does not have any special
122requirements for the vmemmap mappings, it can use default
123:c:func:`vmemmap_populate_basepages` provided by the generic memory
124management.
125
126The virtually mapped memory map allows storing `struct page` objects
127for persistent memory devices in pre-allocated storage on those
128devices. This storage is represented with struct vmem_altmap
129that is eventually passed to vmemmap_populate() through a long chain
130of function calls. The vmemmap_populate() implementation may use the
131`vmem_altmap` along with :c:func:`vmemmap_alloc_block_buf` helper to
132allocate memory map on the persistent memory device.
133
134ZONE_DEVICE
135===========
136The `ZONE_DEVICE` facility builds upon `SPARSEMEM_VMEMMAP` to offer
137`struct page` `mem_map` services for device driver identified physical
138address ranges. The "device" aspect of `ZONE_DEVICE` relates to the fact
139that the page objects for these address ranges are never marked online,
140and that a reference must be taken against the device, not just the page
141to keep the memory pinned for active use. `ZONE_DEVICE`, via
142:c:func:`devm_memremap_pages`, performs just enough memory hotplug to
143turn on :c:func:`pfn_to_page`, :c:func:`page_to_pfn`, and
144:c:func:`get_user_pages` service for the given range of pfns. Since the
145page reference count never drops below 1 the page is never tracked as
146free memory and the page's `struct list_head lru` space is repurposed
147for back referencing to the host device / driver that mapped the memory.
148
149While `SPARSEMEM` presents memory as a collection of sections,
150optionally collected into memory blocks, `ZONE_DEVICE` users have a need
151for smaller granularity of populating the `mem_map`. Given that
152`ZONE_DEVICE` memory is never marked online it is subsequently never
153subject to its memory ranges being exposed through the sysfs memory
154hotplug api on memory block boundaries. The implementation relies on
155this lack of user-api constraint to allow sub-section sized memory
156ranges to be specified to :c:func:`arch_add_memory`, the top-half of
157memory hotplug. Sub-section support allows for 2MB as the cross-arch
158common alignment granularity for :c:func:`devm_memremap_pages`.
159
160The users of `ZONE_DEVICE` are:
161
162* pmem: Map platform persistent memory to be used as a direct-I/O target
163  via DAX mappings.
164
165* hmm: Extend `ZONE_DEVICE` with `->page_fault()` and `->folio_free()`
166  event callbacks to allow a device-driver to coordinate memory management
167  events related to device-memory, typically GPU memory. See
168  Documentation/mm/hmm.rst.
169
170* p2pdma: Create `struct page` objects to allow peer devices in a
171  PCI/-E topology to coordinate direct-DMA operations between themselves,
172  i.e. bypass host memory.
173