xref: /linux/Documentation/mm/memory-model.rst (revision 9a87ffc99ec8eb8d35eed7c4f816d75f5cc9662e)
1*ee65728eSMike Rapoport.. SPDX-License-Identifier: GPL-2.0
2*ee65728eSMike Rapoport
3*ee65728eSMike Rapoport=====================
4*ee65728eSMike RapoportPhysical Memory Model
5*ee65728eSMike Rapoport=====================
6*ee65728eSMike Rapoport
7*ee65728eSMike RapoportPhysical memory in a system may be addressed in different ways. The
8*ee65728eSMike Rapoportsimplest case is when the physical memory starts at address 0 and
9*ee65728eSMike Rapoportspans a contiguous range up to the maximal address. It could be,
10*ee65728eSMike Rapoporthowever, that this range contains small holes that are not accessible
11*ee65728eSMike Rapoportfor the CPU. Then there could be several contiguous ranges at
12*ee65728eSMike Rapoportcompletely distinct addresses. And, don't forget about NUMA, where
13*ee65728eSMike Rapoportdifferent memory banks are attached to different CPUs.
14*ee65728eSMike Rapoport
15*ee65728eSMike RapoportLinux abstracts this diversity using one of the two memory models:
16*ee65728eSMike RapoportFLATMEM and SPARSEMEM. Each architecture defines what
17*ee65728eSMike Rapoportmemory models it supports, what the default memory model is and
18*ee65728eSMike Rapoportwhether it is possible to manually override that default.
19*ee65728eSMike Rapoport
20*ee65728eSMike RapoportAll the memory models track the status of physical page frames using
21*ee65728eSMike Rapoportstruct page arranged in one or more arrays.
22*ee65728eSMike Rapoport
23*ee65728eSMike RapoportRegardless of the selected memory model, there exists one-to-one
24*ee65728eSMike Rapoportmapping between the physical page frame number (PFN) and the
25*ee65728eSMike Rapoportcorresponding `struct page`.
26*ee65728eSMike Rapoport
27*ee65728eSMike RapoportEach memory model defines :c:func:`pfn_to_page` and :c:func:`page_to_pfn`
28*ee65728eSMike Rapoporthelpers that allow the conversion from PFN to `struct page` and vice
29*ee65728eSMike Rapoportversa.
30*ee65728eSMike Rapoport
31*ee65728eSMike RapoportFLATMEM
32*ee65728eSMike Rapoport=======
33*ee65728eSMike Rapoport
34*ee65728eSMike RapoportThe simplest memory model is FLATMEM. This model is suitable for
35*ee65728eSMike Rapoportnon-NUMA systems with contiguous, or mostly contiguous, physical
36*ee65728eSMike Rapoportmemory.
37*ee65728eSMike Rapoport
38*ee65728eSMike RapoportIn the FLATMEM memory model, there is a global `mem_map` array that
39*ee65728eSMike Rapoportmaps the entire physical memory. For most architectures, the holes
40*ee65728eSMike Rapoporthave entries in the `mem_map` array. The `struct page` objects
41*ee65728eSMike Rapoportcorresponding to the holes are never fully initialized.
42*ee65728eSMike Rapoport
43*ee65728eSMike RapoportTo allocate the `mem_map` array, architecture specific setup code should
44*ee65728eSMike Rapoportcall :c:func:`free_area_init` function. Yet, the mappings array is not
45*ee65728eSMike Rapoportusable until the call to :c:func:`memblock_free_all` that hands all the
46*ee65728eSMike Rapoportmemory to the page allocator.
47*ee65728eSMike Rapoport
48*ee65728eSMike RapoportAn architecture may free parts of the `mem_map` array that do not cover the
49*ee65728eSMike Rapoportactual physical pages. In such case, the architecture specific
50*ee65728eSMike Rapoport:c:func:`pfn_valid` implementation should take the holes in the
51*ee65728eSMike Rapoport`mem_map` into account.
52*ee65728eSMike Rapoport
53*ee65728eSMike RapoportWith FLATMEM, the conversion between a PFN and the `struct page` is
54*ee65728eSMike Rapoportstraightforward: `PFN - ARCH_PFN_OFFSET` is an index to the
55*ee65728eSMike Rapoport`mem_map` array.
56*ee65728eSMike Rapoport
57*ee65728eSMike RapoportThe `ARCH_PFN_OFFSET` defines the first page frame number for
58*ee65728eSMike Rapoportsystems with physical memory starting at address different from 0.
59*ee65728eSMike Rapoport
60*ee65728eSMike RapoportSPARSEMEM
61*ee65728eSMike Rapoport=========
62*ee65728eSMike Rapoport
63*ee65728eSMike RapoportSPARSEMEM is the most versatile memory model available in Linux and it
64*ee65728eSMike Rapoportis the only memory model that supports several advanced features such
65*ee65728eSMike Rapoportas hot-plug and hot-remove of the physical memory, alternative memory
66*ee65728eSMike Rapoportmaps for non-volatile memory devices and deferred initialization of
67*ee65728eSMike Rapoportthe memory map for larger systems.
68*ee65728eSMike Rapoport
69*ee65728eSMike RapoportThe SPARSEMEM model presents the physical memory as a collection of
70*ee65728eSMike Rapoportsections. A section is represented with struct mem_section
71*ee65728eSMike Rapoportthat contains `section_mem_map` that is, logically, a pointer to an
72*ee65728eSMike Rapoportarray of struct pages. However, it is stored with some other magic
73*ee65728eSMike Rapoportthat aids the sections management. The section size and maximal number
74*ee65728eSMike Rapoportof section is specified using `SECTION_SIZE_BITS` and
75*ee65728eSMike Rapoport`MAX_PHYSMEM_BITS` constants defined by each architecture that
76*ee65728eSMike Rapoportsupports SPARSEMEM. While `MAX_PHYSMEM_BITS` is an actual width of a
77*ee65728eSMike Rapoportphysical address that an architecture supports, the
78*ee65728eSMike Rapoport`SECTION_SIZE_BITS` is an arbitrary value.
79*ee65728eSMike Rapoport
80*ee65728eSMike RapoportThe maximal number of sections is denoted `NR_MEM_SECTIONS` and
81*ee65728eSMike Rapoportdefined as
82*ee65728eSMike Rapoport
83*ee65728eSMike Rapoport.. math::
84*ee65728eSMike Rapoport
85*ee65728eSMike Rapoport   NR\_MEM\_SECTIONS = 2 ^ {(MAX\_PHYSMEM\_BITS - SECTION\_SIZE\_BITS)}
86*ee65728eSMike Rapoport
87*ee65728eSMike RapoportThe `mem_section` objects are arranged in a two-dimensional array
88*ee65728eSMike Rapoportcalled `mem_sections`. The size and placement of this array depend
89*ee65728eSMike Rapoporton `CONFIG_SPARSEMEM_EXTREME` and the maximal possible number of
90*ee65728eSMike Rapoportsections:
91*ee65728eSMike Rapoport
92*ee65728eSMike Rapoport* When `CONFIG_SPARSEMEM_EXTREME` is disabled, the `mem_sections`
93*ee65728eSMike Rapoport  array is static and has `NR_MEM_SECTIONS` rows. Each row holds a
94*ee65728eSMike Rapoport  single `mem_section` object.
95*ee65728eSMike Rapoport* When `CONFIG_SPARSEMEM_EXTREME` is enabled, the `mem_sections`
96*ee65728eSMike Rapoport  array is dynamically allocated. Each row contains PAGE_SIZE worth of
97*ee65728eSMike Rapoport  `mem_section` objects and the number of rows is calculated to fit
98*ee65728eSMike Rapoport  all the memory sections.
99*ee65728eSMike Rapoport
100*ee65728eSMike RapoportThe architecture setup code should call sparse_init() to
101*ee65728eSMike Rapoportinitialize the memory sections and the memory maps.
102*ee65728eSMike Rapoport
103*ee65728eSMike RapoportWith SPARSEMEM there are two possible ways to convert a PFN to the
104*ee65728eSMike Rapoportcorresponding `struct page` - a "classic sparse" and "sparse
105*ee65728eSMike Rapoportvmemmap". The selection is made at build time and it is determined by
106*ee65728eSMike Rapoportthe value of `CONFIG_SPARSEMEM_VMEMMAP`.
107*ee65728eSMike Rapoport
108*ee65728eSMike RapoportThe classic sparse encodes the section number of a page in page->flags
109*ee65728eSMike Rapoportand uses high bits of a PFN to access the section that maps that page
110*ee65728eSMike Rapoportframe. Inside a section, the PFN is the index to the array of pages.
111*ee65728eSMike Rapoport
112*ee65728eSMike RapoportThe sparse vmemmap uses a virtually mapped memory map to optimize
113*ee65728eSMike Rapoportpfn_to_page and page_to_pfn operations. There is a global `struct
114*ee65728eSMike Rapoportpage *vmemmap` pointer that points to a virtually contiguous array of
115*ee65728eSMike Rapoport`struct page` objects. A PFN is an index to that array and the
116*ee65728eSMike Rapoportoffset of the `struct page` from `vmemmap` is the PFN of that
117*ee65728eSMike Rapoportpage.
118*ee65728eSMike Rapoport
119*ee65728eSMike RapoportTo use vmemmap, an architecture has to reserve a range of virtual
120*ee65728eSMike Rapoportaddresses that will map the physical pages containing the memory
121*ee65728eSMike Rapoportmap and make sure that `vmemmap` points to that range. In addition,
122*ee65728eSMike Rapoportthe architecture should implement :c:func:`vmemmap_populate` method
123*ee65728eSMike Rapoportthat will allocate the physical memory and create page tables for the
124*ee65728eSMike Rapoportvirtual memory map. If an architecture does not have any special
125*ee65728eSMike Rapoportrequirements for the vmemmap mappings, it can use default
126*ee65728eSMike Rapoport:c:func:`vmemmap_populate_basepages` provided by the generic memory
127*ee65728eSMike Rapoportmanagement.
128*ee65728eSMike Rapoport
129*ee65728eSMike RapoportThe virtually mapped memory map allows storing `struct page` objects
130*ee65728eSMike Rapoportfor persistent memory devices in pre-allocated storage on those
131*ee65728eSMike Rapoportdevices. This storage is represented with struct vmem_altmap
132*ee65728eSMike Rapoportthat is eventually passed to vmemmap_populate() through a long chain
133*ee65728eSMike Rapoportof function calls. The vmemmap_populate() implementation may use the
134*ee65728eSMike Rapoport`vmem_altmap` along with :c:func:`vmemmap_alloc_block_buf` helper to
135*ee65728eSMike Rapoportallocate memory map on the persistent memory device.
136*ee65728eSMike Rapoport
137*ee65728eSMike RapoportZONE_DEVICE
138*ee65728eSMike Rapoport===========
139*ee65728eSMike RapoportThe `ZONE_DEVICE` facility builds upon `SPARSEMEM_VMEMMAP` to offer
140*ee65728eSMike Rapoport`struct page` `mem_map` services for device driver identified physical
141*ee65728eSMike Rapoportaddress ranges. The "device" aspect of `ZONE_DEVICE` relates to the fact
142*ee65728eSMike Rapoportthat the page objects for these address ranges are never marked online,
143*ee65728eSMike Rapoportand that a reference must be taken against the device, not just the page
144*ee65728eSMike Rapoportto keep the memory pinned for active use. `ZONE_DEVICE`, via
145*ee65728eSMike Rapoport:c:func:`devm_memremap_pages`, performs just enough memory hotplug to
146*ee65728eSMike Rapoportturn on :c:func:`pfn_to_page`, :c:func:`page_to_pfn`, and
147*ee65728eSMike Rapoport:c:func:`get_user_pages` service for the given range of pfns. Since the
148*ee65728eSMike Rapoportpage reference count never drops below 1 the page is never tracked as
149*ee65728eSMike Rapoportfree memory and the page's `struct list_head lru` space is repurposed
150*ee65728eSMike Rapoportfor back referencing to the host device / driver that mapped the memory.
151*ee65728eSMike Rapoport
152*ee65728eSMike RapoportWhile `SPARSEMEM` presents memory as a collection of sections,
153*ee65728eSMike Rapoportoptionally collected into memory blocks, `ZONE_DEVICE` users have a need
154*ee65728eSMike Rapoportfor smaller granularity of populating the `mem_map`. Given that
155*ee65728eSMike Rapoport`ZONE_DEVICE` memory is never marked online it is subsequently never
156*ee65728eSMike Rapoportsubject to its memory ranges being exposed through the sysfs memory
157*ee65728eSMike Rapoporthotplug api on memory block boundaries. The implementation relies on
158*ee65728eSMike Rapoportthis lack of user-api constraint to allow sub-section sized memory
159*ee65728eSMike Rapoportranges to be specified to :c:func:`arch_add_memory`, the top-half of
160*ee65728eSMike Rapoportmemory hotplug. Sub-section support allows for 2MB as the cross-arch
161*ee65728eSMike Rapoportcommon alignment granularity for :c:func:`devm_memremap_pages`.
162*ee65728eSMike Rapoport
163*ee65728eSMike RapoportThe users of `ZONE_DEVICE` are:
164*ee65728eSMike Rapoport
165*ee65728eSMike Rapoport* pmem: Map platform persistent memory to be used as a direct-I/O target
166*ee65728eSMike Rapoport  via DAX mappings.
167*ee65728eSMike Rapoport
168*ee65728eSMike Rapoport* hmm: Extend `ZONE_DEVICE` with `->page_fault()` and `->page_free()`
169*ee65728eSMike Rapoport  event callbacks to allow a device-driver to coordinate memory management
170*ee65728eSMike Rapoport  events related to device-memory, typically GPU memory. See
171*ee65728eSMike Rapoport  Documentation/mm/hmm.rst.
172*ee65728eSMike Rapoport
173*ee65728eSMike Rapoport* p2pdma: Create `struct page` objects to allow peer devices in a
174*ee65728eSMike Rapoport  PCI/-E topology to coordinate direct-DMA operations between themselves,
175*ee65728eSMike Rapoport  i.e. bypass host memory.
176