1.. SPDX-License-Identifier: GPL-2.0 2 3===================== 4Physical Memory Model 5===================== 6 7Physical memory in a system may be addressed in different ways. The 8simplest case is when the physical memory starts at address 0 and 9spans a contiguous range up to the maximal address. It could be, 10however, that this range contains small holes that are not accessible 11for the CPU. Then there could be several contiguous ranges at 12completely distinct addresses. And, don't forget about NUMA, where 13different memory banks are attached to different CPUs. 14 15Linux abstracts this diversity using one of the two memory models: 16FLATMEM and SPARSEMEM. Each architecture defines what 17memory models it supports, what the default memory model is and 18whether it is possible to manually override that default. 19 20All the memory models track the status of physical page frames using 21struct page arranged in one or more arrays. 22 23Regardless of the selected memory model, there exists one-to-one 24mapping between the physical page frame number (PFN) and the 25corresponding `struct page`. 26 27Each memory model defines :c:func:`pfn_to_page` and :c:func:`page_to_pfn` 28helpers that allow the conversion from PFN to `struct page` and vice 29versa. 30 31FLATMEM 32======= 33 34The simplest memory model is FLATMEM. This model is suitable for 35non-NUMA systems with contiguous, or mostly contiguous, physical 36memory. 37 38In the FLATMEM memory model, there is a global `mem_map` array that 39maps the entire physical memory. For most architectures, the holes 40have entries in the `mem_map` array. The `struct page` objects 41corresponding to the holes are never fully initialized. 42 43To allocate the `mem_map` array, architecture specific setup code should 44call :c:func:`free_area_init` function. Yet, the mappings array is not 45usable until the call to :c:func:`memblock_free_all` that hands all the 46memory to the page allocator. 47 48An architecture may free parts of the `mem_map` array that do not cover the 49actual physical pages. In such case, the architecture specific 50:c:func:`pfn_valid` implementation should take the holes in the 51`mem_map` into account. 52 53With FLATMEM, the conversion between a PFN and the `struct page` is 54straightforward: `PFN - ARCH_PFN_OFFSET` is an index to the 55`mem_map` array. 56 57The `ARCH_PFN_OFFSET` defines the first page frame number for 58systems with physical memory starting at address different from 0. 59 60SPARSEMEM 61========= 62 63SPARSEMEM is the most versatile memory model available in Linux and it 64is the only memory model that supports several advanced features such 65as hot-plug and hot-remove of the physical memory, alternative memory 66maps for non-volatile memory devices and deferred initialization of 67the memory map for larger systems. 68 69The SPARSEMEM model presents the physical memory as a collection of 70sections. A section is represented with struct mem_section 71that contains `section_mem_map` that is, logically, a pointer to an 72array of struct pages. However, it is stored with some other magic 73that aids the sections management. The section size and maximal number 74of section is specified using `SECTION_SIZE_BITS` and 75`MAX_PHYSMEM_BITS` constants defined by each architecture that 76supports SPARSEMEM. While `MAX_PHYSMEM_BITS` is an actual width of a 77physical address that an architecture supports, the 78`SECTION_SIZE_BITS` is an arbitrary value. 79 80The maximal number of sections is denoted `NR_MEM_SECTIONS` and 81defined as 82 83.. math:: 84 85 NR\_MEM\_SECTIONS = 2 ^ {(MAX\_PHYSMEM\_BITS - SECTION\_SIZE\_BITS)} 86 87The `mem_section` objects are arranged in a two-dimensional array 88called `mem_sections`. The size and placement of this array depend 89on `CONFIG_SPARSEMEM_EXTREME` and the maximal possible number of 90sections: 91 92* When `CONFIG_SPARSEMEM_EXTREME` is disabled, the `mem_sections` 93 array is static and has `NR_MEM_SECTIONS` rows. Each row holds a 94 single `mem_section` object. 95* When `CONFIG_SPARSEMEM_EXTREME` is enabled, the `mem_sections` 96 array is dynamically allocated. Each row contains PAGE_SIZE worth of 97 `mem_section` objects and the number of rows is calculated to fit 98 all the memory sections. 99 100With SPARSEMEM there are two possible ways to convert a PFN to the 101corresponding `struct page` - a "classic sparse" and "sparse 102vmemmap". The selection is made at build time and it is determined by 103the value of `CONFIG_SPARSEMEM_VMEMMAP`. 104 105The classic sparse encodes the section number of a page in page->flags 106and uses high bits of a PFN to access the section that maps that page 107frame. Inside a section, the PFN is the index to the array of pages. 108 109The sparse vmemmap uses a virtually mapped memory map to optimize 110pfn_to_page and page_to_pfn operations. There is a global `struct 111page *vmemmap` pointer that points to a virtually contiguous array of 112`struct page` objects. A PFN is an index to that array and the 113offset of the `struct page` from `vmemmap` is the PFN of that 114page. 115 116To use vmemmap, an architecture has to reserve a range of virtual 117addresses that will map the physical pages containing the memory 118map and make sure that `vmemmap` points to that range. In addition, 119the architecture should implement :c:func:`vmemmap_populate` method 120that will allocate the physical memory and create page tables for the 121virtual memory map. If an architecture does not have any special 122requirements for the vmemmap mappings, it can use default 123:c:func:`vmemmap_populate_basepages` provided by the generic memory 124management. 125 126The virtually mapped memory map allows storing `struct page` objects 127for persistent memory devices in pre-allocated storage on those 128devices. This storage is represented with struct vmem_altmap 129that is eventually passed to vmemmap_populate() through a long chain 130of function calls. The vmemmap_populate() implementation may use the 131`vmem_altmap` along with :c:func:`vmemmap_alloc_block_buf` helper to 132allocate memory map on the persistent memory device. 133 134ZONE_DEVICE 135=========== 136The `ZONE_DEVICE` facility builds upon `SPARSEMEM_VMEMMAP` to offer 137`struct page` `mem_map` services for device driver identified physical 138address ranges. The "device" aspect of `ZONE_DEVICE` relates to the fact 139that the page objects for these address ranges are never marked online, 140and that a reference must be taken against the device, not just the page 141to keep the memory pinned for active use. `ZONE_DEVICE`, via 142:c:func:`devm_memremap_pages`, performs just enough memory hotplug to 143turn on :c:func:`pfn_to_page`, :c:func:`page_to_pfn`, and 144:c:func:`get_user_pages` service for the given range of pfns. Since the 145page reference count never drops below 1 the page is never tracked as 146free memory and the page's `struct list_head lru` space is repurposed 147for back referencing to the host device / driver that mapped the memory. 148 149While `SPARSEMEM` presents memory as a collection of sections, 150optionally collected into memory blocks, `ZONE_DEVICE` users have a need 151for smaller granularity of populating the `mem_map`. Given that 152`ZONE_DEVICE` memory is never marked online it is subsequently never 153subject to its memory ranges being exposed through the sysfs memory 154hotplug api on memory block boundaries. The implementation relies on 155this lack of user-api constraint to allow sub-section sized memory 156ranges to be specified to :c:func:`arch_add_memory`, the top-half of 157memory hotplug. Sub-section support allows for 2MB as the cross-arch 158common alignment granularity for :c:func:`devm_memremap_pages`. 159 160The users of `ZONE_DEVICE` are: 161 162* pmem: Map platform persistent memory to be used as a direct-I/O target 163 via DAX mappings. 164 165* hmm: Extend `ZONE_DEVICE` with `->page_fault()` and `->folio_free()` 166 event callbacks to allow a device-driver to coordinate memory management 167 events related to device-memory, typically GPU memory. See 168 Documentation/mm/hmm.rst. 169 170* p2pdma: Create `struct page` objects to allow peer devices in a 171 PCI/-E topology to coordinate direct-DMA operations between themselves, 172 i.e. bypass host memory. 173