1*ee65728eSMike Rapoport.. SPDX-License-Identifier: GPL-2.0 2*ee65728eSMike Rapoport 3*ee65728eSMike Rapoport.. _physical_memory_model: 4*ee65728eSMike Rapoport 5*ee65728eSMike Rapoport===================== 6*ee65728eSMike RapoportPhysical Memory Model 7*ee65728eSMike Rapoport===================== 8*ee65728eSMike Rapoport 9*ee65728eSMike RapoportPhysical memory in a system may be addressed in different ways. The 10*ee65728eSMike Rapoportsimplest case is when the physical memory starts at address 0 and 11*ee65728eSMike Rapoportspans a contiguous range up to the maximal address. It could be, 12*ee65728eSMike Rapoporthowever, that this range contains small holes that are not accessible 13*ee65728eSMike Rapoportfor the CPU. Then there could be several contiguous ranges at 14*ee65728eSMike Rapoportcompletely distinct addresses. And, don't forget about NUMA, where 15*ee65728eSMike Rapoportdifferent memory banks are attached to different CPUs. 16*ee65728eSMike Rapoport 17*ee65728eSMike RapoportLinux abstracts this diversity using one of the two memory models: 18*ee65728eSMike RapoportFLATMEM and SPARSEMEM. Each architecture defines what 19*ee65728eSMike Rapoportmemory models it supports, what the default memory model is and 20*ee65728eSMike Rapoportwhether it is possible to manually override that default. 21*ee65728eSMike Rapoport 22*ee65728eSMike RapoportAll the memory models track the status of physical page frames using 23*ee65728eSMike Rapoportstruct page arranged in one or more arrays. 24*ee65728eSMike Rapoport 25*ee65728eSMike RapoportRegardless of the selected memory model, there exists one-to-one 26*ee65728eSMike Rapoportmapping between the physical page frame number (PFN) and the 27*ee65728eSMike Rapoportcorresponding `struct page`. 28*ee65728eSMike Rapoport 29*ee65728eSMike RapoportEach memory model defines :c:func:`pfn_to_page` and :c:func:`page_to_pfn` 30*ee65728eSMike Rapoporthelpers that allow the conversion from PFN to `struct page` and vice 31*ee65728eSMike Rapoportversa. 32*ee65728eSMike Rapoport 33*ee65728eSMike RapoportFLATMEM 34*ee65728eSMike Rapoport======= 35*ee65728eSMike Rapoport 36*ee65728eSMike RapoportThe simplest memory model is FLATMEM. This model is suitable for 37*ee65728eSMike Rapoportnon-NUMA systems with contiguous, or mostly contiguous, physical 38*ee65728eSMike Rapoportmemory. 39*ee65728eSMike Rapoport 40*ee65728eSMike RapoportIn the FLATMEM memory model, there is a global `mem_map` array that 41*ee65728eSMike Rapoportmaps the entire physical memory. For most architectures, the holes 42*ee65728eSMike Rapoporthave entries in the `mem_map` array. The `struct page` objects 43*ee65728eSMike Rapoportcorresponding to the holes are never fully initialized. 44*ee65728eSMike Rapoport 45*ee65728eSMike RapoportTo allocate the `mem_map` array, architecture specific setup code should 46*ee65728eSMike Rapoportcall :c:func:`free_area_init` function. Yet, the mappings array is not 47*ee65728eSMike Rapoportusable until the call to :c:func:`memblock_free_all` that hands all the 48*ee65728eSMike Rapoportmemory to the page allocator. 49*ee65728eSMike Rapoport 50*ee65728eSMike RapoportAn architecture may free parts of the `mem_map` array that do not cover the 51*ee65728eSMike Rapoportactual physical pages. In such case, the architecture specific 52*ee65728eSMike Rapoport:c:func:`pfn_valid` implementation should take the holes in the 53*ee65728eSMike Rapoport`mem_map` into account. 54*ee65728eSMike Rapoport 55*ee65728eSMike RapoportWith FLATMEM, the conversion between a PFN and the `struct page` is 56*ee65728eSMike Rapoportstraightforward: `PFN - ARCH_PFN_OFFSET` is an index to the 57*ee65728eSMike Rapoport`mem_map` array. 58*ee65728eSMike Rapoport 59*ee65728eSMike RapoportThe `ARCH_PFN_OFFSET` defines the first page frame number for 60*ee65728eSMike Rapoportsystems with physical memory starting at address different from 0. 61*ee65728eSMike Rapoport 62*ee65728eSMike RapoportSPARSEMEM 63*ee65728eSMike Rapoport========= 64*ee65728eSMike Rapoport 65*ee65728eSMike RapoportSPARSEMEM is the most versatile memory model available in Linux and it 66*ee65728eSMike Rapoportis the only memory model that supports several advanced features such 67*ee65728eSMike Rapoportas hot-plug and hot-remove of the physical memory, alternative memory 68*ee65728eSMike Rapoportmaps for non-volatile memory devices and deferred initialization of 69*ee65728eSMike Rapoportthe memory map for larger systems. 70*ee65728eSMike Rapoport 71*ee65728eSMike RapoportThe SPARSEMEM model presents the physical memory as a collection of 72*ee65728eSMike Rapoportsections. A section is represented with struct mem_section 73*ee65728eSMike Rapoportthat contains `section_mem_map` that is, logically, a pointer to an 74*ee65728eSMike Rapoportarray of struct pages. However, it is stored with some other magic 75*ee65728eSMike Rapoportthat aids the sections management. The section size and maximal number 76*ee65728eSMike Rapoportof section is specified using `SECTION_SIZE_BITS` and 77*ee65728eSMike Rapoport`MAX_PHYSMEM_BITS` constants defined by each architecture that 78*ee65728eSMike Rapoportsupports SPARSEMEM. While `MAX_PHYSMEM_BITS` is an actual width of a 79*ee65728eSMike Rapoportphysical address that an architecture supports, the 80*ee65728eSMike Rapoport`SECTION_SIZE_BITS` is an arbitrary value. 81*ee65728eSMike Rapoport 82*ee65728eSMike RapoportThe maximal number of sections is denoted `NR_MEM_SECTIONS` and 83*ee65728eSMike Rapoportdefined as 84*ee65728eSMike Rapoport 85*ee65728eSMike Rapoport.. math:: 86*ee65728eSMike Rapoport 87*ee65728eSMike Rapoport NR\_MEM\_SECTIONS = 2 ^ {(MAX\_PHYSMEM\_BITS - SECTION\_SIZE\_BITS)} 88*ee65728eSMike Rapoport 89*ee65728eSMike RapoportThe `mem_section` objects are arranged in a two-dimensional array 90*ee65728eSMike Rapoportcalled `mem_sections`. The size and placement of this array depend 91*ee65728eSMike Rapoporton `CONFIG_SPARSEMEM_EXTREME` and the maximal possible number of 92*ee65728eSMike Rapoportsections: 93*ee65728eSMike Rapoport 94*ee65728eSMike Rapoport* When `CONFIG_SPARSEMEM_EXTREME` is disabled, the `mem_sections` 95*ee65728eSMike Rapoport array is static and has `NR_MEM_SECTIONS` rows. Each row holds a 96*ee65728eSMike Rapoport single `mem_section` object. 97*ee65728eSMike Rapoport* When `CONFIG_SPARSEMEM_EXTREME` is enabled, the `mem_sections` 98*ee65728eSMike Rapoport array is dynamically allocated. Each row contains PAGE_SIZE worth of 99*ee65728eSMike Rapoport `mem_section` objects and the number of rows is calculated to fit 100*ee65728eSMike Rapoport all the memory sections. 101*ee65728eSMike Rapoport 102*ee65728eSMike RapoportThe architecture setup code should call sparse_init() to 103*ee65728eSMike Rapoportinitialize the memory sections and the memory maps. 104*ee65728eSMike Rapoport 105*ee65728eSMike RapoportWith SPARSEMEM there are two possible ways to convert a PFN to the 106*ee65728eSMike Rapoportcorresponding `struct page` - a "classic sparse" and "sparse 107*ee65728eSMike Rapoportvmemmap". The selection is made at build time and it is determined by 108*ee65728eSMike Rapoportthe value of `CONFIG_SPARSEMEM_VMEMMAP`. 109*ee65728eSMike Rapoport 110*ee65728eSMike RapoportThe classic sparse encodes the section number of a page in page->flags 111*ee65728eSMike Rapoportand uses high bits of a PFN to access the section that maps that page 112*ee65728eSMike Rapoportframe. Inside a section, the PFN is the index to the array of pages. 113*ee65728eSMike Rapoport 114*ee65728eSMike RapoportThe sparse vmemmap uses a virtually mapped memory map to optimize 115*ee65728eSMike Rapoportpfn_to_page and page_to_pfn operations. There is a global `struct 116*ee65728eSMike Rapoportpage *vmemmap` pointer that points to a virtually contiguous array of 117*ee65728eSMike Rapoport`struct page` objects. A PFN is an index to that array and the 118*ee65728eSMike Rapoportoffset of the `struct page` from `vmemmap` is the PFN of that 119*ee65728eSMike Rapoportpage. 120*ee65728eSMike Rapoport 121*ee65728eSMike RapoportTo use vmemmap, an architecture has to reserve a range of virtual 122*ee65728eSMike Rapoportaddresses that will map the physical pages containing the memory 123*ee65728eSMike Rapoportmap and make sure that `vmemmap` points to that range. In addition, 124*ee65728eSMike Rapoportthe architecture should implement :c:func:`vmemmap_populate` method 125*ee65728eSMike Rapoportthat will allocate the physical memory and create page tables for the 126*ee65728eSMike Rapoportvirtual memory map. If an architecture does not have any special 127*ee65728eSMike Rapoportrequirements for the vmemmap mappings, it can use default 128*ee65728eSMike Rapoport:c:func:`vmemmap_populate_basepages` provided by the generic memory 129*ee65728eSMike Rapoportmanagement. 130*ee65728eSMike Rapoport 131*ee65728eSMike RapoportThe virtually mapped memory map allows storing `struct page` objects 132*ee65728eSMike Rapoportfor persistent memory devices in pre-allocated storage on those 133*ee65728eSMike Rapoportdevices. This storage is represented with struct vmem_altmap 134*ee65728eSMike Rapoportthat is eventually passed to vmemmap_populate() through a long chain 135*ee65728eSMike Rapoportof function calls. The vmemmap_populate() implementation may use the 136*ee65728eSMike Rapoport`vmem_altmap` along with :c:func:`vmemmap_alloc_block_buf` helper to 137*ee65728eSMike Rapoportallocate memory map on the persistent memory device. 138*ee65728eSMike Rapoport 139*ee65728eSMike RapoportZONE_DEVICE 140*ee65728eSMike Rapoport=========== 141*ee65728eSMike RapoportThe `ZONE_DEVICE` facility builds upon `SPARSEMEM_VMEMMAP` to offer 142*ee65728eSMike Rapoport`struct page` `mem_map` services for device driver identified physical 143*ee65728eSMike Rapoportaddress ranges. The "device" aspect of `ZONE_DEVICE` relates to the fact 144*ee65728eSMike Rapoportthat the page objects for these address ranges are never marked online, 145*ee65728eSMike Rapoportand that a reference must be taken against the device, not just the page 146*ee65728eSMike Rapoportto keep the memory pinned for active use. `ZONE_DEVICE`, via 147*ee65728eSMike Rapoport:c:func:`devm_memremap_pages`, performs just enough memory hotplug to 148*ee65728eSMike Rapoportturn on :c:func:`pfn_to_page`, :c:func:`page_to_pfn`, and 149*ee65728eSMike Rapoport:c:func:`get_user_pages` service for the given range of pfns. Since the 150*ee65728eSMike Rapoportpage reference count never drops below 1 the page is never tracked as 151*ee65728eSMike Rapoportfree memory and the page's `struct list_head lru` space is repurposed 152*ee65728eSMike Rapoportfor back referencing to the host device / driver that mapped the memory. 153*ee65728eSMike Rapoport 154*ee65728eSMike RapoportWhile `SPARSEMEM` presents memory as a collection of sections, 155*ee65728eSMike Rapoportoptionally collected into memory blocks, `ZONE_DEVICE` users have a need 156*ee65728eSMike Rapoportfor smaller granularity of populating the `mem_map`. Given that 157*ee65728eSMike Rapoport`ZONE_DEVICE` memory is never marked online it is subsequently never 158*ee65728eSMike Rapoportsubject to its memory ranges being exposed through the sysfs memory 159*ee65728eSMike Rapoporthotplug api on memory block boundaries. The implementation relies on 160*ee65728eSMike Rapoportthis lack of user-api constraint to allow sub-section sized memory 161*ee65728eSMike Rapoportranges to be specified to :c:func:`arch_add_memory`, the top-half of 162*ee65728eSMike Rapoportmemory hotplug. Sub-section support allows for 2MB as the cross-arch 163*ee65728eSMike Rapoportcommon alignment granularity for :c:func:`devm_memremap_pages`. 164*ee65728eSMike Rapoport 165*ee65728eSMike RapoportThe users of `ZONE_DEVICE` are: 166*ee65728eSMike Rapoport 167*ee65728eSMike Rapoport* pmem: Map platform persistent memory to be used as a direct-I/O target 168*ee65728eSMike Rapoport via DAX mappings. 169*ee65728eSMike Rapoport 170*ee65728eSMike Rapoport* hmm: Extend `ZONE_DEVICE` with `->page_fault()` and `->page_free()` 171*ee65728eSMike Rapoport event callbacks to allow a device-driver to coordinate memory management 172*ee65728eSMike Rapoport events related to device-memory, typically GPU memory. See 173*ee65728eSMike Rapoport Documentation/mm/hmm.rst. 174*ee65728eSMike Rapoport 175*ee65728eSMike Rapoport* p2pdma: Create `struct page` objects to allow peer devices in a 176*ee65728eSMike Rapoport PCI/-E topology to coordinate direct-DMA operations between themselves, 177*ee65728eSMike Rapoport i.e. bypass host memory. 178