1.. SPDX-License-Identifier: GPL-2.0 2 3=============== 4Physical Memory 5=============== 6 7Linux is available for a wide range of architectures so there is a need for an 8architecture-independent abstraction to represent the physical memory. This 9chapter describes the structures used to manage physical memory in a running 10system. 11 12The first principal concept prevalent in the memory management is 13`Non-Uniform Memory Access (NUMA) 14<https://en.wikipedia.org/wiki/Non-uniform_memory_access>`_. 15With multi-core and multi-socket machines, memory may be arranged into banks 16that incur a different cost to access depending on the “distance” from the 17processor. For example, there might be a bank of memory assigned to each CPU or 18a bank of memory very suitable for DMA near peripheral devices. 19 20Each bank is called a node and the concept is represented under Linux by a 21``struct pglist_data`` even if the architecture is UMA. This structure is 22always referenced by its typedef ``pg_data_t``. A ``pg_data_t`` structure 23for a particular node can be referenced by ``NODE_DATA(nid)`` macro where 24``nid`` is the ID of that node. 25 26For NUMA architectures, the node structures are allocated by the architecture 27specific code early during boot. Usually, these structures are allocated 28locally on the memory bank they represent. For UMA architectures, only one 29static ``pg_data_t`` structure called ``contig_page_data`` is used. Nodes will 30be discussed further in Section :ref:`Nodes <nodes>` 31 32The entire physical address space is partitioned into one or more blocks 33called zones which represent ranges within memory. These ranges are usually 34determined by architectural constraints for accessing the physical memory. 35The memory range within a node that corresponds to a particular zone is 36described by a ``struct zone``, typedeffed to ``zone_t``. Each zone has 37one of the types described below. 38 39* ``ZONE_DMA`` and ``ZONE_DMA32`` historically represented memory suitable for 40 DMA by peripheral devices that cannot access all of the addressable 41 memory. For many years there are better more and robust interfaces to get 42 memory with DMA specific requirements (Documentation/core-api/dma-api.rst), 43 but ``ZONE_DMA`` and ``ZONE_DMA32`` still represent memory ranges that have 44 restrictions on how they can be accessed. 45 Depending on the architecture, either of these zone types or even they both 46 can be disabled at build time using ``CONFIG_ZONE_DMA`` and 47 ``CONFIG_ZONE_DMA32`` configuration options. Some 64-bit platforms may need 48 both zones as they support peripherals with different DMA addressing 49 limitations. 50 51* ``ZONE_NORMAL`` is for normal memory that can be accessed by the kernel all 52 the time. DMA operations can be performed on pages in this zone if the DMA 53 devices support transfers to all addressable memory. ``ZONE_NORMAL`` is 54 always enabled. 55 56* ``ZONE_HIGHMEM`` is the part of the physical memory that is not covered by a 57 permanent mapping in the kernel page tables. The memory in this zone is only 58 accessible to the kernel using temporary mappings. This zone is available 59 only on some 32-bit architectures and is enabled with ``CONFIG_HIGHMEM``. 60 61* ``ZONE_MOVABLE`` is for normal accessible memory, just like ``ZONE_NORMAL``. 62 The difference is that the contents of most pages in ``ZONE_MOVABLE`` is 63 movable. That means that while virtual addresses of these pages do not 64 change, their content may move between different physical pages. Often 65 ``ZONE_MOVABLE`` is populated during memory hotplug, but it may be 66 also populated on boot using one of ``kernelcore``, ``movablecore`` and 67 ``movable_node`` kernel command line parameters. See 68 Documentation/mm/page_migration.rst and 69 Documentation/admin-guide/mm/memory-hotplug.rst for additional details. 70 71* ``ZONE_DEVICE`` represents memory residing on devices such as PMEM and GPU. 72 It has different characteristics than RAM zone types and it exists to provide 73 :ref:`struct page <Pages>` and memory map services for device driver 74 identified physical address ranges. ``ZONE_DEVICE`` is enabled with 75 configuration option ``CONFIG_ZONE_DEVICE``. 76 77It is important to note that many kernel operations can only take place using 78``ZONE_NORMAL`` so it is the most performance critical zone. Zones are 79discussed further in Section :ref:`Zones <zones>`. 80 81The relation between node and zone extents is determined by the physical memory 82map reported by the firmware, architectural constraints for memory addressing 83and certain parameters in the kernel command line. 84 85For example, with 32-bit kernel on an x86 UMA machine with 2 Gbytes of RAM the 86entire memory will be on node 0 and there will be three zones: ``ZONE_DMA``, 87``ZONE_NORMAL`` and ``ZONE_HIGHMEM``:: 88 89 0 2G 90 +-------------------------------------------------------------+ 91 | node 0 | 92 +-------------------------------------------------------------+ 93 94 0 16M 896M 2G 95 +----------+-----------------------+--------------------------+ 96 | ZONE_DMA | ZONE_NORMAL | ZONE_HIGHMEM | 97 +----------+-----------------------+--------------------------+ 98 99 100With a kernel built with ``ZONE_DMA`` disabled and ``ZONE_DMA32`` enabled and 101booted with ``movablecore=80%`` parameter on an arm64 machine with 16 Gbytes of 102RAM equally split between two nodes, there will be ``ZONE_DMA32``, 103``ZONE_NORMAL`` and ``ZONE_MOVABLE`` on node 0, and ``ZONE_NORMAL`` and 104``ZONE_MOVABLE`` on node 1:: 105 106 107 1G 9G 17G 108 +--------------------------------+ +--------------------------+ 109 | node 0 | | node 1 | 110 +--------------------------------+ +--------------------------+ 111 112 1G 4G 4200M 9G 9320M 17G 113 +---------+----------+-----------+ +------------+-------------+ 114 | DMA32 | NORMAL | MOVABLE | | NORMAL | MOVABLE | 115 +---------+----------+-----------+ +------------+-------------+ 116 117 118Memory banks may belong to interleaving nodes. In the example below an x86 119machine has 16 Gbytes of RAM in 4 memory banks, even banks belong to node 0 120and odd banks belong to node 1:: 121 122 123 0 4G 8G 12G 16G 124 +-------------+ +-------------+ +-------------+ +-------------+ 125 | node 0 | | node 1 | | node 0 | | node 1 | 126 +-------------+ +-------------+ +-------------+ +-------------+ 127 128 0 16M 4G 129 +-----+-------+ +-------------+ +-------------+ +-------------+ 130 | DMA | DMA32 | | NORMAL | | NORMAL | | NORMAL | 131 +-----+-------+ +-------------+ +-------------+ +-------------+ 132 133In this case node 0 will span from 0 to 12 Gbytes and node 1 will span from 1344 to 16 Gbytes. 135 136.. _nodes: 137 138Nodes 139===== 140 141As we have mentioned, each node in memory is described by a ``pg_data_t`` which 142is a typedef for a ``struct pglist_data``. When allocating a page, by default 143Linux uses a node-local allocation policy to allocate memory from the node 144closest to the running CPU. As processes tend to run on the same CPU, it is 145likely the memory from the current node will be used. The allocation policy can 146be controlled by users as described in 147Documentation/admin-guide/mm/numa_memory_policy.rst. 148 149Most NUMA architectures maintain an array of pointers to the node 150structures. The actual structures are allocated early during boot when 151architecture specific code parses the physical memory map reported by the 152firmware. The bulk of the node initialization happens slightly later in the 153boot process by free_area_init() function, described later in Section 154:ref:`Initialization <initialization>`. 155 156 157Along with the node structures, kernel maintains an array of ``nodemask_t`` 158bitmasks called ``node_states``. Each bitmask in this array represents a set of 159nodes with particular properties as defined by ``enum node_states``: 160 161``N_POSSIBLE`` 162 The node could become online at some point. 163``N_ONLINE`` 164 The node is online. 165``N_NORMAL_MEMORY`` 166 The node has regular memory. 167``N_HIGH_MEMORY`` 168 The node has regular or high memory. When ``CONFIG_HIGHMEM`` is disabled 169 aliased to ``N_NORMAL_MEMORY``. 170``N_MEMORY`` 171 The node has memory(regular, high, movable) 172``N_CPU`` 173 The node has one or more CPUs 174 175For each node that has a property described above, the bit corresponding to the 176node ID in the ``node_states[<property>]`` bitmask is set. 177 178For example, for node 2 with normal memory and CPUs, bit 2 will be set in :: 179 180 node_states[N_POSSIBLE] 181 node_states[N_ONLINE] 182 node_states[N_NORMAL_MEMORY] 183 node_states[N_HIGH_MEMORY] 184 node_states[N_MEMORY] 185 node_states[N_CPU] 186 187For various operations possible with nodemasks please refer to 188``include/linux/nodemask.h``. 189 190Among other things, nodemasks are used to provide macros for node traversal, 191namely ``for_each_node()`` and ``for_each_online_node()``. 192 193For instance, to call a function foo() for each online node:: 194 195 for_each_online_node(nid) { 196 pg_data_t *pgdat = NODE_DATA(nid); 197 198 foo(pgdat); 199 } 200 201Node structure 202-------------- 203 204The nodes structure ``struct pglist_data`` is declared in 205``include/linux/mmzone.h``. Here we briefly describe fields of this 206structure: 207 208General 209~~~~~~~ 210 211``node_zones`` 212 The zones for this node. Not all of the zones may be populated, but it is 213 the full list. It is referenced by this node's node_zonelists as well as 214 other node's node_zonelists. 215 216``node_zonelists`` 217 The list of all zones in all nodes. This list defines the order of zones 218 that allocations are preferred from. The ``node_zonelists`` is set up by 219 ``build_zonelists()`` in ``mm/page_alloc.c`` during the initialization of 220 core memory management structures. 221 222``nr_zones`` 223 Number of populated zones in this node. 224 225``node_mem_map`` 226 For UMA systems that use FLATMEM memory model the 0's node 227 ``node_mem_map`` is array of struct pages representing each physical frame. 228 229``node_page_ext`` 230 For UMA systems that use FLATMEM memory model the 0's node 231 ``node_page_ext`` is array of extensions of struct pages. Available only 232 in the kernels built with ``CONFIG_PAGE_EXTENSION`` enabled. 233 234``node_start_pfn`` 235 The page frame number of the starting page frame in this node. 236 237``node_present_pages`` 238 Total number of physical pages present in this node. 239 240``node_spanned_pages`` 241 Total size of physical page range, including holes. 242 243``node_size_lock`` 244 A lock that protects the fields defining the node extents. Only defined when 245 at least one of ``CONFIG_MEMORY_HOTPLUG`` or 246 ``CONFIG_DEFERRED_STRUCT_PAGE_INIT`` configuration options are enabled. 247 ``pgdat_resize_lock()`` and ``pgdat_resize_unlock()`` are provided to 248 manipulate ``node_size_lock`` without checking for ``CONFIG_MEMORY_HOTPLUG`` 249 or ``CONFIG_DEFERRED_STRUCT_PAGE_INIT``. 250 251``node_id`` 252 The Node ID (NID) of the node, starts at 0. 253 254``totalreserve_pages`` 255 This is a per-node reserve of pages that are not available to userspace 256 allocations. 257 258``first_deferred_pfn`` 259 If memory initialization on large machines is deferred then this is the first 260 PFN that needs to be initialized. Defined only when 261 ``CONFIG_DEFERRED_STRUCT_PAGE_INIT`` is enabled 262 263``deferred_split_queue`` 264 Per-node queue of huge pages that their split was deferred. Defined only when ``CONFIG_TRANSPARENT_HUGEPAGE`` is enabled. 265 266``__lruvec`` 267 Per-node lruvec holding LRU lists and related parameters. Used only when 268 memory cgroups are disabled. It should not be accessed directly, use 269 ``mem_cgroup_lruvec()`` to look up lruvecs instead. 270 271Reclaim control 272~~~~~~~~~~~~~~~ 273 274See also Documentation/mm/page_reclaim.rst. 275 276``kswapd`` 277 Per-node instance of kswapd kernel thread. 278 279``kswapd_wait``, ``pfmemalloc_wait``, ``reclaim_wait`` 280 Workqueues used to synchronize memory reclaim tasks 281 282``nr_writeback_throttled`` 283 Number of tasks that are throttled waiting on dirty pages to clean. 284 285``nr_reclaim_start`` 286 Number of pages written while reclaim is throttled waiting for writeback. 287 288``kswapd_order`` 289 Controls the order kswapd tries to reclaim 290 291``kswapd_highest_zoneidx`` 292 The highest zone index to be reclaimed by kswapd 293 294``kswapd_failures`` 295 Number of runs kswapd was unable to reclaim any pages 296 297``min_unmapped_pages`` 298 Minimal number of unmapped file backed pages that cannot be reclaimed. 299 Determined by ``vm.min_unmapped_ratio`` sysctl. Only defined when 300 ``CONFIG_NUMA`` is enabled. 301 302``min_slab_pages`` 303 Minimal number of SLAB pages that cannot be reclaimed. Determined by 304 ``vm.min_slab_ratio sysctl``. Only defined when ``CONFIG_NUMA`` is enabled 305 306``flags`` 307 Flags controlling reclaim behavior. 308 309Compaction control 310~~~~~~~~~~~~~~~~~~ 311 312``kcompactd_max_order`` 313 Page order that kcompactd should try to achieve. 314 315``kcompactd_highest_zoneidx`` 316 The highest zone index to be compacted by kcompactd. 317 318``kcompactd_wait`` 319 Workqueue used to synchronize memory compaction tasks. 320 321``kcompactd`` 322 Per-node instance of kcompactd kernel thread. 323 324``proactive_compact_trigger`` 325 Determines if proactive compaction is enabled. Controlled by 326 ``vm.compaction_proactiveness`` sysctl. 327 328Statistics 329~~~~~~~~~~ 330 331``per_cpu_nodestats`` 332 Per-CPU VM statistics for the node 333 334``vm_stat`` 335 VM statistics for the node. 336 337.. _zones: 338 339Zones 340===== 341 342.. admonition:: Stub 343 344 This section is incomplete. Please list and describe the appropriate fields. 345 346.. _pages: 347 348Pages 349===== 350 351.. admonition:: Stub 352 353 This section is incomplete. Please list and describe the appropriate fields. 354 355.. _folios: 356 357Folios 358====== 359 360.. admonition:: Stub 361 362 This section is incomplete. Please list and describe the appropriate fields. 363 364.. _initialization: 365 366Initialization 367============== 368 369.. admonition:: Stub 370 371 This section is incomplete. Please list and describe the appropriate fields. 372