1================= 2Concepts overview 3================= 4 5The memory management in Linux is a complex system that evolved over the 6years and included more and more functionality to support a variety of 7systems from MMU-less microcontrollers to supercomputers. The memory 8management for systems without an MMU is called ``nommu`` and it 9definitely deserves a dedicated document, which hopefully will be 10eventually written. Yet, although some of the concepts are the same, 11here we assume that an MMU is available and a CPU can translate a virtual 12address to a physical address. 13 14.. contents:: :local: 15 16Virtual Memory Primer 17===================== 18 19The physical memory in a computer system is a limited resource and 20even for systems that support memory hotplug there is a hard limit on 21the amount of memory that can be installed. The physical memory is not 22necessarily contiguous; it might be accessible as a set of distinct 23address ranges. Besides, different CPU architectures, and even 24different implementations of the same architecture have different views 25of how these address ranges are defined. 26 27All this makes dealing directly with physical memory quite complex and 28to avoid this complexity a concept of virtual memory was developed. 29 30The virtual memory abstracts the details of physical memory from the 31application software, allows to keep only needed information in the 32physical memory (demand paging) and provides a mechanism for the 33protection and controlled sharing of data between processes. 34 35With virtual memory, each and every memory access uses a virtual 36address. When the CPU decodes an instruction that reads (or 37writes) from (or to) the system memory, it translates the `virtual` 38address encoded in that instruction to a `physical` address that the 39memory controller can understand. 40 41The physical system memory is divided into page frames, or pages. The 42size of each page is architecture specific. Some architectures allow 43selection of the page size from several supported values; this 44selection is performed at the kernel build time by setting an 45appropriate kernel configuration option. 46 47Each physical memory page can be mapped as one or more virtual 48pages. These mappings are described by page tables that allow 49translation from a virtual address used by programs to the physical 50memory address. The page tables are organized hierarchically. 51 52The tables at the lowest level of the hierarchy contain physical 53addresses of actual pages used by the software. The tables at higher 54levels contain physical addresses of the pages belonging to the lower 55levels. The pointer to the top level page table resides in a 56register. When the CPU performs the address translation, it uses this 57register to access the top level page table. The high bits of the 58virtual address are used to index an entry in the top level page 59table. That entry is then used to access the next level in the 60hierarchy with the next bits of the virtual address as the index to 61that level page table. The lowest bits in the virtual address define 62the offset inside the actual page. 63 64Huge Pages 65========== 66 67The address translation requires several memory accesses and memory 68accesses are slow relatively to CPU speed. To avoid spending precious 69processor cycles on the address translation, CPUs maintain a cache of 70such translations called Translation Lookaside Buffer (or 71TLB). Usually TLB is pretty scarce resource and applications with 72large memory working set will experience performance hit because of 73TLB misses. 74 75Many modern CPU architectures allow mapping of the memory pages 76directly by the higher levels in the page table. For instance, on x86, 77it is possible to map 2M and even 1G pages using entries in the second 78and the third level page tables. In Linux such pages are called 79`huge`. Usage of huge pages significantly reduces pressure on TLB, 80improves TLB hit-rate and thus improves overall system performance. 81 82There are two mechanisms in Linux that enable mapping of the physical 83memory with the huge pages. The first one is `HugeTLB filesystem`, or 84hugetlbfs. It is a pseudo filesystem that uses RAM as its backing 85store. For the files created in this filesystem the data resides in 86the memory and mapped using huge pages. The hugetlbfs is described at 87Documentation/admin-guide/mm/hugetlbpage.rst. 88 89Another, more recent, mechanism that enables use of the huge pages is 90called `Transparent HugePages`, or THP. Unlike the hugetlbfs that 91requires users and/or system administrators to configure what parts of 92the system memory should and can be mapped by the huge pages, THP 93manages such mappings transparently to the user and hence the 94name. See Documentation/admin-guide/mm/transhuge.rst for more details 95about THP. 96 97Zones 98===== 99 100Often hardware poses restrictions on how different physical memory 101ranges can be accessed. In some cases, devices cannot perform DMA to 102all the addressable memory. In other cases, the size of the physical 103memory exceeds the maximal addressable size of virtual memory and 104special actions are required to access portions of the memory. Linux 105groups memory pages into `zones` according to their possible 106usage. For example, ZONE_DMA will contain memory that can be used by 107devices for DMA, ZONE_HIGHMEM will contain memory that is not 108permanently mapped into kernel's address space and ZONE_NORMAL will 109contain normally addressed pages. 110 111The actual layout of the memory zones is hardware dependent as not all 112architectures define all zones, and requirements for DMA are different 113for different platforms. 114 115Nodes 116===== 117 118Many multi-processor machines are NUMA - Non-Uniform Memory Access - 119systems. In such systems the memory is arranged into banks that have 120different access latency depending on the "distance" from the 121processor. Each bank is referred to as a `node` and for each node Linux 122constructs an independent memory management subsystem. A node has its 123own set of zones, lists of free and used pages and various statistics 124counters. You can find more details about NUMA in 125Documentation/mm/numa.rst` and in 126Documentation/admin-guide/mm/numa_memory_policy.rst. 127 128Page cache 129========== 130 131The physical memory is volatile and the common case for getting data 132into the memory is to read it from files. Whenever a file is read, the 133data is put into the `page cache` to avoid expensive disk access on 134the subsequent reads. Similarly, when one writes to a file, the data 135is placed in the page cache and eventually gets into the backing 136storage device. The written pages are marked as `dirty` and when Linux 137decides to reuse them for other purposes, it makes sure to synchronize 138the file contents on the device with the updated data. 139 140Anonymous Memory 141================ 142 143The `anonymous memory` or `anonymous mappings` represent memory that 144is not backed by a filesystem. Such mappings are implicitly created 145for program's stack and heap or by explicit calls to mmap(2) system 146call. Usually, the anonymous mappings only define virtual memory areas 147that the program is allowed to access. The read accesses will result 148in creation of a page table entry that references a special physical 149page filled with zeroes. When the program performs a write, a regular 150physical page will be allocated to hold the written data. The page 151will be marked dirty and if the kernel decides to repurpose it, 152the dirty page will be swapped out. 153 154Reclaim 155======= 156 157Throughout the system lifetime, a physical page can be used for storing 158different types of data. It can be kernel internal data structures, 159DMA'able buffers for device drivers use, data read from a filesystem, 160memory allocated by user space processes etc. 161 162Depending on the page usage it is treated differently by the Linux 163memory management. The pages that can be freed at any time, either 164because they cache the data available elsewhere, for instance, on a 165hard disk, or because they can be swapped out, again, to the hard 166disk, are called `reclaimable`. The most notable categories of the 167reclaimable pages are page cache and anonymous memory. 168 169In most cases, the pages holding internal kernel data and used as DMA 170buffers cannot be repurposed, and they remain pinned until freed by 171their user. Such pages are called `unreclaimable`. However, in certain 172circumstances, even pages occupied with kernel data structures can be 173reclaimed. For instance, in-memory caches of filesystem metadata can 174be re-read from the storage device and therefore it is possible to 175discard them from the main memory when system is under memory 176pressure. 177 178The process of freeing the reclaimable physical memory pages and 179repurposing them is called (surprise!) `reclaim`. Linux can reclaim 180pages either asynchronously or synchronously, depending on the state 181of the system. When the system is not loaded, most of the memory is free 182and allocation requests will be satisfied immediately from the free 183pages supply. As the load increases, the amount of the free pages goes 184down and when it reaches a certain threshold (low watermark), an 185allocation request will awaken the ``kswapd`` daemon. It will 186asynchronously scan memory pages and either just free them if the data 187they contain is available elsewhere, or evict to the backing storage 188device (remember those dirty pages?). As memory usage increases even 189more and reaches another threshold - min watermark - an allocation 190will trigger `direct reclaim`. In this case allocation is stalled 191until enough memory pages are reclaimed to satisfy the request. 192 193Compaction 194========== 195 196As the system runs, tasks allocate and free the memory and it becomes 197fragmented. Although with virtual memory it is possible to present 198scattered physical pages as virtually contiguous range, sometimes it is 199necessary to allocate large physically contiguous memory areas. Such 200need may arise, for instance, when a device driver requires a large 201buffer for DMA, or when THP allocates a huge page. Memory `compaction` 202addresses the fragmentation issue. This mechanism moves occupied pages 203from the lower part of a memory zone to free pages in the upper part 204of the zone. When a compaction scan is finished free pages are grouped 205together at the beginning of the zone and allocations of large 206physically contiguous areas become possible. 207 208Like reclaim, the compaction may happen asynchronously in the ``kcompactd`` 209daemon or synchronously as a result of a memory allocation request. 210 211OOM killer 212========== 213 214It is possible that on a loaded machine memory will be exhausted and the 215kernel will be unable to reclaim enough memory to continue to operate. In 216order to save the rest of the system, it invokes the `OOM killer`. 217 218The `OOM killer` selects a task to sacrifice for the sake of the overall 219system health. The selected task is killed in a hope that after it exits 220enough memory will be freed to continue normal operation. 221