1============================ 2Transparent Hugepage Support 3============================ 4 5Objective 6========= 7 8Performance critical computing applications dealing with large memory 9working sets are already running on top of libhugetlbfs and in turn 10hugetlbfs. Transparent HugePage Support (THP) is an alternative mean of 11using huge pages for the backing of virtual memory with huge pages 12that supports the automatic promotion and demotion of page sizes and 13without the shortcomings of hugetlbfs. 14 15Currently THP only works for anonymous memory mappings and tmpfs/shmem. 16But in the future it can expand to other filesystems. 17 18.. note:: 19 in the examples below we presume that the basic page size is 4K and 20 the huge page size is 2M, although the actual numbers may vary 21 depending on the CPU architecture. 22 23The reason applications are running faster is because of two 24factors. The first factor is almost completely irrelevant and it's not 25of significant interest because it'll also have the downside of 26requiring larger clear-page copy-page in page faults which is a 27potentially negative effect. The first factor consists in taking a 28single page fault for each 2M virtual region touched by userland (so 29reducing the enter/exit kernel frequency by a 512 times factor). This 30only matters the first time the memory is accessed for the lifetime of 31a memory mapping. The second long lasting and much more important 32factor will affect all subsequent accesses to the memory for the whole 33runtime of the application. The second factor consist of two 34components: 35 361) the TLB miss will run faster (especially with virtualization using 37 nested pagetables but almost always also on bare metal without 38 virtualization) 39 402) a single TLB entry will be mapping a much larger amount of virtual 41 memory in turn reducing the number of TLB misses. With 42 virtualization and nested pagetables the TLB can be mapped of 43 larger size only if both KVM and the Linux guest are using 44 hugepages but a significant speedup already happens if only one of 45 the two is using hugepages just because of the fact the TLB miss is 46 going to run faster. 47 48Modern kernels support "multi-size THP" (mTHP), which introduces the 49ability to allocate memory in blocks that are bigger than a base page 50but smaller than traditional PMD-size (as described above), in 51increments of a power-of-2 number of pages. mTHP can back anonymous 52memory (for example 16K, 32K, 64K, etc). These THPs continue to be 53PTE-mapped, but in many cases can still provide similar benefits to 54those outlined above: Page faults are significantly reduced (by a 55factor of e.g. 4, 8, 16, etc), but latency spikes are much less 56prominent because the size of each page isn't as huge as the PMD-sized 57variant and there is less memory to clear in each page fault. Some 58architectures also employ TLB compression mechanisms to squeeze more 59entries in when a set of PTEs are virtually and physically contiguous 60and approporiately aligned. In this case, TLB misses will occur less 61often. 62 63THP can be enabled system wide or restricted to certain tasks or even 64memory ranges inside task's address space. Unless THP is completely 65disabled, there is ``khugepaged`` daemon that scans memory and 66collapses sequences of basic pages into PMD-sized huge pages. 67 68The THP behaviour is controlled via :ref:`sysfs <thp_sysfs>` 69interface and using madvise(2) and prctl(2) system calls. 70 71Transparent Hugepage Support maximizes the usefulness of free memory 72if compared to the reservation approach of hugetlbfs by allowing all 73unused memory to be used as cache or other movable (or even unmovable 74entities). It doesn't require reservation to prevent hugepage 75allocation failures to be noticeable from userland. It allows paging 76and all other advanced VM features to be available on the 77hugepages. It requires no modifications for applications to take 78advantage of it. 79 80Applications however can be further optimized to take advantage of 81this feature, like for example they've been optimized before to avoid 82a flood of mmap system calls for every malloc(4k). Optimizing userland 83is by far not mandatory and khugepaged already can take care of long 84lived page allocations even for hugepage unaware applications that 85deals with large amounts of memory. 86 87In certain cases when hugepages are enabled system wide, application 88may end up allocating more memory resources. An application may mmap a 89large region but only touch 1 byte of it, in that case a 2M page might 90be allocated instead of a 4k page for no good. This is why it's 91possible to disable hugepages system-wide and to only have them inside 92MADV_HUGEPAGE madvise regions. 93 94Embedded systems should enable hugepages only inside madvise regions 95to eliminate any risk of wasting any precious byte of memory and to 96only run faster. 97 98Applications that gets a lot of benefit from hugepages and that don't 99risk to lose memory by using hugepages, should use 100madvise(MADV_HUGEPAGE) on their critical mmapped regions. 101 102.. _thp_sysfs: 103 104sysfs 105===== 106 107Global THP controls 108------------------- 109 110Transparent Hugepage Support for anonymous memory can be entirely disabled 111(mostly for debugging purposes) or only enabled inside MADV_HUGEPAGE 112regions (to avoid the risk of consuming more memory resources) or enabled 113system wide. This can be achieved per-supported-THP-size with one of:: 114 115 echo always >/sys/kernel/mm/transparent_hugepage/hugepages-<size>kB/enabled 116 echo madvise >/sys/kernel/mm/transparent_hugepage/hugepages-<size>kB/enabled 117 echo never >/sys/kernel/mm/transparent_hugepage/hugepages-<size>kB/enabled 118 119where <size> is the hugepage size being addressed, the available sizes 120for which vary by system. 121 122For example:: 123 124 echo always >/sys/kernel/mm/transparent_hugepage/hugepages-2048kB/enabled 125 126Alternatively it is possible to specify that a given hugepage size 127will inherit the top-level "enabled" value:: 128 129 echo inherit >/sys/kernel/mm/transparent_hugepage/hugepages-<size>kB/enabled 130 131For example:: 132 133 echo inherit >/sys/kernel/mm/transparent_hugepage/hugepages-2048kB/enabled 134 135The top-level setting (for use with "inherit") can be set by issuing 136one of the following commands:: 137 138 echo always >/sys/kernel/mm/transparent_hugepage/enabled 139 echo madvise >/sys/kernel/mm/transparent_hugepage/enabled 140 echo never >/sys/kernel/mm/transparent_hugepage/enabled 141 142By default, PMD-sized hugepages have enabled="inherit" and all other 143hugepage sizes have enabled="never". If enabling multiple hugepage 144sizes, the kernel will select the most appropriate enabled size for a 145given allocation. 146 147It's also possible to limit defrag efforts in the VM to generate 148anonymous hugepages in case they're not immediately free to madvise 149regions or to never try to defrag memory and simply fallback to regular 150pages unless hugepages are immediately available. Clearly if we spend CPU 151time to defrag memory, we would expect to gain even more by the fact we 152use hugepages later instead of regular pages. This isn't always 153guaranteed, but it may be more likely in case the allocation is for a 154MADV_HUGEPAGE region. 155 156:: 157 158 echo always >/sys/kernel/mm/transparent_hugepage/defrag 159 echo defer >/sys/kernel/mm/transparent_hugepage/defrag 160 echo defer+madvise >/sys/kernel/mm/transparent_hugepage/defrag 161 echo madvise >/sys/kernel/mm/transparent_hugepage/defrag 162 echo never >/sys/kernel/mm/transparent_hugepage/defrag 163 164always 165 means that an application requesting THP will stall on 166 allocation failure and directly reclaim pages and compact 167 memory in an effort to allocate a THP immediately. This may be 168 desirable for virtual machines that benefit heavily from THP 169 use and are willing to delay the VM start to utilise them. 170 171defer 172 means that an application will wake kswapd in the background 173 to reclaim pages and wake kcompactd to compact memory so that 174 THP is available in the near future. It's the responsibility 175 of khugepaged to then install the THP pages later. 176 177defer+madvise 178 will enter direct reclaim and compaction like ``always``, but 179 only for regions that have used madvise(MADV_HUGEPAGE); all 180 other regions will wake kswapd in the background to reclaim 181 pages and wake kcompactd to compact memory so that THP is 182 available in the near future. 183 184madvise 185 will enter direct reclaim like ``always`` but only for regions 186 that are have used madvise(MADV_HUGEPAGE). This is the default 187 behaviour. 188 189never 190 should be self-explanatory. 191 192By default kernel tries to use huge, PMD-mappable zero page on read 193page fault to anonymous mapping. It's possible to disable huge zero 194page by writing 0 or enable it back by writing 1:: 195 196 echo 0 >/sys/kernel/mm/transparent_hugepage/use_zero_page 197 echo 1 >/sys/kernel/mm/transparent_hugepage/use_zero_page 198 199Some userspace (such as a test program, or an optimized memory 200allocation library) may want to know the size (in bytes) of a 201PMD-mappable transparent hugepage:: 202 203 cat /sys/kernel/mm/transparent_hugepage/hpage_pmd_size 204 205khugepaged will be automatically started when one or more hugepage 206sizes are enabled (either by directly setting "always" or "madvise", 207or by setting "inherit" while the top-level enabled is set to "always" 208or "madvise"), and it'll be automatically shutdown when the last 209hugepage size is disabled (either by directly setting "never", or by 210setting "inherit" while the top-level enabled is set to "never"). 211 212Khugepaged controls 213------------------- 214 215.. note:: 216 khugepaged currently only searches for opportunities to collapse to 217 PMD-sized THP and no attempt is made to collapse to other THP 218 sizes. 219 220khugepaged runs usually at low frequency so while one may not want to 221invoke defrag algorithms synchronously during the page faults, it 222should be worth invoking defrag at least in khugepaged. However it's 223also possible to disable defrag in khugepaged by writing 0 or enable 224defrag in khugepaged by writing 1:: 225 226 echo 0 >/sys/kernel/mm/transparent_hugepage/khugepaged/defrag 227 echo 1 >/sys/kernel/mm/transparent_hugepage/khugepaged/defrag 228 229You can also control how many pages khugepaged should scan at each 230pass:: 231 232 /sys/kernel/mm/transparent_hugepage/khugepaged/pages_to_scan 233 234and how many milliseconds to wait in khugepaged between each pass (you 235can set this to 0 to run khugepaged at 100% utilization of one core):: 236 237 /sys/kernel/mm/transparent_hugepage/khugepaged/scan_sleep_millisecs 238 239and how many milliseconds to wait in khugepaged if there's an hugepage 240allocation failure to throttle the next allocation attempt:: 241 242 /sys/kernel/mm/transparent_hugepage/khugepaged/alloc_sleep_millisecs 243 244The khugepaged progress can be seen in the number of pages collapsed (note 245that this counter may not be an exact count of the number of pages 246collapsed, since "collapsed" could mean multiple things: (1) A PTE mapping 247being replaced by a PMD mapping, or (2) All 4K physical pages replaced by 248one 2M hugepage. Each may happen independently, or together, depending on 249the type of memory and the failures that occur. As such, this value should 250be interpreted roughly as a sign of progress, and counters in /proc/vmstat 251consulted for more accurate accounting):: 252 253 /sys/kernel/mm/transparent_hugepage/khugepaged/pages_collapsed 254 255for each pass:: 256 257 /sys/kernel/mm/transparent_hugepage/khugepaged/full_scans 258 259``max_ptes_none`` specifies how many extra small pages (that are 260not already mapped) can be allocated when collapsing a group 261of small pages into one large page:: 262 263 /sys/kernel/mm/transparent_hugepage/khugepaged/max_ptes_none 264 265A higher value leads to use additional memory for programs. 266A lower value leads to gain less thp performance. Value of 267max_ptes_none can waste cpu time very little, you can 268ignore it. 269 270``max_ptes_swap`` specifies how many pages can be brought in from 271swap when collapsing a group of pages into a transparent huge page:: 272 273 /sys/kernel/mm/transparent_hugepage/khugepaged/max_ptes_swap 274 275A higher value can cause excessive swap IO and waste 276memory. A lower value can prevent THPs from being 277collapsed, resulting fewer pages being collapsed into 278THPs, and lower memory access performance. 279 280``max_ptes_shared`` specifies how many pages can be shared across multiple 281processes. Exceeding the number would block the collapse:: 282 283 /sys/kernel/mm/transparent_hugepage/khugepaged/max_ptes_shared 284 285A higher value may increase memory footprint for some workloads. 286 287Boot parameter 288============== 289 290You can change the sysfs boot time defaults of Transparent Hugepage 291Support by passing the parameter ``transparent_hugepage=always`` or 292``transparent_hugepage=madvise`` or ``transparent_hugepage=never`` 293to the kernel command line. 294 295Hugepages in tmpfs/shmem 296======================== 297 298You can control hugepage allocation policy in tmpfs with mount option 299``huge=``. It can have following values: 300 301always 302 Attempt to allocate huge pages every time we need a new page; 303 304never 305 Do not allocate huge pages; 306 307within_size 308 Only allocate huge page if it will be fully within i_size. 309 Also respect fadvise()/madvise() hints; 310 311advise 312 Only allocate huge pages if requested with fadvise()/madvise(); 313 314The default policy is ``never``. 315 316``mount -o remount,huge= /mountpoint`` works fine after mount: remounting 317``huge=never`` will not attempt to break up huge pages at all, just stop more 318from being allocated. 319 320There's also sysfs knob to control hugepage allocation policy for internal 321shmem mount: /sys/kernel/mm/transparent_hugepage/shmem_enabled. The mount 322is used for SysV SHM, memfds, shared anonymous mmaps (of /dev/zero or 323MAP_ANONYMOUS), GPU drivers' DRM objects, Ashmem. 324 325In addition to policies listed above, shmem_enabled allows two further 326values: 327 328deny 329 For use in emergencies, to force the huge option off from 330 all mounts; 331force 332 Force the huge option on for all - very useful for testing; 333 334Need of application restart 335=========================== 336 337The transparent_hugepage/enabled and 338transparent_hugepage/hugepages-<size>kB/enabled values and tmpfs mount 339option only affect future behavior. So to make them effective you need 340to restart any application that could have been using hugepages. This 341also applies to the regions registered in khugepaged. 342 343Monitoring usage 344================ 345 346.. note:: 347 Currently the below counters only record events relating to 348 PMD-sized THP. Events relating to other THP sizes are not included. 349 350The number of PMD-sized anonymous transparent huge pages currently used by the 351system is available by reading the AnonHugePages field in ``/proc/meminfo``. 352To identify what applications are using PMD-sized anonymous transparent huge 353pages, it is necessary to read ``/proc/PID/smaps`` and count the AnonHugePages 354fields for each mapping. (Note that AnonHugePages only applies to traditional 355PMD-sized THP for historical reasons and should have been called 356AnonHugePmdMapped). 357 358The number of file transparent huge pages mapped to userspace is available 359by reading ShmemPmdMapped and ShmemHugePages fields in ``/proc/meminfo``. 360To identify what applications are mapping file transparent huge pages, it 361is necessary to read ``/proc/PID/smaps`` and count the FileHugeMapped fields 362for each mapping. 363 364Note that reading the smaps file is expensive and reading it 365frequently will incur overhead. 366 367There are a number of counters in ``/proc/vmstat`` that may be used to 368monitor how successfully the system is providing huge pages for use. 369 370thp_fault_alloc 371 is incremented every time a huge page is successfully 372 allocated to handle a page fault. 373 374thp_collapse_alloc 375 is incremented by khugepaged when it has found 376 a range of pages to collapse into one huge page and has 377 successfully allocated a new huge page to store the data. 378 379thp_fault_fallback 380 is incremented if a page fault fails to allocate 381 a huge page and instead falls back to using small pages. 382 383thp_fault_fallback_charge 384 is incremented if a page fault fails to charge a huge page and 385 instead falls back to using small pages even though the 386 allocation was successful. 387 388thp_collapse_alloc_failed 389 is incremented if khugepaged found a range 390 of pages that should be collapsed into one huge page but failed 391 the allocation. 392 393thp_file_alloc 394 is incremented every time a file huge page is successfully 395 allocated. 396 397thp_file_fallback 398 is incremented if a file huge page is attempted to be allocated 399 but fails and instead falls back to using small pages. 400 401thp_file_fallback_charge 402 is incremented if a file huge page cannot be charged and instead 403 falls back to using small pages even though the allocation was 404 successful. 405 406thp_file_mapped 407 is incremented every time a file huge page is mapped into 408 user address space. 409 410thp_split_page 411 is incremented every time a huge page is split into base 412 pages. This can happen for a variety of reasons but a common 413 reason is that a huge page is old and is being reclaimed. 414 This action implies splitting all PMD the page mapped with. 415 416thp_split_page_failed 417 is incremented if kernel fails to split huge 418 page. This can happen if the page was pinned by somebody. 419 420thp_deferred_split_page 421 is incremented when a huge page is put onto split 422 queue. This happens when a huge page is partially unmapped and 423 splitting it would free up some memory. Pages on split queue are 424 going to be split under memory pressure. 425 426thp_split_pmd 427 is incremented every time a PMD split into table of PTEs. 428 This can happen, for instance, when application calls mprotect() or 429 munmap() on part of huge page. It doesn't split huge page, only 430 page table entry. 431 432thp_zero_page_alloc 433 is incremented every time a huge zero page used for thp is 434 successfully allocated. Note, it doesn't count every map of 435 the huge zero page, only its allocation. 436 437thp_zero_page_alloc_failed 438 is incremented if kernel fails to allocate 439 huge zero page and falls back to using small pages. 440 441thp_swpout 442 is incremented every time a huge page is swapout in one 443 piece without splitting. 444 445thp_swpout_fallback 446 is incremented if a huge page has to be split before swapout. 447 Usually because failed to allocate some continuous swap space 448 for the huge page. 449 450As the system ages, allocating huge pages may be expensive as the 451system uses memory compaction to copy data around memory to free a 452huge page for use. There are some counters in ``/proc/vmstat`` to help 453monitor this overhead. 454 455compact_stall 456 is incremented every time a process stalls to run 457 memory compaction so that a huge page is free for use. 458 459compact_success 460 is incremented if the system compacted memory and 461 freed a huge page for use. 462 463compact_fail 464 is incremented if the system tries to compact memory 465 but failed. 466 467It is possible to establish how long the stalls were using the function 468tracer to record how long was spent in __alloc_pages() and 469using the mm_page_alloc tracepoint to identify which allocations were 470for huge pages. 471 472Optimizing the applications 473=========================== 474 475To be guaranteed that the kernel will map a THP immediately in any 476memory region, the mmap region has to be hugepage naturally 477aligned. posix_memalign() can provide that guarantee. 478 479Hugetlbfs 480========= 481 482You can use hugetlbfs on a kernel that has transparent hugepage 483support enabled just fine as always. No difference can be noted in 484hugetlbfs other than there will be less overall fragmentation. All 485usual features belonging to hugetlbfs are preserved and 486unaffected. libhugetlbfs will also work fine as usual. 487