1============================ 2Transparent Hugepage Support 3============================ 4 5Objective 6========= 7 8Performance critical computing applications dealing with large memory 9working sets are already running on top of libhugetlbfs and in turn 10hugetlbfs. Transparent HugePage Support (THP) is an alternative mean of 11using huge pages for the backing of virtual memory with huge pages 12that supports the automatic promotion and demotion of page sizes and 13without the shortcomings of hugetlbfs. 14 15Currently THP only works for anonymous memory mappings and tmpfs/shmem. 16But in the future it can expand to other filesystems. 17 18.. note:: 19 in the examples below we presume that the basic page size is 4K and 20 the huge page size is 2M, although the actual numbers may vary 21 depending on the CPU architecture. 22 23The reason applications are running faster is because of two 24factors. The first factor is almost completely irrelevant and it's not 25of significant interest because it'll also have the downside of 26requiring larger clear-page copy-page in page faults which is a 27potentially negative effect. The first factor consists in taking a 28single page fault for each 2M virtual region touched by userland (so 29reducing the enter/exit kernel frequency by a 512 times factor). This 30only matters the first time the memory is accessed for the lifetime of 31a memory mapping. The second long lasting and much more important 32factor will affect all subsequent accesses to the memory for the whole 33runtime of the application. The second factor consist of two 34components: 35 361) the TLB miss will run faster (especially with virtualization using 37 nested pagetables but almost always also on bare metal without 38 virtualization) 39 402) a single TLB entry will be mapping a much larger amount of virtual 41 memory in turn reducing the number of TLB misses. With 42 virtualization and nested pagetables the TLB can be mapped of 43 larger size only if both KVM and the Linux guest are using 44 hugepages but a significant speedup already happens if only one of 45 the two is using hugepages just because of the fact the TLB miss is 46 going to run faster. 47 48Modern kernels support "multi-size THP" (mTHP), which introduces the 49ability to allocate memory in blocks that are bigger than a base page 50but smaller than traditional PMD-size (as described above), in 51increments of a power-of-2 number of pages. mTHP can back anonymous 52memory (for example 16K, 32K, 64K, etc). These THPs continue to be 53PTE-mapped, but in many cases can still provide similar benefits to 54those outlined above: Page faults are significantly reduced (by a 55factor of e.g. 4, 8, 16, etc), but latency spikes are much less 56prominent because the size of each page isn't as huge as the PMD-sized 57variant and there is less memory to clear in each page fault. Some 58architectures also employ TLB compression mechanisms to squeeze more 59entries in when a set of PTEs are virtually and physically contiguous 60and appropriately aligned. In this case, TLB misses will occur less 61often. 62 63THP can be enabled system wide or restricted to certain tasks or even 64memory ranges inside task's address space. Unless THP is completely 65disabled, there is ``khugepaged`` daemon that scans memory and 66collapses sequences of basic pages into huge pages of either PMD size 67or mTHP sizes, if the system is configured to do so. 68 69The THP behaviour is controlled via :ref:`sysfs <thp_sysfs>` 70interface and using madvise(2) and prctl(2) system calls. 71 72Transparent Hugepage Support maximizes the usefulness of free memory 73if compared to the reservation approach of hugetlbfs by allowing all 74unused memory to be used as cache or other movable (or even unmovable 75entities). It doesn't require reservation to prevent hugepage 76allocation failures to be noticeable from userland. It allows paging 77and all other advanced VM features to be available on the 78hugepages. It requires no modifications for applications to take 79advantage of it. 80 81Applications however can be further optimized to take advantage of 82this feature, like for example they've been optimized before to avoid 83a flood of mmap system calls for every malloc(4k). Optimizing userland 84is by far not mandatory and khugepaged already can take care of long 85lived page allocations even for hugepage unaware applications that 86deals with large amounts of memory. 87 88In certain cases when hugepages are enabled system wide, application 89may end up allocating more memory resources. An application may mmap a 90large region but only touch 1 byte of it, in that case a 2M page might 91be allocated instead of a 4k page for no good. This is why it's 92possible to disable hugepages system-wide and to only have them inside 93MADV_HUGEPAGE madvise regions. 94 95Embedded systems should enable hugepages only inside madvise regions 96to eliminate any risk of wasting any precious byte of memory and to 97only run faster. 98 99Applications that gets a lot of benefit from hugepages and that don't 100risk to lose memory by using hugepages, should use 101madvise(MADV_HUGEPAGE) on their critical mmapped regions. 102 103.. _thp_sysfs: 104 105sysfs 106===== 107 108Global THP controls 109------------------- 110 111Transparent Hugepage Support for anonymous memory can be disabled 112(mostly for debugging purposes) or only enabled inside MADV_HUGEPAGE 113regions (to avoid the risk of consuming more memory resources) or enabled 114system wide. This can be achieved per-supported-THP-size with one of:: 115 116 echo always >/sys/kernel/mm/transparent_hugepage/hugepages-<size>kB/enabled 117 echo madvise >/sys/kernel/mm/transparent_hugepage/hugepages-<size>kB/enabled 118 echo never >/sys/kernel/mm/transparent_hugepage/hugepages-<size>kB/enabled 119 120where <size> is the hugepage size being addressed, the available sizes 121for which vary by system. 122 123.. note:: Setting "never" in all sysfs THP controls does **not** disable 124 Transparent Huge Pages globally. This is because ``madvise(..., 125 MADV_COLLAPSE)`` ignores these settings and collapses ranges to 126 PMD-sized huge pages unconditionally. 127 128For example:: 129 130 echo always >/sys/kernel/mm/transparent_hugepage/hugepages-2048kB/enabled 131 132Alternatively it is possible to specify that a given hugepage size 133will inherit the top-level "enabled" value:: 134 135 echo inherit >/sys/kernel/mm/transparent_hugepage/hugepages-<size>kB/enabled 136 137For example:: 138 139 echo inherit >/sys/kernel/mm/transparent_hugepage/hugepages-2048kB/enabled 140 141The top-level setting (for use with "inherit") can be set by issuing 142one of the following commands:: 143 144 echo always >/sys/kernel/mm/transparent_hugepage/enabled 145 echo madvise >/sys/kernel/mm/transparent_hugepage/enabled 146 echo never >/sys/kernel/mm/transparent_hugepage/enabled 147 148By default, PMD-sized hugepages have enabled="inherit" and all other 149hugepage sizes have enabled="never". If enabling multiple hugepage 150sizes, the kernel will select the most appropriate enabled size for a 151given allocation. 152 153It's also possible to limit defrag efforts in the VM to generate 154anonymous hugepages in case they're not immediately free to madvise 155regions or to never try to defrag memory and simply fallback to regular 156pages unless hugepages are immediately available. Clearly if we spend CPU 157time to defrag memory, we would expect to gain even more by the fact we 158use hugepages later instead of regular pages. This isn't always 159guaranteed, but it may be more likely in case the allocation is for a 160MADV_HUGEPAGE region. 161 162:: 163 164 echo always >/sys/kernel/mm/transparent_hugepage/defrag 165 echo defer >/sys/kernel/mm/transparent_hugepage/defrag 166 echo defer+madvise >/sys/kernel/mm/transparent_hugepage/defrag 167 echo madvise >/sys/kernel/mm/transparent_hugepage/defrag 168 echo never >/sys/kernel/mm/transparent_hugepage/defrag 169 170always 171 means that an application requesting THP will stall on 172 allocation failure and directly reclaim pages and compact 173 memory in an effort to allocate a THP immediately. This may be 174 desirable for virtual machines that benefit heavily from THP 175 use and are willing to delay the VM start to utilise them. 176 177defer 178 means that an application will wake kswapd in the background 179 to reclaim pages and wake kcompactd to compact memory so that 180 THP is available in the near future. It's the responsibility 181 of khugepaged to then install the THP pages later. 182 183defer+madvise 184 will enter direct reclaim and compaction like ``always``, but 185 only for regions that have used madvise(MADV_HUGEPAGE); all 186 other regions will wake kswapd in the background to reclaim 187 pages and wake kcompactd to compact memory so that THP is 188 available in the near future. 189 190madvise 191 will enter direct reclaim like ``always`` but only for regions 192 that are have used madvise(MADV_HUGEPAGE). This is the default 193 behaviour. 194 195never 196 should be self-explanatory. Note that ``madvise(..., 197 MADV_COLLAPSE)`` can still cause transparent huge pages to be 198 obtained even if this mode is specified everywhere. 199 200By default kernel tries to use huge, PMD-mappable zero page on read 201page fault to anonymous mapping. It's possible to disable huge zero 202page by writing 0 or enable it back by writing 1:: 203 204 echo 0 >/sys/kernel/mm/transparent_hugepage/use_zero_page 205 echo 1 >/sys/kernel/mm/transparent_hugepage/use_zero_page 206 207Some userspace (such as a test program, or an optimized memory 208allocation library) may want to know the size (in bytes) of a 209PMD-mappable transparent hugepage:: 210 211 cat /sys/kernel/mm/transparent_hugepage/hpage_pmd_size 212 213All THPs at fault and collapse time will be added to _deferred_list, 214and will therefore be split under memory pressure if they are considered 215"underused". A THP is underused if the number of zero-filled pages in 216the THP is above max_ptes_none (see below). It is possible to disable 217this behaviour by writing 0 to shrink_underused, and enable it by writing 2181 to it:: 219 220 echo 0 > /sys/kernel/mm/transparent_hugepage/shrink_underused 221 echo 1 > /sys/kernel/mm/transparent_hugepage/shrink_underused 222 223khugepaged will be automatically started when any THP size is enabled 224(either of the per-size anon control or the top-level control are set 225to "always" or "madvise"), and it'll be automatically shutdown when 226all THP sizes are disabled (when both the per-size anon control and the 227top-level control are "never") 228 229process THP controls 230-------------------- 231 232A process can control its own THP behaviour using the ``PR_SET_THP_DISABLE`` 233and ``PR_GET_THP_DISABLE`` pair of prctl(2) calls. The THP behaviour set using 234``PR_SET_THP_DISABLE`` is inherited across fork(2) and execve(2). These calls 235support the following arguments:: 236 237 prctl(PR_SET_THP_DISABLE, 1, 0, 0, 0): 238 This will disable THPs completely for the process, irrespective 239 of global THP controls or madvise(..., MADV_COLLAPSE) being used. 240 241 prctl(PR_SET_THP_DISABLE, 1, PR_THP_DISABLE_EXCEPT_ADVISED, 0, 0): 242 This will disable THPs for the process except when the usage of THPs is 243 advised. Consequently, THPs will only be used when: 244 - Global THP controls are set to "always" or "madvise" and 245 madvise(..., MADV_HUGEPAGE) or madvise(..., MADV_COLLAPSE) is used. 246 - Global THP controls are set to "never" and madvise(..., MADV_COLLAPSE) 247 is used. This is the same behavior as if THPs would not be disabled on 248 a process level. 249 Note that MADV_COLLAPSE is currently always rejected if 250 madvise(..., MADV_NOHUGEPAGE) is set on an area. 251 252 prctl(PR_SET_THP_DISABLE, 0, 0, 0, 0): 253 This will re-enable THPs for the process, as if they were never disabled. 254 Whether THPs will actually be used depends on global THP controls and 255 madvise() calls. 256 257 prctl(PR_GET_THP_DISABLE, 0, 0, 0, 0): 258 This returns a value whose bits indicate how THP-disable is configured: 259 Bits 260 1 0 Value Description 261 |0|0| 0 No THP-disable behaviour specified. 262 |0|1| 1 THP is entirely disabled for this process. 263 |1|1| 3 THP-except-advised mode is set for this process. 264 265Khugepaged controls 266------------------- 267 268.. note:: 269 khugepaged currently only searches for opportunities to collapse file/shmem 270 to PMD-sized THP. Only anonymous memory will attempt to collapse to other THP 271 sizes. 272 273khugepaged runs usually at low frequency so while one may not want to 274invoke defrag algorithms synchronously during the page faults, it 275should be worth invoking defrag at least in khugepaged. However it's 276also possible to disable defrag in khugepaged by writing 0 or enable 277defrag in khugepaged by writing 1:: 278 279 echo 0 >/sys/kernel/mm/transparent_hugepage/khugepaged/defrag 280 echo 1 >/sys/kernel/mm/transparent_hugepage/khugepaged/defrag 281 282You can also control how many pages khugepaged should scan at each 283pass:: 284 285 /sys/kernel/mm/transparent_hugepage/khugepaged/pages_to_scan 286 287and how many milliseconds to wait in khugepaged between each pass (you 288can set this to 0 to run khugepaged at 100% utilization of one core):: 289 290 /sys/kernel/mm/transparent_hugepage/khugepaged/scan_sleep_millisecs 291 292and how many milliseconds to wait in khugepaged if there's an hugepage 293allocation failure to throttle the next allocation attempt:: 294 295 /sys/kernel/mm/transparent_hugepage/khugepaged/alloc_sleep_millisecs 296 297The khugepaged progress can be seen in the number of pages collapsed (note 298that this counter may not be an exact count of the number of pages 299collapsed, since "collapsed" could mean multiple things: (1) A PTE mapping 300being replaced by a PMD mapping, or (2) physical pages replaced by one 301hugepage of various sizes (PMD-sized or mTHP). Each may happen independently, 302or together, depending on the type of memory and the failures that occur. 303As such, this value should be interpreted roughly as a sign of progress, 304and counters in /proc/vmstat consulted for more accurate accounting):: 305 306 /sys/kernel/mm/transparent_hugepage/khugepaged/pages_collapsed 307 308for each pass:: 309 310 /sys/kernel/mm/transparent_hugepage/khugepaged/full_scans 311 312``max_ptes_none`` specifies how many empty (none/zero) pages are allowed 313when collapsing a group of small pages into one large page:: 314 315 /sys/kernel/mm/transparent_hugepage/khugepaged/max_ptes_none 316 317For PMD-sized THP collapse, this directly limits the number of empty pages 318allowed in the 2MB region. 319 320For mTHP collapse, only 0 or (HPAGE_PMD_NR - 1) are supported. At 321HPAGE_PMD_NR - 1, we collapse to the highest possible order. Any intermediate 322value will emit a warning and mTHP collapse will default to max_ptes_none=0. 323 324A higher value allows more empty pages, potentially leading to more memory 325usage but better THP performance. A lower value is more conservative and 326may result in fewer THP collapses. 327 328``max_ptes_swap`` specifies how many pages can be brought in from 329swap when collapsing a group of pages into a transparent huge page:: 330 331 /sys/kernel/mm/transparent_hugepage/khugepaged/max_ptes_swap 332 333A higher value can cause excessive swap IO and waste 334memory. A lower value can prevent THPs from being 335collapsed, resulting fewer pages being collapsed into 336THPs, and lower memory access performance. 337 338``max_ptes_shared`` specifies how many pages can be shared across multiple 339processes. khugepaged might treat pages of THPs as shared if any page of 340that THP is shared. Exceeding the number would block the collapse:: 341 342 /sys/kernel/mm/transparent_hugepage/khugepaged/max_ptes_shared 343 344A higher value may increase memory footprint for some workloads. 345 346.. note:: 347 For mTHP collapse, khugepaged does not support collapsing regions that 348 contain shared or swapped out pages, as this could lead to continuous 349 promotion to higher orders. The collapse will fail if any shared or 350 swapped PTEs are encountered during the scan. 351 352 Currently, madvise_collapse only supports collapsing to PMD-sized THPs 353 and does not attempt mTHP collapses. 354 355Boot parameters 356=============== 357 358You can change the sysfs boot time default for the top-level "enabled" 359control by passing the parameter ``transparent_hugepage=always`` or 360``transparent_hugepage=madvise`` or ``transparent_hugepage=never`` to the 361kernel command line. 362 363Alternatively, each supported anonymous THP size can be controlled by 364passing ``thp_anon=<size>[KMG],<size>[KMG]:<state>;<size>[KMG]-<size>[KMG]:<state>``, 365where ``<size>`` is the THP size (must be a power of 2 of PAGE_SIZE and 366supported anonymous THP) and ``<state>`` is one of ``always``, ``madvise``, 367``never`` or ``inherit``. 368 369For example, the following will set 16K, 32K, 64K THP to ``always``, 370set 128K, 512K to ``inherit``, set 256K to ``madvise`` and 1M, 2M 371to ``never``:: 372 373 thp_anon=16K-64K:always;128K,512K:inherit;256K:madvise;1M-2M:never 374 375``thp_anon=`` may be specified multiple times to configure all THP sizes as 376required. If ``thp_anon=`` is specified at least once, any anon THP sizes 377not explicitly configured on the command line are implicitly set to 378``never``. 379 380``transparent_hugepage`` setting only affects the global toggle. If 381``thp_anon`` is not specified, PMD_ORDER THP will default to ``inherit``. 382However, if a valid ``thp_anon`` setting is provided by the user, the 383PMD_ORDER THP policy will be overridden. If the policy for PMD_ORDER 384is not defined within a valid ``thp_anon``, its policy will default to 385``never``. 386 387Similarly to ``transparent_hugepage``, you can control the hugepage 388allocation policy for the internal shmem mount by using the kernel parameter 389``transparent_hugepage_shmem=<policy>``, where ``<policy>`` is one of the 390seven valid policies for shmem (``always``, ``within_size``, ``advise``, 391``never``, ``deny``, and ``force``). 392 393Similarly to ``transparent_hugepage_shmem``, you can control the default 394hugepage allocation policy for the tmpfs mount by using the kernel parameter 395``transparent_hugepage_tmpfs=<policy>``, where ``<policy>`` is one of the 396four valid policies for tmpfs (``always``, ``within_size``, ``advise``, 397``never``). The tmpfs mount default policy is ``never``. 398 399Additionally, Kconfig options are available to set the default hugepage 400policies for shmem (``CONFIG_TRANSPARENT_HUGEPAGE_SHMEM_HUGE_*``) and tmpfs 401(``CONFIG_TRANSPARENT_HUGEPAGE_TMPFS_HUGE_*``) at build time. Refer to the 402Kconfig help for more details. 403 404In the same manner as ``thp_anon`` controls each supported anonymous THP 405size, ``thp_shmem`` controls each supported shmem THP size. ``thp_shmem`` 406has the same format as ``thp_anon``, but also supports the policy 407``within_size``. 408 409``thp_shmem=`` may be specified multiple times to configure all THP sizes 410as required. If ``thp_shmem=`` is specified at least once, any shmem THP 411sizes not explicitly configured on the command line are implicitly set to 412``never``. 413 414``transparent_hugepage_shmem`` setting only affects the global toggle. If 415``thp_shmem`` is not specified, PMD_ORDER hugepage will default to 416``inherit``. However, if a valid ``thp_shmem`` setting is provided by the 417user, the PMD_ORDER hugepage policy will be overridden. If the policy for 418PMD_ORDER is not defined within a valid ``thp_shmem``, its policy will 419default to ``never``. 420 421Hugepages in tmpfs/shmem 422======================== 423 424Traditionally, tmpfs only supported a single huge page size ("PMD"). Today, 425it also supports smaller sizes just like anonymous memory, often referred 426to as "multi-size THP" (mTHP). Huge pages of any size are commonly 427represented in the kernel as "large folios". 428 429While there is fine control over the huge page sizes to use for the internal 430shmem mount (see below), ordinary tmpfs mounts will make use of all available 431huge page sizes without any control over the exact sizes, behaving more like 432other file systems. 433 434tmpfs mounts 435------------ 436 437The THP allocation policy for tmpfs mounts can be adjusted using the mount 438option: ``huge=``. It can have following values: 439 440always 441 Attempt to allocate huge pages every time we need a new page; 442 Always try PMD-sized huge pages first, and fall back to smaller-sized 443 huge pages if the PMD-sized huge page allocation fails; 444 445never 446 Do not allocate huge pages. Note that ``madvise(..., MADV_COLLAPSE)`` 447 can still cause transparent huge pages to be obtained even if this mode 448 is specified everywhere; 449 450within_size 451 Only allocate huge page if it will be fully within i_size; 452 Always try PMD-sized huge pages first, and fall back to smaller-sized 453 huge pages if the PMD-sized huge page allocation fails; 454 Also respect madvise() hints; 455 456advise 457 Only allocate huge pages if requested with madvise(); 458 459Remember, that the kernel may use huge pages of all available sizes, and 460that no fine control as for the internal tmpfs mount is available. 461 462The default policy in the past was ``never``, but it can now be adjusted 463using the kernel parameter ``transparent_hugepage_tmpfs=<policy>``. 464 465``mount -o remount,huge= /mountpoint`` works fine after mount: remounting 466``huge=never`` will not attempt to break up huge pages at all, just stop more 467from being allocated. 468 469In addition to policies listed above, the sysfs knob 470/sys/kernel/mm/transparent_hugepage/shmem_enabled will affect the 471allocation policy of tmpfs mounts, when set to the following values: 472 473deny 474 For use in emergencies, to force the huge option off from 475 all mounts; 476force 477 Force the huge option on for all - very useful for testing; 478 479shmem / internal tmpfs 480---------------------- 481The mount internal tmpfs mount is used for SysV SHM, memfds, shared anonymous 482mmaps (of /dev/zero or MAP_ANONYMOUS), GPU drivers' DRM objects, Ashmem. 483 484To control the THP allocation policy for this internal tmpfs mount, the 485sysfs knob /sys/kernel/mm/transparent_hugepage/shmem_enabled and the knobs 486per THP size in 487'/sys/kernel/mm/transparent_hugepage/hugepages-<size>kB/shmem_enabled' 488can be used. 489 490The global knob has the same semantics as the ``huge=`` mount options 491for tmpfs mounts, except that the different huge page sizes can be controlled 492individually, and will only use the setting of the global knob when the 493per-size knob is set to 'inherit'. 494 495The options 'force' and 'deny' are dropped for the individual sizes, which 496are rather testing artifacts from the old ages. 497 498always 499 Attempt to allocate <size> huge pages every time we need a new page; 500 501inherit 502 Inherit the top-level "shmem_enabled" value. By default, PMD-sized hugepages 503 have enabled="inherit" and all other hugepage sizes have enabled="never"; 504 505never 506 Do not allocate <size> huge pages. Note that ``madvise(..., 507 MADV_COLLAPSE)`` can still cause transparent huge pages to be obtained 508 even if this mode is specified everywhere; 509 510within_size 511 Only allocate <size> huge page if it will be fully within i_size. 512 Also respect madvise() hints; 513 514advise 515 Only allocate <size> huge pages if requested with madvise(); 516 517Need of application restart 518=========================== 519 520The transparent_hugepage/enabled and 521transparent_hugepage/hugepages-<size>kB/enabled values and tmpfs mount 522option only affect future behavior. So to make them effective you need 523to restart any application that could have been using hugepages. This 524also applies to the regions registered in khugepaged. 525 526Monitoring usage 527================ 528 529The number of PMD-sized anonymous transparent huge pages currently used by the 530system is available by reading the AnonHugePages field in ``/proc/meminfo``. 531To identify what applications are using PMD-sized anonymous transparent huge 532pages, it is necessary to read ``/proc/PID/smaps`` and count the AnonHugePages 533fields for each mapping. (Note that AnonHugePages only applies to traditional 534PMD-sized THP for historical reasons and should have been called 535AnonHugePmdMapped). 536 537The number of file transparent huge pages mapped to userspace is available 538by reading ShmemPmdMapped and ShmemHugePages fields in ``/proc/meminfo``. 539To identify what applications are mapping file transparent huge pages, it 540is necessary to read ``/proc/PID/smaps`` and count the FilePmdMapped fields 541for each mapping. 542 543Note that reading the smaps file is expensive and reading it 544frequently will incur overhead. 545 546There are a number of counters in ``/proc/vmstat`` that may be used to 547monitor how successfully the system is providing huge pages for use. 548 549thp_fault_alloc 550 is incremented every time a huge page is successfully 551 allocated and charged to handle a page fault. 552 553thp_collapse_alloc 554 is incremented by khugepaged when it has found 555 a range of pages to collapse into one huge page and has 556 successfully allocated a new huge page to store the data. 557 558thp_fault_fallback 559 is incremented if a page fault fails to allocate or charge 560 a huge page and instead falls back to using small pages. 561 562thp_fault_fallback_charge 563 is incremented if a page fault fails to charge a huge page and 564 instead falls back to using small pages even though the 565 allocation was successful. 566 567thp_collapse_alloc_failed 568 is incremented if khugepaged found a range 569 of pages that should be collapsed into one huge page but failed 570 the allocation. 571 572thp_file_alloc 573 is incremented every time a shmem huge page is successfully 574 allocated (Note that despite being named after "file", the counter 575 measures only shmem). 576 577thp_file_fallback 578 is incremented if a shmem huge page is attempted to be allocated 579 but fails and instead falls back to using small pages. (Note that 580 despite being named after "file", the counter measures only shmem). 581 582thp_file_fallback_charge 583 is incremented if a shmem huge page cannot be charged and instead 584 falls back to using small pages even though the allocation was 585 successful. (Note that despite being named after "file", the 586 counter measures only shmem). 587 588thp_file_mapped 589 is incremented every time a file or shmem huge page is mapped into 590 user address space. 591 592thp_split_page 593 is incremented every time a huge page is split into base 594 pages. This can happen for a variety of reasons but a common 595 reason is that a huge page is old and is being reclaimed. 596 This action implies splitting all PMD the page mapped with. 597 598thp_split_page_failed 599 is incremented if kernel fails to split huge 600 page. This can happen if the page was pinned by somebody. 601 602thp_deferred_split_page 603 is incremented when a huge page is put onto split 604 queue. This happens when a huge page is partially unmapped and 605 splitting it would free up some memory. Pages on split queue are 606 going to be split under memory pressure. 607 608thp_underused_split_page 609 is incremented when a huge page on the split queue was split 610 because it was underused. A THP is underused if the number of 611 zero pages in the THP is above a certain threshold 612 (/sys/kernel/mm/transparent_hugepage/khugepaged/max_ptes_none). 613 614thp_split_pmd 615 is incremented every time a PMD split into table of PTEs. 616 This can happen, for instance, when application calls mprotect() or 617 munmap() on part of huge page. It doesn't split huge page, only 618 page table entry. 619 620thp_zero_page_alloc 621 is incremented every time a huge zero page used for thp is 622 successfully allocated. Note, it doesn't count every map of 623 the huge zero page, only its allocation. 624 625thp_zero_page_alloc_failed 626 is incremented if kernel fails to allocate 627 huge zero page and falls back to using small pages. 628 629thp_swpout 630 is incremented every time a huge page is swapout in one 631 piece without splitting. 632 633thp_swpout_fallback 634 is incremented if a huge page has to be split before swapout. 635 Usually because failed to allocate some continuous swap space 636 for the huge page. 637 638In /sys/kernel/mm/transparent_hugepage/hugepages-<size>kB/stats, There are 639also individual counters for each huge page size, which can be utilized to 640monitor the system's effectiveness in providing huge pages for usage. Each 641counter has its own corresponding file. 642 643anon_fault_alloc 644 is incremented every time a huge page is successfully 645 allocated and charged to handle a page fault. 646 647anon_fault_fallback 648 is incremented if a page fault fails to allocate or charge 649 a huge page and instead falls back to using huge pages with 650 lower orders or small pages. 651 652anon_fault_fallback_charge 653 is incremented if a page fault fails to charge a huge page and 654 instead falls back to using huge pages with lower orders or 655 small pages even though the allocation was successful. 656 657collapse_alloc 658 is incremented every time a huge page is successfully allocated for a 659 khugepaged collapse. 660 661collapse_alloc_failed 662 is incremented every time a huge page allocation fails during a 663 khugepaged collapse. 664 665zswpout 666 is incremented every time a huge page is swapped out to zswap in one 667 piece without splitting. 668 669swpin 670 is incremented every time a huge page is swapped in from a non-zswap 671 swap device in one piece. 672 673swpin_fallback 674 is incremented if swapin fails to allocate or charge a huge page 675 and instead falls back to using huge pages with lower orders or 676 small pages. 677 678swpin_fallback_charge 679 is incremented if swapin fails to charge a huge page and instead 680 falls back to using huge pages with lower orders or small pages 681 even though the allocation was successful. 682 683swpout 684 is incremented every time a huge page is swapped out to a non-zswap 685 swap device in one piece without splitting. 686 687swpout_fallback 688 is incremented if a huge page has to be split before swapout. 689 Usually because failed to allocate some continuous swap space 690 for the huge page. 691 692shmem_alloc 693 is incremented every time a shmem huge page is successfully 694 allocated. 695 696shmem_fallback 697 is incremented if a shmem huge page is attempted to be allocated 698 but fails and instead falls back to using small pages. 699 700shmem_fallback_charge 701 is incremented if a shmem huge page cannot be charged and instead 702 falls back to using small pages even though the allocation was 703 successful. 704 705split 706 is incremented every time a huge page is successfully split into 707 smaller orders. This can happen for a variety of reasons but a 708 common reason is that a huge page is old and is being reclaimed. 709 710split_failed 711 is incremented if kernel fails to split huge 712 page. This can happen if the page was pinned by somebody. 713 714split_deferred 715 is incremented when a huge page is put onto split queue. 716 This happens when a huge page is partially unmapped and splitting 717 it would free up some memory. Pages on split queue are going to 718 be split under memory pressure, if splitting is possible. 719 720nr_anon 721 the number of anonymous THP we have in the whole system. These THPs 722 might be currently entirely mapped or have partially unmapped/unused 723 subpages. 724 725nr_anon_partially_mapped 726 the number of anonymous THP which are likely partially mapped, possibly 727 wasting memory, and have been queued for deferred memory reclamation. 728 Note that in corner some cases (e.g., failed migration), we might detect 729 an anonymous THP as "partially mapped" and count it here, even though it 730 is not actually partially mapped anymore. 731 732collapse_exceed_none_pte 733 The number of collapse attempts that failed due to exceeding the 734 max_ptes_none threshold. 735 736collapse_exceed_swap_pte 737 The number of collapse attempts that failed due to exceeding the 738 max_ptes_swap threshold. For non-PMD orders this occurs if a mTHP range 739 contains at least one swap PTE. 740 741collapse_exceed_shared_pte 742 The number of collapse attempts that failed due to exceeding the 743 max_ptes_shared threshold. For non-PMD orders this occurs if a mTHP range 744 contains at least one shared PTE. 745 746As the system ages, allocating huge pages may be expensive as the 747system uses memory compaction to copy data around memory to free a 748huge page for use. There are some counters in ``/proc/vmstat`` to help 749monitor this overhead. 750 751compact_stall 752 is incremented every time a process stalls to run 753 memory compaction so that a huge page is free for use. 754 755compact_success 756 is incremented if the system compacted memory and 757 freed a huge page for use. 758 759compact_fail 760 is incremented if the system tries to compact memory 761 but failed. 762 763It is possible to establish how long the stalls were using the function 764tracer to record how long was spent in __alloc_pages() and 765using the mm_page_alloc tracepoint to identify which allocations were 766for huge pages. 767 768Optimizing the applications 769=========================== 770 771To be guaranteed that the kernel will map a THP immediately in any 772memory region, the mmap region has to be hugepage naturally 773aligned. posix_memalign() can provide that guarantee. 774 775Hugetlbfs 776========= 777 778You can use hugetlbfs on a kernel that has transparent hugepage 779support enabled just fine as always. No difference can be noted in 780hugetlbfs other than there will be less overall fragmentation. All 781usual features belonging to hugetlbfs are preserved and 782unaffected. libhugetlbfs will also work fine as usual. 783