1============================ 2Transparent Hugepage Support 3============================ 4 5Objective 6========= 7 8Performance critical computing applications dealing with large memory 9working sets are already running on top of libhugetlbfs and in turn 10hugetlbfs. Transparent HugePage Support (THP) is an alternative mean of 11using huge pages for the backing of virtual memory with huge pages 12that supports the automatic promotion and demotion of page sizes and 13without the shortcomings of hugetlbfs. 14 15Currently THP only works for anonymous memory mappings and tmpfs/shmem. 16But in the future it can expand to other filesystems. 17 18.. note:: 19 in the examples below we presume that the basic page size is 4K and 20 the huge page size is 2M, although the actual numbers may vary 21 depending on the CPU architecture. 22 23The reason applications are running faster is because of two 24factors. The first factor is almost completely irrelevant and it's not 25of significant interest because it'll also have the downside of 26requiring larger clear-page copy-page in page faults which is a 27potentially negative effect. The first factor consists in taking a 28single page fault for each 2M virtual region touched by userland (so 29reducing the enter/exit kernel frequency by a 512 times factor). This 30only matters the first time the memory is accessed for the lifetime of 31a memory mapping. The second long lasting and much more important 32factor will affect all subsequent accesses to the memory for the whole 33runtime of the application. The second factor consist of two 34components: 35 361) the TLB miss will run faster (especially with virtualization using 37 nested pagetables but almost always also on bare metal without 38 virtualization) 39 402) a single TLB entry will be mapping a much larger amount of virtual 41 memory in turn reducing the number of TLB misses. With 42 virtualization and nested pagetables the TLB can be mapped of 43 larger size only if both KVM and the Linux guest are using 44 hugepages but a significant speedup already happens if only one of 45 the two is using hugepages just because of the fact the TLB miss is 46 going to run faster. 47 48Modern kernels support "multi-size THP" (mTHP), which introduces the 49ability to allocate memory in blocks that are bigger than a base page 50but smaller than traditional PMD-size (as described above), in 51increments of a power-of-2 number of pages. mTHP can back anonymous 52memory (for example 16K, 32K, 64K, etc). These THPs continue to be 53PTE-mapped, but in many cases can still provide similar benefits to 54those outlined above: Page faults are significantly reduced (by a 55factor of e.g. 4, 8, 16, etc), but latency spikes are much less 56prominent because the size of each page isn't as huge as the PMD-sized 57variant and there is less memory to clear in each page fault. Some 58architectures also employ TLB compression mechanisms to squeeze more 59entries in when a set of PTEs are virtually and physically contiguous 60and approporiately aligned. In this case, TLB misses will occur less 61often. 62 63THP can be enabled system wide or restricted to certain tasks or even 64memory ranges inside task's address space. Unless THP is completely 65disabled, there is ``khugepaged`` daemon that scans memory and 66collapses sequences of basic pages into PMD-sized huge pages. 67 68The THP behaviour is controlled via :ref:`sysfs <thp_sysfs>` 69interface and using madvise(2) and prctl(2) system calls. 70 71Transparent Hugepage Support maximizes the usefulness of free memory 72if compared to the reservation approach of hugetlbfs by allowing all 73unused memory to be used as cache or other movable (or even unmovable 74entities). It doesn't require reservation to prevent hugepage 75allocation failures to be noticeable from userland. It allows paging 76and all other advanced VM features to be available on the 77hugepages. It requires no modifications for applications to take 78advantage of it. 79 80Applications however can be further optimized to take advantage of 81this feature, like for example they've been optimized before to avoid 82a flood of mmap system calls for every malloc(4k). Optimizing userland 83is by far not mandatory and khugepaged already can take care of long 84lived page allocations even for hugepage unaware applications that 85deals with large amounts of memory. 86 87In certain cases when hugepages are enabled system wide, application 88may end up allocating more memory resources. An application may mmap a 89large region but only touch 1 byte of it, in that case a 2M page might 90be allocated instead of a 4k page for no good. This is why it's 91possible to disable hugepages system-wide and to only have them inside 92MADV_HUGEPAGE madvise regions. 93 94Embedded systems should enable hugepages only inside madvise regions 95to eliminate any risk of wasting any precious byte of memory and to 96only run faster. 97 98Applications that gets a lot of benefit from hugepages and that don't 99risk to lose memory by using hugepages, should use 100madvise(MADV_HUGEPAGE) on their critical mmapped regions. 101 102.. _thp_sysfs: 103 104sysfs 105===== 106 107Global THP controls 108------------------- 109 110Transparent Hugepage Support for anonymous memory can be entirely disabled 111(mostly for debugging purposes) or only enabled inside MADV_HUGEPAGE 112regions (to avoid the risk of consuming more memory resources) or enabled 113system wide. This can be achieved per-supported-THP-size with one of:: 114 115 echo always >/sys/kernel/mm/transparent_hugepage/hugepages-<size>kB/enabled 116 echo madvise >/sys/kernel/mm/transparent_hugepage/hugepages-<size>kB/enabled 117 echo never >/sys/kernel/mm/transparent_hugepage/hugepages-<size>kB/enabled 118 119where <size> is the hugepage size being addressed, the available sizes 120for which vary by system. 121 122For example:: 123 124 echo always >/sys/kernel/mm/transparent_hugepage/hugepages-2048kB/enabled 125 126Alternatively it is possible to specify that a given hugepage size 127will inherit the top-level "enabled" value:: 128 129 echo inherit >/sys/kernel/mm/transparent_hugepage/hugepages-<size>kB/enabled 130 131For example:: 132 133 echo inherit >/sys/kernel/mm/transparent_hugepage/hugepages-2048kB/enabled 134 135The top-level setting (for use with "inherit") can be set by issuing 136one of the following commands:: 137 138 echo always >/sys/kernel/mm/transparent_hugepage/enabled 139 echo madvise >/sys/kernel/mm/transparent_hugepage/enabled 140 echo never >/sys/kernel/mm/transparent_hugepage/enabled 141 142By default, PMD-sized hugepages have enabled="inherit" and all other 143hugepage sizes have enabled="never". If enabling multiple hugepage 144sizes, the kernel will select the most appropriate enabled size for a 145given allocation. 146 147It's also possible to limit defrag efforts in the VM to generate 148anonymous hugepages in case they're not immediately free to madvise 149regions or to never try to defrag memory and simply fallback to regular 150pages unless hugepages are immediately available. Clearly if we spend CPU 151time to defrag memory, we would expect to gain even more by the fact we 152use hugepages later instead of regular pages. This isn't always 153guaranteed, but it may be more likely in case the allocation is for a 154MADV_HUGEPAGE region. 155 156:: 157 158 echo always >/sys/kernel/mm/transparent_hugepage/defrag 159 echo defer >/sys/kernel/mm/transparent_hugepage/defrag 160 echo defer+madvise >/sys/kernel/mm/transparent_hugepage/defrag 161 echo madvise >/sys/kernel/mm/transparent_hugepage/defrag 162 echo never >/sys/kernel/mm/transparent_hugepage/defrag 163 164always 165 means that an application requesting THP will stall on 166 allocation failure and directly reclaim pages and compact 167 memory in an effort to allocate a THP immediately. This may be 168 desirable for virtual machines that benefit heavily from THP 169 use and are willing to delay the VM start to utilise them. 170 171defer 172 means that an application will wake kswapd in the background 173 to reclaim pages and wake kcompactd to compact memory so that 174 THP is available in the near future. It's the responsibility 175 of khugepaged to then install the THP pages later. 176 177defer+madvise 178 will enter direct reclaim and compaction like ``always``, but 179 only for regions that have used madvise(MADV_HUGEPAGE); all 180 other regions will wake kswapd in the background to reclaim 181 pages and wake kcompactd to compact memory so that THP is 182 available in the near future. 183 184madvise 185 will enter direct reclaim like ``always`` but only for regions 186 that are have used madvise(MADV_HUGEPAGE). This is the default 187 behaviour. 188 189never 190 should be self-explanatory. 191 192By default kernel tries to use huge, PMD-mappable zero page on read 193page fault to anonymous mapping. It's possible to disable huge zero 194page by writing 0 or enable it back by writing 1:: 195 196 echo 0 >/sys/kernel/mm/transparent_hugepage/use_zero_page 197 echo 1 >/sys/kernel/mm/transparent_hugepage/use_zero_page 198 199Some userspace (such as a test program, or an optimized memory 200allocation library) may want to know the size (in bytes) of a 201PMD-mappable transparent hugepage:: 202 203 cat /sys/kernel/mm/transparent_hugepage/hpage_pmd_size 204 205khugepaged will be automatically started when one or more hugepage 206sizes are enabled (either by directly setting "always" or "madvise", 207or by setting "inherit" while the top-level enabled is set to "always" 208or "madvise"), and it'll be automatically shutdown when the last 209hugepage size is disabled (either by directly setting "never", or by 210setting "inherit" while the top-level enabled is set to "never"). 211 212Khugepaged controls 213------------------- 214 215.. note:: 216 khugepaged currently only searches for opportunities to collapse to 217 PMD-sized THP and no attempt is made to collapse to other THP 218 sizes. 219 220khugepaged runs usually at low frequency so while one may not want to 221invoke defrag algorithms synchronously during the page faults, it 222should be worth invoking defrag at least in khugepaged. However it's 223also possible to disable defrag in khugepaged by writing 0 or enable 224defrag in khugepaged by writing 1:: 225 226 echo 0 >/sys/kernel/mm/transparent_hugepage/khugepaged/defrag 227 echo 1 >/sys/kernel/mm/transparent_hugepage/khugepaged/defrag 228 229You can also control how many pages khugepaged should scan at each 230pass:: 231 232 /sys/kernel/mm/transparent_hugepage/khugepaged/pages_to_scan 233 234and how many milliseconds to wait in khugepaged between each pass (you 235can set this to 0 to run khugepaged at 100% utilization of one core):: 236 237 /sys/kernel/mm/transparent_hugepage/khugepaged/scan_sleep_millisecs 238 239and how many milliseconds to wait in khugepaged if there's an hugepage 240allocation failure to throttle the next allocation attempt:: 241 242 /sys/kernel/mm/transparent_hugepage/khugepaged/alloc_sleep_millisecs 243 244The khugepaged progress can be seen in the number of pages collapsed (note 245that this counter may not be an exact count of the number of pages 246collapsed, since "collapsed" could mean multiple things: (1) A PTE mapping 247being replaced by a PMD mapping, or (2) All 4K physical pages replaced by 248one 2M hugepage. Each may happen independently, or together, depending on 249the type of memory and the failures that occur. As such, this value should 250be interpreted roughly as a sign of progress, and counters in /proc/vmstat 251consulted for more accurate accounting):: 252 253 /sys/kernel/mm/transparent_hugepage/khugepaged/pages_collapsed 254 255for each pass:: 256 257 /sys/kernel/mm/transparent_hugepage/khugepaged/full_scans 258 259``max_ptes_none`` specifies how many extra small pages (that are 260not already mapped) can be allocated when collapsing a group 261of small pages into one large page:: 262 263 /sys/kernel/mm/transparent_hugepage/khugepaged/max_ptes_none 264 265A higher value leads to use additional memory for programs. 266A lower value leads to gain less thp performance. Value of 267max_ptes_none can waste cpu time very little, you can 268ignore it. 269 270``max_ptes_swap`` specifies how many pages can be brought in from 271swap when collapsing a group of pages into a transparent huge page:: 272 273 /sys/kernel/mm/transparent_hugepage/khugepaged/max_ptes_swap 274 275A higher value can cause excessive swap IO and waste 276memory. A lower value can prevent THPs from being 277collapsed, resulting fewer pages being collapsed into 278THPs, and lower memory access performance. 279 280``max_ptes_shared`` specifies how many pages can be shared across multiple 281processes. khugepaged might treat pages of THPs as shared if any page of 282that THP is shared. Exceeding the number would block the collapse:: 283 284 /sys/kernel/mm/transparent_hugepage/khugepaged/max_ptes_shared 285 286A higher value may increase memory footprint for some workloads. 287 288Boot parameter 289============== 290 291You can change the sysfs boot time defaults of Transparent Hugepage 292Support by passing the parameter ``transparent_hugepage=always`` or 293``transparent_hugepage=madvise`` or ``transparent_hugepage=never`` 294to the kernel command line. 295 296Hugepages in tmpfs/shmem 297======================== 298 299You can control hugepage allocation policy in tmpfs with mount option 300``huge=``. It can have following values: 301 302always 303 Attempt to allocate huge pages every time we need a new page; 304 305never 306 Do not allocate huge pages; 307 308within_size 309 Only allocate huge page if it will be fully within i_size. 310 Also respect fadvise()/madvise() hints; 311 312advise 313 Only allocate huge pages if requested with fadvise()/madvise(); 314 315The default policy is ``never``. 316 317``mount -o remount,huge= /mountpoint`` works fine after mount: remounting 318``huge=never`` will not attempt to break up huge pages at all, just stop more 319from being allocated. 320 321There's also sysfs knob to control hugepage allocation policy for internal 322shmem mount: /sys/kernel/mm/transparent_hugepage/shmem_enabled. The mount 323is used for SysV SHM, memfds, shared anonymous mmaps (of /dev/zero or 324MAP_ANONYMOUS), GPU drivers' DRM objects, Ashmem. 325 326In addition to policies listed above, shmem_enabled allows two further 327values: 328 329deny 330 For use in emergencies, to force the huge option off from 331 all mounts; 332force 333 Force the huge option on for all - very useful for testing; 334 335Need of application restart 336=========================== 337 338The transparent_hugepage/enabled and 339transparent_hugepage/hugepages-<size>kB/enabled values and tmpfs mount 340option only affect future behavior. So to make them effective you need 341to restart any application that could have been using hugepages. This 342also applies to the regions registered in khugepaged. 343 344Monitoring usage 345================ 346 347.. note:: 348 Currently the below counters only record events relating to 349 PMD-sized THP. Events relating to other THP sizes are not included. 350 351The number of PMD-sized anonymous transparent huge pages currently used by the 352system is available by reading the AnonHugePages field in ``/proc/meminfo``. 353To identify what applications are using PMD-sized anonymous transparent huge 354pages, it is necessary to read ``/proc/PID/smaps`` and count the AnonHugePages 355fields for each mapping. (Note that AnonHugePages only applies to traditional 356PMD-sized THP for historical reasons and should have been called 357AnonHugePmdMapped). 358 359The number of file transparent huge pages mapped to userspace is available 360by reading ShmemPmdMapped and ShmemHugePages fields in ``/proc/meminfo``. 361To identify what applications are mapping file transparent huge pages, it 362is necessary to read ``/proc/PID/smaps`` and count the FileHugeMapped fields 363for each mapping. 364 365Note that reading the smaps file is expensive and reading it 366frequently will incur overhead. 367 368There are a number of counters in ``/proc/vmstat`` that may be used to 369monitor how successfully the system is providing huge pages for use. 370 371thp_fault_alloc 372 is incremented every time a huge page is successfully 373 allocated and charged to handle a page fault. 374 375thp_collapse_alloc 376 is incremented by khugepaged when it has found 377 a range of pages to collapse into one huge page and has 378 successfully allocated a new huge page to store the data. 379 380thp_fault_fallback 381 is incremented if a page fault fails to allocate or charge 382 a huge page and instead falls back to using small pages. 383 384thp_fault_fallback_charge 385 is incremented if a page fault fails to charge a huge page and 386 instead falls back to using small pages even though the 387 allocation was successful. 388 389thp_collapse_alloc_failed 390 is incremented if khugepaged found a range 391 of pages that should be collapsed into one huge page but failed 392 the allocation. 393 394thp_file_alloc 395 is incremented every time a file huge page is successfully 396 allocated. 397 398thp_file_fallback 399 is incremented if a file huge page is attempted to be allocated 400 but fails and instead falls back to using small pages. 401 402thp_file_fallback_charge 403 is incremented if a file huge page cannot be charged and instead 404 falls back to using small pages even though the allocation was 405 successful. 406 407thp_file_mapped 408 is incremented every time a file huge page is mapped into 409 user address space. 410 411thp_split_page 412 is incremented every time a huge page is split into base 413 pages. This can happen for a variety of reasons but a common 414 reason is that a huge page is old and is being reclaimed. 415 This action implies splitting all PMD the page mapped with. 416 417thp_split_page_failed 418 is incremented if kernel fails to split huge 419 page. This can happen if the page was pinned by somebody. 420 421thp_deferred_split_page 422 is incremented when a huge page is put onto split 423 queue. This happens when a huge page is partially unmapped and 424 splitting it would free up some memory. Pages on split queue are 425 going to be split under memory pressure. 426 427thp_split_pmd 428 is incremented every time a PMD split into table of PTEs. 429 This can happen, for instance, when application calls mprotect() or 430 munmap() on part of huge page. It doesn't split huge page, only 431 page table entry. 432 433thp_zero_page_alloc 434 is incremented every time a huge zero page used for thp is 435 successfully allocated. Note, it doesn't count every map of 436 the huge zero page, only its allocation. 437 438thp_zero_page_alloc_failed 439 is incremented if kernel fails to allocate 440 huge zero page and falls back to using small pages. 441 442thp_swpout 443 is incremented every time a huge page is swapout in one 444 piece without splitting. 445 446thp_swpout_fallback 447 is incremented if a huge page has to be split before swapout. 448 Usually because failed to allocate some continuous swap space 449 for the huge page. 450 451In /sys/kernel/mm/transparent_hugepage/hugepages-<size>kB/stats, There are 452also individual counters for each huge page size, which can be utilized to 453monitor the system's effectiveness in providing huge pages for usage. Each 454counter has its own corresponding file. 455 456anon_fault_alloc 457 is incremented every time a huge page is successfully 458 allocated and charged to handle a page fault. 459 460anon_fault_fallback 461 is incremented if a page fault fails to allocate or charge 462 a huge page and instead falls back to using huge pages with 463 lower orders or small pages. 464 465anon_fault_fallback_charge 466 is incremented if a page fault fails to charge a huge page and 467 instead falls back to using huge pages with lower orders or 468 small pages even though the allocation was successful. 469 470swpout 471 is incremented every time a huge page is swapped out in one 472 piece without splitting. 473 474swpout_fallback 475 is incremented if a huge page has to be split before swapout. 476 Usually because failed to allocate some continuous swap space 477 for the huge page. 478 479As the system ages, allocating huge pages may be expensive as the 480system uses memory compaction to copy data around memory to free a 481huge page for use. There are some counters in ``/proc/vmstat`` to help 482monitor this overhead. 483 484compact_stall 485 is incremented every time a process stalls to run 486 memory compaction so that a huge page is free for use. 487 488compact_success 489 is incremented if the system compacted memory and 490 freed a huge page for use. 491 492compact_fail 493 is incremented if the system tries to compact memory 494 but failed. 495 496It is possible to establish how long the stalls were using the function 497tracer to record how long was spent in __alloc_pages() and 498using the mm_page_alloc tracepoint to identify which allocations were 499for huge pages. 500 501Optimizing the applications 502=========================== 503 504To be guaranteed that the kernel will map a THP immediately in any 505memory region, the mmap region has to be hugepage naturally 506aligned. posix_memalign() can provide that guarantee. 507 508Hugetlbfs 509========= 510 511You can use hugetlbfs on a kernel that has transparent hugepage 512support enabled just fine as always. No difference can be noted in 513hugetlbfs other than there will be less overall fragmentation. All 514usual features belonging to hugetlbfs are preserved and 515unaffected. libhugetlbfs will also work fine as usual. 516