1.. SPDX-License-Identifier: GPL-2.0 2 3===================================== 4Virtually Mapped Kernel Stack Support 5===================================== 6 7:Author: Shuah Khan <skhan@linuxfoundation.org> 8 9.. contents:: :local: 10 11Overview 12-------- 13 14This is a compilation of information from the code and original patch 15series that introduced the `Virtually Mapped Kernel Stacks feature 16<https://lwn.net/Articles/694348/>` 17 18Introduction 19------------ 20 21Kernel stack overflows are often hard to debug and make the kernel 22susceptible to exploits. Problems could show up at a later time making 23it difficult to isolate and root-cause. 24 25Virtually mapped kernel stacks with guard pages cause kernel stack 26overflows to be caught immediately rather than causing difficult to 27diagnose corruptions. 28 29HAVE_ARCH_VMAP_STACK and VMAP_STACK configuration options enable 30support for virtually mapped stacks with guard pages. This feature 31causes reliable faults when the stack overflows. The usability of 32the stack trace after overflow and response to the overflow itself 33is architecture dependent. 34 35.. note:: 36 As of this writing, arm64, powerpc, riscv, s390, um, and x86 have 37 support for VMAP_STACK. 38 39HAVE_ARCH_VMAP_STACK 40-------------------- 41 42Architectures that can support Virtually Mapped Kernel Stacks should 43enable this bool configuration option. The requirements are: 44 45- vmalloc space must be large enough to hold many kernel stacks. This 46 may rule out many 32-bit architectures. 47- Stacks in vmalloc space need to work reliably. For example, if 48 vmap page tables are created on demand, either this mechanism 49 needs to work while the stack points to a virtual address with 50 unpopulated page tables or arch code (switch_to() and switch_mm(), 51 most likely) needs to ensure that the stack's page table entries 52 are populated before running on a possibly unpopulated stack. 53- If the stack overflows into a guard page, something reasonable 54 should happen. The definition of "reasonable" is flexible, but 55 instantly rebooting without logging anything would be unfriendly. 56 57VMAP_STACK 58---------- 59 60When enabled, the VMAP_STACK bool configuration option allocates virtually 61mapped task stacks. This option depends on HAVE_ARCH_VMAP_STACK. 62 63- Enable this if you want the use virtually-mapped kernel stacks 64 with guard pages. This causes kernel stack overflows to be caught 65 immediately rather than causing difficult-to-diagnose corruption. 66 67.. note:: 68 69 Using this feature with KASAN requires architecture support 70 for backing virtual mappings with real shadow memory, and 71 KASAN_VMALLOC must be enabled. 72 73.. note:: 74 75 VMAP_STACK is enabled, it is not possible to run DMA on stack 76 allocated data. 77 78Kernel configuration options and dependencies keep changing. Refer to 79the latest code base: 80 81`Kconfig <https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/Kconfig>` 82 83Allocation 84----------- 85 86When a new kernel thread is created, a thread stack is allocated from 87virtually contiguous memory pages from the page level allocator. These 88pages are mapped into contiguous kernel virtual space with PAGE_KERNEL 89protections. 90 91alloc_thread_stack_node() calls __vmalloc_node_range() to allocate stack 92with PAGE_KERNEL protections. 93 94- Allocated stacks are cached and later reused by new threads, so memcg 95 accounting is performed manually on assigning/releasing stacks to tasks. 96 Hence, __vmalloc_node_range is called without __GFP_ACCOUNT. 97- vm_struct is cached to be able to find when thread free is initiated 98 in interrupt context. free_thread_stack() can be called in interrupt 99 context. 100- On arm64, all VMAP's stacks need to have the same alignment to ensure 101 that VMAP'd stack overflow detection works correctly. Arch specific 102 vmap stack allocator takes care of this detail. 103- This does not address interrupt stacks - according to the original patch 104 105Thread stack allocation is initiated from clone(), fork(), vfork(), 106kernel_thread() via kernel_clone(). These are a few hints for searching 107the code base to understand when and how a thread stack is allocated. 108 109Bulk of the code is in: 110`kernel/fork.c <https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/kernel/fork.c>`. 111 112stack_vm_area pointer in task_struct keeps track of the virtually allocated 113stack and a non-null stack_vm_area pointer serves as a indication that the 114virtually mapped kernel stacks are enabled. 115 116:: 117 118 struct vm_struct *stack_vm_area; 119 120Stack overflow handling 121----------------------- 122 123Leading and trailing guard pages help detect stack overflows. When stack 124overflows into the guard pages, handlers have to be careful not overflow 125the stack again. When handlers are called, it is likely that very little 126stack space is left. 127 128On x86, this is done by handling the page fault indicating the kernel 129stack overflow on the double-fault stack. 130 131Testing VMAP allocation with guard pages 132---------------------------------------- 133 134How do we ensure that VMAP_STACK is actually allocating with a leading 135and trailing guard page? The following lkdtm tests can help detect any 136regressions. 137 138:: 139 140 void lkdtm_STACK_GUARD_PAGE_LEADING() 141 void lkdtm_STACK_GUARD_PAGE_TRAILING() 142 143Conclusions 144----------- 145 146- A percpu cache of vmalloced stacks appears to be a bit faster than a 147 high-order stack allocation, at least when the cache hits. 148- THREAD_INFO_IN_TASK gets rid of arch-specific thread_info entirely and 149 simply embed the thread_info (containing only flags) and 'int cpu' into 150 task_struct. 151- The thread stack can be free'ed as soon as the task is dead (without 152 waiting for RCU) and then, if vmapped stacks are in use, cache the 153 entire stack for reuse on the same cpu. 154