1.. SPDX-License-Identifier: GPL-2.0 2 3===================== 4Introduction of mseal 5===================== 6 7:Author: Jeff Xu <jeffxu@chromium.org> 8 9Modern CPUs support memory permissions such as RW and NX bits. The memory 10permission feature improves security stance on memory corruption bugs, i.e. 11the attacker can’t just write to arbitrary memory and point the code to it, 12the memory has to be marked with X bit, or else an exception will happen. 13 14Memory sealing additionally protects the mapping itself against 15modifications. This is useful to mitigate memory corruption issues where a 16corrupted pointer is passed to a memory management system. For example, 17such an attacker primitive can break control-flow integrity guarantees 18since read-only memory that is supposed to be trusted can become writable 19or .text pages can get remapped. Memory sealing can automatically be 20applied by the runtime loader to seal .text and .rodata pages and 21applications can additionally seal security critical data at runtime. 22 23A similar feature already exists in the XNU kernel with the 24VM_FLAGS_PERMANENT flag [1] and on OpenBSD with the mimmutable syscall [2]. 25 26SYSCALL 27======= 28mseal syscall signature 29----------------------- 30 ``int mseal(void \* addr, size_t len, unsigned long flags)`` 31 32 **addr**/**len**: virtual memory address range. 33 The address range set by **addr**/**len** must meet: 34 - The start address must be in an allocated VMA. 35 - The start address must be page aligned. 36 - The end address (**addr** + **len**) must be in an allocated VMA. 37 - no gap (unallocated memory) between start and end address. 38 39 The ``len`` will be paged aligned implicitly by the kernel. 40 41 **flags**: reserved for future use. 42 43 **Return values**: 44 - **0**: Success. 45 - **-EINVAL**: 46 * Invalid input ``flags``. 47 * The start address (``addr``) is not page aligned. 48 * Address range (``addr`` + ``len``) overflow. 49 - **-ENOMEM**: 50 * The start address (``addr``) is not allocated. 51 * The end address (``addr`` + ``len``) is not allocated. 52 * A gap (unallocated memory) between start and end address. 53 - **-EPERM**: 54 * sealing is supported only on 64-bit CPUs, 32-bit is not supported. 55 56 **Note about error return**: 57 - For above error cases, users can expect the given memory range is 58 unmodified, i.e. no partial update. 59 - There might be other internal errors/cases not listed here, e.g. 60 error during merging/splitting VMAs, or the process reaching the maximum 61 number of supported VMAs. In those cases, partial updates to the given 62 memory range could happen. However, those cases should be rare. 63 64 **Architecture support**: 65 mseal only works on 64-bit CPUs, not 32-bit CPUs. 66 67 **Idempotent**: 68 users can call mseal multiple times. mseal on an already sealed memory 69 is a no-action (not error). 70 71 **no munseal** 72 Once mapping is sealed, it can't be unsealed. The kernel should never 73 have munseal, this is consistent with other sealing feature, e.g. 74 F_SEAL_SEAL for file. 75 76Blocked mm syscall for sealed mapping 77------------------------------------- 78 It might be important to note: **once the mapping is sealed, it will 79 stay in the process's memory until the process terminates**. 80 81 Example:: 82 83 *ptr = mmap(0, 4096, PROT_READ, MAP_ANONYMOUS | MAP_PRIVATE, 0, 0); 84 rc = mseal(ptr, 4096, 0); 85 /* munmap will fail */ 86 rc = munmap(ptr, 4096); 87 assert(rc < 0); 88 89 Blocked mm syscall: 90 - munmap 91 - mmap 92 - mremap 93 - mprotect and pkey_mprotect 94 - some destructive madvise behaviors: MADV_DONTNEED, MADV_FREE, 95 MADV_DONTNEED_LOCKED, MADV_FREE, MADV_DONTFORK, MADV_WIPEONFORK 96 97 The first set of syscalls to block is munmap, mremap, mmap. They can 98 either leave an empty space in the address space, therefore allowing 99 replacement with a new mapping with new set of attributes, or can 100 overwrite the existing mapping with another mapping. 101 102 mprotect and pkey_mprotect are blocked because they changes the 103 protection bits (RWX) of the mapping. 104 105 Certain destructive madvise behaviors, specifically MADV_DONTNEED, 106 MADV_FREE, MADV_DONTNEED_LOCKED, and MADV_WIPEONFORK, can introduce 107 risks when applied to anonymous memory by threads lacking write 108 permissions. Consequently, these operations are prohibited under such 109 conditions. The aforementioned behaviors have the potential to modify 110 region contents by discarding pages, effectively performing a memset(0) 111 operation on the anonymous memory. 112 113 Kernel will return -EPERM for blocked syscalls. 114 115 When blocked syscall return -EPERM due to sealing, the memory regions may 116 or may not be changed, depends on the syscall being blocked: 117 118 - munmap: munmap is atomic. If one of VMAs in the given range is 119 sealed, none of VMAs are updated. 120 - mprotect, pkey_mprotect, madvise: partial update might happen, e.g. 121 when mprotect over multiple VMAs, mprotect might update the beginning 122 VMAs before reaching the sealed VMA and return -EPERM. 123 - mmap and mremap: undefined behavior. 124 125Use cases 126========= 127- glibc: 128 The dynamic linker, during loading ELF executables, can apply sealing to 129 mapping segments. 130 131- Chrome browser: protect some security sensitive data structures. 132 133When not to use mseal 134===================== 135Applications can apply sealing to any virtual memory region from userspace, 136but it is *crucial to thoroughly analyze the mapping's lifetime* prior to 137apply the sealing. This is because the sealed mapping *won’t be unmapped* 138until the process terminates or the exec system call is invoked. 139 140For example: 141 - aio/shm 142 aio/shm can call mmap and munmap on behalf of userspace, e.g. 143 ksys_shmdt() in shm.c. The lifetimes of those mapping are not tied to 144 the lifetime of the process. If those memories are sealed from userspace, 145 then munmap will fail, causing leaks in VMA address space during the 146 lifetime of the process. 147 148 - ptr allocated by malloc (heap) 149 Don't use mseal on the memory ptr return from malloc(). 150 malloc() is implemented by allocator, e.g. by glibc. Heap manager might 151 allocate a ptr from brk or mapping created by mmap. 152 If an app calls mseal on a ptr returned from malloc(), this can affect 153 the heap manager's ability to manage the mappings; the outcome is 154 non-deterministic. 155 156 Example:: 157 158 ptr = malloc(size); 159 /* don't call mseal on ptr return from malloc. */ 160 mseal(ptr, size); 161 /* free will success, allocator can't shrink heap lower than ptr */ 162 free(ptr); 163 164mseal doesn't block 165=================== 166In a nutshell, mseal blocks certain mm syscall from modifying some of VMA's 167attributes, such as protection bits (RWX). Sealed mappings doesn't mean the 168memory is immutable. 169 170As Jann Horn pointed out in [3], there are still a few ways to write 171to RO memory, which is, in a way, by design. And those could be blocked 172by different security measures. 173 174Those cases are: 175 176 - Write to read-only memory through /proc/self/mem interface (FOLL_FORCE). 177 - Write to read-only memory through ptrace (such as PTRACE_POKETEXT). 178 - userfaultfd. 179 180The idea that inspired this patch comes from Stephen Röttger’s work in V8 181CFI [4]. Chrome browser in ChromeOS will be the first user of this API. 182 183Reference 184========= 185- [1] https://github.com/apple-oss-distributions/xnu/blob/1031c584a5e37aff177559b9f69dbd3c8c3fd30a/osfmk/mach/vm_statistics.h#L274 186- [2] https://man.openbsd.org/mimmutable.2 187- [3] https://lore.kernel.org/lkml/CAG48ez3ShUYey+ZAFsU2i1RpQn0a5eOs2hzQ426FkcgnfUGLvA@mail.gmail.com 188- [4] https://docs.google.com/document/d/1O2jwK4dxI3nRcOJuPYkonhTkNQfbmwdvxQMyXgeaRHo/edit#heading=h.bvaojj9fu6hc 189