xref: /linux/Documentation/userspace-api/mseal.rst (revision ee8287e068a3995b0f8001dd6931e221dfb7c530)
1.. SPDX-License-Identifier: GPL-2.0
2
3=====================
4Introduction of mseal
5=====================
6
7:Author: Jeff Xu <jeffxu@chromium.org>
8
9Modern CPUs support memory permissions such as RW and NX bits. The memory
10permission feature improves security stance on memory corruption bugs, i.e.
11the attacker can’t just write to arbitrary memory and point the code to it,
12the memory has to be marked with X bit, or else an exception will happen.
13
14Memory sealing additionally protects the mapping itself against
15modifications. This is useful to mitigate memory corruption issues where a
16corrupted pointer is passed to a memory management system. For example,
17such an attacker primitive can break control-flow integrity guarantees
18since read-only memory that is supposed to be trusted can become writable
19or .text pages can get remapped. Memory sealing can automatically be
20applied by the runtime loader to seal .text and .rodata pages and
21applications can additionally seal security critical data at runtime.
22
23A similar feature already exists in the XNU kernel with the
24VM_FLAGS_PERMANENT flag [1] and on OpenBSD with the mimmutable syscall [2].
25
26User API
27========
28mseal()
29-----------
30The mseal() syscall has the following signature:
31
32``int mseal(void addr, size_t len, unsigned long flags)``
33
34**addr/len**: virtual memory address range.
35
36The address range set by ``addr``/``len`` must meet:
37   - The start address must be in an allocated VMA.
38   - The start address must be page aligned.
39   - The end address (``addr`` + ``len``) must be in an allocated VMA.
40   - no gap (unallocated memory) between start and end address.
41
42The ``len`` will be paged aligned implicitly by the kernel.
43
44**flags**: reserved for future use.
45
46**return values**:
47
48- ``0``: Success.
49
50- ``-EINVAL``:
51    - Invalid input ``flags``.
52    - The start address (``addr``) is not page aligned.
53    - Address range (``addr`` + ``len``) overflow.
54
55- ``-ENOMEM``:
56    - The start address (``addr``) is not allocated.
57    - The end address (``addr`` + ``len``) is not allocated.
58    - A gap (unallocated memory) between start and end address.
59
60- ``-EPERM``:
61    - sealing is supported only on 64-bit CPUs, 32-bit is not supported.
62
63- For above error cases, users can expect the given memory range is
64  unmodified, i.e. no partial update.
65
66- There might be other internal errors/cases not listed here, e.g.
67  error during merging/splitting VMAs, or the process reaching the max
68  number of supported VMAs. In those cases, partial updates to the given
69  memory range could happen. However, those cases should be rare.
70
71**Blocked operations after sealing**:
72    Unmapping, moving to another location, and shrinking the size,
73    via munmap() and mremap(), can leave an empty space, therefore
74    can be replaced with a VMA with a new set of attributes.
75
76    Moving or expanding a different VMA into the current location,
77    via mremap().
78
79    Modifying a VMA via mmap(MAP_FIXED).
80
81    Size expansion, via mremap(), does not appear to pose any
82    specific risks to sealed VMAs. It is included anyway because
83    the use case is unclear. In any case, users can rely on
84    merging to expand a sealed VMA.
85
86    mprotect() and pkey_mprotect().
87
88    Some destructive madvice() behaviors (e.g. MADV_DONTNEED)
89    for anonymous memory, when users don't have write permission to the
90    memory. Those behaviors can alter region contents by discarding pages,
91    effectively a memset(0) for anonymous memory.
92
93    Kernel will return -EPERM for blocked operations.
94
95    For blocked operations, one can expect the given address is unmodified,
96    i.e. no partial update. Note, this is different from existing mm
97    system call behaviors, where partial updates are made till an error is
98    found and returned to userspace. To give an example:
99
100    Assume following code sequence:
101
102    - ptr = mmap(null, 8192, PROT_NONE);
103    - munmap(ptr + 4096, 4096);
104    - ret1 = mprotect(ptr, 8192, PROT_READ);
105    - mseal(ptr, 4096);
106    - ret2 = mprotect(ptr, 8192, PROT_NONE);
107
108    ret1 will be -ENOMEM, the page from ptr is updated to PROT_READ.
109
110    ret2 will be -EPERM, the page remains to be PROT_READ.
111
112**Note**:
113
114- mseal() only works on 64-bit CPUs, not 32-bit CPU.
115
116- users can call mseal() multiple times, mseal() on an already sealed memory
117  is a no-action (not error).
118
119- munseal() is not supported.
120
121Use cases:
122==========
123- glibc:
124  The dynamic linker, during loading ELF executables, can apply sealing to
125  non-writable memory segments.
126
127- Chrome browser: protect some security sensitive data-structures.
128
129Notes on which memory to seal:
130==============================
131
132It might be important to note that sealing changes the lifetime of a mapping,
133i.e. the sealed mapping won’t be unmapped till the process terminates or the
134exec system call is invoked. Applications can apply sealing to any virtual
135memory region from userspace, but it is crucial to thoroughly analyze the
136mapping's lifetime prior to apply the sealing.
137
138For example:
139
140- aio/shm
141
142  aio/shm can call mmap()/munmap() on behalf of userspace, e.g. ksys_shmdt() in
143  shm.c. The lifetime of those mapping are not tied to the lifetime of the
144  process. If those memories are sealed from userspace, then munmap() will fail,
145  causing leaks in VMA address space during the lifetime of the process.
146
147- Brk (heap)
148
149  Currently, userspace applications can seal parts of the heap by calling
150  malloc() and mseal().
151  let's assume following calls from user space:
152
153  - ptr = malloc(size);
154  - mprotect(ptr, size, RO);
155  - mseal(ptr, size);
156  - free(ptr);
157
158  Technically, before mseal() is added, the user can change the protection of
159  the heap by calling mprotect(RO). As long as the user changes the protection
160  back to RW before free(), the memory range can be reused.
161
162  Adding mseal() into the picture, however, the heap is then sealed partially,
163  the user can still free it, but the memory remains to be RO. If the address
164  is re-used by the heap manager for another malloc, the process might crash
165  soon after. Therefore, it is important not to apply sealing to any memory
166  that might get recycled.
167
168  Furthermore, even if the application never calls the free() for the ptr,
169  the heap manager may invoke the brk system call to shrink the size of the
170  heap. In the kernel, the brk-shrink will call munmap(). Consequently,
171  depending on the location of the ptr, the outcome of brk-shrink is
172  nondeterministic.
173
174
175Additional notes:
176=================
177As Jann Horn pointed out in [3], there are still a few ways to write
178to RO memory, which is, in a way, by design. Those cases are not covered
179by mseal(). If applications want to block such cases, sandbox tools (such as
180seccomp, LSM, etc) might be considered.
181
182Those cases are:
183
184- Write to read-only memory through /proc/self/mem interface.
185- Write to read-only memory through ptrace (such as PTRACE_POKETEXT).
186- userfaultfd.
187
188The idea that inspired this patch comes from Stephen Röttger’s work in V8
189CFI [4]. Chrome browser in ChromeOS will be the first user of this API.
190
191Reference:
192==========
193[1] https://github.com/apple-oss-distributions/xnu/blob/1031c584a5e37aff177559b9f69dbd3c8c3fd30a/osfmk/mach/vm_statistics.h#L274
194
195[2] https://man.openbsd.org/mimmutable.2
196
197[3] https://lore.kernel.org/lkml/CAG48ez3ShUYey+ZAFsU2i1RpQn0a5eOs2hzQ426FkcgnfUGLvA@mail.gmail.com
198
199[4] https://docs.google.com/document/d/1O2jwK4dxI3nRcOJuPYkonhTkNQfbmwdvxQMyXgeaRHo/edit#heading=h.bvaojj9fu6hc
200