1.. SPDX-License-Identifier: GPL-2.0 2.. Copyright (C) 2020, Google LLC. 3 4Kernel Electric-Fence (KFENCE) 5============================== 6 7Kernel Electric-Fence (KFENCE) is a low-overhead sampling-based memory safety 8error detector. KFENCE detects heap out-of-bounds access, use-after-free, and 9invalid-free errors. 10 11KFENCE is designed to be enabled in production kernels, and has near zero 12performance overhead. Compared to KASAN, KFENCE trades performance for 13precision. The main motivation behind KFENCE's design, is that with enough 14total uptime KFENCE will detect bugs in code paths not typically exercised by 15non-production test workloads. One way to quickly achieve a large enough total 16uptime is when the tool is deployed across a large fleet of machines. 17 18Usage 19----- 20 21To enable KFENCE, configure the kernel with:: 22 23 CONFIG_KFENCE=y 24 25To build a kernel with KFENCE support, but disabled by default (to enable, set 26``kfence.sample_interval`` to non-zero value), configure the kernel with:: 27 28 CONFIG_KFENCE=y 29 CONFIG_KFENCE_SAMPLE_INTERVAL=0 30 31KFENCE provides several other configuration options to customize behaviour (see 32the respective help text in ``lib/Kconfig.kfence`` for more info). 33 34Tuning performance 35~~~~~~~~~~~~~~~~~~ 36 37The most important parameter is KFENCE's sample interval, which can be set via 38the kernel boot parameter ``kfence.sample_interval`` in milliseconds. The 39sample interval determines the frequency with which heap allocations will be 40guarded by KFENCE. The default is configurable via the Kconfig option 41``CONFIG_KFENCE_SAMPLE_INTERVAL``. Setting ``kfence.sample_interval=0`` 42disables KFENCE. 43 44The sample interval controls a timer that sets up KFENCE allocations. By 45default, to keep the real sample interval predictable, the normal timer also 46causes CPU wake-ups when the system is completely idle. This may be undesirable 47on power-constrained systems. The boot parameter ``kfence.deferrable=1`` 48instead switches to a "deferrable" timer which does not force CPU wake-ups on 49idle systems, at the risk of unpredictable sample intervals. The default is 50configurable via the Kconfig option ``CONFIG_KFENCE_DEFERRABLE``. 51 52.. warning:: 53 The KUnit test suite is very likely to fail when using a deferrable timer 54 since it currently causes very unpredictable sample intervals. 55 56By default KFENCE will only sample 1 heap allocation within each sample 57interval. *Burst mode* allows to sample successive heap allocations, where the 58kernel boot parameter ``kfence.burst`` can be set to a non-zero value which 59denotes the *additional* successive allocations within a sample interval; 60setting ``kfence.burst=N`` means that ``1 + N`` successive allocations are 61attempted through KFENCE for each sample interval. 62 63The KFENCE memory pool is of fixed size, and if the pool is exhausted, no 64further KFENCE allocations occur. With ``CONFIG_KFENCE_NUM_OBJECTS`` (default 65255), the number of available guarded objects can be controlled. Each object 66requires 2 pages, one for the object itself and the other one used as a guard 67page; object pages are interleaved with guard pages, and every object page is 68therefore surrounded by two guard pages. 69 70The total memory dedicated to the KFENCE memory pool can be computed as:: 71 72 ( #objects + 1 ) * 2 * PAGE_SIZE 73 74Using the default config, and assuming a page size of 4 KiB, results in 75dedicating 2 MiB to the KFENCE memory pool. 76 77Note: On architectures that support huge pages, KFENCE will ensure that the 78pool is using pages of size ``PAGE_SIZE``. This will result in additional page 79tables being allocated. 80 81Error reports 82~~~~~~~~~~~~~ 83 84The boot parameter ``kfence.fault`` can be used to control the behavior when a 85KFENCE error is detected: 86 87- ``kfence.fault=report``: Print the error report and continue (default). 88- ``kfence.fault=oops``: Print the error report and oops. 89- ``kfence.fault=panic``: Print the error report and panic. 90 91A typical out-of-bounds access looks like this:: 92 93 ================================================================== 94 BUG: KFENCE: out-of-bounds read in test_out_of_bounds_read+0xa6/0x234 95 96 Out-of-bounds read at 0xffff8c3f2e291fff (1B left of kfence-#72): 97 test_out_of_bounds_read+0xa6/0x234 98 kunit_try_run_case+0x61/0xa0 99 kunit_generic_run_threadfn_adapter+0x16/0x30 100 kthread+0x176/0x1b0 101 ret_from_fork+0x22/0x30 102 103 kfence-#72: 0xffff8c3f2e292000-0xffff8c3f2e29201f, size=32, cache=kmalloc-32 104 105 allocated by task 484 on cpu 0 at 32.919330s: 106 test_alloc+0xfe/0x738 107 test_out_of_bounds_read+0x9b/0x234 108 kunit_try_run_case+0x61/0xa0 109 kunit_generic_run_threadfn_adapter+0x16/0x30 110 kthread+0x176/0x1b0 111 ret_from_fork+0x22/0x30 112 113 CPU: 0 PID: 484 Comm: kunit_try_catch Not tainted 5.13.0-rc3+ #7 114 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.14.0-2 04/01/2014 115 ================================================================== 116 117The header of the report provides a short summary of the function involved in 118the access. It is followed by more detailed information about the access and 119its origin. Note that, real kernel addresses are only shown when using the 120kernel command line option ``no_hash_pointers``. 121 122Use-after-free accesses are reported as:: 123 124 ================================================================== 125 BUG: KFENCE: use-after-free read in test_use_after_free_read+0xb3/0x143 126 127 Use-after-free read at 0xffff8c3f2e2a0000 (in kfence-#79): 128 test_use_after_free_read+0xb3/0x143 129 kunit_try_run_case+0x61/0xa0 130 kunit_generic_run_threadfn_adapter+0x16/0x30 131 kthread+0x176/0x1b0 132 ret_from_fork+0x22/0x30 133 134 kfence-#79: 0xffff8c3f2e2a0000-0xffff8c3f2e2a001f, size=32, cache=kmalloc-32 135 136 allocated by task 488 on cpu 2 at 33.871326s: 137 test_alloc+0xfe/0x738 138 test_use_after_free_read+0x76/0x143 139 kunit_try_run_case+0x61/0xa0 140 kunit_generic_run_threadfn_adapter+0x16/0x30 141 kthread+0x176/0x1b0 142 ret_from_fork+0x22/0x30 143 144 freed by task 488 on cpu 2 at 33.871358s: 145 test_use_after_free_read+0xa8/0x143 146 kunit_try_run_case+0x61/0xa0 147 kunit_generic_run_threadfn_adapter+0x16/0x30 148 kthread+0x176/0x1b0 149 ret_from_fork+0x22/0x30 150 151 CPU: 2 PID: 488 Comm: kunit_try_catch Tainted: G B 5.13.0-rc3+ #7 152 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.14.0-2 04/01/2014 153 ================================================================== 154 155KFENCE also reports on invalid frees, such as double-frees:: 156 157 ================================================================== 158 BUG: KFENCE: invalid free in test_double_free+0xdc/0x171 159 160 Invalid free of 0xffff8c3f2e2a4000 (in kfence-#81): 161 test_double_free+0xdc/0x171 162 kunit_try_run_case+0x61/0xa0 163 kunit_generic_run_threadfn_adapter+0x16/0x30 164 kthread+0x176/0x1b0 165 ret_from_fork+0x22/0x30 166 167 kfence-#81: 0xffff8c3f2e2a4000-0xffff8c3f2e2a401f, size=32, cache=kmalloc-32 168 169 allocated by task 490 on cpu 1 at 34.175321s: 170 test_alloc+0xfe/0x738 171 test_double_free+0x76/0x171 172 kunit_try_run_case+0x61/0xa0 173 kunit_generic_run_threadfn_adapter+0x16/0x30 174 kthread+0x176/0x1b0 175 ret_from_fork+0x22/0x30 176 177 freed by task 490 on cpu 1 at 34.175348s: 178 test_double_free+0xa8/0x171 179 kunit_try_run_case+0x61/0xa0 180 kunit_generic_run_threadfn_adapter+0x16/0x30 181 kthread+0x176/0x1b0 182 ret_from_fork+0x22/0x30 183 184 CPU: 1 PID: 490 Comm: kunit_try_catch Tainted: G B 5.13.0-rc3+ #7 185 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.14.0-2 04/01/2014 186 ================================================================== 187 188KFENCE also uses pattern-based redzones on the other side of an object's guard 189page, to detect out-of-bounds writes on the unprotected side of the object. 190These are reported on frees:: 191 192 ================================================================== 193 BUG: KFENCE: memory corruption in test_kmalloc_aligned_oob_write+0xef/0x184 194 195 Corrupted memory at 0xffff8c3f2e33aff9 [ 0xac . . . . . . ] (in kfence-#156): 196 test_kmalloc_aligned_oob_write+0xef/0x184 197 kunit_try_run_case+0x61/0xa0 198 kunit_generic_run_threadfn_adapter+0x16/0x30 199 kthread+0x176/0x1b0 200 ret_from_fork+0x22/0x30 201 202 kfence-#156: 0xffff8c3f2e33afb0-0xffff8c3f2e33aff8, size=73, cache=kmalloc-96 203 204 allocated by task 502 on cpu 7 at 42.159302s: 205 test_alloc+0xfe/0x738 206 test_kmalloc_aligned_oob_write+0x57/0x184 207 kunit_try_run_case+0x61/0xa0 208 kunit_generic_run_threadfn_adapter+0x16/0x30 209 kthread+0x176/0x1b0 210 ret_from_fork+0x22/0x30 211 212 CPU: 7 PID: 502 Comm: kunit_try_catch Tainted: G B 5.13.0-rc3+ #7 213 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.14.0-2 04/01/2014 214 ================================================================== 215 216For such errors, the address where the corruption occurred as well as the 217invalidly written bytes (offset from the address) are shown; in this 218representation, '.' denote untouched bytes. In the example above ``0xac`` is 219the value written to the invalid address at offset 0, and the remaining '.' 220denote that no following bytes have been touched. Note that, real values are 221only shown if the kernel was booted with ``no_hash_pointers``; to avoid 222information disclosure otherwise, '!' is used instead to denote invalidly 223written bytes. 224 225And finally, KFENCE may also report on invalid accesses to any protected page 226where it was not possible to determine an associated object, e.g. if adjacent 227object pages had not yet been allocated:: 228 229 ================================================================== 230 BUG: KFENCE: invalid read in test_invalid_access+0x26/0xe0 231 232 Invalid read at 0xffffffffb670b00a: 233 test_invalid_access+0x26/0xe0 234 kunit_try_run_case+0x51/0x85 235 kunit_generic_run_threadfn_adapter+0x16/0x30 236 kthread+0x137/0x160 237 ret_from_fork+0x22/0x30 238 239 CPU: 4 PID: 124 Comm: kunit_try_catch Tainted: G W 5.8.0-rc6+ #7 240 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-1 04/01/2014 241 ================================================================== 242 243DebugFS interface 244~~~~~~~~~~~~~~~~~ 245 246Some debugging information is exposed via debugfs: 247 248* The file ``/sys/kernel/debug/kfence/stats`` provides runtime statistics. 249 250* The file ``/sys/kernel/debug/kfence/objects`` provides a list of objects 251 allocated via KFENCE, including those already freed but protected. 252 253Implementation Details 254---------------------- 255 256Guarded allocations are set up based on the sample interval. After expiration 257of the sample interval, the next allocation through the main allocator (SLAB or 258SLUB) returns a guarded allocation from the KFENCE object pool (allocation 259sizes up to PAGE_SIZE are supported). At this point, the timer is reset, and 260the next allocation is set up after the expiration of the interval. 261 262When using ``CONFIG_KFENCE_STATIC_KEYS=y``, KFENCE allocations are "gated" 263through the main allocator's fast-path by relying on static branches via the 264static keys infrastructure. The static branch is toggled to redirect the 265allocation to KFENCE. Depending on sample interval, target workloads, and 266system architecture, this may perform better than the simple dynamic branch. 267Careful benchmarking is recommended. 268 269KFENCE objects each reside on a dedicated page, at either the left or right 270page boundaries selected at random. The pages to the left and right of the 271object page are "guard pages", whose attributes are changed to a protected 272state, and cause page faults on any attempted access. Such page faults are then 273intercepted by KFENCE, which handles the fault gracefully by reporting an 274out-of-bounds access, and marking the page as accessible so that the faulting 275code can (wrongly) continue executing (set ``panic_on_warn`` to panic instead). 276 277To detect out-of-bounds writes to memory within the object's page itself, 278KFENCE also uses pattern-based redzones. For each object page, a redzone is set 279up for all non-object memory. For typical alignments, the redzone is only 280required on the unguarded side of an object. Because KFENCE must honor the 281cache's requested alignment, special alignments may result in unprotected gaps 282on either side of an object, all of which are redzoned. 283 284The following figure illustrates the page layout:: 285 286 ---+-----------+-----------+-----------+-----------+-----------+--- 287 | xxxxxxxxx | O : | xxxxxxxxx | : O | xxxxxxxxx | 288 | xxxxxxxxx | B : | xxxxxxxxx | : B | xxxxxxxxx | 289 | x GUARD x | J : RED- | x GUARD x | RED- : J | x GUARD x | 290 | xxxxxxxxx | E : ZONE | xxxxxxxxx | ZONE : E | xxxxxxxxx | 291 | xxxxxxxxx | C : | xxxxxxxxx | : C | xxxxxxxxx | 292 | xxxxxxxxx | T : | xxxxxxxxx | : T | xxxxxxxxx | 293 ---+-----------+-----------+-----------+-----------+-----------+--- 294 295Upon deallocation of a KFENCE object, the object's page is again protected and 296the object is marked as freed. Any further access to the object causes a fault 297and KFENCE reports a use-after-free access. Freed objects are inserted at the 298tail of KFENCE's freelist, so that the least recently freed objects are reused 299first, and the chances of detecting use-after-frees of recently freed objects 300is increased. 301 302If pool utilization reaches 75% (default) or above, to reduce the risk of the 303pool eventually being fully occupied by allocated objects yet ensure diverse 304coverage of allocations, KFENCE limits currently covered allocations of the 305same source from further filling up the pool. The "source" of an allocation is 306based on its partial allocation stack trace. A side-effect is that this also 307limits frequent long-lived allocations (e.g. pagecache) of the same source 308filling up the pool permanently, which is the most common risk for the pool 309becoming full and the sampled allocation rate dropping to zero. The threshold 310at which to start limiting currently covered allocations can be configured via 311the boot parameter ``kfence.skip_covered_thresh`` (pool usage%). 312 313Interface 314--------- 315 316The following describes the functions which are used by allocators as well as 317page handling code to set up and deal with KFENCE allocations. 318 319.. kernel-doc:: include/linux/kfence.h 320 :functions: is_kfence_address 321 kfence_shutdown_cache 322 kfence_alloc kfence_free __kfence_free 323 kfence_ksize kfence_object_start 324 kfence_handle_page_fault 325 326Related Tools 327------------- 328 329In userspace, a similar approach is taken by `GWP-ASan 330<http://llvm.org/docs/GwpAsan.html>`_. GWP-ASan also relies on guard pages and 331a sampling strategy to detect memory unsafety bugs at scale. KFENCE's design is 332directly influenced by GWP-ASan, and can be seen as its kernel sibling. Another 333similar but non-sampling approach, that also inspired the name "KFENCE", can be 334found in the userspace `Electric Fence Malloc Debugger 335<https://linux.die.net/man/3/efence>`_. 336 337In the kernel, several tools exist to debug memory access errors, and in 338particular KASAN can detect all bug classes that KFENCE can detect. While KASAN 339is more precise, relying on compiler instrumentation, this comes at a 340performance cost. 341 342It is worth highlighting that KASAN and KFENCE are complementary, with 343different target environments. For instance, KASAN is the better debugging-aid, 344where test cases or reproducers exists: due to the lower chance to detect the 345error, it would require more effort using KFENCE to debug. Deployments at scale 346that cannot afford to enable KASAN, however, would benefit from using KFENCE to 347discover bugs due to code paths not exercised by test cases or fuzzers. 348