1.. SPDX-License-Identifier: GPL-2.0 2 3.. _kfuncs-header-label: 4 5============================= 6BPF Kernel Functions (kfuncs) 7============================= 8 91. Introduction 10=============== 11 12BPF Kernel Functions or more commonly known as kfuncs are functions in the Linux 13kernel which are exposed for use by BPF programs. Unlike normal BPF helpers, 14kfuncs do not have a stable interface and can change from one kernel release to 15another. Hence, BPF programs need to be updated in response to changes in the 16kernel. 17 182. Defining a kfunc 19=================== 20 21There are two ways to expose a kernel function to BPF programs, either make an 22existing function in the kernel visible, or add a new wrapper for BPF. In both 23cases, care must be taken that BPF program can only call such function in a 24valid context. To enforce this, visibility of a kfunc can be per program type. 25 26If you are not creating a BPF wrapper for existing kernel function, skip ahead 27to :ref:`BPF_kfunc_nodef`. 28 292.1 Creating a wrapper kfunc 30---------------------------- 31 32When defining a wrapper kfunc, the wrapper function should have extern linkage. 33This prevents the compiler from optimizing away dead code, as this wrapper kfunc 34is not invoked anywhere in the kernel itself. It is not necessary to provide a 35prototype in a header for the wrapper kfunc. 36 37An example is given below:: 38 39 /* Disables missing prototype warnings */ 40 __diag_push(); 41 __diag_ignore_all("-Wmissing-prototypes", 42 "Global kfuncs as their definitions will be in BTF"); 43 44 struct task_struct *bpf_find_get_task_by_vpid(pid_t nr) 45 { 46 return find_get_task_by_vpid(nr); 47 } 48 49 __diag_pop(); 50 51A wrapper kfunc is often needed when we need to annotate parameters of the 52kfunc. Otherwise one may directly make the kfunc visible to the BPF program by 53registering it with the BPF subsystem. See :ref:`BPF_kfunc_nodef`. 54 552.2 Annotating kfunc parameters 56------------------------------- 57 58Similar to BPF helpers, there is sometime need for additional context required 59by the verifier to make the usage of kernel functions safer and more useful. 60Hence, we can annotate a parameter by suffixing the name of the argument of the 61kfunc with a __tag, where tag may be one of the supported annotations. 62 632.2.1 __sz Annotation 64--------------------- 65 66This annotation is used to indicate a memory and size pair in the argument list. 67An example is given below:: 68 69 void bpf_memzero(void *mem, int mem__sz) 70 { 71 ... 72 } 73 74Here, the verifier will treat first argument as a PTR_TO_MEM, and second 75argument as its size. By default, without __sz annotation, the size of the type 76of the pointer is used. Without __sz annotation, a kfunc cannot accept a void 77pointer. 78 792.2.2 __k Annotation 80-------------------- 81 82This annotation is only understood for scalar arguments, where it indicates that 83the verifier must check the scalar argument to be a known constant, which does 84not indicate a size parameter, and the value of the constant is relevant to the 85safety of the program. 86 87An example is given below:: 88 89 void *bpf_obj_new(u32 local_type_id__k, ...) 90 { 91 ... 92 } 93 94Here, bpf_obj_new uses local_type_id argument to find out the size of that type 95ID in program's BTF and return a sized pointer to it. Each type ID will have a 96distinct size, hence it is crucial to treat each such call as distinct when 97values don't match during verifier state pruning checks. 98 99Hence, whenever a constant scalar argument is accepted by a kfunc which is not a 100size parameter, and the value of the constant matters for program safety, __k 101suffix should be used. 102 103.. _BPF_kfunc_nodef: 104 1052.3 Using an existing kernel function 106------------------------------------- 107 108When an existing function in the kernel is fit for consumption by BPF programs, 109it can be directly registered with the BPF subsystem. However, care must still 110be taken to review the context in which it will be invoked by the BPF program 111and whether it is safe to do so. 112 1132.4 Annotating kfuncs 114--------------------- 115 116In addition to kfuncs' arguments, verifier may need more information about the 117type of kfunc(s) being registered with the BPF subsystem. To do so, we define 118flags on a set of kfuncs as follows:: 119 120 BTF_SET8_START(bpf_task_set) 121 BTF_ID_FLAGS(func, bpf_get_task_pid, KF_ACQUIRE | KF_RET_NULL) 122 BTF_ID_FLAGS(func, bpf_put_pid, KF_RELEASE) 123 BTF_SET8_END(bpf_task_set) 124 125This set encodes the BTF ID of each kfunc listed above, and encodes the flags 126along with it. Ofcourse, it is also allowed to specify no flags. 127 1282.4.1 KF_ACQUIRE flag 129--------------------- 130 131The KF_ACQUIRE flag is used to indicate that the kfunc returns a pointer to a 132refcounted object. The verifier will then ensure that the pointer to the object 133is eventually released using a release kfunc, or transferred to a map using a 134referenced kptr (by invoking bpf_kptr_xchg). If not, the verifier fails the 135loading of the BPF program until no lingering references remain in all possible 136explored states of the program. 137 1382.4.2 KF_RET_NULL flag 139---------------------- 140 141The KF_RET_NULL flag is used to indicate that the pointer returned by the kfunc 142may be NULL. Hence, it forces the user to do a NULL check on the pointer 143returned from the kfunc before making use of it (dereferencing or passing to 144another helper). This flag is often used in pairing with KF_ACQUIRE flag, but 145both are orthogonal to each other. 146 1472.4.3 KF_RELEASE flag 148--------------------- 149 150The KF_RELEASE flag is used to indicate that the kfunc releases the pointer 151passed in to it. There can be only one referenced pointer that can be passed in. 152All copies of the pointer being released are invalidated as a result of invoking 153kfunc with this flag. 154 1552.4.4 KF_KPTR_GET flag 156---------------------- 157 158The KF_KPTR_GET flag is used to indicate that the kfunc takes the first argument 159as a pointer to kptr, safely increments the refcount of the object it points to, 160and returns a reference to the user. The rest of the arguments may be normal 161arguments of a kfunc. The KF_KPTR_GET flag should be used in conjunction with 162KF_ACQUIRE and KF_RET_NULL flags. 163 1642.4.5 KF_TRUSTED_ARGS flag 165-------------------------- 166 167The KF_TRUSTED_ARGS flag is used for kfuncs taking pointer arguments. It 168indicates that the all pointer arguments are valid, and that all pointers to 169BTF objects have been passed in their unmodified form (that is, at a zero 170offset, and without having been obtained from walking another pointer, with one 171exception described below). 172 173There are two types of pointers to kernel objects which are considered "valid": 174 1751. Pointers which are passed as tracepoint or struct_ops callback arguments. 1762. Pointers which were returned from a KF_ACQUIRE or KF_KPTR_GET kfunc. 177 178Pointers to non-BTF objects (e.g. scalar pointers) may also be passed to 179KF_TRUSTED_ARGS kfuncs, and may have a non-zero offset. 180 181The definition of "valid" pointers is subject to change at any time, and has 182absolutely no ABI stability guarantees. 183 184As mentioned above, a nested pointer obtained from walking a trusted pointer is 185no longer trusted, with one exception. If a struct type has a field that is 186guaranteed to be valid as long as its parent pointer is trusted, the 187``BTF_TYPE_SAFE_NESTED`` macro can be used to express that to the verifier as 188follows: 189 190.. code-block:: c 191 192 BTF_TYPE_SAFE_NESTED(struct task_struct) { 193 const cpumask_t *cpus_ptr; 194 }; 195 196In other words, you must: 197 1981. Wrap the trusted pointer type in the ``BTF_TYPE_SAFE_NESTED`` macro. 199 2002. Specify the type and name of the trusted nested field. This field must match 201 the field in the original type definition exactly. 202 2032.4.6 KF_SLEEPABLE flag 204----------------------- 205 206The KF_SLEEPABLE flag is used for kfuncs that may sleep. Such kfuncs can only 207be called by sleepable BPF programs (BPF_F_SLEEPABLE). 208 2092.4.7 KF_DESTRUCTIVE flag 210-------------------------- 211 212The KF_DESTRUCTIVE flag is used to indicate functions calling which is 213destructive to the system. For example such a call can result in system 214rebooting or panicking. Due to this additional restrictions apply to these 215calls. At the moment they only require CAP_SYS_BOOT capability, but more can be 216added later. 217 2182.4.8 KF_RCU flag 219----------------- 220 221The KF_RCU flag is used for kfuncs which have a rcu ptr as its argument. 222When used together with KF_ACQUIRE, it indicates the kfunc should have a 223single argument which must be a trusted argument or a MEM_RCU pointer. 224The argument may have reference count of 0 and the kfunc must take this 225into consideration. 226 2272.5 Registering the kfuncs 228-------------------------- 229 230Once the kfunc is prepared for use, the final step to making it visible is 231registering it with the BPF subsystem. Registration is done per BPF program 232type. An example is shown below:: 233 234 BTF_SET8_START(bpf_task_set) 235 BTF_ID_FLAGS(func, bpf_get_task_pid, KF_ACQUIRE | KF_RET_NULL) 236 BTF_ID_FLAGS(func, bpf_put_pid, KF_RELEASE) 237 BTF_SET8_END(bpf_task_set) 238 239 static const struct btf_kfunc_id_set bpf_task_kfunc_set = { 240 .owner = THIS_MODULE, 241 .set = &bpf_task_set, 242 }; 243 244 static int init_subsystem(void) 245 { 246 return register_btf_kfunc_id_set(BPF_PROG_TYPE_TRACING, &bpf_task_kfunc_set); 247 } 248 late_initcall(init_subsystem); 249 2502.6 Specifying no-cast aliases with ___init 251-------------------------------------------- 252 253The verifier will always enforce that the BTF type of a pointer passed to a 254kfunc by a BPF program, matches the type of pointer specified in the kfunc 255definition. The verifier, does, however, allow types that are equivalent 256according to the C standard to be passed to the same kfunc arg, even if their 257BTF_IDs differ. 258 259For example, for the following type definition: 260 261.. code-block:: c 262 263 struct bpf_cpumask { 264 cpumask_t cpumask; 265 refcount_t usage; 266 }; 267 268The verifier would allow a ``struct bpf_cpumask *`` to be passed to a kfunc 269taking a ``cpumask_t *`` (which is a typedef of ``struct cpumask *``). For 270instance, both ``struct cpumask *`` and ``struct bpf_cpmuask *`` can be passed 271to bpf_cpumask_test_cpu(). 272 273In some cases, this type-aliasing behavior is not desired. ``struct 274nf_conn___init`` is one such example: 275 276.. code-block:: c 277 278 struct nf_conn___init { 279 struct nf_conn ct; 280 }; 281 282The C standard would consider these types to be equivalent, but it would not 283always be safe to pass either type to a trusted kfunc. ``struct 284nf_conn___init`` represents an allocated ``struct nf_conn`` object that has 285*not yet been initialized*, so it would therefore be unsafe to pass a ``struct 286nf_conn___init *`` to a kfunc that's expecting a fully initialized ``struct 287nf_conn *`` (e.g. ``bpf_ct_change_timeout()``). 288 289In order to accommodate such requirements, the verifier will enforce strict 290PTR_TO_BTF_ID type matching if two types have the exact same name, with one 291being suffixed with ``___init``. 292 2933. Core kfuncs 294============== 295 296The BPF subsystem provides a number of "core" kfuncs that are potentially 297applicable to a wide variety of different possible use cases and programs. 298Those kfuncs are documented here. 299 3003.1 struct task_struct * kfuncs 301------------------------------- 302 303There are a number of kfuncs that allow ``struct task_struct *`` objects to be 304used as kptrs: 305 306.. kernel-doc:: kernel/bpf/helpers.c 307 :identifiers: bpf_task_acquire bpf_task_release 308 309These kfuncs are useful when you want to acquire or release a reference to a 310``struct task_struct *`` that was passed as e.g. a tracepoint arg, or a 311struct_ops callback arg. For example: 312 313.. code-block:: c 314 315 /** 316 * A trivial example tracepoint program that shows how to 317 * acquire and release a struct task_struct * pointer. 318 */ 319 SEC("tp_btf/task_newtask") 320 int BPF_PROG(task_acquire_release_example, struct task_struct *task, u64 clone_flags) 321 { 322 struct task_struct *acquired; 323 324 acquired = bpf_task_acquire(task); 325 326 /* 327 * In a typical program you'd do something like store 328 * the task in a map, and the map will automatically 329 * release it later. Here, we release it manually. 330 */ 331 bpf_task_release(acquired); 332 return 0; 333 } 334 335---- 336 337A BPF program can also look up a task from a pid. This can be useful if the 338caller doesn't have a trusted pointer to a ``struct task_struct *`` object that 339it can acquire a reference on with bpf_task_acquire(). 340 341.. kernel-doc:: kernel/bpf/helpers.c 342 :identifiers: bpf_task_from_pid 343 344Here is an example of it being used: 345 346.. code-block:: c 347 348 SEC("tp_btf/task_newtask") 349 int BPF_PROG(task_get_pid_example, struct task_struct *task, u64 clone_flags) 350 { 351 struct task_struct *lookup; 352 353 lookup = bpf_task_from_pid(task->pid); 354 if (!lookup) 355 /* A task should always be found, as %task is a tracepoint arg. */ 356 return -ENOENT; 357 358 if (lookup->pid != task->pid) { 359 /* bpf_task_from_pid() looks up the task via its 360 * globally-unique pid from the init_pid_ns. Thus, 361 * the pid of the lookup task should always be the 362 * same as the input task. 363 */ 364 bpf_task_release(lookup); 365 return -EINVAL; 366 } 367 368 /* bpf_task_from_pid() returns an acquired reference, 369 * so it must be dropped before returning from the 370 * tracepoint handler. 371 */ 372 bpf_task_release(lookup); 373 return 0; 374 } 375 3763.2 struct cgroup * kfuncs 377-------------------------- 378 379``struct cgroup *`` objects also have acquire and release functions: 380 381.. kernel-doc:: kernel/bpf/helpers.c 382 :identifiers: bpf_cgroup_acquire bpf_cgroup_release 383 384These kfuncs are used in exactly the same manner as bpf_task_acquire() and 385bpf_task_release() respectively, so we won't provide examples for them. 386 387---- 388 389You may also acquire a reference to a ``struct cgroup`` kptr that's already 390stored in a map using bpf_cgroup_kptr_get(): 391 392.. kernel-doc:: kernel/bpf/helpers.c 393 :identifiers: bpf_cgroup_kptr_get 394 395Here's an example of how it can be used: 396 397.. code-block:: c 398 399 /* struct containing the struct task_struct kptr which is actually stored in the map. */ 400 struct __cgroups_kfunc_map_value { 401 struct cgroup __kptr_ref * cgroup; 402 }; 403 404 /* The map containing struct __cgroups_kfunc_map_value entries. */ 405 struct { 406 __uint(type, BPF_MAP_TYPE_HASH); 407 __type(key, int); 408 __type(value, struct __cgroups_kfunc_map_value); 409 __uint(max_entries, 1); 410 } __cgroups_kfunc_map SEC(".maps"); 411 412 /* ... */ 413 414 /** 415 * A simple example tracepoint program showing how a 416 * struct cgroup kptr that is stored in a map can 417 * be acquired using the bpf_cgroup_kptr_get() kfunc. 418 */ 419 SEC("tp_btf/cgroup_mkdir") 420 int BPF_PROG(cgroup_kptr_get_example, struct cgroup *cgrp, const char *path) 421 { 422 struct cgroup *kptr; 423 struct __cgroups_kfunc_map_value *v; 424 s32 id = cgrp->self.id; 425 426 /* Assume a cgroup kptr was previously stored in the map. */ 427 v = bpf_map_lookup_elem(&__cgroups_kfunc_map, &id); 428 if (!v) 429 return -ENOENT; 430 431 /* Acquire a reference to the cgroup kptr that's already stored in the map. */ 432 kptr = bpf_cgroup_kptr_get(&v->cgroup); 433 if (!kptr) 434 /* If no cgroup was present in the map, it's because 435 * we're racing with another CPU that removed it with 436 * bpf_kptr_xchg() between the bpf_map_lookup_elem() 437 * above, and our call to bpf_cgroup_kptr_get(). 438 * bpf_cgroup_kptr_get() internally safely handles this 439 * race, and will return NULL if the task is no longer 440 * present in the map by the time we invoke the kfunc. 441 */ 442 return -EBUSY; 443 444 /* Free the reference we just took above. Note that the 445 * original struct cgroup kptr is still in the map. It will 446 * be freed either at a later time if another context deletes 447 * it from the map, or automatically by the BPF subsystem if 448 * it's still present when the map is destroyed. 449 */ 450 bpf_cgroup_release(kptr); 451 452 return 0; 453 } 454 455---- 456 457Another kfunc available for interacting with ``struct cgroup *`` objects is 458bpf_cgroup_ancestor(). This allows callers to access the ancestor of a cgroup, 459and return it as a cgroup kptr. 460 461.. kernel-doc:: kernel/bpf/helpers.c 462 :identifiers: bpf_cgroup_ancestor 463 464Eventually, BPF should be updated to allow this to happen with a normal memory 465load in the program itself. This is currently not possible without more work in 466the verifier. bpf_cgroup_ancestor() can be used as follows: 467 468.. code-block:: c 469 470 /** 471 * Simple tracepoint example that illustrates how a cgroup's 472 * ancestor can be accessed using bpf_cgroup_ancestor(). 473 */ 474 SEC("tp_btf/cgroup_mkdir") 475 int BPF_PROG(cgrp_ancestor_example, struct cgroup *cgrp, const char *path) 476 { 477 struct cgroup *parent; 478 479 /* The parent cgroup resides at the level before the current cgroup's level. */ 480 parent = bpf_cgroup_ancestor(cgrp, cgrp->level - 1); 481 if (!parent) 482 return -ENOENT; 483 484 bpf_printk("Parent id is %d", parent->self.id); 485 486 /* Return the parent cgroup that was acquired above. */ 487 bpf_cgroup_release(parent); 488 return 0; 489 } 490 4913.3 struct cpumask * kfuncs 492--------------------------- 493 494BPF provides a set of kfuncs that can be used to query, allocate, mutate, and 495destroy struct cpumask * objects. Please refer to :ref:`cpumasks-header-label` 496for more details. 497