#
e56f416c |
| 02-Aug-2025 |
Alexei Starovoitov <ast@kernel.org> |
Merge branch 'task-local-data'
Amery Hung says:
==================== Task local data
v6 -> v7 - Fix typos and improve the clarity of the cover letter (Emil) - Some variable naming changes to m
Merge branch 'task-local-data'
Amery Hung says:
==================== Task local data
v6 -> v7 - Fix typos and improve the clarity of the cover letter (Emil) - Some variable naming changes to make it less confusing (Emil) - Add retry in tld_obj_init() as bpf_task_storage_get_recur() with BPF_LOCAL_STORAGE_GET_F_CREATE can fail if the task local storage is also being modified by other threads on the same CPU. This can be especially easy to trigger in CI VM that only has two CPUs.
Adding bpf_preempt_{disable,enable} around bpf_task_storage_get() does not solve the problem as other threads competing for the percpu counter lock in task local storage may not limit to bpf helpers. Some may be calling bpf_task_storage_free() when threads exit.
There is no good max retry under heavy contention. For the 1st selftest in this set, the max retry to guarantee success grows linearly with the thread count. A proposal is to remove the percpu counter by changing locks in bpf_local_storage to rqspinlock.
An alternative is to reduce the probability of failure by allowing bpf syscall and struct_ops programs to use bpf_task_storage_get() instead of the _recur version by modifying bpf_prog_check_recur(). This does not solve the problem in tracing programs though.
v6: https://lore.kernel.org/bpf/20250717164842.1848817-1-ameryhung@gmail.com/
* Overview *
This patchset introduces task local data, an abstract storage type shared between user space and bpf programs for storing data specific to each task, and implements a library to access it. The goal is to provide an abstraction + implementation that is efficient in data sharing and easy to use. The motivating use case is user space hinting with sched_ext.
* Motivating use case *
CPU schedulers can potentially make a better decision with hints from user space process. To support experimenting user space hinting with sched_ext, there needs a mechanism to pass a "per-task hint" from user space to the bpf scheduler "efficiently".
A bpf feature, UPTR [0], supported by task local storage is introduced to serve the needs. It allows pinning a user space page to the kernel through a special field in task local storage map. This patchset further builds an abstraction on top of it to allow user space and bpf programs to easily share per-task data.
* Design *
Task local data is built on top of task local storage map and UPTR to achieve fast per-task data sharing. UPTR is a type of special field supported in task local storage map value. A user page assigned to a UPTR will be pinned to the kernel when the map is updated. Therefore, user space programs can update data that will be seen by bpf programs without syscalls.
Additionally, unlike most bpf maps, task local data does not require a static map value definition. This design is driven by sched_ext, which would like to allow multiple developers to share a storage without the need to explicitly agree on the layout of it. While a centralized layout definition would have worked, the friction of synchronizing it across different repos is not desirable. This simplify code base management and makes experimenting easier.
* Introduction to task local data library *
Task local data library provides simple APIs for user space and bpf through two header files, task_local_data.h and task_local_data.bpf.h, respectively. The diagram below illustrates the basic usage.
First, an entry of data in task local data, we call it a TLD, needs to be defined in the user space through TLD_DEFINE_KEY() with information including the size and the name. The macro will define a global variable key of opaque type tld_key_t associated with the TLD and initialize it. Then, the user space program can locate the TLD by passing the key to tld_get_data() in different thread, where fd is the file descriptor of the underlying task local storage map. The API returns a pointer to the TLD specific to the calling thread and will remain valid until the thread exits. Finally, user space programs can directly read/write the TLD without bpf syscalls.
To use task local storage on the bpf side, struct tld_keys, needs to be defined first. The structure that will be used to create tld_key_map should contain the keys to the TLDs used in a bpf program. In the bpf program, tld_init_object() first needs to be called to initialize a struct tld_object variable on the stack. Then, tld_get_data() can be called to get a pointer to the TLD similar to the user space. The API will use the name to lookup task local data and cache the key in task local storage map, tld_key_map, to speed up subsequent access. The size of the TLD is also needed to prevent out-of-bound access in bpf.
┌─ Application ───────────────────────────────────────────────────────┐ │ TLD_DEFINE_KEY(kx, "X", 4); ┌─ library A ─────────────────────┐│ │ │ void func(...) ││ │ int main(...) │ { ││ │ { │ tld_key_t ky; ││ │ int *x; │ bool *y; ││ │ │ ││ │ x = tld_get_data(fd, kx); │ ky = tld_create_key("Y", 1);││ │ if (x) *x = 123; │ y = tld_get_data(fd, ky); ││ │ ┌────────┤ if (y) *y = true; ││ │ │ └─────────────────────────────────┘│ └───────┬─────────────────│───────────────────────────────────────────┘ V V + ─ Task local data ─ ─ ─ ─ ─ + ┌─ BPF program ──────────────────────┐ | ┌─ tld_data_map ──────────┐ | │ struct tld_object obj; │ | │ BPF Task local storage │ | │ bool *y; │ | │ │ | │ int *x; │ | │ tld_data_u __uptr *data │ | │ │ | │ tld_meta_u __uptr *meta │ | │ if (tld_init_object(task, &obj)) │ | └─────────────────────────┘ | │ return 0; │ | ┌─ tld_key_map ───────────┐ | │ │ | │ BPF Task local storage │ | │ x = tld_get_data(&obj, kx, "X", 4);│ | │ │ |<─┤ if (x) /* do something */ │ | │ tld_key_t kx; │ | │ │ | │ tld_key_t ky; │ | │ y = tld_get_data(&obj, ky, "Y", 1);│ | └─────────────────────────┘ | │ if (y) /* do something */ │ + ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ + └────────────────────────────────────┘
* Implementation *
Task local data defines the storage to be a task local storage map with two UPTRs pointing to tld_data_u and tld_meta_u. tld_data_u, individual to each thread, contains TLD data and the starting offset of data in a page. tld_meta_u, shared by threads in a process, consists of the TLD counts, total size of TLDs and tld_metadata for each TLD. tld_metadata contains the name and the size of a TLD.
struct tld_data_u { u64 start; char data[PAGE_SIZE - sizeof(u64)]; };
struct tld_meta_u { u8 cnt; u16 size; struct metadata metadata[TLD_DATA_CNT]; };
Both user space and bpf API follow the same protocol when accessing task local data. A pointer to a TLD is located using a key. The key is effectively the offset of a TLD in tld_data_u->data. To define a new TLD, the user space API TLD_DEFINE_KEY() iterates tld_meta_u->metadata until an empty slot is found and then update it. It also adds up sizes of prior TLDs to derive the offset (i.e., key). Then, to locate a TLD, tld_get_data() can simply return tld_data_u->data + offset.
To locate a TLD in bpf programs, an API with the same name as the user space, tld_get_data() can be called. It takes more effort in the first invocation as it fetches the key by name. Internal helper, __tld_fetch_key() will iterate tld_meta_u->metadata until the name is found. Then, the offset is cached as key in another task local storage map, tld_key_map. When the search fails, the current TLD count is cached instead to skip searching metadata entries that has been searched in future invocation. The detail of task local data operations can be found in patch 1.
* Misc *
The metadata can potentially use run-length encoding for names to reduce memory wastage and support save more TLDs. I have a version that works, but the selftest takes a bit longer to finish. More investigation needed to find the root cause. I will save this for the future when there is a need to store more than 63 TLDs.
[0] https://lore.kernel.org/bpf/20241023234759.860539-1-martin.lau@linux.dev/ ---
v5 -> v6 - Address Andrii's comment - Fix verification failure in no_alu32 - Some cleanup v5: https://lore.kernel.org/bpf/20250627233958.2602271-1-ameryhung@gmail.com/
v4 -> v5 - Add an option to free memory on thread exit to prevent memory leak - Add an option to reduce memory waste if the allocator can use just enough memory to fullfill aligned_alloc() (e.g., glibc) - Tweak bpf API - Remove tld_fetch_key() as it does not work in init_tasl - tld_get_data() now tries to fetch key if it is not cached yet - Optimize bpf side tld_get_data() - Faster fast path - Less code - Use stdatomic.h in user space library with seq_cst order - Introduce TLD_DEFINE_KEY() as the default TLD creation API for easier memory management. - TLD_DEFINE_KEY() can consume memory up to a page and no memory is wasted since their size is known before per-thread data allocation. - tld_create_key() can only use up to TLD_DYN_DATA_SIZE. Since tld_create_key can run any time even after per-thread data allocation, it is impossible to predict the total size. A configurable size of memory is allocated on top of the total size of TLD_DEFINE_KEY() to accommodate dynamic key creation. - Add tld prefix to all macros - Replace map_update(NO_EXIST) in __tld_init_data() with cmpxchg() - No more +1,-1 dance on the bpf side - Reduce printf from ASSERT in race test - Try implementing run-length encoding for name and decide to save it for the future v4: https://lore.kernel.org/bpf/20250515211606.2697271-1-ameryhung@gmail.com/
v3 -> v4 - API improvements - Simplify API - Drop string obfuscation - Use opaque type for key - Better documentation - Implementation - Switch to dynamic allocation for per-task data - Now offer as header-only libraries - No TLS map pinning; leave it to users - Drop pthread dependency - Add more invalid tld_create_key() test - Add a race test for tld_create_key() v3: https://lore.kernel.org/bpf/20250425214039.2919818-1-ameryhung@gmail.com/
--- ====================
Link: https://patch.msgid.link/20250730185903.3574598-1-ameryhung@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
show more ...
|
#
31e838e1 |
| 30-Jul-2025 |
Amery Hung <ameryhung@gmail.com> |
selftests/bpf: Introduce task local data
Task local data defines an abstract storage type for storing task- specific data (TLD). This patch provides user space and bpf implementation as header-only
selftests/bpf: Introduce task local data
Task local data defines an abstract storage type for storing task- specific data (TLD). This patch provides user space and bpf implementation as header-only libraries for accessing task local data.
Task local data is a bpf task local storage map with two UPTRs:
- tld_meta_u, shared by all tasks of a process, consists of the total count and size of TLDs and an array of metadata of TLDs. A TLD metadata contains the size and name. The name is used to identify a specific TLD in bpf programs.
- u_tld_data points to a task-specific memory. It stores TLD data and the starting offset of data in a page.
Task local design decouple user space and bpf programs. Since bpf program does not know the size of TLDs in compile time, u_tld_data is declared as a page to accommodate TLDs up to a page. As a result, while user space will likely allocate memory smaller than a page for actual TLDs, it needs to pin a page to kernel. It will pin the page that contains enough memory if the allocated memory spans across the page boundary.
The library also creates another task local storage map, tld_key_map, to cache keys for bpf programs to speed up the access.
Below are the core task local data API:
User space BPF Define TLD TLD_DEFINE_KEY(), tld_create_key() - Init TLD object - tld_object_init() Get TLD data tld_get_data() tld_get_data()
- TLD_DEFINE_KEY(), tld_create_key()
A TLD is first defined by the user space with TLD_DEFINE_KEY() or tld_create_key(). TLD_DEFINE_KEY() defines a TLD statically and allocates just enough memory during initialization. tld_create_key() allows creating TLDs on the fly, but has a fix memory budget, TLD_DYN_DATA_SIZE.
Internally, they all call __tld_create_key(), which iterates tld_meta_u->metadata to check if a TLD can be added. The total TLD size needs to fit into a page (limit of UPTR), and no two TLDs can have the same name. If a TLD can be added, u_tld_meta->cnt is increased using cmpxchg as there may be other concurrent __tld_create_key(). After a successful cmpxchg, the last available tld_meta_u->metadata now belongs to the calling thread. To prevent other threads from reading incomplete metadata while it is being updated, tld_meta_u->metadata->size is used to signal the completion.
Finally, the offset, derived from adding up prior TLD sizes is then encapsulated as an opaque object key to prevent user misuse. The offset is guaranteed to be 8-byte aligned to prevent load/store tearing and allow atomic operations on it.
- tld_get_data()
User space programs can pass the key to tld_get_data() to get a pointer to the associated TLD. The pointer will remain valid for the lifetime of the thread.
tld_data_u is lazily allocated on the first call to tld_get_data(). Trying to read task local data from bpf will result in -ENODATA during tld_object_init(). The task-specific memory need to be freed manually by calling tld_free() on thread exit to prevent memory leak or use TLD_FREE_DATA_ON_THREAD_EXIT.
- tld_object_init() (BPF)
BPF programs need to call tld_object_init() before calling tld_get_data(). This is to avoid redundant map lookup in tld_get_data() by storing pointers to the map values on stack. The pointers are encapsulated as tld_object.
tld_key_map is also created on the first time tld_object_init() is called to cache TLD keys successfully fetched by tld_get_data().
bpf_task_storage_get(.., F_CREATE) needs to be retried since it may fail when another thread has already taken the percpu counter lock for the task local storage.
- tld_get_data() (BPF)
BPF programs can also get a pointer to a TLD with tld_get_data(). It uses the cached key in tld_key_map to locate the data in tld_data_u->data. If the cached key is not set yet (<= 0), __tld_fetch_key() will be called to iterate tld_meta_u->metadata and find the TLD by name. To prevent redundant string comparison in the future when the search fail, the tld_meta_u->cnt is stored in the non-positive range of the key. Next time, __tld_fetch_key() will be called only if there are new TLDs and the search will start from the newly added tld_meta_u->metadata using the old tld_meta_u-cnt.
Signed-off-by: Amery Hung <ameryhung@gmail.com> Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com> Link: https://lore.kernel.org/r/20250730185903.3574598-3-ameryhung@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
show more ...
|