| 70390da5 | 04-Jun-2026 |
Tejun Heo <tj@kernel.org> |
sched_ext: Make scx_bpf_kick_cid() return s32
Switch scx_bpf_kick_cid() from void to s32 so future cap enforcement can surface failures. cid interface is introduced in this cycle and has no external
sched_ext: Make scx_bpf_kick_cid() return s32
Switch scx_bpf_kick_cid() from void to s32 so future cap enforcement can surface failures. cid interface is introduced in this cycle and has no external users, so the ABI change is safe. Subsequent patches will add -EPERM returns when the calling sub-sched lacks the required cap on the target cid.
v2: Return scx_cid_to_cpu()'s errno instead of -EINVAL. (Andrea)
Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com>
show more ...
|
| a83f9edf | 04-Jun-2026 |
Tejun Heo <tj@kernel.org> |
tools/sched_ext: Order single-cid cmask helpers as (cid, mask)
The BPF arena single-cid cmask helpers take the cmask first and the cid second. Reorder them to (cid, mask) to match the kernel-side he
tools/sched_ext: Order single-cid cmask helpers as (cid, mask)
The BPF arena single-cid cmask helpers take the cmask first and the cid second. Reorder them to (cid, mask) to match the kernel-side helpers and the test_bit(nr, addr), cpumask_test_cpu(cpu, mask) convention. Range and iteration helpers keep (mask, start).
Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com>
show more ...
|
| a0b48fd7 | 19-May-2026 |
Tejun Heo <tj@kernel.org> |
sched_ext: Track bits[] storage size in struct scx_cmask
scx_cmask carries @base and @nr_cids but not the bits[] allocation size, so helpers reshaping the active range have no way to check it fits a
sched_ext: Track bits[] storage size in struct scx_cmask
scx_cmask carries @base and @nr_cids but not the bits[] allocation size, so helpers reshaping the active range have no way to check it fits and later kfuncs taking caller-provided storage can't validate it.
Add @alloc_words (u64 word count) annotated with __counted_by, and split the bit-range API into three helpers:
- SCX_CMASK_DEFINE() / __SCX_CMASK_DEFINE() define an on-stack cmask, the latter taking an explicit capacity for oversized storage. SCX_CMASK_DEFINE_SHARD() is a thin wrapper that always reserves SCX_CID_SHARD_MAX_CPUS bits of storage.
- scx_cmask_init() / __scx_cmask_init() initialize a cmask, with the same tight-vs-explicit split.
- scx_cmask_reframe() reshapes the active range without resizing storage.
The BPF mirror (cmask_init / __cmask_init / cmask_reframe) gets the same shape.
Add scx_cmask_clear() and scx_cmask_fill() to zero and set the active-range bits respectively. scx_cpumask_to_cmask() uses scx_cmask_clear(); scx_cmask_init() would otherwise re-write @alloc_words on every call.
A later patch uses @alloc_words in scx_cmask_ref_shard() to refuse output storage that can't hold the requested shard.
v2: Init per-CPU scx_set_cmask_scratch (was zero-init, emitted empty cmasks). Add nr_cids/alloc_cids check in BPF __cmask_init(). (sashiko AI) Widen SCX_CMASK_NR_WORDS()/CMASK_NR_WORDS() to compute in u64 so that @nr_cids near U32_MAX no longer wraps to a small value and bypasses the bounds check in cmask_reframe(). (Andrea)
Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com>
show more ...
|
| 7e655ed7 | 29-Apr-2026 |
Tejun Heo <tj@kernel.org> |
sched_ext: Add bpf_sched_ext_ops_cid struct_ops type
cpumask is awkward from BPF and unusable from arena; cid/cmask work in both. Sub-sched enqueue will need cmask. Without a full cid interface, sch
sched_ext: Add bpf_sched_ext_ops_cid struct_ops type
cpumask is awkward from BPF and unusable from arena; cid/cmask work in both. Sub-sched enqueue will need cmask. Without a full cid interface, schedulers end up mixing forms - a subtle-bug factory.
Add sched_ext_ops_cid, which mirrors sched_ext_ops with cid/cmask replacing cpu/cpumask in the topology-carrying callbacks. cpu_acquire/cpu_release are deprecated and absent; a prior patch moved them past @priv so the cid-form can omit them without disturbing shared-field offsets.
The two structs share byte-identical layout up to @priv, so the existing bpf_scx init/check hooks, has_op bitmap, and scx_kf_allow_flags[] are offset-indexed and apply to both. BUILD_BUG_ON in scx_init() pins the shared-field and renamed-callback offsets so any future drift trips at boot.
The kernel<->BPF boundary translates between cpu and cid:
- A static key, enabled on cid-form sched load, gates the translation so cpu-form schedulers pay nothing. - dispatch, update_idle, cpu_online/offline and dump_cpu translate the cpu arg at the callsite. - select_cpu also translates the returned cid back to a cpu. - set_cpumask is wrapped to synthesize a cmask in a per-cpu scratch before calling the cid-form callback.
All scheds in a hierarchy share one form. The static key drives the hot-path branch.
v2: Use struct_size() for the set_cmask_scratch percpu alloc. Move cid-shard fields and assertions into the later cid-shard patch.
v3: Drop `static` on scx_set_cmask_scratch; add extern in ext_internal.h.
Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Cheng-Yang Chou <yphbchou0911@gmail.com> Reviewed-by: Changwoo Min <changwoo@igalia.com> Reviewed-by: Andrea Righi <arighi@nvidia.com>
show more ...
|
| 5ba0a424 | 29-Apr-2026 |
Tejun Heo <tj@kernel.org> |
sched_ext: Add cid-form kfunc wrappers alongside cpu-form
cpumask is awkward from BPF and unusable from arena; cid/cmask work in both. Sub-sched enqueue will need cmask. Without full cid coverage a
sched_ext: Add cid-form kfunc wrappers alongside cpu-form
cpumask is awkward from BPF and unusable from arena; cid/cmask work in both. Sub-sched enqueue will need cmask. Without full cid coverage a scheduler has to mix cid and cpu forms, which is a subtle-bug factory. Close the gap with a cid-native interface.
Pair every cpu-form kfunc that takes a cpu id with a cid-form equivalent (kick, task placement, cpuperf query/set, per-cpu current task, nr-cpu-ids). Add two cid-natives with no cpu-form sibling: scx_bpf_this_cid() (cid of the running cpu, scx equivalent of bpf_get_smp_processor_id) and scx_bpf_nr_online_cids().
scx_bpf_cpu_rq is deprecated; no cid-form counterpart. NUMA node info is reachable via scx_bpf_cid_topo() on the BPF side.
Each cid-form wrapper is a thin cid -> cpu translation that delegates to the cpu path, registered in the same context sets so usage constraints match.
Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Cheng-Yang Chou <yphbchou0911@gmail.com> Reviewed-by: Changwoo Min <changwoo@igalia.com> Reviewed-by: Andrea Righi <arighi@nvidia.com>
show more ...
|
| a58e6b79 | 29-Apr-2026 |
Tejun Heo <tj@kernel.org> |
sched_ext: Add cmask, a base-windowed bitmap over cid space
Sub-scheduler code built on cids needs bitmaps scoped to a slice of cid space (e.g. the idle cids of a shard). A cpumask sized for NR_CPUS
sched_ext: Add cmask, a base-windowed bitmap over cid space
Sub-scheduler code built on cids needs bitmaps scoped to a slice of cid space (e.g. the idle cids of a shard). A cpumask sized for NR_CPUS wastes most of its bits for a small window and is awkward in BPF.
scx_cmask covers [base, base + nr_bits). bits[] is aligned to the global 64-cid grid: bits[0] spans [base & ~63, (base & ~63) + 64). Any two cmasks therefore address bits[] against the same global windows, so cross-cmask word ops reduce to
dest->bits[i] OP= operand->bits[i - delta]
with no bit-shifting, at the cost of up to one extra storage word for head misalignment. This alignment guarantee is the reason binary ops can stay word-level; every mutating helper preserves it.
Kernel side in ext_cid.[hc]; BPF side in tools/sched_ext/include/scx/ cid.bpf.h. BPF side drops the scx_ prefix (redundant in BPF code) and adds the extra helpers that basic idle-cpu selection needs.
No callers yet.
v2: Narrow to helpers that will be used in the planned changes; set/bit/find/zero ops will be added as usage develops.
v3: cmask_copy_from_kernel: validate src->base == 0 via probe-read; bit-level nr_bits check instead of round-up word count. (Sashiko)
v4: Bump CMASK_CAS_TRIES to 1<<23 so abort fires only after seconds of real spinning, not on plausible contention. Switch __builtin_ctzll() to the ctzll() wrapper for clang compat (Changwoo).
Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Cheng-Yang Chou <yphbchou0911@gmail.com> Reviewed-by: Changwoo Min <changwoo@igalia.com> Reviewed-by: Andrea Righi <arighi@nvidia.com>
show more ...
|
| 32a54807 | 29-Apr-2026 |
Tejun Heo <tj@kernel.org> |
tools/sched_ext: Add struct_size() helpers to common.bpf.h
Add flex_array_size(), struct_size() and struct_size_t() to scx/common.bpf.h so BPF schedulers can size flex-array-containing structs the s
tools/sched_ext: Add struct_size() helpers to common.bpf.h
Add flex_array_size(), struct_size() and struct_size_t() to scx/common.bpf.h so BPF schedulers can size flex-array-containing structs the same way kernel code does. These are abbreviated forms of the <linux/overflow.h> macros.
v3: Use offsetof() instead of sizeof() in struct_size() to match kernel semantics (no inflation from trailing struct padding). (Sashiko)
Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Cheng-Yang Chou <yphbchou0911@gmail.com> Reviewed-by: Changwoo Min <changwoo@igalia.com> Reviewed-by: Andrea Righi <arighi@nvidia.com>
show more ...
|
| df7b5ae0 | 29-Apr-2026 |
Tejun Heo <tj@kernel.org> |
sched_ext: Add scx_bpf_cid_override() kfunc
The auto-probed cid mapping reflects the kernel's view of topology (node -> LLC -> core), but a BPF scheduler may want a different layout - to align cid s
sched_ext: Add scx_bpf_cid_override() kfunc
The auto-probed cid mapping reflects the kernel's view of topology (node -> LLC -> core), but a BPF scheduler may want a different layout - to align cid slices with its own partitioning, or to work around how the kernel reports a particular machine.
Add scx_bpf_cid_override(), callable from ops.init() of the root scheduler. It validates the caller-supplied cpu->cid array and replaces the in-place mapping; topo info is invalidated. A compat.bpf.h wrapper silently no-ops on kernels that lack the kfunc.
A new SCX_KF_ALLOW_INIT bit in the kfunc context filter restricts the kfunc to ops.init() at verifier load time.
Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Cheng-Yang Chou <yphbchou0911@gmail.com> Reviewed-by: Changwoo Min <changwoo@igalia.com>
show more ...
|
| e9b55af4 | 29-Apr-2026 |
Tejun Heo <tj@kernel.org> |
sched_ext: Add topological CPU IDs (cids)
Raw cpu numbers are clumsy for sharding and cross-sched communication, especially from BPF. The space is sparse, numerical closeness doesn't track topologic
sched_ext: Add topological CPU IDs (cids)
Raw cpu numbers are clumsy for sharding and cross-sched communication, especially from BPF. The space is sparse, numerical closeness doesn't track topological closeness (x86 hyperthreading often scatters SMT siblings), and a range of cpu ids doesn't describe anything meaningful. Sub-sched support makes this acute: cpu allocation, revocation, and state constantly flow across sub-scheds. Passing whole cpumasks scales poorly (every op scans 4K bits) and cpumasks are awkward in BPF.
cids assign every cpu a dense, topology-ordered id. CPUs sharing a core, LLC, or NUMA node occupy contiguous cid ranges, so a topology unit becomes a (start, length) slice. Communication passes slices; BPF can process a u64 word of cids at a time.
Build the mapping once at root enable by walking online cpus node -> LLC -> core. Possible-but-not-online cpus tail the space with no-topo cids. Expose kfuncs to map cpu <-> cid in either direction and to query each cid's topology metadata.
v2: Use kzalloc_objs()/kmalloc_objs() for the three allocs in scx_cid_arrays_alloc() (Cheng-Yang Chou).
v3: scx_cid_init() failure path now drops cpus_read_lock(); BUILD_BUG_ON tightened to match BPF cmask helpers' NR_CPUS<=8192. (Sashiko)
Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Cheng-Yang Chou <yphbchou0911@gmail.com> Reviewed-by: Changwoo Min <changwoo@igalia.com> Reviewed-by: Andrea Righi <arighi@nvidia.com>
show more ...
|
| 41e33128 | 19-Apr-2026 |
Tejun Heo <tj@kernel.org> |
sched_ext: add p->scx.tid and SCX_OPS_TID_TO_TASK lookup
BPF schedulers that can't hold task_struct pointers (arena-backed ones in particular) key tasks by pid. During exit, pid is released before t
sched_ext: add p->scx.tid and SCX_OPS_TID_TO_TASK lookup
BPF schedulers that can't hold task_struct pointers (arena-backed ones in particular) key tasks by pid. During exit, pid is released before the task finishes passing through scheduler callbacks, so a dying task becomes invisible to the BPF side mid-schedule. scx_qmap hits this: an exiting task's dispatch callback can't recover its queue entry, stalling dispatch until SCX_EXIT_ERROR_STALL.
Add a unique non-zero u64 p->scx.tid assigned at fork that survives the full task lifetime including exit. scx_bpf_tid_to_task() looks up the task; unlike bpf_task_from_pid(), it handles exiting tasks.
The lookup costs an rhashtable insert/remove under scx_tasks_lock, so root schedulers opt in via SCX_OPS_TID_TO_TASK. Sub-schedulers that set the flag to declare a dependency are rejected at attach if root didn't opt in.
scx_qmap converted: keys tasks by tid and enables SCX_OPS_ENQ_EXITING. Pre-patch it stalls within seconds under a non-leader-exec workload; with the patch it runs cleanly.
v3: Warn on rhashtable_lookup_insert_fast() failure via new scx_tid_hash_insert() helper (Cheng-Yang Chou).
v2: Guard scx_root deref in scx_bpf_tid_to_task() error path. The kfunc is registered via scx_kfunc_set_any and reachable from tracing and syscall programs when no scheduler is attached (Cheng-Yang Chou).
Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Cheng-Yang Chou <yphbchou0911@gmail.com> Reviewed-by: Andrea Righi <arighi@nvidia.com>
show more ...
|
| e613cc23 | 19-Apr-2026 |
Tejun Heo <tj@kernel.org> |
sched_ext: Document the ops compat strategy in compat.h/compat.bpf.h
The comments around SCX_OPS_DEFINE() and SCX_OPS_OPEN() were vague about how backward compatibility actually works. Expand them t
sched_ext: Document the ops compat strategy in compat.h/compat.bpf.h
The comments around SCX_OPS_DEFINE() and SCX_OPS_OPEN() were vague about how backward compatibility actually works. Expand them to describe the two mechanisms: load-time BTF fix-up for additive changes, and multi-variant struct_ops for incompatible ones.
Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Andrea Righi <arighi@nvidia.com> Acked-by: Cheng-Yang Chou <yphbchou0911@gmail.com>
show more ...
|
| 7e311baf | 13-Apr-2026 |
Kuba Piecuch <jpiecuch@google.com> |
tools/sched_ext: Add explicit cast from void* in RESIZE_ARRAY()
This fixes the following compilation error when using the header from C++ code:
error: assigning to 'struct scx_flux__data_uei_dump
tools/sched_ext: Add explicit cast from void* in RESIZE_ARRAY()
This fixes the following compilation error when using the header from C++ code:
error: assigning to 'struct scx_flux__data_uei_dump *' from incompatible type 'void *'
Signed-off-by: Kuba Piecuch <jpiecuch@google.com> Signed-off-by: Tejun Heo <tj@kernel.org>
show more ...
|
| ea702393 | 26-Mar-2026 |
Tejun Heo <tj@kernel.org> |
tools/sched_ext: Remove redundant SCX_ENQ_IMMED compat definition
compat.bpf.h defined a fallback SCX_ENQ_IMMED macro using __COMPAT_ENUM_OR_ZERO(). After 6bf36c68b0a2 ("tools/sched_ext: Regenerate
tools/sched_ext: Remove redundant SCX_ENQ_IMMED compat definition
compat.bpf.h defined a fallback SCX_ENQ_IMMED macro using __COMPAT_ENUM_OR_ZERO(). After 6bf36c68b0a2 ("tools/sched_ext: Regenerate autogen enum headers") added SCX_ENQ_IMMED to the autogen headers, including both triggers -Wmacro-redefined warnings.
The autogen definition through const volatile __weak already resolves to 0 on older kernels, providing the same backward compatibility. Remove the now-redundant compat fallback.
Fixes: 6bf36c68b0a2 ("tools/sched_ext: Regenerate autogen enum headers") Link: https://lore.kernel.org/r/20260326100313.338388-1-zhaomzhao@126.com Reported-by: Zhao Mengmeng <zhaomengmeng@kylinos.cn> Signed-off-by: Tejun Heo <tj@kernel.org>
show more ...
|
| 6bf36c68 | 25-Mar-2026 |
Cheng-Yang Chou <yphbchou0911@gmail.com> |
tools/sched_ext: Regenerate autogen enum headers
Regenerate enum_defs.autogen.h, enums.autogen.h and enums.autogen.bpf.h using the upstream scripts [1][2] to sync with recent kernel enum additions.
tools/sched_ext: Regenerate autogen enum headers
Regenerate enum_defs.autogen.h, enums.autogen.h and enums.autogen.bpf.h using the upstream scripts [1][2] to sync with recent kernel enum additions.
[1] https://github.com/sched-ext/scx/blob/main/scripts/gen_enum_defs.py [2] https://github.com/sched-ext/scx/blob/main/scripts/gen_enums.py
Signed-off-by: Cheng-Yang Chou <yphbchou0911@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org>
show more ...
|
| cb251eae | 23-Mar-2026 |
Cheng-Yang Chou <yphbchou0911@gmail.com> |
tools/sched_ext: Add scx_bpf_sub_dispatch() compat wrapper
Add a transparent compatibility wrapper for the scx_bpf_sub_dispatch() kfunc in compat.bpf.h. This allows BPF schedulers using the sub-sche
tools/sched_ext: Add scx_bpf_sub_dispatch() compat wrapper
Add a transparent compatibility wrapper for the scx_bpf_sub_dispatch() kfunc in compat.bpf.h. This allows BPF schedulers using the sub-sched dispatch feature to build and run on older kernels that lack the kfunc.
To avoid requiring code changes in individual schedulers, the transparent wrapper pattern is used instead of a __COMPAT prefix. The kfunc is declared with a ___compat suffix, while the static inline wrapper retains the original scx_bpf_sub_dispatch() name.
When the kfunc is unavailable, the wrapper safely falls back to returning false. This is acceptable because the dispatch path cannot do anything useful without underlying sub-sched support anyway.
Tested scx_qmap on v6.14 successfully.
Signed-off-by: Cheng-Yang Chou <yphbchou0911@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org>
show more ...
|
| 3229ac4a | 13-Mar-2026 |
Tejun Heo <tj@kernel.org> |
sched_ext: Add SCX_OPS_ALWAYS_ENQ_IMMED ops flag
SCX_ENQ_IMMED makes enqueue to local DSQs succeed only if the task can start running immediately. Otherwise, the task is re-enqueued through ops.enqu
sched_ext: Add SCX_OPS_ALWAYS_ENQ_IMMED ops flag
SCX_ENQ_IMMED makes enqueue to local DSQs succeed only if the task can start running immediately. Otherwise, the task is re-enqueued through ops.enqueue(). This provides tighter control but requires specifying the flag on every insertion.
Add SCX_OPS_ALWAYS_ENQ_IMMED ops flag. When set, SCX_ENQ_IMMED is automatically applied to all local DSQ enqueues including through scx_bpf_dsq_move_to_local().
scx_qmap is updated with -I option to test the feature and -F option for IMMED stress testing which forces every Nth enqueue to a busy local DSQ.
v2: - Cover scx_bpf_dsq_move_to_local() path (now has enq_flags via ___v2). - scx_qmap: Remove sched_switch and cpu_release handlers (superseded by kernel-side wakeup_preempt_scx()). Add -F for IMMED stress testing.
Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com>
show more ...
|
| 98d709cb | 13-Mar-2026 |
Tejun Heo <tj@kernel.org> |
sched_ext: Implement SCX_ENQ_IMMED
Add SCX_ENQ_IMMED enqueue flag for local DSQ insertions. Once a task is dispatched with IMMED, it either gets on the CPU immediately and stays on it, or gets reenq
sched_ext: Implement SCX_ENQ_IMMED
Add SCX_ENQ_IMMED enqueue flag for local DSQ insertions. Once a task is dispatched with IMMED, it either gets on the CPU immediately and stays on it, or gets reenqueued back to the BPF scheduler. It will never linger on a local DSQ behind other tasks or on a CPU taken by a higher-priority class.
rq_is_open() uses rq->next_class to determine whether the rq is available, and wakeup_preempt_scx() triggers reenqueue when a higher-priority class task arrives. These capture all higher class preemptions. Combined with reenqueue points in the dispatch path, all cases where an IMMED task would not execute immediately are covered.
SCX_TASK_IMMED persists in p->scx.flags until the next fresh enqueue, so the guarantee survives SAVE/RESTORE cycles. If preempted while running, put_prev_task_scx() reenqueues through ops.enqueue() with SCX_TASK_REENQ_PREEMPTED instead of silently placing the task back on the local DSQ.
This enables tighter scheduling latency control by preventing tasks from piling up on local DSQs. It also enables opportunistic CPU sharing across sub-schedulers - without this, a sub-scheduler can stuff the local DSQ of a shared CPU, making it difficult for others to use.
v2: - Rewrite is_curr_done() as rq_is_open() using rq->next_class and implement wakeup_preempt_scx() to achieve complete coverage of all cases where IMMED tasks could get stranded. - Track IMMED persistently in p->scx.flags and reenqueue preempted-while-running tasks through ops.enqueue(). - Bound deferred reenq cycles (SCX_REENQ_LOCAL_MAX_REPEAT). - Misc renames, documentation.
Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com>
show more ...
|