f203683c | 18-Apr-2025 |
Honglei Wang <jameshongleiwang@126.com> |
sched_ext: change the variable name for slice refill event
SCX_EV_ENQ_SLICE_DFL gives the impression that the event only occurs when the tasks were enqueued, which seems not accurate. What it actual
sched_ext: change the variable name for slice refill event
SCX_EV_ENQ_SLICE_DFL gives the impression that the event only occurs when the tasks were enqueued, which seems not accurate. What it actually means is the refilling with defalt slice, and this can occur either when enqueue or pick_task. Let's change the variable to SCX_EV_REFILL_SLICE_DFL.
Signed-off-by: Honglei Wang <jameshongleiwang@126.com> Acked-by: Changwoo Min <changwoo@igalia.com> Acked-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
show more ...
|
6d65f682 | 14-Apr-2025 |
yangsonghua <jluyangsonghua@gmail.com> |
sched_ext: Improve cross-compilation support in Makefile
Modify the tools/sched_ext/Makefile to better handle cross-compilation environments by:
1. Fix host tools build directory structure by separ
sched_ext: Improve cross-compilation support in Makefile
Modify the tools/sched_ext/Makefile to better handle cross-compilation environments by:
1. Fix host tools build directory structure by separating obj/ from output (HOST_BUILD_DIR now points to $(OBJ_DIR)/host/obj) 2. Properly propagate CROSS_COMPILE to libbpf sub-make invocation 3. Add missing $(HOST_BPFOBJ) build rule with proper host toolchain flags (ARCH=, CROSS_COMPILE=, explicit HOSTCC/HOSTLD) 4. Consistently quote $(HOSTCC) in bpftool build rule 5. Change LDFLAGS assignment to += to allow external extensions
The changes ensure proper cross-compilation behavior while maintaining backward compatibility with native builds. Host tools are now correctly built with the host toolchain while target binaries use the cross-toolchain.
Signed-off-by: yangsonghua <yangsonghua@lixiang.com> Signed-off-by: Tejun Heo <tj@kernel.org>
show more ...
|
683d2d0f | 05-Apr-2025 |
Andrea Righi <arighi@nvidia.com> |
sched_ext: idle: Introduce scx_bpf_select_cpu_and()
Provide a new kfunc, scx_bpf_select_cpu_and(), that can be used to apply the built-in idle CPU selection policy to a subset of allowed CPU.
This
sched_ext: idle: Introduce scx_bpf_select_cpu_and()
Provide a new kfunc, scx_bpf_select_cpu_and(), that can be used to apply the built-in idle CPU selection policy to a subset of allowed CPU.
This new helper is basically an extension of scx_bpf_select_cpu_dfl(). However, when an idle CPU can't be found, it returns a negative value instead of @prev_cpu, aligning its behavior more closely with scx_bpf_pick_idle_cpu().
It also accepts %SCX_PICK_IDLE_* flags, which can be used to enforce strict selection to @prev_cpu's node (%SCX_PICK_IDLE_IN_NODE), or to request only a full-idle SMT core (%SCX_PICK_IDLE_CORE), while applying the built-in selection logic.
With this helper, BPF schedulers can apply the built-in idle CPU selection policy restricted to any arbitrary subset of CPUs.
Example usage =============
Possible usage in ops.select_cpu():
s32 BPF_STRUCT_OPS(foo_select_cpu, struct task_struct *p, s32 prev_cpu, u64 wake_flags) { const struct cpumask *cpus = task_allowed_cpus(p) ?: p->cpus_ptr; s32 cpu;
cpu = scx_bpf_select_cpu_and(p, prev_cpu, wake_flags, cpus, 0); if (cpu >= 0) { scx_bpf_dsq_insert(p, SCX_DSQ_LOCAL, SCX_SLICE_DFL, 0); return cpu; }
return prev_cpu; }
Results =======
Load distribution on a 4 sockets, 4 cores per socket system, simulated using virtme-ng, running a modified version of scx_bpfland that uses scx_bpf_select_cpu_and() with 0xff00 as the allowed subset of CPUs:
$ vng --cpu 16,sockets=4,cores=4,threads=1 ... $ stress-ng -c 16 ... $ htop ... 0[ 0.0%] 8[||||||||||||||||||||||||100.0%] 1[ 0.0%] 9[||||||||||||||||||||||||100.0%] 2[ 0.0%] 10[||||||||||||||||||||||||100.0%] 3[ 0.0%] 11[||||||||||||||||||||||||100.0%] 4[ 0.0%] 12[||||||||||||||||||||||||100.0%] 5[ 0.0%] 13[||||||||||||||||||||||||100.0%] 6[ 0.0%] 14[||||||||||||||||||||||||100.0%] 7[ 0.0%] 15[||||||||||||||||||||||||100.0%]
With scx_bpf_select_cpu_dfl() tasks would be distributed evenly across all the available CPUs.
Signed-off-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
show more ...
|
8c6ee862 | 04-Apr-2025 |
Tejun Heo <tj@kernel.org> |
sched_ext: Drop "ops" from scx_ops_bypass(), scx_ops_breather() and friends
The tag "ops" is used for two different purposes. First, to indicate that the entity is directly related to the operations
sched_ext: Drop "ops" from scx_ops_bypass(), scx_ops_breather() and friends
The tag "ops" is used for two different purposes. First, to indicate that the entity is directly related to the operations such as flags carried in sched_ext_ops. Second, to indicate that the entity applies to something global such as enable or bypass states. The second usage is historical and causes confusion rather than clarifying anything. For example, scx_ops_enable_state enums are named SCX_OPS_* and thus conflict with scx_ops_flags. Let's drop the second usages.
Drop "ops" from scx_ops_bypass(), scx_ops_breather() and friends. Update scx_show_state.py accordingly.
Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Andrea Righi <arighi@nvidia.com>
show more ...
|
a50c365f | 04-Apr-2025 |
Tejun Heo <tj@kernel.org> |
sched_ext: Drop "ops" from scx_ops_helper, scx_ops_enable_mutex and __scx_ops_enabled
The tag "ops" is used for two different purposes. First, to indicate that the entity is directly related to the
sched_ext: Drop "ops" from scx_ops_helper, scx_ops_enable_mutex and __scx_ops_enabled
The tag "ops" is used for two different purposes. First, to indicate that the entity is directly related to the operations such as flags carried in sched_ext_ops. Second, to indicate that the entity applies to something global such as enable or bypass states. The second usage is historical and causes confusion rather than clarifying anything. For example, scx_ops_enable_state enums are named SCX_OPS_* and thus conflict with scx_ops_flags. Let's drop the second usages.
Drop "ops" from scx_ops_helper, scx_ops_enable_mutex and __scx_ops_enabled. Update scx_show_state.py accordingly.
Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Andrea Righi <arighi@nvidia.com>
show more ...
|
038730dc | 04-Mar-2025 |
Changwoo Min <changwoo@igalia.com> |
sched_ext: Change the event type from u64 to s64
The event count could be negative in the future, so change the event type from u64 to s64.
Signed-off-by: Changwoo Min <changwoo@igalia.com> Signed-
sched_ext: Change the event type from u64 to s64
The event count could be negative in the future, so change the event type from u64 to s64.
Signed-off-by: Changwoo Min <changwoo@igalia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
show more ...
|
b214b04d | 27-Feb-2025 |
Andrea Righi <arighi@nvidia.com> |
tools/sched_ext: Provide a compatible helper for scx_bpf_events()
Introduce __COMPAT_scx_bpf_events() to use scx_bpf_events() in a compatible way also with kernels that don't provide this kfunc.
Th
tools/sched_ext: Provide a compatible helper for scx_bpf_events()
Introduce __COMPAT_scx_bpf_events() to use scx_bpf_events() in a compatible way also with kernels that don't provide this kfunc.
This also fixes the following error with scx_qmap when running on a kernel that does not provide scx_bpf_events():
; scx_bpf_events(&events, sizeof(events)); @ scx_qmap.bpf.c:777 318: (b7) r2 = 72 ; R2_w=72 async_cb 319: <invalid kfunc call> kfunc 'scx_bpf_events' is referenced but wasn't resolved
Fixes: 9865f31d852a4 ("sched_ext: Add scx_bpf_events() and scx_read_event() for BPF schedulers") Signed-off-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
show more ...
|
5e3b6424 | 24-Feb-2025 |
Andrea Righi <arighi@nvidia.com> |
tools/sched_ext: Provide consistent access to scx flags
Make all the SCX_OPS_* and SCX_PICK_IDLE_* flags available to the user-space part of the schedulers via the compat interface.
This allows sch
tools/sched_ext: Provide consistent access to scx flags
Make all the SCX_OPS_* and SCX_PICK_IDLE_* flags available to the user-space part of the schedulers via the compat interface.
This allows schedulers / selftests to set all the ops flags in user-space, rather than having them split between BPF and user-space.
Signed-off-by: Andrea Righi <arighi@nvidia.com> Acked-by: Changwoo Min <changwoo@igalia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
show more ...
|
0e9b4c10 | 24-Feb-2025 |
Andrea Righi <arighi@nvidia.com> |
sched_ext: idle: Introduce scx_bpf_nr_node_ids()
Similarly to scx_bpf_nr_cpu_ids(), introduce a new kfunc scx_bpf_nr_node_ids() to expose the maximum number of NUMA nodes in the system.
BPF schedul
sched_ext: idle: Introduce scx_bpf_nr_node_ids()
Similarly to scx_bpf_nr_cpu_ids(), introduce a new kfunc scx_bpf_nr_node_ids() to expose the maximum number of NUMA nodes in the system.
BPF schedulers can use this information together with the new node-aware kfuncs, for example to create per-node DSQs, validate node IDs, etc.
Signed-off-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
show more ...
|
01059219 | 18-Feb-2025 |
Andrea Righi <arighi@nvidia.com> |
sched_ext: idle: Introduce node-aware idle cpu kfunc helpers
Introduce a new kfunc to retrieve the node associated to a CPU:
int scx_bpf_cpu_node(s32 cpu)
Add the following kfuncs to provide BPF
sched_ext: idle: Introduce node-aware idle cpu kfunc helpers
Introduce a new kfunc to retrieve the node associated to a CPU:
int scx_bpf_cpu_node(s32 cpu)
Add the following kfuncs to provide BPF schedulers direct access to per-node idle cpumasks information:
const struct cpumask *scx_bpf_get_idle_cpumask_node(int node) const struct cpumask *scx_bpf_get_idle_smtmask_node(int node) s32 scx_bpf_pick_idle_cpu_node(const cpumask_t *cpus_allowed, int node, u64 flags) s32 scx_bpf_pick_any_cpu_node(const cpumask_t *cpus_allowed, int node, u64 flags)
Moreover, trigger an scx error when any of the non-node aware idle CPU kfuncs are used when SCX_OPS_BUILTIN_IDLE_PER_NODE is enabled.
Cc: Yury Norov [NVIDIA] <yury.norov@gmail.com> Signed-off-by: Andrea Righi <arighi@nvidia.com> Reviewed-by: Yury Norov [NVIDIA] <yury.norov@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org>
show more ...
|
48849271 | 14-Feb-2025 |
Andrea Righi <arighi@nvidia.com> |
sched_ext: idle: Per-node idle cpumasks
Using a single global idle mask can lead to inefficiencies and a lot of stress on the cache coherency protocol on large systems with multiple NUMA nodes, sinc
sched_ext: idle: Per-node idle cpumasks
Using a single global idle mask can lead to inefficiencies and a lot of stress on the cache coherency protocol on large systems with multiple NUMA nodes, since all the CPUs can create a really intense read/write activity on the single global cpumask.
Therefore, split the global cpumask into multiple per-NUMA node cpumasks to improve scalability and performance on large systems.
The concept is that each cpumask will track only the idle CPUs within its corresponding NUMA node, treating CPUs in other NUMA nodes as busy. In this way concurrent access to the idle cpumask will be restricted within each NUMA node.
The split of multiple per-node idle cpumasks can be controlled using the SCX_OPS_BUILTIN_IDLE_PER_NODE flag.
By default SCX_OPS_BUILTIN_IDLE_PER_NODE is not enabled and a global host-wide idle cpumask is used, maintaining the previous behavior.
NOTE: if a scheduler explicitly enables the per-node idle cpumasks (via SCX_OPS_BUILTIN_IDLE_PER_NODE), scx_bpf_get_idle_cpu/smtmask() will trigger an scx error, since there are no system-wide cpumasks.
= Test =
Hardware: - System: DGX B200 - CPUs: 224 SMT threads (112 physical cores) - Processor: INTEL(R) XEON(R) PLATINUM 8570 - 2 NUMA nodes
Scheduler: - scx_simple [1] (so that we can focus at the built-in idle selection policy and not at the scheduling policy itself)
Test: - Run a parallel kernel build `make -j $(nproc)` and measure the average elapsed time over 10 runs:
avg time | stdev ---------+------ before: 52.431s | 2.895 after: 50.342s | 2.895
= Conclusion =
Splitting the global cpumask into multiple per-NUMA cpumasks helped to achieve a speedup of approximately +4% with this particular architecture and test case.
The same test on a DGX-1 (40 physical cores, Intel Xeon E5-2698 v4 @ 2.20GHz, 2 NUMA nodes) shows a speedup of around 1.5-3%.
On smaller systems, I haven't noticed any measurable regressions or improvements with the same test (parallel kernel build) and scheduler (scx_simple).
Moreover, with a modified scx_bpfland that uses the new NUMA-aware APIs I observed an additional +2-2.5% performance improvement with the same test.
[1] https://github.com/sched-ext/scx/blob/main/scheds/c/scx_simple.bpf.c
Cc: Yury Norov [NVIDIA] <yury.norov@gmail.com> Signed-off-by: Andrea Righi <arighi@nvidia.com> Reviewed-by: Yury Norov [NVIDIA] <yury.norov@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org>
show more ...
|
0aaaf89d | 14-Feb-2025 |
Andrea Righi <arighi@nvidia.com> |
sched_ext: idle: Introduce SCX_OPS_BUILTIN_IDLE_PER_NODE
Add the new scheduler flag SCX_OPS_BUILTIN_IDLE_PER_NODE, which allows BPF schedulers to select between using a global flat idle cpumask or m
sched_ext: idle: Introduce SCX_OPS_BUILTIN_IDLE_PER_NODE
Add the new scheduler flag SCX_OPS_BUILTIN_IDLE_PER_NODE, which allows BPF schedulers to select between using a global flat idle cpumask or multiple per-node cpumasks.
This only introduces the flag and the mechanism to enable/disable this feature without affecting any scheduling behavior.
Cc: Yury Norov [NVIDIA] <yury.norov@gmail.com> Signed-off-by: Andrea Righi <arighi@nvidia.com> Reviewed-by: Yury Norov [NVIDIA] <yury.norov@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org>
show more ...
|
2e2006c9 | 12-Feb-2025 |
Chuyi Zhou <zhouchuyi@bytedance.com> |
sched_ext: Fix the incorrect bpf_list kfunc API in common.bpf.h.
Now BPF only supports bpf_list_push_{front,back}_impl kfunc, not bpf_list_ push_{front,back}.
This patch fix this issue. Without thi
sched_ext: Fix the incorrect bpf_list kfunc API in common.bpf.h.
Now BPF only supports bpf_list_push_{front,back}_impl kfunc, not bpf_list_ push_{front,back}.
This patch fix this issue. Without this patch, if we use bpf_list kfunc in scx, the BPF verifier would complain:
libbpf: extern (func ksym) 'bpf_list_push_back': not found in kernel or module BTFs libbpf: failed to load object 'scx_foo' libbpf: failed to load BPF skeleton 'scx_foo': -EINVAL
With this patch, the bpf list kfunc will work as expected.
Signed-off-by: Chuyi Zhou <zhouchuyi@bytedance.com> Fixes: 2a52ca7c98960 ("sched_ext: Add scx_simple and scx_example_qmap example schedulers") Signed-off-by: Tejun Heo <tj@kernel.org>
show more ...
|
2e7df12b | 10-Feb-2025 |
Changwoo Min <changwoo@igalia.com> |
tools/sched_ext: Update enum_defs.autogen.h
Add where the script is located to the comment lines of the header file. This helps anyone re-generate the header file if required.
Note that this is a s
tools/sched_ext: Update enum_defs.autogen.h
Add where the script is located to the comment lines of the header file. This helps anyone re-generate the header file if required.
Note that this is a sync from the PR [1] in the scx repo.
[1] https://github.com/sched-ext/scx/pull/1322
Signed-off-by: Changwoo Min <changwoo@igalia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
show more ...
|
372033ad | 09-Feb-2025 |
Changwoo Min <changwoo@igalia.com> |
tools/sched_ext: Compatible testing of SCX_ENQ_CPU_SELECTED
This provides compatible testing of SCX_ENQ_CPU_SELECTED. More specifically, it handles two cases:
1. a BPF scheduler is compiled again
tools/sched_ext: Compatible testing of SCX_ENQ_CPU_SELECTED
This provides compatible testing of SCX_ENQ_CPU_SELECTED. More specifically, it handles two cases:
1. a BPF scheduler is compiled against vmlinux.h where SCX_ENQ_CPU_SELECTED is defined, but it runs on a kernel that does not have SCX_ENQ_CPU_SELECTED. In this case, the test result of 'enq_flags & SCX_ENQ_CPU_SELECTED' will always be false. That test result is semantically incorrect because the kernel before SCX_ENQ_CPU_SELECTED has never skipped select_task_rq_scx(), so the result should be true.
2. a BPF scheduler is compiling against vmlinux.h where SCX_ENQ_CPU_SELECTED is not defined. In this case, directly using SCX_ENQ_CPU_SELECTED causes compilation errors.
To hide such complexity, introduce __COMPAT_is_enq_cpu_selected(), which checks if SCX_ENQ_CPU_SELECTED exists in runtime using BPF CO-RE. This consists of three parts:
1. Add enum_defs.autogen.h, which has macros (HAVE_{enum name}) denoting whether SCX enums are defined in the vmlinux.h or not.
2. Implement __COMPAT_is_enq_cpu_selected(), which provide the test of SCX_ENQ_CPU_SELECTED in a compatible way.
3. Use __COMPAT_is_enq_cpu_selected() in scx_qmap.
Note that this is a sync of the relevant PR [1] in the scx repo.
[1] https://github.com/sched-ext/scx/pull/1314
Signed-off-by: Changwoo Min <changwoo@igalia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
show more ...
|
46a0e161 | 09-Feb-2025 |
Tejun Heo <tj@kernel.org> |
tool/sched_ext: Event counter dumping updates
- There's no need to dump event counters from both scx_qmap and scx_central. Drop counter dumping from scx_central.
- bpf_printk() implies a trailing
tool/sched_ext: Event counter dumping updates
- There's no need to dump event counters from both scx_qmap and scx_central. Drop counter dumping from scx_central.
- bpf_printk() implies a trailing new line and the explicit new line leads to double new lines. Drop the explicit new lines.
Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Changwoo Min <changwoo@igalia.com>
show more ...
|
29ef4a2f | 09-Feb-2025 |
Tejun Heo <tj@kernel.org> |
Merge branch 'for-6.14-fixes' into for-6.15
Pull to receive:
- 2fa0fbeb69ed ("sched_ext: Implement auto local dispatching of migration disabled tasks") - 32966821574c ("sched_ext: Fix migration dis
Merge branch 'for-6.14-fixes' into for-6.15
Pull to receive:
- 2fa0fbeb69ed ("sched_ext: Implement auto local dispatching of migration disabled tasks") - 32966821574c ("sched_ext: Fix migration disabled handling in targeted dispatches")
as planned for-6.15 changes depend on them (e.g. adding event counter for implicit migration disabled task handling).
show more ...
|
38d65cd6 | 07-Feb-2025 |
Changwoo Min <changwoo@igalia.com> |
sched_ext: Print an event, SCX_EV_ENQ_SLICE_DFL, in scx_qmap/central
Modify the scx_qmap and scx_celtral schedulers to print the SCX_EV_ENQ_SLICE_DFL event every second.
Signed-off-by: Changwoo Min
sched_ext: Print an event, SCX_EV_ENQ_SLICE_DFL, in scx_qmap/central
Modify the scx_qmap and scx_celtral schedulers to print the SCX_EV_ENQ_SLICE_DFL event every second.
Signed-off-by: Changwoo Min <changwoo@igalia.com> Acked-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
show more ...
|