#
e32cde8d |
| 30-Sep-2024 |
Linus Torvalds <torvalds@linux-foundation.org> |
Merge tag 'sched_ext-for-6.12-rc1-fixes-1' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext
Pull sched_ext fixes from Tejun Heo:
- When sched_ext is in bypass mode (e.g. while disabli
Merge tag 'sched_ext-for-6.12-rc1-fixes-1' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext
Pull sched_ext fixes from Tejun Heo:
- When sched_ext is in bypass mode (e.g. while disabling the BPF scheduler), it was using one DSQ to implement global FIFO scheduling as all it has to do is guaranteeing reasonable forward progress.
On multi-socket machines, this can lead to live-lock conditions under certain workloads. Fixed by splitting the queue used for FIFO scheduling per NUMA node. This required several preparation patches.
- Hotplug tests on powerpc could reliably trigger deadlock while enabling a BPF scheduler.
This was caused by cpu_hotplug_lock nesting inside scx_fork_rwsem and then CPU hotplug path trying to fork a new thread while holding cpu_hotplug_lock.
Fixed by restructuring locking in enable and disable paths so that the two locks are not coupled. This required several preparation patches which also fixed a couple other issues in the enable path.
- A build fix for !CONFIG_SMP
- Userspace tooling sync and updates
* tag 'sched_ext-for-6.12-rc1-fixes-1' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext: sched_ext: Remove redundant p->nr_cpus_allowed checker sched_ext: Decouple locks in scx_ops_enable() sched_ext: Decouple locks in scx_ops_disable_workfn() sched_ext: Add scx_cgroup_enabled to gate cgroup operations and fix scx_tg_online() sched_ext: Enable scx_ops_init_task() separately sched_ext: Fix SCX_TASK_INIT -> SCX_TASK_READY transitions in scx_ops_enable() sched_ext: Initialize in bypass mode sched_ext: Remove SCX_OPS_PREPPING sched_ext: Relocate check_hotplug_seq() call in scx_ops_enable() sched_ext: Use shorter slice while bypassing sched_ext: Split the global DSQ per NUMA node sched_ext: Relocate find_user_dsq() sched_ext: Allow only user DSQs for scx_bpf_consume(), scx_bpf_dsq_nr_queued() and bpf_iter_scx_dsq_new() scx_flatcg: Use a user DSQ for fallback instead of SCX_DSQ_GLOBAL tools/sched_ext: Receive misc updates from SCX repo sched_ext: Add __COMPAT helpers for features added during v6.12 devel cycle sched_ext: Build fix for !CONFIG_SMP
show more ...
|
Revision tags: v6.12-rc1 |
|
#
95b87369 |
| 26-Sep-2024 |
Zhang Qiao <zhangqiao22@huawei.com> |
sched_ext: Remove redundant p->nr_cpus_allowed checker
select_rq_task() already checked that 'p->nr_cpus_allowed > 1', 'p->nr_cpus_allowed == 1' checker in scx_select_cpu_dfl() is redundant.
Signed
sched_ext: Remove redundant p->nr_cpus_allowed checker
select_rq_task() already checked that 'p->nr_cpus_allowed > 1', 'p->nr_cpus_allowed == 1' checker in scx_select_cpu_dfl() is redundant.
Signed-off-by: Zhang Qiao <zhangqiao22@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org>
show more ...
|
#
efe231d9 |
| 27-Sep-2024 |
Tejun Heo <tj@kernel.org> |
sched_ext: Decouple locks in scx_ops_enable()
The enable path uses three big locks - scx_fork_rwsem, scx_cgroup_rwsem and cpus_read_lock. Currently, the locks are grabbed together which is prone to
sched_ext: Decouple locks in scx_ops_enable()
The enable path uses three big locks - scx_fork_rwsem, scx_cgroup_rwsem and cpus_read_lock. Currently, the locks are grabbed together which is prone to locking order problems.
For example, currently, there is a possible deadlock involving scx_fork_rwsem and cpus_read_lock. cpus_read_lock has to nest inside scx_fork_rwsem due to locking order existing in other subsystems. However, there exists a dependency in the other direction during hotplug if hotplug needs to fork a new task, which happens in some cases. This leads to the following deadlock:
scx_ops_enable() hotplug
percpu_down_write(&cpu_hotplug_lock) percpu_down_write(&scx_fork_rwsem) block on cpu_hotplug_lock kthread_create() waits for kthreadd kthreadd blocks on scx_fork_rwsem
Note that this doesn't trigger lockdep because the hotplug side dependency bounces through kthreadd.
With the preceding scx_cgroup_enabled change, this can be solved by decoupling cpus_read_lock, which is needed for static_key manipulations, from the other two locks.
- Move the first block of static_key manipulations outside of scx_fork_rwsem and scx_cgroup_rwsem. This is now safe with the preceding scx_cgroup_enabled change.
- Drop scx_cgroup_rwsem and scx_fork_rwsem between the two task iteration blocks so that __scx_ops_enabled static_key enabling is outside the two rwsems.
Signed-off-by: Tejun Heo <tj@kernel.org> Reported-and-tested-by: Aboorva Devarajan <aboorvad@linux.ibm.com> Link: http://lkml.kernel.org/r/8cd0ec0c4c7c1bc0119e61fbef0bee9d5e24022d.camel@linux.ibm.com
show more ...
|
#
16021656 |
| 27-Sep-2024 |
Tejun Heo <tj@kernel.org> |
sched_ext: Decouple locks in scx_ops_disable_workfn()
The disable path uses three big locks - scx_fork_rwsem, scx_cgroup_rwsem and cpus_read_lock. Currently, the locks are grabbed together which is
sched_ext: Decouple locks in scx_ops_disable_workfn()
The disable path uses three big locks - scx_fork_rwsem, scx_cgroup_rwsem and cpus_read_lock. Currently, the locks are grabbed together which is prone to locking order problems. With the preceding scx_cgroup_enabled change, we can decouple them:
- As cgroup disabling no longer requires modifying a static_key which requires cpus_read_lock(), no need to grab cpus_read_lock() before grabbing scx_cgroup_rwsem.
- cgroup can now be independently disabled before tasks are moved back to the fair class.
Relocate scx_cgroup_exit() invocation before scx_fork_rwsem is grabbed, drop now unnecessary cpus_read_lock() and move static_key operations out of scx_fork_rwsem. This decouples all three locks in the disable path.
Signed-off-by: Tejun Heo <tj@kernel.org> Reported-and-tested-by: Aboorva Devarajan <aboorvad@linux.ibm.com> Link: http://lkml.kernel.org/r/8cd0ec0c4c7c1bc0119e61fbef0bee9d5e24022d.camel@linux.ibm.com
show more ...
|
#
568894ed |
| 27-Sep-2024 |
Tejun Heo <tj@kernel.org> |
sched_ext: Add scx_cgroup_enabled to gate cgroup operations and fix scx_tg_online()
If the BPF scheduler does not implement ops.cgroup_init(), scx_tg_online() didn't set SCX_TG_INITED which meant th
sched_ext: Add scx_cgroup_enabled to gate cgroup operations and fix scx_tg_online()
If the BPF scheduler does not implement ops.cgroup_init(), scx_tg_online() didn't set SCX_TG_INITED which meant that ops.cgroup_exit(), even if implemented, won't be called from scx_tg_offline(). This is because SCX_HAS_OP(cgroupt_init) is used to test both whether SCX cgroup operations are enabled and ops.cgroup_init() exists.
Fix it by introducing a separate bool scx_cgroup_enabled to gate cgroup operations and use SCX_HAS_OP(cgroup_init) only to test whether ops.cgroup_init() exists. Make all cgroup operations consistently use scx_cgroup_enabled to test whether cgroup operations are enabled. scx_cgroup_enabled is added instead of using scx_enabled() to ease planned locking updates.
Signed-off-by: Tejun Heo <tj@kernel.org>
show more ...
|
#
4269c603 |
| 27-Sep-2024 |
Tejun Heo <tj@kernel.org> |
sched_ext: Enable scx_ops_init_task() separately
scx_ops_init_task() and the follow-up scx_ops_enable_task() in the fork path were gated by scx_enabled() test and thus __scx_ops_enabled had to be tu
sched_ext: Enable scx_ops_init_task() separately
scx_ops_init_task() and the follow-up scx_ops_enable_task() in the fork path were gated by scx_enabled() test and thus __scx_ops_enabled had to be turned on before the first scx_ops_init_task() loop in scx_ops_enable(). However, if an external entity causes sched_class switch before the loop is complete, tasks which are not initialized could be switched to SCX.
The following can be reproduced by running a program which keeps toggling a process between SCHED_OTHER and SCHED_EXT using sched_setscheduler(2).
sched_ext: Invalid task state transition 0 -> 3 for fish[1623] WARNING: CPU: 1 PID: 1650 at kernel/sched/ext.c:3392 scx_ops_enable_task+0x1a1/0x200 ... Sched_ext: simple (enabling) RIP: 0010:scx_ops_enable_task+0x1a1/0x200 ... switching_to_scx+0x13/0xa0 __sched_setscheduler+0x850/0xa50 do_sched_setscheduler+0x104/0x1c0 __x64_sys_sched_setscheduler+0x18/0x30 do_syscall_64+0x7b/0x140 entry_SYSCALL_64_after_hwframe+0x76/0x7e
Fix it by gating scx_ops_init_task() separately using scx_ops_init_task_enabled. __scx_ops_enabled is now set after all tasks are finished with scx_ops_init_task().
Signed-off-by: Tejun Heo <tj@kernel.org>
show more ...
|
#
9753358a |
| 27-Sep-2024 |
Tejun Heo <tj@kernel.org> |
sched_ext: Fix SCX_TASK_INIT -> SCX_TASK_READY transitions in scx_ops_enable()
scx_ops_enable() has two task iteration loops. The first one calls scx_ops_init_task() on every task and the latter swi
sched_ext: Fix SCX_TASK_INIT -> SCX_TASK_READY transitions in scx_ops_enable()
scx_ops_enable() has two task iteration loops. The first one calls scx_ops_init_task() on every task and the latter switches the eligible ones into SCX. The first loop left the tasks in SCX_TASK_INIT state and then the second loop switched it into READY before switching the task into SCX.
The distinction between INIT and READY is only meaningful in the fork path where it's used to tell whether the task finished forking so that we can tell ops.exit_task() accordingly. Leaving task in INIT state between the two loops is incosistent with the fork path and incorrect. The following can be triggered by running a program which keeps toggling a task between SCHED_OTHER and SCHED_SCX while enabling a task:
sched_ext: Invalid task state transition 1 -> 3 for fish[1526] WARNING: CPU: 2 PID: 1615 at kernel/sched/ext.c:3393 scx_ops_enable_task+0x1a1/0x200 ... Sched_ext: qmap (enabling+all) RIP: 0010:scx_ops_enable_task+0x1a1/0x200 ... switching_to_scx+0x13/0xa0 __sched_setscheduler+0x850/0xa50 do_sched_setscheduler+0x104/0x1c0 __x64_sys_sched_setscheduler+0x18/0x30 do_syscall_64+0x7b/0x140 entry_SYSCALL_64_after_hwframe+0x76/0x7e
Fix it by transitioning to READY in the first loop right after scx_ops_init_task() succeeds.
Signed-off-by: Tejun Heo <tj@kernel.org> Cc: David Vernet <void@manifault.com>
show more ...
|
#
8c2090c5 |
| 27-Sep-2024 |
Tejun Heo <tj@kernel.org> |
sched_ext: Initialize in bypass mode
scx_ops_enable() used preempt_disable() around the task iteration loop to switch tasks into SCX to guarantee forward progress of the task which is running scx_op
sched_ext: Initialize in bypass mode
scx_ops_enable() used preempt_disable() around the task iteration loop to switch tasks into SCX to guarantee forward progress of the task which is running scx_ops_enable(). However, in the gap between setting __scx_ops_enabled and preeempt_disable(), an external entity can put tasks including the enabling one into SCX prematurely, which can lead to malfunctions including stalls.
The bypass mode can wrap the entire enabling operation and guarantee forward progress no matter what the BPF scheduler does. Use the bypass mode instead to guarantee forward progress while enabling.
While at it, release and regrab scx_tasks_lock between the two task iteration locks in scx_ops_enable() for clarity as there is no reason to keep holding the lock between them.
Signed-off-by: Tejun Heo <tj@kernel.org>
show more ...
|
#
fc1fcebe |
| 27-Sep-2024 |
Tejun Heo <tj@kernel.org> |
sched_ext: Remove SCX_OPS_PREPPING
The distinction between SCX_OPS_PREPPING and SCX_OPS_ENABLING is not used anywhere and only adds confusion. Drop SCX_OPS_PREPPING.
Signed-off-by: Tejun Heo <tj@ke
sched_ext: Remove SCX_OPS_PREPPING
The distinction between SCX_OPS_PREPPING and SCX_OPS_ENABLING is not used anywhere and only adds confusion. Drop SCX_OPS_PREPPING.
Signed-off-by: Tejun Heo <tj@kernel.org>
show more ...
|
#
1bbcfe62 |
| 27-Sep-2024 |
Tejun Heo <tj@kernel.org> |
sched_ext: Relocate check_hotplug_seq() call in scx_ops_enable()
check_hotplug_seq() is used to detect CPU hotplug event which occurred while the BPF scheduler is being loaded so that initialization
sched_ext: Relocate check_hotplug_seq() call in scx_ops_enable()
check_hotplug_seq() is used to detect CPU hotplug event which occurred while the BPF scheduler is being loaded so that initialization can be retried if CPU hotplug events take place before the CPU hotplug callbacks are online.
As such, the best place to call it is in the same cpu_read_lock() section that enables the CPU hotplug ops. Currently, it is called in the next cpus_read_lock() block in scx_ops_enable(). The side effect of this placement is a small window in which hotplug sequence detection can trigger unnecessarily, which isn't critical.
Move check_hotplug_seq() invocation to the same cpus_read_lock() block as the hotplug operation enablement to close the window and get the invocation out of the way for planned locking updates.
Signed-off-by: Tejun Heo <tj@kernel.org> Cc: David Vernet <void@manifault.com>
show more ...
|
#
6f34d8d3 |
| 27-Sep-2024 |
Tejun Heo <tj@kernel.org> |
sched_ext: Use shorter slice while bypassing
While bypassing, tasks are scheduled in FIFO order which favors tasks that hog CPUs. This can slow down e.g. unloading of the BPF scheduler. While bypass
sched_ext: Use shorter slice while bypassing
While bypassing, tasks are scheduled in FIFO order which favors tasks that hog CPUs. This can slow down e.g. unloading of the BPF scheduler. While bypassing, guaranteeing timely forward progress is the main goal. There's no point in giving long slices. Shorten the time slice used while bypassing from 20ms to 5ms.
Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: David Vernet <void@manifault.com>
show more ...
|
#
b7b3b2db |
| 27-Sep-2024 |
Tejun Heo <tj@kernel.org> |
sched_ext: Split the global DSQ per NUMA node
In the bypass mode, the global DSQ is used to schedule all tasks in simple FIFO order. All tasks are queued into the global DSQ and all CPUs try to exec
sched_ext: Split the global DSQ per NUMA node
In the bypass mode, the global DSQ is used to schedule all tasks in simple FIFO order. All tasks are queued into the global DSQ and all CPUs try to execute tasks from it. This creates a lot of cross-node cacheline accesses and scheduling across the node boundaries, and can lead to live-lock conditions where the system takes tens of minutes to disable the BPF scheduler while executing in the bypass mode.
Split the global DSQ per NUMA node. Each node has its own global DSQ. When a task is dispatched to SCX_DSQ_GLOBAL, it's put into the global DSQ local to the task's CPU and all CPUs in a node only consume its node-local global DSQ.
This resolves a livelock condition which could be reliably triggered on an 2x EPYC 7642 system by running `stress-ng --race-sched 1024` together with `stress-ng --workload 80 --workload-threads 10` while repeatedly enabling and disabling a SCX scheduler.
Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: David Vernet <void@manifault.com>
show more ...
|
#
bba26bf3 |
| 27-Sep-2024 |
Tejun Heo <tj@kernel.org> |
sched_ext: Relocate find_user_dsq()
To prepare for the addition of find_global_dsq(). No functional changes.
Signed-off-by: tejun heo <tj@kernel.org> Acked-by: David Vernet <void@manifault.com>
|
#
63fb3ec8 |
| 27-Sep-2024 |
Tejun Heo <tj@kernel.org> |
sched_ext: Allow only user DSQs for scx_bpf_consume(), scx_bpf_dsq_nr_queued() and bpf_iter_scx_dsq_new()
SCX_DSQ_GLOBAL is special in that it can't be used as a priority queue and is consumed impli
sched_ext: Allow only user DSQs for scx_bpf_consume(), scx_bpf_dsq_nr_queued() and bpf_iter_scx_dsq_new()
SCX_DSQ_GLOBAL is special in that it can't be used as a priority queue and is consumed implicitly, but all BPF DSQ related kfuncs could be used on it. SCX_DSQ_GLOBAL will be split per-node for scalability and those operations won't make sense anymore. Disallow SCX_DSQ_GLOBAL on scx_bpf_consume(), scx_bpf_dsq_nr_queued() and bpf_iter_scx_dsq_new(). This means that SCX_DSQ_GLOBAL can only be used as a dispatch target from BPF schedulers.
With scx_flatcg, which was using SCX_DSQ_GLOBAL as the fallback DSQ, updated, this shouldn't affect any schedulers.
This leaves find_dsq_for_dispatch() the only user of find_non_local_dsq(). Open code and remove find_non_local_dsq().
Signed-off-by: tejun heo <tj@kernel.org> Acked-by: David Vernet <void@manifault.com>
show more ...
|
#
42268ad0 |
| 24-Sep-2024 |
Tejun Heo <tj@kernel.org> |
sched_ext: Build fix for !CONFIG_SMP
move_remote_task_to_local_dsq() is only defined on SMP configs but scx_disaptch_from_dsq() was calling move_remote_task_to_local_dsq() on UP configs too causing
sched_ext: Build fix for !CONFIG_SMP
move_remote_task_to_local_dsq() is only defined on SMP configs but scx_disaptch_from_dsq() was calling move_remote_task_to_local_dsq() on UP configs too causing build failures. Add a dummy move_remote_task_to_local_dsq() which triggers a warning.
Signed-off-by: Tejun Heo <tj@kernel.org> Fixes: 4c30f5ce4f7a ("sched_ext: Implement scx_bpf_dispatch[_vtime]_from_dsq()") Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202409241108.jaocHiDJ-lkp@intel.com/
show more ...
|
#
6fa6588e |
| 24-Sep-2024 |
Linus Torvalds <torvalds@linux-foundation.org> |
Merge tag 'sched_ext-for-6.12-rc1-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext
Pull sched_ext fixes from Tejun Heo:
- Three build fixes
- The fix for a stall bug introduc
Merge tag 'sched_ext-for-6.12-rc1-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext
Pull sched_ext fixes from Tejun Heo:
- Three build fixes
- The fix for a stall bug introduced by a recent optimization in sched core (SM_IDLE)
- Addition of /sys/kernel/sched_ext/enable_seq. While not a fix, it is a simple addition that distro people want to be able to tell whether an SCX scheduler has ever been loaded on the system
* tag 'sched_ext-for-6.12-rc1-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext: sched_ext: Provide a sysfs enable_seq counter sched_ext: Fix build when !CONFIG_STACKTRACE sched, sched_ext: Disable SM_IDLE/rq empty path when scx_enabled() sched: Put task_group::idle under CONFIG_GROUP_SCHED_WEIGHT sched: Add dummy version of sched_group_set_idle()
show more ...
|
#
431844b6 |
| 21-Sep-2024 |
Andrea Righi <andrea.righi@linux.dev> |
sched_ext: Provide a sysfs enable_seq counter
As discussed during the distro-centric session within the sched_ext Microconference at LPC 2024, introduce a sequence counter that is incremented every
sched_ext: Provide a sysfs enable_seq counter
As discussed during the distro-centric session within the sched_ext Microconference at LPC 2024, introduce a sequence counter that is incremented every time a BPF scheduler is loaded.
This feature can help distributions in diagnosing potential performance regressions by identifying systems where users are running (or have ran) custom BPF schedulers.
Example:
arighi@virtme-ng~> cat /sys/kernel/sched_ext/enable_seq 0 arighi@virtme-ng~> sudo scx_simple local=1 global=0 ^CEXIT: unregistered from user space arighi@virtme-ng~> cat /sys/kernel/sched_ext/enable_seq 1
In this way user-space tools (such as Ubuntu's apport and similar) are able to gather and include this information in bug reports.
Cc: Giovanni Gherdovich <giovanni.gherdovich@suse.com> Cc: Kleber Sacilotto de Souza <kleber.souza@canonical.com> Cc: Marcelo Henrique Cerri <marcelo.cerri@canonical.com> Cc: Phil Auld <pauld@redhat.com> Signed-off-by: Andrea Righi <andrea.righi@linux.dev> Signed-off-by: Tejun Heo <tj@kernel.org>
show more ...
|
#
62d3726d |
| 23-Sep-2024 |
Tejun Heo <tj@kernel.org> |
sched_ext: Fix build when !CONFIG_STACKTRACE
a2f4b16e736d ("sched_ext: Build fix on !CONFIG_STACKTRACE[_SUPPORT]") tried fixing build when !CONFIG_STACKTRACE but didn't so fully. Also put stack_trac
sched_ext: Fix build when !CONFIG_STACKTRACE
a2f4b16e736d ("sched_ext: Build fix on !CONFIG_STACKTRACE[_SUPPORT]") tried fixing build when !CONFIG_STACKTRACE but didn't so fully. Also put stack_trace_print() and stack_trace_save() inside CONFIG_STACKTRACE to fix build when !CONFIG_STACKTRACE.
Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202409220642.fDW2OmWc-lkp@intel.com/
show more ...
|
#
88264981 |
| 21-Sep-2024 |
Linus Torvalds <torvalds@linux-foundation.org> |
Merge tag 'sched_ext-for-6.12' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext
Pull sched_ext support from Tejun Heo: "This implements a new scheduler class called ‘ext_sched_class’,
Merge tag 'sched_ext-for-6.12' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext
Pull sched_ext support from Tejun Heo: "This implements a new scheduler class called ‘ext_sched_class’, or sched_ext, which allows scheduling policies to be implemented as BPF programs.
The goals of this are:
- Ease of experimentation and exploration: Enabling rapid iteration of new scheduling policies.
- Customization: Building application-specific schedulers which implement policies that are not applicable to general-purpose schedulers.
- Rapid scheduler deployments: Non-disruptive swap outs of scheduling policies in production environments"
See individual commits for more documentation, but also the cover letter for the latest series:
Link: https://lore.kernel.org/all/20240618212056.2833381-1-tj@kernel.org/
* tag 'sched_ext-for-6.12' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext: (110 commits) sched: Move update_other_load_avgs() to kernel/sched/pelt.c sched_ext: Don't trigger ops.quiescent/runnable() on migrations sched_ext: Synchronize bypass state changes with rq lock scx_qmap: Implement highpri boosting sched_ext: Implement scx_bpf_dispatch[_vtime]_from_dsq() sched_ext: Compact struct bpf_iter_scx_dsq_kern sched_ext: Replace consume_local_task() with move_local_task_to_local_dsq() sched_ext: Move consume_local_task() upward sched_ext: Move sanity check and dsq_mod_nr() into task_unlink_from_dsq() sched_ext: Reorder args for consume_local/remote_task() sched_ext: Restructure dispatch_to_local_dsq() sched_ext: Fix processs_ddsp_deferred_locals() by unifying DTL_INVALID handling sched_ext: Make find_dsq_for_dispatch() handle SCX_DSQ_LOCAL_ON sched_ext: Refactor consume_remote_task() sched_ext: Rename scx_kfunc_set_sleepable to unlocked and relocate sched_ext: Add missing static to scx_dump_data sched_ext: Add missing static to scx_has_op[] sched_ext: Temporarily work around pick_task_scx() being called without balance_scx() sched_ext: Add a cgroup scheduler which uses flattened hierarchy sched_ext: Add cgroup support ...
show more ...
|
Revision tags: v6.11 |
|
#
513ed0c7 |
| 10-Sep-2024 |
Tejun Heo <tj@kernel.org> |
sched_ext: Don't trigger ops.quiescent/runnable() on migrations
A task moving across CPUs should not trigger quiescent/runnable task state events as the task is staying runnable the whole time and j
sched_ext: Don't trigger ops.quiescent/runnable() on migrations
A task moving across CPUs should not trigger quiescent/runnable task state events as the task is staying runnable the whole time and just stopping and then starting on different CPUs. Suppress quiescent/runnable task state events if task_on_rq_migrating().
Signed-off-by: Tejun Heo <tj@kernel.org> Suggested-by: David Vernet <void@manifault.com> Cc: Daniel Hodges <hodges.daniel.scott@gmail.com> Cc: Changwoo Min <multics69@gmail.com> Cc: Andrea Righi <andrea.righi@linux.dev> Cc: Dan Schatzberg <schatzberg.dan@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org>
show more ...
|
#
750a40d8 |
| 10-Sep-2024 |
Tejun Heo <tj@kernel.org> |
sched_ext: Synchronize bypass state changes with rq lock
While the BPF scheduler is being unloaded, the following warning messages trigger sometimes:
NOHZ tick-stop error: local softirq work is pe
sched_ext: Synchronize bypass state changes with rq lock
While the BPF scheduler is being unloaded, the following warning messages trigger sometimes:
NOHZ tick-stop error: local softirq work is pending, handler #80!!!
This is caused by the CPU entering idle while there are pending softirqs. The main culprit is the bypassing state assertion not being synchronized with rq operations. As the BPF scheduler cannot be trusted in the disable path, the first step is entering the bypass mode where the BPF scheduler is ignored and scheduling becomes global FIFO.
This is implemented by turning scx_ops_bypassing() true. However, the transition isn't synchronized against anything and it's possible for enqueue and dispatch paths to have different ideas on whether bypass mode is on.
Make each rq track its own bypass state with SCX_RQ_BYPASSING which is modified while rq is locked.
This removes most of the NOHZ tick-stop messages but not completely. I believe the stragglers are from the sched core bug where pick_task_scx() can be called without preceding balance_scx(). Once that bug is fixed, we should verify that all occurrences of this error message are gone too.
v2: scx_enabled() test moved inside the for_each_possible_cpu() loop so that the per-cpu states are always synchronized with the global state.
Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: David Vernet <void@manifault.com>
show more ...
|
#
4c30f5ce |
| 10-Sep-2024 |
Tejun Heo <tj@kernel.org> |
sched_ext: Implement scx_bpf_dispatch[_vtime]_from_dsq()
Once a task is put into a DSQ, the allowed operations are fairly limited. Tasks in the built-in local and global DSQs are executed automatica
sched_ext: Implement scx_bpf_dispatch[_vtime]_from_dsq()
Once a task is put into a DSQ, the allowed operations are fairly limited. Tasks in the built-in local and global DSQs are executed automatically and, ignoring dequeue, there is only one way a task in a user DSQ can be manipulated - scx_bpf_consume() moves the first task to the dispatching local DSQ. This inflexibility sometimes gets in the way and is an area where multiple feature requests have been made.
Implement scx_bpf_dispatch[_vtime]_from_dsq(), which can be called during DSQ iteration and can move the task to any DSQ - local DSQs, global DSQ and user DSQs. The kfuncs can be called from ops.dispatch() and any BPF context which dosen't hold a rq lock including BPF timers and SYSCALL programs.
This is an expansion of an earlier patch which only allowed moving into the dispatching local DSQ:
http://lkml.kernel.org/r/Zn4Cw4FDTmvXnhaf@slm.duckdns.org
v2: Remove @slice and @vtime from scx_bpf_dispatch_from_dsq[_vtime]() as they push scx_bpf_dispatch_from_dsq_vtime() over the kfunc argument count limit and often won't be needed anyway. Instead provide scx_bpf_dispatch_from_dsq_set_{slice|vtime}() kfuncs which can be called only when needed and override the specified parameter for the subsequent dispatch.
Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Daniel Hodges <hodges.daniel.scott@gmail.com> Cc: David Vernet <void@manifault.com> Cc: Changwoo Min <multics69@gmail.com> Cc: Andrea Righi <andrea.righi@linux.dev> Cc: Dan Schatzberg <schatzberg.dan@gmail.com>
show more ...
|
#
6462dd53 |
| 10-Sep-2024 |
Tejun Heo <tj@kernel.org> |
sched_ext: Compact struct bpf_iter_scx_dsq_kern
struct scx_iter_scx_dsq is defined as 6 u64's and scx_dsq_iter_kern was using 5 of them. We want to add two more u64 fields but it's better if we do s
sched_ext: Compact struct bpf_iter_scx_dsq_kern
struct scx_iter_scx_dsq is defined as 6 u64's and scx_dsq_iter_kern was using 5 of them. We want to add two more u64 fields but it's better if we do so while staying within scx_iter_scx_dsq to maintain binary compatibility.
The way scx_iter_scx_dsq_kern is laid out is rather inefficient - the node field takes up three u64's but only one bit of the last u64 is used. Turn the bool into u32 flags and only use the lower 16 bits freeing up 48 bits - 16 bits for flags, 32 bits for a u32 - for use by struct bpf_iter_scx_dsq_kern.
This allows moving the dsq_seq and flags fields of bpf_iter_scx_dsq_kern into the cursor field reducing the struct size by a full u64.
No behavior changes intended.
Signed-off-by: Tejun Heo <tj@kernel.org>
show more ...
|
#
cf3e9443 |
| 10-Sep-2024 |
Tejun Heo <tj@kernel.org> |
sched_ext: Replace consume_local_task() with move_local_task_to_local_dsq()
- Rename move_task_to_local_dsq() to move_remote_task_to_local_dsq().
- Rename consume_local_task() to move_local_task_to
sched_ext: Replace consume_local_task() with move_local_task_to_local_dsq()
- Rename move_task_to_local_dsq() to move_remote_task_to_local_dsq().
- Rename consume_local_task() to move_local_task_to_local_dsq() and remove task_unlink_from_dsq() and source DSQ unlocking from it.
This is to make the migration code easier to reuse.
No functional changes intended.
Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: David Vernet <void@manifault.com>
show more ...
|
#
d434210e |
| 10-Sep-2024 |
Tejun Heo <tj@kernel.org> |
sched_ext: Move consume_local_task() upward
So that the local case comes first and two CONFIG_SMP blocks can be merged.
No functional changes intended.
Signed-off-by: Tejun Heo <tj@kernel.org> Ack
sched_ext: Move consume_local_task() upward
So that the local case comes first and two CONFIG_SMP blocks can be merged.
No functional changes intended.
Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: David Vernet <void@manifault.com>
show more ...
|