History log of /linux/kernel/sched/ext.c (Results 1 – 25 of 97)
Revision (<<< Hide revision tags) (Show revision tags >>>) Date Author Comments
# e32cde8d 30-Sep-2024 Linus Torvalds <torvalds@linux-foundation.org>

Merge tag 'sched_ext-for-6.12-rc1-fixes-1' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext

Pull sched_ext fixes from Tejun Heo:

- When sched_ext is in bypass mode (e.g. while disabli

Merge tag 'sched_ext-for-6.12-rc1-fixes-1' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext

Pull sched_ext fixes from Tejun Heo:

- When sched_ext is in bypass mode (e.g. while disabling the BPF
scheduler), it was using one DSQ to implement global FIFO scheduling
as all it has to do is guaranteeing reasonable forward progress.

On multi-socket machines, this can lead to live-lock conditions under
certain workloads. Fixed by splitting the queue used for FIFO
scheduling per NUMA node. This required several preparation patches.

- Hotplug tests on powerpc could reliably trigger deadlock while
enabling a BPF scheduler.

This was caused by cpu_hotplug_lock nesting inside scx_fork_rwsem and
then CPU hotplug path trying to fork a new thread while holding
cpu_hotplug_lock.

Fixed by restructuring locking in enable and disable paths so that
the two locks are not coupled. This required several preparation
patches which also fixed a couple other issues in the enable path.

- A build fix for !CONFIG_SMP

- Userspace tooling sync and updates

* tag 'sched_ext-for-6.12-rc1-fixes-1' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext:
sched_ext: Remove redundant p->nr_cpus_allowed checker
sched_ext: Decouple locks in scx_ops_enable()
sched_ext: Decouple locks in scx_ops_disable_workfn()
sched_ext: Add scx_cgroup_enabled to gate cgroup operations and fix scx_tg_online()
sched_ext: Enable scx_ops_init_task() separately
sched_ext: Fix SCX_TASK_INIT -> SCX_TASK_READY transitions in scx_ops_enable()
sched_ext: Initialize in bypass mode
sched_ext: Remove SCX_OPS_PREPPING
sched_ext: Relocate check_hotplug_seq() call in scx_ops_enable()
sched_ext: Use shorter slice while bypassing
sched_ext: Split the global DSQ per NUMA node
sched_ext: Relocate find_user_dsq()
sched_ext: Allow only user DSQs for scx_bpf_consume(), scx_bpf_dsq_nr_queued() and bpf_iter_scx_dsq_new()
scx_flatcg: Use a user DSQ for fallback instead of SCX_DSQ_GLOBAL
tools/sched_ext: Receive misc updates from SCX repo
sched_ext: Add __COMPAT helpers for features added during v6.12 devel cycle
sched_ext: Build fix for !CONFIG_SMP

show more ...


Revision tags: v6.12-rc1
# 95b87369 26-Sep-2024 Zhang Qiao <zhangqiao22@huawei.com>

sched_ext: Remove redundant p->nr_cpus_allowed checker

select_rq_task() already checked that 'p->nr_cpus_allowed > 1',
'p->nr_cpus_allowed == 1' checker in scx_select_cpu_dfl() is redundant.

Signed

sched_ext: Remove redundant p->nr_cpus_allowed checker

select_rq_task() already checked that 'p->nr_cpus_allowed > 1',
'p->nr_cpus_allowed == 1' checker in scx_select_cpu_dfl() is redundant.

Signed-off-by: Zhang Qiao <zhangqiao22@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>

show more ...


# efe231d9 27-Sep-2024 Tejun Heo <tj@kernel.org>

sched_ext: Decouple locks in scx_ops_enable()

The enable path uses three big locks - scx_fork_rwsem, scx_cgroup_rwsem and
cpus_read_lock. Currently, the locks are grabbed together which is prone to

sched_ext: Decouple locks in scx_ops_enable()

The enable path uses three big locks - scx_fork_rwsem, scx_cgroup_rwsem and
cpus_read_lock. Currently, the locks are grabbed together which is prone to
locking order problems.

For example, currently, there is a possible deadlock involving
scx_fork_rwsem and cpus_read_lock. cpus_read_lock has to nest inside
scx_fork_rwsem due to locking order existing in other subsystems. However,
there exists a dependency in the other direction during hotplug if hotplug
needs to fork a new task, which happens in some cases. This leads to the
following deadlock:

scx_ops_enable() hotplug

percpu_down_write(&cpu_hotplug_lock)
percpu_down_write(&scx_fork_rwsem)
block on cpu_hotplug_lock
kthread_create() waits for kthreadd
kthreadd blocks on scx_fork_rwsem

Note that this doesn't trigger lockdep because the hotplug side dependency
bounces through kthreadd.

With the preceding scx_cgroup_enabled change, this can be solved by
decoupling cpus_read_lock, which is needed for static_key manipulations,
from the other two locks.

- Move the first block of static_key manipulations outside of scx_fork_rwsem
and scx_cgroup_rwsem. This is now safe with the preceding
scx_cgroup_enabled change.

- Drop scx_cgroup_rwsem and scx_fork_rwsem between the two task iteration
blocks so that __scx_ops_enabled static_key enabling is outside the two
rwsems.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-and-tested-by: Aboorva Devarajan <aboorvad@linux.ibm.com>
Link: http://lkml.kernel.org/r/8cd0ec0c4c7c1bc0119e61fbef0bee9d5e24022d.camel@linux.ibm.com

show more ...


# 16021656 27-Sep-2024 Tejun Heo <tj@kernel.org>

sched_ext: Decouple locks in scx_ops_disable_workfn()

The disable path uses three big locks - scx_fork_rwsem, scx_cgroup_rwsem and
cpus_read_lock. Currently, the locks are grabbed together which is

sched_ext: Decouple locks in scx_ops_disable_workfn()

The disable path uses three big locks - scx_fork_rwsem, scx_cgroup_rwsem and
cpus_read_lock. Currently, the locks are grabbed together which is prone to
locking order problems. With the preceding scx_cgroup_enabled change, we can
decouple them:

- As cgroup disabling no longer requires modifying a static_key which
requires cpus_read_lock(), no need to grab cpus_read_lock() before
grabbing scx_cgroup_rwsem.

- cgroup can now be independently disabled before tasks are moved back to
the fair class.

Relocate scx_cgroup_exit() invocation before scx_fork_rwsem is grabbed, drop
now unnecessary cpus_read_lock() and move static_key operations out of
scx_fork_rwsem. This decouples all three locks in the disable path.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-and-tested-by: Aboorva Devarajan <aboorvad@linux.ibm.com>
Link: http://lkml.kernel.org/r/8cd0ec0c4c7c1bc0119e61fbef0bee9d5e24022d.camel@linux.ibm.com

show more ...


# 568894ed 27-Sep-2024 Tejun Heo <tj@kernel.org>

sched_ext: Add scx_cgroup_enabled to gate cgroup operations and fix scx_tg_online()

If the BPF scheduler does not implement ops.cgroup_init(), scx_tg_online()
didn't set SCX_TG_INITED which meant th

sched_ext: Add scx_cgroup_enabled to gate cgroup operations and fix scx_tg_online()

If the BPF scheduler does not implement ops.cgroup_init(), scx_tg_online()
didn't set SCX_TG_INITED which meant that ops.cgroup_exit(), even if
implemented, won't be called from scx_tg_offline(). This is because
SCX_HAS_OP(cgroupt_init) is used to test both whether SCX cgroup operations
are enabled and ops.cgroup_init() exists.

Fix it by introducing a separate bool scx_cgroup_enabled to gate cgroup
operations and use SCX_HAS_OP(cgroup_init) only to test whether
ops.cgroup_init() exists. Make all cgroup operations consistently use
scx_cgroup_enabled to test whether cgroup operations are enabled.
scx_cgroup_enabled is added instead of using scx_enabled() to ease planned
locking updates.

Signed-off-by: Tejun Heo <tj@kernel.org>

show more ...


# 4269c603 27-Sep-2024 Tejun Heo <tj@kernel.org>

sched_ext: Enable scx_ops_init_task() separately

scx_ops_init_task() and the follow-up scx_ops_enable_task() in the fork path
were gated by scx_enabled() test and thus __scx_ops_enabled had to be tu

sched_ext: Enable scx_ops_init_task() separately

scx_ops_init_task() and the follow-up scx_ops_enable_task() in the fork path
were gated by scx_enabled() test and thus __scx_ops_enabled had to be turned
on before the first scx_ops_init_task() loop in scx_ops_enable(). However,
if an external entity causes sched_class switch before the loop is complete,
tasks which are not initialized could be switched to SCX.

The following can be reproduced by running a program which keeps toggling a
process between SCHED_OTHER and SCHED_EXT using sched_setscheduler(2).

sched_ext: Invalid task state transition 0 -> 3 for fish[1623]
WARNING: CPU: 1 PID: 1650 at kernel/sched/ext.c:3392 scx_ops_enable_task+0x1a1/0x200
...
Sched_ext: simple (enabling)
RIP: 0010:scx_ops_enable_task+0x1a1/0x200
...
switching_to_scx+0x13/0xa0
__sched_setscheduler+0x850/0xa50
do_sched_setscheduler+0x104/0x1c0
__x64_sys_sched_setscheduler+0x18/0x30
do_syscall_64+0x7b/0x140
entry_SYSCALL_64_after_hwframe+0x76/0x7e

Fix it by gating scx_ops_init_task() separately using
scx_ops_init_task_enabled. __scx_ops_enabled is now set after all tasks are
finished with scx_ops_init_task().

Signed-off-by: Tejun Heo <tj@kernel.org>

show more ...


# 9753358a 27-Sep-2024 Tejun Heo <tj@kernel.org>

sched_ext: Fix SCX_TASK_INIT -> SCX_TASK_READY transitions in scx_ops_enable()

scx_ops_enable() has two task iteration loops. The first one calls
scx_ops_init_task() on every task and the latter swi

sched_ext: Fix SCX_TASK_INIT -> SCX_TASK_READY transitions in scx_ops_enable()

scx_ops_enable() has two task iteration loops. The first one calls
scx_ops_init_task() on every task and the latter switches the eligible ones
into SCX. The first loop left the tasks in SCX_TASK_INIT state and then the
second loop switched it into READY before switching the task into SCX.

The distinction between INIT and READY is only meaningful in the fork path
where it's used to tell whether the task finished forking so that we can
tell ops.exit_task() accordingly. Leaving task in INIT state between the two
loops is incosistent with the fork path and incorrect. The following can be
triggered by running a program which keeps toggling a task between
SCHED_OTHER and SCHED_SCX while enabling a task:

sched_ext: Invalid task state transition 1 -> 3 for fish[1526]
WARNING: CPU: 2 PID: 1615 at kernel/sched/ext.c:3393 scx_ops_enable_task+0x1a1/0x200
...
Sched_ext: qmap (enabling+all)
RIP: 0010:scx_ops_enable_task+0x1a1/0x200
...
switching_to_scx+0x13/0xa0
__sched_setscheduler+0x850/0xa50
do_sched_setscheduler+0x104/0x1c0
__x64_sys_sched_setscheduler+0x18/0x30
do_syscall_64+0x7b/0x140
entry_SYSCALL_64_after_hwframe+0x76/0x7e

Fix it by transitioning to READY in the first loop right after
scx_ops_init_task() succeeds.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: David Vernet <void@manifault.com>

show more ...


# 8c2090c5 27-Sep-2024 Tejun Heo <tj@kernel.org>

sched_ext: Initialize in bypass mode

scx_ops_enable() used preempt_disable() around the task iteration loop to
switch tasks into SCX to guarantee forward progress of the task which is
running scx_op

sched_ext: Initialize in bypass mode

scx_ops_enable() used preempt_disable() around the task iteration loop to
switch tasks into SCX to guarantee forward progress of the task which is
running scx_ops_enable(). However, in the gap between setting
__scx_ops_enabled and preeempt_disable(), an external entity can put tasks
including the enabling one into SCX prematurely, which can lead to
malfunctions including stalls.

The bypass mode can wrap the entire enabling operation and guarantee forward
progress no matter what the BPF scheduler does. Use the bypass mode instead
to guarantee forward progress while enabling.

While at it, release and regrab scx_tasks_lock between the two task
iteration locks in scx_ops_enable() for clarity as there is no reason to
keep holding the lock between them.

Signed-off-by: Tejun Heo <tj@kernel.org>

show more ...


# fc1fcebe 27-Sep-2024 Tejun Heo <tj@kernel.org>

sched_ext: Remove SCX_OPS_PREPPING

The distinction between SCX_OPS_PREPPING and SCX_OPS_ENABLING is not used
anywhere and only adds confusion. Drop SCX_OPS_PREPPING.

Signed-off-by: Tejun Heo <tj@ke

sched_ext: Remove SCX_OPS_PREPPING

The distinction between SCX_OPS_PREPPING and SCX_OPS_ENABLING is not used
anywhere and only adds confusion. Drop SCX_OPS_PREPPING.

Signed-off-by: Tejun Heo <tj@kernel.org>

show more ...


# 1bbcfe62 27-Sep-2024 Tejun Heo <tj@kernel.org>

sched_ext: Relocate check_hotplug_seq() call in scx_ops_enable()

check_hotplug_seq() is used to detect CPU hotplug event which occurred while
the BPF scheduler is being loaded so that initialization

sched_ext: Relocate check_hotplug_seq() call in scx_ops_enable()

check_hotplug_seq() is used to detect CPU hotplug event which occurred while
the BPF scheduler is being loaded so that initialization can be retried if
CPU hotplug events take place before the CPU hotplug callbacks are online.

As such, the best place to call it is in the same cpu_read_lock() section
that enables the CPU hotplug ops. Currently, it is called in the next
cpus_read_lock() block in scx_ops_enable(). The side effect of this
placement is a small window in which hotplug sequence detection can trigger
unnecessarily, which isn't critical.

Move check_hotplug_seq() invocation to the same cpus_read_lock() block as
the hotplug operation enablement to close the window and get the invocation
out of the way for planned locking updates.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: David Vernet <void@manifault.com>

show more ...


# 6f34d8d3 27-Sep-2024 Tejun Heo <tj@kernel.org>

sched_ext: Use shorter slice while bypassing

While bypassing, tasks are scheduled in FIFO order which favors tasks that
hog CPUs. This can slow down e.g. unloading of the BPF scheduler. While
bypass

sched_ext: Use shorter slice while bypassing

While bypassing, tasks are scheduled in FIFO order which favors tasks that
hog CPUs. This can slow down e.g. unloading of the BPF scheduler. While
bypassing, guaranteeing timely forward progress is the main goal. There's no
point in giving long slices. Shorten the time slice used while bypassing
from 20ms to 5ms.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: David Vernet <void@manifault.com>

show more ...


# b7b3b2db 27-Sep-2024 Tejun Heo <tj@kernel.org>

sched_ext: Split the global DSQ per NUMA node

In the bypass mode, the global DSQ is used to schedule all tasks in simple
FIFO order. All tasks are queued into the global DSQ and all CPUs try to
exec

sched_ext: Split the global DSQ per NUMA node

In the bypass mode, the global DSQ is used to schedule all tasks in simple
FIFO order. All tasks are queued into the global DSQ and all CPUs try to
execute tasks from it. This creates a lot of cross-node cacheline accesses
and scheduling across the node boundaries, and can lead to live-lock
conditions where the system takes tens of minutes to disable the BPF
scheduler while executing in the bypass mode.

Split the global DSQ per NUMA node. Each node has its own global DSQ. When a
task is dispatched to SCX_DSQ_GLOBAL, it's put into the global DSQ local to
the task's CPU and all CPUs in a node only consume its node-local global
DSQ.

This resolves a livelock condition which could be reliably triggered on an
2x EPYC 7642 system by running `stress-ng --race-sched 1024` together with
`stress-ng --workload 80 --workload-threads 10` while repeatedly enabling
and disabling a SCX scheduler.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: David Vernet <void@manifault.com>

show more ...


# bba26bf3 27-Sep-2024 Tejun Heo <tj@kernel.org>

sched_ext: Relocate find_user_dsq()

To prepare for the addition of find_global_dsq(). No functional changes.

Signed-off-by: tejun heo <tj@kernel.org>
Acked-by: David Vernet <void@manifault.com>


# 63fb3ec8 27-Sep-2024 Tejun Heo <tj@kernel.org>

sched_ext: Allow only user DSQs for scx_bpf_consume(), scx_bpf_dsq_nr_queued() and bpf_iter_scx_dsq_new()

SCX_DSQ_GLOBAL is special in that it can't be used as a priority queue and
is consumed impli

sched_ext: Allow only user DSQs for scx_bpf_consume(), scx_bpf_dsq_nr_queued() and bpf_iter_scx_dsq_new()

SCX_DSQ_GLOBAL is special in that it can't be used as a priority queue and
is consumed implicitly, but all BPF DSQ related kfuncs could be used on it.
SCX_DSQ_GLOBAL will be split per-node for scalability and those operations
won't make sense anymore. Disallow SCX_DSQ_GLOBAL on scx_bpf_consume(),
scx_bpf_dsq_nr_queued() and bpf_iter_scx_dsq_new(). This means that
SCX_DSQ_GLOBAL can only be used as a dispatch target from BPF schedulers.

With scx_flatcg, which was using SCX_DSQ_GLOBAL as the fallback DSQ,
updated, this shouldn't affect any schedulers.

This leaves find_dsq_for_dispatch() the only user of find_non_local_dsq().
Open code and remove find_non_local_dsq().

Signed-off-by: tejun heo <tj@kernel.org>
Acked-by: David Vernet <void@manifault.com>

show more ...


# 42268ad0 24-Sep-2024 Tejun Heo <tj@kernel.org>

sched_ext: Build fix for !CONFIG_SMP

move_remote_task_to_local_dsq() is only defined on SMP configs but
scx_disaptch_from_dsq() was calling move_remote_task_to_local_dsq() on UP
configs too causing

sched_ext: Build fix for !CONFIG_SMP

move_remote_task_to_local_dsq() is only defined on SMP configs but
scx_disaptch_from_dsq() was calling move_remote_task_to_local_dsq() on UP
configs too causing build failures. Add a dummy
move_remote_task_to_local_dsq() which triggers a warning.

Signed-off-by: Tejun Heo <tj@kernel.org>
Fixes: 4c30f5ce4f7a ("sched_ext: Implement scx_bpf_dispatch[_vtime]_from_dsq()")
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202409241108.jaocHiDJ-lkp@intel.com/

show more ...


# 6fa6588e 24-Sep-2024 Linus Torvalds <torvalds@linux-foundation.org>

Merge tag 'sched_ext-for-6.12-rc1-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext

Pull sched_ext fixes from Tejun Heo:

- Three build fixes

- The fix for a stall bug introduc

Merge tag 'sched_ext-for-6.12-rc1-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext

Pull sched_ext fixes from Tejun Heo:

- Three build fixes

- The fix for a stall bug introduced by a recent optimization in sched
core (SM_IDLE)

- Addition of /sys/kernel/sched_ext/enable_seq. While not a fix, it is
a simple addition that distro people want to be able to tell whether
an SCX scheduler has ever been loaded on the system

* tag 'sched_ext-for-6.12-rc1-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext:
sched_ext: Provide a sysfs enable_seq counter
sched_ext: Fix build when !CONFIG_STACKTRACE
sched, sched_ext: Disable SM_IDLE/rq empty path when scx_enabled()
sched: Put task_group::idle under CONFIG_GROUP_SCHED_WEIGHT
sched: Add dummy version of sched_group_set_idle()

show more ...


# 431844b6 21-Sep-2024 Andrea Righi <andrea.righi@linux.dev>

sched_ext: Provide a sysfs enable_seq counter

As discussed during the distro-centric session within the sched_ext
Microconference at LPC 2024, introduce a sequence counter that is
incremented every

sched_ext: Provide a sysfs enable_seq counter

As discussed during the distro-centric session within the sched_ext
Microconference at LPC 2024, introduce a sequence counter that is
incremented every time a BPF scheduler is loaded.

This feature can help distributions in diagnosing potential performance
regressions by identifying systems where users are running (or have ran)
custom BPF schedulers.

Example:

arighi@virtme-ng~> cat /sys/kernel/sched_ext/enable_seq
0
arighi@virtme-ng~> sudo scx_simple
local=1 global=0
^CEXIT: unregistered from user space
arighi@virtme-ng~> cat /sys/kernel/sched_ext/enable_seq
1

In this way user-space tools (such as Ubuntu's apport and similar) are
able to gather and include this information in bug reports.

Cc: Giovanni Gherdovich <giovanni.gherdovich@suse.com>
Cc: Kleber Sacilotto de Souza <kleber.souza@canonical.com>
Cc: Marcelo Henrique Cerri <marcelo.cerri@canonical.com>
Cc: Phil Auld <pauld@redhat.com>
Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
Signed-off-by: Tejun Heo <tj@kernel.org>

show more ...


# 62d3726d 23-Sep-2024 Tejun Heo <tj@kernel.org>

sched_ext: Fix build when !CONFIG_STACKTRACE

a2f4b16e736d ("sched_ext: Build fix on !CONFIG_STACKTRACE[_SUPPORT]") tried
fixing build when !CONFIG_STACKTRACE but didn't so fully. Also put
stack_trac

sched_ext: Fix build when !CONFIG_STACKTRACE

a2f4b16e736d ("sched_ext: Build fix on !CONFIG_STACKTRACE[_SUPPORT]") tried
fixing build when !CONFIG_STACKTRACE but didn't so fully. Also put
stack_trace_print() and stack_trace_save() inside CONFIG_STACKTRACE to fix
build when !CONFIG_STACKTRACE.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202409220642.fDW2OmWc-lkp@intel.com/

show more ...


# 88264981 21-Sep-2024 Linus Torvalds <torvalds@linux-foundation.org>

Merge tag 'sched_ext-for-6.12' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext

Pull sched_ext support from Tejun Heo:
"This implements a new scheduler class called ‘ext_sched_class’,

Merge tag 'sched_ext-for-6.12' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext

Pull sched_ext support from Tejun Heo:
"This implements a new scheduler class called ‘ext_sched_class’, or
sched_ext, which allows scheduling policies to be implemented as BPF
programs.

The goals of this are:

- Ease of experimentation and exploration: Enabling rapid iteration
of new scheduling policies.

- Customization: Building application-specific schedulers which
implement policies that are not applicable to general-purpose
schedulers.

- Rapid scheduler deployments: Non-disruptive swap outs of scheduling
policies in production environments"

See individual commits for more documentation, but also the cover letter
for the latest series:

Link: https://lore.kernel.org/all/20240618212056.2833381-1-tj@kernel.org/

* tag 'sched_ext-for-6.12' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext: (110 commits)
sched: Move update_other_load_avgs() to kernel/sched/pelt.c
sched_ext: Don't trigger ops.quiescent/runnable() on migrations
sched_ext: Synchronize bypass state changes with rq lock
scx_qmap: Implement highpri boosting
sched_ext: Implement scx_bpf_dispatch[_vtime]_from_dsq()
sched_ext: Compact struct bpf_iter_scx_dsq_kern
sched_ext: Replace consume_local_task() with move_local_task_to_local_dsq()
sched_ext: Move consume_local_task() upward
sched_ext: Move sanity check and dsq_mod_nr() into task_unlink_from_dsq()
sched_ext: Reorder args for consume_local/remote_task()
sched_ext: Restructure dispatch_to_local_dsq()
sched_ext: Fix processs_ddsp_deferred_locals() by unifying DTL_INVALID handling
sched_ext: Make find_dsq_for_dispatch() handle SCX_DSQ_LOCAL_ON
sched_ext: Refactor consume_remote_task()
sched_ext: Rename scx_kfunc_set_sleepable to unlocked and relocate
sched_ext: Add missing static to scx_dump_data
sched_ext: Add missing static to scx_has_op[]
sched_ext: Temporarily work around pick_task_scx() being called without balance_scx()
sched_ext: Add a cgroup scheduler which uses flattened hierarchy
sched_ext: Add cgroup support
...

show more ...


Revision tags: v6.11
# 513ed0c7 10-Sep-2024 Tejun Heo <tj@kernel.org>

sched_ext: Don't trigger ops.quiescent/runnable() on migrations

A task moving across CPUs should not trigger quiescent/runnable task state
events as the task is staying runnable the whole time and j

sched_ext: Don't trigger ops.quiescent/runnable() on migrations

A task moving across CPUs should not trigger quiescent/runnable task state
events as the task is staying runnable the whole time and just stopping and
then starting on different CPUs. Suppress quiescent/runnable task state
events if task_on_rq_migrating().

Signed-off-by: Tejun Heo <tj@kernel.org>
Suggested-by: David Vernet <void@manifault.com>
Cc: Daniel Hodges <hodges.daniel.scott@gmail.com>
Cc: Changwoo Min <multics69@gmail.com>
Cc: Andrea Righi <andrea.righi@linux.dev>
Cc: Dan Schatzberg <schatzberg.dan@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>

show more ...


# 750a40d8 10-Sep-2024 Tejun Heo <tj@kernel.org>

sched_ext: Synchronize bypass state changes with rq lock

While the BPF scheduler is being unloaded, the following warning messages
trigger sometimes:

NOHZ tick-stop error: local softirq work is pe

sched_ext: Synchronize bypass state changes with rq lock

While the BPF scheduler is being unloaded, the following warning messages
trigger sometimes:

NOHZ tick-stop error: local softirq work is pending, handler #80!!!

This is caused by the CPU entering idle while there are pending softirqs.
The main culprit is the bypassing state assertion not being synchronized
with rq operations. As the BPF scheduler cannot be trusted in the disable
path, the first step is entering the bypass mode where the BPF scheduler is
ignored and scheduling becomes global FIFO.

This is implemented by turning scx_ops_bypassing() true. However, the
transition isn't synchronized against anything and it's possible for enqueue
and dispatch paths to have different ideas on whether bypass mode is on.

Make each rq track its own bypass state with SCX_RQ_BYPASSING which is
modified while rq is locked.

This removes most of the NOHZ tick-stop messages but not completely. I
believe the stragglers are from the sched core bug where pick_task_scx() can
be called without preceding balance_scx(). Once that bug is fixed, we should
verify that all occurrences of this error message are gone too.

v2: scx_enabled() test moved inside the for_each_possible_cpu() loop so that
the per-cpu states are always synchronized with the global state.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: David Vernet <void@manifault.com>

show more ...


# 4c30f5ce 10-Sep-2024 Tejun Heo <tj@kernel.org>

sched_ext: Implement scx_bpf_dispatch[_vtime]_from_dsq()

Once a task is put into a DSQ, the allowed operations are fairly limited.
Tasks in the built-in local and global DSQs are executed automatica

sched_ext: Implement scx_bpf_dispatch[_vtime]_from_dsq()

Once a task is put into a DSQ, the allowed operations are fairly limited.
Tasks in the built-in local and global DSQs are executed automatically and,
ignoring dequeue, there is only one way a task in a user DSQ can be
manipulated - scx_bpf_consume() moves the first task to the dispatching
local DSQ. This inflexibility sometimes gets in the way and is an area where
multiple feature requests have been made.

Implement scx_bpf_dispatch[_vtime]_from_dsq(), which can be called during
DSQ iteration and can move the task to any DSQ - local DSQs, global DSQ and
user DSQs. The kfuncs can be called from ops.dispatch() and any BPF context
which dosen't hold a rq lock including BPF timers and SYSCALL programs.

This is an expansion of an earlier patch which only allowed moving into the
dispatching local DSQ:

http://lkml.kernel.org/r/Zn4Cw4FDTmvXnhaf@slm.duckdns.org

v2: Remove @slice and @vtime from scx_bpf_dispatch_from_dsq[_vtime]() as
they push scx_bpf_dispatch_from_dsq_vtime() over the kfunc argument
count limit and often won't be needed anyway. Instead provide
scx_bpf_dispatch_from_dsq_set_{slice|vtime}() kfuncs which can be called
only when needed and override the specified parameter for the subsequent
dispatch.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Daniel Hodges <hodges.daniel.scott@gmail.com>
Cc: David Vernet <void@manifault.com>
Cc: Changwoo Min <multics69@gmail.com>
Cc: Andrea Righi <andrea.righi@linux.dev>
Cc: Dan Schatzberg <schatzberg.dan@gmail.com>

show more ...


# 6462dd53 10-Sep-2024 Tejun Heo <tj@kernel.org>

sched_ext: Compact struct bpf_iter_scx_dsq_kern

struct scx_iter_scx_dsq is defined as 6 u64's and scx_dsq_iter_kern was
using 5 of them. We want to add two more u64 fields but it's better if we do
s

sched_ext: Compact struct bpf_iter_scx_dsq_kern

struct scx_iter_scx_dsq is defined as 6 u64's and scx_dsq_iter_kern was
using 5 of them. We want to add two more u64 fields but it's better if we do
so while staying within scx_iter_scx_dsq to maintain binary compatibility.

The way scx_iter_scx_dsq_kern is laid out is rather inefficient - the node
field takes up three u64's but only one bit of the last u64 is used. Turn
the bool into u32 flags and only use the lower 16 bits freeing up 48 bits -
16 bits for flags, 32 bits for a u32 - for use by struct
bpf_iter_scx_dsq_kern.

This allows moving the dsq_seq and flags fields of bpf_iter_scx_dsq_kern
into the cursor field reducing the struct size by a full u64.

No behavior changes intended.

Signed-off-by: Tejun Heo <tj@kernel.org>

show more ...


# cf3e9443 10-Sep-2024 Tejun Heo <tj@kernel.org>

sched_ext: Replace consume_local_task() with move_local_task_to_local_dsq()

- Rename move_task_to_local_dsq() to move_remote_task_to_local_dsq().

- Rename consume_local_task() to move_local_task_to

sched_ext: Replace consume_local_task() with move_local_task_to_local_dsq()

- Rename move_task_to_local_dsq() to move_remote_task_to_local_dsq().

- Rename consume_local_task() to move_local_task_to_local_dsq() and remove
task_unlink_from_dsq() and source DSQ unlocking from it.

This is to make the migration code easier to reuse.

No functional changes intended.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: David Vernet <void@manifault.com>

show more ...


# d434210e 10-Sep-2024 Tejun Heo <tj@kernel.org>

sched_ext: Move consume_local_task() upward

So that the local case comes first and two CONFIG_SMP blocks can be merged.

No functional changes intended.

Signed-off-by: Tejun Heo <tj@kernel.org>
Ack

sched_ext: Move consume_local_task() upward

So that the local case comes first and two CONFIG_SMP blocks can be merged.

No functional changes intended.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: David Vernet <void@manifault.com>

show more ...


1234