History log of /linux/tools/sched_ext/include/scx/common.bpf.h (Results 1 – 22 of 22)
Revision (<<< Hide revision tags) (Show revision tags >>>) Date Author Comments
# daa9f66f 30-Oct-2024 Linus Torvalds <torvalds@linux-foundation.org>

Merge tag 'sched_ext-for-6.12-rc5-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext

Pull sched_ext fixes from Tejun Heo:

- Instances of scx_ops_bypass() could race each other le

Merge tag 'sched_ext-for-6.12-rc5-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext

Pull sched_ext fixes from Tejun Heo:

- Instances of scx_ops_bypass() could race each other leading to
misbehavior. Fix by protecting the operation with a spinlock.

- selftest and userspace header fixes

* tag 'sched_ext-for-6.12-rc5-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext:
sched_ext: Fix enq_last_no_enq_fails selftest
sched_ext: Make cast_mask() inline
scx: Fix raciness in scx_ops_bypass()
scx: Fix exit selftest to use custom DSQ
sched_ext: Fix function pointer type mismatches in BPF selftests
selftests/sched_ext: add order-only dependency of runner.o on BPFOBJ

show more ...


Revision tags: v6.12-rc5
# 7724abf0 26-Oct-2024 Tejun Heo <tj@kernel.org>

sched_ext: Make cast_mask() inline

cast_mask() doesn't do any actual work and is defined in a header file.
Force it to be inline. When it is not inlined and the function is not used,
it can cause ve

sched_ext: Make cast_mask() inline

cast_mask() doesn't do any actual work and is defined in a header file.
Force it to be inline. When it is not inlined and the function is not used,
it can cause verificaiton failures like the following:

# tools/testing/selftests/sched_ext/runner -t minimal
===== START =====
TEST: minimal
DESCRIPTION: Verify we can load a fully minimal scheduler
OUTPUT:
libbpf: prog 'cast_mask': missing BPF prog type, check ELF section name '.text'
libbpf: prog 'cast_mask': failed to load: -22
libbpf: failed to load object 'minimal'
libbpf: failed to load BPF skeleton 'minimal': -22
ERR: minimal.c:20
Failed to open and load skel
not ok 1 minimal #
===== END =====

Signed-off-by: Tejun Heo <tj@kernel.org>
Fixes: a748db0c8c6a ("tools/sched_ext: Receive misc updates from SCX repo")

show more ...


Revision tags: v6.12-rc4
# be602cde 17-Oct-2024 Ingo Molnar <mingo@kernel.org>

Merge branch 'linus' into sched/urgent, to resolve conflict

Conflicts:
kernel/sched/ext.c

There's a context conflict between this upstream commit:

3fdb9ebcec10 sched_ext: Start schedulers with

Merge branch 'linus' into sched/urgent, to resolve conflict

Conflicts:
kernel/sched/ext.c

There's a context conflict between this upstream commit:

3fdb9ebcec10 sched_ext: Start schedulers with consistent p->scx.slice values

... and this fix in sched/urgent:

98442f0ccd82 sched: Fix delayed_dequeue vs switched_from_fair()

Resolve it.

Signed-off-by: Ingo Molnar <mingo@kernel.org>

show more ...


Revision tags: v6.12-rc3
# 75b607fa 08-Oct-2024 Linus Torvalds <torvalds@linux-foundation.org>

Merge tag 'sched_ext-for-6.12-rc2-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext

Pull sched_ext fixes from Tejun Heo:

- ops.enqueue() didn't have a way to tell whether select

Merge tag 'sched_ext-for-6.12-rc2-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext

Pull sched_ext fixes from Tejun Heo:

- ops.enqueue() didn't have a way to tell whether select_task_rq_scx()
and thus ops.select() were skipped. Some schedulers were incorrectly
using SCX_ENQ_WAKEUP. Add SCX_ENQ_CPU_SELECTED and fix scx_qmap using
it.

- Remove a spurious WARN_ON_ONCE() in scx_cgroup_exit()

- Fix error information clobbering during load

- Add missing __weak markers to BPF helper declarations

- Doc update

* tag 'sched_ext-for-6.12-rc2-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext:
sched_ext: Documentation: Update instructions for running example schedulers
sched_ext, scx_qmap: Add and use SCX_ENQ_CPU_SELECTED
sched/core: Add ENQUEUE_RQ_SELECTED to indicate whether ->select_task_rq() was called
sched/core: Make select_task_rq() take the pointer to wake_flags instead of value
sched_ext: scx_cgroup_exit() may be called without successful scx_cgroup_init()
sched_ext: Improve error reporting during loading
sched_ext: Add __weak markers to BPF helper function decalarations

show more ...


Revision tags: v6.12-rc2
# fcbc4235 02-Oct-2024 Vishal Chourasia <vishalc@linux.ibm.com>

sched_ext: Add __weak markers to BPF helper function decalarations

Fix build errors by adding __weak markers to BPF helper function
declarations in header files. This resolves static assertion failu

sched_ext: Add __weak markers to BPF helper function decalarations

Fix build errors by adding __weak markers to BPF helper function
declarations in header files. This resolves static assertion failures
in scx_qmap.bpf.c and scx_flatcg.bpf.c where functions like
scx_bpf_dispatch_from_dsq_set_slice, scx_bpf_dispatch_from_dsq_set_vtime,
and scx_bpf_task_cgroup were missing the __weak attribute.

[1] https://lore.kernel.org/all/ZvvfUqRNM4-jYQzH@linux.ibm.com

Signed-off-by: Vishal Chourasia <vishalc@linux.ibm.com>
Signed-off-by: Tejun Heo <tj@kernel.org>

show more ...


# c8d430db 06-Oct-2024 Paolo Bonzini <pbonzini@redhat.com>

Merge tag 'kvmarm-fixes-6.12-1' of git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD

KVM/arm64 fixes for 6.12, take #1

- Fix pKVM error path on init, making sure we do not chang

Merge tag 'kvmarm-fixes-6.12-1' of git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD

KVM/arm64 fixes for 6.12, take #1

- Fix pKVM error path on init, making sure we do not change critical
system registers as we're about to fail

- Make sure that the host's vector length is at capped by a value
common to all CPUs

- Fix kvm_has_feat*() handling of "negative" features, as the current
code is pretty broken

- Promote Joey to the status of official reviewer, while James steps
down -- hopefully only temporarly

show more ...


# 0c436dfe 02-Oct-2024 Takashi Iwai <tiwai@suse.de>

Merge tag 'asoc-fix-v6.12-rc1' of https://git.kernel.org/pub/scm/linux/kernel/git/broonie/sound into for-linus

ASoC: Fixes for v6.12

A bunch of fixes here that came in during the merge window and t

Merge tag 'asoc-fix-v6.12-rc1' of https://git.kernel.org/pub/scm/linux/kernel/git/broonie/sound into for-linus

ASoC: Fixes for v6.12

A bunch of fixes here that came in during the merge window and the first
week of release, plus some new quirks and device IDs. There's nothing
major here, it's a bit bigger than it might've been due to there being
no fixes sent during the merge window due to your vacation.

show more ...


# 2cd86f02 01-Oct-2024 Maarten Lankhorst <maarten.lankhorst@linux.intel.com>

Merge remote-tracking branch 'drm/drm-fixes' into drm-misc-fixes

Required for a panthor fix that broke when
FOP_UNSIGNED_OFFSET was added in place of FMODE_UNSIGNED_OFFSET.

Signed-off-by: Maarten L

Merge remote-tracking branch 'drm/drm-fixes' into drm-misc-fixes

Required for a panthor fix that broke when
FOP_UNSIGNED_OFFSET was added in place of FMODE_UNSIGNED_OFFSET.

Signed-off-by: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>

show more ...


Revision tags: v6.12-rc1
# 3a39d672 27-Sep-2024 Paolo Abeni <pabeni@redhat.com>

Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net

Cross-merge networking fixes after downstream PR.

No conflicts and no adjacent changes.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>


# e32cde8d 30-Sep-2024 Linus Torvalds <torvalds@linux-foundation.org>

Merge tag 'sched_ext-for-6.12-rc1-fixes-1' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext

Pull sched_ext fixes from Tejun Heo:

- When sched_ext is in bypass mode (e.g. while disabli

Merge tag 'sched_ext-for-6.12-rc1-fixes-1' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext

Pull sched_ext fixes from Tejun Heo:

- When sched_ext is in bypass mode (e.g. while disabling the BPF
scheduler), it was using one DSQ to implement global FIFO scheduling
as all it has to do is guaranteeing reasonable forward progress.

On multi-socket machines, this can lead to live-lock conditions under
certain workloads. Fixed by splitting the queue used for FIFO
scheduling per NUMA node. This required several preparation patches.

- Hotplug tests on powerpc could reliably trigger deadlock while
enabling a BPF scheduler.

This was caused by cpu_hotplug_lock nesting inside scx_fork_rwsem and
then CPU hotplug path trying to fork a new thread while holding
cpu_hotplug_lock.

Fixed by restructuring locking in enable and disable paths so that
the two locks are not coupled. This required several preparation
patches which also fixed a couple other issues in the enable path.

- A build fix for !CONFIG_SMP

- Userspace tooling sync and updates

* tag 'sched_ext-for-6.12-rc1-fixes-1' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext:
sched_ext: Remove redundant p->nr_cpus_allowed checker
sched_ext: Decouple locks in scx_ops_enable()
sched_ext: Decouple locks in scx_ops_disable_workfn()
sched_ext: Add scx_cgroup_enabled to gate cgroup operations and fix scx_tg_online()
sched_ext: Enable scx_ops_init_task() separately
sched_ext: Fix SCX_TASK_INIT -> SCX_TASK_READY transitions in scx_ops_enable()
sched_ext: Initialize in bypass mode
sched_ext: Remove SCX_OPS_PREPPING
sched_ext: Relocate check_hotplug_seq() call in scx_ops_enable()
sched_ext: Use shorter slice while bypassing
sched_ext: Split the global DSQ per NUMA node
sched_ext: Relocate find_user_dsq()
sched_ext: Allow only user DSQs for scx_bpf_consume(), scx_bpf_dsq_nr_queued() and bpf_iter_scx_dsq_new()
scx_flatcg: Use a user DSQ for fallback instead of SCX_DSQ_GLOBAL
tools/sched_ext: Receive misc updates from SCX repo
sched_ext: Add __COMPAT helpers for features added during v6.12 devel cycle
sched_ext: Build fix for !CONFIG_SMP

show more ...


# a748db0c 26-Sep-2024 Tejun Heo <tj@kernel.org>

tools/sched_ext: Receive misc updates from SCX repo

Receive misc tools/sched_ext updates from https://github.com/sched-ext/scx
to sync userspace bits.

- LSP macros to help language servers.

- bpf_

tools/sched_ext: Receive misc updates from SCX repo

Receive misc tools/sched_ext updates from https://github.com/sched-ext/scx
to sync userspace bits.

- LSP macros to help language servers.

- bpf_cpumask_weight() declaration and cast_mask() helper.

- Cosmetic updates to scx_flatcg.bpf.c.

Signed-off-by: Tejun Heo <tj@kernel.org>

show more ...


# 88264981 21-Sep-2024 Linus Torvalds <torvalds@linux-foundation.org>

Merge tag 'sched_ext-for-6.12' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext

Pull sched_ext support from Tejun Heo:
"This implements a new scheduler class called ‘ext_sched_class’,

Merge tag 'sched_ext-for-6.12' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext

Pull sched_ext support from Tejun Heo:
"This implements a new scheduler class called ‘ext_sched_class’, or
sched_ext, which allows scheduling policies to be implemented as BPF
programs.

The goals of this are:

- Ease of experimentation and exploration: Enabling rapid iteration
of new scheduling policies.

- Customization: Building application-specific schedulers which
implement policies that are not applicable to general-purpose
schedulers.

- Rapid scheduler deployments: Non-disruptive swap outs of scheduling
policies in production environments"

See individual commits for more documentation, but also the cover letter
for the latest series:

Link: https://lore.kernel.org/all/20240618212056.2833381-1-tj@kernel.org/

* tag 'sched_ext-for-6.12' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext: (110 commits)
sched: Move update_other_load_avgs() to kernel/sched/pelt.c
sched_ext: Don't trigger ops.quiescent/runnable() on migrations
sched_ext: Synchronize bypass state changes with rq lock
scx_qmap: Implement highpri boosting
sched_ext: Implement scx_bpf_dispatch[_vtime]_from_dsq()
sched_ext: Compact struct bpf_iter_scx_dsq_kern
sched_ext: Replace consume_local_task() with move_local_task_to_local_dsq()
sched_ext: Move consume_local_task() upward
sched_ext: Move sanity check and dsq_mod_nr() into task_unlink_from_dsq()
sched_ext: Reorder args for consume_local/remote_task()
sched_ext: Restructure dispatch_to_local_dsq()
sched_ext: Fix processs_ddsp_deferred_locals() by unifying DTL_INVALID handling
sched_ext: Make find_dsq_for_dispatch() handle SCX_DSQ_LOCAL_ON
sched_ext: Refactor consume_remote_task()
sched_ext: Rename scx_kfunc_set_sleepable to unlocked and relocate
sched_ext: Add missing static to scx_dump_data
sched_ext: Add missing static to scx_has_op[]
sched_ext: Temporarily work around pick_task_scx() being called without balance_scx()
sched_ext: Add a cgroup scheduler which uses flattened hierarchy
sched_ext: Add cgroup support
...

show more ...


Revision tags: v6.11
# 4c30f5ce 10-Sep-2024 Tejun Heo <tj@kernel.org>

sched_ext: Implement scx_bpf_dispatch[_vtime]_from_dsq()

Once a task is put into a DSQ, the allowed operations are fairly limited.
Tasks in the built-in local and global DSQs are executed automatica

sched_ext: Implement scx_bpf_dispatch[_vtime]_from_dsq()

Once a task is put into a DSQ, the allowed operations are fairly limited.
Tasks in the built-in local and global DSQs are executed automatically and,
ignoring dequeue, there is only one way a task in a user DSQ can be
manipulated - scx_bpf_consume() moves the first task to the dispatching
local DSQ. This inflexibility sometimes gets in the way and is an area where
multiple feature requests have been made.

Implement scx_bpf_dispatch[_vtime]_from_dsq(), which can be called during
DSQ iteration and can move the task to any DSQ - local DSQs, global DSQ and
user DSQs. The kfuncs can be called from ops.dispatch() and any BPF context
which dosen't hold a rq lock including BPF timers and SYSCALL programs.

This is an expansion of an earlier patch which only allowed moving into the
dispatching local DSQ:

http://lkml.kernel.org/r/Zn4Cw4FDTmvXnhaf@slm.duckdns.org

v2: Remove @slice and @vtime from scx_bpf_dispatch_from_dsq[_vtime]() as
they push scx_bpf_dispatch_from_dsq_vtime() over the kfunc argument
count limit and often won't be needed anyway. Instead provide
scx_bpf_dispatch_from_dsq_set_{slice|vtime}() kfuncs which can be called
only when needed and override the specified parameter for the subsequent
dispatch.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Daniel Hodges <hodges.daniel.scott@gmail.com>
Cc: David Vernet <void@manifault.com>
Cc: Changwoo Min <multics69@gmail.com>
Cc: Andrea Righi <andrea.righi@linux.dev>
Cc: Dan Schatzberg <schatzberg.dan@gmail.com>

show more ...


Revision tags: v6.11-rc7
# 81951366 04-Sep-2024 Tejun Heo <tj@kernel.org>

sched_ext: Add cgroup support

Add sched_ext_ops operations to init/exit cgroups, and track task migrations
and config changes. A BPF scheduler may not implement or implement only
subset of cgroup fe

sched_ext: Add cgroup support

Add sched_ext_ops operations to init/exit cgroups, and track task migrations
and config changes. A BPF scheduler may not implement or implement only
subset of cgroup features. The implemented features can be indicated using
%SCX_OPS_HAS_CGOUP_* flags. If cgroup configuration makes use of features
that are not implemented, a warning is triggered.

While a BPF scheduler is being enabled and disabled, relevant cgroup
operations are locked out using scx_cgroup_rwsem. This avoids situations
like task prep taking place while the task is being moved across cgroups,
making things easier for BPF schedulers.

v7: - cgroup interface file visibility toggling is dropped in favor just
warning messages. Dynamically changing interface visiblity caused more
confusion than helping.

v6: - Updated to reflect the removal of SCX_KF_SLEEPABLE.

- Updated to use CONFIG_GROUP_SCHED_WEIGHT and fixes for
!CONFIG_FAIR_GROUP_SCHED && CONFIG_EXT_GROUP_SCHED.

v5: - Flipped the locking order between scx_cgroup_rwsem and
cpus_read_lock() to avoid locking order conflict w/ cpuset. Better
documentation around locking.

- sched_move_task() takes an early exit if the source and destination
are identical. This triggered the warning in scx_cgroup_can_attach()
as it left p->scx.cgrp_moving_from uncleared. Updated the cgroup
migration path so that ops.cgroup_prep_move() is skipped for identity
migrations so that its invocations always match ops.cgroup_move()
one-to-one.

v4: - Example schedulers moved into their own patches.

- Fix build failure when !CONFIG_CGROUP_SCHED, reported by Andrea Righi.

v3: - Make scx_example_pair switch all tasks by default.

- Convert to BPF inline iterators.

- scx_bpf_task_cgroup() is added to determine the current cgroup from
CPU controller's POV. This allows BPF schedulers to accurately track
CPU cgroup membership.

- scx_example_flatcg added. This demonstrates flattened hierarchy
implementation of CPU cgroup control and shows significant performance
improvement when cgroups which are nested multiple levels are under
competition.

v2: - Build fixes for different CONFIG combinations.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: David Vernet <dvernet@meta.com>
Acked-by: Josh Don <joshdon@google.com>
Acked-by: Hao Luo <haoluo@google.com>
Acked-by: Barret Rhoden <brho@google.com>
Reported-by: kernel test robot <lkp@intel.com>
Cc: Andrea Righi <andrea.righi@canonical.com>

show more ...


Revision tags: v6.11-rc6, v6.11-rc5, v6.11-rc4, v6.11-rc3, v6.11-rc2, v6.11-rc1, v6.10
# 650ba21b 09-Jul-2024 Tejun Heo <tj@kernel.org>

sched_ext: Implement DSQ iterator

DSQs are very opaque in the consumption path. The BPF scheduler has no way
of knowing which tasks are being considered and which is picked. This patch
adds BPF DSQ

sched_ext: Implement DSQ iterator

DSQs are very opaque in the consumption path. The BPF scheduler has no way
of knowing which tasks are being considered and which is picked. This patch
adds BPF DSQ iterator.

- Allows iterating tasks queued on a DSQ in the dispatch order or reverse
from anywhere using bpf_for_each(scx_dsq) or calling the iterator kfuncs
directly.

- Has ordering guarantee where only tasks which were already queued when the
iteration started are visible and consumable during the iteration.

v5: - Add a comment to the naked list_empty(&dsq->list) test in
consume_dispatch_q() to explain the reasoning behind the lockless test
and by extension why nldsq_next_task() isn't used there.

- scx_qmap changes separated into its own patch.

v4: - bpf_iter_scx_dsq_new() declaration in common.bpf.h was using the wrong
type for the last argument (bool rev instead of u64 flags). Fix it.

v3: - Alexei pointed out that the iterator is too big to allocate on stack.
Added a prep patch to reduce the size of the cursor. Now
bpf_iter_scx_dsq is 48 bytes and bpf_iter_scx_dsq_kern is 40 bytes on
64bit.

- u32_before() comparison factored out.

v2: - scx_bpf_consume_task() is separated out into a separate patch.

- DSQ seq and iter flags don't need to be u64. Use u32.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: David Vernet <dvernet@meta.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Cc: bpf@vger.kernel.org

show more ...


# 6203ef73 08-Jul-2024 Hongyan Xia <hongyan.xia2@arm.com>

sched/ext: Add BPF function to fetch rq

rq contains many useful fields to implement a custom scheduler. For
example, various clock signals like clock_task and clock_pelt can be
used to track load. I

sched/ext: Add BPF function to fetch rq

rq contains many useful fields to implement a custom scheduler. For
example, various clock signals like clock_task and clock_pelt can be
used to track load. It also contains stats in other sched_classes, which
are useful to drive scheduling decisions in ext.

tj: Put the new helper below scx_bpf_task_*() helpers.

Signed-off-by: Hongyan Xia <hongyan.xia2@arm.com>
Signed-off-by: Tejun Heo <tj@kernel.org>

show more ...


Revision tags: v6.10-rc7, v6.10-rc6, v6.10-rc5
# d86adb4f 22-Jun-2024 Tejun Heo <tj@kernel.org>

sched_ext: Add cpuperf support

sched_ext currently does not integrate with schedutil. When schedutil is the
governor, frequencies are left unregulated and usually get stuck close to
the highest perf

sched_ext: Add cpuperf support

sched_ext currently does not integrate with schedutil. When schedutil is the
governor, frequencies are left unregulated and usually get stuck close to
the highest performance level from running RT tasks.

Add CPU performance monitoring and scaling support by integrating into
schedutil. The following kfuncs are added:

- scx_bpf_cpuperf_cap(): Query the relative performance capacity of
different CPUs in the system.

- scx_bpf_cpuperf_cur(): Query the current performance level of a CPU
relative to its max performance.

- scx_bpf_cpuperf_set(): Set the current target performance level of a CPU.

This gives direct control over CPU performance setting to the BPF scheduler.
The only changes on the schedutil side are accounting for the utilization
factor from sched_ext and disabling frequency holding heuristics as it may
not apply well to sched_ext schedulers which may have a lot weaker
connection between tasks and their current / last CPU.

With cpuperf support added, there is no reason to block uclamp. Enable while
at it.

A toy implementation of cpuperf is added to scx_qmap as a demonstration of
the feature.

v2: Ignore cpu_util_cfs_boost() when scx_switched_all() in sugov_get_util()
to avoid factoring in stale util metric. (Christian)

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: David Vernet <dvernet@meta.com>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: Viresh Kumar <viresh.kumar@linaro.org>
Cc: Christian Loehle <christian.loehle@arm.com>

show more ...


# 06e51be3 18-Jun-2024 Tejun Heo <tj@kernel.org>

sched_ext: Add vtime-ordered priority queue to dispatch_q's

Currently, a dsq is always a FIFO. A task which is dispatched earlier gets
consumed or executed earlier. While this is sufficient when dsq

sched_ext: Add vtime-ordered priority queue to dispatch_q's

Currently, a dsq is always a FIFO. A task which is dispatched earlier gets
consumed or executed earlier. While this is sufficient when dsq's are used
for simple staging areas for tasks which are ready to execute, it'd make
dsq's a lot more useful if they can implement custom ordering.

This patch adds a vtime-ordered priority queue to dsq's. When the BPF
scheduler dispatches a task with the new scx_bpf_dispatch_vtime() helper, it
can specify the vtime tha the task should be inserted at and the task is
inserted into the priority queue in the dsq which is ordered according to
time_before64() comparison of the vtime values.

A DSQ can either be a FIFO or priority queue and automatically switches
between the two depending on whether scx_bpf_dispatch() or
scx_bpf_dispatch_vtime() is used. Using the wrong variant while the DSQ
already has the other type queued is not allowed and triggers an ops error.
Built-in DSQs must always be FIFOs.

This makes it very easy for the BPF schedulers to implement proper vtime
based scheduling within each dsq very easy and efficient at a negligible
cost in terms of code complexity and overhead.

scx_simple and scx_example_flatcg are updated to default to weighted
vtime scheduling (the latter within each cgroup). FIFO scheduling can be
selected with -f option.

v4: - As allowing mixing priority queue and FIFO on the same DSQ sometimes
led to unexpected starvations, DSQs now error out if both modes are
used at the same time and the built-in DSQs are no longer allowed to
be priority queues.

- Explicit type struct scx_dsq_node added to contain fields needed to be
linked on DSQs. This will be used to implement stateful iterator.

- Tasks are now always linked on dsq->list whether the DSQ is in FIFO or
PRIQ mode. This confines PRIQ related complexities to the enqueue and
dequeue paths. Other paths only need to look at dsq->list. This will
also ease implementing BPF iterator.

- Print p->scx.dsq_flags in debug dump.

v3: - SCX_TASK_DSQ_ON_PRIQ flag is moved from p->scx.flags into its own
p->scx.dsq_flags. The flag is protected with the dsq lock unlike other
flags in p->scx.flags. This led to flag corruption in some cases.

- Add comments explaining the interaction between using consumption of
p->scx.slice to determine vtime progress and yielding.

v2: - p->scx.dsq_vtime was not initialized on load or across cgroup
migrations leading to some tasks being stalled for extended period of
time depending on how saturated the machine is. Fixed.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: David Vernet <dvernet@meta.com>

show more ...


# 245254f7 18-Jun-2024 David Vernet <dvernet@meta.com>

sched_ext: Implement sched_ext_ops.cpu_acquire/release()

Scheduler classes are strictly ordered and when a higher priority class has
tasks to run, the lower priority ones lose access to the CPU. Bei

sched_ext: Implement sched_ext_ops.cpu_acquire/release()

Scheduler classes are strictly ordered and when a higher priority class has
tasks to run, the lower priority ones lose access to the CPU. Being able to
monitor and act on these events are necessary for use cases includling
strict core-scheduling and latency management.

This patch adds two operations ops.cpu_acquire() and .cpu_release(). The
former is invoked when a CPU becomes available to the BPF scheduler and the
opposite for the latter. This patch also implements
scx_bpf_reenqueue_local() which can be called from .cpu_release() to trigger
requeueing of all tasks in the local dsq of the CPU so that the tasks can be
reassigned to other available CPUs.

scx_pair is updated to use .cpu_acquire/release() along with
%SCX_KICK_WAIT to make the pair scheduling guarantee strict even when a CPU
is preempted by a higher priority scheduler class.

scx_qmap is updated to use .cpu_acquire/release() to empty the local
dsq of a preempted CPU. A similar approach can be adopted by BPF schedulers
that want to have a tight control over latency.

v4: Use the new SCX_KICK_IDLE to wake up a CPU after re-enqueueing.

v3: Drop the const qualifier from scx_cpu_release_args.task. BPF enforces
access control through the verifier, so the qualifier isn't actually
operative and only gets in the way when interacting with various
helpers.

v2: Add p->scx.kf_mask annotation to allow calling scx_bpf_reenqueue_local()
from ops.cpu_release() nested inside ops.init() and other sleepable
operations.

Signed-off-by: David Vernet <dvernet@meta.com>
Reviewed-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Josh Don <joshdon@google.com>
Acked-by: Hao Luo <haoluo@google.com>
Acked-by: Barret Rhoden <brho@google.com>

show more ...


# 81aae789 18-Jun-2024 Tejun Heo <tj@kernel.org>

sched_ext: Implement scx_bpf_kick_cpu() and task preemption support

It's often useful to wake up and/or trigger reschedule on other CPUs. This
patch adds scx_bpf_kick_cpu() kfunc helper that BPF sch

sched_ext: Implement scx_bpf_kick_cpu() and task preemption support

It's often useful to wake up and/or trigger reschedule on other CPUs. This
patch adds scx_bpf_kick_cpu() kfunc helper that BPF scheduler can call to
kick the target CPU into the scheduling path.

As a sched_ext task relinquishes its CPU only after its slice is depleted,
this patch also adds SCX_KICK_PREEMPT and SCX_ENQ_PREEMPT which clears the
slice of the target CPU's current task to guarantee that sched_ext's
scheduling path runs on the CPU.

If SCX_KICK_IDLE is specified, the target CPU is kicked iff the CPU is idle
to guarantee that the target CPU will go through at least one full sched_ext
scheduling cycle after the kicking. This can be used to wake up idle CPUs
without incurring unnecessary overhead if it isn't currently idle.

As a demonstration of how backward compatibility can be supported using BPF
CO-RE, tools/sched_ext/include/scx/compat.bpf.h is added. It provides
__COMPAT_scx_bpf_kick_cpu_IDLE() which uses SCX_KICK_IDLE if available or
becomes a regular kicking otherwise. This allows schedulers to use the new
SCX_KICK_IDLE while maintaining support for older kernels. The plan is to
temporarily use compat helpers to ease API updates and drop them after a few
kernel releases.

v5: - SCX_KICK_IDLE added. Note that this also adds a compat mechanism for
schedulers so that they can support kernels without SCX_KICK_IDLE.
This is useful as a demonstration of how new feature flags can be
added in a backward compatible way.

- kick_cpus_irq_workfn() reimplemented so that it touches the pending
cpumasks only as necessary to reduce kicking overhead on machines with
a lot of CPUs.

- tools/sched_ext/include/scx/compat.bpf.h added.

v4: - Move example scheduler to its own patch.

v3: - Make scx_example_central switch all tasks by default.

- Convert to BPF inline iterators.

v2: - Julia Lawall reported that scx_example_central can overflow the
dispatch buffer and malfunction. As scheduling for other CPUs can't be
handled by the automatic retry mechanism, fix by implementing an
explicit overflow and retry handling.

- Updated to use generic BPF cpumask helpers.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: David Vernet <dvernet@meta.com>
Acked-by: Josh Don <joshdon@google.com>
Acked-by: Hao Luo <haoluo@google.com>
Acked-by: Barret Rhoden <brho@google.com>

show more ...


# 07814a94 18-Jun-2024 Tejun Heo <tj@kernel.org>

sched_ext: Print debug dump after an error exit

If a BPF scheduler triggers an error, the scheduler is aborted and the
system is reverted to the built-in scheduler. In the process, a lot of
informat

sched_ext: Print debug dump after an error exit

If a BPF scheduler triggers an error, the scheduler is aborted and the
system is reverted to the built-in scheduler. In the process, a lot of
information which may be useful for figuring out what happened can be lost.

This patch adds debug dump which captures information which may be useful
for debugging including runqueue and runnable thread states at the time of
failure. The following shows a debug dump after triggering the watchdog:

root@test ~# os/work/tools/sched_ext/build/bin/scx_qmap -t 100
stats : enq=1 dsp=0 delta=1 deq=0
stats : enq=90 dsp=90 delta=0 deq=0
stats : enq=156 dsp=156 delta=0 deq=0
stats : enq=218 dsp=218 delta=0 deq=0
stats : enq=255 dsp=255 delta=0 deq=0
stats : enq=271 dsp=271 delta=0 deq=0
stats : enq=284 dsp=284 delta=0 deq=0
stats : enq=293 dsp=293 delta=0 deq=0

DEBUG DUMP
================================================================================

kworker/u32:12[320] triggered exit kind 1026:
runnable task stall (stress[1530] failed to run for 6.841s)

Backtrace:
scx_watchdog_workfn+0x136/0x1c0
process_scheduled_works+0x2b5/0x600
worker_thread+0x269/0x360
kthread+0xeb/0x110
ret_from_fork+0x36/0x40
ret_from_fork_asm+0x1a/0x30

QMAP FIFO[0]:
QMAP FIFO[1]:
QMAP FIFO[2]: 1436
QMAP FIFO[3]:
QMAP FIFO[4]:

CPU states
----------

CPU 0 : nr_run=1 ops_qseq=244
curr=swapper/0[0] class=idle_sched_class

QMAP: dsp_idx=1 dsp_cnt=0

R stress[1530] -6841ms
scx_state/flags=3/0x1 ops_state/qseq=2/20
sticky/holding_cpu=-1/-1 dsq_id=(n/a)
cpus=ff

QMAP: force_local=0

asm_sysvec_apic_timer_interrupt+0x16/0x20

CPU 2 : nr_run=2 ops_qseq=142
curr=swapper/2[0] class=idle_sched_class

QMAP: dsp_idx=1 dsp_cnt=0

R sshd[1703] -5905ms
scx_state/flags=3/0x9 ops_state/qseq=2/88
sticky/holding_cpu=-1/-1 dsq_id=(n/a)
cpus=ff

QMAP: force_local=1

__x64_sys_ppoll+0xf6/0x120
do_syscall_64+0x7b/0x150
entry_SYSCALL_64_after_hwframe+0x76/0x7e

R fish[1539] -4141ms
scx_state/flags=3/0x9 ops_state/qseq=2/124
sticky/holding_cpu=-1/-1 dsq_id=(n/a)
cpus=ff

QMAP: force_local=1

futex_wait+0x60/0xe0
do_futex+0x109/0x180
__x64_sys_futex+0x117/0x190
do_syscall_64+0x7b/0x150
entry_SYSCALL_64_after_hwframe+0x76/0x7e

CPU 3 : nr_run=2 ops_qseq=162
curr=kworker/u32:12[320] class=ext_sched_class

QMAP: dsp_idx=1 dsp_cnt=0

*R kworker/u32:12[320] +0ms
scx_state/flags=3/0xd ops_state/qseq=0/0
sticky/holding_cpu=-1/-1 dsq_id=(n/a)
cpus=ff

QMAP: force_local=0

scx_dump_state+0x613/0x6f0
scx_ops_error_irq_workfn+0x1f/0x40
irq_work_run_list+0x82/0xd0
irq_work_run+0x14/0x30
__sysvec_irq_work+0x40/0x140
sysvec_irq_work+0x60/0x70
asm_sysvec_irq_work+0x16/0x20
scx_watchdog_workfn+0x15f/0x1c0
process_scheduled_works+0x2b5/0x600
worker_thread+0x269/0x360
kthread+0xeb/0x110
ret_from_fork+0x36/0x40
ret_from_fork_asm+0x1a/0x30

R kworker/3:2[1436] +0ms
scx_state/flags=3/0x9 ops_state/qseq=2/160
sticky/holding_cpu=-1/-1 dsq_id=(n/a)
cpus=08

QMAP: force_local=0

kthread+0xeb/0x110
ret_from_fork+0x36/0x40
ret_from_fork_asm+0x1a/0x30

CPU 7 : nr_run=0 ops_qseq=76
curr=swapper/7[0] class=idle_sched_class


================================================================================

EXIT: runnable task stall (stress[1530] failed to run for 6.841s)

It shows that CPU 3 was running the watchdog when it triggered the error
condition and the scx_qmap thread has been queued on CPU 0 for over 5
seconds but failed to run. It also prints out scx_qmap specific information
- e.g. which tasks are queued on each FIFO and so on using the dump_*() ops.
This dump has proved pretty useful for developing and debugging BPF
schedulers.

Debug dump is generated automatically when the BPF scheduler exits due to an
error. The debug buffer used in such cases is determined by
sched_ext_ops.exit_dump_len and defaults to 32k. If the debug dump overruns
the available buffer, the output is truncated and marked accordingly.

Debug dump output can also be read through the sched_ext_dump tracepoint.
When read through the tracepoint, there is no length limit.

SysRq-D can be used to trigger debug dump at any time while a BPF scheduler
is loaded. This is non-destructive - the scheduler keeps running afterwards.
The output can be read through the sched_ext_dump tracepoint.

v2: - The size of exit debug dump buffer can now be customized using
sched_ext_ops.exit_dump_len.

- sched_ext_ops.dump*() added to enable dumping of BPF scheduler
specific information.

- Tracpoint output and SysRq-D triggering added.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: David Vernet <dvernet@meta.com>

show more ...


# 2a52ca7c 18-Jun-2024 Tejun Heo <tj@kernel.org>

sched_ext: Add scx_simple and scx_example_qmap example schedulers

Add two simple example BPF schedulers - simple and qmap.

* simple: In terms of scheduling, it behaves identical to not having any

sched_ext: Add scx_simple and scx_example_qmap example schedulers

Add two simple example BPF schedulers - simple and qmap.

* simple: In terms of scheduling, it behaves identical to not having any
operation implemented at all. The two operations it implements are only to
improve visibility and exit handling. On certain homogeneous
configurations, this actually can perform pretty well.

* qmap: A fixed five level priority scheduler to demonstrate queueing PIDs
on BPF maps for scheduling. While not very practical, this is useful as a
simple example and will be used to demonstrate different features.

v7: - Compat helpers stripped out in prepartion of upstreaming as the
upstreamed patchset will be the baselinfe. Utility macros that can be
used to implement compat features are kept.

- Explicitly disable map autoattach on struct_ops to avoid trying to
attach twice while maintaining compatbility with older libbpf.

v6: - Common header files reorganized and cleaned up. Compat helpers are
added to demonstrate how schedulers can maintain backward
compatibility with older kernels while making use of newly added
features.

- simple_select_cpu() added to keep track of the number of local
dispatches. This is needed because the default ops.select_cpu()
implementation is updated to dispatch directly and won't call
ops.enqueue().

- Updated to reflect the sched_ext API changes. Switching all tasks is
the default behavior now and scx_qmap supports partial switching when
`-p` is specified.

- tools/sched_ext/Kconfig dropped. This will be included in the doc
instead.

v5: - Improve Makefile. Build artifects are now collected into a separate
dir which change be changed. Install and help targets are added and
clean actually cleans everything.

- MEMBER_VPTR() improved to improve access to structs. ARRAY_ELEM_PTR()
and RESIZEABLE_ARRAY() are added to support resizable arrays in .bss.

- Add scx_common.h which provides common utilities to user code such as
SCX_BUG[_ON]() and RESIZE_ARRAY().

- Use SCX_BUG[_ON]() to simplify error handling.

v4: - Dropped _example prefix from scheduler names.

v3: - Rename scx_example_dummy to scx_example_simple and restructure a bit
to ease later additions. Comment updates.

- Added declarations for BPF inline iterators. In the future, hopefully,
these will be consolidated into a generic BPF header so that they
don't need to be replicated here.

v2: - Updated with the generic BPF cpumask helpers.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: David Vernet <dvernet@meta.com>
Acked-by: Josh Don <joshdon@google.com>
Acked-by: Hao Luo <haoluo@google.com>
Acked-by: Barret Rhoden <brho@google.com>

show more ...