d07be814 | 09-Jan-2025 |
Changwoo Min <changwoo@igalia.com> |
sched_ext: Add time helpers for BPF schedulers
The following functions are added for BPF schedulers: - time_delta(after, before) - time_after(a, b) - time_before(a, b) - time_after_eq(a, b) - time_b
sched_ext: Add time helpers for BPF schedulers
The following functions are added for BPF schedulers: - time_delta(after, before) - time_after(a, b) - time_before(a, b) - time_after_eq(a, b) - time_before_eq(a, b) - time_in_range(a, b, c) - time_in_range_open(a, b, c)
Signed-off-by: Changwoo Min <changwoo@igalia.com> Acked-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
show more ...
|
5cbb3028 | 11-Nov-2024 |
Tejun Heo <tj@kernel.org> |
sched_ext: Rename scx_bpf_dispatch[_vtime]_from_dsq*() -> scx_bpf_dsq_move[_vtime]*()
In sched_ext API, a repeatedly reported pain point is the overuse of the verb "dispatch" and confusion around "c
sched_ext: Rename scx_bpf_dispatch[_vtime]_from_dsq*() -> scx_bpf_dsq_move[_vtime]*()
In sched_ext API, a repeatedly reported pain point is the overuse of the verb "dispatch" and confusion around "consume":
- ops.dispatch() - scx_bpf_dispatch[_vtime]() - scx_bpf_consume() - scx_bpf_dispatch[_vtime]_from_dsq*()
This overloading of the term is historical. Originally, there were only built-in DSQs and moving a task into a DSQ always dispatched it for execution. Using the verb "dispatch" for the kfuncs to move tasks into these DSQs made sense.
Later, user DSQs were added and scx_bpf_dispatch[_vtime]() updated to be able to insert tasks into any DSQ. The only allowed DSQ to DSQ transfer was from a non-local DSQ to a local DSQ and this operation was named "consume". This was already confusing as a task could be dispatched to a user DSQ from ops.enqueue() and then the DSQ would have to be consumed in ops.dispatch(). Later addition of scx_bpf_dispatch_from_dsq*() made the confusion even worse as "dispatch" in this context meant moving a task to an arbitrary DSQ from a user DSQ.
Clean up the API with the following renames:
1. scx_bpf_dispatch[_vtime]() -> scx_bpf_dsq_insert[_vtime]() 2. scx_bpf_consume() -> scx_bpf_dsq_move_to_local() 3. scx_bpf_dispatch[_vtime]_from_dsq*() -> scx_bpf_dsq_move[_vtime]*()
This patch performs the third set of renames. Compatibility is maintained by:
- The previous kfunc names are still provided by the kernel so that old binaries can run. Kernel generates a warning when the old names are used.
- compat.bpf.h provides wrappers for the new names which automatically fall back to the old names when running on older kernels. They also trigger build error if old names are used for new builds.
- scx_bpf_dispatch[_vtime]_from_dsq*() were already wrapped in __COMPAT macros as they were introduced during v6.12 cycle. Wrap new API in __COMPAT macros too and trigger build errors on both __COMPAT prefixed and naked usages of the old names.
The compat features will be dropped after v6.15.
Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Andrea Righi <arighi@nvidia.com> Acked-by: Changwoo Min <changwoo@igalia.com> Acked-by: Johannes Bechberger <me@mostlynerdless.de> Acked-by: Giovanni Gherdovich <ggherdovich@suse.com> Cc: Dan Schatzberg <dschatzberg@meta.com> Cc: Ming Yang <yougmark94@gmail.com>
show more ...
|
5209c03c | 11-Nov-2024 |
Tejun Heo <tj@kernel.org> |
sched_ext: Rename scx_bpf_consume() to scx_bpf_dsq_move_to_local()
In sched_ext API, a repeatedly reported pain point is the overuse of the verb "dispatch" and confusion around "consume":
- ops.dis
sched_ext: Rename scx_bpf_consume() to scx_bpf_dsq_move_to_local()
In sched_ext API, a repeatedly reported pain point is the overuse of the verb "dispatch" and confusion around "consume":
- ops.dispatch() - scx_bpf_dispatch[_vtime]() - scx_bpf_consume() - scx_bpf_dispatch[_vtime]_from_dsq*()
This overloading of the term is historical. Originally, there were only built-in DSQs and moving a task into a DSQ always dispatched it for execution. Using the verb "dispatch" for the kfuncs to move tasks into these DSQs made sense.
Later, user DSQs were added and scx_bpf_dispatch[_vtime]() updated to be able to insert tasks into any DSQ. The only allowed DSQ to DSQ transfer was from a non-local DSQ to a local DSQ and this operation was named "consume". This was already confusing as a task could be dispatched to a user DSQ from ops.enqueue() and then the DSQ would have to be consumed in ops.dispatch(). Later addition of scx_bpf_dispatch_from_dsq*() made the confusion even worse as "dispatch" in this context meant moving a task to an arbitrary DSQ from a user DSQ.
Clean up the API with the following renames:
1. scx_bpf_dispatch[_vtime]() -> scx_bpf_dsq_insert[_vtime]() 2. scx_bpf_consume() -> scx_bpf_dsq_move_to_local() 3. scx_bpf_dispatch[_vtime]_from_dsq*() -> scx_bpf_dsq_move[_vtime]*()
This patch performs the second rename. Compatibility is maintained by:
- The previous kfunc names are still provided by the kernel so that old binaries can run. Kernel generates a warning when the old names are used.
- compat.bpf.h provides wrappers for the new names which automatically fall back to the old names when running on older kernels. They also trigger build error if old names are used for new builds.
The compat features will be dropped after v6.15.
v2: Comment and documentation updates.
Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Andrea Righi <arighi@nvidia.com> Acked-by: Changwoo Min <changwoo@igalia.com> Acked-by: Johannes Bechberger <me@mostlynerdless.de> Acked-by: Giovanni Gherdovich <ggherdovich@suse.com> Cc: Dan Schatzberg <dschatzberg@meta.com> Cc: Ming Yang <yougmark94@gmail.com>
show more ...
|
cc26abb1 | 11-Nov-2024 |
Tejun Heo <tj@kernel.org> |
sched_ext: Rename scx_bpf_dispatch[_vtime]() to scx_bpf_dsq_insert[_vtime]()
In sched_ext API, a repeatedly reported pain point is the overuse of the verb "dispatch" and confusion around "consume":
sched_ext: Rename scx_bpf_dispatch[_vtime]() to scx_bpf_dsq_insert[_vtime]()
In sched_ext API, a repeatedly reported pain point is the overuse of the verb "dispatch" and confusion around "consume":
- ops.dispatch() - scx_bpf_dispatch[_vtime]() - scx_bpf_consume() - scx_bpf_dispatch[_vtime]_from_dsq*()
This overloading of the term is historical. Originally, there were only built-in DSQs and moving a task into a DSQ always dispatched it for execution. Using the verb "dispatch" for the kfuncs to move tasks into these DSQs made sense.
Later, user DSQs were added and scx_bpf_dispatch[_vtime]() updated to be able to insert tasks into any DSQ. The only allowed DSQ to DSQ transfer was from a non-local DSQ to a local DSQ and this operation was named "consume". This was already confusing as a task could be dispatched to a user DSQ from ops.enqueue() and then the DSQ would have to be consumed in ops.dispatch(). Later addition of scx_bpf_dispatch_from_dsq*() made the confusion even worse as "dispatch" in this context meant moving a task to an arbitrary DSQ from a user DSQ.
Clean up the API with the following renames:
1. scx_bpf_dispatch[_vtime]() -> scx_bpf_dsq_insert[_vtime]() 2. scx_bpf_consume() -> scx_bpf_dsq_move_to_local() 3. scx_bpf_dispatch[_vtime]_from_dsq*() -> scx_bpf_dsq_move[_vtime]*()
This patch performs the first set of renames. Compatibility is maintained by:
- The previous kfunc names are still provided by the kernel so that old binaries can run. Kernel generates a warning when the old names are used.
- compat.bpf.h provides wrappers for the new names which automatically fall back to the old names when running on older kernels. They also trigger build error if old names are used for new builds.
The compat features will be dropped after v6.15.
v2: Documentation updates.
Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Andrea Righi <arighi@nvidia.com> Acked-by: Changwoo Min <changwoo@igalia.com> Acked-by: Johannes Bechberger <me@mostlynerdless.de> Acked-by: Giovanni Gherdovich <ggherdovich@suse.com> Cc: Dan Schatzberg <dschatzberg@meta.com> Cc: Ming Yang <yougmark94@gmail.com>
show more ...
|
fcbc4235 | 02-Oct-2024 |
Vishal Chourasia <vishalc@linux.ibm.com> |
sched_ext: Add __weak markers to BPF helper function decalarations
Fix build errors by adding __weak markers to BPF helper function declarations in header files. This resolves static assertion failu
sched_ext: Add __weak markers to BPF helper function decalarations
Fix build errors by adding __weak markers to BPF helper function declarations in header files. This resolves static assertion failures in scx_qmap.bpf.c and scx_flatcg.bpf.c where functions like scx_bpf_dispatch_from_dsq_set_slice, scx_bpf_dispatch_from_dsq_set_vtime, and scx_bpf_task_cgroup were missing the __weak attribute.
[1] https://lore.kernel.org/all/ZvvfUqRNM4-jYQzH@linux.ibm.com
Signed-off-by: Vishal Chourasia <vishalc@linux.ibm.com> Signed-off-by: Tejun Heo <tj@kernel.org>
show more ...
|
a748db0c | 26-Sep-2024 |
Tejun Heo <tj@kernel.org> |
tools/sched_ext: Receive misc updates from SCX repo
Receive misc tools/sched_ext updates from https://github.com/sched-ext/scx to sync userspace bits.
- LSP macros to help language servers.
- bpf_
tools/sched_ext: Receive misc updates from SCX repo
Receive misc tools/sched_ext updates from https://github.com/sched-ext/scx to sync userspace bits.
- LSP macros to help language servers.
- bpf_cpumask_weight() declaration and cast_mask() helper.
- Cosmetic updates to scx_flatcg.bpf.c.
Signed-off-by: Tejun Heo <tj@kernel.org>
show more ...
|
650ba21b | 09-Jul-2024 |
Tejun Heo <tj@kernel.org> |
sched_ext: Implement DSQ iterator
DSQs are very opaque in the consumption path. The BPF scheduler has no way of knowing which tasks are being considered and which is picked. This patch adds BPF DSQ
sched_ext: Implement DSQ iterator
DSQs are very opaque in the consumption path. The BPF scheduler has no way of knowing which tasks are being considered and which is picked. This patch adds BPF DSQ iterator.
- Allows iterating tasks queued on a DSQ in the dispatch order or reverse from anywhere using bpf_for_each(scx_dsq) or calling the iterator kfuncs directly.
- Has ordering guarantee where only tasks which were already queued when the iteration started are visible and consumable during the iteration.
v5: - Add a comment to the naked list_empty(&dsq->list) test in consume_dispatch_q() to explain the reasoning behind the lockless test and by extension why nldsq_next_task() isn't used there.
- scx_qmap changes separated into its own patch.
v4: - bpf_iter_scx_dsq_new() declaration in common.bpf.h was using the wrong type for the last argument (bool rev instead of u64 flags). Fix it.
v3: - Alexei pointed out that the iterator is too big to allocate on stack. Added a prep patch to reduce the size of the cursor. Now bpf_iter_scx_dsq is 48 bytes and bpf_iter_scx_dsq_kern is 40 bytes on 64bit.
- u32_before() comparison factored out.
v2: - scx_bpf_consume_task() is separated out into a separate patch.
- DSQ seq and iter flags don't need to be u64. Use u32.
Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: David Vernet <dvernet@meta.com> Acked-by: Alexei Starovoitov <ast@kernel.org> Cc: bpf@vger.kernel.org
show more ...
|
6203ef73 | 08-Jul-2024 |
Hongyan Xia <hongyan.xia2@arm.com> |
sched/ext: Add BPF function to fetch rq
rq contains many useful fields to implement a custom scheduler. For example, various clock signals like clock_task and clock_pelt can be used to track load. I
sched/ext: Add BPF function to fetch rq
rq contains many useful fields to implement a custom scheduler. For example, various clock signals like clock_task and clock_pelt can be used to track load. It also contains stats in other sched_classes, which are useful to drive scheduling decisions in ext.
tj: Put the new helper below scx_bpf_task_*() helpers.
Signed-off-by: Hongyan Xia <hongyan.xia2@arm.com> Signed-off-by: Tejun Heo <tj@kernel.org>
show more ...
|
06e51be3 | 18-Jun-2024 |
Tejun Heo <tj@kernel.org> |
sched_ext: Add vtime-ordered priority queue to dispatch_q's
Currently, a dsq is always a FIFO. A task which is dispatched earlier gets consumed or executed earlier. While this is sufficient when dsq
sched_ext: Add vtime-ordered priority queue to dispatch_q's
Currently, a dsq is always a FIFO. A task which is dispatched earlier gets consumed or executed earlier. While this is sufficient when dsq's are used for simple staging areas for tasks which are ready to execute, it'd make dsq's a lot more useful if they can implement custom ordering.
This patch adds a vtime-ordered priority queue to dsq's. When the BPF scheduler dispatches a task with the new scx_bpf_dispatch_vtime() helper, it can specify the vtime tha the task should be inserted at and the task is inserted into the priority queue in the dsq which is ordered according to time_before64() comparison of the vtime values.
A DSQ can either be a FIFO or priority queue and automatically switches between the two depending on whether scx_bpf_dispatch() or scx_bpf_dispatch_vtime() is used. Using the wrong variant while the DSQ already has the other type queued is not allowed and triggers an ops error. Built-in DSQs must always be FIFOs.
This makes it very easy for the BPF schedulers to implement proper vtime based scheduling within each dsq very easy and efficient at a negligible cost in terms of code complexity and overhead.
scx_simple and scx_example_flatcg are updated to default to weighted vtime scheduling (the latter within each cgroup). FIFO scheduling can be selected with -f option.
v4: - As allowing mixing priority queue and FIFO on the same DSQ sometimes led to unexpected starvations, DSQs now error out if both modes are used at the same time and the built-in DSQs are no longer allowed to be priority queues.
- Explicit type struct scx_dsq_node added to contain fields needed to be linked on DSQs. This will be used to implement stateful iterator.
- Tasks are now always linked on dsq->list whether the DSQ is in FIFO or PRIQ mode. This confines PRIQ related complexities to the enqueue and dequeue paths. Other paths only need to look at dsq->list. This will also ease implementing BPF iterator.
- Print p->scx.dsq_flags in debug dump.
v3: - SCX_TASK_DSQ_ON_PRIQ flag is moved from p->scx.flags into its own p->scx.dsq_flags. The flag is protected with the dsq lock unlike other flags in p->scx.flags. This led to flag corruption in some cases.
- Add comments explaining the interaction between using consumption of p->scx.slice to determine vtime progress and yielding.
v2: - p->scx.dsq_vtime was not initialized on load or across cgroup migrations leading to some tasks being stalled for extended period of time depending on how saturated the machine is. Fixed.
Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: David Vernet <dvernet@meta.com>
show more ...
|
60c27fb5 | 18-Jun-2024 |
Tejun Heo <tj@kernel.org> |
sched_ext: Implement sched_ext_ops.cpu_online/offline()
Add ops.cpu_online/offline() which are invoked when CPUs come online and offline respectively. As the enqueue path already automatically bypas
sched_ext: Implement sched_ext_ops.cpu_online/offline()
Add ops.cpu_online/offline() which are invoked when CPUs come online and offline respectively. As the enqueue path already automatically bypasses tasks to the local dsq on a deactivated CPU, BPF schedulers are guaranteed to see tasks only on CPUs which are between online() and offline().
If the BPF scheduler doesn't implement ops.cpu_online/offline(), the scheduler is automatically exited with SCX_ECODE_RESTART | SCX_ECODE_RSN_HOTPLUG. Userspace can implement CPU hotpplug support trivially by simply reinitializing and reloading the scheduler.
scx_qmap is updated to print out online CPUs on hotplug events. Other schedulers are updated to restart based on ecode.
v3: - The previous implementation added @reason to sched_class.rq_on/offline() to distinguish between CPU hotplug events and topology updates. This was buggy and fragile as the methods are skipped if the current state equals the target state. Instead, add scx_rq_[de]activate() which are directly called from sched_cpu_de/activate(). This also allows ops.cpu_on/offline() to sleep which can be useful.
- ops.dispatch() could be called on a CPU that the BPF scheduler was told to be offline. The dispatch patch is updated to bypass in such cases.
v2: - To accommodate lock ordering change between scx_cgroup_rwsem and cpus_read_lock(), CPU hotplug operations are put into its own SCX_OPI block and enabled eariler during scx_ope_enable() so that cpus_read_lock() can be dropped before acquiring scx_cgroup_rwsem.
- Auto exit with ECODE added.
Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: David Vernet <dvernet@meta.com> Acked-by: Josh Don <joshdon@google.com> Acked-by: Hao Luo <haoluo@google.com> Acked-by: Barret Rhoden <brho@google.com>
show more ...
|
245254f7 | 18-Jun-2024 |
David Vernet <dvernet@meta.com> |
sched_ext: Implement sched_ext_ops.cpu_acquire/release()
Scheduler classes are strictly ordered and when a higher priority class has tasks to run, the lower priority ones lose access to the CPU. Bei
sched_ext: Implement sched_ext_ops.cpu_acquire/release()
Scheduler classes are strictly ordered and when a higher priority class has tasks to run, the lower priority ones lose access to the CPU. Being able to monitor and act on these events are necessary for use cases includling strict core-scheduling and latency management.
This patch adds two operations ops.cpu_acquire() and .cpu_release(). The former is invoked when a CPU becomes available to the BPF scheduler and the opposite for the latter. This patch also implements scx_bpf_reenqueue_local() which can be called from .cpu_release() to trigger requeueing of all tasks in the local dsq of the CPU so that the tasks can be reassigned to other available CPUs.
scx_pair is updated to use .cpu_acquire/release() along with %SCX_KICK_WAIT to make the pair scheduling guarantee strict even when a CPU is preempted by a higher priority scheduler class.
scx_qmap is updated to use .cpu_acquire/release() to empty the local dsq of a preempted CPU. A similar approach can be adopted by BPF schedulers that want to have a tight control over latency.
v4: Use the new SCX_KICK_IDLE to wake up a CPU after re-enqueueing.
v3: Drop the const qualifier from scx_cpu_release_args.task. BPF enforces access control through the verifier, so the qualifier isn't actually operative and only gets in the way when interacting with various helpers.
v2: Add p->scx.kf_mask annotation to allow calling scx_bpf_reenqueue_local() from ops.cpu_release() nested inside ops.init() and other sleepable operations.
Signed-off-by: David Vernet <dvernet@meta.com> Reviewed-by: Tejun Heo <tj@kernel.org> Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Josh Don <joshdon@google.com> Acked-by: Hao Luo <haoluo@google.com> Acked-by: Barret Rhoden <brho@google.com>
show more ...
|
81aae789 | 18-Jun-2024 |
Tejun Heo <tj@kernel.org> |
sched_ext: Implement scx_bpf_kick_cpu() and task preemption support
It's often useful to wake up and/or trigger reschedule on other CPUs. This patch adds scx_bpf_kick_cpu() kfunc helper that BPF sch
sched_ext: Implement scx_bpf_kick_cpu() and task preemption support
It's often useful to wake up and/or trigger reschedule on other CPUs. This patch adds scx_bpf_kick_cpu() kfunc helper that BPF scheduler can call to kick the target CPU into the scheduling path.
As a sched_ext task relinquishes its CPU only after its slice is depleted, this patch also adds SCX_KICK_PREEMPT and SCX_ENQ_PREEMPT which clears the slice of the target CPU's current task to guarantee that sched_ext's scheduling path runs on the CPU.
If SCX_KICK_IDLE is specified, the target CPU is kicked iff the CPU is idle to guarantee that the target CPU will go through at least one full sched_ext scheduling cycle after the kicking. This can be used to wake up idle CPUs without incurring unnecessary overhead if it isn't currently idle.
As a demonstration of how backward compatibility can be supported using BPF CO-RE, tools/sched_ext/include/scx/compat.bpf.h is added. It provides __COMPAT_scx_bpf_kick_cpu_IDLE() which uses SCX_KICK_IDLE if available or becomes a regular kicking otherwise. This allows schedulers to use the new SCX_KICK_IDLE while maintaining support for older kernels. The plan is to temporarily use compat helpers to ease API updates and drop them after a few kernel releases.
v5: - SCX_KICK_IDLE added. Note that this also adds a compat mechanism for schedulers so that they can support kernels without SCX_KICK_IDLE. This is useful as a demonstration of how new feature flags can be added in a backward compatible way.
- kick_cpus_irq_workfn() reimplemented so that it touches the pending cpumasks only as necessary to reduce kicking overhead on machines with a lot of CPUs.
- tools/sched_ext/include/scx/compat.bpf.h added.
v4: - Move example scheduler to its own patch.
v3: - Make scx_example_central switch all tasks by default.
- Convert to BPF inline iterators.
v2: - Julia Lawall reported that scx_example_central can overflow the dispatch buffer and malfunction. As scheduling for other CPUs can't be handled by the automatic retry mechanism, fix by implementing an explicit overflow and retry handling.
- Updated to use generic BPF cpumask helpers.
Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: David Vernet <dvernet@meta.com> Acked-by: Josh Don <joshdon@google.com> Acked-by: Hao Luo <haoluo@google.com> Acked-by: Barret Rhoden <brho@google.com>
show more ...
|
07814a94 | 18-Jun-2024 |
Tejun Heo <tj@kernel.org> |
sched_ext: Print debug dump after an error exit
If a BPF scheduler triggers an error, the scheduler is aborted and the system is reverted to the built-in scheduler. In the process, a lot of informat
sched_ext: Print debug dump after an error exit
If a BPF scheduler triggers an error, the scheduler is aborted and the system is reverted to the built-in scheduler. In the process, a lot of information which may be useful for figuring out what happened can be lost.
This patch adds debug dump which captures information which may be useful for debugging including runqueue and runnable thread states at the time of failure. The following shows a debug dump after triggering the watchdog:
root@test ~# os/work/tools/sched_ext/build/bin/scx_qmap -t 100 stats : enq=1 dsp=0 delta=1 deq=0 stats : enq=90 dsp=90 delta=0 deq=0 stats : enq=156 dsp=156 delta=0 deq=0 stats : enq=218 dsp=218 delta=0 deq=0 stats : enq=255 dsp=255 delta=0 deq=0 stats : enq=271 dsp=271 delta=0 deq=0 stats : enq=284 dsp=284 delta=0 deq=0 stats : enq=293 dsp=293 delta=0 deq=0
DEBUG DUMP ================================================================================
kworker/u32:12[320] triggered exit kind 1026: runnable task stall (stress[1530] failed to run for 6.841s)
Backtrace: scx_watchdog_workfn+0x136/0x1c0 process_scheduled_works+0x2b5/0x600 worker_thread+0x269/0x360 kthread+0xeb/0x110 ret_from_fork+0x36/0x40 ret_from_fork_asm+0x1a/0x30
QMAP FIFO[0]: QMAP FIFO[1]: QMAP FIFO[2]: 1436 QMAP FIFO[3]: QMAP FIFO[4]:
CPU states ----------
CPU 0 : nr_run=1 ops_qseq=244 curr=swapper/0[0] class=idle_sched_class
QMAP: dsp_idx=1 dsp_cnt=0
R stress[1530] -6841ms scx_state/flags=3/0x1 ops_state/qseq=2/20 sticky/holding_cpu=-1/-1 dsq_id=(n/a) cpus=ff
QMAP: force_local=0
asm_sysvec_apic_timer_interrupt+0x16/0x20
CPU 2 : nr_run=2 ops_qseq=142 curr=swapper/2[0] class=idle_sched_class
QMAP: dsp_idx=1 dsp_cnt=0
R sshd[1703] -5905ms scx_state/flags=3/0x9 ops_state/qseq=2/88 sticky/holding_cpu=-1/-1 dsq_id=(n/a) cpus=ff
QMAP: force_local=1
__x64_sys_ppoll+0xf6/0x120 do_syscall_64+0x7b/0x150 entry_SYSCALL_64_after_hwframe+0x76/0x7e
R fish[1539] -4141ms scx_state/flags=3/0x9 ops_state/qseq=2/124 sticky/holding_cpu=-1/-1 dsq_id=(n/a) cpus=ff
QMAP: force_local=1
futex_wait+0x60/0xe0 do_futex+0x109/0x180 __x64_sys_futex+0x117/0x190 do_syscall_64+0x7b/0x150 entry_SYSCALL_64_after_hwframe+0x76/0x7e
CPU 3 : nr_run=2 ops_qseq=162 curr=kworker/u32:12[320] class=ext_sched_class
QMAP: dsp_idx=1 dsp_cnt=0
*R kworker/u32:12[320] +0ms scx_state/flags=3/0xd ops_state/qseq=0/0 sticky/holding_cpu=-1/-1 dsq_id=(n/a) cpus=ff
QMAP: force_local=0
scx_dump_state+0x613/0x6f0 scx_ops_error_irq_workfn+0x1f/0x40 irq_work_run_list+0x82/0xd0 irq_work_run+0x14/0x30 __sysvec_irq_work+0x40/0x140 sysvec_irq_work+0x60/0x70 asm_sysvec_irq_work+0x16/0x20 scx_watchdog_workfn+0x15f/0x1c0 process_scheduled_works+0x2b5/0x600 worker_thread+0x269/0x360 kthread+0xeb/0x110 ret_from_fork+0x36/0x40 ret_from_fork_asm+0x1a/0x30
R kworker/3:2[1436] +0ms scx_state/flags=3/0x9 ops_state/qseq=2/160 sticky/holding_cpu=-1/-1 dsq_id=(n/a) cpus=08
QMAP: force_local=0
kthread+0xeb/0x110 ret_from_fork+0x36/0x40 ret_from_fork_asm+0x1a/0x30
CPU 7 : nr_run=0 ops_qseq=76 curr=swapper/7[0] class=idle_sched_class
================================================================================
EXIT: runnable task stall (stress[1530] failed to run for 6.841s)
It shows that CPU 3 was running the watchdog when it triggered the error condition and the scx_qmap thread has been queued on CPU 0 for over 5 seconds but failed to run. It also prints out scx_qmap specific information - e.g. which tasks are queued on each FIFO and so on using the dump_*() ops. This dump has proved pretty useful for developing and debugging BPF schedulers.
Debug dump is generated automatically when the BPF scheduler exits due to an error. The debug buffer used in such cases is determined by sched_ext_ops.exit_dump_len and defaults to 32k. If the debug dump overruns the available buffer, the output is truncated and marked accordingly.
Debug dump output can also be read through the sched_ext_dump tracepoint. When read through the tracepoint, there is no length limit.
SysRq-D can be used to trigger debug dump at any time while a BPF scheduler is loaded. This is non-destructive - the scheduler keeps running afterwards. The output can be read through the sched_ext_dump tracepoint.
v2: - The size of exit debug dump buffer can now be customized using sched_ext_ops.exit_dump_len.
- sched_ext_ops.dump*() added to enable dumping of BPF scheduler specific information.
- Tracpoint output and SysRq-D triggering added.
Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: David Vernet <dvernet@meta.com>
show more ...
|