xref: /linux/Documentation/scheduler/sched-ext.rst (revision d86adb4fc0655a0867da811d000df75d2a325ef6)
1==========================
2Extensible Scheduler Class
3==========================
4
5sched_ext is a scheduler class whose behavior can be defined by a set of BPF
6programs - the BPF scheduler.
7
8* sched_ext exports a full scheduling interface so that any scheduling
9  algorithm can be implemented on top.
10
11* The BPF scheduler can group CPUs however it sees fit and schedule them
12  together, as tasks aren't tied to specific CPUs at the time of wakeup.
13
14* The BPF scheduler can be turned on and off dynamically anytime.
15
16* The system integrity is maintained no matter what the BPF scheduler does.
17  The default scheduling behavior is restored anytime an error is detected,
18  a runnable task stalls, or on invoking the SysRq key sequence
19  :kbd:`SysRq-S`.
20
21* When the BPF scheduler triggers an error, debug information is dumped to
22  aid debugging. The debug dump is passed to and printed out by the
23  scheduler binary. The debug dump can also be accessed through the
24  `sched_ext_dump` tracepoint. The SysRq key sequence :kbd:`SysRq-D`
25  triggers a debug dump. This doesn't terminate the BPF scheduler and can
26  only be read through the tracepoint.
27
28Switching to and from sched_ext
29===============================
30
31``CONFIG_SCHED_CLASS_EXT`` is the config option to enable sched_ext and
32``tools/sched_ext`` contains the example schedulers. The following config
33options should be enabled to use sched_ext:
34
35.. code-block:: none
36
37    CONFIG_BPF=y
38    CONFIG_SCHED_CLASS_EXT=y
39    CONFIG_BPF_SYSCALL=y
40    CONFIG_BPF_JIT=y
41    CONFIG_DEBUG_INFO_BTF=y
42    CONFIG_BPF_JIT_ALWAYS_ON=y
43    CONFIG_BPF_JIT_DEFAULT_ON=y
44    CONFIG_PAHOLE_HAS_SPLIT_BTF=y
45    CONFIG_PAHOLE_HAS_BTF_TAG=y
46
47sched_ext is used only when the BPF scheduler is loaded and running.
48
49If a task explicitly sets its scheduling policy to ``SCHED_EXT``, it will be
50treated as ``SCHED_NORMAL`` and scheduled by CFS until the BPF scheduler is
51loaded. On load, such tasks will be switched to and scheduled by sched_ext.
52
53The BPF scheduler can choose to schedule all normal and lower class tasks by
54calling ``scx_bpf_switch_all()`` from its ``init()`` operation. In this
55case, all ``SCHED_NORMAL``, ``SCHED_BATCH``, ``SCHED_IDLE`` and
56``SCHED_EXT`` tasks are scheduled by sched_ext. In the example schedulers,
57this mode can be selected with the ``-a`` option.
58
59Terminating the sched_ext scheduler program, triggering :kbd:`SysRq-S`, or
60detection of any internal error including stalled runnable tasks aborts the
61BPF scheduler and reverts all tasks back to CFS.
62
63.. code-block:: none
64
65    # make -j16 -C tools/sched_ext
66    # tools/sched_ext/scx_simple
67    local=0 global=3
68    local=5 global=24
69    local=9 global=44
70    local=13 global=56
71    local=17 global=72
72    ^CEXIT: BPF scheduler unregistered
73
74The current status of the BPF scheduler can be determined as follows:
75
76.. code-block:: none
77
78    # cat /sys/kernel/sched_ext/state
79    enabled
80    # cat /sys/kernel/sched_ext/root/ops
81    simple
82
83``tools/sched_ext/scx_show_state.py`` is a drgn script which shows more
84detailed information:
85
86.. code-block:: none
87
88    # tools/sched_ext/scx_show_state.py
89    ops           : simple
90    enabled       : 1
91    switching_all : 1
92    switched_all  : 1
93    enable_state  : enabled (2)
94    bypass_depth  : 0
95    nr_rejected   : 0
96
97If ``CONFIG_SCHED_DEBUG`` is set, whether a given task is on sched_ext can
98be determined as follows:
99
100.. code-block:: none
101
102    # grep ext /proc/self/sched
103    ext.enabled                                  :                    1
104
105The Basics
106==========
107
108Userspace can implement an arbitrary BPF scheduler by loading a set of BPF
109programs that implement ``struct sched_ext_ops``. The only mandatory field
110is ``ops.name`` which must be a valid BPF object name. All operations are
111optional. The following modified excerpt is from
112``tools/sched/scx_simple.bpf.c`` showing a minimal global FIFO scheduler.
113
114.. code-block:: c
115
116    /*
117     * Decide which CPU a task should be migrated to before being
118     * enqueued (either at wakeup, fork time, or exec time). If an
119     * idle core is found by the default ops.select_cpu() implementation,
120     * then dispatch the task directly to SCX_DSQ_LOCAL and skip the
121     * ops.enqueue() callback.
122     *
123     * Note that this implementation has exactly the same behavior as the
124     * default ops.select_cpu implementation. The behavior of the scheduler
125     * would be exactly same if the implementation just didn't define the
126     * simple_select_cpu() struct_ops prog.
127     */
128    s32 BPF_STRUCT_OPS(simple_select_cpu, struct task_struct *p,
129                       s32 prev_cpu, u64 wake_flags)
130    {
131            s32 cpu;
132            /* Need to initialize or the BPF verifier will reject the program */
133            bool direct = false;
134
135            cpu = scx_bpf_select_cpu_dfl(p, prev_cpu, wake_flags, &direct);
136
137            if (direct)
138                    scx_bpf_dispatch(p, SCX_DSQ_LOCAL, SCX_SLICE_DFL, 0);
139
140            return cpu;
141    }
142
143    /*
144     * Do a direct dispatch of a task to the global DSQ. This ops.enqueue()
145     * callback will only be invoked if we failed to find a core to dispatch
146     * to in ops.select_cpu() above.
147     *
148     * Note that this implementation has exactly the same behavior as the
149     * default ops.enqueue implementation, which just dispatches the task
150     * to SCX_DSQ_GLOBAL. The behavior of the scheduler would be exactly same
151     * if the implementation just didn't define the simple_enqueue struct_ops
152     * prog.
153     */
154    void BPF_STRUCT_OPS(simple_enqueue, struct task_struct *p, u64 enq_flags)
155    {
156            scx_bpf_dispatch(p, SCX_DSQ_GLOBAL, SCX_SLICE_DFL, enq_flags);
157    }
158
159    s32 BPF_STRUCT_OPS(simple_init)
160    {
161            /*
162             * All SCHED_OTHER, SCHED_IDLE, and SCHED_BATCH tasks should
163             * use sched_ext.
164             */
165            scx_bpf_switch_all();
166            return 0;
167    }
168
169    void BPF_STRUCT_OPS(simple_exit, struct scx_exit_info *ei)
170    {
171            exit_type = ei->type;
172    }
173
174    SEC(".struct_ops")
175    struct sched_ext_ops simple_ops = {
176            .select_cpu             = (void *)simple_select_cpu,
177            .enqueue                = (void *)simple_enqueue,
178            .init                   = (void *)simple_init,
179            .exit                   = (void *)simple_exit,
180            .name                   = "simple",
181    };
182
183Dispatch Queues
184---------------
185
186To match the impedance between the scheduler core and the BPF scheduler,
187sched_ext uses DSQs (dispatch queues) which can operate as both a FIFO and a
188priority queue. By default, there is one global FIFO (``SCX_DSQ_GLOBAL``),
189and one local dsq per CPU (``SCX_DSQ_LOCAL``). The BPF scheduler can manage
190an arbitrary number of dsq's using ``scx_bpf_create_dsq()`` and
191``scx_bpf_destroy_dsq()``.
192
193A CPU always executes a task from its local DSQ. A task is "dispatched" to a
194DSQ. A non-local DSQ is "consumed" to transfer a task to the consuming CPU's
195local DSQ.
196
197When a CPU is looking for the next task to run, if the local DSQ is not
198empty, the first task is picked. Otherwise, the CPU tries to consume the
199global DSQ. If that doesn't yield a runnable task either, ``ops.dispatch()``
200is invoked.
201
202Scheduling Cycle
203----------------
204
205The following briefly shows how a waking task is scheduled and executed.
206
2071. When a task is waking up, ``ops.select_cpu()`` is the first operation
208   invoked. This serves two purposes. First, CPU selection optimization
209   hint. Second, waking up the selected CPU if idle.
210
211   The CPU selected by ``ops.select_cpu()`` is an optimization hint and not
212   binding. The actual decision is made at the last step of scheduling.
213   However, there is a small performance gain if the CPU
214   ``ops.select_cpu()`` returns matches the CPU the task eventually runs on.
215
216   A side-effect of selecting a CPU is waking it up from idle. While a BPF
217   scheduler can wake up any cpu using the ``scx_bpf_kick_cpu()`` helper,
218   using ``ops.select_cpu()`` judiciously can be simpler and more efficient.
219
220   A task can be immediately dispatched to a DSQ from ``ops.select_cpu()`` by
221   calling ``scx_bpf_dispatch()``. If the task is dispatched to
222   ``SCX_DSQ_LOCAL`` from ``ops.select_cpu()``, it will be dispatched to the
223   local DSQ of whichever CPU is returned from ``ops.select_cpu()``.
224   Additionally, dispatching directly from ``ops.select_cpu()`` will cause the
225   ``ops.enqueue()`` callback to be skipped.
226
227   Note that the scheduler core will ignore an invalid CPU selection, for
228   example, if it's outside the allowed cpumask of the task.
229
2302. Once the target CPU is selected, ``ops.enqueue()`` is invoked (unless the
231   task was dispatched directly from ``ops.select_cpu()``). ``ops.enqueue()``
232   can make one of the following decisions:
233
234   * Immediately dispatch the task to either the global or local DSQ by
235     calling ``scx_bpf_dispatch()`` with ``SCX_DSQ_GLOBAL`` or
236     ``SCX_DSQ_LOCAL``, respectively.
237
238   * Immediately dispatch the task to a custom DSQ by calling
239     ``scx_bpf_dispatch()`` with a DSQ ID which is smaller than 2^63.
240
241   * Queue the task on the BPF side.
242
2433. When a CPU is ready to schedule, it first looks at its local DSQ. If
244   empty, it then looks at the global DSQ. If there still isn't a task to
245   run, ``ops.dispatch()`` is invoked which can use the following two
246   functions to populate the local DSQ.
247
248   * ``scx_bpf_dispatch()`` dispatches a task to a DSQ. Any target DSQ can
249     be used - ``SCX_DSQ_LOCAL``, ``SCX_DSQ_LOCAL_ON | cpu``,
250     ``SCX_DSQ_GLOBAL`` or a custom DSQ. While ``scx_bpf_dispatch()``
251     currently can't be called with BPF locks held, this is being worked on
252     and will be supported. ``scx_bpf_dispatch()`` schedules dispatching
253     rather than performing them immediately. There can be up to
254     ``ops.dispatch_max_batch`` pending tasks.
255
256   * ``scx_bpf_consume()`` tranfers a task from the specified non-local DSQ
257     to the dispatching DSQ. This function cannot be called with any BPF
258     locks held. ``scx_bpf_consume()`` flushes the pending dispatched tasks
259     before trying to consume the specified DSQ.
260
2614. After ``ops.dispatch()`` returns, if there are tasks in the local DSQ,
262   the CPU runs the first one. If empty, the following steps are taken:
263
264   * Try to consume the global DSQ. If successful, run the task.
265
266   * If ``ops.dispatch()`` has dispatched any tasks, retry #3.
267
268   * If the previous task is an SCX task and still runnable, keep executing
269     it (see ``SCX_OPS_ENQ_LAST``).
270
271   * Go idle.
272
273Note that the BPF scheduler can always choose to dispatch tasks immediately
274in ``ops.enqueue()`` as illustrated in the above simple example. If only the
275built-in DSQs are used, there is no need to implement ``ops.dispatch()`` as
276a task is never queued on the BPF scheduler and both the local and global
277DSQs are consumed automatically.
278
279``scx_bpf_dispatch()`` queues the task on the FIFO of the target DSQ. Use
280``scx_bpf_dispatch_vtime()`` for the priority queue. Internal DSQs such as
281``SCX_DSQ_LOCAL`` and ``SCX_DSQ_GLOBAL`` do not support priority-queue
282dispatching, and must be dispatched to with ``scx_bpf_dispatch()``.  See the
283function documentation and usage in ``tools/sched_ext/scx_simple.bpf.c`` for
284more information.
285
286Where to Look
287=============
288
289* ``include/linux/sched/ext.h`` defines the core data structures, ops table
290  and constants.
291
292* ``kernel/sched/ext.c`` contains sched_ext core implementation and helpers.
293  The functions prefixed with ``scx_bpf_`` can be called from the BPF
294  scheduler.
295
296* ``tools/sched_ext/`` hosts example BPF scheduler implementations.
297
298  * ``scx_simple[.bpf].c``: Minimal global FIFO scheduler example using a
299    custom DSQ.
300
301  * ``scx_qmap[.bpf].c``: A multi-level FIFO scheduler supporting five
302    levels of priority implemented with ``BPF_MAP_TYPE_QUEUE``.
303
304ABI Instability
305===============
306
307The APIs provided by sched_ext to BPF schedulers programs have no stability
308guarantees. This includes the ops table callbacks and constants defined in
309``include/linux/sched/ext.h``, as well as the ``scx_bpf_`` kfuncs defined in
310``kernel/sched/ext.c``.
311
312While we will attempt to provide a relatively stable API surface when
313possible, they are subject to change without warning between kernel
314versions.
315