xref: /linux/Documentation/scheduler/sched-ext.rst (revision 55d0969c451159cff86949b38c39171cab962069)
1==========================
2Extensible Scheduler Class
3==========================
4
5sched_ext is a scheduler class whose behavior can be defined by a set of BPF
6programs - the BPF scheduler.
7
8* sched_ext exports a full scheduling interface so that any scheduling
9  algorithm can be implemented on top.
10
11* The BPF scheduler can group CPUs however it sees fit and schedule them
12  together, as tasks aren't tied to specific CPUs at the time of wakeup.
13
14* The BPF scheduler can be turned on and off dynamically anytime.
15
16* The system integrity is maintained no matter what the BPF scheduler does.
17  The default scheduling behavior is restored anytime an error is detected,
18  a runnable task stalls, or on invoking the SysRq key sequence
19  :kbd:`SysRq-S`.
20
21* When the BPF scheduler triggers an error, debug information is dumped to
22  aid debugging. The debug dump is passed to and printed out by the
23  scheduler binary. The debug dump can also be accessed through the
24  `sched_ext_dump` tracepoint. The SysRq key sequence :kbd:`SysRq-D`
25  triggers a debug dump. This doesn't terminate the BPF scheduler and can
26  only be read through the tracepoint.
27
28Switching to and from sched_ext
29===============================
30
31``CONFIG_SCHED_CLASS_EXT`` is the config option to enable sched_ext and
32``tools/sched_ext`` contains the example schedulers. The following config
33options should be enabled to use sched_ext:
34
35.. code-block:: none
36
37    CONFIG_BPF=y
38    CONFIG_SCHED_CLASS_EXT=y
39    CONFIG_BPF_SYSCALL=y
40    CONFIG_BPF_JIT=y
41    CONFIG_DEBUG_INFO_BTF=y
42    CONFIG_BPF_JIT_ALWAYS_ON=y
43    CONFIG_BPF_JIT_DEFAULT_ON=y
44    CONFIG_PAHOLE_HAS_SPLIT_BTF=y
45    CONFIG_PAHOLE_HAS_BTF_TAG=y
46
47sched_ext is used only when the BPF scheduler is loaded and running.
48
49If a task explicitly sets its scheduling policy to ``SCHED_EXT``, it will be
50treated as ``SCHED_NORMAL`` and scheduled by CFS until the BPF scheduler is
51loaded.
52
53When the BPF scheduler is loaded and ``SCX_OPS_SWITCH_PARTIAL`` is not set
54in ``ops->flags``, all ``SCHED_NORMAL``, ``SCHED_BATCH``, ``SCHED_IDLE``, and
55``SCHED_EXT`` tasks are scheduled by sched_ext.
56
57However, when the BPF scheduler is loaded and ``SCX_OPS_SWITCH_PARTIAL`` is
58set in ``ops->flags``, only tasks with the ``SCHED_EXT`` policy are scheduled
59by sched_ext, while tasks with ``SCHED_NORMAL``, ``SCHED_BATCH`` and
60``SCHED_IDLE`` policies are scheduled by CFS.
61
62Terminating the sched_ext scheduler program, triggering :kbd:`SysRq-S`, or
63detection of any internal error including stalled runnable tasks aborts the
64BPF scheduler and reverts all tasks back to CFS.
65
66.. code-block:: none
67
68    # make -j16 -C tools/sched_ext
69    # tools/sched_ext/build/bin/scx_simple
70    local=0 global=3
71    local=5 global=24
72    local=9 global=44
73    local=13 global=56
74    local=17 global=72
75    ^CEXIT: BPF scheduler unregistered
76
77The current status of the BPF scheduler can be determined as follows:
78
79.. code-block:: none
80
81    # cat /sys/kernel/sched_ext/state
82    enabled
83    # cat /sys/kernel/sched_ext/root/ops
84    simple
85
86You can check if any BPF scheduler has ever been loaded since boot by examining
87this monotonically incrementing counter (a value of zero indicates that no BPF
88scheduler has been loaded):
89
90.. code-block:: none
91
92    # cat /sys/kernel/sched_ext/enable_seq
93    1
94
95``tools/sched_ext/scx_show_state.py`` is a drgn script which shows more
96detailed information:
97
98.. code-block:: none
99
100    # tools/sched_ext/scx_show_state.py
101    ops           : simple
102    enabled       : 1
103    switching_all : 1
104    switched_all  : 1
105    enable_state  : enabled (2)
106    bypass_depth  : 0
107    nr_rejected   : 0
108    enable_seq    : 1
109
110If ``CONFIG_SCHED_DEBUG`` is set, whether a given task is on sched_ext can
111be determined as follows:
112
113.. code-block:: none
114
115    # grep ext /proc/self/sched
116    ext.enabled                                  :                    1
117
118The Basics
119==========
120
121Userspace can implement an arbitrary BPF scheduler by loading a set of BPF
122programs that implement ``struct sched_ext_ops``. The only mandatory field
123is ``ops.name`` which must be a valid BPF object name. All operations are
124optional. The following modified excerpt is from
125``tools/sched_ext/scx_simple.bpf.c`` showing a minimal global FIFO scheduler.
126
127.. code-block:: c
128
129    /*
130     * Decide which CPU a task should be migrated to before being
131     * enqueued (either at wakeup, fork time, or exec time). If an
132     * idle core is found by the default ops.select_cpu() implementation,
133     * then dispatch the task directly to SCX_DSQ_LOCAL and skip the
134     * ops.enqueue() callback.
135     *
136     * Note that this implementation has exactly the same behavior as the
137     * default ops.select_cpu implementation. The behavior of the scheduler
138     * would be exactly same if the implementation just didn't define the
139     * simple_select_cpu() struct_ops prog.
140     */
141    s32 BPF_STRUCT_OPS(simple_select_cpu, struct task_struct *p,
142                       s32 prev_cpu, u64 wake_flags)
143    {
144            s32 cpu;
145            /* Need to initialize or the BPF verifier will reject the program */
146            bool direct = false;
147
148            cpu = scx_bpf_select_cpu_dfl(p, prev_cpu, wake_flags, &direct);
149
150            if (direct)
151                    scx_bpf_dispatch(p, SCX_DSQ_LOCAL, SCX_SLICE_DFL, 0);
152
153            return cpu;
154    }
155
156    /*
157     * Do a direct dispatch of a task to the global DSQ. This ops.enqueue()
158     * callback will only be invoked if we failed to find a core to dispatch
159     * to in ops.select_cpu() above.
160     *
161     * Note that this implementation has exactly the same behavior as the
162     * default ops.enqueue implementation, which just dispatches the task
163     * to SCX_DSQ_GLOBAL. The behavior of the scheduler would be exactly same
164     * if the implementation just didn't define the simple_enqueue struct_ops
165     * prog.
166     */
167    void BPF_STRUCT_OPS(simple_enqueue, struct task_struct *p, u64 enq_flags)
168    {
169            scx_bpf_dispatch(p, SCX_DSQ_GLOBAL, SCX_SLICE_DFL, enq_flags);
170    }
171
172    s32 BPF_STRUCT_OPS_SLEEPABLE(simple_init)
173    {
174            /*
175             * By default, all SCHED_EXT, SCHED_OTHER, SCHED_IDLE, and
176             * SCHED_BATCH tasks should use sched_ext.
177             */
178            return 0;
179    }
180
181    void BPF_STRUCT_OPS(simple_exit, struct scx_exit_info *ei)
182    {
183            exit_type = ei->type;
184    }
185
186    SEC(".struct_ops")
187    struct sched_ext_ops simple_ops = {
188            .select_cpu             = (void *)simple_select_cpu,
189            .enqueue                = (void *)simple_enqueue,
190            .init                   = (void *)simple_init,
191            .exit                   = (void *)simple_exit,
192            .name                   = "simple",
193    };
194
195Dispatch Queues
196---------------
197
198To match the impedance between the scheduler core and the BPF scheduler,
199sched_ext uses DSQs (dispatch queues) which can operate as both a FIFO and a
200priority queue. By default, there is one global FIFO (``SCX_DSQ_GLOBAL``),
201and one local dsq per CPU (``SCX_DSQ_LOCAL``). The BPF scheduler can manage
202an arbitrary number of dsq's using ``scx_bpf_create_dsq()`` and
203``scx_bpf_destroy_dsq()``.
204
205A CPU always executes a task from its local DSQ. A task is "dispatched" to a
206DSQ. A non-local DSQ is "consumed" to transfer a task to the consuming CPU's
207local DSQ.
208
209When a CPU is looking for the next task to run, if the local DSQ is not
210empty, the first task is picked. Otherwise, the CPU tries to consume the
211global DSQ. If that doesn't yield a runnable task either, ``ops.dispatch()``
212is invoked.
213
214Scheduling Cycle
215----------------
216
217The following briefly shows how a waking task is scheduled and executed.
218
2191. When a task is waking up, ``ops.select_cpu()`` is the first operation
220   invoked. This serves two purposes. First, CPU selection optimization
221   hint. Second, waking up the selected CPU if idle.
222
223   The CPU selected by ``ops.select_cpu()`` is an optimization hint and not
224   binding. The actual decision is made at the last step of scheduling.
225   However, there is a small performance gain if the CPU
226   ``ops.select_cpu()`` returns matches the CPU the task eventually runs on.
227
228   A side-effect of selecting a CPU is waking it up from idle. While a BPF
229   scheduler can wake up any cpu using the ``scx_bpf_kick_cpu()`` helper,
230   using ``ops.select_cpu()`` judiciously can be simpler and more efficient.
231
232   A task can be immediately dispatched to a DSQ from ``ops.select_cpu()`` by
233   calling ``scx_bpf_dispatch()``. If the task is dispatched to
234   ``SCX_DSQ_LOCAL`` from ``ops.select_cpu()``, it will be dispatched to the
235   local DSQ of whichever CPU is returned from ``ops.select_cpu()``.
236   Additionally, dispatching directly from ``ops.select_cpu()`` will cause the
237   ``ops.enqueue()`` callback to be skipped.
238
239   Note that the scheduler core will ignore an invalid CPU selection, for
240   example, if it's outside the allowed cpumask of the task.
241
2422. Once the target CPU is selected, ``ops.enqueue()`` is invoked (unless the
243   task was dispatched directly from ``ops.select_cpu()``). ``ops.enqueue()``
244   can make one of the following decisions:
245
246   * Immediately dispatch the task to either the global or local DSQ by
247     calling ``scx_bpf_dispatch()`` with ``SCX_DSQ_GLOBAL`` or
248     ``SCX_DSQ_LOCAL``, respectively.
249
250   * Immediately dispatch the task to a custom DSQ by calling
251     ``scx_bpf_dispatch()`` with a DSQ ID which is smaller than 2^63.
252
253   * Queue the task on the BPF side.
254
2553. When a CPU is ready to schedule, it first looks at its local DSQ. If
256   empty, it then looks at the global DSQ. If there still isn't a task to
257   run, ``ops.dispatch()`` is invoked which can use the following two
258   functions to populate the local DSQ.
259
260   * ``scx_bpf_dispatch()`` dispatches a task to a DSQ. Any target DSQ can
261     be used - ``SCX_DSQ_LOCAL``, ``SCX_DSQ_LOCAL_ON | cpu``,
262     ``SCX_DSQ_GLOBAL`` or a custom DSQ. While ``scx_bpf_dispatch()``
263     currently can't be called with BPF locks held, this is being worked on
264     and will be supported. ``scx_bpf_dispatch()`` schedules dispatching
265     rather than performing them immediately. There can be up to
266     ``ops.dispatch_max_batch`` pending tasks.
267
268   * ``scx_bpf_consume()`` tranfers a task from the specified non-local DSQ
269     to the dispatching DSQ. This function cannot be called with any BPF
270     locks held. ``scx_bpf_consume()`` flushes the pending dispatched tasks
271     before trying to consume the specified DSQ.
272
2734. After ``ops.dispatch()`` returns, if there are tasks in the local DSQ,
274   the CPU runs the first one. If empty, the following steps are taken:
275
276   * Try to consume the global DSQ. If successful, run the task.
277
278   * If ``ops.dispatch()`` has dispatched any tasks, retry #3.
279
280   * If the previous task is an SCX task and still runnable, keep executing
281     it (see ``SCX_OPS_ENQ_LAST``).
282
283   * Go idle.
284
285Note that the BPF scheduler can always choose to dispatch tasks immediately
286in ``ops.enqueue()`` as illustrated in the above simple example. If only the
287built-in DSQs are used, there is no need to implement ``ops.dispatch()`` as
288a task is never queued on the BPF scheduler and both the local and global
289DSQs are consumed automatically.
290
291``scx_bpf_dispatch()`` queues the task on the FIFO of the target DSQ. Use
292``scx_bpf_dispatch_vtime()`` for the priority queue. Internal DSQs such as
293``SCX_DSQ_LOCAL`` and ``SCX_DSQ_GLOBAL`` do not support priority-queue
294dispatching, and must be dispatched to with ``scx_bpf_dispatch()``.  See the
295function documentation and usage in ``tools/sched_ext/scx_simple.bpf.c`` for
296more information.
297
298Where to Look
299=============
300
301* ``include/linux/sched/ext.h`` defines the core data structures, ops table
302  and constants.
303
304* ``kernel/sched/ext.c`` contains sched_ext core implementation and helpers.
305  The functions prefixed with ``scx_bpf_`` can be called from the BPF
306  scheduler.
307
308* ``tools/sched_ext/`` hosts example BPF scheduler implementations.
309
310  * ``scx_simple[.bpf].c``: Minimal global FIFO scheduler example using a
311    custom DSQ.
312
313  * ``scx_qmap[.bpf].c``: A multi-level FIFO scheduler supporting five
314    levels of priority implemented with ``BPF_MAP_TYPE_QUEUE``.
315
316ABI Instability
317===============
318
319The APIs provided by sched_ext to BPF schedulers programs have no stability
320guarantees. This includes the ops table callbacks and constants defined in
321``include/linux/sched/ext.h``, as well as the ``scx_bpf_`` kfuncs defined in
322``kernel/sched/ext.c``.
323
324While we will attempt to provide a relatively stable API surface when
325possible, they are subject to change without warning between kernel
326versions.
327