1========================== 2Extensible Scheduler Class 3========================== 4 5sched_ext is a scheduler class whose behavior can be defined by a set of BPF 6programs - the BPF scheduler. 7 8* sched_ext exports a full scheduling interface so that any scheduling 9 algorithm can be implemented on top. 10 11* The BPF scheduler can group CPUs however it sees fit and schedule them 12 together, as tasks aren't tied to specific CPUs at the time of wakeup. 13 14* The BPF scheduler can be turned on and off dynamically anytime. 15 16* The system integrity is maintained no matter what the BPF scheduler does. 17 The default scheduling behavior is restored anytime an error is detected, 18 a runnable task stalls, or on invoking the SysRq key sequence 19 :kbd:`SysRq-S`. 20 21* When the BPF scheduler triggers an error, debug information is dumped to 22 aid debugging. The debug dump is passed to and printed out by the 23 scheduler binary. The debug dump can also be accessed through the 24 `sched_ext_dump` tracepoint. The SysRq key sequence :kbd:`SysRq-D` 25 triggers a debug dump. This doesn't terminate the BPF scheduler and can 26 only be read through the tracepoint. 27 28Switching to and from sched_ext 29=============================== 30 31``CONFIG_SCHED_CLASS_EXT`` is the config option to enable sched_ext and 32``tools/sched_ext`` contains the example schedulers. The following config 33options should be enabled to use sched_ext: 34 35.. code-block:: none 36 37 CONFIG_BPF=y 38 CONFIG_SCHED_CLASS_EXT=y 39 CONFIG_BPF_SYSCALL=y 40 CONFIG_BPF_JIT=y 41 CONFIG_DEBUG_INFO_BTF=y 42 CONFIG_BPF_JIT_ALWAYS_ON=y 43 CONFIG_BPF_JIT_DEFAULT_ON=y 44 CONFIG_PAHOLE_HAS_SPLIT_BTF=y 45 CONFIG_PAHOLE_HAS_BTF_TAG=y 46 47sched_ext is used only when the BPF scheduler is loaded and running. 48 49If a task explicitly sets its scheduling policy to ``SCHED_EXT``, it will be 50treated as ``SCHED_NORMAL`` and scheduled by CFS until the BPF scheduler is 51loaded. On load, such tasks will be switched to and scheduled by sched_ext. 52 53The BPF scheduler can choose to schedule all normal and lower class tasks by 54calling ``scx_bpf_switch_all()`` from its ``init()`` operation. In this 55case, all ``SCHED_NORMAL``, ``SCHED_BATCH``, ``SCHED_IDLE`` and 56``SCHED_EXT`` tasks are scheduled by sched_ext. In the example schedulers, 57this mode can be selected with the ``-a`` option. 58 59Terminating the sched_ext scheduler program, triggering :kbd:`SysRq-S`, or 60detection of any internal error including stalled runnable tasks aborts the 61BPF scheduler and reverts all tasks back to CFS. 62 63.. code-block:: none 64 65 # make -j16 -C tools/sched_ext 66 # tools/sched_ext/scx_simple 67 local=0 global=3 68 local=5 global=24 69 local=9 global=44 70 local=13 global=56 71 local=17 global=72 72 ^CEXIT: BPF scheduler unregistered 73 74The current status of the BPF scheduler can be determined as follows: 75 76.. code-block:: none 77 78 # cat /sys/kernel/sched_ext/state 79 enabled 80 # cat /sys/kernel/sched_ext/root/ops 81 simple 82 83``tools/sched_ext/scx_show_state.py`` is a drgn script which shows more 84detailed information: 85 86.. code-block:: none 87 88 # tools/sched_ext/scx_show_state.py 89 ops : simple 90 enabled : 1 91 switching_all : 1 92 switched_all : 1 93 enable_state : enabled (2) 94 bypass_depth : 0 95 nr_rejected : 0 96 97If ``CONFIG_SCHED_DEBUG`` is set, whether a given task is on sched_ext can 98be determined as follows: 99 100.. code-block:: none 101 102 # grep ext /proc/self/sched 103 ext.enabled : 1 104 105The Basics 106========== 107 108Userspace can implement an arbitrary BPF scheduler by loading a set of BPF 109programs that implement ``struct sched_ext_ops``. The only mandatory field 110is ``ops.name`` which must be a valid BPF object name. All operations are 111optional. The following modified excerpt is from 112``tools/sched/scx_simple.bpf.c`` showing a minimal global FIFO scheduler. 113 114.. code-block:: c 115 116 /* 117 * Decide which CPU a task should be migrated to before being 118 * enqueued (either at wakeup, fork time, or exec time). If an 119 * idle core is found by the default ops.select_cpu() implementation, 120 * then dispatch the task directly to SCX_DSQ_LOCAL and skip the 121 * ops.enqueue() callback. 122 * 123 * Note that this implementation has exactly the same behavior as the 124 * default ops.select_cpu implementation. The behavior of the scheduler 125 * would be exactly same if the implementation just didn't define the 126 * simple_select_cpu() struct_ops prog. 127 */ 128 s32 BPF_STRUCT_OPS(simple_select_cpu, struct task_struct *p, 129 s32 prev_cpu, u64 wake_flags) 130 { 131 s32 cpu; 132 /* Need to initialize or the BPF verifier will reject the program */ 133 bool direct = false; 134 135 cpu = scx_bpf_select_cpu_dfl(p, prev_cpu, wake_flags, &direct); 136 137 if (direct) 138 scx_bpf_dispatch(p, SCX_DSQ_LOCAL, SCX_SLICE_DFL, 0); 139 140 return cpu; 141 } 142 143 /* 144 * Do a direct dispatch of a task to the global DSQ. This ops.enqueue() 145 * callback will only be invoked if we failed to find a core to dispatch 146 * to in ops.select_cpu() above. 147 * 148 * Note that this implementation has exactly the same behavior as the 149 * default ops.enqueue implementation, which just dispatches the task 150 * to SCX_DSQ_GLOBAL. The behavior of the scheduler would be exactly same 151 * if the implementation just didn't define the simple_enqueue struct_ops 152 * prog. 153 */ 154 void BPF_STRUCT_OPS(simple_enqueue, struct task_struct *p, u64 enq_flags) 155 { 156 scx_bpf_dispatch(p, SCX_DSQ_GLOBAL, SCX_SLICE_DFL, enq_flags); 157 } 158 159 s32 BPF_STRUCT_OPS(simple_init) 160 { 161 /* 162 * All SCHED_OTHER, SCHED_IDLE, and SCHED_BATCH tasks should 163 * use sched_ext. 164 */ 165 scx_bpf_switch_all(); 166 return 0; 167 } 168 169 void BPF_STRUCT_OPS(simple_exit, struct scx_exit_info *ei) 170 { 171 exit_type = ei->type; 172 } 173 174 SEC(".struct_ops") 175 struct sched_ext_ops simple_ops = { 176 .select_cpu = (void *)simple_select_cpu, 177 .enqueue = (void *)simple_enqueue, 178 .init = (void *)simple_init, 179 .exit = (void *)simple_exit, 180 .name = "simple", 181 }; 182 183Dispatch Queues 184--------------- 185 186To match the impedance between the scheduler core and the BPF scheduler, 187sched_ext uses DSQs (dispatch queues) which can operate as both a FIFO and a 188priority queue. By default, there is one global FIFO (``SCX_DSQ_GLOBAL``), 189and one local dsq per CPU (``SCX_DSQ_LOCAL``). The BPF scheduler can manage 190an arbitrary number of dsq's using ``scx_bpf_create_dsq()`` and 191``scx_bpf_destroy_dsq()``. 192 193A CPU always executes a task from its local DSQ. A task is "dispatched" to a 194DSQ. A non-local DSQ is "consumed" to transfer a task to the consuming CPU's 195local DSQ. 196 197When a CPU is looking for the next task to run, if the local DSQ is not 198empty, the first task is picked. Otherwise, the CPU tries to consume the 199global DSQ. If that doesn't yield a runnable task either, ``ops.dispatch()`` 200is invoked. 201 202Scheduling Cycle 203---------------- 204 205The following briefly shows how a waking task is scheduled and executed. 206 2071. When a task is waking up, ``ops.select_cpu()`` is the first operation 208 invoked. This serves two purposes. First, CPU selection optimization 209 hint. Second, waking up the selected CPU if idle. 210 211 The CPU selected by ``ops.select_cpu()`` is an optimization hint and not 212 binding. The actual decision is made at the last step of scheduling. 213 However, there is a small performance gain if the CPU 214 ``ops.select_cpu()`` returns matches the CPU the task eventually runs on. 215 216 A side-effect of selecting a CPU is waking it up from idle. While a BPF 217 scheduler can wake up any cpu using the ``scx_bpf_kick_cpu()`` helper, 218 using ``ops.select_cpu()`` judiciously can be simpler and more efficient. 219 220 A task can be immediately dispatched to a DSQ from ``ops.select_cpu()`` by 221 calling ``scx_bpf_dispatch()``. If the task is dispatched to 222 ``SCX_DSQ_LOCAL`` from ``ops.select_cpu()``, it will be dispatched to the 223 local DSQ of whichever CPU is returned from ``ops.select_cpu()``. 224 Additionally, dispatching directly from ``ops.select_cpu()`` will cause the 225 ``ops.enqueue()`` callback to be skipped. 226 227 Note that the scheduler core will ignore an invalid CPU selection, for 228 example, if it's outside the allowed cpumask of the task. 229 2302. Once the target CPU is selected, ``ops.enqueue()`` is invoked (unless the 231 task was dispatched directly from ``ops.select_cpu()``). ``ops.enqueue()`` 232 can make one of the following decisions: 233 234 * Immediately dispatch the task to either the global or local DSQ by 235 calling ``scx_bpf_dispatch()`` with ``SCX_DSQ_GLOBAL`` or 236 ``SCX_DSQ_LOCAL``, respectively. 237 238 * Immediately dispatch the task to a custom DSQ by calling 239 ``scx_bpf_dispatch()`` with a DSQ ID which is smaller than 2^63. 240 241 * Queue the task on the BPF side. 242 2433. When a CPU is ready to schedule, it first looks at its local DSQ. If 244 empty, it then looks at the global DSQ. If there still isn't a task to 245 run, ``ops.dispatch()`` is invoked which can use the following two 246 functions to populate the local DSQ. 247 248 * ``scx_bpf_dispatch()`` dispatches a task to a DSQ. Any target DSQ can 249 be used - ``SCX_DSQ_LOCAL``, ``SCX_DSQ_LOCAL_ON | cpu``, 250 ``SCX_DSQ_GLOBAL`` or a custom DSQ. While ``scx_bpf_dispatch()`` 251 currently can't be called with BPF locks held, this is being worked on 252 and will be supported. ``scx_bpf_dispatch()`` schedules dispatching 253 rather than performing them immediately. There can be up to 254 ``ops.dispatch_max_batch`` pending tasks. 255 256 * ``scx_bpf_consume()`` tranfers a task from the specified non-local DSQ 257 to the dispatching DSQ. This function cannot be called with any BPF 258 locks held. ``scx_bpf_consume()`` flushes the pending dispatched tasks 259 before trying to consume the specified DSQ. 260 2614. After ``ops.dispatch()`` returns, if there are tasks in the local DSQ, 262 the CPU runs the first one. If empty, the following steps are taken: 263 264 * Try to consume the global DSQ. If successful, run the task. 265 266 * If ``ops.dispatch()`` has dispatched any tasks, retry #3. 267 268 * If the previous task is an SCX task and still runnable, keep executing 269 it (see ``SCX_OPS_ENQ_LAST``). 270 271 * Go idle. 272 273Note that the BPF scheduler can always choose to dispatch tasks immediately 274in ``ops.enqueue()`` as illustrated in the above simple example. If only the 275built-in DSQs are used, there is no need to implement ``ops.dispatch()`` as 276a task is never queued on the BPF scheduler and both the local and global 277DSQs are consumed automatically. 278 279``scx_bpf_dispatch()`` queues the task on the FIFO of the target DSQ. Use 280``scx_bpf_dispatch_vtime()`` for the priority queue. Internal DSQs such as 281``SCX_DSQ_LOCAL`` and ``SCX_DSQ_GLOBAL`` do not support priority-queue 282dispatching, and must be dispatched to with ``scx_bpf_dispatch()``. See the 283function documentation and usage in ``tools/sched_ext/scx_simple.bpf.c`` for 284more information. 285 286Where to Look 287============= 288 289* ``include/linux/sched/ext.h`` defines the core data structures, ops table 290 and constants. 291 292* ``kernel/sched/ext.c`` contains sched_ext core implementation and helpers. 293 The functions prefixed with ``scx_bpf_`` can be called from the BPF 294 scheduler. 295 296* ``tools/sched_ext/`` hosts example BPF scheduler implementations. 297 298 * ``scx_simple[.bpf].c``: Minimal global FIFO scheduler example using a 299 custom DSQ. 300 301 * ``scx_qmap[.bpf].c``: A multi-level FIFO scheduler supporting five 302 levels of priority implemented with ``BPF_MAP_TYPE_QUEUE``. 303 304ABI Instability 305=============== 306 307The APIs provided by sched_ext to BPF schedulers programs have no stability 308guarantees. This includes the ops table callbacks and constants defined in 309``include/linux/sched/ext.h``, as well as the ``scx_bpf_`` kfuncs defined in 310``kernel/sched/ext.c``. 311 312While we will attempt to provide a relatively stable API surface when 313possible, they are subject to change without warning between kernel 314versions. 315