1.. _sched-ext: 2 3========================== 4Extensible Scheduler Class 5========================== 6 7sched_ext is a scheduler class whose behavior can be defined by a set of BPF 8programs - the BPF scheduler. 9 10* sched_ext exports a full scheduling interface so that any scheduling 11 algorithm can be implemented on top. 12 13* The BPF scheduler can group CPUs however it sees fit and schedule them 14 together, as tasks aren't tied to specific CPUs at the time of wakeup. 15 16* The BPF scheduler can be turned on and off dynamically anytime. 17 18* The system integrity is maintained no matter what the BPF scheduler does. 19 The default scheduling behavior is restored anytime an error is detected, 20 a runnable task stalls, or on invoking the SysRq key sequence 21 `SysRq-S`. 22 23* When the BPF scheduler triggers an error, debug information is dumped to 24 aid debugging. The debug dump is passed to and printed out by the 25 scheduler binary. The debug dump can also be accessed through the 26 `sched_ext_dump` tracepoint. The SysRq key sequence `SysRq-D` 27 triggers a debug dump. This doesn't terminate the BPF scheduler and can 28 only be read through the tracepoint. 29 30Switching to and from sched_ext 31=============================== 32 33``CONFIG_SCHED_CLASS_EXT`` is the config option to enable sched_ext and 34``tools/sched_ext`` contains the example schedulers. The following config 35options should be enabled to use sched_ext: 36 37.. code-block:: none 38 39 CONFIG_BPF=y 40 CONFIG_SCHED_CLASS_EXT=y 41 CONFIG_BPF_SYSCALL=y 42 CONFIG_BPF_JIT=y 43 CONFIG_DEBUG_INFO_BTF=y 44 CONFIG_BPF_JIT_ALWAYS_ON=y 45 CONFIG_BPF_JIT_DEFAULT_ON=y 46 CONFIG_PAHOLE_HAS_BTF_TAG=y 47 48sched_ext is used only when the BPF scheduler is loaded and running. 49 50If a task explicitly sets its scheduling policy to ``SCHED_EXT``, it will be 51treated as ``SCHED_NORMAL`` and scheduled by the fair-class scheduler until the 52BPF scheduler is loaded. 53 54When the BPF scheduler is loaded and ``SCX_OPS_SWITCH_PARTIAL`` is not set 55in ``ops->flags``, all ``SCHED_NORMAL``, ``SCHED_BATCH``, ``SCHED_IDLE``, and 56``SCHED_EXT`` tasks are scheduled by sched_ext. 57 58However, when the BPF scheduler is loaded and ``SCX_OPS_SWITCH_PARTIAL`` is 59set in ``ops->flags``, only tasks with the ``SCHED_EXT`` policy are scheduled 60by sched_ext, while tasks with ``SCHED_NORMAL``, ``SCHED_BATCH`` and 61``SCHED_IDLE`` policies are scheduled by the fair-class scheduler. 62 63Terminating the sched_ext scheduler program, triggering `SysRq-S`, or 64detection of any internal error including stalled runnable tasks aborts the 65BPF scheduler and reverts all tasks back to the fair-class scheduler. 66 67.. code-block:: none 68 69 # make -j16 -C tools/sched_ext 70 # tools/sched_ext/build/bin/scx_simple 71 local=0 global=3 72 local=5 global=24 73 local=9 global=44 74 local=13 global=56 75 local=17 global=72 76 ^CEXIT: BPF scheduler unregistered 77 78The current status of the BPF scheduler can be determined as follows: 79 80.. code-block:: none 81 82 # cat /sys/kernel/sched_ext/state 83 enabled 84 # cat /sys/kernel/sched_ext/root/ops 85 simple 86 87You can check if any BPF scheduler has ever been loaded since boot by examining 88this monotonically incrementing counter (a value of zero indicates that no BPF 89scheduler has been loaded): 90 91.. code-block:: none 92 93 # cat /sys/kernel/sched_ext/enable_seq 94 1 95 96``tools/sched_ext/scx_show_state.py`` is a drgn script which shows more 97detailed information: 98 99.. code-block:: none 100 101 # tools/sched_ext/scx_show_state.py 102 ops : simple 103 enabled : 1 104 switching_all : 1 105 switched_all : 1 106 enable_state : enabled (2) 107 bypass_depth : 0 108 nr_rejected : 0 109 enable_seq : 1 110 111Whether a given task is on sched_ext can be determined as follows: 112 113.. code-block:: none 114 115 # grep ext /proc/self/sched 116 ext.enabled : 1 117 118The Basics 119========== 120 121Userspace can implement an arbitrary BPF scheduler by loading a set of BPF 122programs that implement ``struct sched_ext_ops``. The only mandatory field 123is ``ops.name`` which must be a valid BPF object name. All operations are 124optional. The following modified excerpt is from 125``tools/sched_ext/scx_simple.bpf.c`` showing a minimal global FIFO scheduler. 126 127.. code-block:: c 128 129 /* 130 * Decide which CPU a task should be migrated to before being 131 * enqueued (either at wakeup, fork time, or exec time). If an 132 * idle core is found by the default ops.select_cpu() implementation, 133 * then insert the task directly into SCX_DSQ_LOCAL and skip the 134 * ops.enqueue() callback. 135 * 136 * Note that this implementation has exactly the same behavior as the 137 * default ops.select_cpu implementation. The behavior of the scheduler 138 * would be exactly same if the implementation just didn't define the 139 * simple_select_cpu() struct_ops prog. 140 */ 141 s32 BPF_STRUCT_OPS(simple_select_cpu, struct task_struct *p, 142 s32 prev_cpu, u64 wake_flags) 143 { 144 s32 cpu; 145 /* Need to initialize or the BPF verifier will reject the program */ 146 bool direct = false; 147 148 cpu = scx_bpf_select_cpu_dfl(p, prev_cpu, wake_flags, &direct); 149 150 if (direct) 151 scx_bpf_dsq_insert(p, SCX_DSQ_LOCAL, SCX_SLICE_DFL, 0); 152 153 return cpu; 154 } 155 156 /* 157 * Do a direct insertion of a task to the global DSQ. This ops.enqueue() 158 * callback will only be invoked if we failed to find a core to insert 159 * into in ops.select_cpu() above. 160 * 161 * Note that this implementation has exactly the same behavior as the 162 * default ops.enqueue implementation, which just dispatches the task 163 * to SCX_DSQ_GLOBAL. The behavior of the scheduler would be exactly same 164 * if the implementation just didn't define the simple_enqueue struct_ops 165 * prog. 166 */ 167 void BPF_STRUCT_OPS(simple_enqueue, struct task_struct *p, u64 enq_flags) 168 { 169 scx_bpf_dsq_insert(p, SCX_DSQ_GLOBAL, SCX_SLICE_DFL, enq_flags); 170 } 171 172 s32 BPF_STRUCT_OPS_SLEEPABLE(simple_init) 173 { 174 /* 175 * By default, all SCHED_EXT, SCHED_OTHER, SCHED_IDLE, and 176 * SCHED_BATCH tasks should use sched_ext. 177 */ 178 return 0; 179 } 180 181 void BPF_STRUCT_OPS(simple_exit, struct scx_exit_info *ei) 182 { 183 exit_type = ei->type; 184 } 185 186 SEC(".struct_ops") 187 struct sched_ext_ops simple_ops = { 188 .select_cpu = (void *)simple_select_cpu, 189 .enqueue = (void *)simple_enqueue, 190 .init = (void *)simple_init, 191 .exit = (void *)simple_exit, 192 .name = "simple", 193 }; 194 195Dispatch Queues 196--------------- 197 198To match the impedance between the scheduler core and the BPF scheduler, 199sched_ext uses DSQs (dispatch queues) which can operate as both a FIFO and a 200priority queue. By default, there is one global FIFO (``SCX_DSQ_GLOBAL``), 201and one local DSQ per CPU (``SCX_DSQ_LOCAL``). The BPF scheduler can manage 202an arbitrary number of DSQs using ``scx_bpf_create_dsq()`` and 203``scx_bpf_destroy_dsq()``. 204 205A CPU always executes a task from its local DSQ. A task is "inserted" into a 206DSQ. A task in a non-local DSQ is "move"d into the target CPU's local DSQ. 207 208When a CPU is looking for the next task to run, if the local DSQ is not 209empty, the first task is picked. Otherwise, the CPU tries to move a task 210from the global DSQ. If that doesn't yield a runnable task either, 211``ops.dispatch()`` is invoked. 212 213Scheduling Cycle 214---------------- 215 216The following briefly shows how a waking task is scheduled and executed. 217 2181. When a task is waking up, ``ops.select_cpu()`` is the first operation 219 invoked. This serves two purposes. First, CPU selection optimization 220 hint. Second, waking up the selected CPU if idle. 221 222 The CPU selected by ``ops.select_cpu()`` is an optimization hint and not 223 binding. The actual decision is made at the last step of scheduling. 224 However, there is a small performance gain if the CPU 225 ``ops.select_cpu()`` returns matches the CPU the task eventually runs on. 226 227 A side-effect of selecting a CPU is waking it up from idle. While a BPF 228 scheduler can wake up any cpu using the ``scx_bpf_kick_cpu()`` helper, 229 using ``ops.select_cpu()`` judiciously can be simpler and more efficient. 230 231 A task can be immediately inserted into a DSQ from ``ops.select_cpu()`` 232 by calling ``scx_bpf_dsq_insert()``. If the task is inserted into 233 ``SCX_DSQ_LOCAL`` from ``ops.select_cpu()``, it will be inserted into the 234 local DSQ of whichever CPU is returned from ``ops.select_cpu()``. 235 Additionally, inserting directly from ``ops.select_cpu()`` will cause the 236 ``ops.enqueue()`` callback to be skipped. 237 238 Note that the scheduler core will ignore an invalid CPU selection, for 239 example, if it's outside the allowed cpumask of the task. 240 2412. Once the target CPU is selected, ``ops.enqueue()`` is invoked (unless the 242 task was inserted directly from ``ops.select_cpu()``). ``ops.enqueue()`` 243 can make one of the following decisions: 244 245 * Immediately insert the task into either the global or a local DSQ by 246 calling ``scx_bpf_dsq_insert()`` with one of the following options: 247 ``SCX_DSQ_GLOBAL``, ``SCX_DSQ_LOCAL``, or ``SCX_DSQ_LOCAL_ON | cpu``. 248 249 * Immediately insert the task into a custom DSQ by calling 250 ``scx_bpf_dsq_insert()`` with a DSQ ID which is smaller than 2^63. 251 252 * Queue the task on the BPF side. 253 2543. When a CPU is ready to schedule, it first looks at its local DSQ. If 255 empty, it then looks at the global DSQ. If there still isn't a task to 256 run, ``ops.dispatch()`` is invoked which can use the following two 257 functions to populate the local DSQ. 258 259 * ``scx_bpf_dsq_insert()`` inserts a task to a DSQ. Any target DSQ can be 260 used - ``SCX_DSQ_LOCAL``, ``SCX_DSQ_LOCAL_ON | cpu``, 261 ``SCX_DSQ_GLOBAL`` or a custom DSQ. While ``scx_bpf_dsq_insert()`` 262 currently can't be called with BPF locks held, this is being worked on 263 and will be supported. ``scx_bpf_dsq_insert()`` schedules insertion 264 rather than performing them immediately. There can be up to 265 ``ops.dispatch_max_batch`` pending tasks. 266 267 * ``scx_bpf_move_to_local()`` moves a task from the specified non-local 268 DSQ to the dispatching DSQ. This function cannot be called with any BPF 269 locks held. ``scx_bpf_move_to_local()`` flushes the pending insertions 270 tasks before trying to move from the specified DSQ. 271 2724. After ``ops.dispatch()`` returns, if there are tasks in the local DSQ, 273 the CPU runs the first one. If empty, the following steps are taken: 274 275 * Try to move from the global DSQ. If successful, run the task. 276 277 * If ``ops.dispatch()`` has dispatched any tasks, retry #3. 278 279 * If the previous task is an SCX task and still runnable, keep executing 280 it (see ``SCX_OPS_ENQ_LAST``). 281 282 * Go idle. 283 284Note that the BPF scheduler can always choose to dispatch tasks immediately 285in ``ops.enqueue()`` as illustrated in the above simple example. If only the 286built-in DSQs are used, there is no need to implement ``ops.dispatch()`` as 287a task is never queued on the BPF scheduler and both the local and global 288DSQs are executed automatically. 289 290``scx_bpf_dsq_insert()`` inserts the task on the FIFO of the target DSQ. Use 291``scx_bpf_dsq_insert_vtime()`` for the priority queue. Internal DSQs such as 292``SCX_DSQ_LOCAL`` and ``SCX_DSQ_GLOBAL`` do not support priority-queue 293dispatching, and must be dispatched to with ``scx_bpf_dsq_insert()``. See 294the function documentation and usage in ``tools/sched_ext/scx_simple.bpf.c`` 295for more information. 296 297Task Lifecycle 298-------------- 299 300The following pseudo-code summarizes the entire lifecycle of a task managed 301by a sched_ext scheduler: 302 303.. code-block:: c 304 305 ops.init_task(); /* A new task is created */ 306 ops.enable(); /* Enable BPF scheduling for the task */ 307 308 while (task in SCHED_EXT) { 309 if (task can migrate) 310 ops.select_cpu(); /* Called on wakeup (optimization) */ 311 312 ops.runnable(); /* Task becomes ready to run */ 313 314 while (task is runnable) { 315 if (task is not in a DSQ && task->scx.slice == 0) { 316 ops.enqueue(); /* Task can be added to a DSQ */ 317 318 /* Any usable CPU becomes available */ 319 320 ops.dispatch(); /* Task is moved to a local DSQ */ 321 } 322 ops.running(); /* Task starts running on its assigned CPU */ 323 while (task->scx.slice > 0 && task is runnable) 324 ops.tick(); /* Called every 1/HZ seconds */ 325 ops.stopping(); /* Task stops running (time slice expires or wait) */ 326 327 /* Task's CPU becomes available */ 328 329 ops.dispatch(); /* task->scx.slice can be refilled */ 330 } 331 332 ops.quiescent(); /* Task releases its assigned CPU (wait) */ 333 } 334 335 ops.disable(); /* Disable BPF scheduling for the task */ 336 ops.exit_task(); /* Task is destroyed */ 337 338Where to Look 339============= 340 341* ``include/linux/sched/ext.h`` defines the core data structures, ops table 342 and constants. 343 344* ``kernel/sched/ext.c`` contains sched_ext core implementation and helpers. 345 The functions prefixed with ``scx_bpf_`` can be called from the BPF 346 scheduler. 347 348* ``tools/sched_ext/`` hosts example BPF scheduler implementations. 349 350 * ``scx_simple[.bpf].c``: Minimal global FIFO scheduler example using a 351 custom DSQ. 352 353 * ``scx_qmap[.bpf].c``: A multi-level FIFO scheduler supporting five 354 levels of priority implemented with ``BPF_MAP_TYPE_QUEUE``. 355 356ABI Instability 357=============== 358 359The APIs provided by sched_ext to BPF schedulers programs have no stability 360guarantees. This includes the ops table callbacks and constants defined in 361``include/linux/sched/ext.h``, as well as the ``scx_bpf_`` kfuncs defined in 362``kernel/sched/ext.c``. 363 364While we will attempt to provide a relatively stable API surface when 365possible, they are subject to change without warning between kernel 366versions. 367