1SCHED_EXT EXAMPLE SCHEDULERS 2============================ 3 4# Introduction 5 6This directory contains a number of example sched_ext schedulers. These 7schedulers are meant to provide examples of different types of schedulers 8that can be built using sched_ext, and illustrate how various features of 9sched_ext can be used. 10 11Some of the examples are performant, production-ready schedulers. That is, for 12the correct workload and with the correct tuning, they may be deployed in a 13production environment with acceptable or possibly even improved performance. 14Others are just examples that in practice, would not provide acceptable 15performance (though they could be improved to get there). 16 17This README will describe these example schedulers, including describing the 18types of workloads or scenarios they're designed to accommodate, and whether or 19not they're production ready. For more details on any of these schedulers, 20please see the header comment in their .bpf.c file. 21 22 23# Compiling the examples 24 25There are a few toolchain dependencies for compiling the example schedulers. 26 27## Toolchain dependencies 28 291. clang >= 16.0.0 30 31The schedulers are BPF programs, and therefore must be compiled with clang. gcc 32is actively working on adding a BPF backend compiler as well, but are still 33missing some features such as BTF type tags which are necessary for using 34kptrs. 35 362. pahole >= 1.25 37 38You may need pahole in order to generate BTF from DWARF. 39 403. rust >= 1.70.0 41 42Rust schedulers uses features present in the rust toolchain >= 1.70.0. You 43should be able to use the stable build from rustup, but if that doesn't 44work, try using the rustup nightly build. 45 46There are other requirements as well, such as make, but these are the main / 47non-trivial ones. 48 49## Compiling the kernel 50 51In order to run a sched_ext scheduler, you'll have to run a kernel compiled 52with the patches in this repository, and with a minimum set of necessary 53Kconfig options: 54 55``` 56CONFIG_BPF=y 57CONFIG_SCHED_CLASS_EXT=y 58CONFIG_BPF_SYSCALL=y 59CONFIG_BPF_JIT=y 60CONFIG_DEBUG_INFO_BTF=y 61``` 62 63It's also recommended that you also include the following Kconfig options: 64 65``` 66CONFIG_BPF_JIT_ALWAYS_ON=y 67CONFIG_BPF_JIT_DEFAULT_ON=y 68CONFIG_PAHOLE_HAS_BTF_TAG=y 69``` 70 71There is a `Kconfig` file in this directory whose contents you can append to 72your local `.config` file, as long as there are no conflicts with any existing 73options in the file. 74 75## Getting a vmlinux.h file 76 77You may notice that most of the example schedulers include a "vmlinux.h" file. 78This is a large, auto-generated header file that contains all of the types 79defined in some vmlinux binary that was compiled with 80[BTF](https://docs.kernel.org/bpf/btf.html) (i.e. with the BTF-related Kconfig 81options specified above). 82 83The header file is created using `bpftool`, by passing it a vmlinux binary 84compiled with BTF as follows: 85 86```bash 87$ bpftool btf dump file /path/to/vmlinux format c > vmlinux.h 88``` 89 90`bpftool` analyzes all of the BTF encodings in the binary, and produces a 91header file that can be included by BPF programs to access those types. For 92example, using vmlinux.h allows a scheduler to access fields defined directly 93in vmlinux as follows: 94 95```c 96#include "vmlinux.h" 97// vmlinux.h is also implicitly included by scx_common.bpf.h. 98#include "scx_common.bpf.h" 99 100/* 101 * vmlinux.h provides definitions for struct task_struct and 102 * struct scx_enable_args. 103 */ 104void BPF_STRUCT_OPS(example_enable, struct task_struct *p, 105 struct scx_enable_args *args) 106{ 107 bpf_printk("Task %s enabled in example scheduler", p->comm); 108} 109 110// vmlinux.h provides the definition for struct sched_ext_ops. 111SEC(".struct_ops.link") 112struct sched_ext_ops example_ops { 113 .enable = (void *)example_enable, 114 .name = "example", 115} 116``` 117 118The scheduler build system will generate this vmlinux.h file as part of the 119scheduler build pipeline. It looks for a vmlinux file in the following 120dependency order: 121 1221. If the O= environment variable is defined, at `$O/vmlinux` 1232. If the KBUILD_OUTPUT= environment variable is defined, at 124 `$KBUILD_OUTPUT/vmlinux` 1253. At `../../vmlinux` (i.e. at the root of the kernel tree where you're 126 compiling the schedulers) 1273. `/sys/kernel/btf/vmlinux` 1284. `/boot/vmlinux-$(uname -r)` 129 130In other words, if you have compiled a kernel in your local repo, its vmlinux 131file will be used to generate vmlinux.h. Otherwise, it will be the vmlinux of 132the kernel you're currently running on. This means that if you're running on a 133kernel with sched_ext support, you may not need to compile a local kernel at 134all. 135 136### Aside on CO-RE 137 138One of the cooler features of BPF is that it supports 139[CO-RE](https://nakryiko.com/posts/bpf-core-reference-guide/) (Compile Once Run 140Everywhere). This feature allows you to reference fields inside of structs with 141types defined internal to the kernel, and not have to recompile if you load the 142BPF program on a different kernel with the field at a different offset. In our 143example above, we print out a task name with `p->comm`. CO-RE would perform 144relocations for that access when the program is loaded to ensure that it's 145referencing the correct offset for the currently running kernel. 146 147## Compiling the schedulers 148 149Once you have your toolchain setup, and a vmlinux that can be used to generate 150a full vmlinux.h file, you can compile the schedulers using `make`: 151 152```bash 153$ make -j($nproc) 154``` 155 156# Example schedulers 157 158This directory contains the following example schedulers. These schedulers are 159for testing and demonstrating different aspects of sched_ext. While some may be 160useful in limited scenarios, they are not intended to be practical. 161 162For more scheduler implementations, tools and documentation, visit 163https://github.com/sched-ext/scx. 164 165## scx_simple 166 167A simple scheduler that provides an example of a minimal sched_ext scheduler. 168scx_simple can be run in either global weighted vtime mode, or FIFO mode. 169 170Though very simple, in limited scenarios, this scheduler can perform reasonably 171well on single-socket systems with a unified L3 cache. 172 173## scx_qmap 174 175Another simple, yet slightly more complex scheduler that provides an example of 176a basic weighted FIFO queuing policy. It also provides examples of some common 177useful BPF features, such as sleepable per-task storage allocation in the 178`ops.prep_enable()` callback, and using the `BPF_MAP_TYPE_QUEUE` map type to 179enqueue tasks. It also illustrates how core-sched support could be implemented. 180 181## scx_central 182 183A "central" scheduler where scheduling decisions are made from a single CPU. 184This scheduler illustrates how scheduling decisions can be dispatched from a 185single CPU, allowing other cores to run with infinite slices, without timer 186ticks, and without having to incur the overhead of making scheduling decisions. 187 188The approach demonstrated by this scheduler may be useful for any workload that 189benefits from minimizing scheduling overhead and timer ticks. An example of 190where this could be particularly useful is running VMs, where running with 191infinite slices and no timer ticks allows the VM to avoid unnecessary expensive 192vmexits. 193 194## scx_flatcg 195 196A flattened cgroup hierarchy scheduler. This scheduler implements hierarchical 197weight-based cgroup CPU control by flattening the cgroup hierarchy into a single 198layer, by compounding the active weight share at each level. The effect of this 199is a much more performant CPU controller, which does not need to descend down 200cgroup trees in order to properly compute a cgroup's share. 201 202Similar to scx_simple, in limited scenarios, this scheduler can perform 203reasonably well on single socket-socket systems with a unified L3 cache and show 204significantly lowered hierarchical scheduling overhead. 205 206 207# Troubleshooting 208 209There are a number of common issues that you may run into when building the 210schedulers. We'll go over some of the common ones here. 211 212## Build Failures 213 214### Old version of clang 215 216``` 217error: static assertion failed due to requirement 'SCX_DSQ_FLAG_BUILTIN': bpftool generated vmlinux.h is missing high bits for 64bit enums, upgrade clang and pahole 218 _Static_assert(SCX_DSQ_FLAG_BUILTIN, 219 ^~~~~~~~~~~~~~~~~~~~ 2201 error generated. 221``` 222 223This means you built the kernel or the schedulers with an older version of 224clang than what's supported (i.e. older than 16.0.0). To remediate this: 225 2261. `which clang` to make sure you're using a sufficiently new version of clang. 227 2282. `make fullclean` in the root path of the repository, and rebuild the kernel 229 and schedulers. 230 2313. Rebuild the kernel, and then your example schedulers. 232 233The schedulers are also cleaned if you invoke `make mrproper` in the root 234directory of the tree. 235 236### Stale kernel build / incomplete vmlinux.h file 237 238As described above, you'll need a `vmlinux.h` file that was generated from a 239vmlinux built with BTF, and with sched_ext support enabled. If you don't, 240you'll see errors such as the following which indicate that a type being 241referenced in a scheduler is unknown: 242 243``` 244/path/to/sched_ext/tools/sched_ext/user_exit_info.h:25:23: note: forward declaration of 'struct scx_exit_info' 245 246const struct scx_exit_info *ei) 247 248^ 249``` 250 251In order to resolve this, please follow the steps above in 252[Getting a vmlinux.h file](#getting-a-vmlinuxh-file) in order to ensure your 253schedulers are using a vmlinux.h file that includes the requisite types. 254 255## Misc 256 257### llvm: [OFF] 258 259You may see the following output when building the schedulers: 260 261``` 262Auto-detecting system features: 263... clang-bpf-co-re: [ on ] 264... llvm: [ OFF ] 265... libcap: [ on ] 266... libbfd: [ on ] 267``` 268 269Seeing `llvm: [ OFF ]` here is not an issue. You can safely ignore. 270