README.md
1SCHED_EXT EXAMPLE SCHEDULERS
2============================
3
4# Introduction
5
6This directory contains a number of example sched_ext schedulers. These
7schedulers are meant to provide examples of different types of schedulers
8that can be built using sched_ext, and illustrate how various features of
9sched_ext can be used.
10
11Some of the examples are performant, production-ready schedulers. That is, for
12the correct workload and with the correct tuning, they may be deployed in a
13production environment with acceptable or possibly even improved performance.
14Others are just examples that in practice, would not provide acceptable
15performance (though they could be improved to get there).
16
17This README will describe these example schedulers, including describing the
18types of workloads or scenarios they're designed to accommodate, and whether or
19not they're production ready. For more details on any of these schedulers,
20please see the header comment in their .bpf.c file.
21
22
23# Compiling the examples
24
25There are a few toolchain dependencies for compiling the example schedulers.
26
27## Toolchain dependencies
28
291. clang >= 16.0.0
30
31The schedulers are BPF programs, and therefore must be compiled with clang. gcc
32is actively working on adding a BPF backend compiler as well, but are still
33missing some features such as BTF type tags which are necessary for using
34kptrs.
35
362. pahole >= 1.25
37
38You may need pahole in order to generate BTF from DWARF.
39
403. rust >= 1.70.0
41
42Rust schedulers uses features present in the rust toolchain >= 1.70.0. You
43should be able to use the stable build from rustup, but if that doesn't
44work, try using the rustup nightly build.
45
46There are other requirements as well, such as make, but these are the main /
47non-trivial ones.
48
49## Compiling the kernel
50
51In order to run a sched_ext scheduler, you'll have to run a kernel compiled
52with the patches in this repository, and with a minimum set of necessary
53Kconfig options:
54
55```
56CONFIG_BPF=y
57CONFIG_SCHED_CLASS_EXT=y
58CONFIG_BPF_SYSCALL=y
59CONFIG_BPF_JIT=y
60CONFIG_DEBUG_INFO_BTF=y
61CONFIG_BPF_JIT_ALWAYS_ON=y
62CONFIG_BPF_JIT_DEFAULT_ON=y
63```
64
65There is a `Kconfig` file in this directory whose contents you can append to
66your local `.config` file, as long as there are no conflicts with any existing
67options in the file.
68
69## Getting a vmlinux.h file
70
71You may notice that most of the example schedulers include a "vmlinux.h" file.
72This is a large, auto-generated header file that contains all of the types
73defined in some vmlinux binary that was compiled with
74[BTF](https://docs.kernel.org/bpf/btf.html) (i.e. with the BTF-related Kconfig
75options specified above).
76
77The header file is created using `bpftool`, by passing it a vmlinux binary
78compiled with BTF as follows:
79
80```bash
81$ bpftool btf dump file /path/to/vmlinux format c > vmlinux.h
82```
83
84`bpftool` analyzes all of the BTF encodings in the binary, and produces a
85header file that can be included by BPF programs to access those types. For
86example, using vmlinux.h allows a scheduler to access fields defined directly
87in vmlinux as follows:
88
89```c
90#include "vmlinux.h"
91// vmlinux.h is also implicitly included by scx_common.bpf.h.
92#include "scx_common.bpf.h"
93
94/*
95 * vmlinux.h provides definitions for struct task_struct and
96 * struct scx_enable_args.
97 */
98void BPF_STRUCT_OPS(example_enable, struct task_struct *p,
99 struct scx_enable_args *args)
100{
101 bpf_printk("Task %s enabled in example scheduler", p->comm);
102}
103
104// vmlinux.h provides the definition for struct sched_ext_ops.
105SEC(".struct_ops.link")
106struct sched_ext_ops example_ops {
107 .enable = (void *)example_enable,
108 .name = "example",
109}
110```
111
112The scheduler build system will generate this vmlinux.h file as part of the
113scheduler build pipeline. It looks for a vmlinux file in the following
114dependency order:
115
1161. If the O= environment variable is defined, at `$O/vmlinux`
1172. If the KBUILD_OUTPUT= environment variable is defined, at
118 `$KBUILD_OUTPUT/vmlinux`
1193. At `../../vmlinux` (i.e. at the root of the kernel tree where you're
120 compiling the schedulers)
1213. `/sys/kernel/btf/vmlinux`
1224. `/boot/vmlinux-$(uname -r)`
123
124In other words, if you have compiled a kernel in your local repo, its vmlinux
125file will be used to generate vmlinux.h. Otherwise, it will be the vmlinux of
126the kernel you're currently running on. This means that if you're running on a
127kernel with sched_ext support, you may not need to compile a local kernel at
128all.
129
130### Aside on CO-RE
131
132One of the cooler features of BPF is that it supports
133[CO-RE](https://nakryiko.com/posts/bpf-core-reference-guide/) (Compile Once Run
134Everywhere). This feature allows you to reference fields inside of structs with
135types defined internal to the kernel, and not have to recompile if you load the
136BPF program on a different kernel with the field at a different offset. In our
137example above, we print out a task name with `p->comm`. CO-RE would perform
138relocations for that access when the program is loaded to ensure that it's
139referencing the correct offset for the currently running kernel.
140
141## Compiling the schedulers
142
143Once you have your toolchain setup, and a vmlinux that can be used to generate
144a full vmlinux.h file, you can compile the schedulers using `make`:
145
146```bash
147$ make -j($nproc)
148```
149
150# Example schedulers
151
152This directory contains the following example schedulers. These schedulers are
153for testing and demonstrating different aspects of sched_ext. While some may be
154useful in limited scenarios, they are not intended to be practical.
155
156For more scheduler implementations, tools and documentation, visit
157https://github.com/sched-ext/scx.
158
159## scx_simple
160
161A simple scheduler that provides an example of a minimal sched_ext scheduler.
162scx_simple can be run in either global weighted vtime mode, or FIFO mode.
163
164Though very simple, in limited scenarios, this scheduler can perform reasonably
165well on single-socket systems with a unified L3 cache.
166
167## scx_qmap
168
169Another simple, yet slightly more complex scheduler that provides an example of
170a basic weighted FIFO queuing policy. It also provides examples of some common
171useful BPF features, such as sleepable per-task storage allocation in the
172`ops.prep_enable()` callback, and using the `BPF_MAP_TYPE_QUEUE` map type to
173enqueue tasks. It also illustrates how core-sched support could be implemented.
174
175## scx_central
176
177A "central" scheduler where scheduling decisions are made from a single CPU.
178This scheduler illustrates how scheduling decisions can be dispatched from a
179single CPU, allowing other cores to run with infinite slices, without timer
180ticks, and without having to incur the overhead of making scheduling decisions.
181
182The approach demonstrated by this scheduler may be useful for any workload that
183benefits from minimizing scheduling overhead and timer ticks. An example of
184where this could be particularly useful is running VMs, where running with
185infinite slices and no timer ticks allows the VM to avoid unnecessary expensive
186vmexits.
187
188## scx_flatcg
189
190A flattened cgroup hierarchy scheduler. This scheduler implements hierarchical
191weight-based cgroup CPU control by flattening the cgroup hierarchy into a single
192layer, by compounding the active weight share at each level. The effect of this
193is a much more performant CPU controller, which does not need to descend down
194cgroup trees in order to properly compute a cgroup's share.
195
196Similar to scx_simple, in limited scenarios, this scheduler can perform
197reasonably well on single socket-socket systems with a unified L3 cache and show
198significantly lowered hierarchical scheduling overhead.
199
200
201# Troubleshooting
202
203There are a number of common issues that you may run into when building the
204schedulers. We'll go over some of the common ones here.
205
206## Build Failures
207
208### Old version of clang
209
210```
211error: static assertion failed due to requirement 'SCX_DSQ_FLAG_BUILTIN': bpftool generated vmlinux.h is missing high bits for 64bit enums, upgrade clang and pahole
212 _Static_assert(SCX_DSQ_FLAG_BUILTIN,
213 ^~~~~~~~~~~~~~~~~~~~
2141 error generated.
215```
216
217This means you built the kernel or the schedulers with an older version of
218clang than what's supported (i.e. older than 16.0.0). To remediate this:
219
2201. `which clang` to make sure you're using a sufficiently new version of clang.
221
2222. `make fullclean` in the root path of the repository, and rebuild the kernel
223 and schedulers.
224
2253. Rebuild the kernel, and then your example schedulers.
226
227The schedulers are also cleaned if you invoke `make mrproper` in the root
228directory of the tree.
229
230### Stale kernel build / incomplete vmlinux.h file
231
232As described above, you'll need a `vmlinux.h` file that was generated from a
233vmlinux built with BTF, and with sched_ext support enabled. If you don't,
234you'll see errors such as the following which indicate that a type being
235referenced in a scheduler is unknown:
236
237```
238/path/to/sched_ext/tools/sched_ext/user_exit_info.h:25:23: note: forward declaration of 'struct scx_exit_info'
239
240const struct scx_exit_info *ei)
241
242^
243```
244
245In order to resolve this, please follow the steps above in
246[Getting a vmlinux.h file](#getting-a-vmlinuxh-file) in order to ensure your
247schedulers are using a vmlinux.h file that includes the requisite types.
248
249## Misc
250
251### llvm: [OFF]
252
253You may see the following output when building the schedulers:
254
255```
256Auto-detecting system features:
257... clang-bpf-co-re: [ on ]
258... llvm: [ OFF ]
259... libcap: [ on ]
260... libbfd: [ on ]
261```
262
263Seeing `llvm: [ OFF ]` here is not an issue. You can safely ignore.
264