xref: /linux/tools/sched_ext/README.md (revision 69050f8d6d075dc01af7a5f2f550a8067510366f)
1SCHED_EXT EXAMPLE SCHEDULERS
2============================
3
4# Introduction
5
6This directory contains a number of example sched_ext schedulers. These
7schedulers are meant to provide examples of different types of schedulers
8that can be built using sched_ext, and illustrate how various features of
9sched_ext can be used.
10
11Some of the examples are performant, production-ready schedulers. That is, for
12the correct workload and with the correct tuning, they may be deployed in a
13production environment with acceptable or possibly even improved performance.
14Others are just examples that in practice, would not provide acceptable
15performance (though they could be improved to get there).
16
17This README will describe these example schedulers, including describing the
18types of workloads or scenarios they're designed to accommodate, and whether or
19not they're production ready. For more details on any of these schedulers,
20please see the header comment in their .bpf.c file.
21
22
23# Compiling the examples
24
25There are a few toolchain dependencies for compiling the example schedulers.
26
27## Toolchain dependencies
28
291. clang >= 16.0.0
30
31The schedulers are BPF programs, and therefore must be compiled with clang. gcc
32is actively working on adding a BPF backend compiler as well, but are still
33missing some features such as BTF type tags which are necessary for using
34kptrs.
35
362. pahole >= 1.25
37
38You may need pahole in order to generate BTF from DWARF.
39
403. rust >= 1.70.0
41
42Rust schedulers uses features present in the rust toolchain >= 1.70.0. You
43should be able to use the stable build from rustup, but if that doesn't
44work, try using the rustup nightly build.
45
46There are other requirements as well, such as make, but these are the main /
47non-trivial ones.
48
49## Compiling the kernel
50
51In order to run a sched_ext scheduler, you'll have to run a kernel compiled
52with the patches in this repository, and with a minimum set of necessary
53Kconfig options:
54
55```
56CONFIG_BPF=y
57CONFIG_SCHED_CLASS_EXT=y
58CONFIG_BPF_SYSCALL=y
59CONFIG_BPF_JIT=y
60CONFIG_DEBUG_INFO_BTF=y
61```
62
63It's also recommended that you also include the following Kconfig options:
64
65```
66CONFIG_BPF_JIT_ALWAYS_ON=y
67CONFIG_BPF_JIT_DEFAULT_ON=y
68CONFIG_PAHOLE_HAS_BTF_TAG=y
69```
70
71There is a `Kconfig` file in this directory whose contents you can append to
72your local `.config` file, as long as there are no conflicts with any existing
73options in the file.
74
75## Getting a vmlinux.h file
76
77You may notice that most of the example schedulers include a "vmlinux.h" file.
78This is a large, auto-generated header file that contains all of the types
79defined in some vmlinux binary that was compiled with
80[BTF](https://docs.kernel.org/bpf/btf.html) (i.e. with the BTF-related Kconfig
81options specified above).
82
83The header file is created using `bpftool`, by passing it a vmlinux binary
84compiled with BTF as follows:
85
86```bash
87$ bpftool btf dump file /path/to/vmlinux format c > vmlinux.h
88```
89
90`bpftool` analyzes all of the BTF encodings in the binary, and produces a
91header file that can be included by BPF programs to access those types.  For
92example, using vmlinux.h allows a scheduler to access fields defined directly
93in vmlinux as follows:
94
95```c
96#include "vmlinux.h"
97// vmlinux.h is also implicitly included by scx_common.bpf.h.
98#include "scx_common.bpf.h"
99
100/*
101 * vmlinux.h provides definitions for struct task_struct and
102 * struct scx_enable_args.
103 */
104void BPF_STRUCT_OPS(example_enable, struct task_struct *p,
105		    struct scx_enable_args *args)
106{
107	bpf_printk("Task %s enabled in example scheduler", p->comm);
108}
109
110// vmlinux.h provides the definition for struct sched_ext_ops.
111SEC(".struct_ops.link")
112struct sched_ext_ops example_ops {
113	.enable	= (void *)example_enable,
114	.name	= "example",
115}
116```
117
118The scheduler build system will generate this vmlinux.h file as part of the
119scheduler build pipeline. It looks for a vmlinux file in the following
120dependency order:
121
1221. If the O= environment variable is defined, at `$O/vmlinux`
1232. If the KBUILD_OUTPUT= environment variable is defined, at
124   `$KBUILD_OUTPUT/vmlinux`
1253. At `../../vmlinux` (i.e. at the root of the kernel tree where you're
126   compiling the schedulers)
1273. `/sys/kernel/btf/vmlinux`
1284. `/boot/vmlinux-$(uname -r)`
129
130In other words, if you have compiled a kernel in your local repo, its vmlinux
131file will be used to generate vmlinux.h. Otherwise, it will be the vmlinux of
132the kernel you're currently running on. This means that if you're running on a
133kernel with sched_ext support, you may not need to compile a local kernel at
134all.
135
136### Aside on CO-RE
137
138One of the cooler features of BPF is that it supports
139[CO-RE](https://nakryiko.com/posts/bpf-core-reference-guide/) (Compile Once Run
140Everywhere). This feature allows you to reference fields inside of structs with
141types defined internal to the kernel, and not have to recompile if you load the
142BPF program on a different kernel with the field at a different offset. In our
143example above, we print out a task name with `p->comm`. CO-RE would perform
144relocations for that access when the program is loaded to ensure that it's
145referencing the correct offset for the currently running kernel.
146
147## Compiling the schedulers
148
149Once you have your toolchain setup, and a vmlinux that can be used to generate
150a full vmlinux.h file, you can compile the schedulers using `make`:
151
152```bash
153$ make -j($nproc)
154```
155
156# Example schedulers
157
158This directory contains the following example schedulers. These schedulers are
159for testing and demonstrating different aspects of sched_ext. While some may be
160useful in limited scenarios, they are not intended to be practical.
161
162For more scheduler implementations, tools and documentation, visit
163https://github.com/sched-ext/scx.
164
165## scx_simple
166
167A simple scheduler that provides an example of a minimal sched_ext scheduler.
168scx_simple can be run in either global weighted vtime mode, or FIFO mode.
169
170Though very simple, in limited scenarios, this scheduler can perform reasonably
171well on single-socket systems with a unified L3 cache.
172
173## scx_qmap
174
175Another simple, yet slightly more complex scheduler that provides an example of
176a basic weighted FIFO queuing policy. It also provides examples of some common
177useful BPF features, such as sleepable per-task storage allocation in the
178`ops.prep_enable()` callback, and using the `BPF_MAP_TYPE_QUEUE` map type to
179enqueue tasks. It also illustrates how core-sched support could be implemented.
180
181## scx_central
182
183A "central" scheduler where scheduling decisions are made from a single CPU.
184This scheduler illustrates how scheduling decisions can be dispatched from a
185single CPU, allowing other cores to run with infinite slices, without timer
186ticks, and without having to incur the overhead of making scheduling decisions.
187
188The approach demonstrated by this scheduler may be useful for any workload that
189benefits from minimizing scheduling overhead and timer ticks. An example of
190where this could be particularly useful is running VMs, where running with
191infinite slices and no timer ticks allows the VM to avoid unnecessary expensive
192vmexits.
193
194## scx_flatcg
195
196A flattened cgroup hierarchy scheduler. This scheduler implements hierarchical
197weight-based cgroup CPU control by flattening the cgroup hierarchy into a single
198layer, by compounding the active weight share at each level. The effect of this
199is a much more performant CPU controller, which does not need to descend down
200cgroup trees in order to properly compute a cgroup's share.
201
202Similar to scx_simple, in limited scenarios, this scheduler can perform
203reasonably well on single socket-socket systems with a unified L3 cache and show
204significantly lowered hierarchical scheduling overhead.
205
206
207# Troubleshooting
208
209There are a number of common issues that you may run into when building the
210schedulers. We'll go over some of the common ones here.
211
212## Build Failures
213
214### Old version of clang
215
216```
217error: static assertion failed due to requirement 'SCX_DSQ_FLAG_BUILTIN': bpftool generated vmlinux.h is missing high bits for 64bit enums, upgrade clang and pahole
218        _Static_assert(SCX_DSQ_FLAG_BUILTIN,
219                       ^~~~~~~~~~~~~~~~~~~~
2201 error generated.
221```
222
223This means you built the kernel or the schedulers with an older version of
224clang than what's supported (i.e. older than 16.0.0). To remediate this:
225
2261. `which clang` to make sure you're using a sufficiently new version of clang.
227
2282. `make fullclean` in the root path of the repository, and rebuild the kernel
229   and schedulers.
230
2313. Rebuild the kernel, and then your example schedulers.
232
233The schedulers are also cleaned if you invoke `make mrproper` in the root
234directory of the tree.
235
236### Stale kernel build / incomplete vmlinux.h file
237
238As described above, you'll need a `vmlinux.h` file that was generated from a
239vmlinux built with BTF, and with sched_ext support enabled. If you don't,
240you'll see errors such as the following which indicate that a type being
241referenced in a scheduler is unknown:
242
243```
244/path/to/sched_ext/tools/sched_ext/user_exit_info.h:25:23: note: forward declaration of 'struct scx_exit_info'
245
246const struct scx_exit_info *ei)
247
248^
249```
250
251In order to resolve this, please follow the steps above in
252[Getting a vmlinux.h file](#getting-a-vmlinuxh-file) in order to ensure your
253schedulers are using a vmlinux.h file that includes the requisite types.
254
255## Misc
256
257### llvm: [OFF]
258
259You may see the following output when building the schedulers:
260
261```
262Auto-detecting system features:
263...                         clang-bpf-co-re: [ on  ]
264...                                    llvm: [ OFF ]
265...                                  libcap: [ on  ]
266...                                  libbfd: [ on  ]
267```
268
269Seeing `llvm: [ OFF ]` here is not an issue. You can safely ignore.
270