tools/sched_ext/scx_pair.bpf.c

*f0262b10SEmil Tsalapatis/* SPDX-License-Identifier: GPL-2.0 */
*f0262b10SEmil Tsalapatis/*
*f0262b10SEmil Tsalapatis * A demo sched_ext core-scheduler which always makes every sibling CPU pair
*f0262b10SEmil Tsalapatis * execute from the same CPU cgroup.
*f0262b10SEmil Tsalapatis *
*f0262b10SEmil Tsalapatis * This scheduler is a minimal implementation and would need some form of
*f0262b10SEmil Tsalapatis * priority handling both inside each cgroup and across the cgroups to be
*f0262b10SEmil Tsalapatis * practically useful.
*f0262b10SEmil Tsalapatis *
*f0262b10SEmil Tsalapatis * Each CPU in the system is paired with exactly one other CPU, according to a
*f0262b10SEmil Tsalapatis * "stride" value that can be specified when the BPF scheduler program is first
*f0262b10SEmil Tsalapatis * loaded. Throughout the runtime of the scheduler, these CPU pairs guarantee
*f0262b10SEmil Tsalapatis * that they will only ever schedule tasks that belong to the same CPU cgroup.
*f0262b10SEmil Tsalapatis *
*f0262b10SEmil Tsalapatis * Scheduler Initialization
*f0262b10SEmil Tsalapatis * ------------------------
*f0262b10SEmil Tsalapatis *
*f0262b10SEmil Tsalapatis * The scheduler BPF program is first initialized from user space, before it is
*f0262b10SEmil Tsalapatis * enabled. During this initialization process, each CPU on the system is
*f0262b10SEmil Tsalapatis * assigned several values that are constant throughout its runtime:
*f0262b10SEmil Tsalapatis *
*f0262b10SEmil Tsalapatis * 1. *Pair CPU*: The CPU that it synchronizes with when making scheduling
*f0262b10SEmil Tsalapatis *		  decisions. Paired CPUs always schedule tasks from the same
*f0262b10SEmil Tsalapatis *		  CPU cgroup, and synchronize with each other to guarantee
*f0262b10SEmil Tsalapatis *		  that this constraint is not violated.
*f0262b10SEmil Tsalapatis * 2. *Pair ID*:  Each CPU pair is assigned a Pair ID, which is used to access
*f0262b10SEmil Tsalapatis *		  a struct pair_ctx object that is shared between the pair.
*f0262b10SEmil Tsalapatis * 3. *In-pair-index*: An index, 0 or 1, that is assigned to each core in the
*f0262b10SEmil Tsalapatis *		       pair. Each struct pair_ctx has an active_mask field,
*f0262b10SEmil Tsalapatis *		       which is a bitmap used to indicate whether each core
*f0262b10SEmil Tsalapatis *		       in the pair currently has an actively running task.
*f0262b10SEmil Tsalapatis *		       This index specifies which entry in the bitmap corresponds
*f0262b10SEmil Tsalapatis *		       to each CPU in the pair.
*f0262b10SEmil Tsalapatis *
*f0262b10SEmil Tsalapatis * During this initialization, the CPUs are paired according to a "stride" that
*f0262b10SEmil Tsalapatis * may be specified when invoking the user space program that initializes and
*f0262b10SEmil Tsalapatis * loads the scheduler. By default, the stride is 1/2 the total number of CPUs.
*f0262b10SEmil Tsalapatis *
*f0262b10SEmil Tsalapatis * Tasks and cgroups
*f0262b10SEmil Tsalapatis * -----------------
*f0262b10SEmil Tsalapatis *
*f0262b10SEmil Tsalapatis * Every cgroup in the system is registered with the scheduler using the
*f0262b10SEmil Tsalapatis * pair_cgroup_init() callback, and every task in the system is associated with
*f0262b10SEmil Tsalapatis * exactly one cgroup. At a high level, the idea with the pair scheduler is to
*f0262b10SEmil Tsalapatis * always schedule tasks from the same cgroup within a given CPU pair. When a
*f0262b10SEmil Tsalapatis * task is enqueued (i.e. passed to the pair_enqueue() callback function), its
*f0262b10SEmil Tsalapatis * cgroup ID is read from its task struct, and then a corresponding queue map
*f0262b10SEmil Tsalapatis * is used to FIFO-enqueue the task for that cgroup.
*f0262b10SEmil Tsalapatis *
*f0262b10SEmil Tsalapatis * If you look through the implementation of the scheduler, you'll notice that
*f0262b10SEmil Tsalapatis * there is quite a bit of complexity involved with looking up the per-cgroup
*f0262b10SEmil Tsalapatis * FIFO queue that we enqueue tasks in. For example, there is a cgrp_q_idx_hash
*f0262b10SEmil Tsalapatis * BPF hash map that is used to map a cgroup ID to a globally unique ID that's
*f0262b10SEmil Tsalapatis * allocated in the BPF program. This is done because we use separate maps to
*f0262b10SEmil Tsalapatis * store the FIFO queue of tasks, and the length of that map, per cgroup. This
*f0262b10SEmil Tsalapatis * complexity is only present because of current deficiencies in BPF that will
*f0262b10SEmil Tsalapatis * soon be addressed. The main point to keep in mind is that newly enqueued
*f0262b10SEmil Tsalapatis * tasks are added to their cgroup's FIFO queue.
*f0262b10SEmil Tsalapatis *
*f0262b10SEmil Tsalapatis * Dispatching tasks
*f0262b10SEmil Tsalapatis * -----------------
*f0262b10SEmil Tsalapatis *
*f0262b10SEmil Tsalapatis * This section will describe how enqueued tasks are dispatched and scheduled.
*f0262b10SEmil Tsalapatis * Tasks are dispatched in pair_dispatch(), and at a high level the workflow is
*f0262b10SEmil Tsalapatis * as follows:
*f0262b10SEmil Tsalapatis *
*f0262b10SEmil Tsalapatis * 1. Fetch the struct pair_ctx for the current CPU. As mentioned above, this is
*f0262b10SEmil Tsalapatis *    the structure that's used to synchronize amongst the two pair CPUs in their
*f0262b10SEmil Tsalapatis *    scheduling decisions. After any of the following events have occurred:
*f0262b10SEmil Tsalapatis *
*f0262b10SEmil Tsalapatis * - The cgroup's slice run has expired, or
*f0262b10SEmil Tsalapatis * - The cgroup becomes empty, or
*f0262b10SEmil Tsalapatis * - Either CPU in the pair is preempted by a higher priority scheduling class
*f0262b10SEmil Tsalapatis *
*f0262b10SEmil Tsalapatis * The cgroup transitions to the draining state and stops executing new tasks
*f0262b10SEmil Tsalapatis * from the cgroup.
*f0262b10SEmil Tsalapatis *
*f0262b10SEmil Tsalapatis * 2. If the pair is still executing a task, mark the pair_ctx as draining, and
*f0262b10SEmil Tsalapatis *    wait for the pair CPU to be preempted.
*f0262b10SEmil Tsalapatis *
*f0262b10SEmil Tsalapatis * 3. Otherwise, if the pair CPU is not running a task, we can move onto
*f0262b10SEmil Tsalapatis *    scheduling new tasks. Pop the next cgroup id from the top_q queue.
*f0262b10SEmil Tsalapatis *
*f0262b10SEmil Tsalapatis * 4. Pop a task from that cgroup's FIFO task queue, and begin executing it.
*f0262b10SEmil Tsalapatis *
*f0262b10SEmil Tsalapatis * Note again that this scheduling behavior is simple, but the implementation
*f0262b10SEmil Tsalapatis * is complex mostly because this it hits several BPF shortcomings and has to
*f0262b10SEmil Tsalapatis * work around in often awkward ways. Most of the shortcomings are expected to
*f0262b10SEmil Tsalapatis * be resolved in the near future which should allow greatly simplifying this
*f0262b10SEmil Tsalapatis * scheduler.
*f0262b10SEmil Tsalapatis *
*f0262b10SEmil Tsalapatis * Dealing with preemption
*f0262b10SEmil Tsalapatis * -----------------------
*f0262b10SEmil Tsalapatis *
*f0262b10SEmil Tsalapatis * SCX is the lowest priority sched_class, and could be preempted by them at
*f0262b10SEmil Tsalapatis * any time. To address this, the scheduler implements pair_cpu_release() and
*f0262b10SEmil Tsalapatis * pair_cpu_acquire() callbacks which are invoked by the core scheduler when
*f0262b10SEmil Tsalapatis * the scheduler loses and gains control of the CPU respectively.
*f0262b10SEmil Tsalapatis *
*f0262b10SEmil Tsalapatis * In pair_cpu_release(), we mark the pair_ctx as having been preempted, and
*f0262b10SEmil Tsalapatis * then invoke:
*f0262b10SEmil Tsalapatis *
*f0262b10SEmil Tsalapatis * scx_bpf_kick_cpu(pair_cpu, SCX_KICK_PREEMPT | SCX_KICK_WAIT);
*f0262b10SEmil Tsalapatis *
*f0262b10SEmil Tsalapatis * This preempts the pair CPU, and waits until it has re-entered the scheduler
*f0262b10SEmil Tsalapatis * before returning. This is necessary to ensure that the higher priority
*f0262b10SEmil Tsalapatis * sched_class that preempted our scheduler does not schedule a task
*f0262b10SEmil Tsalapatis * concurrently with our pair CPU.
*f0262b10SEmil Tsalapatis *
*f0262b10SEmil Tsalapatis * When the CPU is re-acquired in pair_cpu_acquire(), we unmark the preemption
*f0262b10SEmil Tsalapatis * in the pair_ctx, and send another resched IPI to the pair CPU to re-enable
*f0262b10SEmil Tsalapatis * pair scheduling.
*f0262b10SEmil Tsalapatis *
*f0262b10SEmil Tsalapatis * Copyright (c) 2022 Meta Platforms, Inc. and affiliates.
*f0262b10SEmil Tsalapatis * Copyright (c) 2022 Tejun Heo <tj@kernel.org>
*f0262b10SEmil Tsalapatis * Copyright (c) 2022 David Vernet <dvernet@meta.com>
*f0262b10SEmil Tsalapatis */
*f0262b10SEmil Tsalapatis#include <scx/common.bpf.h>
*f0262b10SEmil Tsalapatis#include "scx_pair.h"
*f0262b10SEmil Tsalapatis
*f0262b10SEmil Tsalapatischar _license[] SEC("license") = "GPL";
*f0262b10SEmil Tsalapatis
*f0262b10SEmil Tsalapatis/* !0 for veristat, set during init */
*f0262b10SEmil Tsalapatisconst volatile u32 nr_cpu_ids = 1;
*f0262b10SEmil Tsalapatis
*f0262b10SEmil Tsalapatis/* a pair of CPUs stay on a cgroup for this duration */
*f0262b10SEmil Tsalapatisconst volatile u32 pair_batch_dur_ns;
*f0262b10SEmil Tsalapatis
*f0262b10SEmil Tsalapatis/* cpu ID -> pair cpu ID */
*f0262b10SEmil Tsalapatisconst volatile s32 RESIZABLE_ARRAY(rodata, pair_cpu);
*f0262b10SEmil Tsalapatis
*f0262b10SEmil Tsalapatis/* cpu ID -> pair_id */
*f0262b10SEmil Tsalapatisconst volatile u32 RESIZABLE_ARRAY(rodata, pair_id);
*f0262b10SEmil Tsalapatis
*f0262b10SEmil Tsalapatis/* CPU ID -> CPU # in the pair (0 or 1) */
*f0262b10SEmil Tsalapatisconst volatile u32 RESIZABLE_ARRAY(rodata, in_pair_idx);
*f0262b10SEmil Tsalapatis
*f0262b10SEmil Tsalapatisstruct pair_ctx {
*f0262b10SEmil Tsalapatis	struct bpf_spin_lock	lock;
*f0262b10SEmil Tsalapatis
*f0262b10SEmil Tsalapatis	/* the cgroup the pair is currently executing */
*f0262b10SEmil Tsalapatis	u64			cgid;
*f0262b10SEmil Tsalapatis
*f0262b10SEmil Tsalapatis	/* the pair started executing the current cgroup at */
*f0262b10SEmil Tsalapatis	u64			started_at;
*f0262b10SEmil Tsalapatis
*f0262b10SEmil Tsalapatis	/* whether the current cgroup is draining */
*f0262b10SEmil Tsalapatis	bool			draining;
*f0262b10SEmil Tsalapatis
*f0262b10SEmil Tsalapatis	/* the CPUs that are currently active on the cgroup */
*f0262b10SEmil Tsalapatis	u32			active_mask;
*f0262b10SEmil Tsalapatis
*f0262b10SEmil Tsalapatis	/*
*f0262b10SEmil Tsalapatis	 * the CPUs that are currently preempted and running tasks in a
*f0262b10SEmil Tsalapatis	 * different scheduler.
*f0262b10SEmil Tsalapatis	 */
*f0262b10SEmil Tsalapatis	u32			preempted_mask;
*f0262b10SEmil Tsalapatis};
*f0262b10SEmil Tsalapatis
*f0262b10SEmil Tsalapatisstruct {
*f0262b10SEmil Tsalapatis	__uint(type, BPF_MAP_TYPE_ARRAY);
*f0262b10SEmil Tsalapatis	__type(key, u32);
*f0262b10SEmil Tsalapatis	__type(value, struct pair_ctx);
*f0262b10SEmil Tsalapatis} pair_ctx SEC(".maps");
*f0262b10SEmil Tsalapatis
*f0262b10SEmil Tsalapatis/* queue of cgrp_q's possibly with tasks on them */
*f0262b10SEmil Tsalapatisstruct {
*f0262b10SEmil Tsalapatis	__uint(type, BPF_MAP_TYPE_QUEUE);
*f0262b10SEmil Tsalapatis	/*
*f0262b10SEmil Tsalapatis	 * Because it's difficult to build strong synchronization encompassing
*f0262b10SEmil Tsalapatis	 * multiple non-trivial operations in BPF, this queue is managed in an
*f0262b10SEmil Tsalapatis	 * opportunistic way so that we guarantee that a cgroup w/ active tasks
*f0262b10SEmil Tsalapatis	 * is always on it but possibly multiple times. Once we have more robust
*f0262b10SEmil Tsalapatis	 * synchronization constructs and e.g. linked list, we should be able to
*f0262b10SEmil Tsalapatis	 * do this in a prettier way but for now just size it big enough.
*f0262b10SEmil Tsalapatis	 */
*f0262b10SEmil Tsalapatis	__uint(max_entries, 4 * MAX_CGRPS);
*f0262b10SEmil Tsalapatis	__type(value, u64);
*f0262b10SEmil Tsalapatis} top_q SEC(".maps");
*f0262b10SEmil Tsalapatis
*f0262b10SEmil Tsalapatis/* per-cgroup q which FIFOs the tasks from the cgroup */
*f0262b10SEmil Tsalapatisstruct cgrp_q {
*f0262b10SEmil Tsalapatis	__uint(type, BPF_MAP_TYPE_QUEUE);
*f0262b10SEmil Tsalapatis	__uint(max_entries, MAX_QUEUED);
*f0262b10SEmil Tsalapatis	__type(value, u32);
*f0262b10SEmil Tsalapatis};
*f0262b10SEmil Tsalapatis
*f0262b10SEmil Tsalapatis/*
*f0262b10SEmil Tsalapatis * Ideally, we want to allocate cgrp_q and cgrq_q_len in the cgroup local
*f0262b10SEmil Tsalapatis * storage; however, a cgroup local storage can only be accessed from the BPF
*f0262b10SEmil Tsalapatis * progs attached to the cgroup. For now, work around by allocating array of
*f0262b10SEmil Tsalapatis * cgrp_q's and then allocating per-cgroup indices.
*f0262b10SEmil Tsalapatis *
*f0262b10SEmil Tsalapatis * Another caveat: It's difficult to populate a large array of maps statically
*f0262b10SEmil Tsalapatis * or from BPF. Initialize it from userland.
*f0262b10SEmil Tsalapatis */
*f0262b10SEmil Tsalapatisstruct {
*f0262b10SEmil Tsalapatis	__uint(type, BPF_MAP_TYPE_ARRAY_OF_MAPS);
*f0262b10SEmil Tsalapatis	__uint(max_entries, MAX_CGRPS);
*f0262b10SEmil Tsalapatis	__type(key, s32);
*f0262b10SEmil Tsalapatis	__array(values, struct cgrp_q);
*f0262b10SEmil Tsalapatis} cgrp_q_arr SEC(".maps");
*f0262b10SEmil Tsalapatis
*f0262b10SEmil Tsalapatisstatic u64 cgrp_q_len[MAX_CGRPS];
*f0262b10SEmil Tsalapatis
*f0262b10SEmil Tsalapatis/*
*f0262b10SEmil Tsalapatis * This and cgrp_q_idx_hash combine into a poor man's IDR. This likely would be
*f0262b10SEmil Tsalapatis * useful to have as a map type.
*f0262b10SEmil Tsalapatis */
*f0262b10SEmil Tsalapatisstatic u32 cgrp_q_idx_cursor;
*f0262b10SEmil Tsalapatisstatic u64 cgrp_q_idx_busy[MAX_CGRPS];
*f0262b10SEmil Tsalapatis
*f0262b10SEmil Tsalapatis/*
*f0262b10SEmil Tsalapatis * All added up, the following is what we do:
*f0262b10SEmil Tsalapatis *
*f0262b10SEmil Tsalapatis * 1. When a cgroup is enabled, RR cgroup_q_idx_busy array doing cmpxchg looking
*f0262b10SEmil Tsalapatis *    for a free ID. If not found, fail cgroup creation with -EBUSY.
*f0262b10SEmil Tsalapatis *
*f0262b10SEmil Tsalapatis * 2. Hash the cgroup ID to the allocated cgrp_q_idx in the following
*f0262b10SEmil Tsalapatis *    cgrp_q_idx_hash.
*f0262b10SEmil Tsalapatis *
*f0262b10SEmil Tsalapatis * 3. Whenever a cgrp_q needs to be accessed, first look up the cgrp_q_idx from
*f0262b10SEmil Tsalapatis *    cgrp_q_idx_hash and then access the corresponding entry in cgrp_q_arr.
*f0262b10SEmil Tsalapatis *
*f0262b10SEmil Tsalapatis * This is sadly complicated for something pretty simple. Hopefully, we should
*f0262b10SEmil Tsalapatis * be able to simplify in the future.
*f0262b10SEmil Tsalapatis */
*f0262b10SEmil Tsalapatisstruct {
*f0262b10SEmil Tsalapatis	__uint(type, BPF_MAP_TYPE_HASH);
*f0262b10SEmil Tsalapatis	__uint(max_entries, MAX_CGRPS);
*f0262b10SEmil Tsalapatis	__uint(key_size, sizeof(u64));		/* cgrp ID */
*f0262b10SEmil Tsalapatis	__uint(value_size, sizeof(s32));	/* cgrp_q idx */
*f0262b10SEmil Tsalapatis} cgrp_q_idx_hash SEC(".maps");
*f0262b10SEmil Tsalapatis
*f0262b10SEmil Tsalapatis/* statistics */
*f0262b10SEmil Tsalapatisu64 nr_total, nr_dispatched, nr_missing, nr_kicks, nr_preemptions;
*f0262b10SEmil Tsalapatisu64 nr_exps, nr_exp_waits, nr_exp_empty;
*f0262b10SEmil Tsalapatisu64 nr_cgrp_next, nr_cgrp_coll, nr_cgrp_empty;
*f0262b10SEmil Tsalapatis
*f0262b10SEmil TsalapatisUEI_DEFINE(uei);
*f0262b10SEmil Tsalapatis
*f0262b10SEmil Tsalapatisvoid BPF_STRUCT_OPS(pair_enqueue, struct task_struct *p, u64 enq_flags)
*f0262b10SEmil Tsalapatis{
*f0262b10SEmil Tsalapatis	struct cgroup *cgrp;
*f0262b10SEmil Tsalapatis	struct cgrp_q *cgq;
*f0262b10SEmil Tsalapatis	s32 pid = p->pid;
*f0262b10SEmil Tsalapatis	u64 cgid;
*f0262b10SEmil Tsalapatis	u32 *q_idx;
*f0262b10SEmil Tsalapatis	u64 *cgq_len;
*f0262b10SEmil Tsalapatis
*f0262b10SEmil Tsalapatis	__sync_fetch_and_add(&nr_total, 1);
*f0262b10SEmil Tsalapatis
*f0262b10SEmil Tsalapatis	cgrp = scx_bpf_task_cgroup(p);
*f0262b10SEmil Tsalapatis	cgid = cgrp->kn->id;
*f0262b10SEmil Tsalapatis	bpf_cgroup_release(cgrp);
*f0262b10SEmil Tsalapatis
*f0262b10SEmil Tsalapatis	/* find the cgroup's q and push @p into it */
*f0262b10SEmil Tsalapatis	q_idx = bpf_map_lookup_elem(&cgrp_q_idx_hash, &cgid);
*f0262b10SEmil Tsalapatis	if (!q_idx) {
*f0262b10SEmil Tsalapatis		scx_bpf_error("failed to lookup q_idx for cgroup[%llu]", cgid);
*f0262b10SEmil Tsalapatis		return;
*f0262b10SEmil Tsalapatis	}
*f0262b10SEmil Tsalapatis
*f0262b10SEmil Tsalapatis	cgq = bpf_map_lookup_elem(&cgrp_q_arr, q_idx);
*f0262b10SEmil Tsalapatis	if (!cgq) {
*f0262b10SEmil Tsalapatis		scx_bpf_error("failed to lookup q_arr for cgroup[%llu] q_idx[%u]",
*f0262b10SEmil Tsalapatis			      cgid, *q_idx);
*f0262b10SEmil Tsalapatis		return;
*f0262b10SEmil Tsalapatis	}
*f0262b10SEmil Tsalapatis
*f0262b10SEmil Tsalapatis	if (bpf_map_push_elem(cgq, &pid, 0)) {
*f0262b10SEmil Tsalapatis		scx_bpf_error("cgroup[%llu] queue overflow", cgid);
*f0262b10SEmil Tsalapatis		return;
*f0262b10SEmil Tsalapatis	}
*f0262b10SEmil Tsalapatis
*f0262b10SEmil Tsalapatis	/* bump q len, if going 0 -> 1, queue cgroup into the top_q */
*f0262b10SEmil Tsalapatis	cgq_len = MEMBER_VPTR(cgrp_q_len, [*q_idx]);
*f0262b10SEmil Tsalapatis	if (!cgq_len) {
*f0262b10SEmil Tsalapatis		scx_bpf_error("MEMBER_VTPR malfunction");
*f0262b10SEmil Tsalapatis		return;
*f0262b10SEmil Tsalapatis	}
*f0262b10SEmil Tsalapatis
*f0262b10SEmil Tsalapatis	if (!__sync_fetch_and_add(cgq_len, 1) &&
*f0262b10SEmil Tsalapatis	    bpf_map_push_elem(&top_q, &cgid, 0)) {
*f0262b10SEmil Tsalapatis		scx_bpf_error("top_q overflow");
*f0262b10SEmil Tsalapatis		return;
*f0262b10SEmil Tsalapatis	}
*f0262b10SEmil Tsalapatis}
*f0262b10SEmil Tsalapatis
*f0262b10SEmil Tsalapatisstatic int lookup_pairc_and_mask(s32 cpu, struct pair_ctx **pairc, u32 *mask)
*f0262b10SEmil Tsalapatis{
*f0262b10SEmil Tsalapatis	u32 *vptr;
*f0262b10SEmil Tsalapatis
*f0262b10SEmil Tsalapatis	vptr = (u32 *)ARRAY_ELEM_PTR(pair_id, cpu, nr_cpu_ids);
*f0262b10SEmil Tsalapatis	if (!vptr)
*f0262b10SEmil Tsalapatis		return -EINVAL;
*f0262b10SEmil Tsalapatis
*f0262b10SEmil Tsalapatis	*pairc = bpf_map_lookup_elem(&pair_ctx, vptr);
*f0262b10SEmil Tsalapatis	if (!(*pairc))
*f0262b10SEmil Tsalapatis		return -EINVAL;
*f0262b10SEmil Tsalapatis
*f0262b10SEmil Tsalapatis	vptr = (u32 *)ARRAY_ELEM_PTR(in_pair_idx, cpu, nr_cpu_ids);
*f0262b10SEmil Tsalapatis	if (!vptr)
*f0262b10SEmil Tsalapatis		return -EINVAL;
*f0262b10SEmil Tsalapatis
*f0262b10SEmil Tsalapatis	*mask = 1U << *vptr;
*f0262b10SEmil Tsalapatis
*f0262b10SEmil Tsalapatis	return 0;
*f0262b10SEmil Tsalapatis}
*f0262b10SEmil Tsalapatis
*f0262b10SEmil Tsalapatis__attribute__((noinline))
*f0262b10SEmil Tsalapatisstatic int try_dispatch(s32 cpu)
*f0262b10SEmil Tsalapatis{
*f0262b10SEmil Tsalapatis	struct pair_ctx *pairc;
*f0262b10SEmil Tsalapatis	struct bpf_map *cgq_map;
*f0262b10SEmil Tsalapatis	struct task_struct *p;
*f0262b10SEmil Tsalapatis	u64 now = scx_bpf_now();
*f0262b10SEmil Tsalapatis	bool kick_pair = false;
*f0262b10SEmil Tsalapatis	bool expired, pair_preempted;
*f0262b10SEmil Tsalapatis	u32 *vptr, in_pair_mask;
*f0262b10SEmil Tsalapatis	s32 pid, q_idx;
*f0262b10SEmil Tsalapatis	u64 cgid;
*f0262b10SEmil Tsalapatis	int ret;
*f0262b10SEmil Tsalapatis
*f0262b10SEmil Tsalapatis	ret = lookup_pairc_and_mask(cpu, &pairc, &in_pair_mask);
*f0262b10SEmil Tsalapatis	if (ret) {
*f0262b10SEmil Tsalapatis		scx_bpf_error("failed to lookup pairc and in_pair_mask for cpu[%d]",
*f0262b10SEmil Tsalapatis			      cpu);
*f0262b10SEmil Tsalapatis		return -ENOENT;
*f0262b10SEmil Tsalapatis	}
*f0262b10SEmil Tsalapatis
*f0262b10SEmil Tsalapatis	bpf_spin_lock(&pairc->lock);
*f0262b10SEmil Tsalapatis	pairc->active_mask &= ~in_pair_mask;
*f0262b10SEmil Tsalapatis
*f0262b10SEmil Tsalapatis	expired = time_before(pairc->started_at + pair_batch_dur_ns, now);
*f0262b10SEmil Tsalapatis	if (expired || pairc->draining) {
*f0262b10SEmil Tsalapatis		u64 new_cgid = 0;
*f0262b10SEmil Tsalapatis
*f0262b10SEmil Tsalapatis		__sync_fetch_and_add(&nr_exps, 1);
*f0262b10SEmil Tsalapatis
*f0262b10SEmil Tsalapatis		/*
*f0262b10SEmil Tsalapatis		 * We're done with the current cgid. An obvious optimization
*f0262b10SEmil Tsalapatis		 * would be not draining if the next cgroup is the current one.
*f0262b10SEmil Tsalapatis		 * For now, be dumb and always expire.
*f0262b10SEmil Tsalapatis		 */
*f0262b10SEmil Tsalapatis		pairc->draining = true;
*f0262b10SEmil Tsalapatis
*f0262b10SEmil Tsalapatis		pair_preempted = pairc->preempted_mask;
*f0262b10SEmil Tsalapatis		if (pairc->active_mask || pair_preempted) {
*f0262b10SEmil Tsalapatis			/*
*f0262b10SEmil Tsalapatis			 * The other CPU is still active, or is no longer under
*f0262b10SEmil Tsalapatis			 * our control due to e.g. being preempted by a higher
*f0262b10SEmil Tsalapatis			 * priority sched_class. We want to wait until this
*f0262b10SEmil Tsalapatis			 * cgroup expires, or until control of our pair CPU has
*f0262b10SEmil Tsalapatis			 * been returned to us.
*f0262b10SEmil Tsalapatis			 *
*f0262b10SEmil Tsalapatis			 * If the pair controls its CPU, and the time already
*f0262b10SEmil Tsalapatis			 * expired, kick.  When the other CPU arrives at
*f0262b10SEmil Tsalapatis			 * dispatch and clears its active mask, it'll push the
*f0262b10SEmil Tsalapatis			 * pair to the next cgroup and kick this CPU.
*f0262b10SEmil Tsalapatis			 */
*f0262b10SEmil Tsalapatis			__sync_fetch_and_add(&nr_exp_waits, 1);
*f0262b10SEmil Tsalapatis			bpf_spin_unlock(&pairc->lock);
*f0262b10SEmil Tsalapatis			if (expired && !pair_preempted)
*f0262b10SEmil Tsalapatis				kick_pair = true;
*f0262b10SEmil Tsalapatis			goto out_maybe_kick;
*f0262b10SEmil Tsalapatis		}
*f0262b10SEmil Tsalapatis
*f0262b10SEmil Tsalapatis		bpf_spin_unlock(&pairc->lock);
*f0262b10SEmil Tsalapatis
*f0262b10SEmil Tsalapatis		/*
*f0262b10SEmil Tsalapatis		 * Pick the next cgroup. It'd be easier / cleaner to not drop
*f0262b10SEmil Tsalapatis		 * pairc->lock and use stronger synchronization here especially
*f0262b10SEmil Tsalapatis		 * given that we'll be switching cgroups significantly less
*f0262b10SEmil Tsalapatis		 * frequently than tasks. Unfortunately, bpf_spin_lock can't
*f0262b10SEmil Tsalapatis		 * really protect anything non-trivial. Let's do opportunistic
*f0262b10SEmil Tsalapatis		 * operations instead.
*f0262b10SEmil Tsalapatis		 */
*f0262b10SEmil Tsalapatis		bpf_repeat(BPF_MAX_LOOPS) {
*f0262b10SEmil Tsalapatis			u32 *q_idx;
*f0262b10SEmil Tsalapatis			u64 *cgq_len;
*f0262b10SEmil Tsalapatis
*f0262b10SEmil Tsalapatis			if (bpf_map_pop_elem(&top_q, &new_cgid)) {
*f0262b10SEmil Tsalapatis				/* no active cgroup, go idle */
*f0262b10SEmil Tsalapatis				__sync_fetch_and_add(&nr_exp_empty, 1);
*f0262b10SEmil Tsalapatis				return 0;
*f0262b10SEmil Tsalapatis			}
*f0262b10SEmil Tsalapatis
*f0262b10SEmil Tsalapatis			q_idx = bpf_map_lookup_elem(&cgrp_q_idx_hash, &new_cgid);
*f0262b10SEmil Tsalapatis			if (!q_idx)
*f0262b10SEmil Tsalapatis				continue;
*f0262b10SEmil Tsalapatis
*f0262b10SEmil Tsalapatis			/*
*f0262b10SEmil Tsalapatis			 * This is the only place where empty cgroups are taken
*f0262b10SEmil Tsalapatis			 * off the top_q.
*f0262b10SEmil Tsalapatis			 */
*f0262b10SEmil Tsalapatis			cgq_len = MEMBER_VPTR(cgrp_q_len, [*q_idx]);
*f0262b10SEmil Tsalapatis			if (!cgq_len || !*cgq_len)
*f0262b10SEmil Tsalapatis				continue;
*f0262b10SEmil Tsalapatis
*f0262b10SEmil Tsalapatis			/*
*f0262b10SEmil Tsalapatis			 * If it has any tasks, requeue as we may race and not
*f0262b10SEmil Tsalapatis			 * execute it.
*f0262b10SEmil Tsalapatis			 */
*f0262b10SEmil Tsalapatis			bpf_map_push_elem(&top_q, &new_cgid, 0);
*f0262b10SEmil Tsalapatis			break;
*f0262b10SEmil Tsalapatis		}
*f0262b10SEmil Tsalapatis
*f0262b10SEmil Tsalapatis		bpf_spin_lock(&pairc->lock);
*f0262b10SEmil Tsalapatis
*f0262b10SEmil Tsalapatis		/*
*f0262b10SEmil Tsalapatis		 * The other CPU may already have started on a new cgroup while
*f0262b10SEmil Tsalapatis		 * we dropped the lock. Make sure that we're still draining and
*f0262b10SEmil Tsalapatis		 * start on the new cgroup.
*f0262b10SEmil Tsalapatis		 */
*f0262b10SEmil Tsalapatis		if (pairc->draining && !pairc->active_mask) {
*f0262b10SEmil Tsalapatis			__sync_fetch_and_add(&nr_cgrp_next, 1);
*f0262b10SEmil Tsalapatis			pairc->cgid = new_cgid;
*f0262b10SEmil Tsalapatis			pairc->started_at = now;
*f0262b10SEmil Tsalapatis			pairc->draining = false;
*f0262b10SEmil Tsalapatis			kick_pair = true;
*f0262b10SEmil Tsalapatis		} else {
*f0262b10SEmil Tsalapatis			__sync_fetch_and_add(&nr_cgrp_coll, 1);
*f0262b10SEmil Tsalapatis		}
*f0262b10SEmil Tsalapatis	}
*f0262b10SEmil Tsalapatis
*f0262b10SEmil Tsalapatis	cgid = pairc->cgid;
*f0262b10SEmil Tsalapatis	pairc->active_mask |= in_pair_mask;
*f0262b10SEmil Tsalapatis	bpf_spin_unlock(&pairc->lock);
*f0262b10SEmil Tsalapatis
*f0262b10SEmil Tsalapatis	/* again, it'd be better to do all these with the lock held, oh well */
*f0262b10SEmil Tsalapatis	vptr = bpf_map_lookup_elem(&cgrp_q_idx_hash, &cgid);
*f0262b10SEmil Tsalapatis	if (!vptr) {
*f0262b10SEmil Tsalapatis		scx_bpf_error("failed to lookup q_idx for cgroup[%llu]", cgid);
*f0262b10SEmil Tsalapatis		return -ENOENT;
*f0262b10SEmil Tsalapatis	}
*f0262b10SEmil Tsalapatis	q_idx = *vptr;
*f0262b10SEmil Tsalapatis
*f0262b10SEmil Tsalapatis	/* claim one task from cgrp_q w/ q_idx */
*f0262b10SEmil Tsalapatis	bpf_repeat(BPF_MAX_LOOPS) {
*f0262b10SEmil Tsalapatis		u64 *cgq_len, len;
*f0262b10SEmil Tsalapatis
*f0262b10SEmil Tsalapatis		cgq_len = MEMBER_VPTR(cgrp_q_len, [q_idx]);
*f0262b10SEmil Tsalapatis		if (!cgq_len || !(len = *(volatile u64 *)cgq_len)) {
*f0262b10SEmil Tsalapatis			/* the cgroup must be empty, expire and repeat */
*f0262b10SEmil Tsalapatis			__sync_fetch_and_add(&nr_cgrp_empty, 1);
*f0262b10SEmil Tsalapatis			bpf_spin_lock(&pairc->lock);
*f0262b10SEmil Tsalapatis			pairc->draining = true;
*f0262b10SEmil Tsalapatis			pairc->active_mask &= ~in_pair_mask;
*f0262b10SEmil Tsalapatis			bpf_spin_unlock(&pairc->lock);
*f0262b10SEmil Tsalapatis			return -EAGAIN;
*f0262b10SEmil Tsalapatis		}
*f0262b10SEmil Tsalapatis
*f0262b10SEmil Tsalapatis		if (__sync_val_compare_and_swap(cgq_len, len, len - 1) != len)
*f0262b10SEmil Tsalapatis			continue;
*f0262b10SEmil Tsalapatis
*f0262b10SEmil Tsalapatis		break;
*f0262b10SEmil Tsalapatis	}
*f0262b10SEmil Tsalapatis
*f0262b10SEmil Tsalapatis	cgq_map = bpf_map_lookup_elem(&cgrp_q_arr, &q_idx);
*f0262b10SEmil Tsalapatis	if (!cgq_map) {
*f0262b10SEmil Tsalapatis		scx_bpf_error("failed to lookup cgq_map for cgroup[%llu] q_idx[%d]",
*f0262b10SEmil Tsalapatis			      cgid, q_idx);
*f0262b10SEmil Tsalapatis		return -ENOENT;
*f0262b10SEmil Tsalapatis	}
*f0262b10SEmil Tsalapatis
*f0262b10SEmil Tsalapatis	if (bpf_map_pop_elem(cgq_map, &pid)) {
*f0262b10SEmil Tsalapatis		scx_bpf_error("cgq_map is empty for cgroup[%llu] q_idx[%d]",
*f0262b10SEmil Tsalapatis			      cgid, q_idx);
*f0262b10SEmil Tsalapatis		return -ENOENT;
*f0262b10SEmil Tsalapatis	}
*f0262b10SEmil Tsalapatis
*f0262b10SEmil Tsalapatis	p = bpf_task_from_pid(pid);
*f0262b10SEmil Tsalapatis	if (p) {
*f0262b10SEmil Tsalapatis		__sync_fetch_and_add(&nr_dispatched, 1);
*f0262b10SEmil Tsalapatis		scx_bpf_dsq_insert(p, SCX_DSQ_GLOBAL, SCX_SLICE_DFL, 0);
*f0262b10SEmil Tsalapatis		bpf_task_release(p);
*f0262b10SEmil Tsalapatis	} else {
*f0262b10SEmil Tsalapatis		/* we don't handle dequeues, retry on lost tasks */
*f0262b10SEmil Tsalapatis		__sync_fetch_and_add(&nr_missing, 1);
*f0262b10SEmil Tsalapatis		return -EAGAIN;
*f0262b10SEmil Tsalapatis	}
*f0262b10SEmil Tsalapatis
*f0262b10SEmil Tsalapatisout_maybe_kick:
*f0262b10SEmil Tsalapatis	if (kick_pair) {
*f0262b10SEmil Tsalapatis		s32 *pair = (s32 *)ARRAY_ELEM_PTR(pair_cpu, cpu, nr_cpu_ids);
*f0262b10SEmil Tsalapatis		if (pair) {
*f0262b10SEmil Tsalapatis			__sync_fetch_and_add(&nr_kicks, 1);
*f0262b10SEmil Tsalapatis			scx_bpf_kick_cpu(*pair, SCX_KICK_PREEMPT);
*f0262b10SEmil Tsalapatis		}
*f0262b10SEmil Tsalapatis	}
*f0262b10SEmil Tsalapatis	return 0;
*f0262b10SEmil Tsalapatis}
*f0262b10SEmil Tsalapatis
*f0262b10SEmil Tsalapatisvoid BPF_STRUCT_OPS(pair_dispatch, s32 cpu, struct task_struct *prev)
*f0262b10SEmil Tsalapatis{
*f0262b10SEmil Tsalapatis	bpf_repeat(BPF_MAX_LOOPS) {
*f0262b10SEmil Tsalapatis		if (try_dispatch(cpu) != -EAGAIN)
*f0262b10SEmil Tsalapatis			break;
*f0262b10SEmil Tsalapatis	}
*f0262b10SEmil Tsalapatis}
*f0262b10SEmil Tsalapatis
*f0262b10SEmil Tsalapatisvoid BPF_STRUCT_OPS(pair_cpu_acquire, s32 cpu, struct scx_cpu_acquire_args *args)
*f0262b10SEmil Tsalapatis{
*f0262b10SEmil Tsalapatis	int ret;
*f0262b10SEmil Tsalapatis	u32 in_pair_mask;
*f0262b10SEmil Tsalapatis	struct pair_ctx *pairc;
*f0262b10SEmil Tsalapatis	bool kick_pair;
*f0262b10SEmil Tsalapatis
*f0262b10SEmil Tsalapatis	ret = lookup_pairc_and_mask(cpu, &pairc, &in_pair_mask);
*f0262b10SEmil Tsalapatis	if (ret)
*f0262b10SEmil Tsalapatis		return;
*f0262b10SEmil Tsalapatis
*f0262b10SEmil Tsalapatis	bpf_spin_lock(&pairc->lock);
*f0262b10SEmil Tsalapatis	pairc->preempted_mask &= ~in_pair_mask;
*f0262b10SEmil Tsalapatis	/* Kick the pair CPU, unless it was also preempted. */
*f0262b10SEmil Tsalapatis	kick_pair = !pairc->preempted_mask;
*f0262b10SEmil Tsalapatis	bpf_spin_unlock(&pairc->lock);
*f0262b10SEmil Tsalapatis
*f0262b10SEmil Tsalapatis	if (kick_pair) {
*f0262b10SEmil Tsalapatis		s32 *pair = (s32 *)ARRAY_ELEM_PTR(pair_cpu, cpu, nr_cpu_ids);
*f0262b10SEmil Tsalapatis
*f0262b10SEmil Tsalapatis		if (pair) {
*f0262b10SEmil Tsalapatis			__sync_fetch_and_add(&nr_kicks, 1);
*f0262b10SEmil Tsalapatis			scx_bpf_kick_cpu(*pair, SCX_KICK_PREEMPT);
*f0262b10SEmil Tsalapatis		}
*f0262b10SEmil Tsalapatis	}
*f0262b10SEmil Tsalapatis}
*f0262b10SEmil Tsalapatis
*f0262b10SEmil Tsalapatisvoid BPF_STRUCT_OPS(pair_cpu_release, s32 cpu, struct scx_cpu_release_args *args)
*f0262b10SEmil Tsalapatis{
*f0262b10SEmil Tsalapatis	int ret;
*f0262b10SEmil Tsalapatis	u32 in_pair_mask;
*f0262b10SEmil Tsalapatis	struct pair_ctx *pairc;
*f0262b10SEmil Tsalapatis	bool kick_pair;
*f0262b10SEmil Tsalapatis
*f0262b10SEmil Tsalapatis	ret = lookup_pairc_and_mask(cpu, &pairc, &in_pair_mask);
*f0262b10SEmil Tsalapatis	if (ret)
*f0262b10SEmil Tsalapatis		return;
*f0262b10SEmil Tsalapatis
*f0262b10SEmil Tsalapatis	bpf_spin_lock(&pairc->lock);
*f0262b10SEmil Tsalapatis	pairc->preempted_mask |= in_pair_mask;
*f0262b10SEmil Tsalapatis	pairc->active_mask &= ~in_pair_mask;
*f0262b10SEmil Tsalapatis	/* Kick the pair CPU if it's still running. */
*f0262b10SEmil Tsalapatis	kick_pair = pairc->active_mask;
*f0262b10SEmil Tsalapatis	pairc->draining = true;
*f0262b10SEmil Tsalapatis	bpf_spin_unlock(&pairc->lock);
*f0262b10SEmil Tsalapatis
*f0262b10SEmil Tsalapatis	if (kick_pair) {
*f0262b10SEmil Tsalapatis		s32 *pair = (s32 *)ARRAY_ELEM_PTR(pair_cpu, cpu, nr_cpu_ids);
*f0262b10SEmil Tsalapatis
*f0262b10SEmil Tsalapatis		if (pair) {
*f0262b10SEmil Tsalapatis			__sync_fetch_and_add(&nr_kicks, 1);
*f0262b10SEmil Tsalapatis			scx_bpf_kick_cpu(*pair, SCX_KICK_PREEMPT | SCX_KICK_WAIT);
*f0262b10SEmil Tsalapatis		}
*f0262b10SEmil Tsalapatis	}
*f0262b10SEmil Tsalapatis	__sync_fetch_and_add(&nr_preemptions, 1);
*f0262b10SEmil Tsalapatis}
*f0262b10SEmil Tsalapatis
*f0262b10SEmil Tsalapatiss32 BPF_STRUCT_OPS(pair_cgroup_init, struct cgroup *cgrp)
*f0262b10SEmil Tsalapatis{
*f0262b10SEmil Tsalapatis	u64 cgid = cgrp->kn->id;
*f0262b10SEmil Tsalapatis	s32 i, q_idx;
*f0262b10SEmil Tsalapatis
*f0262b10SEmil Tsalapatis	bpf_for(i, 0, MAX_CGRPS) {
*f0262b10SEmil Tsalapatis		q_idx = __sync_fetch_and_add(&cgrp_q_idx_cursor, 1) % MAX_CGRPS;
*f0262b10SEmil Tsalapatis		if (!__sync_val_compare_and_swap(&cgrp_q_idx_busy[q_idx], 0, 1))
*f0262b10SEmil Tsalapatis			break;
*f0262b10SEmil Tsalapatis	}
*f0262b10SEmil Tsalapatis	if (i == MAX_CGRPS)
*f0262b10SEmil Tsalapatis		return -EBUSY;
*f0262b10SEmil Tsalapatis
*f0262b10SEmil Tsalapatis	if (bpf_map_update_elem(&cgrp_q_idx_hash, &cgid, &q_idx, BPF_ANY)) {
*f0262b10SEmil Tsalapatis		u64 *busy = MEMBER_VPTR(cgrp_q_idx_busy, [q_idx]);
*f0262b10SEmil Tsalapatis		if (busy)
*f0262b10SEmil Tsalapatis			*busy = 0;
*f0262b10SEmil Tsalapatis		return -EBUSY;
*f0262b10SEmil Tsalapatis	}
*f0262b10SEmil Tsalapatis
*f0262b10SEmil Tsalapatis	return 0;
*f0262b10SEmil Tsalapatis}
*f0262b10SEmil Tsalapatis
*f0262b10SEmil Tsalapatisvoid BPF_STRUCT_OPS(pair_cgroup_exit, struct cgroup *cgrp)
*f0262b10SEmil Tsalapatis{
*f0262b10SEmil Tsalapatis	u64 cgid = cgrp->kn->id;
*f0262b10SEmil Tsalapatis	s32 *q_idx;
*f0262b10SEmil Tsalapatis
*f0262b10SEmil Tsalapatis	q_idx = bpf_map_lookup_elem(&cgrp_q_idx_hash, &cgid);
*f0262b10SEmil Tsalapatis	if (q_idx) {
*f0262b10SEmil Tsalapatis		u64 *busy = MEMBER_VPTR(cgrp_q_idx_busy, [*q_idx]);
*f0262b10SEmil Tsalapatis		if (busy)
*f0262b10SEmil Tsalapatis			*busy = 0;
*f0262b10SEmil Tsalapatis		bpf_map_delete_elem(&cgrp_q_idx_hash, &cgid);
*f0262b10SEmil Tsalapatis	}
*f0262b10SEmil Tsalapatis}
*f0262b10SEmil Tsalapatis
*f0262b10SEmil Tsalapatisvoid BPF_STRUCT_OPS(pair_exit, struct scx_exit_info *ei)
*f0262b10SEmil Tsalapatis{
*f0262b10SEmil Tsalapatis	UEI_RECORD(uei, ei);
*f0262b10SEmil Tsalapatis}
*f0262b10SEmil Tsalapatis
*f0262b10SEmil TsalapatisSCX_OPS_DEFINE(pair_ops,
*f0262b10SEmil Tsalapatis	       .enqueue			= (void *)pair_enqueue,
*f0262b10SEmil Tsalapatis	       .dispatch		= (void *)pair_dispatch,
*f0262b10SEmil Tsalapatis	       .cpu_acquire		= (void *)pair_cpu_acquire,
*f0262b10SEmil Tsalapatis	       .cpu_release		= (void *)pair_cpu_release,
*f0262b10SEmil Tsalapatis	       .cgroup_init		= (void *)pair_cgroup_init,
*f0262b10SEmil Tsalapatis	       .cgroup_exit		= (void *)pair_cgroup_exit,
*f0262b10SEmil Tsalapatis	       .exit			= (void *)pair_exit,
*f0262b10SEmil Tsalapatis	       .name			= "pair");