985d0681 | 26-Mar-2024 |
Andrii Nakryiko <andrii@kernel.org> |
selftests/bpf: add batched tp/raw_tp/fmodret tests
Utilize bpf_modify_return_test_tp() kfunc to have a fast way to trigger tp/raw_tp/fmodret programs from another BPF program, which gives us compara
selftests/bpf: add batched tp/raw_tp/fmodret tests
Utilize bpf_modify_return_test_tp() kfunc to have a fast way to trigger tp/raw_tp/fmodret programs from another BPF program, which gives us comparable batched benchmarks to (batched) kprobe/fentry benchmarks.
We don't switch kprobe/fentry batched benchmarks to this kfunc to make bench tool usable on older kernels as well.
Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20240326162151.3981687-7-andrii@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
show more ...
|
b4ccf915 | 26-Mar-2024 |
Andrii Nakryiko <andrii@kernel.org> |
selftests/bpf: lazy-load trigger bench BPF programs
Instead of front-loading all possible benchmarking BPF programs for trigger benchmarks, explicitly specify which BPF programs are used by specific
selftests/bpf: lazy-load trigger bench BPF programs
Instead of front-loading all possible benchmarking BPF programs for trigger benchmarks, explicitly specify which BPF programs are used by specific benchmark and load only it.
This allows to be more flexible in supporting older kernels, where some program types might not be possible to load (e.g., those that rely on newly added kfunc).
Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20240326162151.3981687-5-andrii@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
show more ...
|
208c4391 | 26-Mar-2024 |
Andrii Nakryiko <andrii@kernel.org> |
selftests/bpf: remove syscall-driven benchs, keep syscall-count only
Remove "legacy" benchmarks triggered by syscalls in favor of newly added in-kernel/batched benchmarks. Drop -batched suffix now a
selftests/bpf: remove syscall-driven benchs, keep syscall-count only
Remove "legacy" benchmarks triggered by syscalls in favor of newly added in-kernel/batched benchmarks. Drop -batched suffix now as well. Next patch will restore "feature parity" by adding back tp/raw_tp/fmodret benchmarks based on in-kernel kfunc approach.
Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20240326162151.3981687-4-andrii@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
show more ...
|
7df4e597 | 26-Mar-2024 |
Andrii Nakryiko <andrii@kernel.org> |
selftests/bpf: add batched, mostly in-kernel BPF triggering benchmarks
Existing kprobe/fentry triggering benchmarks have 1-to-1 mapping between one syscall execution and BPF program run. While we us
selftests/bpf: add batched, mostly in-kernel BPF triggering benchmarks
Existing kprobe/fentry triggering benchmarks have 1-to-1 mapping between one syscall execution and BPF program run. While we use a fast get_pgid() syscall, syscall overhead can still be non-trivial.
This patch adds kprobe/fentry set of benchmarks significantly amortizing the cost of syscall vs actual BPF triggering overhead. We do this by employing BPF_PROG_TEST_RUN command to trigger "driver" raw_tp program which does a tight parameterized loop calling cheap BPF helper (bpf_get_numa_node_id()), to which kprobe/fentry programs are attached for benchmarking.
This way 1 bpf() syscall causes N executions of BPF program being benchmarked. N defaults to 100, but can be adjusted with --trig-batch-iters CLI argument.
For comparison we also implement a new baseline program that instead of triggering another BPF program just does N atomic per-CPU counter increments, establishing the limit for all other types of program within this batched benchmarking setup.
Taking the final set of benchmarks added in this patch set (including tp/raw_tp/fmodret, added in later patch), and keeping for now "legacy" syscall-driven benchmarks, we can capture all triggering benchmarks in one place for comparison, before we remove the legacy ones (and rename xxx-batched into just xxx).
$ benchs/run_bench_trigger.sh usermode-count : 79.500 ± 0.024M/s kernel-count : 49.949 ± 0.081M/s syscall-count : 9.009 ± 0.007M/s
fentry-batch : 31.002 ± 0.015M/s fexit-batch : 20.372 ± 0.028M/s fmodret-batch : 21.651 ± 0.659M/s rawtp-batch : 36.775 ± 0.264M/s tp-batch : 19.411 ± 0.248M/s kprobe-batch : 12.949 ± 0.220M/s kprobe-multi-batch : 15.400 ± 0.007M/s kretprobe-batch : 5.559 ± 0.011M/s kretprobe-multi-batch: 5.861 ± 0.003M/s
fentry-legacy : 8.329 ± 0.004M/s fexit-legacy : 6.239 ± 0.003M/s fmodret-legacy : 6.595 ± 0.001M/s rawtp-legacy : 8.305 ± 0.004M/s tp-legacy : 6.382 ± 0.001M/s kprobe-legacy : 5.528 ± 0.003M/s kprobe-multi-legacy : 5.864 ± 0.022M/s kretprobe-legacy : 3.081 ± 0.001M/s kretprobe-multi-legacy: 3.193 ± 0.001M/s
Note how xxx-batch variants are measured with significantly higher throughput, even though it's exactly the same in-kernel overhead. As such, results can be compared only between benchmarks of the same kind (syscall vs batched):
fentry-legacy : 8.329 ± 0.004M/s fentry-batch : 31.002 ± 0.015M/s
kprobe-multi-legacy : 5.864 ± 0.022M/s kprobe-multi-batch : 15.400 ± 0.007M/s
Note also that syscall-count is setting a theoretical limit for syscall-triggered benchmarks, while kernel-count is setting similar limits for batch variants. usermode-count is a happy and unachievable case of user space counting without doing any syscalls, and is mostly the measure of CPU speed for such a trivial benchmark.
As was mentioned, tp/raw_tp/fmodret require kernel-side kfunc to produce similar benchmark, which we address in a separate patch.
Note that run_bench_trigger.sh allows to override a list of benchmarks to run, which is very useful for performance work.
Cc: Jiri Olsa <jolsa@kernel.org> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20240326162151.3981687-3-andrii@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
show more ...
|
af8d27bf | 22-Mar-2024 |
Jiri Olsa <jolsa@kernel.org> |
selftests/bpf: Mark uprobe trigger functions with nocf_check attribute
Some distros seem to enable the -fcf-protection=branch by default, which breaks our setup on first instruction of uprobe trigge
selftests/bpf: Mark uprobe trigger functions with nocf_check attribute
Some distros seem to enable the -fcf-protection=branch by default, which breaks our setup on first instruction of uprobe trigger functions and place there endbr64 instruction.
Marking them with nocf_check attribute to skip that.
Ignoring unknown attribute warning in gcc for bench objects, because nocf_check can be used only when -fcf-protection=branch is enabled, otherwise we get a warning and break compilation.
Signed-off-by: Jiri Olsa <jolsa@kernel.org> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20240322134936.1075395-1-jolsa@kernel.org
show more ...
|
1684d6eb | 22-Mar-2024 |
Alan Maguire <alan.maguire@oracle.com> |
selftests/bpf: Use syscall(SYS_gettid) instead of gettid() wrapper in bench
With glibc 2.28, selftests compilation fails for benchs/bench_trigger.c:
benchs/bench_trigger.c: In function ‘inc_counter
selftests/bpf: Use syscall(SYS_gettid) instead of gettid() wrapper in bench
With glibc 2.28, selftests compilation fails for benchs/bench_trigger.c:
benchs/bench_trigger.c: In function ‘inc_counter’: benchs/bench_trigger.c:25:23: error: implicit declaration of function ‘gettid’; did you mean ‘getgid’? [-Werror=implicit-function-declaration] 25 | tid = gettid(); | ^~~~~~ | getgid cc1: all warnings being treated as errors
It appears support for the gettid() wrapper is variable across glibc versions, so may be safer to use syscall(SYS_gettid) instead.
Fixes: 520fad2e3206 ("selftests/bpf: scale benchmark counting by using per-CPU counters") Signed-off-by: Alan Maguire <alan.maguire@oracle.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20240322095728.95671-1-alan.maguire@oracle.com
show more ...
|
520fad2e | 15-Mar-2024 |
Andrii Nakryiko <andrii@kernel.org> |
selftests/bpf: scale benchmark counting by using per-CPU counters
When benchmarking with multiple threads (-pN, where N>1), we start contending on single atomic counter that both BPF trigger benchma
selftests/bpf: scale benchmark counting by using per-CPU counters
When benchmarking with multiple threads (-pN, where N>1), we start contending on single atomic counter that both BPF trigger benchmarks are using, as well as "baseline" tests in user space (trig-base and trig-uprobe-base benchmarks). As such, we start bottlenecking on something completely irrelevant to benchmark at hand.
Scale counting up by using per-CPU counters on BPF side. On use space side we do the next best thing: hash thread ID to approximate per-CPU behavior. It seems to work quite well in practice.
To demonstrate the difference, I ran three benchmarks with 1, 2, 4, 8, 16, and 32 threads: - trig-uprobe-base (no syscalls, pure tight counting loop in user-space); - trig-base (get_pgid() syscall, atomic counter in user-space); - trig-fentry (syscall to trigger fentry program, atomic uncontended per-CPU counter on BPF side).
Command used:
for b in uprobe-base base fentry; do \ for p in 1 2 4 8 16 32; do \ printf "%-11s %2d: %s\n" $b $p \ "$(sudo ./bench -w2 -d5 -a -p$p trig-$b | tail -n1 | cut -d'(' -f1 | cut -d' ' -f3-)"; \ done; \ done
Before these changes, aggregate throughput across all threads doesn't scale well with number of threads, it actually even falls sharply for uprobe-base due to a very high contention:
uprobe-base 1: 138.998 ± 0.650M/s uprobe-base 2: 70.526 ± 1.147M/s uprobe-base 4: 63.114 ± 0.302M/s uprobe-base 8: 54.177 ± 0.138M/s uprobe-base 16: 45.439 ± 0.057M/s uprobe-base 32: 37.163 ± 0.242M/s base 1: 16.940 ± 0.182M/s base 2: 19.231 ± 0.105M/s base 4: 21.479 ± 0.038M/s base 8: 23.030 ± 0.037M/s base 16: 22.034 ± 0.004M/s base 32: 18.152 ± 0.013M/s fentry 1: 14.794 ± 0.054M/s fentry 2: 17.341 ± 0.055M/s fentry 4: 23.792 ± 0.024M/s fentry 8: 21.557 ± 0.047M/s fentry 16: 21.121 ± 0.004M/s fentry 32: 17.067 ± 0.023M/s
After these changes, we see almost perfect linear scaling, as expected. The sub-linear scaling when going from 8 to 16 threads is interesting and consistent on my test machine, but I haven't investigated what is causing it this peculiar slowdown (across all benchmarks, could be due to hyperthreading effects, not sure).
uprobe-base 1: 139.980 ± 0.648M/s uprobe-base 2: 270.244 ± 0.379M/s uprobe-base 4: 532.044 ± 1.519M/s uprobe-base 8: 1004.571 ± 3.174M/s uprobe-base 16: 1720.098 ± 0.744M/s uprobe-base 32: 3506.659 ± 8.549M/s base 1: 16.869 ± 0.071M/s base 2: 33.007 ± 0.092M/s base 4: 64.670 ± 0.203M/s base 8: 121.969 ± 0.210M/s base 16: 207.832 ± 0.112M/s base 32: 424.227 ± 1.477M/s fentry 1: 14.777 ± 0.087M/s fentry 2: 28.575 ± 0.146M/s fentry 4: 56.234 ± 0.176M/s fentry 8: 106.095 ± 0.385M/s fentry 16: 181.440 ± 0.032M/s fentry 32: 369.131 ± 0.693M/s
Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Message-ID: <20240315213329.1161589-1-andrii@kernel.org> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
show more ...
|
856fe03d | 07-Jul-2023 |
Lu Hongfei <luhongfei@vivo.com> |
selftests/bpf: Correct two typos
When wrapping code, use ';' better than using ',' which is more in line with the coding habits of most engineers.
Signed-off-by: Lu Hongfei <luhongfei@vivo.com> Sig
selftests/bpf: Correct two typos
When wrapping code, use ';' better than using ',' which is more in line with the coding habits of most engineers.
Signed-off-by: Lu Hongfei <luhongfei@vivo.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Hou Tao <houtao1@huawei.com> Acked-by: Stanislav Fomichev <sdf@google.com> Link: https://lore.kernel.org/bpf/20230707081253.34638-1-luhongfei@vivo.com
show more ...
|
970308a7 | 13-Jun-2023 |
Hou Tao <houtao1@huawei.com> |
selftests/bpf: Set the default value of consumer_cnt as 0
Considering that only bench_ringbufs.c supports consumer, just set the default value of consumer_cnt as 0. After that, update the validity c
selftests/bpf: Set the default value of consumer_cnt as 0
Considering that only bench_ringbufs.c supports consumer, just set the default value of consumer_cnt as 0. After that, update the validity check of consumer_cnt, remove unused consumer_thread code snippets and set consumer_cnt as 1 in run_bench_ringbufs.sh accordingly.
Signed-off-by: Hou Tao <houtao1@huawei.com> Link: https://lore.kernel.org/r/20230613080921.1623219-5-houtao@huaweicloud.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
show more ...
|