Home
last modified time | relevance | path

Searched hist:"520 fad2e3206b2476f201a283ecf096cfb671902" (Results 1 – 2 of 2) sorted by relevance

/linux/tools/testing/selftests/bpf/progs/
H A Dtrigger_bench.cdiff 520fad2e3206b2476f201a283ecf096cfb671902 Fri Mar 15 22:33:29 CET 2024 Andrii Nakryiko <andrii@kernel.org> selftests/bpf: scale benchmark counting by using per-CPU counters

When benchmarking with multiple threads (-pN, where N>1), we start
contending on single atomic counter that both BPF trigger benchmarks are
using, as well as "baseline" tests in user space (trig-base and
trig-uprobe-base benchmarks). As such, we start bottlenecking on
something completely irrelevant to benchmark at hand.

Scale counting up by using per-CPU counters on BPF side. On use space
side we do the next best thing: hash thread ID to approximate per-CPU
behavior. It seems to work quite well in practice.

To demonstrate the difference, I ran three benchmarks with 1, 2, 4, 8,
16, and 32 threads:
- trig-uprobe-base (no syscalls, pure tight counting loop in user-space);
- trig-base (get_pgid() syscall, atomic counter in user-space);
- trig-fentry (syscall to trigger fentry program, atomic uncontended per-CPU
counter on BPF side).

Command used:

for b in uprobe-base base fentry; do \
for p in 1 2 4 8 16 32; do \
printf "%-11s %2d: %s\n" $b $p \
"$(sudo ./bench -w2 -d5 -a -p$p trig-$b | tail -n1 | cut -d'(' -f1 | cut -d' ' -f3-)"; \
done; \
done

Before these changes, aggregate throughput across all threads doesn't
scale well with number of threads, it actually even falls sharply for
uprobe-base due to a very high contention:

uprobe-base 1: 138.998 ± 0.650M/s
uprobe-base 2: 70.526 ± 1.147M/s
uprobe-base 4: 63.114 ± 0.302M/s
uprobe-base 8: 54.177 ± 0.138M/s
uprobe-base 16: 45.439 ± 0.057M/s
uprobe-base 32: 37.163 ± 0.242M/s
base 1: 16.940 ± 0.182M/s
base 2: 19.231 ± 0.105M/s
base 4: 21.479 ± 0.038M/s
base 8: 23.030 ± 0.037M/s
base 16: 22.034 ± 0.004M/s
base 32: 18.152 ± 0.013M/s
fentry 1: 14.794 ± 0.054M/s
fentry 2: 17.341 ± 0.055M/s
fentry 4: 23.792 ± 0.024M/s
fentry 8: 21.557 ± 0.047M/s
fentry 16: 21.121 ± 0.004M/s
fentry 32: 17.067 ± 0.023M/s

After these changes, we see almost perfect linear scaling, as expected.
The sub-linear scaling when going from 8 to 16 threads is interesting
and consistent on my test machine, but I haven't investigated what is
causing it this peculiar slowdown (across all benchmarks, could be due
to hyperthreading effects, not sure).

uprobe-base 1: 139.980 ± 0.648M/s
uprobe-base 2: 270.244 ± 0.379M/s
uprobe-base 4: 532.044 ± 1.519M/s
uprobe-base 8: 1004.571 ± 3.174M/s
uprobe-base 16: 1720.098 ± 0.744M/s
uprobe-base 32: 3506.659 ± 8.549M/s
base 1: 16.869 ± 0.071M/s
base 2: 33.007 ± 0.092M/s
base 4: 64.670 ± 0.203M/s
base 8: 121.969 ± 0.210M/s
base 16: 207.832 ± 0.112M/s
base 32: 424.227 ± 1.477M/s
fentry 1: 14.777 ± 0.087M/s
fentry 2: 28.575 ± 0.146M/s
fentry 4: 56.234 ± 0.176M/s
fentry 8: 106.095 ± 0.385M/s
fentry 16: 181.440 ± 0.032M/s
fentry 32: 369.131 ± 0.693M/s

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Message-ID: <20240315213329.1161589-1-andrii@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
/linux/tools/testing/selftests/bpf/benchs/
H A Dbench_trigger.cdiff 520fad2e3206b2476f201a283ecf096cfb671902 Fri Mar 15 22:33:29 CET 2024 Andrii Nakryiko <andrii@kernel.org> selftests/bpf: scale benchmark counting by using per-CPU counters

When benchmarking with multiple threads (-pN, where N>1), we start
contending on single atomic counter that both BPF trigger benchmarks are
using, as well as "baseline" tests in user space (trig-base and
trig-uprobe-base benchmarks). As such, we start bottlenecking on
something completely irrelevant to benchmark at hand.

Scale counting up by using per-CPU counters on BPF side. On use space
side we do the next best thing: hash thread ID to approximate per-CPU
behavior. It seems to work quite well in practice.

To demonstrate the difference, I ran three benchmarks with 1, 2, 4, 8,
16, and 32 threads:
- trig-uprobe-base (no syscalls, pure tight counting loop in user-space);
- trig-base (get_pgid() syscall, atomic counter in user-space);
- trig-fentry (syscall to trigger fentry program, atomic uncontended per-CPU
counter on BPF side).

Command used:

for b in uprobe-base base fentry; do \
for p in 1 2 4 8 16 32; do \
printf "%-11s %2d: %s\n" $b $p \
"$(sudo ./bench -w2 -d5 -a -p$p trig-$b | tail -n1 | cut -d'(' -f1 | cut -d' ' -f3-)"; \
done; \
done

Before these changes, aggregate throughput across all threads doesn't
scale well with number of threads, it actually even falls sharply for
uprobe-base due to a very high contention:

uprobe-base 1: 138.998 ± 0.650M/s
uprobe-base 2: 70.526 ± 1.147M/s
uprobe-base 4: 63.114 ± 0.302M/s
uprobe-base 8: 54.177 ± 0.138M/s
uprobe-base 16: 45.439 ± 0.057M/s
uprobe-base 32: 37.163 ± 0.242M/s
base 1: 16.940 ± 0.182M/s
base 2: 19.231 ± 0.105M/s
base 4: 21.479 ± 0.038M/s
base 8: 23.030 ± 0.037M/s
base 16: 22.034 ± 0.004M/s
base 32: 18.152 ± 0.013M/s
fentry 1: 14.794 ± 0.054M/s
fentry 2: 17.341 ± 0.055M/s
fentry 4: 23.792 ± 0.024M/s
fentry 8: 21.557 ± 0.047M/s
fentry 16: 21.121 ± 0.004M/s
fentry 32: 17.067 ± 0.023M/s

After these changes, we see almost perfect linear scaling, as expected.
The sub-linear scaling when going from 8 to 16 threads is interesting
and consistent on my test machine, but I haven't investigated what is
causing it this peculiar slowdown (across all benchmarks, could be due
to hyperthreading effects, not sure).

uprobe-base 1: 139.980 ± 0.648M/s
uprobe-base 2: 270.244 ± 0.379M/s
uprobe-base 4: 532.044 ± 1.519M/s
uprobe-base 8: 1004.571 ± 3.174M/s
uprobe-base 16: 1720.098 ± 0.744M/s
uprobe-base 32: 3506.659 ± 8.549M/s
base 1: 16.869 ± 0.071M/s
base 2: 33.007 ± 0.092M/s
base 4: 64.670 ± 0.203M/s
base 8: 121.969 ± 0.210M/s
base 16: 207.832 ± 0.112M/s
base 32: 424.227 ± 1.477M/s
fentry 1: 14.777 ± 0.087M/s
fentry 2: 28.575 ± 0.146M/s
fentry 4: 56.234 ± 0.176M/s
fentry 8: 106.095 ± 0.385M/s
fentry 16: 181.440 ± 0.032M/s
fentry 32: 369.131 ± 0.693M/s

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Message-ID: <20240315213329.1161589-1-andrii@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>