10c51b369SStanislav Fomichev.. SPDX-License-Identifier: GPL-2.0 20c51b369SStanislav Fomichev 30c51b369SStanislav Fomichev============================ 40c51b369SStanislav FomichevBPF_PROG_TYPE_CGROUP_SOCKOPT 50c51b369SStanislav Fomichev============================ 60c51b369SStanislav Fomichev 70c51b369SStanislav Fomichev``BPF_PROG_TYPE_CGROUP_SOCKOPT`` program type can be attached to two 80c51b369SStanislav Fomichevcgroup hooks: 90c51b369SStanislav Fomichev 100c51b369SStanislav Fomichev* ``BPF_CGROUP_GETSOCKOPT`` - called every time process executes ``getsockopt`` 110c51b369SStanislav Fomichev system call. 120c51b369SStanislav Fomichev* ``BPF_CGROUP_SETSOCKOPT`` - called every time process executes ``setsockopt`` 130c51b369SStanislav Fomichev system call. 140c51b369SStanislav Fomichev 150c51b369SStanislav FomichevThe context (``struct bpf_sockopt``) has associated socket (``sk``) and 160c51b369SStanislav Fomichevall input arguments: ``level``, ``optname``, ``optval`` and ``optlen``. 170c51b369SStanislav Fomichev 180c51b369SStanislav FomichevBPF_CGROUP_SETSOCKOPT 190c51b369SStanislav Fomichev===================== 200c51b369SStanislav Fomichev 210c51b369SStanislav Fomichev``BPF_CGROUP_SETSOCKOPT`` is triggered *before* the kernel handling of 220c51b369SStanislav Fomichevsockopt and it has writable context: it can modify the supplied arguments 230c51b369SStanislav Fomichevbefore passing them down to the kernel. This hook has access to the cgroup 240c51b369SStanislav Fomichevand socket local storage. 250c51b369SStanislav Fomichev 260c51b369SStanislav FomichevIf BPF program sets ``optlen`` to -1, the control will be returned 270c51b369SStanislav Fomichevback to the userspace after all other BPF programs in the cgroup 280c51b369SStanislav Fomichevchain finish (i.e. kernel ``setsockopt`` handling will *not* be executed). 290c51b369SStanislav Fomichev 300c51b369SStanislav FomichevNote, that ``optlen`` can not be increased beyond the user-supplied 310c51b369SStanislav Fomichevvalue. It can only be decreased or set to -1. Any other value will 320c51b369SStanislav Fomichevtrigger ``EFAULT``. 330c51b369SStanislav Fomichev 340c51b369SStanislav FomichevReturn Type 350c51b369SStanislav Fomichev----------- 360c51b369SStanislav Fomichev 370c51b369SStanislav Fomichev* ``0`` - reject the syscall, ``EPERM`` will be returned to the userspace. 380c51b369SStanislav Fomichev* ``1`` - success, continue with next BPF program in the cgroup chain. 390c51b369SStanislav Fomichev 400c51b369SStanislav FomichevBPF_CGROUP_GETSOCKOPT 410c51b369SStanislav Fomichev===================== 420c51b369SStanislav Fomichev 430c51b369SStanislav Fomichev``BPF_CGROUP_GETSOCKOPT`` is triggered *after* the kernel handing of 440c51b369SStanislav Fomichevsockopt. The BPF hook can observe ``optval``, ``optlen`` and ``retval`` 450c51b369SStanislav Fomichevif it's interested in whatever kernel has returned. BPF hook can override 460c51b369SStanislav Fomichevthe values above, adjust ``optlen`` and reset ``retval`` to 0. If ``optlen`` 470c51b369SStanislav Fomichevhas been increased above initial ``getsockopt`` value (i.e. userspace 480c51b369SStanislav Fomichevbuffer is too small), ``EFAULT`` is returned. 490c51b369SStanislav Fomichev 500c51b369SStanislav FomichevThis hook has access to the cgroup and socket local storage. 510c51b369SStanislav Fomichev 520c51b369SStanislav FomichevNote, that the only acceptable value to set to ``retval`` is 0 and the 530c51b369SStanislav Fomichevoriginal value that the kernel returned. Any other value will trigger 540c51b369SStanislav Fomichev``EFAULT``. 550c51b369SStanislav Fomichev 560c51b369SStanislav FomichevReturn Type 570c51b369SStanislav Fomichev----------- 580c51b369SStanislav Fomichev 590c51b369SStanislav Fomichev* ``0`` - reject the syscall, ``EPERM`` will be returned to the userspace. 600c51b369SStanislav Fomichev* ``1`` - success: copy ``optval`` and ``optlen`` to userspace, return 610c51b369SStanislav Fomichev ``retval`` from the syscall (note that this can be overwritten by 620c51b369SStanislav Fomichev the BPF program from the parent cgroup). 630c51b369SStanislav Fomichev 640c51b369SStanislav FomichevCgroup Inheritance 650c51b369SStanislav Fomichev================== 660c51b369SStanislav Fomichev 670c51b369SStanislav FomichevSuppose, there is the following cgroup hierarchy where each cgroup 680c51b369SStanislav Fomichevhas ``BPF_CGROUP_GETSOCKOPT`` attached at each level with 690c51b369SStanislav Fomichev``BPF_F_ALLOW_MULTI`` flag:: 700c51b369SStanislav Fomichev 710c51b369SStanislav Fomichev A (root, parent) 720c51b369SStanislav Fomichev \ 730c51b369SStanislav Fomichev B (child) 740c51b369SStanislav Fomichev 750c51b369SStanislav FomichevWhen the application calls ``getsockopt`` syscall from the cgroup B, 760c51b369SStanislav Fomichevthe programs are executed from the bottom up: B, A. First program 770c51b369SStanislav Fomichev(B) sees the result of kernel's ``getsockopt``. It can optionally 780c51b369SStanislav Fomichevadjust ``optval``, ``optlen`` and reset ``retval`` to 0. After that 790c51b369SStanislav Fomichevcontrol will be passed to the second (A) program which will see the 800c51b369SStanislav Fomichevsame context as B including any potential modifications. 810c51b369SStanislav Fomichev 820c51b369SStanislav FomichevSame for ``BPF_CGROUP_SETSOCKOPT``: if the program is attached to 830c51b369SStanislav FomichevA and B, the trigger order is B, then A. If B does any changes 840c51b369SStanislav Fomichevto the input arguments (``level``, ``optname``, ``optval``, ``optlen``), 850c51b369SStanislav Fomichevthen the next program in the chain (A) will see those changes, 860c51b369SStanislav Fomichev*not* the original input ``setsockopt`` arguments. The potentially 870c51b369SStanislav Fomichevmodified values will be then passed down to the kernel. 880c51b369SStanislav Fomichev 898030e250SStanislav FomichevLarge optval 908030e250SStanislav Fomichev============ 918030e250SStanislav FomichevWhen the ``optval`` is greater than the ``PAGE_SIZE``, the BPF program 928030e250SStanislav Fomichevcan access only the first ``PAGE_SIZE`` of that data. So it has to options: 938030e250SStanislav Fomichev 948030e250SStanislav Fomichev* Set ``optlen`` to zero, which indicates that the kernel should 958030e250SStanislav Fomichev use the original buffer from the userspace. Any modifications 968030e250SStanislav Fomichev done by the BPF program to the ``optval`` are ignored. 978030e250SStanislav Fomichev* Set ``optlen`` to the value less than ``PAGE_SIZE``, which 988030e250SStanislav Fomichev indicates that the kernel should use BPF's trimmed ``optval``. 998030e250SStanislav Fomichev 1008030e250SStanislav FomichevWhen the BPF program returns with the ``optlen`` greater than 101*6b6a23d5SStanislav Fomichev``PAGE_SIZE``, the userspace will receive original kernel 102*6b6a23d5SStanislav Fomichevbuffers without any modifications that the BPF program might have 103*6b6a23d5SStanislav Fomichevapplied. 1048030e250SStanislav Fomichev 1050c51b369SStanislav FomichevExample 1060c51b369SStanislav Fomichev======= 1070c51b369SStanislav Fomichev 108*6b6a23d5SStanislav FomichevRecommended way to handle BPF programs is as follows: 109*6b6a23d5SStanislav Fomichev 110*6b6a23d5SStanislav Fomichev.. code-block:: c 111*6b6a23d5SStanislav Fomichev 112*6b6a23d5SStanislav Fomichev SEC("cgroup/getsockopt") 113*6b6a23d5SStanislav Fomichev int getsockopt(struct bpf_sockopt *ctx) 114*6b6a23d5SStanislav Fomichev { 115*6b6a23d5SStanislav Fomichev /* Custom socket option. */ 116*6b6a23d5SStanislav Fomichev if (ctx->level == MY_SOL && ctx->optname == MY_OPTNAME) { 117*6b6a23d5SStanislav Fomichev ctx->retval = 0; 118*6b6a23d5SStanislav Fomichev optval[0] = ...; 119*6b6a23d5SStanislav Fomichev ctx->optlen = 1; 120*6b6a23d5SStanislav Fomichev return 1; 121*6b6a23d5SStanislav Fomichev } 122*6b6a23d5SStanislav Fomichev 123*6b6a23d5SStanislav Fomichev /* Modify kernel's socket option. */ 124*6b6a23d5SStanislav Fomichev if (ctx->level == SOL_IP && ctx->optname == IP_FREEBIND) { 125*6b6a23d5SStanislav Fomichev ctx->retval = 0; 126*6b6a23d5SStanislav Fomichev optval[0] = ...; 127*6b6a23d5SStanislav Fomichev ctx->optlen = 1; 128*6b6a23d5SStanislav Fomichev return 1; 129*6b6a23d5SStanislav Fomichev } 130*6b6a23d5SStanislav Fomichev 131*6b6a23d5SStanislav Fomichev /* optval larger than PAGE_SIZE use kernel's buffer. */ 132*6b6a23d5SStanislav Fomichev if (ctx->optlen > PAGE_SIZE) 133*6b6a23d5SStanislav Fomichev ctx->optlen = 0; 134*6b6a23d5SStanislav Fomichev 135*6b6a23d5SStanislav Fomichev return 1; 136*6b6a23d5SStanislav Fomichev } 137*6b6a23d5SStanislav Fomichev 138*6b6a23d5SStanislav Fomichev SEC("cgroup/setsockopt") 139*6b6a23d5SStanislav Fomichev int setsockopt(struct bpf_sockopt *ctx) 140*6b6a23d5SStanislav Fomichev { 141*6b6a23d5SStanislav Fomichev /* Custom socket option. */ 142*6b6a23d5SStanislav Fomichev if (ctx->level == MY_SOL && ctx->optname == MY_OPTNAME) { 143*6b6a23d5SStanislav Fomichev /* do something */ 144*6b6a23d5SStanislav Fomichev ctx->optlen = -1; 145*6b6a23d5SStanislav Fomichev return 1; 146*6b6a23d5SStanislav Fomichev } 147*6b6a23d5SStanislav Fomichev 148*6b6a23d5SStanislav Fomichev /* Modify kernel's socket option. */ 149*6b6a23d5SStanislav Fomichev if (ctx->level == SOL_IP && ctx->optname == IP_FREEBIND) { 150*6b6a23d5SStanislav Fomichev optval[0] = ...; 151*6b6a23d5SStanislav Fomichev return 1; 152*6b6a23d5SStanislav Fomichev } 153*6b6a23d5SStanislav Fomichev 154*6b6a23d5SStanislav Fomichev /* optval larger than PAGE_SIZE use kernel's buffer. */ 155*6b6a23d5SStanislav Fomichev if (ctx->optlen > PAGE_SIZE) 156*6b6a23d5SStanislav Fomichev ctx->optlen = 0; 157*6b6a23d5SStanislav Fomichev 158*6b6a23d5SStanislav Fomichev return 1; 159*6b6a23d5SStanislav Fomichev } 160*6b6a23d5SStanislav Fomichev 1610c51b369SStanislav FomichevSee ``tools/testing/selftests/bpf/progs/sockopt_sk.c`` for an example 1620c51b369SStanislav Fomichevof BPF program that handles socket options. 163