xref: /linux/Documentation/bpf/prog_cgroup_sockopt.rst (revision cdd5b5a9761fd66d17586e4f4ba6588c70e640ea)
10c51b369SStanislav Fomichev.. SPDX-License-Identifier: GPL-2.0
20c51b369SStanislav Fomichev
30c51b369SStanislav Fomichev============================
40c51b369SStanislav FomichevBPF_PROG_TYPE_CGROUP_SOCKOPT
50c51b369SStanislav Fomichev============================
60c51b369SStanislav Fomichev
70c51b369SStanislav Fomichev``BPF_PROG_TYPE_CGROUP_SOCKOPT`` program type can be attached to two
80c51b369SStanislav Fomichevcgroup hooks:
90c51b369SStanislav Fomichev
100c51b369SStanislav Fomichev* ``BPF_CGROUP_GETSOCKOPT`` - called every time process executes ``getsockopt``
110c51b369SStanislav Fomichev  system call.
120c51b369SStanislav Fomichev* ``BPF_CGROUP_SETSOCKOPT`` - called every time process executes ``setsockopt``
130c51b369SStanislav Fomichev  system call.
140c51b369SStanislav Fomichev
150c51b369SStanislav FomichevThe context (``struct bpf_sockopt``) has associated socket (``sk``) and
160c51b369SStanislav Fomichevall input arguments: ``level``, ``optname``, ``optval`` and ``optlen``.
170c51b369SStanislav Fomichev
180c51b369SStanislav FomichevBPF_CGROUP_SETSOCKOPT
190c51b369SStanislav Fomichev=====================
200c51b369SStanislav Fomichev
210c51b369SStanislav Fomichev``BPF_CGROUP_SETSOCKOPT`` is triggered *before* the kernel handling of
220c51b369SStanislav Fomichevsockopt and it has writable context: it can modify the supplied arguments
230c51b369SStanislav Fomichevbefore passing them down to the kernel. This hook has access to the cgroup
240c51b369SStanislav Fomichevand socket local storage.
250c51b369SStanislav Fomichev
260c51b369SStanislav FomichevIf BPF program sets ``optlen`` to -1, the control will be returned
270c51b369SStanislav Fomichevback to the userspace after all other BPF programs in the cgroup
280c51b369SStanislav Fomichevchain finish (i.e. kernel ``setsockopt`` handling will *not* be executed).
290c51b369SStanislav Fomichev
300c51b369SStanislav FomichevNote, that ``optlen`` can not be increased beyond the user-supplied
310c51b369SStanislav Fomichevvalue. It can only be decreased or set to -1. Any other value will
320c51b369SStanislav Fomichevtrigger ``EFAULT``.
330c51b369SStanislav Fomichev
340c51b369SStanislav FomichevReturn Type
350c51b369SStanislav Fomichev-----------
360c51b369SStanislav Fomichev
370c51b369SStanislav Fomichev* ``0`` - reject the syscall, ``EPERM`` will be returned to the userspace.
380c51b369SStanislav Fomichev* ``1`` - success, continue with next BPF program in the cgroup chain.
390c51b369SStanislav Fomichev
400c51b369SStanislav FomichevBPF_CGROUP_GETSOCKOPT
410c51b369SStanislav Fomichev=====================
420c51b369SStanislav Fomichev
430c51b369SStanislav Fomichev``BPF_CGROUP_GETSOCKOPT`` is triggered *after* the kernel handing of
440c51b369SStanislav Fomichevsockopt. The BPF hook can observe ``optval``, ``optlen`` and ``retval``
450c51b369SStanislav Fomichevif it's interested in whatever kernel has returned. BPF hook can override
460c51b369SStanislav Fomichevthe values above, adjust ``optlen`` and reset ``retval`` to 0. If ``optlen``
470c51b369SStanislav Fomichevhas been increased above initial ``getsockopt`` value (i.e. userspace
480c51b369SStanislav Fomichevbuffer is too small), ``EFAULT`` is returned.
490c51b369SStanislav Fomichev
500c51b369SStanislav FomichevThis hook has access to the cgroup and socket local storage.
510c51b369SStanislav Fomichev
520c51b369SStanislav FomichevNote, that the only acceptable value to set to ``retval`` is 0 and the
530c51b369SStanislav Fomichevoriginal value that the kernel returned. Any other value will trigger
540c51b369SStanislav Fomichev``EFAULT``.
550c51b369SStanislav Fomichev
560c51b369SStanislav FomichevReturn Type
570c51b369SStanislav Fomichev-----------
580c51b369SStanislav Fomichev
590c51b369SStanislav Fomichev* ``0`` - reject the syscall, ``EPERM`` will be returned to the userspace.
600c51b369SStanislav Fomichev* ``1`` - success: copy ``optval`` and ``optlen`` to userspace, return
610c51b369SStanislav Fomichev  ``retval`` from the syscall (note that this can be overwritten by
620c51b369SStanislav Fomichev  the BPF program from the parent cgroup).
630c51b369SStanislav Fomichev
640c51b369SStanislav FomichevCgroup Inheritance
650c51b369SStanislav Fomichev==================
660c51b369SStanislav Fomichev
670c51b369SStanislav FomichevSuppose, there is the following cgroup hierarchy where each cgroup
680c51b369SStanislav Fomichevhas ``BPF_CGROUP_GETSOCKOPT`` attached at each level with
690c51b369SStanislav Fomichev``BPF_F_ALLOW_MULTI`` flag::
700c51b369SStanislav Fomichev
710c51b369SStanislav Fomichev  A (root, parent)
720c51b369SStanislav Fomichev   \
730c51b369SStanislav Fomichev    B (child)
740c51b369SStanislav Fomichev
750c51b369SStanislav FomichevWhen the application calls ``getsockopt`` syscall from the cgroup B,
760c51b369SStanislav Fomichevthe programs are executed from the bottom up: B, A. First program
770c51b369SStanislav Fomichev(B) sees the result of kernel's ``getsockopt``. It can optionally
780c51b369SStanislav Fomichevadjust ``optval``, ``optlen`` and reset ``retval`` to 0. After that
790c51b369SStanislav Fomichevcontrol will be passed to the second (A) program which will see the
800c51b369SStanislav Fomichevsame context as B including any potential modifications.
810c51b369SStanislav Fomichev
820c51b369SStanislav FomichevSame for ``BPF_CGROUP_SETSOCKOPT``: if the program is attached to
830c51b369SStanislav FomichevA and B, the trigger order is B, then A. If B does any changes
840c51b369SStanislav Fomichevto the input arguments (``level``, ``optname``, ``optval``, ``optlen``),
850c51b369SStanislav Fomichevthen the next program in the chain (A) will see those changes,
860c51b369SStanislav Fomichev*not* the original input ``setsockopt`` arguments. The potentially
870c51b369SStanislav Fomichevmodified values will be then passed down to the kernel.
880c51b369SStanislav Fomichev
898030e250SStanislav FomichevLarge optval
908030e250SStanislav Fomichev============
918030e250SStanislav FomichevWhen the ``optval`` is greater than the ``PAGE_SIZE``, the BPF program
928030e250SStanislav Fomichevcan access only the first ``PAGE_SIZE`` of that data. So it has to options:
938030e250SStanislav Fomichev
948030e250SStanislav Fomichev* Set ``optlen`` to zero, which indicates that the kernel should
958030e250SStanislav Fomichev  use the original buffer from the userspace. Any modifications
968030e250SStanislav Fomichev  done by the BPF program to the ``optval`` are ignored.
978030e250SStanislav Fomichev* Set ``optlen`` to the value less than ``PAGE_SIZE``, which
988030e250SStanislav Fomichev  indicates that the kernel should use BPF's trimmed ``optval``.
998030e250SStanislav Fomichev
1008030e250SStanislav FomichevWhen the BPF program returns with the ``optlen`` greater than
101*6b6a23d5SStanislav Fomichev``PAGE_SIZE``, the userspace will receive original kernel
102*6b6a23d5SStanislav Fomichevbuffers without any modifications that the BPF program might have
103*6b6a23d5SStanislav Fomichevapplied.
1048030e250SStanislav Fomichev
1050c51b369SStanislav FomichevExample
1060c51b369SStanislav Fomichev=======
1070c51b369SStanislav Fomichev
108*6b6a23d5SStanislav FomichevRecommended way to handle BPF programs is as follows:
109*6b6a23d5SStanislav Fomichev
110*6b6a23d5SStanislav Fomichev.. code-block:: c
111*6b6a23d5SStanislav Fomichev
112*6b6a23d5SStanislav Fomichev	SEC("cgroup/getsockopt")
113*6b6a23d5SStanislav Fomichev	int getsockopt(struct bpf_sockopt *ctx)
114*6b6a23d5SStanislav Fomichev	{
115*6b6a23d5SStanislav Fomichev		/* Custom socket option. */
116*6b6a23d5SStanislav Fomichev		if (ctx->level == MY_SOL && ctx->optname == MY_OPTNAME) {
117*6b6a23d5SStanislav Fomichev			ctx->retval = 0;
118*6b6a23d5SStanislav Fomichev			optval[0] = ...;
119*6b6a23d5SStanislav Fomichev			ctx->optlen = 1;
120*6b6a23d5SStanislav Fomichev			return 1;
121*6b6a23d5SStanislav Fomichev		}
122*6b6a23d5SStanislav Fomichev
123*6b6a23d5SStanislav Fomichev		/* Modify kernel's socket option. */
124*6b6a23d5SStanislav Fomichev		if (ctx->level == SOL_IP && ctx->optname == IP_FREEBIND) {
125*6b6a23d5SStanislav Fomichev			ctx->retval = 0;
126*6b6a23d5SStanislav Fomichev			optval[0] = ...;
127*6b6a23d5SStanislav Fomichev			ctx->optlen = 1;
128*6b6a23d5SStanislav Fomichev			return 1;
129*6b6a23d5SStanislav Fomichev		}
130*6b6a23d5SStanislav Fomichev
131*6b6a23d5SStanislav Fomichev		/* optval larger than PAGE_SIZE use kernel's buffer. */
132*6b6a23d5SStanislav Fomichev		if (ctx->optlen > PAGE_SIZE)
133*6b6a23d5SStanislav Fomichev			ctx->optlen = 0;
134*6b6a23d5SStanislav Fomichev
135*6b6a23d5SStanislav Fomichev		return 1;
136*6b6a23d5SStanislav Fomichev	}
137*6b6a23d5SStanislav Fomichev
138*6b6a23d5SStanislav Fomichev	SEC("cgroup/setsockopt")
139*6b6a23d5SStanislav Fomichev	int setsockopt(struct bpf_sockopt *ctx)
140*6b6a23d5SStanislav Fomichev	{
141*6b6a23d5SStanislav Fomichev		/* Custom socket option. */
142*6b6a23d5SStanislav Fomichev		if (ctx->level == MY_SOL && ctx->optname == MY_OPTNAME) {
143*6b6a23d5SStanislav Fomichev			/* do something */
144*6b6a23d5SStanislav Fomichev			ctx->optlen = -1;
145*6b6a23d5SStanislav Fomichev			return 1;
146*6b6a23d5SStanislav Fomichev		}
147*6b6a23d5SStanislav Fomichev
148*6b6a23d5SStanislav Fomichev		/* Modify kernel's socket option. */
149*6b6a23d5SStanislav Fomichev		if (ctx->level == SOL_IP && ctx->optname == IP_FREEBIND) {
150*6b6a23d5SStanislav Fomichev			optval[0] = ...;
151*6b6a23d5SStanislav Fomichev			return 1;
152*6b6a23d5SStanislav Fomichev		}
153*6b6a23d5SStanislav Fomichev
154*6b6a23d5SStanislav Fomichev		/* optval larger than PAGE_SIZE use kernel's buffer. */
155*6b6a23d5SStanislav Fomichev		if (ctx->optlen > PAGE_SIZE)
156*6b6a23d5SStanislav Fomichev			ctx->optlen = 0;
157*6b6a23d5SStanislav Fomichev
158*6b6a23d5SStanislav Fomichev		return 1;
159*6b6a23d5SStanislav Fomichev	}
160*6b6a23d5SStanislav Fomichev
1610c51b369SStanislav FomichevSee ``tools/testing/selftests/bpf/progs/sockopt_sk.c`` for an example
1620c51b369SStanislav Fomichevof BPF program that handles socket options.
163