xref: /linux/Documentation/userspace-api/rseq.rst (revision e1d7f54900f1e1d3003a85b78cd7105a64203ff7)
1d7a5da7aSThomas Gleixner=====================
2d7a5da7aSThomas GleixnerRestartable Sequences
3d7a5da7aSThomas Gleixner=====================
4d7a5da7aSThomas Gleixner
5d7a5da7aSThomas GleixnerRestartable Sequences allow to register a per thread userspace memory area
6d7a5da7aSThomas Gleixnerto be used as an ABI between kernel and userspace for three purposes:
7d7a5da7aSThomas Gleixner
8d7a5da7aSThomas Gleixner * userspace restartable sequences
9d7a5da7aSThomas Gleixner
10d7a5da7aSThomas Gleixner * quick access to read the current CPU number, node ID from userspace
11d7a5da7aSThomas Gleixner
12d7a5da7aSThomas Gleixner * scheduler time slice extensions
13d7a5da7aSThomas Gleixner
14d7a5da7aSThomas GleixnerRestartable sequences (per-cpu atomics)
15d7a5da7aSThomas Gleixner---------------------------------------
16d7a5da7aSThomas Gleixner
17d7a5da7aSThomas GleixnerRestartable sequences allow userspace to perform update operations on
18d7a5da7aSThomas Gleixnerper-cpu data without requiring heavyweight atomic operations. The actual
19d7a5da7aSThomas GleixnerABI is unfortunately only available in the code and selftests.
20d7a5da7aSThomas Gleixner
21d7a5da7aSThomas GleixnerQuick access to CPU number, node ID
22d7a5da7aSThomas Gleixner-----------------------------------
23d7a5da7aSThomas Gleixner
24d7a5da7aSThomas GleixnerAllows to implement per CPU data efficiently. Documentation is in code and
25d7a5da7aSThomas Gleixnerselftests. :(
26d7a5da7aSThomas Gleixner
27d7a5da7aSThomas GleixnerScheduler time slice extensions
28d7a5da7aSThomas Gleixner-------------------------------
29d7a5da7aSThomas Gleixner
30d7a5da7aSThomas GleixnerThis allows a thread to request a time slice extension when it enters a
31d7a5da7aSThomas Gleixnercritical section to avoid contention on a resource when the thread is
32d7a5da7aSThomas Gleixnerscheduled out inside of the critical section.
33d7a5da7aSThomas Gleixner
34d7a5da7aSThomas GleixnerThe prerequisites for this functionality are:
35d7a5da7aSThomas Gleixner
36d7a5da7aSThomas Gleixner    * Enabled in Kconfig
37d7a5da7aSThomas Gleixner
38d7a5da7aSThomas Gleixner    * Enabled at boot time (default is enabled)
39d7a5da7aSThomas Gleixner
40d7a5da7aSThomas Gleixner    * A rseq userspace pointer has been registered for the thread
41d7a5da7aSThomas Gleixner
42d7a5da7aSThomas GleixnerThe thread has to enable the functionality via prctl(2)::
43d7a5da7aSThomas Gleixner
44d7a5da7aSThomas Gleixner    prctl(PR_RSEQ_SLICE_EXTENSION, PR_RSEQ_SLICE_EXTENSION_SET,
45d7a5da7aSThomas Gleixner          PR_RSEQ_SLICE_EXT_ENABLE, 0, 0);
46d7a5da7aSThomas Gleixner
47d7a5da7aSThomas Gleixnerprctl() returns 0 on success or otherwise with the following error codes:
48d7a5da7aSThomas Gleixner
49d7a5da7aSThomas Gleixner========= ==============================================================
50d7a5da7aSThomas GleixnerErrorcode Meaning
51d7a5da7aSThomas Gleixner========= ==============================================================
52d7a5da7aSThomas GleixnerEINVAL	  Functionality not available or invalid function arguments.
53d7a5da7aSThomas Gleixner          Note: arg4 and arg5 must be zero
54d7a5da7aSThomas GleixnerENOTSUPP  Functionality was disabled on the kernel command line
55d7a5da7aSThomas GleixnerENXIO	  Available, but no rseq user struct registered
56d7a5da7aSThomas Gleixner========= ==============================================================
57d7a5da7aSThomas Gleixner
58d7a5da7aSThomas GleixnerThe state can be also queried via prctl(2)::
59d7a5da7aSThomas Gleixner
60d7a5da7aSThomas Gleixner  prctl(PR_RSEQ_SLICE_EXTENSION, PR_RSEQ_SLICE_EXTENSION_GET, 0, 0, 0);
61d7a5da7aSThomas Gleixner
62d7a5da7aSThomas Gleixnerprctl() returns ``PR_RSEQ_SLICE_EXT_ENABLE`` when it is enabled or 0 if
63d7a5da7aSThomas Gleixnerdisabled. Otherwise it returns with the following error codes:
64d7a5da7aSThomas Gleixner
65d7a5da7aSThomas Gleixner========= ==============================================================
66d7a5da7aSThomas GleixnerErrorcode Meaning
67d7a5da7aSThomas Gleixner========= ==============================================================
68d7a5da7aSThomas GleixnerEINVAL	  Functionality not available or invalid function arguments.
69d7a5da7aSThomas Gleixner          Note: arg3 and arg4 and arg5 must be zero
70d7a5da7aSThomas Gleixner========= ==============================================================
71d7a5da7aSThomas Gleixner
72d7a5da7aSThomas GleixnerThe availability and status is also exposed via the rseq ABI struct flags
73d7a5da7aSThomas Gleixnerfield via the ``RSEQ_CS_FLAG_SLICE_EXT_AVAILABLE_BIT`` and the
74d7a5da7aSThomas Gleixner``RSEQ_CS_FLAG_SLICE_EXT_ENABLED_BIT``. These bits are read-only for user
75d7a5da7aSThomas Gleixnerspace and only for informational purposes.
76d7a5da7aSThomas Gleixner
77d7a5da7aSThomas GleixnerIf the mechanism was enabled via prctl(), the thread can request a time
78d7a5da7aSThomas Gleixnerslice extension by setting rseq::slice_ctrl::request to 1. If the thread is
79d7a5da7aSThomas Gleixnerinterrupted and the interrupt results in a reschedule request in the
80d7a5da7aSThomas Gleixnerkernel, then the kernel can grant a time slice extension and return to
81d7a5da7aSThomas Gleixneruserspace instead of scheduling out. The length of the extension is
82*e1d7f549SPeter Zijlstradetermined by debugfs:rseq/slice_ext_nsec. The default value is 10 usec; which
83*e1d7f549SPeter Zijlstrais the minimum value. It can be incremented to 50 usecs, however doing so
84*e1d7f549SPeter Zijlstracan/will affect the minimum scheduling latency.
85d7a5da7aSThomas Gleixner
86d7a5da7aSThomas GleixnerThe kernel indicates the grant by clearing rseq::slice_ctrl::request and
87d7a5da7aSThomas Gleixnersetting rseq::slice_ctrl::granted to 1. If there is a reschedule of the
88d7a5da7aSThomas Gleixnerthread after granting the extension, the kernel clears the granted bit to
89d7a5da7aSThomas Gleixnerindicate that to userspace.
90d7a5da7aSThomas Gleixner
91d7a5da7aSThomas GleixnerIf the request bit is still set when the leaving the critical section,
92d7a5da7aSThomas Gleixneruserspace can clear it and continue.
93d7a5da7aSThomas Gleixner
94d7a5da7aSThomas GleixnerIf the granted bit is set, then userspace invokes rseq_slice_yield(2) when
95d7a5da7aSThomas Gleixnerleaving the critical section to relinquish the CPU. The kernel enforces
96d7a5da7aSThomas Gleixnerthis by arming a timer to prevent misbehaving userspace from abusing this
97d7a5da7aSThomas Gleixnermechanism.
98d7a5da7aSThomas Gleixner
99d7a5da7aSThomas GleixnerIf both the request bit and the granted bit are false when leaving the
100d7a5da7aSThomas Gleixnercritical section, then this indicates that a grant was revoked and no
101d7a5da7aSThomas Gleixnerfurther action is required by userspace.
102d7a5da7aSThomas Gleixner
103d7a5da7aSThomas GleixnerThe required code flow is as follows::
104d7a5da7aSThomas Gleixner
105d7a5da7aSThomas Gleixner    rseq->slice_ctrl.request = 1;
106d7a5da7aSThomas Gleixner    barrier();  // Prevent compiler reordering
107d7a5da7aSThomas Gleixner    critical_section();
108d7a5da7aSThomas Gleixner    barrier();  // Prevent compiler reordering
109d7a5da7aSThomas Gleixner    rseq->slice_ctrl.request = 0;
110d7a5da7aSThomas Gleixner    if (rseq->slice_ctrl.granted)
111d7a5da7aSThomas Gleixner        rseq_slice_yield();
112d7a5da7aSThomas Gleixner
113d7a5da7aSThomas GleixnerAs all of this is strictly CPU local, there are no atomicity requirements.
114d7a5da7aSThomas GleixnerChecking the granted state is racy, but that cannot be avoided at all::
115d7a5da7aSThomas Gleixner
116d7a5da7aSThomas Gleixner    if (rseq->slice_ctrl.granted)
117d7a5da7aSThomas Gleixner      -> Interrupt results in schedule and grant revocation
118d7a5da7aSThomas Gleixner        rseq_slice_yield();
119d7a5da7aSThomas Gleixner
120d7a5da7aSThomas GleixnerSo there is no point in pretending that this might be solved by an atomic
121d7a5da7aSThomas Gleixneroperation.
122d7a5da7aSThomas Gleixner
123d7a5da7aSThomas GleixnerIf the thread issues a syscall other than rseq_slice_yield(2) within the
124d7a5da7aSThomas Gleixnergranted timeslice extension, the grant is also revoked and the CPU is
125d7a5da7aSThomas Gleixnerrelinquished immediately when entering the kernel. This is required as
126d7a5da7aSThomas Gleixnersyscalls might consume arbitrary CPU time until they reach a scheduling
127d7a5da7aSThomas Gleixnerpoint when the preemption model is either NONE or VOLUNTARY and therefore
128d7a5da7aSThomas Gleixnermight exceed the grant by far.
129d7a5da7aSThomas Gleixner
130d7a5da7aSThomas GleixnerThe preferred solution for user space is to use rseq_slice_yield(2) which
131d7a5da7aSThomas Gleixneris side effect free. The support for arbitrary syscalls is required to
132d7a5da7aSThomas Gleixnersupport onion layer architectured applications, where the code handling the
133d7a5da7aSThomas Gleixnercritical section and requesting the time slice extension has no control
134d7a5da7aSThomas Gleixnerover the code within the critical section.
135d7a5da7aSThomas Gleixner
136d7a5da7aSThomas GleixnerThe kernel enforces flag consistency and terminates the thread with SIGSEGV
137d7a5da7aSThomas Gleixnerif it detects a violation.
138