xref: /linux/Documentation/userspace-api/rseq.rst (revision bb332a9e5a057d2cb9b90e307b26cce9b1f6f660)
1d7a5da7aSThomas Gleixner=====================
2d7a5da7aSThomas GleixnerRestartable Sequences
3d7a5da7aSThomas Gleixner=====================
4d7a5da7aSThomas Gleixner
5d7a5da7aSThomas GleixnerRestartable Sequences allow to register a per thread userspace memory area
6d7a5da7aSThomas Gleixnerto be used as an ABI between kernel and userspace for three purposes:
7d7a5da7aSThomas Gleixner
8d7a5da7aSThomas Gleixner * userspace restartable sequences
9d7a5da7aSThomas Gleixner
10d7a5da7aSThomas Gleixner * quick access to read the current CPU number, node ID from userspace
11d7a5da7aSThomas Gleixner
12d7a5da7aSThomas Gleixner * scheduler time slice extensions
13d7a5da7aSThomas Gleixner
14d7a5da7aSThomas GleixnerRestartable sequences (per-cpu atomics)
15d7a5da7aSThomas Gleixner---------------------------------------
16d7a5da7aSThomas Gleixner
17d7a5da7aSThomas GleixnerRestartable sequences allow userspace to perform update operations on
18d7a5da7aSThomas Gleixnerper-cpu data without requiring heavyweight atomic operations. The actual
19d7a5da7aSThomas GleixnerABI is unfortunately only available in the code and selftests.
20d7a5da7aSThomas Gleixner
21d7a5da7aSThomas GleixnerQuick access to CPU number, node ID
22d7a5da7aSThomas Gleixner-----------------------------------
23d7a5da7aSThomas Gleixner
24d7a5da7aSThomas GleixnerAllows to implement per CPU data efficiently. Documentation is in code and
25d7a5da7aSThomas Gleixnerselftests. :(
26d7a5da7aSThomas Gleixner
27d7a5da7aSThomas GleixnerScheduler time slice extensions
28d7a5da7aSThomas Gleixner-------------------------------
29d7a5da7aSThomas Gleixner
30d7a5da7aSThomas GleixnerThis allows a thread to request a time slice extension when it enters a
31d7a5da7aSThomas Gleixnercritical section to avoid contention on a resource when the thread is
32d7a5da7aSThomas Gleixnerscheduled out inside of the critical section.
33d7a5da7aSThomas Gleixner
34d7a5da7aSThomas GleixnerThe prerequisites for this functionality are:
35d7a5da7aSThomas Gleixner
36d7a5da7aSThomas Gleixner    * Enabled in Kconfig
37d7a5da7aSThomas Gleixner
38d7a5da7aSThomas Gleixner    * Enabled at boot time (default is enabled)
39d7a5da7aSThomas Gleixner
40d7a5da7aSThomas Gleixner    * A rseq userspace pointer has been registered for the thread
41d7a5da7aSThomas Gleixner
42d7a5da7aSThomas GleixnerThe thread has to enable the functionality via prctl(2)::
43d7a5da7aSThomas Gleixner
44d7a5da7aSThomas Gleixner    prctl(PR_RSEQ_SLICE_EXTENSION, PR_RSEQ_SLICE_EXTENSION_SET,
45d7a5da7aSThomas Gleixner          PR_RSEQ_SLICE_EXT_ENABLE, 0, 0);
46d7a5da7aSThomas Gleixner
47d7a5da7aSThomas Gleixnerprctl() returns 0 on success or otherwise with the following error codes:
48d7a5da7aSThomas Gleixner
49d7a5da7aSThomas Gleixner========= ==============================================================
50d7a5da7aSThomas GleixnerErrorcode Meaning
51d7a5da7aSThomas Gleixner========= ==============================================================
52d7a5da7aSThomas GleixnerEINVAL	  Functionality not available or invalid function arguments.
53d7a5da7aSThomas Gleixner          Note: arg4 and arg5 must be zero
54d7a5da7aSThomas GleixnerENOTSUPP  Functionality was disabled on the kernel command line
55d7a5da7aSThomas GleixnerENXIO	  Available, but no rseq user struct registered
56d7a5da7aSThomas Gleixner========= ==============================================================
57d7a5da7aSThomas Gleixner
58d7a5da7aSThomas GleixnerThe state can be also queried via prctl(2)::
59d7a5da7aSThomas Gleixner
60d7a5da7aSThomas Gleixner  prctl(PR_RSEQ_SLICE_EXTENSION, PR_RSEQ_SLICE_EXTENSION_GET, 0, 0, 0);
61d7a5da7aSThomas Gleixner
62d7a5da7aSThomas Gleixnerprctl() returns ``PR_RSEQ_SLICE_EXT_ENABLE`` when it is enabled or 0 if
63d7a5da7aSThomas Gleixnerdisabled. Otherwise it returns with the following error codes:
64d7a5da7aSThomas Gleixner
65d7a5da7aSThomas Gleixner========= ==============================================================
66d7a5da7aSThomas GleixnerErrorcode Meaning
67d7a5da7aSThomas Gleixner========= ==============================================================
68d7a5da7aSThomas GleixnerEINVAL	  Functionality not available or invalid function arguments.
69d7a5da7aSThomas Gleixner          Note: arg3 and arg4 and arg5 must be zero
70d7a5da7aSThomas Gleixner========= ==============================================================
71d7a5da7aSThomas Gleixner
72d7a5da7aSThomas GleixnerThe availability and status is also exposed via the rseq ABI struct flags
73d7a5da7aSThomas Gleixnerfield via the ``RSEQ_CS_FLAG_SLICE_EXT_AVAILABLE_BIT`` and the
74d7a5da7aSThomas Gleixner``RSEQ_CS_FLAG_SLICE_EXT_ENABLED_BIT``. These bits are read-only for user
75d7a5da7aSThomas Gleixnerspace and only for informational purposes.
76d7a5da7aSThomas Gleixner
77d7a5da7aSThomas GleixnerIf the mechanism was enabled via prctl(), the thread can request a time
78d7a5da7aSThomas Gleixnerslice extension by setting rseq::slice_ctrl::request to 1. If the thread is
79d7a5da7aSThomas Gleixnerinterrupted and the interrupt results in a reschedule request in the
80d7a5da7aSThomas Gleixnerkernel, then the kernel can grant a time slice extension and return to
81d7a5da7aSThomas Gleixneruserspace instead of scheduling out. The length of the extension is
8221c0e92dSPeter Zijlstradetermined by debugfs:rseq/slice_ext_nsec. The default value is 5 usec; which
83e1d7f549SPeter Zijlstrais the minimum value. It can be incremented to 50 usecs, however doing so
84e1d7f549SPeter Zijlstracan/will affect the minimum scheduling latency.
85d7a5da7aSThomas Gleixner
86*bb332a9eSPeter ZijlstraAny proposed changes to this default will have to come with a selftest and
87*bb332a9eSPeter Zijlstrarseq-slice-hist.py output that shows the new value has merrit.
88*bb332a9eSPeter Zijlstra
89d7a5da7aSThomas GleixnerThe kernel indicates the grant by clearing rseq::slice_ctrl::request and
90d7a5da7aSThomas Gleixnersetting rseq::slice_ctrl::granted to 1. If there is a reschedule of the
91d7a5da7aSThomas Gleixnerthread after granting the extension, the kernel clears the granted bit to
92d7a5da7aSThomas Gleixnerindicate that to userspace.
93d7a5da7aSThomas Gleixner
94d7a5da7aSThomas GleixnerIf the request bit is still set when the leaving the critical section,
95d7a5da7aSThomas Gleixneruserspace can clear it and continue.
96d7a5da7aSThomas Gleixner
97d7a5da7aSThomas GleixnerIf the granted bit is set, then userspace invokes rseq_slice_yield(2) when
98d7a5da7aSThomas Gleixnerleaving the critical section to relinquish the CPU. The kernel enforces
99d7a5da7aSThomas Gleixnerthis by arming a timer to prevent misbehaving userspace from abusing this
100d7a5da7aSThomas Gleixnermechanism.
101d7a5da7aSThomas Gleixner
102d7a5da7aSThomas GleixnerIf both the request bit and the granted bit are false when leaving the
103d7a5da7aSThomas Gleixnercritical section, then this indicates that a grant was revoked and no
104d7a5da7aSThomas Gleixnerfurther action is required by userspace.
105d7a5da7aSThomas Gleixner
106d7a5da7aSThomas GleixnerThe required code flow is as follows::
107d7a5da7aSThomas Gleixner
108d7a5da7aSThomas Gleixner    rseq->slice_ctrl.request = 1;
109d7a5da7aSThomas Gleixner    barrier();  // Prevent compiler reordering
110d7a5da7aSThomas Gleixner    critical_section();
111d7a5da7aSThomas Gleixner    barrier();  // Prevent compiler reordering
112d7a5da7aSThomas Gleixner    rseq->slice_ctrl.request = 0;
113d7a5da7aSThomas Gleixner    if (rseq->slice_ctrl.granted)
114d7a5da7aSThomas Gleixner        rseq_slice_yield();
115d7a5da7aSThomas Gleixner
116d7a5da7aSThomas GleixnerAs all of this is strictly CPU local, there are no atomicity requirements.
117d7a5da7aSThomas GleixnerChecking the granted state is racy, but that cannot be avoided at all::
118d7a5da7aSThomas Gleixner
119d7a5da7aSThomas Gleixner    if (rseq->slice_ctrl.granted)
120d7a5da7aSThomas Gleixner      -> Interrupt results in schedule and grant revocation
121d7a5da7aSThomas Gleixner        rseq_slice_yield();
122d7a5da7aSThomas Gleixner
123d7a5da7aSThomas GleixnerSo there is no point in pretending that this might be solved by an atomic
124d7a5da7aSThomas Gleixneroperation.
125d7a5da7aSThomas Gleixner
126d7a5da7aSThomas GleixnerIf the thread issues a syscall other than rseq_slice_yield(2) within the
127d7a5da7aSThomas Gleixnergranted timeslice extension, the grant is also revoked and the CPU is
128d7a5da7aSThomas Gleixnerrelinquished immediately when entering the kernel. This is required as
129d7a5da7aSThomas Gleixnersyscalls might consume arbitrary CPU time until they reach a scheduling
130d7a5da7aSThomas Gleixnerpoint when the preemption model is either NONE or VOLUNTARY and therefore
131d7a5da7aSThomas Gleixnermight exceed the grant by far.
132d7a5da7aSThomas Gleixner
133d7a5da7aSThomas GleixnerThe preferred solution for user space is to use rseq_slice_yield(2) which
134d7a5da7aSThomas Gleixneris side effect free. The support for arbitrary syscalls is required to
135d7a5da7aSThomas Gleixnersupport onion layer architectured applications, where the code handling the
136d7a5da7aSThomas Gleixnercritical section and requesting the time slice extension has no control
137d7a5da7aSThomas Gleixnerover the code within the critical section.
138d7a5da7aSThomas Gleixner
139d7a5da7aSThomas GleixnerThe kernel enforces flag consistency and terminates the thread with SIGSEGV
140d7a5da7aSThomas Gleixnerif it detects a violation.
141