xref: /linux/Documentation/userspace-api/rseq.rst (revision 985d4a55e64e43bd86eeb896b81ceba453301989)
1=====================
2Restartable Sequences
3=====================
4
5Restartable Sequences allow to register a per thread userspace memory area
6to be used as an ABI between kernel and userspace for three purposes:
7
8 * userspace restartable sequences
9
10 * quick access to read the current CPU number, node ID from userspace
11
12 * scheduler time slice extensions
13
14Restartable sequences (per-cpu atomics)
15---------------------------------------
16
17Restartable sequences allow userspace to perform update operations on
18per-cpu data without requiring heavyweight atomic operations. The actual
19ABI is unfortunately only available in the code and selftests.
20
21Quick access to CPU number, node ID
22-----------------------------------
23
24Allows to implement per CPU data efficiently. Documentation is in code and
25selftests. :(
26
27Optimized RSEQ V2
28-----------------
29
30On architectures which utilize the generic entry code and generic TIF bits
31the kernel supports runtime optimizations for RSEQ, which also enable
32enhanced features like scheduler time slice extensions.
33
34To enable them a task has to register the RSEQ region with at least the
35length advertised by getauxval(AT_RSEQ_FEATURE_SIZE).
36
37If existing binaries register with RSEQ_ORIG_SIZE (32 bytes), the kernel
38keeps the legacy low performance mode enabled to fulfil the expectations
39of existing users regarding the original RSEQ implementation behaviour.
40
41The following table documents the ABI and behavioral guarantees of the
42legacy and the optimized V2 mode.
43
44.. list-table:: RSEQ modes
45   :header-rows: 1
46
47   * - Nr
48     - What
49
50     - Legacy
51     - Optimized V2
52
53   * - 1
54     - The cpu_id_start, cpu_id, node_id and mm_cid fields (User mode read
55       only)
56       .. Legacy
57     - Updated by the kernel unconditionally after each context switch and
58       before signal delivery
59       .. Optimized V2
60     - Updated by the kernel if and only if they change, i.e. if the task
61       is migrated or mm_cid changes
62
63   * - 2
64     - The rseq_cs critical section field
65       .. Legacy
66     - Evaluated and handled unconditionally after each context switch and
67       before signal delivery
68       .. Optimized V2
69     - Evaluated and handled conditionally only when user space was
70       interrupted and was scheduled out or before delivering a signal in
71       the interrupted context.
72
73   * - 3
74     - Read only fields
75       .. Legacy
76     - No strict enforcement except in debug mode
77       .. Optimized V2
78     - Strict enforcement
79
80   * - 4
81     - membarrier(...RSEQ)
82       .. Legacy
83     - All running threads of the process are interrupted and the ID fields
84       are rewritten and eventually active critical sections are aborted
85       before they return to user space.  All threads which are scheduled
86       out whether voluntary or not are covered by #1/#2 above.
87       .. Optimized V2
88     - All running threads of the process are interrupted and eventually
89       active critical sections are aborted before these threads return to
90       user space. The ID fields are only updated if changed as a
91       consequence of the interrupt. All threads which are scheduled out
92       whether voluntary or not are covered by #1/#2 above.
93
94   * - 5
95     - Time slice extensions
96       .. Legacy
97     - Not supported
98       .. Optimized V2
99     - Supported
100
101The legacy mode is obviously less performant as it does unconditional
102updates and critical section checks even if not strictly required by the
103ABI contract. That can't be changed anymore as some users depend on that
104observed behavior, which in turn enables them to violate the ABI and
105overwrite the cpu_id_start field for their own purposes. This is obviously
106discouraged as it renders RSEQ incompatible with the intended usage and
107breaks the expectation of other libraries in the same application.
108
109The ABI compliant optimized v2 mode, which respects the read only fields,
110does not require unconditional updates and therefore is way more
111performant. The kernel validates the read only fields for compliance. If
112user space modifies them, the process is killed. Compliant usage allows
113multiple libraries in the same application to benefit from the RSEQ
114functionality without disturbing each other. The ABI compliant optimized v2
115mode also enables extended RSEQ features like time slice extensions.
116
117
118Scheduler time slice extensions
119-------------------------------
120
121This allows a thread to request a time slice extension when it enters a
122critical section to avoid contention on a resource when the thread is
123scheduled out inside of the critical section.
124
125The prerequisites for this functionality are:
126
127    * Enabled in Kconfig
128
129    * Enabled at boot time (default is enabled)
130
131    * A rseq userspace pointer has been registered for the thread in
132      optimized V2 mode
133
134The thread has to enable the functionality via prctl(2)::
135
136    prctl(PR_RSEQ_SLICE_EXTENSION, PR_RSEQ_SLICE_EXTENSION_SET,
137          PR_RSEQ_SLICE_EXT_ENABLE, 0, 0);
138
139prctl() returns 0 on success or otherwise with the following error codes:
140
141========= ==============================================================
142Errorcode Meaning
143========= ==============================================================
144EINVAL	  Functionality not available or invalid function arguments.
145          Note: arg4 and arg5 must be zero
146ENOTSUPP  Functionality was disabled on the kernel command line
147ENXIO	  Available, but no rseq user struct registered
148========= ==============================================================
149
150The state can be also queried via prctl(2)::
151
152  prctl(PR_RSEQ_SLICE_EXTENSION, PR_RSEQ_SLICE_EXTENSION_GET, 0, 0, 0);
153
154prctl() returns ``PR_RSEQ_SLICE_EXT_ENABLE`` when it is enabled or 0 if
155disabled. Otherwise it returns with the following error codes:
156
157========= ==============================================================
158Errorcode Meaning
159========= ==============================================================
160EINVAL	  Functionality not available or invalid function arguments.
161          Note: arg3 and arg4 and arg5 must be zero
162========= ==============================================================
163
164The availability and status is also exposed via the rseq ABI struct flags
165field via the ``RSEQ_CS_FLAG_SLICE_EXT_AVAILABLE_BIT`` and the
166``RSEQ_CS_FLAG_SLICE_EXT_ENABLED_BIT``. These bits are read-only for user
167space and only for informational purposes.
168
169If the mechanism was enabled via prctl(), the thread can request a time
170slice extension by setting rseq::slice_ctrl::request to 1. If the thread is
171interrupted and the interrupt results in a reschedule request in the
172kernel, then the kernel can grant a time slice extension and return to
173userspace instead of scheduling out. The length of the extension is
174determined by debugfs:rseq/slice_ext_nsec. The default value is 5 usec; which
175is the minimum value. It can be incremented to 50 usecs, however doing so
176can/will affect the minimum scheduling latency.
177
178Any proposed changes to this default will have to come with a selftest and
179rseq-slice-hist.py output that shows the new value has merrit.
180
181The kernel indicates the grant by clearing rseq::slice_ctrl::request and
182setting rseq::slice_ctrl::granted to 1. If there is a reschedule of the
183thread after granting the extension, the kernel clears the granted bit to
184indicate that to userspace.
185
186If the request bit is still set when the leaving the critical section,
187userspace can clear it and continue.
188
189If the granted bit is set, then userspace invokes rseq_slice_yield(2) when
190leaving the critical section to relinquish the CPU. The kernel enforces
191this by arming a timer to prevent misbehaving userspace from abusing this
192mechanism.
193
194If both the request bit and the granted bit are false when leaving the
195critical section, then this indicates that a grant was revoked and no
196further action is required by userspace.
197
198The required code flow is as follows::
199
200    rseq->slice_ctrl.request = 1;
201    barrier();  // Prevent compiler reordering
202    critical_section();
203    barrier();  // Prevent compiler reordering
204    rseq->slice_ctrl.request = 0;
205    if (rseq->slice_ctrl.granted)
206        rseq_slice_yield();
207
208As all of this is strictly CPU local, there are no atomicity requirements.
209Checking the granted state is racy, but that cannot be avoided at all::
210
211    if (rseq->slice_ctrl.granted)
212      -> Interrupt results in schedule and grant revocation
213        rseq_slice_yield();
214
215So there is no point in pretending that this might be solved by an atomic
216operation.
217
218If the thread issues a syscall other than rseq_slice_yield(2) within the
219granted timeslice extension, the grant is also revoked and the CPU is
220relinquished immediately when entering the kernel. This is required as
221syscalls might consume arbitrary CPU time until they reach a scheduling
222point when the preemption model is either NONE or VOLUNTARY and therefore
223might exceed the grant by far.
224
225The preferred solution for user space is to use rseq_slice_yield(2) which
226is side effect free. The support for arbitrary syscalls is required to
227support onion layer architectured applications, where the code handling the
228critical section and requesting the time slice extension has no control
229over the code within the critical section.
230
231The kernel enforces flag consistency and terminates the thread with SIGSEGV
232if it detects a violation.
233