xref: /linux/Documentation/userspace-api/rseq.rst (revision 37a93dd5c49b5fda807fd204edf2547c3493319c)
1=====================
2Restartable Sequences
3=====================
4
5Restartable Sequences allow to register a per thread userspace memory area
6to be used as an ABI between kernel and userspace for three purposes:
7
8 * userspace restartable sequences
9
10 * quick access to read the current CPU number, node ID from userspace
11
12 * scheduler time slice extensions
13
14Restartable sequences (per-cpu atomics)
15---------------------------------------
16
17Restartable sequences allow userspace to perform update operations on
18per-cpu data without requiring heavyweight atomic operations. The actual
19ABI is unfortunately only available in the code and selftests.
20
21Quick access to CPU number, node ID
22-----------------------------------
23
24Allows to implement per CPU data efficiently. Documentation is in code and
25selftests. :(
26
27Scheduler time slice extensions
28-------------------------------
29
30This allows a thread to request a time slice extension when it enters a
31critical section to avoid contention on a resource when the thread is
32scheduled out inside of the critical section.
33
34The prerequisites for this functionality are:
35
36    * Enabled in Kconfig
37
38    * Enabled at boot time (default is enabled)
39
40    * A rseq userspace pointer has been registered for the thread
41
42The thread has to enable the functionality via prctl(2)::
43
44    prctl(PR_RSEQ_SLICE_EXTENSION, PR_RSEQ_SLICE_EXTENSION_SET,
45          PR_RSEQ_SLICE_EXT_ENABLE, 0, 0);
46
47prctl() returns 0 on success or otherwise with the following error codes:
48
49========= ==============================================================
50Errorcode Meaning
51========= ==============================================================
52EINVAL	  Functionality not available or invalid function arguments.
53          Note: arg4 and arg5 must be zero
54ENOTSUPP  Functionality was disabled on the kernel command line
55ENXIO	  Available, but no rseq user struct registered
56========= ==============================================================
57
58The state can be also queried via prctl(2)::
59
60  prctl(PR_RSEQ_SLICE_EXTENSION, PR_RSEQ_SLICE_EXTENSION_GET, 0, 0, 0);
61
62prctl() returns ``PR_RSEQ_SLICE_EXT_ENABLE`` when it is enabled or 0 if
63disabled. Otherwise it returns with the following error codes:
64
65========= ==============================================================
66Errorcode Meaning
67========= ==============================================================
68EINVAL	  Functionality not available or invalid function arguments.
69          Note: arg3 and arg4 and arg5 must be zero
70========= ==============================================================
71
72The availability and status is also exposed via the rseq ABI struct flags
73field via the ``RSEQ_CS_FLAG_SLICE_EXT_AVAILABLE_BIT`` and the
74``RSEQ_CS_FLAG_SLICE_EXT_ENABLED_BIT``. These bits are read-only for user
75space and only for informational purposes.
76
77If the mechanism was enabled via prctl(), the thread can request a time
78slice extension by setting rseq::slice_ctrl::request to 1. If the thread is
79interrupted and the interrupt results in a reschedule request in the
80kernel, then the kernel can grant a time slice extension and return to
81userspace instead of scheduling out. The length of the extension is
82determined by debugfs:rseq/slice_ext_nsec. The default value is 5 usec; which
83is the minimum value. It can be incremented to 50 usecs, however doing so
84can/will affect the minimum scheduling latency.
85
86Any proposed changes to this default will have to come with a selftest and
87rseq-slice-hist.py output that shows the new value has merrit.
88
89The kernel indicates the grant by clearing rseq::slice_ctrl::request and
90setting rseq::slice_ctrl::granted to 1. If there is a reschedule of the
91thread after granting the extension, the kernel clears the granted bit to
92indicate that to userspace.
93
94If the request bit is still set when the leaving the critical section,
95userspace can clear it and continue.
96
97If the granted bit is set, then userspace invokes rseq_slice_yield(2) when
98leaving the critical section to relinquish the CPU. The kernel enforces
99this by arming a timer to prevent misbehaving userspace from abusing this
100mechanism.
101
102If both the request bit and the granted bit are false when leaving the
103critical section, then this indicates that a grant was revoked and no
104further action is required by userspace.
105
106The required code flow is as follows::
107
108    rseq->slice_ctrl.request = 1;
109    barrier();  // Prevent compiler reordering
110    critical_section();
111    barrier();  // Prevent compiler reordering
112    rseq->slice_ctrl.request = 0;
113    if (rseq->slice_ctrl.granted)
114        rseq_slice_yield();
115
116As all of this is strictly CPU local, there are no atomicity requirements.
117Checking the granted state is racy, but that cannot be avoided at all::
118
119    if (rseq->slice_ctrl.granted)
120      -> Interrupt results in schedule and grant revocation
121        rseq_slice_yield();
122
123So there is no point in pretending that this might be solved by an atomic
124operation.
125
126If the thread issues a syscall other than rseq_slice_yield(2) within the
127granted timeslice extension, the grant is also revoked and the CPU is
128relinquished immediately when entering the kernel. This is required as
129syscalls might consume arbitrary CPU time until they reach a scheduling
130point when the preemption model is either NONE or VOLUNTARY and therefore
131might exceed the grant by far.
132
133The preferred solution for user space is to use rseq_slice_yield(2) which
134is side effect free. The support for arbitrary syscalls is required to
135support onion layer architectured applications, where the code handling the
136critical section and requesting the time slice extension has no control
137over the code within the critical section.
138
139The kernel enforces flag consistency and terminates the thread with SIGSEGV
140if it detects a violation.
141