1===================== 2Restartable Sequences 3===================== 4 5Restartable Sequences allow to register a per thread userspace memory area 6to be used as an ABI between kernel and userspace for three purposes: 7 8 * userspace restartable sequences 9 10 * quick access to read the current CPU number, node ID from userspace 11 12 * scheduler time slice extensions 13 14Restartable sequences (per-cpu atomics) 15--------------------------------------- 16 17Restartable sequences allow userspace to perform update operations on 18per-cpu data without requiring heavyweight atomic operations. The actual 19ABI is unfortunately only available in the code and selftests. 20 21Quick access to CPU number, node ID 22----------------------------------- 23 24Allows to implement per CPU data efficiently. Documentation is in code and 25selftests. :( 26 27Optimized RSEQ V2 28----------------- 29 30On architectures which utilize the generic entry code and generic TIF bits 31the kernel supports runtime optimizations for RSEQ, which also enable 32enhanced features like scheduler time slice extensions. 33 34To enable them a task has to register the RSEQ region with at least the 35length advertised by getauxval(AT_RSEQ_FEATURE_SIZE). 36 37If existing binaries register with RSEQ_ORIG_SIZE (32 bytes), the kernel 38keeps the legacy low performance mode enabled to fulfil the expectations 39of existing users regarding the original RSEQ implementation behaviour. 40 41The following table documents the ABI and behavioral guarantees of the 42legacy and the optimized V2 mode. 43 44.. list-table:: RSEQ modes 45 :header-rows: 1 46 47 * - Nr 48 - What 49 50 - Legacy 51 - Optimized V2 52 53 * - 1 54 - The cpu_id_start, cpu_id, node_id and mm_cid fields (User mode read 55 only) 56 .. Legacy 57 - Updated by the kernel unconditionally after each context switch and 58 before signal delivery 59 .. Optimized V2 60 - Updated by the kernel if and only if they change, i.e. if the task 61 is migrated or mm_cid changes 62 63 * - 2 64 - The rseq_cs critical section field 65 .. Legacy 66 - Evaluated and handled unconditionally after each context switch and 67 before signal delivery 68 .. Optimized V2 69 - Evaluated and handled conditionally only when user space was 70 interrupted and was scheduled out or before delivering a signal in 71 the interrupted context. 72 73 * - 3 74 - Read only fields 75 .. Legacy 76 - No strict enforcement except in debug mode 77 .. Optimized V2 78 - Strict enforcement 79 80 * - 4 81 - membarrier(...RSEQ) 82 .. Legacy 83 - All running threads of the process are interrupted and the ID fields 84 are rewritten and eventually active critical sections are aborted 85 before they return to user space. All threads which are scheduled 86 out whether voluntary or not are covered by #1/#2 above. 87 .. Optimized V2 88 - All running threads of the process are interrupted and eventually 89 active critical sections are aborted before these threads return to 90 user space. The ID fields are only updated if changed as a 91 consequence of the interrupt. All threads which are scheduled out 92 whether voluntary or not are covered by #1/#2 above. 93 94 * - 5 95 - Time slice extensions 96 .. Legacy 97 - Not supported 98 .. Optimized V2 99 - Supported 100 101The legacy mode is obviously less performant as it does unconditional 102updates and critical section checks even if not strictly required by the 103ABI contract. That can't be changed anymore as some users depend on that 104observed behavior, which in turn enables them to violate the ABI and 105overwrite the cpu_id_start field for their own purposes. This is obviously 106discouraged as it renders RSEQ incompatible with the intended usage and 107breaks the expectation of other libraries in the same application. 108 109The ABI compliant optimized v2 mode, which respects the read only fields, 110does not require unconditional updates and therefore is way more 111performant. The kernel validates the read only fields for compliance. If 112user space modifies them, the process is killed. Compliant usage allows 113multiple libraries in the same application to benefit from the RSEQ 114functionality without disturbing each other. The ABI compliant optimized v2 115mode also enables extended RSEQ features like time slice extensions. 116 117 118Scheduler time slice extensions 119------------------------------- 120 121This allows a thread to request a time slice extension when it enters a 122critical section to avoid contention on a resource when the thread is 123scheduled out inside of the critical section. 124 125The prerequisites for this functionality are: 126 127 * Enabled in Kconfig 128 129 * Enabled at boot time (default is enabled) 130 131 * A rseq userspace pointer has been registered for the thread in 132 optimized V2 mode 133 134The thread has to enable the functionality via prctl(2):: 135 136 prctl(PR_RSEQ_SLICE_EXTENSION, PR_RSEQ_SLICE_EXTENSION_SET, 137 PR_RSEQ_SLICE_EXT_ENABLE, 0, 0); 138 139prctl() returns 0 on success or otherwise with the following error codes: 140 141========= ============================================================== 142Errorcode Meaning 143========= ============================================================== 144EINVAL Functionality not available or invalid function arguments. 145 Note: arg4 and arg5 must be zero 146ENOTSUPP Functionality was disabled on the kernel command line 147ENXIO Available, but no rseq user struct registered 148========= ============================================================== 149 150The state can be also queried via prctl(2):: 151 152 prctl(PR_RSEQ_SLICE_EXTENSION, PR_RSEQ_SLICE_EXTENSION_GET, 0, 0, 0); 153 154prctl() returns ``PR_RSEQ_SLICE_EXT_ENABLE`` when it is enabled or 0 if 155disabled. Otherwise it returns with the following error codes: 156 157========= ============================================================== 158Errorcode Meaning 159========= ============================================================== 160EINVAL Functionality not available or invalid function arguments. 161 Note: arg3 and arg4 and arg5 must be zero 162========= ============================================================== 163 164The availability and status is also exposed via the rseq ABI struct flags 165field via the ``RSEQ_CS_FLAG_SLICE_EXT_AVAILABLE_BIT`` and the 166``RSEQ_CS_FLAG_SLICE_EXT_ENABLED_BIT``. These bits are read-only for user 167space and only for informational purposes. 168 169If the mechanism was enabled via prctl(), the thread can request a time 170slice extension by setting rseq::slice_ctrl::request to 1. If the thread is 171interrupted and the interrupt results in a reschedule request in the 172kernel, then the kernel can grant a time slice extension and return to 173userspace instead of scheduling out. The length of the extension is 174determined by debugfs:rseq/slice_ext_nsec. The default value is 5 usec; which 175is the minimum value. It can be incremented to 50 usecs, however doing so 176can/will affect the minimum scheduling latency. 177 178Any proposed changes to this default will have to come with a selftest and 179rseq-slice-hist.py output that shows the new value has merrit. 180 181The kernel indicates the grant by clearing rseq::slice_ctrl::request and 182setting rseq::slice_ctrl::granted to 1. If there is a reschedule of the 183thread after granting the extension, the kernel clears the granted bit to 184indicate that to userspace. 185 186If the request bit is still set when the leaving the critical section, 187userspace can clear it and continue. 188 189If the granted bit is set, then userspace invokes rseq_slice_yield(2) when 190leaving the critical section to relinquish the CPU. The kernel enforces 191this by arming a timer to prevent misbehaving userspace from abusing this 192mechanism. 193 194If both the request bit and the granted bit are false when leaving the 195critical section, then this indicates that a grant was revoked and no 196further action is required by userspace. 197 198The required code flow is as follows:: 199 200 rseq->slice_ctrl.request = 1; 201 barrier(); // Prevent compiler reordering 202 critical_section(); 203 barrier(); // Prevent compiler reordering 204 rseq->slice_ctrl.request = 0; 205 if (rseq->slice_ctrl.granted) 206 rseq_slice_yield(); 207 208As all of this is strictly CPU local, there are no atomicity requirements. 209Checking the granted state is racy, but that cannot be avoided at all:: 210 211 if (rseq->slice_ctrl.granted) 212 -> Interrupt results in schedule and grant revocation 213 rseq_slice_yield(); 214 215So there is no point in pretending that this might be solved by an atomic 216operation. 217 218If the thread issues a syscall other than rseq_slice_yield(2) within the 219granted timeslice extension, the grant is also revoked and the CPU is 220relinquished immediately when entering the kernel. This is required as 221syscalls might consume arbitrary CPU time until they reach a scheduling 222point when the preemption model is either NONE or VOLUNTARY and therefore 223might exceed the grant by far. 224 225The preferred solution for user space is to use rseq_slice_yield(2) which 226is side effect free. The support for arbitrary syscalls is required to 227support onion layer architectured applications, where the code handling the 228critical section and requesting the time slice extension has no control 229over the code within the critical section. 230 231The kernel enforces flag consistency and terminates the thread with SIGSEGV 232if it detects a violation. 233