1d7a5da7aSThomas Gleixner===================== 2d7a5da7aSThomas GleixnerRestartable Sequences 3d7a5da7aSThomas Gleixner===================== 4d7a5da7aSThomas Gleixner 5d7a5da7aSThomas GleixnerRestartable Sequences allow to register a per thread userspace memory area 6d7a5da7aSThomas Gleixnerto be used as an ABI between kernel and userspace for three purposes: 7d7a5da7aSThomas Gleixner 8d7a5da7aSThomas Gleixner * userspace restartable sequences 9d7a5da7aSThomas Gleixner 10d7a5da7aSThomas Gleixner * quick access to read the current CPU number, node ID from userspace 11d7a5da7aSThomas Gleixner 12d7a5da7aSThomas Gleixner * scheduler time slice extensions 13d7a5da7aSThomas Gleixner 14d7a5da7aSThomas GleixnerRestartable sequences (per-cpu atomics) 15d7a5da7aSThomas Gleixner--------------------------------------- 16d7a5da7aSThomas Gleixner 17d7a5da7aSThomas GleixnerRestartable sequences allow userspace to perform update operations on 18d7a5da7aSThomas Gleixnerper-cpu data without requiring heavyweight atomic operations. The actual 19d7a5da7aSThomas GleixnerABI is unfortunately only available in the code and selftests. 20d7a5da7aSThomas Gleixner 21d7a5da7aSThomas GleixnerQuick access to CPU number, node ID 22d7a5da7aSThomas Gleixner----------------------------------- 23d7a5da7aSThomas Gleixner 24d7a5da7aSThomas GleixnerAllows to implement per CPU data efficiently. Documentation is in code and 25d7a5da7aSThomas Gleixnerselftests. :( 26d7a5da7aSThomas Gleixner 27d7a5da7aSThomas GleixnerScheduler time slice extensions 28d7a5da7aSThomas Gleixner------------------------------- 29d7a5da7aSThomas Gleixner 30d7a5da7aSThomas GleixnerThis allows a thread to request a time slice extension when it enters a 31d7a5da7aSThomas Gleixnercritical section to avoid contention on a resource when the thread is 32d7a5da7aSThomas Gleixnerscheduled out inside of the critical section. 33d7a5da7aSThomas Gleixner 34d7a5da7aSThomas GleixnerThe prerequisites for this functionality are: 35d7a5da7aSThomas Gleixner 36d7a5da7aSThomas Gleixner * Enabled in Kconfig 37d7a5da7aSThomas Gleixner 38d7a5da7aSThomas Gleixner * Enabled at boot time (default is enabled) 39d7a5da7aSThomas Gleixner 40d7a5da7aSThomas Gleixner * A rseq userspace pointer has been registered for the thread 41d7a5da7aSThomas Gleixner 42d7a5da7aSThomas GleixnerThe thread has to enable the functionality via prctl(2):: 43d7a5da7aSThomas Gleixner 44d7a5da7aSThomas Gleixner prctl(PR_RSEQ_SLICE_EXTENSION, PR_RSEQ_SLICE_EXTENSION_SET, 45d7a5da7aSThomas Gleixner PR_RSEQ_SLICE_EXT_ENABLE, 0, 0); 46d7a5da7aSThomas Gleixner 47d7a5da7aSThomas Gleixnerprctl() returns 0 on success or otherwise with the following error codes: 48d7a5da7aSThomas Gleixner 49d7a5da7aSThomas Gleixner========= ============================================================== 50d7a5da7aSThomas GleixnerErrorcode Meaning 51d7a5da7aSThomas Gleixner========= ============================================================== 52d7a5da7aSThomas GleixnerEINVAL Functionality not available or invalid function arguments. 53d7a5da7aSThomas Gleixner Note: arg4 and arg5 must be zero 54d7a5da7aSThomas GleixnerENOTSUPP Functionality was disabled on the kernel command line 55d7a5da7aSThomas GleixnerENXIO Available, but no rseq user struct registered 56d7a5da7aSThomas Gleixner========= ============================================================== 57d7a5da7aSThomas Gleixner 58d7a5da7aSThomas GleixnerThe state can be also queried via prctl(2):: 59d7a5da7aSThomas Gleixner 60d7a5da7aSThomas Gleixner prctl(PR_RSEQ_SLICE_EXTENSION, PR_RSEQ_SLICE_EXTENSION_GET, 0, 0, 0); 61d7a5da7aSThomas Gleixner 62d7a5da7aSThomas Gleixnerprctl() returns ``PR_RSEQ_SLICE_EXT_ENABLE`` when it is enabled or 0 if 63d7a5da7aSThomas Gleixnerdisabled. Otherwise it returns with the following error codes: 64d7a5da7aSThomas Gleixner 65d7a5da7aSThomas Gleixner========= ============================================================== 66d7a5da7aSThomas GleixnerErrorcode Meaning 67d7a5da7aSThomas Gleixner========= ============================================================== 68d7a5da7aSThomas GleixnerEINVAL Functionality not available or invalid function arguments. 69d7a5da7aSThomas Gleixner Note: arg3 and arg4 and arg5 must be zero 70d7a5da7aSThomas Gleixner========= ============================================================== 71d7a5da7aSThomas Gleixner 72d7a5da7aSThomas GleixnerThe availability and status is also exposed via the rseq ABI struct flags 73d7a5da7aSThomas Gleixnerfield via the ``RSEQ_CS_FLAG_SLICE_EXT_AVAILABLE_BIT`` and the 74d7a5da7aSThomas Gleixner``RSEQ_CS_FLAG_SLICE_EXT_ENABLED_BIT``. These bits are read-only for user 75d7a5da7aSThomas Gleixnerspace and only for informational purposes. 76d7a5da7aSThomas Gleixner 77d7a5da7aSThomas GleixnerIf the mechanism was enabled via prctl(), the thread can request a time 78d7a5da7aSThomas Gleixnerslice extension by setting rseq::slice_ctrl::request to 1. If the thread is 79d7a5da7aSThomas Gleixnerinterrupted and the interrupt results in a reschedule request in the 80d7a5da7aSThomas Gleixnerkernel, then the kernel can grant a time slice extension and return to 81d7a5da7aSThomas Gleixneruserspace instead of scheduling out. The length of the extension is 8221c0e92dSPeter Zijlstradetermined by debugfs:rseq/slice_ext_nsec. The default value is 5 usec; which 83e1d7f549SPeter Zijlstrais the minimum value. It can be incremented to 50 usecs, however doing so 84e1d7f549SPeter Zijlstracan/will affect the minimum scheduling latency. 85d7a5da7aSThomas Gleixner 86*bb332a9eSPeter ZijlstraAny proposed changes to this default will have to come with a selftest and 87*bb332a9eSPeter Zijlstrarseq-slice-hist.py output that shows the new value has merrit. 88*bb332a9eSPeter Zijlstra 89d7a5da7aSThomas GleixnerThe kernel indicates the grant by clearing rseq::slice_ctrl::request and 90d7a5da7aSThomas Gleixnersetting rseq::slice_ctrl::granted to 1. If there is a reschedule of the 91d7a5da7aSThomas Gleixnerthread after granting the extension, the kernel clears the granted bit to 92d7a5da7aSThomas Gleixnerindicate that to userspace. 93d7a5da7aSThomas Gleixner 94d7a5da7aSThomas GleixnerIf the request bit is still set when the leaving the critical section, 95d7a5da7aSThomas Gleixneruserspace can clear it and continue. 96d7a5da7aSThomas Gleixner 97d7a5da7aSThomas GleixnerIf the granted bit is set, then userspace invokes rseq_slice_yield(2) when 98d7a5da7aSThomas Gleixnerleaving the critical section to relinquish the CPU. The kernel enforces 99d7a5da7aSThomas Gleixnerthis by arming a timer to prevent misbehaving userspace from abusing this 100d7a5da7aSThomas Gleixnermechanism. 101d7a5da7aSThomas Gleixner 102d7a5da7aSThomas GleixnerIf both the request bit and the granted bit are false when leaving the 103d7a5da7aSThomas Gleixnercritical section, then this indicates that a grant was revoked and no 104d7a5da7aSThomas Gleixnerfurther action is required by userspace. 105d7a5da7aSThomas Gleixner 106d7a5da7aSThomas GleixnerThe required code flow is as follows:: 107d7a5da7aSThomas Gleixner 108d7a5da7aSThomas Gleixner rseq->slice_ctrl.request = 1; 109d7a5da7aSThomas Gleixner barrier(); // Prevent compiler reordering 110d7a5da7aSThomas Gleixner critical_section(); 111d7a5da7aSThomas Gleixner barrier(); // Prevent compiler reordering 112d7a5da7aSThomas Gleixner rseq->slice_ctrl.request = 0; 113d7a5da7aSThomas Gleixner if (rseq->slice_ctrl.granted) 114d7a5da7aSThomas Gleixner rseq_slice_yield(); 115d7a5da7aSThomas Gleixner 116d7a5da7aSThomas GleixnerAs all of this is strictly CPU local, there are no atomicity requirements. 117d7a5da7aSThomas GleixnerChecking the granted state is racy, but that cannot be avoided at all:: 118d7a5da7aSThomas Gleixner 119d7a5da7aSThomas Gleixner if (rseq->slice_ctrl.granted) 120d7a5da7aSThomas Gleixner -> Interrupt results in schedule and grant revocation 121d7a5da7aSThomas Gleixner rseq_slice_yield(); 122d7a5da7aSThomas Gleixner 123d7a5da7aSThomas GleixnerSo there is no point in pretending that this might be solved by an atomic 124d7a5da7aSThomas Gleixneroperation. 125d7a5da7aSThomas Gleixner 126d7a5da7aSThomas GleixnerIf the thread issues a syscall other than rseq_slice_yield(2) within the 127d7a5da7aSThomas Gleixnergranted timeslice extension, the grant is also revoked and the CPU is 128d7a5da7aSThomas Gleixnerrelinquished immediately when entering the kernel. This is required as 129d7a5da7aSThomas Gleixnersyscalls might consume arbitrary CPU time until they reach a scheduling 130d7a5da7aSThomas Gleixnerpoint when the preemption model is either NONE or VOLUNTARY and therefore 131d7a5da7aSThomas Gleixnermight exceed the grant by far. 132d7a5da7aSThomas Gleixner 133d7a5da7aSThomas GleixnerThe preferred solution for user space is to use rseq_slice_yield(2) which 134d7a5da7aSThomas Gleixneris side effect free. The support for arbitrary syscalls is required to 135d7a5da7aSThomas Gleixnersupport onion layer architectured applications, where the code handling the 136d7a5da7aSThomas Gleixnercritical section and requesting the time slice extension has no control 137d7a5da7aSThomas Gleixnerover the code within the critical section. 138d7a5da7aSThomas Gleixner 139d7a5da7aSThomas GleixnerThe kernel enforces flag consistency and terminates the thread with SIGSEGV 140d7a5da7aSThomas Gleixnerif it detects a violation. 141