1==================== 2The robust futex ABI 3==================== 4 5:Author: Started by Paul Jackson <pj@sgi.com> 6 7 8Robust_futexes provide a mechanism that is used in addition to normal 9futexes, for kernel assist of cleanup of held locks on task exit. 10 11The interesting data as to what futexes a thread is holding is kept on a 12linked list in user space, where it can be updated efficiently as locks 13are taken and dropped, without kernel intervention. The only additional 14kernel intervention required for robust_futexes above and beyond what is 15required for futexes is: 16 17 1) a one time call, per thread, to tell the kernel where its list of 18 held robust_futexes begins, and 19 2) internal kernel code at exit, to handle any listed locks held 20 by the exiting thread. 21 22The existing normal futexes already provide a "Fast Userspace Locking" 23mechanism, which handles uncontested locking without needing a system 24call, and handles contested locking by maintaining a list of waiting 25threads in the kernel. Options on the sys_futex(2) system call support 26waiting on a particular futex, and waking up the next waiter on a 27particular futex. 28 29For robust_futexes to work, the user code (typically in a library such 30as glibc linked with the application) has to manage and place the 31necessary list elements exactly as the kernel expects them. If it fails 32to do so, then improperly listed locks will not be cleaned up on exit, 33probably causing deadlock or other such failure of the other threads 34waiting on the same locks. 35 36A thread that anticipates possibly using robust_futexes should first 37issue the system call:: 38 39 asmlinkage long 40 sys_set_robust_list(struct robust_list_head __user *head, size_t len); 41 42The pointer 'head' points to a structure in the threads address space 43consisting of three words. Each word is 32 bits on 32 bit arch's, or 64 44bits on 64 bit arch's, and local byte order. Each thread should have 45its own thread private 'head'. 46 47If a thread is running in 32 bit compatibility mode on a 64 native arch 48kernel, then it can actually have two such structures - one using 32 bit 49words for 32 bit compatibility mode, and one using 64 bit words for 64 50bit native mode. The kernel, if it is a 64 bit kernel supporting 32 bit 51compatibility mode, will attempt to process both lists on each task 52exit, if the corresponding sys_set_robust_list() call has been made to 53setup that list. 54 55 The first word in the memory structure at 'head' contains a 56 pointer to a single linked list of 'lock entries', one per lock, 57 as described below. If the list is empty, the pointer will point 58 to itself, 'head'. The last 'lock entry' points back to the 'head'. 59 60 The second word, called 'offset', specifies the offset from the 61 address of the associated 'lock entry', plus or minus, of what will 62 be called the 'lock word', from that 'lock entry'. The 'lock word' 63 is always a 32 bit word, unlike the other words above. The 'lock 64 word' holds 2 flag bits in the upper 2 bits, and the thread id (TID) 65 of the thread holding the lock in the bottom 30 bits. See further 66 below for a description of the flag bits. 67 68 The third word, called 'list_op_pending', contains transient copy of 69 the address of the 'lock entry', during list insertion and removal, 70 and is needed to correctly resolve races should a thread exit while 71 in the middle of a locking or unlocking operation. 72 73Each 'lock entry' on the single linked list starting at 'head' consists 74of just a single word, pointing to the next 'lock entry', or back to 75'head' if there are no more entries. In addition, nearby to each 'lock 76entry', at an offset from the 'lock entry' specified by the 'offset' 77word, is one 'lock word'. 78 79The 'lock word' is always 32 bits, and is intended to be the same 32 bit 80lock variable used by the futex mechanism, in conjunction with 81robust_futexes. The kernel will only be able to wakeup the next thread 82waiting for a lock on a threads exit if that next thread used the futex 83mechanism to register the address of that 'lock word' with the kernel. 84 85For each futex lock currently held by a thread, if it wants this 86robust_futex support for exit cleanup of that lock, it should have one 87'lock entry' on this list, with its associated 'lock word' at the 88specified 'offset'. Should a thread die while holding any such locks, 89the kernel will walk this list, mark any such locks with a bit 90indicating their holder died, and wakeup the next thread waiting for 91that lock using the futex mechanism. 92 93When a thread has invoked the above system call to indicate it 94anticipates using robust_futexes, the kernel stores the passed in 'head' 95pointer for that task. The task may retrieve that value later on by 96using the system call:: 97 98 asmlinkage long 99 sys_get_robust_list(int pid, struct robust_list_head __user **head_ptr, 100 size_t __user *len_ptr); 101 102It is anticipated that threads will use robust_futexes embedded in 103larger, user level locking structures, one per lock. The kernel 104robust_futex mechanism doesn't care what else is in that structure, so 105long as the 'offset' to the 'lock word' is the same for all 106robust_futexes used by that thread. The thread should link those locks 107it currently holds using the 'lock entry' pointers. It may also have 108other links between the locks, such as the reverse side of a double 109linked list, but that doesn't matter to the kernel. 110 111By keeping its locks linked this way, on a list starting with a 'head' 112pointer known to the kernel, the kernel can provide to a thread the 113essential service available for robust_futexes, which is to help clean 114up locks held at the time of (a perhaps unexpectedly) exit. 115 116Actual locking and unlocking, during normal operations, is handled 117entirely by user level code in the contending threads, and by the 118existing futex mechanism to wait for, and wakeup, locks. The kernels 119only essential involvement in robust_futexes is to remember where the 120list 'head' is, and to walk the list on thread exit, handling locks 121still held by the departing thread, as described below. 122 123There may exist thousands of futex lock structures in a threads shared 124memory, on various data structures, at a given point in time. Only those 125lock structures for locks currently held by that thread should be on 126that thread's robust_futex linked lock list a given time. 127 128A given futex lock structure in a user shared memory region may be held 129at different times by any of the threads with access to that region. The 130thread currently holding such a lock, if any, is marked with the threads 131TID in the lower 30 bits of the 'lock word'. 132 133When adding or removing a lock from its list of held locks, in order for 134the kernel to correctly handle lock cleanup regardless of when the task 135exits (perhaps it gets an unexpected signal 9 in the middle of 136manipulating this list), the user code must observe the following 137protocol on 'lock entry' insertion and removal: 138 139On insertion: 140 141 1) set the 'list_op_pending' word to the address of the 'lock entry' 142 to be inserted, 143 2) acquire the futex lock, 144 3) add the lock entry, with its thread id (TID) in the bottom 30 bits 145 of the 'lock word', to the linked list starting at 'head', and 146 4) clear the 'list_op_pending' word. 147 148On removal: 149 150 1) set the 'list_op_pending' word to the address of the 'lock entry' 151 to be removed, 152 2) remove the lock entry for this lock from the 'head' list, 153 3) release the futex lock, and 154 4) clear the 'lock_op_pending' word. 155 156Please note that the removal of a robust futex purely in userspace is 157racy. Refer to the next chapter to learn more and how to avoid this. 158 159On exit, the kernel will consider the address stored in 160'list_op_pending' and the address of each 'lock word' found by walking 161the list starting at 'head'. For each such address, if the bottom 30 162bits of the 'lock word' at offset 'offset' from that address equals the 163exiting threads TID, then the kernel will do two things: 164 165 1) if bit 31 (0x80000000) is set in that word, then attempt a futex 166 wakeup on that address, which will waken the next thread that has 167 used to the futex mechanism to wait on that address, and 168 2) atomically set bit 30 (0x40000000) in the 'lock word'. 169 170In the above, bit 31 was set by futex waiters on that lock to indicate 171they were waiting, and bit 30 is set by the kernel to indicate that the 172lock owner died holding the lock. 173 174The kernel exit code will silently stop scanning the list further if at 175any point: 176 177 1) the 'head' pointer or an subsequent linked list pointer 178 is not a valid address of a user space word 179 2) the calculated location of the 'lock word' (address plus 180 'offset') is not the valid address of a 32 bit user space 181 word 182 3) if the list contains more than 1 million (subject to 183 future kernel configuration changes) elements. 184 185When the kernel sees a list entry whose 'lock word' doesn't have the 186current threads TID in the lower 30 bits, it does nothing with that 187entry, and goes on to the next entry. 188 189Robust release is racy 190---------------------- 191 192The removal of a robust futex from the list is racy when doing it solely in 193userspace. Quoting Thomas Gleixner for the explanation: 194 195 The robust futex unlock mechanism is racy in respect to the clearing of the 196 robust_list_head::list_op_pending pointer because unlock and clearing the 197 pointer are not atomic. The race window is between the unlock and clearing 198 the pending op pointer. If the task is forced to exit in this window, exit 199 will access a potentially invalid pending op pointer when cleaning up the 200 robust list. That happens if another task manages to unmap the object 201 containing the lock before the cleanup, which results in an UAF. In the 202 worst case this UAF can lead to memory corruption when unrelated content 203 has been mapped to the same address by the time the access happens. 204 205A full in-depth analysis can be read at 206https://lore.kernel.org/lkml/20260316162316.356674433@kernel.org/ 207 208To overcome that, the kernel needs to participate in the lock release operation. 209This ensures that the release happens "atomically" with regard to releasing 210the lock and removing the address from ``list_op_pending``. If the release is 211interrupted by a signal, the kernel will also verify if it interrupted the 212release operation. 213 214For the contended unlock case, where other threads are waiting for the lock 215release, there's the ``FUTEX_ROBUST_UNLOCK`` operation feature flag for the 216``futex()`` system call, which must be used with one of the following 217operations: ``FUTEX_WAKE``, ``FUTEX_WAKE_BITSET`` or ``FUTEX_UNLOCK_PI``. 218The kernel will release the lock (set the futex word to zero), clean the 219``list_op_pending`` field. Then, it will proceed with the normal wake path. 220 221For the non-contended path, there's still a race between checking the futex word 222and clearing the ``list_op_pending`` field. To solve this without the need of a 223complete system call, userspace should call the virtual syscall 224``__vdso_futex_robust_listXX_try_unlock()`` (where XX is either 32 or 64, 225depending on the size of the pointer). If the vDSO call succeeds, it means that 226it released the lock and cleared ``list_op_pending``. If it fails, that means 227that there are waiters for this lock and a call to ``futex()`` syscall with 228``FUTEX_ROBUST_UNLOCK`` is needed. 229