xref: /linux/Documentation/locking/pi-futex.rst (revision 4b4193256c8d3bc3a5397b5cd9494c2ad386317d)
1*95ca6d73SMauro Carvalho Chehab======================
2*95ca6d73SMauro Carvalho ChehabLightweight PI-futexes
3*95ca6d73SMauro Carvalho Chehab======================
4*95ca6d73SMauro Carvalho Chehab
5*95ca6d73SMauro Carvalho ChehabWe are calling them lightweight for 3 reasons:
6*95ca6d73SMauro Carvalho Chehab
7*95ca6d73SMauro Carvalho Chehab - in the user-space fastpath a PI-enabled futex involves no kernel work
8*95ca6d73SMauro Carvalho Chehab   (or any other PI complexity) at all. No registration, no extra kernel
9*95ca6d73SMauro Carvalho Chehab   calls - just pure fast atomic ops in userspace.
10*95ca6d73SMauro Carvalho Chehab
11*95ca6d73SMauro Carvalho Chehab - even in the slowpath, the system call and scheduling pattern is very
12*95ca6d73SMauro Carvalho Chehab   similar to normal futexes.
13*95ca6d73SMauro Carvalho Chehab
14*95ca6d73SMauro Carvalho Chehab - the in-kernel PI implementation is streamlined around the mutex
15*95ca6d73SMauro Carvalho Chehab   abstraction, with strict rules that keep the implementation
16*95ca6d73SMauro Carvalho Chehab   relatively simple: only a single owner may own a lock (i.e. no
17*95ca6d73SMauro Carvalho Chehab   read-write lock support), only the owner may unlock a lock, no
18*95ca6d73SMauro Carvalho Chehab   recursive locking, etc.
19*95ca6d73SMauro Carvalho Chehab
20*95ca6d73SMauro Carvalho ChehabPriority Inheritance - why?
21*95ca6d73SMauro Carvalho Chehab---------------------------
22*95ca6d73SMauro Carvalho Chehab
23*95ca6d73SMauro Carvalho ChehabThe short reply: user-space PI helps achieving/improving determinism for
24*95ca6d73SMauro Carvalho Chehabuser-space applications. In the best-case, it can help achieve
25*95ca6d73SMauro Carvalho Chehabdeterminism and well-bound latencies. Even in the worst-case, PI will
26*95ca6d73SMauro Carvalho Chehabimprove the statistical distribution of locking related application
27*95ca6d73SMauro Carvalho Chehabdelays.
28*95ca6d73SMauro Carvalho Chehab
29*95ca6d73SMauro Carvalho ChehabThe longer reply
30*95ca6d73SMauro Carvalho Chehab----------------
31*95ca6d73SMauro Carvalho Chehab
32*95ca6d73SMauro Carvalho ChehabFirstly, sharing locks between multiple tasks is a common programming
33*95ca6d73SMauro Carvalho Chehabtechnique that often cannot be replaced with lockless algorithms. As we
34*95ca6d73SMauro Carvalho Chehabcan see it in the kernel [which is a quite complex program in itself],
35*95ca6d73SMauro Carvalho Chehablockless structures are rather the exception than the norm - the current
36*95ca6d73SMauro Carvalho Chehabratio of lockless vs. locky code for shared data structures is somewhere
37*95ca6d73SMauro Carvalho Chehabbetween 1:10 and 1:100. Lockless is hard, and the complexity of lockless
38*95ca6d73SMauro Carvalho Chehabalgorithms often endangers to ability to do robust reviews of said code.
39*95ca6d73SMauro Carvalho ChehabI.e. critical RT apps often choose lock structures to protect critical
40*95ca6d73SMauro Carvalho Chehabdata structures, instead of lockless algorithms. Furthermore, there are
41*95ca6d73SMauro Carvalho Chehabcases (like shared hardware, or other resource limits) where lockless
42*95ca6d73SMauro Carvalho Chehabaccess is mathematically impossible.
43*95ca6d73SMauro Carvalho Chehab
44*95ca6d73SMauro Carvalho ChehabMedia players (such as Jack) are an example of reasonable application
45*95ca6d73SMauro Carvalho Chehabdesign with multiple tasks (with multiple priority levels) sharing
46*95ca6d73SMauro Carvalho Chehabshort-held locks: for example, a highprio audio playback thread is
47*95ca6d73SMauro Carvalho Chehabcombined with medium-prio construct-audio-data threads and low-prio
48*95ca6d73SMauro Carvalho Chehabdisplay-colory-stuff threads. Add video and decoding to the mix and
49*95ca6d73SMauro Carvalho Chehabwe've got even more priority levels.
50*95ca6d73SMauro Carvalho Chehab
51*95ca6d73SMauro Carvalho ChehabSo once we accept that synchronization objects (locks) are an
52*95ca6d73SMauro Carvalho Chehabunavoidable fact of life, and once we accept that multi-task userspace
53*95ca6d73SMauro Carvalho Chehabapps have a very fair expectation of being able to use locks, we've got
54*95ca6d73SMauro Carvalho Chehabto think about how to offer the option of a deterministic locking
55*95ca6d73SMauro Carvalho Chehabimplementation to user-space.
56*95ca6d73SMauro Carvalho Chehab
57*95ca6d73SMauro Carvalho ChehabMost of the technical counter-arguments against doing priority
58*95ca6d73SMauro Carvalho Chehabinheritance only apply to kernel-space locks. But user-space locks are
59*95ca6d73SMauro Carvalho Chehabdifferent, there we cannot disable interrupts or make the task
60*95ca6d73SMauro Carvalho Chehabnon-preemptible in a critical section, so the 'use spinlocks' argument
61*95ca6d73SMauro Carvalho Chehabdoes not apply (user-space spinlocks have the same priority inversion
62*95ca6d73SMauro Carvalho Chehabproblems as other user-space locking constructs). Fact is, pretty much
63*95ca6d73SMauro Carvalho Chehabthe only technique that currently enables good determinism for userspace
64*95ca6d73SMauro Carvalho Chehablocks (such as futex-based pthread mutexes) is priority inheritance:
65*95ca6d73SMauro Carvalho Chehab
66*95ca6d73SMauro Carvalho ChehabCurrently (without PI), if a high-prio and a low-prio task shares a lock
67*95ca6d73SMauro Carvalho Chehab[this is a quite common scenario for most non-trivial RT applications],
68*95ca6d73SMauro Carvalho Chehabeven if all critical sections are coded carefully to be deterministic
69*95ca6d73SMauro Carvalho Chehab(i.e. all critical sections are short in duration and only execute a
70*95ca6d73SMauro Carvalho Chehablimited number of instructions), the kernel cannot guarantee any
71*95ca6d73SMauro Carvalho Chehabdeterministic execution of the high-prio task: any medium-priority task
72*95ca6d73SMauro Carvalho Chehabcould preempt the low-prio task while it holds the shared lock and
73*95ca6d73SMauro Carvalho Chehabexecutes the critical section, and could delay it indefinitely.
74*95ca6d73SMauro Carvalho Chehab
75*95ca6d73SMauro Carvalho ChehabImplementation
76*95ca6d73SMauro Carvalho Chehab--------------
77*95ca6d73SMauro Carvalho Chehab
78*95ca6d73SMauro Carvalho ChehabAs mentioned before, the userspace fastpath of PI-enabled pthread
79*95ca6d73SMauro Carvalho Chehabmutexes involves no kernel work at all - they behave quite similarly to
80*95ca6d73SMauro Carvalho Chehabnormal futex-based locks: a 0 value means unlocked, and a value==TID
81*95ca6d73SMauro Carvalho Chehabmeans locked. (This is the same method as used by list-based robust
82*95ca6d73SMauro Carvalho Chehabfutexes.) Userspace uses atomic ops to lock/unlock these mutexes without
83*95ca6d73SMauro Carvalho Chehabentering the kernel.
84*95ca6d73SMauro Carvalho Chehab
85*95ca6d73SMauro Carvalho ChehabTo handle the slowpath, we have added two new futex ops:
86*95ca6d73SMauro Carvalho Chehab
87*95ca6d73SMauro Carvalho Chehab  - FUTEX_LOCK_PI
88*95ca6d73SMauro Carvalho Chehab  - FUTEX_UNLOCK_PI
89*95ca6d73SMauro Carvalho Chehab
90*95ca6d73SMauro Carvalho ChehabIf the lock-acquire fastpath fails, [i.e. an atomic transition from 0 to
91*95ca6d73SMauro Carvalho ChehabTID fails], then FUTEX_LOCK_PI is called. The kernel does all the
92*95ca6d73SMauro Carvalho Chehabremaining work: if there is no futex-queue attached to the futex address
93*95ca6d73SMauro Carvalho Chehabyet then the code looks up the task that owns the futex [it has put its
94*95ca6d73SMauro Carvalho Chehabown TID into the futex value], and attaches a 'PI state' structure to
95*95ca6d73SMauro Carvalho Chehabthe futex-queue. The pi_state includes an rt-mutex, which is a PI-aware,
96*95ca6d73SMauro Carvalho Chehabkernel-based synchronization object. The 'other' task is made the owner
97*95ca6d73SMauro Carvalho Chehabof the rt-mutex, and the FUTEX_WAITERS bit is atomically set in the
98*95ca6d73SMauro Carvalho Chehabfutex value. Then this task tries to lock the rt-mutex, on which it
99*95ca6d73SMauro Carvalho Chehabblocks. Once it returns, it has the mutex acquired, and it sets the
100*95ca6d73SMauro Carvalho Chehabfutex value to its own TID and returns. Userspace has no other work to
101*95ca6d73SMauro Carvalho Chehabperform - it now owns the lock, and futex value contains
102*95ca6d73SMauro Carvalho ChehabFUTEX_WAITERS|TID.
103*95ca6d73SMauro Carvalho Chehab
104*95ca6d73SMauro Carvalho ChehabIf the unlock side fastpath succeeds, [i.e. userspace manages to do a
105*95ca6d73SMauro Carvalho ChehabTID -> 0 atomic transition of the futex value], then no kernel work is
106*95ca6d73SMauro Carvalho Chehabtriggered.
107*95ca6d73SMauro Carvalho Chehab
108*95ca6d73SMauro Carvalho ChehabIf the unlock fastpath fails (because the FUTEX_WAITERS bit is set),
109*95ca6d73SMauro Carvalho Chehabthen FUTEX_UNLOCK_PI is called, and the kernel unlocks the futex on the
110*95ca6d73SMauro Carvalho Chehabbehalf of userspace - and it also unlocks the attached
111*95ca6d73SMauro Carvalho Chehabpi_state->rt_mutex and thus wakes up any potential waiters.
112*95ca6d73SMauro Carvalho Chehab
113*95ca6d73SMauro Carvalho ChehabNote that under this approach, contrary to previous PI-futex approaches,
114*95ca6d73SMauro Carvalho Chehabthere is no prior 'registration' of a PI-futex. [which is not quite
115*95ca6d73SMauro Carvalho Chehabpossible anyway, due to existing ABI properties of pthread mutexes.]
116*95ca6d73SMauro Carvalho Chehab
117*95ca6d73SMauro Carvalho ChehabAlso, under this scheme, 'robustness' and 'PI' are two orthogonal
118*95ca6d73SMauro Carvalho Chehabproperties of futexes, and all four combinations are possible: futex,
119*95ca6d73SMauro Carvalho Chehabrobust-futex, PI-futex, robust+PI-futex.
120*95ca6d73SMauro Carvalho Chehab
121*95ca6d73SMauro Carvalho ChehabMore details about priority inheritance can be found in
122*95ca6d73SMauro Carvalho ChehabDocumentation/locking/rt-mutex.rst.
123