1.. SPDX-License-Identifier: GPL-2.0 2 3.. _kernel_hacking_locktypes: 4 5========================== 6Lock types and their rules 7========================== 8 9Introduction 10============ 11 12The kernel provides a variety of locking primitives which can be divided 13into three categories: 14 15 - Sleeping locks 16 - CPU local locks 17 - Spinning locks 18 19This document conceptually describes these lock types and provides rules 20for their nesting, including the rules for use under PREEMPT_RT. 21 22 23Lock categories 24=============== 25 26Sleeping locks 27-------------- 28 29Sleeping locks can only be acquired in preemptible task context. 30 31Although implementations allow try_lock() from other contexts, it is 32necessary to carefully evaluate the safety of unlock() as well as of 33try_lock(). Furthermore, it is also necessary to evaluate the debugging 34versions of these primitives. In short, don't acquire sleeping locks from 35other contexts unless there is no other option. 36 37Sleeping lock types: 38 39 - mutex 40 - rt_mutex 41 - semaphore 42 - rw_semaphore 43 - ww_mutex 44 - percpu_rw_semaphore 45 46On PREEMPT_RT kernels, these lock types are converted to sleeping locks: 47 48 - local_lock 49 - spinlock_t 50 - rwlock_t 51 52 53CPU local locks 54--------------- 55 56 - local_lock 57 58On non-PREEMPT_RT kernels, local_lock functions are wrappers around 59preemption and interrupt disabling primitives. Contrary to other locking 60mechanisms, disabling preemption or interrupts are pure CPU local 61concurrency control mechanisms and not suited for inter-CPU concurrency 62control. 63 64 65Spinning locks 66-------------- 67 68 - raw_spinlock_t 69 - bit spinlocks 70 71On non-PREEMPT_RT kernels, these lock types are also spinning locks: 72 73 - spinlock_t 74 - rwlock_t 75 76Spinning locks implicitly disable preemption and the lock / unlock functions 77can have suffixes which apply further protections: 78 79 =================== ==================================================== 80 _bh() Disable / enable bottom halves (soft interrupts) 81 _irq() Disable / enable interrupts 82 _irqsave/restore() Save and disable / restore interrupt disabled state 83 =================== ==================================================== 84 85 86Owner semantics 87=============== 88 89The aforementioned lock types except semaphores have strict owner 90semantics: 91 92 The context (task) that acquired the lock must release it. 93 94rw_semaphores have a special interface which allows non-owner release for 95readers. 96 97 98rtmutex 99======= 100 101RT-mutexes are mutexes with support for priority inheritance (PI). 102 103PI has limitations on non-PREEMPT_RT kernels due to preemption and 104interrupt disabled sections. 105 106PI clearly cannot preempt preemption-disabled or interrupt-disabled 107regions of code, even on PREEMPT_RT kernels. Instead, PREEMPT_RT kernels 108execute most such regions of code in preemptible task context, especially 109interrupt handlers and soft interrupts. This conversion allows spinlock_t 110and rwlock_t to be implemented via RT-mutexes. 111 112 113semaphore 114========= 115 116semaphore is a counting semaphore implementation. 117 118Semaphores are often used for both serialization and waiting, but new use 119cases should instead use separate serialization and wait mechanisms, such 120as mutexes and completions. 121 122semaphores and PREEMPT_RT 123---------------------------- 124 125PREEMPT_RT does not change the semaphore implementation because counting 126semaphores have no concept of owners, thus preventing PREEMPT_RT from 127providing priority inheritance for semaphores. After all, an unknown 128owner cannot be boosted. As a consequence, blocking on semaphores can 129result in priority inversion. 130 131 132rw_semaphore 133============ 134 135rw_semaphore is a multiple readers and single writer lock mechanism. 136 137On non-PREEMPT_RT kernels the implementation is fair, thus preventing 138writer starvation. 139 140rw_semaphore complies by default with the strict owner semantics, but there 141exist special-purpose interfaces that allow non-owner release for readers. 142These interfaces work independent of the kernel configuration. 143 144rw_semaphore and PREEMPT_RT 145--------------------------- 146 147PREEMPT_RT kernels map rw_semaphore to a separate rt_mutex-based 148implementation, thus changing the fairness: 149 150 Because an rw_semaphore writer cannot grant its priority to multiple 151 readers, a preempted low-priority reader will continue holding its lock, 152 thus starving even high-priority writers. In contrast, because readers 153 can grant their priority to a writer, a preempted low-priority writer will 154 have its priority boosted until it releases the lock, thus preventing that 155 writer from starving readers. 156 157 158local_lock 159========== 160 161local_lock provides a named scope to critical sections which are protected 162by disabling preemption or interrupts. 163 164On non-PREEMPT_RT kernels local_lock operations map to the preemption and 165interrupt disabling and enabling primitives: 166 167 =============================== ====================== 168 local_lock(&llock) preempt_disable() 169 local_unlock(&llock) preempt_enable() 170 local_lock_irq(&llock) local_irq_disable() 171 local_unlock_irq(&llock) local_irq_enable() 172 local_lock_irqsave(&llock) local_irq_save() 173 local_unlock_irqrestore(&llock) local_irq_restore() 174 =============================== ====================== 175 176The named scope of local_lock has two advantages over the regular 177primitives: 178 179 - The lock name allows static analysis and is also a clear documentation 180 of the protection scope while the regular primitives are scopeless and 181 opaque. 182 183 - If lockdep is enabled the local_lock gains a lockmap which allows to 184 validate the correctness of the protection. This can detect cases where 185 e.g. a function using preempt_disable() as protection mechanism is 186 invoked from interrupt or soft-interrupt context. Aside of that 187 lockdep_assert_held(&llock) works as with any other locking primitive. 188 189local_lock and PREEMPT_RT 190------------------------- 191 192PREEMPT_RT kernels map local_lock to a per-CPU spinlock_t, thus changing 193semantics: 194 195 - All spinlock_t changes also apply to local_lock. 196 197local_lock usage 198---------------- 199 200local_lock should be used in situations where disabling preemption or 201interrupts is the appropriate form of concurrency control to protect 202per-CPU data structures on a non PREEMPT_RT kernel. 203 204local_lock is not suitable to protect against preemption or interrupts on a 205PREEMPT_RT kernel due to the PREEMPT_RT specific spinlock_t semantics. 206 207CPU local scope and bottom-half 208------------------------------- 209 210Per-CPU variables that are accessed only in softirq context should not rely on 211the assumption that this context is implicitly protected due to being 212non-preemptible. In a PREEMPT_RT kernel, softirq context is preemptible, and 213synchronizing every bottom-half-disabled section via implicit context results 214in an implicit per-CPU "big kernel lock." 215 216A local_lock_t together with local_lock_nested_bh() and 217local_unlock_nested_bh() for locking operations help to identify the locking 218scope. 219 220When lockdep is enabled, these functions verify that data structure access 221occurs within softirq context. 222Unlike local_lock(), local_unlock_nested_bh() does not disable preemption and 223does not add overhead when used without lockdep. 224 225On a PREEMPT_RT kernel, local_lock_t behaves as a real lock and 226local_unlock_nested_bh() serializes access to the data structure, which allows 227removal of serialization via local_bh_disable(). 228 229raw_spinlock_t and spinlock_t 230============================= 231 232raw_spinlock_t 233-------------- 234 235raw_spinlock_t is a strict spinning lock implementation in all kernels, 236including PREEMPT_RT kernels. Use raw_spinlock_t only in real critical 237core code, low-level interrupt handling and places where disabling 238preemption or interrupts is required, for example, to safely access 239hardware state. raw_spinlock_t can sometimes also be used when the 240critical section is tiny, thus avoiding RT-mutex overhead. 241 242spinlock_t 243---------- 244 245The semantics of spinlock_t change with the state of PREEMPT_RT. 246 247On a non-PREEMPT_RT kernel spinlock_t is mapped to raw_spinlock_t and has 248exactly the same semantics. 249 250spinlock_t and PREEMPT_RT 251------------------------- 252 253On a PREEMPT_RT kernel spinlock_t is mapped to a separate implementation 254based on rt_mutex which changes the semantics: 255 256 - Preemption is not disabled. 257 258 - The hard interrupt related suffixes for spin_lock / spin_unlock 259 operations (_irq, _irqsave / _irqrestore) do not affect the CPU's 260 interrupt disabled state. 261 262 - The soft interrupt related suffix (_bh()) still disables softirq 263 handlers. 264 265 Non-PREEMPT_RT kernels disable preemption to get this effect. 266 267 PREEMPT_RT kernels use a per-CPU lock for serialization which keeps 268 preemption enabled. The lock disables softirq handlers and also 269 prevents reentrancy due to task preemption. 270 271PREEMPT_RT kernels preserve all other spinlock_t semantics: 272 273 - Tasks holding a spinlock_t do not migrate. Non-PREEMPT_RT kernels 274 avoid migration by disabling preemption. PREEMPT_RT kernels instead 275 disable migration, which ensures that pointers to per-CPU variables 276 remain valid even if the task is preempted. 277 278 - Task state is preserved across spinlock acquisition, ensuring that the 279 task-state rules apply to all kernel configurations. Non-PREEMPT_RT 280 kernels leave task state untouched. However, PREEMPT_RT must change 281 task state if the task blocks during acquisition. Therefore, it saves 282 the current task state before blocking and the corresponding lock wakeup 283 restores it, as shown below:: 284 285 task->state = TASK_INTERRUPTIBLE 286 lock() 287 block() 288 task->saved_state = task->state 289 task->state = TASK_UNINTERRUPTIBLE 290 schedule() 291 lock wakeup 292 task->state = task->saved_state 293 294 Other types of wakeups would normally unconditionally set the task state 295 to RUNNING, but that does not work here because the task must remain 296 blocked until the lock becomes available. Therefore, when a non-lock 297 wakeup attempts to awaken a task blocked waiting for a spinlock, it 298 instead sets the saved state to RUNNING. Then, when the lock 299 acquisition completes, the lock wakeup sets the task state to the saved 300 state, in this case setting it to RUNNING:: 301 302 task->state = TASK_INTERRUPTIBLE 303 lock() 304 block() 305 task->saved_state = task->state 306 task->state = TASK_UNINTERRUPTIBLE 307 schedule() 308 non lock wakeup 309 task->saved_state = TASK_RUNNING 310 311 lock wakeup 312 task->state = task->saved_state 313 314 This ensures that the real wakeup cannot be lost. 315 316 317rwlock_t 318======== 319 320rwlock_t is a multiple readers and single writer lock mechanism. 321 322Non-PREEMPT_RT kernels implement rwlock_t as a spinning lock and the 323suffix rules of spinlock_t apply accordingly. The implementation is fair, 324thus preventing writer starvation. 325 326rwlock_t and PREEMPT_RT 327----------------------- 328 329PREEMPT_RT kernels map rwlock_t to a separate rt_mutex-based 330implementation, thus changing semantics: 331 332 - All the spinlock_t changes also apply to rwlock_t. 333 334 - Because an rwlock_t writer cannot grant its priority to multiple 335 readers, a preempted low-priority reader will continue holding its lock, 336 thus starving even high-priority writers. In contrast, because readers 337 can grant their priority to a writer, a preempted low-priority writer 338 will have its priority boosted until it releases the lock, thus 339 preventing that writer from starving readers. 340 341 342PREEMPT_RT caveats 343================== 344 345local_lock on RT 346---------------- 347 348The mapping of local_lock to spinlock_t on PREEMPT_RT kernels has a few 349implications. For example, on a non-PREEMPT_RT kernel the following code 350sequence works as expected:: 351 352 local_lock_irq(&local_lock); 353 raw_spin_lock(&lock); 354 355and is fully equivalent to:: 356 357 raw_spin_lock_irq(&lock); 358 359On a PREEMPT_RT kernel this code sequence breaks because local_lock_irq() 360is mapped to a per-CPU spinlock_t which neither disables interrupts nor 361preemption. The following code sequence works perfectly correct on both 362PREEMPT_RT and non-PREEMPT_RT kernels:: 363 364 local_lock_irq(&local_lock); 365 spin_lock(&lock); 366 367Another caveat with local locks is that each local_lock has a specific 368protection scope. So the following substitution is wrong:: 369 370 func1() 371 { 372 local_irq_save(flags); -> local_lock_irqsave(&local_lock_1, flags); 373 func3(); 374 local_irq_restore(flags); -> local_unlock_irqrestore(&local_lock_1, flags); 375 } 376 377 func2() 378 { 379 local_irq_save(flags); -> local_lock_irqsave(&local_lock_2, flags); 380 func3(); 381 local_irq_restore(flags); -> local_unlock_irqrestore(&local_lock_2, flags); 382 } 383 384 func3() 385 { 386 lockdep_assert_irqs_disabled(); 387 access_protected_data(); 388 } 389 390On a non-PREEMPT_RT kernel this works correctly, but on a PREEMPT_RT kernel 391local_lock_1 and local_lock_2 are distinct and cannot serialize the callers 392of func3(). Also the lockdep assert will trigger on a PREEMPT_RT kernel 393because local_lock_irqsave() does not disable interrupts due to the 394PREEMPT_RT-specific semantics of spinlock_t. The correct substitution is:: 395 396 func1() 397 { 398 local_irq_save(flags); -> local_lock_irqsave(&local_lock, flags); 399 func3(); 400 local_irq_restore(flags); -> local_unlock_irqrestore(&local_lock, flags); 401 } 402 403 func2() 404 { 405 local_irq_save(flags); -> local_lock_irqsave(&local_lock, flags); 406 func3(); 407 local_irq_restore(flags); -> local_unlock_irqrestore(&local_lock, flags); 408 } 409 410 func3() 411 { 412 lockdep_assert_held(&local_lock); 413 access_protected_data(); 414 } 415 416 417spinlock_t and rwlock_t 418----------------------- 419 420The changes in spinlock_t and rwlock_t semantics on PREEMPT_RT kernels 421have a few implications. For example, on a non-PREEMPT_RT kernel the 422following code sequence works as expected:: 423 424 local_irq_disable(); 425 spin_lock(&lock); 426 427and is fully equivalent to:: 428 429 spin_lock_irq(&lock); 430 431Same applies to rwlock_t and the _irqsave() suffix variants. 432 433On PREEMPT_RT kernel this code sequence breaks because RT-mutex requires a 434fully preemptible context. Instead, use spin_lock_irq() or 435spin_lock_irqsave() and their unlock counterparts. In cases where the 436interrupt disabling and locking must remain separate, PREEMPT_RT offers a 437local_lock mechanism. Acquiring the local_lock pins the task to a CPU, 438allowing things like per-CPU interrupt disabled locks to be acquired. 439However, this approach should be used only where absolutely necessary. 440 441A typical scenario is protection of per-CPU variables in thread context:: 442 443 struct foo *p = get_cpu_ptr(&var1); 444 445 spin_lock(&p->lock); 446 p->count += this_cpu_read(var2); 447 448This is correct code on a non-PREEMPT_RT kernel, but on a PREEMPT_RT kernel 449this breaks. The PREEMPT_RT-specific change of spinlock_t semantics does 450not allow to acquire p->lock because get_cpu_ptr() implicitly disables 451preemption. The following substitution works on both kernels:: 452 453 struct foo *p; 454 455 migrate_disable(); 456 p = this_cpu_ptr(&var1); 457 spin_lock(&p->lock); 458 p->count += this_cpu_read(var2); 459 460migrate_disable() ensures that the task is pinned on the current CPU which 461in turn guarantees that the per-CPU access to var1 and var2 are staying on 462the same CPU while the task remains preemptible. 463 464The migrate_disable() substitution is not valid for the following 465scenario:: 466 467 func() 468 { 469 struct foo *p; 470 471 migrate_disable(); 472 p = this_cpu_ptr(&var1); 473 p->val = func2(); 474 475This breaks because migrate_disable() does not protect against reentrancy from 476a preempting task. A correct substitution for this case is:: 477 478 func() 479 { 480 struct foo *p; 481 482 local_lock(&foo_lock); 483 p = this_cpu_ptr(&var1); 484 p->val = func2(); 485 486On a non-PREEMPT_RT kernel this protects against reentrancy by disabling 487preemption. On a PREEMPT_RT kernel this is achieved by acquiring the 488underlying per-CPU spinlock. 489 490 491raw_spinlock_t on RT 492-------------------- 493 494Acquiring a raw_spinlock_t disables preemption and possibly also 495interrupts, so the critical section must avoid acquiring a regular 496spinlock_t or rwlock_t, for example, the critical section must avoid 497allocating memory. Thus, on a non-PREEMPT_RT kernel the following code 498works perfectly:: 499 500 raw_spin_lock(&lock); 501 p = kmalloc(sizeof(*p), GFP_ATOMIC); 502 503But this code fails on PREEMPT_RT kernels because the memory allocator is 504fully preemptible and therefore cannot be invoked from truly atomic 505contexts. However, it is perfectly fine to invoke the memory allocator 506while holding normal non-raw spinlocks because they do not disable 507preemption on PREEMPT_RT kernels:: 508 509 spin_lock(&lock); 510 p = kmalloc(sizeof(*p), GFP_ATOMIC); 511 512 513bit spinlocks 514------------- 515 516PREEMPT_RT cannot substitute bit spinlocks because a single bit is too 517small to accommodate an RT-mutex. Therefore, the semantics of bit 518spinlocks are preserved on PREEMPT_RT kernels, so that the raw_spinlock_t 519caveats also apply to bit spinlocks. 520 521Some bit spinlocks are replaced with regular spinlock_t for PREEMPT_RT 522using conditional (#ifdef'ed) code changes at the usage site. In contrast, 523usage-site changes are not needed for the spinlock_t substitution. 524Instead, conditionals in header files and the core locking implementation 525enable the compiler to do the substitution transparently. 526 527 528Lock type nesting rules 529======================= 530 531The most basic rules are: 532 533 - Lock types of the same lock category (sleeping, CPU local, spinning) 534 can nest arbitrarily as long as they respect the general lock ordering 535 rules to prevent deadlocks. 536 537 - Sleeping lock types cannot nest inside CPU local and spinning lock types. 538 539 - CPU local and spinning lock types can nest inside sleeping lock types. 540 541 - Spinning lock types can nest inside all lock types 542 543These constraints apply both in PREEMPT_RT and otherwise. 544 545The fact that PREEMPT_RT changes the lock category of spinlock_t and 546rwlock_t from spinning to sleeping and substitutes local_lock with a 547per-CPU spinlock_t means that they cannot be acquired while holding a raw 548spinlock. This results in the following nesting ordering: 549 550 1) Sleeping locks 551 2) spinlock_t, rwlock_t, local_lock 552 3) raw_spinlock_t and bit spinlocks 553 554Lockdep will complain if these constraints are violated, both in 555PREEMPT_RT and otherwise. 556