1This document provides "recipes", that is, litmus tests for commonly 2occurring situations, as well as a few that illustrate subtly broken but 3attractive nuisances. Many of these recipes include example code from 4v5.7 of the Linux kernel. 5 6The first section covers simple special cases, the second section 7takes off the training wheels to cover more involved examples, 8and the third section provides a few rules of thumb. 9 10 11Simple special cases 12==================== 13 14This section presents two simple special cases, the first being where 15there is only one CPU or only one memory location is accessed, and the 16second being use of that old concurrency workhorse, locking. 17 18 19Single CPU or single memory location 20------------------------------------ 21 22If there is only one CPU on the one hand or only one variable 23on the other, the code will execute in order. There are (as 24usual) some things to be careful of: 25 261. Some aspects of the C language are unordered. For example, 27 in the expression "f(x) + g(y)", the order in which f and g are 28 called is not defined; the object code is allowed to use either 29 order or even to interleave the computations. 30 312. Compilers are permitted to use the "as-if" rule. That is, a 32 compiler can emit whatever code it likes for normal accesses, 33 as long as the results of a single-threaded execution appear 34 just as if the compiler had followed all the relevant rules. 35 To see this, compile with a high level of optimization and run 36 the debugger on the resulting binary. 37 383. If there is only one variable but multiple CPUs, that variable 39 must be properly aligned and all accesses to that variable must 40 be full sized. Variables that straddle cachelines or pages void 41 your full-ordering warranty, as do undersized accesses that load 42 from or store to only part of the variable. 43 444. If there are multiple CPUs, accesses to shared variables should 45 use READ_ONCE() and WRITE_ONCE() or stronger to prevent load/store 46 tearing, load/store fusing, and invented loads and stores. 47 There are exceptions to this rule, including: 48 49 i. When there is no possibility of a given shared variable 50 being updated by some other CPU, for example, while 51 holding the update-side lock, reads from that variable 52 need not use READ_ONCE(). 53 54 ii. When there is no possibility of a given shared variable 55 being either read or updated by other CPUs, for example, 56 when running during early boot, reads from that variable 57 need not use READ_ONCE() and writes to that variable 58 need not use WRITE_ONCE(). 59 60 61Locking 62------- 63 64[!] Note: 65 locking.txt expands on this section, providing more detail on 66 locklessly accessing lock-protected shared variables. 67 68Locking is well-known and straightforward, at least if you don't think 69about it too hard. And the basic rule is indeed quite simple: Any CPU that 70has acquired a given lock sees any changes previously seen or made by any 71CPU before it released that same lock. Note that this statement is a bit 72stronger than "Any CPU holding a given lock sees all changes made by any 73CPU during the time that CPU was holding this same lock". For example, 74consider the following pair of code fragments: 75 76 /* See MP+polocks.litmus. */ 77 void CPU0(void) 78 { 79 WRITE_ONCE(x, 1); 80 spin_lock(&mylock); 81 WRITE_ONCE(y, 1); 82 spin_unlock(&mylock); 83 } 84 85 void CPU1(void) 86 { 87 spin_lock(&mylock); 88 r0 = READ_ONCE(y); 89 spin_unlock(&mylock); 90 r1 = READ_ONCE(x); 91 } 92 93The basic rule guarantees that if CPU0() acquires mylock before CPU1(), 94then both r0 and r1 must be set to the value 1. This also has the 95consequence that if the final value of r0 is equal to 1, then the final 96value of r1 must also be equal to 1. In contrast, the weaker rule would 97say nothing about the final value of r1. 98 99The converse to the basic rule also holds, as illustrated by the 100following litmus test: 101 102 /* See MP+porevlocks.litmus. */ 103 void CPU0(void) 104 { 105 r0 = READ_ONCE(y); 106 spin_lock(&mylock); 107 r1 = READ_ONCE(x); 108 spin_unlock(&mylock); 109 } 110 111 void CPU1(void) 112 { 113 spin_lock(&mylock); 114 WRITE_ONCE(x, 1); 115 spin_unlock(&mylock); 116 WRITE_ONCE(y, 1); 117 } 118 119This converse to the basic rule guarantees that if CPU0() acquires 120mylock before CPU1(), then both r0 and r1 must be set to the value 0. 121This also has the consequence that if the final value of r1 is equal 122to 0, then the final value of r0 must also be equal to 0. In contrast, 123the weaker rule would say nothing about the final value of r0. 124 125These examples show only a single pair of CPUs, but the effects of the 126locking basic rule extend across multiple acquisitions of a given lock 127across multiple CPUs. 128 129However, it is not necessarily the case that accesses ordered by 130locking will be seen as ordered by CPUs not holding that lock. 131Consider this example: 132 133 /* See Z6.0+pooncelock+pooncelock+pombonce.litmus. */ 134 void CPU0(void) 135 { 136 spin_lock(&mylock); 137 WRITE_ONCE(x, 1); 138 WRITE_ONCE(y, 1); 139 spin_unlock(&mylock); 140 } 141 142 void CPU1(void) 143 { 144 spin_lock(&mylock); 145 r0 = READ_ONCE(y); 146 WRITE_ONCE(z, 1); 147 spin_unlock(&mylock); 148 } 149 150 void CPU2(void) 151 { 152 WRITE_ONCE(z, 2); 153 smp_mb(); 154 r1 = READ_ONCE(x); 155 } 156 157Counter-intuitive though it might be, it is quite possible to have 158the final value of r0 be 1, the final value of z be 2, and the final 159value of r1 be 0. The reason for this surprising outcome is that 160CPU2() never acquired the lock, and thus did not benefit from the 161lock's ordering properties. 162 163Ordering can be extended to CPUs not holding the lock by careful use 164of smp_mb__after_spinlock(): 165 166 /* See Z6.0+pooncelock+poonceLock+pombonce.litmus. */ 167 void CPU0(void) 168 { 169 spin_lock(&mylock); 170 WRITE_ONCE(x, 1); 171 WRITE_ONCE(y, 1); 172 spin_unlock(&mylock); 173 } 174 175 void CPU1(void) 176 { 177 spin_lock(&mylock); 178 smp_mb__after_spinlock(); 179 r0 = READ_ONCE(y); 180 WRITE_ONCE(z, 1); 181 spin_unlock(&mylock); 182 } 183 184 void CPU2(void) 185 { 186 WRITE_ONCE(z, 2); 187 smp_mb(); 188 r1 = READ_ONCE(x); 189 } 190 191This addition of smp_mb__after_spinlock() strengthens the lock acquisition 192sufficiently to rule out the counter-intuitive outcome. 193 194 195Taking off the training wheels 196============================== 197 198This section looks at more complex examples, including message passing, 199load buffering, release-acquire chains, store buffering. 200Many classes of litmus tests have abbreviated names, which may be found 201here: https://www.cl.cam.ac.uk/~pes20/ppc-supplemental/test6.pdf 202 203 204Message passing (MP) 205-------------------- 206 207The MP pattern has one CPU execute a pair of stores to a pair of variables 208and another CPU execute a pair of loads from this same pair of variables, 209but in the opposite order. The goal is to avoid the counter-intuitive 210outcome in which the first load sees the value written by the second store 211but the second load does not see the value written by the first store. 212In the absence of any ordering, this goal may not be met, as can be seen 213in the MP+poonceonces.litmus litmus test. This section therefore looks at 214a number of ways of meeting this goal. 215 216 217Release and acquire 218~~~~~~~~~~~~~~~~~~~ 219 220Use of smp_store_release() and smp_load_acquire() is one way to force 221the desired MP ordering. The general approach is shown below: 222 223 /* See MP+pooncerelease+poacquireonce.litmus. */ 224 void CPU0(void) 225 { 226 WRITE_ONCE(x, 1); 227 smp_store_release(&y, 1); 228 } 229 230 void CPU1(void) 231 { 232 r0 = smp_load_acquire(&y); 233 r1 = READ_ONCE(x); 234 } 235 236The smp_store_release() macro orders any prior accesses against the 237store, while the smp_load_acquire macro orders the load against any 238subsequent accesses. Therefore, if the final value of r0 is the value 1, 239the final value of r1 must also be the value 1. 240 241The init_stack_slab() function in lib/stackdepot.c uses release-acquire 242in this way to safely initialize of a slab of the stack. Working out 243the mutual-exclusion design is left as an exercise for the reader. 244 245 246Assign and dereference 247~~~~~~~~~~~~~~~~~~~~~~ 248 249Use of rcu_assign_pointer() and rcu_dereference() is quite similar to the 250use of smp_store_release() and smp_load_acquire(), except that both 251rcu_assign_pointer() and rcu_dereference() operate on RCU-protected 252pointers. The general approach is shown below: 253 254 /* See MP+onceassign+derefonce.litmus. */ 255 int z; 256 int *y = &z; 257 int x; 258 259 void CPU0(void) 260 { 261 WRITE_ONCE(x, 1); 262 rcu_assign_pointer(y, &x); 263 } 264 265 void CPU1(void) 266 { 267 rcu_read_lock(); 268 r0 = rcu_dereference(y); 269 r1 = READ_ONCE(*r0); 270 rcu_read_unlock(); 271 } 272 273In this example, if the final value of r0 is &x then the final value of 274r1 must be 1. 275 276The rcu_assign_pointer() macro has the same ordering properties as does 277smp_store_release(), but the rcu_dereference() macro orders the load only 278against later accesses that depend on the value loaded. A dependency 279is present if the value loaded determines the address of a later access 280(address dependency, as shown above), the value written by a later store 281(data dependency), or whether or not a later store is executed in the 282first place (control dependency). Note that the term "data dependency" 283is sometimes casually used to cover both address and data dependencies. 284 285In lib/math/prime_numbers.c, the expand_to_next_prime() function invokes 286rcu_assign_pointer(), and the next_prime_number() function invokes 287rcu_dereference(). This combination mediates access to a bit vector 288that is expanded as additional primes are needed. 289 290 291Write and read memory barriers 292~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 293 294It is usually better to use smp_store_release() instead of smp_wmb() 295and to use smp_load_acquire() instead of smp_rmb(). However, the older 296smp_wmb() and smp_rmb() APIs are still heavily used, so it is important 297to understand their use cases. The general approach is shown below: 298 299 /* See MP+fencewmbonceonce+fencermbonceonce.litmus. */ 300 void CPU0(void) 301 { 302 WRITE_ONCE(x, 1); 303 smp_wmb(); 304 WRITE_ONCE(y, 1); 305 } 306 307 void CPU1(void) 308 { 309 r0 = READ_ONCE(y); 310 smp_rmb(); 311 r1 = READ_ONCE(x); 312 } 313 314The smp_wmb() macro orders prior stores against later stores, and the 315smp_rmb() macro orders prior loads against later loads. Therefore, if 316the final value of r0 is 1, the final value of r1 must also be 1. 317 318The xlog_state_switch_iclogs() function in fs/xfs/xfs_log.c contains 319the following write-side code fragment: 320 321 log->l_curr_block -= log->l_logBBsize; 322 ASSERT(log->l_curr_block >= 0); 323 smp_wmb(); 324 log->l_curr_cycle++; 325 326And the xlog_valid_lsn() function in fs/xfs/xfs_log_priv.h contains 327the corresponding read-side code fragment: 328 329 cur_cycle = READ_ONCE(log->l_curr_cycle); 330 smp_rmb(); 331 cur_block = READ_ONCE(log->l_curr_block); 332 333Alternatively, consider the following comment in function 334perf_output_put_handle() in kernel/events/ring_buffer.c: 335 336 * kernel user 337 * 338 * if (LOAD ->data_tail) { LOAD ->data_head 339 * (A) smp_rmb() (C) 340 * STORE $data LOAD $data 341 * smp_wmb() (B) smp_mb() (D) 342 * STORE ->data_head STORE ->data_tail 343 * } 344 345The B/C pairing is an example of the MP pattern using smp_wmb() on the 346write side and smp_rmb() on the read side. 347 348Of course, given that smp_mb() is strictly stronger than either smp_wmb() 349or smp_rmb(), any code fragment that would work with smp_rmb() and 350smp_wmb() would also work with smp_mb() replacing either or both of the 351weaker barriers. 352 353 354Load buffering (LB) 355------------------- 356 357The LB pattern has one CPU load from one variable and then store to a 358second, while another CPU loads from the second variable and then stores 359to the first. The goal is to avoid the counter-intuitive situation where 360each load reads the value written by the other CPU's store. In the 361absence of any ordering it is quite possible that this may happen, as 362can be seen in the LB+poonceonces.litmus litmus test. 363 364One way of avoiding the counter-intuitive outcome is through the use of a 365control dependency paired with a full memory barrier: 366 367 /* See LB+fencembonceonce+ctrlonceonce.litmus. */ 368 void CPU0(void) 369 { 370 r0 = READ_ONCE(x); 371 if (r0) 372 WRITE_ONCE(y, 1); 373 } 374 375 void CPU1(void) 376 { 377 r1 = READ_ONCE(y); 378 smp_mb(); 379 WRITE_ONCE(x, 1); 380 } 381 382This pairing of a control dependency in CPU0() with a full memory 383barrier in CPU1() prevents r0 and r1 from both ending up equal to 1. 384 385The A/D pairing from the ring-buffer use case shown earlier also 386illustrates LB. Here is a repeat of the comment in 387perf_output_put_handle() in kernel/events/ring_buffer.c, showing a 388control dependency on the kernel side and a full memory barrier on 389the user side: 390 391 * kernel user 392 * 393 * if (LOAD ->data_tail) { LOAD ->data_head 394 * (A) smp_rmb() (C) 395 * STORE $data LOAD $data 396 * smp_wmb() (B) smp_mb() (D) 397 * STORE ->data_head STORE ->data_tail 398 * } 399 * 400 * Where A pairs with D, and B pairs with C. 401 402The kernel's control dependency between the load from ->data_tail 403and the store to data combined with the user's full memory barrier 404between the load from data and the store to ->data_tail prevents 405the counter-intuitive outcome where the kernel overwrites the data 406before the user gets done loading it. 407 408 409Release-acquire chains 410---------------------- 411 412Release-acquire chains are a low-overhead, flexible, and easy-to-use 413method of maintaining order. However, they do have some limitations that 414need to be fully understood. Here is an example that maintains order: 415 416 /* See ISA2+pooncerelease+poacquirerelease+poacquireonce.litmus. */ 417 void CPU0(void) 418 { 419 WRITE_ONCE(x, 1); 420 smp_store_release(&y, 1); 421 } 422 423 void CPU1(void) 424 { 425 r0 = smp_load_acquire(y); 426 smp_store_release(&z, 1); 427 } 428 429 void CPU2(void) 430 { 431 r1 = smp_load_acquire(z); 432 r2 = READ_ONCE(x); 433 } 434 435In this case, if r0 and r1 both have final values of 1, then r2 must 436also have a final value of 1. 437 438The ordering in this example is stronger than it needs to be. For 439example, ordering would still be preserved if CPU1()'s smp_load_acquire() 440invocation was replaced with READ_ONCE(). 441 442It is tempting to assume that CPU0()'s store to x is globally ordered 443before CPU1()'s store to z, but this is not the case: 444 445 /* See Z6.0+pooncerelease+poacquirerelease+mbonceonce.litmus. */ 446 void CPU0(void) 447 { 448 WRITE_ONCE(x, 1); 449 smp_store_release(&y, 1); 450 } 451 452 void CPU1(void) 453 { 454 r0 = smp_load_acquire(y); 455 smp_store_release(&z, 1); 456 } 457 458 void CPU2(void) 459 { 460 WRITE_ONCE(z, 2); 461 smp_mb(); 462 r1 = READ_ONCE(x); 463 } 464 465One might hope that if the final value of r0 is 1 and the final value 466of z is 2, then the final value of r1 must also be 1, but it really is 467possible for r1 to have the final value of 0. The reason, of course, 468is that in this version, CPU2() is not part of the release-acquire chain. 469This situation is accounted for in the rules of thumb below. 470 471Despite this limitation, release-acquire chains are low-overhead as 472well as simple and powerful, at least as memory-ordering mechanisms go. 473 474 475Store buffering 476--------------- 477 478Store buffering can be thought of as upside-down load buffering, so 479that one CPU first stores to one variable and then loads from a second, 480while another CPU stores to the second variable and then loads from the 481first. Preserving order requires nothing less than full barriers: 482 483 /* See SB+fencembonceonces.litmus. */ 484 void CPU0(void) 485 { 486 WRITE_ONCE(x, 1); 487 smp_mb(); 488 r0 = READ_ONCE(y); 489 } 490 491 void CPU1(void) 492 { 493 WRITE_ONCE(y, 1); 494 smp_mb(); 495 r1 = READ_ONCE(x); 496 } 497 498Omitting either smp_mb() will allow both r0 and r1 to have final 499values of 0, but providing both full barriers as shown above prevents 500this counter-intuitive outcome. 501 502This pattern most famously appears as part of Dekker's locking 503algorithm, but it has a much more practical use within the Linux kernel 504of ordering wakeups. The following comment taken from waitqueue_active() 505in include/linux/wait.h shows the canonical pattern: 506 507 * CPU0 - waker CPU1 - waiter 508 * 509 * for (;;) { 510 * @cond = true; prepare_to_wait(&wq_head, &wait, state); 511 * smp_mb(); // smp_mb() from set_current_state() 512 * if (waitqueue_active(wq_head)) if (@cond) 513 * wake_up(wq_head); break; 514 * schedule(); 515 * } 516 * finish_wait(&wq_head, &wait); 517 518On CPU0, the store is to @cond and the load is in waitqueue_active(). 519On CPU1, prepare_to_wait() contains both a store to wq_head and a call 520to set_current_state(), which contains an smp_mb() barrier; the load is 521"if (@cond)". The full barriers prevent the undesirable outcome where 522CPU1 puts the waiting task to sleep and CPU0 fails to wake it up. 523 524Note that use of locking can greatly simplify this pattern. 525 526 527Rules of thumb 528============== 529 530There might seem to be no pattern governing what ordering primitives are 531needed in which situations, but this is not the case. There is a pattern 532based on the relation between the accesses linking successive CPUs in a 533given litmus test. There are three types of linkage: 534 5351. Write-to-read, where the next CPU reads the value that the 536 previous CPU wrote. The LB litmus-test patterns contain only 537 this type of relation. In formal memory-modeling texts, this 538 relation is called "reads-from" and is usually abbreviated "rf". 539 5402. Read-to-write, where the next CPU overwrites the value that the 541 previous CPU read. The SB litmus test contains only this type 542 of relation. In formal memory-modeling texts, this relation is 543 often called "from-reads" and is sometimes abbreviated "fr". 544 5453. Write-to-write, where the next CPU overwrites the value written 546 by the previous CPU. The Z6.0 litmus test pattern contains a 547 write-to-write relation between the last access of CPU1() and 548 the first access of CPU2(). In formal memory-modeling texts, 549 this relation is often called "coherence order" and is sometimes 550 abbreviated "co". In the C++ standard, it is instead called 551 "modification order" and often abbreviated "mo". 552 553The strength of memory ordering required for a given litmus test to 554avoid a counter-intuitive outcome depends on the types of relations 555linking the memory accesses for the outcome in question: 556 557o If all links are write-to-read links, then the weakest 558 possible ordering within each CPU suffices. For example, in 559 the LB litmus test, a control dependency was enough to do the 560 job. 561 562o If all but one of the links are write-to-read links, then a 563 release-acquire chain suffices. Both the MP and the ISA2 564 litmus tests illustrate this case. 565 566o If more than one of the links are something other than 567 write-to-read links, then a full memory barrier is required 568 between each successive pair of non-write-to-read links. This 569 case is illustrated by the Z6.0 litmus tests, both in the 570 locking and in the release-acquire sections. 571 572However, if you find yourself having to stretch these rules of thumb 573to fit your situation, you should consider creating a litmus test and 574running it on the model. 575