xref: /linux/tools/memory-model/Documentation/recipes.txt (revision 95bf3760eb9ceeb93febdc280695347b1e0a89c1)
1This document provides "recipes", that is, litmus tests for commonly
2occurring situations, as well as a few that illustrate subtly broken but
3attractive nuisances.  Many of these recipes include example code from
4v5.7 of the Linux kernel.
5
6The first section covers simple special cases, the second section
7takes off the training wheels to cover more involved examples,
8and the third section provides a few rules of thumb.
9
10
11Simple special cases
12====================
13
14This section presents two simple special cases, the first being where
15there is only one CPU or only one memory location is accessed, and the
16second being use of that old concurrency workhorse, locking.
17
18
19Single CPU or single memory location
20------------------------------------
21
22If there is only one CPU on the one hand or only one variable
23on the other, the code will execute in order.  There are (as
24usual) some things to be careful of:
25
261.	Some aspects of the C language are unordered.  For example,
27	in the expression "f(x) + g(y)", the order in which f and g are
28	called is not defined; the object code is allowed to use either
29	order or even to interleave the computations.
30
312.	Compilers are permitted to use the "as-if" rule.  That is, a
32	compiler can emit whatever code it likes for normal accesses,
33	as long as the results of a single-threaded execution appear
34	just as if the compiler had followed all the relevant rules.
35	To see this, compile with a high level of optimization and run
36	the debugger on the resulting binary.
37
383.	If there is only one variable but multiple CPUs, that variable
39	must be properly aligned and all accesses to that variable must
40	be full sized.	Variables that straddle cachelines or pages void
41	your full-ordering warranty, as do undersized accesses that load
42	from or store to only part of the variable.
43
444.	If there are multiple CPUs, accesses to shared variables should
45	use READ_ONCE() and WRITE_ONCE() or stronger to prevent load/store
46	tearing, load/store fusing, and invented loads and stores.
47	There are exceptions to this rule, including:
48
49	i.	When there is no possibility of a given shared variable
50		being updated by some other CPU, for example, while
51		holding the update-side lock, reads from that variable
52		need not use READ_ONCE().
53
54	ii.	When there is no possibility of a given shared variable
55		being either read or updated by other CPUs, for example,
56		when running during early boot, reads from that variable
57		need not use READ_ONCE() and writes to that variable
58		need not use WRITE_ONCE().
59
60
61Locking
62-------
63
64[!] Note:
65	locking.txt expands on this section, providing more detail on
66	locklessly accessing lock-protected shared variables.
67
68Locking is well-known and straightforward, at least if you don't think
69about it too hard.  And the basic rule is indeed quite simple: Any CPU that
70has acquired a given lock sees any changes previously seen or made by any
71CPU before it released that same lock.  Note that this statement is a bit
72stronger than "Any CPU holding a given lock sees all changes made by any
73CPU during the time that CPU was holding this same lock".  For example,
74consider the following pair of code fragments:
75
76	/* See MP+polocks.litmus. */
77	void CPU0(void)
78	{
79		WRITE_ONCE(x, 1);
80		spin_lock(&mylock);
81		WRITE_ONCE(y, 1);
82		spin_unlock(&mylock);
83	}
84
85	void CPU1(void)
86	{
87		spin_lock(&mylock);
88		r0 = READ_ONCE(y);
89		spin_unlock(&mylock);
90		r1 = READ_ONCE(x);
91	}
92
93The basic rule guarantees that if CPU0() acquires mylock before CPU1(),
94then both r0 and r1 must be set to the value 1.  This also has the
95consequence that if the final value of r0 is equal to 1, then the final
96value of r1 must also be equal to 1.  In contrast, the weaker rule would
97say nothing about the final value of r1.
98
99The converse to the basic rule also holds, as illustrated by the
100following litmus test:
101
102	/* See MP+porevlocks.litmus. */
103	void CPU0(void)
104	{
105		r0 = READ_ONCE(y);
106		spin_lock(&mylock);
107		r1 = READ_ONCE(x);
108		spin_unlock(&mylock);
109	}
110
111	void CPU1(void)
112	{
113		spin_lock(&mylock);
114		WRITE_ONCE(x, 1);
115		spin_unlock(&mylock);
116		WRITE_ONCE(y, 1);
117	}
118
119This converse to the basic rule guarantees that if CPU0() acquires
120mylock before CPU1(), then both r0 and r1 must be set to the value 0.
121This also has the consequence that if the final value of r1 is equal
122to 0, then the final value of r0 must also be equal to 0.  In contrast,
123the weaker rule would say nothing about the final value of r0.
124
125These examples show only a single pair of CPUs, but the effects of the
126locking basic rule extend across multiple acquisitions of a given lock
127across multiple CPUs.
128
129However, it is not necessarily the case that accesses ordered by
130locking will be seen as ordered by CPUs not holding that lock.
131Consider this example:
132
133	/* See Z6.0+pooncelock+pooncelock+pombonce.litmus. */
134	void CPU0(void)
135	{
136		spin_lock(&mylock);
137		WRITE_ONCE(x, 1);
138		WRITE_ONCE(y, 1);
139		spin_unlock(&mylock);
140	}
141
142	void CPU1(void)
143	{
144		spin_lock(&mylock);
145		r0 = READ_ONCE(y);
146		WRITE_ONCE(z, 1);
147		spin_unlock(&mylock);
148	}
149
150	void CPU2(void)
151	{
152		WRITE_ONCE(z, 2);
153		smp_mb();
154		r1 = READ_ONCE(x);
155	}
156
157Counter-intuitive though it might be, it is quite possible to have
158the final value of r0 be 1, the final value of z be 2, and the final
159value of r1 be 0.  The reason for this surprising outcome is that
160CPU2() never acquired the lock, and thus did not benefit from the
161lock's ordering properties.
162
163Ordering can be extended to CPUs not holding the lock by careful use
164of smp_mb__after_spinlock():
165
166	/* See Z6.0+pooncelock+poonceLock+pombonce.litmus. */
167	void CPU0(void)
168	{
169		spin_lock(&mylock);
170		WRITE_ONCE(x, 1);
171		WRITE_ONCE(y, 1);
172		spin_unlock(&mylock);
173	}
174
175	void CPU1(void)
176	{
177		spin_lock(&mylock);
178		smp_mb__after_spinlock();
179		r0 = READ_ONCE(y);
180		WRITE_ONCE(z, 1);
181		spin_unlock(&mylock);
182	}
183
184	void CPU2(void)
185	{
186		WRITE_ONCE(z, 2);
187		smp_mb();
188		r1 = READ_ONCE(x);
189	}
190
191This addition of smp_mb__after_spinlock() strengthens the lock acquisition
192sufficiently to rule out the counter-intuitive outcome.
193
194
195Taking off the training wheels
196==============================
197
198This section looks at more complex examples, including message passing,
199load buffering, release-acquire chains, store buffering.
200Many classes of litmus tests have abbreviated names, which may be found
201here: https://www.cl.cam.ac.uk/~pes20/ppc-supplemental/test6.pdf
202
203
204Message passing (MP)
205--------------------
206
207The MP pattern has one CPU execute a pair of stores to a pair of variables
208and another CPU execute a pair of loads from this same pair of variables,
209but in the opposite order.  The goal is to avoid the counter-intuitive
210outcome in which the first load sees the value written by the second store
211but the second load does not see the value written by the first store.
212In the absence of any ordering, this goal may not be met, as can be seen
213in the MP+poonceonces.litmus litmus test.  This section therefore looks at
214a number of ways of meeting this goal.
215
216
217Release and acquire
218~~~~~~~~~~~~~~~~~~~
219
220Use of smp_store_release() and smp_load_acquire() is one way to force
221the desired MP ordering.  The general approach is shown below:
222
223	/* See MP+pooncerelease+poacquireonce.litmus. */
224	void CPU0(void)
225	{
226		WRITE_ONCE(x, 1);
227		smp_store_release(&y, 1);
228	}
229
230	void CPU1(void)
231	{
232		r0 = smp_load_acquire(&y);
233		r1 = READ_ONCE(x);
234	}
235
236The smp_store_release() macro orders any prior accesses against the
237store, while the smp_load_acquire macro orders the load against any
238subsequent accesses.  Therefore, if the final value of r0 is the value 1,
239the final value of r1 must also be the value 1.
240
241The init_stack_slab() function in lib/stackdepot.c uses release-acquire
242in this way to safely initialize of a slab of the stack.  Working out
243the mutual-exclusion design is left as an exercise for the reader.
244
245
246Assign and dereference
247~~~~~~~~~~~~~~~~~~~~~~
248
249Use of rcu_assign_pointer() and rcu_dereference() is quite similar to the
250use of smp_store_release() and smp_load_acquire(), except that both
251rcu_assign_pointer() and rcu_dereference() operate on RCU-protected
252pointers.  The general approach is shown below:
253
254	/* See MP+onceassign+derefonce.litmus. */
255	int z;
256	int *y = &z;
257	int x;
258
259	void CPU0(void)
260	{
261		WRITE_ONCE(x, 1);
262		rcu_assign_pointer(y, &x);
263	}
264
265	void CPU1(void)
266	{
267		rcu_read_lock();
268		r0 = rcu_dereference(y);
269		r1 = READ_ONCE(*r0);
270		rcu_read_unlock();
271	}
272
273In this example, if the final value of r0 is &x then the final value of
274r1 must be 1.
275
276The rcu_assign_pointer() macro has the same ordering properties as does
277smp_store_release(), but the rcu_dereference() macro orders the load only
278against later accesses that depend on the value loaded.  A dependency
279is present if the value loaded determines the address of a later access
280(address dependency, as shown above), the value written by a later store
281(data dependency), or whether or not a later store is executed in the
282first place (control dependency).  Note that the term "data dependency"
283is sometimes casually used to cover both address and data dependencies.
284
285In lib/math/prime_numbers.c, the expand_to_next_prime() function invokes
286rcu_assign_pointer(), and the next_prime_number() function invokes
287rcu_dereference().  This combination mediates access to a bit vector
288that is expanded as additional primes are needed.
289
290
291Write and read memory barriers
292~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
293
294It is usually better to use smp_store_release() instead of smp_wmb()
295and to use smp_load_acquire() instead of smp_rmb().  However, the older
296smp_wmb() and smp_rmb() APIs are still heavily used, so it is important
297to understand their use cases.  The general approach is shown below:
298
299	/* See MP+fencewmbonceonce+fencermbonceonce.litmus. */
300	void CPU0(void)
301	{
302		WRITE_ONCE(x, 1);
303		smp_wmb();
304		WRITE_ONCE(y, 1);
305	}
306
307	void CPU1(void)
308	{
309		r0 = READ_ONCE(y);
310		smp_rmb();
311		r1 = READ_ONCE(x);
312	}
313
314The smp_wmb() macro orders prior stores against later stores, and the
315smp_rmb() macro orders prior loads against later loads.  Therefore, if
316the final value of r0 is 1, the final value of r1 must also be 1.
317
318The xlog_state_switch_iclogs() function in fs/xfs/xfs_log.c contains
319the following write-side code fragment:
320
321	log->l_curr_block -= log->l_logBBsize;
322	ASSERT(log->l_curr_block >= 0);
323	smp_wmb();
324	log->l_curr_cycle++;
325
326And the xlog_valid_lsn() function in fs/xfs/xfs_log_priv.h contains
327the corresponding read-side code fragment:
328
329	cur_cycle = READ_ONCE(log->l_curr_cycle);
330	smp_rmb();
331	cur_block = READ_ONCE(log->l_curr_block);
332
333Alternatively, consider the following comment in function
334perf_output_put_handle() in kernel/events/ring_buffer.c:
335
336	 *   kernel				user
337	 *
338	 *   if (LOAD ->data_tail) {		LOAD ->data_head
339	 *			(A)		smp_rmb()	(C)
340	 *	STORE $data			LOAD $data
341	 *	smp_wmb()	(B)		smp_mb()	(D)
342	 *	STORE ->data_head		STORE ->data_tail
343	 *   }
344
345The B/C pairing is an example of the MP pattern using smp_wmb() on the
346write side and smp_rmb() on the read side.
347
348Of course, given that smp_mb() is strictly stronger than either smp_wmb()
349or smp_rmb(), any code fragment that would work with smp_rmb() and
350smp_wmb() would also work with smp_mb() replacing either or both of the
351weaker barriers.
352
353
354Load buffering (LB)
355-------------------
356
357The LB pattern has one CPU load from one variable and then store to a
358second, while another CPU loads from the second variable and then stores
359to the first.  The goal is to avoid the counter-intuitive situation where
360each load reads the value written by the other CPU's store.  In the
361absence of any ordering it is quite possible that this may happen, as
362can be seen in the LB+poonceonces.litmus litmus test.
363
364One way of avoiding the counter-intuitive outcome is through the use of a
365control dependency paired with a full memory barrier:
366
367	/* See LB+fencembonceonce+ctrlonceonce.litmus. */
368	void CPU0(void)
369	{
370		r0 = READ_ONCE(x);
371		if (r0)
372			WRITE_ONCE(y, 1);
373	}
374
375	void CPU1(void)
376	{
377		r1 = READ_ONCE(y);
378		smp_mb();
379		WRITE_ONCE(x, 1);
380	}
381
382This pairing of a control dependency in CPU0() with a full memory
383barrier in CPU1() prevents r0 and r1 from both ending up equal to 1.
384
385The A/D pairing from the ring-buffer use case shown earlier also
386illustrates LB.  Here is a repeat of the comment in
387perf_output_put_handle() in kernel/events/ring_buffer.c, showing a
388control dependency on the kernel side and a full memory barrier on
389the user side:
390
391	 *   kernel				user
392	 *
393	 *   if (LOAD ->data_tail) {		LOAD ->data_head
394	 *			(A)		smp_rmb()	(C)
395	 *	STORE $data			LOAD $data
396	 *	smp_wmb()	(B)		smp_mb()	(D)
397	 *	STORE ->data_head		STORE ->data_tail
398	 *   }
399	 *
400	 * Where A pairs with D, and B pairs with C.
401
402The kernel's control dependency between the load from ->data_tail
403and the store to data combined with the user's full memory barrier
404between the load from data and the store to ->data_tail prevents
405the counter-intuitive outcome where the kernel overwrites the data
406before the user gets done loading it.
407
408
409Release-acquire chains
410----------------------
411
412Release-acquire chains are a low-overhead, flexible, and easy-to-use
413method of maintaining order.  However, they do have some limitations that
414need to be fully understood.  Here is an example that maintains order:
415
416	/* See ISA2+pooncerelease+poacquirerelease+poacquireonce.litmus. */
417	void CPU0(void)
418	{
419		WRITE_ONCE(x, 1);
420		smp_store_release(&y, 1);
421	}
422
423	void CPU1(void)
424	{
425		r0 = smp_load_acquire(y);
426		smp_store_release(&z, 1);
427	}
428
429	void CPU2(void)
430	{
431		r1 = smp_load_acquire(z);
432		r2 = READ_ONCE(x);
433	}
434
435In this case, if r0 and r1 both have final values of 1, then r2 must
436also have a final value of 1.
437
438The ordering in this example is stronger than it needs to be.  For
439example, ordering would still be preserved if CPU1()'s smp_load_acquire()
440invocation was replaced with READ_ONCE().
441
442It is tempting to assume that CPU0()'s store to x is globally ordered
443before CPU1()'s store to z, but this is not the case:
444
445	/* See Z6.0+pooncerelease+poacquirerelease+mbonceonce.litmus. */
446	void CPU0(void)
447	{
448		WRITE_ONCE(x, 1);
449		smp_store_release(&y, 1);
450	}
451
452	void CPU1(void)
453	{
454		r0 = smp_load_acquire(y);
455		smp_store_release(&z, 1);
456	}
457
458	void CPU2(void)
459	{
460		WRITE_ONCE(z, 2);
461		smp_mb();
462		r1 = READ_ONCE(x);
463	}
464
465One might hope that if the final value of r0 is 1 and the final value
466of z is 2, then the final value of r1 must also be 1, but it really is
467possible for r1 to have the final value of 0.  The reason, of course,
468is that in this version, CPU2() is not part of the release-acquire chain.
469This situation is accounted for in the rules of thumb below.
470
471Despite this limitation, release-acquire chains are low-overhead as
472well as simple and powerful, at least as memory-ordering mechanisms go.
473
474
475Store buffering
476---------------
477
478Store buffering can be thought of as upside-down load buffering, so
479that one CPU first stores to one variable and then loads from a second,
480while another CPU stores to the second variable and then loads from the
481first.  Preserving order requires nothing less than full barriers:
482
483	/* See SB+fencembonceonces.litmus. */
484	void CPU0(void)
485	{
486		WRITE_ONCE(x, 1);
487		smp_mb();
488		r0 = READ_ONCE(y);
489	}
490
491	void CPU1(void)
492	{
493		WRITE_ONCE(y, 1);
494		smp_mb();
495		r1 = READ_ONCE(x);
496	}
497
498Omitting either smp_mb() will allow both r0 and r1 to have final
499values of 0, but providing both full barriers as shown above prevents
500this counter-intuitive outcome.
501
502This pattern most famously appears as part of Dekker's locking
503algorithm, but it has a much more practical use within the Linux kernel
504of ordering wakeups.  The following comment taken from waitqueue_active()
505in include/linux/wait.h shows the canonical pattern:
506
507 *      CPU0 - waker                    CPU1 - waiter
508 *
509 *                                      for (;;) {
510 *      @cond = true;                     prepare_to_wait(&wq_head, &wait, state);
511 *      smp_mb();                         // smp_mb() from set_current_state()
512 *      if (waitqueue_active(wq_head))         if (@cond)
513 *        wake_up(wq_head);                      break;
514 *                                        schedule();
515 *                                      }
516 *                                      finish_wait(&wq_head, &wait);
517
518On CPU0, the store is to @cond and the load is in waitqueue_active().
519On CPU1, prepare_to_wait() contains both a store to wq_head and a call
520to set_current_state(), which contains an smp_mb() barrier; the load is
521"if (@cond)".  The full barriers prevent the undesirable outcome where
522CPU1 puts the waiting task to sleep and CPU0 fails to wake it up.
523
524Note that use of locking can greatly simplify this pattern.
525
526
527Rules of thumb
528==============
529
530There might seem to be no pattern governing what ordering primitives are
531needed in which situations, but this is not the case.  There is a pattern
532based on the relation between the accesses linking successive CPUs in a
533given litmus test.  There are three types of linkage:
534
5351.	Write-to-read, where the next CPU reads the value that the
536	previous CPU wrote.  The LB litmus-test patterns contain only
537	this type of relation.	In formal memory-modeling texts, this
538	relation is called "reads-from" and is usually abbreviated "rf".
539
5402.	Read-to-write, where the next CPU overwrites the value that the
541	previous CPU read.  The SB litmus test contains only this type
542	of relation.  In formal memory-modeling texts, this relation is
543	often called "from-reads" and is sometimes abbreviated "fr".
544
5453.	Write-to-write, where the next CPU overwrites the value written
546	by the previous CPU.  The Z6.0 litmus test pattern contains a
547	write-to-write relation between the last access of CPU1() and
548	the first access of CPU2().  In formal memory-modeling texts,
549	this relation is often called "coherence order" and is sometimes
550	abbreviated "co".  In the C++ standard, it is instead called
551	"modification order" and often abbreviated "mo".
552
553The strength of memory ordering required for a given litmus test to
554avoid a counter-intuitive outcome depends on the types of relations
555linking the memory accesses for the outcome in question:
556
557o	If all links are write-to-read links, then the weakest
558	possible ordering within each CPU suffices.  For example, in
559	the LB litmus test, a control dependency was enough to do the
560	job.
561
562o	If all but one of the links are write-to-read links, then a
563	release-acquire chain suffices.  Both the MP and the ISA2
564	litmus tests illustrate this case.
565
566o	If more than one of the links are something other than
567	write-to-read links, then a full memory barrier is required
568	between each successive pair of non-write-to-read links.  This
569	case is illustrated by the Z6.0 litmus tests, both in the
570	locking and in the release-acquire sections.
571
572However, if you find yourself having to stretch these rules of thumb
573to fit your situation, you should consider creating a litmus test and
574running it on the model.
575