xref: /linux/tools/memory-model/Documentation/recipes.txt (revision cbecf716ca618fd44feda6bd9a64a8179d031fc5)
11c27b644SPaul E. McKenneyThis document provides "recipes", that is, litmus tests for commonly
21c27b644SPaul E. McKenneyoccurring situations, as well as a few that illustrate subtly broken but
31c27b644SPaul E. McKenneyattractive nuisances.  Many of these recipes include example code from
4*cc9628b4SPaul E. McKenneyv5.7 of the Linux kernel.
51c27b644SPaul E. McKenney
61c27b644SPaul E. McKenneyThe first section covers simple special cases, the second section
71c27b644SPaul E. McKenneytakes off the training wheels to cover more involved examples,
81c27b644SPaul E. McKenneyand the third section provides a few rules of thumb.
91c27b644SPaul E. McKenney
101c27b644SPaul E. McKenney
111c27b644SPaul E. McKenneySimple special cases
121c27b644SPaul E. McKenney====================
131c27b644SPaul E. McKenney
141c27b644SPaul E. McKenneyThis section presents two simple special cases, the first being where
151c27b644SPaul E. McKenneythere is only one CPU or only one memory location is accessed, and the
161c27b644SPaul E. McKenneysecond being use of that old concurrency workhorse, locking.
171c27b644SPaul E. McKenney
181c27b644SPaul E. McKenney
191c27b644SPaul E. McKenneySingle CPU or single memory location
201c27b644SPaul E. McKenney------------------------------------
211c27b644SPaul E. McKenney
221c27b644SPaul E. McKenneyIf there is only one CPU on the one hand or only one variable
231c27b644SPaul E. McKenneyon the other, the code will execute in order.  There are (as
241c27b644SPaul E. McKenneyusual) some things to be careful of:
251c27b644SPaul E. McKenney
261c27b644SPaul E. McKenney1.	Some aspects of the C language are unordered.  For example,
271c27b644SPaul E. McKenney	in the expression "f(x) + g(y)", the order in which f and g are
281c27b644SPaul E. McKenney	called is not defined; the object code is allowed to use either
291c27b644SPaul E. McKenney	order or even to interleave the computations.
301c27b644SPaul E. McKenney
311c27b644SPaul E. McKenney2.	Compilers are permitted to use the "as-if" rule.  That is, a
321c27b644SPaul E. McKenney	compiler can emit whatever code it likes for normal accesses,
331c27b644SPaul E. McKenney	as long as the results of a single-threaded execution appear
341c27b644SPaul E. McKenney	just as if the compiler had followed all the relevant rules.
351c27b644SPaul E. McKenney	To see this, compile with a high level of optimization and run
361c27b644SPaul E. McKenney	the debugger on the resulting binary.
371c27b644SPaul E. McKenney
381c27b644SPaul E. McKenney3.	If there is only one variable but multiple CPUs, that variable
391c27b644SPaul E. McKenney	must be properly aligned and all accesses to that variable must
401c27b644SPaul E. McKenney	be full sized.	Variables that straddle cachelines or pages void
411c27b644SPaul E. McKenney	your full-ordering warranty, as do undersized accesses that load
421c27b644SPaul E. McKenney	from or store to only part of the variable.
431c27b644SPaul E. McKenney
441c27b644SPaul E. McKenney4.	If there are multiple CPUs, accesses to shared variables should
451c27b644SPaul E. McKenney	use READ_ONCE() and WRITE_ONCE() or stronger to prevent load/store
461c27b644SPaul E. McKenney	tearing, load/store fusing, and invented loads and stores.
471c27b644SPaul E. McKenney	There are exceptions to this rule, including:
481c27b644SPaul E. McKenney
491c27b644SPaul E. McKenney	i.	When there is no possibility of a given shared variable
501c27b644SPaul E. McKenney		being updated by some other CPU, for example, while
511c27b644SPaul E. McKenney		holding the update-side lock, reads from that variable
521c27b644SPaul E. McKenney		need not use READ_ONCE().
531c27b644SPaul E. McKenney
541c27b644SPaul E. McKenney	ii.	When there is no possibility of a given shared variable
551c27b644SPaul E. McKenney		being either read or updated by other CPUs, for example,
561c27b644SPaul E. McKenney		when running during early boot, reads from that variable
571c27b644SPaul E. McKenney		need not use READ_ONCE() and writes to that variable
581c27b644SPaul E. McKenney		need not use WRITE_ONCE().
591c27b644SPaul E. McKenney
601c27b644SPaul E. McKenney
611c27b644SPaul E. McKenneyLocking
621c27b644SPaul E. McKenney-------
631c27b644SPaul E. McKenney
641c27b644SPaul E. McKenneyLocking is well-known and straightforward, at least if you don't think
651c27b644SPaul E. McKenneyabout it too hard.  And the basic rule is indeed quite simple: Any CPU that
661c27b644SPaul E. McKenneyhas acquired a given lock sees any changes previously seen or made by any
671c27b644SPaul E. McKenneyCPU before it released that same lock.  Note that this statement is a bit
681c27b644SPaul E. McKenneystronger than "Any CPU holding a given lock sees all changes made by any
691c27b644SPaul E. McKenneyCPU during the time that CPU was holding this same lock".  For example,
701c27b644SPaul E. McKenneyconsider the following pair of code fragments:
711c27b644SPaul E. McKenney
721c27b644SPaul E. McKenney	/* See MP+polocks.litmus. */
731c27b644SPaul E. McKenney	void CPU0(void)
741c27b644SPaul E. McKenney	{
751c27b644SPaul E. McKenney		WRITE_ONCE(x, 1);
761c27b644SPaul E. McKenney		spin_lock(&mylock);
771c27b644SPaul E. McKenney		WRITE_ONCE(y, 1);
781c27b644SPaul E. McKenney		spin_unlock(&mylock);
791c27b644SPaul E. McKenney	}
801c27b644SPaul E. McKenney
811c27b644SPaul E. McKenney	void CPU1(void)
821c27b644SPaul E. McKenney	{
831c27b644SPaul E. McKenney		spin_lock(&mylock);
841c27b644SPaul E. McKenney		r0 = READ_ONCE(y);
851c27b644SPaul E. McKenney		spin_unlock(&mylock);
861c27b644SPaul E. McKenney		r1 = READ_ONCE(x);
871c27b644SPaul E. McKenney	}
881c27b644SPaul E. McKenney
891c27b644SPaul E. McKenneyThe basic rule guarantees that if CPU0() acquires mylock before CPU1(),
901c27b644SPaul E. McKenneythen both r0 and r1 must be set to the value 1.  This also has the
911c27b644SPaul E. McKenneyconsequence that if the final value of r0 is equal to 1, then the final
921c27b644SPaul E. McKenneyvalue of r1 must also be equal to 1.  In contrast, the weaker rule would
931c27b644SPaul E. McKenneysay nothing about the final value of r1.
941c27b644SPaul E. McKenney
951c27b644SPaul E. McKenneyThe converse to the basic rule also holds, as illustrated by the
961c27b644SPaul E. McKenneyfollowing litmus test:
971c27b644SPaul E. McKenney
981c27b644SPaul E. McKenney	/* See MP+porevlocks.litmus. */
991c27b644SPaul E. McKenney	void CPU0(void)
1001c27b644SPaul E. McKenney	{
1011c27b644SPaul E. McKenney		r0 = READ_ONCE(y);
1021c27b644SPaul E. McKenney		spin_lock(&mylock);
1031c27b644SPaul E. McKenney		r1 = READ_ONCE(x);
1041c27b644SPaul E. McKenney		spin_unlock(&mylock);
1051c27b644SPaul E. McKenney	}
1061c27b644SPaul E. McKenney
1071c27b644SPaul E. McKenney	void CPU1(void)
1081c27b644SPaul E. McKenney	{
1091c27b644SPaul E. McKenney		spin_lock(&mylock);
1101c27b644SPaul E. McKenney		WRITE_ONCE(x, 1);
1111c27b644SPaul E. McKenney		spin_unlock(&mylock);
1121c27b644SPaul E. McKenney		WRITE_ONCE(y, 1);
1131c27b644SPaul E. McKenney	}
1141c27b644SPaul E. McKenney
1151c27b644SPaul E. McKenneyThis converse to the basic rule guarantees that if CPU0() acquires
1161c27b644SPaul E. McKenneymylock before CPU1(), then both r0 and r1 must be set to the value 0.
1171c27b644SPaul E. McKenneyThis also has the consequence that if the final value of r1 is equal
1181c27b644SPaul E. McKenneyto 0, then the final value of r0 must also be equal to 0.  In contrast,
1191c27b644SPaul E. McKenneythe weaker rule would say nothing about the final value of r0.
1201c27b644SPaul E. McKenney
1211c27b644SPaul E. McKenneyThese examples show only a single pair of CPUs, but the effects of the
1221c27b644SPaul E. McKenneylocking basic rule extend across multiple acquisitions of a given lock
1231c27b644SPaul E. McKenneyacross multiple CPUs.
1241c27b644SPaul E. McKenney
1251c27b644SPaul E. McKenneyHowever, it is not necessarily the case that accesses ordered by
1261c27b644SPaul E. McKenneylocking will be seen as ordered by CPUs not holding that lock.
1271c27b644SPaul E. McKenneyConsider this example:
1281c27b644SPaul E. McKenney
1299725dd55SAkira Yokosawa	/* See Z6.0+pooncelock+pooncelock+pombonce.litmus. */
1301c27b644SPaul E. McKenney	void CPU0(void)
1311c27b644SPaul E. McKenney	{
1321c27b644SPaul E. McKenney		spin_lock(&mylock);
1331c27b644SPaul E. McKenney		WRITE_ONCE(x, 1);
1341c27b644SPaul E. McKenney		WRITE_ONCE(y, 1);
1351c27b644SPaul E. McKenney		spin_unlock(&mylock);
1361c27b644SPaul E. McKenney	}
1371c27b644SPaul E. McKenney
1381c27b644SPaul E. McKenney	void CPU1(void)
1391c27b644SPaul E. McKenney	{
1401c27b644SPaul E. McKenney		spin_lock(&mylock);
1411c27b644SPaul E. McKenney		r0 = READ_ONCE(y);
1421c27b644SPaul E. McKenney		WRITE_ONCE(z, 1);
1431c27b644SPaul E. McKenney		spin_unlock(&mylock);
1441c27b644SPaul E. McKenney	}
1451c27b644SPaul E. McKenney
1461c27b644SPaul E. McKenney	void CPU2(void)
1471c27b644SPaul E. McKenney	{
1481c27b644SPaul E. McKenney		WRITE_ONCE(z, 2);
1491c27b644SPaul E. McKenney		smp_mb();
1501c27b644SPaul E. McKenney		r1 = READ_ONCE(x);
1511c27b644SPaul E. McKenney	}
1521c27b644SPaul E. McKenney
1531c27b644SPaul E. McKenneyCounter-intuitive though it might be, it is quite possible to have
1541c27b644SPaul E. McKenneythe final value of r0 be 1, the final value of z be 2, and the final
1551c27b644SPaul E. McKenneyvalue of r1 be 0.  The reason for this surprising outcome is that
1561c27b644SPaul E. McKenneyCPU2() never acquired the lock, and thus did not benefit from the
1571c27b644SPaul E. McKenneylock's ordering properties.
1581c27b644SPaul E. McKenney
1591c27b644SPaul E. McKenneyOrdering can be extended to CPUs not holding the lock by careful use
1601c27b644SPaul E. McKenneyof smp_mb__after_spinlock():
1611c27b644SPaul E. McKenney
1621c27b644SPaul E. McKenney	/* See Z6.0+pooncelock+poonceLock+pombonce.litmus. */
1631c27b644SPaul E. McKenney	void CPU0(void)
1641c27b644SPaul E. McKenney	{
1651c27b644SPaul E. McKenney		spin_lock(&mylock);
1661c27b644SPaul E. McKenney		WRITE_ONCE(x, 1);
1671c27b644SPaul E. McKenney		WRITE_ONCE(y, 1);
1681c27b644SPaul E. McKenney		spin_unlock(&mylock);
1691c27b644SPaul E. McKenney	}
1701c27b644SPaul E. McKenney
1711c27b644SPaul E. McKenney	void CPU1(void)
1721c27b644SPaul E. McKenney	{
1731c27b644SPaul E. McKenney		spin_lock(&mylock);
1741c27b644SPaul E. McKenney		smp_mb__after_spinlock();
1751c27b644SPaul E. McKenney		r0 = READ_ONCE(y);
1761c27b644SPaul E. McKenney		WRITE_ONCE(z, 1);
1771c27b644SPaul E. McKenney		spin_unlock(&mylock);
1781c27b644SPaul E. McKenney	}
1791c27b644SPaul E. McKenney
1801c27b644SPaul E. McKenney	void CPU2(void)
1811c27b644SPaul E. McKenney	{
1821c27b644SPaul E. McKenney		WRITE_ONCE(z, 2);
1831c27b644SPaul E. McKenney		smp_mb();
1841c27b644SPaul E. McKenney		r1 = READ_ONCE(x);
1851c27b644SPaul E. McKenney	}
1861c27b644SPaul E. McKenney
1871c27b644SPaul E. McKenneyThis addition of smp_mb__after_spinlock() strengthens the lock acquisition
1881c27b644SPaul E. McKenneysufficiently to rule out the counter-intuitive outcome.
1891c27b644SPaul E. McKenney
1901c27b644SPaul E. McKenney
1911c27b644SPaul E. McKenneyTaking off the training wheels
1921c27b644SPaul E. McKenney==============================
1931c27b644SPaul E. McKenney
1941c27b644SPaul E. McKenneyThis section looks at more complex examples, including message passing,
1951c27b644SPaul E. McKenneyload buffering, release-acquire chains, store buffering.
1961c27b644SPaul E. McKenneyMany classes of litmus tests have abbreviated names, which may be found
1971c27b644SPaul E. McKenneyhere: https://www.cl.cam.ac.uk/~pes20/ppc-supplemental/test6.pdf
1981c27b644SPaul E. McKenney
1991c27b644SPaul E. McKenney
2001c27b644SPaul E. McKenneyMessage passing (MP)
2011c27b644SPaul E. McKenney--------------------
2021c27b644SPaul E. McKenney
2031c27b644SPaul E. McKenneyThe MP pattern has one CPU execute a pair of stores to a pair of variables
2041c27b644SPaul E. McKenneyand another CPU execute a pair of loads from this same pair of variables,
2051c27b644SPaul E. McKenneybut in the opposite order.  The goal is to avoid the counter-intuitive
2061c27b644SPaul E. McKenneyoutcome in which the first load sees the value written by the second store
2071c27b644SPaul E. McKenneybut the second load does not see the value written by the first store.
2081c27b644SPaul E. McKenneyIn the absence of any ordering, this goal may not be met, as can be seen
2091c27b644SPaul E. McKenneyin the MP+poonceonces.litmus litmus test.  This section therefore looks at
2101c27b644SPaul E. McKenneya number of ways of meeting this goal.
2111c27b644SPaul E. McKenney
2121c27b644SPaul E. McKenney
2131c27b644SPaul E. McKenneyRelease and acquire
2141c27b644SPaul E. McKenney~~~~~~~~~~~~~~~~~~~
2151c27b644SPaul E. McKenney
2161c27b644SPaul E. McKenneyUse of smp_store_release() and smp_load_acquire() is one way to force
2171c27b644SPaul E. McKenneythe desired MP ordering.  The general approach is shown below:
2181c27b644SPaul E. McKenney
2191c27b644SPaul E. McKenney	/* See MP+pooncerelease+poacquireonce.litmus. */
2201c27b644SPaul E. McKenney	void CPU0(void)
2211c27b644SPaul E. McKenney	{
2221c27b644SPaul E. McKenney		WRITE_ONCE(x, 1);
2231c27b644SPaul E. McKenney		smp_store_release(&y, 1);
2241c27b644SPaul E. McKenney	}
2251c27b644SPaul E. McKenney
2261c27b644SPaul E. McKenney	void CPU1(void)
2271c27b644SPaul E. McKenney	{
2281c27b644SPaul E. McKenney		r0 = smp_load_acquire(&y);
2291c27b644SPaul E. McKenney		r1 = READ_ONCE(x);
2301c27b644SPaul E. McKenney	}
2311c27b644SPaul E. McKenney
2321c27b644SPaul E. McKenneyThe smp_store_release() macro orders any prior accesses against the
2331c27b644SPaul E. McKenneystore, while the smp_load_acquire macro orders the load against any
2341c27b644SPaul E. McKenneysubsequent accesses.  Therefore, if the final value of r0 is the value 1,
2351c27b644SPaul E. McKenneythe final value of r1 must also be the value 1.
2361c27b644SPaul E. McKenney
2371c27b644SPaul E. McKenneyThe init_stack_slab() function in lib/stackdepot.c uses release-acquire
2381c27b644SPaul E. McKenneyin this way to safely initialize of a slab of the stack.  Working out
2391c27b644SPaul E. McKenneythe mutual-exclusion design is left as an exercise for the reader.
2401c27b644SPaul E. McKenney
2411c27b644SPaul E. McKenney
2421c27b644SPaul E. McKenneyAssign and dereference
2431c27b644SPaul E. McKenney~~~~~~~~~~~~~~~~~~~~~~
2441c27b644SPaul E. McKenney
2451c27b644SPaul E. McKenneyUse of rcu_assign_pointer() and rcu_dereference() is quite similar to the
2461c27b644SPaul E. McKenneyuse of smp_store_release() and smp_load_acquire(), except that both
2471c27b644SPaul E. McKenneyrcu_assign_pointer() and rcu_dereference() operate on RCU-protected
2481c27b644SPaul E. McKenneypointers.  The general approach is shown below:
2491c27b644SPaul E. McKenney
2501c27b644SPaul E. McKenney	/* See MP+onceassign+derefonce.litmus. */
2511c27b644SPaul E. McKenney	int z;
2521c27b644SPaul E. McKenney	int *y = &z;
2531c27b644SPaul E. McKenney	int x;
2541c27b644SPaul E. McKenney
2551c27b644SPaul E. McKenney	void CPU0(void)
2561c27b644SPaul E. McKenney	{
2571c27b644SPaul E. McKenney		WRITE_ONCE(x, 1);
2581c27b644SPaul E. McKenney		rcu_assign_pointer(y, &x);
2591c27b644SPaul E. McKenney	}
2601c27b644SPaul E. McKenney
2611c27b644SPaul E. McKenney	void CPU1(void)
2621c27b644SPaul E. McKenney	{
2631c27b644SPaul E. McKenney		rcu_read_lock();
2641c27b644SPaul E. McKenney		r0 = rcu_dereference(y);
2651c27b644SPaul E. McKenney		r1 = READ_ONCE(*r0);
2661c27b644SPaul E. McKenney		rcu_read_unlock();
2671c27b644SPaul E. McKenney	}
2681c27b644SPaul E. McKenney
2691c27b644SPaul E. McKenneyIn this example, if the final value of r0 is &x then the final value of
2701c27b644SPaul E. McKenneyr1 must be 1.
2711c27b644SPaul E. McKenney
2721c27b644SPaul E. McKenneyThe rcu_assign_pointer() macro has the same ordering properties as does
2731c27b644SPaul E. McKenneysmp_store_release(), but the rcu_dereference() macro orders the load only
2741c27b644SPaul E. McKenneyagainst later accesses that depend on the value loaded.  A dependency
2751c27b644SPaul E. McKenneyis present if the value loaded determines the address of a later access
2761c27b644SPaul E. McKenney(address dependency, as shown above), the value written by a later store
2771c27b644SPaul E. McKenney(data dependency), or whether or not a later store is executed in the
2781c27b644SPaul E. McKenneyfirst place (control dependency).  Note that the term "data dependency"
2791c27b644SPaul E. McKenneyis sometimes casually used to cover both address and data dependencies.
2801c27b644SPaul E. McKenney
281*cc9628b4SPaul E. McKenneyIn lib/math/prime_numbers.c, the expand_to_next_prime() function invokes
2821c27b644SPaul E. McKenneyrcu_assign_pointer(), and the next_prime_number() function invokes
2831c27b644SPaul E. McKenneyrcu_dereference().  This combination mediates access to a bit vector
2841c27b644SPaul E. McKenneythat is expanded as additional primes are needed.
2851c27b644SPaul E. McKenney
2861c27b644SPaul E. McKenney
2871c27b644SPaul E. McKenneyWrite and read memory barriers
2881c27b644SPaul E. McKenney~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
2891c27b644SPaul E. McKenney
2901c27b644SPaul E. McKenneyIt is usually better to use smp_store_release() instead of smp_wmb()
2911c27b644SPaul E. McKenneyand to use smp_load_acquire() instead of smp_rmb().  However, the older
2921c27b644SPaul E. McKenneysmp_wmb() and smp_rmb() APIs are still heavily used, so it is important
2931c27b644SPaul E. McKenneyto understand their use cases.  The general approach is shown below:
2941c27b644SPaul E. McKenney
29571b7ff5eSAndrea Parri	/* See MP+fencewmbonceonce+fencermbonceonce.litmus. */
2961c27b644SPaul E. McKenney	void CPU0(void)
2971c27b644SPaul E. McKenney	{
2981c27b644SPaul E. McKenney		WRITE_ONCE(x, 1);
2991c27b644SPaul E. McKenney		smp_wmb();
3001c27b644SPaul E. McKenney		WRITE_ONCE(y, 1);
3011c27b644SPaul E. McKenney	}
3021c27b644SPaul E. McKenney
3031c27b644SPaul E. McKenney	void CPU1(void)
3041c27b644SPaul E. McKenney	{
3051c27b644SPaul E. McKenney		r0 = READ_ONCE(y);
3061c27b644SPaul E. McKenney		smp_rmb();
3071c27b644SPaul E. McKenney		r1 = READ_ONCE(x);
3081c27b644SPaul E. McKenney	}
3091c27b644SPaul E. McKenney
3101c27b644SPaul E. McKenneyThe smp_wmb() macro orders prior stores against later stores, and the
3111c27b644SPaul E. McKenneysmp_rmb() macro orders prior loads against later loads.  Therefore, if
3121c27b644SPaul E. McKenneythe final value of r0 is 1, the final value of r1 must also be 1.
3131c27b644SPaul E. McKenney
3143d2046a6SSeongJae ParkThe xlog_state_switch_iclogs() function in fs/xfs/xfs_log.c contains
3151c27b644SPaul E. McKenneythe following write-side code fragment:
3161c27b644SPaul E. McKenney
3171c27b644SPaul E. McKenney	log->l_curr_block -= log->l_logBBsize;
3181c27b644SPaul E. McKenney	ASSERT(log->l_curr_block >= 0);
3191c27b644SPaul E. McKenney	smp_wmb();
3201c27b644SPaul E. McKenney	log->l_curr_cycle++;
3211c27b644SPaul E. McKenney
3221c27b644SPaul E. McKenneyAnd the xlog_valid_lsn() function in fs/xfs/xfs_log_priv.h contains
3231c27b644SPaul E. McKenneythe corresponding read-side code fragment:
3241c27b644SPaul E. McKenney
3255bde06b6SMark Rutland	cur_cycle = READ_ONCE(log->l_curr_cycle);
3261c27b644SPaul E. McKenney	smp_rmb();
3275bde06b6SMark Rutland	cur_block = READ_ONCE(log->l_curr_block);
3281c27b644SPaul E. McKenney
3291c27b644SPaul E. McKenneyAlternatively, consider the following comment in function
3301c27b644SPaul E. McKenneyperf_output_put_handle() in kernel/events/ring_buffer.c:
3311c27b644SPaul E. McKenney
3321c27b644SPaul E. McKenney	 *   kernel				user
3331c27b644SPaul E. McKenney	 *
3341c27b644SPaul E. McKenney	 *   if (LOAD ->data_tail) {		LOAD ->data_head
3351c27b644SPaul E. McKenney	 *			(A)		smp_rmb()	(C)
3361c27b644SPaul E. McKenney	 *	STORE $data			LOAD $data
3371c27b644SPaul E. McKenney	 *	smp_wmb()	(B)		smp_mb()	(D)
3381c27b644SPaul E. McKenney	 *	STORE ->data_head		STORE ->data_tail
3391c27b644SPaul E. McKenney	 *   }
3401c27b644SPaul E. McKenney
3411c27b644SPaul E. McKenneyThe B/C pairing is an example of the MP pattern using smp_wmb() on the
3421c27b644SPaul E. McKenneywrite side and smp_rmb() on the read side.
3431c27b644SPaul E. McKenney
3441c27b644SPaul E. McKenneyOf course, given that smp_mb() is strictly stronger than either smp_wmb()
3451c27b644SPaul E. McKenneyor smp_rmb(), any code fragment that would work with smp_rmb() and
3461c27b644SPaul E. McKenneysmp_wmb() would also work with smp_mb() replacing either or both of the
3471c27b644SPaul E. McKenneyweaker barriers.
3481c27b644SPaul E. McKenney
3491c27b644SPaul E. McKenney
3501c27b644SPaul E. McKenneyLoad buffering (LB)
3511c27b644SPaul E. McKenney-------------------
3521c27b644SPaul E. McKenney
3531c27b644SPaul E. McKenneyThe LB pattern has one CPU load from one variable and then store to a
3541c27b644SPaul E. McKenneysecond, while another CPU loads from the second variable and then stores
3551c27b644SPaul E. McKenneyto the first.  The goal is to avoid the counter-intuitive situation where
3561c27b644SPaul E. McKenneyeach load reads the value written by the other CPU's store.  In the
3571c27b644SPaul E. McKenneyabsence of any ordering it is quite possible that this may happen, as
3581c27b644SPaul E. McKenneycan be seen in the LB+poonceonces.litmus litmus test.
3591c27b644SPaul E. McKenney
3601c27b644SPaul E. McKenneyOne way of avoiding the counter-intuitive outcome is through the use of a
3611c27b644SPaul E. McKenneycontrol dependency paired with a full memory barrier:
3621c27b644SPaul E. McKenney
36371b7ff5eSAndrea Parri	/* See LB+fencembonceonce+ctrlonceonce.litmus. */
3641c27b644SPaul E. McKenney	void CPU0(void)
3651c27b644SPaul E. McKenney	{
3661c27b644SPaul E. McKenney		r0 = READ_ONCE(x);
3671c27b644SPaul E. McKenney		if (r0)
3681c27b644SPaul E. McKenney			WRITE_ONCE(y, 1);
3691c27b644SPaul E. McKenney	}
3701c27b644SPaul E. McKenney
3711c27b644SPaul E. McKenney	void CPU1(void)
3721c27b644SPaul E. McKenney	{
3731c27b644SPaul E. McKenney		r1 = READ_ONCE(y);
3741c27b644SPaul E. McKenney		smp_mb();
3751c27b644SPaul E. McKenney		WRITE_ONCE(x, 1);
3761c27b644SPaul E. McKenney	}
3771c27b644SPaul E. McKenney
3781c27b644SPaul E. McKenneyThis pairing of a control dependency in CPU0() with a full memory
3791c27b644SPaul E. McKenneybarrier in CPU1() prevents r0 and r1 from both ending up equal to 1.
3801c27b644SPaul E. McKenney
3811c27b644SPaul E. McKenneyThe A/D pairing from the ring-buffer use case shown earlier also
3821c27b644SPaul E. McKenneyillustrates LB.  Here is a repeat of the comment in
3831c27b644SPaul E. McKenneyperf_output_put_handle() in kernel/events/ring_buffer.c, showing a
3841c27b644SPaul E. McKenneycontrol dependency on the kernel side and a full memory barrier on
3851c27b644SPaul E. McKenneythe user side:
3861c27b644SPaul E. McKenney
3871c27b644SPaul E. McKenney	 *   kernel				user
3881c27b644SPaul E. McKenney	 *
3891c27b644SPaul E. McKenney	 *   if (LOAD ->data_tail) {		LOAD ->data_head
3901c27b644SPaul E. McKenney	 *			(A)		smp_rmb()	(C)
3911c27b644SPaul E. McKenney	 *	STORE $data			LOAD $data
3921c27b644SPaul E. McKenney	 *	smp_wmb()	(B)		smp_mb()	(D)
3931c27b644SPaul E. McKenney	 *	STORE ->data_head		STORE ->data_tail
3941c27b644SPaul E. McKenney	 *   }
3951c27b644SPaul E. McKenney	 *
3961c27b644SPaul E. McKenney	 * Where A pairs with D, and B pairs with C.
3971c27b644SPaul E. McKenney
3981c27b644SPaul E. McKenneyThe kernel's control dependency between the load from ->data_tail
3991c27b644SPaul E. McKenneyand the store to data combined with the user's full memory barrier
4001c27b644SPaul E. McKenneybetween the load from data and the store to ->data_tail prevents
4011c27b644SPaul E. McKenneythe counter-intuitive outcome where the kernel overwrites the data
4021c27b644SPaul E. McKenneybefore the user gets done loading it.
4031c27b644SPaul E. McKenney
4041c27b644SPaul E. McKenney
4051c27b644SPaul E. McKenneyRelease-acquire chains
4061c27b644SPaul E. McKenney----------------------
4071c27b644SPaul E. McKenney
4081c27b644SPaul E. McKenneyRelease-acquire chains are a low-overhead, flexible, and easy-to-use
4091c27b644SPaul E. McKenneymethod of maintaining order.  However, they do have some limitations that
4101c27b644SPaul E. McKenneyneed to be fully understood.  Here is an example that maintains order:
4111c27b644SPaul E. McKenney
4121c27b644SPaul E. McKenney	/* See ISA2+pooncerelease+poacquirerelease+poacquireonce.litmus. */
4131c27b644SPaul E. McKenney	void CPU0(void)
4141c27b644SPaul E. McKenney	{
4151c27b644SPaul E. McKenney		WRITE_ONCE(x, 1);
4161c27b644SPaul E. McKenney		smp_store_release(&y, 1);
4171c27b644SPaul E. McKenney	}
4181c27b644SPaul E. McKenney
4191c27b644SPaul E. McKenney	void CPU1(void)
4201c27b644SPaul E. McKenney	{
4211c27b644SPaul E. McKenney		r0 = smp_load_acquire(y);
4221c27b644SPaul E. McKenney		smp_store_release(&z, 1);
4231c27b644SPaul E. McKenney	}
4241c27b644SPaul E. McKenney
4251c27b644SPaul E. McKenney	void CPU2(void)
4261c27b644SPaul E. McKenney	{
4271c27b644SPaul E. McKenney		r1 = smp_load_acquire(z);
4281c27b644SPaul E. McKenney		r2 = READ_ONCE(x);
4291c27b644SPaul E. McKenney	}
4301c27b644SPaul E. McKenney
4311c27b644SPaul E. McKenneyIn this case, if r0 and r1 both have final values of 1, then r2 must
4321c27b644SPaul E. McKenneyalso have a final value of 1.
4331c27b644SPaul E. McKenney
4341c27b644SPaul E. McKenneyThe ordering in this example is stronger than it needs to be.  For
4351c27b644SPaul E. McKenneyexample, ordering would still be preserved if CPU1()'s smp_load_acquire()
4361c27b644SPaul E. McKenneyinvocation was replaced with READ_ONCE().
4371c27b644SPaul E. McKenney
4381c27b644SPaul E. McKenneyIt is tempting to assume that CPU0()'s store to x is globally ordered
4391c27b644SPaul E. McKenneybefore CPU1()'s store to z, but this is not the case:
4401c27b644SPaul E. McKenney
4411c27b644SPaul E. McKenney	/* See Z6.0+pooncerelease+poacquirerelease+mbonceonce.litmus. */
4421c27b644SPaul E. McKenney	void CPU0(void)
4431c27b644SPaul E. McKenney	{
4441c27b644SPaul E. McKenney		WRITE_ONCE(x, 1);
4451c27b644SPaul E. McKenney		smp_store_release(&y, 1);
4461c27b644SPaul E. McKenney	}
4471c27b644SPaul E. McKenney
4481c27b644SPaul E. McKenney	void CPU1(void)
4491c27b644SPaul E. McKenney	{
4501c27b644SPaul E. McKenney		r0 = smp_load_acquire(y);
4511c27b644SPaul E. McKenney		smp_store_release(&z, 1);
4521c27b644SPaul E. McKenney	}
4531c27b644SPaul E. McKenney
4541c27b644SPaul E. McKenney	void CPU2(void)
4551c27b644SPaul E. McKenney	{
4561c27b644SPaul E. McKenney		WRITE_ONCE(z, 2);
4571c27b644SPaul E. McKenney		smp_mb();
4581c27b644SPaul E. McKenney		r1 = READ_ONCE(x);
4591c27b644SPaul E. McKenney	}
4601c27b644SPaul E. McKenney
4611c27b644SPaul E. McKenneyOne might hope that if the final value of r0 is 1 and the final value
4621c27b644SPaul E. McKenneyof z is 2, then the final value of r1 must also be 1, but it really is
4631c27b644SPaul E. McKenneypossible for r1 to have the final value of 0.  The reason, of course,
4641c27b644SPaul E. McKenneyis that in this version, CPU2() is not part of the release-acquire chain.
4651c27b644SPaul E. McKenneyThis situation is accounted for in the rules of thumb below.
4661c27b644SPaul E. McKenney
4671c27b644SPaul E. McKenneyDespite this limitation, release-acquire chains are low-overhead as
4681c27b644SPaul E. McKenneywell as simple and powerful, at least as memory-ordering mechanisms go.
4691c27b644SPaul E. McKenney
4701c27b644SPaul E. McKenney
4711c27b644SPaul E. McKenneyStore buffering
4721c27b644SPaul E. McKenney---------------
4731c27b644SPaul E. McKenney
4741c27b644SPaul E. McKenneyStore buffering can be thought of as upside-down load buffering, so
4751c27b644SPaul E. McKenneythat one CPU first stores to one variable and then loads from a second,
4761c27b644SPaul E. McKenneywhile another CPU stores to the second variable and then loads from the
4771c27b644SPaul E. McKenneyfirst.  Preserving order requires nothing less than full barriers:
4781c27b644SPaul E. McKenney
47971b7ff5eSAndrea Parri	/* See SB+fencembonceonces.litmus. */
4801c27b644SPaul E. McKenney	void CPU0(void)
4811c27b644SPaul E. McKenney	{
4821c27b644SPaul E. McKenney		WRITE_ONCE(x, 1);
4831c27b644SPaul E. McKenney		smp_mb();
4841c27b644SPaul E. McKenney		r0 = READ_ONCE(y);
4851c27b644SPaul E. McKenney	}
4861c27b644SPaul E. McKenney
4871c27b644SPaul E. McKenney	void CPU1(void)
4881c27b644SPaul E. McKenney	{
4891c27b644SPaul E. McKenney		WRITE_ONCE(y, 1);
4901c27b644SPaul E. McKenney		smp_mb();
4911c27b644SPaul E. McKenney		r1 = READ_ONCE(x);
4921c27b644SPaul E. McKenney	}
4931c27b644SPaul E. McKenney
4941c27b644SPaul E. McKenneyOmitting either smp_mb() will allow both r0 and r1 to have final
4951c27b644SPaul E. McKenneyvalues of 0, but providing both full barriers as shown above prevents
4961c27b644SPaul E. McKenneythis counter-intuitive outcome.
4971c27b644SPaul E. McKenney
4981c27b644SPaul E. McKenneyThis pattern most famously appears as part of Dekker's locking
4991c27b644SPaul E. McKenneyalgorithm, but it has a much more practical use within the Linux kernel
5001c27b644SPaul E. McKenneyof ordering wakeups.  The following comment taken from waitqueue_active()
5011c27b644SPaul E. McKenneyin include/linux/wait.h shows the canonical pattern:
5021c27b644SPaul E. McKenney
5031c27b644SPaul E. McKenney *      CPU0 - waker                    CPU1 - waiter
5041c27b644SPaul E. McKenney *
5051c27b644SPaul E. McKenney *                                      for (;;) {
5061c27b644SPaul E. McKenney *      @cond = true;                     prepare_to_wait(&wq_head, &wait, state);
5071c27b644SPaul E. McKenney *      smp_mb();                         // smp_mb() from set_current_state()
5081c27b644SPaul E. McKenney *      if (waitqueue_active(wq_head))         if (@cond)
5091c27b644SPaul E. McKenney *        wake_up(wq_head);                      break;
5101c27b644SPaul E. McKenney *                                        schedule();
5111c27b644SPaul E. McKenney *                                      }
5121c27b644SPaul E. McKenney *                                      finish_wait(&wq_head, &wait);
5131c27b644SPaul E. McKenney
5141c27b644SPaul E. McKenneyOn CPU0, the store is to @cond and the load is in waitqueue_active().
5151c27b644SPaul E. McKenneyOn CPU1, prepare_to_wait() contains both a store to wq_head and a call
5161c27b644SPaul E. McKenneyto set_current_state(), which contains an smp_mb() barrier; the load is
5171c27b644SPaul E. McKenney"if (@cond)".  The full barriers prevent the undesirable outcome where
5181c27b644SPaul E. McKenneyCPU1 puts the waiting task to sleep and CPU0 fails to wake it up.
5191c27b644SPaul E. McKenney
5201c27b644SPaul E. McKenneyNote that use of locking can greatly simplify this pattern.
5211c27b644SPaul E. McKenney
5221c27b644SPaul E. McKenney
5231c27b644SPaul E. McKenneyRules of thumb
5241c27b644SPaul E. McKenney==============
5251c27b644SPaul E. McKenney
5261c27b644SPaul E. McKenneyThere might seem to be no pattern governing what ordering primitives are
5271c27b644SPaul E. McKenneyneeded in which situations, but this is not the case.  There is a pattern
5281c27b644SPaul E. McKenneybased on the relation between the accesses linking successive CPUs in a
5291c27b644SPaul E. McKenneygiven litmus test.  There are three types of linkage:
5301c27b644SPaul E. McKenney
5311c27b644SPaul E. McKenney1.	Write-to-read, where the next CPU reads the value that the
5321c27b644SPaul E. McKenney	previous CPU wrote.  The LB litmus-test patterns contain only
5331c27b644SPaul E. McKenney	this type of relation.	In formal memory-modeling texts, this
5341c27b644SPaul E. McKenney	relation is called "reads-from" and is usually abbreviated "rf".
5351c27b644SPaul E. McKenney
5361c27b644SPaul E. McKenney2.	Read-to-write, where the next CPU overwrites the value that the
5371c27b644SPaul E. McKenney	previous CPU read.  The SB litmus test contains only this type
5381c27b644SPaul E. McKenney	of relation.  In formal memory-modeling texts, this relation is
5391c27b644SPaul E. McKenney	often called "from-reads" and is sometimes abbreviated "fr".
5401c27b644SPaul E. McKenney
5411c27b644SPaul E. McKenney3.	Write-to-write, where the next CPU overwrites the value written
5421c27b644SPaul E. McKenney	by the previous CPU.  The Z6.0 litmus test pattern contains a
5431c27b644SPaul E. McKenney	write-to-write relation between the last access of CPU1() and
5441c27b644SPaul E. McKenney	the first access of CPU2().  In formal memory-modeling texts,
5451c27b644SPaul E. McKenney	this relation is often called "coherence order" and is sometimes
5461c27b644SPaul E. McKenney	abbreviated "co".  In the C++ standard, it is instead called
5471c27b644SPaul E. McKenney	"modification order" and often abbreviated "mo".
5481c27b644SPaul E. McKenney
5491c27b644SPaul E. McKenneyThe strength of memory ordering required for a given litmus test to
5501c27b644SPaul E. McKenneyavoid a counter-intuitive outcome depends on the types of relations
5511c27b644SPaul E. McKenneylinking the memory accesses for the outcome in question:
5521c27b644SPaul E. McKenney
5531c27b644SPaul E. McKenneyo	If all links are write-to-read links, then the weakest
5541c27b644SPaul E. McKenney	possible ordering within each CPU suffices.  For example, in
5551c27b644SPaul E. McKenney	the LB litmus test, a control dependency was enough to do the
5561c27b644SPaul E. McKenney	job.
5571c27b644SPaul E. McKenney
5581c27b644SPaul E. McKenneyo	If all but one of the links are write-to-read links, then a
5591c27b644SPaul E. McKenney	release-acquire chain suffices.  Both the MP and the ISA2
5601c27b644SPaul E. McKenney	litmus tests illustrate this case.
5611c27b644SPaul E. McKenney
5621c27b644SPaul E. McKenneyo	If more than one of the links are something other than
5631c27b644SPaul E. McKenney	write-to-read links, then a full memory barrier is required
5641c27b644SPaul E. McKenney	between each successive pair of non-write-to-read links.  This
5651c27b644SPaul E. McKenney	case is illustrated by the Z6.0 litmus tests, both in the
5661c27b644SPaul E. McKenney	locking and in the release-acquire sections.
5671c27b644SPaul E. McKenney
5681c27b644SPaul E. McKenneyHowever, if you find yourself having to stretch these rules of thumb
5691c27b644SPaul E. McKenneyto fit your situation, you should consider creating a litmus test and
5701c27b644SPaul E. McKenneyrunning it on the model.
571