memory-barriers.txt - OpenGrok cross reference for /linux/Documentation/memory-barriers.txt

Lines Matching +full:load +full:- +full:acquire
19 documentation at tools/memory-model/.  Nevertheless, even this memory
37 Note also that it is possible that a barrier may be a no-op for an
48      - Device operations.
49      - Guarantees.
53      - Varieties of memory barrier.
54      - What may not be assumed about memory barriers?
55      - Address-dependency barriers (historical).
56      - Control dependencies.
57      - SMP barrier pairing.
58      - Examples of memory barrier sequences.
59      - Read memory barriers vs load speculation.
60      - Multicopy atomicity.
64      - Compiler barrier.
65      - CPU memory barriers.
69      - Lock acquisition functions.
70      - Interrupt disabling functions.
71      - Sleep and wake-up functions.
72      - Miscellaneous functions.
74  (*) Inter-CPU acquiring barrier effects.
76      - Acquires vs memory accesses.
80      - Interprocessor interaction.
81      - Atomic operations.
82      - Accessing devices.
83      - Interrupts.
91      - Cache coherency vs DMA.
92      - Cache coherency vs MMIO.
96      - And then there's the Alpha.
97      - Virtual Machine Guests.
101      - Circular buffers.
115 		+-------+   :   +--------+   :   +-------+
118 		| CPU 1 |<----->| Memory |<----->| CPU 2 |
121 		+-------+   :   +--------+   :   +-------+
126 		    |       :   +--------+   :       |
129 		    +---------->| Device |<----------+
132 		            :   +--------+   :
158 	STORE A=3,	STORE B=4,	y=LOAD A->3,	x=LOAD B->4
159 	STORE A=3,	STORE B=4,	x=LOAD B->4,	y=LOAD A->3
160 	STORE A=3,	y=LOAD A->3,	STORE B=4,	x=LOAD B->4
161 	STORE A=3,	y=LOAD A->3,	x=LOAD B->2,	STORE B=4
162 	STORE A=3,	x=LOAD B->2,	STORE B=4,	y=LOAD A->3
163 	STORE A=3,	x=LOAD B->2,	y=LOAD A->3,	STORE B=4
164 	STORE B=4,	STORE A=3,	y=LOAD A->3,	x=LOAD B->4
197 Note that CPU 2 will never try and load C into D because the CPU will load P
198 into Q before issuing the load of *Q.
202 -----------------
216 	STORE *A = 5, x = LOAD *D
217 	x = LOAD *D, STORE *A = 5
224 ----------
235 	Q = LOAD P, D = LOAD *Q
238      emits a memory-barrier instruction, so that a DEC Alpha CPU will
241 	Q = LOAD P, MEMORY_BARRIER, D = LOAD *Q, MEMORY_BARRIER
253 	a = LOAD *X, STORE *X = b
261 	STORE *X = c, d = LOAD *X
281 	X = LOAD *A,  Y = LOAD *B,  STORE *D = Z
282 	X = LOAD *A,  STORE *D = Z, Y = LOAD *B
283 	Y = LOAD *B,  X = LOAD *A,  STORE *D = Z
284 	Y = LOAD *B,  STORE *D = Z, X = LOAD *A
285 	STORE *D = Z, X = LOAD *A,  Y = LOAD *B
286 	STORE *D = Z, Y = LOAD *B,  X = LOAD *A
295 	X = LOAD *A; Y = LOAD *(A + 4);
296 	Y = LOAD *(A + 4); X = LOAD *A;
297 	{X, Y} = LOAD {*A, *(A + 4) };
309 And there are anti-guarantees:
312      generate code to modify these using non-atomic read-modify-write
319      non-atomic read-modify-write sequences can cause an update to one
326      "char", two-byte alignment for "short", four-byte alignment for
327      "int", and either four-byte or eight-byte alignment for "long",
328      on 32-bit and 64-bit systems, respectively.  Note that these
330      using older pre-C11 compilers (for example, gcc 4.6).  The portion
336 		of adjacent bit-fields all having nonzero width
342 		NOTE 2: A bit-field and an adjacent non-bit-field member
344 		to two bit-fields, if one is declared inside a nested
346 		are separated by a zero-length bit-field declaration,
347 		or if they are separated by a non-bit-field member
349 		bit-fields in the same structure if all members declared
350 		between them are also bit-fields, no matter what the
351 		sizes of those intervening bit-fields happen to be.
359 in random order, but this can be a problem for CPU-CPU interaction and for I/O.
375 ---------------------------
394      address-dependency barriers; see the "SMP barrier pairing" subsection.
397  (2) Address-dependency barriers (historical).
398      [!] This section is marked as HISTORICAL: it covers the long-obsolete
400      implicit in all marked accesses.  For more up-to-date information,
404      An address-dependency barrier is a weaker form of read barrier.  In the
406      result of the first (eg: the first load retrieves the address to which
407      the second load will be directed), an address-dependency barrier would
408      be required to make sure that the target of the second load is updated
409      after the address obtained by the first load is accessed.
411      An address-dependency barrier is a partial ordering on interdependent
417      considered can then perceive.  An address-dependency barrier issued by
418      the CPU under consideration guarantees that for any load preceding it,
419      if that load touches one of a sequence of stores from another CPU, then
421      that touched by the load will be perceptible to any loads issued after
422      the address-dependency barrier.
427      [!] Note that the first load really has to have an _address_ dependency and
428      not a control dependency.  If the address for the second load is dependent
429      on the first load, but the dependency is through a conditional rather than
434      [!] Note that address-dependency barriers should normally be paired with
437      [!] Kernel release v5.9 removed kernel APIs for explicit address-
440      address-dependency barriers.
442  (3) Read (or load) memory barriers.
444      A read barrier is an address-dependency barrier plus a guarantee that all
445      the LOAD operations specified before the barrier will appear to happen
446      before all the LOAD operations specified after the barrier with respect to
452      Read memory barriers imply address-dependency barriers, and so can
461      A general memory barrier gives a guarantee that all the LOAD and STORE
463      the LOAD and STORE operations specified after the barrier with respect to
474  (5) ACQUIRE operations.
476      This acts as a one-way permeable barrier.  It guarantees that all memory
477      operations after the ACQUIRE operation will appear to happen after the
478      ACQUIRE operation with respect to the other components of the system.
479      ACQUIRE operations include LOCK operations and both smp_load_acquire()
482      Memory operations that occur before an ACQUIRE operation may appear to
485      An ACQUIRE operation should almost always be paired with a RELEASE
491      This also acts as a one-way permeable barrier.  It guarantees that all
500      The use of ACQUIRE and RELEASE operations generally precludes the need
501      for other sorts of memory barrier.  In addition, a RELEASE+ACQUIRE pair is
502      -not- guaranteed to act as a full memory barrier.  However, after an
503      ACQUIRE on a given variable, all memory accesses preceding any prior
509      This means that ACQUIRE acts as a minimal "acquire" operation and
512 A subset of the atomic operations described in atomic_t.txt have ACQUIRE and
513 RELEASE variants in addition to fully-ordered and relaxed (no barrier
514 semantics) definitions.  For compound atomics performing both a load and a
515 store, ACQUIRE semantics apply only to the load and RELEASE semantics apply
530 ----------------------------------------------
549  (*) There is no guarantee that some intervening piece of off-the-CPU
556 	    Documentation/driver-api/pci/pci.rst
557 	    Documentation/core-api/dma-api-howto.rst
558 	    Documentation/core-api/dma-api.rst
561 ADDRESS-DEPENDENCY BARRIERS (HISTORICAL)
562 ----------------------------------------
563 [!] This section is marked as HISTORICAL: it covers the long-obsolete
565 in all marked accesses.  For more up-to-date information, including
571 to this section are those working on DEC Alpha architecture-specific code
574 address-dependency barriers.
576 [!] While address dependencies are observed in both load-to-load and
577 load-to-store relations, address-dependency barriers are not necessary
578 for load-to-store situations.
580 The requirement of address-dependency barriers is a little subtle, and
593 [!] READ_ONCE_OLD() corresponds to READ_ONCE() of pre-4.15 kernel, which
594 doesn't imply an address-dependency barrier.
611 To deal with this, READ_ONCE() provides an implicit address-dependency barrier
621 			      <implicit address-dependency barrier>
630 even-numbered cache lines and the other bank processes odd-numbered cache
631 lines.  The pointer P might be stored in an odd-numbered cache line, and the
632 variable B might be stored in an even-numbered cache line.  Then, if the
633 even-numbered bank of the reading CPU's cache is extremely busy while the
634 odd-numbered bank is idle, one can see the new value of the pointer P (&B),
638 An address-dependency barrier is not required to order dependent writes
655 Therefore, no address-dependency barrier is required to order the read into
657 even without an implicit address-dependency barrier of modern READ_ONCE():
662 of dependency ordering is to -prevent- writes to the data structure, along
673 The address-dependency barrier is very important to the RCU system,
681 --------------------
687 A load-load control dependency requires a full read memory barrier, not
688 simply an (implicit) address-dependency barrier to make it work correctly.
692 	<implicit address-dependency barrier>
699 dependency, but rather a control dependency that the CPU may short-circuit
701 the load from b as having happened before the load from a.  In such a case
710 However, stores are not speculated.  This means that ordering -is- provided
711 for load-store control dependencies, as in the following example:
721 load from 'a' with other loads from 'a'.  Without the WRITE_ONCE(),
726 variable 'a' is always non-zero, it would be well within its rights
754 	WRITE_ONCE(b, 1);  /* BUG: No ordering vs. load from a!!! */
756 		/* WRITE_ONCE(b, 1); -- moved up, BUG!!! */
759 		/* WRITE_ONCE(b, 1); -- moved up, BUG!!! */
763 Now there is no conditional between the load from 'a' and the store to
779 In contrast, without explicit memory barriers, two-legged-if control
816 between the load from variable 'a' and the store to variable 'b'.  It is
823 	BUILD_BUG_ON(MAX <= 1); /* Order load from a with store to b. */
836 You must also be careful not to rely too much on boolean short-circuit
851 out-guess your code.  More generally, although READ_ONCE() does force
852 the compiler to actually emit code for a given load, it does not force
855 In addition, control dependencies apply only to the then-clause and
856 else-clause of the if-statement in question.  In particular, it does
857 not necessarily apply to code following the if-statement:
871 conditional-move instructions, as in this fanciful pseudo-assembly
881 A weakly ordered CPU would have no dependency of any sort between the load
884 In short, control dependencies apply only to the stores in the then-clause
885 and else-clause of the if-statement in question (including functions
886 invoked by those two clauses), not to code following that if-statement.
897       However, they do -not- guarantee any other sort of ordering:
906       to carry out the stores.  Please note that it is -not- sufficient
912   (*) Control dependencies require at least one run-time conditional
913       between the prior load and the subsequent store, and this
914       conditional must involve the prior load.  If the compiler is able
924   (*) Control dependencies apply only to the then-clause and else-clause
925       of the if-statement containing the control dependency, including
927       do -not- apply to code following the if-statement containing the
932   (*) Control dependencies do -not- provide multicopy atomicity.  If you
940 -------------------
942 When dealing with CPU-CPU interactions, certain types of memory barrier should
946 other types of barriers, albeit without multicopy atomicity.  An acquire
949 with an address-dependency barrier, a control dependency, an acquire barrier,
951 read barrier, control dependency, or an address-dependency barrier pairs
952 with a write barrier, an acquire barrier, a release barrier, or a
970 			      <implicit address-dependency barrier>
990 match the loads after the read barrier or the address-dependency barrier, and
995 	WRITE_ONCE(a, 1);    }----   --->{  v = READ_ONCE(c);
999 	WRITE_ONCE(d, 4);    }----   --->{  y = READ_ONCE(b);
1003 ------------------------------------
1022 	+-------+       :      :
1023 	|       |       +------+
1024 	|       |------>| C=3  |     }     /\
1025 	|       |  :    +------+     }-----  \  -----> Events perceptible to
1027 	|       |  :    +------+     }
1029 	|       |       +------+     }
1030 	|       |   wwwwwwwwwwwwwwww }   <--- At this point the write barrier
1031 	|       |       +------+     }        requires all stores prior to the
1033 	|       |  :    +------+     }        further stores may take place
1034 	|       |------>| D=4  |     }
1035 	|       |       +------+
1036 	+-------+       :      :
1043 Secondly, address-dependency barriers act as partial orderings on address-
1052 	STORE C = &B		LOAD X
1053 	STORE D = 4		LOAD C (gets &B)
1054 				LOAD *C (reads B)
1059 	+-------+       :      :                :       :
1060 	|       |       +------+                +-------+  | Sequence of update
1061 	|       |------>| B=2  |-----       --->| Y->8  |  | of perception on
1062 	|       |  :    +------+     \          +-------+  | CPU 2
1063 	| CPU 1 |  :    | A=1  |      \     --->| C->&Y |  V
1064 	|       |       +------+       |        +-------+
1066 	|       |       +------+       |        :       :
1067 	|       |  :    | C=&B |---    |        :       :       +-------+
1068 	|       |  :    +------+   \   |        +-------+       |       |
1069 	|       |------>| D=4  |    ----------->| C->&B |------>|       |
1070 	|       |       +------+       |        +-------+       |       |
1071 	+-------+       :      :       |        :       :       |       |
1074 	                               |        +-------+       |       |
1075 	    Apparently incorrect --->  |        | B->7  |------>|       |
1076 	    perception of B (!)        |        +-------+       |       |
1078 	                               |        +-------+       |       |
1079 	    The load of X holds --->    \       | X->9  |------>|       |
1080 	    up the maintenance           \      +-------+       |       |
1081 	    of coherence of B             ----->| B->2  |       +-------+
1082 	                                        +-------+
1086 In the above example, CPU 2 perceives that B is 7, despite the load of *C
1087 (which would be B) coming after the LOAD of C.
1089 If, however, an address-dependency barrier were to be placed between the load
1090 of C and the load of *C (ie: B) on CPU 2:
1098 	STORE C = &B		LOAD X
1099 	STORE D = 4		LOAD C (gets &B)
1100 				<address-dependency barrier>
1101 				LOAD *C (reads B)
1105 	+-------+       :      :                :       :
1106 	|       |       +------+                +-------+
1107 	|       |------>| B=2  |-----       --->| Y->8  |
1108 	|       |  :    +------+     \          +-------+
1109 	| CPU 1 |  :    | A=1  |      \     --->| C->&Y |
1110 	|       |       +------+       |        +-------+
1112 	|       |       +------+       |        :       :
1113 	|       |  :    | C=&B |---    |        :       :       +-------+
1114 	|       |  :    +------+   \   |        +-------+       |       |
1115 	|       |------>| D=4  |    ----------->| C->&B |------>|       |
1116 	|       |       +------+       |        +-------+       |       |
1117 	+-------+       :      :       |        :       :       |       |
1120 	                               |        +-------+       |       |
1121 	                               |        | X->9  |------>|       |
1122 	                               |        +-------+       |       |
1123 	  Makes sure all effects --->   \   aaaaaaaaaaaaaaaaa   |       |
1124 	  prior to the store of C        \      +-------+       |       |
1125 	  are perceptible to              ----->| B->2  |------>|       |
1126 	  subsequent loads                      +-------+       |       |
1127 	                                        :       :       +-------+
1139 				LOAD B
1140 				LOAD A
1145 	+-------+       :      :                :       :
1146 	|       |       +------+                +-------+
1147 	|       |------>| A=1  |------      --->| A->0  |
1148 	|       |       +------+      \         +-------+
1149 	| CPU 1 |   wwwwwwwwwwwwwwww   \    --->| B->9  |
1150 	|       |       +------+        |       +-------+
1151 	|       |------>| B=2  |---     |       :       :
1152 	|       |       +------+   \    |       :       :       +-------+
1153 	+-------+       :      :    \   |       +-------+       |       |
1154 	                             ---------->| B->2  |------>|       |
1155 	                                |       +-------+       | CPU 2 |
1156 	                                |       | A->0  |------>|       |
1157 	                                |       +-------+       |       |
1158 	                                |       :       :       +-------+
1160 	                                  \     +-------+
1161 	                                   ---->| A->1  |
1162 	                                        +-------+
1166 If, however, a read barrier were to be placed between the load of B and the
1167 load of A on CPU 2:
1175 				LOAD B
1177 				LOAD A
1182 	+-------+       :      :                :       :
1183 	|       |       +------+                +-------+
1184 	|       |------>| A=1  |------      --->| A->0  |
1185 	|       |       +------+      \         +-------+
1186 	| CPU 1 |   wwwwwwwwwwwwwwww   \    --->| B->9  |
1187 	|       |       +------+        |       +-------+
1188 	|       |------>| B=2  |---     |       :       :
1189 	|       |       +------+   \    |       :       :       +-------+
1190 	+-------+       :      :    \   |       +-------+       |       |
1191 	                             ---------->| B->2  |------>|       |
1192 	                                |       +-------+       | CPU 2 |
1195 	  At this point the read ---->   \  rrrrrrrrrrrrrrrrr   |       |
1196 	  barrier causes all effects      \     +-------+       |       |
1197 	  prior to the storage of B        ---->| A->1  |------>|       |
1198 	  to be perceptible to CPU 2            +-------+       |       |
1199 	                                        :       :       +-------+
1203 contained a load of A either side of the read barrier:
1211 				LOAD B
1212 				LOAD A [first load of A]
1214 				LOAD A [second load of A]
1216 Even though the two loads of A both occur after the load of B, they may both
1219 	+-------+       :      :                :       :
1220 	|       |       +------+                +-------+
1221 	|       |------>| A=1  |------      --->| A->0  |
1222 	|       |       +------+      \         +-------+
1223 	| CPU 1 |   wwwwwwwwwwwwwwww   \    --->| B->9  |
1224 	|       |       +------+        |       +-------+
1225 	|       |------>| B=2  |---     |       :       :
1226 	|       |       +------+   \    |       :       :       +-------+
1227 	+-------+       :      :    \   |       +-------+       |       |
1228 	                             ---------->| B->2  |------>|       |
1229 	                                |       +-------+       | CPU 2 |
1232 	                                |       +-------+       |       |
1233 	                                |       | A->0  |------>| 1st   |
1234 	                                |       +-------+       |       |
1235 	  At this point the read ---->   \  rrrrrrrrrrrrrrrrr   |       |
1236 	  barrier causes all effects      \     +-------+       |       |
1237 	  prior to the storage of B        ---->| A->1  |------>| 2nd   |
1238 	  to be perceptible to CPU 2            +-------+       |       |
1239 	                                        :       :       +-------+
1245 	+-------+       :      :                :       :
1246 	|       |       +------+                +-------+
1247 	|       |------>| A=1  |------      --->| A->0  |
1248 	|       |       +------+      \         +-------+
1249 	| CPU 1 |   wwwwwwwwwwwwwwww   \    --->| B->9  |
1250 	|       |       +------+        |       +-------+
1251 	|       |------>| B=2  |---     |       :       :
1252 	|       |       +------+   \    |       :       :       +-------+
1253 	+-------+       :      :    \   |       +-------+       |       |
1254 	                             ---------->| B->2  |------>|       |
1255 	                                |       +-------+       | CPU 2 |
1258 	                                  \     +-------+       |       |
1259 	                                   ---->| A->1  |------>| 1st   |
1260 	                                        +-------+       |       |
1262 	                                        +-------+       |       |
1263 	                                        | A->1  |------>| 2nd   |
1264 	                                        +-------+       |       |
1265 	                                        :       :       +-------+
1268 The guarantee is that the second load will always come up with A == 1 if the
1269 load of B came up with B == 2.  No such guarantee exists for the first load of
1273 READ MEMORY BARRIERS VS LOAD SPECULATION
1274 ----------------------------------------
1276 Many CPUs speculate with loads: that is they see that they will need to load an
1278 other loads, and so do the load in advance - even though they haven't actually
1280 actual load instruction to potentially complete immediately because the CPU
1283 It may turn out that the CPU didn't actually need the value - perhaps because a
1284 branch circumvented the load - in which case it can discard the value or just
1291 				LOAD B
1294 				LOAD A
1298 	                                        :       :       +-------+
1299 	                                        +-------+       |       |
1300 	                                    --->| B->2  |------>|       |
1301 	                                        +-------+       | CPU 2 |
1303 	                                        +-------+       |       |
1304 	The CPU being busy doing a --->     --->| A->0  |~~~~   |       |
1305 	division speculates on the              +-------+   ~   |       |
1306 	LOAD of A                               :       :   ~   |       |
1309 	Once the divisions are complete -->     :       :   ~-->|       |
1311 	LOAD with immediate effect              :       :       +-------+
1314 Placing a read barrier or an address-dependency barrier just before the second
1315 load:
1319 				LOAD B
1323 				LOAD A
1329 	                                        :       :       +-------+
1330 	                                        +-------+       |       |
1331 	                                    --->| B->2  |------>|       |
1332 	                                        +-------+       | CPU 2 |
1334 	                                        +-------+       |       |
1335 	The CPU being busy doing a --->     --->| A->0  |~~~~   |       |
1336 	division speculates on the              +-------+   ~   |       |
1337 	LOAD of A                               :       :   ~   |       |
1343 	                                        :       :   ~-->|       |
1345 	                                        :       :       +-------+
1351 	                                        :       :       +-------+
1352 	                                        +-------+       |       |
1353 	                                    --->| B->2  |------>|       |
1354 	                                        +-------+       | CPU 2 |
1356 	                                        +-------+       |       |
1357 	The CPU being busy doing a --->     --->| A->0  |~~~~   |       |
1358 	division speculates on the              +-------+   ~   |       |
1359 	LOAD of A                               :       :   ~   |       |
1364 	                                        +-------+       |       |
1365 	The speculation is discarded --->   --->| A->1  |------>|       |
1366 	and an updated value is                 +-------+       |       |
1367 	retrieved                               :       :       +-------+
1371 --------------------
1380 time to all -other- CPUs.  The remainder of this document discusses this
1388 	STORE X=1		r1=LOAD X (reads 1)	LOAD Y (reads 1)
1390 				STORE Y=r1		LOAD X
1392 Suppose that CPU 2's load from X returns 1, which it then stores to Y,
1393 and CPU 3's load from Y returns 1.  This indicates that CPU 1's store
1394 to X precedes CPU 2's load from X and that CPU 2's store to Y precedes
1395 CPU 3's load from Y.  In addition, the memory barriers guarantee that
1396 CPU 2 executes its load before its store, and CPU 3 loads from Y before
1397 it loads from X.  The question is then "Can CPU 3's load from X return 0?"
1399 Because CPU 3's load from X in some sense comes after CPU 2's load, it
1400 is natural to expect that CPU 3's load from X must therefore return 1.
1401 This expectation follows from multicopy atomicity: if a load executing
1402 on CPU B follows a load from the same variable executing on CPU A (and
1404 multicopy-atomic systems, CPU B's load must return either the same value
1405 that CPU A's load did or some later value.  However, the Linux kernel
1409 for any lack of multicopy atomicity.  In the example, if CPU 2's load
1410 from X returns 1 and CPU 3's load from Y returns 1, then CPU 3's load
1414 able to compensate for non-multicopy atomicity.  For example, suppose
1421 	STORE X=1		r1=LOAD X (reads 1)	LOAD Y (reads 1)
1423 				STORE Y=r1		LOAD X (reads 0)
1425 This substitution allows non-multicopy atomicity to run rampant: in
1426 this example, it is perfectly legal for CPU 2's load from X to return 1,
1427 CPU 3's load from Y to return 1, and its load from X to return 0.
1429 The key point is that although CPU 2's data dependency orders its load
1431 example runs on a non-multicopy-atomic system where CPUs 1 and 2 share a
1436 General barriers can compensate not only for non-multicopy atomicity,
1437 but can also generate additional ordering that can ensure that -all-
1438 CPUs will perceive the same order of -all- operations.  In contrast, a
1439 chain of release-acquire pairs do not provide this additional ordering,
1480 Furthermore, because of the release-acquire relationship between cpu0()
1486 However, the ordering provided by a release-acquire chain is local
1497 writes in order, CPUs not involved in the release-acquire chain might
1499 the weak memory-barrier instructions used to implement smp_load_acquire()
1502 store to u as happening -after- cpu1()'s load from v, even though
1508 -not- ensure that any particular value will be read.  Therefore, the
1533 ----------------
1540 This is a general barrier -- there are no read-read or write-write
1550      interrupt-handler code and the code that was interrupted.
1552  (*) Within a loop, forces the compiler to load the variables used
1556 optimizations that, while perfectly safe in single-threaded code, can
1585      for single-threaded code, is almost certainly not what the developer
1606      single-threaded code, but can be fatal in concurrent code:
1624      single-threaded code, so you need to tell the compiler about cases
1627  (*) The compiler is within its rights to omit a load entirely if it knows
1638      This transformation is a win for single-threaded code because it
1639      gets rid of a load and a branch.  The problem is that the compiler
1657      the code into near-nonexistence.  (It will still load from the
1685      between process-level code and an interrupt handler:
1701      win for single-threaded code:
1762      In single-threaded code, this is not only safe, but also saves
1764      could cause some other CPU to see a spurious value of 42 -- even
1765      if variable 'a' was never zero -- when loading variable 'b'.
1774      damaging, but they can result in cache-line bouncing and thus in
1779      with a single memory-reference instruction, prevents "load tearing"
1782      16-bit store instructions with 7-bit immediate fields, the compiler
1783      might be tempted to use two 16-bit store-immediate instructions to
1784      implement the following 32-bit store:
1791      This optimization can therefore be a win in single-threaded code.
1798      Use of packed structures can also result in load and store tearing,
1815      implement these three assignment statements as a pair of 32-bit
1816      loads followed by a pair of 32-bit stores.  This would result in
1817      load tearing on 'foo1.b' and store tearing on 'foo2.b'.  READ_ONCE()
1836 -------------------
1848 All memory barriers except the address-dependency barriers imply a compiler
1852 to issue the loads in the correct order (eg. `a[b]` would have to load
1855 (eg. is equal to 1) and load a[b] before b (eg. tmp = a[1]; if (b != 1)
1862 systems because it is assumed that a CPU will appear to be self-consistent,
1873 windows.  These barriers are required even on non-SMP systems as they affect
1904 	obj->dead = 1;
1906 	atomic_dec(&obj->ref_count);
1920      DMA capable device. See Documentation/core-api/dma-api.rst file for more
1928 	if (desc->status != DEVICE_OWN) {
1933 		read_data = desc->data;
1934 		desc->data = write_data;
1940 		desc->status = DEVICE_OWN;
1964      For example, after a non-temporal write to pmem region, we use pmem_wmb()
1970      For load from persistent memory, existing read memory barriers are sufficient
1975      For memory accesses with write-combining attributes (e.g. those returned
1978      write-combining memory accesses before this macro with those after it when
1994 --------------------------
2004 In all cases there are variants on "ACQUIRE" operations and "RELEASE" operations
2007  (1) ACQUIRE operation implication:
2009      Memory operations issued after the ACQUIRE will be completed after the
2010      ACQUIRE operation has completed.
2012      Memory operations issued before the ACQUIRE may be completed after
2013      the ACQUIRE operation has completed.
2023  (3) ACQUIRE vs ACQUIRE implication:
2025      All ACQUIRE operations issued before another ACQUIRE operation will be
2026      completed before that ACQUIRE operation.
2028  (4) ACQUIRE vs RELEASE implication:
2030      All ACQUIRE operations issued before a RELEASE operation will be
2033  (5) Failed conditional ACQUIRE implication:
2035      Certain locking variants of the ACQUIRE operation may fail, either due to
2041 one-way barriers is that the effects of instructions outside of a critical
2044 An ACQUIRE followed by a RELEASE may not be assumed to be full memory barrier
2045 because it is possible for an access preceding the ACQUIRE to happen after the
2046 ACQUIRE, and an access following the RELEASE to happen before the RELEASE, and
2050 	ACQUIRE M
2056 	ACQUIRE M, STORE *B, STORE *A, RELEASE M
2058 When the ACQUIRE and RELEASE are a lock acquisition and release,
2059 respectively, this same reordering can occur if the lock's ACQUIRE and
2061 another CPU not holding that lock.  In short, a ACQUIRE followed by an
2062 RELEASE may -not- be assumed to be a full memory barrier.
2064 Similarly, the reverse case of a RELEASE followed by an ACQUIRE does
2066 critical sections corresponding to the RELEASE and the ACQUIRE can cross,
2071 	ACQUIRE N
2076 	ACQUIRE N, STORE *B, STORE *A, RELEASE M
2087 	-could- occur.
2102 	a sleep-unlock race, but the locking primitive needs to resolve
2107 anything at all - especially with respect to I/O accesses - unless combined
2110 See also the section on "Inter-CPU acquiring barrier effects".
2117 	ACQUIRE
2126 	ACQUIRE, {*F,*A}, *E, {*C,*D}, *B, RELEASE
2132 	{*F,*A}, *B,	ACQUIRE, *C, *D,	RELEASE, *E
2133 	*A, *B, *C,	ACQUIRE, *D,		RELEASE, *E, *F
2134 	*A, *B,		ACQUIRE, *C,		RELEASE, *D, *E, *F
2135 	*B,		ACQUIRE, *C, *D,	RELEASE, {*F,*A}, *E
2140 -----------------------------
2142 Functions that disable interrupts (ACQUIRE equivalent) and enable interrupts
2148 SLEEP AND WAKE-UP FUNCTIONS
2149 ---------------------------
2174 	    STORE current->state
2176 	LOAD event_indicated
2219 	    STORE current->state	  ...
2221 	LOAD event_indicated		  if ((LOAD task->state) & TASK_NORMAL)
2222 					    STORE task->state
2235 	LOAD Y				LOAD X
2267 order multiple stores before the wake-up with respect to loads of those stored
2303 -----------------------
2311 INTER-CPU ACQUIRING BARRIER EFFECTS
2320 ---------------------------
2328 	ACQUIRE M			ACQUIRE Q
2338 	*E, ACQUIRE M, ACQUIRE Q, *G, *C, *F, *A, *B, RELEASE Q, *D, *H, RELEASE M
2342 	*B, *C or *D preceding ACQUIRE M
2344 	*F, *G or *H preceding ACQUIRE Q
2353 be a problem as a single-threaded linear piece of code will still appear to
2367 --------------------------
2407 	LOAD waiter->list.next;
2408 	LOAD waiter->task;
2409 	STORE waiter->task;
2431 	LOAD waiter->task;
2432 	STORE waiter->task;
2440 	LOAD waiter->list.next;
2441 	--- OOPS ---
2448 	LOAD waiter->list.next;
2449 	LOAD waiter->task;
2451 	STORE waiter->task;
2461 On a UP system - where this wouldn't be a problem - the smp_mb() is just a
2468 -----------------
2479 -----------------
2488 efficient to reorder, combine or merge accesses - something that would cause
2492 routines - such as inb() or writel() - which know how to make such accesses
2498 See Documentation/driver-api/device-io.rst for more information.
2502 ----------
2508 This may be alleviated - at least in part - by disabling local interrupts (a
2510 the interrupt-disabled section in the driver.  While the driver's interrupt
2517 under interrupt-disablement and then the driver's interrupt handler is invoked:
2531 	STORE *ADDR = 3, STORE *ADDR = 4, STORE *DATA = y, q = LOAD *DATA
2536 accesses performed in an interrupt - and vice versa - unless implicit or
2540 sections will include synchronous load operations on strictly ordered I/O
2546 likely, then interrupt-disabling locks should be used to guarantee ordering.
2554 specific. Therefore, drivers which are inherently non-portable may rely on
2606 	The ordering properties of __iomem pointers obtained with non-default
2616 	bullets 2-5 above) but they are still guaranteed to be ordered with
2624 	register-based, memory-mapped FIFOs residing on peripherals that are not
2630 	The inX() and outX() accessors are intended to access legacy port-mapped
2641 	Device drivers may expect outX() to emit a non-posted write transaction
2659 little-endian and will therefore perform byte-swapping operations on big-endian
2667 It has to be assumed that the conceptual CPU is weakly-ordered but that it will
2671 of arch-specific code.
2674 stream in any order it feels like - or even in parallel - provided that if an
2680  [*] Some instructions have more than one effect - such as changing the
2681      condition codes, changing registers or changing memory - and different
2685 ultimate effect.  For example, if two adjacent instructions both load an
2707 	    <--- CPU --->         :       <----------- Memory ----------->
2709 	+--------+    +--------+  :   +--------+    +-----------+
2710 	|        |    |        |  :   |        |    |           |    +--------+
2712 	|  Core  |--->| Access |----->| Cache  |<-->|           |    |        |
2713 	|        |    | Queue  |  :   |        |    |           |--->| Memory |
2715 	+--------+    +--------+  :   +--------+    |           |    |        |
2716 	                          :                 | Cache     |    +--------+
2718 	                          :                 | Mechanism |    +--------+
2719 	+--------+    +--------+  :   +--------+    |           |    |	      |
2721 	|  CPU   |    | Memory |  :   | CPU    |    |           |--->| Device |
2722 	|  Core  |--->| Access |----->| Cache  |<-->|           |    |        |
2724 	|        |    |        |  :   |        |    |           |    +--------+
2725 	+--------+    +--------+  :   +--------+    +-----------+
2729 Although any particular load or store may not actually appear outside of the
2737 generate load and store operations which then go into the queue of memory
2756 ----------------------
2773 See Documentation/core-api/cachetlb.rst for more information on cache
2778 -----------------------
2810 	LOAD *A, STORE *B, LOAD *C, LOAD *D, STORE *E.
2834  (*) the CPU's data cache may affect the ordering, and while cache-coherency
2835      mechanisms may alleviate this - once the store has actually hit the cache
2836      - there's no guarantee that the coherency management will be propagated in
2842 	LOAD *A, ..., LOAD {*C,*D}, STORE *E, STORE *B
2844 	(Where "LOAD {*C,*D}" is a combined load)
2847 However, it is guaranteed that a CPU will be self-consistent: it will see its
2869 	U=LOAD *A, STORE *A=V, STORE *A=W, X=LOAD *A, STORE *A=Y, Z=LOAD *A
2874 are -not- optional in the above example, as there are architectures
2905 and the LOAD operation never appear outside of the CPU.
2909 --------------------------
2913 two semantically-related cache lines updated at separate times.  This is where
2914 the address-dependency barrier really becomes necessary as this synchronises
2924 ----------------------
2929 barriers for this use-case would be possible but is often suboptimal.
2931 To handle this case optimally, low-level virt_mb() etc macros are available.
2933 identical code for SMP and non-SMP systems.  For example, virtual machine guests
2947 ----------------
2952 	Documentation/core-api/circular-buffers.rst
2969 	Chapter 7.1: Memory-Access Ordering
2972 ARM Architecture Reference Manual (ARMv8, for ARMv8-A architecture profile)
2975 IA-32 Intel Architecture Software Developer's Manual, Volume 3:
2990 	Chapter 15: Sparc-V9 Memory Models
3006 Solaris Internals, Core Kernel Architecture, p63-68: