Lines Matching +full:wait +full:- +full:on +full:- +full:read

10 `LWN <https://lwn.net/>`_ on those articles:
16 ------------
18 Read-copy update (RCU) is a synchronization mechanism that is often used
19 as a replacement for reader-writer locking. RCU is unusual in that
20 updaters do not block readers, which means that RCU's read-side
28 thought of as an informal, high-level specification for RCU. It is
40 #. `Fundamental Non-Requirements`_
42 #. `Quality-of-Implementation Requirements`_
44 #. `Software-Engineering Requirements`_
53 ------------------------
58 #. `Grace-Period Guarantee`_
60 #. `Memory-Barrier Guarantees`_
62 #. `Guaranteed Read-to-Write Upgrade`_
64 Grace-Period Guarantee
67 RCU's grace-period guarantee is unusual in being premeditated: Jack
69 on RCU (then called “rclock”) in the early 1990s. That said, the past
73 RCU's grace-period guarantee allows updaters to wait for the completion
74 of all pre-existing RCU read-side critical sections. An RCU read-side
77 RCU treats a nested set as one big RCU read-side critical section.
78 Production-quality implementations of rcu_read_lock() and
105 Because the synchronize_rcu() on line 14 waits for all pre-existing
119 +-----------------------------------------------------------------------+
121 +-----------------------------------------------------------------------+
122 | Wait a minute! You said that updaters can make useful forward |
123 | progress concurrently with readers, but pre-existing readers will |
126 +-----------------------------------------------------------------------+
128 +-----------------------------------------------------------------------+
131 | Second, even when using synchronize_rcu(), the other update-side |
132 | code does run concurrently with readers, whether pre-existing or not. |
133 +-----------------------------------------------------------------------+
173 The RCU read-side critical section in do_something_dlm() works with
178 +-----------------------------------------------------------------------+
180 +-----------------------------------------------------------------------+
181 | Why is the synchronize_rcu() on line 28 needed? |
182 +-----------------------------------------------------------------------+
184 +-----------------------------------------------------------------------+
188 +-----------------------------------------------------------------------+
190 In order to avoid fatal problems such as deadlocks, an RCU read-side
192 Similarly, an RCU read-side critical section must not contain anything
193 that waits, directly or indirectly, on completion of an invocation of
196 Although RCU's grace-period guarantee is useful in and of itself, with
198 be good to be able to use RCU to coordinate read-side access to linked
199 data structures. For this, the grace-period guarantee is not sufficient,
203 non-\ ``NULL``, locklessly accessing the ``->a`` and ``->b`` fields.
211 5 return -ENOMEM;
217 11 p->a = a;
218 12 p->b = a;
233 5 return -ENOMEM;
240 12 p->a = a;
241 13 p->b = a;
247 executes line 11, it will see garbage in the ``->a`` and ``->b`` fields.
251 which brings us to the publish-subscribe guarantee discussed in the next
257 RCU's publish-subscribe guarantee allows data to be inserted into a
269 5 return -ENOMEM;
275 11 p->a = a;
276 12 p->b = a;
282 The rcu_assign_pointer() on line 13 is conceptually equivalent to a
289 +-----------------------------------------------------------------------+
291 +-----------------------------------------------------------------------+
293 | assignments to ``p->a`` and ``p->b`` from being reordered. Can't that |
295 +-----------------------------------------------------------------------+
297 +-----------------------------------------------------------------------+
300 | initialized. So reordering the assignments to ``p->a`` and ``p->b`` |
302 +-----------------------------------------------------------------------+
305 control its accesses to the RCU-protected data, as shown in
315 6 do_something(p->a, p->b);
335 5 do_something(gp->a, gp->b);
344 the current structure with a new one, the fetches of ``gp->a`` and
345 ``gp->b`` might well come from two different structures, which could
356 6 do_something(p->a, p->b);
365 barriers in the Linux kernel. Should a |high-quality implementation of
370 outermost RCU read-side critical section containing that
376 .. |high-quality implementation of C11 memory_order_consume [PDF]| replace:: high-quality implement…
377 .. _high-quality implementation of C11 memory_order_consume [PDF]: http://www.rdrop.com/users/paulm…
383 Of course, it is also necessary to remove elements from RCU-protected
387 #. Wait for all pre-existing RCU read-side critical sections to complete
388 (because only pre-existing readers can possibly have a reference to
418 element referenced by ``p`` is freed. The rcu_access_pointer() on
427 read-side critical section or in a code segment where the pointer
429 update-side lock.
431 +-----------------------------------------------------------------------+
433 +-----------------------------------------------------------------------+
436 +-----------------------------------------------------------------------+
438 +-----------------------------------------------------------------------+
442 | in a byte-at-a-time manner, resulting in *load tearing*, in turn |
443 | resulting a bytewise mash-up of two distinct pointer values. It might |
444 | even use value-speculation optimizations, where it makes a wrong |
447 | about any dereferences that returned pre-initialization garbage in |
454 +-----------------------------------------------------------------------+
456 In short, RCU's publish-subscribe guarantee is provided by the
458 guarantee allows data elements to be safely added to RCU-protected
460 can be used in combination with the grace-period guarantee to also allow
461 data elements to be removed from RCU-protected linked data structures,
467 resembling the dependency-ordering barrier that was later subsumed
470 late-1990s meeting with the DEC Alpha architects, back in the days when
471 DEC was still a free-standing company. It took the Alpha architects a
475 and C++ standards committees have provided much education on tricks and
480 Memory-Barrier Guarantees
483 The previous section's simple linked-data-structure scenario clearly
484 demonstrates the need for RCU's stringent memory-ordering guarantees on
487 #. Each CPU that has an RCU read-side critical section that begins
489 memory barrier between the time that the RCU read-side critical
491 this guarantee, a pre-existing RCU read-side critical section might
493 kfree() on line 14 of remove_gp_synchronous().
494 #. Each CPU that has an RCU read-side critical section that ends after
497 time that the RCU read-side critical section begins. Without this
498 guarantee, a later RCU read-side critical section running after the
499 kfree() on line 14 of remove_gp_synchronous() might later run
501 #. If the task invoking synchronize_rcu() remains on a given CPU,
504 that the kfree() on line 14 of remove_gp_synchronous() really
505 does execute after the removal on line 11.
510 the kfree() on line 14 of remove_gp_synchronous() really does
511 execute after the removal on line 11, but also in the case where the
514 +-----------------------------------------------------------------------+
516 +-----------------------------------------------------------------------+
517 | Given that multiple CPUs can start RCU read-side critical sections at |
519 | whether or not a given RCU read-side critical section starts before a |
521 +-----------------------------------------------------------------------+
523 +-----------------------------------------------------------------------+
524 | If RCU cannot tell whether or not a given RCU read-side critical |
526 | it must assume that the RCU read-side critical section started first. |
528 | waiting on a given RCU read-side critical section only if it can |
534 | within the enclosed RCU read-side critical section to the code |
536 | then a given RCU read-side critical section begins before a given |
547 +-----------------------------------------------------------------------+
549 +-----------------------------------------------------------------------+
551 +-----------------------------------------------------------------------+
554 +-----------------------------------------------------------------------+
556 +-----------------------------------------------------------------------+
564 | #. CPU 1: ``do_something_with(q->a);`` |
571 | end of the RCU read-side critical section and the end of the grace |
584 | #. CPU 1: ``do_something_with(q->a); /* Boom!!! */`` |
588 | grace period and the beginning of the RCU read-side critical section, |
594 | believing that you have adhered to the as-if rule than it is to |
596 +-----------------------------------------------------------------------+
598 +-----------------------------------------------------------------------+
600 +-----------------------------------------------------------------------+
603 | compiler might arbitrarily rearrange consecutive RCU read-side |
604 | critical sections. Given such rearrangement, if a given RCU read-side |
606 | read-side critical sections are done? Won't the compiler |
608 +-----------------------------------------------------------------------+
610 +-----------------------------------------------------------------------+
614 | schedule() had better prevent calling-code accesses to shared |
616 | RCU detects the end of a given RCU read-side critical section, it |
617 | will necessarily detect the end of all prior RCU read-side critical |
621 | loop, into user-mode code, and so on. But if your kernel build allows |
623 +-----------------------------------------------------------------------+
625 Note that these memory-barrier requirements do not replace the
626 fundamental RCU requirement that a grace period wait for all
627 pre-existing readers. On the contrary, the memory barriers called out in
635 The common-case RCU primitives are unconditional. They are invoked, they
642 guarantee was reverse-engineered, not premeditated. The unconditional
647 primitive to RCU would need to be based on detailed and compelling use
650 Guaranteed Read-to-Write Upgrade
654 within an RCU read-side critical section. For example, that RCU
655 read-side critical section might search for a given data element, and
656 then might acquire the update-side spinlock in order to update that
657 element, all while remaining in that RCU read-side critical section. Of
658 course, it is necessary to exit the RCU read-side critical section
663 +-----------------------------------------------------------------------+
665 +-----------------------------------------------------------------------+
666 | But how does the upgrade-to-write operation exclude other readers? |
667 +-----------------------------------------------------------------------+
669 +-----------------------------------------------------------------------+
672 +-----------------------------------------------------------------------+
674 This guarantee allows lookup code to be shared between read-side and
675 update-side code, and was premeditated, appearing in the earliest
678 Fundamental Non-Requirements
679 ----------------------------
681 RCU provides extremely lightweight readers, and its read-side
686 non-guarantees that have caused confusion. Except where otherwise noted,
687 these non-guarantees were premeditated.
691 #. `Updaters Only Wait For Old Readers`_
692 #. `Grace Periods Don't Partition Read-Side Critical Sections`_
693 #. `Read-Side Critical Sections Don't Partition Grace Periods`_
698 Reader-side markers such as rcu_read_lock() and
700 through their interaction with the grace-period APIs such as
737 significant ordering constraints would slow down these fast-path APIs.
739 +-----------------------------------------------------------------------+
741 +-----------------------------------------------------------------------+
743 +-----------------------------------------------------------------------+
745 +-----------------------------------------------------------------------+
748 +-----------------------------------------------------------------------+
785 Updaters Only Wait For Old Readers
792 obligation to wait for these new readers.
794 +-----------------------------------------------------------------------+
796 +-----------------------------------------------------------------------+
797 | Suppose that synchronize_rcu() did wait until *all* readers had |
798 | completed instead of waiting only on pre-existing readers. For how |
799 | long would the updater be able to rely on there being no readers? |
800 +-----------------------------------------------------------------------+
802 +-----------------------------------------------------------------------+
803 | For no time at all. Even if synchronize_rcu() were to wait until |
806 | synchronize_rcu() can *never* rely on there being no readers. |
807 +-----------------------------------------------------------------------+
809 Grace Periods Don't Partition Read-Side Critical Sections
812 It is tempting to assume that if any part of one RCU read-side critical
814 read-side critical section follows that same grace period, then all of
815 the first RCU read-side critical section must precede all of the second.
817 partition the set of RCU read-side critical sections. An example of this
855 that the thread cannot be in the midst of an RCU read-side critical
858 .. kernel-figure:: GPpartitionReaders1.svg
860 If it is necessary to partition RCU read-side critical sections in this
898 ``(r4 == 1)``, then thread3()'s read from ``b`` must happen after
902 the two RCU read-side critical sections cannot overlap, guaranteeing
911 This non-requirement was also non-premeditated, but became apparent when
914 Read-Side Critical Sections Don't Partition Grace Periods
917 It is also tempting to assume that if an RCU read-side critical section
970 .. kernel-figure:: ReadersPartitionGP1.svg
972 Again, an RCU read-side critical section can overlap almost all of a
974 period. As a result, an RCU read-side critical section cannot partition
977 +-----------------------------------------------------------------------+
979 +-----------------------------------------------------------------------+
981 | read-side critical section, would be required to partition the RCU |
982 | read-side critical sections at the beginning and end of the chain? |
983 +-----------------------------------------------------------------------+
985 +-----------------------------------------------------------------------+
990 +-----------------------------------------------------------------------+
993 -------------------------
1000 completely futile. This is most obvious in preemptible user-level
1003 hypervisor), but can also happen in bare-metal environments due to
1007 but where “extremely long” is not long enough to allow wrap-around
1008 when incrementing a 64-bit counter.
1010 matters, RCU must use compiler directives and memory-barrier
1014 writes and more-frequent concurrent writes will result in more
1021 #. Counters are finite, especially on 32-bit systems. RCU's use of
1026 dyntick-idle nesting counter allows 54 bits for interrupt nesting
1027 level (this counter is 64 bits even on a 32-bit system). Overflowing
1028 this counter requires 2\ :sup:`54` half-interrupts on a given CPU
1029 without that CPU ever going idle. If a half-interrupt happened every
1033 kernel in a single shared-memory environment. RCU must therefore pay
1034 close attention to high-end scalability.
1042 Quality-of-Implementation Requirements
1043 --------------------------------------
1045 These sections list quality-of-implementation requirements. Although an
1048 inappropriate for industrial-strength production use. Classes of
1049 quality-of-implementation requirements are as follows:
1062 RCU is and always has been intended primarily for read-mostly
1063 situations, which means that RCU's read-side primitives are optimized,
1064 often at the expense of its update-side primitives. Experience thus far
1067 #. Read-mostly data, where stale and inconsistent data is not a problem:
1069 #. Read-mostly data, where data must be consistent: RCU works well.
1070 #. Read-write data, where data must be consistent: RCU *might* work OK.
1072 #. Write-mostly data, where data must be consistent: RCU is very
1076 a. Existence guarantees for update-friendly mechanisms.
1077 b. Wait-free read-side primitives for real-time use.
1079 This focus on read-mostly situations means that RCU must interoperate
1084 primitives be legal within RCU read-side critical sections, including
1088 +-----------------------------------------------------------------------+
1090 +-----------------------------------------------------------------------+
1092 +-----------------------------------------------------------------------+
1094 +-----------------------------------------------------------------------+
1095 | These are forbidden within Linux-kernel RCU read-side critical |
1097 | case, voluntary context switch) within an RCU read-side critical |
1099 | read-side critical sections, and also within Linux-kernel sleepable |
1100 | RCU `(SRCU) <Sleepable RCU_>`__ read-side critical sections. In |
1101 | addition, the -rt patchset turns spinlocks into a sleeping locks so |
1104 | locks!) may be acquire within -rt-Linux-kernel RCU read-side critical |
1106 | Note that it *is* legal for a normal RCU read-side critical section |
1114 +-----------------------------------------------------------------------+
1126 inconsistency must be tolerated due to speed-of-light delays if nothing
1131 whether or not a given cat was alive. But how long should they wait
1136 heart might stop for some period of time, so the exact wait period is a
1137 judgment call. One of our pair of veterinarians might wait 30 seconds
1138 before pronouncing the cat dead, while the other might insist on waiting
1139 a full minute. The two veterinarians would then disagree on the state of
1152 For example, the translation between a user-level SystemV semaphore ID
1153 to the corresponding in-kernel data structure is protected by RCU, but
1156 by acquiring spinlocks located in the in-kernel data structure from
1157 within the RCU read-side critical section, and this is indicated by the
1171 Linux-kernel RCU implementations must therefore avoid unnecessarily
1174 in which I was given “frank and open” feedback on the importance of
1175 energy efficiency in battery-powered systems and on specific
1176 energy-efficiency shortcomings of the Linux-kernel RCU implementation.
1177 In my experience, the battery-powered embedded community will consider
1179 mere Linux-kernel-mailing-list posts are insufficient to vent their ire.
1184 `bloatwatch <http://elinux.org/Linux_Tiny-FAQ>`__ efforts, memory
1185 footprint is critically important on single-CPU systems with
1186 non-preemptible (``CONFIG_PREEMPTION=n``) kernels, and thus `tiny
1188 was born. Josh Triplett has since taken over the small-memory banner
1194 unsurprising. For example, in keeping with RCU's read-side
1197 Similarly, in non-preemptible environments, rcu_read_lock() and
1200 In preemptible environments, in the case where the RCU read-side
1202 highest-priority real-time process), rcu_read_lock() and
1204 should not contain atomic read-modify-write operations, memory-barrier
1206 branches. However, in the case where the RCU read-side critical section
1208 interrupts. This is why it is better to nest an RCU read-side critical
1209 section within a preempt-disable region than vice versa, at least in
1211 degrading real-time latencies.
1213 The synchronize_rcu() grace-period-wait primitive is optimized for
1215 addition to the duration of the longest RCU read-side critical section.
1216 On the other hand, multiple concurrent invocations of
1218 they can be satisfied by a single underlying grace-period-wait
1220 single grace-period-wait operation to serve more than `1,000 separate
1221 … <https://www.usenix.org/conference/2004-usenix-annual-technical-conference/making-rcu-safe-deep-s…
1222 of synchronize_rcu(), thus amortizing the per-invocation overhead
1223 down to nearly zero. However, the grace-period optimization is also
1224 required to avoid measurable degradation of real-time scheduling and
1227 In some cases, the multi-millisecond synchronize_rcu() latencies are
1229 used instead, reducing the grace-period latency down to a few tens of
1230 microseconds on small systems, at least in cases where the RCU read-side
1232 requirements for synchronize_rcu_expedited() on large systems, but,
1235 requirements: A storm of synchronize_rcu_expedited() invocations on
1238 to impose modest degradation of real-time latency on non-idle online
1240 scheduling-clock interrupt.
1243 synchronize_rcu_expedited()'s reduced grace-period latency is
1273 25 call_rcu(&p->rh, remove_gp_cb);
1278 A definition of ``struct foo`` is finally needed, and appears on
1279 lines 1-5. The function remove_gp_cb() is passed to call_rcu()
1280 on line 25, and will be invoked after the end of a subsequent grace
1282 without forcing the updater to wait for a grace period to elapse. The
1285 be legal, including within preempt-disable code, local_bh_disable()
1286 code, interrupt-disable code, and interrupt handlers. However, even
1293 takes too long. Long-running operations should be relegated to separate
1296 +-----------------------------------------------------------------------+
1298 +-----------------------------------------------------------------------+
1300 | call_rcu() on line 25 stores into the structure, which would |
1303 +-----------------------------------------------------------------------+
1305 +-----------------------------------------------------------------------+
1306 | Presumably the ``->gp_lock`` acquired on line 18 excludes any |
1309 | after ``->gp_lock`` is released on line 25, which in turn means that |
1311 +-----------------------------------------------------------------------+
1313 However, all that remove_gp_cb() is doing is invoking kfree() on
1350 open-coded it.
1352 +-----------------------------------------------------------------------+
1354 +-----------------------------------------------------------------------+
1358 | of the memory (respectively) must still wait for a grace period to |
1360 +-----------------------------------------------------------------------+
1362 +-----------------------------------------------------------------------+
1364 | definition would say that updates in garbage-collected languages |
1369 | kfree_rcu(), without having to wait for a subsequent grace |
1371 +-----------------------------------------------------------------------+
1373 But what if the updater must wait for the completion of code to be
1375 be carried out in the meantime? The polling-style
1401 On line 14, get_state_synchronize_rcu() obtains a “cookie” from RCU,
1414 In theory, delaying grace-period completion and callback invocation is
1421 example, an infinite loop in an RCU read-side critical section must by
1423 involved example, consider a 64-CPU system built with
1424 ``CONFIG_RCU_NOCB_CPU=y`` and booted with ``rcu_nocbs=1-63``, where
1439 causes future invocations of cond_resched() on the holdout CPUs
1442 corresponding CPU's next scheduling-clock.
1444 indefinitely in the kernel without scheduling-clock interrupts, which
1446 invoke resched_cpu() on any ``nohz_full`` CPUs still holding out
1449 has been preempted within an RCU read-side critical section is
1453 will invoke resched_cpu() on it regardless of its ``nohz_full``
1460 caution when changing them. Note that these forward-progress measures
1465 invocation of callbacks when any given non-\ ``rcu_nocbs`` CPU has
1476 #. Lifts callback-execution batch limits, which speeds up callback
1480 overridden. Again, these forward-progress measures are provided only for
1482 RCU`_. Even for RCU, callback-invocation forward
1483 progress for ``rcu_nocbs`` CPUs is much less well-developed, in part
1487 additional forward-progress work will be required.
1493 part due to the collision of multicore hardware with object-oriented
1494 techniques designed in single-threaded environments for single-threaded
1495 use. And in theory, RCU read-side critical sections may be composed, and
1497 real-world implementations of composable constructs, there are
1501 rcu_read_unlock() generate no code, such as Linux-kernel RCU when
1516 the nesting-depth counter. For example, the Linux kernel's preemptible
1518 practical purposes. That said, a consecutive pair of RCU read-side
1520 grace period cannot be enclosed in another RCU read-side critical
1521 section. This is because it is not legal to wait for a grace period
1522 within an RCU read-side critical section: To do so would result either
1523 in deadlock or in RCU implicitly splitting the enclosing RCU read-side
1524 critical section, neither of which is conducive to a long-lived and
1528 example, many transactional-memory implementations prohibit composing a
1530 a network receive operation). For another example, lock-based critical
1534 In short, although RCU read-side critical sections are highly
1542 read-side critical sections, perhaps even so intense that there was
1544 read-side critical section in flight. RCU cannot allow this situation to
1545 block grace periods: As long as all the RCU read-side critical sections
1549 RCU read-side critical sections being preempted for long durations,
1550 which has the effect of creating a long-duration RCU read-side critical
1552 systems using real-time priorities are of course more vulnerable.
1554 case. That said, the exact requirements on RCU priority boosting will
1563 rates should not delay RCU read-side critical sections, although some
1564 small read-side delays can occur when using
1569 1990s, a simple user-level test consisting of ``close(open(path))`` in a
1571 appreciation of the high-update-rate corner case. This test also
1576 grace-period processing. This evasive action causes the grace period to
1581 Software-Engineering Requirements
1582 ---------------------------------
1589 splat if rcu_dereference() is used outside of an RCU read-side
1590 critical section. Update-side code can use
1602 that it has been invoked within an RCU read-side critical section. I
1605 #. A given function might wish to check for RCU-related preconditions
1611 assignment. To catch this sort of error, a given RCU-protected
1613 complain about simple-assignment accesses to that pointer. Arnd
1621 allocated on the stack must be initialized with
1624 non-stack ``rcu_head`` structures must be initialized with
1629 #. An infinite loop in an RCU read-side critical section will eventually
1635 waiting on that particular RCU read-side critical section.
1641 stall warnings are counter-productive during sysrq dumps and during
1654 read-side critical sections, there is currently no good way of doing
1658 #. In kernels built with ``CONFIG_RCU_TRACE=y``, RCU-related information
1660 #. Open-coded use of rcu_assign_pointer() and rcu_dereference()
1662 error-prone. Therefore, RCU-protected `linked
1664 more recently, RCU-protected `hash
1666 other special-purpose RCU-protected data structures are available in
1675 This not a hard-and-fast list: RCU's diagnostic capabilities will
1677 real-world RCU usage.
1680 --------------------------
1696 #. `Scheduling-Clock Interrupts and RCU`_
1701 notable Linux-kernel complications. Each of the following sections
1719 `remind <https://lore.kernel.org/r/CA+55aFy4wcCwaL4okTs8wXhGZ5h-ibecy_Meg9C4MNQrUnwMcg@mail.gmail.c…
1731 used to do, it would create too many per-CPU kthreads. Although the
1736 RCU must therefore wait for a given CPU to actually come online before
1748 ``task_struct`` is available and the boot CPU's per-CPU variables are
1749 set up. The read-side primitives (rcu_read_lock(),
1751 rcu_access_pointer()) will operate normally very early on, as will
1762 panacea because there would be severe restrictions on what operations
1769 itself is a quiescent state and thus a grace period, so the early-boot
1770 implementation can be a no-op.
1775 reason is that an RCU read-side critical section might be preempted,
1777 wait for something, as opposed to simply returning immediately.
1785 +-----------------------------------------------------------------------+
1787 +-----------------------------------------------------------------------+
1790 +-----------------------------------------------------------------------+
1792 +-----------------------------------------------------------------------+
1797 | grace-period mechanism. At runtime, this expedited mechanism relies |
1798 | on workqueues, but during the dead zone the requesting task itself |
1799 | drives the desired expedited grace period. Because dead-zone |
1813 +-----------------------------------------------------------------------+
1815 I learned of these boot-time requirements as a result of a series of
1821 The Linux kernel has interrupts, and RCU read-side critical sections are
1822 legal within interrupt handlers and within interrupt-disabled regions of
1825 Some Linux-kernel architectures can enter an interrupt handler from
1826 non-idle process context, and then just never leave it, instead
1829 “half-interrupts” mean that RCU has to be very careful about how it
1831 way during a rewrite of RCU's dyntick-idle code.
1833 The Linux kernel has non-maskable interrupts (NMIs), and RCU read-side
1835 update-side primitives, including call_rcu(), are prohibited within
1838 The name notwithstanding, some Linux-kernel architectures can have
1840 me <https://lore.kernel.org/r/CALCETrXLq1y7e_dKFPgou-FKHB6Pu-r8+t-6Ds+8=va7anBWDA@mail.gmail.com>`__
1858 one of its functions results in a segmentation fault. The module-unload
1859 functions must therefore cancel any delayed calls to loadable-module
1867 module unload request, we need some other way to deal with in-flight RCU
1871 in-flight RCU callbacks have been invoked. If a module uses
1874 the underlying module-unload code could invoke rcu_barrier()
1879 filesystem-unmount situation, and Dipankar Sarma incorporated
1886 *not*, obligated to wait for a grace period. It is instead only required
1887 to wait for RCU callbacks that have already been posted. Therefore, if
1891 to wait for a grace period.
1893 +-----------------------------------------------------------------------+
1895 +-----------------------------------------------------------------------+
1896 | Wait a minute! Each RCU callbacks must wait for a grace period to |
1897 | complete, and rcu_barrier() must wait for each pre-existing |
1899 | wait for a full grace period if there is even one callback posted |
1901 +-----------------------------------------------------------------------+
1903 +-----------------------------------------------------------------------+
1905 | Yes, each RCU callbacks must wait for a grace period to complete, but |
1908 | need only wait for the remaining portion of the grace period to |
1912 | So if you need to wait for a grace period as well as for all |
1913 | pre-existing callbacks, you will need to invoke both |
1916 +-----------------------------------------------------------------------+
1923 offline CPU, with the exception of `SRCU <Sleepable RCU_>`__ read-side
1925 DYNIX/ptx, but on the other hand, the Linux kernel's CPU-hotplug
1928 The Linux-kernel CPU-hotplug implementation has notifiers that are used
1930 appropriately to a given CPU-hotplug operation. Most RCU operations may
1931 be invoked from CPU-hotplug notifiers, including even synchronous
1932 grace-period operations such as (synchronize_rcu() and
1938 In addition, all-callback-wait operations such as rcu_barrier() may
1939 not be invoked from any CPU-hotplug notifier. This restriction is due
1940 to the fact that there are phases of CPU-hotplug operations where the
1941 outgoing CPU's callbacks will not be invoked until after the CPU-hotplug
1943 rcu_barrier() blocks CPU-hotplug operations during its execution,
1944 which results in another type of deadlock when invoked from a CPU-hotplug
1952 for the force-quiescent-state loop (FQS) to report quiescent states for
1960 race either with CPU offlining or with a task unblocking on a leaf
1963 The CPU-online path (rcutree_report_cpu_starting()) should never need to report
1976 RCU makes use of kthreads, and it is necessary to avoid excessive CPU-time
1978 RCU's violation of it when running context-switch-heavy workloads when
1982 context-switch-heavy ``CONFIG_NO_HZ_FULL=y`` workloads, but there is
1986 scheduler's runqueue or priority-inheritance spinlocks across an
1988 somewhere within the corresponding RCU read-side critical section.
1994 nesting. The fact that interrupt-disabled regions of code act as RCU
1995 read-side critical sections implicitly avoids earlier issues that used
2001 It is possible to use tracing on RCU code, but tracing itself uses RCU.
2012 The kernel needs to access user-space memory, for example, to access data
2013 referenced by system-call parameters. The get_user() macro does this job.
2015 However, user-space memory might well be paged out, which means that
2016 get_user() might well page-fault and thus block while waiting for the
2018 reorder a get_user() invocation into an RCU read-side critical section.
2026 3 v = p->value;
2039 4 v = p->value;
2045 state in the middle of an RCU read-side critical section. This misplaced
2046 quiescent state could result in line 4 being a use-after-free access,
2054 ``p->value`` is not volatile, so the compiler would not have any reason to keep
2057 Therefore, the Linux-kernel definitions of rcu_read_lock() and
2060 of RCU read-side critical sections.
2066 by people with battery-powered embedded systems. RCU therefore conserves
2069 energy-efficiency requirement, so I learned of this via an irate phone
2073 RCU read-side critical section on an idle CPU. (Kernels built with
2082 These energy-efficiency requirements have proven quite difficult to
2084 clean-sheet rewrites of RCU's energy-efficiency code, the last of which
2085 was finally able to demonstrate `real energy savings running on real
2089 phone calls: Flaming me on the Linux-kernel mailing list was apparently
2090 not sufficient to fully vent their ire at RCU's energy-efficiency bugs!
2092 Scheduling-Clock Interrupts and RCU
2095 The kernel transitions between in-kernel non-idle execution, userspace
2096 execution, and the idle loop. Depending on kernel configuration, RCU
2099 +-----------------+------------------+------------------+-----------------+
2100 | ``HZ`` Kconfig | In-Kernel | Usermode | Idle |
2102 | ``HZ_PERIODIC`` | Can rely on | Can rely on | Can rely on |
2103 | | scheduling-clock | scheduling-clock | RCU's |
2104 | | interrupt. | interrupt and | dyntick-idle |
2108 +-----------------+------------------+------------------+-----------------+
2109 | ``NO_HZ_IDLE`` | Can rely on | Can rely on | Can rely on |
2110 | | scheduling-clock | scheduling-clock | RCU's |
2111 | | interrupt. | interrupt and | dyntick-idle |
2115 +-----------------+------------------+------------------+-----------------+
2116 | ``NO_HZ_FULL`` | Can only | Can rely on | Can rely on |
2118 | | on | dyntick-idle | dyntick-idle |
2119 | | scheduling-clock | detection. | detection. |
2127 +-----------------+------------------+------------------+-----------------+
2129 +-----------------------------------------------------------------------+
2131 +-----------------------------------------------------------------------+
2132 | Why can't ``NO_HZ_FULL`` in-kernel execution rely on the |
2133 | scheduling-clock interrupt, just like ``HZ_PERIODIC`` and |
2135 +-----------------------------------------------------------------------+
2137 +-----------------------------------------------------------------------+
2139 | necessarily re-enable the scheduling-clock interrupt on entry to each |
2141 +-----------------------------------------------------------------------+
2147 scheduling-clock interrupt be enabled when RCU needs it to be:
2150 is non-idle, the scheduling-clock tick had better be running.
2152 (11-second) grace periods, with a pointless IPI waking the CPU from
2154 #. If a CPU is in a portion of the kernel that executes RCU read-side
2160 no-joking guaranteed to never execute any RCU read-side critical
2162 sort of thing is used by some architectures for light-weight
2169 fact joking about not doing RCU read-side critical sections.
2170 #. If a CPU is executing in the kernel with the scheduling-clock
2171 interrupt disabled and RCU believes this CPU to be non-idle, and if
2181 is usually OK) and the scheduling-clock interrupt is enabled, of
2186 +-----------------------------------------------------------------------+
2188 +-----------------------------------------------------------------------+
2192 +-----------------------------------------------------------------------+
2194 +-----------------------------------------------------------------------+
2196 | often. But given that long-running interrupt handlers can cause other |
2199 +-----------------------------------------------------------------------+
2202 between in-kernel execution, usermode execution, and idle, and as long
2203 as the scheduling-clock interrupt is enabled when RCU needs it to be,
2210 Although small-memory non-realtime systems can simply use Tiny RCU, code
2214 pair of pointers, it does appear in many RCU-protected data structures,
2219 This need for memory efficiency is one reason that RCU uses hand-crafted
2224 posted them. Although this information might appear in debug-only kernel
2225 builds at some point, in the meantime, the ``->func`` field will often
2233 conditions <https://lore.kernel.org/r/1439976106-137226-1-git-send-email-kirill.shutemov@linux.inte…
2234 the Linux kernel's memory-management subsystem needs a particular bit to
2235 remain zero during all phases of grace-period processing, and that bit
2237 ``->next`` field. RCU makes this guarantee as long as call_rcu() is
2240 energy-efficiency purposes.
2243 structure be aligned to a two-byte boundary, and passing a misaligned
2247 a four-byte or even eight-byte alignment requirement? Because the m68k
2248 architecture provides only two-byte alignment, and thus acts as
2254 potentially have energy-efficiency benefits, but only if the rate of
2255 non-lazy callbacks decreases significantly for some important workload.
2262 Expanding on the `earlier
2264 hot code paths in performance-critical portions of the Linux kernel's
2267 read-side primitives. To that end, it would be good if preemptible RCU's
2274 frequent acquisitions of global locks or frequent atomic operations on
2277 on the ``rcu_node`` structure. RCU is required to tolerate all CPUs
2279 minimal per-operation overhead. In fact, in many cases, increasing load
2280 must *decrease* the per-operation overhead, witness the batching
2286 The Linux kernel is used for real-time workloads, especially in
2287 conjunction with the `-rt
2289 real-time-latency response requirements are such that the traditional
2290 approach of disabling preemption across RCU read-side critical sections
2292 an RCU implementation that allows RCU read-side critical sections to be
2294 clear that an earlier `real-time
2298 encountered by a very early version of the -rt patchset.
2300 In addition, RCU must make do with a sub-100-microsecond real-time
2301 latency budget. In fact, on smaller systems with the -rt patchset, the
2302 Linux kernel provides sub-20-microsecond real-time latencies for the
2305 surprise, the sub-100-microsecond real-time latency budget `applies to
2308 up to and including systems with 4096 CPUs. This real-time requirement
2309 motivated the grace-period kthread, which also simplified handling of a
2312 RCU must avoid degrading real-time response for CPU-bound threads,
2314 ``CONFIG_NO_HZ_FULL=y``) or in the kernel. That said, CPU-bound loops in
2322 stress-test suite. This stress-test suite is called ``rcutorture``.
2328 today, given Android smartphones, Linux-powered televisions, and
2332 Suppose that RCU contains a race condition that manifests on average
2338 jurisdictions, a successful multi-year test of a given mechanism, which
2340 safety-critical certifications. In fact, rumor has it that the Linux
2341 kernel is already being used in production for safety-critical
2343 bug in RCU killed someone. Which might explain my recent focus on
2347 -----------------
2352 implementations, non-preemptible and preemptible. The other four flavors
2356 #. `Bottom-Half Flavor (Historical)`_
2362 Bottom-Half Flavor (Historical)
2365 The RCU-bh flavor of RCU has since been expressed in terms of the other
2367 single flavor. The read-side API remains, and continues to disable
2371 The softirq-disable (AKA “bottom-half”, hence the “_bh” abbreviations)
2372 flavor of RCU, or *RCU-bh*, was developed by Dipankar Sarma to provide a
2373 flavor of RCU that could withstand the network-based denial-of-service
2375 networking load on the system that some of the CPUs never exited softirq
2378 grace periods from ever ending. The result was an out-of-memory
2381 The solution was the creation of RCU-bh, which does
2382 local_bh_disable() across its read-side critical sections, and which
2385 offline. This means that RCU-bh grace periods can complete even when
2387 algorithms based on RCU-bh to withstand network-based denial-of-service
2391 re-enable softirq handlers, any attempt to start a softirq handlers
2392 during the RCU-bh read-side critical section will be deferred. In this
2395 overhead should be associated with the code following the RCU-bh
2396 read-side critical section rather than rcu_read_unlock_bh(), but the
2398 of fine distinction. For example, suppose that a three-millisecond-long
2399 RCU-bh read-side critical section executes during a time of heavy
2406 The `RCU-bh
2407 API <https://lwn.net/Articles/609973/#RCU%20Per-Flavor%20API%20Table>`__
2410 old RCU-bh update-side APIs are now gone, replaced by synchronize_rcu(),
2412 anything that disables bottom halves also marks an RCU-bh read-side
2414 local_irq_save() and local_irq_restore(), and so on.
2419 The RCU-sched flavor of RCU has since been expressed in terms of the
2421 single flavor. The read-side API remains, and continues to disable
2426 effect of also waiting for all pre-existing interrupt and NMI handlers.
2427 However, there are legitimate preemptible-RCU implementations that do
2429 RCU read-side critical section can be a quiescent state. Therefore,
2430 *RCU-sched* was created, which follows “classic” RCU in that an
2431 RCU-sched grace period waits for pre-existing interrupt and NMI
2433 RCU-sched APIs have identical implementations, while kernels built with
2438 re-enable preemption, respectively. This means that if there was a
2439 preemption attempt during the RCU-sched read-side critical section,
2443 very slowly. However, the highest-priority task won't be preempted, so
2444 that task will enjoy low-overhead rcu_read_unlock_sched()
2447 The `RCU-sched
2448 API <https://lwn.net/Articles/609973/#RCU%20Per-Flavor%20API%20Table>`__
2452 rcu_read_lock_sched_held(). However, the old RCU-sched update-side APIs
2455 preemption also marks an RCU-sched read-side critical section,
2457 and local_irq_restore(), and so on.
2463 read-side critical section” was a reliable indication that this someone
2465 read-side critical section, you can probably afford to use a
2466 higher-overhead synchronization mechanism. However, that changed with
2467 the advent of the Linux kernel's notifiers, whose RCU read-side critical
2478 That said, one consequence of these domains is that read-side code must
2490 As noted above, it is legal to block within SRCU read-side critical
2492 block forever in one of a given domain's SRCU read-side critical
2495 happen if any operation in a given domain's SRCU read-side critical
2496 section can wait, either directly or indirectly, for that domain's grace
2497 period to elapse. For example, this results in a self-deadlock:
2512 ``ss1``-domain SRCU read-side critical section acquired another mutex
2513 that was held across as ``ss``-domain synchronize_srcu(), deadlock
2518 Unlike the other RCU flavors, SRCU read-side critical sections can run
2519 on idle and even offline CPUs. This ability requires that
2527 invoked from CPU-hotplug notifiers, due to the fact that SRCU grace
2529 temporarily “stranded” on the outgoing CPU. This stranding of timers
2531 the CPU-hotplug process. The problem is that if a notifier is waiting on
2532 an SRCU grace period, that grace period is waiting on a timer, and that
2533 timer is stranded on the outgoing CPU, then the notifier will never be
2536 CPU-hotplug notifiers.
2539 non-expedited grace periods are implemented by the same mechanism. This
2549 As of v4.12, SRCU's callbacks are maintained per-CPU, eliminating a
2551 allow users to put much heavier stress on call_srcu(), it is
2557 that sort of load. Of course, your mileage may vary based on the speed
2561 API <https://lwn.net/Articles/609973/#RCU%20Per-Flavor%20API%20Table>`__
2576 specified cookie corresponds to an already-completed
2584 certain types of buffer-cache algorithms having multi-stage age-out
2595 anywhere in the code, it is not possible to use read-side markers such
2605 trampoline would be pre-ordained a surprisingly long time before execution
2609 RCU <https://lwn.net/Articles/607117/>`__, is to have implicit read-side
2613 userspace execution also delimit tasks-RCU read-side critical sections.
2617 Note well that involuntary context switches are *not* Tasks-RCU quiescent
2619 trampoline might be preempted. In this case, the Tasks-RCU grace period
2625 The tasks-RCU API is quite compact, consisting only of
2637 Some forms of tracing need to wait for all preemption-disabled regions
2638 of code running on any online CPU, including those executed when RCU is
2641 forcing a workqueue to be scheduled on each online CPU, hence the "Rude"
2642 moniker. And this operation is considered to be quite rude by real-time
2644 by battery-powered systems that don't want their idle CPUs to be awakened.
2646 Once kernel entry/exit and deep-idle functions have been properly tagged
2651 The tasks-rude-RCU API is also reader-marking-free and thus quite compact,
2658 SRCU's read-side overhead, which includes a full memory barrier in both
2661 readers. Real-time systems that cannot tolerate IPIs may build their
2663 the expense of adding full memory barriers to the read-side primitives.
2665 The tasks-trace-RCU API is also reasonably compact,
2671 -----------------------
2673 One of the tricks that RCU uses to attain update-side scalability is to
2674 increase grace-period latency with increasing numbers of CPUs. If this
2676 grace-period state machine so as to avoid the need for the additional
2681 rcu_barrier() in CPU-hotplug notifiers, it will be necessary to
2685 The tradeoff between grace-period latency on the one hand and
2686 interruptions of other CPUs on the other hand may need to be
2687 re-examined. The desire is of course for zero grace-period latency as
2697 be unnecessary because the hotpath read-side primitives do not access
2708 carefully run and realistic system-level workload.
2722 Additional work may be required to provide reasonable forward-progress
2727 -------
2731 be the last word on this subject, but at least it serves to get an
2735 ---------------