Requirements.rst - OpenGrok cross reference for /linux/Documentation/RCU/Design/Requirements/Requirements.rst

Lines Matching +full:technical +full:- +full:articles
10 `LWN <https://lwn.net/>`_ on those articles:
11 `part 1 <https://lwn.net/Articles/652156/>`_,
12 `part 2 <https://lwn.net/Articles/652677/>`_, and
13 `part 3 <https://lwn.net/Articles/653326/>`_.
16 ------------
18 Read-copy update (RCU) is a synchronization mechanism that is often used
19 as a replacement for reader-writer locking. RCU is unusual in that
20 updaters do not block readers, which means that RCU's read-side
28 thought of as an informal, high-level specification for RCU. It is
40 #. `Fundamental Non-Requirements`_
42 #. `Quality-of-Implementation Requirements`_
44 #. `Software-Engineering Requirements`_
53 ------------------------
58 #. `Grace-Period Guarantee`_
60 #. `Memory-Barrier Guarantees`_
62 #. `Guaranteed Read-to-Write Upgrade`_
64 Grace-Period Guarantee
67 RCU's grace-period guarantee is unusual in being premeditated: Jack
73 RCU's grace-period guarantee allows updaters to wait for the completion
74 of all pre-existing RCU read-side critical sections. An RCU read-side
77 RCU treats a nested set as one big RCU read-side critical section.
78 Production-quality implementations of rcu_read_lock() and
105 Because the synchronize_rcu() on line 14 waits for all pre-existing
119 +-----------------------------------------------------------------------+
121 +-----------------------------------------------------------------------+
123 | progress concurrently with readers, but pre-existing readers will     |
126 +-----------------------------------------------------------------------+
128 +-----------------------------------------------------------------------+
131 | Second, even when using synchronize_rcu(), the other update-side      |
132 | code does run concurrently with readers, whether pre-existing or not. |
133 +-----------------------------------------------------------------------+
173 The RCU read-side critical section in do_something_dlm() works with
178 +-----------------------------------------------------------------------+
180 +-----------------------------------------------------------------------+
182 +-----------------------------------------------------------------------+
184 +-----------------------------------------------------------------------+
188 +-----------------------------------------------------------------------+
190 In order to avoid fatal problems such as deadlocks, an RCU read-side
192 Similarly, an RCU read-side critical section must not contain anything
196 Although RCU's grace-period guarantee is useful in and of itself, with
197 `quite a few use cases <https://lwn.net/Articles/573497/>`__, it would
198 be good to be able to use RCU to coordinate read-side access to linked
199 data structures. For this, the grace-period guarantee is not sufficient,
203 non-\ ``NULL``, locklessly accessing the ``->a`` and ``->b`` fields.
211        5     return -ENOMEM;
217       11   p->a = a;
218       12   p->b = a;
233        5     return -ENOMEM;
240       12   p->a = a;
241       13   p->b = a;
247 executes line 11, it will see garbage in the ``->a`` and ``->b`` fields.
251 which brings us to the publish-subscribe guarantee discussed in the next
257 RCU's publish-subscribe guarantee allows data to be inserted into a
269        5     return -ENOMEM;
275       11   p->a = a;
276       12   p->b = a;
289 +-----------------------------------------------------------------------+
291 +-----------------------------------------------------------------------+
293 | assignments to ``p->a`` and ``p->b`` from being reordered. Can't that |
295 +-----------------------------------------------------------------------+
297 +-----------------------------------------------------------------------+
300 | initialized. So reordering the assignments to ``p->a`` and ``p->b``   |
302 +-----------------------------------------------------------------------+
305 control its accesses to the RCU-protected data, as shown in
315        6     do_something(p->a, p->b);
335        5     do_something(gp->a, gp->b);
344 the current structure with a new one, the fetches of ``gp->a`` and
345 ``gp->b`` might well come from two different structures, which could
356        6     do_something(p->a, p->b);
365 barriers in the Linux kernel. Should a |high-quality implementation of
370 outermost RCU read-side critical section containing that
376 .. |high-quality implementation of C11 memory_order_consume [PDF]| replace:: high-quality implement…
377 .. _high-quality implementation of C11 memory_order_consume [PDF]: http://www.rdrop.com/users/paulm…
383 Of course, it is also necessary to remove elements from RCU-protected
387 #. Wait for all pre-existing RCU read-side critical sections to complete
388    (because only pre-existing readers can possibly have a reference to
427    read-side critical section or in a code segment where the pointer
429    update-side lock.
431 +-----------------------------------------------------------------------+
433 +-----------------------------------------------------------------------+
436 +-----------------------------------------------------------------------+
438 +-----------------------------------------------------------------------+
442 | in a byte-at-a-time manner, resulting in *load tearing*, in turn      |
443 | resulting a bytewise mash-up of two distinct pointer values. It might |
444 | even use value-speculation optimizations, where it makes a wrong      |
447 | about any dereferences that returned pre-initialization garbage in    |
454 +-----------------------------------------------------------------------+
456 In short, RCU's publish-subscribe guarantee is provided by the
458 guarantee allows data elements to be safely added to RCU-protected
460 can be used in combination with the grace-period guarantee to also allow
461 data elements to be removed from RCU-protected linked data structures,
467 resembling the dependency-ordering barrier that was later subsumed
470 late-1990s meeting with the DEC Alpha architects, back in the days when
471 DEC was still a free-standing company. It took the Alpha architects a
480 Memory-Barrier Guarantees
483 The previous section's simple linked-data-structure scenario clearly
484 demonstrates the need for RCU's stringent memory-ordering guarantees on
487 #. Each CPU that has an RCU read-side critical section that begins
489    memory barrier between the time that the RCU read-side critical
491    this guarantee, a pre-existing RCU read-side critical section might
494 #. Each CPU that has an RCU read-side critical section that ends after
497    time that the RCU read-side critical section begins. Without this
498    guarantee, a later RCU read-side critical section running after the
514 +-----------------------------------------------------------------------+
516 +-----------------------------------------------------------------------+
517 | Given that multiple CPUs can start RCU read-side critical sections at |
519 | whether or not a given RCU read-side critical section starts before a |
521 +-----------------------------------------------------------------------+
523 +-----------------------------------------------------------------------+
524 | If RCU cannot tell whether or not a given RCU read-side critical      |
526 | it must assume that the RCU read-side critical section started first. |
528 | waiting on a given RCU read-side critical section only if it can      |
534 | within the enclosed RCU read-side critical section to the code        |
536 | then a given RCU read-side critical section begins before a given     |
547 +-----------------------------------------------------------------------+
549 +-----------------------------------------------------------------------+
551 +-----------------------------------------------------------------------+
554 +-----------------------------------------------------------------------+
556 +-----------------------------------------------------------------------+
564 | #. CPU 1: ``do_something_with(q->a);``                                |
571 | end of the RCU read-side critical section and the end of the grace    |
584 | #. CPU 1: ``do_something_with(q->a); /* Boom!!! */``                  |
588 | grace period and the beginning of the RCU read-side critical section, |
594 | believing that you have adhered to the as-if rule than it is to       |
596 +-----------------------------------------------------------------------+
598 +-----------------------------------------------------------------------+
600 +-----------------------------------------------------------------------+
603 | compiler might arbitrarily rearrange consecutive RCU read-side        |
604 | critical sections. Given such rearrangement, if a given RCU read-side |
606 | read-side critical sections are done? Won't the compiler              |
608 +-----------------------------------------------------------------------+
610 +-----------------------------------------------------------------------+
614 | schedule() had better prevent calling-code accesses to shared         |
616 | RCU detects the end of a given RCU read-side critical section, it     |
617 | will necessarily detect the end of all prior RCU read-side critical   |
621 | loop, into user-mode code, and so on. But if your kernel build allows |
623 +-----------------------------------------------------------------------+
625 Note that these memory-barrier requirements do not replace the
627 pre-existing readers. On the contrary, the memory barriers called out in
635 The common-case RCU primitives are unconditional. They are invoked, they
642 guarantee was reverse-engineered, not premeditated. The unconditional
650 Guaranteed Read-to-Write Upgrade
654 within an RCU read-side critical section. For example, that RCU
655 read-side critical section might search for a given data element, and
656 then might acquire the update-side spinlock in order to update that
657 element, all while remaining in that RCU read-side critical section. Of
658 course, it is necessary to exit the RCU read-side critical section
663 +-----------------------------------------------------------------------+
665 +-----------------------------------------------------------------------+
666 | But how does the upgrade-to-write operation exclude other readers?    |
667 +-----------------------------------------------------------------------+
669 +-----------------------------------------------------------------------+
672 +-----------------------------------------------------------------------+
674 This guarantee allows lookup code to be shared between read-side and
675 update-side code, and was premeditated, appearing in the earliest
678 Fundamental Non-Requirements
679 ----------------------------
681 RCU provides extremely lightweight readers, and its read-side
686 non-guarantees that have caused confusion. Except where otherwise noted,
687 these non-guarantees were premeditated.
692 #. `Grace Periods Don't Partition Read-Side Critical Sections`_
693 #. `Read-Side Critical Sections Don't Partition Grace Periods`_
698 Reader-side markers such as rcu_read_lock() and
700 through their interaction with the grace-period APIs such as
737 significant ordering constraints would slow down these fast-path APIs.
739 +-----------------------------------------------------------------------+
741 +-----------------------------------------------------------------------+
743 +-----------------------------------------------------------------------+
745 +-----------------------------------------------------------------------+
748 +-----------------------------------------------------------------------+
794 +-----------------------------------------------------------------------+
796 +-----------------------------------------------------------------------+
798 | completed instead of waiting only on pre-existing readers. For how    |
800 +-----------------------------------------------------------------------+
802 +-----------------------------------------------------------------------+
807 +-----------------------------------------------------------------------+
809 Grace Periods Don't Partition Read-Side Critical Sections
812 It is tempting to assume that if any part of one RCU read-side critical
814 read-side critical section follows that same grace period, then all of
815 the first RCU read-side critical section must precede all of the second.
817 partition the set of RCU read-side critical sections. An example of this
855 that the thread cannot be in the midst of an RCU read-side critical
858 .. kernel-figure:: GPpartitionReaders1.svg
860 If it is necessary to partition RCU read-side critical sections in this
902 the two RCU read-side critical sections cannot overlap, guaranteeing
911 This non-requirement was also non-premeditated, but became apparent when
914 Read-Side Critical Sections Don't Partition Grace Periods
917 It is also tempting to assume that if an RCU read-side critical section
970 .. kernel-figure:: ReadersPartitionGP1.svg
972 Again, an RCU read-side critical section can overlap almost all of a
974 period. As a result, an RCU read-side critical section cannot partition
977 +-----------------------------------------------------------------------+
979 +-----------------------------------------------------------------------+
981 | read-side critical section, would be required to partition the RCU    |
982 | read-side critical sections at the beginning and end of the chain?    |
983 +-----------------------------------------------------------------------+
985 +-----------------------------------------------------------------------+
990 +-----------------------------------------------------------------------+
993 -------------------------
1000    completely futile. This is most obvious in preemptible user-level
1003    hypervisor), but can also happen in bare-metal environments due to
1007    but where “extremely long” is not long enough to allow wrap-around
1008    when incrementing a 64-bit counter.
1010    matters, RCU must use compiler directives and memory-barrier
1014    writes and more-frequent concurrent writes will result in more
1021 #. Counters are finite, especially on 32-bit systems. RCU's use of
1026    dyntick-idle nesting counter allows 54 bits for interrupt nesting
1027    level (this counter is 64 bits even on a 32-bit system). Overflowing
1028    this counter requires 2\ :sup:`54` half-interrupts on a given CPU
1029    without that CPU ever going idle. If a half-interrupt happened every
1033    kernel in a single shared-memory environment. RCU must therefore pay
1034    close attention to high-end scalability.
1042 Quality-of-Implementation Requirements
1043 --------------------------------------
1045 These sections list quality-of-implementation requirements. Although an
1048 inappropriate for industrial-strength production use. Classes of
1049 quality-of-implementation requirements are as follows:
1062 RCU is and always has been intended primarily for read-mostly
1063 situations, which means that RCU's read-side primitives are optimized,
1064 often at the expense of its update-side primitives. Experience thus far
1067 #. Read-mostly data, where stale and inconsistent data is not a problem:
1069 #. Read-mostly data, where data must be consistent: RCU works well.
1070 #. Read-write data, where data must be consistent: RCU *might* work OK.
1072 #. Write-mostly data, where data must be consistent: RCU is very
1076    a. Existence guarantees for update-friendly mechanisms.
1077    b. Wait-free read-side primitives for real-time use.
1079 This focus on read-mostly situations means that RCU must interoperate
1084 primitives be legal within RCU read-side critical sections, including
1088 +-----------------------------------------------------------------------+
1090 +-----------------------------------------------------------------------+
1092 +-----------------------------------------------------------------------+
1094 +-----------------------------------------------------------------------+
1095 | These are forbidden within Linux-kernel RCU read-side critical        |
1097 | case, voluntary context switch) within an RCU read-side critical      |
1099 | read-side critical sections, and also within Linux-kernel sleepable   |
1100 | RCU `(SRCU) <Sleepable RCU_>`__ read-side critical sections. In       |
1101 | addition, the -rt patchset turns spinlocks into a sleeping locks so   |
1104 | locks!) may be acquire within -rt-Linux-kernel RCU read-side critical |
1106 | Note that it *is* legal for a normal RCU read-side critical section   |
1114 +-----------------------------------------------------------------------+
1126 inconsistency must be tolerated due to speed-of-light delays if nothing
1152 For example, the translation between a user-level SystemV semaphore ID
1153 to the corresponding in-kernel data structure is protected by RCU, but
1156 by acquiring spinlocks located in the in-kernel data structure from
1157 within the RCU read-side critical section, and this is indicated by the
1171 Linux-kernel RCU implementations must therefore avoid unnecessarily
1175 energy efficiency in battery-powered systems and on specific
1176 energy-efficiency shortcomings of the Linux-kernel RCU implementation.
1177 In my experience, the battery-powered embedded community will consider
1179 mere Linux-kernel-mailing-list posts are insufficient to vent their ire.
1184 `bloatwatch <http://elinux.org/Linux_Tiny-FAQ>`__ efforts, memory
1185 footprint is critically important on single-CPU systems with
1186 non-preemptible (``CONFIG_PREEMPTION=n``) kernels, and thus `tiny
1188 was born. Josh Triplett has since taken over the small-memory banner
1194 unsurprising. For example, in keeping with RCU's read-side
1197 Similarly, in non-preemptible environments, rcu_read_lock() and
1200 In preemptible environments, in the case where the RCU read-side
1202 highest-priority real-time process), rcu_read_lock() and
1204 should not contain atomic read-modify-write operations, memory-barrier
1206 branches. However, in the case where the RCU read-side critical section
1208 interrupts. This is why it is better to nest an RCU read-side critical
1209 section within a preempt-disable region than vice versa, at least in
1211 degrading real-time latencies.
1213 The synchronize_rcu() grace-period-wait primitive is optimized for
1215 addition to the duration of the longest RCU read-side critical section.
1218 they can be satisfied by a single underlying grace-period-wait
1220 single grace-period-wait operation to serve more than `1,000 separate
1221 … <https://www.usenix.org/conference/2004-usenix-annual-technical-conference/making-rcu-safe-deep-s…
1222 of synchronize_rcu(), thus amortizing the per-invocation overhead
1223 down to nearly zero. However, the grace-period optimization is also
1224 required to avoid measurable degradation of real-time scheduling and
1227 In some cases, the multi-millisecond synchronize_rcu() latencies are
1229 used instead, reducing the grace-period latency down to a few tens of
1230 microseconds on small systems, at least in cases where the RCU read-side
1238 to impose modest degradation of real-time latency on non-idle online
1240 scheduling-clock interrupt.
1243 synchronize_rcu_expedited()'s reduced grace-period latency is
1273       25   call_rcu(&p->rh, remove_gp_cb);
1279 lines 1-5. The function remove_gp_cb() is passed to call_rcu()
1285 be legal, including within preempt-disable code, local_bh_disable()
1286 code, interrupt-disable code, and interrupt handlers. However, even
1293 takes too long. Long-running operations should be relegated to separate
1296 +-----------------------------------------------------------------------+
1298 +-----------------------------------------------------------------------+
1303 +-----------------------------------------------------------------------+
1305 +-----------------------------------------------------------------------+
1306 | Presumably the ``->gp_lock`` acquired on line 18 excludes any         |
1309 | after ``->gp_lock`` is released on line 25, which in turn means that  |
1311 +-----------------------------------------------------------------------+
1350 open-coded it.
1352 +-----------------------------------------------------------------------+
1354 +-----------------------------------------------------------------------+
1360 +-----------------------------------------------------------------------+
1362 +-----------------------------------------------------------------------+
1364 | definition would say that updates in garbage-collected languages      |
1371 +-----------------------------------------------------------------------+
1375 be carried out in the meantime? The polling-style
1414 In theory, delaying grace-period completion and callback invocation is
1421 example, an infinite loop in an RCU read-side critical section must by
1423 involved example, consider a 64-CPU system built with
1424 ``CONFIG_RCU_NOCB_CPU=y`` and booted with ``rcu_nocbs=1-63``, where
1442    corresponding CPU's next scheduling-clock.
1444    indefinitely in the kernel without scheduling-clock interrupts, which
1449    has been preempted within an RCU read-side critical section is
1460 caution when changing them. Note that these forward-progress measures
1465 invocation of callbacks when any given non-\ ``rcu_nocbs`` CPU has
1476 #. Lifts callback-execution batch limits, which speeds up callback
1480 overridden. Again, these forward-progress measures are provided only for
1482 RCU`_. Even for RCU, callback-invocation forward
1483 progress for ``rcu_nocbs`` CPUs is much less well-developed, in part
1487 additional forward-progress work will be required.
1493 part due to the collision of multicore hardware with object-oriented
1494 techniques designed in single-threaded environments for single-threaded
1495 use. And in theory, RCU read-side critical sections may be composed, and
1497 real-world implementations of composable constructs, there are
1501 rcu_read_unlock() generate no code, such as Linux-kernel RCU when
1516 the nesting-depth counter. For example, the Linux kernel's preemptible
1518 practical purposes. That said, a consecutive pair of RCU read-side
1520 grace period cannot be enclosed in another RCU read-side critical
1522 within an RCU read-side critical section: To do so would result either
1523 in deadlock or in RCU implicitly splitting the enclosing RCU read-side
1524 critical section, neither of which is conducive to a long-lived and
1528 example, many transactional-memory implementations prohibit composing a
1530 a network receive operation). For another example, lock-based critical
1534 In short, although RCU read-side critical sections are highly
1542 read-side critical sections, perhaps even so intense that there was
1544 read-side critical section in flight. RCU cannot allow this situation to
1545 block grace periods: As long as all the RCU read-side critical sections
1549 RCU read-side critical sections being preempted for long durations,
1550 which has the effect of creating a long-duration RCU read-side critical
1552 systems using real-time priorities are of course more vulnerable.
1563 rates should not delay RCU read-side critical sections, although some
1564 small read-side delays can occur when using
1569 1990s, a simple user-level test consisting of ``close(open(path))`` in a
1571 appreciation of the high-update-rate corner case. This test also
1576 grace-period processing. This evasive action causes the grace period to
1581 Software-Engineering Requirements
1582 ---------------------------------
1589    splat if rcu_dereference() is used outside of an RCU read-side
1590    critical section. Update-side code can use
1592    expression <https://lwn.net/Articles/371986/>`__ to indicate what is
1602    that it has been invoked within an RCU read-side critical section. I
1605 #. A given function might wish to check for RCU-related preconditions
1611    assignment. To catch this sort of error, a given RCU-protected
1613    complain about simple-assignment accesses to that pointer. Arnd
1615    needed `patch series <https://lwn.net/Articles/376011/>`__.
1624    non-stack ``rcu_head`` structures must be initialized with
1629 #. An infinite loop in an RCU read-side critical section will eventually
1635    waiting on that particular RCU read-side critical section.
1641    stall warnings are counter-productive during sysrq dumps and during
1654    read-side critical sections, there is currently no good way of doing
1658 #. In kernels built with ``CONFIG_RCU_TRACE=y``, RCU-related information
1660 #. Open-coded use of rcu_assign_pointer() and rcu_dereference()
1662    error-prone. Therefore, RCU-protected `linked
1663    lists <https://lwn.net/Articles/609973/#RCU%20List%20APIs>`__ and,
1664    more recently, RCU-protected `hash
1665    tables <https://lwn.net/Articles/612100/>`__ are available. Many
1666    other special-purpose RCU-protected data structures are available in
1675 This not a hard-and-fast list: RCU's diagnostic capabilities will
1677 real-world RCU usage.
1680 --------------------------
1696 #. `Scheduling-Clock Interrupts and RCU`_
1701 notable Linux-kernel complications. Each of the following sections
1719 `remind <https://lore.kernel.org/r/CA+55aFy4wcCwaL4okTs8wXhGZ5h-ibecy_Meg9C4MNQrUnwMcg@mail.gmail.c…
1731 used to do, it would create too many per-CPU kthreads. Although the
1748 ``task_struct`` is available and the boot CPU's per-CPU variables are
1749 set up. The read-side primitives (rcu_read_lock(),
1769 itself is a quiescent state and thus a grace period, so the early-boot
1770 implementation can be a no-op.
1775 reason is that an RCU read-side critical section might be preempted,
1785 +-----------------------------------------------------------------------+
1787 +-----------------------------------------------------------------------+
1790 +-----------------------------------------------------------------------+
1792 +-----------------------------------------------------------------------+
1797 | grace-period mechanism. At runtime, this expedited mechanism relies   |
1799 | drives the desired expedited grace period. Because dead-zone          |
1813 +-----------------------------------------------------------------------+
1815 I learned of these boot-time requirements as a result of a series of
1821 The Linux kernel has interrupts, and RCU read-side critical sections are
1822 legal within interrupt handlers and within interrupt-disabled regions of
1825 Some Linux-kernel architectures can enter an interrupt handler from
1826 non-idle process context, and then just never leave it, instead
1829 “half-interrupts” mean that RCU has to be very careful about how it
1831 way during a rewrite of RCU's dyntick-idle code.
1833 The Linux kernel has non-maskable interrupts (NMIs), and RCU read-side
1835 update-side primitives, including call_rcu(), are prohibited within
1838 The name notwithstanding, some Linux-kernel architectures can have
1840 me <https://lore.kernel.org/r/CALCETrXLq1y7e_dKFPgou-FKHB6Pu-r8+t-6Ds+8=va7anBWDA@mail.gmail.com>`__
1858 one of its functions results in a segmentation fault. The module-unload
1859 functions must therefore cancel any delayed calls to loadable-module
1867 module unload request, we need some other way to deal with in-flight RCU
1871 in-flight RCU callbacks have been invoked. If a module uses
1874 the underlying module-unload code could invoke rcu_barrier()
1879 filesystem-unmount situation, and Dipankar Sarma incorporated
1893 +-----------------------------------------------------------------------+
1895 +-----------------------------------------------------------------------+
1897 | complete, and rcu_barrier() must wait for each pre-existing           |
1901 +-----------------------------------------------------------------------+
1903 +-----------------------------------------------------------------------+
1913 | pre-existing callbacks, you will need to invoke both                  |
1916 +-----------------------------------------------------------------------+
1923 offline CPU, with the exception of `SRCU <Sleepable RCU_>`__ read-side
1925 DYNIX/ptx, but on the other hand, the Linux kernel's CPU-hotplug
1928 The Linux-kernel CPU-hotplug implementation has notifiers that are used
1930 appropriately to a given CPU-hotplug operation. Most RCU operations may
1931 be invoked from CPU-hotplug notifiers, including even synchronous
1932 grace-period operations such as (synchronize_rcu() and
1938 In addition, all-callback-wait operations such as rcu_barrier() may
1939 not be invoked from any CPU-hotplug notifier.  This restriction is due
1940 to the fact that there are phases of CPU-hotplug operations where the
1941 outgoing CPU's callbacks will not be invoked until after the CPU-hotplug
1943 rcu_barrier() blocks CPU-hotplug operations during its execution,
1944 which results in another type of deadlock when invoked from a CPU-hotplug
1952 for the force-quiescent-state loop (FQS) to report quiescent states for
1963 The CPU-online path (rcutree_report_cpu_starting()) should never need to report
1976 RCU makes use of kthreads, and it is necessary to avoid excessive CPU-time
1978 RCU's violation of it when running context-switch-heavy workloads when
1982 context-switch-heavy ``CONFIG_NO_HZ_FULL=y`` workloads, but there is
1986 scheduler's runqueue or priority-inheritance spinlocks across an
1988 somewhere within the corresponding RCU read-side critical section.
1994 nesting.  The fact that interrupt-disabled regions of code act as RCU
1995 read-side critical sections implicitly avoids earlier issues that used
2012 The kernel needs to access user-space memory, for example, to access data
2013 referenced by system-call parameters.  The get_user() macro does this job.
2015 However, user-space memory might well be paged out, which means that
2016 get_user() might well page-fault and thus block while waiting for the
2018 reorder a get_user() invocation into an RCU read-side critical section.
2026        3 v = p->value;
2039        4 v = p->value;
2045 state in the middle of an RCU read-side critical section.  This misplaced
2046 quiescent state could result in line 4 being a use-after-free access,
2054 ``p->value`` is not volatile, so the compiler would not have any reason to keep
2057 Therefore, the Linux-kernel definitions of rcu_read_lock() and
2060 of RCU read-side critical sections.
2066 by people with battery-powered embedded systems. RCU therefore conserves
2069 energy-efficiency requirement, so I learned of this via an irate phone
2073 RCU read-side critical section on an idle CPU. (Kernels built with
2082 These energy-efficiency requirements have proven quite difficult to
2084 clean-sheet rewrites of RCU's energy-efficiency code, the last of which
2089 phone calls: Flaming me on the Linux-kernel mailing list was apparently
2090 not sufficient to fully vent their ire at RCU's energy-efficiency bugs!
2092 Scheduling-Clock Interrupts and RCU
2095 The kernel transitions between in-kernel non-idle execution, userspace
2099 +-----------------+------------------+------------------+-----------------+
2100 | ``HZ`` Kconfig  | In-Kernel        | Usermode         | Idle            |
2103 |                 | scheduling-clock | scheduling-clock | RCU's           |
2104 |                 | interrupt.       | interrupt and    | dyntick-idle    |
2108 +-----------------+------------------+------------------+-----------------+
2110 |                 | scheduling-clock | scheduling-clock | RCU's           |
2111 |                 | interrupt.       | interrupt and    | dyntick-idle    |
2115 +-----------------+------------------+------------------+-----------------+
2118 |                 | on               | dyntick-idle     | dyntick-idle    |
2119 |                 | scheduling-clock | detection.       | detection.      |
2127 +-----------------+------------------+------------------+-----------------+
2129 +-----------------------------------------------------------------------+
2131 +-----------------------------------------------------------------------+
2132 | Why can't ``NO_HZ_FULL`` in-kernel execution rely on the              |
2133 | scheduling-clock interrupt, just like ``HZ_PERIODIC`` and             |
2135 +-----------------------------------------------------------------------+
2137 +-----------------------------------------------------------------------+
2139 | necessarily re-enable the scheduling-clock interrupt on entry to each |
2141 +-----------------------------------------------------------------------+
2147 scheduling-clock interrupt be enabled when RCU needs it to be:
2150    is non-idle, the scheduling-clock tick had better be running.
2152    (11-second) grace periods, with a pointless IPI waking the CPU from
2154 #. If a CPU is in a portion of the kernel that executes RCU read-side
2160    no-joking guaranteed to never execute any RCU read-side critical
2162    sort of thing is used by some architectures for light-weight
2169    fact joking about not doing RCU read-side critical sections.
2170 #. If a CPU is executing in the kernel with the scheduling-clock
2171    interrupt disabled and RCU believes this CPU to be non-idle, and if
2181    is usually OK) and the scheduling-clock interrupt is enabled, of
2186 +-----------------------------------------------------------------------+
2188 +-----------------------------------------------------------------------+
2192 +-----------------------------------------------------------------------+
2194 +-----------------------------------------------------------------------+
2196 | often. But given that long-running interrupt handlers can cause other |
2199 +-----------------------------------------------------------------------+
2202 between in-kernel execution, usermode execution, and idle, and as long
2203 as the scheduling-clock interrupt is enabled when RCU needs it to be,
2210 Although small-memory non-realtime systems can simply use Tiny RCU, code
2214 pair of pointers, it does appear in many RCU-protected data structures,
2219 This need for memory efficiency is one reason that RCU uses hand-crafted
2224 posted them. Although this information might appear in debug-only kernel
2225 builds at some point, in the meantime, the ``->func`` field will often
2233 conditions <https://lore.kernel.org/r/1439976106-137226-1-git-send-email-kirill.shutemov@linux.inte…
2234 the Linux kernel's memory-management subsystem needs a particular bit to
2235 remain zero during all phases of grace-period processing, and that bit
2237 ``->next`` field. RCU makes this guarantee as long as call_rcu() is
2240 energy-efficiency purposes.
2243 structure be aligned to a two-byte boundary, and passing a misaligned
2247 a four-byte or even eight-byte alignment requirement? Because the m68k
2248 architecture provides only two-byte alignment, and thus acts as
2254 potentially have energy-efficiency benefits, but only if the rate of
2255 non-lazy callbacks decreases significantly for some important workload.
2264 hot code paths in performance-critical portions of the Linux kernel's
2267 read-side primitives. To that end, it would be good if preemptible RCU's
2279 minimal per-operation overhead. In fact, in many cases, increasing load
2280 must *decrease* the per-operation overhead, witness the batching
2286 The Linux kernel is used for real-time workloads, especially in
2287 conjunction with the `-rt
2289 real-time-latency response requirements are such that the traditional
2290 approach of disabling preemption across RCU read-side critical sections
2292 an RCU implementation that allows RCU read-side critical sections to be
2294 clear that an earlier `real-time
2295 patch <https://lwn.net/Articles/107930/>`__ did not meet their needs, in
2298 encountered by a very early version of the -rt patchset.
2300 In addition, RCU must make do with a sub-100-microsecond real-time
2301 latency budget. In fact, on smaller systems with the -rt patchset, the
2302 Linux kernel provides sub-20-microsecond real-time latencies for the
2305 surprise, the sub-100-microsecond real-time latency budget `applies to
2308 up to and including systems with 4096 CPUs. This real-time requirement
2309 motivated the grace-period kthread, which also simplified handling of a
2312 RCU must avoid degrading real-time response for CPU-bound threads,
2314 ``CONFIG_NO_HZ_FULL=y``) or in the kernel. That said, CPU-bound loops in
2322 stress-test suite. This stress-test suite is called ``rcutorture``.
2328 today, given Android smartphones, Linux-powered televisions, and
2338 jurisdictions, a successful multi-year test of a given mechanism, which
2340 safety-critical certifications. In fact, rumor has it that the Linux
2341 kernel is already being used in production for safety-critical
2347 -----------------
2352 implementations, non-preemptible and preemptible. The other four flavors
2356 #. `Bottom-Half Flavor (Historical)`_
2362 Bottom-Half Flavor (Historical)
2365 The RCU-bh flavor of RCU has since been expressed in terms of the other
2367 single flavor. The read-side API remains, and continues to disable
2371 The softirq-disable (AKA “bottom-half”, hence the “_bh” abbreviations)
2372 flavor of RCU, or *RCU-bh*, was developed by Dipankar Sarma to provide a
2373 flavor of RCU that could withstand the network-based denial-of-service
2378 grace periods from ever ending. The result was an out-of-memory
2381 The solution was the creation of RCU-bh, which does
2382 local_bh_disable() across its read-side critical sections, and which
2385 offline. This means that RCU-bh grace periods can complete even when
2387 algorithms based on RCU-bh to withstand network-based denial-of-service
2391 re-enable softirq handlers, any attempt to start a softirq handlers
2392 during the RCU-bh read-side critical section will be deferred. In this
2395 overhead should be associated with the code following the RCU-bh
2396 read-side critical section rather than rcu_read_unlock_bh(), but the
2398 of fine distinction. For example, suppose that a three-millisecond-long
2399 RCU-bh read-side critical section executes during a time of heavy
2406 The `RCU-bh
2407 API <https://lwn.net/Articles/609973/#RCU%20Per-Flavor%20API%20Table>`__
2410 old RCU-bh update-side APIs are now gone, replaced by synchronize_rcu(),
2412 anything that disables bottom halves also marks an RCU-bh read-side
2419 The RCU-sched flavor of RCU has since been expressed in terms of the
2421 single flavor. The read-side API remains, and continues to disable
2426 effect of also waiting for all pre-existing interrupt and NMI handlers.
2427 However, there are legitimate preemptible-RCU implementations that do
2429 RCU read-side critical section can be a quiescent state. Therefore,
2430 *RCU-sched* was created, which follows “classic” RCU in that an
2431 RCU-sched grace period waits for pre-existing interrupt and NMI
2433 RCU-sched APIs have identical implementations, while kernels built with
2438 re-enable preemption, respectively. This means that if there was a
2439 preemption attempt during the RCU-sched read-side critical section,
2443 very slowly. However, the highest-priority task won't be preempted, so
2444 that task will enjoy low-overhead rcu_read_unlock_sched()
2447 The `RCU-sched
2448 API <https://lwn.net/Articles/609973/#RCU%20Per-Flavor%20API%20Table>`__
2452 rcu_read_lock_sched_held().  However, the old RCU-sched update-side APIs
2455 preemption also marks an RCU-sched read-side critical section,
2463 read-side critical section” was a reliable indication that this someone
2465 read-side critical section, you can probably afford to use a
2466 higher-overhead synchronization mechanism. However, that changed with
2467 the advent of the Linux kernel's notifiers, whose RCU read-side critical
2469 introduction of `sleepable RCU <https://lwn.net/Articles/202847/>`__, or
2478 That said, one consequence of these domains is that read-side code must
2490 As noted above, it is legal to block within SRCU read-side critical
2492 block forever in one of a given domain's SRCU read-side critical
2495 happen if any operation in a given domain's SRCU read-side critical
2497 period to elapse. For example, this results in a self-deadlock:
2512 ``ss1``-domain SRCU read-side critical section acquired another mutex
2513 that was held across as ``ss``-domain synchronize_srcu(), deadlock
2518 Unlike the other RCU flavors, SRCU read-side critical sections can run
2527 invoked from CPU-hotplug notifiers, due to the fact that SRCU grace
2531 the CPU-hotplug process. The problem is that if a notifier is waiting on
2536 CPU-hotplug notifiers.
2539 non-expedited grace periods are implemented by the same mechanism. This
2549 As of v4.12, SRCU's callbacks are maintained per-CPU, eliminating a
2561 API <https://lwn.net/Articles/609973/#RCU%20Per-Flavor%20API%20Table>`__
2576    specified cookie corresponds to an already-completed
2584 certain types of buffer-cache algorithms having multi-stage age-out
2595 anywhere in the code, it is not possible to use read-side markers such
2605 trampoline would be pre-ordained a surprisingly long time before execution
2609 RCU <https://lwn.net/Articles/607117/>`__, is to have implicit read-side
2613 userspace execution also delimit tasks-RCU read-side critical sections.
2617 Note well that involuntary context switches are *not* Tasks-RCU quiescent
2619 trampoline might be preempted.  In this case, the Tasks-RCU grace period
2625 The tasks-RCU API is quite compact, consisting only of
2637 Some forms of tracing need to wait for all preemption-disabled regions
2642 moniker.  And this operation is considered to be quite rude by real-time
2644 by battery-powered systems that don't want their idle CPUs to be awakened.
2646 Once kernel entry/exit and deep-idle functions have been properly tagged
2651 The tasks-rude-RCU API is also reader-marking-free and thus quite compact,
2658 SRCU's read-side overhead, which includes a full memory barrier in both
2661 readers.  Real-time systems that cannot tolerate IPIs may build their
2663 the expense of adding full memory barriers to the read-side primitives.
2665 The tasks-trace-RCU API is also reasonably compact,
2671 -----------------------
2673 One of the tricks that RCU uses to attain update-side scalability is to
2674 increase grace-period latency with increasing numbers of CPUs. If this
2676 grace-period state machine so as to avoid the need for the additional
2681 rcu_barrier() in CPU-hotplug notifiers, it will be necessary to
2685 The tradeoff between grace-period latency on the one hand and
2687 re-examined. The desire is of course for zero grace-period latency as
2697 be unnecessary because the hotpath read-side primitives do not access
2708 carefully run and realistic system-level workload.
2722 Additional work may be required to provide reasonable forward-progress
2727 -------
2735 ---------------