Lines Matching +full:a +full:- +full:side
2 A Tour Through RCU's Requirements
16 ------------
18 Read-copy update (RCU) is a synchronization mechanism that is often used
19 as a replacement for reader-writer locking. RCU is unusual in that
20 updaters do not block readers, which means that RCU's read-side
28 thought of as an informal, high-level specification for RCU. It is
32 has this learning process been a lot of fun, but it has also been a
40 #. `Fundamental Non-Requirements`_
42 #. `Quality-of-Implementation Requirements`_
44 #. `Software-Engineering Requirements`_
48 This is followed by a summary_, however, the answers to
53 ------------------------
58 #. `Grace-Period Guarantee`_
60 #. `Memory-Barrier Guarantees`_
62 #. `Guaranteed Read-to-Write Upgrade`_
64 Grace-Period Guarantee
67 RCU's grace-period guarantee is unusual in being premeditated: Jack
70 two decades of experience with RCU have produced a much more detailed
73 RCU's grace-period guarantee allows updaters to wait for the completion
74 of all pre-existing RCU read-side critical sections. An RCU read-side
77 RCU treats a nested set as one big RCU read-side critical section.
78 Production-quality implementations of rcu_read_lock() and
105 Because the synchronize_rcu() on line 14 waits for all pre-existing
106 readers, any instance of thread0() that loads a value of zero from
108 instance must also load a value of zero from ``y``. Similarly, any
109 instance of thread0() that loads a value of one from ``y`` must have
111 load a value of one from ``x``. Therefore, the outcome:
119 +-----------------------------------------------------------------------+
121 +-----------------------------------------------------------------------+
122 | Wait a minute! You said that updaters can make useful forward |
123 | progress concurrently with readers, but pre-existing readers will |
126 +-----------------------------------------------------------------------+
128 +-----------------------------------------------------------------------+
131 | Second, even when using synchronize_rcu(), the other update-side |
132 | code does run concurrently with readers, whether pre-existing or not. |
133 +-----------------------------------------------------------------------+
136 `DYNIX/ptx <https://en.wikipedia.org/wiki/DYNIX>`__, which managed a
137 distributed lock manager's transition into a state suitable for handling
173 The RCU read-side critical section in do_something_dlm() works with
178 +-----------------------------------------------------------------------+
180 +-----------------------------------------------------------------------+
182 +-----------------------------------------------------------------------+
184 +-----------------------------------------------------------------------+
188 +-----------------------------------------------------------------------+
190 In order to avoid fatal problems such as deadlocks, an RCU read-side
192 Similarly, an RCU read-side critical section must not contain anything
196 Although RCU's grace-period guarantee is useful in and of itself, with
197 `quite a few use cases <https://lwn.net/Articles/573497/>`__, it would
198 be good to be able to use RCU to coordinate read-side access to linked
199 data structures. For this, the grace-period guarantee is not sufficient,
203 non-\ ``NULL``, locklessly accessing the ``->a`` and ``->b`` fields.
207 1 bool add_gp_buggy(int a, int b)
211 5 return -ENOMEM;
217 11 p->a = a;
218 12 p->b = a;
229 1 bool add_gp_buggy_optimized(int a, int b)
233 5 return -ENOMEM;
240 12 p->a = a;
241 13 p->b = a;
247 executes line 11, it will see garbage in the ``->a`` and ``->b`` fields.
251 which brings us to the publish-subscribe guarantee discussed in the next
257 RCU's publish-subscribe guarantee allows data to be inserted into a
265 1 bool add_gp(int a, int b)
269 5 return -ENOMEM;
275 11 p->a = a;
276 12 p->b = a;
282 The rcu_assign_pointer() on line 13 is conceptually equivalent to a
287 ``gp`` as a scratch location immediately preceding the assignment.
289 +-----------------------------------------------------------------------+
291 +-----------------------------------------------------------------------+
293 | assignments to ``p->a`` and ``p->b`` from being reordered. Can't that |
295 +-----------------------------------------------------------------------+
297 +-----------------------------------------------------------------------+
300 | initialized. So reordering the assignments to ``p->a`` and ``p->b`` |
302 +-----------------------------------------------------------------------+
305 control its accesses to the RCU-protected data, as shown in
315 6 do_something(p->a, p->b);
323 However, this temptation must be resisted because there are a
327 ``gp`` rather than keeping a separate copy in ``p`` as follows:
335 5 do_something(gp->a, gp->b);
343 If this function ran concurrently with a series of updates that replaced
344 the current structure with a new one, the fetches of ``gp->a`` and
345 ``gp->b`` might well come from two different structures, which could
356 6 do_something(p->a, p->b);
365 barriers in the Linux kernel. Should a |high-quality implementation of
367 ever appear, then rcu_dereference() could be implemented as a
368 ``memory_order_consume`` load. Regardless of the exact implementation, a
370 outermost RCU read-side critical section containing that
376 .. |high-quality implementation of C11 memory_order_consume [PDF]| replace:: high-quality implement…
377 .. _high-quality implementation of C11 memory_order_consume [PDF]: http://www.rdrop.com/users/paulm…
381 ensure that readers have a consistent view of newly added data elements.
383 Of course, it is also necessary to remove elements from RCU-protected
387 #. Wait for all pre-existing RCU read-side critical sections to complete
388 (because only pre-existing readers can possibly have a reference to
390 #. At this point, only the updater has a reference to the newly removed
415 This function is straightforward, with line 13 waiting for a grace
427 read-side critical section or in a code segment where the pointer
429 update-side lock.
431 +-----------------------------------------------------------------------+
433 +-----------------------------------------------------------------------+
436 +-----------------------------------------------------------------------+
438 +-----------------------------------------------------------------------+
440 | use rcu_dereference(). It could reuse a value formerly fetched |
442 | in a byte-at-a-time manner, resulting in *load tearing*, in turn |
443 | resulting a bytewise mash-up of two distinct pointer values. It might |
444 | even use value-speculation optimizations, where it makes a wrong |
447 | about any dereferences that returned pre-initialization garbage in |
454 +-----------------------------------------------------------------------+
456 In short, RCU's publish-subscribe guarantee is provided by the
458 guarantee allows data elements to be safely added to RCU-protected
460 can be used in combination with the grace-period guarantee to also allow
461 data elements to be removed from RCU-protected linked data structures,
467 resembling the dependency-ordering barrier that was later subsumed
469 need for these operations made itself known quite suddenly at a
470 late-1990s meeting with the DEC Alpha architects, back in the days when
471 DEC was still a free-standing company. It took the Alpha architects a
473 and it then took me a good *two* hours to convince them that their
480 Memory-Barrier Guarantees
483 The previous section's simple linked-data-structure scenario clearly
484 demonstrates the need for RCU's stringent memory-ordering guarantees on
487 #. Each CPU that has an RCU read-side critical section that begins
488 before synchronize_rcu() starts is guaranteed to execute a full
489 memory barrier between the time that the RCU read-side critical
491 this guarantee, a pre-existing RCU read-side critical section might
492 hold a reference to the newly removed ``struct foo`` after the
494 #. Each CPU that has an RCU read-side critical section that ends after
495 synchronize_rcu() returns is guaranteed to execute a full memory
497 time that the RCU read-side critical section begins. Without this
498 guarantee, a later RCU read-side critical section running after the
501 #. If the task invoking synchronize_rcu() remains on a given CPU,
502 then that CPU is guaranteed to execute a full memory barrier sometime
506 #. If the task invoking synchronize_rcu() migrates among a group of
508 guaranteed to execute a full memory barrier sometime during the
514 +-----------------------------------------------------------------------+
516 +-----------------------------------------------------------------------+
517 | Given that multiple CPUs can start RCU read-side critical sections at |
519 | whether or not a given RCU read-side critical section starts before a |
521 +-----------------------------------------------------------------------+
523 +-----------------------------------------------------------------------+
524 | If RCU cannot tell whether or not a given RCU read-side critical |
525 | section starts before a given instance of synchronize_rcu(), then |
526 | it must assume that the RCU read-side critical section started first. |
527 | In other words, a given instance of synchronize_rcu() can avoid |
528 | waiting on a given RCU read-side critical section only if it can |
530 | A related question is “When rcu_read_lock() doesn't generate any |
531 | code, why does it matter how it relates to a grace period?” The |
534 | within the enclosed RCU read-side critical section to the code |
536 | then a given RCU read-side critical section begins before a given |
547 +-----------------------------------------------------------------------+
549 +-----------------------------------------------------------------------+
551 +-----------------------------------------------------------------------+
554 +-----------------------------------------------------------------------+
556 +-----------------------------------------------------------------------+
564 | #. CPU 1: ``do_something_with(q->a);`` |
570 | Therefore, there absolutely must be a full memory barrier between the |
571 | end of the RCU read-side critical section and the end of the grace |
584 | #. CPU 1: ``do_something_with(q->a); /* Boom!!! */`` |
587 | And similarly, without a memory barrier between the beginning of the |
588 | grace period and the beginning of the RCU read-side critical section, |
592 | acts as if the appropriate memory barriers were in place is a correct |
594 | believing that you have adhered to the as-if rule than it is to |
596 +-----------------------------------------------------------------------+
598 +-----------------------------------------------------------------------+
600 +-----------------------------------------------------------------------+
603 | compiler might arbitrarily rearrange consecutive RCU read-side |
604 | critical sections. Given such rearrangement, if a given RCU read-side |
606 | read-side critical sections are done? Won't the compiler |
608 +-----------------------------------------------------------------------+
610 +-----------------------------------------------------------------------+
614 | schedule() had better prevent calling-code accesses to shared |
616 | RCU detects the end of a given RCU read-side critical section, it |
617 | will necessarily detect the end of all prior RCU read-side critical |
621 | loop, into user-mode code, and so on. But if your kernel build allows |
623 +-----------------------------------------------------------------------+
625 Note that these memory-barrier requirements do not replace the
626 fundamental RCU requirement that a grace period wait for all
627 pre-existing readers. On the contrary, the memory barriers called out in
628 this section must operate in such a way as to *enforce* this fundamental
635 The common-case RCU primitives are unconditional. They are invoked, they
637 to retry. This is a key RCU design philosophy.
640 comes up with a good justification for a particular conditional RCU
642 guarantee was reverse-engineered, not premeditated. The unconditional
645 with conditional primitives caused me to elevate this accident to a
646 guarantee. Therefore, the justification for adding a conditional
650 Guaranteed Read-to-Write Upgrade
654 within an RCU read-side critical section. For example, that RCU
655 read-side critical section might search for a given data element, and
656 then might acquire the update-side spinlock in order to update that
657 element, all while remaining in that RCU read-side critical section. Of
658 course, it is necessary to exit the RCU read-side critical section
663 +-----------------------------------------------------------------------+
665 +-----------------------------------------------------------------------+
666 | But how does the upgrade-to-write operation exclude other readers? |
667 +-----------------------------------------------------------------------+
669 +-----------------------------------------------------------------------+
672 +-----------------------------------------------------------------------+
674 This guarantee allows lookup code to be shared between read-side and
675 update-side code, and was premeditated, appearing in the earliest
678 Fundamental Non-Requirements
679 ----------------------------
681 RCU provides extremely lightweight readers, and its read-side
685 infinitely long, however, the following sections list a few
686 non-guarantees that have caused confusion. Except where otherwise noted,
687 these non-guarantees were premeditated.
692 #. `Grace Periods Don't Partition Read-Side Critical Sections`_
693 #. `Read-Side Critical Sections Don't Partition Grace Periods`_
698 Reader-side markers such as rcu_read_lock() and
700 through their interaction with the grace-period APIs such as
737 significant ordering constraints would slow down these fast-path APIs.
739 +-----------------------------------------------------------------------+
741 +-----------------------------------------------------------------------+
743 +-----------------------------------------------------------------------+
745 +-----------------------------------------------------------------------+
748 +-----------------------------------------------------------------------+
794 +-----------------------------------------------------------------------+
796 +-----------------------------------------------------------------------+
798 | completed instead of waiting only on pre-existing readers. For how |
800 +-----------------------------------------------------------------------+
802 +-----------------------------------------------------------------------+
804 | all readers had completed, a new reader might start immediately after |
807 +-----------------------------------------------------------------------+
809 Grace Periods Don't Partition Read-Side Critical Sections
812 It is tempting to assume that if any part of one RCU read-side critical
813 section precedes a given grace period, and if any part of another RCU
814 read-side critical section follows that same grace period, then all of
815 the first RCU read-side critical section must precede all of the second.
816 However, this just isn't the case: A single grace period does not
817 partition the set of RCU read-side critical sections. An example of this
826 4 WRITE_ONCE(a, 1);
833 11 r1 = READ_ONCE(a);
853 with each circled ``QS`` indicating the point at which RCU recorded a
854 *quiescent state* for each thread, that is, a state in which RCU knows
855 that the thread cannot be in the midst of an RCU read-side critical
858 .. kernel-figure:: GPpartitionReaders1.svg
860 If it is necessary to partition RCU read-side critical sections in this
869 4 WRITE_ONCE(a, 1);
876 11 r1 = READ_ONCE(a);
902 the two RCU read-side critical sections cannot overlap, guaranteeing
903 that ``(r3 == 1)``. As a result, the outcome:
911 This non-requirement was also non-premeditated, but became apparent when
914 Read-Side Critical Sections Don't Partition Grace Periods
917 It is also tempting to assume that if an RCU read-side critical section
918 happens between a pair of grace periods, then those grace periods cannot
927 4 WRITE_ONCE(a, 1);
934 11 r1 = READ_ONCE(a);
970 .. kernel-figure:: ReadersPartitionGP1.svg
972 Again, an RCU read-side critical section can overlap almost all of a
974 period. As a result, an RCU read-side critical section cannot partition
975 a pair of RCU grace periods.
977 +-----------------------------------------------------------------------+
979 +-----------------------------------------------------------------------+
980 | How long a sequence of grace periods, each separated by an RCU |
981 | read-side critical section, would be required to partition the RCU |
982 | read-side critical sections at the beginning and end of the chain? |
983 +-----------------------------------------------------------------------+
985 +-----------------------------------------------------------------------+
990 +-----------------------------------------------------------------------+
993 -------------------------
1000 completely futile. This is most obvious in preemptible user-level
1001 environments and in virtualized environments (where a given guest
1003 hypervisor), but can also happen in bare-metal environments due to
1004 ECC errors, NMIs, and other hardware events. Although a delay of more
1007 but where “extremely long” is not long enough to allow wrap-around
1008 when incrementing a 64-bit counter.
1010 matters, RCU must use compiler directives and memory-barrier
1014 writes and more-frequent concurrent writes will result in more
1018 #. As a rough rule of thumb, only one CPU's worth of processing may be
1021 #. Counters are finite, especially on 32-bit systems. RCU's use of
1023 that counter wrap would take way more time than a single system is
1024 likely to run. An uptime of ten years is quite possible, a runtime of
1025 a century much less so. As an example of the latter, RCU's
1026 dyntick-idle nesting counter allows 54 bits for interrupt nesting
1027 level (this counter is 64 bits even on a 32-bit system). Overflowing
1028 this counter requires 2\ :sup:`54` half-interrupts on a given CPU
1029 without that CPU ever going idle. If a half-interrupt happened every
1032 #. Linux systems can have thousands of CPUs running a single Linux
1033 kernel in a single shared-memory environment. RCU must therefore pay
1034 close attention to high-end scalability.
1042 Quality-of-Implementation Requirements
1043 --------------------------------------
1045 These sections list quality-of-implementation requirements. Although an
1048 inappropriate for industrial-strength production use. Classes of
1049 quality-of-implementation requirements are as follows:
1062 RCU is and always has been intended primarily for read-mostly
1063 situations, which means that RCU's read-side primitives are optimized,
1064 often at the expense of its update-side primitives. Experience thus far
1067 #. Read-mostly data, where stale and inconsistent data is not a problem:
1069 #. Read-mostly data, where data must be consistent: RCU works well.
1070 #. Read-write data, where data must be consistent: RCU *might* work OK.
1072 #. Write-mostly data, where data must be consistent: RCU is very
1076 a. Existence guarantees for update-friendly mechanisms.
1077 b. Wait-free read-side primitives for real-time use.
1079 This focus on read-mostly situations means that RCU must interoperate
1083 extends much farther, requiring that a variety of synchronization
1084 primitives be legal within RCU read-side critical sections, including
1088 +-----------------------------------------------------------------------+
1090 +-----------------------------------------------------------------------+
1092 +-----------------------------------------------------------------------+
1094 +-----------------------------------------------------------------------+
1095 | These are forbidden within Linux-kernel RCU read-side critical |
1096 | sections because it is not legal to place a quiescent state (in this |
1097 | case, voluntary context switch) within an RCU read-side critical |
1099 | read-side critical sections, and also within Linux-kernel sleepable |
1100 | RCU `(SRCU) <Sleepable RCU_>`__ read-side critical sections. In |
1101 | addition, the -rt patchset turns spinlocks into a sleeping locks so |
1104 | locks!) may be acquire within -rt-Linux-kernel RCU read-side critical |
1106 | Note that it *is* legal for a normal RCU read-side critical section |
1107 | to conditionally acquire a sleeping locks (as in |
1114 +-----------------------------------------------------------------------+
1116 It often comes as a surprise that many algorithms do not require a
1120 arrives at a given system, that system has been sending network traffic
1121 the wrong way for a considerable length of time. Having a few threads
1122 continue to send traffic the wrong way for a few more milliseconds is
1123 clearly not a problem: In the worst case, TCP retransmissions will
1126 inconsistency must be tolerated due to speed-of-light delays if nothing
1130 For example, a pair of veterinarians might use heartbeat to determine
1131 whether or not a given cat was alive. But how long should they wait
1133 less than 400 milliseconds makes no sense because this would mean that a
1135 than 100 times per minute. Moreover, just as with human beings, a cat's
1136 heart might stop for some period of time, so the exact wait period is a
1139 a full minute. The two veterinarians would then disagree on the state of
1146 don't receive a response within a given period of time. Policy decisions
1148 decided some time ago, and is only now being put into effect, so a few
1152 For example, the translation between a user-level SystemV semaphore ID
1153 to the corresponding in-kernel data structure is protected by RCU, but
1154 it is absolutely forbidden to update a semaphore that has just been
1156 by acquiring spinlocks located in the in-kernel data structure from
1157 within the RCU read-side critical section, and this is indicated by the
1165 right mix of synchronization tools to be used for a given job.
1170 Energy efficiency is a critical component of performance today, and
1171 Linux-kernel RCU implementations must therefore avoid unnecessarily
1173 premeditated. In fact, I learned of it during a telephone conversation
1175 energy efficiency in battery-powered systems and on specific
1176 energy-efficiency shortcomings of the Linux-kernel RCU implementation.
1177 In my experience, the battery-powered embedded community will consider
1179 mere Linux-kernel-mailing-list posts are insufficient to vent their ire.
1184 `bloatwatch <http://elinux.org/Linux_Tiny-FAQ>`__ efforts, memory
1185 footprint is critically important on single-CPU systems with
1186 non-preemptible (``CONFIG_PREEMPTION=n``) kernels, and thus `tiny
1188 was born. Josh Triplett has since taken over the small-memory banner
1194 unsurprising. For example, in keeping with RCU's read-side
1196 (for example, suppression of a few minor compiler optimizations).
1197 Similarly, in non-preemptible environments, rcu_read_lock() and
1200 In preemptible environments, in the case where the RCU read-side
1202 highest-priority real-time process), rcu_read_lock() and
1204 should not contain atomic read-modify-write operations, memory-barrier
1206 branches. However, in the case where the RCU read-side critical section
1208 interrupts. This is why it is better to nest an RCU read-side critical
1209 section within a preempt-disable region than vice versa, at least in
1211 degrading real-time latencies.
1213 The synchronize_rcu() grace-period-wait primitive is optimized for
1215 addition to the duration of the longest RCU read-side critical section.
1218 they can be satisfied by a single underlying grace-period-wait
1219 operation. For example, in the Linux kernel, it is not unusual for a
1220 single grace-period-wait operation to serve more than `1,000 separate
1221 … <https://www.usenix.org/conference/2004-usenix-annual-technical-conference/making-rcu-safe-deep-s…
1222 of synchronize_rcu(), thus amortizing the per-invocation overhead
1223 down to nearly zero. However, the grace-period optimization is also
1224 required to avoid measurable degradation of real-time scheduling and
1227 In some cases, the multi-millisecond synchronize_rcu() latencies are
1229 used instead, reducing the grace-period latency down to a few tens of
1230 microseconds on small systems, at least in cases where the RCU read-side
1235 requirements: A storm of synchronize_rcu_expedited() invocations on
1238 to impose modest degradation of real-time latency on non-idle online
1239 CPUs. Here, “modest” means roughly the same latency degradation as a
1240 scheduling-clock interrupt.
1242 There are a number of situations where even
1243 synchronize_rcu_expedited()'s reduced grace-period latency is
1250 2 int a;
1273 25 call_rcu(&p->rh, remove_gp_cb);
1278 A definition of ``struct foo`` is finally needed, and appears on
1279 lines 1-5. The function remove_gp_cb() is passed to call_rcu()
1280 on line 25, and will be invoked after the end of a subsequent grace
1282 without forcing the updater to wait for a grace period to elapse. The
1283 call_rcu() function may be used in a number of situations where
1285 be legal, including within preempt-disable code, local_bh_disable()
1286 code, interrupt-disable code, and interrupt handlers. However, even
1290 Linux kernel, either within a real softirq handler or under the
1293 takes too long. Long-running operations should be relegated to separate
1296 +-----------------------------------------------------------------------+
1298 +-----------------------------------------------------------------------+
1303 +-----------------------------------------------------------------------+
1305 +-----------------------------------------------------------------------+
1306 | Presumably the ``->gp_lock`` acquired on line 18 excludes any |
1309 | after ``->gp_lock`` is released on line 25, which in turn means that |
1311 +-----------------------------------------------------------------------+
1314 the data element. This is a common idiom, and is supported by
1321 2 int a;
1350 open-coded it.
1352 +-----------------------------------------------------------------------+
1354 +-----------------------------------------------------------------------+
1358 | of the memory (respectively) must still wait for a grace period to |
1360 +-----------------------------------------------------------------------+
1362 +-----------------------------------------------------------------------+
1364 | definition would say that updates in garbage-collected languages |
1369 | kfree_rcu(), without having to wait for a subsequent grace |
1371 +-----------------------------------------------------------------------+
1375 be carried out in the meantime? The polling-style
1401 On line 14, get_state_synchronize_rcu() obtains a “cookie” from RCU,
1403 immediately if a grace period has elapsed in the meantime, but otherwise
1408 RCU thus provides a range of tools to allow updaters to strike the
1414 In theory, delaying grace-period completion and callback invocation is
1418 a number of mechanisms to promote forward progress.
1421 example, an infinite loop in an RCU read-side critical section must by
1422 definition prevent later grace periods from ever completing. For a more
1423 involved example, consider a 64-CPU system built with
1424 ``CONFIG_RCU_NOCB_CPU=y`` and booted with ``rcu_nocbs=1-63``, where
1438 #. If a grace period fails to complete within 100 milliseconds, RCU
1442 corresponding CPU's next scheduling-clock.
1444 indefinitely in the kernel without scheduling-clock interrupts, which
1448 #. In kernels built with ``CONFIG_RCU_BOOST=y``, if a given task that
1449 has been preempted within an RCU read-side critical section is
1452 #. If a CPU is still holding out 10 seconds into the grace period, RCU
1460 caution when changing them. Note that these forward-progress measures
1465 invocation of callbacks when any given non-\ ``rcu_nocbs`` CPU has
1469 #. Starts a grace period, if one is not already in progress.
1476 #. Lifts callback-execution batch limits, which speeds up callback
1480 overridden. Again, these forward-progress measures are provided only for
1482 RCU`_. Even for RCU, callback-invocation forward
1483 progress for ``rcu_nocbs`` CPUs is much less well-developed, in part
1487 additional forward-progress work will be required.
1493 part due to the collision of multicore hardware with object-oriented
1494 techniques designed in single-threaded environments for single-threaded
1495 use. And in theory, RCU read-side critical sections may be composed, and
1497 real-world implementations of composable constructs, there are
1501 rcu_read_unlock() generate no code, such as Linux-kernel RCU when
1516 the nesting-depth counter. For example, the Linux kernel's preemptible
1518 practical purposes. That said, a consecutive pair of RCU read-side
1519 critical sections between which there is an operation that waits for a
1520 grace period cannot be enclosed in another RCU read-side critical
1521 section. This is because it is not legal to wait for a grace period
1522 within an RCU read-side critical section: To do so would result either
1523 in deadlock or in RCU implicitly splitting the enclosing RCU read-side
1524 critical section, neither of which is conducive to a long-lived and
1528 example, many transactional-memory implementations prohibit composing a
1530 a network receive operation). For another example, lock-based critical
1534 In short, although RCU read-side critical sections are highly
1541 A given RCU workload might have an endless and intense stream of RCU
1542 read-side critical sections, perhaps even so intense that there was
1543 never a point in time during which there was not at least one RCU
1544 read-side critical section in flight. RCU cannot allow this situation to
1545 block grace periods: As long as all the RCU read-side critical sections
1549 RCU read-side critical sections being preempted for long durations,
1550 which has the effect of creating a long-duration RCU read-side critical
1552 systems using real-time priorities are of course more vulnerable.
1563 rates should not delay RCU read-side critical sections, although some
1564 small read-side delays can occur when using
1569 1990s, a simple user-level test consisting of ``close(open(path))`` in a
1570 tight loop in the early 2000s suddenly provided a much deeper
1571 appreciation of the high-update-rate corner case. This test also
1573 example, if a given CPU finds itself with more than 10,000 RCU callbacks
1576 grace-period processing. This evasive action causes the grace period to
1581 Software-Engineering Requirements
1582 ---------------------------------
1589 splat if rcu_dereference() is used outside of an RCU read-side
1590 critical section. Update-side code can use
1591 rcu_dereference_protected(), which takes a `lockdep
1594 provided, a lockdep splat is emitted.
1596 rcu_dereference_check(), which also takes a lockdep expression,
1597 and emits a lockdep splat if neither rcu_read_lock() nor the
1601 rcu_read_lock_held() is provided to allow a function to verify
1602 that it has been invoked within an RCU read-side critical section. I
1604 Gleixner audited a number of RCU uses.
1605 #. A given function might wish to check for RCU-related preconditions
1610 rcu_dereference(), perhaps (incorrectly) substituting a simple
1611 assignment. To catch this sort of error, a given RCU-protected
1613 complain about simple-assignment accesses to that pointer. Arnd
1617 a data element is passed to call_rcu() twice in a row, without a
1618 grace period in between. (This error is similar to a double free.)
1624 non-stack ``rcu_head`` structures must be initialized with
1629 #. An infinite loop in an RCU read-side critical section will eventually
1634 is not obligated to produce this splat unless there is a grace period
1635 waiting on that particular RCU read-side critical section.
1641 stall warnings are counter-productive during sysrq dumps and during
1645 automatically invoked at the beginning of a panic to suppress further
1649 the first time that it was necessary to debug a CPU stall. That said,
1654 read-side critical sections, there is currently no good way of doing
1658 #. In kernels built with ``CONFIG_RCU_TRACE=y``, RCU-related information
1660 #. Open-coded use of rcu_assign_pointer() and rcu_dereference()
1662 error-prone. Therefore, RCU-protected `linked
1664 more recently, RCU-protected `hash
1666 other special-purpose RCU-protected data structures are available in
1672 linked structures that are to be published via a single external
1675 This not a hard-and-fast list: RCU's diagnostic capabilities will
1677 real-world RCU usage.
1680 --------------------------
1696 #. `Scheduling-Clock Interrupts and RCU`_
1700 This list is probably incomplete, but it does give a feel for the most
1701 notable Linux-kernel complications. Each of the following sections
1714 requires almost all of them be hidden behind a ``CONFIG_RCU_EXPERT``
1719 `remind <https://lore.kernel.org/r/CA+55aFy4wcCwaL4okTs8wXhGZ5h-ibecy_Meg9C4MNQrUnwMcg@mail.gmail.c…
1730 sometimes by a large factor. If RCU naively believed the firmware, as it
1731 used to do, it would create too many per-CPU kthreads. Although the
1736 RCU must therefore wait for a given CPU to actually come online before
1738 resulting “ghost CPUs” (which are never going to come online) cause a
1746 used early, even before rcu_init() is invoked. In fact, a number of
1748 ``task_struct`` is available and the boot CPU's per-CPU variables are
1749 set up. The read-side primitives (rcu_read_lock(),
1761 it would be possible to invoke callbacks earlier, however, this is not a
1769 itself is a quiescent state and thus a grace period, so the early-boot
1770 implementation can be a no-op.
1775 reason is that an RCU read-side critical section might be preempted,
1776 which means that a subsequent synchronize_rcu() really does have to
1785 +-----------------------------------------------------------------------+
1787 +-----------------------------------------------------------------------+
1790 +-----------------------------------------------------------------------+
1792 +-----------------------------------------------------------------------+
1797 | grace-period mechanism. At runtime, this expedited mechanism relies |
1799 | drives the desired expedited grace period. Because dead-zone |
1802 | as is required to avoid problems that would otherwise occur when a |
1803 | user task received a POSIX signal while driving an expedited grace |
1809 | there ever turns out to be a good reason for sending POSIX signals |
1813 +-----------------------------------------------------------------------+
1815 I learned of these boot-time requirements as a result of a series of
1821 The Linux kernel has interrupts, and RCU read-side critical sections are
1822 legal within interrupt handlers and within interrupt-disabled regions of
1825 Some Linux-kernel architectures can enter an interrupt handler from
1826 non-idle process context, and then just never leave it, instead
1829 “half-interrupts” mean that RCU has to be very careful about how it
1831 way during a rewrite of RCU's dyntick-idle code.
1833 The Linux kernel has non-maskable interrupts (NMIs), and RCU read-side
1835 update-side primitives, including call_rcu(), are prohibited within
1838 The name notwithstanding, some Linux-kernel architectures can have
1840 me <https://lore.kernel.org/r/CALCETrXLq1y7e_dKFPgou-FKHB6Pu-r8+t-6Ds+8=va7anBWDA@mail.gmail.com>`__
1857 unloaded. After a given module has been unloaded, any attempt to call
1858 one of its functions results in a segmentation fault. The module-unload
1859 functions must therefore cancel any delayed calls to loadable-module
1866 considered socially irresponsible to crash the system in response to a
1867 module unload request, we need some other way to deal with in-flight RCU
1871 in-flight RCU callbacks have been invoked. If a module uses
1874 the underlying module-unload code could invoke rcu_barrier()
1879 filesystem-unmount situation, and Dipankar Sarma incorporated
1886 *not*, obligated to wait for a grace period. It is instead only required
1891 to wait for a grace period.
1893 +-----------------------------------------------------------------------+
1895 +-----------------------------------------------------------------------+
1896 | Wait a minute! Each RCU callbacks must wait for a grace period to |
1897 | complete, and rcu_barrier() must wait for each pre-existing |
1899 | wait for a full grace period if there is even one callback posted |
1901 +-----------------------------------------------------------------------+
1903 +-----------------------------------------------------------------------+
1905 | Yes, each RCU callbacks must wait for a grace period to complete, but |
1909 | elapse. So even if there are quite a few callbacks posted, |
1912 | So if you need to wait for a grace period as well as for all |
1913 | pre-existing callbacks, you will need to invoke both |
1914 | synchronize_rcu() and rcu_barrier(). If latency is a concern, |
1916 +-----------------------------------------------------------------------+
1923 offline CPU, with the exception of `SRCU <Sleepable RCU_>`__ read-side
1925 DYNIX/ptx, but on the other hand, the Linux kernel's CPU-hotplug
1928 The Linux-kernel CPU-hotplug implementation has notifiers that are used
1930 appropriately to a given CPU-hotplug operation. Most RCU operations may
1931 be invoked from CPU-hotplug notifiers, including even synchronous
1932 grace-period operations such as (synchronize_rcu() and
1938 In addition, all-callback-wait operations such as rcu_barrier() may
1939 not be invoked from any CPU-hotplug notifier. This restriction is due
1940 to the fact that there are phases of CPU-hotplug operations where the
1941 outgoing CPU's callbacks will not be invoked until after the CPU-hotplug
1943 rcu_barrier() blocks CPU-hotplug operations during its execution,
1944 which results in another type of deadlock when invoked from a CPU-hotplug
1950 and also by reporting quiescent states explicitly when a CPU goes
1952 for the force-quiescent-state loop (FQS) to report quiescent states for
1953 offline CPUs. However, as a debugging measure, the FQS loop does splat
1959 2. When grace period initialization (rcu_gp_init()) detects a
1960 race either with CPU offlining or with a task unblocking on a leaf
1963 The CPU-online path (rcutree_report_cpu_starting()) should never need to report
1964 a quiescent state for an offline CPU. However, as a debugging measure,
1965 it does emit a warning if a quiescent state was not already reported
1976 RCU makes use of kthreads, and it is necessary to avoid excessive CPU-time
1978 RCU's violation of it when running context-switch-heavy workloads when
1979 built with ``CONFIG_NO_HZ_FULL=y`` `did come as a surprise
1982 context-switch-heavy ``CONFIG_NO_HZ_FULL=y`` workloads, but there is
1986 scheduler's runqueue or priority-inheritance spinlocks across an
1988 somewhere within the corresponding RCU read-side critical section.
1994 nesting. The fact that interrupt-disabled regions of code act as RCU
1995 read-side critical sections implicitly avoids earlier issues that used
2012 The kernel needs to access user-space memory, for example, to access data
2013 referenced by system-call parameters. The get_user() macro does this job.
2015 However, user-space memory might well be paged out, which means that
2016 get_user() might well page-fault and thus block while waiting for the
2017 resulting I/O to complete. It would be a very bad thing for the compiler to
2018 reorder a get_user() invocation into an RCU read-side critical section.
2026 3 v = p->value;
2039 4 v = p->value;
2043 If the compiler did make this transformation in a ``CONFIG_PREEMPTION=n`` kernel
2044 build, and if get_user() did page fault, the result would be a quiescent
2045 state in the middle of an RCU read-side critical section. This misplaced
2046 quiescent state could result in line 4 being a use-after-free access,
2054 ``p->value`` is not volatile, so the compiler would not have any reason to keep
2057 Therefore, the Linux-kernel definitions of rcu_read_lock() and
2059 instances of rcu_read_lock() and rcu_read_unlock() within a nested set
2060 of RCU read-side critical sections.
2066 by people with battery-powered embedded systems. RCU therefore conserves
2068 have been interrupted from idle. This is a large part of the
2069 energy-efficiency requirement, so I learned of this via an irate phone
2073 RCU read-side critical section on an idle CPU. (Kernels built with
2082 These energy-efficiency requirements have proven quite difficult to
2084 clean-sheet rewrites of RCU's energy-efficiency code, the last of which
2087 [PDF] <http://www.rdrop.com/users/paulmck/realtime/paper/AMPenergy.2013.04.19a.pdf>`__.
2089 phone calls: Flaming me on the Linux-kernel mailing list was apparently
2090 not sufficient to fully vent their ire at RCU's energy-efficiency bugs!
2092 Scheduling-Clock Interrupts and RCU
2095 The kernel transitions between in-kernel non-idle execution, userspace
2099 +-----------------+------------------+------------------+-----------------+
2100 | ``HZ`` Kconfig | In-Kernel | Usermode | Idle |
2103 | | scheduling-clock | scheduling-clock | RCU's |
2104 | | interrupt. | interrupt and | dyntick-idle |
2108 +-----------------+------------------+------------------+-----------------+
2110 | | scheduling-clock | scheduling-clock | RCU's |
2111 | | interrupt. | interrupt and | dyntick-idle |
2115 +-----------------+------------------+------------------+-----------------+
2118 | | on | dyntick-idle | dyntick-idle |
2119 | | scheduling-clock | detection. | detection. |
2127 +-----------------+------------------+------------------+-----------------+
2129 +-----------------------------------------------------------------------+
2131 +-----------------------------------------------------------------------+
2132 | Why can't ``NO_HZ_FULL`` in-kernel execution rely on the |
2133 | scheduling-clock interrupt, just like ``HZ_PERIODIC`` and |
2135 +-----------------------------------------------------------------------+
2137 +-----------------------------------------------------------------------+
2138 | Because, as a performance optimization, ``NO_HZ_FULL`` does not |
2139 | necessarily re-enable the scheduling-clock interrupt on entry to each |
2141 +-----------------------------------------------------------------------+
2147 scheduling-clock interrupt be enabled when RCU needs it to be:
2149 #. If a CPU is either idle or executing in usermode, and RCU believes it
2150 is non-idle, the scheduling-clock tick had better be running.
2152 (11-second) grace periods, with a pointless IPI waking the CPU from
2154 #. If a CPU is in a portion of the kernel that executes RCU read-side
2159 #. If a CPU is in a portion of the kernel that is absolutely positively
2160 no-joking guaranteed to never execute any RCU read-side critical
2162 sort of thing is used by some architectures for light-weight
2169 fact joking about not doing RCU read-side critical sections.
2170 #. If a CPU is executing in the kernel with the scheduling-clock
2171 interrupt disabled and RCU believes this CPU to be non-idle, and if
2174 idle periods of up to a second or so.
2176 #. If a CPU is either idle or executing in usermode, and RCU believes it
2178 #. If a CPU is executing in the kernel, the kernel code path is passing
2179 through quiescent states at a reasonable frequency (preferably about
2180 once per few jiffies, but the occasional excursion to a second or so
2181 is usually OK) and the scheduling-clock interrupt is enabled, of
2183 If the gap between a successive pair of quiescent states grows too
2186 +-----------------------------------------------------------------------+
2188 +-----------------------------------------------------------------------+
2189 | But what if my driver has a hardware interrupt handler that can run |
2192 +-----------------------------------------------------------------------+
2194 +-----------------------------------------------------------------------+
2196 | often. But given that long-running interrupt handlers can cause other |
2199 +-----------------------------------------------------------------------+
2202 between in-kernel execution, usermode execution, and idle, and as long
2203 as the scheduling-clock interrupt is enabled when RCU needs it to be,
2210 Although small-memory non-realtime systems can simply use Tiny RCU, code
2213 kfree_rcu(). Although this structure contains nothing more than a
2214 pair of pointers, it does appear in many RCU-protected data structures,
2215 including some that are size critical. The ``page`` structure is a case
2219 This need for memory efficiency is one reason that RCU uses hand-crafted
2221 waiting for a grace period to elapse. It is also the reason why
2224 posted them. Although this information might appear in debug-only kernel
2225 builds at some point, in the meantime, the ``->func`` field will often
2230 ``rcu_head`` field shares storage with a great many other structures
2233 conditions <https://lore.kernel.org/r/1439976106-137226-1-git-send-email-kirill.shutemov@linux.inte…
2234 the Linux kernel's memory-management subsystem needs a particular bit to
2235 remain zero during all phases of grace-period processing, and that bit
2237 ``->next`` field. RCU makes this guarantee as long as call_rcu() is
2240 energy-efficiency purposes.
2243 structure be aligned to a two-byte boundary, and passing a misaligned
2245 will result in a splat. It is therefore necessary to exercise caution
2247 a four-byte or even eight-byte alignment requirement? Because the m68k
2248 architecture provides only two-byte alignment, and thus acts as
2254 potentially have energy-efficiency benefits, but only if the rate of
2255 non-lazy callbacks decreases significantly for some important workload.
2264 hot code paths in performance-critical portions of the Linux kernel's
2267 read-side primitives. To that end, it would be good if preemptible RCU's
2276 implementation. RCU therefore makes heavy use of a combining tree based
2279 minimal per-operation overhead. In fact, in many cases, increasing load
2280 must *decrease* the per-operation overhead, witness the batching
2282 synchronize_rcu_expedited(), and rcu_barrier(). As a general
2286 The Linux kernel is used for real-time workloads, especially in
2287 conjunction with the `-rt
2289 real-time-latency response requirements are such that the traditional
2290 approach of disabling preemption across RCU read-side critical sections
2292 an RCU implementation that allows RCU read-side critical sections to be
2294 clear that an earlier `real-time
2298 encountered by a very early version of the -rt patchset.
2300 In addition, RCU must make do with a sub-100-microsecond real-time
2301 latency budget. In fact, on smaller systems with the -rt patchset, the
2302 Linux kernel provides sub-20-microsecond real-time latencies for the
2305 surprise, the sub-100-microsecond real-time latency budget `applies to
2307 [PDF] <http://www.rdrop.com/users/paulmck/realtime/paper/bigrt.2013.01.31a.LCA.pdf>`__,
2308 up to and including systems with 4096 CPUs. This real-time requirement
2309 motivated the grace-period kthread, which also simplified handling of a
2312 RCU must avoid degrading real-time response for CPU-bound threads,
2314 ``CONFIG_NO_HZ_FULL=y``) or in the kernel. That said, CPU-bound loops in
2318 Finally, RCU's status as a synchronization primitive means that any RCU
2322 stress-test suite. This stress-test suite is called ``rcutorture``.
2328 today, given Android smartphones, Linux-powered televisions, and
2332 Suppose that RCU contains a race condition that manifests on average
2336 their smartphone to last for a million years. However, anyone taking too
2338 jurisdictions, a successful multi-year test of a given mechanism, which
2339 might include a Linux kernel, suffices for a number of types of
2340 safety-critical certifications. In fact, rumor has it that the Linux
2341 kernel is already being used in production for safety-critical
2342 applications. I don't know about you, but I would feel quite bad if a
2347 -----------------
2352 implementations, non-preemptible and preemptible. The other four flavors
2353 are listed below, with requirements for each described in a separate
2356 #. `Bottom-Half Flavor (Historical)`_
2362 Bottom-Half Flavor (Historical)
2365 The RCU-bh flavor of RCU has since been expressed in terms of the other
2366 RCU flavors as part of a consolidation of the three flavors into a
2367 single flavor. The read-side API remains, and continues to disable
2371 The softirq-disable (AKA “bottom-half”, hence the “_bh” abbreviations)
2372 flavor of RCU, or *RCU-bh*, was developed by Dipankar Sarma to provide a
2373 flavor of RCU that could withstand the network-based denial-of-service
2376 execution, which in turn prevented those CPUs from ever executing a
2378 grace periods from ever ending. The result was an out-of-memory
2379 condition and a system hang.
2381 The solution was the creation of RCU-bh, which does
2382 local_bh_disable() across its read-side critical sections, and which
2383 uses the transition from one type of softirq processing to another as a
2385 offline. This means that RCU-bh grace periods can complete even when
2387 algorithms based on RCU-bh to withstand network-based denial-of-service
2391 re-enable softirq handlers, any attempt to start a softirq handlers
2392 during the RCU-bh read-side critical section will be deferred. In this
2395 overhead should be associated with the code following the RCU-bh
2396 read-side critical section rather than rcu_read_unlock_bh(), but the
2398 of fine distinction. For example, suppose that a three-millisecond-long
2399 RCU-bh read-side critical section executes during a time of heavy
2406 The `RCU-bh
2407 API <https://lwn.net/Articles/609973/#RCU%20Per-Flavor%20API%20Table>`__
2410 old RCU-bh update-side APIs are now gone, replaced by synchronize_rcu(),
2412 anything that disables bottom halves also marks an RCU-bh read-side
2419 The RCU-sched flavor of RCU has since been expressed in terms of the
2420 other RCU flavors as part of a consolidation of the three flavors into a
2421 single flavor. The read-side API remains, and continues to disable
2425 Before preemptible RCU, waiting for an RCU grace period had the side
2426 effect of also waiting for all pre-existing interrupt and NMI handlers.
2427 However, there are legitimate preemptible-RCU implementations that do
2429 RCU read-side critical section can be a quiescent state. Therefore,
2430 *RCU-sched* was created, which follows “classic” RCU in that an
2431 RCU-sched grace period waits for pre-existing interrupt and NMI
2433 RCU-sched APIs have identical implementations, while kernels built with
2434 ``CONFIG_PREEMPTION=y`` provide a separate implementation for each.
2438 re-enable preemption, respectively. This means that if there was a
2439 preemption attempt during the RCU-sched read-side critical section,
2443 very slowly. However, the highest-priority task won't be preempted, so
2444 that task will enjoy low-overhead rcu_read_unlock_sched()
2447 The `RCU-sched
2448 API <https://lwn.net/Articles/609973/#RCU%20Per-Flavor%20API%20Table>`__
2452 rcu_read_lock_sched_held(). However, the old RCU-sched update-side APIs
2455 preemption also marks an RCU-sched read-side critical section,
2462 For well over a decade, someone saying “I need to block within an RCU
2463 read-side critical section” was a reliable indication that this someone
2465 read-side critical section, you can probably afford to use a
2466 higher-overhead synchronization mechanism. However, that changed with
2467 the advent of the Linux kernel's notifiers, whose RCU read-side critical
2473 defined by an instance of an ``srcu_struct`` structure. A pointer to
2476 structure. The key benefit of these domains is that a slow SRCU reader
2478 That said, one consequence of these domains is that read-side code must
2479 pass a “cookie” from srcu_read_lock() to srcu_read_unlock(), for
2490 As noted above, it is legal to block within SRCU read-side critical
2492 block forever in one of a given domain's SRCU read-side critical
2495 happen if any operation in a given domain's SRCU read-side critical
2497 period to elapse. For example, this results in a self-deadlock:
2508 However, if line 5 acquired a mutex that was held across a
2510 possible. Furthermore, if line 5 acquired a mutex that was held across a
2512 ``ss1``-domain SRCU read-side critical section acquired another mutex
2513 that was held across as ``ss``-domain synchronize_srcu(), deadlock
2514 would again be possible. Such a deadlock cycle could extend across an
2518 Unlike the other RCU flavors, SRCU read-side critical sections can run
2521 which means that SRCU readers will run a bit slower than would RCU
2523 which, in combination with srcu_read_unlock(), guarantees a full
2527 invoked from CPU-hotplug notifiers, due to the fact that SRCU grace
2531 the CPU-hotplug process. The problem is that if a notifier is waiting on
2532 an SRCU grace period, that grace period is waiting on a timer, and that
2536 CPU-hotplug notifiers.
2539 non-expedited grace periods are implemented by the same mechanism. This
2540 means that in the current SRCU implementation, expediting a future grace
2541 period has the side effect of expediting all prior grace periods that
2542 have not yet completed. (But please note that this is a property of the
2546 by default), and if a synchronize_srcu() invocation ends this idle
2549 As of v4.12, SRCU's callbacks are maintained per-CPU, eliminating a
2556 run some tests first. SRCU just might need a few adjustment to deal with
2561 API <https://lwn.net/Articles/609973/#RCU%20Per-Flavor%20API%20Table>`__
2572 #. start_poll_synchronize_srcu() returns a cookie identifying
2573 the completion of a future SRCU grace period and ensures
2576 specified cookie corresponds to an already-completed
2578 #. get_state_synchronize_srcu() returns a cookie just like
2584 certain types of buffer-cache algorithms having multi-stage age-out
2593 able to free old trampolines, which sounds like a job for some form of
2594 RCU. However, because it is necessary to be able to install a trace
2595 anywhere in the code, it is not possible to use read-side markers such
2602 the trampoline's protection must extend a few instructions *prior* to
2605 trampoline would be pre-ordained a surprisingly long time before execution
2609 RCU <https://lwn.net/Articles/607117/>`__, is to have implicit read-side
2613 userspace execution also delimit tasks-RCU read-side critical sections.
2617 Note well that involuntary context switches are *not* Tasks-RCU quiescent
2618 states. After all, in preemptible kernels, a task executing code in a
2619 trampoline might be preempted. In this case, the Tasks-RCU grace period
2622 not provide a Tasks RCU quiescent state. (Instead, use rcu_softirq_qs()
2625 The tasks-RCU API is quite compact, consisting only of
2637 Some forms of tracing need to wait for all preemption-disabled regions
2641 forcing a workqueue to be scheduled on each online CPU, hence the "Rude"
2642 moniker. And this operation is considered to be quite rude by real-time
2644 by battery-powered systems that don't want their idle CPUs to be awakened.
2646 Once kernel entry/exit and deep-idle functions have been properly tagged
2651 The tasks-rude-RCU API is also reader-marking-free and thus quite compact,
2658 SRCU's read-side overhead, which includes a full memory barrier in both
2659 srcu_read_lock() and srcu_read_unlock(). This need is handled by a
2661 readers. Real-time systems that cannot tolerate IPIs may build their
2663 the expense of adding full memory barriers to the read-side primitives.
2665 The tasks-trace-RCU API is also reasonably compact,
2671 -----------------------
2673 One of the tricks that RCU uses to attain update-side scalability is to
2674 increase grace-period latency with increasing numbers of CPUs. If this
2675 becomes a serious problem, it will be necessary to rework the
2676 grace-period state machine so as to avoid the need for the additional
2679 RCU disables CPU hotplug in a few places, perhaps most notably in the
2680 rcu_barrier() operations. If there is a strong reason to use
2681 rcu_barrier() in CPU-hotplug notifiers, it will be necessary to
2683 there had better be a *very* good reason.
2685 The tradeoff between grace-period latency on the one hand and
2687 re-examined. The desire is of course for zero grace-period latency as
2692 The multiprocessor implementations of RCU use a combining tree that
2697 be unnecessary because the hotpath read-side primitives do not access
2702 number of CPUs in a socket, NUMA node, or whatever. If the number of
2703 CPUs is too large, use a fraction of the number of CPUs. If the number
2704 of CPUs is a large prime number, well, that certainly is an
2707 inadequate, and only if the inadequacy has been demonstrated by a
2708 carefully run and realistic system-level workload.
2722 Additional work may be required to provide reasonable forward-progress
2727 -------
2735 ---------------