xref: /linux/Documentation/RCU/Design/Memory-Ordering/Tree-RCU-Memory-Ordering.rst (revision 566ab427f827b0256d3e8ce0235d088e6a9c28bd)
1======================================================
2A Tour Through TREE_RCU's Grace-Period Memory Ordering
3======================================================
4
5August 8, 2017
6
7This article was contributed by Paul E. McKenney
8
9Introduction
10============
11
12This document gives a rough visual overview of how Tree RCU's
13grace-period memory ordering guarantee is provided.
14
15What Is Tree RCU's Grace Period Memory Ordering Guarantee?
16==========================================================
17
18RCU grace periods provide extremely strong memory-ordering guarantees
19for non-idle non-offline code.
20Any code that happens after the end of a given RCU grace period is guaranteed
21to see the effects of all accesses prior to the beginning of that grace
22period that are within RCU read-side critical sections.
23Similarly, any code that happens before the beginning of a given RCU grace
24period is guaranteed to not see the effects of all accesses following the end
25of that grace period that are within RCU read-side critical sections.
26
27Note well that RCU-sched read-side critical sections include any region
28of code for which preemption is disabled.
29Given that each individual machine instruction can be thought of as
30an extremely small region of preemption-disabled code, one can think of
31``synchronize_rcu()`` as ``smp_mb()`` on steroids.
32
33RCU updaters use this guarantee by splitting their updates into
34two phases, one of which is executed before the grace period and
35the other of which is executed after the grace period.
36In the most common use case, phase one removes an element from
37a linked RCU-protected data structure, and phase two frees that element.
38For this to work, any readers that have witnessed state prior to the
39phase-one update (in the common case, removal) must not witness state
40following the phase-two update (in the common case, freeing).
41
42The RCU implementation provides this guarantee using a network
43of lock-based critical sections, memory barriers, and per-CPU
44processing, as is described in the following sections.
45
46Tree RCU Grace Period Memory Ordering Building Blocks
47=====================================================
48
49The workhorse for RCU's grace-period memory ordering is the
50critical section for the ``rcu_node`` structure's
51``->lock``. These critical sections use helper functions for lock
52acquisition, including ``raw_spin_lock_rcu_node()``,
53``raw_spin_lock_irq_rcu_node()``, and ``raw_spin_lock_irqsave_rcu_node()``.
54Their lock-release counterparts are ``raw_spin_unlock_rcu_node()``,
55``raw_spin_unlock_irq_rcu_node()``, and
56``raw_spin_unlock_irqrestore_rcu_node()``, respectively.
57For completeness, a ``raw_spin_trylock_rcu_node()`` is also provided.
58The key point is that the lock-acquisition functions, including
59``raw_spin_trylock_rcu_node()``, all invoke ``smp_mb__after_unlock_lock()``
60immediately after successful acquisition of the lock.
61
62Therefore, for any given ``rcu_node`` structure, any access
63happening before one of the above lock-release functions will be seen
64by all CPUs as happening before any access happening after a later
65one of the above lock-acquisition functions.
66Furthermore, any access happening before one of the
67above lock-release function on any given CPU will be seen by all
68CPUs as happening before any access happening after a later one
69of the above lock-acquisition functions executing on that same CPU,
70even if the lock-release and lock-acquisition functions are operating
71on different ``rcu_node`` structures.
72Tree RCU uses these two ordering guarantees to form an ordering
73network among all CPUs that were in any way involved in the grace
74period, including any CPUs that came online or went offline during
75the grace period in question.
76
77The following litmus test exhibits the ordering effects of these
78lock-acquisition and lock-release functions::
79
80    1 int x, y, z;
81    2
82    3 void task0(void)
83    4 {
84    5   raw_spin_lock_rcu_node(rnp);
85    6   WRITE_ONCE(x, 1);
86    7   r1 = READ_ONCE(y);
87    8   raw_spin_unlock_rcu_node(rnp);
88    9 }
89   10
90   11 void task1(void)
91   12 {
92   13   raw_spin_lock_rcu_node(rnp);
93   14   WRITE_ONCE(y, 1);
94   15   r2 = READ_ONCE(z);
95   16   raw_spin_unlock_rcu_node(rnp);
96   17 }
97   18
98   19 void task2(void)
99   20 {
100   21   WRITE_ONCE(z, 1);
101   22   smp_mb();
102   23   r3 = READ_ONCE(x);
103   24 }
104   25
105   26 WARN_ON(r1 == 0 && r2 == 0 && r3 == 0);
106
107The ``WARN_ON()`` is evaluated at "the end of time",
108after all changes have propagated throughout the system.
109Without the ``smp_mb__after_unlock_lock()`` provided by the
110acquisition functions, this ``WARN_ON()`` could trigger, for example
111on PowerPC.
112The ``smp_mb__after_unlock_lock()`` invocations prevent this
113``WARN_ON()`` from triggering.
114
115+-----------------------------------------------------------------------+
116| **Quick Quiz**:                                                       |
117+-----------------------------------------------------------------------+
118| But the chain of rcu_node-structure lock acquisitions guarantees      |
119| that new readers will see all of the updater's pre-grace-period       |
120| accesses and also guarantees that the updater's post-grace-period     |
121| accesses will see all of the old reader's accesses.  So why do we     |
122| need all of those calls to smp_mb__after_unlock_lock()?               |
123+-----------------------------------------------------------------------+
124| **Answer**:                                                           |
125+-----------------------------------------------------------------------+
126| Because we must provide ordering for RCU's polling grace-period       |
127| primitives, for example, get_state_synchronize_rcu() and              |
128| poll_state_synchronize_rcu().  Consider this code::                   |
129|                                                                       |
130|  CPU 0                                     CPU 1                      |
131|  ----                                      ----                       |
132|  WRITE_ONCE(X, 1)                          WRITE_ONCE(Y, 1)           |
133|  g = get_state_synchronize_rcu()           smp_mb()                   |
134|  while (!poll_state_synchronize_rcu(g))    r1 = READ_ONCE(X)          |
135|          continue;                                                    |
136|  r0 = READ_ONCE(Y)                                                    |
137|                                                                       |
138| RCU guarantees that the outcome r0 == 0 && r1 == 0 will not           |
139| happen, even if CPU 1 is in an RCU extended quiescent state           |
140| (idle or offline) and thus won't interact directly with the RCU       |
141| core processing at all.                                               |
142+-----------------------------------------------------------------------+
143
144This approach must be extended to include idle CPUs, which need
145RCU's grace-period memory ordering guarantee to extend to any
146RCU read-side critical sections preceding and following the current
147idle sojourn.
148This case is handled by calls to the strongly ordered
149``atomic_add_return()`` read-modify-write atomic operation that
150is invoked within ``ct_kernel_exit_state()`` at idle-entry
151time and within ``ct_kernel_enter_state()`` at idle-exit time.
152The grace-period kthread invokes first ``ct_rcu_watching_cpu_acquire()``
153(preceded by a full memory barrier) and ``rcu_watching_snap_stopped_since()``
154(both of which rely on acquire semantics) to detect idle CPUs.
155
156+-----------------------------------------------------------------------+
157| **Quick Quiz**:                                                       |
158+-----------------------------------------------------------------------+
159| But what about CPUs that remain offline for the entire grace period?  |
160+-----------------------------------------------------------------------+
161| **Answer**:                                                           |
162+-----------------------------------------------------------------------+
163| Such CPUs will be offline at the beginning of the grace period, so    |
164| the grace period won't expect quiescent states from them. Races       |
165| between grace-period start and CPU-hotplug operations are mediated    |
166| by the CPU's leaf ``rcu_node`` structure's ``->lock`` as described    |
167| above.                                                                |
168+-----------------------------------------------------------------------+
169
170The approach must be extended to handle one final case, that of waking a
171task blocked in ``synchronize_rcu()``. This task might be affined to
172a CPU that is not yet aware that the grace period has ended, and thus
173might not yet be subject to the grace period's memory ordering.
174Therefore, there is an ``smp_mb()`` after the return from
175``wait_for_completion()`` in the ``synchronize_rcu()`` code path.
176
177+-----------------------------------------------------------------------+
178| **Quick Quiz**:                                                       |
179+-----------------------------------------------------------------------+
180| What? Where??? I don't see any ``smp_mb()`` after the return from     |
181| ``wait_for_completion()``!!!                                          |
182+-----------------------------------------------------------------------+
183| **Answer**:                                                           |
184+-----------------------------------------------------------------------+
185| That would be because I spotted the need for that ``smp_mb()`` during |
186| the creation of this documentation, and it is therefore unlikely to   |
187| hit mainline before v4.14. Kudos to Lance Roy, Will Deacon, Peter     |
188| Zijlstra, and Jonathan Cameron for asking questions that sensitized   |
189| me to the rather elaborate sequence of events that demonstrate the    |
190| need for this memory barrier.                                         |
191+-----------------------------------------------------------------------+
192
193Tree RCU's grace--period memory-ordering guarantees rely most heavily on
194the ``rcu_node`` structure's ``->lock`` field, so much so that it is
195necessary to abbreviate this pattern in the diagrams in the next
196section. For example, consider the ``rcu_prepare_for_idle()`` function
197shown below, which is one of several functions that enforce ordering of
198newly arrived RCU callbacks against future grace periods:
199
200::
201
202    1 static void rcu_prepare_for_idle(void)
203    2 {
204    3   bool needwake;
205    4   struct rcu_data *rdp = this_cpu_ptr(&rcu_data);
206    5   struct rcu_node *rnp;
207    6   int tne;
208    7
209    8   lockdep_assert_irqs_disabled();
210    9   if (rcu_rdp_is_offloaded(rdp))
211   10     return;
212   11
213   12   /* Handle nohz enablement switches conservatively. */
214   13   tne = READ_ONCE(tick_nohz_active);
215   14   if (tne != rdp->tick_nohz_enabled_snap) {
216   15     if (!rcu_segcblist_empty(&rdp->cblist))
217   16       invoke_rcu_core(); /* force nohz to see update. */
218   17     rdp->tick_nohz_enabled_snap = tne;
219   18     return;
220   19	}
221   20   if (!tne)
222   21     return;
223   22
224   23   /*
225   24    * If we have not yet accelerated this jiffy, accelerate all
226   25    * callbacks on this CPU.
227   26   */
228   27   if (rdp->last_accelerate == jiffies)
229   28     return;
230   29   rdp->last_accelerate = jiffies;
231   30   if (rcu_segcblist_pend_cbs(&rdp->cblist)) {
232   31     rnp = rdp->mynode;
233   32     raw_spin_lock_rcu_node(rnp); /* irqs already disabled. */
234   33     needwake = rcu_accelerate_cbs(rnp, rdp);
235   34     raw_spin_unlock_rcu_node(rnp); /* irqs remain disabled. */
236   35     if (needwake)
237   36       rcu_gp_kthread_wake();
238   37   }
239   38 }
240
241But the only part of ``rcu_prepare_for_idle()`` that really matters for
242this discussion are lines 32–34. We will therefore abbreviate this
243function as follows:
244
245.. kernel-figure:: rcu_node-lock.svg
246
247The box represents the ``rcu_node`` structure's ``->lock`` critical
248section, with the double line on top representing the additional
249``smp_mb__after_unlock_lock()``.
250
251Tree RCU Grace Period Memory Ordering Components
252~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
253
254Tree RCU's grace-period memory-ordering guarantee is provided by a
255number of RCU components:
256
257#. `Callback Registry`_
258#. `Grace-Period Initialization`_
259#. `Self-Reported Quiescent States`_
260#. `Dynamic Tick Interface`_
261#. `CPU-Hotplug Interface`_
262#. `Forcing Quiescent States`_
263#. `Grace-Period Cleanup`_
264#. `Callback Invocation`_
265
266Each of the following section looks at the corresponding component in
267detail.
268
269Callback Registry
270^^^^^^^^^^^^^^^^^
271
272If RCU's grace-period guarantee is to mean anything at all, any access
273that happens before a given invocation of ``call_rcu()`` must also
274happen before the corresponding grace period. The implementation of this
275portion of RCU's grace period guarantee is shown in the following
276figure:
277
278.. kernel-figure:: TreeRCU-callback-registry.svg
279
280Because ``call_rcu()`` normally acts only on CPU-local state, it
281provides no ordering guarantees, either for itself or for phase one of
282the update (which again will usually be removal of an element from an
283RCU-protected data structure). It simply enqueues the ``rcu_head``
284structure on a per-CPU list, which cannot become associated with a grace
285period until a later call to ``rcu_accelerate_cbs()``, as shown in the
286diagram above.
287
288One set of code paths shown on the left invokes ``rcu_accelerate_cbs()``
289via ``note_gp_changes()``, either directly from ``call_rcu()`` (if the
290current CPU is inundated with queued ``rcu_head`` structures) or more
291likely from an ``RCU_SOFTIRQ`` handler. Another code path in the middle
292is taken only in kernels built with ``CONFIG_RCU_FAST_NO_HZ=y``, which
293invokes ``rcu_accelerate_cbs()`` via ``rcu_prepare_for_idle()``. The
294final code path on the right is taken only in kernels built with
295``CONFIG_HOTPLUG_CPU=y``, which invokes ``rcu_accelerate_cbs()`` via
296``rcu_advance_cbs()``, ``rcu_migrate_callbacks``,
297``rcutree_migrate_callbacks()``, and ``takedown_cpu()``, which in turn
298is invoked on a surviving CPU after the outgoing CPU has been completely
299offlined.
300
301There are a few other code paths within grace-period processing that
302opportunistically invoke ``rcu_accelerate_cbs()``. However, either way,
303all of the CPU's recently queued ``rcu_head`` structures are associated
304with a future grace-period number under the protection of the CPU's lead
305``rcu_node`` structure's ``->lock``. In all cases, there is full
306ordering against any prior critical section for that same ``rcu_node``
307structure's ``->lock``, and also full ordering against any of the
308current task's or CPU's prior critical sections for any ``rcu_node``
309structure's ``->lock``.
310
311The next section will show how this ordering ensures that any accesses
312prior to the ``call_rcu()`` (particularly including phase one of the
313update) happen before the start of the corresponding grace period.
314
315+-----------------------------------------------------------------------+
316| **Quick Quiz**:                                                       |
317+-----------------------------------------------------------------------+
318| But what about ``synchronize_rcu()``?                                 |
319+-----------------------------------------------------------------------+
320| **Answer**:                                                           |
321+-----------------------------------------------------------------------+
322| The ``synchronize_rcu()`` passes ``call_rcu()`` to ``wait_rcu_gp()``, |
323| which invokes it. So either way, it eventually comes down to          |
324| ``call_rcu()``.                                                       |
325+-----------------------------------------------------------------------+
326
327Grace-Period Initialization
328^^^^^^^^^^^^^^^^^^^^^^^^^^^
329
330Grace-period initialization is carried out by the grace-period kernel
331thread, which makes several passes over the ``rcu_node`` tree within the
332``rcu_gp_init()`` function. This means that showing the full flow of
333ordering through the grace-period computation will require duplicating
334this tree. If you find this confusing, please note that the state of the
335``rcu_node`` changes over time, just like Heraclitus's river. However,
336to keep the ``rcu_node`` river tractable, the grace-period kernel
337thread's traversals are presented in multiple parts, starting in this
338section with the various phases of grace-period initialization.
339
340The first ordering-related grace-period initialization action is to
341advance the ``rcu_state`` structure's ``->gp_seq`` grace-period-number
342counter, as shown below:
343
344.. kernel-figure:: TreeRCU-gp-init-1.svg
345
346The actual increment is carried out using ``smp_store_release()``, which
347helps reject false-positive RCU CPU stall detection. Note that only the
348root ``rcu_node`` structure is touched.
349
350The first pass through the ``rcu_node`` tree updates bitmasks based on
351CPUs having come online or gone offline since the start of the previous
352grace period. In the common case where the number of online CPUs for
353this ``rcu_node`` structure has not transitioned to or from zero, this
354pass will scan only the leaf ``rcu_node`` structures. However, if the
355number of online CPUs for a given leaf ``rcu_node`` structure has
356transitioned from zero, ``rcu_init_new_rnp()`` will be invoked for the
357first incoming CPU. Similarly, if the number of online CPUs for a given
358leaf ``rcu_node`` structure has transitioned to zero,
359``rcu_cleanup_dead_rnp()`` will be invoked for the last outgoing CPU.
360The diagram below shows the path of ordering if the leftmost
361``rcu_node`` structure onlines its first CPU and if the next
362``rcu_node`` structure has no online CPUs (or, alternatively if the
363leftmost ``rcu_node`` structure offlines its last CPU and if the next
364``rcu_node`` structure has no online CPUs).
365
366.. kernel-figure:: TreeRCU-gp-init-2.svg
367
368The final ``rcu_gp_init()`` pass through the ``rcu_node`` tree traverses
369breadth-first, setting each ``rcu_node`` structure's ``->gp_seq`` field
370to the newly advanced value from the ``rcu_state`` structure, as shown
371in the following diagram.
372
373.. kernel-figure:: TreeRCU-gp-init-3.svg
374
375This change will also cause each CPU's next call to
376``__note_gp_changes()`` to notice that a new grace period has started,
377as described in the next section. But because the grace-period kthread
378started the grace period at the root (with the advancing of the
379``rcu_state`` structure's ``->gp_seq`` field) before setting each leaf
380``rcu_node`` structure's ``->gp_seq`` field, each CPU's observation of
381the start of the grace period will happen after the actual start of the
382grace period.
383
384+-----------------------------------------------------------------------+
385| **Quick Quiz**:                                                       |
386+-----------------------------------------------------------------------+
387| But what about the CPU that started the grace period? Why wouldn't it |
388| see the start of the grace period right when it started that grace    |
389| period?                                                               |
390+-----------------------------------------------------------------------+
391| **Answer**:                                                           |
392+-----------------------------------------------------------------------+
393| In some deep philosophical and overly anthromorphized sense, yes, the |
394| CPU starting the grace period is immediately aware of having done so. |
395| However, if we instead assume that RCU is not self-aware, then even   |
396| the CPU starting the grace period does not really become aware of the |
397| start of this grace period until its first call to                    |
398| ``__note_gp_changes()``. On the other hand, this CPU potentially gets |
399| early notification because it invokes ``__note_gp_changes()`` during  |
400| its last ``rcu_gp_init()`` pass through its leaf ``rcu_node``         |
401| structure.                                                            |
402+-----------------------------------------------------------------------+
403
404Self-Reported Quiescent States
405^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
406
407When all entities that might block the grace period have reported
408quiescent states (or as described in a later section, had quiescent
409states reported on their behalf), the grace period can end. Online
410non-idle CPUs report their own quiescent states, as shown in the
411following diagram:
412
413.. kernel-figure:: TreeRCU-qs.svg
414
415This is for the last CPU to report a quiescent state, which signals the
416end of the grace period. Earlier quiescent states would push up the
417``rcu_node`` tree only until they encountered an ``rcu_node`` structure
418that is waiting for additional quiescent states. However, ordering is
419nevertheless preserved because some later quiescent state will acquire
420that ``rcu_node`` structure's ``->lock``.
421
422Any number of events can lead up to a CPU invoking ``note_gp_changes``
423(or alternatively, directly invoking ``__note_gp_changes()``), at which
424point that CPU will notice the start of a new grace period while holding
425its leaf ``rcu_node`` lock. Therefore, all execution shown in this
426diagram happens after the start of the grace period. In addition, this
427CPU will consider any RCU read-side critical section that started before
428the invocation of ``__note_gp_changes()`` to have started before the
429grace period, and thus a critical section that the grace period must
430wait on.
431
432+-----------------------------------------------------------------------+
433| **Quick Quiz**:                                                       |
434+-----------------------------------------------------------------------+
435| But a RCU read-side critical section might have started after the     |
436| beginning of the grace period (the advancing of ``->gp_seq`` from     |
437| earlier), so why should the grace period wait on such a critical      |
438| section?                                                              |
439+-----------------------------------------------------------------------+
440| **Answer**:                                                           |
441+-----------------------------------------------------------------------+
442| It is indeed not necessary for the grace period to wait on such a     |
443| critical section. However, it is permissible to wait on it. And it is |
444| furthermore important to wait on it, as this lazy approach is far     |
445| more scalable than a “big bang” all-at-once grace-period start could  |
446| possibly be.                                                          |
447+-----------------------------------------------------------------------+
448
449If the CPU does a context switch, a quiescent state will be noted by
450``rcu_note_context_switch()`` on the left. On the other hand, if the CPU
451takes a scheduler-clock interrupt while executing in usermode, a
452quiescent state will be noted by ``rcu_sched_clock_irq()`` on the right.
453Either way, the passage through a quiescent state will be noted in a
454per-CPU variable.
455
456The next time an ``RCU_SOFTIRQ`` handler executes on this CPU (for
457example, after the next scheduler-clock interrupt), ``rcu_core()`` will
458invoke ``rcu_check_quiescent_state()``, which will notice the recorded
459quiescent state, and invoke ``rcu_report_qs_rdp()``. If
460``rcu_report_qs_rdp()`` verifies that the quiescent state really does
461apply to the current grace period, it invokes ``rcu_report_rnp()`` which
462traverses up the ``rcu_node`` tree as shown at the bottom of the
463diagram, clearing bits from each ``rcu_node`` structure's ``->qsmask``
464field, and propagating up the tree when the result is zero.
465
466Note that traversal passes upwards out of a given ``rcu_node`` structure
467only if the current CPU is reporting the last quiescent state for the
468subtree headed by that ``rcu_node`` structure. A key point is that if a
469CPU's traversal stops at a given ``rcu_node`` structure, then there will
470be a later traversal by another CPU (or perhaps the same one) that
471proceeds upwards from that point, and the ``rcu_node`` ``->lock``
472guarantees that the first CPU's quiescent state happens before the
473remainder of the second CPU's traversal. Applying this line of thought
474repeatedly shows that all CPUs' quiescent states happen before the last
475CPU traverses through the root ``rcu_node`` structure, the “last CPU”
476being the one that clears the last bit in the root ``rcu_node``
477structure's ``->qsmask`` field.
478
479Dynamic Tick Interface
480^^^^^^^^^^^^^^^^^^^^^^
481
482Due to energy-efficiency considerations, RCU is forbidden from
483disturbing idle CPUs. CPUs are therefore required to notify RCU when
484entering or leaving idle state, which they do via fully ordered
485value-returning atomic operations on a per-CPU variable. The ordering
486effects are as shown below:
487
488.. kernel-figure:: TreeRCU-dyntick.svg
489
490The RCU grace-period kernel thread samples the per-CPU idleness variable
491while holding the corresponding CPU's leaf ``rcu_node`` structure's
492``->lock``. This means that any RCU read-side critical sections that
493precede the idle period (the oval near the top of the diagram above)
494will happen before the end of the current grace period. Similarly, the
495beginning of the current grace period will happen before any RCU
496read-side critical sections that follow the idle period (the oval near
497the bottom of the diagram above).
498
499Plumbing this into the full grace-period execution is described
500`below <Forcing Quiescent States_>`__.
501
502CPU-Hotplug Interface
503^^^^^^^^^^^^^^^^^^^^^
504
505RCU is also forbidden from disturbing offline CPUs, which might well be
506powered off and removed from the system completely. CPUs are therefore
507required to notify RCU of their comings and goings as part of the
508corresponding CPU hotplug operations. The ordering effects are shown
509below:
510
511.. kernel-figure:: TreeRCU-hotplug.svg
512
513Because CPU hotplug operations are much less frequent than idle
514transitions, they are heavier weight, and thus acquire the CPU's leaf
515``rcu_node`` structure's ``->lock`` and update this structure's
516``->qsmaskinitnext``. The RCU grace-period kernel thread samples this
517mask to detect CPUs having gone offline since the beginning of this
518grace period.
519
520Plumbing this into the full grace-period execution is described
521`below <Forcing Quiescent States_>`__.
522
523Forcing Quiescent States
524^^^^^^^^^^^^^^^^^^^^^^^^
525
526As noted above, idle and offline CPUs cannot report their own quiescent
527states, and therefore the grace-period kernel thread must do the
528reporting on their behalf. This process is called “forcing quiescent
529states”, it is repeated every few jiffies, and its ordering effects are
530shown below:
531
532.. kernel-figure:: TreeRCU-gp-fqs.svg
533
534Each pass of quiescent state forcing is guaranteed to traverse the leaf
535``rcu_node`` structures, and if there are no new quiescent states due to
536recently idled and/or offlined CPUs, then only the leaves are traversed.
537However, if there is a newly offlined CPU as illustrated on the left or
538a newly idled CPU as illustrated on the right, the corresponding
539quiescent state will be driven up towards the root. As with
540self-reported quiescent states, the upwards driving stops once it
541reaches an ``rcu_node`` structure that has quiescent states outstanding
542from other CPUs.
543
544+-----------------------------------------------------------------------+
545| **Quick Quiz**:                                                       |
546+-----------------------------------------------------------------------+
547| The leftmost drive to root stopped before it reached the root         |
548| ``rcu_node`` structure, which means that there are still CPUs         |
549| subordinate to that structure on which the current grace period is    |
550| waiting. Given that, how is it possible that the rightmost drive to   |
551| root ended the grace period?                                          |
552+-----------------------------------------------------------------------+
553| **Answer**:                                                           |
554+-----------------------------------------------------------------------+
555| Good analysis! It is in fact impossible in the absence of bugs in     |
556| RCU. But this diagram is complex enough as it is, so simplicity       |
557| overrode accuracy. You can think of it as poetic license, or you can  |
558| think of it as misdirection that is resolved in the                   |
559| `stitched-together diagram <Putting It All Together_>`__.             |
560+-----------------------------------------------------------------------+
561
562Grace-Period Cleanup
563^^^^^^^^^^^^^^^^^^^^
564
565Grace-period cleanup first scans the ``rcu_node`` tree breadth-first
566advancing all the ``->gp_seq`` fields, then it advances the
567``rcu_state`` structure's ``->gp_seq`` field. The ordering effects are
568shown below:
569
570.. kernel-figure:: TreeRCU-gp-cleanup.svg
571
572As indicated by the oval at the bottom of the diagram, once grace-period
573cleanup is complete, the next grace period can begin.
574
575+-----------------------------------------------------------------------+
576| **Quick Quiz**:                                                       |
577+-----------------------------------------------------------------------+
578| But when precisely does the grace period end?                         |
579+-----------------------------------------------------------------------+
580| **Answer**:                                                           |
581+-----------------------------------------------------------------------+
582| There is no useful single point at which the grace period can be said |
583| to end. The earliest reasonable candidate is as soon as the last CPU  |
584| has reported its quiescent state, but it may be some milliseconds     |
585| before RCU becomes aware of this. The latest reasonable candidate is  |
586| once the ``rcu_state`` structure's ``->gp_seq`` field has been        |
587| updated, but it is quite possible that some CPUs have already         |
588| completed phase two of their updates by that time. In short, if you   |
589| are going to work with RCU, you need to learn to embrace uncertainty. |
590+-----------------------------------------------------------------------+
591
592Callback Invocation
593^^^^^^^^^^^^^^^^^^^
594
595Once a given CPU's leaf ``rcu_node`` structure's ``->gp_seq`` field has
596been updated, that CPU can begin invoking its RCU callbacks that were
597waiting for this grace period to end. These callbacks are identified by
598``rcu_advance_cbs()``, which is usually invoked by
599``__note_gp_changes()``. As shown in the diagram below, this invocation
600can be triggered by the scheduling-clock interrupt
601(``rcu_sched_clock_irq()`` on the left) or by idle entry
602(``rcu_cleanup_after_idle()`` on the right, but only for kernels build
603with ``CONFIG_RCU_FAST_NO_HZ=y``). Either way, ``RCU_SOFTIRQ`` is
604raised, which results in ``rcu_do_batch()`` invoking the callbacks,
605which in turn allows those callbacks to carry out (either directly or
606indirectly via wakeup) the needed phase-two processing for each update.
607
608.. kernel-figure:: TreeRCU-callback-invocation.svg
609
610Please note that callback invocation can also be prompted by any number
611of corner-case code paths, for example, when a CPU notes that it has
612excessive numbers of callbacks queued. In all cases, the CPU acquires
613its leaf ``rcu_node`` structure's ``->lock`` before invoking callbacks,
614which preserves the required ordering against the newly completed grace
615period.
616
617However, if the callback function communicates to other CPUs, for
618example, doing a wakeup, then it is that function's responsibility to
619maintain ordering. For example, if the callback function wakes up a task
620that runs on some other CPU, proper ordering must in place in both the
621callback function and the task being awakened. To see why this is
622important, consider the top half of the `grace-period
623cleanup`_ diagram. The callback might be
624running on a CPU corresponding to the leftmost leaf ``rcu_node``
625structure, and awaken a task that is to run on a CPU corresponding to
626the rightmost leaf ``rcu_node`` structure, and the grace-period kernel
627thread might not yet have reached the rightmost leaf. In this case, the
628grace period's memory ordering might not yet have reached that CPU, so
629again the callback function and the awakened task must supply proper
630ordering.
631
632Putting It All Together
633~~~~~~~~~~~~~~~~~~~~~~~
634
635A stitched-together diagram is here:
636
637.. kernel-figure:: TreeRCU-gp.svg
638
639Legal Statement
640~~~~~~~~~~~~~~~
641
642This work represents the view of the author and does not necessarily
643represent the view of IBM.
644
645Linux is a registered trademark of Linus Torvalds.
646
647Other company, product, and service names may be trademarks or service
648marks of others.
649