Design/Expedited-Grace-Periods/Expedited-Grace-Periods.rst

ccc9971eSMauro Carvalho Chehab=================================================
ccc9971eSMauro Carvalho ChehabA Tour Through TREE_RCU's Expedited Grace Periods
ccc9971eSMauro Carvalho Chehab=================================================
ccc9971eSMauro Carvalho Chehab
ccc9971eSMauro Carvalho ChehabIntroduction
ccc9971eSMauro Carvalho Chehab============
ccc9971eSMauro Carvalho Chehab
ccc9971eSMauro Carvalho ChehabThis document describes RCU's expedited grace periods.
ccc9971eSMauro Carvalho ChehabUnlike RCU's normal grace periods, which accept long latencies to attain
ccc9971eSMauro Carvalho Chehabhigh efficiency and minimal disturbance, expedited grace periods accept
ccc9971eSMauro Carvalho Chehablower efficiency and significant disturbance to attain shorter latencies.
ccc9971eSMauro Carvalho Chehab
ccc9971eSMauro Carvalho ChehabThere are two flavors of RCU (RCU-preempt and RCU-sched), with an earlier
ccc9971eSMauro Carvalho Chehabthird RCU-bh flavor having been implemented in terms of the other two.
ccc9971eSMauro Carvalho ChehabEach of the two implementations is covered in its own section.
ccc9971eSMauro Carvalho Chehab
ccc9971eSMauro Carvalho ChehabExpedited Grace Period Design
ccc9971eSMauro Carvalho Chehab=============================
ccc9971eSMauro Carvalho Chehab
ccc9971eSMauro Carvalho ChehabThe expedited RCU grace periods cannot be accused of being subtle,
ccc9971eSMauro Carvalho Chehabgiven that they for all intents and purposes hammer every CPU that
ccc9971eSMauro Carvalho Chehabhas not yet provided a quiescent state for the current expedited
ccc9971eSMauro Carvalho Chehabgrace period.
ccc9971eSMauro Carvalho ChehabThe one saving grace is that the hammer has grown a bit smaller
ccc9971eSMauro Carvalho Chehabover time:  The old call to ``try_stop_cpus()`` has been
ccc9971eSMauro Carvalho Chehabreplaced with a set of calls to ``smp_call_function_single()``,
ccc9971eSMauro Carvalho Chehabeach of which results in an IPI to the target CPU.
ccc9971eSMauro Carvalho ChehabThe corresponding handler function checks the CPU's state, motivating
ccc9971eSMauro Carvalho Chehaba faster quiescent state where possible, and triggering a report
ccc9971eSMauro Carvalho Chehabof that quiescent state.
ccc9971eSMauro Carvalho ChehabAs always for RCU, once everything has spent some time in a quiescent
ccc9971eSMauro Carvalho Chehabstate, the expedited grace period has completed.
ccc9971eSMauro Carvalho Chehab
ccc9971eSMauro Carvalho ChehabThe details of the ``smp_call_function_single()`` handler's
ccc9971eSMauro Carvalho Chehaboperation depend on the RCU flavor, as described in the following
ccc9971eSMauro Carvalho Chehabsections.
ccc9971eSMauro Carvalho Chehab
ccc9971eSMauro Carvalho ChehabRCU-preempt Expedited Grace Periods
ccc9971eSMauro Carvalho Chehab===================================
ccc9971eSMauro Carvalho Chehab
81ad58beSSebastian Andrzej Siewior``CONFIG_PREEMPTION=y`` kernels implement RCU-preempt.
ccc9971eSMauro Carvalho ChehabThe overall flow of the handling of a given CPU by an RCU-preempt
ccc9971eSMauro Carvalho Chehabexpedited grace period is shown in the following diagram:
ccc9971eSMauro Carvalho Chehab
ccc9971eSMauro Carvalho Chehab.. kernel-figure:: ExpRCUFlow.svg
ccc9971eSMauro Carvalho Chehab
ccc9971eSMauro Carvalho ChehabThe solid arrows denote direct action, for example, a function call.
ccc9971eSMauro Carvalho ChehabThe dotted arrows denote indirect action, for example, an IPI
ccc9971eSMauro Carvalho Chehabor a state that is reached after some time.
ccc9971eSMauro Carvalho Chehab
ccc9971eSMauro Carvalho ChehabIf a given CPU is offline or idle, ``synchronize_rcu_expedited()``
ccc9971eSMauro Carvalho Chehabwill ignore it because idle and offline CPUs are already residing
ccc9971eSMauro Carvalho Chehabin quiescent states.
ccc9971eSMauro Carvalho ChehabOtherwise, the expedited grace period will use
ccc9971eSMauro Carvalho Chehab``smp_call_function_single()`` to send the CPU an IPI, which
ccc9971eSMauro Carvalho Chehabis handled by ``rcu_exp_handler()``.
ccc9971eSMauro Carvalho Chehab
ccc9971eSMauro Carvalho ChehabHowever, because this is preemptible RCU, ``rcu_exp_handler()``
ccc9971eSMauro Carvalho Chehabcan check to see if the CPU is currently running in an RCU read-side
ccc9971eSMauro Carvalho Chehabcritical section.
ccc9971eSMauro Carvalho ChehabIf not, the handler can immediately report a quiescent state.
ccc9971eSMauro Carvalho ChehabOtherwise, it sets flags so that the outermost ``rcu_read_unlock()``
ccc9971eSMauro Carvalho Chehabinvocation will provide the needed quiescent-state report.
ccc9971eSMauro Carvalho ChehabThis flag-setting avoids the previous forced preemption of all
ccc9971eSMauro Carvalho ChehabCPUs that might have RCU read-side critical sections.
ccc9971eSMauro Carvalho ChehabIn addition, this flag-setting is done so as to avoid increasing
ccc9971eSMauro Carvalho Chehabthe overhead of the common-case fastpath through the scheduler.
ccc9971eSMauro Carvalho Chehab
ccc9971eSMauro Carvalho ChehabAgain because this is preemptible RCU, an RCU read-side critical section
ccc9971eSMauro Carvalho Chehabcan be preempted.
ccc9971eSMauro Carvalho ChehabWhen that happens, RCU will enqueue the task, which will the continue to
ccc9971eSMauro Carvalho Chehabblock the current expedited grace period until it resumes and finds its
ccc9971eSMauro Carvalho Chehaboutermost ``rcu_read_unlock()``.
ccc9971eSMauro Carvalho ChehabThe CPU will report a quiescent state just after enqueuing the task because
ccc9971eSMauro Carvalho Chehabthe CPU is no longer blocking the grace period.
ccc9971eSMauro Carvalho ChehabIt is instead the preempted task doing the blocking.
ccc9971eSMauro Carvalho ChehabThe list of blocked tasks is managed by ``rcu_preempt_ctxt_queue()``,
ccc9971eSMauro Carvalho Chehabwhich is called from ``rcu_preempt_note_context_switch()``, which
ccc9971eSMauro Carvalho Chehabin turn is called from ``rcu_note_context_switch()``, which in
ccc9971eSMauro Carvalho Chehabturn is called from the scheduler.
ccc9971eSMauro Carvalho Chehab
ccc9971eSMauro Carvalho Chehab
ccc9971eSMauro Carvalho Chehab+-----------------------------------------------------------------------+
ccc9971eSMauro Carvalho Chehab| **Quick Quiz**:                                                       |
ccc9971eSMauro Carvalho Chehab+-----------------------------------------------------------------------+
ccc9971eSMauro Carvalho Chehab| Why not just have the expedited grace period check the state of all   |
ccc9971eSMauro Carvalho Chehab| the CPUs? After all, that would avoid all those real-time-unfriendly  |
ccc9971eSMauro Carvalho Chehab| IPIs.                                                                 |
ccc9971eSMauro Carvalho Chehab+-----------------------------------------------------------------------+
ccc9971eSMauro Carvalho Chehab| **Answer**:                                                           |
ccc9971eSMauro Carvalho Chehab+-----------------------------------------------------------------------+
ccc9971eSMauro Carvalho Chehab| Because we want the RCU read-side critical sections to run fast,      |
ccc9971eSMauro Carvalho Chehab| which means no memory barriers. Therefore, it is not possible to      |
ccc9971eSMauro Carvalho Chehab| safely check the state from some other CPU. And even if it was        |
ccc9971eSMauro Carvalho Chehab| possible to safely check the state, it would still be necessary to    |
ccc9971eSMauro Carvalho Chehab| IPI the CPU to safely interact with the upcoming                      |
ccc9971eSMauro Carvalho Chehab| ``rcu_read_unlock()`` invocation, which means that the remote state   |
ccc9971eSMauro Carvalho Chehab| testing would not help the worst-case latency that real-time          |
ccc9971eSMauro Carvalho Chehab| applications care about.                                              |
ccc9971eSMauro Carvalho Chehab|                                                                       |
ccc9971eSMauro Carvalho Chehab| One way to prevent your real-time application from getting hit with   |
ccc9971eSMauro Carvalho Chehab| these IPIs is to build your kernel with ``CONFIG_NO_HZ_FULL=y``. RCU  |
ccc9971eSMauro Carvalho Chehab| would then perceive the CPU running your application as being idle,   |
ccc9971eSMauro Carvalho Chehab| and it would be able to safely detect that state without needing to   |
ccc9971eSMauro Carvalho Chehab| IPI the CPU.                                                          |
ccc9971eSMauro Carvalho Chehab+-----------------------------------------------------------------------+
ccc9971eSMauro Carvalho Chehab
ccc9971eSMauro Carvalho ChehabPlease note that this is just the overall flow: Additional complications
ccc9971eSMauro Carvalho Chehabcan arise due to races with CPUs going idle or offline, among other
ccc9971eSMauro Carvalho Chehabthings.
ccc9971eSMauro Carvalho Chehab
ccc9971eSMauro Carvalho ChehabRCU-sched Expedited Grace Periods
ccc9971eSMauro Carvalho Chehab---------------------------------
ccc9971eSMauro Carvalho Chehab
81ad58beSSebastian Andrzej Siewior``CONFIG_PREEMPTION=n`` kernels implement RCU-sched. The overall flow of
ccc9971eSMauro Carvalho Chehabthe handling of a given CPU by an RCU-sched expedited grace period is
ccc9971eSMauro Carvalho Chehabshown in the following diagram:
ccc9971eSMauro Carvalho Chehab
ccc9971eSMauro Carvalho Chehab.. kernel-figure:: ExpSchedFlow.svg
ccc9971eSMauro Carvalho Chehab
ccc9971eSMauro Carvalho ChehabAs with RCU-preempt, RCU-sched's ``synchronize_rcu_expedited()`` ignores
ccc9971eSMauro Carvalho Chehaboffline and idle CPUs, again because they are in remotely detectable
ccc9971eSMauro Carvalho Chehabquiescent states. However, because the ``rcu_read_lock_sched()`` and
ccc9971eSMauro Carvalho Chehab``rcu_read_unlock_sched()`` leave no trace of their invocation, in
ccc9971eSMauro Carvalho Chehabgeneral it is not possible to tell whether or not the current CPU is in
ccc9971eSMauro Carvalho Chehaban RCU read-side critical section. The best that RCU-sched's
ccc9971eSMauro Carvalho Chehab``rcu_exp_handler()`` can do is to check for idle, on the off-chance
ccc9971eSMauro Carvalho Chehabthat the CPU went idle while the IPI was in flight. If the CPU is idle,
ccc9971eSMauro Carvalho Chehabthen ``rcu_exp_handler()`` reports the quiescent state.
ccc9971eSMauro Carvalho Chehab
ccc9971eSMauro Carvalho ChehabOtherwise, the handler forces a future context switch by setting the
ccc9971eSMauro Carvalho ChehabNEED_RESCHED flag of the current task's thread flag and the CPU preempt
ccc9971eSMauro Carvalho Chehabcounter. At the time of the context switch, the CPU reports the
ccc9971eSMauro Carvalho Chehabquiescent state. Should the CPU go offline first, it will report the
ccc9971eSMauro Carvalho Chehabquiescent state at that time.
ccc9971eSMauro Carvalho Chehab
ccc9971eSMauro Carvalho ChehabExpedited Grace Period and CPU Hotplug
ccc9971eSMauro Carvalho Chehab--------------------------------------
ccc9971eSMauro Carvalho Chehab
ccc9971eSMauro Carvalho ChehabThe expedited nature of expedited grace periods require a much tighter
ccc9971eSMauro Carvalho Chehabinteraction with CPU hotplug operations than is required for normal
ccc9971eSMauro Carvalho Chehabgrace periods. In addition, attempting to IPI offline CPUs will result
ccc9971eSMauro Carvalho Chehabin splats, but failing to IPI online CPUs can result in too-short grace
ccc9971eSMauro Carvalho Chehabperiods. Neither option is acceptable in production kernels.
ccc9971eSMauro Carvalho Chehab
ccc9971eSMauro Carvalho ChehabThe interaction between expedited grace periods and CPU hotplug
ccc9971eSMauro Carvalho Chehaboperations is carried out at several levels:
ccc9971eSMauro Carvalho Chehab
ccc9971eSMauro Carvalho Chehab#. The number of CPUs that have ever been online is tracked by the
ccc9971eSMauro Carvalho Chehab   ``rcu_state`` structure's ``->ncpus`` field. The ``rcu_state``
ccc9971eSMauro Carvalho Chehab   structure's ``->ncpus_snap`` field tracks the number of CPUs that
ccc9971eSMauro Carvalho Chehab   have ever been online at the beginning of an RCU expedited grace
ccc9971eSMauro Carvalho Chehab   period. Note that this number never decreases, at least in the
ccc9971eSMauro Carvalho Chehab   absence of a time machine.
ccc9971eSMauro Carvalho Chehab#. The identities of the CPUs that have ever been online is tracked by
ccc9971eSMauro Carvalho Chehab   the ``rcu_node`` structure's ``->expmaskinitnext`` field. The
ccc9971eSMauro Carvalho Chehab   ``rcu_node`` structure's ``->expmaskinit`` field tracks the
ccc9971eSMauro Carvalho Chehab   identities of the CPUs that were online at least once at the
ccc9971eSMauro Carvalho Chehab   beginning of the most recent RCU expedited grace period. The
ccc9971eSMauro Carvalho Chehab   ``rcu_state`` structure's ``->ncpus`` and ``->ncpus_snap`` fields are
ccc9971eSMauro Carvalho Chehab   used to detect when new CPUs have come online for the first time,
ccc9971eSMauro Carvalho Chehab   that is, when the ``rcu_node`` structure's ``->expmaskinitnext``
ccc9971eSMauro Carvalho Chehab   field has changed since the beginning of the last RCU expedited grace
ccc9971eSMauro Carvalho Chehab   period, which triggers an update of each ``rcu_node`` structure's
ccc9971eSMauro Carvalho Chehab   ``->expmaskinit`` field from its ``->expmaskinitnext`` field.
ccc9971eSMauro Carvalho Chehab#. Each ``rcu_node`` structure's ``->expmaskinit`` field is used to
ccc9971eSMauro Carvalho Chehab   initialize that structure's ``->expmask`` at the beginning of each
ccc9971eSMauro Carvalho Chehab   RCU expedited grace period. This means that only those CPUs that have
ccc9971eSMauro Carvalho Chehab   been online at least once will be considered for a given grace
ccc9971eSMauro Carvalho Chehab   period.
ccc9971eSMauro Carvalho Chehab#. Any CPU that goes offline will clear its bit in its leaf ``rcu_node``
ccc9971eSMauro Carvalho Chehab   structure's ``->qsmaskinitnext`` field, so any CPU with that bit
ccc9971eSMauro Carvalho Chehab   clear can safely be ignored. However, it is possible for a CPU coming
ccc9971eSMauro Carvalho Chehab   online or going offline to have this bit set for some time while
ccc9971eSMauro Carvalho Chehab   ``cpu_online`` returns ``false``.
ccc9971eSMauro Carvalho Chehab#. For each non-idle CPU that RCU believes is currently online, the
ccc9971eSMauro Carvalho Chehab   grace period invokes ``smp_call_function_single()``. If this
ccc9971eSMauro Carvalho Chehab   succeeds, the CPU was fully online. Failure indicates that the CPU is
ccc9971eSMauro Carvalho Chehab   in the process of coming online or going offline, in which case it is
ccc9971eSMauro Carvalho Chehab   necessary to wait for a short time period and try again. The purpose
ccc9971eSMauro Carvalho Chehab   of this wait (or series of waits, as the case may be) is to permit a
ccc9971eSMauro Carvalho Chehab   concurrent CPU-hotplug operation to complete.
ccc9971eSMauro Carvalho Chehab#. In the case of RCU-sched, one of the last acts of an outgoing CPU is
*448e9f34SFrederic Weisbecker   to invoke ``rcutree_report_cpu_dead()``, which reports a quiescent state for
ccc9971eSMauro Carvalho Chehab   that CPU. However, this is likely paranoia-induced redundancy.
ccc9971eSMauro Carvalho Chehab
ccc9971eSMauro Carvalho Chehab+-----------------------------------------------------------------------+
ccc9971eSMauro Carvalho Chehab| **Quick Quiz**:                                                       |
ccc9971eSMauro Carvalho Chehab+-----------------------------------------------------------------------+
ccc9971eSMauro Carvalho Chehab| Why all the dancing around with multiple counters and masks tracking  |
ccc9971eSMauro Carvalho Chehab| CPUs that were once online? Why not just have a single set of masks   |
ccc9971eSMauro Carvalho Chehab| tracking the currently online CPUs and be done with it?               |
ccc9971eSMauro Carvalho Chehab+-----------------------------------------------------------------------+
ccc9971eSMauro Carvalho Chehab| **Answer**:                                                           |
ccc9971eSMauro Carvalho Chehab+-----------------------------------------------------------------------+
ccc9971eSMauro Carvalho Chehab| Maintaining single set of masks tracking the online CPUs *sounds*     |
ccc9971eSMauro Carvalho Chehab| easier, at least until you try working out all the race conditions    |
ccc9971eSMauro Carvalho Chehab| between grace-period initialization and CPU-hotplug operations. For   |
ccc9971eSMauro Carvalho Chehab| example, suppose initialization is progressing down the tree while a  |
ccc9971eSMauro Carvalho Chehab| CPU-offline operation is progressing up the tree. This situation can  |
ccc9971eSMauro Carvalho Chehab| result in bits set at the top of the tree that have no counterparts   |
ccc9971eSMauro Carvalho Chehab| at the bottom of the tree. Those bits will never be cleared, which    |
ccc9971eSMauro Carvalho Chehab| will result in grace-period hangs. In short, that way lies madness,   |
ccc9971eSMauro Carvalho Chehab| to say nothing of a great many bugs, hangs, and deadlocks.            |
ccc9971eSMauro Carvalho Chehab| In contrast, the current multi-mask multi-counter scheme ensures that |
ccc9971eSMauro Carvalho Chehab| grace-period initialization will always see consistent masks up and   |
ccc9971eSMauro Carvalho Chehab| down the tree, which brings significant simplifications over the      |
ccc9971eSMauro Carvalho Chehab| single-mask method.                                                   |
ccc9971eSMauro Carvalho Chehab|                                                                       |
ccc9971eSMauro Carvalho Chehab| This is an instance of `deferring work in order to avoid              |
ccc9971eSMauro Carvalho Chehab| synchronization <http://www.cs.columbia.edu/~library/TR-repository/re |
ccc9971eSMauro Carvalho Chehab| ports/reports-1992/cucs-039-92.ps.gz>`__.                             |
ccc9971eSMauro Carvalho Chehab| Lazily recording CPU-hotplug events at the beginning of the next      |
ccc9971eSMauro Carvalho Chehab| grace period greatly simplifies maintenance of the CPU-tracking       |
ccc9971eSMauro Carvalho Chehab| bitmasks in the ``rcu_node`` tree.                                    |
ccc9971eSMauro Carvalho Chehab+-----------------------------------------------------------------------+
ccc9971eSMauro Carvalho Chehab
ccc9971eSMauro Carvalho ChehabExpedited Grace Period Refinements
ccc9971eSMauro Carvalho Chehab----------------------------------
ccc9971eSMauro Carvalho Chehab
ccc9971eSMauro Carvalho ChehabIdle-CPU Checks
ccc9971eSMauro Carvalho Chehab~~~~~~~~~~~~~~~
ccc9971eSMauro Carvalho Chehab
ccc9971eSMauro Carvalho ChehabEach expedited grace period checks for idle CPUs when initially forming
ccc9971eSMauro Carvalho Chehabthe mask of CPUs to be IPIed and again just before IPIing a CPU (both
ccc9971eSMauro Carvalho Chehabchecks are carried out by ``sync_rcu_exp_select_cpus()``). If the CPU is
ccc9971eSMauro Carvalho Chehabidle at any time between those two times, the CPU will not be IPIed.
ccc9971eSMauro Carvalho ChehabInstead, the task pushing the grace period forward will include the idle
ccc9971eSMauro Carvalho ChehabCPUs in the mask passed to ``rcu_report_exp_cpu_mult()``.
ccc9971eSMauro Carvalho Chehab
ccc9971eSMauro Carvalho ChehabFor RCU-sched, there is an additional check: If the IPI has interrupted
ccc9971eSMauro Carvalho Chehabthe idle loop, then ``rcu_exp_handler()`` invokes
ccc9971eSMauro Carvalho Chehab``rcu_report_exp_rdp()`` to report the corresponding quiescent state.
ccc9971eSMauro Carvalho Chehab
ccc9971eSMauro Carvalho ChehabFor RCU-preempt, there is no specific check for idle in the IPI handler
ccc9971eSMauro Carvalho Chehab(``rcu_exp_handler()``), but because RCU read-side critical sections are
ccc9971eSMauro Carvalho Chehabnot permitted within the idle loop, if ``rcu_exp_handler()`` sees that
ccc9971eSMauro Carvalho Chehabthe CPU is within RCU read-side critical section, the CPU cannot
ccc9971eSMauro Carvalho Chehabpossibly be idle. Otherwise, ``rcu_exp_handler()`` invokes
ccc9971eSMauro Carvalho Chehab``rcu_report_exp_rdp()`` to report the corresponding quiescent state,
ccc9971eSMauro Carvalho Chehabregardless of whether or not that quiescent state was due to the CPU
ccc9971eSMauro Carvalho Chehabbeing idle.
ccc9971eSMauro Carvalho Chehab
ccc9971eSMauro Carvalho ChehabIn summary, RCU expedited grace periods check for idle when building the
ccc9971eSMauro Carvalho Chehabbitmask of CPUs that must be IPIed, just before sending each IPI, and
ccc9971eSMauro Carvalho Chehab(either explicitly or implicitly) within the IPI handler.
ccc9971eSMauro Carvalho Chehab
ccc9971eSMauro Carvalho ChehabBatching via Sequence Counter
ccc9971eSMauro Carvalho Chehab~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
ccc9971eSMauro Carvalho Chehab
ccc9971eSMauro Carvalho ChehabIf each grace-period request was carried out separately, expedited grace
ccc9971eSMauro Carvalho Chehabperiods would have abysmal scalability and problematic high-load
ccc9971eSMauro Carvalho Chehabcharacteristics. Because each grace-period operation can serve an
ccc9971eSMauro Carvalho Chehabunlimited number of updates, it is important to *batch* requests, so
ccc9971eSMauro Carvalho Chehabthat a single expedited grace-period operation will cover all requests
ccc9971eSMauro Carvalho Chehabin the corresponding batch.
ccc9971eSMauro Carvalho Chehab
ccc9971eSMauro Carvalho ChehabThis batching is controlled by a sequence counter named
ccc9971eSMauro Carvalho Chehab``->expedited_sequence`` in the ``rcu_state`` structure. This counter
ccc9971eSMauro Carvalho Chehabhas an odd value when there is an expedited grace period in progress and
ccc9971eSMauro Carvalho Chehaban even value otherwise, so that dividing the counter value by two gives
ccc9971eSMauro Carvalho Chehabthe number of completed grace periods. During any given update request,
ccc9971eSMauro Carvalho Chehabthe counter must transition from even to odd and then back to even, thus
ccc9971eSMauro Carvalho Chehabindicating that a grace period has elapsed. Therefore, if the initial
ccc9971eSMauro Carvalho Chehabvalue of the counter is ``s``, the updater must wait until the counter
ccc9971eSMauro Carvalho Chehabreaches at least the value ``(s+3)&~0x1``. This counter is managed by
ccc9971eSMauro Carvalho Chehabthe following access functions:
ccc9971eSMauro Carvalho Chehab
ccc9971eSMauro Carvalho Chehab#. ``rcu_exp_gp_seq_start()``, which marks the start of an expedited
ccc9971eSMauro Carvalho Chehab   grace period.
ccc9971eSMauro Carvalho Chehab#. ``rcu_exp_gp_seq_end()``, which marks the end of an expedited grace
ccc9971eSMauro Carvalho Chehab   period.
ccc9971eSMauro Carvalho Chehab#. ``rcu_exp_gp_seq_snap()``, which obtains a snapshot of the counter.
ccc9971eSMauro Carvalho Chehab#. ``rcu_exp_gp_seq_done()``, which returns ``true`` if a full expedited
ccc9971eSMauro Carvalho Chehab   grace period has elapsed since the corresponding call to
ccc9971eSMauro Carvalho Chehab   ``rcu_exp_gp_seq_snap()``.
ccc9971eSMauro Carvalho Chehab
ccc9971eSMauro Carvalho ChehabAgain, only one request in a given batch need actually carry out a
ccc9971eSMauro Carvalho Chehabgrace-period operation, which means there must be an efficient way to
c4af9e00SRandy Dunlapidentify which of many concurrent requests will initiate the grace
ccc9971eSMauro Carvalho Chehabperiod, and that there be an efficient way for the remaining requests to
ccc9971eSMauro Carvalho Chehabwait for that grace period to complete. However, that is the topic of
ccc9971eSMauro Carvalho Chehabthe next section.
ccc9971eSMauro Carvalho Chehab
ccc9971eSMauro Carvalho ChehabFunnel Locking and Wait/Wakeup
ccc9971eSMauro Carvalho Chehab~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
ccc9971eSMauro Carvalho Chehab
ccc9971eSMauro Carvalho ChehabThe natural way to sort out which of a batch of updaters will initiate
ccc9971eSMauro Carvalho Chehabthe expedited grace period is to use the ``rcu_node`` combining tree, as
ccc9971eSMauro Carvalho Chehabimplemented by the ``exp_funnel_lock()`` function. The first updater
ccc9971eSMauro Carvalho Chehabcorresponding to a given grace period arriving at a given ``rcu_node``
ccc9971eSMauro Carvalho Chehabstructure records its desired grace-period sequence number in the
ccc9971eSMauro Carvalho Chehab``->exp_seq_rq`` field and moves up to the next level in the tree.
ccc9971eSMauro Carvalho ChehabOtherwise, if the ``->exp_seq_rq`` field already contains the sequence
ccc9971eSMauro Carvalho Chehabnumber for the desired grace period or some later one, the updater
ccc9971eSMauro Carvalho Chehabblocks on one of four wait queues in the ``->exp_wq[]`` array, using the
ccc9971eSMauro Carvalho Chehabsecond-from-bottom and third-from bottom bits as an index. An
ccc9971eSMauro Carvalho Chehab``->exp_lock`` field in the ``rcu_node`` structure synchronizes access
ccc9971eSMauro Carvalho Chehabto these fields.
ccc9971eSMauro Carvalho Chehab
ccc9971eSMauro Carvalho ChehabAn empty ``rcu_node`` tree is shown in the following diagram, with the
ccc9971eSMauro Carvalho Chehabwhite cells representing the ``->exp_seq_rq`` field and the red cells
ccc9971eSMauro Carvalho Chehabrepresenting the elements of the ``->exp_wq[]`` array.
ccc9971eSMauro Carvalho Chehab
ccc9971eSMauro Carvalho Chehab.. kernel-figure:: Funnel0.svg
ccc9971eSMauro Carvalho Chehab
ccc9971eSMauro Carvalho ChehabThe next diagram shows the situation after the arrival of Task A and
ccc9971eSMauro Carvalho ChehabTask B at the leftmost and rightmost leaf ``rcu_node`` structures,
ccc9971eSMauro Carvalho Chehabrespectively. The current value of the ``rcu_state`` structure's
ccc9971eSMauro Carvalho Chehab``->expedited_sequence`` field is zero, so adding three and clearing the
ccc9971eSMauro Carvalho Chehabbottom bit results in the value two, which both tasks record in the
ccc9971eSMauro Carvalho Chehab``->exp_seq_rq`` field of their respective ``rcu_node`` structures:
ccc9971eSMauro Carvalho Chehab
ccc9971eSMauro Carvalho Chehab.. kernel-figure:: Funnel1.svg
ccc9971eSMauro Carvalho Chehab
ccc9971eSMauro Carvalho ChehabEach of Tasks A and B will move up to the root ``rcu_node`` structure.
ccc9971eSMauro Carvalho ChehabSuppose that Task A wins, recording its desired grace-period sequence
ccc9971eSMauro Carvalho Chehabnumber and resulting in the state shown below:
ccc9971eSMauro Carvalho Chehab
ccc9971eSMauro Carvalho Chehab.. kernel-figure:: Funnel2.svg
ccc9971eSMauro Carvalho Chehab
ccc9971eSMauro Carvalho ChehabTask A now advances to initiate a new grace period, while Task B moves
ccc9971eSMauro Carvalho Chehabup to the root ``rcu_node`` structure, and, seeing that its desired
ccc9971eSMauro Carvalho Chehabsequence number is already recorded, blocks on ``->exp_wq[1]``.
ccc9971eSMauro Carvalho Chehab
ccc9971eSMauro Carvalho Chehab+-----------------------------------------------------------------------+
ccc9971eSMauro Carvalho Chehab| **Quick Quiz**:                                                       |
ccc9971eSMauro Carvalho Chehab+-----------------------------------------------------------------------+
ccc9971eSMauro Carvalho Chehab| Why ``->exp_wq[1]``? Given that the value of these tasks' desired     |
ccc9971eSMauro Carvalho Chehab| sequence number is two, so shouldn't they instead block on            |
ccc9971eSMauro Carvalho Chehab| ``->exp_wq[2]``?                                                      |
ccc9971eSMauro Carvalho Chehab+-----------------------------------------------------------------------+
ccc9971eSMauro Carvalho Chehab| **Answer**:                                                           |
ccc9971eSMauro Carvalho Chehab+-----------------------------------------------------------------------+
ccc9971eSMauro Carvalho Chehab| No.                                                                   |
ccc9971eSMauro Carvalho Chehab| Recall that the bottom bit of the desired sequence number indicates   |
ccc9971eSMauro Carvalho Chehab| whether or not a grace period is currently in progress. It is         |
ccc9971eSMauro Carvalho Chehab| therefore necessary to shift the sequence number right one bit        |
ccc9971eSMauro Carvalho Chehab| position to obtain the number of the grace period. This results in    |
ccc9971eSMauro Carvalho Chehab| ``->exp_wq[1]``.                                                      |
ccc9971eSMauro Carvalho Chehab+-----------------------------------------------------------------------+
ccc9971eSMauro Carvalho Chehab
ccc9971eSMauro Carvalho ChehabIf Tasks C and D also arrive at this point, they will compute the same
ccc9971eSMauro Carvalho Chehabdesired grace-period sequence number, and see that both leaf
ccc9971eSMauro Carvalho Chehab``rcu_node`` structures already have that value recorded. They will
ccc9971eSMauro Carvalho Chehabtherefore block on their respective ``rcu_node`` structures'
ccc9971eSMauro Carvalho Chehab``->exp_wq[1]`` fields, as shown below:
ccc9971eSMauro Carvalho Chehab
ccc9971eSMauro Carvalho Chehab.. kernel-figure:: Funnel3.svg
ccc9971eSMauro Carvalho Chehab
ccc9971eSMauro Carvalho ChehabTask A now acquires the ``rcu_state`` structure's ``->exp_mutex`` and
ccc9971eSMauro Carvalho Chehabinitiates the grace period, which increments ``->expedited_sequence``.
ccc9971eSMauro Carvalho ChehabTherefore, if Tasks E and F arrive, they will compute a desired sequence
ccc9971eSMauro Carvalho Chehabnumber of 4 and will record this value as shown below:
ccc9971eSMauro Carvalho Chehab
ccc9971eSMauro Carvalho Chehab.. kernel-figure:: Funnel4.svg
ccc9971eSMauro Carvalho Chehab
ccc9971eSMauro Carvalho ChehabTasks E and F will propagate up the ``rcu_node`` combining tree, with
ccc9971eSMauro Carvalho ChehabTask F blocking on the root ``rcu_node`` structure and Task E wait for
ccc9971eSMauro Carvalho ChehabTask A to finish so that it can start the next grace period. The
ccc9971eSMauro Carvalho Chehabresulting state is as shown below:
ccc9971eSMauro Carvalho Chehab
ccc9971eSMauro Carvalho Chehab.. kernel-figure:: Funnel5.svg
ccc9971eSMauro Carvalho Chehab
ccc9971eSMauro Carvalho ChehabOnce the grace period completes, Task A starts waking up the tasks
ccc9971eSMauro Carvalho Chehabwaiting for this grace period to complete, increments the
ccc9971eSMauro Carvalho Chehab``->expedited_sequence``, acquires the ``->exp_wake_mutex`` and then
ccc9971eSMauro Carvalho Chehabreleases the ``->exp_mutex``. This results in the following state:
ccc9971eSMauro Carvalho Chehab
ccc9971eSMauro Carvalho Chehab.. kernel-figure:: Funnel6.svg
ccc9971eSMauro Carvalho Chehab
ccc9971eSMauro Carvalho ChehabTask E can then acquire ``->exp_mutex`` and increment
ccc9971eSMauro Carvalho Chehab``->expedited_sequence`` to the value three. If new tasks G and H arrive
ccc9971eSMauro Carvalho Chehaband moves up the combining tree at the same time, the state will be as
ccc9971eSMauro Carvalho Chehabfollows:
ccc9971eSMauro Carvalho Chehab
ccc9971eSMauro Carvalho Chehab.. kernel-figure:: Funnel7.svg
ccc9971eSMauro Carvalho Chehab
ccc9971eSMauro Carvalho ChehabNote that three of the root ``rcu_node`` structure's waitqueues are now
ccc9971eSMauro Carvalho Chehaboccupied. However, at some point, Task A will wake up the tasks blocked
ccc9971eSMauro Carvalho Chehabon the ``->exp_wq`` waitqueues, resulting in the following state:
ccc9971eSMauro Carvalho Chehab
ccc9971eSMauro Carvalho Chehab.. kernel-figure:: Funnel8.svg
ccc9971eSMauro Carvalho Chehab
ccc9971eSMauro Carvalho ChehabExecution will continue with Tasks E and H completing their grace
ccc9971eSMauro Carvalho Chehabperiods and carrying out their wakeups.
ccc9971eSMauro Carvalho Chehab
ccc9971eSMauro Carvalho Chehab+-----------------------------------------------------------------------+
ccc9971eSMauro Carvalho Chehab| **Quick Quiz**:                                                       |
ccc9971eSMauro Carvalho Chehab+-----------------------------------------------------------------------+
ccc9971eSMauro Carvalho Chehab| What happens if Task A takes so long to do its wakeups that Task E's  |
ccc9971eSMauro Carvalho Chehab| grace period completes?                                               |
ccc9971eSMauro Carvalho Chehab+-----------------------------------------------------------------------+
ccc9971eSMauro Carvalho Chehab| **Answer**:                                                           |
ccc9971eSMauro Carvalho Chehab+-----------------------------------------------------------------------+
ccc9971eSMauro Carvalho Chehab| Then Task E will block on the ``->exp_wake_mutex``, which will also   |
ccc9971eSMauro Carvalho Chehab| prevent it from releasing ``->exp_mutex``, which in turn will prevent |
ccc9971eSMauro Carvalho Chehab| the next grace period from starting. This last is important in        |
ccc9971eSMauro Carvalho Chehab| preventing overflow of the ``->exp_wq[]`` array.                      |
ccc9971eSMauro Carvalho Chehab+-----------------------------------------------------------------------+
ccc9971eSMauro Carvalho Chehab
ccc9971eSMauro Carvalho ChehabUse of Workqueues
ccc9971eSMauro Carvalho Chehab~~~~~~~~~~~~~~~~~
ccc9971eSMauro Carvalho Chehab
ccc9971eSMauro Carvalho ChehabIn earlier implementations, the task requesting the expedited grace
ccc9971eSMauro Carvalho Chehabperiod also drove it to completion. This straightforward approach had
ccc9971eSMauro Carvalho Chehabthe disadvantage of needing to account for POSIX signals sent to user
c4af9e00SRandy Dunlaptasks, so more recent implementations use the Linux kernel's
404147faSAkira Yokosawaworkqueues (see Documentation/core-api/workqueue.rst).
ccc9971eSMauro Carvalho Chehab
ccc9971eSMauro Carvalho ChehabThe requesting task still does counter snapshotting and funnel-lock
ccc9971eSMauro Carvalho Chehabprocessing, but the task reaching the top of the funnel lock does a
ccc9971eSMauro Carvalho Chehab``schedule_work()`` (from ``_synchronize_rcu_expedited()`` so that a
ccc9971eSMauro Carvalho Chehabworkqueue kthread does the actual grace-period processing. Because
ccc9971eSMauro Carvalho Chehabworkqueue kthreads do not accept POSIX signals, grace-period-wait
ccc9971eSMauro Carvalho Chehabprocessing need not allow for POSIX signals. In addition, this approach
ccc9971eSMauro Carvalho Chehaballows wakeups for the previous expedited grace period to be overlapped
ccc9971eSMauro Carvalho Chehabwith processing for the next expedited grace period. Because there are
ccc9971eSMauro Carvalho Chehabonly four sets of waitqueues, it is necessary to ensure that the
ccc9971eSMauro Carvalho Chehabprevious grace period's wakeups complete before the next grace period's
ccc9971eSMauro Carvalho Chehabwakeups start. This is handled by having the ``->exp_mutex`` guard
ccc9971eSMauro Carvalho Chehabexpedited grace-period processing and the ``->exp_wake_mutex`` guard
ccc9971eSMauro Carvalho Chehabwakeups. The key point is that the ``->exp_mutex`` is not released until
ccc9971eSMauro Carvalho Chehabthe first wakeup is complete, which means that the ``->exp_wake_mutex``
ccc9971eSMauro Carvalho Chehabhas already been acquired at that point. This approach ensures that the
ccc9971eSMauro Carvalho Chehabprevious grace period's wakeups can be carried out while the current
ccc9971eSMauro Carvalho Chehabgrace period is in process, but that these wakeups will complete before
ccc9971eSMauro Carvalho Chehabthe next grace period starts. This means that only three waitqueues are
ccc9971eSMauro Carvalho Chehabrequired, guaranteeing that the four that are provided are sufficient.
ccc9971eSMauro Carvalho Chehab
ccc9971eSMauro Carvalho ChehabStall Warnings
ccc9971eSMauro Carvalho Chehab~~~~~~~~~~~~~~
ccc9971eSMauro Carvalho Chehab
ccc9971eSMauro Carvalho ChehabExpediting grace periods does nothing to speed things up when RCU
ccc9971eSMauro Carvalho Chehabreaders take too long, and therefore expedited grace periods check for
ccc9971eSMauro Carvalho Chehabstalls just as normal grace periods do.
ccc9971eSMauro Carvalho Chehab
ccc9971eSMauro Carvalho Chehab+-----------------------------------------------------------------------+
ccc9971eSMauro Carvalho Chehab| **Quick Quiz**:                                                       |
ccc9971eSMauro Carvalho Chehab+-----------------------------------------------------------------------+
ccc9971eSMauro Carvalho Chehab| But why not just let the normal grace-period machinery detect the     |
ccc9971eSMauro Carvalho Chehab| stalls, given that a given reader must block both normal and          |
ccc9971eSMauro Carvalho Chehab| expedited grace periods?                                              |
ccc9971eSMauro Carvalho Chehab+-----------------------------------------------------------------------+
ccc9971eSMauro Carvalho Chehab| **Answer**:                                                           |
ccc9971eSMauro Carvalho Chehab+-----------------------------------------------------------------------+
ccc9971eSMauro Carvalho Chehab| Because it is quite possible that at a given time there is no normal  |
ccc9971eSMauro Carvalho Chehab| grace period in progress, in which case the normal grace period       |
ccc9971eSMauro Carvalho Chehab| cannot emit a stall warning.                                          |
ccc9971eSMauro Carvalho Chehab+-----------------------------------------------------------------------+
ccc9971eSMauro Carvalho Chehab
ccc9971eSMauro Carvalho ChehabThe ``synchronize_sched_expedited_wait()`` function loops waiting for
ccc9971eSMauro Carvalho Chehabthe expedited grace period to end, but with a timeout set to the current
ccc9971eSMauro Carvalho ChehabRCU CPU stall-warning time. If this time is exceeded, any CPUs or
ccc9971eSMauro Carvalho Chehab``rcu_node`` structures blocking the current grace period are printed.
ccc9971eSMauro Carvalho ChehabEach stall warning results in another pass through the loop, but the
ccc9971eSMauro Carvalho Chehabsecond and subsequent passes use longer stall times.
ccc9971eSMauro Carvalho Chehab
ccc9971eSMauro Carvalho ChehabMid-boot operation
ccc9971eSMauro Carvalho Chehab~~~~~~~~~~~~~~~~~~
ccc9971eSMauro Carvalho Chehab
ccc9971eSMauro Carvalho ChehabThe use of workqueues has the advantage that the expedited grace-period
ccc9971eSMauro Carvalho Chehabcode need not worry about POSIX signals. Unfortunately, it has the
ccc9971eSMauro Carvalho Chehabcorresponding disadvantage that workqueues cannot be used until they are
ccc9971eSMauro Carvalho Chehabinitialized, which does not happen until some time after the scheduler
ccc9971eSMauro Carvalho Chehabspawns the first task. Given that there are parts of the kernel that
ccc9971eSMauro Carvalho Chehabreally do want to execute grace periods during this mid-boot “dead
c4af9e00SRandy Dunlapzone”, expedited grace periods must do something else during this time.
ccc9971eSMauro Carvalho Chehab
ccc9971eSMauro Carvalho ChehabWhat they do is to fall back to the old practice of requiring that the
ccc9971eSMauro Carvalho Chehabrequesting task drive the expedited grace period, as was the case before
ccc9971eSMauro Carvalho Chehabthe use of workqueues. However, the requesting task is only required to
ccc9971eSMauro Carvalho Chehabdrive the grace period during the mid-boot dead zone. Before mid-boot, a
ccc9971eSMauro Carvalho Chehabsynchronous grace period is a no-op. Some time after mid-boot,
ccc9971eSMauro Carvalho Chehabworkqueues are used.
ccc9971eSMauro Carvalho Chehab
ccc9971eSMauro Carvalho ChehabNon-expedited non-SRCU synchronous grace periods must also operate
ccc9971eSMauro Carvalho Chehabnormally during mid-boot. This is handled by causing non-expedited grace
ccc9971eSMauro Carvalho Chehabperiods to take the expedited code path during mid-boot.
ccc9971eSMauro Carvalho Chehab
ccc9971eSMauro Carvalho ChehabThe current code assumes that there are no POSIX signals during the
ccc9971eSMauro Carvalho Chehabmid-boot dead zone. However, if an overwhelming need for POSIX signals
ccc9971eSMauro Carvalho Chehabsomehow arises, appropriate adjustments can be made to the expedited
ccc9971eSMauro Carvalho Chehabstall-warning code. One such adjustment would reinstate the
ccc9971eSMauro Carvalho Chehabpre-workqueue stall-warning checks, but only during the mid-boot dead
ccc9971eSMauro Carvalho Chehabzone.
ccc9971eSMauro Carvalho Chehab
ccc9971eSMauro Carvalho ChehabWith this refinement, synchronous grace periods can now be used from
ccc9971eSMauro Carvalho Chehabtask context pretty much any time during the life of the kernel. That
ccc9971eSMauro Carvalho Chehabis, aside from some points in the suspend, hibernate, or shutdown code
ccc9971eSMauro Carvalho Chehabpath.
ccc9971eSMauro Carvalho Chehab
ccc9971eSMauro Carvalho ChehabSummary
ccc9971eSMauro Carvalho Chehab~~~~~~~
ccc9971eSMauro Carvalho Chehab
ccc9971eSMauro Carvalho ChehabExpedited grace periods use a sequence-number approach to promote
ccc9971eSMauro Carvalho Chehabbatching, so that a single grace-period operation can serve numerous
ccc9971eSMauro Carvalho Chehabrequests. A funnel lock is used to efficiently identify the one task out
ccc9971eSMauro Carvalho Chehabof a concurrent group that will request the grace period. All members of
ccc9971eSMauro Carvalho Chehabthe group will block on waitqueues provided in the ``rcu_node``
ccc9971eSMauro Carvalho Chehabstructure. The actual grace-period processing is carried out by a
ccc9971eSMauro Carvalho Chehabworkqueue.
ccc9971eSMauro Carvalho Chehab
ccc9971eSMauro Carvalho ChehabCPU-hotplug operations are noted lazily in order to prevent the need for
ccc9971eSMauro Carvalho Chehabtight synchronization between expedited grace periods and CPU-hotplug
ccc9971eSMauro Carvalho Chehaboperations. The dyntick-idle counters are used to avoid sending IPIs to
ccc9971eSMauro Carvalho Chehabidle CPUs, at least in the common case. RCU-preempt and RCU-sched use
ccc9971eSMauro Carvalho Chehabdifferent IPI handlers and different code to respond to the state
ccc9971eSMauro Carvalho Chehabchanges carried out by those handlers, but otherwise use common code.
ccc9971eSMauro Carvalho Chehab
ccc9971eSMauro Carvalho ChehabQuiescent states are tracked using the ``rcu_node`` tree, and once all
ccc9971eSMauro Carvalho Chehabnecessary quiescent states have been reported, all tasks waiting on this
ccc9971eSMauro Carvalho Chehabexpedited grace period are awakened. A pair of mutexes are used to allow
ccc9971eSMauro Carvalho Chehabone grace period's wakeups to proceed concurrently with the next grace
ccc9971eSMauro Carvalho Chehabperiod's processing.
ccc9971eSMauro Carvalho Chehab
ccc9971eSMauro Carvalho ChehabThis combination of mechanisms allows expedited grace periods to run
ccc9971eSMauro Carvalho Chehabreasonably efficiently. However, for non-time-critical tasks, normal
ccc9971eSMauro Carvalho Chehabgrace periods should be used instead because their longer duration
ccc9971eSMauro Carvalho Chehabpermits much higher degrees of batching, and thus much lower per-request
ccc9971eSMauro Carvalho Chehaboverheads.