Lines Matching +full:performance +full:- +full:affecting

15 Data-Structure Relationships
25 .. kernel-figure:: BigTreeClassicRCU.svg
34 which results in a three-level ``rcu_node`` tree.
38 The purpose of this combining tree is to allow per-CPU events
39 such as quiescent states, dyntick-idle transitions,
42 Quiescent states are recorded by the per-CPU ``rcu_data`` structures,
43 and other events are recorded by the leaf-level ``rcu_node``
54 As can be seen from the diagram, on a 64-bit system
55 a two-level tree with 64 leaves can accommodate 1,024 CPUs, with a fanout
58 +-----------------------------------------------------------------------+
60 +-----------------------------------------------------------------------+
62 +-----------------------------------------------------------------------+
64 +-----------------------------------------------------------------------+
65 | Because there are more types of events that affect the leaf-level |
68 | these structures' ``->structures`` becomes excessive. Experimentation |
73 | thousands of CPUs may demonstrate that the fanout for the non-leaf |
77 | on the non-leaf ``rcu_node`` structures, you may use the |
79 | non-leaf fanout as needed. |
85 +-----------------------------------------------------------------------+
88 32-bit system), then RCU will automatically add more levels to the tree.
89 For example, if you are crazy enough to build a 64-bit system with
92 .. kernel-figure:: HugeTreeClassicRCU.svg
94 RCU currently permits up to a four-level tree, which on a 64-bit system
96 32-bit systems. On the other hand, you can set both
98 2, which would result in a 16-CPU test using a 4-level tree. This can be
99 useful for testing large-system capabilities on small test machines.
101 This multi-level combining tree allows us to get most of the performance
102 and scalability benefits of partitioning, even though RCU grace-period
106 up the tree. This means that at the leaf-level ``rcu_node`` structure,
109 Only one access out of sixty-four will progress up the tree. Because the
112 there are in the system, at most 64 quiescent-state reports per grace
128 .. kernel-figure:: BigTreePreemptRCUBHdyntickCB.svg
147 absolutely necessary for RCU to have good read-side performance. If this
156 serves as short-term repository for callbacks orphaned by CPU-hotplug
158 grace-period state, and maintains state used to force quiescent
161 quiescent-state information from the leaves to the root, and also
162 propagates grace-period information from the root to the leaves. It
163 provides local copies of the grace-period state in order to allow
167 lists of tasks that have blocked while in their current RCU read-side
169 ``CONFIG_RCU_BOOST``, it manages the per-\ ``rcu_node``
170 priority-boosting kernel threads (kthreads) and state. Finally, it
171 records CPU-hotplug state in order to determine which CPUs should be
173 #. ``rcu_data``: This per-CPU structure is the focus of quiescent-state
176 more-efficient propagation of quiescent states up the ``rcu_node``
178 copy of the grace-period information to allow for-free synchronized
180 structure records past dyntick-idle state for the corresponding CPU
184 structure is normally embedded within the RCU-protected data
198 periods, contains the lock used to synchronize with CPU-hotplug events,
217 +-----------------------------------------------------------------------+
219 +-----------------------------------------------------------------------+
222 +-----------------------------------------------------------------------+
224 +-----------------------------------------------------------------------+
230 +-----------------------------------------------------------------------+
232 The ``rcu_node`` tree is embedded into the ``->node[]`` array as shown
235 .. kernel-figure:: TreeMapping.svg
237 One interesting consequence of this mapping is that a breadth-first
243 Each entry of the ``->level`` array references the first ``rcu_node``
247 .. kernel-figure:: TreeMappingLevel.svg
254 For whatever it is worth, if you draw the tree to be tree-shaped rather
255 than array-shaped, it is easy to draw a planar representation:
257 .. kernel-figure:: TreeLevel.svg
259 Finally, the ``->rda`` field references a per-CPU pointer to the
265 Grace-Period Tracking
274 RCU grace periods are numbered, and the ``->gp_seq`` field contains the
275 current grace-period sequence number. The bottom two bits are the state
278 ``->gp_seq`` are zero, then RCU is idle. Any other value in the bottom
280 the root ``rcu_node`` structure's ``->lock`` field.
282 There are ``->gp_seq`` fields in the ``rcu_node`` and ``rcu_data``
289 +-----------------------------------------------------------------------+
291 +-----------------------------------------------------------------------+
296 +-----------------------------------------------------------------------+
298 +-----------------------------------------------------------------------+
299 | On single-node RCU trees (where the root node is also a leaf), |
306 | gp_seq, in rcu_pending(). They would all then invoke the RCU-core. |
308 | 3. But rnp->qsmask isn't initialized yet (happens later in |
311 | needs to report quiescent state (no qsmask), update rdp->gp_seq, |
316 | grace period counter without immediately affecting what CPUs see in |
320 +-----------------------------------------------------------------------+
333 The ``->gp_max`` field tracks the duration of the longest grace period
334 in jiffies. It is protected by the root ``rcu_node``'s ``->lock``.
336 The ``->name`` and ``->abbr`` fields distinguish between preemptible RCU
337 (“rcu_preempt” and “p”) and non-preemptible RCU (“rcu_sched” and “s”).
344 quiescent-state information from the leaves to the root and also that
345 propagates grace-period information from the root down to the leaves.
346 They provides local copies of the grace-period state in order to allow
350 of tasks that have blocked while in their current RCU read-side critical
352 manage the per-\ ``rcu_node`` priority-boosting kernel threads
353 (kthreads) and state. Finally, they record CPU-hotplug state in order to
373 The ``->parent`` pointer references the ``rcu_node`` one level up in the
376 ``->level`` field gives the level in the tree, with the root being at
377 level zero, its children at level one, and so on. The ``->grpnum`` field
379 number can range between 0 and 31 on 32-bit systems and between 0 and 63
380 on 64-bit systems. The ``->level`` and ``->grpnum`` fields are used only
381 during initialization and for tracing. The ``->grpmask`` field is the
382 bitmask counterpart of ``->grpnum``, and therefore always has exactly
385 later. Finally, the ``->grplo`` and ``->grphi`` fields contain the
407 .. _grace-period-tracking-1:
409 Grace-Period Tracking
419 The ``rcu_node`` structures' ``->gp_seq`` fields are the counterparts of
422 two bits of a given ``rcu_node`` structure's ``->gp_seq`` field is zero,
428 The ``->gp_seq_needed`` fields record the furthest-in-the-future grace
430 request is considered fulfilled when the value of the ``->gp_seq`` field
431 equals or exceeds that of the ``->gp_seq_needed`` field.
433 +-----------------------------------------------------------------------+
435 +-----------------------------------------------------------------------+
437 | very long time. Won't wrapping of the ``->gp_seq`` field cause |
439 +-----------------------------------------------------------------------+
441 +-----------------------------------------------------------------------+
442 | No, because if the ``->gp_seq_needed`` field lags behind the |
443 | ``->gp_seq`` field, the ``->gp_seq_needed`` field will be updated at |
444 | the end of the grace period. Modulo-arithmetic comparisons therefore |
446 +-----------------------------------------------------------------------+
448 Quiescent-State Tracking
463 The ``->qsmask`` field tracks which of this ``rcu_node`` structure's
468 Similarly, the ``->expmask`` field tracks which of this ``rcu_node``
473 grace-period latency, for example, consuming a few tens of microseconds
474 worth of CPU time to reduce grace-period duration from milliseconds to
475 tens of microseconds. The ``->qsmaskinit`` field tracks which of this
477 This mask is used to initialize ``->qsmask``, and ``->expmaskinit`` is
478 used to initialize ``->expmask`` and the beginning of the normal and
481 +-----------------------------------------------------------------------+
483 +-----------------------------------------------------------------------+
486 +-----------------------------------------------------------------------+
488 +-----------------------------------------------------------------------+
489 | Lockless grace-period computation! Such a tantalizing possibility! |
492 | #. CPU 0 has been in dyntick-idle mode for quite some time. When it |
500 | read-side critical section, and notices that the RCU core needs |
507 | That grace period might now end before the RCU read-side critical |
511 | of the bits with updating of the grace-period sequence number in |
512 | ``->gp_seq``. |
513 +-----------------------------------------------------------------------+
515 Blocked-Task Management
519 read-side critical sections, and these tasks must be tracked explicitly.
521 separate article on RCU read-side processing. For now, it is enough to
531 The ``->blkd_tasks`` field is a list header for the list of blocked and
532 preempted tasks. As tasks undergo context switches within RCU read-side
534 the ``task_struct``'s ``->rcu_node_entry`` field) onto the head of the
535 ``->blkd_tasks`` list for the leaf ``rcu_node`` structure corresponding
537 later exit their RCU read-side critical sections, they remove themselves
542 grace period. That pointer is stored in ``->gp_tasks`` for normal grace
543 periods and in ``->exp_tasks`` for expedited grace periods. These last
547 removes itself from the ``->blkd_tasks`` list, then that task must
551 For example, suppose that tasks T1, T2, and T3 are all hard-affinitied
552 to the largest-numbered CPU in the system. Then if task T1 blocked in an
553 RCU read-side critical section, then an expedited grace period started,
554 then task T2 blocked in an RCU read-side critical section, then a normal
555 grace period started, and finally task 3 blocked in an RCU read-side
557 structure's blocked-task list would be as shown below:
559 .. kernel-figure:: blkd_task.svg
566 read-side critical section.
568 The ``->wait_blkd_tasks`` field indicates whether or not the current
574 The ``rcu_node`` array is sized via a series of C-preprocessor
647 limited to four, as specified by lines 21-24 and the structure of the
648 subsequent “if” statement. For 32-bit systems, this allows
650 years at least. For 64-bit systems, 16*64*64*64=4,194,304 CPUs is
652 four-level tree also allows kernels built with ``CONFIG_RCU_FANOUT=8``
655 any measurable performance degradation due to misaligned socket and
658 combining-tree code.
661 each non-leaf level of the ``rcu_node`` tree. If the
668 number of bits in the ``->qsmask`` field on a 64-bit system, results in
669 excessive contention for the leaf ``rcu_node`` structures' ``->lock``
674 Lines 11-19 perform this computation.
676 Lines 21-24 compute the maximum number of CPUs supported by a
677 single-level (which contains a single ``rcu_node`` structure),
678 two-level, three-level, and four-level ``rcu_node`` tree, respectively,
681 ``RCU_FANOUT_2``, ``RCU_FANOUT_3``, and ``RCU_FANOUT_4`` C-preprocessor
684 These variables are used to control the C-preprocessor ``#if`` statement
685 spanning lines 26-66 that computes the number of ``rcu_node`` structures
688 C-preprocessor variable by lines 27, 35, 44, and 54. The number of
695 lines 37, 46-47, and 56-58. Lines 31-33, 40-42, 50-52, and 62-63 create
696 initializers for lockdep lock-class names. Finally, lines 64-66 produce
728 grace period is current, hence the ``->gp_seq`` field.
734 The ``->head`` pointer references the first callback or is ``NULL`` if
736 Each element of the ``->tails[]`` array references the ``->next``
738 or the list's ``->head`` pointer if that segment and all previous
743 ``->head`` pointer, the ``->tails[]`` array, and the callbacks is shown
746 .. kernel-figure:: nxtlist.svg
748 In this figure, the ``->head`` pointer references the first RCU callback
749 in the list. The ``->tails[RCU_DONE_TAIL]`` array element references the
750 ``->head`` pointer itself, indicating that none of the callbacks is
751 ready to invoke. The ``->tails[RCU_WAIT_TAIL]`` array element references
752 callback CB 2's ``->next`` pointer, which indicates that CB 1 and CB 2
755 ``->tails[RCU_NEXT_READY_TAIL]`` array element references the same RCU
756 callback that ``->tails[RCU_WAIT_TAIL]`` does, which indicates that
758 ``->tails[RCU_NEXT_TAIL]`` array element references CB 4's ``->next``
761 ``->tails[RCU_NEXT_TAIL]`` array element always references the last RCU
762 callback's ``->next`` pointer unless the callback list is empty, in
763 which case it references the ``->head`` pointer.
766 ``->tails[RCU_NEXT_TAIL]`` array element: It can be ``NULL`` when this
775 The ``->gp_seq[]`` array records grace-period numbers corresponding to
782 The ``->len`` counter contains the number of callbacks in ``->head``,
783 and the ``->len_lazy`` contains the number of those callbacks that are
789 It is the ``->len`` field that determines whether or
791 structure, *not* the ``->head`` pointer. The reason for this is that all
792 the ready-to-invoke callbacks (that is, those in the ``RCU_DONE_TAIL``
793 segment) are extracted all at once at callback-invocation time
794 (``rcu_do_batch``), due to which ``->head`` may be set to NULL if there
795 are no not-done callbacks remaining in the ``rcu_segcblist``. If
797 high-priority process just woke up on this CPU, then the remaining
799 ``->head`` once again points to the start of the segment. In short, the
802 ``->head`` pointer for ``NULL``.
804 In contrast, the ``->len`` and ``->len_lazy`` counts are adjusted only
806 ``->len`` count is zero only if the ``rcu_segcblist`` structure really
807 is devoid of callbacks. Of course, off-CPU sampling of the ``->len``
815 The ``rcu_data`` maintains the per-CPU state for the RCU subsystem. The
818 of quiescent-state detection and RCU callback queuing. It also tracks
820 allow more-efficient propagation of quiescent states up the ``rcu_node``
822 copy of the grace-period information to allow for-free synchronized
824 structure records past dyntick-idle state for the corresponding CPU and
842 The ``->cpu`` field contains the number of the corresponding CPU and the
843 ``->mynode`` field references the corresponding ``rcu_node`` structure.
844 The ``->mynode`` is used to propagate quiescent states up the combining
848 The ``->grpmask`` field indicates the bit in the ``->mynode->qsmask``
850 propagating quiescent states. The ``->beenonline`` flag is set whenever
855 Quiescent-State and Grace-Period Tracking
868 The ``->gp_seq`` field is the counterpart of the field of the same name
870 ``->gp_seq_needed`` field is the counterpart of the field of the same
874 dyntick-idle mode (but these counters will catch up upon exit from
875 dyntick-idle mode). If the lower two bits of a given ``rcu_data``
876 structure's ``->gp_seq`` are zero, then this ``rcu_data`` structure
879 +-----------------------------------------------------------------------+
881 +-----------------------------------------------------------------------+
885 +-----------------------------------------------------------------------+
887 +-----------------------------------------------------------------------+
891 | need to carefully manage the numbers on a per-node basis. Recall from |
895 +-----------------------------------------------------------------------+
897 The ``->cpu_no_qs`` flag indicates that the CPU has not yet passed
898 through a quiescent state, while the ``->core_needs_qs`` flag indicates
900 The ``->gpwrap`` field indicates that the corresponding CPU has remained
908 In the absence of CPU-hotplug events, RCU callbacks are invoked by the
909 same CPU that registered them. This is strictly a cache-locality
928 The ``->cblist`` structure is the segmented callback list described
932 of its ``rcu_data`` structure's ``->gp_seq`` field differs from that of
934 structure's ``->gp_seq`` field is updated at the beginnings and ends of
937 The ``->qlen_last_fqs_check`` and ``->n_force_qs_snap`` coordinate the
941 The ``->n_cbs_invoked``, ``->n_cbs_orphaned``, and ``->n_cbs_adopted``
944 CPUs go offline. The ``->n_nocbs_invoked`` is used when the CPU's
947 Finally, the ``->blimit`` counter is the maximum number of RCU callbacks
950 Dyntick-Idle Handling
960 The ``->watching_snap`` field is used to take a snapshot of the
961 corresponding CPU's dyntick-idle state when forcing quiescent states,
963 ``->dynticks_fqs`` field is used to count the number of times this CPU
964 is determined to be in dyntick-idle state, and is used for tracing and
977 These fields in the rcu_data structure maintain the per-CPU dyntick-idle
981 The ``->nesting`` field counts the nesting depth of process
984 ``->nmi_nesting`` field. Because NMIs cannot be masked, changes
988 represented by a ``->nmi_nesting`` value of nine. This counter
990 CPU cannot be permitted to enter dyntick-idle mode, aside from
991 process-level transitions.
993 However, it turns out that when running in non-idle kernel context, the
996 ``->nesting`` field is incremented up from zero, the
997 ``->nmi_nesting`` field is set to a large positive number, and
998 whenever the ``->nesting`` field is decremented down to zero,
999 the ``->nmi_nesting`` field is set to zero. Assuming that
1001 counter, this approach corrects the ``->nmi_nesting`` field
1005 The ``->dynticks`` field counts the corresponding CPU's transitions to
1006 and from either dyntick-idle or user mode, so that this counter has an
1007 even value when the CPU is in dyntick-idle mode or user mode and an odd
1009 for user mode adaptive-ticks support (see Documentation/timers/no_hz.rst).
1011 The ``->rcu_need_heavy_qs`` field is used to record the fact that the
1014 heavy-weight dyntick-counter operations. This flag is checked by RCU's
1015 context-switch and ``cond_resched()`` code, which provide a momentary
1018 Finally, the ``->rcu_urgent_qs`` field is used to record the fact that
1022 context-switch path (``rcu_note_context_switch``) and the cond_resched
1025 +-----------------------------------------------------------------------+
1027 +-----------------------------------------------------------------------+
1028 | Why not simply combine the ``->nesting`` and |
1029 | ``->nmi_nesting`` counters into a single counter that just |
1030 | counts the number of reasons that the corresponding CPU is non-idle? |
1031 +-----------------------------------------------------------------------+
1033 +-----------------------------------------------------------------------+
1035 | never return and of handlers that manage to return from a made-up |
1037 +-----------------------------------------------------------------------+
1039 Additional fields are present for some special-purpose builds, and are
1046 are normally embedded within RCU-protected data structures whose
1058 The ``->next`` field is used to link the ``rcu_head`` structures
1059 together in the lists within the ``rcu_data`` structures. The ``->func``
1062 ``rcu_head`` structure. However, ``kfree_rcu()`` uses the ``->func``
1064 enclosing RCU-protected data structure.
1069 +-----------------------------------------------------------------------+
1071 +-----------------------------------------------------------------------+
1072 | Given that the callback function ``->func`` is passed a pointer to |
1074 | beginning of the enclosing RCU-protected data structure? |
1075 +-----------------------------------------------------------------------+
1077 +-----------------------------------------------------------------------+
1079 | RCU-protected data structure. The callback function can therefore use |
1081 | pointer-manipulation facilities in other software environments) to |
1083 +-----------------------------------------------------------------------+
1085 RCU-Specific Fields in the ``task_struct`` Structure
1106 The ``->rcu_read_lock_nesting`` field records the nesting level for RCU
1107 read-side critical sections, and the ``->rcu_read_unlock_special`` field
1109 ``rcu_read_unlock()`` to do additional work. The ``->rcu_node_entry``
1111 preemptible-RCU read-side critical sections and the
1112 ``->rcu_blocked_node`` field references the ``rcu_node`` structure whose
1114 preemptible-RCU read-side critical section.
1116 The ``->rcu_tasks_nvcsw`` field tracks the number of voluntary context
1118 tasks-RCU grace period, ``->rcu_tasks_holdout`` is set if the current
1119 tasks-RCU grace period is waiting on this task,
1120 ``->rcu_tasks_holdout_list`` is a list element enqueuing this task on
1121 the holdout list, and ``->rcu_tasks_idle_cpu`` tracks which CPU this
1136 3 return &rsp->node[0];
1140 7 for ((rnp) = &(rsp)->node[0]; \
1141 8 (rnp) < &(rsp)->node[NUM_RCU_NODES]; (rnp)++)
1144 11 for ((rnp) = (rsp)->level[NUM_RCU_LVLS - 1]; \
1145 12 (rnp) < &(rsp)->node[NUM_RCU_NODES]; (rnp)++)
1148 the specified ``rcu_state`` structure's ``->node[]`` array, which is the
1153 ``rcu_state`` structure's ``->node[]`` array, performing a breadth-first
1158 +-----------------------------------------------------------------------+
1160 +-----------------------------------------------------------------------+
1163 +-----------------------------------------------------------------------+
1165 +-----------------------------------------------------------------------+
1166 | In the single-node case, ``rcu_for_each_leaf_node()`` traverses the |
1168 +-----------------------------------------------------------------------+
1175 Finally, in ``CONFIG_NO_HZ_IDLE`` kernels, each CPU's dyntick-idle state
1176 is tracked by dynticks-related fields in the ``rcu_data`` structure. If
1185 helping me get this document into a more human-readable state.