xref: /linux/Documentation/RCU/whatisRCU.rst (revision df9c299371054cb725eef730fd0f1d0fe2ed6bb0)
1.. _whatisrcu_doc:
2
3What is RCU?  --  "Read, Copy, Update"
4======================================
5
6Please note that the "What is RCU?" LWN series is an excellent place
7to start learning about RCU:
8
9| 1.	What is RCU, Fundamentally?  https://lwn.net/Articles/262464/
10| 2.	What is RCU? Part 2: Usage   https://lwn.net/Articles/263130/
11| 3.	RCU part 3: the RCU API      https://lwn.net/Articles/264090/
12| 4.	The RCU API, 2010 Edition    https://lwn.net/Articles/418853/
13| 	2010 Big API Table           https://lwn.net/Articles/419086/
14| 5.	The RCU API, 2014 Edition    https://lwn.net/Articles/609904/
15|	2014 Big API Table           https://lwn.net/Articles/609973/
16| 6.	The RCU API, 2019 Edition    https://lwn.net/Articles/777036/
17|	2019 Big API Table           https://lwn.net/Articles/777165/
18| 7.	The RCU API, 2024 Edition    https://lwn.net/Articles/988638/
19|       2024 Background Information  https://lwn.net/Articles/988641/
20|	2024 Big API Table           https://lwn.net/Articles/988666/
21
22For those preferring video:
23
24| 1.	Unraveling RCU Mysteries: Fundamentals          https://www.linuxfoundation.org/webinars/unraveling-rcu-usage-mysteries
25| 2.	Unraveling RCU Mysteries: Additional Use Cases  https://www.linuxfoundation.org/webinars/unraveling-rcu-usage-mysteries-additional-use-cases
26
27
28What is RCU?
29
30RCU is a synchronization mechanism that was added to the Linux kernel
31during the 2.5 development effort that is optimized for read-mostly
32situations.  Although RCU is actually quite simple, making effective use
33of it requires you to think differently about your code.  Another part
34of the problem is the mistaken assumption that there is "one true way" to
35describe and to use RCU.  Instead, the experience has been that different
36people must take different paths to arrive at an understanding of RCU,
37depending on their experiences and use cases.  This document provides
38several different paths, as follows:
39
40:ref:`1.	RCU OVERVIEW <1_whatisRCU>`
41
42:ref:`2.	WHAT IS RCU'S CORE API? <2_whatisRCU>`
43
44:ref:`3.	WHAT ARE SOME EXAMPLE USES OF CORE RCU API? <3_whatisRCU>`
45
46:ref:`4.	WHAT IF MY UPDATING THREAD CANNOT BLOCK? <4_whatisRCU>`
47
48:ref:`5.	WHAT ARE SOME SIMPLE IMPLEMENTATIONS OF RCU? <5_whatisRCU>`
49
50:ref:`6.	ANALOGY WITH READER-WRITER LOCKING <6_whatisRCU>`
51
52:ref:`7.	ANALOGY WITH REFERENCE COUNTING <7_whatisRCU>`
53
54:ref:`8.	FULL LIST OF RCU APIs <8_whatisRCU>`
55
56:ref:`9.	ANSWERS TO QUICK QUIZZES <9_whatisRCU>`
57
58People who prefer starting with a conceptual overview should focus on
59Section 1, though most readers will profit by reading this section at
60some point.  People who prefer to start with an API that they can then
61experiment with should focus on Section 2.  People who prefer to start
62with example uses should focus on Sections 3 and 4.  People who need to
63understand the RCU implementation should focus on Section 5, then dive
64into the kernel source code.  People who reason best by analogy should
65focus on Section 6 and 7.  Section 8 serves as an index to the docbook
66API documentation, and Section 9 is the traditional answer key.
67
68So, start with the section that makes the most sense to you and your
69preferred method of learning.  If you need to know everything about
70everything, feel free to read the whole thing -- but if you are really
71that type of person, you have perused the source code and will therefore
72never need this document anyway.  ;-)
73
74.. _1_whatisRCU:
75
761.  RCU OVERVIEW
77----------------
78
79The basic idea behind RCU is to split updates into "removal" and
80"reclamation" phases.  The removal phase removes references to data items
81within a data structure (possibly by replacing them with references to
82new versions of these data items), and can run concurrently with readers.
83The reason that it is safe to run the removal phase concurrently with
84readers is the semantics of modern CPUs guarantee that readers will see
85either the old or the new version of the data structure rather than a
86partially updated reference.  The reclamation phase does the work of reclaiming
87(e.g., freeing) the data items removed from the data structure during the
88removal phase.  Because reclaiming data items can disrupt any readers
89concurrently referencing those data items, the reclamation phase must
90not start until readers no longer hold references to those data items.
91
92Splitting the update into removal and reclamation phases permits the
93updater to perform the removal phase immediately, and to defer the
94reclamation phase until all readers active during the removal phase have
95completed, either by blocking until they finish or by registering a
96callback that is invoked after they finish.  Only readers that are active
97during the removal phase need be considered, because any reader starting
98after the removal phase will be unable to gain a reference to the removed
99data items, and therefore cannot be disrupted by the reclamation phase.
100
101So the typical RCU update sequence goes something like the following:
102
103a.	Remove pointers to a data structure, so that subsequent
104	readers cannot gain a reference to it.
105
106b.	Wait for all previous readers to complete their RCU read-side
107	critical sections.
108
109c.	At this point, there cannot be any readers who hold references
110	to the data structure, so it now may safely be reclaimed
111	(e.g., kfree()d).
112
113Step (b) above is the key idea underlying RCU's deferred destruction.
114The ability to wait until all readers are done allows RCU readers to
115use much lighter-weight synchronization, in some cases, absolutely no
116synchronization at all.  In contrast, in more conventional lock-based
117schemes, readers must use heavy-weight synchronization in order to
118prevent an updater from deleting the data structure out from under them.
119This is because lock-based updaters typically update data items in place,
120and must therefore exclude readers.  In contrast, RCU-based updaters
121typically take advantage of the fact that writes to single aligned
122pointers are atomic on modern CPUs, allowing atomic insertion, removal,
123and replacement of data items in a linked structure without disrupting
124readers.  Concurrent RCU readers can then continue accessing the old
125versions, and can dispense with the atomic operations, memory barriers,
126and communications cache misses that are so expensive on present-day
127SMP computer systems, even in absence of lock contention.
128
129In the three-step procedure shown above, the updater is performing both
130the removal and the reclamation step, but it is often helpful for an
131entirely different thread to do the reclamation, as is in fact the case
132in the Linux kernel's directory-entry cache (dcache).  Even if the same
133thread performs both the update step (step (a) above) and the reclamation
134step (step (c) above), it is often helpful to think of them separately.
135For example, RCU readers and updaters need not communicate at all,
136but RCU provides implicit low-overhead communication between readers
137and reclaimers, namely, in step (b) above.
138
139So how the heck can a reclaimer tell when a reader is done, given
140that readers are not doing any sort of synchronization operations???
141Read on to learn about how RCU's API makes this easy.
142
143.. _2_whatisRCU:
144
1452.  WHAT IS RCU'S CORE API?
146---------------------------
147
148The core RCU API is quite small:
149
150a.	rcu_read_lock()
151b.	rcu_read_unlock()
152c.	synchronize_rcu() / call_rcu()
153d.	rcu_assign_pointer()
154e.	rcu_dereference()
155
156There are many other members of the RCU API, but the rest can be
157expressed in terms of these five, though most implementations instead
158express synchronize_rcu() in terms of the call_rcu() callback API.
159
160The five core RCU APIs are described below, the other 18 will be enumerated
161later.  See the kernel docbook documentation for more info, or look directly
162at the function header comments.
163
164rcu_read_lock()
165^^^^^^^^^^^^^^^
166	void rcu_read_lock(void);
167
168	This temporal primitive is used by a reader to inform the
169	reclaimer that the reader is entering an RCU read-side critical
170	section.  It is illegal to block while in an RCU read-side
171	critical section, though kernels built with CONFIG_PREEMPT_RCU
172	can preempt RCU read-side critical sections.  Any RCU-protected
173	data structure accessed during an RCU read-side critical section
174	is guaranteed to remain unreclaimed for the full duration of that
175	critical section.  Reference counts may be used in conjunction
176	with RCU to maintain longer-term references to data structures.
177
178	Note that anything that disables bottom halves, preemption,
179	or interrupts also enters an RCU read-side critical section.
180	Acquiring a spinlock also enters an RCU read-side critical
181	sections, even for spinlocks that do not disable preemption,
182	as is the case in kernels built with CONFIG_PREEMPT_RT=y.
183	Sleeplocks do *not* enter RCU read-side critical sections.
184
185rcu_read_unlock()
186^^^^^^^^^^^^^^^^^
187	void rcu_read_unlock(void);
188
189	This temporal primitives is used by a reader to inform the
190	reclaimer that the reader is exiting an RCU read-side critical
191	section.  Anything that enables bottom halves, preemption,
192	or interrupts also exits an RCU read-side critical section.
193	Releasing a spinlock also exits an RCU read-side critical section.
194
195	Note that RCU read-side critical sections may be nested and/or
196	overlapping.
197
198synchronize_rcu()
199^^^^^^^^^^^^^^^^^
200	void synchronize_rcu(void);
201
202	This temporal primitive marks the end of updater code and the
203	beginning of reclaimer code.  It does this by blocking until
204	all pre-existing RCU read-side critical sections on all CPUs
205	have completed.  Note that synchronize_rcu() will **not**
206	necessarily wait for any subsequent RCU read-side critical
207	sections to complete.  For example, consider the following
208	sequence of events::
209
210	         CPU 0                  CPU 1                 CPU 2
211	     ----------------- ------------------------- ---------------
212	 1.  rcu_read_lock()
213	 2.                    enters synchronize_rcu()
214	 3.                                               rcu_read_lock()
215	 4.  rcu_read_unlock()
216	 5.                     exits synchronize_rcu()
217	 6.                                              rcu_read_unlock()
218
219	To reiterate, synchronize_rcu() waits only for ongoing RCU
220	read-side critical sections to complete, not necessarily for
221	any that begin after synchronize_rcu() is invoked.
222
223	Of course, synchronize_rcu() does not necessarily return
224	**immediately** after the last pre-existing RCU read-side critical
225	section completes.  For one thing, there might well be scheduling
226	delays.  For another thing, many RCU implementations process
227	requests in batches in order to improve efficiencies, which can
228	further delay synchronize_rcu().
229
230	Since synchronize_rcu() is the API that must figure out when
231	readers are done, its implementation is key to RCU.  For RCU
232	to be useful in all but the most read-intensive situations,
233	synchronize_rcu()'s overhead must also be quite small.
234
235	The call_rcu() API is an asynchronous callback form of
236	synchronize_rcu(), and is described in more detail in a later
237	section.  Instead of blocking, it registers a function and
238	argument which are invoked after all ongoing RCU read-side
239	critical sections have completed.  This callback variant is
240	particularly useful in situations where it is illegal to block
241	or where update-side performance is critically important.
242
243	However, the call_rcu() API should not be used lightly, as use
244	of the synchronize_rcu() API generally results in simpler code.
245	In addition, the synchronize_rcu() API has the nice property
246	of automatically limiting update rate should grace periods
247	be delayed.  This property results in system resilience in face
248	of denial-of-service attacks.  Code using call_rcu() should limit
249	update rate in order to gain this same sort of resilience.  See
250	checklist.rst for some approaches to limiting the update rate.
251
252rcu_assign_pointer()
253^^^^^^^^^^^^^^^^^^^^
254	void rcu_assign_pointer(p, typeof(p) v);
255
256	Yes, rcu_assign_pointer() **is** implemented as a macro, though
257	it would be cool to be able to declare a function in this manner.
258	(And there has been some discussion of adding overloaded functions
259	to the C language, so who knows?)
260
261	The updater uses this spatial macro to assign a new value to an
262	RCU-protected pointer, in order to safely communicate the change
263	in value from the updater to the reader.  This is a spatial (as
264	opposed to temporal) macro.  It does not evaluate to an rvalue,
265	but it does provide any compiler directives and memory-barrier
266	instructions required for a given compile or CPU architecture.
267	Its ordering properties are that of a store-release operation,
268	that is, any prior loads and stores required to initialize the
269	structure are ordered before the store that publishes the pointer
270	to that structure.
271
272	Perhaps just as important, rcu_assign_pointer() serves to document
273	(1) which pointers are protected by RCU and (2) the point at which
274	a given structure becomes accessible to other CPUs.  That said,
275	rcu_assign_pointer() is most frequently used indirectly, via
276	the _rcu list-manipulation primitives such as list_add_rcu().
277
278rcu_dereference()
279^^^^^^^^^^^^^^^^^
280	typeof(p) rcu_dereference(p);
281
282	Like rcu_assign_pointer(), rcu_dereference() must be implemented
283	as a macro.
284
285	The reader uses the spatial rcu_dereference() macro to fetch
286	an RCU-protected pointer, which returns a value that may
287	then be safely dereferenced.  Note that rcu_dereference()
288	does not actually dereference the pointer, instead, it
289	protects the pointer for later dereferencing.  It also
290	executes any needed memory-barrier instructions for a given
291	CPU architecture.  Currently, only Alpha needs memory barriers
292	within rcu_dereference() -- on other CPUs, it compiles to a
293	volatile load.	However, no mainstream C compilers respect
294	address dependencies, so rcu_dereference() uses volatile casts,
295	which, in combination with the coding guidelines listed in
296	rcu_dereference.rst, prevent current compilers from breaking
297	these dependencies.
298
299	Common coding practice uses rcu_dereference() to copy an
300	RCU-protected pointer to a local variable, then dereferences
301	this local variable, for example as follows::
302
303		p = rcu_dereference(head.next);
304		return p->data;
305
306	However, in this case, one could just as easily combine these
307	into one statement::
308
309		return rcu_dereference(head.next)->data;
310
311	If you are going to be fetching multiple fields from the
312	RCU-protected structure, using the local variable is of
313	course preferred.  Repeated rcu_dereference() calls look
314	ugly, do not guarantee that the same pointer will be returned
315	if an update happened while in the critical section, and incur
316	unnecessary overhead on Alpha CPUs.
317
318	Note that the value returned by rcu_dereference() is valid
319	only within the enclosing RCU read-side critical section [1]_.
320	For example, the following is **not** legal::
321
322		rcu_read_lock();
323		p = rcu_dereference(head.next);
324		rcu_read_unlock();
325		x = p->address;	/* BUG!!! */
326		rcu_read_lock();
327		y = p->data;	/* BUG!!! */
328		rcu_read_unlock();
329
330	Holding a reference from one RCU read-side critical section
331	to another is just as illegal as holding a reference from
332	one lock-based critical section to another!  Similarly,
333	using a reference outside of the critical section in which
334	it was acquired is just as illegal as doing so with normal
335	locking.
336
337	As with rcu_assign_pointer(), an important function of
338	rcu_dereference() is to document which pointers are protected by
339	RCU, in particular, flagging a pointer that is subject to changing
340	at any time, including immediately after the rcu_dereference().
341	And, again like rcu_assign_pointer(), rcu_dereference() is
342	typically used indirectly, via the _rcu list-manipulation
343	primitives, such as list_for_each_entry_rcu() [2]_.
344
345.. 	[1] The variant rcu_dereference_protected() can be used outside
346	of an RCU read-side critical section as long as the usage is
347	protected by locks acquired by the update-side code.  This variant
348	avoids the lockdep warning that would happen when using (for
349	example) rcu_dereference() without rcu_read_lock() protection.
350	Using rcu_dereference_protected() also has the advantage
351	of permitting compiler optimizations that rcu_dereference()
352	must prohibit.	The rcu_dereference_protected() variant takes
353	a lockdep expression to indicate which locks must be acquired
354	by the caller. If the indicated protection is not provided,
355	a lockdep splat is emitted.  See Design/Requirements/Requirements.rst
356	and the API's code comments for more details and example usage.
357
358.. 	[2] If the list_for_each_entry_rcu() instance might be used by
359	update-side code as well as by RCU readers, then an additional
360	lockdep expression can be added to its list of arguments.
361	For example, given an additional "lock_is_held(&mylock)" argument,
362	the RCU lockdep code would complain only if this instance was
363	invoked outside of an RCU read-side critical section and without
364	the protection of mylock.
365
366The following diagram shows how each API communicates among the
367reader, updater, and reclaimer.
368::
369
370
371	    rcu_assign_pointer()
372	                            +--------+
373	    +---------------------->| reader |---------+
374	    |                       +--------+         |
375	    |                           |              |
376	    |                           |              | Protect:
377	    |                           |              | rcu_read_lock()
378	    |                           |              | rcu_read_unlock()
379	    |        rcu_dereference()  |              |
380	    +---------+                 |              |
381	    | updater |<----------------+              |
382	    +---------+                                V
383	    |                                    +-----------+
384	    +----------------------------------->| reclaimer |
385	                                         +-----------+
386	      Defer:
387	      synchronize_rcu() & call_rcu()
388
389
390The RCU infrastructure observes the temporal sequence of rcu_read_lock(),
391rcu_read_unlock(), synchronize_rcu(), and call_rcu() invocations in
392order to determine when (1) synchronize_rcu() invocations may return
393to their callers and (2) call_rcu() callbacks may be invoked.  Efficient
394implementations of the RCU infrastructure make heavy use of batching in
395order to amortize their overhead over many uses of the corresponding APIs.
396The rcu_assign_pointer() and rcu_dereference() invocations communicate
397spatial changes via stores to and loads from the RCU-protected pointer in
398question.
399
400There are at least three flavors of RCU usage in the Linux kernel. The diagram
401above shows the most common one. On the updater side, the rcu_assign_pointer(),
402synchronize_rcu() and call_rcu() primitives used are the same for all three
403flavors. However for protection (on the reader side), the primitives used vary
404depending on the flavor:
405
406a.	rcu_read_lock() / rcu_read_unlock()
407	rcu_dereference()
408
409b.	rcu_read_lock_bh() / rcu_read_unlock_bh()
410	local_bh_disable() / local_bh_enable()
411	rcu_dereference_bh()
412
413c.	rcu_read_lock_sched() / rcu_read_unlock_sched()
414	preempt_disable() / preempt_enable()
415	local_irq_save() / local_irq_restore()
416	hardirq enter / hardirq exit
417	NMI enter / NMI exit
418	rcu_dereference_sched()
419
420These three flavors are used as follows:
421
422a.	RCU applied to normal data structures.
423
424b.	RCU applied to networking data structures that may be subjected
425	to remote denial-of-service attacks.
426
427c.	RCU applied to scheduler and interrupt/NMI-handler tasks.
428
429Again, most uses will be of (a).  The (b) and (c) cases are important
430for specialized uses, but are relatively uncommon.  The SRCU, RCU-Tasks,
431RCU-Tasks-Rude, and RCU-Tasks-Trace have similar relationships among
432their assorted primitives.
433
434.. _3_whatisRCU:
435
4363.  WHAT ARE SOME EXAMPLE USES OF CORE RCU API?
437-----------------------------------------------
438
439This section shows a simple use of the core RCU API to protect a
440global pointer to a dynamically allocated structure.  More-typical
441uses of RCU may be found in listRCU.rst and NMI-RCU.rst.
442::
443
444	struct foo {
445		int a;
446		char b;
447		long c;
448	};
449	DEFINE_SPINLOCK(foo_mutex);
450
451	struct foo __rcu *gbl_foo;
452
453	/*
454	 * Create a new struct foo that is the same as the one currently
455	 * pointed to by gbl_foo, except that field "a" is replaced
456	 * with "new_a".  Points gbl_foo to the new structure, and
457	 * frees up the old structure after a grace period.
458	 *
459	 * Uses rcu_assign_pointer() to ensure that concurrent readers
460	 * see the initialized version of the new structure.
461	 *
462	 * Uses synchronize_rcu() to ensure that any readers that might
463	 * have references to the old structure complete before freeing
464	 * the old structure.
465	 */
466	void foo_update_a(int new_a)
467	{
468		struct foo *new_fp;
469		struct foo *old_fp;
470
471		new_fp = kmalloc(sizeof(*new_fp), GFP_KERNEL);
472		spin_lock(&foo_mutex);
473		old_fp = rcu_dereference_protected(gbl_foo, lockdep_is_held(&foo_mutex));
474		*new_fp = *old_fp;
475		new_fp->a = new_a;
476		rcu_assign_pointer(gbl_foo, new_fp);
477		spin_unlock(&foo_mutex);
478		synchronize_rcu();
479		kfree(old_fp);
480	}
481
482	/*
483	 * Return the value of field "a" of the current gbl_foo
484	 * structure.  Use rcu_read_lock() and rcu_read_unlock()
485	 * to ensure that the structure does not get deleted out
486	 * from under us, and use rcu_dereference() to ensure that
487	 * we see the initialized version of the structure (important
488	 * for DEC Alpha and for people reading the code).
489	 */
490	int foo_get_a(void)
491	{
492		int retval;
493
494		rcu_read_lock();
495		retval = rcu_dereference(gbl_foo)->a;
496		rcu_read_unlock();
497		return retval;
498	}
499
500So, to sum up:
501
502-	Use rcu_read_lock() and rcu_read_unlock() to guard RCU
503	read-side critical sections.
504
505-	Within an RCU read-side critical section, use rcu_dereference()
506	to dereference RCU-protected pointers.
507
508-	Use some solid design (such as locks or semaphores) to
509	keep concurrent updates from interfering with each other.
510
511-	Use rcu_assign_pointer() to update an RCU-protected pointer.
512	This primitive protects concurrent readers from the updater,
513	**not** concurrent updates from each other!  You therefore still
514	need to use locking (or something similar) to keep concurrent
515	rcu_assign_pointer() primitives from interfering with each other.
516
517-	Use synchronize_rcu() **after** removing a data element from an
518	RCU-protected data structure, but **before** reclaiming/freeing
519	the data element, in order to wait for the completion of all
520	RCU read-side critical sections that might be referencing that
521	data item.
522
523See checklist.rst for additional rules to follow when using RCU.
524And again, more-typical uses of RCU may be found in listRCU.rst
525and NMI-RCU.rst.
526
527.. _4_whatisRCU:
528
5294.  WHAT IF MY UPDATING THREAD CANNOT BLOCK?
530--------------------------------------------
531
532In the example above, foo_update_a() blocks until a grace period elapses.
533This is quite simple, but in some cases one cannot afford to wait so
534long -- there might be other high-priority work to be done.
535
536In such cases, one uses call_rcu() rather than synchronize_rcu().
537The call_rcu() API is as follows::
538
539	void call_rcu(struct rcu_head *head, rcu_callback_t func);
540
541This function invokes func(head) after a grace period has elapsed.
542This invocation might happen from either softirq or process context,
543so the function is not permitted to block.  The foo struct needs to
544have an rcu_head structure added, perhaps as follows::
545
546	struct foo {
547		int a;
548		char b;
549		long c;
550		struct rcu_head rcu;
551	};
552
553The foo_update_a() function might then be written as follows::
554
555	/*
556	 * Create a new struct foo that is the same as the one currently
557	 * pointed to by gbl_foo, except that field "a" is replaced
558	 * with "new_a".  Points gbl_foo to the new structure, and
559	 * frees up the old structure after a grace period.
560	 *
561	 * Uses rcu_assign_pointer() to ensure that concurrent readers
562	 * see the initialized version of the new structure.
563	 *
564	 * Uses call_rcu() to ensure that any readers that might have
565	 * references to the old structure complete before freeing the
566	 * old structure.
567	 */
568	void foo_update_a(int new_a)
569	{
570		struct foo *new_fp;
571		struct foo *old_fp;
572
573		new_fp = kmalloc(sizeof(*new_fp), GFP_KERNEL);
574		spin_lock(&foo_mutex);
575		old_fp = rcu_dereference_protected(gbl_foo, lockdep_is_held(&foo_mutex));
576		*new_fp = *old_fp;
577		new_fp->a = new_a;
578		rcu_assign_pointer(gbl_foo, new_fp);
579		spin_unlock(&foo_mutex);
580		call_rcu(&old_fp->rcu, foo_reclaim);
581	}
582
583The foo_reclaim() function might appear as follows::
584
585	void foo_reclaim(struct rcu_head *rp)
586	{
587		struct foo *fp = container_of(rp, struct foo, rcu);
588
589		foo_cleanup(fp->a);
590
591		kfree(fp);
592	}
593
594The container_of() primitive is a macro that, given a pointer into a
595struct, the type of the struct, and the pointed-to field within the
596struct, returns a pointer to the beginning of the struct.
597
598The use of call_rcu() permits the caller of foo_update_a() to
599immediately regain control, without needing to worry further about the
600old version of the newly updated element.  It also clearly shows the
601RCU distinction between updater, namely foo_update_a(), and reclaimer,
602namely foo_reclaim().
603
604The summary of advice is the same as for the previous section, except
605that we are now using call_rcu() rather than synchronize_rcu():
606
607-	Use call_rcu() **after** removing a data element from an
608	RCU-protected data structure in order to register a callback
609	function that will be invoked after the completion of all RCU
610	read-side critical sections that might be referencing that
611	data item.
612
613If the callback for call_rcu() is not doing anything more than calling
614kfree() on the structure, you can use kfree_rcu() instead of call_rcu()
615to avoid having to write your own callback::
616
617	kfree_rcu(old_fp, rcu);
618
619If the occasional sleep is permitted, the single-argument form may
620be used, omitting the rcu_head structure from struct foo.
621
622	kfree_rcu_mightsleep(old_fp);
623
624This variant almost never blocks, but might do so by invoking
625synchronize_rcu() in response to memory-allocation failure.
626
627Again, see checklist.rst for additional rules governing the use of RCU.
628
629.. _5_whatisRCU:
630
6315.  WHAT ARE SOME SIMPLE IMPLEMENTATIONS OF RCU?
632------------------------------------------------
633
634One of the nice things about RCU is that it has extremely simple "toy"
635implementations that are a good first step towards understanding the
636production-quality implementations in the Linux kernel.  This section
637presents two such "toy" implementations of RCU, one that is implemented
638in terms of familiar locking primitives, and another that more closely
639resembles "classic" RCU.  Both are way too simple for real-world use,
640lacking both functionality and performance.  However, they are useful
641in getting a feel for how RCU works.  See kernel/rcu/update.c for a
642production-quality implementation, and see:
643
644	https://docs.google.com/document/d/1X0lThx8OK0ZgLMqVoXiR4ZrGURHrXK6NyLRbeXe3Xac/edit
645
646for papers describing the Linux kernel RCU implementation.  The OLS'01
647and OLS'02 papers are a good introduction, and the dissertation provides
648more details on the current implementation as of early 2004.
649
650
6515A.  "TOY" IMPLEMENTATION #1: LOCKING
652^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
653This section presents a "toy" RCU implementation that is based on
654familiar locking primitives.  Its overhead makes it a non-starter for
655real-life use, as does its lack of scalability.  It is also unsuitable
656for realtime use, since it allows scheduling latency to "bleed" from
657one read-side critical section to another.  It also assumes recursive
658reader-writer locks:  If you try this with non-recursive locks, and
659you allow nested rcu_read_lock() calls, you can deadlock.
660
661However, it is probably the easiest implementation to relate to, so is
662a good starting point.
663
664It is extremely simple::
665
666	static DEFINE_RWLOCK(rcu_gp_mutex);
667
668	void rcu_read_lock(void)
669	{
670		read_lock(&rcu_gp_mutex);
671	}
672
673	void rcu_read_unlock(void)
674	{
675		read_unlock(&rcu_gp_mutex);
676	}
677
678	void synchronize_rcu(void)
679	{
680		write_lock(&rcu_gp_mutex);
681		smp_mb__after_spinlock();
682		write_unlock(&rcu_gp_mutex);
683	}
684
685[You can ignore rcu_assign_pointer() and rcu_dereference() without missing
686much.  But here are simplified versions anyway.  And whatever you do,
687don't forget about them when submitting patches making use of RCU!]::
688
689	#define rcu_assign_pointer(p, v) \
690	({ \
691		smp_store_release(&(p), (v)); \
692	})
693
694	#define rcu_dereference(p) \
695	({ \
696		typeof(p) _________p1 = READ_ONCE(p); \
697		(_________p1); \
698	})
699
700
701The rcu_read_lock() and rcu_read_unlock() primitive read-acquire
702and release a global reader-writer lock.  The synchronize_rcu()
703primitive write-acquires this same lock, then releases it.  This means
704that once synchronize_rcu() exits, all RCU read-side critical sections
705that were in progress before synchronize_rcu() was called are guaranteed
706to have completed -- there is no way that synchronize_rcu() would have
707been able to write-acquire the lock otherwise.  The smp_mb__after_spinlock()
708promotes synchronize_rcu() to a full memory barrier in compliance with
709the "Memory-Barrier Guarantees" listed in:
710
711	Design/Requirements/Requirements.rst
712
713It is possible to nest rcu_read_lock(), since reader-writer locks may
714be recursively acquired.  Note also that rcu_read_lock() is immune
715from deadlock (an important property of RCU).  The reason for this is
716that the only thing that can block rcu_read_lock() is a synchronize_rcu().
717But synchronize_rcu() does not acquire any locks while holding rcu_gp_mutex,
718so there can be no deadlock cycle.
719
720.. _quiz_1:
721
722Quick Quiz #1:
723		Why is this argument naive?  How could a deadlock
724		occur when using this algorithm in a real-world Linux
725		kernel?  How could this deadlock be avoided?
726
727:ref:`Answers to Quick Quiz <9_whatisRCU>`
728
7295B.  "TOY" EXAMPLE #2: CLASSIC RCU
730^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
731This section presents a "toy" RCU implementation that is based on
732"classic RCU".  It is also short on performance (but only for updates) and
733on features such as hotplug CPU and the ability to run in CONFIG_PREEMPTION
734kernels.  The definitions of rcu_dereference() and rcu_assign_pointer()
735are the same as those shown in the preceding section, so they are omitted.
736::
737
738	void rcu_read_lock(void) { }
739
740	void rcu_read_unlock(void) { }
741
742	void synchronize_rcu(void)
743	{
744		int cpu;
745
746		for_each_possible_cpu(cpu)
747			run_on(cpu);
748	}
749
750Note that rcu_read_lock() and rcu_read_unlock() do absolutely nothing.
751This is the great strength of classic RCU in a non-preemptive kernel:
752read-side overhead is precisely zero, at least on non-Alpha CPUs.
753And there is absolutely no way that rcu_read_lock() can possibly
754participate in a deadlock cycle!
755
756The implementation of synchronize_rcu() simply schedules itself on each
757CPU in turn.  The run_on() primitive can be implemented straightforwardly
758in terms of the sched_setaffinity() primitive.  Of course, a somewhat less
759"toy" implementation would restore the affinity upon completion rather
760than just leaving all tasks running on the last CPU, but when I said
761"toy", I meant **toy**!
762
763So how the heck is this supposed to work???
764
765Remember that it is illegal to block while in an RCU read-side critical
766section.  Therefore, if a given CPU executes a context switch, we know
767that it must have completed all preceding RCU read-side critical sections.
768Once **all** CPUs have executed a context switch, then **all** preceding
769RCU read-side critical sections will have completed.
770
771So, suppose that we remove a data item from its structure and then invoke
772synchronize_rcu().  Once synchronize_rcu() returns, we are guaranteed
773that there are no RCU read-side critical sections holding a reference
774to that data item, so we can safely reclaim it.
775
776.. _quiz_2:
777
778Quick Quiz #2:
779		Give an example where Classic RCU's read-side
780		overhead is **negative**.
781
782:ref:`Answers to Quick Quiz <9_whatisRCU>`
783
784.. _quiz_3:
785
786Quick Quiz #3:
787		If it is illegal to block in an RCU read-side
788		critical section, what the heck do you do in
789		CONFIG_PREEMPT_RT, where normal spinlocks can block???
790
791:ref:`Answers to Quick Quiz <9_whatisRCU>`
792
793.. _6_whatisRCU:
794
7956.  ANALOGY WITH READER-WRITER LOCKING
796--------------------------------------
797
798Although RCU can be used in many different ways, a very common use of
799RCU is analogous to reader-writer locking.  The following unified
800diff shows how closely related RCU and reader-writer locking can be.
801::
802
803	@@ -5,5 +5,5 @@ struct el {
804	 	int data;
805	 	/* Other data fields */
806	 };
807	-rwlock_t listmutex;
808	+spinlock_t listmutex;
809	 struct el head;
810
811	@@ -13,15 +14,15 @@
812		struct list_head *lp;
813		struct el *p;
814
815	-	read_lock(&listmutex);
816	-	list_for_each_entry(p, head, lp) {
817	+	rcu_read_lock();
818	+	list_for_each_entry_rcu(p, head, lp) {
819			if (p->key == key) {
820				*result = p->data;
821	-			read_unlock(&listmutex);
822	+			rcu_read_unlock();
823				return 1;
824			}
825		}
826	-	read_unlock(&listmutex);
827	+	rcu_read_unlock();
828		return 0;
829	 }
830
831	@@ -29,15 +30,16 @@
832	 {
833		struct el *p;
834
835	-	write_lock(&listmutex);
836	+	spin_lock(&listmutex);
837		list_for_each_entry(p, head, lp) {
838			if (p->key == key) {
839	-			list_del(&p->list);
840	-			write_unlock(&listmutex);
841	+			list_del_rcu(&p->list);
842	+			spin_unlock(&listmutex);
843	+			synchronize_rcu();
844				kfree(p);
845				return 1;
846			}
847		}
848	-	write_unlock(&listmutex);
849	+	spin_unlock(&listmutex);
850		return 0;
851	 }
852
853Or, for those who prefer a side-by-side listing::
854
855 1 struct el {                          1 struct el {
856 2   struct list_head list;             2   struct list_head list;
857 3   long key;                          3   long key;
858 4   spinlock_t mutex;                  4   spinlock_t mutex;
859 5   int data;                          5   int data;
860 6   /* Other data fields */            6   /* Other data fields */
861 7 };                                   7 };
862 8 rwlock_t listmutex;                  8 spinlock_t listmutex;
863 9 struct el head;                      9 struct el head;
864
865::
866
867  1 int search(long key, int *result)    1 int search(long key, int *result)
868  2 {                                    2 {
869  3   struct list_head *lp;              3   struct list_head *lp;
870  4   struct el *p;                      4   struct el *p;
871  5                                      5
872  6   read_lock(&listmutex);             6   rcu_read_lock();
873  7   list_for_each_entry(p, head, lp) { 7   list_for_each_entry_rcu(p, head, lp) {
874  8     if (p->key == key) {             8     if (p->key == key) {
875  9       *result = p->data;             9       *result = p->data;
876 10       read_unlock(&listmutex);      10       rcu_read_unlock();
877 11       return 1;                     11       return 1;
878 12     }                               12     }
879 13   }                                 13   }
880 14   read_unlock(&listmutex);          14   rcu_read_unlock();
881 15   return 0;                         15   return 0;
882 16 }                                   16 }
883
884::
885
886  1 int delete(long key)                 1 int delete(long key)
887  2 {                                    2 {
888  3   struct el *p;                      3   struct el *p;
889  4                                      4
890  5   write_lock(&listmutex);            5   spin_lock(&listmutex);
891  6   list_for_each_entry(p, head, lp) { 6   list_for_each_entry(p, head, lp) {
892  7     if (p->key == key) {             7     if (p->key == key) {
893  8       list_del(&p->list);            8       list_del_rcu(&p->list);
894  9       write_unlock(&listmutex);      9       spin_unlock(&listmutex);
895                                        10       synchronize_rcu();
896 10       kfree(p);                     11       kfree(p);
897 11       return 1;                     12       return 1;
898 12     }                               13     }
899 13   }                                 14   }
900 14   write_unlock(&listmutex);         15   spin_unlock(&listmutex);
901 15   return 0;                         16   return 0;
902 16 }                                   17 }
903
904Either way, the differences are quite small.  Read-side locking moves
905to rcu_read_lock() and rcu_read_unlock, update-side locking moves from
906a reader-writer lock to a simple spinlock, and a synchronize_rcu()
907precedes the kfree().
908
909However, there is one potential catch: the read-side and update-side
910critical sections can now run concurrently.  In many cases, this will
911not be a problem, but it is necessary to check carefully regardless.
912For example, if multiple independent list updates must be seen as
913a single atomic update, converting to RCU will require special care.
914
915Also, the presence of synchronize_rcu() means that the RCU version of
916delete() can now block.  If this is a problem, there is a callback-based
917mechanism that never blocks, namely call_rcu() or kfree_rcu(), that can
918be used in place of synchronize_rcu().
919
920.. _7_whatisRCU:
921
9227.  ANALOGY WITH REFERENCE COUNTING
923-----------------------------------
924
925The reader-writer analogy (illustrated by the previous section) is not
926always the best way to think about using RCU.  Another helpful analogy
927considers RCU an effective reference count on everything which is
928protected by RCU.
929
930A reference count typically does not prevent the referenced object's
931values from changing, but does prevent changes to type -- particularly the
932gross change of type that happens when that object's memory is freed and
933re-allocated for some other purpose.  Once a type-safe reference to the
934object is obtained, some other mechanism is needed to ensure consistent
935access to the data in the object.  This could involve taking a spinlock,
936but with RCU the typical approach is to perform reads with SMP-aware
937operations such as smp_load_acquire(), to perform updates with atomic
938read-modify-write operations, and to provide the necessary ordering.
939RCU provides a number of support functions that embed the required
940operations and ordering, such as the list_for_each_entry_rcu() macro
941used in the previous section.
942
943A more focused view of the reference counting behavior is that,
944between rcu_read_lock() and rcu_read_unlock(), any reference taken with
945rcu_dereference() on a pointer marked as ``__rcu`` can be treated as
946though a reference-count on that object has been temporarily increased.
947This prevents the object from changing type.  Exactly what this means
948will depend on normal expectations of objects of that type, but it
949typically includes that spinlocks can still be safely locked, normal
950reference counters can be safely manipulated, and ``__rcu`` pointers
951can be safely dereferenced.
952
953Some operations that one might expect to see on an object for
954which an RCU reference is held include:
955
956 - Copying out data that is guaranteed to be stable by the object's type.
957 - Using kref_get_unless_zero() or similar to get a longer-term
958   reference.  This may fail of course.
959 - Acquiring a spinlock in the object, and checking if the object still
960   is the expected object and if so, manipulating it freely.
961
962The understanding that RCU provides a reference that only prevents a
963change of type is particularly visible with objects allocated from a
964slab cache marked ``SLAB_TYPESAFE_BY_RCU``.  RCU operations may yield a
965reference to an object from such a cache that has been concurrently freed
966and the memory reallocated to a completely different object, though of
967the same type.  In this case RCU doesn't even protect the identity of the
968object from changing, only its type.  So the object found may not be the
969one expected, but it will be one where it is safe to take a reference
970(and then potentially acquiring a spinlock), allowing subsequent code
971to check whether the identity matches expectations.  It is tempting
972to simply acquire the spinlock without first taking the reference, but
973unfortunately any spinlock in a ``SLAB_TYPESAFE_BY_RCU`` object must be
974initialized after each and every call to kmem_cache_alloc(), which renders
975reference-free spinlock acquisition completely unsafe.  Therefore, when
976using ``SLAB_TYPESAFE_BY_RCU``, make proper use of a reference counter.
977If using refcount_t, the specialized refcount_{add|inc}_not_zero_acquire()
978and refcount_set_release() APIs should be used to ensure correct operation
979ordering when verifying object identity and when initializing newly
980allocated objects. Acquire fence in refcount_{add|inc}_not_zero_acquire()
981ensures that identity checks happen *after* reference count is taken.
982refcount_set_release() should be called after a newly allocated object is
983fully initialized and release fence ensures that new values are visible
984*before* refcount can be successfully taken by other users. Once
985refcount_set_release() is called, the object should be considered visible
986by other tasks.
987(Those willing to initialize their locks in a kmem_cache constructor
988may also use locking, including cache-friendly sequence locking.)
989
990With traditional reference counting -- such as that implemented by the
991kref library in Linux -- there is typically code that runs when the last
992reference to an object is dropped.  With kref, this is the function
993passed to kref_put().  When RCU is being used, such finalization code
994must not be run until all ``__rcu`` pointers referencing the object have
995been updated, and then a grace period has passed.  Every remaining
996globally visible pointer to the object must be considered to be a
997potential counted reference, and the finalization code is typically run
998using call_rcu() only after all those pointers have been changed.
999
1000To see how to choose between these two analogies -- of RCU as a
1001reader-writer lock and RCU as a reference counting system -- it is useful
1002to reflect on the scale of the thing being protected.  The reader-writer
1003lock analogy looks at larger multi-part objects such as a linked list
1004and shows how RCU can facilitate concurrency while elements are added
1005to, and removed from, the list.  The reference-count analogy looks at
1006the individual objects and looks at how they can be accessed safely
1007within whatever whole they are a part of.
1008
1009.. _8_whatisRCU:
1010
10118.  FULL LIST OF RCU APIs
1012-------------------------
1013
1014The RCU APIs are documented in docbook-format header comments in the
1015Linux-kernel source code, but it helps to have a full list of the
1016APIs, since there does not appear to be a way to categorize them
1017in docbook.  Here is the list, by category.
1018
1019RCU list traversal::
1020
1021	list_entry_rcu
1022	list_entry_lockless
1023	list_first_entry_rcu
1024	list_next_rcu
1025	list_for_each_entry_rcu
1026	list_for_each_entry_continue_rcu
1027	list_for_each_entry_from_rcu
1028	list_first_or_null_rcu
1029	list_next_or_null_rcu
1030	hlist_first_rcu
1031	hlist_next_rcu
1032	hlist_pprev_rcu
1033	hlist_for_each_entry_rcu
1034	hlist_for_each_entry_rcu_bh
1035	hlist_for_each_entry_from_rcu
1036	hlist_for_each_entry_continue_rcu
1037	hlist_for_each_entry_continue_rcu_bh
1038	hlist_nulls_first_rcu
1039	hlist_nulls_for_each_entry_rcu
1040	hlist_bl_first_rcu
1041	hlist_bl_for_each_entry_rcu
1042
1043RCU pointer/list update::
1044
1045	rcu_assign_pointer
1046	list_add_rcu
1047	list_add_tail_rcu
1048	list_del_rcu
1049	list_replace_rcu
1050	hlist_add_behind_rcu
1051	hlist_add_before_rcu
1052	hlist_add_head_rcu
1053	hlist_add_tail_rcu
1054	hlist_del_rcu
1055	hlist_del_init_rcu
1056	hlist_replace_rcu
1057	list_splice_init_rcu
1058	list_splice_tail_init_rcu
1059	hlist_nulls_del_init_rcu
1060	hlist_nulls_del_rcu
1061	hlist_nulls_add_head_rcu
1062	hlist_bl_add_head_rcu
1063	hlist_bl_del_init_rcu
1064	hlist_bl_del_rcu
1065	hlist_bl_set_first_rcu
1066
1067RCU::
1068
1069	Critical sections	Grace period		Barrier
1070
1071	rcu_read_lock		synchronize_net		rcu_barrier
1072	rcu_read_unlock		synchronize_rcu
1073	rcu_dereference		synchronize_rcu_expedited
1074	rcu_read_lock_held	call_rcu
1075	rcu_dereference_check	kfree_rcu
1076	rcu_dereference_protected
1077
1078bh::
1079
1080	Critical sections	Grace period		Barrier
1081
1082	rcu_read_lock_bh	call_rcu		rcu_barrier
1083	rcu_read_unlock_bh	synchronize_rcu
1084	[local_bh_disable]	synchronize_rcu_expedited
1085	[and friends]
1086	rcu_dereference_bh
1087	rcu_dereference_bh_check
1088	rcu_dereference_bh_protected
1089	rcu_read_lock_bh_held
1090
1091sched::
1092
1093	Critical sections	Grace period		Barrier
1094
1095	rcu_read_lock_sched	call_rcu		rcu_barrier
1096	rcu_read_unlock_sched	synchronize_rcu
1097	[preempt_disable]	synchronize_rcu_expedited
1098	[and friends]
1099	rcu_read_lock_sched_notrace
1100	rcu_read_unlock_sched_notrace
1101	rcu_dereference_sched
1102	rcu_dereference_sched_check
1103	rcu_dereference_sched_protected
1104	rcu_read_lock_sched_held
1105
1106
1107RCU-Tasks::
1108
1109	Critical sections	Grace period		Barrier
1110
1111	N/A			call_rcu_tasks		rcu_barrier_tasks
1112				synchronize_rcu_tasks
1113
1114
1115RCU-Tasks-Rude::
1116
1117	Critical sections	Grace period		Barrier
1118
1119	N/A						N/A
1120				synchronize_rcu_tasks_rude
1121
1122
1123RCU-Tasks-Trace::
1124
1125	Critical sections	Grace period		Barrier
1126
1127	rcu_read_lock_trace	call_rcu_tasks_trace	rcu_barrier_tasks_trace
1128	rcu_read_unlock_trace	synchronize_rcu_tasks_trace
1129
1130
1131SRCU::
1132
1133	Critical sections	Grace period		Barrier
1134
1135	srcu_read_lock		call_srcu		srcu_barrier
1136	srcu_read_unlock	synchronize_srcu
1137	srcu_dereference	synchronize_srcu_expedited
1138	srcu_dereference_check
1139	srcu_read_lock_held
1140
1141SRCU: Initialization/cleanup::
1142
1143	DEFINE_SRCU
1144	DEFINE_STATIC_SRCU
1145	init_srcu_struct
1146	cleanup_srcu_struct
1147
1148All: lockdep-checked RCU utility APIs::
1149
1150	RCU_LOCKDEP_WARN
1151	rcu_sleep_check
1152
1153All: Unchecked RCU-protected pointer access::
1154
1155	rcu_dereference_raw
1156
1157All: Unchecked RCU-protected pointer access with dereferencing prohibited::
1158
1159	rcu_access_pointer
1160
1161See the comment headers in the source code (or the docbook generated
1162from them) for more information.
1163
1164However, given that there are no fewer than four families of RCU APIs
1165in the Linux kernel, how do you choose which one to use?  The following
1166list can be helpful:
1167
1168a.	Will readers need to block?  If so, you need SRCU.
1169
1170b.	Will readers need to block and are you doing tracing, for
1171	example, ftrace or BPF?  If so, you need RCU-tasks,
1172	RCU-tasks-rude, and/or RCU-tasks-trace.
1173
1174c.	What about the -rt patchset?  If readers would need to block in
1175	an non-rt kernel, you need SRCU.  If readers would block when
1176	acquiring spinlocks in a -rt kernel, but not in a non-rt kernel,
1177	SRCU is not necessary.	(The -rt patchset turns spinlocks into
1178	sleeplocks, hence this distinction.)
1179
1180d.	Do you need to treat NMI handlers, hardirq handlers,
1181	and code segments with preemption disabled (whether
1182	via preempt_disable(), local_irq_save(), local_bh_disable(),
1183	or some other mechanism) as if they were explicit RCU readers?
1184	If so, RCU-sched readers are the only choice that will work
1185	for you, but since about v4.20 you use can use the vanilla RCU
1186	update primitives.
1187
1188e.	Do you need RCU grace periods to complete even in the face of
1189	softirq monopolization of one or more of the CPUs?  For example,
1190	is your code subject to network-based denial-of-service attacks?
1191	If so, you should disable softirq across your readers, for
1192	example, by using rcu_read_lock_bh().  Since about v4.20 you
1193	use can use the vanilla RCU update primitives.
1194
1195f.	Is your workload too update-intensive for normal use of
1196	RCU, but inappropriate for other synchronization mechanisms?
1197	If so, consider SLAB_TYPESAFE_BY_RCU (which was originally
1198	named SLAB_DESTROY_BY_RCU).  But please be careful!
1199
1200g.	Do you need read-side critical sections that are respected even
1201	on CPUs that are deep in the idle loop, during entry to or exit
1202	from user-mode execution, or on an offlined CPU?  If so, SRCU
1203	and RCU Tasks Trace are the only choices that will work for you,
1204	with SRCU being strongly preferred in almost all cases.
1205
1206h.	Otherwise, use RCU.
1207
1208Of course, this all assumes that you have determined that RCU is in fact
1209the right tool for your job.
1210
1211.. _9_whatisRCU:
1212
12139.  ANSWERS TO QUICK QUIZZES
1214----------------------------
1215
1216Quick Quiz #1:
1217		Why is this argument naive?  How could a deadlock
1218		occur when using this algorithm in a real-world Linux
1219		kernel?  [Referring to the lock-based "toy" RCU
1220		algorithm.]
1221
1222Answer:
1223		Consider the following sequence of events:
1224
1225		1.	CPU 0 acquires some unrelated lock, call it
1226			"problematic_lock", disabling irq via
1227			spin_lock_irqsave().
1228
1229		2.	CPU 1 enters synchronize_rcu(), write-acquiring
1230			rcu_gp_mutex.
1231
1232		3.	CPU 0 enters rcu_read_lock(), but must wait
1233			because CPU 1 holds rcu_gp_mutex.
1234
1235		4.	CPU 1 is interrupted, and the irq handler
1236			attempts to acquire problematic_lock.
1237
1238		The system is now deadlocked.
1239
1240		One way to avoid this deadlock is to use an approach like
1241		that of CONFIG_PREEMPT_RT, where all normal spinlocks
1242		become blocking locks, and all irq handlers execute in
1243		the context of special tasks.  In this case, in step 4
1244		above, the irq handler would block, allowing CPU 1 to
1245		release rcu_gp_mutex, avoiding the deadlock.
1246
1247		Even in the absence of deadlock, this RCU implementation
1248		allows latency to "bleed" from readers to other
1249		readers through synchronize_rcu().  To see this,
1250		consider task A in an RCU read-side critical section
1251		(thus read-holding rcu_gp_mutex), task B blocked
1252		attempting to write-acquire rcu_gp_mutex, and
1253		task C blocked in rcu_read_lock() attempting to
1254		read_acquire rcu_gp_mutex.  Task A's RCU read-side
1255		latency is holding up task C, albeit indirectly via
1256		task B.
1257
1258		Realtime RCU implementations therefore use a counter-based
1259		approach where tasks in RCU read-side critical sections
1260		cannot be blocked by tasks executing synchronize_rcu().
1261
1262:ref:`Back to Quick Quiz #1 <quiz_1>`
1263
1264Quick Quiz #2:
1265		Give an example where Classic RCU's read-side
1266		overhead is **negative**.
1267
1268Answer:
1269		Imagine a single-CPU system with a non-CONFIG_PREEMPTION
1270		kernel where a routing table is used by process-context
1271		code, but can be updated by irq-context code (for example,
1272		by an "ICMP REDIRECT" packet).	The usual way of handling
1273		this would be to have the process-context code disable
1274		interrupts while searching the routing table.  Use of
1275		RCU allows such interrupt-disabling to be dispensed with.
1276		Thus, without RCU, you pay the cost of disabling interrupts,
1277		and with RCU you don't.
1278
1279		One can argue that the overhead of RCU in this
1280		case is negative with respect to the single-CPU
1281		interrupt-disabling approach.  Others might argue that
1282		the overhead of RCU is merely zero, and that replacing
1283		the positive overhead of the interrupt-disabling scheme
1284		with the zero-overhead RCU scheme does not constitute
1285		negative overhead.
1286
1287		In real life, of course, things are more complex.  But
1288		even the theoretical possibility of negative overhead for
1289		a synchronization primitive is a bit unexpected.  ;-)
1290
1291:ref:`Back to Quick Quiz #2 <quiz_2>`
1292
1293Quick Quiz #3:
1294		If it is illegal to block in an RCU read-side
1295		critical section, what the heck do you do in
1296		CONFIG_PREEMPT_RT, where normal spinlocks can block???
1297
1298Answer:
1299		Just as CONFIG_PREEMPT_RT permits preemption of spinlock
1300		critical sections, it permits preemption of RCU
1301		read-side critical sections.  It also permits
1302		spinlocks blocking while in RCU read-side critical
1303		sections.
1304
1305		Why the apparent inconsistency?  Because it is
1306		possible to use priority boosting to keep the RCU
1307		grace periods short if need be (for example, if running
1308		short of memory).  In contrast, if blocking waiting
1309		for (say) network reception, there is no way to know
1310		what should be boosted.  Especially given that the
1311		process we need to boost might well be a human being
1312		who just went out for a pizza or something.  And although
1313		a computer-operated cattle prod might arouse serious
1314		interest, it might also provoke serious objections.
1315		Besides, how does the computer know what pizza parlor
1316		the human being went to???
1317
1318:ref:`Back to Quick Quiz #3 <quiz_3>`
1319
1320ACKNOWLEDGEMENTS
1321
1322My thanks to the people who helped make this human-readable, including
1323Jon Walpole, Josh Triplett, Serge Hallyn, Suzanne Wood, and Alan Stern.
1324
1325
1326For more information, see http://www.rdrop.com/users/paulmck/RCU.
1327