1.. _whatisrcu_doc: 2 3What is RCU? -- "Read, Copy, Update" 4====================================== 5 6Please note that the "What is RCU?" LWN series is an excellent place 7to start learning about RCU: 8 9| 1. What is RCU, Fundamentally? https://lwn.net/Articles/262464/ 10| 2. What is RCU? Part 2: Usage https://lwn.net/Articles/263130/ 11| 3. RCU part 3: the RCU API https://lwn.net/Articles/264090/ 12| 4. The RCU API, 2010 Edition https://lwn.net/Articles/418853/ 13| 2010 Big API Table https://lwn.net/Articles/419086/ 14| 5. The RCU API, 2014 Edition https://lwn.net/Articles/609904/ 15| 2014 Big API Table https://lwn.net/Articles/609973/ 16| 6. The RCU API, 2019 Edition https://lwn.net/Articles/777036/ 17| 2019 Big API Table https://lwn.net/Articles/777165/ 18 19 20What is RCU? 21 22RCU is a synchronization mechanism that was added to the Linux kernel 23during the 2.5 development effort that is optimized for read-mostly 24situations. Although RCU is actually quite simple once you understand it, 25getting there can sometimes be a challenge. Part of the problem is that 26most of the past descriptions of RCU have been written with the mistaken 27assumption that there is "one true way" to describe RCU. Instead, 28the experience has been that different people must take different paths 29to arrive at an understanding of RCU. This document provides several 30different paths, as follows: 31 32:ref:`1. RCU OVERVIEW <1_whatisRCU>` 33 34:ref:`2. WHAT IS RCU'S CORE API? <2_whatisRCU>` 35 36:ref:`3. WHAT ARE SOME EXAMPLE USES OF CORE RCU API? <3_whatisRCU>` 37 38:ref:`4. WHAT IF MY UPDATING THREAD CANNOT BLOCK? <4_whatisRCU>` 39 40:ref:`5. WHAT ARE SOME SIMPLE IMPLEMENTATIONS OF RCU? <5_whatisRCU>` 41 42:ref:`6. ANALOGY WITH READER-WRITER LOCKING <6_whatisRCU>` 43 44:ref:`7. ANALOGY WITH REFERENCE COUNTING <7_whatisRCU>` 45 46:ref:`8. FULL LIST OF RCU APIs <8_whatisRCU>` 47 48:ref:`9. ANSWERS TO QUICK QUIZZES <9_whatisRCU>` 49 50People who prefer starting with a conceptual overview should focus on 51Section 1, though most readers will profit by reading this section at 52some point. People who prefer to start with an API that they can then 53experiment with should focus on Section 2. People who prefer to start 54with example uses should focus on Sections 3 and 4. People who need to 55understand the RCU implementation should focus on Section 5, then dive 56into the kernel source code. People who reason best by analogy should 57focus on Section 6. Section 7 serves as an index to the docbook API 58documentation, and Section 8 is the traditional answer key. 59 60So, start with the section that makes the most sense to you and your 61preferred method of learning. If you need to know everything about 62everything, feel free to read the whole thing -- but if you are really 63that type of person, you have perused the source code and will therefore 64never need this document anyway. ;-) 65 66.. _1_whatisRCU: 67 681. RCU OVERVIEW 69---------------- 70 71The basic idea behind RCU is to split updates into "removal" and 72"reclamation" phases. The removal phase removes references to data items 73within a data structure (possibly by replacing them with references to 74new versions of these data items), and can run concurrently with readers. 75The reason that it is safe to run the removal phase concurrently with 76readers is the semantics of modern CPUs guarantee that readers will see 77either the old or the new version of the data structure rather than a 78partially updated reference. The reclamation phase does the work of reclaiming 79(e.g., freeing) the data items removed from the data structure during the 80removal phase. Because reclaiming data items can disrupt any readers 81concurrently referencing those data items, the reclamation phase must 82not start until readers no longer hold references to those data items. 83 84Splitting the update into removal and reclamation phases permits the 85updater to perform the removal phase immediately, and to defer the 86reclamation phase until all readers active during the removal phase have 87completed, either by blocking until they finish or by registering a 88callback that is invoked after they finish. Only readers that are active 89during the removal phase need be considered, because any reader starting 90after the removal phase will be unable to gain a reference to the removed 91data items, and therefore cannot be disrupted by the reclamation phase. 92 93So the typical RCU update sequence goes something like the following: 94 95a. Remove pointers to a data structure, so that subsequent 96 readers cannot gain a reference to it. 97 98b. Wait for all previous readers to complete their RCU read-side 99 critical sections. 100 101c. At this point, there cannot be any readers who hold references 102 to the data structure, so it now may safely be reclaimed 103 (e.g., kfree()d). 104 105Step (b) above is the key idea underlying RCU's deferred destruction. 106The ability to wait until all readers are done allows RCU readers to 107use much lighter-weight synchronization, in some cases, absolutely no 108synchronization at all. In contrast, in more conventional lock-based 109schemes, readers must use heavy-weight synchronization in order to 110prevent an updater from deleting the data structure out from under them. 111This is because lock-based updaters typically update data items in place, 112and must therefore exclude readers. In contrast, RCU-based updaters 113typically take advantage of the fact that writes to single aligned 114pointers are atomic on modern CPUs, allowing atomic insertion, removal, 115and replacement of data items in a linked structure without disrupting 116readers. Concurrent RCU readers can then continue accessing the old 117versions, and can dispense with the atomic operations, memory barriers, 118and communications cache misses that are so expensive on present-day 119SMP computer systems, even in absence of lock contention. 120 121In the three-step procedure shown above, the updater is performing both 122the removal and the reclamation step, but it is often helpful for an 123entirely different thread to do the reclamation, as is in fact the case 124in the Linux kernel's directory-entry cache (dcache). Even if the same 125thread performs both the update step (step (a) above) and the reclamation 126step (step (c) above), it is often helpful to think of them separately. 127For example, RCU readers and updaters need not communicate at all, 128but RCU provides implicit low-overhead communication between readers 129and reclaimers, namely, in step (b) above. 130 131So how the heck can a reclaimer tell when a reader is done, given 132that readers are not doing any sort of synchronization operations??? 133Read on to learn about how RCU's API makes this easy. 134 135.. _2_whatisRCU: 136 1372. WHAT IS RCU'S CORE API? 138--------------------------- 139 140The core RCU API is quite small: 141 142a. rcu_read_lock() 143b. rcu_read_unlock() 144c. synchronize_rcu() / call_rcu() 145d. rcu_assign_pointer() 146e. rcu_dereference() 147 148There are many other members of the RCU API, but the rest can be 149expressed in terms of these five, though most implementations instead 150express synchronize_rcu() in terms of the call_rcu() callback API. 151 152The five core RCU APIs are described below, the other 18 will be enumerated 153later. See the kernel docbook documentation for more info, or look directly 154at the function header comments. 155 156rcu_read_lock() 157^^^^^^^^^^^^^^^ 158 void rcu_read_lock(void); 159 160 Used by a reader to inform the reclaimer that the reader is 161 entering an RCU read-side critical section. It is illegal 162 to block while in an RCU read-side critical section, though 163 kernels built with CONFIG_PREEMPT_RCU can preempt RCU 164 read-side critical sections. Any RCU-protected data structure 165 accessed during an RCU read-side critical section is guaranteed to 166 remain unreclaimed for the full duration of that critical section. 167 Reference counts may be used in conjunction with RCU to maintain 168 longer-term references to data structures. 169 170rcu_read_unlock() 171^^^^^^^^^^^^^^^^^ 172 void rcu_read_unlock(void); 173 174 Used by a reader to inform the reclaimer that the reader is 175 exiting an RCU read-side critical section. Note that RCU 176 read-side critical sections may be nested and/or overlapping. 177 178synchronize_rcu() 179^^^^^^^^^^^^^^^^^ 180 void synchronize_rcu(void); 181 182 Marks the end of updater code and the beginning of reclaimer 183 code. It does this by blocking until all pre-existing RCU 184 read-side critical sections on all CPUs have completed. 185 Note that synchronize_rcu() will **not** necessarily wait for 186 any subsequent RCU read-side critical sections to complete. 187 For example, consider the following sequence of events:: 188 189 CPU 0 CPU 1 CPU 2 190 ----------------- ------------------------- --------------- 191 1. rcu_read_lock() 192 2. enters synchronize_rcu() 193 3. rcu_read_lock() 194 4. rcu_read_unlock() 195 5. exits synchronize_rcu() 196 6. rcu_read_unlock() 197 198 To reiterate, synchronize_rcu() waits only for ongoing RCU 199 read-side critical sections to complete, not necessarily for 200 any that begin after synchronize_rcu() is invoked. 201 202 Of course, synchronize_rcu() does not necessarily return 203 **immediately** after the last pre-existing RCU read-side critical 204 section completes. For one thing, there might well be scheduling 205 delays. For another thing, many RCU implementations process 206 requests in batches in order to improve efficiencies, which can 207 further delay synchronize_rcu(). 208 209 Since synchronize_rcu() is the API that must figure out when 210 readers are done, its implementation is key to RCU. For RCU 211 to be useful in all but the most read-intensive situations, 212 synchronize_rcu()'s overhead must also be quite small. 213 214 The call_rcu() API is a callback form of synchronize_rcu(), 215 and is described in more detail in a later section. Instead of 216 blocking, it registers a function and argument which are invoked 217 after all ongoing RCU read-side critical sections have completed. 218 This callback variant is particularly useful in situations where 219 it is illegal to block or where update-side performance is 220 critically important. 221 222 However, the call_rcu() API should not be used lightly, as use 223 of the synchronize_rcu() API generally results in simpler code. 224 In addition, the synchronize_rcu() API has the nice property 225 of automatically limiting update rate should grace periods 226 be delayed. This property results in system resilience in face 227 of denial-of-service attacks. Code using call_rcu() should limit 228 update rate in order to gain this same sort of resilience. See 229 checklist.rst for some approaches to limiting the update rate. 230 231rcu_assign_pointer() 232^^^^^^^^^^^^^^^^^^^^ 233 void rcu_assign_pointer(p, typeof(p) v); 234 235 Yes, rcu_assign_pointer() **is** implemented as a macro, though it 236 would be cool to be able to declare a function in this manner. 237 (Compiler experts will no doubt disagree.) 238 239 The updater uses this function to assign a new value to an 240 RCU-protected pointer, in order to safely communicate the change 241 in value from the updater to the reader. This macro does not 242 evaluate to an rvalue, but it does execute any memory-barrier 243 instructions required for a given CPU architecture. 244 245 Perhaps just as important, it serves to document (1) which 246 pointers are protected by RCU and (2) the point at which a 247 given structure becomes accessible to other CPUs. That said, 248 rcu_assign_pointer() is most frequently used indirectly, via 249 the _rcu list-manipulation primitives such as list_add_rcu(). 250 251rcu_dereference() 252^^^^^^^^^^^^^^^^^ 253 typeof(p) rcu_dereference(p); 254 255 Like rcu_assign_pointer(), rcu_dereference() must be implemented 256 as a macro. 257 258 The reader uses rcu_dereference() to fetch an RCU-protected 259 pointer, which returns a value that may then be safely 260 dereferenced. Note that rcu_dereference() does not actually 261 dereference the pointer, instead, it protects the pointer for 262 later dereferencing. It also executes any needed memory-barrier 263 instructions for a given CPU architecture. Currently, only Alpha 264 needs memory barriers within rcu_dereference() -- on other CPUs, 265 it compiles to nothing, not even a compiler directive. 266 267 Common coding practice uses rcu_dereference() to copy an 268 RCU-protected pointer to a local variable, then dereferences 269 this local variable, for example as follows:: 270 271 p = rcu_dereference(head.next); 272 return p->data; 273 274 However, in this case, one could just as easily combine these 275 into one statement:: 276 277 return rcu_dereference(head.next)->data; 278 279 If you are going to be fetching multiple fields from the 280 RCU-protected structure, using the local variable is of 281 course preferred. Repeated rcu_dereference() calls look 282 ugly, do not guarantee that the same pointer will be returned 283 if an update happened while in the critical section, and incur 284 unnecessary overhead on Alpha CPUs. 285 286 Note that the value returned by rcu_dereference() is valid 287 only within the enclosing RCU read-side critical section [1]_. 288 For example, the following is **not** legal:: 289 290 rcu_read_lock(); 291 p = rcu_dereference(head.next); 292 rcu_read_unlock(); 293 x = p->address; /* BUG!!! */ 294 rcu_read_lock(); 295 y = p->data; /* BUG!!! */ 296 rcu_read_unlock(); 297 298 Holding a reference from one RCU read-side critical section 299 to another is just as illegal as holding a reference from 300 one lock-based critical section to another! Similarly, 301 using a reference outside of the critical section in which 302 it was acquired is just as illegal as doing so with normal 303 locking. 304 305 As with rcu_assign_pointer(), an important function of 306 rcu_dereference() is to document which pointers are protected by 307 RCU, in particular, flagging a pointer that is subject to changing 308 at any time, including immediately after the rcu_dereference(). 309 And, again like rcu_assign_pointer(), rcu_dereference() is 310 typically used indirectly, via the _rcu list-manipulation 311 primitives, such as list_for_each_entry_rcu() [2]_. 312 313.. [1] The variant rcu_dereference_protected() can be used outside 314 of an RCU read-side critical section as long as the usage is 315 protected by locks acquired by the update-side code. This variant 316 avoids the lockdep warning that would happen when using (for 317 example) rcu_dereference() without rcu_read_lock() protection. 318 Using rcu_dereference_protected() also has the advantage 319 of permitting compiler optimizations that rcu_dereference() 320 must prohibit. The rcu_dereference_protected() variant takes 321 a lockdep expression to indicate which locks must be acquired 322 by the caller. If the indicated protection is not provided, 323 a lockdep splat is emitted. See Design/Requirements/Requirements.rst 324 and the API's code comments for more details and example usage. 325 326.. [2] If the list_for_each_entry_rcu() instance might be used by 327 update-side code as well as by RCU readers, then an additional 328 lockdep expression can be added to its list of arguments. 329 For example, given an additional "lock_is_held(&mylock)" argument, 330 the RCU lockdep code would complain only if this instance was 331 invoked outside of an RCU read-side critical section and without 332 the protection of mylock. 333 334The following diagram shows how each API communicates among the 335reader, updater, and reclaimer. 336:: 337 338 339 rcu_assign_pointer() 340 +--------+ 341 +---------------------->| reader |---------+ 342 | +--------+ | 343 | | | 344 | | | Protect: 345 | | | rcu_read_lock() 346 | | | rcu_read_unlock() 347 | rcu_dereference() | | 348 +---------+ | | 349 | updater |<----------------+ | 350 +---------+ V 351 | +-----------+ 352 +----------------------------------->| reclaimer | 353 +-----------+ 354 Defer: 355 synchronize_rcu() & call_rcu() 356 357 358The RCU infrastructure observes the time sequence of rcu_read_lock(), 359rcu_read_unlock(), synchronize_rcu(), and call_rcu() invocations in 360order to determine when (1) synchronize_rcu() invocations may return 361to their callers and (2) call_rcu() callbacks may be invoked. Efficient 362implementations of the RCU infrastructure make heavy use of batching in 363order to amortize their overhead over many uses of the corresponding APIs. 364 365There are at least three flavors of RCU usage in the Linux kernel. The diagram 366above shows the most common one. On the updater side, the rcu_assign_pointer(), 367synchronize_rcu() and call_rcu() primitives used are the same for all three 368flavors. However for protection (on the reader side), the primitives used vary 369depending on the flavor: 370 371a. rcu_read_lock() / rcu_read_unlock() 372 rcu_dereference() 373 374b. rcu_read_lock_bh() / rcu_read_unlock_bh() 375 local_bh_disable() / local_bh_enable() 376 rcu_dereference_bh() 377 378c. rcu_read_lock_sched() / rcu_read_unlock_sched() 379 preempt_disable() / preempt_enable() 380 local_irq_save() / local_irq_restore() 381 hardirq enter / hardirq exit 382 NMI enter / NMI exit 383 rcu_dereference_sched() 384 385These three flavors are used as follows: 386 387a. RCU applied to normal data structures. 388 389b. RCU applied to networking data structures that may be subjected 390 to remote denial-of-service attacks. 391 392c. RCU applied to scheduler and interrupt/NMI-handler tasks. 393 394Again, most uses will be of (a). The (b) and (c) cases are important 395for specialized uses, but are relatively uncommon. 396 397.. _3_whatisRCU: 398 3993. WHAT ARE SOME EXAMPLE USES OF CORE RCU API? 400----------------------------------------------- 401 402This section shows a simple use of the core RCU API to protect a 403global pointer to a dynamically allocated structure. More-typical 404uses of RCU may be found in listRCU.rst, arrayRCU.rst, and NMI-RCU.rst. 405:: 406 407 struct foo { 408 int a; 409 char b; 410 long c; 411 }; 412 DEFINE_SPINLOCK(foo_mutex); 413 414 struct foo __rcu *gbl_foo; 415 416 /* 417 * Create a new struct foo that is the same as the one currently 418 * pointed to by gbl_foo, except that field "a" is replaced 419 * with "new_a". Points gbl_foo to the new structure, and 420 * frees up the old structure after a grace period. 421 * 422 * Uses rcu_assign_pointer() to ensure that concurrent readers 423 * see the initialized version of the new structure. 424 * 425 * Uses synchronize_rcu() to ensure that any readers that might 426 * have references to the old structure complete before freeing 427 * the old structure. 428 */ 429 void foo_update_a(int new_a) 430 { 431 struct foo *new_fp; 432 struct foo *old_fp; 433 434 new_fp = kmalloc(sizeof(*new_fp), GFP_KERNEL); 435 spin_lock(&foo_mutex); 436 old_fp = rcu_dereference_protected(gbl_foo, lockdep_is_held(&foo_mutex)); 437 *new_fp = *old_fp; 438 new_fp->a = new_a; 439 rcu_assign_pointer(gbl_foo, new_fp); 440 spin_unlock(&foo_mutex); 441 synchronize_rcu(); 442 kfree(old_fp); 443 } 444 445 /* 446 * Return the value of field "a" of the current gbl_foo 447 * structure. Use rcu_read_lock() and rcu_read_unlock() 448 * to ensure that the structure does not get deleted out 449 * from under us, and use rcu_dereference() to ensure that 450 * we see the initialized version of the structure (important 451 * for DEC Alpha and for people reading the code). 452 */ 453 int foo_get_a(void) 454 { 455 int retval; 456 457 rcu_read_lock(); 458 retval = rcu_dereference(gbl_foo)->a; 459 rcu_read_unlock(); 460 return retval; 461 } 462 463So, to sum up: 464 465- Use rcu_read_lock() and rcu_read_unlock() to guard RCU 466 read-side critical sections. 467 468- Within an RCU read-side critical section, use rcu_dereference() 469 to dereference RCU-protected pointers. 470 471- Use some solid scheme (such as locks or semaphores) to 472 keep concurrent updates from interfering with each other. 473 474- Use rcu_assign_pointer() to update an RCU-protected pointer. 475 This primitive protects concurrent readers from the updater, 476 **not** concurrent updates from each other! You therefore still 477 need to use locking (or something similar) to keep concurrent 478 rcu_assign_pointer() primitives from interfering with each other. 479 480- Use synchronize_rcu() **after** removing a data element from an 481 RCU-protected data structure, but **before** reclaiming/freeing 482 the data element, in order to wait for the completion of all 483 RCU read-side critical sections that might be referencing that 484 data item. 485 486See checklist.rst for additional rules to follow when using RCU. 487And again, more-typical uses of RCU may be found in listRCU.rst, 488arrayRCU.rst, and NMI-RCU.rst. 489 490.. _4_whatisRCU: 491 4924. WHAT IF MY UPDATING THREAD CANNOT BLOCK? 493-------------------------------------------- 494 495In the example above, foo_update_a() blocks until a grace period elapses. 496This is quite simple, but in some cases one cannot afford to wait so 497long -- there might be other high-priority work to be done. 498 499In such cases, one uses call_rcu() rather than synchronize_rcu(). 500The call_rcu() API is as follows:: 501 502 void call_rcu(struct rcu_head *head, rcu_callback_t func); 503 504This function invokes func(head) after a grace period has elapsed. 505This invocation might happen from either softirq or process context, 506so the function is not permitted to block. The foo struct needs to 507have an rcu_head structure added, perhaps as follows:: 508 509 struct foo { 510 int a; 511 char b; 512 long c; 513 struct rcu_head rcu; 514 }; 515 516The foo_update_a() function might then be written as follows:: 517 518 /* 519 * Create a new struct foo that is the same as the one currently 520 * pointed to by gbl_foo, except that field "a" is replaced 521 * with "new_a". Points gbl_foo to the new structure, and 522 * frees up the old structure after a grace period. 523 * 524 * Uses rcu_assign_pointer() to ensure that concurrent readers 525 * see the initialized version of the new structure. 526 * 527 * Uses call_rcu() to ensure that any readers that might have 528 * references to the old structure complete before freeing the 529 * old structure. 530 */ 531 void foo_update_a(int new_a) 532 { 533 struct foo *new_fp; 534 struct foo *old_fp; 535 536 new_fp = kmalloc(sizeof(*new_fp), GFP_KERNEL); 537 spin_lock(&foo_mutex); 538 old_fp = rcu_dereference_protected(gbl_foo, lockdep_is_held(&foo_mutex)); 539 *new_fp = *old_fp; 540 new_fp->a = new_a; 541 rcu_assign_pointer(gbl_foo, new_fp); 542 spin_unlock(&foo_mutex); 543 call_rcu(&old_fp->rcu, foo_reclaim); 544 } 545 546The foo_reclaim() function might appear as follows:: 547 548 void foo_reclaim(struct rcu_head *rp) 549 { 550 struct foo *fp = container_of(rp, struct foo, rcu); 551 552 foo_cleanup(fp->a); 553 554 kfree(fp); 555 } 556 557The container_of() primitive is a macro that, given a pointer into a 558struct, the type of the struct, and the pointed-to field within the 559struct, returns a pointer to the beginning of the struct. 560 561The use of call_rcu() permits the caller of foo_update_a() to 562immediately regain control, without needing to worry further about the 563old version of the newly updated element. It also clearly shows the 564RCU distinction between updater, namely foo_update_a(), and reclaimer, 565namely foo_reclaim(). 566 567The summary of advice is the same as for the previous section, except 568that we are now using call_rcu() rather than synchronize_rcu(): 569 570- Use call_rcu() **after** removing a data element from an 571 RCU-protected data structure in order to register a callback 572 function that will be invoked after the completion of all RCU 573 read-side critical sections that might be referencing that 574 data item. 575 576If the callback for call_rcu() is not doing anything more than calling 577kfree() on the structure, you can use kfree_rcu() instead of call_rcu() 578to avoid having to write your own callback:: 579 580 kfree_rcu(old_fp, rcu); 581 582Again, see checklist.rst for additional rules governing the use of RCU. 583 584.. _5_whatisRCU: 585 5865. WHAT ARE SOME SIMPLE IMPLEMENTATIONS OF RCU? 587------------------------------------------------ 588 589One of the nice things about RCU is that it has extremely simple "toy" 590implementations that are a good first step towards understanding the 591production-quality implementations in the Linux kernel. This section 592presents two such "toy" implementations of RCU, one that is implemented 593in terms of familiar locking primitives, and another that more closely 594resembles "classic" RCU. Both are way too simple for real-world use, 595lacking both functionality and performance. However, they are useful 596in getting a feel for how RCU works. See kernel/rcu/update.c for a 597production-quality implementation, and see: 598 599 http://www.rdrop.com/users/paulmck/RCU 600 601for papers describing the Linux kernel RCU implementation. The OLS'01 602and OLS'02 papers are a good introduction, and the dissertation provides 603more details on the current implementation as of early 2004. 604 605 6065A. "TOY" IMPLEMENTATION #1: LOCKING 607^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 608This section presents a "toy" RCU implementation that is based on 609familiar locking primitives. Its overhead makes it a non-starter for 610real-life use, as does its lack of scalability. It is also unsuitable 611for realtime use, since it allows scheduling latency to "bleed" from 612one read-side critical section to another. It also assumes recursive 613reader-writer locks: If you try this with non-recursive locks, and 614you allow nested rcu_read_lock() calls, you can deadlock. 615 616However, it is probably the easiest implementation to relate to, so is 617a good starting point. 618 619It is extremely simple:: 620 621 static DEFINE_RWLOCK(rcu_gp_mutex); 622 623 void rcu_read_lock(void) 624 { 625 read_lock(&rcu_gp_mutex); 626 } 627 628 void rcu_read_unlock(void) 629 { 630 read_unlock(&rcu_gp_mutex); 631 } 632 633 void synchronize_rcu(void) 634 { 635 write_lock(&rcu_gp_mutex); 636 smp_mb__after_spinlock(); 637 write_unlock(&rcu_gp_mutex); 638 } 639 640[You can ignore rcu_assign_pointer() and rcu_dereference() without missing 641much. But here are simplified versions anyway. And whatever you do, 642don't forget about them when submitting patches making use of RCU!]:: 643 644 #define rcu_assign_pointer(p, v) \ 645 ({ \ 646 smp_store_release(&(p), (v)); \ 647 }) 648 649 #define rcu_dereference(p) \ 650 ({ \ 651 typeof(p) _________p1 = READ_ONCE(p); \ 652 (_________p1); \ 653 }) 654 655 656The rcu_read_lock() and rcu_read_unlock() primitive read-acquire 657and release a global reader-writer lock. The synchronize_rcu() 658primitive write-acquires this same lock, then releases it. This means 659that once synchronize_rcu() exits, all RCU read-side critical sections 660that were in progress before synchronize_rcu() was called are guaranteed 661to have completed -- there is no way that synchronize_rcu() would have 662been able to write-acquire the lock otherwise. The smp_mb__after_spinlock() 663promotes synchronize_rcu() to a full memory barrier in compliance with 664the "Memory-Barrier Guarantees" listed in: 665 666 Design/Requirements/Requirements.rst 667 668It is possible to nest rcu_read_lock(), since reader-writer locks may 669be recursively acquired. Note also that rcu_read_lock() is immune 670from deadlock (an important property of RCU). The reason for this is 671that the only thing that can block rcu_read_lock() is a synchronize_rcu(). 672But synchronize_rcu() does not acquire any locks while holding rcu_gp_mutex, 673so there can be no deadlock cycle. 674 675.. _quiz_1: 676 677Quick Quiz #1: 678 Why is this argument naive? How could a deadlock 679 occur when using this algorithm in a real-world Linux 680 kernel? How could this deadlock be avoided? 681 682:ref:`Answers to Quick Quiz <9_whatisRCU>` 683 6845B. "TOY" EXAMPLE #2: CLASSIC RCU 685^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 686This section presents a "toy" RCU implementation that is based on 687"classic RCU". It is also short on performance (but only for updates) and 688on features such as hotplug CPU and the ability to run in CONFIG_PREEMPTION 689kernels. The definitions of rcu_dereference() and rcu_assign_pointer() 690are the same as those shown in the preceding section, so they are omitted. 691:: 692 693 void rcu_read_lock(void) { } 694 695 void rcu_read_unlock(void) { } 696 697 void synchronize_rcu(void) 698 { 699 int cpu; 700 701 for_each_possible_cpu(cpu) 702 run_on(cpu); 703 } 704 705Note that rcu_read_lock() and rcu_read_unlock() do absolutely nothing. 706This is the great strength of classic RCU in a non-preemptive kernel: 707read-side overhead is precisely zero, at least on non-Alpha CPUs. 708And there is absolutely no way that rcu_read_lock() can possibly 709participate in a deadlock cycle! 710 711The implementation of synchronize_rcu() simply schedules itself on each 712CPU in turn. The run_on() primitive can be implemented straightforwardly 713in terms of the sched_setaffinity() primitive. Of course, a somewhat less 714"toy" implementation would restore the affinity upon completion rather 715than just leaving all tasks running on the last CPU, but when I said 716"toy", I meant **toy**! 717 718So how the heck is this supposed to work??? 719 720Remember that it is illegal to block while in an RCU read-side critical 721section. Therefore, if a given CPU executes a context switch, we know 722that it must have completed all preceding RCU read-side critical sections. 723Once **all** CPUs have executed a context switch, then **all** preceding 724RCU read-side critical sections will have completed. 725 726So, suppose that we remove a data item from its structure and then invoke 727synchronize_rcu(). Once synchronize_rcu() returns, we are guaranteed 728that there are no RCU read-side critical sections holding a reference 729to that data item, so we can safely reclaim it. 730 731.. _quiz_2: 732 733Quick Quiz #2: 734 Give an example where Classic RCU's read-side 735 overhead is **negative**. 736 737:ref:`Answers to Quick Quiz <9_whatisRCU>` 738 739.. _quiz_3: 740 741Quick Quiz #3: 742 If it is illegal to block in an RCU read-side 743 critical section, what the heck do you do in 744 CONFIG_PREEMPT_RT, where normal spinlocks can block??? 745 746:ref:`Answers to Quick Quiz <9_whatisRCU>` 747 748.. _6_whatisRCU: 749 7506. ANALOGY WITH READER-WRITER LOCKING 751-------------------------------------- 752 753Although RCU can be used in many different ways, a very common use of 754RCU is analogous to reader-writer locking. The following unified 755diff shows how closely related RCU and reader-writer locking can be. 756:: 757 758 @@ -5,5 +5,5 @@ struct el { 759 int data; 760 /* Other data fields */ 761 }; 762 -rwlock_t listmutex; 763 +spinlock_t listmutex; 764 struct el head; 765 766 @@ -13,15 +14,15 @@ 767 struct list_head *lp; 768 struct el *p; 769 770 - read_lock(&listmutex); 771 - list_for_each_entry(p, head, lp) { 772 + rcu_read_lock(); 773 + list_for_each_entry_rcu(p, head, lp) { 774 if (p->key == key) { 775 *result = p->data; 776 - read_unlock(&listmutex); 777 + rcu_read_unlock(); 778 return 1; 779 } 780 } 781 - read_unlock(&listmutex); 782 + rcu_read_unlock(); 783 return 0; 784 } 785 786 @@ -29,15 +30,16 @@ 787 { 788 struct el *p; 789 790 - write_lock(&listmutex); 791 + spin_lock(&listmutex); 792 list_for_each_entry(p, head, lp) { 793 if (p->key == key) { 794 - list_del(&p->list); 795 - write_unlock(&listmutex); 796 + list_del_rcu(&p->list); 797 + spin_unlock(&listmutex); 798 + synchronize_rcu(); 799 kfree(p); 800 return 1; 801 } 802 } 803 - write_unlock(&listmutex); 804 + spin_unlock(&listmutex); 805 return 0; 806 } 807 808Or, for those who prefer a side-by-side listing:: 809 810 1 struct el { 1 struct el { 811 2 struct list_head list; 2 struct list_head list; 812 3 long key; 3 long key; 813 4 spinlock_t mutex; 4 spinlock_t mutex; 814 5 int data; 5 int data; 815 6 /* Other data fields */ 6 /* Other data fields */ 816 7 }; 7 }; 817 8 rwlock_t listmutex; 8 spinlock_t listmutex; 818 9 struct el head; 9 struct el head; 819 820:: 821 822 1 int search(long key, int *result) 1 int search(long key, int *result) 823 2 { 2 { 824 3 struct list_head *lp; 3 struct list_head *lp; 825 4 struct el *p; 4 struct el *p; 826 5 5 827 6 read_lock(&listmutex); 6 rcu_read_lock(); 828 7 list_for_each_entry(p, head, lp) { 7 list_for_each_entry_rcu(p, head, lp) { 829 8 if (p->key == key) { 8 if (p->key == key) { 830 9 *result = p->data; 9 *result = p->data; 831 10 read_unlock(&listmutex); 10 rcu_read_unlock(); 832 11 return 1; 11 return 1; 833 12 } 12 } 834 13 } 13 } 835 14 read_unlock(&listmutex); 14 rcu_read_unlock(); 836 15 return 0; 15 return 0; 837 16 } 16 } 838 839:: 840 841 1 int delete(long key) 1 int delete(long key) 842 2 { 2 { 843 3 struct el *p; 3 struct el *p; 844 4 4 845 5 write_lock(&listmutex); 5 spin_lock(&listmutex); 846 6 list_for_each_entry(p, head, lp) { 6 list_for_each_entry(p, head, lp) { 847 7 if (p->key == key) { 7 if (p->key == key) { 848 8 list_del(&p->list); 8 list_del_rcu(&p->list); 849 9 write_unlock(&listmutex); 9 spin_unlock(&listmutex); 850 10 synchronize_rcu(); 851 10 kfree(p); 11 kfree(p); 852 11 return 1; 12 return 1; 853 12 } 13 } 854 13 } 14 } 855 14 write_unlock(&listmutex); 15 spin_unlock(&listmutex); 856 15 return 0; 16 return 0; 857 16 } 17 } 858 859Either way, the differences are quite small. Read-side locking moves 860to rcu_read_lock() and rcu_read_unlock, update-side locking moves from 861a reader-writer lock to a simple spinlock, and a synchronize_rcu() 862precedes the kfree(). 863 864However, there is one potential catch: the read-side and update-side 865critical sections can now run concurrently. In many cases, this will 866not be a problem, but it is necessary to check carefully regardless. 867For example, if multiple independent list updates must be seen as 868a single atomic update, converting to RCU will require special care. 869 870Also, the presence of synchronize_rcu() means that the RCU version of 871delete() can now block. If this is a problem, there is a callback-based 872mechanism that never blocks, namely call_rcu() or kfree_rcu(), that can 873be used in place of synchronize_rcu(). 874 875.. _7_whatisRCU: 876 8777. ANALOGY WITH REFERENCE COUNTING 878----------------------------------- 879 880The reader-writer analogy (illustrated by the previous section) is not 881always the best way to think about using RCU. Another helpful analogy 882considers RCU an effective reference count on everything which is 883protected by RCU. 884 885A reference count typically does not prevent the referenced object's 886values from changing, but does prevent changes to type -- particularly the 887gross change of type that happens when that object's memory is freed and 888re-allocated for some other purpose. Once a type-safe reference to the 889object is obtained, some other mechanism is needed to ensure consistent 890access to the data in the object. This could involve taking a spinlock, 891but with RCU the typical approach is to perform reads with SMP-aware 892operations such as smp_load_acquire(), to perform updates with atomic 893read-modify-write operations, and to provide the necessary ordering. 894RCU provides a number of support functions that embed the required 895operations and ordering, such as the list_for_each_entry_rcu() macro 896used in the previous section. 897 898A more focused view of the reference counting behavior is that, 899between rcu_read_lock() and rcu_read_unlock(), any reference taken with 900rcu_dereference() on a pointer marked as ``__rcu`` can be treated as 901though a reference-count on that object has been temporarily increased. 902This prevents the object from changing type. Exactly what this means 903will depend on normal expectations of objects of that type, but it 904typically includes that spinlocks can still be safely locked, normal 905reference counters can be safely manipulated, and ``__rcu`` pointers 906can be safely dereferenced. 907 908Some operations that one might expect to see on an object for 909which an RCU reference is held include: 910 911 - Copying out data that is guaranteed to be stable by the object's type. 912 - Using kref_get_unless_zero() or similar to get a longer-term 913 reference. This may fail of course. 914 - Acquiring a spinlock in the object, and checking if the object still 915 is the expected object and if so, manipulating it freely. 916 917The understanding that RCU provides a reference that only prevents a 918change of type is particularly visible with objects allocated from a 919slab cache marked ``SLAB_TYPESAFE_BY_RCU``. RCU operations may yield a 920reference to an object from such a cache that has been concurrently freed 921and the memory reallocated to a completely different object, though of 922the same type. In this case RCU doesn't even protect the identity of the 923object from changing, only its type. So the object found may not be the 924one expected, but it will be one where it is safe to take a reference 925(and then potentially acquiring a spinlock), allowing subsequent code 926to check whether the identity matches expectations. It is tempting 927to simply acquire the spinlock without first taking the reference, but 928unfortunately any spinlock in a ``SLAB_TYPESAFE_BY_RCU`` object must be 929initialized after each and every call to kmem_cache_alloc(), which renders 930reference-free spinlock acquisition completely unsafe. Therefore, when 931using ``SLAB_TYPESAFE_BY_RCU``, make proper use of a reference counter. 932 933With traditional reference counting -- such as that implemented by the 934kref library in Linux -- there is typically code that runs when the last 935reference to an object is dropped. With kref, this is the function 936passed to kref_put(). When RCU is being used, such finalization code 937must not be run until all ``__rcu`` pointers referencing the object have 938been updated, and then a grace period has passed. Every remaining 939globally visible pointer to the object must be considered to be a 940potential counted reference, and the finalization code is typically run 941using call_rcu() only after all those pointers have been changed. 942 943To see how to choose between these two analogies -- of RCU as a 944reader-writer lock and RCU as a reference counting system -- it is useful 945to reflect on the scale of the thing being protected. The reader-writer 946lock analogy looks at larger multi-part objects such as a linked list 947and shows how RCU can facilitate concurrency while elements are added 948to, and removed from, the list. The reference-count analogy looks at 949the individual objects and looks at how they can be accessed safely 950within whatever whole they are a part of. 951 952.. _8_whatisRCU: 953 9548. FULL LIST OF RCU APIs 955------------------------- 956 957The RCU APIs are documented in docbook-format header comments in the 958Linux-kernel source code, but it helps to have a full list of the 959APIs, since there does not appear to be a way to categorize them 960in docbook. Here is the list, by category. 961 962RCU list traversal:: 963 964 list_entry_rcu 965 list_entry_lockless 966 list_first_entry_rcu 967 list_next_rcu 968 list_for_each_entry_rcu 969 list_for_each_entry_continue_rcu 970 list_for_each_entry_from_rcu 971 list_first_or_null_rcu 972 list_next_or_null_rcu 973 hlist_first_rcu 974 hlist_next_rcu 975 hlist_pprev_rcu 976 hlist_for_each_entry_rcu 977 hlist_for_each_entry_rcu_bh 978 hlist_for_each_entry_from_rcu 979 hlist_for_each_entry_continue_rcu 980 hlist_for_each_entry_continue_rcu_bh 981 hlist_nulls_first_rcu 982 hlist_nulls_for_each_entry_rcu 983 hlist_bl_first_rcu 984 hlist_bl_for_each_entry_rcu 985 986RCU pointer/list update:: 987 988 rcu_assign_pointer 989 list_add_rcu 990 list_add_tail_rcu 991 list_del_rcu 992 list_replace_rcu 993 hlist_add_behind_rcu 994 hlist_add_before_rcu 995 hlist_add_head_rcu 996 hlist_add_tail_rcu 997 hlist_del_rcu 998 hlist_del_init_rcu 999 hlist_replace_rcu 1000 list_splice_init_rcu 1001 list_splice_tail_init_rcu 1002 hlist_nulls_del_init_rcu 1003 hlist_nulls_del_rcu 1004 hlist_nulls_add_head_rcu 1005 hlist_bl_add_head_rcu 1006 hlist_bl_del_init_rcu 1007 hlist_bl_del_rcu 1008 hlist_bl_set_first_rcu 1009 1010RCU:: 1011 1012 Critical sections Grace period Barrier 1013 1014 rcu_read_lock synchronize_net rcu_barrier 1015 rcu_read_unlock synchronize_rcu 1016 rcu_dereference synchronize_rcu_expedited 1017 rcu_read_lock_held call_rcu 1018 rcu_dereference_check kfree_rcu 1019 rcu_dereference_protected 1020 1021bh:: 1022 1023 Critical sections Grace period Barrier 1024 1025 rcu_read_lock_bh call_rcu rcu_barrier 1026 rcu_read_unlock_bh synchronize_rcu 1027 [local_bh_disable] synchronize_rcu_expedited 1028 [and friends] 1029 rcu_dereference_bh 1030 rcu_dereference_bh_check 1031 rcu_dereference_bh_protected 1032 rcu_read_lock_bh_held 1033 1034sched:: 1035 1036 Critical sections Grace period Barrier 1037 1038 rcu_read_lock_sched call_rcu rcu_barrier 1039 rcu_read_unlock_sched synchronize_rcu 1040 [preempt_disable] synchronize_rcu_expedited 1041 [and friends] 1042 rcu_read_lock_sched_notrace 1043 rcu_read_unlock_sched_notrace 1044 rcu_dereference_sched 1045 rcu_dereference_sched_check 1046 rcu_dereference_sched_protected 1047 rcu_read_lock_sched_held 1048 1049 1050SRCU:: 1051 1052 Critical sections Grace period Barrier 1053 1054 srcu_read_lock call_srcu srcu_barrier 1055 srcu_read_unlock synchronize_srcu 1056 srcu_dereference synchronize_srcu_expedited 1057 srcu_dereference_check 1058 srcu_read_lock_held 1059 1060SRCU: Initialization/cleanup:: 1061 1062 DEFINE_SRCU 1063 DEFINE_STATIC_SRCU 1064 init_srcu_struct 1065 cleanup_srcu_struct 1066 1067All: lockdep-checked RCU utility APIs:: 1068 1069 RCU_LOCKDEP_WARN 1070 rcu_sleep_check 1071 RCU_NONIDLE 1072 1073All: Unchecked RCU-protected pointer access:: 1074 1075 rcu_dereference_raw 1076 1077All: Unchecked RCU-protected pointer access with dereferencing prohibited:: 1078 1079 rcu_access_pointer 1080 1081See the comment headers in the source code (or the docbook generated 1082from them) for more information. 1083 1084However, given that there are no fewer than four families of RCU APIs 1085in the Linux kernel, how do you choose which one to use? The following 1086list can be helpful: 1087 1088a. Will readers need to block? If so, you need SRCU. 1089 1090b. What about the -rt patchset? If readers would need to block 1091 in an non-rt kernel, you need SRCU. If readers would block 1092 in a -rt kernel, but not in a non-rt kernel, SRCU is not 1093 necessary. (The -rt patchset turns spinlocks into sleeplocks, 1094 hence this distinction.) 1095 1096c. Do you need to treat NMI handlers, hardirq handlers, 1097 and code segments with preemption disabled (whether 1098 via preempt_disable(), local_irq_save(), local_bh_disable(), 1099 or some other mechanism) as if they were explicit RCU readers? 1100 If so, RCU-sched is the only choice that will work for you. 1101 1102d. Do you need RCU grace periods to complete even in the face 1103 of softirq monopolization of one or more of the CPUs? For 1104 example, is your code subject to network-based denial-of-service 1105 attacks? If so, you should disable softirq across your readers, 1106 for example, by using rcu_read_lock_bh(). 1107 1108e. Is your workload too update-intensive for normal use of 1109 RCU, but inappropriate for other synchronization mechanisms? 1110 If so, consider SLAB_TYPESAFE_BY_RCU (which was originally 1111 named SLAB_DESTROY_BY_RCU). But please be careful! 1112 1113f. Do you need read-side critical sections that are respected 1114 even though they are in the middle of the idle loop, during 1115 user-mode execution, or on an offlined CPU? If so, SRCU is the 1116 only choice that will work for you. 1117 1118g. Otherwise, use RCU. 1119 1120Of course, this all assumes that you have determined that RCU is in fact 1121the right tool for your job. 1122 1123.. _9_whatisRCU: 1124 11259. ANSWERS TO QUICK QUIZZES 1126---------------------------- 1127 1128Quick Quiz #1: 1129 Why is this argument naive? How could a deadlock 1130 occur when using this algorithm in a real-world Linux 1131 kernel? [Referring to the lock-based "toy" RCU 1132 algorithm.] 1133 1134Answer: 1135 Consider the following sequence of events: 1136 1137 1. CPU 0 acquires some unrelated lock, call it 1138 "problematic_lock", disabling irq via 1139 spin_lock_irqsave(). 1140 1141 2. CPU 1 enters synchronize_rcu(), write-acquiring 1142 rcu_gp_mutex. 1143 1144 3. CPU 0 enters rcu_read_lock(), but must wait 1145 because CPU 1 holds rcu_gp_mutex. 1146 1147 4. CPU 1 is interrupted, and the irq handler 1148 attempts to acquire problematic_lock. 1149 1150 The system is now deadlocked. 1151 1152 One way to avoid this deadlock is to use an approach like 1153 that of CONFIG_PREEMPT_RT, where all normal spinlocks 1154 become blocking locks, and all irq handlers execute in 1155 the context of special tasks. In this case, in step 4 1156 above, the irq handler would block, allowing CPU 1 to 1157 release rcu_gp_mutex, avoiding the deadlock. 1158 1159 Even in the absence of deadlock, this RCU implementation 1160 allows latency to "bleed" from readers to other 1161 readers through synchronize_rcu(). To see this, 1162 consider task A in an RCU read-side critical section 1163 (thus read-holding rcu_gp_mutex), task B blocked 1164 attempting to write-acquire rcu_gp_mutex, and 1165 task C blocked in rcu_read_lock() attempting to 1166 read_acquire rcu_gp_mutex. Task A's RCU read-side 1167 latency is holding up task C, albeit indirectly via 1168 task B. 1169 1170 Realtime RCU implementations therefore use a counter-based 1171 approach where tasks in RCU read-side critical sections 1172 cannot be blocked by tasks executing synchronize_rcu(). 1173 1174:ref:`Back to Quick Quiz #1 <quiz_1>` 1175 1176Quick Quiz #2: 1177 Give an example where Classic RCU's read-side 1178 overhead is **negative**. 1179 1180Answer: 1181 Imagine a single-CPU system with a non-CONFIG_PREEMPTION 1182 kernel where a routing table is used by process-context 1183 code, but can be updated by irq-context code (for example, 1184 by an "ICMP REDIRECT" packet). The usual way of handling 1185 this would be to have the process-context code disable 1186 interrupts while searching the routing table. Use of 1187 RCU allows such interrupt-disabling to be dispensed with. 1188 Thus, without RCU, you pay the cost of disabling interrupts, 1189 and with RCU you don't. 1190 1191 One can argue that the overhead of RCU in this 1192 case is negative with respect to the single-CPU 1193 interrupt-disabling approach. Others might argue that 1194 the overhead of RCU is merely zero, and that replacing 1195 the positive overhead of the interrupt-disabling scheme 1196 with the zero-overhead RCU scheme does not constitute 1197 negative overhead. 1198 1199 In real life, of course, things are more complex. But 1200 even the theoretical possibility of negative overhead for 1201 a synchronization primitive is a bit unexpected. ;-) 1202 1203:ref:`Back to Quick Quiz #2 <quiz_2>` 1204 1205Quick Quiz #3: 1206 If it is illegal to block in an RCU read-side 1207 critical section, what the heck do you do in 1208 CONFIG_PREEMPT_RT, where normal spinlocks can block??? 1209 1210Answer: 1211 Just as CONFIG_PREEMPT_RT permits preemption of spinlock 1212 critical sections, it permits preemption of RCU 1213 read-side critical sections. It also permits 1214 spinlocks blocking while in RCU read-side critical 1215 sections. 1216 1217 Why the apparent inconsistency? Because it is 1218 possible to use priority boosting to keep the RCU 1219 grace periods short if need be (for example, if running 1220 short of memory). In contrast, if blocking waiting 1221 for (say) network reception, there is no way to know 1222 what should be boosted. Especially given that the 1223 process we need to boost might well be a human being 1224 who just went out for a pizza or something. And although 1225 a computer-operated cattle prod might arouse serious 1226 interest, it might also provoke serious objections. 1227 Besides, how does the computer know what pizza parlor 1228 the human being went to??? 1229 1230:ref:`Back to Quick Quiz #3 <quiz_3>` 1231 1232ACKNOWLEDGEMENTS 1233 1234My thanks to the people who helped make this human-readable, including 1235Jon Walpole, Josh Triplett, Serge Hallyn, Suzanne Wood, and Alan Stern. 1236 1237 1238For more information, see http://www.rdrop.com/users/paulmck/RCU. 1239