xref: /linux/Documentation/filesystems/sharedsubtree.rst (revision 8b00d6fe96960aaba1b923d4a8c1ddb173c9c1ff)
1.. SPDX-License-Identifier: GPL-2.0
2
3===============
4Shared Subtrees
5===============
6
7.. Contents:
8	1) Overview
9	2) Features
10	3) Setting mount states
11	4) Use-case
12	5) Detailed semantics
13	6) Quiz
14	7) FAQ
15	8) Implementation
16
17
181) Overview
19-----------
20
21Consider the following situation:
22
23A process wants to clone its own namespace, but still wants to access the CD
24that got mounted recently.  Shared subtree semantics provide the necessary
25mechanism to accomplish the above.
26
27It provides the necessary building blocks for features like per-user-namespace
28and versioned filesystem.
29
302) Features
31-----------
32
33Shared subtree provides four different flavors of mounts; struct vfsmount to be
34precise:
35
36
37a) A **shared mount** can be replicated to as many mountpoints and all the
38   replicas continue to be exactly same.
39
40   Here is an example:
41
42   Let's say /mnt has a mount that is shared::
43
44     # mount --make-shared /mnt
45
46   .. note::
47      mount(8) command now supports the --make-shared flag,
48      so the sample 'smount' program is no longer needed and has been
49      removed.
50
51   ::
52
53     # mount --bind /mnt /tmp
54
55   The above command replicates the mount at /mnt to the mountpoint /tmp
56   and the contents of both the mounts remain identical.
57
58   ::
59
60     #ls /mnt
61     a b c
62
63     #ls /tmp
64     a b c
65
66   Now let's say we mount a device at /tmp/a::
67
68     # mount /dev/sd0  /tmp/a
69
70     # ls /tmp/a
71     t1 t2 t3
72
73     # ls /mnt/a
74     t1 t2 t3
75
76   Note that the mount has propagated to the mount at /mnt as well.
77
78   And the same is true even when /dev/sd0 is mounted on /mnt/a. The
79   contents will be visible under /tmp/a too.
80
81
82b) A **slave mount** is like a shared mount except that mount and umount events
83   only propagate towards it.
84
85   All slave mounts have a master mount which is a shared.
86
87   Here is an example:
88
89   Let's say /mnt has a mount which is shared::
90
91     # mount --make-shared /mnt
92
93   Let's bind mount /mnt to /tmp::
94
95     # mount --bind /mnt /tmp
96
97   the new mount at /tmp becomes a shared mount and it is a replica of
98   the mount at /mnt.
99
100   Now let's make the mount at /tmp; a slave of /mnt::
101
102     # mount --make-slave /tmp
103
104   let's mount /dev/sd0 on /mnt/a::
105
106     # mount /dev/sd0 /mnt/a
107
108     # ls /mnt/a
109     t1 t2 t3
110
111     # ls /tmp/a
112     t1 t2 t3
113
114   Note the mount event has propagated to the mount at /tmp
115
116   However let's see what happens if we mount something on the mount at
117   /tmp::
118
119     # mount /dev/sd1 /tmp/b
120
121     # ls /tmp/b
122     s1 s2 s3
123
124     # ls /mnt/b
125
126   Note how the mount event has not propagated to the mount at
127   /mnt
128
129
130c) A **private mount** does not forward or receive propagation.
131
132   This is the mount we are familiar with. Its the default type.
133
134
135d) An **unbindable mount** is, as the name suggests, an unbindable private
136   mount.
137
138   let's say we have a mount at /mnt and we make it unbindable::
139
140     # mount --make-unbindable /mnt
141
142   Let's try to bind mount this mount somewhere else::
143
144     # mount --bind /mnt /tmp mount: wrong fs type, bad option, bad
145     superblock on /mnt, or too many mounted file systems
146
147   Binding a unbindable mount is a invalid operation.
148
149
1503) Setting mount states
151-----------------------
152
153The mount command (util-linux package) can be used to set mount
154states::
155
156    mount --make-shared mountpoint
157    mount --make-slave mountpoint
158    mount --make-private mountpoint
159    mount --make-unbindable mountpoint
160
161
1624) Use cases
163------------
164
165A) A process wants to clone its own namespace, but still wants to
166   access the CD that got mounted recently.
167
168   Solution:
169
170   The system administrator can make the mount at /cdrom shared::
171
172     mount --bind /cdrom /cdrom
173     mount --make-shared /cdrom
174
175   Now any process that clones off a new namespace will have a
176   mount at /cdrom which is a replica of the same mount in the
177   parent namespace.
178
179   So when a CD is inserted and mounted at /cdrom that mount gets
180   propagated to the other mount at /cdrom in all the other clone
181   namespaces.
182
183B) A process wants its mounts invisible to any other process, but
184   still be able to see the other system mounts.
185
186   Solution:
187
188   To begin with, the administrator can mark the entire mount tree
189   as shareable::
190
191     mount --make-rshared /
192
193   A new process can clone off a new namespace. And mark some part
194   of its namespace as slave::
195
196     mount --make-rslave /myprivatetree
197
198   Hence forth any mounts within the /myprivatetree done by the
199   process will not show up in any other namespace. However mounts
200   done in the parent namespace under /myprivatetree still shows
201   up in the process's namespace.
202
203
204Apart from the above semantics this feature provides the
205building blocks to solve the following problems:
206
207C)  Per-user namespace
208
209    The above semantics allows a way to share mounts across
210    namespaces.  But namespaces are associated with processes. If
211    namespaces are made first class objects with user API to
212    associate/disassociate a namespace with userid, then each user
213    could have his/her own namespace and tailor it to his/her
214    requirements. This needs to be supported in PAM.
215
216D)  Versioned files
217
218    If the entire mount tree is visible at multiple locations, then
219    an underlying versioning file system can return different
220    versions of the file depending on the path used to access that
221    file.
222
223    An example is::
224
225       mount --make-shared /
226       mount --rbind / /view/v1
227       mount --rbind / /view/v2
228       mount --rbind / /view/v3
229       mount --rbind / /view/v4
230
231    and if /usr has a versioning filesystem mounted, then that
232    mount appears at /view/v1/usr, /view/v2/usr, /view/v3/usr and
233    /view/v4/usr too
234
235    A user can request v3 version of the file /usr/fs/namespace.c
236    by accessing /view/v3/usr/fs/namespace.c . The underlying
237    versioning filesystem can then decipher that v3 version of the
238    filesystem is being requested and return the corresponding
239    inode.
240
2415) Detailed semantics
242---------------------
243The section below explains the detailed semantics of
244bind, rbind, move, mount, umount and clone-namespace operations.
245
246.. Note::
247   the word 'vfsmount' and the noun 'mount' have been used
248   to mean the same thing, throughout this document.
249
250a) Mount states
251
252   A **propagation event** is defined as event generated on a vfsmount
253   that leads to mount or unmount actions in other vfsmounts.
254
255   A **peer group** is defined as a group of vfsmounts that propagate
256   events to each other.
257
258   A given mount can be in one of the following states:
259
260   (1) Shared mounts
261
262       A **shared mount** is defined as a vfsmount that belongs to a
263       peer group.
264
265       For example::
266
267         mount --make-shared /mnt
268         mount --bind /mnt /tmp
269
270       The mount at /mnt and that at /tmp are both shared and belong
271       to the same peer group. Anything mounted or unmounted under
272       /mnt or /tmp reflect in all the other mounts of its peer
273       group.
274
275
276   (2) Slave mounts
277
278       A **slave mount** is defined as a vfsmount that receives
279       propagation events and does not forward propagation events.
280
281       A slave mount as the name implies has a master mount from which
282       mount/unmount events are received. Events do not propagate from
283       the slave mount to the master.  Only a shared mount can be made
284       a slave by executing the following command::
285
286         mount --make-slave mount
287
288       A shared mount that is made as a slave is no more shared unless
289       modified to become shared.
290
291   (3) Shared and Slave
292
293       A vfsmount can be both **shared** as well as **slave**.  This state
294       indicates that the mount is a slave of some vfsmount, and
295       has its own peer group too.  This vfsmount receives propagation
296       events from its master vfsmount, and also forwards propagation
297       events to its 'peer group' and to its slave vfsmounts.
298
299       Strictly speaking, the vfsmount is shared having its own
300       peer group, and this peer-group is a slave of some other
301       peer group.
302
303       Only a slave vfsmount can be made as 'shared and slave' by
304       either executing the following command::
305
306         mount --make-shared mount
307
308       or by moving the slave vfsmount under a shared vfsmount.
309
310   (4) Private mount
311
312       A **private mount** is defined as vfsmount that does not
313       receive or forward any propagation events.
314
315   (5) Unbindable mount
316
317       A **unbindable mount** is defined as vfsmount that does not
318       receive or forward any propagation events and cannot
319       be bind mounted.
320
321
322       State diagram:
323
324       The state diagram below explains the state transition of a mount,
325       in response to various commands::
326
327            -----------------------------------------------------------------------
328            |             |make-shared |  make-slave  | make-private |make-unbindab|
329            --------------|------------|--------------|--------------|-------------|
330            |shared       |shared      |*slave/private|   private    | unbindable  |
331            |             |            |              |              |             |
332            |-------------|------------|--------------|--------------|-------------|
333            |slave        |shared      | **slave      |    private   | unbindable  |
334            |             |and slave   |              |              |             |
335            |-------------|------------|--------------|--------------|-------------|
336            |shared       |shared      | slave        |    private   | unbindable  |
337            |and slave    |and slave   |              |              |             |
338            |-------------|------------|--------------|--------------|-------------|
339            |private      |shared      |  **private   |    private   | unbindable  |
340            |-------------|------------|--------------|--------------|-------------|
341            |unbindable   |shared      |**unbindable  |    private   | unbindable  |
342            ------------------------------------------------------------------------
343
344            * if the shared mount is the only mount in its peer group, making it
345            slave, makes it private automatically. Note that there is no master to
346            which it can be slaved to.
347
348            ** slaving a non-shared mount has no effect on the mount.
349
350       Apart from the commands listed below, the 'move' operation also changes
351       the state of a mount depending on type of the destination mount. Its
352       explained in section 5d.
353
354b) Bind semantics
355
356   Consider the following command::
357
358     mount --bind A/a  B/b
359
360   where 'A' is the source mount, 'a' is the dentry in the mount 'A', 'B'
361   is the destination mount and 'b' is the dentry in the destination mount.
362
363   The outcome depends on the type of mount of 'A' and 'B'. The table
364   below contains quick reference::
365
366            --------------------------------------------------------------------------
367            |         BIND MOUNT OPERATION                                           |
368            |************************************************************************|
369            |source(A)->| shared      |       private  |       slave    | unbindable |
370            | dest(B)  |              |                |                |            |
371            |   |      |              |                |                |            |
372            |   v      |              |                |                |            |
373            |************************************************************************|
374            |  shared  | shared       |     shared     | shared & slave |  invalid   |
375            |          |              |                |                |            |
376            |non-shared| shared       |      private   |      slave     |  invalid   |
377            **************************************************************************
378
379   Details:
380
381   1. 'A' is a shared mount and 'B' is a shared mount. A new mount 'C'
382      which is clone of 'A', is created. Its root dentry is 'a' . 'C' is
383      mounted on mount 'B' at dentry 'b'. Also new mount 'C1', 'C2', 'C3' ...
384      are created and mounted at the dentry 'b' on all mounts where 'B'
385      propagates to. A new propagation tree containing 'C1',..,'Cn' is
386      created. This propagation tree is identical to the propagation tree of
387      'B'.  And finally the peer-group of 'C' is merged with the peer group
388      of 'A'.
389
390   2. 'A' is a private mount and 'B' is a shared mount. A new mount 'C'
391      which is clone of 'A', is created. Its root dentry is 'a'. 'C' is
392      mounted on mount 'B' at dentry 'b'. Also new mount 'C1', 'C2', 'C3' ...
393      are created and mounted at the dentry 'b' on all mounts where 'B'
394      propagates to. A new propagation tree is set containing all new mounts
395      'C', 'C1', .., 'Cn' with exactly the same configuration as the
396      propagation tree for 'B'.
397
398   3. 'A' is a slave mount of mount 'Z' and 'B' is a shared mount. A new
399      mount 'C' which is clone of 'A', is created. Its root dentry is 'a' .
400      'C' is mounted on mount 'B' at dentry 'b'. Also new mounts 'C1', 'C2',
401      'C3' ... are created and mounted at the dentry 'b' on all mounts where
402      'B' propagates to. A new propagation tree containing the new mounts
403      'C','C1',..  'Cn' is created. This propagation tree is identical to the
404      propagation tree for 'B'. And finally the mount 'C' and its peer group
405      is made the slave of mount 'Z'.  In other words, mount 'C' is in the
406      state 'slave and shared'.
407
408   4. 'A' is a unbindable mount and 'B' is a shared mount. This is a
409      invalid operation.
410
411   5. 'A' is a private mount and 'B' is a non-shared(private or slave or
412      unbindable) mount. A new mount 'C' which is clone of 'A', is created.
413      Its root dentry is 'a'. 'C' is mounted on mount 'B' at dentry 'b'.
414
415   6. 'A' is a shared mount and 'B' is a non-shared mount. A new mount 'C'
416      which is a clone of 'A' is created. Its root dentry is 'a'. 'C' is
417      mounted on mount 'B' at dentry 'b'.  'C' is made a member of the
418      peer-group of 'A'.
419
420   7. 'A' is a slave mount of mount 'Z' and 'B' is a non-shared mount. A
421      new mount 'C' which is a clone of 'A' is created. Its root dentry is
422      'a'.  'C' is mounted on mount 'B' at dentry 'b'. Also 'C' is set as a
423      slave mount of 'Z'. In other words 'A' and 'C' are both slave mounts of
424      'Z'.  All mount/unmount events on 'Z' propagates to 'A' and 'C'. But
425      mount/unmount on 'A' do not propagate anywhere else. Similarly
426      mount/unmount on 'C' do not propagate anywhere else.
427
428   8. 'A' is a unbindable mount and 'B' is a non-shared mount. This is a
429      invalid operation. A unbindable mount cannot be bind mounted.
430
431c) Rbind semantics
432
433   rbind is same as bind. Bind replicates the specified mount.  Rbind
434   replicates all the mounts in the tree belonging to the specified mount.
435   Rbind mount is bind mount applied to all the mounts in the tree.
436
437   If the source tree that is rbind has some unbindable mounts,
438   then the subtree under the unbindable mount is pruned in the new
439   location.
440
441   eg:
442
443   let's say we have the following mount tree::
444
445                A
446              /   \
447              B   C
448             / \ / \
449             D E F G
450
451   Let's say all the mount except the mount C in the tree are
452   of a type other than unbindable.
453
454   If this tree is rbound to say Z
455
456   We will have the following tree at the new location::
457
458                Z
459                |
460                A'
461               /
462              B'                Note how the tree under C is pruned
463             / \                in the new location.
464            D' E'
465
466
467
468d) Move semantics
469
470   Consider the following command::
471
472     mount --move A  B/b
473
474   where 'A' is the source mount, 'B' is the destination mount and 'b' is
475   the dentry in the destination mount.
476
477   The outcome depends on the type of the mount of 'A' and 'B'. The table
478   below is a quick reference::
479
480            ---------------------------------------------------------------------------
481            |                   MOVE MOUNT OPERATION                                 |
482            |**************************************************************************
483            | source(A)->| shared      |       private  |       slave    | unbindable |
484            | dest(B)  |               |                |                |            |
485            |   |      |               |                |                |            |
486            |   v      |               |                |                |            |
487            |**************************************************************************
488            |  shared  | shared        |     shared     |shared and slave|  invalid   |
489            |          |               |                |                |            |
490            |non-shared| shared        |      private   |    slave       | unbindable |
491            ***************************************************************************
492
493   .. Note:: moving a mount residing under a shared mount is invalid.
494
495   Details follow:
496
497   1. 'A' is a shared mount and 'B' is a shared mount.  The mount 'A' is
498      mounted on mount 'B' at dentry 'b'.  Also new mounts 'A1', 'A2'...'An'
499      are created and mounted at dentry 'b' on all mounts that receive
500      propagation from mount 'B'. A new propagation tree is created in the
501      exact same configuration as that of 'B'. This new propagation tree
502      contains all the new mounts 'A1', 'A2'...  'An'.  And this new
503      propagation tree is appended to the already existing propagation tree
504      of 'A'.
505
506   2. 'A' is a private mount and 'B' is a shared mount. The mount 'A' is
507      mounted on mount 'B' at dentry 'b'. Also new mount 'A1', 'A2'... 'An'
508      are created and mounted at dentry 'b' on all mounts that receive
509      propagation from mount 'B'. The mount 'A' becomes a shared mount and a
510      propagation tree is created which is identical to that of
511      'B'. This new propagation tree contains all the new mounts 'A1',
512      'A2'...  'An'.
513
514   3. 'A' is a slave mount of mount 'Z' and 'B' is a shared mount.  The
515      mount 'A' is mounted on mount 'B' at dentry 'b'.  Also new mounts 'A1',
516      'A2'... 'An' are created and mounted at dentry 'b' on all mounts that
517      receive propagation from mount 'B'. A new propagation tree is created
518      in the exact same configuration as that of 'B'. This new propagation
519      tree contains all the new mounts 'A1', 'A2'...  'An'.  And this new
520      propagation tree is appended to the already existing propagation tree of
521      'A'.  Mount 'A' continues to be the slave mount of 'Z' but it also
522      becomes 'shared'.
523
524   4. 'A' is a unbindable mount and 'B' is a shared mount. The operation
525      is invalid. Because mounting anything on the shared mount 'B' can
526      create new mounts that get mounted on the mounts that receive
527      propagation from 'B'.  And since the mount 'A' is unbindable, cloning
528      it to mount at other mountpoints is not possible.
529
530   5. 'A' is a private mount and 'B' is a non-shared(private or slave or
531      unbindable) mount. The mount 'A' is mounted on mount 'B' at dentry 'b'.
532
533   6. 'A' is a shared mount and 'B' is a non-shared mount.  The mount 'A'
534      is mounted on mount 'B' at dentry 'b'.  Mount 'A' continues to be a
535      shared mount.
536
537   7. 'A' is a slave mount of mount 'Z' and 'B' is a non-shared mount.
538      The mount 'A' is mounted on mount 'B' at dentry 'b'.  Mount 'A'
539      continues to be a slave mount of mount 'Z'.
540
541   8. 'A' is a unbindable mount and 'B' is a non-shared mount. The mount
542      'A' is mounted on mount 'B' at dentry 'b'. Mount 'A' continues to be a
543      unbindable mount.
544
545e) Mount semantics
546
547   Consider the following command::
548
549     mount device  B/b
550
551   'B' is the destination mount and 'b' is the dentry in the destination
552   mount.
553
554   The above operation is the same as bind operation with the exception
555   that the source mount is always a private mount.
556
557
558f) Unmount semantics
559
560   Consider the following command::
561
562     umount A
563
564   where 'A' is a mount mounted on mount 'B' at dentry 'b'.
565
566   If mount 'B' is shared, then all most-recently-mounted mounts at dentry
567   'b' on mounts that receive propagation from mount 'B' and does not have
568   sub-mounts within them are unmounted.
569
570   Example: Let's say 'B1', 'B2', 'B3' are shared mounts that propagate to
571   each other.
572
573   let's say 'A1', 'A2', 'A3' are first mounted at dentry 'b' on mount
574   'B1', 'B2' and 'B3' respectively.
575
576   let's say 'C1', 'C2', 'C3' are next mounted at the same dentry 'b' on
577   mount 'B1', 'B2' and 'B3' respectively.
578
579   if 'C1' is unmounted, all the mounts that are most-recently-mounted on
580   'B1' and on the mounts that 'B1' propagates-to are unmounted.
581
582   'B1' propagates to 'B2' and 'B3'. And the most recently mounted mount
583   on 'B2' at dentry 'b' is 'C2', and that of mount 'B3' is 'C3'.
584
585   So all 'C1', 'C2' and 'C3' should be unmounted.
586
587   If any of 'C2' or 'C3' has some child mounts, then that mount is not
588   unmounted, but all other mounts are unmounted. However if 'C1' is told
589   to be unmounted and 'C1' has some sub-mounts, the umount operation is
590   failed entirely.
591
592g) Clone Namespace
593
594   A cloned namespace contains all the mounts as that of the parent
595   namespace.
596
597   Let's say 'A' and 'B' are the corresponding mounts in the parent and the
598   child namespace.
599
600   If 'A' is shared, then 'B' is also shared and 'A' and 'B' propagate to
601   each other.
602
603   If 'A' is a slave mount of 'Z', then 'B' is also the slave mount of
604   'Z'.
605
606   If 'A' is a private mount, then 'B' is a private mount too.
607
608   If 'A' is unbindable mount, then 'B' is a unbindable mount too.
609
610
6116) Quiz
612-------
613
614A. What is the result of the following command sequence?
615
616   ::
617
618       mount --bind /mnt /mnt
619       mount --make-shared /mnt
620       mount --bind /mnt /tmp
621       mount --move /tmp /mnt/1
622
623   what should be the contents of /mnt /mnt/1 /mnt/1/1 should be?
624   Should they all be identical? or should /mnt and /mnt/1 be
625   identical only?
626
627
628B. What is the result of the following command sequence?
629
630   ::
631
632       mount --make-rshared /
633       mkdir -p /v/1
634       mount --rbind / /v/1
635
636   what should be the content of /v/1/v/1 be?
637
638
639C. What is the result of the following command sequence?
640
641   ::
642
643       mount --bind /mnt /mnt
644       mount --make-shared /mnt
645       mkdir -p /mnt/1/2/3 /mnt/1/test
646       mount --bind /mnt/1 /tmp
647       mount --make-slave /mnt
648       mount --make-shared /mnt
649       mount --bind /mnt/1/2 /tmp1
650       mount --make-slave /mnt
651
652   At this point we have the first mount at /tmp and
653   its root dentry is 1. Let's call this mount 'A'
654   And then we have a second mount at /tmp1 with root
655   dentry 2. Let's call this mount 'B'
656   Next we have a third mount at /mnt with root dentry
657   mnt. Let's call this mount 'C'
658
659   'B' is the slave of 'A' and 'C' is a slave of 'B'
660   A -> B -> C
661
662   at this point if we execute the following command::
663
664     mount --bind /bin /tmp/test
665
666   The mount is attempted on 'A'
667
668   will the mount propagate to 'B' and 'C' ?
669
670   what would be the contents of
671   /mnt/1/test be?
672
6737) FAQ
674------
675
6761. Why is bind mount needed? How is it different from symbolic links?
677
678   symbolic links can get stale if the destination mount gets
679   unmounted or moved. Bind mounts continue to exist even if the
680   other mount is unmounted or moved.
681
6822. Why can't the shared subtree be implemented using exportfs?
683
684   exportfs is a heavyweight way of accomplishing part of what
685   shared subtree can do. I cannot imagine a way to implement the
686   semantics of slave mount using exportfs?
687
6883. Why is unbindable mount needed?
689
690   Let's say we want to replicate the mount tree at multiple
691   locations within the same subtree.
692
693   if one rbind mounts a tree within the same subtree 'n' times
694   the number of mounts created is an exponential function of 'n'.
695   Having unbindable mount can help prune the unneeded bind
696   mounts. Here is an example.
697
698   step 1:
699      let's say the root tree has just two directories with
700      one vfsmount::
701
702                                    root
703                                   /    \
704                                  tmp    usr
705
706      And we want to replicate the tree at multiple
707      mountpoints under /root/tmp
708
709   step 2:
710      ::
711
712
713                        mount --make-shared /root
714
715                        mkdir -p /tmp/m1
716
717                        mount --rbind /root /tmp/m1
718
719      the new tree now looks like this::
720
721                                    root
722                                   /    \
723                                 tmp    usr
724                                /
725                               m1
726                              /  \
727                             tmp  usr
728                             /
729                            m1
730
731      it has two vfsmounts
732
733   step 3:
734      ::
735
736                            mkdir -p /tmp/m2
737                            mount --rbind /root /tmp/m2
738
739      the new tree now looks like this::
740
741                                      root
742                                     /    \
743                                   tmp     usr
744                                  /    \
745                                m1       m2
746                               / \       /  \
747                             tmp  usr   tmp  usr
748                             / \          /
749                            m1  m2      m1
750                                / \     /  \
751                              tmp usr  tmp   usr
752                              /        / \
753                             m1       m1  m2
754                            /  \
755                          tmp   usr
756                          /  \
757                         m1   m2
758
759                    it has 6 vfsmounts
760
761   step 4:
762      ::
763
764                          mkdir -p /tmp/m3
765                          mount --rbind /root /tmp/m3
766
767      I won't draw the tree..but it has 24 vfsmounts
768
769
770   at step i the number of vfsmounts is V[i] = i*V[i-1].
771   This is an exponential function. And this tree has way more
772   mounts than what we really needed in the first place.
773
774   One could use a series of umount at each step to prune
775   out the unneeded mounts. But there is a better solution.
776   Unclonable mounts come in handy here.
777
778   step 1:
779      let's say the root tree has just two directories with
780      one vfsmount::
781
782                                    root
783                                   /    \
784                                  tmp    usr
785
786         How do we set up the same tree at multiple locations under
787         /root/tmp
788
789   step 2:
790      ::
791
792
793                        mount --bind /root/tmp /root/tmp
794
795                        mount --make-rshared /root
796                        mount --make-unbindable /root/tmp
797
798                        mkdir -p /tmp/m1
799
800                        mount --rbind /root /tmp/m1
801
802      the new tree now looks like this::
803
804                                    root
805                                   /    \
806                                 tmp    usr
807                                /
808                               m1
809                              /  \
810                             tmp  usr
811
812   step 3:
813      ::
814
815                            mkdir -p /tmp/m2
816                            mount --rbind /root /tmp/m2
817
818      the new tree now looks like this::
819
820                                    root
821                                   /    \
822                                 tmp    usr
823                                /   \
824                               m1     m2
825                              /  \     / \
826                             tmp  usr tmp usr
827
828   step 4:
829      ::
830
831                            mkdir -p /tmp/m3
832                            mount --rbind /root /tmp/m3
833
834      the new tree now looks like this::
835
836                                          root
837                                      /           \
838                                     tmp           usr
839                                 /    \    \
840                               m1     m2     m3
841                              /  \     / \    /  \
842                             tmp  usr tmp usr tmp usr
843
8448) Implementation
845-----------------
846
847A) Datastructure
848
849   Several new fields are introduced to struct vfsmount:
850
851   ->mnt_share
852           Links together all the mount to/from which this vfsmount
853           send/receives propagation events.
854
855   ->mnt_slave_list
856           Links all the mounts to which this vfsmount propagates
857           to.
858
859   ->mnt_slave
860           Links together all the slaves that its master vfsmount
861           propagates to.
862
863   ->mnt_master
864           Points to the master vfsmount from which this vfsmount
865           receives propagation.
866
867   ->mnt_flags
868           Takes two more flags to indicate the propagation status of
869           the vfsmount.  MNT_SHARE indicates that the vfsmount is a shared
870           vfsmount.  MNT_UNCLONABLE indicates that the vfsmount cannot be
871           replicated.
872
873   All the shared vfsmounts in a peer group form a cyclic list through
874   ->mnt_share.
875
876   All vfsmounts with the same ->mnt_master form on a cyclic list anchored
877   in ->mnt_master->mnt_slave_list and going through ->mnt_slave.
878
879   ->mnt_master can point to arbitrary (and possibly different) members
880   of master peer group.  To find all immediate slaves of a peer group
881   you need to go through _all_ ->mnt_slave_list of its members.
882   Conceptually it's just a single set - distribution among the
883   individual lists does not affect propagation or the way propagation
884   tree is modified by operations.
885
886   All vfsmounts in a peer group have the same ->mnt_master.  If it is
887   non-NULL, they form a contiguous (ordered) segment of slave list.
888
889   A example propagation tree looks as shown in the figure below.
890
891   .. note::
892      Though it looks like a forest, if we consider all the shared
893      mounts as a conceptual entity called 'pnode', it becomes a tree.
894
895   ::
896
897
898                        A <--> B <--> C <---> D
899                       /|\            /|      |\
900                      / F G          J K      H I
901                     /
902                    E<-->K
903                        /|\
904                       M L N
905
906   In the above figure  A,B,C and D all are shared and propagate to each
907   other.   'A' has got 3 slave mounts 'E' 'F' and 'G' 'C' has got 2 slave
908   mounts 'J' and 'K'  and  'D' has got two slave mounts 'H' and 'I'.
909   'E' is also shared with 'K' and they propagate to each other.  And
910   'K' has 3 slaves 'M', 'L' and 'N'
911
912   A's ->mnt_share links with the ->mnt_share of 'B' 'C' and 'D'
913
914   A's ->mnt_slave_list links with ->mnt_slave of 'E', 'K', 'F' and 'G'
915
916   E's ->mnt_share links with ->mnt_share of K
917
918   'E', 'K', 'F', 'G' have their ->mnt_master point to struct vfsmount of 'A'
919
920   'M', 'L', 'N' have their ->mnt_master point to struct vfsmount of 'K'
921
922   K's ->mnt_slave_list links with ->mnt_slave of 'M', 'L' and 'N'
923
924   C's ->mnt_slave_list links with ->mnt_slave of 'J' and 'K'
925
926   J and K's ->mnt_master points to struct vfsmount of C
927
928   and finally D's ->mnt_slave_list links with ->mnt_slave of 'H' and 'I'
929
930   'H' and 'I' have their ->mnt_master pointing to struct vfsmount of 'D'.
931
932
933   NOTE: The propagation tree is orthogonal to the mount tree.
934
935B) Locking:
936
937   ->mnt_share, ->mnt_slave, ->mnt_slave_list, ->mnt_master are protected
938   by namespace_sem (exclusive for modifications, shared for reading).
939
940   Normally we have ->mnt_flags modifications serialized by vfsmount_lock.
941   There are two exceptions: do_add_mount() and clone_mnt().
942   The former modifies a vfsmount that has not been visible in any shared
943   data structures yet.
944   The latter holds namespace_sem and the only references to vfsmount
945   are in lists that can't be traversed without namespace_sem.
946
947C) Algorithm:
948
949   The crux of the implementation resides in rbind/move operation.
950
951   The overall algorithm breaks the operation into 3 phases: (look at
952   attach_recursive_mnt() and propagate_mnt())
953
954   1. Prepare phase.
955
956      For each mount in the source tree:
957
958      a) Create the necessary number of mount trees to
959         be attached to each of the mounts that receive
960         propagation from the destination mount.
961      b) Do not attach any of the trees to its destination.
962         However note down its ->mnt_parent and ->mnt_mountpoint
963      c) Link all the new mounts to form a propagation tree that
964         is identical to the propagation tree of the destination
965         mount.
966
967      If this phase is successful, there should be 'n' new
968      propagation trees; where 'n' is the number of mounts in the
969      source tree.  Go to the commit phase
970
971      Also there should be 'm' new mount trees, where 'm' is
972      the number of mounts to which the destination mount
973      propagates to.
974
975      If any memory allocations fail, go to the abort phase.
976
977   2. Commit phase.
978
979      Attach each of the mount trees to their corresponding
980      destination mounts.
981
982   3. Abort phase.
983
984      Delete all the newly created trees.
985
986   .. Note::
987      all the propagation related functionality resides in the file pnode.c
988
989
990------------------------------------------------------------------------
991
992version 0.1  (created the initial document, Ram Pai linuxram@us.ibm.com)
993
994version 0.2  (Incorporated comments from Al Viro)
995