Lines Matching +full:up +full:- +full:to
1 .\" ----------------------------------------------------------------------------
2 .\" "THE BEER-WARE LICENSE" (Revision 42):
5 .\" this stuff is worth it, you can buy me a beer in return. Poul-Henning Kamp
6 .\" ----------------------------------------------------------------------------
13 - or -
17 Poul-Henning Kamp <phk@FreeBSD.org>
22 they they left the file-system layer were logical sub disk implementation
25 and operating system have done it their own way. As FreeBSD migrates to
26 other platforms it needs to understand these local conventions to be
27 able to co-exist with other operating systems on the same disk.
34 Both of these factors indicate the need for a structured approach to
37 This paper contains the road-map for a stackable "BIO" system in
44 of struct buf, it is a most enlightening case of not exactly bit-rot
45 but more appropriately design-rot.
76 and to transport I/O operations to device drivers. For the purpose of
88 In addition to this, the av_forw and av_back elements are
89 used by the disk device drivers to put requests on a linked list.
91 aspect and only a few fields relate exclusively to the cache aspect.
93 If we step forward to the BSD 4.4-Lite-2 release, struct buf has grown
116 /* Function to call upon completion. */
138 but otherwise there is no change to the fields which
139 represent the I/O aspect. All the new fields relate to the cache
140 aspect, link buffers to the VM system, provide hacks for file-systems
143 By the time we get to FreeBSD 3.0 more stuff has grown on struct buf:
167 /* Function to call upon completion. */
196 …nce unchanged. A couple of fields have been added which allows the driver to hang local data off …
207 seems to continue to grow and grow.
210 of code have emerged which need to do non-trivial translations of
213 a logical space to a physical space, and the mappings they perform
216 It is interesting to note that Lions in his comments to the \f(CWrkaddr\fP
217 routine (p. 16-2) writes \fIThe code in this procedure incorporates
220 usefulness seems to be restricted.\fP This more than hints at the
221 presence already then of various hacks to stripe/span multiple devices.
226 this resulted in, leaving but one: Reads or writes to the magic "disklabel"
228 some cases modified before being passed on to the device driver. This need
230 ability to stack I/O operations;
235 these and pass them on to other device drivers.
237 Apart from it being inefficient to lug about a 348 bytes data structure
238 when 80 bytes would have done, it also leads to significant code rot
239 when programmers don't know what to do about the remaining fields or
254 \(bu Struct buf currently have several users, vinum, ccd and to
265 will have for future work on struct buf, I will try to address
269 a legitimate and valid requirement to be able to do I/O operations
274 Without doubt, the I/O request has to be tuned to fit the needs
276 any future changes in struct buf are likely to affect the I/O request
279 One particular change which has been proposed is to drop the present
282 physical address DMA to transfer the data maintaining such a mapping
285 Of course some drivers will still need to be able to access the
292 The answer to this is ``no''.
293 Anything that could be added to or done with
294 the I/O aspect of struct buf can also be added to or done
299 The first decision to be made was who got to use the name "struct buf",
304 for instance, the function to signal completion of an I/O request is
314 and by using cpp(1) macros to alias the fields to the legacy struct buf
321 opposed to "hi-jacked" fields).
322 One field was found to have "dual-use": the b_flags field.
332 For historic reasons B_WRITE was defined to be zero, which lead to
333 confusion and bugs, this pushed the decision to have a separate
335 only the action to be performed.
349 dev_t bio_dev; /* Device to do I/O on. */
373 After adding a struct bio to struct buf and the fields aliased into it
410 period allows a pointer to either to be cast to a pointer of the other,
411 which means that certain pieces of code can be left un-converted with the
413 The ccd and vinum modules have been left un-converted like this for now.
415 This is basically where FreeBSD-current stands today.
417 The next step is to substitute struct bio for struct buf in all the code
419 The patch to do this is up for review. \**
437 be trivial to do so, but not by definition worthwhile.
439 Retaining the aliasing for the b_* and bio_* name-spaces this way
443 into the drivers to full-fill the I/O request and this provides us
444 with a single isolated location for performing non-trivial translations.
446 As an example of this flexibility: It has been proposed to essentially
447 drop the b_blkno field and use the b_offset field to communicate the
448 on-disk location of the data. b_blkno is a 32bit offset of B_DEVSIZE
449 (512) bytes sectors which allows us to address two terabytes worth
450 of data. Using b_offset as a 64 bit byte-address would not only allow
451 us to address 8 million times larger disks, it would also make it
452 possible to accommodate disks which use non-power-of-two sector-size,
453 Audio CD-ROMs for instance.
457 \(bu Add code to DEV_STRATEGY() to populate b_offset from b_blkno in the
461 \(bu Change diskslice/label, ccd, vinum and device drivers to use b_offset
464 \(bu Remove the bio_blkno field from struct bio, add it to struct buf as
467 Another possible transition could be to not have a "built-in" struct bio
469 to struct buf it might be cheaper to remove struct bio from struct buf,
470 un-alias the fields and have DEV_STRATEGY() allocate a struct bio and populate
472 This would also be entirely transparent to both users of struct buf and
480 diskslice/label, ccd and vinum, it is not unreasonable to start
484 In traditional UNIX semantics a "disk" is a one-dimensional array of
486 …lemented with a sort of "don't ask-don't tell" policy where system administrator would specify a l…
489 A truly generalised concept of a disk needs to be more flexible and more
490 expressive. For instance, a user of a disk will want to know:
492 \(bu What is the sector size. Sector-size these days may not be a power
500 devices where a wear-distribution software or hardware function uses
501 the information about which sectors are actually in use to optimise the
502 usage of the slow erase function to a minimum.
504 \(bu Is opening this device in a specific mode, (read-only or read-write)
505 allowed. The VM system and the file-systems generally assume that nobody
506 writes to "their storage" under their feet, and therefore opens which
510 This is useful for staying compatible with badly designed on-disk formats
515 at any time. While some devices like CD-ROMs can lock the media in
519 …ng "strategy/biodone" model of I/O scheduling and decide to use a modular or stackable approach to
521 Mirroring, RAID, striping, interleaving, disk-labels and sub-disks, all of
522 these techniques would get a common framework to operate in.
525 installation of FreeBSD. The code will have to act and react exactly
527 all hard to emulate so implementation-wise this is a non-issue.
529 But lets look at some drawings to see what this means in practice.
535 arrow up from Ad0.n
538 arrow up from Ad1.n
541 arrow up from Ad2.n
544 arrow up from Ad3.n
546 DML: box dashed width 4i height .9i with .sw at SL0.sw + (-.2,-.2)
547 "Disk-mini-layer" with .n at DML.s + (0, .1)
551 A0A: arrow up from 1/4 <SL0.nw, SL0.ne>
552 A0B: arrow up from 2/4 <SL0.nw, SL0.ne>
553 A0E: arrow up from 3/4 <SL0.nw, SL0.ne>
554 A1C: arrow up from 2/4 <SL1.nw, SL1.ne>
555 arrow to 1/3 <V.sw, V.se>
556 A2C: arrow up from 2/4 <SL2.nw, SL2.ne>
557 arrow to 2/3 <V.sw, V.se>
558 A3A: arrow up from 1/4 <SL3.nw, SL3.ne>
559 A3E: arrow up from 2/4 <SL3.nw, SL3.ne>
560 A3F: arrow up from 3/4 <SL3.nw, SL3.ne>
571 V1: arrow up from 1/4 <V.nw, V.ne>
572 V2: arrow up from 2/4 <V.nw, V.ne>
573 V3: arrow up from 3/4 <V.nw, V.ne>
586 arrow up from Ad0.n
588 arrow up
591 A0A: arrow up from 1/4 <B0.nw, B0.ne>
592 A0B: arrow up from 2/4 <B0.nw, B0.ne>
593 A0E: arrow up from 3/4 <B0.nw, B0.ne>
598 arrow up from Ad3.n
600 arrow up
604 arrow from Ad1.n to 1/3 <V.sw, V.se>
605 arrow from Ad2.n to 2/3 <V.sw, V.se>
618 V1: arrow up from 1/4 <V.nw, V.ne>
619 V2: arrow up from 2/4 <V.nw, V.ne>
620 V3: arrow up from 3/4 <V.nw, V.ne>
628 The first thing we notice is that the disk mini-layer is gone, instead
631 needs to go though the BSD/MBR layers if it wants access to the entire
634 Now, imagine that a ZIP drive is connected to the machine, and the
643 A number of the geometry modules have registered as "auto-discovering"
644 and will be polled sequentially to see if any of them recognise what
646 an instance of itself to the disk:
650 arrow up from D.n
652 M1: arrow up from 1/3 <M.nw, M.ne>
653 M2: arrow up from 2/3 <M.nw, M.ne>
664 arrow "O" up from D.n
666 M1: line up .3i from 1/3 <M.nw, M.ne>
668 M2: arrow "O" up from 2/3 <M.nw, M.ne>
670 B1: arrow "O" up from 1/4 <B.nw, B.ne>
671 B2: arrow "O" up from 2/4 <B.nw, B.ne>
672 B3: arrow "O" up from 3/4 <B.nw, B.ne>
680 the UFS superblock and extract from there the path to mount the disk
681 on, but this is probably better implemented in a general "device-daemon"
682 in user-land.
685 accessed from user-land or kernel. The VM and file-systems generally
686 prefer to have exclusive write access to the disk sectors they use,
687 so we need to enforce this policy. Since we cannot know what transformation
688 a particular module implements, we need to ask the modules if the open
689 is OK, and they may need to ask their neighbours before they can answer.
691 We decide to mount a filesystem on one of the BSD partitions at the very top.
692 The open request is passed to the BSD module, which finds that none of
694 objections. It then passes the open to the MBR module, which goes through
695 basically the same procedure finds no objections and pass the request to
702 in R/W mode so it does not need to ask for permission for that again
709 If we now try to open the other slice for writing, the one which has the
710 BSD module attached already. The open is passed to the MBR module which
715 code to implement in a prototype implementation.
718 of intent to eject, a call-up from the driver can try to get devices synchronised
720 a unplugged parallel zip drive, a floppy disk or a PC-card, we have no
721 choice but to dismantle the setup. The device driver sends a "gone" notification to the MBR module…
723 in the buf/vm system and returns. The BSD module replicates the "gone" to
724 the two mounted file-systems which in turn unmounts forcefully, invalidates
735 arrow "O" up from D0.n
737 M01: line up .3i from 1/3 <M0.nw, M0.ne>
739 M02: arrow "O" up from 2/3 <M0.nw, M0.ne>
742 arrow "O" up from D1.n
744 M11: line up .3i from 1/3 <M1.nw, M1.ne>
746 M11a: arrow up .2i
749 arrow "O" up
751 BB1: arrow "O" up from 1/4 <BB.nw, BB.ne>
752 BB2: arrow "O" up from 2/4 <BB.nw, BB.ne>
753 BB3: arrow "O" up from 3/4 <BB.nw, BB.ne>
755 M12: arrow "O" up from 2/3 <M1.nw, M1.ne>
757 B1: arrow "O" up from 1/4 <B.nw, B.ne>
758 B2: arrow "O" up from 2/4 <B.nw, B.ne>
759 B3: arrow "O" up from 3/4 <B.nw, B.ne>
763 Now assuming that we lose disk da0, the notification goes up like before
765 doesn't propagate the "gone" notification further up and the three
766 file-systems mounted are not affected.
768 It is possible to modify the graph while in action, as long as the
772 to output. Next we can add another copy to the mirror, give the
773 mirror time to sync the two copies. Detach the first mirror copy
775 from one disk to another transparently.
779 Most of the infrastructure is in place now to implement stackable
783 information about devices can be put. This enabled us to get rid
785 driver/device, and significantly cleaned up the vnode aliasing for
788 \(bu The disk-mini-layer has
790 majority of the disk-drivers, saving on average 100 lines of code per
799 \(bu changes to struct bio to make it more
800 stackable. This mostly relates to the handling of the biodone()
801 event, something which will be transparent to all current users
804 \(bu code to stich modules together and to pass events and notifications
809 My plan for implementation stackable BIO layers is to first complete
812 The next step is to re-implement the monolithic disk-mini-layer so
814 other consumers should not be unable to tell the difference between
815 the current and the new disk-mini-layer. The new implementation
816 will initially use a static stacking to remain compatible with the
819 The next step is to make the stackable layers configurable,
820 to provide the means to initialise the stacking and to subsequently
824 BIO system: CCD can be re-implemented as a mirror module and a stripe
825 module. Vinum can be integrated either as one "macro-module" or
827 purposes can be added, sub-disk handling for Solaris, MacOS, etc