127c74787SPoul-Henning Kamp.\" 227c74787SPoul-Henning Kamp.\" Copyright (c) 2002 Poul-Henning Kamp 327c74787SPoul-Henning Kamp.\" Copyright (c) 2002 Networks Associates Technology, Inc. 427c74787SPoul-Henning Kamp.\" All rights reserved. 527c74787SPoul-Henning Kamp.\" 627c74787SPoul-Henning Kamp.\" This software was developed for the FreeBSD Project by Poul-Henning Kamp 727c74787SPoul-Henning Kamp.\" and NAI Labs, the Security Research Division of Network Associates, Inc. 827c74787SPoul-Henning Kamp.\" under DARPA/SPAWAR contract N66001-01-C-8035 ("CBOSS"), as part of the 927c74787SPoul-Henning Kamp.\" DARPA CHATS research program. 1027c74787SPoul-Henning Kamp.\" 1127c74787SPoul-Henning Kamp.\" Redistribution and use in source and binary forms, with or without 1227c74787SPoul-Henning Kamp.\" modification, are permitted provided that the following conditions 1327c74787SPoul-Henning Kamp.\" are met: 1427c74787SPoul-Henning Kamp.\" 1. Redistributions of source code must retain the above copyright 1527c74787SPoul-Henning Kamp.\" notice, this list of conditions and the following disclaimer. 1627c74787SPoul-Henning Kamp.\" 2. Redistributions in binary form must reproduce the above copyright 1727c74787SPoul-Henning Kamp.\" notice, this list of conditions and the following disclaimer in the 1827c74787SPoul-Henning Kamp.\" documentation and/or other materials provided with the distribution. 1927c74787SPoul-Henning Kamp.\" 3. The names of the authors may not be used to endorse or promote 2027c74787SPoul-Henning Kamp.\" products derived from this software without specific prior written 2127c74787SPoul-Henning Kamp.\" permission. 2227c74787SPoul-Henning Kamp.\" 2327c74787SPoul-Henning Kamp.\" THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND 2427c74787SPoul-Henning Kamp.\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE 2527c74787SPoul-Henning Kamp.\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE 2627c74787SPoul-Henning Kamp.\" ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE 2727c74787SPoul-Henning Kamp.\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL 2827c74787SPoul-Henning Kamp.\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS 2927c74787SPoul-Henning Kamp.\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) 3027c74787SPoul-Henning Kamp.\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT 3127c74787SPoul-Henning Kamp.\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY 3227c74787SPoul-Henning Kamp.\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF 3327c74787SPoul-Henning Kamp.\" SUCH DAMAGE. 3427c74787SPoul-Henning Kamp.\" 3527c74787SPoul-Henning Kamp.\" $FreeBSD$ 3627c74787SPoul-Henning Kamp.\" 3727c74787SPoul-Henning Kamp.Dd March 27, 2002 3827c74787SPoul-Henning Kamp.Os FreeBSD 5.0 3927c74787SPoul-Henning Kamp.Dt GEOM 4 4027c74787SPoul-Henning Kamp.Sh NAME 4127c74787SPoul-Henning Kamp.Nm GEOM 4227c74787SPoul-Henning Kamp.Nd modular disk I/O request transformation framework. 4327c74787SPoul-Henning Kamp.Sh DESCRIPTION 4427c74787SPoul-Henning KampThe GEOM framework provides an infrastructure in which modules 4527c74787SPoul-Henning Kampcan perform transformations on disk I/O requests on their path from 4627c74787SPoul-Henning Kampthe upper kernel to the device drivers and back. 4727c74787SPoul-Henning Kamp.Pp 4827c74787SPoul-Henning KampTransformations in a GEOM context ranges from the simple geometric 4927c74787SPoul-Henning Kampdisplacement performed in typical disklabel modules over RAID 5027c74787SPoul-Henning Kampalgorithms and device multipath resolution to full blown cryptographic 5127c74787SPoul-Henning Kampprotection of the stored data. 5227c74787SPoul-Henning Kamp.Pp 5327c74787SPoul-Henning KampCompared to traditional "volume management", GEOM differs from most 5427c74787SPoul-Henning Kampand in some cases all previous implementations in the following ways: 5527c74787SPoul-Henning Kamp.Bl -bullet 5627c74787SPoul-Henning Kamp.It 5727c74787SPoul-Henning KampGEOM is extensible. It is trivially simple to write a new class 5827c74787SPoul-Henning Kampof transformation and it will not be given stepchild treatment. If 5927c74787SPoul-Henning Kampsomeone for some reason wanted to mount IBM MVS diskpacks, a class 6027c74787SPoul-Henning Kamprecognizing and configuring their VTOC information would be a trivial 6127c74787SPoul-Henning Kampmatter. 6227c74787SPoul-Henning Kamp.It 6327c74787SPoul-Henning KampGEOM is topologically agnostic. Most volume management implementations 6427c74787SPoul-Henning Kamphave very strict notions of how classes can fit together, very often 6527c74787SPoul-Henning Kampone fixed hierarchy is provided for instance subdisk - plex - 6627c74787SPoul-Henning Kampvolume. 6727c74787SPoul-Henning Kamp.El 6827c74787SPoul-Henning Kamp.Pp 6927c74787SPoul-Henning KampBeing extensible means that new transformations are treated no differently 7027c74787SPoul-Henning Kampthan existing transformations. 7127c74787SPoul-Henning Kamp.Pp 7227c74787SPoul-Henning KampFixed hierarchies are bad because they make it impossible to express 7327c74787SPoul-Henning Kampthe intent efficiently. 7427c74787SPoul-Henning KampIn the fixed hierarchy above it is not possible to mirror two 7527c74787SPoul-Henning Kampphysical disks and then parition the mirror into subdisks, instead 7627c74787SPoul-Henning Kampone is forced to make subdisks on the physical volumes and to mirror 7727c74787SPoul-Henning Kampthese two and two resulting in a much more complex configuration. 7827c74787SPoul-Henning KampGEOM on the other hand does not care in which order things are done, 7927c74787SPoul-Henning Kampthe only restriction is that cycles in the graph will not be allowed. 8027c74787SPoul-Henning Kamp.Pp 8127c74787SPoul-Henning Kamp.Sh "TERMINOLOGY and TOPOLOGY" 8227c74787SPoul-Henning KampGeom is quite object oriented and consequently the terminology 8327c74787SPoul-Henning Kampborrows a lot of context and sematics from the OO vocabulary: 8427c74787SPoul-Henning Kamp.Pp 8527c74787SPoul-Henning KampA "class", represented by the data structure g_class implements one 8627c74787SPoul-Henning Kampparticular kind of transformation. Typical examples are MBR disk 8727c74787SPoul-Henning Kamppartition, BSD disklabel or RAID5 classes. 8827c74787SPoul-Henning Kamp.Pp 8927c74787SPoul-Henning KampAn instance of a class is called a "geom" and represented by the 9027c74787SPoul-Henning Kampdata structure "g_geom". An in typical i386 FreeBSD system, there 9127c74787SPoul-Henning Kampwill be one geom of class MBR for each disk. 9227c74787SPoul-Henning Kamp.Pp 9327c74787SPoul-Henning KampA "provider", represented by the data structure "g_provider", is 9427c74787SPoul-Henning Kampthe front gate at which a geom offers service. 9527c74787SPoul-Henning KampA provider is "a disk-like thing which appear in /dev" - a logical 9627c74787SPoul-Henning Kampdisk in other words. 9727c74787SPoul-Henning KampAll providers have three main properties: name, sectorsize and size. . 9827c74787SPoul-Henning Kamp.Pp 9927c74787SPoul-Henning KampA "consumer" is the backdoor through which a geom connects to another 10027c74787SPoul-Henning Kampgeoms provider and through which I/O requests are sent. 10127c74787SPoul-Henning Kamp.Pp 10227c74787SPoul-Henning KampThe topological relationship between these entities are as follows: 10327c74787SPoul-Henning Kamp.Bl -bullet 10427c74787SPoul-Henning Kamp.It 10527c74787SPoul-Henning KampA class has zero or more geom instances. 10627c74787SPoul-Henning Kamp.It 10727c74787SPoul-Henning KampA geom has exactly one class it is derived from. 10827c74787SPoul-Henning Kamp.It 10927c74787SPoul-Henning KampA geom has zero or more consumers. 11027c74787SPoul-Henning Kamp.It 11127c74787SPoul-Henning KampA geom has zero or more provicers. 11227c74787SPoul-Henning Kamp.It 11327c74787SPoul-Henning KampA consumer can be attached to zero or one providers. 11427c74787SPoul-Henning Kamp.It 11527c74787SPoul-Henning KampA provider can have zero or more consumers attached. 11627c74787SPoul-Henning Kamp.El 11727c74787SPoul-Henning Kamp.Pp 11827c74787SPoul-Henning KampAll geoms have a rank-number assigned which is used to detect and 11927c74787SPoul-Henning Kampprevent loops in the acyclic directed graph, this rank number is 12027c74787SPoul-Henning Kampassigned as follows: 12127c74787SPoul-Henning Kamp.Bl -enum 12227c74787SPoul-Henning Kamp.It 12327c74787SPoul-Henning KampA geom with no attached consumers has rank=1 12427c74787SPoul-Henning Kamp.It 12527c74787SPoul-Henning KampA geom with attached consumers has a rank one higher then the 12627c74787SPoul-Henning Kamphighest rank of the geoms of the providers its consumers are 12727c74787SPoul-Henning Kampattached to. 12827c74787SPoul-Henning Kamp.El 12927c74787SPoul-Henning Kamp.Sh "SPECIAL TOPOLOGICAL MANEUVRES" 13027c74787SPoul-Henning KampIn addition to the straightforward attach which attaches a consumer 13127c74787SPoul-Henning Kampto a provider and dettach which breaks the bond, a number of special 13227c74787SPoul-Henning Kamptoplogical maneuvres exists to facilitate configuration and to 13327c74787SPoul-Henning Kampimprove the overall flexibility. 13427c74787SPoul-Henning Kamp.Pp 13527c74787SPoul-Henning Kamp.Em TASTING 13627c74787SPoul-Henning Kampis a process which happens whenever a new class or new provider 13727c74787SPoul-Henning Kampis created and it is the class' chance to automatically configure an 13827c74787SPoul-Henning Kampinstance on providers which it recognize as its own. 13927c74787SPoul-Henning KampA typical example is the MBR disk-parition class which will look for 14027c74787SPoul-Henning Kampthe MBR table in the first sector and if found and validated it will 14127c74787SPoul-Henning Kampinstantiate a geom to multiplex according to the contents of the MBR. 14227c74787SPoul-Henning Kamp.Pp 14327c74787SPoul-Henning KampA new class will be offered all existing providers in turn and a new 14427c74787SPoul-Henning Kampprovider will be offered to all classes in turn. 14527c74787SPoul-Henning Kamp.Pp 14627c74787SPoul-Henning KampExactly what a class does to recognize if it should accept the offered 14727c74787SPoul-Henning Kampprovider is not defined by GEOM, but the sensible set of options are: 14827c74787SPoul-Henning Kamp.Bl -bullet 14927c74787SPoul-Henning Kamp.It 15027c74787SPoul-Henning KampExamine specific data structures on the disk. 15127c74787SPoul-Henning Kamp.It 15227c74787SPoul-Henning KampExamine properties like sectorsize or mediasize for the provider. 15327c74787SPoul-Henning Kamp.It 15427c74787SPoul-Henning KampExamine the rank number of the providers geom. 15527c74787SPoul-Henning Kamp.It 15627c74787SPoul-Henning KampExamine the method name of the providers geom. 15727c74787SPoul-Henning Kamp.El 15827c74787SPoul-Henning Kamp.Pp 15927c74787SPoul-Henning Kamp.Em ORPHANIZATION 16027c74787SPoul-Henning Kampis the process by which a provider is removed while 16127c74787SPoul-Henning Kampit potentially still being in used. 16227c74787SPoul-Henning Kamp.Pp 16327c74787SPoul-Henning KampWhen a geom makes a provider as orphan all future I/O requests will 16427c74787SPoul-Henning Kamp"bounce" on the provider with an error code set by the geom. Any 16527c74787SPoul-Henning Kampconsumers attached to the provider will receive notification about 16627c74787SPoul-Henning Kampthe orphanization and need to take appropriate action. 16727c74787SPoul-Henning Kamp.Pp 16827c74787SPoul-Henning KampA geom which came into being as result of a normal taste operation 16927c74787SPoul-Henning Kampshould selfdestruct unless it has an way to keep functioning. Geoms 17027c74787SPoul-Henning Kamplike disklabels and stripes should therefore selfdestruct whereas 17127c74787SPoul-Henning KampRAID5 or mirror geoms can continue to function as ong as they do 17227c74787SPoul-Henning Kampnot loose quorum. 17327c74787SPoul-Henning Kamp.Pp 17427c74787SPoul-Henning KampWhen a provider is orphaned, this does not result in any immediate 17527c74787SPoul-Henning Kampchange in the topology, any attached consumers are still attached, 17627c74787SPoul-Henning Kampany opened paths are still open, it is the responsibility of the 17727c74787SPoul-Henning Kampgeoms above to close and dettach as soon as this can happen. 17827c74787SPoul-Henning Kamp.Pp 17927c74787SPoul-Henning KampThe typical scenario is that a device driver notices a disk has 18027c74787SPoul-Henning Kampgone and orphans the provider for it. 18127c74787SPoul-Henning KampThe geoms on top receive the orphanization event and orphan all 18227c74787SPoul-Henning Kamptheir providers in turn. 18327c74787SPoul-Henning KampProviders which are not attached to are destroyed right away. 18427c74787SPoul-Henning KampEventually at the toplevel the geom which interfaces 18527c74787SPoul-Henning Kampto the DEVFS received an orphan event on its consumer and it 18627c74787SPoul-Henning Kampcalls destroy_dev(9) and does an explicit close if the 18727c74787SPoul-Henning Kampdevice was open and then dettaches its consumer. 18827c74787SPoul-Henning KampThe provider below is now no longer attached to and can be 18927c74787SPoul-Henning Kampdestroyed, if the geom has no more providers it can dettach 19027c74787SPoul-Henning Kampits consumer and selfdestruct and so the carnage passes back 19127c74787SPoul-Henning Kampdown the tree, until the original provider is dettached from 19227c74787SPoul-Henning Kampand it can be destroyed by the geom serving the device driver. 19327c74787SPoul-Henning Kamp.Pp 19427c74787SPoul-Henning KampWhile this approach seens byzantine it does provide the maximum 19527c74787SPoul-Henning Kampflexibility in handling disapparing devices. 19627c74787SPoul-Henning Kamp.Pp 19727c74787SPoul-Henning Kamp.Em SPOILING 19827c74787SPoul-Henning Kampis a special case of orphanization used to protect 19927c74787SPoul-Henning Kampagainst stale metadata. 20027c74787SPoul-Henning KampIt is probably easiest to understand spoiling by going through 20127c74787SPoul-Henning Kampan example. 20227c74787SPoul-Henning Kamp.Pp 20327c74787SPoul-Henning KampImagine a disk, "da0" on top of which a MBR geom provides 20427c74787SPoul-Henning Kamp"da0s1" and "da0s2" and on top of "da0s1" a BSD geom provides 20527c74787SPoul-Henning Kamp"da0s1a" through "da0s1e", both the MBR and BSD geoms have 20627c74787SPoul-Henning Kampautoconfigured based on data structures on the disk media. 20727c74787SPoul-Henning KampNow imagine the case where "da0" is opened for writing and those 20827c74787SPoul-Henning Kampdata structures are modified or overwritten: Now the geoms would 20927c74787SPoul-Henning Kampbe operating on stale metadata unless some notification system 21027c74787SPoul-Henning Kampcan inform them otherwise. 21127c74787SPoul-Henning KampTo avoid this situation, when the open of "da0" for write happens, 21227c74787SPoul-Henning Kampall attached consumers are told about this, and geoms like 21327c74787SPoul-Henning KampMBR and BSD will selfdestruct as a result. 21427c74787SPoul-Henning KampWhen "da0" is closed again, it will be offered for tasting again 21527c74787SPoul-Henning Kampand if the data structures for MBR and BSD are still there, new 21627c74787SPoul-Henning Kampgeoms will instantiate themselves anew. 21727c74787SPoul-Henning Kamp.Pp 21827c74787SPoul-Henning KampNow for the fine print: 21927c74787SPoul-Henning Kamp.Pp 22027c74787SPoul-Henning KampIf any of the paths through the MBR or BSD module were open, they 22127c74787SPoul-Henning Kampwould have opened downwards with an exclusive bit rendering it 22227c74787SPoul-Henning Kampimpossible to open "da0" for writing in that case and conversely 22327c74787SPoul-Henning Kampthe requested exclusive bit would render it impossible to open a 22427c74787SPoul-Henning Kamppath through the MBR geom while "da0" is open for writing. 22527c74787SPoul-Henning Kamp.Pp 22627c74787SPoul-Henning KampFrom this it also follows that changing the size of open geoms can 22727c74787SPoul-Henning Kamponly be done through their cooperation. 22827c74787SPoul-Henning Kamp.Pp 22927c74787SPoul-Henning KampFinally: the spoiling only happens when the write count goes from 23027c74787SPoul-Henning Kampzero to non-zero and the retasting only when the write count goes 23127c74787SPoul-Henning Kampback to zero. 23227c74787SPoul-Henning Kamp.Pp 23327c74787SPoul-Henning Kamp.Em INSERT/DELETE 23427c74787SPoul-Henning Kampare a very special operation which allows a new geom 23527c74787SPoul-Henning Kampto be instantiated between a consumer and a provider attached to 23627c74787SPoul-Henning Kampeach other and to remove it again. 23727c74787SPoul-Henning Kamp.Pp 23827c74787SPoul-Henning KampTo understand the utility of this, imagine a provider with 23927c74787SPoul-Henning Kampbeing mounted as a filesystem. 24027c74787SPoul-Henning KampBetween the DEVFS geoms consumer and its provider we insert 24127c74787SPoul-Henning Kampa mirror modules which configures itself with one mirror 24227c74787SPoul-Henning Kampcopy and consequently is transparent to the I/O requests 24327c74787SPoul-Henning Kampon the path. 24427c74787SPoul-Henning KampWe can now configure yet a mirror copy on the mirror geom, 24527c74787SPoul-Henning Kamprequest a synchronization and finally drop the first mirror 24627c74787SPoul-Henning Kampcopy. 24727c74787SPoul-Henning KampWe have now in essence moved a mounted filesystem from one 24827c74787SPoul-Henning Kampdisk to another while it was being used. 24927c74787SPoul-Henning KampAt this point the mirror geom can be deleted from the path 25027c74787SPoul-Henning Kampagain, it has served its purpose. 25127c74787SPoul-Henning Kamp.Pp 25227c74787SPoul-Henning Kamp.Em CONFIGURE 25327c74787SPoul-Henning Kampis the process where the administrator issues instructions 25427c74787SPoul-Henning Kampfor a particular class to instantiate itself. There are multiple 25527c74787SPoul-Henning Kampways to express intent in this case, a particular provider can be 25627c74787SPoul-Henning Kampspecified with a level of override forcing for instance a BSD 25727c74787SPoul-Henning Kampdisklabel module to attach to a provider which was not found palatable 25827c74787SPoul-Henning Kampduring the TASTE operation. 25927c74787SPoul-Henning Kamp.Pp 26027c74787SPoul-Henning KampFinally IO is the reason we even do this: it concerns itself with 26127c74787SPoul-Henning Kampsending I/O requests through the graph. 26227c74787SPoul-Henning Kamp.Pp 26327c74787SPoul-Henning Kamp.Em "I/O REQUESTS 26427c74787SPoul-Henning Kamprepresented by struct bio, originate at a consumer, 26527c74787SPoul-Henning Kampare scheduled on its attached provider and when processed, returned 26627c74787SPoul-Henning Kampto the consumer. 26727c74787SPoul-Henning KampIt is important to realize that the struct bio which 26827c74787SPoul-Henning Kampenters throuh the provider of a particular geom does not "come 26927c74787SPoul-Henning Kampout on the other side". 27027c74787SPoul-Henning KampEven simple transformations like MBR and BSD will clone the 27127c74787SPoul-Henning Kampstruct bio, modify the clone and schedule the clone on their 27227c74787SPoul-Henning Kampown consumer. 27327c74787SPoul-Henning KampNote that cloning the struct bio does not involve cloning the 27427c74787SPoul-Henning Kampactual data area specified in the IO request. 27527c74787SPoul-Henning Kamp.Pp 27627c74787SPoul-Henning KampIn total five different IO requests exist in GEOM: read, write, 27727c74787SPoul-Henning Kampdelete, format, get attribute and set attribute. 27827c74787SPoul-Henning Kamp.Pp 27927c74787SPoul-Henning KampRead and write are pretty self explanatory. 28027c74787SPoul-Henning Kamp.Pp 28127c74787SPoul-Henning KampDelete indicates that a certain range of data is no longer used 28227c74787SPoul-Henning Kampand that it can be erased or freed as the underlying technology 28327c74787SPoul-Henning Kampsupports. 28427c74787SPoul-Henning KampTechnologies like flash adaptation layers can arrange to erase 28527c74787SPoul-Henning Kampthe relevant blocks before they will become reassigned and 28627c74787SPoul-Henning Kampcrytographic devices may want to fill random bits into the 28727c74787SPoul-Henning Kamprange to reduce the amount of data available for attack. 28827c74787SPoul-Henning Kamp.Pp 28927c74787SPoul-Henning KampIt is important to recognize that a delete indication is not a 29027c74787SPoul-Henning Kamprequest and consequently there is no guarantee that the data actually 29127c74787SPoul-Henning Kampwill be erased or made unavailable unless guaranteed by specific 29227c74787SPoul-Henning Kampgeoms in the graph. If "secure delete" semantics are required, a 29327c74787SPoul-Henning Kampgeom should be pushed which converts delete indications into (a 29427c74787SPoul-Henning Kampsequence of) write requests. 29527c74787SPoul-Henning Kamp.Pp 29627c74787SPoul-Henning KampGet attribute and set attribute supports inspection and manipulation 29727c74787SPoul-Henning Kampof out-of-band attributes on a particular provider or path. 29827c74787SPoul-Henning KampAttributes are named by ascii strings and they will be discussed in 29927c74787SPoul-Henning Kampa separate section below. 30027c74787SPoul-Henning Kamp.Pp 30127c74787SPoul-Henning Kamp(stay tuned while the author rests his brain and fingers: more to come.) 30227c74787SPoul-Henning Kamp.Sh HISTORY 30327c74787SPoul-Henning KampThis software was developed for the FreeBSD Project by Poul-Henning Kamp 30427c74787SPoul-Henning Kampand NAI Labs, the Security Research Division of Network Associates, Inc. 30527c74787SPoul-Henning Kampunder DARPA/SPAWAR contract N66001-01-C-8035 ("CBOSS"), as part of the 30627c74787SPoul-Henning KampDARPA CHATS research program. 30727c74787SPoul-Henning Kamp.Pp 30827c74787SPoul-Henning KampThe first precursor for GEOM was a gruesome hack to Minix 1.2 and was 30927c74787SPoul-Henning Kampnever distributed. An earlier attempt to implement a less general scheme in FreeBSD never succeeded. 31027c74787SPoul-Henning Kamp.Sh AUTHORS 31127c74787SPoul-Henning Kamp.An "Poul-Henning Kamp" Aq phk@FreeBSD.org 312