1.\" 2.\" Copyright (c) 2002 Poul-Henning Kamp 3.\" Copyright (c) 2002 Networks Associates Technology, Inc. 4.\" All rights reserved. 5.\" 6.\" This software was developed for the FreeBSD Project by Poul-Henning Kamp 7.\" and NAI Labs, the Security Research Division of Network Associates, Inc. 8.\" under DARPA/SPAWAR contract N66001-01-C-8035 ("CBOSS"), as part of the 9.\" DARPA CHATS research program. 10.\" 11.\" Redistribution and use in source and binary forms, with or without 12.\" modification, are permitted provided that the following conditions 13.\" are met: 14.\" 1. Redistributions of source code must retain the above copyright 15.\" notice, this list of conditions and the following disclaimer. 16.\" 2. Redistributions in binary form must reproduce the above copyright 17.\" notice, this list of conditions and the following disclaimer in the 18.\" documentation and/or other materials provided with the distribution. 19.\" 3. The names of the authors may not be used to endorse or promote 20.\" products derived from this software without specific prior written 21.\" permission. 22.\" 23.\" THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND 24.\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE 25.\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE 26.\" ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE 27.\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL 28.\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS 29.\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) 30.\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT 31.\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY 32.\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF 33.\" SUCH DAMAGE. 34.\" 35.\" $FreeBSD$ 36.\" 37.Dd March 27, 2002 38.Os 39.Dt GEOM 4 40.Sh NAME 41.Nm GEOM 42.Nd modular disk I/O request transformation framework. 43.Sh DESCRIPTION 44The GEOM framework provides an infrastructure in which "classes" 45can perform transformations on disk I/O requests on their path from 46the upper kernel to the device drivers and back. 47.Pp 48Transformations in a GEOM context range from the simple geometric 49displacement performed in typical disk partitioning modules over RAID 50algorithms and device multipath resolution to full blown cryptographic 51protection of the stored data. 52.Pp 53Compared to traditional "volume management", GEOM differs from most 54and in some cases all previous implementations in the following ways: 55.Bl -bullet 56.It 57GEOM is extensible. 58It is trivially simple to write a new class 59of transformation and it will not be given stepchild treatment. 60If 61someone for some reason wanted to mount IBM MVS diskpacks, a class 62recognizing and configuring their VTOC information would be a trivial 63matter. 64.It 65GEOM is topologically agnostic. 66Most volume management implementations 67have very strict notions of how classes can fit together, very often 68one fixed hierarchy is provided for instance subdisk - plex - 69volume. 70.El 71.Pp 72Being extensible means that new transformations are treated no differently 73than existing transformations. 74.Pp 75Fixed hierarchies are bad because they make it impossible to express 76the intent efficiently. 77In the fixed hierarchy above it is not possible to mirror two 78physical disks and then partition the mirror into subdisks, instead 79one is forced to make subdisks on the physical volumes and to mirror 80these two and two resulting in a much more complex configuration. 81GEOM on the other hand does not care in which order things are done, 82the only restriction is that cycles in the graph will not be allowed. 83.Pp 84.Sh "TERMINOLOGY and TOPOLOGY" 85GEOM is quite object oriented and consequently the terminology 86borrows a lot of context and semantics from the OO vocabulary: 87.Pp 88A "class", represented by the data structure g_class implements one 89particular kind of transformation. 90Typical examples are MBR disk 91partition, BSD disklabel, and RAID5 classes. 92.Pp 93An instance of a class is called a "geom" and represented by the 94data structure "g_geom". 95In a typical i386 FreeBSD system, there 96will be one geom of class MBR for each disk. 97.Pp 98A "provider", represented by the data structure "g_provider", is 99the front gate at which a geom offers service. 100A provider is "a disk-like thing which appears in /dev" - a logical 101disk in other words. 102All providers have three main properties: name, sectorsize and size. 103.Pp 104A "consumer" is the backdoor through which a geom connects to another 105geom provider and through which I/O requests are sent. 106.Pp 107The topological relationship between these entities are as follows: 108.Bl -bullet 109.It 110A class has zero or more geom instances. 111.It 112A geom has exactly one class it is derived from. 113.It 114A geom has zero or more consumers. 115.It 116A geom has zero or more providers. 117.It 118A consumer can be attached to zero or one providers. 119.It 120A provider can have zero or more consumers attached. 121.El 122.Pp 123All geoms have a rank-number assigned, which is used to detect and 124prevent loops in the acyclic directed graph. 125This rank number is 126assigned as follows: 127.Bl -enum 128.It 129A geom with no attached consumers has rank=1 130.It 131A geom with attached consumers has a rank one higher than the 132highest rank of the geoms of the providers its consumers are 133attached to. 134.El 135.Sh "SPECIAL TOPOLOGICAL MANEUVERS" 136In addition to the straightforward attach, which attaches a consumer 137to a provider, and detach, which breaks the bond, a number of special 138topological maneuvers exists to facilitate configuration and to 139improve the overall flexibility. 140.Pp 141.Em TASTING 142is a process that happens whenever a new class or new provider 143is created and it provides the class a chance to automatically configure an 144instance on providers, which it recognize as its own. 145A typical example is the MBR disk-partition class which will look for 146the MBR table in the first sector and if found and validated it will 147instantiate a geom to multiplex according to the contents of the MBR. 148.Pp 149A new class will be offered to all existing providers in turn and a new 150provider will be offered to all classes in turn. 151.Pp 152Exactly what a class does to recognize if it should accept the offered 153provider is not defined by GEOM, but the sensible set of options are: 154.Bl -bullet 155.It 156Examine specific data structures on the disk. 157.It 158Examine properties like sectorsize or mediasize for the provider. 159.It 160Examine the rank number of the provider's geom. 161.It 162Examine the method name of the provider's geom. 163.El 164.Pp 165.Em ORPHANIZATION 166is the process by which a provider is removed while 167it potentially is still being used. 168.Pp 169When a geom orphans a provider, all future I/O requests will 170"bounce" on the provider with an error code set by the geom. 171Any 172consumers attached to the provider will receive notification about 173the orphanization when the eventloop gets around to it, and they 174can take appropriate action at that time. 175.Pp 176A geom which came into being as a result of a normal taste operation 177should selfdestruct unless it has a way to keep functioning lacking 178the orphaned provider. 179Geoms like diskslicers should therefore selfdestruct whereas 180RAID5 or mirror geoms will be able to continue, as long as they do 181not loose quorum. 182.Pp 183When a provider is orphaned, this does not necessarily result in any 184immediate change in the topology: any attached consumers are still 185attached, any opened paths are still open, any outstanding I/O 186requests are still outstanding. 187.Pp 188The typical scenario is 189.Bl -bullet -offset indent -compact 190.It 191A device driver detects a disk has departed and orphans the provider for it. 192.It 193The geoms on top of the disk receive the orphanization event and 194orphans all their providers in turn. 195Providers, which are not attached to, will typically self-destruct 196right away. 197This process continues in a quasi-recursive fashion until all 198relevant pieces of the tree has heard the bad news. 199.It 200Eventually the buck stops when it reaches geom_dev at the top 201of the stack. 202.It 203Geom_dev will call destroy_dev(9) to stop any more request from 204coming in. 205It will sleep until all (if any) outstanding I/O requests have 206been returned. 207It will explicitly close (ie: zero the access counts), a change 208which will propagate all the way down through the mesh. 209It will then detach and destroy its geom. 210.It 211The geom whose provider is now attached will destroy the provider, 212detach and destroy its consumer and destroy its geom. 213.It 214This process percolates all the way down through the mesh, until 215the cleanup is complete. 216.El 217.Pp 218While this approach seems byzantine, it does provide the maximum 219flexibility and robustness in handling disappearing devices. 220.Pp 221The one absolutely crucial detail to be aware is that if the 222device driver does not return all I/O requests, the tree will 223not unravel. 224.Pp 225.Em SPOILING 226is a special case of orphanization used to protect 227against stale metadata. 228It is probably easiest to understand spoiling by going through 229an example. 230.Pp 231Imagine a disk, "da0" on top of which a MBR geom provides 232"da0s1" and "da0s2" and on top of "da0s1" a BSD geom provides 233"da0s1a" through "da0s1e", both the MBR and BSD geoms have 234autoconfigured based on data structures on the disk media. 235Now imagine the case where "da0" is opened for writing and those 236data structures are modified or overwritten: Now the geoms would 237be operating on stale metadata unless some notification system 238can inform them otherwise. 239.Pp 240To avoid this situation, when the open of "da0" for write happens, 241all attached consumers are told about this, and geoms like 242MBR and BSD will selfdestruct as a result. 243When "da0" is closed again, it will be offered for tasting again 244and if the data structures for MBR and BSD are still there, new 245geoms will instantiate themselves anew. 246.Pp 247Now for the fine print: 248.Pp 249If any of the paths through the MBR or BSD module were open, they 250would have opened downwards with an exclusive bit rendering it 251impossible to open "da0" for writing in that case and conversely 252the requested exclusive bit would render it impossible to open a 253path through the MBR geom while "da0" is open for writing. 254.Pp 255From this it also follows that changing the size of open geoms can 256only be done with their cooperation. 257.Pp 258Finally: the spoiling only happens when the write count goes from 259zero to non-zero and the retasting only when the write count goes 260from non-zero to zero. 261.Pp 262.Em INSERT/DELETE 263are a very special operation which allows a new geom 264to be instantiated between a consumer and a provider attached to 265each other and to remove it again. 266.Pp 267To understand the utility of this, imagine a provider with 268being mounted as a file system. 269Between the DEVFS geoms consumer and its provider we insert 270a mirror module which configures itself with one mirror 271copy and consequently is transparent to the I/O requests 272on the path. 273We can now configure yet a mirror copy on the mirror geom, 274request a synchronization, and finally drop the first mirror 275copy. 276We have now in essence moved a mounted file system from one 277disk to another while it was being used. 278At this point the mirror geom can be deleted from the path 279again, it has served its purpose. 280.Pp 281.Em CONFIGURE 282is the process where the administrator issues instructions 283for a particular class to instantiate itself. 284There are multiple 285ways to express intent in this case, a particular provider can be 286specified with a level of override forcing for instance a BSD 287disklabel module to attach to a provider which was not found palatable 288during the TASTE operation. 289.Pp 290Finally IO is the reason we even do this: it concerns itself with 291sending I/O requests through the graph. 292.Pp 293.Em "I/O REQUESTS 294represented by struct bio, originate at a consumer, 295are scheduled on its attached provider, and when processed, returned 296to the consumer. 297It is important to realize that the struct bio which 298enters through the provider of a particular geom does not "come 299out on the other side". 300Even simple transformations like MBR and BSD will clone the 301struct bio, modify the clone, and schedule the clone on their 302own consumer. 303Note that cloning the struct bio does not involve cloning the 304actual data area specified in the IO request. 305.Pp 306In total four different IO requests exist in GEOM: read, write, 307delete, and get attribute. 308.Pp 309Read and write are self explanatory. 310.Pp 311Delete indicates that a certain range of data is no longer used 312and that it can be erased or freed as the underlying technology 313supports. 314Technologies like flash adaptation layers can arrange to erase 315the relevant blocks before they will become reassigned and 316cryptographic devices may want to fill random bits into the 317range to reduce the amount of data available for attack. 318.Pp 319It is important to recognize that a delete indication is not a 320request and consequently there is no guarantee that the data actually 321will be erased or made unavailable unless guaranteed by specific 322geoms in the graph. 323If "secure delete" semantics are required, a 324geom should be pushed which converts delete indications into (a 325sequence of) write requests. 326.Pp 327Get attribute supports inspection and manipulation 328of out-of-band attributes on a particular provider or path. 329Attributes are named by ascii strings and they will be discussed in 330a separate section below. 331.Pp 332(stay tuned while the author rests his brain and fingers: more to come.) 333.Sh HISTORY 334This software was developed for the FreeBSD Project by Poul-Henning Kamp 335and NAI Labs, the Security Research Division of Network Associates, Inc. 336under DARPA/SPAWAR contract N66001-01-C-8035 ("CBOSS"), as part of the 337DARPA CHATS research program. 338.Pp 339The first precursor for GEOM was a gruesome hack to Minix 1.2 and was 340never distributed. 341An earlier attempt to implement a less general scheme 342in FreeBSD never succeeded. 343.Sh AUTHORS 344.An "Poul-Henning Kamp" Aq phk@FreeBSD.org 345