1.\" 2.\" Copyright (c) 2002 Poul-Henning Kamp 3.\" Copyright (c) 2002 Networks Associates Technology, Inc. 4.\" All rights reserved. 5.\" 6.\" This software was developed for the FreeBSD Project by Poul-Henning Kamp 7.\" and NAI Labs, the Security Research Division of Network Associates, Inc. 8.\" under DARPA/SPAWAR contract N66001-01-C-8035 ("CBOSS"), as part of the 9.\" DARPA CHATS research program. 10.\" 11.\" Redistribution and use in source and binary forms, with or without 12.\" modification, are permitted provided that the following conditions 13.\" are met: 14.\" 1. Redistributions of source code must retain the above copyright 15.\" notice, this list of conditions and the following disclaimer. 16.\" 2. Redistributions in binary form must reproduce the above copyright 17.\" notice, this list of conditions and the following disclaimer in the 18.\" documentation and/or other materials provided with the distribution. 19.\" 3. The names of the authors may not be used to endorse or promote 20.\" products derived from this software without specific prior written 21.\" permission. 22.\" 23.\" THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND 24.\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE 25.\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE 26.\" ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE 27.\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL 28.\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS 29.\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) 30.\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT 31.\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY 32.\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF 33.\" SUCH DAMAGE. 34.\" 35.\" $FreeBSD$ 36.\" 37.Dd May 25, 2006 38.Os 39.Dt GEOM 4 40.Sh NAME 41.Nm GEOM 42.Nd "modular disk I/O request transformation framework" 43.Sh DESCRIPTION 44The 45.Nm 46framework provides an infrastructure in which 47.Dq classes 48can perform transformations on disk I/O requests on their path from 49the upper kernel to the device drivers and back. 50.Pp 51Transformations in a 52.Nm 53context range from the simple geometric 54displacement performed in typical disk partitioning modules over RAID 55algorithms and device multipath resolution to full blown cryptographic 56protection of the stored data. 57.Pp 58Compared to traditional 59.Dq "volume management" , 60.Nm 61differs from most 62and in some cases all previous implementations in the following ways: 63.Bl -bullet 64.It 65.Nm 66is extensible. 67It is trivially simple to write a new class 68of transformation and it will not be given stepchild treatment. 69If 70someone for some reason wanted to mount IBM MVS diskpacks, a class 71recognizing and configuring their VTOC information would be a trivial 72matter. 73.It 74.Nm 75is topologically agnostic. 76Most volume management implementations 77have very strict notions of how classes can fit together, very often 78one fixed hierarchy is provided, for instance, subdisk - plex - 79volume. 80.El 81.Pp 82Being extensible means that new transformations are treated no differently 83than existing transformations. 84.Pp 85Fixed hierarchies are bad because they make it impossible to express 86the intent efficiently. 87In the fixed hierarchy above, it is not possible to mirror two 88physical disks and then partition the mirror into subdisks, instead 89one is forced to make subdisks on the physical volumes and to mirror 90these two and two, resulting in a much more complex configuration. 91.Nm 92on the other hand does not care in which order things are done, 93the only restriction is that cycles in the graph will not be allowed. 94.Sh "TERMINOLOGY AND TOPOLOGY" 95.Nm 96is quite object oriented and consequently the terminology 97borrows a lot of context and semantics from the OO vocabulary: 98.Pp 99A 100.Dq class , 101represented by the data structure 102.Vt g_class 103implements one 104particular kind of transformation. 105Typical examples are MBR disk 106partition, BSD disklabel, and RAID5 classes. 107.Pp 108An instance of a class is called a 109.Dq geom 110and represented by the data structure 111.Vt g_geom . 112In a typical i386 113.Fx 114system, there 115will be one geom of class MBR for each disk. 116.Pp 117A 118.Dq provider , 119represented by the data structure 120.Vt g_provider , 121is the front gate at which a geom offers service. 122A provider is 123.Do 124a disk-like thing which appears in 125.Pa /dev 126.Dc - a logical 127disk in other words. 128All providers have three main properties: 129.Dq name , 130.Dq sectorsize 131and 132.Dq size . 133.Pp 134A 135.Dq consumer 136is the backdoor through which a geom connects to another 137geom provider and through which I/O requests are sent. 138.Pp 139The topological relationship between these entities are as follows: 140.Bl -bullet 141.It 142A class has zero or more geom instances. 143.It 144A geom has exactly one class it is derived from. 145.It 146A geom has zero or more consumers. 147.It 148A geom has zero or more providers. 149.It 150A consumer can be attached to zero or one providers. 151.It 152A provider can have zero or more consumers attached. 153.El 154.Pp 155All geoms have a rank-number assigned, which is used to detect and 156prevent loops in the acyclic directed graph. 157This rank number is 158assigned as follows: 159.Bl -enum 160.It 161A geom with no attached consumers has rank=1. 162.It 163A geom with attached consumers has a rank one higher than the 164highest rank of the geoms of the providers its consumers are 165attached to. 166.El 167.Sh "SPECIAL TOPOLOGICAL MANEUVERS" 168In addition to the straightforward attach, which attaches a consumer 169to a provider, and detach, which breaks the bond, a number of special 170topological maneuvers exists to facilitate configuration and to 171improve the overall flexibility. 172.Bl -inset 173.It Em TASTING 174is a process that happens whenever a new class or new provider 175is created, and it provides the class a chance to automatically configure an 176instance on providers which it recognizes as its own. 177A typical example is the MBR disk-partition class which will look for 178the MBR table in the first sector and, if found and validated, will 179instantiate a geom to multiplex according to the contents of the MBR. 180.Pp 181A new class will be offered to all existing providers in turn and a new 182provider will be offered to all classes in turn. 183.Pp 184Exactly what a class does to recognize if it should accept the offered 185provider is not defined by 186.Nm , 187but the sensible set of options are: 188.Bl -bullet 189.It 190Examine specific data structures on the disk. 191.It 192Examine properties like 193.Dq sectorsize 194or 195.Dq mediasize 196for the provider. 197.It 198Examine the rank number of the provider's geom. 199.It 200Examine the method name of the provider's geom. 201.El 202.It Em ORPHANIZATION 203is the process by which a provider is removed while 204it potentially is still being used. 205.Pp 206When a geom orphans a provider, all future I/O requests will 207.Dq bounce 208on the provider with an error code set by the geom. 209Any 210consumers attached to the provider will receive notification about 211the orphanization when the event loop gets around to it, and they 212can take appropriate action at that time. 213.Pp 214A geom which came into being as a result of a normal taste operation 215should self-destruct unless it has a way to keep functioning whilst 216lacking the orphaned provider. 217Geoms like disk slicers should therefore self-destruct whereas 218RAID5 or mirror geoms will be able to continue as long as they do 219not lose quorum. 220.Pp 221When a provider is orphaned, this does not necessarily result in any 222immediate change in the topology: any attached consumers are still 223attached, any opened paths are still open, any outstanding I/O 224requests are still outstanding. 225.Pp 226The typical scenario is: 227.Pp 228.Bl -bullet -offset indent -compact 229.It 230A device driver detects a disk has departed and orphans the provider for it. 231.It 232The geoms on top of the disk receive the orphanization event and 233orphan all their providers in turn. 234Providers which are not attached to will typically self-destruct 235right away. 236This process continues in a quasi-recursive fashion until all 237relevant pieces of the tree have heard the bad news. 238.It 239Eventually the buck stops when it reaches geom_dev at the top 240of the stack. 241.It 242Geom_dev will call 243.Xr destroy_dev 9 244to stop any more requests from 245coming in. 246It will sleep until any and all outstanding I/O requests have 247been returned. 248It will explicitly close (i.e.: zero the access counts), a change 249which will propagate all the way down through the mesh. 250It will then detach and destroy its geom. 251.It 252The geom whose provider is now attached will destroy the provider, 253detach and destroy its consumer and destroy its geom. 254.It 255This process percolates all the way down through the mesh, until 256the cleanup is complete. 257.El 258.Pp 259While this approach seems byzantine, it does provide the maximum 260flexibility and robustness in handling disappearing devices. 261.Pp 262The one absolutely crucial detail to be aware of is that if the 263device driver does not return all I/O requests, the tree will 264not unravel. 265.It Em SPOILING 266is a special case of orphanization used to protect 267against stale metadata. 268It is probably easiest to understand spoiling by going through 269an example. 270.Pp 271Imagine a disk, 272.Pa da0 , 273on top of which an MBR geom provides 274.Pa da0s1 275and 276.Pa da0s2 , 277and on top of 278.Pa da0s1 279a BSD geom provides 280.Pa da0s1a 281through 282.Pa da0s1e , 283and that both the MBR and BSD geoms have 284autoconfigured based on data structures on the disk media. 285Now imagine the case where 286.Pa da0 287is opened for writing and those 288data structures are modified or overwritten: now the geoms would 289be operating on stale metadata unless some notification system 290can inform them otherwise. 291.Pp 292To avoid this situation, when the open of 293.Pa da0 294for write happens, 295all attached consumers are told about this and geoms like 296MBR and BSD will self-destruct as a result. 297When 298.Pa da0 299is closed, it will be offered for tasting again 300and, if the data structures for MBR and BSD are still there, new 301geoms will instantiate themselves anew. 302.Pp 303Now for the fine print: 304.Pp 305If any of the paths through the MBR or BSD module were open, they 306would have opened downwards with an exclusive bit thus rendering it 307impossible to open 308.Pa da0 309for writing in that case. 310Conversely, 311the requested exclusive bit would render it impossible to open a 312path through the MBR geom while 313.Pa da0 314is open for writing. 315.Pp 316From this it also follows that changing the size of open geoms can 317only be done with their cooperation. 318.Pp 319Finally: the spoiling only happens when the write count goes from 320zero to non-zero and the retasting happens only when the write count goes 321from non-zero to zero. 322.It Em INSERT/DELETE 323are very special operations which allow a new geom 324to be instantiated between a consumer and a provider attached to 325each other and to remove it again. 326.Pp 327To understand the utility of this, imagine a provider 328being mounted as a file system. 329Between the DEVFS geom's consumer and its provider we insert 330a mirror module which configures itself with one mirror 331copy and consequently is transparent to the I/O requests 332on the path. 333We can now configure yet a mirror copy on the mirror geom, 334request a synchronization, and finally drop the first mirror 335copy. 336We have now, in essence, moved a mounted file system from one 337disk to another while it was being used. 338At this point the mirror geom can be deleted from the path 339again; it has served its purpose. 340.It Em CONFIGURE 341is the process where the administrator issues instructions 342for a particular class to instantiate itself. 343There are multiple 344ways to express intent in this case - a particular provider may be 345specified with a level of override forcing, for instance, a BSD 346disklabel module to attach to a provider which was not found palatable 347during the TASTE operation. 348.Pp 349Finally, I/O is the reason we even do this: it concerns itself with 350sending I/O requests through the graph. 351.It Em "I/O REQUESTS" , 352represented by 353.Vt "struct bio" , 354originate at a consumer, 355are scheduled on its attached provider and, when processed, are returned 356to the consumer. 357It is important to realize that the 358.Vt "struct bio" 359which enters through the provider of a particular geom does not 360.Do 361come out on the other side 362.Dc . 363Even simple transformations like MBR and BSD will clone the 364.Vt "struct bio" , 365modify the clone, and schedule the clone on their 366own consumer. 367Note that cloning the 368.Vt "struct bio" 369does not involve cloning the 370actual data area specified in the I/O request. 371.Pp 372In total, four different I/O requests exist in 373.Nm : 374read, write, delete, and 375.Dq "get attribute". 376.Pp 377Read and write are self explanatory. 378.Pp 379Delete indicates that a certain range of data is no longer used 380and that it can be erased or freed as the underlying technology 381supports. 382Technologies like flash adaptation layers can arrange to erase 383the relevant blocks before they will become reassigned and 384cryptographic devices may want to fill random bits into the 385range to reduce the amount of data available for attack. 386.Pp 387It is important to recognize that a delete indication is not a 388request and consequently there is no guarantee that the data actually 389will be erased or made unavailable unless guaranteed by specific 390geoms in the graph. 391If 392.Dq "secure delete" 393semantics are required, a 394geom should be pushed which converts delete indications into (a 395sequence of) write requests. 396.Pp 397.Dq "Get attribute" 398supports inspection and manipulation 399of out-of-band attributes on a particular provider or path. 400Attributes are named by 401.Tn ASCII 402strings and they will be discussed in 403a separate section below. 404.El 405.Pp 406(Stay tuned while the author rests his brain and fingers: more to come.) 407.Sh DIAGNOSTICS 408Several flags are provided for tracing 409.Nm 410operations and unlocking 411protection mechanisms via the 412.Va kern.geom.debugflags 413sysctl. 414All of these flags are off by default, and great care should be taken in 415turning them on. 416.Bl -tag -width indent 417.It 0x01 Pq Dv G_T_TOPOLOGY 418Provide tracing of topology change events. 419.It 0x02 Pq Dv G_T_BIO 420Provide tracing of buffer I/O requests. 421.It 0x04 Pq Dv G_T_ACCESS 422Provide tracing of access check controls. 423.It 0x08 (unused) 424.It 0x10 (allow foot shooting) 425Allow writing to Rank 1 providers. 426This would, for example, allow the super-user to overwrite the MBR on the root 427disk or write random sectors elsewhere to a mounted disk. 428The implications are obvious. 429.It 0x40 Pq Dv G_F_DISKIOCTL 430This is unused at this time. 431.It 0x80 Pq Dv G_F_CTLDUMP 432Dump contents of gctl requests. 433.El 434.Sh HISTORY 435This software was developed for the 436.Fx 437Project by 438.An Poul-Henning Kamp 439and NAI Labs, the Security Research Division of Network Associates, Inc.\& 440under DARPA/SPAWAR contract N66001-01-C-8035 441.Pq Dq CBOSS , 442as part of the 443DARPA CHATS research program. 444.Pp 445The first precursor for 446.Nm 447was a gruesome hack to Minix 1.2 and was 448never distributed. 449An earlier attempt to implement a less general scheme 450in 451.Fx 452never succeeded. 453.Sh AUTHORS 454.An "Poul-Henning Kamp" Aq phk@FreeBSD.org 455