1.\" 2.\" Copyright (c) 2002 Poul-Henning Kamp 3.\" Copyright (c) 2002 Networks Associates Technology, Inc. 4.\" All rights reserved. 5.\" 6.\" This software was developed for the FreeBSD Project by Poul-Henning Kamp 7.\" and NAI Labs, the Security Research Division of Network Associates, Inc. 8.\" under DARPA/SPAWAR contract N66001-01-C-8035 ("CBOSS"), as part of the 9.\" DARPA CHATS research program. 10.\" 11.\" Redistribution and use in source and binary forms, with or without 12.\" modification, are permitted provided that the following conditions 13.\" are met: 14.\" 1. Redistributions of source code must retain the above copyright 15.\" notice, this list of conditions and the following disclaimer. 16.\" 2. Redistributions in binary form must reproduce the above copyright 17.\" notice, this list of conditions and the following disclaimer in the 18.\" documentation and/or other materials provided with the distribution. 19.\" 3. The names of the authors may not be used to endorse or promote 20.\" products derived from this software without specific prior written 21.\" permission. 22.\" 23.\" THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND 24.\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE 25.\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE 26.\" ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE 27.\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL 28.\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS 29.\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) 30.\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT 31.\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY 32.\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF 33.\" SUCH DAMAGE. 34.\" 35.\" $FreeBSD$ 36.\" 37.Dd March 14, 2013 38.Dt GEOM 4 39.Os 40.Sh NAME 41.Nm GEOM 42.Nd "modular disk I/O request transformation framework" 43.Sh SYNOPSIS 44.Cd options GEOM_AES 45.Cd options GEOM_BDE 46.Cd options GEOM_BSD 47.Cd options GEOM_CACHE 48.Cd options GEOM_CONCAT 49.Cd options GEOM_ELI 50.Cd options GEOM_FOX 51.Cd options GEOM_GATE 52.Cd options GEOM_JOURNAL 53.Cd options GEOM_LABEL 54.Cd options GEOM_LINUX_LVM 55.Cd options GEOM_MBR 56.Cd options GEOM_MIRROR 57.Cd options GEOM_MULTIPATH 58.Cd options GEOM_NOP 59.Cd options GEOM_PART_APM 60.Cd options GEOM_PART_BSD 61.Cd options GEOM_PART_EBR 62.Cd options GEOM_PART_EBR_COMPAT 63.Cd options GEOM_PART_GPT 64.Cd options GEOM_PART_LDM 65.Cd options GEOM_PART_MBR 66.Cd options GEOM_PART_PC98 67.Cd options GEOM_PART_VTOC8 68.Cd options GEOM_PC98 69.Cd options GEOM_RAID 70.Cd options GEOM_RAID3 71.Cd options GEOM_SHSEC 72.Cd options GEOM_STRIPE 73.Cd options GEOM_SUNLABEL 74.Cd options GEOM_UZIP 75.Cd options GEOM_VIRSTOR 76.Cd options GEOM_VOL 77.Cd options GEOM_ZERO 78.Sh DESCRIPTION 79The 80.Nm 81framework provides an infrastructure in which 82.Dq classes 83can perform transformations on disk I/O requests on their path from 84the upper kernel to the device drivers and back. 85.Pp 86Transformations in a 87.Nm 88context range from the simple geometric 89displacement performed in typical disk partitioning modules over RAID 90algorithms and device multipath resolution to full blown cryptographic 91protection of the stored data. 92.Pp 93Compared to traditional 94.Dq "volume management" , 95.Nm 96differs from most 97and in some cases all previous implementations in the following ways: 98.Bl -bullet 99.It 100.Nm 101is extensible. 102It is trivially simple to write a new class 103of transformation and it will not be given stepchild treatment. 104If 105someone for some reason wanted to mount IBM MVS diskpacks, a class 106recognizing and configuring their VTOC information would be a trivial 107matter. 108.It 109.Nm 110is topologically agnostic. 111Most volume management implementations 112have very strict notions of how classes can fit together, very often 113one fixed hierarchy is provided, for instance, subdisk - plex - 114volume. 115.El 116.Pp 117Being extensible means that new transformations are treated no differently 118than existing transformations. 119.Pp 120Fixed hierarchies are bad because they make it impossible to express 121the intent efficiently. 122In the fixed hierarchy above, it is not possible to mirror two 123physical disks and then partition the mirror into subdisks, instead 124one is forced to make subdisks on the physical volumes and to mirror 125these two and two, resulting in a much more complex configuration. 126.Nm 127on the other hand does not care in which order things are done, 128the only restriction is that cycles in the graph will not be allowed. 129.Sh "TERMINOLOGY AND TOPOLOGY" 130.Nm 131is quite object oriented and consequently the terminology 132borrows a lot of context and semantics from the OO vocabulary: 133.Pp 134A 135.Dq class , 136represented by the data structure 137.Vt g_class 138implements one 139particular kind of transformation. 140Typical examples are MBR disk 141partition, BSD disklabel, and RAID5 classes. 142.Pp 143An instance of a class is called a 144.Dq geom 145and represented by the data structure 146.Vt g_geom . 147In a typical i386 148.Fx 149system, there 150will be one geom of class MBR for each disk. 151.Pp 152A 153.Dq provider , 154represented by the data structure 155.Vt g_provider , 156is the front gate at which a geom offers service. 157A provider is 158.Do 159a disk-like thing which appears in 160.Pa /dev 161.Dc - a logical 162disk in other words. 163All providers have three main properties: 164.Dq name , 165.Dq sectorsize 166and 167.Dq size . 168.Pp 169A 170.Dq consumer 171is the backdoor through which a geom connects to another 172geom provider and through which I/O requests are sent. 173.Pp 174The topological relationship between these entities are as follows: 175.Bl -bullet 176.It 177A class has zero or more geom instances. 178.It 179A geom has exactly one class it is derived from. 180.It 181A geom has zero or more consumers. 182.It 183A geom has zero or more providers. 184.It 185A consumer can be attached to zero or one providers. 186.It 187A provider can have zero or more consumers attached. 188.El 189.Pp 190All geoms have a rank-number assigned, which is used to detect and 191prevent loops in the acyclic directed graph. 192This rank number is 193assigned as follows: 194.Bl -enum 195.It 196A geom with no attached consumers has rank=1. 197.It 198A geom with attached consumers has a rank one higher than the 199highest rank of the geoms of the providers its consumers are 200attached to. 201.El 202.Sh "SPECIAL TOPOLOGICAL MANEUVERS" 203In addition to the straightforward attach, which attaches a consumer 204to a provider, and detach, which breaks the bond, a number of special 205topological maneuvers exists to facilitate configuration and to 206improve the overall flexibility. 207.Bl -inset 208.It Em TASTING 209is a process that happens whenever a new class or new provider 210is created, and it provides the class a chance to automatically configure an 211instance on providers which it recognizes as its own. 212A typical example is the MBR disk-partition class which will look for 213the MBR table in the first sector and, if found and validated, will 214instantiate a geom to multiplex according to the contents of the MBR. 215.Pp 216A new class will be offered to all existing providers in turn and a new 217provider will be offered to all classes in turn. 218.Pp 219Exactly what a class does to recognize if it should accept the offered 220provider is not defined by 221.Nm , 222but the sensible set of options are: 223.Bl -bullet 224.It 225Examine specific data structures on the disk. 226.It 227Examine properties like 228.Dq sectorsize 229or 230.Dq mediasize 231for the provider. 232.It 233Examine the rank number of the provider's geom. 234.It 235Examine the method name of the provider's geom. 236.El 237.It Em ORPHANIZATION 238is the process by which a provider is removed while 239it potentially is still being used. 240.Pp 241When a geom orphans a provider, all future I/O requests will 242.Dq bounce 243on the provider with an error code set by the geom. 244Any 245consumers attached to the provider will receive notification about 246the orphanization when the event loop gets around to it, and they 247can take appropriate action at that time. 248.Pp 249A geom which came into being as a result of a normal taste operation 250should self-destruct unless it has a way to keep functioning whilst 251lacking the orphaned provider. 252Geoms like disk slicers should therefore self-destruct whereas 253RAID5 or mirror geoms will be able to continue as long as they do 254not lose quorum. 255.Pp 256When a provider is orphaned, this does not necessarily result in any 257immediate change in the topology: any attached consumers are still 258attached, any opened paths are still open, any outstanding I/O 259requests are still outstanding. 260.Pp 261The typical scenario is: 262.Pp 263.Bl -bullet -offset indent -compact 264.It 265A device driver detects a disk has departed and orphans the provider for it. 266.It 267The geoms on top of the disk receive the orphanization event and 268orphan all their providers in turn. 269Providers which are not attached to will typically self-destruct 270right away. 271This process continues in a quasi-recursive fashion until all 272relevant pieces of the tree have heard the bad news. 273.It 274Eventually the buck stops when it reaches geom_dev at the top 275of the stack. 276.It 277Geom_dev will call 278.Xr destroy_dev 9 279to stop any more requests from 280coming in. 281It will sleep until any and all outstanding I/O requests have 282been returned. 283It will explicitly close (i.e.: zero the access counts), a change 284which will propagate all the way down through the mesh. 285It will then detach and destroy its geom. 286.It 287The geom whose provider is now detached will destroy the provider, 288detach and destroy its consumer and destroy its geom. 289.It 290This process percolates all the way down through the mesh, until 291the cleanup is complete. 292.El 293.Pp 294While this approach seems byzantine, it does provide the maximum 295flexibility and robustness in handling disappearing devices. 296.Pp 297The one absolutely crucial detail to be aware of is that if the 298device driver does not return all I/O requests, the tree will 299not unravel. 300.It Em SPOILING 301is a special case of orphanization used to protect 302against stale metadata. 303It is probably easiest to understand spoiling by going through 304an example. 305.Pp 306Imagine a disk, 307.Pa da0 , 308on top of which an MBR geom provides 309.Pa da0s1 310and 311.Pa da0s2 , 312and on top of 313.Pa da0s1 314a BSD geom provides 315.Pa da0s1a 316through 317.Pa da0s1e , 318and that both the MBR and BSD geoms have 319autoconfigured based on data structures on the disk media. 320Now imagine the case where 321.Pa da0 322is opened for writing and those 323data structures are modified or overwritten: now the geoms would 324be operating on stale metadata unless some notification system 325can inform them otherwise. 326.Pp 327To avoid this situation, when the open of 328.Pa da0 329for write happens, 330all attached consumers are told about this and geoms like 331MBR and BSD will self-destruct as a result. 332When 333.Pa da0 334is closed, it will be offered for tasting again 335and, if the data structures for MBR and BSD are still there, new 336geoms will instantiate themselves anew. 337.Pp 338Now for the fine print: 339.Pp 340If any of the paths through the MBR or BSD module were open, they 341would have opened downwards with an exclusive bit thus rendering it 342impossible to open 343.Pa da0 344for writing in that case. 345Conversely, 346the requested exclusive bit would render it impossible to open a 347path through the MBR geom while 348.Pa da0 349is open for writing. 350.Pp 351From this it also follows that changing the size of open geoms can 352only be done with their cooperation. 353.Pp 354Finally: the spoiling only happens when the write count goes from 355zero to non-zero and the retasting happens only when the write count goes 356from non-zero to zero. 357.It Em INSERT/DELETE 358are very special operations which allow a new geom 359to be instantiated between a consumer and a provider attached to 360each other and to remove it again. 361.Pp 362To understand the utility of this, imagine a provider 363being mounted as a file system. 364Between the DEVFS geom's consumer and its provider we insert 365a mirror module which configures itself with one mirror 366copy and consequently is transparent to the I/O requests 367on the path. 368We can now configure yet a mirror copy on the mirror geom, 369request a synchronization, and finally drop the first mirror 370copy. 371We have now, in essence, moved a mounted file system from one 372disk to another while it was being used. 373At this point the mirror geom can be deleted from the path 374again; it has served its purpose. 375.It Em CONFIGURE 376is the process where the administrator issues instructions 377for a particular class to instantiate itself. 378There are multiple 379ways to express intent in this case - a particular provider may be 380specified with a level of override forcing, for instance, a BSD 381disklabel module to attach to a provider which was not found palatable 382during the TASTE operation. 383.Pp 384Finally, I/O is the reason we even do this: it concerns itself with 385sending I/O requests through the graph. 386.It Em "I/O REQUESTS" , 387represented by 388.Vt "struct bio" , 389originate at a consumer, 390are scheduled on its attached provider and, when processed, are returned 391to the consumer. 392It is important to realize that the 393.Vt "struct bio" 394which enters through the provider of a particular geom does not 395.Do 396come out on the other side 397.Dc . 398Even simple transformations like MBR and BSD will clone the 399.Vt "struct bio" , 400modify the clone, and schedule the clone on their 401own consumer. 402Note that cloning the 403.Vt "struct bio" 404does not involve cloning the 405actual data area specified in the I/O request. 406.Pp 407In total, four different I/O requests exist in 408.Nm : 409read, write, delete, and 410.Dq "get attribute". 411.Pp 412Read and write are self explanatory. 413.Pp 414Delete indicates that a certain range of data is no longer used 415and that it can be erased or freed as the underlying technology 416supports. 417Technologies like flash adaptation layers can arrange to erase 418the relevant blocks before they will become reassigned and 419cryptographic devices may want to fill random bits into the 420range to reduce the amount of data available for attack. 421.Pp 422It is important to recognize that a delete indication is not a 423request and consequently there is no guarantee that the data actually 424will be erased or made unavailable unless guaranteed by specific 425geoms in the graph. 426If 427.Dq "secure delete" 428semantics are required, a 429geom should be pushed which converts delete indications into (a 430sequence of) write requests. 431.Pp 432.Dq "Get attribute" 433supports inspection and manipulation 434of out-of-band attributes on a particular provider or path. 435Attributes are named by 436.Tn ASCII 437strings and they will be discussed in 438a separate section below. 439.El 440.Pp 441(Stay tuned while the author rests his brain and fingers: more to come.) 442.Sh DIAGNOSTICS 443Several flags are provided for tracing 444.Nm 445operations and unlocking 446protection mechanisms via the 447.Va kern.geom.debugflags 448sysctl. 449All of these flags are off by default, and great care should be taken in 450turning them on. 451.Bl -tag -width indent 452.It 0x01 Pq Dv G_T_TOPOLOGY 453Provide tracing of topology change events. 454.It 0x02 Pq Dv G_T_BIO 455Provide tracing of buffer I/O requests. 456.It 0x04 Pq Dv G_T_ACCESS 457Provide tracing of access check controls. 458.It 0x08 (unused) 459.It 0x10 (allow foot shooting) 460Allow writing to Rank 1 providers. 461This would, for example, allow the super-user to overwrite the MBR on the root 462disk or write random sectors elsewhere to a mounted disk. 463The implications are obvious. 464.It 0x40 Pq Dv G_F_DISKIOCTL 465This is unused at this time. 466.It 0x80 Pq Dv G_F_CTLDUMP 467Dump contents of gctl requests. 468.El 469.Sh SEE ALSO 470.Xr libgeom 3 , 471.Xr disk 9 , 472.Xr DECLARE_GEOM_CLASS 9 , 473.Xr g_access 9 , 474.Xr g_attach 9 , 475.Xr g_bio 9 , 476.Xr g_consumer 9 , 477.Xr g_data 9 , 478.Xr g_event 9 , 479.Xr g_geom 9 , 480.Xr g_provider 9 , 481.Xr g_provider_by_name 9 482.Sh HISTORY 483This software was developed for the 484.Fx 485Project by 486.An Poul-Henning Kamp 487and NAI Labs, the Security Research Division of Network Associates, Inc.\& 488under DARPA/SPAWAR contract N66001-01-C-8035 489.Pq Dq CBOSS , 490as part of the 491DARPA CHATS research program. 492.Pp 493The first precursor for 494.Nm 495was a gruesome hack to Minix 1.2 and was 496never distributed. 497An earlier attempt to implement a less general scheme 498in 499.Fx 500never succeeded. 501.Sh AUTHORS 502.An "Poul-Henning Kamp" Aq phk@FreeBSD.org 503