1.\" 2.\" Copyright (c) 2002 Poul-Henning Kamp 3.\" Copyright (c) 2002 Networks Associates Technology, Inc. 4.\" All rights reserved. 5.\" 6.\" This software was developed for the FreeBSD Project by Poul-Henning Kamp 7.\" and NAI Labs, the Security Research Division of Network Associates, Inc. 8.\" under DARPA/SPAWAR contract N66001-01-C-8035 ("CBOSS"), as part of the 9.\" DARPA CHATS research program. 10.\" 11.\" Redistribution and use in source and binary forms, with or without 12.\" modification, are permitted provided that the following conditions 13.\" are met: 14.\" 1. Redistributions of source code must retain the above copyright 15.\" notice, this list of conditions and the following disclaimer. 16.\" 2. Redistributions in binary form must reproduce the above copyright 17.\" notice, this list of conditions and the following disclaimer in the 18.\" documentation and/or other materials provided with the distribution. 19.\" 3. The names of the authors may not be used to endorse or promote 20.\" products derived from this software without specific prior written 21.\" permission. 22.\" 23.\" THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND 24.\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE 25.\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE 26.\" ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE 27.\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL 28.\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS 29.\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) 30.\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT 31.\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY 32.\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF 33.\" SUCH DAMAGE. 34.\" 35.\" $FreeBSD$ 36.\" 37.Dd June 8, 2015 38.Dt GEOM 4 39.Os 40.Sh NAME 41.Nm GEOM 42.Nd "modular disk I/O request transformation framework" 43.Sh SYNOPSIS 44.Cd options GEOM_AES 45.Cd options GEOM_BDE 46.Cd options GEOM_BSD 47.Cd options GEOM_CACHE 48.Cd options GEOM_CONCAT 49.Cd options GEOM_ELI 50.Cd options GEOM_FOX 51.Cd options GEOM_GATE 52.Cd options GEOM_JOURNAL 53.Cd options GEOM_LABEL 54.Cd options GEOM_LINUX_LVM 55.Cd options GEOM_MAP 56.Cd options GEOM_MBR 57.Cd options GEOM_MIRROR 58.Cd options GEOM_MULTIPATH 59.Cd options GEOM_NOP 60.Cd options GEOM_PART_APM 61.Cd options GEOM_PART_BSD 62.Cd options GEOM_PART_BSD64 63.Cd options GEOM_PART_EBR 64.Cd options GEOM_PART_EBR_COMPAT 65.Cd options GEOM_PART_GPT 66.Cd options GEOM_PART_LDM 67.Cd options GEOM_PART_MBR 68.Cd options GEOM_PART_PC98 69.Cd options GEOM_PART_VTOC8 70.Cd options GEOM_PC98 71.Cd options GEOM_RAID 72.Cd options GEOM_RAID3 73.Cd options GEOM_SHSEC 74.Cd options GEOM_STRIPE 75.Cd options GEOM_SUNLABEL 76.Cd options GEOM_UZIP 77.Cd options GEOM_VIRSTOR 78.Cd options GEOM_VOL 79.Cd options GEOM_ZERO 80.Sh DESCRIPTION 81The 82.Nm 83framework provides an infrastructure in which 84.Dq classes 85can perform transformations on disk I/O requests on their path from 86the upper kernel to the device drivers and back. 87.Pp 88Transformations in a 89.Nm 90context range from the simple geometric 91displacement performed in typical disk partitioning modules over RAID 92algorithms and device multipath resolution to full blown cryptographic 93protection of the stored data. 94.Pp 95Compared to traditional 96.Dq "volume management" , 97.Nm 98differs from most 99and in some cases all previous implementations in the following ways: 100.Bl -bullet 101.It 102.Nm 103is extensible. 104It is trivially simple to write a new class 105of transformation and it will not be given stepchild treatment. 106If 107someone for some reason wanted to mount IBM MVS diskpacks, a class 108recognizing and configuring their VTOC information would be a trivial 109matter. 110.It 111.Nm 112is topologically agnostic. 113Most volume management implementations 114have very strict notions of how classes can fit together, very often 115one fixed hierarchy is provided, for instance, subdisk - plex - 116volume. 117.El 118.Pp 119Being extensible means that new transformations are treated no differently 120than existing transformations. 121.Pp 122Fixed hierarchies are bad because they make it impossible to express 123the intent efficiently. 124In the fixed hierarchy above, it is not possible to mirror two 125physical disks and then partition the mirror into subdisks, instead 126one is forced to make subdisks on the physical volumes and to mirror 127these two and two, resulting in a much more complex configuration. 128.Nm 129on the other hand does not care in which order things are done, 130the only restriction is that cycles in the graph will not be allowed. 131.Sh "TERMINOLOGY AND TOPOLOGY" 132.Nm 133is quite object oriented and consequently the terminology 134borrows a lot of context and semantics from the OO vocabulary: 135.Pp 136A 137.Dq class , 138represented by the data structure 139.Vt g_class 140implements one 141particular kind of transformation. 142Typical examples are MBR disk 143partition, BSD disklabel, and RAID5 classes. 144.Pp 145An instance of a class is called a 146.Dq geom 147and represented by the data structure 148.Vt g_geom . 149In a typical i386 150.Fx 151system, there 152will be one geom of class MBR for each disk. 153.Pp 154A 155.Dq provider , 156represented by the data structure 157.Vt g_provider , 158is the front gate at which a geom offers service. 159A provider is 160.Do 161a disk-like thing which appears in 162.Pa /dev 163.Dc - a logical 164disk in other words. 165All providers have three main properties: 166.Dq name , 167.Dq sectorsize 168and 169.Dq size . 170.Pp 171A 172.Dq consumer 173is the backdoor through which a geom connects to another 174geom provider and through which I/O requests are sent. 175.Pp 176The topological relationship between these entities are as follows: 177.Bl -bullet 178.It 179A class has zero or more geom instances. 180.It 181A geom has exactly one class it is derived from. 182.It 183A geom has zero or more consumers. 184.It 185A geom has zero or more providers. 186.It 187A consumer can be attached to zero or one providers. 188.It 189A provider can have zero or more consumers attached. 190.El 191.Pp 192All geoms have a rank-number assigned, which is used to detect and 193prevent loops in the acyclic directed graph. 194This rank number is 195assigned as follows: 196.Bl -enum 197.It 198A geom with no attached consumers has rank=1. 199.It 200A geom with attached consumers has a rank one higher than the 201highest rank of the geoms of the providers its consumers are 202attached to. 203.El 204.Sh "SPECIAL TOPOLOGICAL MANEUVERS" 205In addition to the straightforward attach, which attaches a consumer 206to a provider, and detach, which breaks the bond, a number of special 207topological maneuvers exists to facilitate configuration and to 208improve the overall flexibility. 209.Bl -inset 210.It Em TASTING 211is a process that happens whenever a new class or new provider 212is created, and it provides the class a chance to automatically configure an 213instance on providers which it recognizes as its own. 214A typical example is the MBR disk-partition class which will look for 215the MBR table in the first sector and, if found and validated, will 216instantiate a geom to multiplex according to the contents of the MBR. 217.Pp 218A new class will be offered to all existing providers in turn and a new 219provider will be offered to all classes in turn. 220.Pp 221Exactly what a class does to recognize if it should accept the offered 222provider is not defined by 223.Nm , 224but the sensible set of options are: 225.Bl -bullet 226.It 227Examine specific data structures on the disk. 228.It 229Examine properties like 230.Dq sectorsize 231or 232.Dq mediasize 233for the provider. 234.It 235Examine the rank number of the provider's geom. 236.It 237Examine the method name of the provider's geom. 238.El 239.It Em ORPHANIZATION 240is the process by which a provider is removed while 241it potentially is still being used. 242.Pp 243When a geom orphans a provider, all future I/O requests will 244.Dq bounce 245on the provider with an error code set by the geom. 246Any 247consumers attached to the provider will receive notification about 248the orphanization when the event loop gets around to it, and they 249can take appropriate action at that time. 250.Pp 251A geom which came into being as a result of a normal taste operation 252should self-destruct unless it has a way to keep functioning whilst 253lacking the orphaned provider. 254Geoms like disk slicers should therefore self-destruct whereas 255RAID5 or mirror geoms will be able to continue as long as they do 256not lose quorum. 257.Pp 258When a provider is orphaned, this does not necessarily result in any 259immediate change in the topology: any attached consumers are still 260attached, any opened paths are still open, any outstanding I/O 261requests are still outstanding. 262.Pp 263The typical scenario is: 264.Pp 265.Bl -bullet -offset indent -compact 266.It 267A device driver detects a disk has departed and orphans the provider for it. 268.It 269The geoms on top of the disk receive the orphanization event and 270orphan all their providers in turn. 271Providers which are not attached to will typically self-destruct 272right away. 273This process continues in a quasi-recursive fashion until all 274relevant pieces of the tree have heard the bad news. 275.It 276Eventually the buck stops when it reaches geom_dev at the top 277of the stack. 278.It 279Geom_dev will call 280.Xr destroy_dev 9 281to stop any more requests from 282coming in. 283It will sleep until any and all outstanding I/O requests have 284been returned. 285It will explicitly close (i.e.: zero the access counts), a change 286which will propagate all the way down through the mesh. 287It will then detach and destroy its geom. 288.It 289The geom whose provider is now detached will destroy the provider, 290detach and destroy its consumer and destroy its geom. 291.It 292This process percolates all the way down through the mesh, until 293the cleanup is complete. 294.El 295.Pp 296While this approach seems byzantine, it does provide the maximum 297flexibility and robustness in handling disappearing devices. 298.Pp 299The one absolutely crucial detail to be aware of is that if the 300device driver does not return all I/O requests, the tree will 301not unravel. 302.It Em SPOILING 303is a special case of orphanization used to protect 304against stale metadata. 305It is probably easiest to understand spoiling by going through 306an example. 307.Pp 308Imagine a disk, 309.Pa da0 , 310on top of which an MBR geom provides 311.Pa da0s1 312and 313.Pa da0s2 , 314and on top of 315.Pa da0s1 316a BSD geom provides 317.Pa da0s1a 318through 319.Pa da0s1e , 320and that both the MBR and BSD geoms have 321autoconfigured based on data structures on the disk media. 322Now imagine the case where 323.Pa da0 324is opened for writing and those 325data structures are modified or overwritten: now the geoms would 326be operating on stale metadata unless some notification system 327can inform them otherwise. 328.Pp 329To avoid this situation, when the open of 330.Pa da0 331for write happens, 332all attached consumers are told about this and geoms like 333MBR and BSD will self-destruct as a result. 334When 335.Pa da0 336is closed, it will be offered for tasting again 337and, if the data structures for MBR and BSD are still there, new 338geoms will instantiate themselves anew. 339.Pp 340Now for the fine print: 341.Pp 342If any of the paths through the MBR or BSD module were open, they 343would have opened downwards with an exclusive bit thus rendering it 344impossible to open 345.Pa da0 346for writing in that case. 347Conversely, 348the requested exclusive bit would render it impossible to open a 349path through the MBR geom while 350.Pa da0 351is open for writing. 352.Pp 353From this it also follows that changing the size of open geoms can 354only be done with their cooperation. 355.Pp 356Finally: the spoiling only happens when the write count goes from 357zero to non-zero and the retasting happens only when the write count goes 358from non-zero to zero. 359.It Em CONFIGURE 360is the process where the administrator issues instructions 361for a particular class to instantiate itself. 362There are multiple 363ways to express intent in this case - a particular provider may be 364specified with a level of override forcing, for instance, a BSD 365disklabel module to attach to a provider which was not found palatable 366during the TASTE operation. 367.Pp 368Finally, I/O is the reason we even do this: it concerns itself with 369sending I/O requests through the graph. 370.It Em "I/O REQUESTS" , 371represented by 372.Vt "struct bio" , 373originate at a consumer, 374are scheduled on its attached provider and, when processed, are returned 375to the consumer. 376It is important to realize that the 377.Vt "struct bio" 378which enters through the provider of a particular geom does not 379.Do 380come out on the other side 381.Dc . 382Even simple transformations like MBR and BSD will clone the 383.Vt "struct bio" , 384modify the clone, and schedule the clone on their 385own consumer. 386Note that cloning the 387.Vt "struct bio" 388does not involve cloning the 389actual data area specified in the I/O request. 390.Pp 391In total, four different I/O requests exist in 392.Nm : 393read, write, delete, and 394.Dq "get attribute". 395.Pp 396Read and write are self explanatory. 397.Pp 398Delete indicates that a certain range of data is no longer used 399and that it can be erased or freed as the underlying technology 400supports. 401Technologies like flash adaptation layers can arrange to erase 402the relevant blocks before they will become reassigned and 403cryptographic devices may want to fill random bits into the 404range to reduce the amount of data available for attack. 405.Pp 406It is important to recognize that a delete indication is not a 407request and consequently there is no guarantee that the data actually 408will be erased or made unavailable unless guaranteed by specific 409geoms in the graph. 410If 411.Dq "secure delete" 412semantics are required, a 413geom should be pushed which converts delete indications into (a 414sequence of) write requests. 415.Pp 416.Dq "Get attribute" 417supports inspection and manipulation 418of out-of-band attributes on a particular provider or path. 419Attributes are named by 420.Tn ASCII 421strings and they will be discussed in 422a separate section below. 423.El 424.Pp 425(Stay tuned while the author rests his brain and fingers: more to come.) 426.Sh DIAGNOSTICS 427Several flags are provided for tracing 428.Nm 429operations and unlocking 430protection mechanisms via the 431.Va kern.geom.debugflags 432sysctl. 433All of these flags are off by default, and great care should be taken in 434turning them on. 435.Bl -tag -width indent 436.It 0x01 Pq Dv G_T_TOPOLOGY 437Provide tracing of topology change events. 438.It 0x02 Pq Dv G_T_BIO 439Provide tracing of buffer I/O requests. 440.It 0x04 Pq Dv G_T_ACCESS 441Provide tracing of access check controls. 442.It 0x08 (unused) 443.It 0x10 (allow foot shooting) 444Allow writing to Rank 1 providers. 445This would, for example, allow the super-user to overwrite the MBR on the root 446disk or write random sectors elsewhere to a mounted disk. 447The implications are obvious. 448.It 0x40 Pq Dv G_F_DISKIOCTL 449This is unused at this time. 450.It 0x80 Pq Dv G_F_CTLDUMP 451Dump contents of gctl requests. 452.El 453.Sh SEE ALSO 454.Xr libgeom 3 , 455.Xr DECLARE_GEOM_CLASS 9 , 456.Xr disk 9 , 457.Xr g_access 9 , 458.Xr g_attach 9 , 459.Xr g_bio 9 , 460.Xr g_consumer 9 , 461.Xr g_data 9 , 462.Xr g_event 9 , 463.Xr g_geom 9 , 464.Xr g_provider 9 , 465.Xr g_provider_by_name 9 466.Sh HISTORY 467This software was developed for the 468.Fx 469Project by 470.An Poul-Henning Kamp 471and NAI Labs, the Security Research Division of Network Associates, Inc.\& 472under DARPA/SPAWAR contract N66001-01-C-8035 473.Pq Dq CBOSS , 474as part of the 475DARPA CHATS research program. 476.Pp 477The first precursor for 478.Nm 479was a gruesome hack to Minix 1.2 and was 480never distributed. 481An earlier attempt to implement a less general scheme 482in 483.Fx 484never succeeded. 485.Sh AUTHORS 486.An Poul-Henning Kamp Aq Mt phk@FreeBSD.org 487