1.\" 2.\" Copyright (c) 2002 Poul-Henning Kamp 3.\" Copyright (c) 2002 Networks Associates Technology, Inc. 4.\" All rights reserved. 5.\" 6.\" This software was developed for the FreeBSD Project by Poul-Henning Kamp 7.\" and NAI Labs, the Security Research Division of Network Associates, Inc. 8.\" under DARPA/SPAWAR contract N66001-01-C-8035 ("CBOSS"), as part of the 9.\" DARPA CHATS research program. 10.\" 11.\" Redistribution and use in source and binary forms, with or without 12.\" modification, are permitted provided that the following conditions 13.\" are met: 14.\" 1. Redistributions of source code must retain the above copyright 15.\" notice, this list of conditions and the following disclaimer. 16.\" 2. Redistributions in binary form must reproduce the above copyright 17.\" notice, this list of conditions and the following disclaimer in the 18.\" documentation and/or other materials provided with the distribution. 19.\" 3. The names of the authors may not be used to endorse or promote 20.\" products derived from this software without specific prior written 21.\" permission. 22.\" 23.\" THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND 24.\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE 25.\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE 26.\" ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE 27.\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL 28.\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS 29.\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) 30.\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT 31.\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY 32.\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF 33.\" SUCH DAMAGE. 34.\" 35.\" $FreeBSD$ 36.\" 37.Dd April 20, 2017 38.Dt GEOM 4 39.Os 40.Sh NAME 41.Nm GEOM 42.Nd "modular disk I/O request transformation framework" 43.Sh SYNOPSIS 44.Cd options GEOM_AES 45.Cd options GEOM_BDE 46.Cd options GEOM_BSD 47.Cd options GEOM_CACHE 48.Cd options GEOM_CONCAT 49.Cd options GEOM_ELI 50.Cd options GEOM_FOX 51.Cd options GEOM_GATE 52.Cd options GEOM_JOURNAL 53.Cd options GEOM_LABEL 54.Cd options GEOM_LINUX_LVM 55.Cd options GEOM_MAP 56.Cd options GEOM_MBR 57.Cd options GEOM_MIRROR 58.Cd options GEOM_MOUNTVER 59.Cd options GEOM_MULTIPATH 60.Cd options GEOM_NOP 61.Cd options GEOM_PART_APM 62.Cd options GEOM_PART_BSD 63.Cd options GEOM_PART_BSD64 64.Cd options GEOM_PART_EBR 65.Cd options GEOM_PART_EBR_COMPAT 66.Cd options GEOM_PART_GPT 67.Cd options GEOM_PART_LDM 68.Cd options GEOM_PART_MBR 69.Cd options GEOM_PART_VTOC8 70.Cd options GEOM_RAID 71.Cd options GEOM_RAID3 72.Cd options GEOM_SHSEC 73.Cd options GEOM_STRIPE 74.Cd options GEOM_SUNLABEL 75.Cd options GEOM_UZIP 76.Cd options GEOM_VIRSTOR 77.Cd options GEOM_VOL 78.Cd options GEOM_ZERO 79.Sh DESCRIPTION 80The 81.Nm 82framework provides an infrastructure in which 83.Dq classes 84can perform transformations on disk I/O requests on their path from 85the upper kernel to the device drivers and back. 86.Pp 87Transformations in a 88.Nm 89context range from the simple geometric 90displacement performed in typical disk partitioning modules over RAID 91algorithms and device multipath resolution to full blown cryptographic 92protection of the stored data. 93.Pp 94Compared to traditional 95.Dq "volume management" , 96.Nm 97differs from most 98and in some cases all previous implementations in the following ways: 99.Bl -bullet 100.It 101.Nm 102is extensible. 103It is trivially simple to write a new class 104of transformation and it will not be given stepchild treatment. 105If 106someone for some reason wanted to mount IBM MVS diskpacks, a class 107recognizing and configuring their VTOC information would be a trivial 108matter. 109.It 110.Nm 111is topologically agnostic. 112Most volume management implementations 113have very strict notions of how classes can fit together, very often 114one fixed hierarchy is provided, for instance, subdisk - plex - 115volume. 116.El 117.Pp 118Being extensible means that new transformations are treated no differently 119than existing transformations. 120.Pp 121Fixed hierarchies are bad because they make it impossible to express 122the intent efficiently. 123In the fixed hierarchy above, it is not possible to mirror two 124physical disks and then partition the mirror into subdisks, instead 125one is forced to make subdisks on the physical volumes and to mirror 126these two and two, resulting in a much more complex configuration. 127.Nm 128on the other hand does not care in which order things are done, 129the only restriction is that cycles in the graph will not be allowed. 130.Sh "TERMINOLOGY AND TOPOLOGY" 131.Nm 132is quite object oriented and consequently the terminology 133borrows a lot of context and semantics from the OO vocabulary: 134.Pp 135A 136.Dq class , 137represented by the data structure 138.Vt g_class 139implements one 140particular kind of transformation. 141Typical examples are MBR disk 142partition, BSD disklabel, and RAID5 classes. 143.Pp 144An instance of a class is called a 145.Dq geom 146and represented by the data structure 147.Vt g_geom . 148In a typical i386 149.Fx 150system, there 151will be one geom of class MBR for each disk. 152.Pp 153A 154.Dq provider , 155represented by the data structure 156.Vt g_provider , 157is the front gate at which a geom offers service. 158A provider is 159.Do 160a disk-like thing which appears in 161.Pa /dev 162.Dc - a logical 163disk in other words. 164All providers have three main properties: 165.Dq name , 166.Dq sectorsize 167and 168.Dq size . 169.Pp 170A 171.Dq consumer 172is the backdoor through which a geom connects to another 173geom provider and through which I/O requests are sent. 174.Pp 175The topological relationship between these entities are as follows: 176.Bl -bullet 177.It 178A class has zero or more geom instances. 179.It 180A geom has exactly one class it is derived from. 181.It 182A geom has zero or more consumers. 183.It 184A geom has zero or more providers. 185.It 186A consumer can be attached to zero or one providers. 187.It 188A provider can have zero or more consumers attached. 189.El 190.Pp 191All geoms have a rank-number assigned, which is used to detect and 192prevent loops in the acyclic directed graph. 193This rank number is 194assigned as follows: 195.Bl -enum 196.It 197A geom with no attached consumers has rank=1. 198.It 199A geom with attached consumers has a rank one higher than the 200highest rank of the geoms of the providers its consumers are 201attached to. 202.El 203.Sh "SPECIAL TOPOLOGICAL MANEUVERS" 204In addition to the straightforward attach, which attaches a consumer 205to a provider, and detach, which breaks the bond, a number of special 206topological maneuvers exists to facilitate configuration and to 207improve the overall flexibility. 208.Bl -inset 209.It Em TASTING 210is a process that happens whenever a new class or new provider 211is created, and it provides the class a chance to automatically configure an 212instance on providers which it recognizes as its own. 213A typical example is the MBR disk-partition class which will look for 214the MBR table in the first sector and, if found and validated, will 215instantiate a geom to multiplex according to the contents of the MBR. 216.Pp 217A new class will be offered to all existing providers in turn and a new 218provider will be offered to all classes in turn. 219.Pp 220Exactly what a class does to recognize if it should accept the offered 221provider is not defined by 222.Nm , 223but the sensible set of options are: 224.Bl -bullet 225.It 226Examine specific data structures on the disk. 227.It 228Examine properties like 229.Dq sectorsize 230or 231.Dq mediasize 232for the provider. 233.It 234Examine the rank number of the provider's geom. 235.It 236Examine the method name of the provider's geom. 237.El 238.It Em ORPHANIZATION 239is the process by which a provider is removed while 240it potentially is still being used. 241.Pp 242When a geom orphans a provider, all future I/O requests will 243.Dq bounce 244on the provider with an error code set by the geom. 245Any 246consumers attached to the provider will receive notification about 247the orphanization when the event loop gets around to it, and they 248can take appropriate action at that time. 249.Pp 250A geom which came into being as a result of a normal taste operation 251should self-destruct unless it has a way to keep functioning whilst 252lacking the orphaned provider. 253Geoms like disk slicers should therefore self-destruct whereas 254RAID5 or mirror geoms will be able to continue as long as they do 255not lose quorum. 256.Pp 257When a provider is orphaned, this does not necessarily result in any 258immediate change in the topology: any attached consumers are still 259attached, any opened paths are still open, any outstanding I/O 260requests are still outstanding. 261.Pp 262The typical scenario is: 263.Pp 264.Bl -bullet -offset indent -compact 265.It 266A device driver detects a disk has departed and orphans the provider for it. 267.It 268The geoms on top of the disk receive the orphanization event and 269orphan all their providers in turn. 270Providers which are not attached to will typically self-destruct 271right away. 272This process continues in a quasi-recursive fashion until all 273relevant pieces of the tree have heard the bad news. 274.It 275Eventually the buck stops when it reaches geom_dev at the top 276of the stack. 277.It 278Geom_dev will call 279.Xr destroy_dev 9 280to stop any more requests from 281coming in. 282It will sleep until any and all outstanding I/O requests have 283been returned. 284It will explicitly close (i.e.: zero the access counts), a change 285which will propagate all the way down through the mesh. 286It will then detach and destroy its geom. 287.It 288The geom whose provider is now detached will destroy the provider, 289detach and destroy its consumer and destroy its geom. 290.It 291This process percolates all the way down through the mesh, until 292the cleanup is complete. 293.El 294.Pp 295While this approach seems byzantine, it does provide the maximum 296flexibility and robustness in handling disappearing devices. 297.Pp 298The one absolutely crucial detail to be aware of is that if the 299device driver does not return all I/O requests, the tree will 300not unravel. 301.It Em SPOILING 302is a special case of orphanization used to protect 303against stale metadata. 304It is probably easiest to understand spoiling by going through 305an example. 306.Pp 307Imagine a disk, 308.Pa da0 , 309on top of which an MBR geom provides 310.Pa da0s1 311and 312.Pa da0s2 , 313and on top of 314.Pa da0s1 315a BSD geom provides 316.Pa da0s1a 317through 318.Pa da0s1e , 319and that both the MBR and BSD geoms have 320autoconfigured based on data structures on the disk media. 321Now imagine the case where 322.Pa da0 323is opened for writing and those 324data structures are modified or overwritten: now the geoms would 325be operating on stale metadata unless some notification system 326can inform them otherwise. 327.Pp 328To avoid this situation, when the open of 329.Pa da0 330for write happens, 331all attached consumers are told about this and geoms like 332MBR and BSD will self-destruct as a result. 333When 334.Pa da0 335is closed, it will be offered for tasting again 336and, if the data structures for MBR and BSD are still there, new 337geoms will instantiate themselves anew. 338.Pp 339Now for the fine print: 340.Pp 341If any of the paths through the MBR or BSD module were open, they 342would have opened downwards with an exclusive bit thus rendering it 343impossible to open 344.Pa da0 345for writing in that case. 346Conversely, 347the requested exclusive bit would render it impossible to open a 348path through the MBR geom while 349.Pa da0 350is open for writing. 351.Pp 352From this it also follows that changing the size of open geoms can 353only be done with their cooperation. 354.Pp 355Finally: the spoiling only happens when the write count goes from 356zero to non-zero and the retasting happens only when the write count goes 357from non-zero to zero. 358.It Em CONFIGURE 359is the process where the administrator issues instructions 360for a particular class to instantiate itself. 361There are multiple 362ways to express intent in this case - a particular provider may be 363specified with a level of override forcing, for instance, a BSD 364disklabel module to attach to a provider which was not found palatable 365during the TASTE operation. 366.Pp 367Finally, I/O is the reason we even do this: it concerns itself with 368sending I/O requests through the graph. 369.It Em "I/O REQUESTS" , 370represented by 371.Vt "struct bio" , 372originate at a consumer, 373are scheduled on its attached provider and, when processed, are returned 374to the consumer. 375It is important to realize that the 376.Vt "struct bio" 377which enters through the provider of a particular geom does not 378.Do 379come out on the other side 380.Dc . 381Even simple transformations like MBR and BSD will clone the 382.Vt "struct bio" , 383modify the clone, and schedule the clone on their 384own consumer. 385Note that cloning the 386.Vt "struct bio" 387does not involve cloning the 388actual data area specified in the I/O request. 389.Pp 390In total, four different I/O requests exist in 391.Nm : 392read, write, delete, and 393.Dq "get attribute". 394.Pp 395Read and write are self explanatory. 396.Pp 397Delete indicates that a certain range of data is no longer used 398and that it can be erased or freed as the underlying technology 399supports. 400Technologies like flash adaptation layers can arrange to erase 401the relevant blocks before they will become reassigned and 402cryptographic devices may want to fill random bits into the 403range to reduce the amount of data available for attack. 404.Pp 405It is important to recognize that a delete indication is not a 406request and consequently there is no guarantee that the data actually 407will be erased or made unavailable unless guaranteed by specific 408geoms in the graph. 409If 410.Dq "secure delete" 411semantics are required, a 412geom should be pushed which converts delete indications into (a 413sequence of) write requests. 414.Pp 415.Dq "Get attribute" 416supports inspection and manipulation 417of out-of-band attributes on a particular provider or path. 418Attributes are named by 419.Tn ASCII 420strings and they will be discussed in 421a separate section below. 422.El 423.Pp 424(Stay tuned while the author rests his brain and fingers: more to come.) 425.Sh DIAGNOSTICS 426Several flags are provided for tracing 427.Nm 428operations and unlocking 429protection mechanisms via the 430.Va kern.geom.debugflags 431sysctl. 432All of these flags are off by default, and great care should be taken in 433turning them on. 434.Bl -tag -width indent 435.It 0x01 Pq Dv G_T_TOPOLOGY 436Provide tracing of topology change events. 437.It 0x02 Pq Dv G_T_BIO 438Provide tracing of buffer I/O requests. 439.It 0x04 Pq Dv G_T_ACCESS 440Provide tracing of access check controls. 441.It 0x08 (unused) 442.It 0x10 (allow foot shooting) 443Allow writing to Rank 1 providers. 444This would, for example, allow the super-user to overwrite the MBR on the root 445disk or write random sectors elsewhere to a mounted disk. 446The implications are obvious. 447.It 0x40 Pq Dv G_F_DISKIOCTL 448This is unused at this time. 449.It 0x80 Pq Dv G_F_CTLDUMP 450Dump contents of gctl requests. 451.El 452.Sh SEE ALSO 453.Xr libgeom 3 , 454.Xr DECLARE_GEOM_CLASS 9 , 455.Xr disk 9 , 456.Xr g_access 9 , 457.Xr g_attach 9 , 458.Xr g_bio 9 , 459.Xr g_consumer 9 , 460.Xr g_data 9 , 461.Xr g_event 9 , 462.Xr g_geom 9 , 463.Xr g_provider 9 , 464.Xr g_provider_by_name 9 465.Sh HISTORY 466This software was developed for the 467.Fx 468Project by 469.An Poul-Henning Kamp 470and NAI Labs, the Security Research Division of Network Associates, Inc.\& 471under DARPA/SPAWAR contract N66001-01-C-8035 472.Pq Dq CBOSS , 473as part of the 474DARPA CHATS research program. 475.Pp 476The first precursor for 477.Nm 478was a gruesome hack to Minix 1.2 and was 479never distributed. 480An earlier attempt to implement a less general scheme 481in 482.Fx 483never succeeded. 484.Sh AUTHORS 485.An Poul-Henning Kamp Aq Mt phk@FreeBSD.org 486