1.\" 2.\" Copyright (c) 2002 Poul-Henning Kamp 3.\" Copyright (c) 2002 Networks Associates Technology, Inc. 4.\" All rights reserved. 5.\" 6.\" This software was developed for the FreeBSD Project by Poul-Henning Kamp 7.\" and NAI Labs, the Security Research Division of Network Associates, Inc. 8.\" under DARPA/SPAWAR contract N66001-01-C-8035 ("CBOSS"), as part of the 9.\" DARPA CHATS research program. 10.\" 11.\" Redistribution and use in source and binary forms, with or without 12.\" modification, are permitted provided that the following conditions 13.\" are met: 14.\" 1. Redistributions of source code must retain the above copyright 15.\" notice, this list of conditions and the following disclaimer. 16.\" 2. Redistributions in binary form must reproduce the above copyright 17.\" notice, this list of conditions and the following disclaimer in the 18.\" documentation and/or other materials provided with the distribution. 19.\" 3. The names of the authors may not be used to endorse or promote 20.\" products derived from this software without specific prior written 21.\" permission. 22.\" 23.\" THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND 24.\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE 25.\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE 26.\" ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE 27.\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL 28.\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS 29.\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) 30.\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT 31.\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY 32.\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF 33.\" SUCH DAMAGE. 34.\" 35.\" $FreeBSD$ 36.\" 37.Dd August 9, 2017 38.Dt GEOM 4 39.Os 40.Sh NAME 41.Nm GEOM 42.Nd "modular disk I/O request transformation framework" 43.Sh SYNOPSIS 44.Cd options GEOM_AES 45.Cd options GEOM_BDE 46.Cd options GEOM_CACHE 47.Cd options GEOM_CONCAT 48.Cd options GEOM_ELI 49.Cd options GEOM_GATE 50.Cd options GEOM_JOURNAL 51.Cd options GEOM_LABEL 52.Cd options GEOM_LINUX_LVM 53.Cd options GEOM_MAP 54.Cd options GEOM_MIRROR 55.Cd options GEOM_MOUNTVER 56.Cd options GEOM_MULTIPATH 57.Cd options GEOM_NOP 58.Cd options GEOM_PART_APM 59.Cd options GEOM_PART_BSD 60.Cd options GEOM_PART_BSD64 61.Cd options GEOM_PART_EBR 62.Cd options GEOM_PART_EBR_COMPAT 63.Cd options GEOM_PART_GPT 64.Cd options GEOM_PART_LDM 65.Cd options GEOM_PART_MBR 66.Cd options GEOM_PART_VTOC8 67.Cd options GEOM_RAID 68.Cd options GEOM_RAID3 69.Cd options GEOM_SHSEC 70.Cd options GEOM_STRIPE 71.Cd options GEOM_UZIP 72.Cd options GEOM_VIRSTOR 73.Cd options GEOM_ZERO 74.Sh DESCRIPTION 75The 76.Nm 77framework provides an infrastructure in which 78.Dq classes 79can perform transformations on disk I/O requests on their path from 80the upper kernel to the device drivers and back. 81.Pp 82Transformations in a 83.Nm 84context range from the simple geometric 85displacement performed in typical disk partitioning modules over RAID 86algorithms and device multipath resolution to full blown cryptographic 87protection of the stored data. 88.Pp 89Compared to traditional 90.Dq "volume management" , 91.Nm 92differs from most 93and in some cases all previous implementations in the following ways: 94.Bl -bullet 95.It 96.Nm 97is extensible. 98It is trivially simple to write a new class 99of transformation and it will not be given stepchild treatment. 100If 101someone for some reason wanted to mount IBM MVS diskpacks, a class 102recognizing and configuring their VTOC information would be a trivial 103matter. 104.It 105.Nm 106is topologically agnostic. 107Most volume management implementations 108have very strict notions of how classes can fit together, very often 109one fixed hierarchy is provided, for instance, subdisk - plex - 110volume. 111.El 112.Pp 113Being extensible means that new transformations are treated no differently 114than existing transformations. 115.Pp 116Fixed hierarchies are bad because they make it impossible to express 117the intent efficiently. 118In the fixed hierarchy above, it is not possible to mirror two 119physical disks and then partition the mirror into subdisks, instead 120one is forced to make subdisks on the physical volumes and to mirror 121these two and two, resulting in a much more complex configuration. 122.Nm 123on the other hand does not care in which order things are done, 124the only restriction is that cycles in the graph will not be allowed. 125.Sh "TERMINOLOGY AND TOPOLOGY" 126.Nm 127is quite object oriented and consequently the terminology 128borrows a lot of context and semantics from the OO vocabulary: 129.Pp 130A 131.Dq class , 132represented by the data structure 133.Vt g_class 134implements one 135particular kind of transformation. 136Typical examples are MBR disk 137partition, BSD disklabel, and RAID5 classes. 138.Pp 139An instance of a class is called a 140.Dq geom 141and represented by the data structure 142.Vt g_geom . 143In a typical i386 144.Fx 145system, there 146will be one geom of class MBR for each disk. 147.Pp 148A 149.Dq provider , 150represented by the data structure 151.Vt g_provider , 152is the front gate at which a geom offers service. 153A provider is 154.Do 155a disk-like thing which appears in 156.Pa /dev 157.Dc - a logical 158disk in other words. 159All providers have three main properties: 160.Dq name , 161.Dq sectorsize 162and 163.Dq size . 164.Pp 165A 166.Dq consumer 167is the backdoor through which a geom connects to another 168geom provider and through which I/O requests are sent. 169.Pp 170The topological relationship between these entities are as follows: 171.Bl -bullet 172.It 173A class has zero or more geom instances. 174.It 175A geom has exactly one class it is derived from. 176.It 177A geom has zero or more consumers. 178.It 179A geom has zero or more providers. 180.It 181A consumer can be attached to zero or one providers. 182.It 183A provider can have zero or more consumers attached. 184.El 185.Pp 186All geoms have a rank-number assigned, which is used to detect and 187prevent loops in the acyclic directed graph. 188This rank number is 189assigned as follows: 190.Bl -enum 191.It 192A geom with no attached consumers has rank=1. 193.It 194A geom with attached consumers has a rank one higher than the 195highest rank of the geoms of the providers its consumers are 196attached to. 197.El 198.Sh "SPECIAL TOPOLOGICAL MANEUVERS" 199In addition to the straightforward attach, which attaches a consumer 200to a provider, and detach, which breaks the bond, a number of special 201topological maneuvers exists to facilitate configuration and to 202improve the overall flexibility. 203.Bl -inset 204.It Em TASTING 205is a process that happens whenever a new class or new provider 206is created, and it provides the class a chance to automatically configure an 207instance on providers which it recognizes as its own. 208A typical example is the MBR disk-partition class which will look for 209the MBR table in the first sector and, if found and validated, will 210instantiate a geom to multiplex according to the contents of the MBR. 211.Pp 212A new class will be offered to all existing providers in turn and a new 213provider will be offered to all classes in turn. 214.Pp 215Exactly what a class does to recognize if it should accept the offered 216provider is not defined by 217.Nm , 218but the sensible set of options are: 219.Bl -bullet 220.It 221Examine specific data structures on the disk. 222.It 223Examine properties like 224.Dq sectorsize 225or 226.Dq mediasize 227for the provider. 228.It 229Examine the rank number of the provider's geom. 230.It 231Examine the method name of the provider's geom. 232.El 233.It Em ORPHANIZATION 234is the process by which a provider is removed while 235it potentially is still being used. 236.Pp 237When a geom orphans a provider, all future I/O requests will 238.Dq bounce 239on the provider with an error code set by the geom. 240Any 241consumers attached to the provider will receive notification about 242the orphanization when the event loop gets around to it, and they 243can take appropriate action at that time. 244.Pp 245A geom which came into being as a result of a normal taste operation 246should self-destruct unless it has a way to keep functioning whilst 247lacking the orphaned provider. 248Geoms like disk slicers should therefore self-destruct whereas 249RAID5 or mirror geoms will be able to continue as long as they do 250not lose quorum. 251.Pp 252When a provider is orphaned, this does not necessarily result in any 253immediate change in the topology: any attached consumers are still 254attached, any opened paths are still open, any outstanding I/O 255requests are still outstanding. 256.Pp 257The typical scenario is: 258.Pp 259.Bl -bullet -offset indent -compact 260.It 261A device driver detects a disk has departed and orphans the provider for it. 262.It 263The geoms on top of the disk receive the orphanization event and 264orphan all their providers in turn. 265Providers which are not attached to will typically self-destruct 266right away. 267This process continues in a quasi-recursive fashion until all 268relevant pieces of the tree have heard the bad news. 269.It 270Eventually the buck stops when it reaches geom_dev at the top 271of the stack. 272.It 273Geom_dev will call 274.Xr destroy_dev 9 275to stop any more requests from 276coming in. 277It will sleep until any and all outstanding I/O requests have 278been returned. 279It will explicitly close (i.e.: zero the access counts), a change 280which will propagate all the way down through the mesh. 281It will then detach and destroy its geom. 282.It 283The geom whose provider is now detached will destroy the provider, 284detach and destroy its consumer and destroy its geom. 285.It 286This process percolates all the way down through the mesh, until 287the cleanup is complete. 288.El 289.Pp 290While this approach seems byzantine, it does provide the maximum 291flexibility and robustness in handling disappearing devices. 292.Pp 293The one absolutely crucial detail to be aware of is that if the 294device driver does not return all I/O requests, the tree will 295not unravel. 296.It Em SPOILING 297is a special case of orphanization used to protect 298against stale metadata. 299It is probably easiest to understand spoiling by going through 300an example. 301.Pp 302Imagine a disk, 303.Pa da0 , 304on top of which an MBR geom provides 305.Pa da0s1 306and 307.Pa da0s2 , 308and on top of 309.Pa da0s1 310a BSD geom provides 311.Pa da0s1a 312through 313.Pa da0s1e , 314and that both the MBR and BSD geoms have 315autoconfigured based on data structures on the disk media. 316Now imagine the case where 317.Pa da0 318is opened for writing and those 319data structures are modified or overwritten: now the geoms would 320be operating on stale metadata unless some notification system 321can inform them otherwise. 322.Pp 323To avoid this situation, when the open of 324.Pa da0 325for write happens, 326all attached consumers are told about this and geoms like 327MBR and BSD will self-destruct as a result. 328When 329.Pa da0 330is closed, it will be offered for tasting again 331and, if the data structures for MBR and BSD are still there, new 332geoms will instantiate themselves anew. 333.Pp 334Now for the fine print: 335.Pp 336If any of the paths through the MBR or BSD module were open, they 337would have opened downwards with an exclusive bit thus rendering it 338impossible to open 339.Pa da0 340for writing in that case. 341Conversely, 342the requested exclusive bit would render it impossible to open a 343path through the MBR geom while 344.Pa da0 345is open for writing. 346.Pp 347From this it also follows that changing the size of open geoms can 348only be done with their cooperation. 349.Pp 350Finally: the spoiling only happens when the write count goes from 351zero to non-zero and the retasting happens only when the write count goes 352from non-zero to zero. 353.It Em CONFIGURE 354is the process where the administrator issues instructions 355for a particular class to instantiate itself. 356There are multiple 357ways to express intent in this case - a particular provider may be 358specified with a level of override forcing, for instance, a BSD 359disklabel module to attach to a provider which was not found palatable 360during the TASTE operation. 361.Pp 362Finally, I/O is the reason we even do this: it concerns itself with 363sending I/O requests through the graph. 364.It Em "I/O REQUESTS" , 365represented by 366.Vt "struct bio" , 367originate at a consumer, 368are scheduled on its attached provider and, when processed, are returned 369to the consumer. 370It is important to realize that the 371.Vt "struct bio" 372which enters through the provider of a particular geom does not 373.Do 374come out on the other side 375.Dc . 376Even simple transformations like MBR and BSD will clone the 377.Vt "struct bio" , 378modify the clone, and schedule the clone on their 379own consumer. 380Note that cloning the 381.Vt "struct bio" 382does not involve cloning the 383actual data area specified in the I/O request. 384.Pp 385In total, four different I/O requests exist in 386.Nm : 387read, write, delete, and 388.Dq "get attribute". 389.Pp 390Read and write are self explanatory. 391.Pp 392Delete indicates that a certain range of data is no longer used 393and that it can be erased or freed as the underlying technology 394supports. 395Technologies like flash adaptation layers can arrange to erase 396the relevant blocks before they will become reassigned and 397cryptographic devices may want to fill random bits into the 398range to reduce the amount of data available for attack. 399.Pp 400It is important to recognize that a delete indication is not a 401request and consequently there is no guarantee that the data actually 402will be erased or made unavailable unless guaranteed by specific 403geoms in the graph. 404If 405.Dq "secure delete" 406semantics are required, a 407geom should be pushed which converts delete indications into (a 408sequence of) write requests. 409.Pp 410.Dq "Get attribute" 411supports inspection and manipulation 412of out-of-band attributes on a particular provider or path. 413Attributes are named by 414.Tn ASCII 415strings and they will be discussed in 416a separate section below. 417.El 418.Pp 419(Stay tuned while the author rests his brain and fingers: more to come.) 420.Sh DIAGNOSTICS 421Several flags are provided for tracing 422.Nm 423operations and unlocking 424protection mechanisms via the 425.Va kern.geom.debugflags 426sysctl. 427All of these flags are off by default, and great care should be taken in 428turning them on. 429.Bl -tag -width indent 430.It 0x01 Pq Dv G_T_TOPOLOGY 431Provide tracing of topology change events. 432.It 0x02 Pq Dv G_T_BIO 433Provide tracing of buffer I/O requests. 434.It 0x04 Pq Dv G_T_ACCESS 435Provide tracing of access check controls. 436.It 0x08 (unused) 437.It 0x10 (allow foot shooting) 438Allow writing to Rank 1 providers. 439This would, for example, allow the super-user to overwrite the MBR on the root 440disk or write random sectors elsewhere to a mounted disk. 441The implications are obvious. 442.It 0x40 Pq Dv G_F_DISKIOCTL 443This is unused at this time. 444.It 0x80 Pq Dv G_F_CTLDUMP 445Dump contents of gctl requests. 446.El 447.Sh OBSOLETE OPTIONS 448.Pp 449The following options have been deprecated and will be removed in 450.Fx 12 : 451.Cd GEOM_BSD , 452.Cd GEOM_FOX , 453.Cd GEOM_MBR , 454.Cd GEOM_SUNLABEL , 455and 456.Cd GEOM_VOL . 457.Pp 458Use 459.Cd GEOM_PART_BSD , 460.Cd GEOM_MULTIPATH , 461.Cd GEOM_PART_MBR , 462.Cd GEOM_PART_VTOC8 , 463.Cd GEOM_LABEL 464options, respectively, instead. 465.Sh SEE ALSO 466.Xr libgeom 3 , 467.Xr DECLARE_GEOM_CLASS 9 , 468.Xr disk 9 , 469.Xr g_access 9 , 470.Xr g_attach 9 , 471.Xr g_bio 9 , 472.Xr g_consumer 9 , 473.Xr g_data 9 , 474.Xr g_event 9 , 475.Xr g_geom 9 , 476.Xr g_provider 9 , 477.Xr g_provider_by_name 9 478.Sh HISTORY 479This software was developed for the 480.Fx 481Project by 482.An Poul-Henning Kamp 483and NAI Labs, the Security Research Division of Network Associates, Inc.\& 484under DARPA/SPAWAR contract N66001-01-C-8035 485.Pq Dq CBOSS , 486as part of the 487DARPA CHATS research program. 488.Pp 489The first precursor for 490.Nm 491was a gruesome hack to Minix 1.2 and was 492never distributed. 493An earlier attempt to implement a less general scheme 494in 495.Fx 496never succeeded. 497.Sh AUTHORS 498.An Poul-Henning Kamp Aq Mt phk@FreeBSD.org 499