1.\" 2.\" Copyright (c) 2002 Poul-Henning Kamp 3.\" Copyright (c) 2002 Networks Associates Technology, Inc. 4.\" All rights reserved. 5.\" 6.\" This software was developed for the FreeBSD Project by Poul-Henning Kamp 7.\" and NAI Labs, the Security Research Division of Network Associates, Inc. 8.\" under DARPA/SPAWAR contract N66001-01-C-8035 ("CBOSS"), as part of the 9.\" DARPA CHATS research program. 10.\" 11.\" Redistribution and use in source and binary forms, with or without 12.\" modification, are permitted provided that the following conditions 13.\" are met: 14.\" 1. Redistributions of source code must retain the above copyright 15.\" notice, this list of conditions and the following disclaimer. 16.\" 2. Redistributions in binary form must reproduce the above copyright 17.\" notice, this list of conditions and the following disclaimer in the 18.\" documentation and/or other materials provided with the distribution. 19.\" 3. The names of the authors may not be used to endorse or promote 20.\" products derived from this software without specific prior written 21.\" permission. 22.\" 23.\" THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND 24.\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE 25.\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE 26.\" ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE 27.\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL 28.\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS 29.\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) 30.\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT 31.\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY 32.\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF 33.\" SUCH DAMAGE. 34.\" 35.\" $FreeBSD$ 36.\" 37.Dd April 9, 2018 38.Dt GEOM 4 39.Os 40.Sh NAME 41.Nm GEOM 42.Nd "modular disk I/O request transformation framework" 43.Sh SYNOPSIS 44.Cd options GEOM_BDE 45.Cd options GEOM_CACHE 46.Cd options GEOM_CONCAT 47.Cd options GEOM_ELI 48.Cd options GEOM_GATE 49.Cd options GEOM_JOURNAL 50.Cd options GEOM_LABEL 51.Cd options GEOM_LINUX_LVM 52.Cd options GEOM_MAP 53.Cd options GEOM_MIRROR 54.Cd options GEOM_MOUNTVER 55.Cd options GEOM_MULTIPATH 56.Cd options GEOM_NOP 57.Cd options GEOM_PART_APM 58.Cd options GEOM_PART_BSD 59.Cd options GEOM_PART_BSD64 60.Cd options GEOM_PART_EBR 61.Cd options GEOM_PART_EBR_COMPAT 62.Cd options GEOM_PART_GPT 63.Cd options GEOM_PART_LDM 64.Cd options GEOM_PART_MBR 65.Cd options GEOM_PART_VTOC8 66.Cd options GEOM_RAID 67.Cd options GEOM_RAID3 68.Cd options GEOM_SHSEC 69.Cd options GEOM_STRIPE 70.Cd options GEOM_UZIP 71.Cd options GEOM_VIRSTOR 72.Cd options GEOM_ZERO 73.Sh DESCRIPTION 74The 75.Nm 76framework provides an infrastructure in which 77.Dq classes 78can perform transformations on disk I/O requests on their path from 79the upper kernel to the device drivers and back. 80.Pp 81Transformations in a 82.Nm 83context range from the simple geometric 84displacement performed in typical disk partitioning modules over RAID 85algorithms and device multipath resolution to full blown cryptographic 86protection of the stored data. 87.Pp 88Compared to traditional 89.Dq "volume management" , 90.Nm 91differs from most 92and in some cases all previous implementations in the following ways: 93.Bl -bullet 94.It 95.Nm 96is extensible. 97It is trivially simple to write a new class 98of transformation and it will not be given stepchild treatment. 99If 100someone for some reason wanted to mount IBM MVS diskpacks, a class 101recognizing and configuring their VTOC information would be a trivial 102matter. 103.It 104.Nm 105is topologically agnostic. 106Most volume management implementations 107have very strict notions of how classes can fit together, very often 108one fixed hierarchy is provided, for instance, subdisk - plex - 109volume. 110.El 111.Pp 112Being extensible means that new transformations are treated no differently 113than existing transformations. 114.Pp 115Fixed hierarchies are bad because they make it impossible to express 116the intent efficiently. 117In the fixed hierarchy above, it is not possible to mirror two 118physical disks and then partition the mirror into subdisks, instead 119one is forced to make subdisks on the physical volumes and to mirror 120these two and two, resulting in a much more complex configuration. 121.Nm 122on the other hand does not care in which order things are done, 123the only restriction is that cycles in the graph will not be allowed. 124.Sh "TERMINOLOGY AND TOPOLOGY" 125.Nm 126is quite object oriented and consequently the terminology 127borrows a lot of context and semantics from the OO vocabulary: 128.Pp 129A 130.Dq class , 131represented by the data structure 132.Vt g_class 133implements one 134particular kind of transformation. 135Typical examples are MBR disk 136partition, BSD disklabel, and RAID5 classes. 137.Pp 138An instance of a class is called a 139.Dq geom 140and represented by the data structure 141.Vt g_geom . 142In a typical i386 143.Fx 144system, there 145will be one geom of class MBR for each disk. 146.Pp 147A 148.Dq provider , 149represented by the data structure 150.Vt g_provider , 151is the front gate at which a geom offers service. 152A provider is 153.Do 154a disk-like thing which appears in 155.Pa /dev 156.Dc - a logical 157disk in other words. 158All providers have three main properties: 159.Dq name , 160.Dq sectorsize 161and 162.Dq size . 163.Pp 164A 165.Dq consumer 166is the backdoor through which a geom connects to another 167geom provider and through which I/O requests are sent. 168.Pp 169The topological relationship between these entities are as follows: 170.Bl -bullet 171.It 172A class has zero or more geom instances. 173.It 174A geom has exactly one class it is derived from. 175.It 176A geom has zero or more consumers. 177.It 178A geom has zero or more providers. 179.It 180A consumer can be attached to zero or one providers. 181.It 182A provider can have zero or more consumers attached. 183.El 184.Pp 185All geoms have a rank-number assigned, which is used to detect and 186prevent loops in the acyclic directed graph. 187This rank number is 188assigned as follows: 189.Bl -enum 190.It 191A geom with no attached consumers has rank=1. 192.It 193A geom with attached consumers has a rank one higher than the 194highest rank of the geoms of the providers its consumers are 195attached to. 196.El 197.Sh "SPECIAL TOPOLOGICAL MANEUVERS" 198In addition to the straightforward attach, which attaches a consumer 199to a provider, and detach, which breaks the bond, a number of special 200topological maneuvers exists to facilitate configuration and to 201improve the overall flexibility. 202.Bl -inset 203.It Em TASTING 204is a process that happens whenever a new class or new provider 205is created, and it provides the class a chance to automatically configure an 206instance on providers which it recognizes as its own. 207A typical example is the MBR disk-partition class which will look for 208the MBR table in the first sector and, if found and validated, will 209instantiate a geom to multiplex according to the contents of the MBR. 210.Pp 211A new class will be offered to all existing providers in turn and a new 212provider will be offered to all classes in turn. 213.Pp 214Exactly what a class does to recognize if it should accept the offered 215provider is not defined by 216.Nm , 217but the sensible set of options are: 218.Bl -bullet 219.It 220Examine specific data structures on the disk. 221.It 222Examine properties like 223.Dq sectorsize 224or 225.Dq mediasize 226for the provider. 227.It 228Examine the rank number of the provider's geom. 229.It 230Examine the method name of the provider's geom. 231.El 232.It Em ORPHANIZATION 233is the process by which a provider is removed while 234it potentially is still being used. 235.Pp 236When a geom orphans a provider, all future I/O requests will 237.Dq bounce 238on the provider with an error code set by the geom. 239Any 240consumers attached to the provider will receive notification about 241the orphanization when the event loop gets around to it, and they 242can take appropriate action at that time. 243.Pp 244A geom which came into being as a result of a normal taste operation 245should self-destruct unless it has a way to keep functioning whilst 246lacking the orphaned provider. 247Geoms like disk slicers should therefore self-destruct whereas 248RAID5 or mirror geoms will be able to continue as long as they do 249not lose quorum. 250.Pp 251When a provider is orphaned, this does not necessarily result in any 252immediate change in the topology: any attached consumers are still 253attached, any opened paths are still open, any outstanding I/O 254requests are still outstanding. 255.Pp 256The typical scenario is: 257.Pp 258.Bl -bullet -offset indent -compact 259.It 260A device driver detects a disk has departed and orphans the provider for it. 261.It 262The geoms on top of the disk receive the orphanization event and 263orphan all their providers in turn. 264Providers which are not attached to will typically self-destruct 265right away. 266This process continues in a quasi-recursive fashion until all 267relevant pieces of the tree have heard the bad news. 268.It 269Eventually the buck stops when it reaches geom_dev at the top 270of the stack. 271.It 272Geom_dev will call 273.Xr destroy_dev 9 274to stop any more requests from 275coming in. 276It will sleep until any and all outstanding I/O requests have 277been returned. 278It will explicitly close (i.e.: zero the access counts), a change 279which will propagate all the way down through the mesh. 280It will then detach and destroy its geom. 281.It 282The geom whose provider is now detached will destroy the provider, 283detach and destroy its consumer and destroy its geom. 284.It 285This process percolates all the way down through the mesh, until 286the cleanup is complete. 287.El 288.Pp 289While this approach seems byzantine, it does provide the maximum 290flexibility and robustness in handling disappearing devices. 291.Pp 292The one absolutely crucial detail to be aware of is that if the 293device driver does not return all I/O requests, the tree will 294not unravel. 295.It Em SPOILING 296is a special case of orphanization used to protect 297against stale metadata. 298It is probably easiest to understand spoiling by going through 299an example. 300.Pp 301Imagine a disk, 302.Pa da0 , 303on top of which an MBR geom provides 304.Pa da0s1 305and 306.Pa da0s2 , 307and on top of 308.Pa da0s1 309a BSD geom provides 310.Pa da0s1a 311through 312.Pa da0s1e , 313and that both the MBR and BSD geoms have 314autoconfigured based on data structures on the disk media. 315Now imagine the case where 316.Pa da0 317is opened for writing and those 318data structures are modified or overwritten: now the geoms would 319be operating on stale metadata unless some notification system 320can inform them otherwise. 321.Pp 322To avoid this situation, when the open of 323.Pa da0 324for write happens, 325all attached consumers are told about this and geoms like 326MBR and BSD will self-destruct as a result. 327When 328.Pa da0 329is closed, it will be offered for tasting again 330and, if the data structures for MBR and BSD are still there, new 331geoms will instantiate themselves anew. 332.Pp 333Now for the fine print: 334.Pp 335If any of the paths through the MBR or BSD module were open, they 336would have opened downwards with an exclusive bit thus rendering it 337impossible to open 338.Pa da0 339for writing in that case. 340Conversely, 341the requested exclusive bit would render it impossible to open a 342path through the MBR geom while 343.Pa da0 344is open for writing. 345.Pp 346From this it also follows that changing the size of open geoms can 347only be done with their cooperation. 348.Pp 349Finally: the spoiling only happens when the write count goes from 350zero to non-zero and the retasting happens only when the write count goes 351from non-zero to zero. 352.It Em CONFIGURE 353is the process where the administrator issues instructions 354for a particular class to instantiate itself. 355There are multiple 356ways to express intent in this case - a particular provider may be 357specified with a level of override forcing, for instance, a BSD 358disklabel module to attach to a provider which was not found palatable 359during the TASTE operation. 360.Pp 361Finally, I/O is the reason we even do this: it concerns itself with 362sending I/O requests through the graph. 363.It Em "I/O REQUESTS" , 364represented by 365.Vt "struct bio" , 366originate at a consumer, 367are scheduled on its attached provider and, when processed, are returned 368to the consumer. 369It is important to realize that the 370.Vt "struct bio" 371which enters through the provider of a particular geom does not 372.Do 373come out on the other side 374.Dc . 375Even simple transformations like MBR and BSD will clone the 376.Vt "struct bio" , 377modify the clone, and schedule the clone on their 378own consumer. 379Note that cloning the 380.Vt "struct bio" 381does not involve cloning the 382actual data area specified in the I/O request. 383.Pp 384In total, four different I/O requests exist in 385.Nm : 386read, write, delete, and 387.Dq "get attribute". 388.Pp 389Read and write are self explanatory. 390.Pp 391Delete indicates that a certain range of data is no longer used 392and that it can be erased or freed as the underlying technology 393supports. 394Technologies like flash adaptation layers can arrange to erase 395the relevant blocks before they will become reassigned and 396cryptographic devices may want to fill random bits into the 397range to reduce the amount of data available for attack. 398.Pp 399It is important to recognize that a delete indication is not a 400request and consequently there is no guarantee that the data actually 401will be erased or made unavailable unless guaranteed by specific 402geoms in the graph. 403If 404.Dq "secure delete" 405semantics are required, a 406geom should be pushed which converts delete indications into (a 407sequence of) write requests. 408.Pp 409.Dq "Get attribute" 410supports inspection and manipulation 411of out-of-band attributes on a particular provider or path. 412Attributes are named by 413.Tn ASCII 414strings and they will be discussed in 415a separate section below. 416.El 417.Pp 418(Stay tuned while the author rests his brain and fingers: more to come.) 419.Sh DIAGNOSTICS 420Several flags are provided for tracing 421.Nm 422operations and unlocking 423protection mechanisms via the 424.Va kern.geom.debugflags 425sysctl. 426All of these flags are off by default, and great care should be taken in 427turning them on. 428.Bl -tag -width indent 429.It 0x01 Pq Dv G_T_TOPOLOGY 430Provide tracing of topology change events. 431.It 0x02 Pq Dv G_T_BIO 432Provide tracing of buffer I/O requests. 433.It 0x04 Pq Dv G_T_ACCESS 434Provide tracing of access check controls. 435.It 0x08 (unused) 436.It 0x10 (allow foot shooting) 437Allow writing to Rank 1 providers. 438This would, for example, allow the super-user to overwrite the MBR on the root 439disk or write random sectors elsewhere to a mounted disk. 440The implications are obvious. 441.It 0x40 Pq Dv G_F_DISKIOCTL 442This is unused at this time. 443.It 0x80 Pq Dv G_F_CTLDUMP 444Dump contents of gctl requests. 445.El 446.Sh OBSOLETE OPTIONS 447.Pp 448The following options have been deprecated and will be removed in 449.Fx 12 : 450.Cd GEOM_BSD , 451.Cd GEOM_FOX , 452.Cd GEOM_MBR , 453.Cd GEOM_SUNLABEL , 454and 455.Cd GEOM_VOL . 456.Pp 457Use 458.Cd GEOM_PART_BSD , 459.Cd GEOM_MULTIPATH , 460.Cd GEOM_PART_MBR , 461.Cd GEOM_PART_VTOC8 , 462.Cd GEOM_LABEL 463options, respectively, instead. 464.Sh SEE ALSO 465.Xr libgeom 3 , 466.Xr DECLARE_GEOM_CLASS 9 , 467.Xr disk 9 , 468.Xr g_access 9 , 469.Xr g_attach 9 , 470.Xr g_bio 9 , 471.Xr g_consumer 9 , 472.Xr g_data 9 , 473.Xr g_event 9 , 474.Xr g_geom 9 , 475.Xr g_provider 9 , 476.Xr g_provider_by_name 9 477.Sh HISTORY 478This software was developed for the 479.Fx 480Project by 481.An Poul-Henning Kamp 482and NAI Labs, the Security Research Division of Network Associates, Inc.\& 483under DARPA/SPAWAR contract N66001-01-C-8035 484.Pq Dq CBOSS , 485as part of the 486DARPA CHATS research program. 487.Pp 488The first precursor for 489.Nm 490was a gruesome hack to Minix 1.2 and was 491never distributed. 492An earlier attempt to implement a less general scheme 493in 494.Fx 495never succeeded. 496.Sh AUTHORS 497.An Poul-Henning Kamp Aq Mt phk@FreeBSD.org 498