1.\" 2.\" Copyright (c) 2002 Poul-Henning Kamp 3.\" Copyright (c) 2002 Networks Associates Technology, Inc. 4.\" All rights reserved. 5.\" 6.\" This software was developed for the FreeBSD Project by Poul-Henning Kamp 7.\" and NAI Labs, the Security Research Division of Network Associates, Inc. 8.\" under DARPA/SPAWAR contract N66001-01-C-8035 ("CBOSS"), as part of the 9.\" DARPA CHATS research program. 10.\" 11.\" Redistribution and use in source and binary forms, with or without 12.\" modification, are permitted provided that the following conditions 13.\" are met: 14.\" 1. Redistributions of source code must retain the above copyright 15.\" notice, this list of conditions and the following disclaimer. 16.\" 2. Redistributions in binary form must reproduce the above copyright 17.\" notice, this list of conditions and the following disclaimer in the 18.\" documentation and/or other materials provided with the distribution. 19.\" 3. The names of the authors may not be used to endorse or promote 20.\" products derived from this software without specific prior written 21.\" permission. 22.\" 23.\" THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND 24.\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE 25.\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE 26.\" ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE 27.\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL 28.\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS 29.\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) 30.\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT 31.\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY 32.\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF 33.\" SUCH DAMAGE. 34.\" 35.\" $FreeBSD$ 36.\" 37.Dd June 8, 2015 38.Dt GEOM 4 39.Os 40.Sh NAME 41.Nm GEOM 42.Nd "modular disk I/O request transformation framework" 43.Sh SYNOPSIS 44.Cd options GEOM_AES 45.Cd options GEOM_BDE 46.Cd options GEOM_BSD 47.Cd options GEOM_CACHE 48.Cd options GEOM_CONCAT 49.Cd options GEOM_ELI 50.Cd options GEOM_FOX 51.Cd options GEOM_GATE 52.Cd options GEOM_JOURNAL 53.Cd options GEOM_LABEL 54.Cd options GEOM_LINUX_LVM 55.Cd options GEOM_MAP 56.Cd options GEOM_MBR 57.Cd options GEOM_MIRROR 58.Cd options GEOM_MULTIPATH 59.Cd options GEOM_NOP 60.Cd options GEOM_PART_APM 61.Cd options GEOM_PART_BSD 62.Cd options GEOM_PART_BSD64 63.Cd options GEOM_PART_EBR 64.Cd options GEOM_PART_EBR_COMPAT 65.Cd options GEOM_PART_GPT 66.Cd options GEOM_PART_LDM 67.Cd options GEOM_PART_MBR 68.Cd options GEOM_PART_PC98 69.Cd options GEOM_PART_VTOC8 70.Cd options GEOM_PC98 71.Cd options GEOM_RAID 72.Cd options GEOM_RAID3 73.Cd options GEOM_SHSEC 74.Cd options GEOM_STRIPE 75.Cd options GEOM_SUNLABEL 76.Cd options GEOM_UNCOMPRESS 77.Cd options GEOM_UZIP 78.Cd options GEOM_VIRSTOR 79.Cd options GEOM_VOL 80.Cd options GEOM_ZERO 81.Sh DESCRIPTION 82The 83.Nm 84framework provides an infrastructure in which 85.Dq classes 86can perform transformations on disk I/O requests on their path from 87the upper kernel to the device drivers and back. 88.Pp 89Transformations in a 90.Nm 91context range from the simple geometric 92displacement performed in typical disk partitioning modules over RAID 93algorithms and device multipath resolution to full blown cryptographic 94protection of the stored data. 95.Pp 96Compared to traditional 97.Dq "volume management" , 98.Nm 99differs from most 100and in some cases all previous implementations in the following ways: 101.Bl -bullet 102.It 103.Nm 104is extensible. 105It is trivially simple to write a new class 106of transformation and it will not be given stepchild treatment. 107If 108someone for some reason wanted to mount IBM MVS diskpacks, a class 109recognizing and configuring their VTOC information would be a trivial 110matter. 111.It 112.Nm 113is topologically agnostic. 114Most volume management implementations 115have very strict notions of how classes can fit together, very often 116one fixed hierarchy is provided, for instance, subdisk - plex - 117volume. 118.El 119.Pp 120Being extensible means that new transformations are treated no differently 121than existing transformations. 122.Pp 123Fixed hierarchies are bad because they make it impossible to express 124the intent efficiently. 125In the fixed hierarchy above, it is not possible to mirror two 126physical disks and then partition the mirror into subdisks, instead 127one is forced to make subdisks on the physical volumes and to mirror 128these two and two, resulting in a much more complex configuration. 129.Nm 130on the other hand does not care in which order things are done, 131the only restriction is that cycles in the graph will not be allowed. 132.Sh "TERMINOLOGY AND TOPOLOGY" 133.Nm 134is quite object oriented and consequently the terminology 135borrows a lot of context and semantics from the OO vocabulary: 136.Pp 137A 138.Dq class , 139represented by the data structure 140.Vt g_class 141implements one 142particular kind of transformation. 143Typical examples are MBR disk 144partition, BSD disklabel, and RAID5 classes. 145.Pp 146An instance of a class is called a 147.Dq geom 148and represented by the data structure 149.Vt g_geom . 150In a typical i386 151.Fx 152system, there 153will be one geom of class MBR for each disk. 154.Pp 155A 156.Dq provider , 157represented by the data structure 158.Vt g_provider , 159is the front gate at which a geom offers service. 160A provider is 161.Do 162a disk-like thing which appears in 163.Pa /dev 164.Dc - a logical 165disk in other words. 166All providers have three main properties: 167.Dq name , 168.Dq sectorsize 169and 170.Dq size . 171.Pp 172A 173.Dq consumer 174is the backdoor through which a geom connects to another 175geom provider and through which I/O requests are sent. 176.Pp 177The topological relationship between these entities are as follows: 178.Bl -bullet 179.It 180A class has zero or more geom instances. 181.It 182A geom has exactly one class it is derived from. 183.It 184A geom has zero or more consumers. 185.It 186A geom has zero or more providers. 187.It 188A consumer can be attached to zero or one providers. 189.It 190A provider can have zero or more consumers attached. 191.El 192.Pp 193All geoms have a rank-number assigned, which is used to detect and 194prevent loops in the acyclic directed graph. 195This rank number is 196assigned as follows: 197.Bl -enum 198.It 199A geom with no attached consumers has rank=1. 200.It 201A geom with attached consumers has a rank one higher than the 202highest rank of the geoms of the providers its consumers are 203attached to. 204.El 205.Sh "SPECIAL TOPOLOGICAL MANEUVERS" 206In addition to the straightforward attach, which attaches a consumer 207to a provider, and detach, which breaks the bond, a number of special 208topological maneuvers exists to facilitate configuration and to 209improve the overall flexibility. 210.Bl -inset 211.It Em TASTING 212is a process that happens whenever a new class or new provider 213is created, and it provides the class a chance to automatically configure an 214instance on providers which it recognizes as its own. 215A typical example is the MBR disk-partition class which will look for 216the MBR table in the first sector and, if found and validated, will 217instantiate a geom to multiplex according to the contents of the MBR. 218.Pp 219A new class will be offered to all existing providers in turn and a new 220provider will be offered to all classes in turn. 221.Pp 222Exactly what a class does to recognize if it should accept the offered 223provider is not defined by 224.Nm , 225but the sensible set of options are: 226.Bl -bullet 227.It 228Examine specific data structures on the disk. 229.It 230Examine properties like 231.Dq sectorsize 232or 233.Dq mediasize 234for the provider. 235.It 236Examine the rank number of the provider's geom. 237.It 238Examine the method name of the provider's geom. 239.El 240.It Em ORPHANIZATION 241is the process by which a provider is removed while 242it potentially is still being used. 243.Pp 244When a geom orphans a provider, all future I/O requests will 245.Dq bounce 246on the provider with an error code set by the geom. 247Any 248consumers attached to the provider will receive notification about 249the orphanization when the event loop gets around to it, and they 250can take appropriate action at that time. 251.Pp 252A geom which came into being as a result of a normal taste operation 253should self-destruct unless it has a way to keep functioning whilst 254lacking the orphaned provider. 255Geoms like disk slicers should therefore self-destruct whereas 256RAID5 or mirror geoms will be able to continue as long as they do 257not lose quorum. 258.Pp 259When a provider is orphaned, this does not necessarily result in any 260immediate change in the topology: any attached consumers are still 261attached, any opened paths are still open, any outstanding I/O 262requests are still outstanding. 263.Pp 264The typical scenario is: 265.Pp 266.Bl -bullet -offset indent -compact 267.It 268A device driver detects a disk has departed and orphans the provider for it. 269.It 270The geoms on top of the disk receive the orphanization event and 271orphan all their providers in turn. 272Providers which are not attached to will typically self-destruct 273right away. 274This process continues in a quasi-recursive fashion until all 275relevant pieces of the tree have heard the bad news. 276.It 277Eventually the buck stops when it reaches geom_dev at the top 278of the stack. 279.It 280Geom_dev will call 281.Xr destroy_dev 9 282to stop any more requests from 283coming in. 284It will sleep until any and all outstanding I/O requests have 285been returned. 286It will explicitly close (i.e.: zero the access counts), a change 287which will propagate all the way down through the mesh. 288It will then detach and destroy its geom. 289.It 290The geom whose provider is now detached will destroy the provider, 291detach and destroy its consumer and destroy its geom. 292.It 293This process percolates all the way down through the mesh, until 294the cleanup is complete. 295.El 296.Pp 297While this approach seems byzantine, it does provide the maximum 298flexibility and robustness in handling disappearing devices. 299.Pp 300The one absolutely crucial detail to be aware of is that if the 301device driver does not return all I/O requests, the tree will 302not unravel. 303.It Em SPOILING 304is a special case of orphanization used to protect 305against stale metadata. 306It is probably easiest to understand spoiling by going through 307an example. 308.Pp 309Imagine a disk, 310.Pa da0 , 311on top of which an MBR geom provides 312.Pa da0s1 313and 314.Pa da0s2 , 315and on top of 316.Pa da0s1 317a BSD geom provides 318.Pa da0s1a 319through 320.Pa da0s1e , 321and that both the MBR and BSD geoms have 322autoconfigured based on data structures on the disk media. 323Now imagine the case where 324.Pa da0 325is opened for writing and those 326data structures are modified or overwritten: now the geoms would 327be operating on stale metadata unless some notification system 328can inform them otherwise. 329.Pp 330To avoid this situation, when the open of 331.Pa da0 332for write happens, 333all attached consumers are told about this and geoms like 334MBR and BSD will self-destruct as a result. 335When 336.Pa da0 337is closed, it will be offered for tasting again 338and, if the data structures for MBR and BSD are still there, new 339geoms will instantiate themselves anew. 340.Pp 341Now for the fine print: 342.Pp 343If any of the paths through the MBR or BSD module were open, they 344would have opened downwards with an exclusive bit thus rendering it 345impossible to open 346.Pa da0 347for writing in that case. 348Conversely, 349the requested exclusive bit would render it impossible to open a 350path through the MBR geom while 351.Pa da0 352is open for writing. 353.Pp 354From this it also follows that changing the size of open geoms can 355only be done with their cooperation. 356.Pp 357Finally: the spoiling only happens when the write count goes from 358zero to non-zero and the retasting happens only when the write count goes 359from non-zero to zero. 360.It Em CONFIGURE 361is the process where the administrator issues instructions 362for a particular class to instantiate itself. 363There are multiple 364ways to express intent in this case - a particular provider may be 365specified with a level of override forcing, for instance, a BSD 366disklabel module to attach to a provider which was not found palatable 367during the TASTE operation. 368.Pp 369Finally, I/O is the reason we even do this: it concerns itself with 370sending I/O requests through the graph. 371.It Em "I/O REQUESTS" , 372represented by 373.Vt "struct bio" , 374originate at a consumer, 375are scheduled on its attached provider and, when processed, are returned 376to the consumer. 377It is important to realize that the 378.Vt "struct bio" 379which enters through the provider of a particular geom does not 380.Do 381come out on the other side 382.Dc . 383Even simple transformations like MBR and BSD will clone the 384.Vt "struct bio" , 385modify the clone, and schedule the clone on their 386own consumer. 387Note that cloning the 388.Vt "struct bio" 389does not involve cloning the 390actual data area specified in the I/O request. 391.Pp 392In total, four different I/O requests exist in 393.Nm : 394read, write, delete, and 395.Dq "get attribute". 396.Pp 397Read and write are self explanatory. 398.Pp 399Delete indicates that a certain range of data is no longer used 400and that it can be erased or freed as the underlying technology 401supports. 402Technologies like flash adaptation layers can arrange to erase 403the relevant blocks before they will become reassigned and 404cryptographic devices may want to fill random bits into the 405range to reduce the amount of data available for attack. 406.Pp 407It is important to recognize that a delete indication is not a 408request and consequently there is no guarantee that the data actually 409will be erased or made unavailable unless guaranteed by specific 410geoms in the graph. 411If 412.Dq "secure delete" 413semantics are required, a 414geom should be pushed which converts delete indications into (a 415sequence of) write requests. 416.Pp 417.Dq "Get attribute" 418supports inspection and manipulation 419of out-of-band attributes on a particular provider or path. 420Attributes are named by 421.Tn ASCII 422strings and they will be discussed in 423a separate section below. 424.El 425.Pp 426(Stay tuned while the author rests his brain and fingers: more to come.) 427.Sh DIAGNOSTICS 428Several flags are provided for tracing 429.Nm 430operations and unlocking 431protection mechanisms via the 432.Va kern.geom.debugflags 433sysctl. 434All of these flags are off by default, and great care should be taken in 435turning them on. 436.Bl -tag -width indent 437.It 0x01 Pq Dv G_T_TOPOLOGY 438Provide tracing of topology change events. 439.It 0x02 Pq Dv G_T_BIO 440Provide tracing of buffer I/O requests. 441.It 0x04 Pq Dv G_T_ACCESS 442Provide tracing of access check controls. 443.It 0x08 (unused) 444.It 0x10 (allow foot shooting) 445Allow writing to Rank 1 providers. 446This would, for example, allow the super-user to overwrite the MBR on the root 447disk or write random sectors elsewhere to a mounted disk. 448The implications are obvious. 449.It 0x40 Pq Dv G_F_DISKIOCTL 450This is unused at this time. 451.It 0x80 Pq Dv G_F_CTLDUMP 452Dump contents of gctl requests. 453.El 454.Sh SEE ALSO 455.Xr libgeom 3 , 456.Xr DECLARE_GEOM_CLASS 9 , 457.Xr disk 9 , 458.Xr g_access 9 , 459.Xr g_attach 9 , 460.Xr g_bio 9 , 461.Xr g_consumer 9 , 462.Xr g_data 9 , 463.Xr g_event 9 , 464.Xr g_geom 9 , 465.Xr g_provider 9 , 466.Xr g_provider_by_name 9 467.Sh HISTORY 468This software was developed for the 469.Fx 470Project by 471.An Poul-Henning Kamp 472and NAI Labs, the Security Research Division of Network Associates, Inc.\& 473under DARPA/SPAWAR contract N66001-01-C-8035 474.Pq Dq CBOSS , 475as part of the 476DARPA CHATS research program. 477.Pp 478The first precursor for 479.Nm 480was a gruesome hack to Minix 1.2 and was 481never distributed. 482An earlier attempt to implement a less general scheme 483in 484.Fx 485never succeeded. 486.Sh AUTHORS 487.An Poul-Henning Kamp Aq Mt phk@FreeBSD.org 488