1.\" 2.\" Copyright (c) 2002 Poul-Henning Kamp 3.\" Copyright (c) 2002 Networks Associates Technology, Inc. 4.\" All rights reserved. 5.\" 6.\" This software was developed for the FreeBSD Project by Poul-Henning Kamp 7.\" and NAI Labs, the Security Research Division of Network Associates, Inc. 8.\" under DARPA/SPAWAR contract N66001-01-C-8035 ("CBOSS"), as part of the 9.\" DARPA CHATS research program. 10.\" 11.\" Redistribution and use in source and binary forms, with or without 12.\" modification, are permitted provided that the following conditions 13.\" are met: 14.\" 1. Redistributions of source code must retain the above copyright 15.\" notice, this list of conditions and the following disclaimer. 16.\" 2. Redistributions in binary form must reproduce the above copyright 17.\" notice, this list of conditions and the following disclaimer in the 18.\" documentation and/or other materials provided with the distribution. 19.\" 3. The names of the authors may not be used to endorse or promote 20.\" products derived from this software without specific prior written 21.\" permission. 22.\" 23.\" THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND 24.\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE 25.\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE 26.\" ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE 27.\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL 28.\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS 29.\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) 30.\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT 31.\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY 32.\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF 33.\" SUCH DAMAGE. 34.\" 35.Dd October 6, 2023 36.Dt GEOM 4 37.Os 38.Sh NAME 39.Nm GEOM 40.Nd "modular disk I/O request transformation framework" 41.Sh SYNOPSIS 42.Cd options GEOM_BDE 43.Cd options GEOM_CACHE 44.Cd options GEOM_CONCAT 45.Cd options GEOM_ELI 46.Cd options GEOM_GATE 47.Cd options GEOM_JOURNAL 48.Cd options GEOM_LABEL 49.Cd options GEOM_LINUX_LVM 50.Cd options GEOM_MAP 51.Cd options GEOM_MIRROR 52.Cd options GEOM_MOUNTVER 53.Cd options GEOM_MULTIPATH 54.Cd options GEOM_NOP 55.Cd options GEOM_PART_APM 56.Cd options GEOM_PART_BSD 57.Cd options GEOM_PART_BSD64 58.Cd options GEOM_PART_EBR 59.Cd options GEOM_PART_EBR_COMPAT 60.Cd options GEOM_PART_GPT 61.Cd options GEOM_PART_LDM 62.Cd options GEOM_PART_MBR 63.Cd options GEOM_RAID 64.Cd options GEOM_RAID3 65.Cd options GEOM_SHSEC 66.Cd options GEOM_STRIPE 67.Cd options GEOM_UZIP 68.Cd options GEOM_VIRSTOR 69.Cd options GEOM_ZERO 70.Sh DESCRIPTION 71The 72.Nm 73framework provides an infrastructure in which 74.Dq classes 75can perform transformations on disk I/O requests on their path from 76the upper kernel to the device drivers and back. 77.Pp 78Transformations in a 79.Nm 80context range from the simple geometric 81displacement performed in typical disk partitioning modules over RAID 82algorithms and device multipath resolution to full blown cryptographic 83protection of the stored data. 84.Pp 85Compared to traditional 86.Dq "volume management" , 87.Nm 88differs from most 89and in some cases all previous implementations in the following ways: 90.Bl -bullet 91.It 92.Nm 93is extensible. 94It is trivially simple to write a new class 95of transformation and it will not be given stepchild treatment. 96If 97someone for some reason wanted to mount IBM MVS diskpacks, a class 98recognizing and configuring their VTOC information would be a trivial 99matter. 100.It 101.Nm 102is topologically agnostic. 103Most volume management implementations 104have very strict notions of how classes can fit together, very often 105one fixed hierarchy is provided, for instance, subdisk - plex - 106volume. 107.El 108.Pp 109Being extensible means that new transformations are treated no differently 110than existing transformations. 111.Pp 112Fixed hierarchies are bad because they make it impossible to express 113the intent efficiently. 114In the fixed hierarchy above, it is not possible to mirror two 115physical disks and then partition the mirror into subdisks, instead 116one is forced to make subdisks on the physical volumes and to mirror 117these two and two, resulting in a much more complex configuration. 118.Nm 119on the other hand does not care in which order things are done, 120the only restriction is that cycles in the graph will not be allowed. 121.Sh "TERMINOLOGY AND TOPOLOGY" 122.Nm 123is quite object oriented and consequently the terminology 124borrows a lot of context and semantics from the OO vocabulary: 125.Pp 126A 127.Dq class , 128represented by the data structure 129.Vt g_class 130implements one 131particular kind of transformation. 132Typical examples are MBR disk 133partition, BSD disklabel, and RAID5 classes. 134.Pp 135An instance of a class is called a 136.Dq geom 137and represented by the data structure 138.Vt g_geom . 139In a typical i386 140.Fx 141system, there 142will be one geom of class MBR for each disk. 143.Pp 144A 145.Dq provider , 146represented by the data structure 147.Vt g_provider , 148is the front gate at which a geom offers service. 149A provider is 150.Do 151a disk-like thing which appears in 152.Pa /dev 153.Dc - a logical 154disk in other words. 155All providers have three main properties: 156.Dq name , 157.Dq sectorsize 158and 159.Dq size . 160.Pp 161A 162.Dq consumer 163is the backdoor through which a geom connects to another 164geom provider and through which I/O requests are sent. 165.Pp 166The topological relationship between these entities are as follows: 167.Bl -bullet 168.It 169A class has zero or more geom instances. 170.It 171A geom has exactly one class it is derived from. 172.It 173A geom has zero or more consumers. 174.It 175A geom has zero or more providers. 176.It 177A consumer can be attached to zero or one providers. 178.It 179A provider can have zero or more consumers attached. 180.El 181.Pp 182All geoms have a rank-number assigned, which is used to detect and 183prevent loops in the acyclic directed graph. 184This rank number is 185assigned as follows: 186.Bl -enum 187.It 188A geom with no attached consumers has rank=1. 189.It 190A geom with attached consumers has a rank one higher than the 191highest rank of the geoms of the providers its consumers are 192attached to. 193.El 194.Sh "SPECIAL TOPOLOGICAL MANEUVERS" 195In addition to the straightforward attach, which attaches a consumer 196to a provider, and detach, which breaks the bond, a number of special 197topological maneuvers exists to facilitate configuration and to 198improve the overall flexibility. 199.Bl -inset 200.It Em TASTING 201is a process that happens whenever a new class or new provider 202is created, and it provides the class a chance to automatically configure an 203instance on providers which it recognizes as its own. 204A typical example is the MBR disk-partition class which will look for 205the MBR table in the first sector and, if found and validated, will 206instantiate a geom to multiplex according to the contents of the MBR. 207.Pp 208A new class will be offered to all existing providers in turn and a new 209provider will be offered to all classes in turn. 210.Pp 211Exactly what a class does to recognize if it should accept the offered 212provider is not defined by 213.Nm , 214but the sensible set of options are: 215.Bl -bullet 216.It 217Examine specific data structures on the disk. 218.It 219Examine properties like 220.Dq sectorsize 221or 222.Dq mediasize 223for the provider. 224.It 225Examine the rank number of the provider's geom. 226.It 227Examine the method name of the provider's geom. 228.El 229.Pp 230Tasting is controlled by the 231.Va kern.geom.notaste 232sysctl. 233To disable tasting, set the sysctl to 1, to 234re-enable tasting, set the sysctl to 0. 235.It Em ORPHANIZATION 236is the process by which a provider is removed while 237it potentially is still being used. 238.Pp 239When a geom orphans a provider, all future I/O requests will 240.Dq bounce 241on the provider with an error code set by the geom. 242Any 243consumers attached to the provider will receive notification about 244the orphanization when the event loop gets around to it, and they 245can take appropriate action at that time. 246.Pp 247A geom which came into being as a result of a normal taste operation 248should self-destruct unless it has a way to keep functioning whilst 249lacking the orphaned provider. 250Geoms like disk slicers should therefore self-destruct whereas 251RAID5 or mirror geoms will be able to continue as long as they do 252not lose quorum. 253.Pp 254When a provider is orphaned, this does not necessarily result in any 255immediate change in the topology: any attached consumers are still 256attached, any opened paths are still open, any outstanding I/O 257requests are still outstanding. 258.Pp 259The typical scenario is: 260.Pp 261.Bl -bullet -offset indent -compact 262.It 263A device driver detects a disk has departed and orphans the provider for it. 264.It 265The geoms on top of the disk receive the orphanization event and 266orphan all their providers in turn. 267Providers which are not attached to will typically self-destruct 268right away. 269This process continues in a quasi-recursive fashion until all 270relevant pieces of the tree have heard the bad news. 271.It 272Eventually the buck stops when it reaches geom_dev at the top 273of the stack. 274.It 275Geom_dev will call 276.Xr destroy_dev 9 277to stop any more requests from 278coming in. 279It will sleep until any and all outstanding I/O requests have 280been returned. 281It will explicitly close (i.e.: zero the access counts), a change 282which will propagate all the way down through the mesh. 283It will then detach and destroy its geom. 284.It 285The geom whose provider is now detached will destroy the provider, 286detach and destroy its consumer and destroy its geom. 287.It 288This process percolates all the way down through the mesh, until 289the cleanup is complete. 290.El 291.Pp 292While this approach seems byzantine, it does provide the maximum 293flexibility and robustness in handling disappearing devices. 294.Pp 295The one absolutely crucial detail to be aware of is that if the 296device driver does not return all I/O requests, the tree will 297not unravel. 298.It Em SPOILING 299is a special case of orphanization used to protect 300against stale metadata. 301It is probably easiest to understand spoiling by going through 302an example. 303.Pp 304Imagine a disk, 305.Pa da0 , 306on top of which an MBR geom provides 307.Pa da0s1 308and 309.Pa da0s2 , 310and on top of 311.Pa da0s1 312a BSD geom provides 313.Pa da0s1a 314through 315.Pa da0s1e , 316and that both the MBR and BSD geoms have 317autoconfigured based on data structures on the disk media. 318Now imagine the case where 319.Pa da0 320is opened for writing and those 321data structures are modified or overwritten: now the geoms would 322be operating on stale metadata unless some notification system 323can inform them otherwise. 324.Pp 325To avoid this situation, when the open of 326.Pa da0 327for write happens, 328all attached consumers are told about this and geoms like 329MBR and BSD will self-destruct as a result. 330When 331.Pa da0 332is closed, it will be offered for tasting again 333and, if the data structures for MBR and BSD are still there, new 334geoms will instantiate themselves anew. 335.Pp 336Now for the fine print: 337.Pp 338If any of the paths through the MBR or BSD module were open, they 339would have opened downwards with an exclusive bit thus rendering it 340impossible to open 341.Pa da0 342for writing in that case. 343Conversely, 344the requested exclusive bit would render it impossible to open a 345path through the MBR geom while 346.Pa da0 347is open for writing. 348.Pp 349From this it also follows that changing the size of open geoms can 350only be done with their cooperation. 351.Pp 352Finally: the spoiling only happens when the write count goes from 353zero to non-zero and the retasting happens only when the write count goes 354from non-zero to zero. 355.It Em CONFIGURE 356is the process where the administrator issues instructions 357for a particular class to instantiate itself. 358There are multiple 359ways to express intent in this case - a particular provider may be 360specified with a level of override forcing, for instance, a BSD 361disklabel module to attach to a provider which was not found palatable 362during the TASTE operation. 363.Pp 364Finally, I/O is the reason we even do this: it concerns itself with 365sending I/O requests through the graph. 366.It Em "I/O REQUESTS" , 367represented by 368.Vt "struct bio" , 369originate at a consumer, 370are scheduled on its attached provider and, when processed, are returned 371to the consumer. 372It is important to realize that the 373.Vt "struct bio" 374which enters through the provider of a particular geom does not 375.Do 376come out on the other side 377.Dc . 378Even simple transformations like MBR and BSD will clone the 379.Vt "struct bio" , 380modify the clone, and schedule the clone on their 381own consumer. 382Note that cloning the 383.Vt "struct bio" 384does not involve cloning the 385actual data area specified in the I/O request. 386.Pp 387In total, four different I/O requests exist in 388.Nm : 389read, write, delete, and 390.Dq "get attribute". 391.Pp 392Read and write are self explanatory. 393.Pp 394Delete indicates that a certain range of data is no longer used 395and that it can be erased or freed as the underlying technology 396supports. 397Technologies like flash adaptation layers can arrange to erase 398the relevant blocks before they will become reassigned and 399cryptographic devices may want to fill random bits into the 400range to reduce the amount of data available for attack. 401.Pp 402It is important to recognize that a delete indication is not a 403request and consequently there is no guarantee that the data actually 404will be erased or made unavailable unless guaranteed by specific 405geoms in the graph. 406If 407.Dq "secure delete" 408semantics are required, a 409geom should be pushed which converts delete indications into (a 410sequence of) write requests. 411.Pp 412.Dq "Get attribute" 413supports inspection and manipulation 414of out-of-band attributes on a particular provider or path. 415Attributes are named by 416.Tn ASCII 417strings and they will be discussed in 418a separate section below. 419.El 420.Pp 421(Stay tuned while the author rests his brain and fingers: more to come.) 422.Sh DIAGNOSTICS 423Several flags are provided for tracing 424.Nm 425operations and unlocking 426protection mechanisms via the 427.Va kern.geom.debugflags 428sysctl. 429All of these flags are off by default, and great care should be taken in 430turning them on. 431.Bl -tag -width indent 432.It 0x01 Pq Dv G_T_TOPOLOGY 433Provide tracing of topology change events. 434.It 0x02 Pq Dv G_T_BIO 435Provide tracing of buffer I/O requests. 436.It 0x04 Pq Dv G_T_ACCESS 437Provide tracing of access check controls. 438.It 0x08 (unused) 439.It 0x10 (allow foot shooting) 440Allow writing to Rank 1 providers. 441This would, for example, allow the super-user to overwrite the MBR on the root 442disk or write random sectors elsewhere to a mounted disk. 443The implications are obvious. 444.It 0x40 Pq Dv G_F_DISKIOCTL 445This is unused at this time. 446.It 0x80 Pq Dv G_F_CTLDUMP 447Dump contents of gctl requests. 448.El 449.Sh SEE ALSO 450.Xr libgeom 3 , 451.Xr geom 8 , 452.Xr DECLARE_GEOM_CLASS 9 , 453.Xr disk 9 , 454.Xr g_access 9 , 455.Xr g_attach 9 , 456.Xr g_bio 9 , 457.Xr g_consumer 9 , 458.Xr g_data 9 , 459.Xr g_event 9 , 460.Xr g_geom 9 , 461.Xr g_provider 9 , 462.Xr g_provider_by_name 9 463.Sh HISTORY 464This software was initially developed for the 465.Fx 466Project by 467.An Poul-Henning Kamp 468and NAI Labs, the Security Research Division of Network Associates, Inc.\& 469under DARPA/SPAWAR contract N66001-01-C-8035 470.Pq Dq CBOSS , 471as part of the 472DARPA CHATS research program. 473.Pp 474The following obsolete 475.Nm 476components were removed in 477.Fx 13.0 : 478.Bl -bullet -offset indent -compact 479.It 480.Cd GEOM_BSD , 481.It 482.Cd GEOM_FOX , 483.It 484.Cd GEOM_MBR , 485.It 486.Cd GEOM_SUNLABEL , 487and 488.It 489.Cd GEOM_VOL . 490.El 491.Pp 492Use 493.Bl -bullet -offset indent -compact 494.It 495.Cd GEOM_PART_BSD , 496.It 497.Cd GEOM_MULTIPATH , 498.It 499.Cd GEOM_PART_MBR , 500and 501.It 502.Cd GEOM_LABEL 503.El 504options, respectively, instead. 505.Sh AUTHORS 506.An Poul-Henning Kamp Aq Mt phk@FreeBSD.org 507