1.\" 2.\" Copyright (c) 2002 Poul-Henning Kamp 3.\" Copyright (c) 2002 Networks Associates Technology, Inc. 4.\" All rights reserved. 5.\" 6.\" This software was developed for the FreeBSD Project by Poul-Henning Kamp 7.\" and NAI Labs, the Security Research Division of Network Associates, Inc. 8.\" under DARPA/SPAWAR contract N66001-01-C-8035 ("CBOSS"), as part of the 9.\" DARPA CHATS research program. 10.\" 11.\" Redistribution and use in source and binary forms, with or without 12.\" modification, are permitted provided that the following conditions 13.\" are met: 14.\" 1. Redistributions of source code must retain the above copyright 15.\" notice, this list of conditions and the following disclaimer. 16.\" 2. Redistributions in binary form must reproduce the above copyright 17.\" notice, this list of conditions and the following disclaimer in the 18.\" documentation and/or other materials provided with the distribution. 19.\" 3. The names of the authors may not be used to endorse or promote 20.\" products derived from this software without specific prior written 21.\" permission. 22.\" 23.\" THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND 24.\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE 25.\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE 26.\" ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE 27.\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL 28.\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS 29.\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) 30.\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT 31.\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY 32.\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF 33.\" SUCH DAMAGE. 34.\" 35.\" $FreeBSD$ 36.\" 37.Dd July 26, 2023 38.Dt GEOM 4 39.Os 40.Sh NAME 41.Nm GEOM 42.Nd "modular disk I/O request transformation framework" 43.Sh SYNOPSIS 44.Cd options GEOM_BDE 45.Cd options GEOM_CACHE 46.Cd options GEOM_CONCAT 47.Cd options GEOM_ELI 48.Cd options GEOM_GATE 49.Cd options GEOM_JOURNAL 50.Cd options GEOM_LABEL 51.Cd options GEOM_LINUX_LVM 52.Cd options GEOM_MAP 53.Cd options GEOM_MIRROR 54.Cd options GEOM_MOUNTVER 55.Cd options GEOM_MULTIPATH 56.Cd options GEOM_NOP 57.Cd options GEOM_PART_APM 58.Cd options GEOM_PART_BSD 59.Cd options GEOM_PART_BSD64 60.Cd options GEOM_PART_EBR 61.Cd options GEOM_PART_EBR_COMPAT 62.Cd options GEOM_PART_GPT 63.Cd options GEOM_PART_LDM 64.Cd options GEOM_PART_MBR 65.Cd options GEOM_RAID 66.Cd options GEOM_RAID3 67.Cd options GEOM_SHSEC 68.Cd options GEOM_STRIPE 69.Cd options GEOM_UZIP 70.Cd options GEOM_VIRSTOR 71.Cd options GEOM_ZERO 72.Sh DESCRIPTION 73The 74.Nm 75framework provides an infrastructure in which 76.Dq classes 77can perform transformations on disk I/O requests on their path from 78the upper kernel to the device drivers and back. 79.Pp 80Transformations in a 81.Nm 82context range from the simple geometric 83displacement performed in typical disk partitioning modules over RAID 84algorithms and device multipath resolution to full blown cryptographic 85protection of the stored data. 86.Pp 87Compared to traditional 88.Dq "volume management" , 89.Nm 90differs from most 91and in some cases all previous implementations in the following ways: 92.Bl -bullet 93.It 94.Nm 95is extensible. 96It is trivially simple to write a new class 97of transformation and it will not be given stepchild treatment. 98If 99someone for some reason wanted to mount IBM MVS diskpacks, a class 100recognizing and configuring their VTOC information would be a trivial 101matter. 102.It 103.Nm 104is topologically agnostic. 105Most volume management implementations 106have very strict notions of how classes can fit together, very often 107one fixed hierarchy is provided, for instance, subdisk - plex - 108volume. 109.El 110.Pp 111Being extensible means that new transformations are treated no differently 112than existing transformations. 113.Pp 114Fixed hierarchies are bad because they make it impossible to express 115the intent efficiently. 116In the fixed hierarchy above, it is not possible to mirror two 117physical disks and then partition the mirror into subdisks, instead 118one is forced to make subdisks on the physical volumes and to mirror 119these two and two, resulting in a much more complex configuration. 120.Nm 121on the other hand does not care in which order things are done, 122the only restriction is that cycles in the graph will not be allowed. 123.Sh "TERMINOLOGY AND TOPOLOGY" 124.Nm 125is quite object oriented and consequently the terminology 126borrows a lot of context and semantics from the OO vocabulary: 127.Pp 128A 129.Dq class , 130represented by the data structure 131.Vt g_class 132implements one 133particular kind of transformation. 134Typical examples are MBR disk 135partition, BSD disklabel, and RAID5 classes. 136.Pp 137An instance of a class is called a 138.Dq geom 139and represented by the data structure 140.Vt g_geom . 141In a typical i386 142.Fx 143system, there 144will be one geom of class MBR for each disk. 145.Pp 146A 147.Dq provider , 148represented by the data structure 149.Vt g_provider , 150is the front gate at which a geom offers service. 151A provider is 152.Do 153a disk-like thing which appears in 154.Pa /dev 155.Dc - a logical 156disk in other words. 157All providers have three main properties: 158.Dq name , 159.Dq sectorsize 160and 161.Dq size . 162.Pp 163A 164.Dq consumer 165is the backdoor through which a geom connects to another 166geom provider and through which I/O requests are sent. 167.Pp 168The topological relationship between these entities are as follows: 169.Bl -bullet 170.It 171A class has zero or more geom instances. 172.It 173A geom has exactly one class it is derived from. 174.It 175A geom has zero or more consumers. 176.It 177A geom has zero or more providers. 178.It 179A consumer can be attached to zero or one providers. 180.It 181A provider can have zero or more consumers attached. 182.El 183.Pp 184All geoms have a rank-number assigned, which is used to detect and 185prevent loops in the acyclic directed graph. 186This rank number is 187assigned as follows: 188.Bl -enum 189.It 190A geom with no attached consumers has rank=1. 191.It 192A geom with attached consumers has a rank one higher than the 193highest rank of the geoms of the providers its consumers are 194attached to. 195.El 196.Sh "SPECIAL TOPOLOGICAL MANEUVERS" 197In addition to the straightforward attach, which attaches a consumer 198to a provider, and detach, which breaks the bond, a number of special 199topological maneuvers exists to facilitate configuration and to 200improve the overall flexibility. 201.Bl -inset 202.It Em TASTING 203is a process that happens whenever a new class or new provider 204is created, and it provides the class a chance to automatically configure an 205instance on providers which it recognizes as its own. 206A typical example is the MBR disk-partition class which will look for 207the MBR table in the first sector and, if found and validated, will 208instantiate a geom to multiplex according to the contents of the MBR. 209.Pp 210A new class will be offered to all existing providers in turn and a new 211provider will be offered to all classes in turn. 212.Pp 213Exactly what a class does to recognize if it should accept the offered 214provider is not defined by 215.Nm , 216but the sensible set of options are: 217.Bl -bullet 218.It 219Examine specific data structures on the disk. 220.It 221Examine properties like 222.Dq sectorsize 223or 224.Dq mediasize 225for the provider. 226.It 227Examine the rank number of the provider's geom. 228.It 229Examine the method name of the provider's geom. 230.El 231.It Em ORPHANIZATION 232is the process by which a provider is removed while 233it potentially is still being used. 234.Pp 235When a geom orphans a provider, all future I/O requests will 236.Dq bounce 237on the provider with an error code set by the geom. 238Any 239consumers attached to the provider will receive notification about 240the orphanization when the event loop gets around to it, and they 241can take appropriate action at that time. 242.Pp 243A geom which came into being as a result of a normal taste operation 244should self-destruct unless it has a way to keep functioning whilst 245lacking the orphaned provider. 246Geoms like disk slicers should therefore self-destruct whereas 247RAID5 or mirror geoms will be able to continue as long as they do 248not lose quorum. 249.Pp 250When a provider is orphaned, this does not necessarily result in any 251immediate change in the topology: any attached consumers are still 252attached, any opened paths are still open, any outstanding I/O 253requests are still outstanding. 254.Pp 255The typical scenario is: 256.Pp 257.Bl -bullet -offset indent -compact 258.It 259A device driver detects a disk has departed and orphans the provider for it. 260.It 261The geoms on top of the disk receive the orphanization event and 262orphan all their providers in turn. 263Providers which are not attached to will typically self-destruct 264right away. 265This process continues in a quasi-recursive fashion until all 266relevant pieces of the tree have heard the bad news. 267.It 268Eventually the buck stops when it reaches geom_dev at the top 269of the stack. 270.It 271Geom_dev will call 272.Xr destroy_dev 9 273to stop any more requests from 274coming in. 275It will sleep until any and all outstanding I/O requests have 276been returned. 277It will explicitly close (i.e.: zero the access counts), a change 278which will propagate all the way down through the mesh. 279It will then detach and destroy its geom. 280.It 281The geom whose provider is now detached will destroy the provider, 282detach and destroy its consumer and destroy its geom. 283.It 284This process percolates all the way down through the mesh, until 285the cleanup is complete. 286.El 287.Pp 288While this approach seems byzantine, it does provide the maximum 289flexibility and robustness in handling disappearing devices. 290.Pp 291The one absolutely crucial detail to be aware of is that if the 292device driver does not return all I/O requests, the tree will 293not unravel. 294.It Em SPOILING 295is a special case of orphanization used to protect 296against stale metadata. 297It is probably easiest to understand spoiling by going through 298an example. 299.Pp 300Imagine a disk, 301.Pa da0 , 302on top of which an MBR geom provides 303.Pa da0s1 304and 305.Pa da0s2 , 306and on top of 307.Pa da0s1 308a BSD geom provides 309.Pa da0s1a 310through 311.Pa da0s1e , 312and that both the MBR and BSD geoms have 313autoconfigured based on data structures on the disk media. 314Now imagine the case where 315.Pa da0 316is opened for writing and those 317data structures are modified or overwritten: now the geoms would 318be operating on stale metadata unless some notification system 319can inform them otherwise. 320.Pp 321To avoid this situation, when the open of 322.Pa da0 323for write happens, 324all attached consumers are told about this and geoms like 325MBR and BSD will self-destruct as a result. 326When 327.Pa da0 328is closed, it will be offered for tasting again 329and, if the data structures for MBR and BSD are still there, new 330geoms will instantiate themselves anew. 331.Pp 332Now for the fine print: 333.Pp 334If any of the paths through the MBR or BSD module were open, they 335would have opened downwards with an exclusive bit thus rendering it 336impossible to open 337.Pa da0 338for writing in that case. 339Conversely, 340the requested exclusive bit would render it impossible to open a 341path through the MBR geom while 342.Pa da0 343is open for writing. 344.Pp 345From this it also follows that changing the size of open geoms can 346only be done with their cooperation. 347.Pp 348Finally: the spoiling only happens when the write count goes from 349zero to non-zero and the retasting happens only when the write count goes 350from non-zero to zero. 351.It Em CONFIGURE 352is the process where the administrator issues instructions 353for a particular class to instantiate itself. 354There are multiple 355ways to express intent in this case - a particular provider may be 356specified with a level of override forcing, for instance, a BSD 357disklabel module to attach to a provider which was not found palatable 358during the TASTE operation. 359.Pp 360Finally, I/O is the reason we even do this: it concerns itself with 361sending I/O requests through the graph. 362.It Em "I/O REQUESTS" , 363represented by 364.Vt "struct bio" , 365originate at a consumer, 366are scheduled on its attached provider and, when processed, are returned 367to the consumer. 368It is important to realize that the 369.Vt "struct bio" 370which enters through the provider of a particular geom does not 371.Do 372come out on the other side 373.Dc . 374Even simple transformations like MBR and BSD will clone the 375.Vt "struct bio" , 376modify the clone, and schedule the clone on their 377own consumer. 378Note that cloning the 379.Vt "struct bio" 380does not involve cloning the 381actual data area specified in the I/O request. 382.Pp 383In total, four different I/O requests exist in 384.Nm : 385read, write, delete, and 386.Dq "get attribute". 387.Pp 388Read and write are self explanatory. 389.Pp 390Delete indicates that a certain range of data is no longer used 391and that it can be erased or freed as the underlying technology 392supports. 393Technologies like flash adaptation layers can arrange to erase 394the relevant blocks before they will become reassigned and 395cryptographic devices may want to fill random bits into the 396range to reduce the amount of data available for attack. 397.Pp 398It is important to recognize that a delete indication is not a 399request and consequently there is no guarantee that the data actually 400will be erased or made unavailable unless guaranteed by specific 401geoms in the graph. 402If 403.Dq "secure delete" 404semantics are required, a 405geom should be pushed which converts delete indications into (a 406sequence of) write requests. 407.Pp 408.Dq "Get attribute" 409supports inspection and manipulation 410of out-of-band attributes on a particular provider or path. 411Attributes are named by 412.Tn ASCII 413strings and they will be discussed in 414a separate section below. 415.El 416.Pp 417(Stay tuned while the author rests his brain and fingers: more to come.) 418.Sh DIAGNOSTICS 419Several flags are provided for tracing 420.Nm 421operations and unlocking 422protection mechanisms via the 423.Va kern.geom.debugflags 424sysctl. 425All of these flags are off by default, and great care should be taken in 426turning them on. 427.Bl -tag -width indent 428.It 0x01 Pq Dv G_T_TOPOLOGY 429Provide tracing of topology change events. 430.It 0x02 Pq Dv G_T_BIO 431Provide tracing of buffer I/O requests. 432.It 0x04 Pq Dv G_T_ACCESS 433Provide tracing of access check controls. 434.It 0x08 (unused) 435.It 0x10 (allow foot shooting) 436Allow writing to Rank 1 providers. 437This would, for example, allow the super-user to overwrite the MBR on the root 438disk or write random sectors elsewhere to a mounted disk. 439The implications are obvious. 440.It 0x40 Pq Dv G_F_DISKIOCTL 441This is unused at this time. 442.It 0x80 Pq Dv G_F_CTLDUMP 443Dump contents of gctl requests. 444.El 445.Sh SEE ALSO 446.Xr libgeom 3 , 447.Xr geom 8 , 448.Xr DECLARE_GEOM_CLASS 9 , 449.Xr disk 9 , 450.Xr g_access 9 , 451.Xr g_attach 9 , 452.Xr g_bio 9 , 453.Xr g_consumer 9 , 454.Xr g_data 9 , 455.Xr g_event 9 , 456.Xr g_geom 9 , 457.Xr g_provider 9 , 458.Xr g_provider_by_name 9 459.Sh HISTORY 460This software was initially developed for the 461.Fx 462Project by 463.An Poul-Henning Kamp 464and NAI Labs, the Security Research Division of Network Associates, Inc.\& 465under DARPA/SPAWAR contract N66001-01-C-8035 466.Pq Dq CBOSS , 467as part of the 468DARPA CHATS research program. 469.Pp 470The following obsolete 471.Nm 472components were removed in 473.Fx 13.0 : 474.Bl -bullet -offset indent -compact 475.It 476.Cd GEOM_BSD , 477.It 478.Cd GEOM_FOX , 479.It 480.Cd GEOM_MBR , 481.It 482.Cd GEOM_SUNLABEL , 483and 484.It 485.Cd GEOM_VOL . 486.El 487.Pp 488Use 489.Bl -bullet -offset indent -compact 490.It 491.Cd GEOM_PART_BSD , 492.It 493.Cd GEOM_MULTIPATH , 494.It 495.Cd GEOM_PART_MBR , 496and 497.It 498.Cd GEOM_LABEL 499.El 500options, respectively, instead. 501.Sh AUTHORS 502.An Poul-Henning Kamp Aq Mt phk@FreeBSD.org 503