xref: /freebsd/share/man/man4/geom.4 (revision 3ff369fed2a08f32dda232c10470b949bef9489f)
1.\"
2.\" Copyright (c) 2002 Poul-Henning Kamp
3.\" Copyright (c) 2002 Networks Associates Technology, Inc.
4.\" All rights reserved.
5.\"
6.\" This software was developed for the FreeBSD Project by Poul-Henning Kamp
7.\" and NAI Labs, the Security Research Division of Network Associates, Inc.
8.\" under DARPA/SPAWAR contract N66001-01-C-8035 ("CBOSS"), as part of the
9.\" DARPA CHATS research program.
10.\"
11.\" Redistribution and use in source and binary forms, with or without
12.\" modification, are permitted provided that the following conditions
13.\" are met:
14.\" 1. Redistributions of source code must retain the above copyright
15.\"    notice, this list of conditions and the following disclaimer.
16.\" 2. Redistributions in binary form must reproduce the above copyright
17.\"    notice, this list of conditions and the following disclaimer in the
18.\"    documentation and/or other materials provided with the distribution.
19.\" 3. The names of the authors may not be used to endorse or promote
20.\"    products derived from this software without specific prior written
21.\"    permission.
22.\"
23.\" THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND
24.\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
25.\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
26.\" ARE DISCLAIMED.  IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE
27.\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
28.\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
29.\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
30.\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
31.\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
32.\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
33.\" SUCH DAMAGE.
34.\"
35.\" $FreeBSD$
36.\"
37.Dd March 27, 2002
38.Os FreeBSD 5.0
39.Dt GEOM 4
40.Sh NAME
41.Nm GEOM
42.Nd modular disk I/O request transformation framework.
43.Sh DESCRIPTION
44The GEOM framework provides an infrastructure in which modules
45can perform transformations on disk I/O requests on their path from
46the upper kernel to the device drivers and back.
47.Pp
48Transformations in a GEOM context range from the simple geometric
49displacement performed in typical disklabel modules over RAID
50algorithms and device multipath resolution to full blown cryptographic
51protection of the stored data.
52.Pp
53Compared to traditional "volume management", GEOM differs from most
54and in some cases all previous implementations in the following ways:
55.Bl -bullet
56.It
57GEOM is extensible.  It is trivially simple to write a new class
58of transformation and it will not be given stepchild treatment.  If
59someone for some reason wanted to mount IBM MVS diskpacks, a class
60recognizing and configuring their VTOC information would be a trivial
61matter.
62.It
63GEOM is topologically agnostic.  Most volume management implementations
64have very strict notions of how classes can fit together, very often
65one fixed hierarchy is provided for instance  subdisk - plex -
66volume.
67.El
68.Pp
69Being extensible means that new transformations are treated no differently
70than existing transformations.
71.Pp
72Fixed hierarchies are bad because they make it impossible to express
73the intent efficiently.
74In the fixed hierarchy above it is not possible to mirror two
75physical disks and then partition the mirror into subdisks, instead
76one is forced to make subdisks on the physical volumes and to mirror
77these two and two resulting in a much more complex configuration.
78GEOM on the other hand does not care in which order things are done,
79the only restriction is that cycles in the graph will not be allowed.
80.Pp
81.Sh "TERMINOLOGY and TOPOLOGY"
82Geom is quite object oriented and consequently the terminology
83borrows a lot of context and semantics from the OO vocabulary:
84.Pp
85A "class", represented by the data structure g_class implements one
86particular kind of transformation.  Typical examples are MBR disk
87partition, BSD disklabel, and RAID5 classes.
88.Pp
89An instance of a class is called a "geom" and represented by the
90data structure "g_geom".  In a typical i386 FreeBSD system, there
91will be one geom of class MBR for each disk.
92.Pp
93A "provider", represented by the data structure "g_provider", is
94the front gate at which a geom offers service.
95A provider is "a disk-like thing which appears in /dev" - a logical
96disk in other words.
97All providers have three main properties: name, sectorsize and size.
98.Pp
99A "consumer" is the backdoor through which a geom connects to another
100geom provider and through which I/O requests are sent.
101.Pp
102The topological relationship between these entities are as follows:
103.Bl -bullet
104.It
105A class has zero or more geom instances.
106.It
107A geom has exactly one class it is derived from.
108.It
109A geom has zero or more consumers.
110.It
111A geom has zero or more providers.
112.It
113A consumer can be attached to zero or one providers.
114.It
115A provider can have zero or more consumers attached.
116.El
117.Pp
118All geoms have a rank-number assigned, which is used to detect and
119prevent loops in the acyclic directed graph.  This rank number is
120assigned as follows:
121.Bl -enum
122.It
123A geom with no attached consumers has rank=1
124.It
125A geom with attached consumers has a rank one higher than the
126highest rank of the geoms of the providers its consumers are
127attached to.
128.El
129.Sh "SPECIAL TOPOLOGICAL MANEUVRES"
130In addition to the straightforward attach, which attaches a consumer
131to a provider, and dettach, which breaks the bond, a number of special
132toplogical maneuvres exists to facilitate configuration and to
133improve the overall flexibility.
134.Pp
135.Em TASTING
136is a process that happens whenever a new class or new provider
137is created and it is the class' chance to automatically configure an
138instance on providers, which it recognize as its own.
139A typical example is the MBR disk-partition class which will look for
140the MBR table in the first sector and if found and validated it will
141instantiate a geom to multiplex according to the contents of the MBR.
142.Pp
143A new class will be offered to all existing providers in turn and a new
144provider will be offered to all classes in turn.
145.Pp
146Exactly what a class does to recognize if it should accept the offered
147provider is not defined by GEOM, but the sensible set of options are:
148.Bl -bullet
149.It
150Examine specific data structures on the disk.
151.It
152Examine properties like sectorsize or mediasize for the provider.
153.It
154Examine the rank number of the provider's geom.
155.It
156Examine the method name of the provider's geom.
157.El
158.Pp
159.Em ORPHANIZATION
160is the process by which a provider is removed while
161it potentially is still being used.
162.Pp
163When a geom makes a provider an orphan, all future I/O requests will
164"bounce" on the provider with an error code set by the geom.  Any
165consumers attached to the provider will receive notification about
166the orphanization and need to take appropriate action.
167.Pp
168A geom which came into being as a result of a normal taste operation
169should selfdestruct unless it has a way to keep functioning.  Geoms
170like disklabels and stripes should therefore selfdestruct whereas
171RAID5 or mirror geoms can continue to function as long as they do
172not loose quorum.
173.Pp
174When a provider is orphaned, this does not result in any immediate
175change in the topology, any attached consumers are still attached,
176any opened paths are still open, it is the responsibility of the
177geoms above to close and dettach as soon as this can happen.
178.Pp
179The typical scenario is that a device driver notices a disk has
180gone and orphans the provider for it.
181The geoms on top receive the orphanization event and orphan all
182their providers in turn.
183Providers, which are not attached, are destroyed right away.
184Eventually at the toplevel the geom which interfaces
185to the DEVFS received an orphan event on its consumer and it
186calls destroy_dev(9) and does an explicit close if the
187device was open and then dettaches its consumer.
188The provider below is now no longer attached to and can be
189destroyed, if the geom has no more providers it can dettach
190its consumer and selfdestruct and so the carnage passes back
191down the tree, until the original provider is dettached from
192and it can be destroyed by the geom serving the device driver.
193.Pp
194While this approach seems byzantine, it does provide the maximum
195flexibility in handling disappearing devices.
196.Pp
197.Em SPOILING
198is a special case of orphanization used to protect
199against stale metadata.
200It is probably easiest to understand spoiling by going through
201an example.
202.Pp
203Imagine a disk, "da0" on top of which a MBR geom provides
204"da0s1" and "da0s2" and on top of "da0s1" a BSD geom provides
205"da0s1a" through "da0s1e", both the MBR and BSD geoms have
206autoconfigured based on data structures on the disk media.
207Now imagine the case where "da0" is opened for writing and those
208data structures are modified or overwritten:  Now the geoms would
209be operating on stale metadata unless some notification system
210can inform them otherwise.
211To avoid this situation, when the open of "da0" for write happens,
212all attached consumers are told about this, and geoms like
213MBR and BSD will selfdestruct as a result.
214When "da0" is closed again, it will be offered for tasting again
215and if the data structures for MBR and BSD are still there, new
216geoms will instantiate themselves anew.
217.Pp
218Now for the fine print:
219.Pp
220If any of the paths through the MBR or BSD module were open, they
221would have opened downwards with an exclusive bit rendering it
222impossible to open "da0" for writing in that case and conversely
223the requested exclusive bit would render it impossible to open a
224path through the MBR geom while "da0" is open for writing.
225.Pp
226From this it also follows that changing the size of open geoms can
227only be done through their cooperation.
228.Pp
229Finally: the spoiling only happens when the write count goes from
230zero to non-zero and the retasting only when the write count goes
231back to zero.
232.Pp
233.Em INSERT/DELETE
234are a very special operation which allows a new geom
235to be instantiated between a consumer and a provider attached to
236each other and to remove it again.
237.Pp
238To understand the utility of this, imagine a provider with
239being mounted as a filesystem.
240Between the DEVFS geoms consumer and its provider we insert
241a mirror module which configures itself with one mirror
242copy and consequently is transparent to the I/O requests
243on the path.
244We can now configure yet a mirror copy on the mirror geom,
245request a synchronization, and finally drop the first mirror
246copy.
247We have now in essence moved a mounted filesystem from one
248disk to another while it was being used.
249At this point the mirror geom can be deleted from the path
250again, it has served its purpose.
251.Pp
252.Em CONFIGURE
253is the process where the administrator issues instructions
254for a particular class to instantiate itself.  There are multiple
255ways to express intent in this case, a particular provider can be
256specified with a level of override forcing for instance a BSD
257disklabel module to attach to a provider which was not found palatable
258during the TASTE operation.
259.Pp
260Finally IO is the reason we even do this: it concerns itself with
261sending I/O requests through the graph.
262.Pp
263.Em "I/O REQUESTS
264represented by struct bio, originate at a consumer,
265are scheduled on its attached provider, and when processed, returned
266to the consumer.
267It is important to realize that the struct bio which
268enters throuh the provider of a particular geom does not "come
269out on the other side".
270Even simple transformations like MBR and BSD will clone the
271struct bio, modify the clone, and schedule the clone on their
272own consumer.
273Note that cloning the struct bio does not involve cloning the
274actual data area specified in the IO request.
275.Pp
276In total five different IO requests exist in GEOM: read, write,
277delete, format, get attribute, and set attribute.
278.Pp
279Read and write are self explanatory.
280.Pp
281Delete indicates that a certain range of data is no longer used
282and that it can be erased or freed as the underlying technology
283supports.
284Technologies like flash adaptation layers can arrange to erase
285the relevant blocks before they will become reassigned and
286cryptographic devices may want to fill random bits into the
287range to reduce the amount of data available for attack.
288.Pp
289It is important to recognize that a delete indication is not a
290request and consequently there is no guarantee that the data actually
291will be erased or made unavailable unless guaranteed by specific
292geoms in the graph.  If "secure delete" semantics are required, a
293geom should be pushed which converts delete indications into (a
294sequence of) write requests.
295.Pp
296Get attribute and set attribute supports inspection and manipulation
297of out-of-band attributes on a particular provider or path.
298Attributes are named by ascii strings and they will be discussed in
299a separate section below.
300.Pp
301(stay tuned while the author rests his brain and fingers: more to come.)
302.Sh HISTORY
303This software was developed for the FreeBSD Project by Poul-Henning Kamp
304and NAI Labs, the Security Research Division of Network Associates, Inc.
305under DARPA/SPAWAR contract N66001-01-C-8035 ("CBOSS"), as part of the
306DARPA CHATS research program.
307.Pp
308The first precursor for GEOM was a gruesome hack to Minix 1.2 and was
309never distributed.  An earlier attempt to implement a less general scheme
310in FreeBSD never succeeded.
311.Sh AUTHORS
312.An "Poul-Henning Kamp" Aq phk@FreeBSD.org
313