xref: /freebsd/share/man/man4/geom.4 (revision 7dfd9569a2f0637fb9a48157b1c1bfe5709faee3)
1.\"
2.\" Copyright (c) 2002 Poul-Henning Kamp
3.\" Copyright (c) 2002 Networks Associates Technology, Inc.
4.\" All rights reserved.
5.\"
6.\" This software was developed for the FreeBSD Project by Poul-Henning Kamp
7.\" and NAI Labs, the Security Research Division of Network Associates, Inc.
8.\" under DARPA/SPAWAR contract N66001-01-C-8035 ("CBOSS"), as part of the
9.\" DARPA CHATS research program.
10.\"
11.\" Redistribution and use in source and binary forms, with or without
12.\" modification, are permitted provided that the following conditions
13.\" are met:
14.\" 1. Redistributions of source code must retain the above copyright
15.\"    notice, this list of conditions and the following disclaimer.
16.\" 2. Redistributions in binary form must reproduce the above copyright
17.\"    notice, this list of conditions and the following disclaimer in the
18.\"    documentation and/or other materials provided with the distribution.
19.\" 3. The names of the authors may not be used to endorse or promote
20.\"    products derived from this software without specific prior written
21.\"    permission.
22.\"
23.\" THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND
24.\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
25.\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
26.\" ARE DISCLAIMED.  IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE
27.\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
28.\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
29.\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
30.\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
31.\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
32.\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
33.\" SUCH DAMAGE.
34.\"
35.\" $FreeBSD$
36.\"
37.Dd March 27, 2002
38.Os
39.Dt GEOM 4
40.Sh NAME
41.Nm GEOM
42.Nd "modular disk I/O request transformation framework"
43.Sh DESCRIPTION
44The
45.Nm
46framework provides an infrastructure in which
47.Dq classes
48can perform transformations on disk I/O requests on their path from
49the upper kernel to the device drivers and back.
50.Pp
51Transformations in a
52.Nm
53context range from the simple geometric
54displacement performed in typical disk partitioning modules over RAID
55algorithms and device multipath resolution to full blown cryptographic
56protection of the stored data.
57.Pp
58Compared to traditional
59.Dq "volume management" ,
60.Nm
61differs from most
62and in some cases all previous implementations in the following ways:
63.Bl -bullet
64.It
65.Nm
66is extensible.
67It is trivially simple to write a new class
68of transformation and it will not be given stepchild treatment.
69If
70someone for some reason wanted to mount IBM MVS diskpacks, a class
71recognizing and configuring their VTOC information would be a trivial
72matter.
73.It
74.Nm
75is topologically agnostic.
76Most volume management implementations
77have very strict notions of how classes can fit together, very often
78one fixed hierarchy is provided, for instance, subdisk - plex -
79volume.
80.El
81.Pp
82Being extensible means that new transformations are treated no differently
83than existing transformations.
84.Pp
85Fixed hierarchies are bad because they make it impossible to express
86the intent efficiently.
87In the fixed hierarchy above, it is not possible to mirror two
88physical disks and then partition the mirror into subdisks, instead
89one is forced to make subdisks on the physical volumes and to mirror
90these two and two, resulting in a much more complex configuration.
91.Nm
92on the other hand does not care in which order things are done,
93the only restriction is that cycles in the graph will not be allowed.
94.Sh "TERMINOLOGY AND TOPOLOGY"
95.Nm
96is quite object oriented and consequently the terminology
97borrows a lot of context and semantics from the OO vocabulary:
98.Pp
99A
100.Dq class ,
101represented by the data structure
102.Vt g_class
103implements one
104particular kind of transformation.
105Typical examples are MBR disk
106partition, BSD disklabel, and RAID5 classes.
107.Pp
108An instance of a class is called a
109.Dq geom
110and represented by the data structure
111.Vt g_geom .
112In a typical i386
113.Fx
114system, there
115will be one geom of class MBR for each disk.
116.Pp
117A
118.Dq provider ,
119represented by the data structure
120.Vt g_provider ,
121is the front gate at which a geom offers service.
122A provider is
123.Do
124a disk-like thing which appears in
125.Pa /dev
126.Dc - a logical
127disk in other words.
128All providers have three main properties:
129.Dq name ,
130.Dq sectorsize
131and
132.Dq size .
133.Pp
134A
135.Dq consumer
136is the backdoor through which a geom connects to another
137geom provider and through which I/O requests are sent.
138.Pp
139The topological relationship between these entities are as follows:
140.Bl -bullet
141.It
142A class has zero or more geom instances.
143.It
144A geom has exactly one class it is derived from.
145.It
146A geom has zero or more consumers.
147.It
148A geom has zero or more providers.
149.It
150A consumer can be attached to zero or one providers.
151.It
152A provider can have zero or more consumers attached.
153.El
154.Pp
155All geoms have a rank-number assigned, which is used to detect and
156prevent loops in the acyclic directed graph.
157This rank number is
158assigned as follows:
159.Bl -enum
160.It
161A geom with no attached consumers has rank=1.
162.It
163A geom with attached consumers has a rank one higher than the
164highest rank of the geoms of the providers its consumers are
165attached to.
166.El
167.Sh "SPECIAL TOPOLOGICAL MANEUVERS"
168In addition to the straightforward attach, which attaches a consumer
169to a provider, and detach, which breaks the bond, a number of special
170topological maneuvers exists to facilitate configuration and to
171improve the overall flexibility.
172.Bl -inset
173.It Em TASTING
174is a process that happens whenever a new class or new provider
175is created, and it provides the class a chance to automatically configure an
176instance on providers, which it recognizes as its own.
177A typical example is the MBR disk-partition class which will look for
178the MBR table in the first sector and, if found and validated, will
179instantiate a geom to multiplex according to the contents of the MBR.
180.Pp
181A new class will be offered to all existing providers in turn and a new
182provider will be offered to all classes in turn.
183.Pp
184Exactly what a class does to recognize if it should accept the offered
185provider is not defined by
186.Nm ,
187but the sensible set of options are:
188.Bl -bullet
189.It
190Examine specific data structures on the disk.
191.It
192Examine properties like
193.Dq sectorsize
194or
195.Dq mediasize
196for the provider.
197.It
198Examine the rank number of the provider's geom.
199.It
200Examine the method name of the provider's geom.
201.El
202.It Em ORPHANIZATION
203is the process by which a provider is removed while
204it potentially is still being used.
205.Pp
206When a geom orphans a provider, all future I/O requests will
207.Dq bounce
208on the provider with an error code set by the geom.
209Any
210consumers attached to the provider will receive notification about
211the orphanization when the eventloop gets around to it, and they
212can take appropriate action at that time.
213.Pp
214A geom which came into being as a result of a normal taste operation
215should self-destruct unless it has a way to keep functioning lacking
216the orphaned provider.
217Geoms like diskslicers should therefore self-destruct whereas
218RAID5 or mirror geoms will be able to continue, as long as they do
219not loose quorum.
220.Pp
221When a provider is orphaned, this does not necessarily result in any
222immediate change in the topology: any attached consumers are still
223attached, any opened paths are still open, any outstanding I/O
224requests are still outstanding.
225.Pp
226The typical scenario is:
227.Pp
228.Bl -bullet -offset indent -compact
229.It
230A device driver detects a disk has departed and orphans the provider for it.
231.It
232The geoms on top of the disk receive the orphanization event and
233orphans all their providers in turn.
234Providers, which are not attached to, will typically self-destruct
235right away.
236This process continues in a quasi-recursive fashion until all
237relevant pieces of the tree has heard the bad news.
238.It
239Eventually the buck stops when it reaches geom_dev at the top
240of the stack.
241.It
242Geom_dev will call
243.Xr destroy_dev 9
244to stop any more request from
245coming in.
246It will sleep until all (if any) outstanding I/O requests have
247been returned.
248It will explicitly close (i.e.: zero the access counts), a change
249which will propagate all the way down through the mesh.
250It will then detach and destroy its geom.
251.It
252The geom whose provider is now attached will destroy the provider,
253detach and destroy its consumer and destroy its geom.
254.It
255This process percolates all the way down through the mesh, until
256the cleanup is complete.
257.El
258.Pp
259While this approach seems byzantine, it does provide the maximum
260flexibility and robustness in handling disappearing devices.
261.Pp
262The one absolutely crucial detail to be aware is that if the
263device driver does not return all I/O requests, the tree will
264not unravel.
265.It Em SPOILING
266is a special case of orphanization used to protect
267against stale metadata.
268It is probably easiest to understand spoiling by going through
269an example.
270.Pp
271Imagine a disk,
272.Pa da0
273on top of which an MBR geom provides
274.Pa da0s1
275and
276.Pa da0s2 ,
277and on top of
278.Pa da0s1
279a BSD geom provides
280.Pa da0s1a
281through
282.Pa da0s1e ,
283both the MBR and BSD geoms have
284autoconfigured based on data structures on the disk media.
285Now imagine the case where
286.Pa da0
287is opened for writing and those
288data structures are modified or overwritten: now the geoms would
289be operating on stale metadata unless some notification system
290can inform them otherwise.
291.Pp
292To avoid this situation, when the open of
293.Pa da0
294for write happens,
295all attached consumers are told about this, and geoms like
296MBR and BSD will self-destruct as a result.
297When
298.Pa da0
299is closed again, it will be offered for tasting again
300and if the data structures for MBR and BSD are still there, new
301geoms will instantiate themselves anew.
302.Pp
303Now for the fine print:
304.Pp
305If any of the paths through the MBR or BSD module were open, they
306would have opened downwards with an exclusive bit rendering it
307impossible to open
308.Pa da0
309for writing in that case and conversely
310the requested exclusive bit would render it impossible to open a
311path through the MBR geom while
312.Pa da0
313is open for writing.
314.Pp
315From this it also follows that changing the size of open geoms can
316only be done with their cooperation.
317.Pp
318Finally: the spoiling only happens when the write count goes from
319zero to non-zero and the retasting only when the write count goes
320from non-zero to zero.
321.It Em INSERT/DELETE
322are a very special operation which allows a new geom
323to be instantiated between a consumer and a provider attached to
324each other and to remove it again.
325.Pp
326To understand the utility of this, imagine a provider with
327being mounted as a file system.
328Between the DEVFS geoms consumer and its provider we insert
329a mirror module which configures itself with one mirror
330copy and consequently is transparent to the I/O requests
331on the path.
332We can now configure yet a mirror copy on the mirror geom,
333request a synchronization, and finally drop the first mirror
334copy.
335We have now in essence moved a mounted file system from one
336disk to another while it was being used.
337At this point the mirror geom can be deleted from the path
338again, it has served its purpose.
339.It Em CONFIGURE
340is the process where the administrator issues instructions
341for a particular class to instantiate itself.
342There are multiple
343ways to express intent in this case, a particular provider can be
344specified with a level of override forcing for instance a BSD
345disklabel module to attach to a provider which was not found palatable
346during the TASTE operation.
347.Pp
348Finally I/O is the reason we even do this: it concerns itself with
349sending I/O requests through the graph.
350.It Em "I/O REQUESTS"
351represented by
352.Vt "struct bio" ,
353originate at a consumer,
354are scheduled on its attached provider, and when processed, returned
355to the consumer.
356It is important to realize that the
357.Vt "struct bio"
358which enters through the provider of a particular geom does not
359.Do
360come out on the other side
361.Dc .
362Even simple transformations like MBR and BSD will clone the
363.Vt "struct bio" ,
364modify the clone, and schedule the clone on their
365own consumer.
366Note that cloning the
367.Vt "struct bio"
368does not involve cloning the
369actual data area specified in the I/O request.
370.Pp
371In total, four different I/O requests exist in
372.Nm :
373read, write, delete, and
374.Dq "get attribute".
375.Pp
376Read and write are self explanatory.
377.Pp
378Delete indicates that a certain range of data is no longer used
379and that it can be erased or freed as the underlying technology
380supports.
381Technologies like flash adaptation layers can arrange to erase
382the relevant blocks before they will become reassigned and
383cryptographic devices may want to fill random bits into the
384range to reduce the amount of data available for attack.
385.Pp
386It is important to recognize that a delete indication is not a
387request and consequently there is no guarantee that the data actually
388will be erased or made unavailable unless guaranteed by specific
389geoms in the graph.
390If
391.Dq "secure delete"
392semantics are required, a
393geom should be pushed which converts delete indications into (a
394sequence of) write requests.
395.Pp
396.Dq "Get attribute"
397supports inspection and manipulation
398of out-of-band attributes on a particular provider or path.
399Attributes are named by
400.Tn ASCII
401strings and they will be discussed in
402a separate section below.
403.El
404.Pp
405(Stay tuned while the author rests his brain and fingers: more to come.)
406.Sh DIAGNOSTICS
407Several flags are provided for tracing
408.Nm
409operations and unlocking
410protection mechanisms via the
411.Va kern.geom.debugflags
412sysctl.
413All of these flags are off by default, and great care should be taken in
414turning them on.
415.Bl -tag -width indent
416.It 0x01 Pq Dv G_T_TOPOLOGY
417Provide tracing of topology change events.
418.It 0x02 Pq Dv G_T_BIO
419Provide tracing of buffer I/O requests.
420.It 0x04 Pq Dv G_T_ACCESS
421Provide tracing of access check controls.
422.It 0x08 (unused)
423.It 0x10 (allow foot shooting)
424Allow writing to Rank 1 providers.
425This would, for example, allow the super-user to overwrite the MBR on the root
426disk or write random sectors elsewhere to a mounted disk.
427The implications are obvious.
428.It 0x20 Pq Dv G_T_DETAILS
429This appears to be unused at this time.
430.It 0x40 Pq Dv G_F_DISKIOCTL
431This appears to be unused at this time.
432.It 0x80 Pq Dv G_F_CTLDUMP
433Dump contents of gctl requests.
434.El
435.Sh HISTORY
436This software was developed for the
437.Fx
438Project by
439.An Poul-Henning Kamp
440and NAI Labs, the Security Research Division of Network Associates, Inc.\&
441under DARPA/SPAWAR contract N66001-01-C-8035
442.Pq Dq CBOSS ,
443as part of the
444DARPA CHATS research program.
445.Pp
446The first precursor for
447.Nm
448was a gruesome hack to Minix 1.2 and was
449never distributed.
450An earlier attempt to implement a less general scheme
451in
452.Fx
453never succeeded.
454.Sh AUTHORS
455.An "Poul-Henning Kamp" Aq phk@FreeBSD.org
456