xref: /freebsd/share/man/man4/geom.4 (revision 445ed7b40948c160f2f7d363d2d0ae1ffac4aabd)
1.\"
2.\" Copyright (c) 2002 Poul-Henning Kamp
3.\" Copyright (c) 2002 Networks Associates Technology, Inc.
4.\" All rights reserved.
5.\"
6.\" This software was developed for the FreeBSD Project by Poul-Henning Kamp
7.\" and NAI Labs, the Security Research Division of Network Associates, Inc.
8.\" under DARPA/SPAWAR contract N66001-01-C-8035 ("CBOSS"), as part of the
9.\" DARPA CHATS research program.
10.\"
11.\" Redistribution and use in source and binary forms, with or without
12.\" modification, are permitted provided that the following conditions
13.\" are met:
14.\" 1. Redistributions of source code must retain the above copyright
15.\"    notice, this list of conditions and the following disclaimer.
16.\" 2. Redistributions in binary form must reproduce the above copyright
17.\"    notice, this list of conditions and the following disclaimer in the
18.\"    documentation and/or other materials provided with the distribution.
19.\" 3. The names of the authors may not be used to endorse or promote
20.\"    products derived from this software without specific prior written
21.\"    permission.
22.\"
23.\" THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND
24.\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
25.\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
26.\" ARE DISCLAIMED.  IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE
27.\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
28.\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
29.\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
30.\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
31.\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
32.\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
33.\" SUCH DAMAGE.
34.\"
35.\" $FreeBSD$
36.\"
37.Dd September 10, 2013
38.Dt GEOM 4
39.Os
40.Sh NAME
41.Nm GEOM
42.Nd "modular disk I/O request transformation framework"
43.Sh SYNOPSIS
44.Cd options GEOM_AES
45.Cd options GEOM_BDE
46.Cd options GEOM_BSD
47.Cd options GEOM_CACHE
48.Cd options GEOM_CONCAT
49.Cd options GEOM_ELI
50.Cd options GEOM_FOX
51.Cd options GEOM_GATE
52.Cd options GEOM_JOURNAL
53.Cd options GEOM_LABEL
54.Cd options GEOM_LINUX_LVM
55.Cd options GEOM_MBR
56.Cd options GEOM_MIRROR
57.Cd options GEOM_MULTIPATH
58.Cd options GEOM_NOP
59.Cd options GEOM_PART_APM
60.Cd options GEOM_PART_BSD
61.Cd options GEOM_PART_EBR
62.Cd options GEOM_PART_EBR_COMPAT
63.Cd options GEOM_PART_GPT
64.Cd options GEOM_PART_LDM
65.Cd options GEOM_PART_MBR
66.Cd options GEOM_PART_PC98
67.Cd options GEOM_PART_VTOC8
68.Cd options GEOM_PC98
69.Cd options GEOM_RAID
70.Cd options GEOM_RAID3
71.Cd options GEOM_SHSEC
72.Cd options GEOM_STRIPE
73.Cd options GEOM_SUNLABEL
74.Cd options GEOM_UZIP
75.Cd options GEOM_VIRSTOR
76.Cd options GEOM_VOL
77.Cd options GEOM_ZERO
78.Sh DESCRIPTION
79The
80.Nm
81framework provides an infrastructure in which
82.Dq classes
83can perform transformations on disk I/O requests on their path from
84the upper kernel to the device drivers and back.
85.Pp
86Transformations in a
87.Nm
88context range from the simple geometric
89displacement performed in typical disk partitioning modules over RAID
90algorithms and device multipath resolution to full blown cryptographic
91protection of the stored data.
92.Pp
93Compared to traditional
94.Dq "volume management" ,
95.Nm
96differs from most
97and in some cases all previous implementations in the following ways:
98.Bl -bullet
99.It
100.Nm
101is extensible.
102It is trivially simple to write a new class
103of transformation and it will not be given stepchild treatment.
104If
105someone for some reason wanted to mount IBM MVS diskpacks, a class
106recognizing and configuring their VTOC information would be a trivial
107matter.
108.It
109.Nm
110is topologically agnostic.
111Most volume management implementations
112have very strict notions of how classes can fit together, very often
113one fixed hierarchy is provided, for instance, subdisk - plex -
114volume.
115.El
116.Pp
117Being extensible means that new transformations are treated no differently
118than existing transformations.
119.Pp
120Fixed hierarchies are bad because they make it impossible to express
121the intent efficiently.
122In the fixed hierarchy above, it is not possible to mirror two
123physical disks and then partition the mirror into subdisks, instead
124one is forced to make subdisks on the physical volumes and to mirror
125these two and two, resulting in a much more complex configuration.
126.Nm
127on the other hand does not care in which order things are done,
128the only restriction is that cycles in the graph will not be allowed.
129.Sh "TERMINOLOGY AND TOPOLOGY"
130.Nm
131is quite object oriented and consequently the terminology
132borrows a lot of context and semantics from the OO vocabulary:
133.Pp
134A
135.Dq class ,
136represented by the data structure
137.Vt g_class
138implements one
139particular kind of transformation.
140Typical examples are MBR disk
141partition, BSD disklabel, and RAID5 classes.
142.Pp
143An instance of a class is called a
144.Dq geom
145and represented by the data structure
146.Vt g_geom .
147In a typical i386
148.Fx
149system, there
150will be one geom of class MBR for each disk.
151.Pp
152A
153.Dq provider ,
154represented by the data structure
155.Vt g_provider ,
156is the front gate at which a geom offers service.
157A provider is
158.Do
159a disk-like thing which appears in
160.Pa /dev
161.Dc - a logical
162disk in other words.
163All providers have three main properties:
164.Dq name ,
165.Dq sectorsize
166and
167.Dq size .
168.Pp
169A
170.Dq consumer
171is the backdoor through which a geom connects to another
172geom provider and through which I/O requests are sent.
173.Pp
174The topological relationship between these entities are as follows:
175.Bl -bullet
176.It
177A class has zero or more geom instances.
178.It
179A geom has exactly one class it is derived from.
180.It
181A geom has zero or more consumers.
182.It
183A geom has zero or more providers.
184.It
185A consumer can be attached to zero or one providers.
186.It
187A provider can have zero or more consumers attached.
188.El
189.Pp
190All geoms have a rank-number assigned, which is used to detect and
191prevent loops in the acyclic directed graph.
192This rank number is
193assigned as follows:
194.Bl -enum
195.It
196A geom with no attached consumers has rank=1.
197.It
198A geom with attached consumers has a rank one higher than the
199highest rank of the geoms of the providers its consumers are
200attached to.
201.El
202.Sh "SPECIAL TOPOLOGICAL MANEUVERS"
203In addition to the straightforward attach, which attaches a consumer
204to a provider, and detach, which breaks the bond, a number of special
205topological maneuvers exists to facilitate configuration and to
206improve the overall flexibility.
207.Bl -inset
208.It Em TASTING
209is a process that happens whenever a new class or new provider
210is created, and it provides the class a chance to automatically configure an
211instance on providers which it recognizes as its own.
212A typical example is the MBR disk-partition class which will look for
213the MBR table in the first sector and, if found and validated, will
214instantiate a geom to multiplex according to the contents of the MBR.
215.Pp
216A new class will be offered to all existing providers in turn and a new
217provider will be offered to all classes in turn.
218.Pp
219Exactly what a class does to recognize if it should accept the offered
220provider is not defined by
221.Nm ,
222but the sensible set of options are:
223.Bl -bullet
224.It
225Examine specific data structures on the disk.
226.It
227Examine properties like
228.Dq sectorsize
229or
230.Dq mediasize
231for the provider.
232.It
233Examine the rank number of the provider's geom.
234.It
235Examine the method name of the provider's geom.
236.El
237.It Em ORPHANIZATION
238is the process by which a provider is removed while
239it potentially is still being used.
240.Pp
241When a geom orphans a provider, all future I/O requests will
242.Dq bounce
243on the provider with an error code set by the geom.
244Any
245consumers attached to the provider will receive notification about
246the orphanization when the event loop gets around to it, and they
247can take appropriate action at that time.
248.Pp
249A geom which came into being as a result of a normal taste operation
250should self-destruct unless it has a way to keep functioning whilst
251lacking the orphaned provider.
252Geoms like disk slicers should therefore self-destruct whereas
253RAID5 or mirror geoms will be able to continue as long as they do
254not lose quorum.
255.Pp
256When a provider is orphaned, this does not necessarily result in any
257immediate change in the topology: any attached consumers are still
258attached, any opened paths are still open, any outstanding I/O
259requests are still outstanding.
260.Pp
261The typical scenario is:
262.Pp
263.Bl -bullet -offset indent -compact
264.It
265A device driver detects a disk has departed and orphans the provider for it.
266.It
267The geoms on top of the disk receive the orphanization event and
268orphan all their providers in turn.
269Providers which are not attached to will typically self-destruct
270right away.
271This process continues in a quasi-recursive fashion until all
272relevant pieces of the tree have heard the bad news.
273.It
274Eventually the buck stops when it reaches geom_dev at the top
275of the stack.
276.It
277Geom_dev will call
278.Xr destroy_dev 9
279to stop any more requests from
280coming in.
281It will sleep until any and all outstanding I/O requests have
282been returned.
283It will explicitly close (i.e.: zero the access counts), a change
284which will propagate all the way down through the mesh.
285It will then detach and destroy its geom.
286.It
287The geom whose provider is now detached will destroy the provider,
288detach and destroy its consumer and destroy its geom.
289.It
290This process percolates all the way down through the mesh, until
291the cleanup is complete.
292.El
293.Pp
294While this approach seems byzantine, it does provide the maximum
295flexibility and robustness in handling disappearing devices.
296.Pp
297The one absolutely crucial detail to be aware of is that if the
298device driver does not return all I/O requests, the tree will
299not unravel.
300.It Em SPOILING
301is a special case of orphanization used to protect
302against stale metadata.
303It is probably easiest to understand spoiling by going through
304an example.
305.Pp
306Imagine a disk,
307.Pa da0 ,
308on top of which an MBR geom provides
309.Pa da0s1
310and
311.Pa da0s2 ,
312and on top of
313.Pa da0s1
314a BSD geom provides
315.Pa da0s1a
316through
317.Pa da0s1e ,
318and that both the MBR and BSD geoms have
319autoconfigured based on data structures on the disk media.
320Now imagine the case where
321.Pa da0
322is opened for writing and those
323data structures are modified or overwritten: now the geoms would
324be operating on stale metadata unless some notification system
325can inform them otherwise.
326.Pp
327To avoid this situation, when the open of
328.Pa da0
329for write happens,
330all attached consumers are told about this and geoms like
331MBR and BSD will self-destruct as a result.
332When
333.Pa da0
334is closed, it will be offered for tasting again
335and, if the data structures for MBR and BSD are still there, new
336geoms will instantiate themselves anew.
337.Pp
338Now for the fine print:
339.Pp
340If any of the paths through the MBR or BSD module were open, they
341would have opened downwards with an exclusive bit thus rendering it
342impossible to open
343.Pa da0
344for writing in that case.
345Conversely,
346the requested exclusive bit would render it impossible to open a
347path through the MBR geom while
348.Pa da0
349is open for writing.
350.Pp
351From this it also follows that changing the size of open geoms can
352only be done with their cooperation.
353.Pp
354Finally: the spoiling only happens when the write count goes from
355zero to non-zero and the retasting happens only when the write count goes
356from non-zero to zero.
357.It Em CONFIGURE
358is the process where the administrator issues instructions
359for a particular class to instantiate itself.
360There are multiple
361ways to express intent in this case - a particular provider may be
362specified with a level of override forcing, for instance, a BSD
363disklabel module to attach to a provider which was not found palatable
364during the TASTE operation.
365.Pp
366Finally, I/O is the reason we even do this: it concerns itself with
367sending I/O requests through the graph.
368.It Em "I/O REQUESTS" ,
369represented by
370.Vt "struct bio" ,
371originate at a consumer,
372are scheduled on its attached provider and, when processed, are returned
373to the consumer.
374It is important to realize that the
375.Vt "struct bio"
376which enters through the provider of a particular geom does not
377.Do
378come out on the other side
379.Dc .
380Even simple transformations like MBR and BSD will clone the
381.Vt "struct bio" ,
382modify the clone, and schedule the clone on their
383own consumer.
384Note that cloning the
385.Vt "struct bio"
386does not involve cloning the
387actual data area specified in the I/O request.
388.Pp
389In total, four different I/O requests exist in
390.Nm :
391read, write, delete, and
392.Dq "get attribute".
393.Pp
394Read and write are self explanatory.
395.Pp
396Delete indicates that a certain range of data is no longer used
397and that it can be erased or freed as the underlying technology
398supports.
399Technologies like flash adaptation layers can arrange to erase
400the relevant blocks before they will become reassigned and
401cryptographic devices may want to fill random bits into the
402range to reduce the amount of data available for attack.
403.Pp
404It is important to recognize that a delete indication is not a
405request and consequently there is no guarantee that the data actually
406will be erased or made unavailable unless guaranteed by specific
407geoms in the graph.
408If
409.Dq "secure delete"
410semantics are required, a
411geom should be pushed which converts delete indications into (a
412sequence of) write requests.
413.Pp
414.Dq "Get attribute"
415supports inspection and manipulation
416of out-of-band attributes on a particular provider or path.
417Attributes are named by
418.Tn ASCII
419strings and they will be discussed in
420a separate section below.
421.El
422.Pp
423(Stay tuned while the author rests his brain and fingers: more to come.)
424.Sh DIAGNOSTICS
425Several flags are provided for tracing
426.Nm
427operations and unlocking
428protection mechanisms via the
429.Va kern.geom.debugflags
430sysctl.
431All of these flags are off by default, and great care should be taken in
432turning them on.
433.Bl -tag -width indent
434.It 0x01 Pq Dv G_T_TOPOLOGY
435Provide tracing of topology change events.
436.It 0x02 Pq Dv G_T_BIO
437Provide tracing of buffer I/O requests.
438.It 0x04 Pq Dv G_T_ACCESS
439Provide tracing of access check controls.
440.It 0x08 (unused)
441.It 0x10 (allow foot shooting)
442Allow writing to Rank 1 providers.
443This would, for example, allow the super-user to overwrite the MBR on the root
444disk or write random sectors elsewhere to a mounted disk.
445The implications are obvious.
446.It 0x40 Pq Dv G_F_DISKIOCTL
447This is unused at this time.
448.It 0x80 Pq Dv G_F_CTLDUMP
449Dump contents of gctl requests.
450.El
451.Sh SEE ALSO
452.Xr libgeom 3 ,
453.Xr disk 9 ,
454.Xr DECLARE_GEOM_CLASS 9 ,
455.Xr g_access 9 ,
456.Xr g_attach 9 ,
457.Xr g_bio 9 ,
458.Xr g_consumer 9 ,
459.Xr g_data 9 ,
460.Xr g_event 9 ,
461.Xr g_geom 9 ,
462.Xr g_provider 9 ,
463.Xr g_provider_by_name 9
464.Sh HISTORY
465This software was developed for the
466.Fx
467Project by
468.An Poul-Henning Kamp
469and NAI Labs, the Security Research Division of Network Associates, Inc.\&
470under DARPA/SPAWAR contract N66001-01-C-8035
471.Pq Dq CBOSS ,
472as part of the
473DARPA CHATS research program.
474.Pp
475The first precursor for
476.Nm
477was a gruesome hack to Minix 1.2 and was
478never distributed.
479An earlier attempt to implement a less general scheme
480in
481.Fx
482never succeeded.
483.Sh AUTHORS
484.An Poul-Henning Kamp Aq Mt phk@FreeBSD.org
485