xref: /freebsd/share/man/man4/geom.4 (revision f4b37ed0f8b307b1f3f0f630ca725d68f1dff30d)
1.\"
2.\" Copyright (c) 2002 Poul-Henning Kamp
3.\" Copyright (c) 2002 Networks Associates Technology, Inc.
4.\" All rights reserved.
5.\"
6.\" This software was developed for the FreeBSD Project by Poul-Henning Kamp
7.\" and NAI Labs, the Security Research Division of Network Associates, Inc.
8.\" under DARPA/SPAWAR contract N66001-01-C-8035 ("CBOSS"), as part of the
9.\" DARPA CHATS research program.
10.\"
11.\" Redistribution and use in source and binary forms, with or without
12.\" modification, are permitted provided that the following conditions
13.\" are met:
14.\" 1. Redistributions of source code must retain the above copyright
15.\"    notice, this list of conditions and the following disclaimer.
16.\" 2. Redistributions in binary form must reproduce the above copyright
17.\"    notice, this list of conditions and the following disclaimer in the
18.\"    documentation and/or other materials provided with the distribution.
19.\" 3. The names of the authors may not be used to endorse or promote
20.\"    products derived from this software without specific prior written
21.\"    permission.
22.\"
23.\" THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND
24.\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
25.\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
26.\" ARE DISCLAIMED.  IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE
27.\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
28.\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
29.\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
30.\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
31.\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
32.\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
33.\" SUCH DAMAGE.
34.\"
35.\" $FreeBSD$
36.\"
37.Dd June 8, 2015
38.Dt GEOM 4
39.Os
40.Sh NAME
41.Nm GEOM
42.Nd "modular disk I/O request transformation framework"
43.Sh SYNOPSIS
44.Cd options GEOM_AES
45.Cd options GEOM_BDE
46.Cd options GEOM_BSD
47.Cd options GEOM_CACHE
48.Cd options GEOM_CONCAT
49.Cd options GEOM_ELI
50.Cd options GEOM_FOX
51.Cd options GEOM_GATE
52.Cd options GEOM_JOURNAL
53.Cd options GEOM_LABEL
54.Cd options GEOM_LINUX_LVM
55.Cd options GEOM_MAP
56.Cd options GEOM_MBR
57.Cd options GEOM_MIRROR
58.Cd options GEOM_MULTIPATH
59.Cd options GEOM_NOP
60.Cd options GEOM_PART_APM
61.Cd options GEOM_PART_BSD
62.Cd options GEOM_PART_BSD64
63.Cd options GEOM_PART_EBR
64.Cd options GEOM_PART_EBR_COMPAT
65.Cd options GEOM_PART_GPT
66.Cd options GEOM_PART_LDM
67.Cd options GEOM_PART_MBR
68.Cd options GEOM_PART_PC98
69.Cd options GEOM_PART_VTOC8
70.Cd options GEOM_PC98
71.Cd options GEOM_RAID
72.Cd options GEOM_RAID3
73.Cd options GEOM_SHSEC
74.Cd options GEOM_STRIPE
75.Cd options GEOM_SUNLABEL
76.Cd options GEOM_UNCOMPRESS
77.Cd options GEOM_UZIP
78.Cd options GEOM_VIRSTOR
79.Cd options GEOM_VOL
80.Cd options GEOM_ZERO
81.Sh DESCRIPTION
82The
83.Nm
84framework provides an infrastructure in which
85.Dq classes
86can perform transformations on disk I/O requests on their path from
87the upper kernel to the device drivers and back.
88.Pp
89Transformations in a
90.Nm
91context range from the simple geometric
92displacement performed in typical disk partitioning modules over RAID
93algorithms and device multipath resolution to full blown cryptographic
94protection of the stored data.
95.Pp
96Compared to traditional
97.Dq "volume management" ,
98.Nm
99differs from most
100and in some cases all previous implementations in the following ways:
101.Bl -bullet
102.It
103.Nm
104is extensible.
105It is trivially simple to write a new class
106of transformation and it will not be given stepchild treatment.
107If
108someone for some reason wanted to mount IBM MVS diskpacks, a class
109recognizing and configuring their VTOC information would be a trivial
110matter.
111.It
112.Nm
113is topologically agnostic.
114Most volume management implementations
115have very strict notions of how classes can fit together, very often
116one fixed hierarchy is provided, for instance, subdisk - plex -
117volume.
118.El
119.Pp
120Being extensible means that new transformations are treated no differently
121than existing transformations.
122.Pp
123Fixed hierarchies are bad because they make it impossible to express
124the intent efficiently.
125In the fixed hierarchy above, it is not possible to mirror two
126physical disks and then partition the mirror into subdisks, instead
127one is forced to make subdisks on the physical volumes and to mirror
128these two and two, resulting in a much more complex configuration.
129.Nm
130on the other hand does not care in which order things are done,
131the only restriction is that cycles in the graph will not be allowed.
132.Sh "TERMINOLOGY AND TOPOLOGY"
133.Nm
134is quite object oriented and consequently the terminology
135borrows a lot of context and semantics from the OO vocabulary:
136.Pp
137A
138.Dq class ,
139represented by the data structure
140.Vt g_class
141implements one
142particular kind of transformation.
143Typical examples are MBR disk
144partition, BSD disklabel, and RAID5 classes.
145.Pp
146An instance of a class is called a
147.Dq geom
148and represented by the data structure
149.Vt g_geom .
150In a typical i386
151.Fx
152system, there
153will be one geom of class MBR for each disk.
154.Pp
155A
156.Dq provider ,
157represented by the data structure
158.Vt g_provider ,
159is the front gate at which a geom offers service.
160A provider is
161.Do
162a disk-like thing which appears in
163.Pa /dev
164.Dc - a logical
165disk in other words.
166All providers have three main properties:
167.Dq name ,
168.Dq sectorsize
169and
170.Dq size .
171.Pp
172A
173.Dq consumer
174is the backdoor through which a geom connects to another
175geom provider and through which I/O requests are sent.
176.Pp
177The topological relationship between these entities are as follows:
178.Bl -bullet
179.It
180A class has zero or more geom instances.
181.It
182A geom has exactly one class it is derived from.
183.It
184A geom has zero or more consumers.
185.It
186A geom has zero or more providers.
187.It
188A consumer can be attached to zero or one providers.
189.It
190A provider can have zero or more consumers attached.
191.El
192.Pp
193All geoms have a rank-number assigned, which is used to detect and
194prevent loops in the acyclic directed graph.
195This rank number is
196assigned as follows:
197.Bl -enum
198.It
199A geom with no attached consumers has rank=1.
200.It
201A geom with attached consumers has a rank one higher than the
202highest rank of the geoms of the providers its consumers are
203attached to.
204.El
205.Sh "SPECIAL TOPOLOGICAL MANEUVERS"
206In addition to the straightforward attach, which attaches a consumer
207to a provider, and detach, which breaks the bond, a number of special
208topological maneuvers exists to facilitate configuration and to
209improve the overall flexibility.
210.Bl -inset
211.It Em TASTING
212is a process that happens whenever a new class or new provider
213is created, and it provides the class a chance to automatically configure an
214instance on providers which it recognizes as its own.
215A typical example is the MBR disk-partition class which will look for
216the MBR table in the first sector and, if found and validated, will
217instantiate a geom to multiplex according to the contents of the MBR.
218.Pp
219A new class will be offered to all existing providers in turn and a new
220provider will be offered to all classes in turn.
221.Pp
222Exactly what a class does to recognize if it should accept the offered
223provider is not defined by
224.Nm ,
225but the sensible set of options are:
226.Bl -bullet
227.It
228Examine specific data structures on the disk.
229.It
230Examine properties like
231.Dq sectorsize
232or
233.Dq mediasize
234for the provider.
235.It
236Examine the rank number of the provider's geom.
237.It
238Examine the method name of the provider's geom.
239.El
240.It Em ORPHANIZATION
241is the process by which a provider is removed while
242it potentially is still being used.
243.Pp
244When a geom orphans a provider, all future I/O requests will
245.Dq bounce
246on the provider with an error code set by the geom.
247Any
248consumers attached to the provider will receive notification about
249the orphanization when the event loop gets around to it, and they
250can take appropriate action at that time.
251.Pp
252A geom which came into being as a result of a normal taste operation
253should self-destruct unless it has a way to keep functioning whilst
254lacking the orphaned provider.
255Geoms like disk slicers should therefore self-destruct whereas
256RAID5 or mirror geoms will be able to continue as long as they do
257not lose quorum.
258.Pp
259When a provider is orphaned, this does not necessarily result in any
260immediate change in the topology: any attached consumers are still
261attached, any opened paths are still open, any outstanding I/O
262requests are still outstanding.
263.Pp
264The typical scenario is:
265.Pp
266.Bl -bullet -offset indent -compact
267.It
268A device driver detects a disk has departed and orphans the provider for it.
269.It
270The geoms on top of the disk receive the orphanization event and
271orphan all their providers in turn.
272Providers which are not attached to will typically self-destruct
273right away.
274This process continues in a quasi-recursive fashion until all
275relevant pieces of the tree have heard the bad news.
276.It
277Eventually the buck stops when it reaches geom_dev at the top
278of the stack.
279.It
280Geom_dev will call
281.Xr destroy_dev 9
282to stop any more requests from
283coming in.
284It will sleep until any and all outstanding I/O requests have
285been returned.
286It will explicitly close (i.e.: zero the access counts), a change
287which will propagate all the way down through the mesh.
288It will then detach and destroy its geom.
289.It
290The geom whose provider is now detached will destroy the provider,
291detach and destroy its consumer and destroy its geom.
292.It
293This process percolates all the way down through the mesh, until
294the cleanup is complete.
295.El
296.Pp
297While this approach seems byzantine, it does provide the maximum
298flexibility and robustness in handling disappearing devices.
299.Pp
300The one absolutely crucial detail to be aware of is that if the
301device driver does not return all I/O requests, the tree will
302not unravel.
303.It Em SPOILING
304is a special case of orphanization used to protect
305against stale metadata.
306It is probably easiest to understand spoiling by going through
307an example.
308.Pp
309Imagine a disk,
310.Pa da0 ,
311on top of which an MBR geom provides
312.Pa da0s1
313and
314.Pa da0s2 ,
315and on top of
316.Pa da0s1
317a BSD geom provides
318.Pa da0s1a
319through
320.Pa da0s1e ,
321and that both the MBR and BSD geoms have
322autoconfigured based on data structures on the disk media.
323Now imagine the case where
324.Pa da0
325is opened for writing and those
326data structures are modified or overwritten: now the geoms would
327be operating on stale metadata unless some notification system
328can inform them otherwise.
329.Pp
330To avoid this situation, when the open of
331.Pa da0
332for write happens,
333all attached consumers are told about this and geoms like
334MBR and BSD will self-destruct as a result.
335When
336.Pa da0
337is closed, it will be offered for tasting again
338and, if the data structures for MBR and BSD are still there, new
339geoms will instantiate themselves anew.
340.Pp
341Now for the fine print:
342.Pp
343If any of the paths through the MBR or BSD module were open, they
344would have opened downwards with an exclusive bit thus rendering it
345impossible to open
346.Pa da0
347for writing in that case.
348Conversely,
349the requested exclusive bit would render it impossible to open a
350path through the MBR geom while
351.Pa da0
352is open for writing.
353.Pp
354From this it also follows that changing the size of open geoms can
355only be done with their cooperation.
356.Pp
357Finally: the spoiling only happens when the write count goes from
358zero to non-zero and the retasting happens only when the write count goes
359from non-zero to zero.
360.It Em CONFIGURE
361is the process where the administrator issues instructions
362for a particular class to instantiate itself.
363There are multiple
364ways to express intent in this case - a particular provider may be
365specified with a level of override forcing, for instance, a BSD
366disklabel module to attach to a provider which was not found palatable
367during the TASTE operation.
368.Pp
369Finally, I/O is the reason we even do this: it concerns itself with
370sending I/O requests through the graph.
371.It Em "I/O REQUESTS" ,
372represented by
373.Vt "struct bio" ,
374originate at a consumer,
375are scheduled on its attached provider and, when processed, are returned
376to the consumer.
377It is important to realize that the
378.Vt "struct bio"
379which enters through the provider of a particular geom does not
380.Do
381come out on the other side
382.Dc .
383Even simple transformations like MBR and BSD will clone the
384.Vt "struct bio" ,
385modify the clone, and schedule the clone on their
386own consumer.
387Note that cloning the
388.Vt "struct bio"
389does not involve cloning the
390actual data area specified in the I/O request.
391.Pp
392In total, four different I/O requests exist in
393.Nm :
394read, write, delete, and
395.Dq "get attribute".
396.Pp
397Read and write are self explanatory.
398.Pp
399Delete indicates that a certain range of data is no longer used
400and that it can be erased or freed as the underlying technology
401supports.
402Technologies like flash adaptation layers can arrange to erase
403the relevant blocks before they will become reassigned and
404cryptographic devices may want to fill random bits into the
405range to reduce the amount of data available for attack.
406.Pp
407It is important to recognize that a delete indication is not a
408request and consequently there is no guarantee that the data actually
409will be erased or made unavailable unless guaranteed by specific
410geoms in the graph.
411If
412.Dq "secure delete"
413semantics are required, a
414geom should be pushed which converts delete indications into (a
415sequence of) write requests.
416.Pp
417.Dq "Get attribute"
418supports inspection and manipulation
419of out-of-band attributes on a particular provider or path.
420Attributes are named by
421.Tn ASCII
422strings and they will be discussed in
423a separate section below.
424.El
425.Pp
426(Stay tuned while the author rests his brain and fingers: more to come.)
427.Sh DIAGNOSTICS
428Several flags are provided for tracing
429.Nm
430operations and unlocking
431protection mechanisms via the
432.Va kern.geom.debugflags
433sysctl.
434All of these flags are off by default, and great care should be taken in
435turning them on.
436.Bl -tag -width indent
437.It 0x01 Pq Dv G_T_TOPOLOGY
438Provide tracing of topology change events.
439.It 0x02 Pq Dv G_T_BIO
440Provide tracing of buffer I/O requests.
441.It 0x04 Pq Dv G_T_ACCESS
442Provide tracing of access check controls.
443.It 0x08 (unused)
444.It 0x10 (allow foot shooting)
445Allow writing to Rank 1 providers.
446This would, for example, allow the super-user to overwrite the MBR on the root
447disk or write random sectors elsewhere to a mounted disk.
448The implications are obvious.
449.It 0x40 Pq Dv G_F_DISKIOCTL
450This is unused at this time.
451.It 0x80 Pq Dv G_F_CTLDUMP
452Dump contents of gctl requests.
453.El
454.Sh SEE ALSO
455.Xr libgeom 3 ,
456.Xr DECLARE_GEOM_CLASS 9 ,
457.Xr disk 9 ,
458.Xr g_access 9 ,
459.Xr g_attach 9 ,
460.Xr g_bio 9 ,
461.Xr g_consumer 9 ,
462.Xr g_data 9 ,
463.Xr g_event 9 ,
464.Xr g_geom 9 ,
465.Xr g_provider 9 ,
466.Xr g_provider_by_name 9
467.Sh HISTORY
468This software was developed for the
469.Fx
470Project by
471.An Poul-Henning Kamp
472and NAI Labs, the Security Research Division of Network Associates, Inc.\&
473under DARPA/SPAWAR contract N66001-01-C-8035
474.Pq Dq CBOSS ,
475as part of the
476DARPA CHATS research program.
477.Pp
478The first precursor for
479.Nm
480was a gruesome hack to Minix 1.2 and was
481never distributed.
482An earlier attempt to implement a less general scheme
483in
484.Fx
485never succeeded.
486.Sh AUTHORS
487.An Poul-Henning Kamp Aq Mt phk@FreeBSD.org
488