xref: /freebsd/share/man/man4/geom.4 (revision d0ba1baed3f6e4936a0c1b89c25f6c59168ef6de)
1.\"
2.\" Copyright (c) 2002 Poul-Henning Kamp
3.\" Copyright (c) 2002 Networks Associates Technology, Inc.
4.\" All rights reserved.
5.\"
6.\" This software was developed for the FreeBSD Project by Poul-Henning Kamp
7.\" and NAI Labs, the Security Research Division of Network Associates, Inc.
8.\" under DARPA/SPAWAR contract N66001-01-C-8035 ("CBOSS"), as part of the
9.\" DARPA CHATS research program.
10.\"
11.\" Redistribution and use in source and binary forms, with or without
12.\" modification, are permitted provided that the following conditions
13.\" are met:
14.\" 1. Redistributions of source code must retain the above copyright
15.\"    notice, this list of conditions and the following disclaimer.
16.\" 2. Redistributions in binary form must reproduce the above copyright
17.\"    notice, this list of conditions and the following disclaimer in the
18.\"    documentation and/or other materials provided with the distribution.
19.\" 3. The names of the authors may not be used to endorse or promote
20.\"    products derived from this software without specific prior written
21.\"    permission.
22.\"
23.\" THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND
24.\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
25.\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
26.\" ARE DISCLAIMED.  IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE
27.\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
28.\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
29.\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
30.\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
31.\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
32.\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
33.\" SUCH DAMAGE.
34.\"
35.\" $FreeBSD$
36.\"
37.Dd April 9, 2018
38.Dt GEOM 4
39.Os
40.Sh NAME
41.Nm GEOM
42.Nd "modular disk I/O request transformation framework"
43.Sh SYNOPSIS
44.Cd options GEOM_BDE
45.Cd options GEOM_CACHE
46.Cd options GEOM_CONCAT
47.Cd options GEOM_ELI
48.Cd options GEOM_GATE
49.Cd options GEOM_JOURNAL
50.Cd options GEOM_LABEL
51.Cd options GEOM_LINUX_LVM
52.Cd options GEOM_MAP
53.Cd options GEOM_MIRROR
54.Cd options GEOM_MOUNTVER
55.Cd options GEOM_MULTIPATH
56.Cd options GEOM_NOP
57.Cd options GEOM_PART_APM
58.Cd options GEOM_PART_BSD
59.Cd options GEOM_PART_BSD64
60.Cd options GEOM_PART_EBR
61.Cd options GEOM_PART_EBR_COMPAT
62.Cd options GEOM_PART_GPT
63.Cd options GEOM_PART_LDM
64.Cd options GEOM_PART_MBR
65.Cd options GEOM_PART_VTOC8
66.Cd options GEOM_RAID
67.Cd options GEOM_RAID3
68.Cd options GEOM_SHSEC
69.Cd options GEOM_STRIPE
70.Cd options GEOM_UZIP
71.Cd options GEOM_VIRSTOR
72.Cd options GEOM_ZERO
73.Sh DESCRIPTION
74The
75.Nm
76framework provides an infrastructure in which
77.Dq classes
78can perform transformations on disk I/O requests on their path from
79the upper kernel to the device drivers and back.
80.Pp
81Transformations in a
82.Nm
83context range from the simple geometric
84displacement performed in typical disk partitioning modules over RAID
85algorithms and device multipath resolution to full blown cryptographic
86protection of the stored data.
87.Pp
88Compared to traditional
89.Dq "volume management" ,
90.Nm
91differs from most
92and in some cases all previous implementations in the following ways:
93.Bl -bullet
94.It
95.Nm
96is extensible.
97It is trivially simple to write a new class
98of transformation and it will not be given stepchild treatment.
99If
100someone for some reason wanted to mount IBM MVS diskpacks, a class
101recognizing and configuring their VTOC information would be a trivial
102matter.
103.It
104.Nm
105is topologically agnostic.
106Most volume management implementations
107have very strict notions of how classes can fit together, very often
108one fixed hierarchy is provided, for instance, subdisk - plex -
109volume.
110.El
111.Pp
112Being extensible means that new transformations are treated no differently
113than existing transformations.
114.Pp
115Fixed hierarchies are bad because they make it impossible to express
116the intent efficiently.
117In the fixed hierarchy above, it is not possible to mirror two
118physical disks and then partition the mirror into subdisks, instead
119one is forced to make subdisks on the physical volumes and to mirror
120these two and two, resulting in a much more complex configuration.
121.Nm
122on the other hand does not care in which order things are done,
123the only restriction is that cycles in the graph will not be allowed.
124.Sh "TERMINOLOGY AND TOPOLOGY"
125.Nm
126is quite object oriented and consequently the terminology
127borrows a lot of context and semantics from the OO vocabulary:
128.Pp
129A
130.Dq class ,
131represented by the data structure
132.Vt g_class
133implements one
134particular kind of transformation.
135Typical examples are MBR disk
136partition, BSD disklabel, and RAID5 classes.
137.Pp
138An instance of a class is called a
139.Dq geom
140and represented by the data structure
141.Vt g_geom .
142In a typical i386
143.Fx
144system, there
145will be one geom of class MBR for each disk.
146.Pp
147A
148.Dq provider ,
149represented by the data structure
150.Vt g_provider ,
151is the front gate at which a geom offers service.
152A provider is
153.Do
154a disk-like thing which appears in
155.Pa /dev
156.Dc - a logical
157disk in other words.
158All providers have three main properties:
159.Dq name ,
160.Dq sectorsize
161and
162.Dq size .
163.Pp
164A
165.Dq consumer
166is the backdoor through which a geom connects to another
167geom provider and through which I/O requests are sent.
168.Pp
169The topological relationship between these entities are as follows:
170.Bl -bullet
171.It
172A class has zero or more geom instances.
173.It
174A geom has exactly one class it is derived from.
175.It
176A geom has zero or more consumers.
177.It
178A geom has zero or more providers.
179.It
180A consumer can be attached to zero or one providers.
181.It
182A provider can have zero or more consumers attached.
183.El
184.Pp
185All geoms have a rank-number assigned, which is used to detect and
186prevent loops in the acyclic directed graph.
187This rank number is
188assigned as follows:
189.Bl -enum
190.It
191A geom with no attached consumers has rank=1.
192.It
193A geom with attached consumers has a rank one higher than the
194highest rank of the geoms of the providers its consumers are
195attached to.
196.El
197.Sh "SPECIAL TOPOLOGICAL MANEUVERS"
198In addition to the straightforward attach, which attaches a consumer
199to a provider, and detach, which breaks the bond, a number of special
200topological maneuvers exists to facilitate configuration and to
201improve the overall flexibility.
202.Bl -inset
203.It Em TASTING
204is a process that happens whenever a new class or new provider
205is created, and it provides the class a chance to automatically configure an
206instance on providers which it recognizes as its own.
207A typical example is the MBR disk-partition class which will look for
208the MBR table in the first sector and, if found and validated, will
209instantiate a geom to multiplex according to the contents of the MBR.
210.Pp
211A new class will be offered to all existing providers in turn and a new
212provider will be offered to all classes in turn.
213.Pp
214Exactly what a class does to recognize if it should accept the offered
215provider is not defined by
216.Nm ,
217but the sensible set of options are:
218.Bl -bullet
219.It
220Examine specific data structures on the disk.
221.It
222Examine properties like
223.Dq sectorsize
224or
225.Dq mediasize
226for the provider.
227.It
228Examine the rank number of the provider's geom.
229.It
230Examine the method name of the provider's geom.
231.El
232.It Em ORPHANIZATION
233is the process by which a provider is removed while
234it potentially is still being used.
235.Pp
236When a geom orphans a provider, all future I/O requests will
237.Dq bounce
238on the provider with an error code set by the geom.
239Any
240consumers attached to the provider will receive notification about
241the orphanization when the event loop gets around to it, and they
242can take appropriate action at that time.
243.Pp
244A geom which came into being as a result of a normal taste operation
245should self-destruct unless it has a way to keep functioning whilst
246lacking the orphaned provider.
247Geoms like disk slicers should therefore self-destruct whereas
248RAID5 or mirror geoms will be able to continue as long as they do
249not lose quorum.
250.Pp
251When a provider is orphaned, this does not necessarily result in any
252immediate change in the topology: any attached consumers are still
253attached, any opened paths are still open, any outstanding I/O
254requests are still outstanding.
255.Pp
256The typical scenario is:
257.Pp
258.Bl -bullet -offset indent -compact
259.It
260A device driver detects a disk has departed and orphans the provider for it.
261.It
262The geoms on top of the disk receive the orphanization event and
263orphan all their providers in turn.
264Providers which are not attached to will typically self-destruct
265right away.
266This process continues in a quasi-recursive fashion until all
267relevant pieces of the tree have heard the bad news.
268.It
269Eventually the buck stops when it reaches geom_dev at the top
270of the stack.
271.It
272Geom_dev will call
273.Xr destroy_dev 9
274to stop any more requests from
275coming in.
276It will sleep until any and all outstanding I/O requests have
277been returned.
278It will explicitly close (i.e.: zero the access counts), a change
279which will propagate all the way down through the mesh.
280It will then detach and destroy its geom.
281.It
282The geom whose provider is now detached will destroy the provider,
283detach and destroy its consumer and destroy its geom.
284.It
285This process percolates all the way down through the mesh, until
286the cleanup is complete.
287.El
288.Pp
289While this approach seems byzantine, it does provide the maximum
290flexibility and robustness in handling disappearing devices.
291.Pp
292The one absolutely crucial detail to be aware of is that if the
293device driver does not return all I/O requests, the tree will
294not unravel.
295.It Em SPOILING
296is a special case of orphanization used to protect
297against stale metadata.
298It is probably easiest to understand spoiling by going through
299an example.
300.Pp
301Imagine a disk,
302.Pa da0 ,
303on top of which an MBR geom provides
304.Pa da0s1
305and
306.Pa da0s2 ,
307and on top of
308.Pa da0s1
309a BSD geom provides
310.Pa da0s1a
311through
312.Pa da0s1e ,
313and that both the MBR and BSD geoms have
314autoconfigured based on data structures on the disk media.
315Now imagine the case where
316.Pa da0
317is opened for writing and those
318data structures are modified or overwritten: now the geoms would
319be operating on stale metadata unless some notification system
320can inform them otherwise.
321.Pp
322To avoid this situation, when the open of
323.Pa da0
324for write happens,
325all attached consumers are told about this and geoms like
326MBR and BSD will self-destruct as a result.
327When
328.Pa da0
329is closed, it will be offered for tasting again
330and, if the data structures for MBR and BSD are still there, new
331geoms will instantiate themselves anew.
332.Pp
333Now for the fine print:
334.Pp
335If any of the paths through the MBR or BSD module were open, they
336would have opened downwards with an exclusive bit thus rendering it
337impossible to open
338.Pa da0
339for writing in that case.
340Conversely,
341the requested exclusive bit would render it impossible to open a
342path through the MBR geom while
343.Pa da0
344is open for writing.
345.Pp
346From this it also follows that changing the size of open geoms can
347only be done with their cooperation.
348.Pp
349Finally: the spoiling only happens when the write count goes from
350zero to non-zero and the retasting happens only when the write count goes
351from non-zero to zero.
352.It Em CONFIGURE
353is the process where the administrator issues instructions
354for a particular class to instantiate itself.
355There are multiple
356ways to express intent in this case - a particular provider may be
357specified with a level of override forcing, for instance, a BSD
358disklabel module to attach to a provider which was not found palatable
359during the TASTE operation.
360.Pp
361Finally, I/O is the reason we even do this: it concerns itself with
362sending I/O requests through the graph.
363.It Em "I/O REQUESTS" ,
364represented by
365.Vt "struct bio" ,
366originate at a consumer,
367are scheduled on its attached provider and, when processed, are returned
368to the consumer.
369It is important to realize that the
370.Vt "struct bio"
371which enters through the provider of a particular geom does not
372.Do
373come out on the other side
374.Dc .
375Even simple transformations like MBR and BSD will clone the
376.Vt "struct bio" ,
377modify the clone, and schedule the clone on their
378own consumer.
379Note that cloning the
380.Vt "struct bio"
381does not involve cloning the
382actual data area specified in the I/O request.
383.Pp
384In total, four different I/O requests exist in
385.Nm :
386read, write, delete, and
387.Dq "get attribute".
388.Pp
389Read and write are self explanatory.
390.Pp
391Delete indicates that a certain range of data is no longer used
392and that it can be erased or freed as the underlying technology
393supports.
394Technologies like flash adaptation layers can arrange to erase
395the relevant blocks before they will become reassigned and
396cryptographic devices may want to fill random bits into the
397range to reduce the amount of data available for attack.
398.Pp
399It is important to recognize that a delete indication is not a
400request and consequently there is no guarantee that the data actually
401will be erased or made unavailable unless guaranteed by specific
402geoms in the graph.
403If
404.Dq "secure delete"
405semantics are required, a
406geom should be pushed which converts delete indications into (a
407sequence of) write requests.
408.Pp
409.Dq "Get attribute"
410supports inspection and manipulation
411of out-of-band attributes on a particular provider or path.
412Attributes are named by
413.Tn ASCII
414strings and they will be discussed in
415a separate section below.
416.El
417.Pp
418(Stay tuned while the author rests his brain and fingers: more to come.)
419.Sh DIAGNOSTICS
420Several flags are provided for tracing
421.Nm
422operations and unlocking
423protection mechanisms via the
424.Va kern.geom.debugflags
425sysctl.
426All of these flags are off by default, and great care should be taken in
427turning them on.
428.Bl -tag -width indent
429.It 0x01 Pq Dv G_T_TOPOLOGY
430Provide tracing of topology change events.
431.It 0x02 Pq Dv G_T_BIO
432Provide tracing of buffer I/O requests.
433.It 0x04 Pq Dv G_T_ACCESS
434Provide tracing of access check controls.
435.It 0x08 (unused)
436.It 0x10 (allow foot shooting)
437Allow writing to Rank 1 providers.
438This would, for example, allow the super-user to overwrite the MBR on the root
439disk or write random sectors elsewhere to a mounted disk.
440The implications are obvious.
441.It 0x40 Pq Dv G_F_DISKIOCTL
442This is unused at this time.
443.It 0x80 Pq Dv G_F_CTLDUMP
444Dump contents of gctl requests.
445.El
446.Sh OBSOLETE OPTIONS
447.Pp
448The following options have been deprecated and will be removed in
449.Fx 12 :
450.Cd GEOM_BSD ,
451.Cd GEOM_FOX ,
452.Cd GEOM_MBR ,
453.Cd GEOM_SUNLABEL ,
454and
455.Cd GEOM_VOL .
456.Pp
457Use
458.Cd GEOM_PART_BSD ,
459.Cd GEOM_MULTIPATH ,
460.Cd GEOM_PART_MBR ,
461.Cd GEOM_PART_VTOC8 ,
462.Cd GEOM_LABEL
463options, respectively, instead.
464.Sh SEE ALSO
465.Xr libgeom 3 ,
466.Xr DECLARE_GEOM_CLASS 9 ,
467.Xr disk 9 ,
468.Xr g_access 9 ,
469.Xr g_attach 9 ,
470.Xr g_bio 9 ,
471.Xr g_consumer 9 ,
472.Xr g_data 9 ,
473.Xr g_event 9 ,
474.Xr g_geom 9 ,
475.Xr g_provider 9 ,
476.Xr g_provider_by_name 9
477.Sh HISTORY
478This software was developed for the
479.Fx
480Project by
481.An Poul-Henning Kamp
482and NAI Labs, the Security Research Division of Network Associates, Inc.\&
483under DARPA/SPAWAR contract N66001-01-C-8035
484.Pq Dq CBOSS ,
485as part of the
486DARPA CHATS research program.
487.Pp
488The first precursor for
489.Nm
490was a gruesome hack to Minix 1.2 and was
491never distributed.
492An earlier attempt to implement a less general scheme
493in
494.Fx
495never succeeded.
496.Sh AUTHORS
497.An Poul-Henning Kamp Aq Mt phk@FreeBSD.org
498