xref: /freebsd/share/man/man4/geom.4 (revision 35c0a8c449fd2b7f75029ebed5e10852240f0865)
1.\"
2.\" Copyright (c) 2002 Poul-Henning Kamp
3.\" Copyright (c) 2002 Networks Associates Technology, Inc.
4.\" All rights reserved.
5.\"
6.\" This software was developed for the FreeBSD Project by Poul-Henning Kamp
7.\" and NAI Labs, the Security Research Division of Network Associates, Inc.
8.\" under DARPA/SPAWAR contract N66001-01-C-8035 ("CBOSS"), as part of the
9.\" DARPA CHATS research program.
10.\"
11.\" Redistribution and use in source and binary forms, with or without
12.\" modification, are permitted provided that the following conditions
13.\" are met:
14.\" 1. Redistributions of source code must retain the above copyright
15.\"    notice, this list of conditions and the following disclaimer.
16.\" 2. Redistributions in binary form must reproduce the above copyright
17.\"    notice, this list of conditions and the following disclaimer in the
18.\"    documentation and/or other materials provided with the distribution.
19.\" 3. The names of the authors may not be used to endorse or promote
20.\"    products derived from this software without specific prior written
21.\"    permission.
22.\"
23.\" THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND
24.\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
25.\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
26.\" ARE DISCLAIMED.  IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE
27.\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
28.\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
29.\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
30.\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
31.\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
32.\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
33.\" SUCH DAMAGE.
34.\"
35.Dd July 8, 2024
36.Dt GEOM 4
37.Os
38.Sh NAME
39.Nm GEOM
40.Nd "modular disk I/O request transformation framework"
41.Sh SYNOPSIS
42.Cd options GEOM_CACHE
43.Cd options GEOM_CONCAT
44.Cd options GEOM_ELI
45.Cd options GEOM_GATE
46.Cd options GEOM_JOURNAL
47.Cd options GEOM_LABEL
48.Cd options GEOM_LINUX_LVM
49.Cd options GEOM_MAP
50.Cd options GEOM_MIRROR
51.Cd options GEOM_MOUNTVER
52.Cd options GEOM_MULTIPATH
53.Cd options GEOM_NOP
54.Cd options GEOM_PART_APM
55.Cd options GEOM_PART_BSD
56.Cd options GEOM_PART_BSD64
57.Cd options GEOM_PART_EBR
58.Cd options GEOM_PART_EBR_COMPAT
59.Cd options GEOM_PART_GPT
60.Cd options GEOM_PART_LDM
61.Cd options GEOM_PART_MBR
62.Cd options GEOM_RAID
63.Cd options GEOM_RAID3
64.Cd options GEOM_SHSEC
65.Cd options GEOM_STRIPE
66.Cd options GEOM_UZIP
67.Cd options GEOM_VIRSTOR
68.Cd options GEOM_ZERO
69.Sh DESCRIPTION
70The
71.Nm
72framework provides an infrastructure in which
73.Dq classes
74can perform transformations on disk I/O requests on their path from
75the upper kernel to the device drivers and back.
76.Pp
77Transformations in a
78.Nm
79context range from the simple geometric
80displacement performed in typical disk partitioning modules over RAID
81algorithms and device multipath resolution to full blown cryptographic
82protection of the stored data.
83.Pp
84Compared to traditional
85.Dq "volume management" ,
86.Nm
87differs from most
88and in some cases all previous implementations in the following ways:
89.Bl -bullet
90.It
91.Nm
92is extensible.
93It is trivially simple to write a new class
94of transformation and it will not be given stepchild treatment.
95If
96someone for some reason wanted to mount IBM MVS diskpacks, a class
97recognizing and configuring their VTOC information would be a trivial
98matter.
99.It
100.Nm
101is topologically agnostic.
102Most volume management implementations
103have very strict notions of how classes can fit together, very often
104one fixed hierarchy is provided, for instance, subdisk - plex -
105volume.
106.El
107.Pp
108Being extensible means that new transformations are treated no differently
109than existing transformations.
110.Pp
111Fixed hierarchies are bad because they make it impossible to express
112the intent efficiently.
113In the fixed hierarchy above, it is not possible to mirror two
114physical disks and then partition the mirror into subdisks, instead
115one is forced to make subdisks on the physical volumes and to mirror
116these two and two, resulting in a much more complex configuration.
117.Nm
118on the other hand does not care in which order things are done,
119the only restriction is that cycles in the graph will not be allowed.
120.Sh "TERMINOLOGY AND TOPOLOGY"
121.Nm
122is quite object oriented and consequently the terminology
123borrows a lot of context and semantics from the OO vocabulary:
124.Pp
125A
126.Dq class ,
127represented by the data structure
128.Vt g_class
129implements one
130particular kind of transformation.
131Typical examples are MBR disk
132partition, BSD disklabel, and RAID5 classes.
133.Pp
134An instance of a class is called a
135.Dq geom
136and represented by the data structure
137.Vt g_geom .
138In a typical i386
139.Fx
140system, there
141will be one geom of class MBR for each disk.
142.Pp
143A
144.Dq provider ,
145represented by the data structure
146.Vt g_provider ,
147is the front gate at which a geom offers service.
148A provider is
149.Do
150a disk-like thing which appears in
151.Pa /dev
152.Dc - a logical
153disk in other words.
154All providers have three main properties:
155.Dq name ,
156.Dq sectorsize
157and
158.Dq size .
159.Pp
160A
161.Dq consumer
162is the backdoor through which a geom connects to another
163geom provider and through which I/O requests are sent.
164.Pp
165The topological relationship between these entities are as follows:
166.Bl -bullet
167.It
168A class has zero or more geom instances.
169.It
170A geom has exactly one class it is derived from.
171.It
172A geom has zero or more consumers.
173.It
174A geom has zero or more providers.
175.It
176A consumer can be attached to zero or one providers.
177.It
178A provider can have zero or more consumers attached.
179.El
180.Pp
181All geoms have a rank-number assigned, which is used to detect and
182prevent loops in the acyclic directed graph.
183This rank number is
184assigned as follows:
185.Bl -enum
186.It
187A geom with no attached consumers has rank=1.
188.It
189A geom with attached consumers has a rank one higher than the
190highest rank of the geoms of the providers its consumers are
191attached to.
192.El
193.Sh "SPECIAL TOPOLOGICAL MANEUVERS"
194In addition to the straightforward attach, which attaches a consumer
195to a provider, and detach, which breaks the bond, a number of special
196topological maneuvers exists to facilitate configuration and to
197improve the overall flexibility.
198.Bl -inset
199.It Em TASTING
200is a process that happens whenever a new class or new provider
201is created, and it provides the class a chance to automatically configure an
202instance on providers which it recognizes as its own.
203A typical example is the MBR disk-partition class which will look for
204the MBR table in the first sector and, if found and validated, will
205instantiate a geom to multiplex according to the contents of the MBR.
206.Pp
207A new class will be offered to all existing providers in turn and a new
208provider will be offered to all classes in turn.
209.Pp
210Exactly what a class does to recognize if it should accept the offered
211provider is not defined by
212.Nm ,
213but the sensible set of options are:
214.Bl -bullet
215.It
216Examine specific data structures on the disk.
217.It
218Examine properties like
219.Dq sectorsize
220or
221.Dq mediasize
222for the provider.
223.It
224Examine the rank number of the provider's geom.
225.It
226Examine the method name of the provider's geom.
227.El
228.Pp
229Tasting is controlled by the
230.Va kern.geom.notaste
231sysctl.
232To disable tasting, set the sysctl to 1, to
233re-enable tasting, set the sysctl to 0.
234.It Em ORPHANIZATION
235is the process by which a provider is removed while
236it potentially is still being used.
237.Pp
238When a geom orphans a provider, all future I/O requests will
239.Dq bounce
240on the provider with an error code set by the geom.
241Any
242consumers attached to the provider will receive notification about
243the orphanization when the event loop gets around to it, and they
244can take appropriate action at that time.
245.Pp
246A geom which came into being as a result of a normal taste operation
247should self-destruct unless it has a way to keep functioning whilst
248lacking the orphaned provider.
249Geoms like disk slicers should therefore self-destruct whereas
250RAID5 or mirror geoms will be able to continue as long as they do
251not lose quorum.
252.Pp
253When a provider is orphaned, this does not necessarily result in any
254immediate change in the topology: any attached consumers are still
255attached, any opened paths are still open, any outstanding I/O
256requests are still outstanding.
257.Pp
258The typical scenario is:
259.Pp
260.Bl -bullet -offset indent -compact
261.It
262A device driver detects a disk has departed and orphans the provider for it.
263.It
264The geoms on top of the disk receive the orphanization event and
265orphan all their providers in turn.
266Providers which are not attached to will typically self-destruct
267right away.
268This process continues in a quasi-recursive fashion until all
269relevant pieces of the tree have heard the bad news.
270.It
271Eventually the buck stops when it reaches geom_dev at the top
272of the stack.
273.It
274Geom_dev will call
275.Xr destroy_dev 9
276to stop any more requests from
277coming in.
278It will sleep until any and all outstanding I/O requests have
279been returned.
280It will explicitly close (i.e.: zero the access counts), a change
281which will propagate all the way down through the mesh.
282It will then detach and destroy its geom.
283.It
284The geom whose provider is now detached will destroy the provider,
285detach and destroy its consumer and destroy its geom.
286.It
287This process percolates all the way down through the mesh, until
288the cleanup is complete.
289.El
290.Pp
291While this approach seems byzantine, it does provide the maximum
292flexibility and robustness in handling disappearing devices.
293.Pp
294The one absolutely crucial detail to be aware of is that if the
295device driver does not return all I/O requests, the tree will
296not unravel.
297.It Em SPOILING
298is a special case of orphanization used to protect
299against stale metadata.
300It is probably easiest to understand spoiling by going through
301an example.
302.Pp
303Imagine a disk,
304.Pa da0 ,
305on top of which an MBR geom provides
306.Pa da0s1
307and
308.Pa da0s2 ,
309and on top of
310.Pa da0s1
311a BSD geom provides
312.Pa da0s1a
313through
314.Pa da0s1e ,
315and that both the MBR and BSD geoms have
316autoconfigured based on data structures on the disk media.
317Now imagine the case where
318.Pa da0
319is opened for writing and those
320data structures are modified or overwritten: now the geoms would
321be operating on stale metadata unless some notification system
322can inform them otherwise.
323.Pp
324To avoid this situation, when the open of
325.Pa da0
326for write happens,
327all attached consumers are told about this and geoms like
328MBR and BSD will self-destruct as a result.
329When
330.Pa da0
331is closed, it will be offered for tasting again
332and, if the data structures for MBR and BSD are still there, new
333geoms will instantiate themselves anew.
334.Pp
335Now for the fine print:
336.Pp
337If any of the paths through the MBR or BSD module were open, they
338would have opened downwards with an exclusive bit thus rendering it
339impossible to open
340.Pa da0
341for writing in that case.
342Conversely,
343the requested exclusive bit would render it impossible to open a
344path through the MBR geom while
345.Pa da0
346is open for writing.
347.Pp
348From this it also follows that changing the size of open geoms can
349only be done with their cooperation.
350.Pp
351Finally: the spoiling only happens when the write count goes from
352zero to non-zero and the retasting happens only when the write count goes
353from non-zero to zero.
354.It Em CONFIGURE
355is the process where the administrator issues instructions
356for a particular class to instantiate itself.
357There are multiple
358ways to express intent in this case - a particular provider may be
359specified with a level of override forcing, for instance, a BSD
360disklabel module to attach to a provider which was not found palatable
361during the TASTE operation.
362.Pp
363Finally, I/O is the reason we even do this: it concerns itself with
364sending I/O requests through the graph.
365.It Em "I/O REQUESTS" ,
366represented by
367.Vt "struct bio" ,
368originate at a consumer,
369are scheduled on its attached provider and, when processed, are returned
370to the consumer.
371It is important to realize that the
372.Vt "struct bio"
373which enters through the provider of a particular geom does not
374.Do
375come out on the other side
376.Dc .
377Even simple transformations like MBR and BSD will clone the
378.Vt "struct bio" ,
379modify the clone, and schedule the clone on their
380own consumer.
381Note that cloning the
382.Vt "struct bio"
383does not involve cloning the
384actual data area specified in the I/O request.
385.Pp
386In total, four different I/O requests exist in
387.Nm :
388read, write, delete, and
389.Dq "get attribute".
390.Pp
391Read and write are self explanatory.
392.Pp
393Delete indicates that a certain range of data is no longer used
394and that it can be erased or freed as the underlying technology
395supports.
396Technologies like flash adaptation layers can arrange to erase
397the relevant blocks before they will become reassigned and
398cryptographic devices may want to fill random bits into the
399range to reduce the amount of data available for attack.
400.Pp
401It is important to recognize that a delete indication is not a
402request and consequently there is no guarantee that the data actually
403will be erased or made unavailable unless guaranteed by specific
404geoms in the graph.
405If
406.Dq "secure delete"
407semantics are required, a
408geom should be pushed which converts delete indications into (a
409sequence of) write requests.
410.Pp
411.Dq "Get attribute"
412supports inspection and manipulation
413of out-of-band attributes on a particular provider or path.
414Attributes are named by
415.Tn ASCII
416strings and they will be discussed in
417a separate section below.
418.El
419.Pp
420(Stay tuned while the author rests his brain and fingers: more to come.)
421.Sh DIAGNOSTICS
422Several flags are provided for tracing
423.Nm
424operations and unlocking
425protection mechanisms via the
426.Va kern.geom.debugflags
427sysctl.
428All of these flags are off by default, and great care should be taken in
429turning them on.
430.Bl -tag -width indent
431.It 0x01 Pq Dv G_T_TOPOLOGY
432Provide tracing of topology change events.
433.It 0x02 Pq Dv G_T_BIO
434Provide tracing of buffer I/O requests.
435.It 0x04 Pq Dv G_T_ACCESS
436Provide tracing of access check controls.
437.It 0x08 (unused)
438.It 0x10 (allow foot shooting)
439Allow writing to Rank 1 providers.
440This would, for example, allow the super-user to overwrite the MBR on the root
441disk or write random sectors elsewhere to a mounted disk.
442The implications are obvious.
443.It 0x40 Pq Dv G_F_DISKIOCTL
444This is unused at this time.
445.It 0x80 Pq Dv G_F_CTLDUMP
446Dump contents of gctl requests.
447.El
448.Sh SEE ALSO
449.Xr libgeom 3 ,
450.Xr geom 8 ,
451.Xr DECLARE_GEOM_CLASS 9 ,
452.Xr disk 9 ,
453.Xr g_access 9 ,
454.Xr g_attach 9 ,
455.Xr g_bio 9 ,
456.Xr g_consumer 9 ,
457.Xr g_data 9 ,
458.Xr g_event 9 ,
459.Xr g_geom 9 ,
460.Xr g_provider 9 ,
461.Xr g_provider_by_name 9
462.Sh HISTORY
463This software was initially developed for the
464.Fx
465Project by
466.An Poul-Henning Kamp
467and NAI Labs, the Security Research Division of Network Associates, Inc.\&
468under DARPA/SPAWAR contract N66001-01-C-8035
469.Pq Dq CBOSS ,
470as part of the
471DARPA CHATS research program.
472.Pp
473The following obsolete
474.Nm
475components were removed in
476.Fx 13.0 :
477.Bl -bullet -offset indent -compact
478.It
479.Cd GEOM_BSD ,
480.It
481.Cd GEOM_FOX ,
482.It
483.Cd GEOM_MBR ,
484.It
485.Cd GEOM_SUNLABEL ,
486and
487.It
488.Cd GEOM_VOL .
489.El
490.Pp
491Use
492.Bl -bullet -offset indent -compact
493.It
494.Cd GEOM_PART_BSD ,
495.It
496.Cd GEOM_MULTIPATH ,
497.It
498.Cd GEOM_PART_MBR ,
499and
500.It
501.Cd GEOM_LABEL
502.El
503options, respectively, instead.
504.Sh AUTHORS
505.An Poul-Henning Kamp Aq Mt phk@FreeBSD.org
506