xref: /freebsd/share/man/man4/geom.4 (revision 1843dfb05ed80149f5a412180af882e3cb8f451b)
1.\"
2.\" Copyright (c) 2002 Poul-Henning Kamp
3.\" Copyright (c) 2002 Networks Associates Technology, Inc.
4.\" All rights reserved.
5.\"
6.\" This software was developed for the FreeBSD Project by Poul-Henning Kamp
7.\" and NAI Labs, the Security Research Division of Network Associates, Inc.
8.\" under DARPA/SPAWAR contract N66001-01-C-8035 ("CBOSS"), as part of the
9.\" DARPA CHATS research program.
10.\"
11.\" Redistribution and use in source and binary forms, with or without
12.\" modification, are permitted provided that the following conditions
13.\" are met:
14.\" 1. Redistributions of source code must retain the above copyright
15.\"    notice, this list of conditions and the following disclaimer.
16.\" 2. Redistributions in binary form must reproduce the above copyright
17.\"    notice, this list of conditions and the following disclaimer in the
18.\"    documentation and/or other materials provided with the distribution.
19.\" 3. The names of the authors may not be used to endorse or promote
20.\"    products derived from this software without specific prior written
21.\"    permission.
22.\"
23.\" THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND
24.\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
25.\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
26.\" ARE DISCLAIMED.  IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE
27.\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
28.\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
29.\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
30.\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
31.\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
32.\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
33.\" SUCH DAMAGE.
34.\"
35.Dd July 26, 2023
36.Dt GEOM 4
37.Os
38.Sh NAME
39.Nm GEOM
40.Nd "modular disk I/O request transformation framework"
41.Sh SYNOPSIS
42.Cd options GEOM_BDE
43.Cd options GEOM_CACHE
44.Cd options GEOM_CONCAT
45.Cd options GEOM_ELI
46.Cd options GEOM_GATE
47.Cd options GEOM_JOURNAL
48.Cd options GEOM_LABEL
49.Cd options GEOM_LINUX_LVM
50.Cd options GEOM_MAP
51.Cd options GEOM_MIRROR
52.Cd options GEOM_MOUNTVER
53.Cd options GEOM_MULTIPATH
54.Cd options GEOM_NOP
55.Cd options GEOM_PART_APM
56.Cd options GEOM_PART_BSD
57.Cd options GEOM_PART_BSD64
58.Cd options GEOM_PART_EBR
59.Cd options GEOM_PART_EBR_COMPAT
60.Cd options GEOM_PART_GPT
61.Cd options GEOM_PART_LDM
62.Cd options GEOM_PART_MBR
63.Cd options GEOM_RAID
64.Cd options GEOM_RAID3
65.Cd options GEOM_SHSEC
66.Cd options GEOM_STRIPE
67.Cd options GEOM_UZIP
68.Cd options GEOM_VIRSTOR
69.Cd options GEOM_ZERO
70.Sh DESCRIPTION
71The
72.Nm
73framework provides an infrastructure in which
74.Dq classes
75can perform transformations on disk I/O requests on their path from
76the upper kernel to the device drivers and back.
77.Pp
78Transformations in a
79.Nm
80context range from the simple geometric
81displacement performed in typical disk partitioning modules over RAID
82algorithms and device multipath resolution to full blown cryptographic
83protection of the stored data.
84.Pp
85Compared to traditional
86.Dq "volume management" ,
87.Nm
88differs from most
89and in some cases all previous implementations in the following ways:
90.Bl -bullet
91.It
92.Nm
93is extensible.
94It is trivially simple to write a new class
95of transformation and it will not be given stepchild treatment.
96If
97someone for some reason wanted to mount IBM MVS diskpacks, a class
98recognizing and configuring their VTOC information would be a trivial
99matter.
100.It
101.Nm
102is topologically agnostic.
103Most volume management implementations
104have very strict notions of how classes can fit together, very often
105one fixed hierarchy is provided, for instance, subdisk - plex -
106volume.
107.El
108.Pp
109Being extensible means that new transformations are treated no differently
110than existing transformations.
111.Pp
112Fixed hierarchies are bad because they make it impossible to express
113the intent efficiently.
114In the fixed hierarchy above, it is not possible to mirror two
115physical disks and then partition the mirror into subdisks, instead
116one is forced to make subdisks on the physical volumes and to mirror
117these two and two, resulting in a much more complex configuration.
118.Nm
119on the other hand does not care in which order things are done,
120the only restriction is that cycles in the graph will not be allowed.
121.Sh "TERMINOLOGY AND TOPOLOGY"
122.Nm
123is quite object oriented and consequently the terminology
124borrows a lot of context and semantics from the OO vocabulary:
125.Pp
126A
127.Dq class ,
128represented by the data structure
129.Vt g_class
130implements one
131particular kind of transformation.
132Typical examples are MBR disk
133partition, BSD disklabel, and RAID5 classes.
134.Pp
135An instance of a class is called a
136.Dq geom
137and represented by the data structure
138.Vt g_geom .
139In a typical i386
140.Fx
141system, there
142will be one geom of class MBR for each disk.
143.Pp
144A
145.Dq provider ,
146represented by the data structure
147.Vt g_provider ,
148is the front gate at which a geom offers service.
149A provider is
150.Do
151a disk-like thing which appears in
152.Pa /dev
153.Dc - a logical
154disk in other words.
155All providers have three main properties:
156.Dq name ,
157.Dq sectorsize
158and
159.Dq size .
160.Pp
161A
162.Dq consumer
163is the backdoor through which a geom connects to another
164geom provider and through which I/O requests are sent.
165.Pp
166The topological relationship between these entities are as follows:
167.Bl -bullet
168.It
169A class has zero or more geom instances.
170.It
171A geom has exactly one class it is derived from.
172.It
173A geom has zero or more consumers.
174.It
175A geom has zero or more providers.
176.It
177A consumer can be attached to zero or one providers.
178.It
179A provider can have zero or more consumers attached.
180.El
181.Pp
182All geoms have a rank-number assigned, which is used to detect and
183prevent loops in the acyclic directed graph.
184This rank number is
185assigned as follows:
186.Bl -enum
187.It
188A geom with no attached consumers has rank=1.
189.It
190A geom with attached consumers has a rank one higher than the
191highest rank of the geoms of the providers its consumers are
192attached to.
193.El
194.Sh "SPECIAL TOPOLOGICAL MANEUVERS"
195In addition to the straightforward attach, which attaches a consumer
196to a provider, and detach, which breaks the bond, a number of special
197topological maneuvers exists to facilitate configuration and to
198improve the overall flexibility.
199.Bl -inset
200.It Em TASTING
201is a process that happens whenever a new class or new provider
202is created, and it provides the class a chance to automatically configure an
203instance on providers which it recognizes as its own.
204A typical example is the MBR disk-partition class which will look for
205the MBR table in the first sector and, if found and validated, will
206instantiate a geom to multiplex according to the contents of the MBR.
207.Pp
208A new class will be offered to all existing providers in turn and a new
209provider will be offered to all classes in turn.
210.Pp
211Exactly what a class does to recognize if it should accept the offered
212provider is not defined by
213.Nm ,
214but the sensible set of options are:
215.Bl -bullet
216.It
217Examine specific data structures on the disk.
218.It
219Examine properties like
220.Dq sectorsize
221or
222.Dq mediasize
223for the provider.
224.It
225Examine the rank number of the provider's geom.
226.It
227Examine the method name of the provider's geom.
228.El
229.It Em ORPHANIZATION
230is the process by which a provider is removed while
231it potentially is still being used.
232.Pp
233When a geom orphans a provider, all future I/O requests will
234.Dq bounce
235on the provider with an error code set by the geom.
236Any
237consumers attached to the provider will receive notification about
238the orphanization when the event loop gets around to it, and they
239can take appropriate action at that time.
240.Pp
241A geom which came into being as a result of a normal taste operation
242should self-destruct unless it has a way to keep functioning whilst
243lacking the orphaned provider.
244Geoms like disk slicers should therefore self-destruct whereas
245RAID5 or mirror geoms will be able to continue as long as they do
246not lose quorum.
247.Pp
248When a provider is orphaned, this does not necessarily result in any
249immediate change in the topology: any attached consumers are still
250attached, any opened paths are still open, any outstanding I/O
251requests are still outstanding.
252.Pp
253The typical scenario is:
254.Pp
255.Bl -bullet -offset indent -compact
256.It
257A device driver detects a disk has departed and orphans the provider for it.
258.It
259The geoms on top of the disk receive the orphanization event and
260orphan all their providers in turn.
261Providers which are not attached to will typically self-destruct
262right away.
263This process continues in a quasi-recursive fashion until all
264relevant pieces of the tree have heard the bad news.
265.It
266Eventually the buck stops when it reaches geom_dev at the top
267of the stack.
268.It
269Geom_dev will call
270.Xr destroy_dev 9
271to stop any more requests from
272coming in.
273It will sleep until any and all outstanding I/O requests have
274been returned.
275It will explicitly close (i.e.: zero the access counts), a change
276which will propagate all the way down through the mesh.
277It will then detach and destroy its geom.
278.It
279The geom whose provider is now detached will destroy the provider,
280detach and destroy its consumer and destroy its geom.
281.It
282This process percolates all the way down through the mesh, until
283the cleanup is complete.
284.El
285.Pp
286While this approach seems byzantine, it does provide the maximum
287flexibility and robustness in handling disappearing devices.
288.Pp
289The one absolutely crucial detail to be aware of is that if the
290device driver does not return all I/O requests, the tree will
291not unravel.
292.It Em SPOILING
293is a special case of orphanization used to protect
294against stale metadata.
295It is probably easiest to understand spoiling by going through
296an example.
297.Pp
298Imagine a disk,
299.Pa da0 ,
300on top of which an MBR geom provides
301.Pa da0s1
302and
303.Pa da0s2 ,
304and on top of
305.Pa da0s1
306a BSD geom provides
307.Pa da0s1a
308through
309.Pa da0s1e ,
310and that both the MBR and BSD geoms have
311autoconfigured based on data structures on the disk media.
312Now imagine the case where
313.Pa da0
314is opened for writing and those
315data structures are modified or overwritten: now the geoms would
316be operating on stale metadata unless some notification system
317can inform them otherwise.
318.Pp
319To avoid this situation, when the open of
320.Pa da0
321for write happens,
322all attached consumers are told about this and geoms like
323MBR and BSD will self-destruct as a result.
324When
325.Pa da0
326is closed, it will be offered for tasting again
327and, if the data structures for MBR and BSD are still there, new
328geoms will instantiate themselves anew.
329.Pp
330Now for the fine print:
331.Pp
332If any of the paths through the MBR or BSD module were open, they
333would have opened downwards with an exclusive bit thus rendering it
334impossible to open
335.Pa da0
336for writing in that case.
337Conversely,
338the requested exclusive bit would render it impossible to open a
339path through the MBR geom while
340.Pa da0
341is open for writing.
342.Pp
343From this it also follows that changing the size of open geoms can
344only be done with their cooperation.
345.Pp
346Finally: the spoiling only happens when the write count goes from
347zero to non-zero and the retasting happens only when the write count goes
348from non-zero to zero.
349.It Em CONFIGURE
350is the process where the administrator issues instructions
351for a particular class to instantiate itself.
352There are multiple
353ways to express intent in this case - a particular provider may be
354specified with a level of override forcing, for instance, a BSD
355disklabel module to attach to a provider which was not found palatable
356during the TASTE operation.
357.Pp
358Finally, I/O is the reason we even do this: it concerns itself with
359sending I/O requests through the graph.
360.It Em "I/O REQUESTS" ,
361represented by
362.Vt "struct bio" ,
363originate at a consumer,
364are scheduled on its attached provider and, when processed, are returned
365to the consumer.
366It is important to realize that the
367.Vt "struct bio"
368which enters through the provider of a particular geom does not
369.Do
370come out on the other side
371.Dc .
372Even simple transformations like MBR and BSD will clone the
373.Vt "struct bio" ,
374modify the clone, and schedule the clone on their
375own consumer.
376Note that cloning the
377.Vt "struct bio"
378does not involve cloning the
379actual data area specified in the I/O request.
380.Pp
381In total, four different I/O requests exist in
382.Nm :
383read, write, delete, and
384.Dq "get attribute".
385.Pp
386Read and write are self explanatory.
387.Pp
388Delete indicates that a certain range of data is no longer used
389and that it can be erased or freed as the underlying technology
390supports.
391Technologies like flash adaptation layers can arrange to erase
392the relevant blocks before they will become reassigned and
393cryptographic devices may want to fill random bits into the
394range to reduce the amount of data available for attack.
395.Pp
396It is important to recognize that a delete indication is not a
397request and consequently there is no guarantee that the data actually
398will be erased or made unavailable unless guaranteed by specific
399geoms in the graph.
400If
401.Dq "secure delete"
402semantics are required, a
403geom should be pushed which converts delete indications into (a
404sequence of) write requests.
405.Pp
406.Dq "Get attribute"
407supports inspection and manipulation
408of out-of-band attributes on a particular provider or path.
409Attributes are named by
410.Tn ASCII
411strings and they will be discussed in
412a separate section below.
413.El
414.Pp
415(Stay tuned while the author rests his brain and fingers: more to come.)
416.Sh DIAGNOSTICS
417Several flags are provided for tracing
418.Nm
419operations and unlocking
420protection mechanisms via the
421.Va kern.geom.debugflags
422sysctl.
423All of these flags are off by default, and great care should be taken in
424turning them on.
425.Bl -tag -width indent
426.It 0x01 Pq Dv G_T_TOPOLOGY
427Provide tracing of topology change events.
428.It 0x02 Pq Dv G_T_BIO
429Provide tracing of buffer I/O requests.
430.It 0x04 Pq Dv G_T_ACCESS
431Provide tracing of access check controls.
432.It 0x08 (unused)
433.It 0x10 (allow foot shooting)
434Allow writing to Rank 1 providers.
435This would, for example, allow the super-user to overwrite the MBR on the root
436disk or write random sectors elsewhere to a mounted disk.
437The implications are obvious.
438.It 0x40 Pq Dv G_F_DISKIOCTL
439This is unused at this time.
440.It 0x80 Pq Dv G_F_CTLDUMP
441Dump contents of gctl requests.
442.El
443.Sh SEE ALSO
444.Xr libgeom 3 ,
445.Xr geom 8 ,
446.Xr DECLARE_GEOM_CLASS 9 ,
447.Xr disk 9 ,
448.Xr g_access 9 ,
449.Xr g_attach 9 ,
450.Xr g_bio 9 ,
451.Xr g_consumer 9 ,
452.Xr g_data 9 ,
453.Xr g_event 9 ,
454.Xr g_geom 9 ,
455.Xr g_provider 9 ,
456.Xr g_provider_by_name 9
457.Sh HISTORY
458This software was initially developed for the
459.Fx
460Project by
461.An Poul-Henning Kamp
462and NAI Labs, the Security Research Division of Network Associates, Inc.\&
463under DARPA/SPAWAR contract N66001-01-C-8035
464.Pq Dq CBOSS ,
465as part of the
466DARPA CHATS research program.
467.Pp
468The following obsolete
469.Nm
470components were removed in
471.Fx 13.0 :
472.Bl -bullet -offset indent -compact
473.It
474.Cd GEOM_BSD ,
475.It
476.Cd GEOM_FOX ,
477.It
478.Cd GEOM_MBR ,
479.It
480.Cd GEOM_SUNLABEL ,
481and
482.It
483.Cd GEOM_VOL .
484.El
485.Pp
486Use
487.Bl -bullet -offset indent -compact
488.It
489.Cd GEOM_PART_BSD ,
490.It
491.Cd GEOM_MULTIPATH ,
492.It
493.Cd GEOM_PART_MBR ,
494and
495.It
496.Cd GEOM_LABEL
497.El
498options, respectively, instead.
499.Sh AUTHORS
500.An Poul-Henning Kamp Aq Mt phk@FreeBSD.org
501