xref: /freebsd/sys/contrib/openzfs/man/man7/zpoolconcepts.7 (revision b1c1ee4429fcca8f69873a8be66184e68e1b19d7)
1.\" SPDX-License-Identifier: CDDL-1.0
2.\"
3.\" CDDL HEADER START
4.\"
5.\" The contents of this file are subject to the terms of the
6.\" Common Development and Distribution License (the "License").
7.\" You may not use this file except in compliance with the License.
8.\"
9.\" You can obtain a copy of the license at usr/src/OPENSOLARIS.LICENSE
10.\" or https://opensource.org/licenses/CDDL-1.0.
11.\" See the License for the specific language governing permissions
12.\" and limitations under the License.
13.\"
14.\" When distributing Covered Code, include this CDDL HEADER in each
15.\" file and include the License file at usr/src/OPENSOLARIS.LICENSE.
16.\" If applicable, add the following below this CDDL HEADER, with the
17.\" fields enclosed by brackets "[]" replaced with your own identifying
18.\" information: Portions Copyright [yyyy] [name of copyright owner]
19.\"
20.\" CDDL HEADER END
21.\"
22.\" Copyright (c) 2007, Sun Microsystems, Inc. All Rights Reserved.
23.\" Copyright (c) 2012, 2018 by Delphix. All rights reserved.
24.\" Copyright (c) 2012 Cyril Plisko. All Rights Reserved.
25.\" Copyright (c) 2017 Datto Inc.
26.\" Copyright (c) 2018 George Melikov. All Rights Reserved.
27.\" Copyright 2017 Nexenta Systems, Inc.
28.\" Copyright (c) 2017 Open-E, Inc. All Rights Reserved.
29.\"
30.Dd April 7, 2023
31.Dt ZPOOLCONCEPTS 7
32.Os
33.
34.Sh NAME
35.Nm zpoolconcepts
36.Nd overview of ZFS storage pools
37.
38.Sh DESCRIPTION
39.Ss Virtual Devices (vdevs)
40A "virtual device" describes a single device or a collection of devices,
41organized according to certain performance and fault characteristics.
42The following virtual devices are supported:
43.Bl -tag -width "special"
44.It Sy disk
45A block device, typically located under
46.Pa /dev .
47ZFS can use individual slices or partitions, though the recommended mode of
48operation is to use whole disks.
49A disk can be specified by a full path, or it can be a shorthand name
50.Po the relative portion of the path under
51.Pa /dev
52.Pc .
53A whole disk can be specified by omitting the slice or partition designation.
54For example,
55.Pa sda
56is equivalent to
57.Pa /dev/sda .
58When given a whole disk, ZFS automatically labels the disk, if necessary.
59.It Sy file
60A regular file.
61The use of files as a backing store is strongly discouraged.
62It is designed primarily for experimental purposes, as the fault tolerance of a
63file is only as good as the file system on which it resides.
64A file must be specified by a full path.
65.It Sy mirror
66A mirror of two or more devices.
67Data is replicated in an identical fashion across all components of a mirror.
68A mirror with
69.Em N No disks of size Em X No can hold Em X No bytes and can withstand Em N-1
70devices failing, without losing data.
71.It Sy raidz , raidz1 , raidz2 , raidz3
72A distributed-parity layout, similar to RAID-5/6, with improved distribution of
73parity, and which does not suffer from the RAID-5/6
74.Qq write hole ,
75.Pq in which data and parity become inconsistent after a power loss .
76Data and parity is striped across all disks within a raidz group, though not
77necessarily in a consistent stripe width.
78.Pp
79A raidz group can have single, double, or triple parity, meaning that the
80raidz group can sustain one, two, or three failures, respectively, without
81losing any data.
82The
83.Sy raidz1
84vdev type specifies a single-parity raidz group; the
85.Sy raidz2
86vdev type specifies a double-parity raidz group; and the
87.Sy raidz3
88vdev type specifies a triple-parity raidz group.
89The
90.Sy raidz
91vdev type is an alias for
92.Sy raidz1 .
93.Pp
94A raidz group with
95.Em N No disks of size Em X No with Em P No parity disks can hold approximately
96.Em (N-P)*X No bytes and can withstand Em P No devices failing without losing data .
97The minimum number of devices in a raidz group is one more than the number of
98parity disks.
99The recommended number is between 3 and 9 to help increase performance.
100.It Sy draid , draid1 , draid2 , draid3
101A variant of raidz that provides integrated distributed hot spares, allowing
102for faster resilvering, while retaining the benefits of raidz.
103A dRAID vdev is constructed from multiple internal raidz groups, each with
104.Em D No data devices and Em P No parity devices .
105These groups are distributed over all of the children in order to fully
106utilize the available disk performance.
107.Pp
108Unlike raidz, dRAID uses a fixed stripe width (padding as necessary with
109zeros) to allow fully sequential resilvering.
110This fixed stripe width significantly affects both usable capacity and IOPS.
111For example, with the default
112.Em D=8 No and Em 4 KiB No disk sectors the minimum allocation size is Em 32 KiB .
113If using compression, this relatively large allocation size can reduce the
114effective compression ratio.
115When using ZFS volumes (zvols) and dRAID, the default of the
116.Sy volblocksize
117property is increased to account for the allocation size.
118If a dRAID pool will hold a significant amount of small blocks, it is
119recommended to also add a mirrored
120.Sy special
121vdev to store those blocks.
122.Pp
123In regards to I/O, performance is similar to raidz since, for any read, all
124.Em D No data disks must be accessed .
125Delivered random IOPS can be reasonably approximated as
126.Sy floor((N-S)/(D+P))*single_drive_IOPS .
127.Pp
128Like raidz, a dRAID can have single-, double-, or triple-parity.
129The
130.Sy draid1 ,
131.Sy draid2 ,
132and
133.Sy draid3
134types can be used to specify the parity level.
135The
136.Sy draid
137vdev type is an alias for
138.Sy draid1 .
139.Pp
140A dRAID with
141.Em N No disks of size Em X , D No data disks per redundancy group , Em P
142.No parity level, and Em S No distributed hot spares can hold approximately
143.Em (N-S)*(D/(D+P))*X No bytes and can withstand Em P
144devices failing without losing data.
145.It Sy draid Ns Oo Ar parity Oc Ns Oo Sy \&: Ns Ar data Ns Sy d Oc Ns Oo Sy \&: Ns Ar children Ns Sy c Oc Ns Oo Sy \&: Ns Ar spares Ns Sy s Oc
146A non-default dRAID configuration can be specified by appending one or more
147of the following optional arguments to the
148.Sy draid
149keyword:
150.Bl -tag -compact -width "children"
151.It Ar parity
152The parity level (1-3).
153.It Ar data
154The number of data devices per redundancy group.
155In general, a smaller value of
156.Em D No will increase IOPS, improve the compression ratio ,
157and speed up resilvering at the expense of total usable capacity.
158Defaults to
159.Em 8 , No unless Em N-P-S No is less than Em 8 .
160.It Ar children
161The expected number of children.
162Useful as a cross-check when listing a large number of devices.
163An error is returned when the provided number of children differs.
164.It Ar spares
165The number of distributed hot spares.
166Defaults to zero.
167.El
168.It Sy spare
169A pseudo-vdev which keeps track of available hot spares for a pool.
170For more information, see the
171.Sx Hot Spares
172section.
173.It Sy log
174A separate intent log device.
175If more than one log device is specified, then writes are load-balanced between
176devices.
177Log devices can be mirrored.
178However, raidz vdev types are not supported for the intent log.
179For more information, see the
180.Sx Intent Log
181section.
182.It Sy dedup
183A device solely dedicated for deduplication tables.
184The redundancy of this device should match the redundancy of the other normal
185devices in the pool.
186If more than one dedup device is specified, then
187allocations are load-balanced between those devices.
188.It Sy special
189A device dedicated solely for allocating various kinds of internal metadata,
190and optionally small file blocks.
191The redundancy of this device should match the redundancy of the other normal
192devices in the pool.
193If more than one special device is specified, then
194allocations are load-balanced between those devices.
195.Pp
196For more information on special allocations, see the
197.Sx Special Allocation Class
198section.
199.It Sy cache
200A device used to cache storage pool data.
201A cache device cannot be configured as a mirror or raidz group.
202For more information, see the
203.Sx Cache Devices
204section.
205.El
206.Pp
207Virtual devices cannot be nested arbitrarily.
208A mirror, raidz or draid virtual device can only be created with files or disks.
209Mirrors of mirrors or other such combinations are not allowed.
210.Pp
211A pool can have any number of virtual devices at the top of the configuration
212.Po known as
213.Qq root vdevs
214.Pc .
215Data is dynamically distributed across all top-level devices to balance data
216among devices.
217As new virtual devices are added, ZFS automatically places data on the newly
218available devices.
219.Pp
220Virtual devices are specified one at a time on the command line,
221separated by whitespace.
222Keywords like
223.Sy mirror No and Sy raidz
224are used to distinguish where a group ends and another begins.
225For example, the following creates a pool with two root vdevs,
226each a mirror of two disks:
227.Dl # Nm zpool Cm create Ar mypool Sy mirror Ar sda sdb Sy mirror Ar sdc sdd
228.
229.Ss Device Failure and Recovery
230ZFS supports a rich set of mechanisms for handling device failure and data
231corruption.
232All metadata and data is checksummed, and ZFS automatically repairs bad data
233from a good copy, when corruption is detected.
234.Pp
235In order to take advantage of these features, a pool must make use of some form
236of redundancy, using either mirrored or raidz groups.
237While ZFS supports running in a non-redundant configuration, where each root
238vdev is simply a disk or file, this is strongly discouraged.
239A single case of bit corruption can render some or all of your data unavailable.
240.Pp
241A pool's health status is described by one of three states:
242.Sy online , degraded , No or Sy faulted .
243An online pool has all devices operating normally.
244A degraded pool is one in which one or more devices have failed, but the data is
245still available due to a redundant configuration.
246A faulted pool has corrupted metadata, or one or more faulted devices, and
247insufficient replicas to continue functioning.
248.Pp
249The health of the top-level vdev, such as a mirror or raidz device,
250is potentially impacted by the state of its associated vdevs
251or component devices.
252A top-level vdev or component device is in one of the following states:
253.Bl -tag -width "DEGRADED"
254.It Sy DEGRADED
255One or more top-level vdevs is in the degraded state because one or more
256component devices are offline.
257Sufficient replicas exist to continue functioning.
258.Pp
259One or more component devices is in the degraded or faulted state, but
260sufficient replicas exist to continue functioning.
261The underlying conditions are as follows:
262.Bl -bullet -compact
263.It
264The number of checksum errors or slow I/Os exceeds acceptable levels and the
265device is degraded as an indication that something may be wrong.
266ZFS continues to use the device as necessary.
267.It
268The number of I/O errors exceeds acceptable levels.
269The device could not be marked as faulted because there are insufficient
270replicas to continue functioning.
271.El
272.It Sy FAULTED
273One or more top-level vdevs is in the faulted state because one or more
274component devices are offline.
275Insufficient replicas exist to continue functioning.
276.Pp
277One or more component devices is in the faulted state, and insufficient
278replicas exist to continue functioning.
279The underlying conditions are as follows:
280.Bl -bullet -compact
281.It
282The device could be opened, but the contents did not match expected values.
283.It
284The number of I/O errors exceeds acceptable levels and the device is faulted to
285prevent further use of the device.
286.El
287.It Sy OFFLINE
288The device was explicitly taken offline by the
289.Nm zpool Cm offline
290command.
291.It Sy ONLINE
292The device is online and functioning.
293.It Sy REMOVED
294The device was physically removed while the system was running.
295Device removal detection is hardware-dependent and may not be supported on all
296platforms.
297.It Sy UNAVAIL
298The device could not be opened.
299If a pool is imported when a device was unavailable, then the device will be
300identified by a unique identifier instead of its path since the path was never
301correct in the first place.
302.El
303.Pp
304Checksum errors represent events where a disk returned data that was expected
305to be correct, but was not.
306In other words, these are instances of silent data corruption.
307The checksum errors are reported in
308.Nm zpool Cm status
309and
310.Nm zpool Cm events .
311When a block is stored redundantly, a damaged block may be reconstructed
312(e.g. from raidz parity or a mirrored copy).
313In this case, ZFS reports the checksum error against the disks that contained
314damaged data.
315If a block is unable to be reconstructed (e.g. due to 3 disks being damaged
316in a raidz2 group), it is not possible to determine which disks were silently
317corrupted.
318In this case, checksum errors are reported for all disks on which the block
319is stored.
320.Pp
321If a device is removed and later re-attached to the system,
322ZFS attempts to bring the device online automatically.
323Device attachment detection is hardware-dependent
324and might not be supported on all platforms.
325.
326.Ss Hot Spares
327ZFS allows devices to be associated with pools as
328.Qq hot spares .
329These devices are not actively used in the pool.
330But, when an active device
331fails, it is automatically replaced by a hot spare.
332To create a pool with hot spares, specify a
333.Sy spare
334vdev with any number of devices.
335For example,
336.Dl # Nm zpool Cm create Ar pool Sy mirror Ar sda sdb Sy spare Ar sdc sdd
337.Pp
338Spares can be shared across multiple pools, and can be added with the
339.Nm zpool Cm add
340command and removed with the
341.Nm zpool Cm remove
342command.
343Once a spare replacement is initiated, a new
344.Sy spare
345vdev is created within the configuration that will remain there until the
346original device is replaced.
347At this point, the hot spare becomes available again, if another device fails.
348.Pp
349If a pool has a shared spare that is currently being used, the pool cannot be
350exported, since other pools may use this shared spare, which may lead to
351potential data corruption.
352.Pp
353Shared spares add some risk.
354If the pools are imported on different hosts,
355and both pools suffer a device failure at the same time,
356both could attempt to use the spare at the same time.
357This may not be detected, resulting in data corruption.
358.Pp
359An in-progress spare replacement can be canceled by detaching the hot spare.
360If the original faulted device is detached, then the hot spare assumes its
361place in the configuration, and is removed from the spare list of all active
362pools.
363.Pp
364The
365.Sy draid
366vdev type provides distributed hot spares.
367These hot spares are named after the dRAID vdev they're a part of
368.Po Sy draid1 Ns - Ns Ar 2 Ns - Ns Ar 3 No specifies spare Ar 3 No of vdev Ar 2 ,
369.No which is a single parity dRAID Pc
370and may only be used by that dRAID vdev.
371Otherwise, they behave the same as normal hot spares.
372.Pp
373Spares cannot replace log devices.
374.
375.Ss Intent Log
376The ZFS Intent Log (ZIL) satisfies POSIX requirements for synchronous
377transactions.
378For instance, databases often require their transactions to be on stable storage
379devices when returning from a system call.
380NFS and other applications can also use
381.Xr fsync 2
382to ensure data stability.
383By default, the intent log is allocated from blocks within the main pool.
384However, it might be possible to get better performance using separate intent
385log devices such as NVRAM or a dedicated disk.
386For example:
387.Dl # Nm zpool Cm create Ar pool sda sdb Sy log Ar sdc
388.Pp
389Multiple log devices can also be specified, and they can be mirrored.
390See the
391.Sx EXAMPLES
392section for an example of mirroring multiple log devices.
393.Pp
394Log devices can be added, replaced, attached, detached, and removed.
395In addition, log devices are imported and exported as part of the pool
396that contains them.
397Mirrored devices can be removed by specifying the top-level mirror vdev.
398.
399.Ss Cache Devices
400Devices can be added to a storage pool as
401.Qq cache devices .
402These devices provide an additional layer of caching between main memory and
403disk.
404For read-heavy workloads, where the working set size is much larger than what
405can be cached in main memory, using cache devices allows much more of this
406working set to be served from low latency media.
407Using cache devices provides the greatest performance improvement for random
408read-workloads of mostly static content.
409.Pp
410To create a pool with cache devices, specify a
411.Sy cache
412vdev with any number of devices.
413For example:
414.Dl # Nm zpool Cm create Ar pool sda sdb Sy cache Ar sdc sdd
415.Pp
416Cache devices cannot be mirrored or part of a raidz configuration.
417If a read error is encountered on a cache device, that read I/O is reissued to
418the original storage pool device, which might be part of a mirrored or raidz
419configuration.
420.Pp
421The content of the cache devices is persistent across reboots and restored
422asynchronously when importing the pool in L2ARC (persistent L2ARC).
423This can be disabled by setting
424.Sy l2arc_rebuild_enabled Ns = Ns Sy 0 .
425For cache devices smaller than
426.Em 1 GiB ,
427ZFS does not write the metadata structures
428required for rebuilding the L2ARC, to conserve space.
429This can be changed with
430.Sy l2arc_rebuild_blocks_min_l2size .
431The cache device header
432.Pq Em 512 B
433is updated even if no metadata structures are written.
434Setting
435.Sy l2arc_headroom Ns = Ns Sy 0
436will result in scanning the full-length ARC lists for cacheable content to be
437written in L2ARC (persistent ARC).
438If a cache device is added with
439.Nm zpool Cm add ,
440its label and header will be overwritten and its contents will not be
441restored in L2ARC, even if the device was previously part of the pool.
442If a cache device is onlined with
443.Nm zpool Cm online ,
444its contents will be restored in L2ARC.
445This is useful in case of memory pressure,
446where the contents of the cache device are not fully restored in L2ARC.
447The user can off- and online the cache device when there is less memory
448pressure, to fully restore its contents to L2ARC.
449.
450.Ss Pool checkpoint
451Before starting critical procedures that include destructive actions
452.Pq like Nm zfs Cm destroy ,
453an administrator can checkpoint the pool's state and, in the case of a
454mistake or failure, rewind the entire pool back to the checkpoint.
455Otherwise, the checkpoint can be discarded when the procedure has completed
456successfully.
457.Pp
458A pool checkpoint can be thought of as a pool-wide snapshot and should be used
459with care as it contains every part of the pool's state, from properties to vdev
460configuration.
461Thus, certain operations are not allowed while a pool has a checkpoint.
462Specifically, vdev removal/attach/detach, mirror splitting, and
463changing the pool's GUID.
464Adding a new vdev is supported, but in the case of a rewind it will have to be
465added again.
466Finally, users of this feature should keep in mind that scrubs in a pool that
467has a checkpoint do not repair checkpointed data.
468.Pp
469To create a checkpoint for a pool:
470.Dl # Nm zpool Cm checkpoint Ar pool
471.Pp
472To later rewind to its checkpointed state, you need to first export it and
473then rewind it during import:
474.Dl # Nm zpool Cm export Ar pool
475.Dl # Nm zpool Cm import Fl -rewind-to-checkpoint Ar pool
476.Pp
477To discard the checkpoint from a pool:
478.Dl # Nm zpool Cm checkpoint Fl d Ar pool
479.Pp
480Dataset reservations (controlled by the
481.Sy reservation No and Sy refreservation
482properties) may be unenforceable while a checkpoint exists, because the
483checkpoint is allowed to consume the dataset's reservation.
484Finally, data that is part of the checkpoint but has been freed in the
485current state of the pool won't be scanned during a scrub.
486.
487.Ss Special Allocation Class
488Allocations in the special class are dedicated to specific block types.
489By default, this includes all metadata, the indirect blocks of user data, and
490any deduplication tables.
491The class can also be provisioned to accept small file blocks or zvol blocks
492on a per dataset granularity.
493.Pp
494A pool must always have at least one normal
495.Pq non- Ns Sy dedup Ns /- Ns Sy special
496vdev before
497other devices can be assigned to the special class.
498If the
499.Sy special
500class becomes full, then allocations intended for it
501will spill back into the normal class.
502.Pp
503Deduplication tables can be excluded from the special class by unsetting the
504.Sy zfs_ddt_data_is_special
505ZFS module parameter.
506.Pp
507Inclusion of small file or zvol blocks in the special class is opt-in.
508Each dataset can control the size of small file blocks allowed
509in the special class by setting the
510.Sy special_small_blocks
511property to nonzero.
512See
513.Xr zfsprops 7
514for more info on this property.
515