xref: /freebsd/sys/contrib/openzfs/man/man7/zpoolconcepts.7 (revision 3ff01b231dfa83d518854c63e7c9cd1debd1139e)
1*3ff01b23SMartin Matuska.\"
2*3ff01b23SMartin Matuska.\" CDDL HEADER START
3*3ff01b23SMartin Matuska.\"
4*3ff01b23SMartin Matuska.\" The contents of this file are subject to the terms of the
5*3ff01b23SMartin Matuska.\" Common Development and Distribution License (the "License").
6*3ff01b23SMartin Matuska.\" You may not use this file except in compliance with the License.
7*3ff01b23SMartin Matuska.\"
8*3ff01b23SMartin Matuska.\" You can obtain a copy of the license at usr/src/OPENSOLARIS.LICENSE
9*3ff01b23SMartin Matuska.\" or http://www.opensolaris.org/os/licensing.
10*3ff01b23SMartin Matuska.\" See the License for the specific language governing permissions
11*3ff01b23SMartin Matuska.\" and limitations under the License.
12*3ff01b23SMartin Matuska.\"
13*3ff01b23SMartin Matuska.\" When distributing Covered Code, include this CDDL HEADER in each
14*3ff01b23SMartin Matuska.\" file and include the License file at usr/src/OPENSOLARIS.LICENSE.
15*3ff01b23SMartin Matuska.\" If applicable, add the following below this CDDL HEADER, with the
16*3ff01b23SMartin Matuska.\" fields enclosed by brackets "[]" replaced with your own identifying
17*3ff01b23SMartin Matuska.\" information: Portions Copyright [yyyy] [name of copyright owner]
18*3ff01b23SMartin Matuska.\"
19*3ff01b23SMartin Matuska.\" CDDL HEADER END
20*3ff01b23SMartin Matuska.\"
21*3ff01b23SMartin Matuska.\" Copyright (c) 2007, Sun Microsystems, Inc. All Rights Reserved.
22*3ff01b23SMartin Matuska.\" Copyright (c) 2012, 2018 by Delphix. All rights reserved.
23*3ff01b23SMartin Matuska.\" Copyright (c) 2012 Cyril Plisko. All Rights Reserved.
24*3ff01b23SMartin Matuska.\" Copyright (c) 2017 Datto Inc.
25*3ff01b23SMartin Matuska.\" Copyright (c) 2018 George Melikov. All Rights Reserved.
26*3ff01b23SMartin Matuska.\" Copyright 2017 Nexenta Systems, Inc.
27*3ff01b23SMartin Matuska.\" Copyright (c) 2017 Open-E, Inc. All Rights Reserved.
28*3ff01b23SMartin Matuska.\"
29*3ff01b23SMartin Matuska.Dd June 2, 2021
30*3ff01b23SMartin Matuska.Dt ZPOOLCONCEPTS 7
31*3ff01b23SMartin Matuska.Os
32*3ff01b23SMartin Matuska.
33*3ff01b23SMartin Matuska.Sh NAME
34*3ff01b23SMartin Matuska.Nm zpoolconcepts
35*3ff01b23SMartin Matuska.Nd overview of ZFS storage pools
36*3ff01b23SMartin Matuska.
37*3ff01b23SMartin Matuska.Sh DESCRIPTION
38*3ff01b23SMartin Matuska.Ss Virtual Devices (vdevs)
39*3ff01b23SMartin MatuskaA "virtual device" describes a single device or a collection of devices
40*3ff01b23SMartin Matuskaorganized according to certain performance and fault characteristics.
41*3ff01b23SMartin MatuskaThe following virtual devices are supported:
42*3ff01b23SMartin Matuska.Bl -tag -width "special"
43*3ff01b23SMartin Matuska.It Sy disk
44*3ff01b23SMartin MatuskaA block device, typically located under
45*3ff01b23SMartin Matuska.Pa /dev .
46*3ff01b23SMartin MatuskaZFS can use individual slices or partitions, though the recommended mode of
47*3ff01b23SMartin Matuskaoperation is to use whole disks.
48*3ff01b23SMartin MatuskaA disk can be specified by a full path, or it can be a shorthand name
49*3ff01b23SMartin Matuska.Po the relative portion of the path under
50*3ff01b23SMartin Matuska.Pa /dev
51*3ff01b23SMartin Matuska.Pc .
52*3ff01b23SMartin MatuskaA whole disk can be specified by omitting the slice or partition designation.
53*3ff01b23SMartin MatuskaFor example,
54*3ff01b23SMartin Matuska.Pa sda
55*3ff01b23SMartin Matuskais equivalent to
56*3ff01b23SMartin Matuska.Pa /dev/sda .
57*3ff01b23SMartin MatuskaWhen given a whole disk, ZFS automatically labels the disk, if necessary.
58*3ff01b23SMartin Matuska.It Sy file
59*3ff01b23SMartin MatuskaA regular file.
60*3ff01b23SMartin MatuskaThe use of files as a backing store is strongly discouraged.
61*3ff01b23SMartin MatuskaIt is designed primarily for experimental purposes, as the fault tolerance of a
62*3ff01b23SMartin Matuskafile is only as good as the file system on which it resides.
63*3ff01b23SMartin MatuskaA file must be specified by a full path.
64*3ff01b23SMartin Matuska.It Sy mirror
65*3ff01b23SMartin MatuskaA mirror of two or more devices.
66*3ff01b23SMartin MatuskaData is replicated in an identical fashion across all components of a mirror.
67*3ff01b23SMartin MatuskaA mirror with
68*3ff01b23SMartin Matuska.Em N No disks of size Em X No can hold Em X No bytes and can withstand Em N-1
69*3ff01b23SMartin Matuskadevices failing without losing data.
70*3ff01b23SMartin Matuska.It Sy raidz , raidz1 , raidz2 , raidz3
71*3ff01b23SMartin MatuskaA variation on RAID-5 that allows for better distribution of parity and
72*3ff01b23SMartin Matuskaeliminates the RAID-5
73*3ff01b23SMartin Matuska.Qq write hole
74*3ff01b23SMartin Matuska.Pq in which data and parity become inconsistent after a power loss .
75*3ff01b23SMartin MatuskaData and parity is striped across all disks within a raidz group.
76*3ff01b23SMartin Matuska.Pp
77*3ff01b23SMartin MatuskaA raidz group can have single, double, or triple parity, meaning that the
78*3ff01b23SMartin Matuskaraidz group can sustain one, two, or three failures, respectively, without
79*3ff01b23SMartin Matuskalosing any data.
80*3ff01b23SMartin MatuskaThe
81*3ff01b23SMartin Matuska.Sy raidz1
82*3ff01b23SMartin Matuskavdev type specifies a single-parity raidz group; the
83*3ff01b23SMartin Matuska.Sy raidz2
84*3ff01b23SMartin Matuskavdev type specifies a double-parity raidz group; and the
85*3ff01b23SMartin Matuska.Sy raidz3
86*3ff01b23SMartin Matuskavdev type specifies a triple-parity raidz group.
87*3ff01b23SMartin MatuskaThe
88*3ff01b23SMartin Matuska.Sy raidz
89*3ff01b23SMartin Matuskavdev type is an alias for
90*3ff01b23SMartin Matuska.Sy raidz1 .
91*3ff01b23SMartin Matuska.Pp
92*3ff01b23SMartin MatuskaA raidz group with
93*3ff01b23SMartin Matuska.Em N No disks of size Em X No with Em P No parity disks can hold approximately
94*3ff01b23SMartin Matuska.Em (N-P)*X No bytes and can withstand Em P No devices failing without losing data.
95*3ff01b23SMartin MatuskaThe minimum number of devices in a raidz group is one more than the number of
96*3ff01b23SMartin Matuskaparity disks.
97*3ff01b23SMartin MatuskaThe recommended number is between 3 and 9 to help increase performance.
98*3ff01b23SMartin Matuska.It Sy draid , draid1 , draid2 , draid3
99*3ff01b23SMartin MatuskaA variant of raidz that provides integrated distributed hot spares which
100*3ff01b23SMartin Matuskaallows for faster resilvering while retaining the benefits of raidz.
101*3ff01b23SMartin MatuskaA dRAID vdev is constructed from multiple internal raidz groups, each with
102*3ff01b23SMartin Matuska.Em D No data devices and Em P No parity devices.
103*3ff01b23SMartin MatuskaThese groups are distributed over all of the children in order to fully
104*3ff01b23SMartin Matuskautilize the available disk performance.
105*3ff01b23SMartin Matuska.Pp
106*3ff01b23SMartin MatuskaUnlike raidz, dRAID uses a fixed stripe width (padding as necessary with
107*3ff01b23SMartin Matuskazeros) to allow fully sequential resilvering.
108*3ff01b23SMartin MatuskaThis fixed stripe width significantly effects both usable capacity and IOPS.
109*3ff01b23SMartin MatuskaFor example, with the default
110*3ff01b23SMartin Matuska.Em D=8 No and Em 4kB No disk sectors the minimum allocation size is Em 32kB .
111*3ff01b23SMartin MatuskaIf using compression, this relatively large allocation size can reduce the
112*3ff01b23SMartin Matuskaeffective compression ratio.
113*3ff01b23SMartin MatuskaWhen using ZFS volumes and dRAID, the default of the
114*3ff01b23SMartin Matuska.Sy volblocksize
115*3ff01b23SMartin Matuskaproperty is increased to account for the allocation size.
116*3ff01b23SMartin MatuskaIf a dRAID pool will hold a significant amount of small blocks, it is
117*3ff01b23SMartin Matuskarecommended to also add a mirrored
118*3ff01b23SMartin Matuska.Sy special
119*3ff01b23SMartin Matuskavdev to store those blocks.
120*3ff01b23SMartin Matuska.Pp
121*3ff01b23SMartin MatuskaIn regards to I/O, performance is similar to raidz since for any read all
122*3ff01b23SMartin Matuska.Em D No data disks must be accessed.
123*3ff01b23SMartin MatuskaDelivered random IOPS can be reasonably approximated as
124*3ff01b23SMartin Matuska.Sy floor((N-S)/(D+P))*single_drive_IOPS .
125*3ff01b23SMartin Matuska.Pp
126*3ff01b23SMartin MatuskaLike raidzm a dRAID can have single-, double-, or triple-parity.
127*3ff01b23SMartin MatuskaThe
128*3ff01b23SMartin Matuska.Sy draid1 ,
129*3ff01b23SMartin Matuska.Sy draid2 ,
130*3ff01b23SMartin Matuskaand
131*3ff01b23SMartin Matuska.Sy draid3
132*3ff01b23SMartin Matuskatypes can be used to specify the parity level.
133*3ff01b23SMartin MatuskaThe
134*3ff01b23SMartin Matuska.Sy draid
135*3ff01b23SMartin Matuskavdev type is an alias for
136*3ff01b23SMartin Matuska.Sy draid1 .
137*3ff01b23SMartin Matuska.Pp
138*3ff01b23SMartin MatuskaA dRAID with
139*3ff01b23SMartin Matuska.Em N No disks of size Em X , D No data disks per redundancy group, Em P
140*3ff01b23SMartin Matuska.No parity level, and Em S No distributed hot spares can hold approximately
141*3ff01b23SMartin Matuska.Em (N-S)*(D/(D+P))*X No bytes and can withstand Em P
142*3ff01b23SMartin Matuskadevices failing without losing data.
143*3ff01b23SMartin Matuska.It Sy draid Ns Oo Ar parity Oc Ns Oo Sy \&: Ns Ar data Ns Sy d Oc Ns Oo Sy \&: Ns Ar children Ns Sy c Oc Ns Oo Sy \&: Ns Ar spares Ns Sy s Oc
144*3ff01b23SMartin MatuskaA non-default dRAID configuration can be specified by appending one or more
145*3ff01b23SMartin Matuskaof the following optional arguments to the
146*3ff01b23SMartin Matuska.Sy draid
147*3ff01b23SMartin Matuskakeyword:
148*3ff01b23SMartin Matuska.Bl -tag -compact -width "children"
149*3ff01b23SMartin Matuska.It Ar parity
150*3ff01b23SMartin MatuskaThe parity level (1-3).
151*3ff01b23SMartin Matuska.It Ar data
152*3ff01b23SMartin MatuskaThe number of data devices per redundancy group.
153*3ff01b23SMartin MatuskaIn general, a smaller value of
154*3ff01b23SMartin Matuska.Em D No will increase IOPS, improve the compression ratio,
155*3ff01b23SMartin Matuskaand speed up resilvering at the expense of total usable capacity.
156*3ff01b23SMartin MatuskaDefaults to
157*3ff01b23SMartin Matuska.Em 8 , No unless Em N-P-S No is less than Em 8 .
158*3ff01b23SMartin Matuska.It Ar children
159*3ff01b23SMartin MatuskaThe expected number of children.
160*3ff01b23SMartin MatuskaUseful as a cross-check when listing a large number of devices.
161*3ff01b23SMartin MatuskaAn error is returned when the provided number of children differs.
162*3ff01b23SMartin Matuska.It Ar spares
163*3ff01b23SMartin MatuskaThe number of distributed hot spares.
164*3ff01b23SMartin MatuskaDefaults to zero.
165*3ff01b23SMartin Matuska.El
166*3ff01b23SMartin Matuska.It Sy spare
167*3ff01b23SMartin MatuskaA pseudo-vdev which keeps track of available hot spares for a pool.
168*3ff01b23SMartin MatuskaFor more information, see the
169*3ff01b23SMartin Matuska.Sx Hot Spares
170*3ff01b23SMartin Matuskasection.
171*3ff01b23SMartin Matuska.It Sy log
172*3ff01b23SMartin MatuskaA separate intent log device.
173*3ff01b23SMartin MatuskaIf more than one log device is specified, then writes are load-balanced between
174*3ff01b23SMartin Matuskadevices.
175*3ff01b23SMartin MatuskaLog devices can be mirrored.
176*3ff01b23SMartin MatuskaHowever, raidz vdev types are not supported for the intent log.
177*3ff01b23SMartin MatuskaFor more information, see the
178*3ff01b23SMartin Matuska.Sx Intent Log
179*3ff01b23SMartin Matuskasection.
180*3ff01b23SMartin Matuska.It Sy dedup
181*3ff01b23SMartin MatuskaA device dedicated solely for deduplication tables.
182*3ff01b23SMartin MatuskaThe redundancy of this device should match the redundancy of the other normal
183*3ff01b23SMartin Matuskadevices in the pool.
184*3ff01b23SMartin MatuskaIf more than one dedup device is specified, then
185*3ff01b23SMartin Matuskaallocations are load-balanced between those devices.
186*3ff01b23SMartin Matuska.It Sy special
187*3ff01b23SMartin MatuskaA device dedicated solely for allocating various kinds of internal metadata,
188*3ff01b23SMartin Matuskaand optionally small file blocks.
189*3ff01b23SMartin MatuskaThe redundancy of this device should match the redundancy of the other normal
190*3ff01b23SMartin Matuskadevices in the pool.
191*3ff01b23SMartin MatuskaIf more than one special device is specified, then
192*3ff01b23SMartin Matuskaallocations are load-balanced between those devices.
193*3ff01b23SMartin Matuska.Pp
194*3ff01b23SMartin MatuskaFor more information on special allocations, see the
195*3ff01b23SMartin Matuska.Sx Special Allocation Class
196*3ff01b23SMartin Matuskasection.
197*3ff01b23SMartin Matuska.It Sy cache
198*3ff01b23SMartin MatuskaA device used to cache storage pool data.
199*3ff01b23SMartin MatuskaA cache device cannot be configured as a mirror or raidz group.
200*3ff01b23SMartin MatuskaFor more information, see the
201*3ff01b23SMartin Matuska.Sx Cache Devices
202*3ff01b23SMartin Matuskasection.
203*3ff01b23SMartin Matuska.El
204*3ff01b23SMartin Matuska.Pp
205*3ff01b23SMartin MatuskaVirtual devices cannot be nested, so a mirror or raidz virtual device can only
206*3ff01b23SMartin Matuskacontain files or disks.
207*3ff01b23SMartin MatuskaMirrors of mirrors
208*3ff01b23SMartin Matuska.Pq or other combinations
209*3ff01b23SMartin Matuskaare not allowed.
210*3ff01b23SMartin Matuska.Pp
211*3ff01b23SMartin MatuskaA pool can have any number of virtual devices at the top of the configuration
212*3ff01b23SMartin Matuska.Po known as
213*3ff01b23SMartin Matuska.Qq root vdevs
214*3ff01b23SMartin Matuska.Pc .
215*3ff01b23SMartin MatuskaData is dynamically distributed across all top-level devices to balance data
216*3ff01b23SMartin Matuskaamong devices.
217*3ff01b23SMartin MatuskaAs new virtual devices are added, ZFS automatically places data on the newly
218*3ff01b23SMartin Matuskaavailable devices.
219*3ff01b23SMartin Matuska.Pp
220*3ff01b23SMartin MatuskaVirtual devices are specified one at a time on the command line,
221*3ff01b23SMartin Matuskaseparated by whitespace.
222*3ff01b23SMartin MatuskaKeywords like
223*3ff01b23SMartin Matuska.Sy mirror No and Sy raidz
224*3ff01b23SMartin Matuskaare used to distinguish where a group ends and another begins.
225*3ff01b23SMartin MatuskaFor example, the following creates a pool with two root vdevs,
226*3ff01b23SMartin Matuskaeach a mirror of two disks:
227*3ff01b23SMartin Matuska.Dl # Nm zpool Cm create Ar mypool Sy mirror Ar sda sdb Sy mirror Ar sdc sdd
228*3ff01b23SMartin Matuska.
229*3ff01b23SMartin Matuska.Ss Device Failure and Recovery
230*3ff01b23SMartin MatuskaZFS supports a rich set of mechanisms for handling device failure and data
231*3ff01b23SMartin Matuskacorruption.
232*3ff01b23SMartin MatuskaAll metadata and data is checksummed, and ZFS automatically repairs bad data
233*3ff01b23SMartin Matuskafrom a good copy when corruption is detected.
234*3ff01b23SMartin Matuska.Pp
235*3ff01b23SMartin MatuskaIn order to take advantage of these features, a pool must make use of some form
236*3ff01b23SMartin Matuskaof redundancy, using either mirrored or raidz groups.
237*3ff01b23SMartin MatuskaWhile ZFS supports running in a non-redundant configuration, where each root
238*3ff01b23SMartin Matuskavdev is simply a disk or file, this is strongly discouraged.
239*3ff01b23SMartin MatuskaA single case of bit corruption can render some or all of your data unavailable.
240*3ff01b23SMartin Matuska.Pp
241*3ff01b23SMartin MatuskaA pool's health status is described by one of three states:
242*3ff01b23SMartin Matuska.Sy online , degraded , No or Sy faulted .
243*3ff01b23SMartin MatuskaAn online pool has all devices operating normally.
244*3ff01b23SMartin MatuskaA degraded pool is one in which one or more devices have failed, but the data is
245*3ff01b23SMartin Matuskastill available due to a redundant configuration.
246*3ff01b23SMartin MatuskaA faulted pool has corrupted metadata, or one or more faulted devices, and
247*3ff01b23SMartin Matuskainsufficient replicas to continue functioning.
248*3ff01b23SMartin Matuska.Pp
249*3ff01b23SMartin MatuskaThe health of the top-level vdev, such as a mirror or raidz device,
250*3ff01b23SMartin Matuskais potentially impacted by the state of its associated vdevs,
251*3ff01b23SMartin Matuskaor component devices.
252*3ff01b23SMartin MatuskaA top-level vdev or component device is in one of the following states:
253*3ff01b23SMartin Matuska.Bl -tag -width "DEGRADED"
254*3ff01b23SMartin Matuska.It Sy DEGRADED
255*3ff01b23SMartin MatuskaOne or more top-level vdevs is in the degraded state because one or more
256*3ff01b23SMartin Matuskacomponent devices are offline.
257*3ff01b23SMartin MatuskaSufficient replicas exist to continue functioning.
258*3ff01b23SMartin Matuska.Pp
259*3ff01b23SMartin MatuskaOne or more component devices is in the degraded or faulted state, but
260*3ff01b23SMartin Matuskasufficient replicas exist to continue functioning.
261*3ff01b23SMartin MatuskaThe underlying conditions are as follows:
262*3ff01b23SMartin Matuska.Bl -bullet -compact
263*3ff01b23SMartin Matuska.It
264*3ff01b23SMartin MatuskaThe number of checksum errors exceeds acceptable levels and the device is
265*3ff01b23SMartin Matuskadegraded as an indication that something may be wrong.
266*3ff01b23SMartin MatuskaZFS continues to use the device as necessary.
267*3ff01b23SMartin Matuska.It
268*3ff01b23SMartin MatuskaThe number of I/O errors exceeds acceptable levels.
269*3ff01b23SMartin MatuskaThe device could not be marked as faulted because there are insufficient
270*3ff01b23SMartin Matuskareplicas to continue functioning.
271*3ff01b23SMartin Matuska.El
272*3ff01b23SMartin Matuska.It Sy FAULTED
273*3ff01b23SMartin MatuskaOne or more top-level vdevs is in the faulted state because one or more
274*3ff01b23SMartin Matuskacomponent devices are offline.
275*3ff01b23SMartin MatuskaInsufficient replicas exist to continue functioning.
276*3ff01b23SMartin Matuska.Pp
277*3ff01b23SMartin MatuskaOne or more component devices is in the faulted state, and insufficient
278*3ff01b23SMartin Matuskareplicas exist to continue functioning.
279*3ff01b23SMartin MatuskaThe underlying conditions are as follows:
280*3ff01b23SMartin Matuska.Bl -bullet -compact
281*3ff01b23SMartin Matuska.It
282*3ff01b23SMartin MatuskaThe device could be opened, but the contents did not match expected values.
283*3ff01b23SMartin Matuska.It
284*3ff01b23SMartin MatuskaThe number of I/O errors exceeds acceptable levels and the device is faulted to
285*3ff01b23SMartin Matuskaprevent further use of the device.
286*3ff01b23SMartin Matuska.El
287*3ff01b23SMartin Matuska.It Sy OFFLINE
288*3ff01b23SMartin MatuskaThe device was explicitly taken offline by the
289*3ff01b23SMartin Matuska.Nm zpool Cm offline
290*3ff01b23SMartin Matuskacommand.
291*3ff01b23SMartin Matuska.It Sy ONLINE
292*3ff01b23SMartin MatuskaThe device is online and functioning.
293*3ff01b23SMartin Matuska.It Sy REMOVED
294*3ff01b23SMartin MatuskaThe device was physically removed while the system was running.
295*3ff01b23SMartin MatuskaDevice removal detection is hardware-dependent and may not be supported on all
296*3ff01b23SMartin Matuskaplatforms.
297*3ff01b23SMartin Matuska.It Sy UNAVAIL
298*3ff01b23SMartin MatuskaThe device could not be opened.
299*3ff01b23SMartin MatuskaIf a pool is imported when a device was unavailable, then the device will be
300*3ff01b23SMartin Matuskaidentified by a unique identifier instead of its path since the path was never
301*3ff01b23SMartin Matuskacorrect in the first place.
302*3ff01b23SMartin Matuska.El
303*3ff01b23SMartin Matuska.Pp
304*3ff01b23SMartin MatuskaChecksum errors represent events where a disk returned data that was expected
305*3ff01b23SMartin Matuskato be correct, but was not.
306*3ff01b23SMartin MatuskaIn other words, these are instances of silent data corruption.
307*3ff01b23SMartin MatuskaThe checksum errors are reported in
308*3ff01b23SMartin Matuska.Nm zpool Cm status
309*3ff01b23SMartin Matuskaand
310*3ff01b23SMartin Matuska.Nm zpool Cm events .
311*3ff01b23SMartin MatuskaWhen a block is stored redundantly, a damaged block may be reconstructed
312*3ff01b23SMartin Matuska(e.g. from raidz parity or a mirrored copy).
313*3ff01b23SMartin MatuskaIn this case, ZFS reports the checksum error against the disks that contained
314*3ff01b23SMartin Matuskadamaged data.
315*3ff01b23SMartin MatuskaIf a block is unable to be reconstructed (e.g. due to 3 disks being damaged
316*3ff01b23SMartin Matuskain a raidz2 group), it is not possible to determine which disks were silently
317*3ff01b23SMartin Matuskacorrupted.
318*3ff01b23SMartin MatuskaIn this case, checksum errors are reported for all disks on which the block
319*3ff01b23SMartin Matuskais stored.
320*3ff01b23SMartin Matuska.Pp
321*3ff01b23SMartin MatuskaIf a device is removed and later re-attached to the system,
322*3ff01b23SMartin MatuskaZFS attempts online the device automatically.
323*3ff01b23SMartin MatuskaDevice attachment detection is hardware-dependent
324*3ff01b23SMartin Matuskaand might not be supported on all platforms.
325*3ff01b23SMartin Matuska.
326*3ff01b23SMartin Matuska.Ss Hot Spares
327*3ff01b23SMartin MatuskaZFS allows devices to be associated with pools as
328*3ff01b23SMartin Matuska.Qq hot spares .
329*3ff01b23SMartin MatuskaThese devices are not actively used in the pool, but when an active device
330*3ff01b23SMartin Matuskafails, it is automatically replaced by a hot spare.
331*3ff01b23SMartin MatuskaTo create a pool with hot spares, specify a
332*3ff01b23SMartin Matuska.Sy spare
333*3ff01b23SMartin Matuskavdev with any number of devices.
334*3ff01b23SMartin MatuskaFor example,
335*3ff01b23SMartin Matuska.Dl # Nm zpool Cm create Ar pool Sy mirror Ar sda sdb Sy spare Ar sdc sdd
336*3ff01b23SMartin Matuska.Pp
337*3ff01b23SMartin MatuskaSpares can be shared across multiple pools, and can be added with the
338*3ff01b23SMartin Matuska.Nm zpool Cm add
339*3ff01b23SMartin Matuskacommand and removed with the
340*3ff01b23SMartin Matuska.Nm zpool Cm remove
341*3ff01b23SMartin Matuskacommand.
342*3ff01b23SMartin MatuskaOnce a spare replacement is initiated, a new
343*3ff01b23SMartin Matuska.Sy spare
344*3ff01b23SMartin Matuskavdev is created within the configuration that will remain there until the
345*3ff01b23SMartin Matuskaoriginal device is replaced.
346*3ff01b23SMartin MatuskaAt this point, the hot spare becomes available again if another device fails.
347*3ff01b23SMartin Matuska.Pp
348*3ff01b23SMartin MatuskaIf a pool has a shared spare that is currently being used, the pool can not be
349*3ff01b23SMartin Matuskaexported since other pools may use this shared spare, which may lead to
350*3ff01b23SMartin Matuskapotential data corruption.
351*3ff01b23SMartin Matuska.Pp
352*3ff01b23SMartin MatuskaShared spares add some risk.
353*3ff01b23SMartin MatuskaIf the pools are imported on different hosts,
354*3ff01b23SMartin Matuskaand both pools suffer a device failure at the same time,
355*3ff01b23SMartin Matuskaboth could attempt to use the spare at the same time.
356*3ff01b23SMartin MatuskaThis may not be detected, resulting in data corruption.
357*3ff01b23SMartin Matuska.Pp
358*3ff01b23SMartin MatuskaAn in-progress spare replacement can be cancelled by detaching the hot spare.
359*3ff01b23SMartin MatuskaIf the original faulted device is detached, then the hot spare assumes its
360*3ff01b23SMartin Matuskaplace in the configuration, and is removed from the spare list of all active
361*3ff01b23SMartin Matuskapools.
362*3ff01b23SMartin Matuska.Pp
363*3ff01b23SMartin MatuskaThe
364*3ff01b23SMartin Matuska.Sy draid
365*3ff01b23SMartin Matuskavdev type provides distributed hot spares.
366*3ff01b23SMartin MatuskaThese hot spares are named after the dRAID vdev they're a part of
367*3ff01b23SMartin Matuska.Po Sy draid1 Ns - Ns Ar 2 Ns - Ns Ar 3 No specifies spare Ar 3 No of vdev Ar 2 ,
368*3ff01b23SMartin Matuska.No which is a single parity dRAID Pc
369*3ff01b23SMartin Matuskaand may only be used by that dRAID vdev.
370*3ff01b23SMartin MatuskaOtherwise, they behave the same as normal hot spares.
371*3ff01b23SMartin Matuska.Pp
372*3ff01b23SMartin MatuskaSpares cannot replace log devices.
373*3ff01b23SMartin Matuska.
374*3ff01b23SMartin Matuska.Ss Intent Log
375*3ff01b23SMartin MatuskaThe ZFS Intent Log (ZIL) satisfies POSIX requirements for synchronous
376*3ff01b23SMartin Matuskatransactions.
377*3ff01b23SMartin MatuskaFor instance, databases often require their transactions to be on stable storage
378*3ff01b23SMartin Matuskadevices when returning from a system call.
379*3ff01b23SMartin MatuskaNFS and other applications can also use
380*3ff01b23SMartin Matuska.Xr fsync 2
381*3ff01b23SMartin Matuskato ensure data stability.
382*3ff01b23SMartin MatuskaBy default, the intent log is allocated from blocks within the main pool.
383*3ff01b23SMartin MatuskaHowever, it might be possible to get better performance using separate intent
384*3ff01b23SMartin Matuskalog devices such as NVRAM or a dedicated disk.
385*3ff01b23SMartin MatuskaFor example:
386*3ff01b23SMartin Matuska.Dl # Nm zpool Cm create Ar pool sda sdb Sy log Ar sdc
387*3ff01b23SMartin Matuska.Pp
388*3ff01b23SMartin MatuskaMultiple log devices can also be specified, and they can be mirrored.
389*3ff01b23SMartin MatuskaSee the
390*3ff01b23SMartin Matuska.Sx EXAMPLES
391*3ff01b23SMartin Matuskasection for an example of mirroring multiple log devices.
392*3ff01b23SMartin Matuska.Pp
393*3ff01b23SMartin MatuskaLog devices can be added, replaced, attached, detached and removed.
394*3ff01b23SMartin MatuskaIn addition, log devices are imported and exported as part of the pool
395*3ff01b23SMartin Matuskathat contains them.
396*3ff01b23SMartin MatuskaMirrored devices can be removed by specifying the top-level mirror vdev.
397*3ff01b23SMartin Matuska.
398*3ff01b23SMartin Matuska.Ss Cache Devices
399*3ff01b23SMartin MatuskaDevices can be added to a storage pool as
400*3ff01b23SMartin Matuska.Qq cache devices .
401*3ff01b23SMartin MatuskaThese devices provide an additional layer of caching between main memory and
402*3ff01b23SMartin Matuskadisk.
403*3ff01b23SMartin MatuskaFor read-heavy workloads, where the working set size is much larger than what
404*3ff01b23SMartin Matuskacan be cached in main memory, using cache devices allows much more of this
405*3ff01b23SMartin Matuskaworking set to be served from low latency media.
406*3ff01b23SMartin MatuskaUsing cache devices provides the greatest performance improvement for random
407*3ff01b23SMartin Matuskaread-workloads of mostly static content.
408*3ff01b23SMartin Matuska.Pp
409*3ff01b23SMartin MatuskaTo create a pool with cache devices, specify a
410*3ff01b23SMartin Matuska.Sy cache
411*3ff01b23SMartin Matuskavdev with any number of devices.
412*3ff01b23SMartin MatuskaFor example:
413*3ff01b23SMartin Matuska.Dl # Nm zpool Cm create Ar pool sda sdb Sy cache Ar sdc sdd
414*3ff01b23SMartin Matuska.Pp
415*3ff01b23SMartin MatuskaCache devices cannot be mirrored or part of a raidz configuration.
416*3ff01b23SMartin MatuskaIf a read error is encountered on a cache device, that read I/O is reissued to
417*3ff01b23SMartin Matuskathe original storage pool device, which might be part of a mirrored or raidz
418*3ff01b23SMartin Matuskaconfiguration.
419*3ff01b23SMartin Matuska.Pp
420*3ff01b23SMartin MatuskaThe content of the cache devices is persistent across reboots and restored
421*3ff01b23SMartin Matuskaasynchronously when importing the pool in L2ARC (persistent L2ARC).
422*3ff01b23SMartin MatuskaThis can be disabled by setting
423*3ff01b23SMartin Matuska.Sy l2arc_rebuild_enabled Ns = Ns Sy 0 .
424*3ff01b23SMartin MatuskaFor cache devices smaller than
425*3ff01b23SMartin Matuska.Em 1GB ,
426*3ff01b23SMartin Matuskawe do not write the metadata structures
427*3ff01b23SMartin Matuskarequired for rebuilding the L2ARC in order not to waste space.
428*3ff01b23SMartin MatuskaThis can be changed with
429*3ff01b23SMartin Matuska.Sy l2arc_rebuild_blocks_min_l2size .
430*3ff01b23SMartin MatuskaThe cache device header
431*3ff01b23SMartin Matuska.Pq Em 512B
432*3ff01b23SMartin Matuskais updated even if no metadata structures are written.
433*3ff01b23SMartin MatuskaSetting
434*3ff01b23SMartin Matuska.Sy l2arc_headroom Ns = Ns Sy 0
435*3ff01b23SMartin Matuskawill result in scanning the full-length ARC lists for cacheable content to be
436*3ff01b23SMartin Matuskawritten in L2ARC (persistent ARC).
437*3ff01b23SMartin MatuskaIf a cache device is added with
438*3ff01b23SMartin Matuska.Nm zpool Cm add
439*3ff01b23SMartin Matuskaits label and header will be overwritten and its contents are not going to be
440*3ff01b23SMartin Matuskarestored in L2ARC, even if the device was previously part of the pool.
441*3ff01b23SMartin MatuskaIf a cache device is onlined with
442*3ff01b23SMartin Matuska.Nm zpool Cm online
443*3ff01b23SMartin Matuskaits contents will be restored in L2ARC.
444*3ff01b23SMartin MatuskaThis is useful in case of memory pressure
445*3ff01b23SMartin Matuskawhere the contents of the cache device are not fully restored in L2ARC.
446*3ff01b23SMartin MatuskaThe user can off- and online the cache device when there is less memory pressure
447*3ff01b23SMartin Matuskain order to fully restore its contents to L2ARC.
448*3ff01b23SMartin Matuska.
449*3ff01b23SMartin Matuska.Ss Pool checkpoint
450*3ff01b23SMartin MatuskaBefore starting critical procedures that include destructive actions
451*3ff01b23SMartin Matuska.Pq like Nm zfs Cm destroy ,
452*3ff01b23SMartin Matuskaan administrator can checkpoint the pool's state and in the case of a
453*3ff01b23SMartin Matuskamistake or failure, rewind the entire pool back to the checkpoint.
454*3ff01b23SMartin MatuskaOtherwise, the checkpoint can be discarded when the procedure has completed
455*3ff01b23SMartin Matuskasuccessfully.
456*3ff01b23SMartin Matuska.Pp
457*3ff01b23SMartin MatuskaA pool checkpoint can be thought of as a pool-wide snapshot and should be used
458*3ff01b23SMartin Matuskawith care as it contains every part of the pool's state, from properties to vdev
459*3ff01b23SMartin Matuskaconfiguration.
460*3ff01b23SMartin MatuskaThus, certain operations are not allowed while a pool has a checkpoint.
461*3ff01b23SMartin MatuskaSpecifically, vdev removal/attach/detach, mirror splitting, and
462*3ff01b23SMartin Matuskachanging the pool's GUID.
463*3ff01b23SMartin MatuskaAdding a new vdev is supported, but in the case of a rewind it will have to be
464*3ff01b23SMartin Matuskaadded again.
465*3ff01b23SMartin MatuskaFinally, users of this feature should keep in mind that scrubs in a pool that
466*3ff01b23SMartin Matuskahas a checkpoint do not repair checkpointed data.
467*3ff01b23SMartin Matuska.Pp
468*3ff01b23SMartin MatuskaTo create a checkpoint for a pool:
469*3ff01b23SMartin Matuska.Dl # Nm zpool Cm checkpoint Ar pool
470*3ff01b23SMartin Matuska.Pp
471*3ff01b23SMartin MatuskaTo later rewind to its checkpointed state, you need to first export it and
472*3ff01b23SMartin Matuskathen rewind it during import:
473*3ff01b23SMartin Matuska.Dl # Nm zpool Cm export Ar pool
474*3ff01b23SMartin Matuska.Dl # Nm zpool Cm import Fl -rewind-to-checkpoint Ar pool
475*3ff01b23SMartin Matuska.Pp
476*3ff01b23SMartin MatuskaTo discard the checkpoint from a pool:
477*3ff01b23SMartin Matuska.Dl # Nm zpool Cm checkpoint Fl d Ar pool
478*3ff01b23SMartin Matuska.Pp
479*3ff01b23SMartin MatuskaDataset reservations (controlled by the
480*3ff01b23SMartin Matuska.Sy reservation No and Sy refreservation
481*3ff01b23SMartin Matuskaproperties) may be unenforceable while a checkpoint exists, because the
482*3ff01b23SMartin Matuskacheckpoint is allowed to consume the dataset's reservation.
483*3ff01b23SMartin MatuskaFinally, data that is part of the checkpoint but has been freed in the
484*3ff01b23SMartin Matuskacurrent state of the pool won't be scanned during a scrub.
485*3ff01b23SMartin Matuska.
486*3ff01b23SMartin Matuska.Ss Special Allocation Class
487*3ff01b23SMartin MatuskaAllocations in the special class are dedicated to specific block types.
488*3ff01b23SMartin MatuskaBy default this includes all metadata, the indirect blocks of user data, and
489*3ff01b23SMartin Matuskaany deduplication tables.
490*3ff01b23SMartin MatuskaThe class can also be provisioned to accept small file blocks.
491*3ff01b23SMartin Matuska.Pp
492*3ff01b23SMartin MatuskaA pool must always have at least one normal
493*3ff01b23SMartin Matuska.Pq non- Ns Sy dedup Ns /- Ns Sy special
494*3ff01b23SMartin Matuskavdev before
495*3ff01b23SMartin Matuskaother devices can be assigned to the special class.
496*3ff01b23SMartin MatuskaIf the
497*3ff01b23SMartin Matuska.Sy special
498*3ff01b23SMartin Matuskaclass becomes full, then allocations intended for it
499*3ff01b23SMartin Matuskawill spill back into the normal class.
500*3ff01b23SMartin Matuska.Pp
501*3ff01b23SMartin MatuskaDeduplication tables can be excluded from the special class by unsetting the
502*3ff01b23SMartin Matuska.Sy zfs_ddt_data_is_special
503*3ff01b23SMartin MatuskaZFS module parameter.
504*3ff01b23SMartin Matuska.Pp
505*3ff01b23SMartin MatuskaInclusion of small file blocks in the special class is opt-in.
506*3ff01b23SMartin MatuskaEach dataset can control the size of small file blocks allowed
507*3ff01b23SMartin Matuskain the special class by setting the
508*3ff01b23SMartin Matuska.Sy special_small_blocks
509*3ff01b23SMartin Matuskaproperty to nonzero.
510*3ff01b23SMartin MatuskaSee
511*3ff01b23SMartin Matuska.Xr zfsprops 7
512*3ff01b23SMartin Matuskafor more info on this property.
513