1.\" 2.\" CDDL HEADER START 3.\" 4.\" The contents of this file are subject to the terms of the 5.\" Common Development and Distribution License (the "License"). 6.\" You may not use this file except in compliance with the License. 7.\" 8.\" You can obtain a copy of the license at usr/src/OPENSOLARIS.LICENSE 9.\" or https://opensource.org/licenses/CDDL-1.0. 10.\" See the License for the specific language governing permissions 11.\" and limitations under the License. 12.\" 13.\" When distributing Covered Code, include this CDDL HEADER in each 14.\" file and include the License file at usr/src/OPENSOLARIS.LICENSE. 15.\" If applicable, add the following below this CDDL HEADER, with the 16.\" fields enclosed by brackets "[]" replaced with your own identifying 17.\" information: Portions Copyright [yyyy] [name of copyright owner] 18.\" 19.\" CDDL HEADER END 20.\" 21.\" Copyright (c) 2007, Sun Microsystems, Inc. All Rights Reserved. 22.\" Copyright (c) 2012, 2018 by Delphix. All rights reserved. 23.\" Copyright (c) 2012 Cyril Plisko. All Rights Reserved. 24.\" Copyright (c) 2017 Datto Inc. 25.\" Copyright (c) 2018 George Melikov. All Rights Reserved. 26.\" Copyright 2017 Nexenta Systems, Inc. 27.\" Copyright (c) 2017 Open-E, Inc. All Rights Reserved. 28.\" 29.Dd April 7, 2023 30.Dt ZPOOLCONCEPTS 7 31.Os 32. 33.Sh NAME 34.Nm zpoolconcepts 35.Nd overview of ZFS storage pools 36. 37.Sh DESCRIPTION 38.Ss Virtual Devices (vdevs) 39A "virtual device" describes a single device or a collection of devices, 40organized according to certain performance and fault characteristics. 41The following virtual devices are supported: 42.Bl -tag -width "special" 43.It Sy disk 44A block device, typically located under 45.Pa /dev . 46ZFS can use individual slices or partitions, though the recommended mode of 47operation is to use whole disks. 48A disk can be specified by a full path, or it can be a shorthand name 49.Po the relative portion of the path under 50.Pa /dev 51.Pc . 52A whole disk can be specified by omitting the slice or partition designation. 53For example, 54.Pa sda 55is equivalent to 56.Pa /dev/sda . 57When given a whole disk, ZFS automatically labels the disk, if necessary. 58.It Sy file 59A regular file. 60The use of files as a backing store is strongly discouraged. 61It is designed primarily for experimental purposes, as the fault tolerance of a 62file is only as good as the file system on which it resides. 63A file must be specified by a full path. 64.It Sy mirror 65A mirror of two or more devices. 66Data is replicated in an identical fashion across all components of a mirror. 67A mirror with 68.Em N No disks of size Em X No can hold Em X No bytes and can withstand Em N-1 69devices failing, without losing data. 70.It Sy raidz , raidz1 , raidz2 , raidz3 71A distributed-parity layout, similar to RAID-5/6, with improved distribution of 72parity, and which does not suffer from the RAID-5/6 73.Qq write hole , 74.Pq in which data and parity become inconsistent after a power loss . 75Data and parity is striped across all disks within a raidz group, though not 76necessarily in a consistent stripe width. 77.Pp 78A raidz group can have single, double, or triple parity, meaning that the 79raidz group can sustain one, two, or three failures, respectively, without 80losing any data. 81The 82.Sy raidz1 83vdev type specifies a single-parity raidz group; the 84.Sy raidz2 85vdev type specifies a double-parity raidz group; and the 86.Sy raidz3 87vdev type specifies a triple-parity raidz group. 88The 89.Sy raidz 90vdev type is an alias for 91.Sy raidz1 . 92.Pp 93A raidz group with 94.Em N No disks of size Em X No with Em P No parity disks can hold approximately 95.Em (N-P)*X No bytes and can withstand Em P No devices failing without losing data . 96The minimum number of devices in a raidz group is one more than the number of 97parity disks. 98The recommended number is between 3 and 9 to help increase performance. 99.It Sy draid , draid1 , draid2 , draid3 100A variant of raidz that provides integrated distributed hot spares, allowing 101for faster resilvering, while retaining the benefits of raidz. 102A dRAID vdev is constructed from multiple internal raidz groups, each with 103.Em D No data devices and Em P No parity devices . 104These groups are distributed over all of the children in order to fully 105utilize the available disk performance. 106.Pp 107Unlike raidz, dRAID uses a fixed stripe width (padding as necessary with 108zeros) to allow fully sequential resilvering. 109This fixed stripe width significantly affects both usable capacity and IOPS. 110For example, with the default 111.Em D=8 No and Em 4 KiB No disk sectors the minimum allocation size is Em 32 KiB . 112If using compression, this relatively large allocation size can reduce the 113effective compression ratio. 114When using ZFS volumes (zvols) and dRAID, the default of the 115.Sy volblocksize 116property is increased to account for the allocation size. 117If a dRAID pool will hold a significant amount of small blocks, it is 118recommended to also add a mirrored 119.Sy special 120vdev to store those blocks. 121.Pp 122In regards to I/O, performance is similar to raidz since, for any read, all 123.Em D No data disks must be accessed . 124Delivered random IOPS can be reasonably approximated as 125.Sy floor((N-S)/(D+P))*single_drive_IOPS . 126.Pp 127Like raidz, a dRAID can have single-, double-, or triple-parity. 128The 129.Sy draid1 , 130.Sy draid2 , 131and 132.Sy draid3 133types can be used to specify the parity level. 134The 135.Sy draid 136vdev type is an alias for 137.Sy draid1 . 138.Pp 139A dRAID with 140.Em N No disks of size Em X , D No data disks per redundancy group , Em P 141.No parity level, and Em S No distributed hot spares can hold approximately 142.Em (N-S)*(D/(D+P))*X No bytes and can withstand Em P 143devices failing without losing data. 144.It Sy draid Ns Oo Ar parity Oc Ns Oo Sy \&: Ns Ar data Ns Sy d Oc Ns Oo Sy \&: Ns Ar children Ns Sy c Oc Ns Oo Sy \&: Ns Ar spares Ns Sy s Oc 145A non-default dRAID configuration can be specified by appending one or more 146of the following optional arguments to the 147.Sy draid 148keyword: 149.Bl -tag -compact -width "children" 150.It Ar parity 151The parity level (1-3). 152.It Ar data 153The number of data devices per redundancy group. 154In general, a smaller value of 155.Em D No will increase IOPS, improve the compression ratio , 156and speed up resilvering at the expense of total usable capacity. 157Defaults to 158.Em 8 , No unless Em N-P-S No is less than Em 8 . 159.It Ar children 160The expected number of children. 161Useful as a cross-check when listing a large number of devices. 162An error is returned when the provided number of children differs. 163.It Ar spares 164The number of distributed hot spares. 165Defaults to zero. 166.El 167.It Sy spare 168A pseudo-vdev which keeps track of available hot spares for a pool. 169For more information, see the 170.Sx Hot Spares 171section. 172.It Sy log 173A separate intent log device. 174If more than one log device is specified, then writes are load-balanced between 175devices. 176Log devices can be mirrored. 177However, raidz vdev types are not supported for the intent log. 178For more information, see the 179.Sx Intent Log 180section. 181.It Sy dedup 182A device solely dedicated for deduplication tables. 183The redundancy of this device should match the redundancy of the other normal 184devices in the pool. 185If more than one dedup device is specified, then 186allocations are load-balanced between those devices. 187.It Sy special 188A device dedicated solely for allocating various kinds of internal metadata, 189and optionally small file blocks. 190The redundancy of this device should match the redundancy of the other normal 191devices in the pool. 192If more than one special device is specified, then 193allocations are load-balanced between those devices. 194.Pp 195For more information on special allocations, see the 196.Sx Special Allocation Class 197section. 198.It Sy cache 199A device used to cache storage pool data. 200A cache device cannot be configured as a mirror or raidz group. 201For more information, see the 202.Sx Cache Devices 203section. 204.El 205.Pp 206Virtual devices cannot be nested, so a mirror or raidz virtual device can only 207contain files or disks. 208Mirrors of mirrors 209.Pq or other combinations 210are not allowed. 211.Pp 212A pool can have any number of virtual devices at the top of the configuration 213.Po known as 214.Qq root vdevs 215.Pc . 216Data is dynamically distributed across all top-level devices to balance data 217among devices. 218As new virtual devices are added, ZFS automatically places data on the newly 219available devices. 220.Pp 221Virtual devices are specified one at a time on the command line, 222separated by whitespace. 223Keywords like 224.Sy mirror No and Sy raidz 225are used to distinguish where a group ends and another begins. 226For example, the following creates a pool with two root vdevs, 227each a mirror of two disks: 228.Dl # Nm zpool Cm create Ar mypool Sy mirror Ar sda sdb Sy mirror Ar sdc sdd 229. 230.Ss Device Failure and Recovery 231ZFS supports a rich set of mechanisms for handling device failure and data 232corruption. 233All metadata and data is checksummed, and ZFS automatically repairs bad data 234from a good copy, when corruption is detected. 235.Pp 236In order to take advantage of these features, a pool must make use of some form 237of redundancy, using either mirrored or raidz groups. 238While ZFS supports running in a non-redundant configuration, where each root 239vdev is simply a disk or file, this is strongly discouraged. 240A single case of bit corruption can render some or all of your data unavailable. 241.Pp 242A pool's health status is described by one of three states: 243.Sy online , degraded , No or Sy faulted . 244An online pool has all devices operating normally. 245A degraded pool is one in which one or more devices have failed, but the data is 246still available due to a redundant configuration. 247A faulted pool has corrupted metadata, or one or more faulted devices, and 248insufficient replicas to continue functioning. 249.Pp 250The health of the top-level vdev, such as a mirror or raidz device, 251is potentially impacted by the state of its associated vdevs 252or component devices. 253A top-level vdev or component device is in one of the following states: 254.Bl -tag -width "DEGRADED" 255.It Sy DEGRADED 256One or more top-level vdevs is in the degraded state because one or more 257component devices are offline. 258Sufficient replicas exist to continue functioning. 259.Pp 260One or more component devices is in the degraded or faulted state, but 261sufficient replicas exist to continue functioning. 262The underlying conditions are as follows: 263.Bl -bullet -compact 264.It 265The number of checksum errors exceeds acceptable levels and the device is 266degraded as an indication that something may be wrong. 267ZFS continues to use the device as necessary. 268.It 269The number of I/O errors exceeds acceptable levels. 270The device could not be marked as faulted because there are insufficient 271replicas to continue functioning. 272.El 273.It Sy FAULTED 274One or more top-level vdevs is in the faulted state because one or more 275component devices are offline. 276Insufficient replicas exist to continue functioning. 277.Pp 278One or more component devices is in the faulted state, and insufficient 279replicas exist to continue functioning. 280The underlying conditions are as follows: 281.Bl -bullet -compact 282.It 283The device could be opened, but the contents did not match expected values. 284.It 285The number of I/O errors exceeds acceptable levels and the device is faulted to 286prevent further use of the device. 287.El 288.It Sy OFFLINE 289The device was explicitly taken offline by the 290.Nm zpool Cm offline 291command. 292.It Sy ONLINE 293The device is online and functioning. 294.It Sy REMOVED 295The device was physically removed while the system was running. 296Device removal detection is hardware-dependent and may not be supported on all 297platforms. 298.It Sy UNAVAIL 299The device could not be opened. 300If a pool is imported when a device was unavailable, then the device will be 301identified by a unique identifier instead of its path since the path was never 302correct in the first place. 303.El 304.Pp 305Checksum errors represent events where a disk returned data that was expected 306to be correct, but was not. 307In other words, these are instances of silent data corruption. 308The checksum errors are reported in 309.Nm zpool Cm status 310and 311.Nm zpool Cm events . 312When a block is stored redundantly, a damaged block may be reconstructed 313(e.g. from raidz parity or a mirrored copy). 314In this case, ZFS reports the checksum error against the disks that contained 315damaged data. 316If a block is unable to be reconstructed (e.g. due to 3 disks being damaged 317in a raidz2 group), it is not possible to determine which disks were silently 318corrupted. 319In this case, checksum errors are reported for all disks on which the block 320is stored. 321.Pp 322If a device is removed and later re-attached to the system, 323ZFS attempts to bring the device online automatically. 324Device attachment detection is hardware-dependent 325and might not be supported on all platforms. 326. 327.Ss Hot Spares 328ZFS allows devices to be associated with pools as 329.Qq hot spares . 330These devices are not actively used in the pool. 331But, when an active device 332fails, it is automatically replaced by a hot spare. 333To create a pool with hot spares, specify a 334.Sy spare 335vdev with any number of devices. 336For example, 337.Dl # Nm zpool Cm create Ar pool Sy mirror Ar sda sdb Sy spare Ar sdc sdd 338.Pp 339Spares can be shared across multiple pools, and can be added with the 340.Nm zpool Cm add 341command and removed with the 342.Nm zpool Cm remove 343command. 344Once a spare replacement is initiated, a new 345.Sy spare 346vdev is created within the configuration that will remain there until the 347original device is replaced. 348At this point, the hot spare becomes available again, if another device fails. 349.Pp 350If a pool has a shared spare that is currently being used, the pool cannot be 351exported, since other pools may use this shared spare, which may lead to 352potential data corruption. 353.Pp 354Shared spares add some risk. 355If the pools are imported on different hosts, 356and both pools suffer a device failure at the same time, 357both could attempt to use the spare at the same time. 358This may not be detected, resulting in data corruption. 359.Pp 360An in-progress spare replacement can be cancelled by detaching the hot spare. 361If the original faulted device is detached, then the hot spare assumes its 362place in the configuration, and is removed from the spare list of all active 363pools. 364.Pp 365The 366.Sy draid 367vdev type provides distributed hot spares. 368These hot spares are named after the dRAID vdev they're a part of 369.Po Sy draid1 Ns - Ns Ar 2 Ns - Ns Ar 3 No specifies spare Ar 3 No of vdev Ar 2 , 370.No which is a single parity dRAID Pc 371and may only be used by that dRAID vdev. 372Otherwise, they behave the same as normal hot spares. 373.Pp 374Spares cannot replace log devices. 375. 376.Ss Intent Log 377The ZFS Intent Log (ZIL) satisfies POSIX requirements for synchronous 378transactions. 379For instance, databases often require their transactions to be on stable storage 380devices when returning from a system call. 381NFS and other applications can also use 382.Xr fsync 2 383to ensure data stability. 384By default, the intent log is allocated from blocks within the main pool. 385However, it might be possible to get better performance using separate intent 386log devices such as NVRAM or a dedicated disk. 387For example: 388.Dl # Nm zpool Cm create Ar pool sda sdb Sy log Ar sdc 389.Pp 390Multiple log devices can also be specified, and they can be mirrored. 391See the 392.Sx EXAMPLES 393section for an example of mirroring multiple log devices. 394.Pp 395Log devices can be added, replaced, attached, detached, and removed. 396In addition, log devices are imported and exported as part of the pool 397that contains them. 398Mirrored devices can be removed by specifying the top-level mirror vdev. 399. 400.Ss Cache Devices 401Devices can be added to a storage pool as 402.Qq cache devices . 403These devices provide an additional layer of caching between main memory and 404disk. 405For read-heavy workloads, where the working set size is much larger than what 406can be cached in main memory, using cache devices allows much more of this 407working set to be served from low latency media. 408Using cache devices provides the greatest performance improvement for random 409read-workloads of mostly static content. 410.Pp 411To create a pool with cache devices, specify a 412.Sy cache 413vdev with any number of devices. 414For example: 415.Dl # Nm zpool Cm create Ar pool sda sdb Sy cache Ar sdc sdd 416.Pp 417Cache devices cannot be mirrored or part of a raidz configuration. 418If a read error is encountered on a cache device, that read I/O is reissued to 419the original storage pool device, which might be part of a mirrored or raidz 420configuration. 421.Pp 422The content of the cache devices is persistent across reboots and restored 423asynchronously when importing the pool in L2ARC (persistent L2ARC). 424This can be disabled by setting 425.Sy l2arc_rebuild_enabled Ns = Ns Sy 0 . 426For cache devices smaller than 427.Em 1 GiB , 428ZFS does not write the metadata structures 429required for rebuilding the L2ARC, to conserve space. 430This can be changed with 431.Sy l2arc_rebuild_blocks_min_l2size . 432The cache device header 433.Pq Em 512 B 434is updated even if no metadata structures are written. 435Setting 436.Sy l2arc_headroom Ns = Ns Sy 0 437will result in scanning the full-length ARC lists for cacheable content to be 438written in L2ARC (persistent ARC). 439If a cache device is added with 440.Nm zpool Cm add , 441its label and header will be overwritten and its contents will not be 442restored in L2ARC, even if the device was previously part of the pool. 443If a cache device is onlined with 444.Nm zpool Cm online , 445its contents will be restored in L2ARC. 446This is useful in case of memory pressure, 447where the contents of the cache device are not fully restored in L2ARC. 448The user can off- and online the cache device when there is less memory 449pressure, to fully restore its contents to L2ARC. 450. 451.Ss Pool checkpoint 452Before starting critical procedures that include destructive actions 453.Pq like Nm zfs Cm destroy , 454an administrator can checkpoint the pool's state and, in the case of a 455mistake or failure, rewind the entire pool back to the checkpoint. 456Otherwise, the checkpoint can be discarded when the procedure has completed 457successfully. 458.Pp 459A pool checkpoint can be thought of as a pool-wide snapshot and should be used 460with care as it contains every part of the pool's state, from properties to vdev 461configuration. 462Thus, certain operations are not allowed while a pool has a checkpoint. 463Specifically, vdev removal/attach/detach, mirror splitting, and 464changing the pool's GUID. 465Adding a new vdev is supported, but in the case of a rewind it will have to be 466added again. 467Finally, users of this feature should keep in mind that scrubs in a pool that 468has a checkpoint do not repair checkpointed data. 469.Pp 470To create a checkpoint for a pool: 471.Dl # Nm zpool Cm checkpoint Ar pool 472.Pp 473To later rewind to its checkpointed state, you need to first export it and 474then rewind it during import: 475.Dl # Nm zpool Cm export Ar pool 476.Dl # Nm zpool Cm import Fl -rewind-to-checkpoint Ar pool 477.Pp 478To discard the checkpoint from a pool: 479.Dl # Nm zpool Cm checkpoint Fl d Ar pool 480.Pp 481Dataset reservations (controlled by the 482.Sy reservation No and Sy refreservation 483properties) may be unenforceable while a checkpoint exists, because the 484checkpoint is allowed to consume the dataset's reservation. 485Finally, data that is part of the checkpoint but has been freed in the 486current state of the pool won't be scanned during a scrub. 487. 488.Ss Special Allocation Class 489Allocations in the special class are dedicated to specific block types. 490By default, this includes all metadata, the indirect blocks of user data, and 491any deduplication tables. 492The class can also be provisioned to accept small file blocks. 493.Pp 494A pool must always have at least one normal 495.Pq non- Ns Sy dedup Ns /- Ns Sy special 496vdev before 497other devices can be assigned to the special class. 498If the 499.Sy special 500class becomes full, then allocations intended for it 501will spill back into the normal class. 502.Pp 503Deduplication tables can be excluded from the special class by unsetting the 504.Sy zfs_ddt_data_is_special 505ZFS module parameter. 506.Pp 507Inclusion of small file blocks in the special class is opt-in. 508Each dataset can control the size of small file blocks allowed 509in the special class by setting the 510.Sy special_small_blocks 511property to nonzero. 512See 513.Xr zfsprops 7 514for more info on this property. 515