1*eda14cbcSMatt Macy /* 2*eda14cbcSMatt Macy * CDDL HEADER START 3*eda14cbcSMatt Macy * 4*eda14cbcSMatt Macy * The contents of this file are subject to the terms of the 5*eda14cbcSMatt Macy * Common Development and Distribution License (the "License"). 6*eda14cbcSMatt Macy * You may not use this file except in compliance with the License. 7*eda14cbcSMatt Macy * 8*eda14cbcSMatt Macy * You can obtain a copy of the license at usr/src/OPENSOLARIS.LICENSE 9*eda14cbcSMatt Macy * or http://www.opensolaris.org/os/licensing. 10*eda14cbcSMatt Macy * See the License for the specific language governing permissions 11*eda14cbcSMatt Macy * and limitations under the License. 12*eda14cbcSMatt Macy * 13*eda14cbcSMatt Macy * When distributing Covered Code, include this CDDL HEADER in each 14*eda14cbcSMatt Macy * file and include the License file at usr/src/OPENSOLARIS.LICENSE. 15*eda14cbcSMatt Macy * If applicable, add the following below this CDDL HEADER, with the 16*eda14cbcSMatt Macy * fields enclosed by brackets "[]" replaced with your own identifying 17*eda14cbcSMatt Macy * information: Portions Copyright [yyyy] [name of copyright owner] 18*eda14cbcSMatt Macy * 19*eda14cbcSMatt Macy * CDDL HEADER END 20*eda14cbcSMatt Macy */ 21*eda14cbcSMatt Macy 22*eda14cbcSMatt Macy /* 23*eda14cbcSMatt Macy * Copyright (c) 2005, 2010, Oracle and/or its affiliates. All rights reserved. 24*eda14cbcSMatt Macy * Copyright (c) 2012, 2020 by Delphix. All rights reserved. 25*eda14cbcSMatt Macy * Copyright (c) 2017, Intel Corporation. 26*eda14cbcSMatt Macy */ 27*eda14cbcSMatt Macy 28*eda14cbcSMatt Macy /* 29*eda14cbcSMatt Macy * Virtual Device Labels 30*eda14cbcSMatt Macy * --------------------- 31*eda14cbcSMatt Macy * 32*eda14cbcSMatt Macy * The vdev label serves several distinct purposes: 33*eda14cbcSMatt Macy * 34*eda14cbcSMatt Macy * 1. Uniquely identify this device as part of a ZFS pool and confirm its 35*eda14cbcSMatt Macy * identity within the pool. 36*eda14cbcSMatt Macy * 37*eda14cbcSMatt Macy * 2. Verify that all the devices given in a configuration are present 38*eda14cbcSMatt Macy * within the pool. 39*eda14cbcSMatt Macy * 40*eda14cbcSMatt Macy * 3. Determine the uberblock for the pool. 41*eda14cbcSMatt Macy * 42*eda14cbcSMatt Macy * 4. In case of an import operation, determine the configuration of the 43*eda14cbcSMatt Macy * toplevel vdev of which it is a part. 44*eda14cbcSMatt Macy * 45*eda14cbcSMatt Macy * 5. If an import operation cannot find all the devices in the pool, 46*eda14cbcSMatt Macy * provide enough information to the administrator to determine which 47*eda14cbcSMatt Macy * devices are missing. 48*eda14cbcSMatt Macy * 49*eda14cbcSMatt Macy * It is important to note that while the kernel is responsible for writing the 50*eda14cbcSMatt Macy * label, it only consumes the information in the first three cases. The 51*eda14cbcSMatt Macy * latter information is only consumed in userland when determining the 52*eda14cbcSMatt Macy * configuration to import a pool. 53*eda14cbcSMatt Macy * 54*eda14cbcSMatt Macy * 55*eda14cbcSMatt Macy * Label Organization 56*eda14cbcSMatt Macy * ------------------ 57*eda14cbcSMatt Macy * 58*eda14cbcSMatt Macy * Before describing the contents of the label, it's important to understand how 59*eda14cbcSMatt Macy * the labels are written and updated with respect to the uberblock. 60*eda14cbcSMatt Macy * 61*eda14cbcSMatt Macy * When the pool configuration is altered, either because it was newly created 62*eda14cbcSMatt Macy * or a device was added, we want to update all the labels such that we can deal 63*eda14cbcSMatt Macy * with fatal failure at any point. To this end, each disk has two labels which 64*eda14cbcSMatt Macy * are updated before and after the uberblock is synced. Assuming we have 65*eda14cbcSMatt Macy * labels and an uberblock with the following transaction groups: 66*eda14cbcSMatt Macy * 67*eda14cbcSMatt Macy * L1 UB L2 68*eda14cbcSMatt Macy * +------+ +------+ +------+ 69*eda14cbcSMatt Macy * | | | | | | 70*eda14cbcSMatt Macy * | t10 | | t10 | | t10 | 71*eda14cbcSMatt Macy * | | | | | | 72*eda14cbcSMatt Macy * +------+ +------+ +------+ 73*eda14cbcSMatt Macy * 74*eda14cbcSMatt Macy * In this stable state, the labels and the uberblock were all updated within 75*eda14cbcSMatt Macy * the same transaction group (10). Each label is mirrored and checksummed, so 76*eda14cbcSMatt Macy * that we can detect when we fail partway through writing the label. 77*eda14cbcSMatt Macy * 78*eda14cbcSMatt Macy * In order to identify which labels are valid, the labels are written in the 79*eda14cbcSMatt Macy * following manner: 80*eda14cbcSMatt Macy * 81*eda14cbcSMatt Macy * 1. For each vdev, update 'L1' to the new label 82*eda14cbcSMatt Macy * 2. Update the uberblock 83*eda14cbcSMatt Macy * 3. For each vdev, update 'L2' to the new label 84*eda14cbcSMatt Macy * 85*eda14cbcSMatt Macy * Given arbitrary failure, we can determine the correct label to use based on 86*eda14cbcSMatt Macy * the transaction group. If we fail after updating L1 but before updating the 87*eda14cbcSMatt Macy * UB, we will notice that L1's transaction group is greater than the uberblock, 88*eda14cbcSMatt Macy * so L2 must be valid. If we fail after writing the uberblock but before 89*eda14cbcSMatt Macy * writing L2, we will notice that L2's transaction group is less than L1, and 90*eda14cbcSMatt Macy * therefore L1 is valid. 91*eda14cbcSMatt Macy * 92*eda14cbcSMatt Macy * Another added complexity is that not every label is updated when the config 93*eda14cbcSMatt Macy * is synced. If we add a single device, we do not want to have to re-write 94*eda14cbcSMatt Macy * every label for every device in the pool. This means that both L1 and L2 may 95*eda14cbcSMatt Macy * be older than the pool uberblock, because the necessary information is stored 96*eda14cbcSMatt Macy * on another vdev. 97*eda14cbcSMatt Macy * 98*eda14cbcSMatt Macy * 99*eda14cbcSMatt Macy * On-disk Format 100*eda14cbcSMatt Macy * -------------- 101*eda14cbcSMatt Macy * 102*eda14cbcSMatt Macy * The vdev label consists of two distinct parts, and is wrapped within the 103*eda14cbcSMatt Macy * vdev_label_t structure. The label includes 8k of padding to permit legacy 104*eda14cbcSMatt Macy * VTOC disk labels, but is otherwise ignored. 105*eda14cbcSMatt Macy * 106*eda14cbcSMatt Macy * The first half of the label is a packed nvlist which contains pool wide 107*eda14cbcSMatt Macy * properties, per-vdev properties, and configuration information. It is 108*eda14cbcSMatt Macy * described in more detail below. 109*eda14cbcSMatt Macy * 110*eda14cbcSMatt Macy * The latter half of the label consists of a redundant array of uberblocks. 111*eda14cbcSMatt Macy * These uberblocks are updated whenever a transaction group is committed, 112*eda14cbcSMatt Macy * or when the configuration is updated. When a pool is loaded, we scan each 113*eda14cbcSMatt Macy * vdev for the 'best' uberblock. 114*eda14cbcSMatt Macy * 115*eda14cbcSMatt Macy * 116*eda14cbcSMatt Macy * Configuration Information 117*eda14cbcSMatt Macy * ------------------------- 118*eda14cbcSMatt Macy * 119*eda14cbcSMatt Macy * The nvlist describing the pool and vdev contains the following elements: 120*eda14cbcSMatt Macy * 121*eda14cbcSMatt Macy * version ZFS on-disk version 122*eda14cbcSMatt Macy * name Pool name 123*eda14cbcSMatt Macy * state Pool state 124*eda14cbcSMatt Macy * txg Transaction group in which this label was written 125*eda14cbcSMatt Macy * pool_guid Unique identifier for this pool 126*eda14cbcSMatt Macy * vdev_tree An nvlist describing vdev tree. 127*eda14cbcSMatt Macy * features_for_read 128*eda14cbcSMatt Macy * An nvlist of the features necessary for reading the MOS. 129*eda14cbcSMatt Macy * 130*eda14cbcSMatt Macy * Each leaf device label also contains the following: 131*eda14cbcSMatt Macy * 132*eda14cbcSMatt Macy * top_guid Unique ID for top-level vdev in which this is contained 133*eda14cbcSMatt Macy * guid Unique ID for the leaf vdev 134*eda14cbcSMatt Macy * 135*eda14cbcSMatt Macy * The 'vs' configuration follows the format described in 'spa_config.c'. 136*eda14cbcSMatt Macy */ 137*eda14cbcSMatt Macy 138*eda14cbcSMatt Macy #include <sys/zfs_context.h> 139*eda14cbcSMatt Macy #include <sys/spa.h> 140*eda14cbcSMatt Macy #include <sys/spa_impl.h> 141*eda14cbcSMatt Macy #include <sys/dmu.h> 142*eda14cbcSMatt Macy #include <sys/zap.h> 143*eda14cbcSMatt Macy #include <sys/vdev.h> 144*eda14cbcSMatt Macy #include <sys/vdev_impl.h> 145*eda14cbcSMatt Macy #include <sys/uberblock_impl.h> 146*eda14cbcSMatt Macy #include <sys/metaslab.h> 147*eda14cbcSMatt Macy #include <sys/metaslab_impl.h> 148*eda14cbcSMatt Macy #include <sys/zio.h> 149*eda14cbcSMatt Macy #include <sys/dsl_scan.h> 150*eda14cbcSMatt Macy #include <sys/abd.h> 151*eda14cbcSMatt Macy #include <sys/fs/zfs.h> 152*eda14cbcSMatt Macy 153*eda14cbcSMatt Macy /* 154*eda14cbcSMatt Macy * Basic routines to read and write from a vdev label. 155*eda14cbcSMatt Macy * Used throughout the rest of this file. 156*eda14cbcSMatt Macy */ 157*eda14cbcSMatt Macy uint64_t 158*eda14cbcSMatt Macy vdev_label_offset(uint64_t psize, int l, uint64_t offset) 159*eda14cbcSMatt Macy { 160*eda14cbcSMatt Macy ASSERT(offset < sizeof (vdev_label_t)); 161*eda14cbcSMatt Macy ASSERT(P2PHASE_TYPED(psize, sizeof (vdev_label_t), uint64_t) == 0); 162*eda14cbcSMatt Macy 163*eda14cbcSMatt Macy return (offset + l * sizeof (vdev_label_t) + (l < VDEV_LABELS / 2 ? 164*eda14cbcSMatt Macy 0 : psize - VDEV_LABELS * sizeof (vdev_label_t))); 165*eda14cbcSMatt Macy } 166*eda14cbcSMatt Macy 167*eda14cbcSMatt Macy /* 168*eda14cbcSMatt Macy * Returns back the vdev label associated with the passed in offset. 169*eda14cbcSMatt Macy */ 170*eda14cbcSMatt Macy int 171*eda14cbcSMatt Macy vdev_label_number(uint64_t psize, uint64_t offset) 172*eda14cbcSMatt Macy { 173*eda14cbcSMatt Macy int l; 174*eda14cbcSMatt Macy 175*eda14cbcSMatt Macy if (offset >= psize - VDEV_LABEL_END_SIZE) { 176*eda14cbcSMatt Macy offset -= psize - VDEV_LABEL_END_SIZE; 177*eda14cbcSMatt Macy offset += (VDEV_LABELS / 2) * sizeof (vdev_label_t); 178*eda14cbcSMatt Macy } 179*eda14cbcSMatt Macy l = offset / sizeof (vdev_label_t); 180*eda14cbcSMatt Macy return (l < VDEV_LABELS ? l : -1); 181*eda14cbcSMatt Macy } 182*eda14cbcSMatt Macy 183*eda14cbcSMatt Macy static void 184*eda14cbcSMatt Macy vdev_label_read(zio_t *zio, vdev_t *vd, int l, abd_t *buf, uint64_t offset, 185*eda14cbcSMatt Macy uint64_t size, zio_done_func_t *done, void *private, int flags) 186*eda14cbcSMatt Macy { 187*eda14cbcSMatt Macy ASSERT( 188*eda14cbcSMatt Macy spa_config_held(zio->io_spa, SCL_STATE, RW_READER) == SCL_STATE || 189*eda14cbcSMatt Macy spa_config_held(zio->io_spa, SCL_STATE, RW_WRITER) == SCL_STATE); 190*eda14cbcSMatt Macy ASSERT(flags & ZIO_FLAG_CONFIG_WRITER); 191*eda14cbcSMatt Macy 192*eda14cbcSMatt Macy zio_nowait(zio_read_phys(zio, vd, 193*eda14cbcSMatt Macy vdev_label_offset(vd->vdev_psize, l, offset), 194*eda14cbcSMatt Macy size, buf, ZIO_CHECKSUM_LABEL, done, private, 195*eda14cbcSMatt Macy ZIO_PRIORITY_SYNC_READ, flags, B_TRUE)); 196*eda14cbcSMatt Macy } 197*eda14cbcSMatt Macy 198*eda14cbcSMatt Macy void 199*eda14cbcSMatt Macy vdev_label_write(zio_t *zio, vdev_t *vd, int l, abd_t *buf, uint64_t offset, 200*eda14cbcSMatt Macy uint64_t size, zio_done_func_t *done, void *private, int flags) 201*eda14cbcSMatt Macy { 202*eda14cbcSMatt Macy ASSERT( 203*eda14cbcSMatt Macy spa_config_held(zio->io_spa, SCL_STATE, RW_READER) == SCL_STATE || 204*eda14cbcSMatt Macy spa_config_held(zio->io_spa, SCL_STATE, RW_WRITER) == SCL_STATE); 205*eda14cbcSMatt Macy ASSERT(flags & ZIO_FLAG_CONFIG_WRITER); 206*eda14cbcSMatt Macy 207*eda14cbcSMatt Macy zio_nowait(zio_write_phys(zio, vd, 208*eda14cbcSMatt Macy vdev_label_offset(vd->vdev_psize, l, offset), 209*eda14cbcSMatt Macy size, buf, ZIO_CHECKSUM_LABEL, done, private, 210*eda14cbcSMatt Macy ZIO_PRIORITY_SYNC_WRITE, flags, B_TRUE)); 211*eda14cbcSMatt Macy } 212*eda14cbcSMatt Macy 213*eda14cbcSMatt Macy /* 214*eda14cbcSMatt Macy * Generate the nvlist representing this vdev's stats 215*eda14cbcSMatt Macy */ 216*eda14cbcSMatt Macy void 217*eda14cbcSMatt Macy vdev_config_generate_stats(vdev_t *vd, nvlist_t *nv) 218*eda14cbcSMatt Macy { 219*eda14cbcSMatt Macy nvlist_t *nvx; 220*eda14cbcSMatt Macy vdev_stat_t *vs; 221*eda14cbcSMatt Macy vdev_stat_ex_t *vsx; 222*eda14cbcSMatt Macy 223*eda14cbcSMatt Macy vs = kmem_alloc(sizeof (*vs), KM_SLEEP); 224*eda14cbcSMatt Macy vsx = kmem_alloc(sizeof (*vsx), KM_SLEEP); 225*eda14cbcSMatt Macy 226*eda14cbcSMatt Macy vdev_get_stats_ex(vd, vs, vsx); 227*eda14cbcSMatt Macy fnvlist_add_uint64_array(nv, ZPOOL_CONFIG_VDEV_STATS, 228*eda14cbcSMatt Macy (uint64_t *)vs, sizeof (*vs) / sizeof (uint64_t)); 229*eda14cbcSMatt Macy 230*eda14cbcSMatt Macy /* 231*eda14cbcSMatt Macy * Add extended stats into a special extended stats nvlist. This keeps 232*eda14cbcSMatt Macy * all the extended stats nicely grouped together. The extended stats 233*eda14cbcSMatt Macy * nvlist is then added to the main nvlist. 234*eda14cbcSMatt Macy */ 235*eda14cbcSMatt Macy nvx = fnvlist_alloc(); 236*eda14cbcSMatt Macy 237*eda14cbcSMatt Macy /* ZIOs in flight to disk */ 238*eda14cbcSMatt Macy fnvlist_add_uint64(nvx, ZPOOL_CONFIG_VDEV_SYNC_R_ACTIVE_QUEUE, 239*eda14cbcSMatt Macy vsx->vsx_active_queue[ZIO_PRIORITY_SYNC_READ]); 240*eda14cbcSMatt Macy 241*eda14cbcSMatt Macy fnvlist_add_uint64(nvx, ZPOOL_CONFIG_VDEV_SYNC_W_ACTIVE_QUEUE, 242*eda14cbcSMatt Macy vsx->vsx_active_queue[ZIO_PRIORITY_SYNC_WRITE]); 243*eda14cbcSMatt Macy 244*eda14cbcSMatt Macy fnvlist_add_uint64(nvx, ZPOOL_CONFIG_VDEV_ASYNC_R_ACTIVE_QUEUE, 245*eda14cbcSMatt Macy vsx->vsx_active_queue[ZIO_PRIORITY_ASYNC_READ]); 246*eda14cbcSMatt Macy 247*eda14cbcSMatt Macy fnvlist_add_uint64(nvx, ZPOOL_CONFIG_VDEV_ASYNC_W_ACTIVE_QUEUE, 248*eda14cbcSMatt Macy vsx->vsx_active_queue[ZIO_PRIORITY_ASYNC_WRITE]); 249*eda14cbcSMatt Macy 250*eda14cbcSMatt Macy fnvlist_add_uint64(nvx, ZPOOL_CONFIG_VDEV_SCRUB_ACTIVE_QUEUE, 251*eda14cbcSMatt Macy vsx->vsx_active_queue[ZIO_PRIORITY_SCRUB]); 252*eda14cbcSMatt Macy 253*eda14cbcSMatt Macy fnvlist_add_uint64(nvx, ZPOOL_CONFIG_VDEV_TRIM_ACTIVE_QUEUE, 254*eda14cbcSMatt Macy vsx->vsx_active_queue[ZIO_PRIORITY_TRIM]); 255*eda14cbcSMatt Macy 256*eda14cbcSMatt Macy /* ZIOs pending */ 257*eda14cbcSMatt Macy fnvlist_add_uint64(nvx, ZPOOL_CONFIG_VDEV_SYNC_R_PEND_QUEUE, 258*eda14cbcSMatt Macy vsx->vsx_pend_queue[ZIO_PRIORITY_SYNC_READ]); 259*eda14cbcSMatt Macy 260*eda14cbcSMatt Macy fnvlist_add_uint64(nvx, ZPOOL_CONFIG_VDEV_SYNC_W_PEND_QUEUE, 261*eda14cbcSMatt Macy vsx->vsx_pend_queue[ZIO_PRIORITY_SYNC_WRITE]); 262*eda14cbcSMatt Macy 263*eda14cbcSMatt Macy fnvlist_add_uint64(nvx, ZPOOL_CONFIG_VDEV_ASYNC_R_PEND_QUEUE, 264*eda14cbcSMatt Macy vsx->vsx_pend_queue[ZIO_PRIORITY_ASYNC_READ]); 265*eda14cbcSMatt Macy 266*eda14cbcSMatt Macy fnvlist_add_uint64(nvx, ZPOOL_CONFIG_VDEV_ASYNC_W_PEND_QUEUE, 267*eda14cbcSMatt Macy vsx->vsx_pend_queue[ZIO_PRIORITY_ASYNC_WRITE]); 268*eda14cbcSMatt Macy 269*eda14cbcSMatt Macy fnvlist_add_uint64(nvx, ZPOOL_CONFIG_VDEV_SCRUB_PEND_QUEUE, 270*eda14cbcSMatt Macy vsx->vsx_pend_queue[ZIO_PRIORITY_SCRUB]); 271*eda14cbcSMatt Macy 272*eda14cbcSMatt Macy fnvlist_add_uint64(nvx, ZPOOL_CONFIG_VDEV_TRIM_PEND_QUEUE, 273*eda14cbcSMatt Macy vsx->vsx_pend_queue[ZIO_PRIORITY_TRIM]); 274*eda14cbcSMatt Macy 275*eda14cbcSMatt Macy /* Histograms */ 276*eda14cbcSMatt Macy fnvlist_add_uint64_array(nvx, ZPOOL_CONFIG_VDEV_TOT_R_LAT_HISTO, 277*eda14cbcSMatt Macy vsx->vsx_total_histo[ZIO_TYPE_READ], 278*eda14cbcSMatt Macy ARRAY_SIZE(vsx->vsx_total_histo[ZIO_TYPE_READ])); 279*eda14cbcSMatt Macy 280*eda14cbcSMatt Macy fnvlist_add_uint64_array(nvx, ZPOOL_CONFIG_VDEV_TOT_W_LAT_HISTO, 281*eda14cbcSMatt Macy vsx->vsx_total_histo[ZIO_TYPE_WRITE], 282*eda14cbcSMatt Macy ARRAY_SIZE(vsx->vsx_total_histo[ZIO_TYPE_WRITE])); 283*eda14cbcSMatt Macy 284*eda14cbcSMatt Macy fnvlist_add_uint64_array(nvx, ZPOOL_CONFIG_VDEV_DISK_R_LAT_HISTO, 285*eda14cbcSMatt Macy vsx->vsx_disk_histo[ZIO_TYPE_READ], 286*eda14cbcSMatt Macy ARRAY_SIZE(vsx->vsx_disk_histo[ZIO_TYPE_READ])); 287*eda14cbcSMatt Macy 288*eda14cbcSMatt Macy fnvlist_add_uint64_array(nvx, ZPOOL_CONFIG_VDEV_DISK_W_LAT_HISTO, 289*eda14cbcSMatt Macy vsx->vsx_disk_histo[ZIO_TYPE_WRITE], 290*eda14cbcSMatt Macy ARRAY_SIZE(vsx->vsx_disk_histo[ZIO_TYPE_WRITE])); 291*eda14cbcSMatt Macy 292*eda14cbcSMatt Macy fnvlist_add_uint64_array(nvx, ZPOOL_CONFIG_VDEV_SYNC_R_LAT_HISTO, 293*eda14cbcSMatt Macy vsx->vsx_queue_histo[ZIO_PRIORITY_SYNC_READ], 294*eda14cbcSMatt Macy ARRAY_SIZE(vsx->vsx_queue_histo[ZIO_PRIORITY_SYNC_READ])); 295*eda14cbcSMatt Macy 296*eda14cbcSMatt Macy fnvlist_add_uint64_array(nvx, ZPOOL_CONFIG_VDEV_SYNC_W_LAT_HISTO, 297*eda14cbcSMatt Macy vsx->vsx_queue_histo[ZIO_PRIORITY_SYNC_WRITE], 298*eda14cbcSMatt Macy ARRAY_SIZE(vsx->vsx_queue_histo[ZIO_PRIORITY_SYNC_WRITE])); 299*eda14cbcSMatt Macy 300*eda14cbcSMatt Macy fnvlist_add_uint64_array(nvx, ZPOOL_CONFIG_VDEV_ASYNC_R_LAT_HISTO, 301*eda14cbcSMatt Macy vsx->vsx_queue_histo[ZIO_PRIORITY_ASYNC_READ], 302*eda14cbcSMatt Macy ARRAY_SIZE(vsx->vsx_queue_histo[ZIO_PRIORITY_ASYNC_READ])); 303*eda14cbcSMatt Macy 304*eda14cbcSMatt Macy fnvlist_add_uint64_array(nvx, ZPOOL_CONFIG_VDEV_ASYNC_W_LAT_HISTO, 305*eda14cbcSMatt Macy vsx->vsx_queue_histo[ZIO_PRIORITY_ASYNC_WRITE], 306*eda14cbcSMatt Macy ARRAY_SIZE(vsx->vsx_queue_histo[ZIO_PRIORITY_ASYNC_WRITE])); 307*eda14cbcSMatt Macy 308*eda14cbcSMatt Macy fnvlist_add_uint64_array(nvx, ZPOOL_CONFIG_VDEV_SCRUB_LAT_HISTO, 309*eda14cbcSMatt Macy vsx->vsx_queue_histo[ZIO_PRIORITY_SCRUB], 310*eda14cbcSMatt Macy ARRAY_SIZE(vsx->vsx_queue_histo[ZIO_PRIORITY_SCRUB])); 311*eda14cbcSMatt Macy 312*eda14cbcSMatt Macy fnvlist_add_uint64_array(nvx, ZPOOL_CONFIG_VDEV_TRIM_LAT_HISTO, 313*eda14cbcSMatt Macy vsx->vsx_queue_histo[ZIO_PRIORITY_TRIM], 314*eda14cbcSMatt Macy ARRAY_SIZE(vsx->vsx_queue_histo[ZIO_PRIORITY_TRIM])); 315*eda14cbcSMatt Macy 316*eda14cbcSMatt Macy /* Request sizes */ 317*eda14cbcSMatt Macy fnvlist_add_uint64_array(nvx, ZPOOL_CONFIG_VDEV_SYNC_IND_R_HISTO, 318*eda14cbcSMatt Macy vsx->vsx_ind_histo[ZIO_PRIORITY_SYNC_READ], 319*eda14cbcSMatt Macy ARRAY_SIZE(vsx->vsx_ind_histo[ZIO_PRIORITY_SYNC_READ])); 320*eda14cbcSMatt Macy 321*eda14cbcSMatt Macy fnvlist_add_uint64_array(nvx, ZPOOL_CONFIG_VDEV_SYNC_IND_W_HISTO, 322*eda14cbcSMatt Macy vsx->vsx_ind_histo[ZIO_PRIORITY_SYNC_WRITE], 323*eda14cbcSMatt Macy ARRAY_SIZE(vsx->vsx_ind_histo[ZIO_PRIORITY_SYNC_WRITE])); 324*eda14cbcSMatt Macy 325*eda14cbcSMatt Macy fnvlist_add_uint64_array(nvx, ZPOOL_CONFIG_VDEV_ASYNC_IND_R_HISTO, 326*eda14cbcSMatt Macy vsx->vsx_ind_histo[ZIO_PRIORITY_ASYNC_READ], 327*eda14cbcSMatt Macy ARRAY_SIZE(vsx->vsx_ind_histo[ZIO_PRIORITY_ASYNC_READ])); 328*eda14cbcSMatt Macy 329*eda14cbcSMatt Macy fnvlist_add_uint64_array(nvx, ZPOOL_CONFIG_VDEV_ASYNC_IND_W_HISTO, 330*eda14cbcSMatt Macy vsx->vsx_ind_histo[ZIO_PRIORITY_ASYNC_WRITE], 331*eda14cbcSMatt Macy ARRAY_SIZE(vsx->vsx_ind_histo[ZIO_PRIORITY_ASYNC_WRITE])); 332*eda14cbcSMatt Macy 333*eda14cbcSMatt Macy fnvlist_add_uint64_array(nvx, ZPOOL_CONFIG_VDEV_IND_SCRUB_HISTO, 334*eda14cbcSMatt Macy vsx->vsx_ind_histo[ZIO_PRIORITY_SCRUB], 335*eda14cbcSMatt Macy ARRAY_SIZE(vsx->vsx_ind_histo[ZIO_PRIORITY_SCRUB])); 336*eda14cbcSMatt Macy 337*eda14cbcSMatt Macy fnvlist_add_uint64_array(nvx, ZPOOL_CONFIG_VDEV_IND_TRIM_HISTO, 338*eda14cbcSMatt Macy vsx->vsx_ind_histo[ZIO_PRIORITY_TRIM], 339*eda14cbcSMatt Macy ARRAY_SIZE(vsx->vsx_ind_histo[ZIO_PRIORITY_TRIM])); 340*eda14cbcSMatt Macy 341*eda14cbcSMatt Macy fnvlist_add_uint64_array(nvx, ZPOOL_CONFIG_VDEV_SYNC_AGG_R_HISTO, 342*eda14cbcSMatt Macy vsx->vsx_agg_histo[ZIO_PRIORITY_SYNC_READ], 343*eda14cbcSMatt Macy ARRAY_SIZE(vsx->vsx_agg_histo[ZIO_PRIORITY_SYNC_READ])); 344*eda14cbcSMatt Macy 345*eda14cbcSMatt Macy fnvlist_add_uint64_array(nvx, ZPOOL_CONFIG_VDEV_SYNC_AGG_W_HISTO, 346*eda14cbcSMatt Macy vsx->vsx_agg_histo[ZIO_PRIORITY_SYNC_WRITE], 347*eda14cbcSMatt Macy ARRAY_SIZE(vsx->vsx_agg_histo[ZIO_PRIORITY_SYNC_WRITE])); 348*eda14cbcSMatt Macy 349*eda14cbcSMatt Macy fnvlist_add_uint64_array(nvx, ZPOOL_CONFIG_VDEV_ASYNC_AGG_R_HISTO, 350*eda14cbcSMatt Macy vsx->vsx_agg_histo[ZIO_PRIORITY_ASYNC_READ], 351*eda14cbcSMatt Macy ARRAY_SIZE(vsx->vsx_agg_histo[ZIO_PRIORITY_ASYNC_READ])); 352*eda14cbcSMatt Macy 353*eda14cbcSMatt Macy fnvlist_add_uint64_array(nvx, ZPOOL_CONFIG_VDEV_ASYNC_AGG_W_HISTO, 354*eda14cbcSMatt Macy vsx->vsx_agg_histo[ZIO_PRIORITY_ASYNC_WRITE], 355*eda14cbcSMatt Macy ARRAY_SIZE(vsx->vsx_agg_histo[ZIO_PRIORITY_ASYNC_WRITE])); 356*eda14cbcSMatt Macy 357*eda14cbcSMatt Macy fnvlist_add_uint64_array(nvx, ZPOOL_CONFIG_VDEV_AGG_SCRUB_HISTO, 358*eda14cbcSMatt Macy vsx->vsx_agg_histo[ZIO_PRIORITY_SCRUB], 359*eda14cbcSMatt Macy ARRAY_SIZE(vsx->vsx_agg_histo[ZIO_PRIORITY_SCRUB])); 360*eda14cbcSMatt Macy 361*eda14cbcSMatt Macy fnvlist_add_uint64_array(nvx, ZPOOL_CONFIG_VDEV_AGG_TRIM_HISTO, 362*eda14cbcSMatt Macy vsx->vsx_agg_histo[ZIO_PRIORITY_TRIM], 363*eda14cbcSMatt Macy ARRAY_SIZE(vsx->vsx_agg_histo[ZIO_PRIORITY_TRIM])); 364*eda14cbcSMatt Macy 365*eda14cbcSMatt Macy /* IO delays */ 366*eda14cbcSMatt Macy fnvlist_add_uint64(nvx, ZPOOL_CONFIG_VDEV_SLOW_IOS, vs->vs_slow_ios); 367*eda14cbcSMatt Macy 368*eda14cbcSMatt Macy /* Add extended stats nvlist to main nvlist */ 369*eda14cbcSMatt Macy fnvlist_add_nvlist(nv, ZPOOL_CONFIG_VDEV_STATS_EX, nvx); 370*eda14cbcSMatt Macy 371*eda14cbcSMatt Macy fnvlist_free(nvx); 372*eda14cbcSMatt Macy kmem_free(vs, sizeof (*vs)); 373*eda14cbcSMatt Macy kmem_free(vsx, sizeof (*vsx)); 374*eda14cbcSMatt Macy } 375*eda14cbcSMatt Macy 376*eda14cbcSMatt Macy static void 377*eda14cbcSMatt Macy root_vdev_actions_getprogress(vdev_t *vd, nvlist_t *nvl) 378*eda14cbcSMatt Macy { 379*eda14cbcSMatt Macy spa_t *spa = vd->vdev_spa; 380*eda14cbcSMatt Macy 381*eda14cbcSMatt Macy if (vd != spa->spa_root_vdev) 382*eda14cbcSMatt Macy return; 383*eda14cbcSMatt Macy 384*eda14cbcSMatt Macy /* provide either current or previous scan information */ 385*eda14cbcSMatt Macy pool_scan_stat_t ps; 386*eda14cbcSMatt Macy if (spa_scan_get_stats(spa, &ps) == 0) { 387*eda14cbcSMatt Macy fnvlist_add_uint64_array(nvl, 388*eda14cbcSMatt Macy ZPOOL_CONFIG_SCAN_STATS, (uint64_t *)&ps, 389*eda14cbcSMatt Macy sizeof (pool_scan_stat_t) / sizeof (uint64_t)); 390*eda14cbcSMatt Macy } 391*eda14cbcSMatt Macy 392*eda14cbcSMatt Macy pool_removal_stat_t prs; 393*eda14cbcSMatt Macy if (spa_removal_get_stats(spa, &prs) == 0) { 394*eda14cbcSMatt Macy fnvlist_add_uint64_array(nvl, 395*eda14cbcSMatt Macy ZPOOL_CONFIG_REMOVAL_STATS, (uint64_t *)&prs, 396*eda14cbcSMatt Macy sizeof (prs) / sizeof (uint64_t)); 397*eda14cbcSMatt Macy } 398*eda14cbcSMatt Macy 399*eda14cbcSMatt Macy pool_checkpoint_stat_t pcs; 400*eda14cbcSMatt Macy if (spa_checkpoint_get_stats(spa, &pcs) == 0) { 401*eda14cbcSMatt Macy fnvlist_add_uint64_array(nvl, 402*eda14cbcSMatt Macy ZPOOL_CONFIG_CHECKPOINT_STATS, (uint64_t *)&pcs, 403*eda14cbcSMatt Macy sizeof (pcs) / sizeof (uint64_t)); 404*eda14cbcSMatt Macy } 405*eda14cbcSMatt Macy } 406*eda14cbcSMatt Macy 407*eda14cbcSMatt Macy static void 408*eda14cbcSMatt Macy top_vdev_actions_getprogress(vdev_t *vd, nvlist_t *nvl) 409*eda14cbcSMatt Macy { 410*eda14cbcSMatt Macy if (vd == vd->vdev_top) { 411*eda14cbcSMatt Macy vdev_rebuild_stat_t vrs; 412*eda14cbcSMatt Macy if (vdev_rebuild_get_stats(vd, &vrs) == 0) { 413*eda14cbcSMatt Macy fnvlist_add_uint64_array(nvl, 414*eda14cbcSMatt Macy ZPOOL_CONFIG_REBUILD_STATS, (uint64_t *)&vrs, 415*eda14cbcSMatt Macy sizeof (vrs) / sizeof (uint64_t)); 416*eda14cbcSMatt Macy } 417*eda14cbcSMatt Macy } 418*eda14cbcSMatt Macy } 419*eda14cbcSMatt Macy 420*eda14cbcSMatt Macy /* 421*eda14cbcSMatt Macy * Generate the nvlist representing this vdev's config. 422*eda14cbcSMatt Macy */ 423*eda14cbcSMatt Macy nvlist_t * 424*eda14cbcSMatt Macy vdev_config_generate(spa_t *spa, vdev_t *vd, boolean_t getstats, 425*eda14cbcSMatt Macy vdev_config_flag_t flags) 426*eda14cbcSMatt Macy { 427*eda14cbcSMatt Macy nvlist_t *nv = NULL; 428*eda14cbcSMatt Macy vdev_indirect_config_t *vic = &vd->vdev_indirect_config; 429*eda14cbcSMatt Macy 430*eda14cbcSMatt Macy nv = fnvlist_alloc(); 431*eda14cbcSMatt Macy 432*eda14cbcSMatt Macy fnvlist_add_string(nv, ZPOOL_CONFIG_TYPE, vd->vdev_ops->vdev_op_type); 433*eda14cbcSMatt Macy if (!(flags & (VDEV_CONFIG_SPARE | VDEV_CONFIG_L2CACHE))) 434*eda14cbcSMatt Macy fnvlist_add_uint64(nv, ZPOOL_CONFIG_ID, vd->vdev_id); 435*eda14cbcSMatt Macy fnvlist_add_uint64(nv, ZPOOL_CONFIG_GUID, vd->vdev_guid); 436*eda14cbcSMatt Macy 437*eda14cbcSMatt Macy if (vd->vdev_path != NULL) 438*eda14cbcSMatt Macy fnvlist_add_string(nv, ZPOOL_CONFIG_PATH, vd->vdev_path); 439*eda14cbcSMatt Macy 440*eda14cbcSMatt Macy if (vd->vdev_devid != NULL) 441*eda14cbcSMatt Macy fnvlist_add_string(nv, ZPOOL_CONFIG_DEVID, vd->vdev_devid); 442*eda14cbcSMatt Macy 443*eda14cbcSMatt Macy if (vd->vdev_physpath != NULL) 444*eda14cbcSMatt Macy fnvlist_add_string(nv, ZPOOL_CONFIG_PHYS_PATH, 445*eda14cbcSMatt Macy vd->vdev_physpath); 446*eda14cbcSMatt Macy 447*eda14cbcSMatt Macy if (vd->vdev_enc_sysfs_path != NULL) 448*eda14cbcSMatt Macy fnvlist_add_string(nv, ZPOOL_CONFIG_VDEV_ENC_SYSFS_PATH, 449*eda14cbcSMatt Macy vd->vdev_enc_sysfs_path); 450*eda14cbcSMatt Macy 451*eda14cbcSMatt Macy if (vd->vdev_fru != NULL) 452*eda14cbcSMatt Macy fnvlist_add_string(nv, ZPOOL_CONFIG_FRU, vd->vdev_fru); 453*eda14cbcSMatt Macy 454*eda14cbcSMatt Macy if (vd->vdev_nparity != 0) { 455*eda14cbcSMatt Macy ASSERT(strcmp(vd->vdev_ops->vdev_op_type, 456*eda14cbcSMatt Macy VDEV_TYPE_RAIDZ) == 0); 457*eda14cbcSMatt Macy 458*eda14cbcSMatt Macy /* 459*eda14cbcSMatt Macy * Make sure someone hasn't managed to sneak a fancy new vdev 460*eda14cbcSMatt Macy * into a crufty old storage pool. 461*eda14cbcSMatt Macy */ 462*eda14cbcSMatt Macy ASSERT(vd->vdev_nparity == 1 || 463*eda14cbcSMatt Macy (vd->vdev_nparity <= 2 && 464*eda14cbcSMatt Macy spa_version(spa) >= SPA_VERSION_RAIDZ2) || 465*eda14cbcSMatt Macy (vd->vdev_nparity <= 3 && 466*eda14cbcSMatt Macy spa_version(spa) >= SPA_VERSION_RAIDZ3)); 467*eda14cbcSMatt Macy 468*eda14cbcSMatt Macy /* 469*eda14cbcSMatt Macy * Note that we'll add the nparity tag even on storage pools 470*eda14cbcSMatt Macy * that only support a single parity device -- older software 471*eda14cbcSMatt Macy * will just ignore it. 472*eda14cbcSMatt Macy */ 473*eda14cbcSMatt Macy fnvlist_add_uint64(nv, ZPOOL_CONFIG_NPARITY, vd->vdev_nparity); 474*eda14cbcSMatt Macy } 475*eda14cbcSMatt Macy 476*eda14cbcSMatt Macy if (vd->vdev_wholedisk != -1ULL) 477*eda14cbcSMatt Macy fnvlist_add_uint64(nv, ZPOOL_CONFIG_WHOLE_DISK, 478*eda14cbcSMatt Macy vd->vdev_wholedisk); 479*eda14cbcSMatt Macy 480*eda14cbcSMatt Macy if (vd->vdev_not_present && !(flags & VDEV_CONFIG_MISSING)) 481*eda14cbcSMatt Macy fnvlist_add_uint64(nv, ZPOOL_CONFIG_NOT_PRESENT, 1); 482*eda14cbcSMatt Macy 483*eda14cbcSMatt Macy if (vd->vdev_isspare) 484*eda14cbcSMatt Macy fnvlist_add_uint64(nv, ZPOOL_CONFIG_IS_SPARE, 1); 485*eda14cbcSMatt Macy 486*eda14cbcSMatt Macy if (!(flags & (VDEV_CONFIG_SPARE | VDEV_CONFIG_L2CACHE)) && 487*eda14cbcSMatt Macy vd == vd->vdev_top) { 488*eda14cbcSMatt Macy fnvlist_add_uint64(nv, ZPOOL_CONFIG_METASLAB_ARRAY, 489*eda14cbcSMatt Macy vd->vdev_ms_array); 490*eda14cbcSMatt Macy fnvlist_add_uint64(nv, ZPOOL_CONFIG_METASLAB_SHIFT, 491*eda14cbcSMatt Macy vd->vdev_ms_shift); 492*eda14cbcSMatt Macy fnvlist_add_uint64(nv, ZPOOL_CONFIG_ASHIFT, vd->vdev_ashift); 493*eda14cbcSMatt Macy fnvlist_add_uint64(nv, ZPOOL_CONFIG_ASIZE, 494*eda14cbcSMatt Macy vd->vdev_asize); 495*eda14cbcSMatt Macy fnvlist_add_uint64(nv, ZPOOL_CONFIG_IS_LOG, vd->vdev_islog); 496*eda14cbcSMatt Macy if (vd->vdev_removing) { 497*eda14cbcSMatt Macy fnvlist_add_uint64(nv, ZPOOL_CONFIG_REMOVING, 498*eda14cbcSMatt Macy vd->vdev_removing); 499*eda14cbcSMatt Macy } 500*eda14cbcSMatt Macy 501*eda14cbcSMatt Macy /* zpool command expects alloc class data */ 502*eda14cbcSMatt Macy if (getstats && vd->vdev_alloc_bias != VDEV_BIAS_NONE) { 503*eda14cbcSMatt Macy const char *bias = NULL; 504*eda14cbcSMatt Macy 505*eda14cbcSMatt Macy switch (vd->vdev_alloc_bias) { 506*eda14cbcSMatt Macy case VDEV_BIAS_LOG: 507*eda14cbcSMatt Macy bias = VDEV_ALLOC_BIAS_LOG; 508*eda14cbcSMatt Macy break; 509*eda14cbcSMatt Macy case VDEV_BIAS_SPECIAL: 510*eda14cbcSMatt Macy bias = VDEV_ALLOC_BIAS_SPECIAL; 511*eda14cbcSMatt Macy break; 512*eda14cbcSMatt Macy case VDEV_BIAS_DEDUP: 513*eda14cbcSMatt Macy bias = VDEV_ALLOC_BIAS_DEDUP; 514*eda14cbcSMatt Macy break; 515*eda14cbcSMatt Macy default: 516*eda14cbcSMatt Macy ASSERT3U(vd->vdev_alloc_bias, ==, 517*eda14cbcSMatt Macy VDEV_BIAS_NONE); 518*eda14cbcSMatt Macy } 519*eda14cbcSMatt Macy fnvlist_add_string(nv, ZPOOL_CONFIG_ALLOCATION_BIAS, 520*eda14cbcSMatt Macy bias); 521*eda14cbcSMatt Macy } 522*eda14cbcSMatt Macy } 523*eda14cbcSMatt Macy 524*eda14cbcSMatt Macy if (vd->vdev_dtl_sm != NULL) { 525*eda14cbcSMatt Macy fnvlist_add_uint64(nv, ZPOOL_CONFIG_DTL, 526*eda14cbcSMatt Macy space_map_object(vd->vdev_dtl_sm)); 527*eda14cbcSMatt Macy } 528*eda14cbcSMatt Macy 529*eda14cbcSMatt Macy if (vic->vic_mapping_object != 0) { 530*eda14cbcSMatt Macy fnvlist_add_uint64(nv, ZPOOL_CONFIG_INDIRECT_OBJECT, 531*eda14cbcSMatt Macy vic->vic_mapping_object); 532*eda14cbcSMatt Macy } 533*eda14cbcSMatt Macy 534*eda14cbcSMatt Macy if (vic->vic_births_object != 0) { 535*eda14cbcSMatt Macy fnvlist_add_uint64(nv, ZPOOL_CONFIG_INDIRECT_BIRTHS, 536*eda14cbcSMatt Macy vic->vic_births_object); 537*eda14cbcSMatt Macy } 538*eda14cbcSMatt Macy 539*eda14cbcSMatt Macy if (vic->vic_prev_indirect_vdev != UINT64_MAX) { 540*eda14cbcSMatt Macy fnvlist_add_uint64(nv, ZPOOL_CONFIG_PREV_INDIRECT_VDEV, 541*eda14cbcSMatt Macy vic->vic_prev_indirect_vdev); 542*eda14cbcSMatt Macy } 543*eda14cbcSMatt Macy 544*eda14cbcSMatt Macy if (vd->vdev_crtxg) 545*eda14cbcSMatt Macy fnvlist_add_uint64(nv, ZPOOL_CONFIG_CREATE_TXG, vd->vdev_crtxg); 546*eda14cbcSMatt Macy 547*eda14cbcSMatt Macy if (vd->vdev_expansion_time) 548*eda14cbcSMatt Macy fnvlist_add_uint64(nv, ZPOOL_CONFIG_EXPANSION_TIME, 549*eda14cbcSMatt Macy vd->vdev_expansion_time); 550*eda14cbcSMatt Macy 551*eda14cbcSMatt Macy if (flags & VDEV_CONFIG_MOS) { 552*eda14cbcSMatt Macy if (vd->vdev_leaf_zap != 0) { 553*eda14cbcSMatt Macy ASSERT(vd->vdev_ops->vdev_op_leaf); 554*eda14cbcSMatt Macy fnvlist_add_uint64(nv, ZPOOL_CONFIG_VDEV_LEAF_ZAP, 555*eda14cbcSMatt Macy vd->vdev_leaf_zap); 556*eda14cbcSMatt Macy } 557*eda14cbcSMatt Macy 558*eda14cbcSMatt Macy if (vd->vdev_top_zap != 0) { 559*eda14cbcSMatt Macy ASSERT(vd == vd->vdev_top); 560*eda14cbcSMatt Macy fnvlist_add_uint64(nv, ZPOOL_CONFIG_VDEV_TOP_ZAP, 561*eda14cbcSMatt Macy vd->vdev_top_zap); 562*eda14cbcSMatt Macy } 563*eda14cbcSMatt Macy 564*eda14cbcSMatt Macy if (vd->vdev_resilver_deferred) { 565*eda14cbcSMatt Macy ASSERT(vd->vdev_ops->vdev_op_leaf); 566*eda14cbcSMatt Macy ASSERT(spa->spa_resilver_deferred); 567*eda14cbcSMatt Macy fnvlist_add_boolean(nv, ZPOOL_CONFIG_RESILVER_DEFER); 568*eda14cbcSMatt Macy } 569*eda14cbcSMatt Macy } 570*eda14cbcSMatt Macy 571*eda14cbcSMatt Macy if (getstats) { 572*eda14cbcSMatt Macy vdev_config_generate_stats(vd, nv); 573*eda14cbcSMatt Macy 574*eda14cbcSMatt Macy root_vdev_actions_getprogress(vd, nv); 575*eda14cbcSMatt Macy top_vdev_actions_getprogress(vd, nv); 576*eda14cbcSMatt Macy 577*eda14cbcSMatt Macy /* 578*eda14cbcSMatt Macy * Note: this can be called from open context 579*eda14cbcSMatt Macy * (spa_get_stats()), so we need the rwlock to prevent 580*eda14cbcSMatt Macy * the mapping from being changed by condensing. 581*eda14cbcSMatt Macy */ 582*eda14cbcSMatt Macy rw_enter(&vd->vdev_indirect_rwlock, RW_READER); 583*eda14cbcSMatt Macy if (vd->vdev_indirect_mapping != NULL) { 584*eda14cbcSMatt Macy ASSERT(vd->vdev_indirect_births != NULL); 585*eda14cbcSMatt Macy vdev_indirect_mapping_t *vim = 586*eda14cbcSMatt Macy vd->vdev_indirect_mapping; 587*eda14cbcSMatt Macy fnvlist_add_uint64(nv, ZPOOL_CONFIG_INDIRECT_SIZE, 588*eda14cbcSMatt Macy vdev_indirect_mapping_size(vim)); 589*eda14cbcSMatt Macy } 590*eda14cbcSMatt Macy rw_exit(&vd->vdev_indirect_rwlock); 591*eda14cbcSMatt Macy if (vd->vdev_mg != NULL && 592*eda14cbcSMatt Macy vd->vdev_mg->mg_fragmentation != ZFS_FRAG_INVALID) { 593*eda14cbcSMatt Macy /* 594*eda14cbcSMatt Macy * Compute approximately how much memory would be used 595*eda14cbcSMatt Macy * for the indirect mapping if this device were to 596*eda14cbcSMatt Macy * be removed. 597*eda14cbcSMatt Macy * 598*eda14cbcSMatt Macy * Note: If the frag metric is invalid, then not 599*eda14cbcSMatt Macy * enough metaslabs have been converted to have 600*eda14cbcSMatt Macy * histograms. 601*eda14cbcSMatt Macy */ 602*eda14cbcSMatt Macy uint64_t seg_count = 0; 603*eda14cbcSMatt Macy uint64_t to_alloc = vd->vdev_stat.vs_alloc; 604*eda14cbcSMatt Macy 605*eda14cbcSMatt Macy /* 606*eda14cbcSMatt Macy * There are the same number of allocated segments 607*eda14cbcSMatt Macy * as free segments, so we will have at least one 608*eda14cbcSMatt Macy * entry per free segment. However, small free 609*eda14cbcSMatt Macy * segments (smaller than vdev_removal_max_span) 610*eda14cbcSMatt Macy * will be combined with adjacent allocated segments 611*eda14cbcSMatt Macy * as a single mapping. 612*eda14cbcSMatt Macy */ 613*eda14cbcSMatt Macy for (int i = 0; i < RANGE_TREE_HISTOGRAM_SIZE; i++) { 614*eda14cbcSMatt Macy if (1ULL << (i + 1) < vdev_removal_max_span) { 615*eda14cbcSMatt Macy to_alloc += 616*eda14cbcSMatt Macy vd->vdev_mg->mg_histogram[i] << 617*eda14cbcSMatt Macy (i + 1); 618*eda14cbcSMatt Macy } else { 619*eda14cbcSMatt Macy seg_count += 620*eda14cbcSMatt Macy vd->vdev_mg->mg_histogram[i]; 621*eda14cbcSMatt Macy } 622*eda14cbcSMatt Macy } 623*eda14cbcSMatt Macy 624*eda14cbcSMatt Macy /* 625*eda14cbcSMatt Macy * The maximum length of a mapping is 626*eda14cbcSMatt Macy * zfs_remove_max_segment, so we need at least one entry 627*eda14cbcSMatt Macy * per zfs_remove_max_segment of allocated data. 628*eda14cbcSMatt Macy */ 629*eda14cbcSMatt Macy seg_count += to_alloc / spa_remove_max_segment(spa); 630*eda14cbcSMatt Macy 631*eda14cbcSMatt Macy fnvlist_add_uint64(nv, ZPOOL_CONFIG_INDIRECT_SIZE, 632*eda14cbcSMatt Macy seg_count * 633*eda14cbcSMatt Macy sizeof (vdev_indirect_mapping_entry_phys_t)); 634*eda14cbcSMatt Macy } 635*eda14cbcSMatt Macy } 636*eda14cbcSMatt Macy 637*eda14cbcSMatt Macy if (!vd->vdev_ops->vdev_op_leaf) { 638*eda14cbcSMatt Macy nvlist_t **child; 639*eda14cbcSMatt Macy int c, idx; 640*eda14cbcSMatt Macy 641*eda14cbcSMatt Macy ASSERT(!vd->vdev_ishole); 642*eda14cbcSMatt Macy 643*eda14cbcSMatt Macy child = kmem_alloc(vd->vdev_children * sizeof (nvlist_t *), 644*eda14cbcSMatt Macy KM_SLEEP); 645*eda14cbcSMatt Macy 646*eda14cbcSMatt Macy for (c = 0, idx = 0; c < vd->vdev_children; c++) { 647*eda14cbcSMatt Macy vdev_t *cvd = vd->vdev_child[c]; 648*eda14cbcSMatt Macy 649*eda14cbcSMatt Macy /* 650*eda14cbcSMatt Macy * If we're generating an nvlist of removing 651*eda14cbcSMatt Macy * vdevs then skip over any device which is 652*eda14cbcSMatt Macy * not being removed. 653*eda14cbcSMatt Macy */ 654*eda14cbcSMatt Macy if ((flags & VDEV_CONFIG_REMOVING) && 655*eda14cbcSMatt Macy !cvd->vdev_removing) 656*eda14cbcSMatt Macy continue; 657*eda14cbcSMatt Macy 658*eda14cbcSMatt Macy child[idx++] = vdev_config_generate(spa, cvd, 659*eda14cbcSMatt Macy getstats, flags); 660*eda14cbcSMatt Macy } 661*eda14cbcSMatt Macy 662*eda14cbcSMatt Macy if (idx) { 663*eda14cbcSMatt Macy fnvlist_add_nvlist_array(nv, ZPOOL_CONFIG_CHILDREN, 664*eda14cbcSMatt Macy child, idx); 665*eda14cbcSMatt Macy } 666*eda14cbcSMatt Macy 667*eda14cbcSMatt Macy for (c = 0; c < idx; c++) 668*eda14cbcSMatt Macy nvlist_free(child[c]); 669*eda14cbcSMatt Macy 670*eda14cbcSMatt Macy kmem_free(child, vd->vdev_children * sizeof (nvlist_t *)); 671*eda14cbcSMatt Macy 672*eda14cbcSMatt Macy } else { 673*eda14cbcSMatt Macy const char *aux = NULL; 674*eda14cbcSMatt Macy 675*eda14cbcSMatt Macy if (vd->vdev_offline && !vd->vdev_tmpoffline) 676*eda14cbcSMatt Macy fnvlist_add_uint64(nv, ZPOOL_CONFIG_OFFLINE, B_TRUE); 677*eda14cbcSMatt Macy if (vd->vdev_resilver_txg != 0) 678*eda14cbcSMatt Macy fnvlist_add_uint64(nv, ZPOOL_CONFIG_RESILVER_TXG, 679*eda14cbcSMatt Macy vd->vdev_resilver_txg); 680*eda14cbcSMatt Macy if (vd->vdev_rebuild_txg != 0) 681*eda14cbcSMatt Macy fnvlist_add_uint64(nv, ZPOOL_CONFIG_REBUILD_TXG, 682*eda14cbcSMatt Macy vd->vdev_rebuild_txg); 683*eda14cbcSMatt Macy if (vd->vdev_faulted) 684*eda14cbcSMatt Macy fnvlist_add_uint64(nv, ZPOOL_CONFIG_FAULTED, B_TRUE); 685*eda14cbcSMatt Macy if (vd->vdev_degraded) 686*eda14cbcSMatt Macy fnvlist_add_uint64(nv, ZPOOL_CONFIG_DEGRADED, B_TRUE); 687*eda14cbcSMatt Macy if (vd->vdev_removed) 688*eda14cbcSMatt Macy fnvlist_add_uint64(nv, ZPOOL_CONFIG_REMOVED, B_TRUE); 689*eda14cbcSMatt Macy if (vd->vdev_unspare) 690*eda14cbcSMatt Macy fnvlist_add_uint64(nv, ZPOOL_CONFIG_UNSPARE, B_TRUE); 691*eda14cbcSMatt Macy if (vd->vdev_ishole) 692*eda14cbcSMatt Macy fnvlist_add_uint64(nv, ZPOOL_CONFIG_IS_HOLE, B_TRUE); 693*eda14cbcSMatt Macy 694*eda14cbcSMatt Macy /* Set the reason why we're FAULTED/DEGRADED. */ 695*eda14cbcSMatt Macy switch (vd->vdev_stat.vs_aux) { 696*eda14cbcSMatt Macy case VDEV_AUX_ERR_EXCEEDED: 697*eda14cbcSMatt Macy aux = "err_exceeded"; 698*eda14cbcSMatt Macy break; 699*eda14cbcSMatt Macy 700*eda14cbcSMatt Macy case VDEV_AUX_EXTERNAL: 701*eda14cbcSMatt Macy aux = "external"; 702*eda14cbcSMatt Macy break; 703*eda14cbcSMatt Macy } 704*eda14cbcSMatt Macy 705*eda14cbcSMatt Macy if (aux != NULL && !vd->vdev_tmpoffline) { 706*eda14cbcSMatt Macy fnvlist_add_string(nv, ZPOOL_CONFIG_AUX_STATE, aux); 707*eda14cbcSMatt Macy } else { 708*eda14cbcSMatt Macy /* 709*eda14cbcSMatt Macy * We're healthy - clear any previous AUX_STATE values. 710*eda14cbcSMatt Macy */ 711*eda14cbcSMatt Macy if (nvlist_exists(nv, ZPOOL_CONFIG_AUX_STATE)) 712*eda14cbcSMatt Macy nvlist_remove_all(nv, ZPOOL_CONFIG_AUX_STATE); 713*eda14cbcSMatt Macy } 714*eda14cbcSMatt Macy 715*eda14cbcSMatt Macy if (vd->vdev_splitting && vd->vdev_orig_guid != 0LL) { 716*eda14cbcSMatt Macy fnvlist_add_uint64(nv, ZPOOL_CONFIG_ORIG_GUID, 717*eda14cbcSMatt Macy vd->vdev_orig_guid); 718*eda14cbcSMatt Macy } 719*eda14cbcSMatt Macy } 720*eda14cbcSMatt Macy 721*eda14cbcSMatt Macy return (nv); 722*eda14cbcSMatt Macy } 723*eda14cbcSMatt Macy 724*eda14cbcSMatt Macy /* 725*eda14cbcSMatt Macy * Generate a view of the top-level vdevs. If we currently have holes 726*eda14cbcSMatt Macy * in the namespace, then generate an array which contains a list of holey 727*eda14cbcSMatt Macy * vdevs. Additionally, add the number of top-level children that currently 728*eda14cbcSMatt Macy * exist. 729*eda14cbcSMatt Macy */ 730*eda14cbcSMatt Macy void 731*eda14cbcSMatt Macy vdev_top_config_generate(spa_t *spa, nvlist_t *config) 732*eda14cbcSMatt Macy { 733*eda14cbcSMatt Macy vdev_t *rvd = spa->spa_root_vdev; 734*eda14cbcSMatt Macy uint64_t *array; 735*eda14cbcSMatt Macy uint_t c, idx; 736*eda14cbcSMatt Macy 737*eda14cbcSMatt Macy array = kmem_alloc(rvd->vdev_children * sizeof (uint64_t), KM_SLEEP); 738*eda14cbcSMatt Macy 739*eda14cbcSMatt Macy for (c = 0, idx = 0; c < rvd->vdev_children; c++) { 740*eda14cbcSMatt Macy vdev_t *tvd = rvd->vdev_child[c]; 741*eda14cbcSMatt Macy 742*eda14cbcSMatt Macy if (tvd->vdev_ishole) { 743*eda14cbcSMatt Macy array[idx++] = c; 744*eda14cbcSMatt Macy } 745*eda14cbcSMatt Macy } 746*eda14cbcSMatt Macy 747*eda14cbcSMatt Macy if (idx) { 748*eda14cbcSMatt Macy VERIFY(nvlist_add_uint64_array(config, ZPOOL_CONFIG_HOLE_ARRAY, 749*eda14cbcSMatt Macy array, idx) == 0); 750*eda14cbcSMatt Macy } 751*eda14cbcSMatt Macy 752*eda14cbcSMatt Macy VERIFY(nvlist_add_uint64(config, ZPOOL_CONFIG_VDEV_CHILDREN, 753*eda14cbcSMatt Macy rvd->vdev_children) == 0); 754*eda14cbcSMatt Macy 755*eda14cbcSMatt Macy kmem_free(array, rvd->vdev_children * sizeof (uint64_t)); 756*eda14cbcSMatt Macy } 757*eda14cbcSMatt Macy 758*eda14cbcSMatt Macy /* 759*eda14cbcSMatt Macy * Returns the configuration from the label of the given vdev. For vdevs 760*eda14cbcSMatt Macy * which don't have a txg value stored on their label (i.e. spares/cache) 761*eda14cbcSMatt Macy * or have not been completely initialized (txg = 0) just return 762*eda14cbcSMatt Macy * the configuration from the first valid label we find. Otherwise, 763*eda14cbcSMatt Macy * find the most up-to-date label that does not exceed the specified 764*eda14cbcSMatt Macy * 'txg' value. 765*eda14cbcSMatt Macy */ 766*eda14cbcSMatt Macy nvlist_t * 767*eda14cbcSMatt Macy vdev_label_read_config(vdev_t *vd, uint64_t txg) 768*eda14cbcSMatt Macy { 769*eda14cbcSMatt Macy spa_t *spa = vd->vdev_spa; 770*eda14cbcSMatt Macy nvlist_t *config = NULL; 771*eda14cbcSMatt Macy vdev_phys_t *vp; 772*eda14cbcSMatt Macy abd_t *vp_abd; 773*eda14cbcSMatt Macy zio_t *zio; 774*eda14cbcSMatt Macy uint64_t best_txg = 0; 775*eda14cbcSMatt Macy uint64_t label_txg = 0; 776*eda14cbcSMatt Macy int error = 0; 777*eda14cbcSMatt Macy int flags = ZIO_FLAG_CONFIG_WRITER | ZIO_FLAG_CANFAIL | 778*eda14cbcSMatt Macy ZIO_FLAG_SPECULATIVE; 779*eda14cbcSMatt Macy 780*eda14cbcSMatt Macy ASSERT(spa_config_held(spa, SCL_STATE_ALL, RW_WRITER) == SCL_STATE_ALL); 781*eda14cbcSMatt Macy 782*eda14cbcSMatt Macy if (!vdev_readable(vd)) 783*eda14cbcSMatt Macy return (NULL); 784*eda14cbcSMatt Macy 785*eda14cbcSMatt Macy vp_abd = abd_alloc_linear(sizeof (vdev_phys_t), B_TRUE); 786*eda14cbcSMatt Macy vp = abd_to_buf(vp_abd); 787*eda14cbcSMatt Macy 788*eda14cbcSMatt Macy retry: 789*eda14cbcSMatt Macy for (int l = 0; l < VDEV_LABELS; l++) { 790*eda14cbcSMatt Macy nvlist_t *label = NULL; 791*eda14cbcSMatt Macy 792*eda14cbcSMatt Macy zio = zio_root(spa, NULL, NULL, flags); 793*eda14cbcSMatt Macy 794*eda14cbcSMatt Macy vdev_label_read(zio, vd, l, vp_abd, 795*eda14cbcSMatt Macy offsetof(vdev_label_t, vl_vdev_phys), 796*eda14cbcSMatt Macy sizeof (vdev_phys_t), NULL, NULL, flags); 797*eda14cbcSMatt Macy 798*eda14cbcSMatt Macy if (zio_wait(zio) == 0 && 799*eda14cbcSMatt Macy nvlist_unpack(vp->vp_nvlist, sizeof (vp->vp_nvlist), 800*eda14cbcSMatt Macy &label, 0) == 0) { 801*eda14cbcSMatt Macy /* 802*eda14cbcSMatt Macy * Auxiliary vdevs won't have txg values in their 803*eda14cbcSMatt Macy * labels and newly added vdevs may not have been 804*eda14cbcSMatt Macy * completely initialized so just return the 805*eda14cbcSMatt Macy * configuration from the first valid label we 806*eda14cbcSMatt Macy * encounter. 807*eda14cbcSMatt Macy */ 808*eda14cbcSMatt Macy error = nvlist_lookup_uint64(label, 809*eda14cbcSMatt Macy ZPOOL_CONFIG_POOL_TXG, &label_txg); 810*eda14cbcSMatt Macy if ((error || label_txg == 0) && !config) { 811*eda14cbcSMatt Macy config = label; 812*eda14cbcSMatt Macy break; 813*eda14cbcSMatt Macy } else if (label_txg <= txg && label_txg > best_txg) { 814*eda14cbcSMatt Macy best_txg = label_txg; 815*eda14cbcSMatt Macy nvlist_free(config); 816*eda14cbcSMatt Macy config = fnvlist_dup(label); 817*eda14cbcSMatt Macy } 818*eda14cbcSMatt Macy } 819*eda14cbcSMatt Macy 820*eda14cbcSMatt Macy if (label != NULL) { 821*eda14cbcSMatt Macy nvlist_free(label); 822*eda14cbcSMatt Macy label = NULL; 823*eda14cbcSMatt Macy } 824*eda14cbcSMatt Macy } 825*eda14cbcSMatt Macy 826*eda14cbcSMatt Macy if (config == NULL && !(flags & ZIO_FLAG_TRYHARD)) { 827*eda14cbcSMatt Macy flags |= ZIO_FLAG_TRYHARD; 828*eda14cbcSMatt Macy goto retry; 829*eda14cbcSMatt Macy } 830*eda14cbcSMatt Macy 831*eda14cbcSMatt Macy /* 832*eda14cbcSMatt Macy * We found a valid label but it didn't pass txg restrictions. 833*eda14cbcSMatt Macy */ 834*eda14cbcSMatt Macy if (config == NULL && label_txg != 0) { 835*eda14cbcSMatt Macy vdev_dbgmsg(vd, "label discarded as txg is too large " 836*eda14cbcSMatt Macy "(%llu > %llu)", (u_longlong_t)label_txg, 837*eda14cbcSMatt Macy (u_longlong_t)txg); 838*eda14cbcSMatt Macy } 839*eda14cbcSMatt Macy 840*eda14cbcSMatt Macy abd_free(vp_abd); 841*eda14cbcSMatt Macy 842*eda14cbcSMatt Macy return (config); 843*eda14cbcSMatt Macy } 844*eda14cbcSMatt Macy 845*eda14cbcSMatt Macy /* 846*eda14cbcSMatt Macy * Determine if a device is in use. The 'spare_guid' parameter will be filled 847*eda14cbcSMatt Macy * in with the device guid if this spare is active elsewhere on the system. 848*eda14cbcSMatt Macy */ 849*eda14cbcSMatt Macy static boolean_t 850*eda14cbcSMatt Macy vdev_inuse(vdev_t *vd, uint64_t crtxg, vdev_labeltype_t reason, 851*eda14cbcSMatt Macy uint64_t *spare_guid, uint64_t *l2cache_guid) 852*eda14cbcSMatt Macy { 853*eda14cbcSMatt Macy spa_t *spa = vd->vdev_spa; 854*eda14cbcSMatt Macy uint64_t state, pool_guid, device_guid, txg, spare_pool; 855*eda14cbcSMatt Macy uint64_t vdtxg = 0; 856*eda14cbcSMatt Macy nvlist_t *label; 857*eda14cbcSMatt Macy 858*eda14cbcSMatt Macy if (spare_guid) 859*eda14cbcSMatt Macy *spare_guid = 0ULL; 860*eda14cbcSMatt Macy if (l2cache_guid) 861*eda14cbcSMatt Macy *l2cache_guid = 0ULL; 862*eda14cbcSMatt Macy 863*eda14cbcSMatt Macy /* 864*eda14cbcSMatt Macy * Read the label, if any, and perform some basic sanity checks. 865*eda14cbcSMatt Macy */ 866*eda14cbcSMatt Macy if ((label = vdev_label_read_config(vd, -1ULL)) == NULL) 867*eda14cbcSMatt Macy return (B_FALSE); 868*eda14cbcSMatt Macy 869*eda14cbcSMatt Macy (void) nvlist_lookup_uint64(label, ZPOOL_CONFIG_CREATE_TXG, 870*eda14cbcSMatt Macy &vdtxg); 871*eda14cbcSMatt Macy 872*eda14cbcSMatt Macy if (nvlist_lookup_uint64(label, ZPOOL_CONFIG_POOL_STATE, 873*eda14cbcSMatt Macy &state) != 0 || 874*eda14cbcSMatt Macy nvlist_lookup_uint64(label, ZPOOL_CONFIG_GUID, 875*eda14cbcSMatt Macy &device_guid) != 0) { 876*eda14cbcSMatt Macy nvlist_free(label); 877*eda14cbcSMatt Macy return (B_FALSE); 878*eda14cbcSMatt Macy } 879*eda14cbcSMatt Macy 880*eda14cbcSMatt Macy if (state != POOL_STATE_SPARE && state != POOL_STATE_L2CACHE && 881*eda14cbcSMatt Macy (nvlist_lookup_uint64(label, ZPOOL_CONFIG_POOL_GUID, 882*eda14cbcSMatt Macy &pool_guid) != 0 || 883*eda14cbcSMatt Macy nvlist_lookup_uint64(label, ZPOOL_CONFIG_POOL_TXG, 884*eda14cbcSMatt Macy &txg) != 0)) { 885*eda14cbcSMatt Macy nvlist_free(label); 886*eda14cbcSMatt Macy return (B_FALSE); 887*eda14cbcSMatt Macy } 888*eda14cbcSMatt Macy 889*eda14cbcSMatt Macy nvlist_free(label); 890*eda14cbcSMatt Macy 891*eda14cbcSMatt Macy /* 892*eda14cbcSMatt Macy * Check to see if this device indeed belongs to the pool it claims to 893*eda14cbcSMatt Macy * be a part of. The only way this is allowed is if the device is a hot 894*eda14cbcSMatt Macy * spare (which we check for later on). 895*eda14cbcSMatt Macy */ 896*eda14cbcSMatt Macy if (state != POOL_STATE_SPARE && state != POOL_STATE_L2CACHE && 897*eda14cbcSMatt Macy !spa_guid_exists(pool_guid, device_guid) && 898*eda14cbcSMatt Macy !spa_spare_exists(device_guid, NULL, NULL) && 899*eda14cbcSMatt Macy !spa_l2cache_exists(device_guid, NULL)) 900*eda14cbcSMatt Macy return (B_FALSE); 901*eda14cbcSMatt Macy 902*eda14cbcSMatt Macy /* 903*eda14cbcSMatt Macy * If the transaction group is zero, then this an initialized (but 904*eda14cbcSMatt Macy * unused) label. This is only an error if the create transaction 905*eda14cbcSMatt Macy * on-disk is the same as the one we're using now, in which case the 906*eda14cbcSMatt Macy * user has attempted to add the same vdev multiple times in the same 907*eda14cbcSMatt Macy * transaction. 908*eda14cbcSMatt Macy */ 909*eda14cbcSMatt Macy if (state != POOL_STATE_SPARE && state != POOL_STATE_L2CACHE && 910*eda14cbcSMatt Macy txg == 0 && vdtxg == crtxg) 911*eda14cbcSMatt Macy return (B_TRUE); 912*eda14cbcSMatt Macy 913*eda14cbcSMatt Macy /* 914*eda14cbcSMatt Macy * Check to see if this is a spare device. We do an explicit check for 915*eda14cbcSMatt Macy * spa_has_spare() here because it may be on our pending list of spares 916*eda14cbcSMatt Macy * to add. We also check if it is an l2cache device. 917*eda14cbcSMatt Macy */ 918*eda14cbcSMatt Macy if (spa_spare_exists(device_guid, &spare_pool, NULL) || 919*eda14cbcSMatt Macy spa_has_spare(spa, device_guid)) { 920*eda14cbcSMatt Macy if (spare_guid) 921*eda14cbcSMatt Macy *spare_guid = device_guid; 922*eda14cbcSMatt Macy 923*eda14cbcSMatt Macy switch (reason) { 924*eda14cbcSMatt Macy case VDEV_LABEL_CREATE: 925*eda14cbcSMatt Macy case VDEV_LABEL_L2CACHE: 926*eda14cbcSMatt Macy return (B_TRUE); 927*eda14cbcSMatt Macy 928*eda14cbcSMatt Macy case VDEV_LABEL_REPLACE: 929*eda14cbcSMatt Macy return (!spa_has_spare(spa, device_guid) || 930*eda14cbcSMatt Macy spare_pool != 0ULL); 931*eda14cbcSMatt Macy 932*eda14cbcSMatt Macy case VDEV_LABEL_SPARE: 933*eda14cbcSMatt Macy return (spa_has_spare(spa, device_guid)); 934*eda14cbcSMatt Macy default: 935*eda14cbcSMatt Macy break; 936*eda14cbcSMatt Macy } 937*eda14cbcSMatt Macy } 938*eda14cbcSMatt Macy 939*eda14cbcSMatt Macy /* 940*eda14cbcSMatt Macy * Check to see if this is an l2cache device. 941*eda14cbcSMatt Macy */ 942*eda14cbcSMatt Macy if (spa_l2cache_exists(device_guid, NULL)) 943*eda14cbcSMatt Macy return (B_TRUE); 944*eda14cbcSMatt Macy 945*eda14cbcSMatt Macy /* 946*eda14cbcSMatt Macy * We can't rely on a pool's state if it's been imported 947*eda14cbcSMatt Macy * read-only. Instead we look to see if the pools is marked 948*eda14cbcSMatt Macy * read-only in the namespace and set the state to active. 949*eda14cbcSMatt Macy */ 950*eda14cbcSMatt Macy if (state != POOL_STATE_SPARE && state != POOL_STATE_L2CACHE && 951*eda14cbcSMatt Macy (spa = spa_by_guid(pool_guid, device_guid)) != NULL && 952*eda14cbcSMatt Macy spa_mode(spa) == SPA_MODE_READ) 953*eda14cbcSMatt Macy state = POOL_STATE_ACTIVE; 954*eda14cbcSMatt Macy 955*eda14cbcSMatt Macy /* 956*eda14cbcSMatt Macy * If the device is marked ACTIVE, then this device is in use by another 957*eda14cbcSMatt Macy * pool on the system. 958*eda14cbcSMatt Macy */ 959*eda14cbcSMatt Macy return (state == POOL_STATE_ACTIVE); 960*eda14cbcSMatt Macy } 961*eda14cbcSMatt Macy 962*eda14cbcSMatt Macy /* 963*eda14cbcSMatt Macy * Initialize a vdev label. We check to make sure each leaf device is not in 964*eda14cbcSMatt Macy * use, and writable. We put down an initial label which we will later 965*eda14cbcSMatt Macy * overwrite with a complete label. Note that it's important to do this 966*eda14cbcSMatt Macy * sequentially, not in parallel, so that we catch cases of multiple use of the 967*eda14cbcSMatt Macy * same leaf vdev in the vdev we're creating -- e.g. mirroring a disk with 968*eda14cbcSMatt Macy * itself. 969*eda14cbcSMatt Macy */ 970*eda14cbcSMatt Macy int 971*eda14cbcSMatt Macy vdev_label_init(vdev_t *vd, uint64_t crtxg, vdev_labeltype_t reason) 972*eda14cbcSMatt Macy { 973*eda14cbcSMatt Macy spa_t *spa = vd->vdev_spa; 974*eda14cbcSMatt Macy nvlist_t *label; 975*eda14cbcSMatt Macy vdev_phys_t *vp; 976*eda14cbcSMatt Macy abd_t *vp_abd; 977*eda14cbcSMatt Macy abd_t *bootenv; 978*eda14cbcSMatt Macy uberblock_t *ub; 979*eda14cbcSMatt Macy abd_t *ub_abd; 980*eda14cbcSMatt Macy zio_t *zio; 981*eda14cbcSMatt Macy char *buf; 982*eda14cbcSMatt Macy size_t buflen; 983*eda14cbcSMatt Macy int error; 984*eda14cbcSMatt Macy uint64_t spare_guid = 0, l2cache_guid = 0; 985*eda14cbcSMatt Macy int flags = ZIO_FLAG_CONFIG_WRITER | ZIO_FLAG_CANFAIL; 986*eda14cbcSMatt Macy 987*eda14cbcSMatt Macy ASSERT(spa_config_held(spa, SCL_ALL, RW_WRITER) == SCL_ALL); 988*eda14cbcSMatt Macy 989*eda14cbcSMatt Macy for (int c = 0; c < vd->vdev_children; c++) 990*eda14cbcSMatt Macy if ((error = vdev_label_init(vd->vdev_child[c], 991*eda14cbcSMatt Macy crtxg, reason)) != 0) 992*eda14cbcSMatt Macy return (error); 993*eda14cbcSMatt Macy 994*eda14cbcSMatt Macy /* Track the creation time for this vdev */ 995*eda14cbcSMatt Macy vd->vdev_crtxg = crtxg; 996*eda14cbcSMatt Macy 997*eda14cbcSMatt Macy if (!vd->vdev_ops->vdev_op_leaf || !spa_writeable(spa)) 998*eda14cbcSMatt Macy return (0); 999*eda14cbcSMatt Macy 1000*eda14cbcSMatt Macy /* 1001*eda14cbcSMatt Macy * Dead vdevs cannot be initialized. 1002*eda14cbcSMatt Macy */ 1003*eda14cbcSMatt Macy if (vdev_is_dead(vd)) 1004*eda14cbcSMatt Macy return (SET_ERROR(EIO)); 1005*eda14cbcSMatt Macy 1006*eda14cbcSMatt Macy /* 1007*eda14cbcSMatt Macy * Determine if the vdev is in use. 1008*eda14cbcSMatt Macy */ 1009*eda14cbcSMatt Macy if (reason != VDEV_LABEL_REMOVE && reason != VDEV_LABEL_SPLIT && 1010*eda14cbcSMatt Macy vdev_inuse(vd, crtxg, reason, &spare_guid, &l2cache_guid)) 1011*eda14cbcSMatt Macy return (SET_ERROR(EBUSY)); 1012*eda14cbcSMatt Macy 1013*eda14cbcSMatt Macy /* 1014*eda14cbcSMatt Macy * If this is a request to add or replace a spare or l2cache device 1015*eda14cbcSMatt Macy * that is in use elsewhere on the system, then we must update the 1016*eda14cbcSMatt Macy * guid (which was initialized to a random value) to reflect the 1017*eda14cbcSMatt Macy * actual GUID (which is shared between multiple pools). 1018*eda14cbcSMatt Macy */ 1019*eda14cbcSMatt Macy if (reason != VDEV_LABEL_REMOVE && reason != VDEV_LABEL_L2CACHE && 1020*eda14cbcSMatt Macy spare_guid != 0ULL) { 1021*eda14cbcSMatt Macy uint64_t guid_delta = spare_guid - vd->vdev_guid; 1022*eda14cbcSMatt Macy 1023*eda14cbcSMatt Macy vd->vdev_guid += guid_delta; 1024*eda14cbcSMatt Macy 1025*eda14cbcSMatt Macy for (vdev_t *pvd = vd; pvd != NULL; pvd = pvd->vdev_parent) 1026*eda14cbcSMatt Macy pvd->vdev_guid_sum += guid_delta; 1027*eda14cbcSMatt Macy 1028*eda14cbcSMatt Macy /* 1029*eda14cbcSMatt Macy * If this is a replacement, then we want to fallthrough to the 1030*eda14cbcSMatt Macy * rest of the code. If we're adding a spare, then it's already 1031*eda14cbcSMatt Macy * labeled appropriately and we can just return. 1032*eda14cbcSMatt Macy */ 1033*eda14cbcSMatt Macy if (reason == VDEV_LABEL_SPARE) 1034*eda14cbcSMatt Macy return (0); 1035*eda14cbcSMatt Macy ASSERT(reason == VDEV_LABEL_REPLACE || 1036*eda14cbcSMatt Macy reason == VDEV_LABEL_SPLIT); 1037*eda14cbcSMatt Macy } 1038*eda14cbcSMatt Macy 1039*eda14cbcSMatt Macy if (reason != VDEV_LABEL_REMOVE && reason != VDEV_LABEL_SPARE && 1040*eda14cbcSMatt Macy l2cache_guid != 0ULL) { 1041*eda14cbcSMatt Macy uint64_t guid_delta = l2cache_guid - vd->vdev_guid; 1042*eda14cbcSMatt Macy 1043*eda14cbcSMatt Macy vd->vdev_guid += guid_delta; 1044*eda14cbcSMatt Macy 1045*eda14cbcSMatt Macy for (vdev_t *pvd = vd; pvd != NULL; pvd = pvd->vdev_parent) 1046*eda14cbcSMatt Macy pvd->vdev_guid_sum += guid_delta; 1047*eda14cbcSMatt Macy 1048*eda14cbcSMatt Macy /* 1049*eda14cbcSMatt Macy * If this is a replacement, then we want to fallthrough to the 1050*eda14cbcSMatt Macy * rest of the code. If we're adding an l2cache, then it's 1051*eda14cbcSMatt Macy * already labeled appropriately and we can just return. 1052*eda14cbcSMatt Macy */ 1053*eda14cbcSMatt Macy if (reason == VDEV_LABEL_L2CACHE) 1054*eda14cbcSMatt Macy return (0); 1055*eda14cbcSMatt Macy ASSERT(reason == VDEV_LABEL_REPLACE); 1056*eda14cbcSMatt Macy } 1057*eda14cbcSMatt Macy 1058*eda14cbcSMatt Macy /* 1059*eda14cbcSMatt Macy * Initialize its label. 1060*eda14cbcSMatt Macy */ 1061*eda14cbcSMatt Macy vp_abd = abd_alloc_linear(sizeof (vdev_phys_t), B_TRUE); 1062*eda14cbcSMatt Macy abd_zero(vp_abd, sizeof (vdev_phys_t)); 1063*eda14cbcSMatt Macy vp = abd_to_buf(vp_abd); 1064*eda14cbcSMatt Macy 1065*eda14cbcSMatt Macy /* 1066*eda14cbcSMatt Macy * Generate a label describing the pool and our top-level vdev. 1067*eda14cbcSMatt Macy * We mark it as being from txg 0 to indicate that it's not 1068*eda14cbcSMatt Macy * really part of an active pool just yet. The labels will 1069*eda14cbcSMatt Macy * be written again with a meaningful txg by spa_sync(). 1070*eda14cbcSMatt Macy */ 1071*eda14cbcSMatt Macy if (reason == VDEV_LABEL_SPARE || 1072*eda14cbcSMatt Macy (reason == VDEV_LABEL_REMOVE && vd->vdev_isspare)) { 1073*eda14cbcSMatt Macy /* 1074*eda14cbcSMatt Macy * For inactive hot spares, we generate a special label that 1075*eda14cbcSMatt Macy * identifies as a mutually shared hot spare. We write the 1076*eda14cbcSMatt Macy * label if we are adding a hot spare, or if we are removing an 1077*eda14cbcSMatt Macy * active hot spare (in which case we want to revert the 1078*eda14cbcSMatt Macy * labels). 1079*eda14cbcSMatt Macy */ 1080*eda14cbcSMatt Macy VERIFY(nvlist_alloc(&label, NV_UNIQUE_NAME, KM_SLEEP) == 0); 1081*eda14cbcSMatt Macy 1082*eda14cbcSMatt Macy VERIFY(nvlist_add_uint64(label, ZPOOL_CONFIG_VERSION, 1083*eda14cbcSMatt Macy spa_version(spa)) == 0); 1084*eda14cbcSMatt Macy VERIFY(nvlist_add_uint64(label, ZPOOL_CONFIG_POOL_STATE, 1085*eda14cbcSMatt Macy POOL_STATE_SPARE) == 0); 1086*eda14cbcSMatt Macy VERIFY(nvlist_add_uint64(label, ZPOOL_CONFIG_GUID, 1087*eda14cbcSMatt Macy vd->vdev_guid) == 0); 1088*eda14cbcSMatt Macy } else if (reason == VDEV_LABEL_L2CACHE || 1089*eda14cbcSMatt Macy (reason == VDEV_LABEL_REMOVE && vd->vdev_isl2cache)) { 1090*eda14cbcSMatt Macy /* 1091*eda14cbcSMatt Macy * For level 2 ARC devices, add a special label. 1092*eda14cbcSMatt Macy */ 1093*eda14cbcSMatt Macy VERIFY(nvlist_alloc(&label, NV_UNIQUE_NAME, KM_SLEEP) == 0); 1094*eda14cbcSMatt Macy 1095*eda14cbcSMatt Macy VERIFY(nvlist_add_uint64(label, ZPOOL_CONFIG_VERSION, 1096*eda14cbcSMatt Macy spa_version(spa)) == 0); 1097*eda14cbcSMatt Macy VERIFY(nvlist_add_uint64(label, ZPOOL_CONFIG_POOL_STATE, 1098*eda14cbcSMatt Macy POOL_STATE_L2CACHE) == 0); 1099*eda14cbcSMatt Macy VERIFY(nvlist_add_uint64(label, ZPOOL_CONFIG_GUID, 1100*eda14cbcSMatt Macy vd->vdev_guid) == 0); 1101*eda14cbcSMatt Macy } else { 1102*eda14cbcSMatt Macy uint64_t txg = 0ULL; 1103*eda14cbcSMatt Macy 1104*eda14cbcSMatt Macy if (reason == VDEV_LABEL_SPLIT) 1105*eda14cbcSMatt Macy txg = spa->spa_uberblock.ub_txg; 1106*eda14cbcSMatt Macy label = spa_config_generate(spa, vd, txg, B_FALSE); 1107*eda14cbcSMatt Macy 1108*eda14cbcSMatt Macy /* 1109*eda14cbcSMatt Macy * Add our creation time. This allows us to detect multiple 1110*eda14cbcSMatt Macy * vdev uses as described above, and automatically expires if we 1111*eda14cbcSMatt Macy * fail. 1112*eda14cbcSMatt Macy */ 1113*eda14cbcSMatt Macy VERIFY(nvlist_add_uint64(label, ZPOOL_CONFIG_CREATE_TXG, 1114*eda14cbcSMatt Macy crtxg) == 0); 1115*eda14cbcSMatt Macy } 1116*eda14cbcSMatt Macy 1117*eda14cbcSMatt Macy buf = vp->vp_nvlist; 1118*eda14cbcSMatt Macy buflen = sizeof (vp->vp_nvlist); 1119*eda14cbcSMatt Macy 1120*eda14cbcSMatt Macy error = nvlist_pack(label, &buf, &buflen, NV_ENCODE_XDR, KM_SLEEP); 1121*eda14cbcSMatt Macy if (error != 0) { 1122*eda14cbcSMatt Macy nvlist_free(label); 1123*eda14cbcSMatt Macy abd_free(vp_abd); 1124*eda14cbcSMatt Macy /* EFAULT means nvlist_pack ran out of room */ 1125*eda14cbcSMatt Macy return (SET_ERROR(error == EFAULT ? ENAMETOOLONG : EINVAL)); 1126*eda14cbcSMatt Macy } 1127*eda14cbcSMatt Macy 1128*eda14cbcSMatt Macy /* 1129*eda14cbcSMatt Macy * Initialize uberblock template. 1130*eda14cbcSMatt Macy */ 1131*eda14cbcSMatt Macy ub_abd = abd_alloc_linear(VDEV_UBERBLOCK_RING, B_TRUE); 1132*eda14cbcSMatt Macy abd_zero(ub_abd, VDEV_UBERBLOCK_RING); 1133*eda14cbcSMatt Macy abd_copy_from_buf(ub_abd, &spa->spa_uberblock, sizeof (uberblock_t)); 1134*eda14cbcSMatt Macy ub = abd_to_buf(ub_abd); 1135*eda14cbcSMatt Macy ub->ub_txg = 0; 1136*eda14cbcSMatt Macy 1137*eda14cbcSMatt Macy /* Initialize the 2nd padding area. */ 1138*eda14cbcSMatt Macy bootenv = abd_alloc_for_io(VDEV_PAD_SIZE, B_TRUE); 1139*eda14cbcSMatt Macy abd_zero(bootenv, VDEV_PAD_SIZE); 1140*eda14cbcSMatt Macy 1141*eda14cbcSMatt Macy /* 1142*eda14cbcSMatt Macy * Write everything in parallel. 1143*eda14cbcSMatt Macy */ 1144*eda14cbcSMatt Macy retry: 1145*eda14cbcSMatt Macy zio = zio_root(spa, NULL, NULL, flags); 1146*eda14cbcSMatt Macy 1147*eda14cbcSMatt Macy for (int l = 0; l < VDEV_LABELS; l++) { 1148*eda14cbcSMatt Macy 1149*eda14cbcSMatt Macy vdev_label_write(zio, vd, l, vp_abd, 1150*eda14cbcSMatt Macy offsetof(vdev_label_t, vl_vdev_phys), 1151*eda14cbcSMatt Macy sizeof (vdev_phys_t), NULL, NULL, flags); 1152*eda14cbcSMatt Macy 1153*eda14cbcSMatt Macy /* 1154*eda14cbcSMatt Macy * Skip the 1st padding area. 1155*eda14cbcSMatt Macy * Zero out the 2nd padding area where it might have 1156*eda14cbcSMatt Macy * left over data from previous filesystem format. 1157*eda14cbcSMatt Macy */ 1158*eda14cbcSMatt Macy vdev_label_write(zio, vd, l, bootenv, 1159*eda14cbcSMatt Macy offsetof(vdev_label_t, vl_be), 1160*eda14cbcSMatt Macy VDEV_PAD_SIZE, NULL, NULL, flags); 1161*eda14cbcSMatt Macy 1162*eda14cbcSMatt Macy vdev_label_write(zio, vd, l, ub_abd, 1163*eda14cbcSMatt Macy offsetof(vdev_label_t, vl_uberblock), 1164*eda14cbcSMatt Macy VDEV_UBERBLOCK_RING, NULL, NULL, flags); 1165*eda14cbcSMatt Macy } 1166*eda14cbcSMatt Macy 1167*eda14cbcSMatt Macy error = zio_wait(zio); 1168*eda14cbcSMatt Macy 1169*eda14cbcSMatt Macy if (error != 0 && !(flags & ZIO_FLAG_TRYHARD)) { 1170*eda14cbcSMatt Macy flags |= ZIO_FLAG_TRYHARD; 1171*eda14cbcSMatt Macy goto retry; 1172*eda14cbcSMatt Macy } 1173*eda14cbcSMatt Macy 1174*eda14cbcSMatt Macy nvlist_free(label); 1175*eda14cbcSMatt Macy abd_free(bootenv); 1176*eda14cbcSMatt Macy abd_free(ub_abd); 1177*eda14cbcSMatt Macy abd_free(vp_abd); 1178*eda14cbcSMatt Macy 1179*eda14cbcSMatt Macy /* 1180*eda14cbcSMatt Macy * If this vdev hasn't been previously identified as a spare, then we 1181*eda14cbcSMatt Macy * mark it as such only if a) we are labeling it as a spare, or b) it 1182*eda14cbcSMatt Macy * exists as a spare elsewhere in the system. Do the same for 1183*eda14cbcSMatt Macy * level 2 ARC devices. 1184*eda14cbcSMatt Macy */ 1185*eda14cbcSMatt Macy if (error == 0 && !vd->vdev_isspare && 1186*eda14cbcSMatt Macy (reason == VDEV_LABEL_SPARE || 1187*eda14cbcSMatt Macy spa_spare_exists(vd->vdev_guid, NULL, NULL))) 1188*eda14cbcSMatt Macy spa_spare_add(vd); 1189*eda14cbcSMatt Macy 1190*eda14cbcSMatt Macy if (error == 0 && !vd->vdev_isl2cache && 1191*eda14cbcSMatt Macy (reason == VDEV_LABEL_L2CACHE || 1192*eda14cbcSMatt Macy spa_l2cache_exists(vd->vdev_guid, NULL))) 1193*eda14cbcSMatt Macy spa_l2cache_add(vd); 1194*eda14cbcSMatt Macy 1195*eda14cbcSMatt Macy return (error); 1196*eda14cbcSMatt Macy } 1197*eda14cbcSMatt Macy 1198*eda14cbcSMatt Macy /* 1199*eda14cbcSMatt Macy * Done callback for vdev_label_read_bootenv_impl. If this is the first 1200*eda14cbcSMatt Macy * callback to finish, store our abd in the callback pointer. Otherwise, we 1201*eda14cbcSMatt Macy * just free our abd and return. 1202*eda14cbcSMatt Macy */ 1203*eda14cbcSMatt Macy static void 1204*eda14cbcSMatt Macy vdev_label_read_bootenv_done(zio_t *zio) 1205*eda14cbcSMatt Macy { 1206*eda14cbcSMatt Macy zio_t *rio = zio->io_private; 1207*eda14cbcSMatt Macy abd_t **cbp = rio->io_private; 1208*eda14cbcSMatt Macy 1209*eda14cbcSMatt Macy ASSERT3U(zio->io_size, ==, VDEV_PAD_SIZE); 1210*eda14cbcSMatt Macy 1211*eda14cbcSMatt Macy if (zio->io_error == 0) { 1212*eda14cbcSMatt Macy mutex_enter(&rio->io_lock); 1213*eda14cbcSMatt Macy if (*cbp == NULL) { 1214*eda14cbcSMatt Macy /* Will free this buffer in vdev_label_read_bootenv. */ 1215*eda14cbcSMatt Macy *cbp = zio->io_abd; 1216*eda14cbcSMatt Macy } else { 1217*eda14cbcSMatt Macy abd_free(zio->io_abd); 1218*eda14cbcSMatt Macy } 1219*eda14cbcSMatt Macy mutex_exit(&rio->io_lock); 1220*eda14cbcSMatt Macy } else { 1221*eda14cbcSMatt Macy abd_free(zio->io_abd); 1222*eda14cbcSMatt Macy } 1223*eda14cbcSMatt Macy } 1224*eda14cbcSMatt Macy 1225*eda14cbcSMatt Macy static void 1226*eda14cbcSMatt Macy vdev_label_read_bootenv_impl(zio_t *zio, vdev_t *vd, int flags) 1227*eda14cbcSMatt Macy { 1228*eda14cbcSMatt Macy for (int c = 0; c < vd->vdev_children; c++) 1229*eda14cbcSMatt Macy vdev_label_read_bootenv_impl(zio, vd->vdev_child[c], flags); 1230*eda14cbcSMatt Macy 1231*eda14cbcSMatt Macy /* 1232*eda14cbcSMatt Macy * We just use the first label that has a correct checksum; the 1233*eda14cbcSMatt Macy * bootloader should have rewritten them all to be the same on boot, 1234*eda14cbcSMatt Macy * and any changes we made since boot have been the same across all 1235*eda14cbcSMatt Macy * labels. 1236*eda14cbcSMatt Macy * 1237*eda14cbcSMatt Macy * While grub supports writing to all four labels, other bootloaders 1238*eda14cbcSMatt Macy * don't, so we only use the first two labels to store boot 1239*eda14cbcSMatt Macy * information. 1240*eda14cbcSMatt Macy */ 1241*eda14cbcSMatt Macy if (vd->vdev_ops->vdev_op_leaf && vdev_readable(vd)) { 1242*eda14cbcSMatt Macy for (int l = 0; l < VDEV_LABELS / 2; l++) { 1243*eda14cbcSMatt Macy vdev_label_read(zio, vd, l, 1244*eda14cbcSMatt Macy abd_alloc_linear(VDEV_PAD_SIZE, B_FALSE), 1245*eda14cbcSMatt Macy offsetof(vdev_label_t, vl_be), VDEV_PAD_SIZE, 1246*eda14cbcSMatt Macy vdev_label_read_bootenv_done, zio, flags); 1247*eda14cbcSMatt Macy } 1248*eda14cbcSMatt Macy } 1249*eda14cbcSMatt Macy } 1250*eda14cbcSMatt Macy 1251*eda14cbcSMatt Macy int 1252*eda14cbcSMatt Macy vdev_label_read_bootenv(vdev_t *rvd, nvlist_t *command) 1253*eda14cbcSMatt Macy { 1254*eda14cbcSMatt Macy spa_t *spa = rvd->vdev_spa; 1255*eda14cbcSMatt Macy abd_t *abd = NULL; 1256*eda14cbcSMatt Macy int flags = ZIO_FLAG_CONFIG_WRITER | ZIO_FLAG_CANFAIL | 1257*eda14cbcSMatt Macy ZIO_FLAG_SPECULATIVE | ZIO_FLAG_TRYHARD; 1258*eda14cbcSMatt Macy 1259*eda14cbcSMatt Macy ASSERT(command); 1260*eda14cbcSMatt Macy ASSERT(spa_config_held(spa, SCL_ALL, RW_WRITER) == SCL_ALL); 1261*eda14cbcSMatt Macy 1262*eda14cbcSMatt Macy zio_t *zio = zio_root(spa, NULL, &abd, flags); 1263*eda14cbcSMatt Macy vdev_label_read_bootenv_impl(zio, rvd, flags); 1264*eda14cbcSMatt Macy int err = zio_wait(zio); 1265*eda14cbcSMatt Macy 1266*eda14cbcSMatt Macy if (abd != NULL) { 1267*eda14cbcSMatt Macy vdev_boot_envblock_t *vbe = abd_to_buf(abd); 1268*eda14cbcSMatt Macy if (vbe->vbe_version != VB_RAW) { 1269*eda14cbcSMatt Macy abd_free(abd); 1270*eda14cbcSMatt Macy return (SET_ERROR(ENOTSUP)); 1271*eda14cbcSMatt Macy } 1272*eda14cbcSMatt Macy vbe->vbe_bootenv[sizeof (vbe->vbe_bootenv) - 1] = '\0'; 1273*eda14cbcSMatt Macy fnvlist_add_string(command, "envmap", vbe->vbe_bootenv); 1274*eda14cbcSMatt Macy /* abd was allocated in vdev_label_read_bootenv_impl() */ 1275*eda14cbcSMatt Macy abd_free(abd); 1276*eda14cbcSMatt Macy /* If we managed to read any successfully, return success. */ 1277*eda14cbcSMatt Macy return (0); 1278*eda14cbcSMatt Macy } 1279*eda14cbcSMatt Macy return (err); 1280*eda14cbcSMatt Macy } 1281*eda14cbcSMatt Macy 1282*eda14cbcSMatt Macy int 1283*eda14cbcSMatt Macy vdev_label_write_bootenv(vdev_t *vd, char *envmap) 1284*eda14cbcSMatt Macy { 1285*eda14cbcSMatt Macy zio_t *zio; 1286*eda14cbcSMatt Macy spa_t *spa = vd->vdev_spa; 1287*eda14cbcSMatt Macy vdev_boot_envblock_t *bootenv; 1288*eda14cbcSMatt Macy int flags = ZIO_FLAG_CONFIG_WRITER | ZIO_FLAG_CANFAIL; 1289*eda14cbcSMatt Macy int error = ENXIO; 1290*eda14cbcSMatt Macy 1291*eda14cbcSMatt Macy if (strlen(envmap) >= sizeof (bootenv->vbe_bootenv)) { 1292*eda14cbcSMatt Macy return (SET_ERROR(E2BIG)); 1293*eda14cbcSMatt Macy } 1294*eda14cbcSMatt Macy 1295*eda14cbcSMatt Macy ASSERT(spa_config_held(spa, SCL_ALL, RW_WRITER) == SCL_ALL); 1296*eda14cbcSMatt Macy 1297*eda14cbcSMatt Macy for (int c = 0; c < vd->vdev_children; c++) { 1298*eda14cbcSMatt Macy int child_err = vdev_label_write_bootenv(vd->vdev_child[c], 1299*eda14cbcSMatt Macy envmap); 1300*eda14cbcSMatt Macy /* 1301*eda14cbcSMatt Macy * As long as any of the disks managed to write all of their 1302*eda14cbcSMatt Macy * labels successfully, return success. 1303*eda14cbcSMatt Macy */ 1304*eda14cbcSMatt Macy if (child_err == 0) 1305*eda14cbcSMatt Macy error = child_err; 1306*eda14cbcSMatt Macy } 1307*eda14cbcSMatt Macy 1308*eda14cbcSMatt Macy if (!vd->vdev_ops->vdev_op_leaf || vdev_is_dead(vd) || 1309*eda14cbcSMatt Macy !vdev_writeable(vd)) { 1310*eda14cbcSMatt Macy return (error); 1311*eda14cbcSMatt Macy } 1312*eda14cbcSMatt Macy ASSERT3U(sizeof (*bootenv), ==, VDEV_PAD_SIZE); 1313*eda14cbcSMatt Macy abd_t *abd = abd_alloc_for_io(VDEV_PAD_SIZE, B_TRUE); 1314*eda14cbcSMatt Macy abd_zero(abd, VDEV_PAD_SIZE); 1315*eda14cbcSMatt Macy bootenv = abd_borrow_buf_copy(abd, VDEV_PAD_SIZE); 1316*eda14cbcSMatt Macy 1317*eda14cbcSMatt Macy char *buf = bootenv->vbe_bootenv; 1318*eda14cbcSMatt Macy (void) strlcpy(buf, envmap, sizeof (bootenv->vbe_bootenv)); 1319*eda14cbcSMatt Macy bootenv->vbe_version = VB_RAW; 1320*eda14cbcSMatt Macy abd_return_buf_copy(abd, bootenv, VDEV_PAD_SIZE); 1321*eda14cbcSMatt Macy 1322*eda14cbcSMatt Macy retry: 1323*eda14cbcSMatt Macy zio = zio_root(spa, NULL, NULL, flags); 1324*eda14cbcSMatt Macy for (int l = 0; l < VDEV_LABELS / 2; l++) { 1325*eda14cbcSMatt Macy vdev_label_write(zio, vd, l, abd, 1326*eda14cbcSMatt Macy offsetof(vdev_label_t, vl_be), 1327*eda14cbcSMatt Macy VDEV_PAD_SIZE, NULL, NULL, flags); 1328*eda14cbcSMatt Macy } 1329*eda14cbcSMatt Macy 1330*eda14cbcSMatt Macy error = zio_wait(zio); 1331*eda14cbcSMatt Macy if (error != 0 && !(flags & ZIO_FLAG_TRYHARD)) { 1332*eda14cbcSMatt Macy flags |= ZIO_FLAG_TRYHARD; 1333*eda14cbcSMatt Macy goto retry; 1334*eda14cbcSMatt Macy } 1335*eda14cbcSMatt Macy 1336*eda14cbcSMatt Macy abd_free(abd); 1337*eda14cbcSMatt Macy return (error); 1338*eda14cbcSMatt Macy } 1339*eda14cbcSMatt Macy 1340*eda14cbcSMatt Macy /* 1341*eda14cbcSMatt Macy * ========================================================================== 1342*eda14cbcSMatt Macy * uberblock load/sync 1343*eda14cbcSMatt Macy * ========================================================================== 1344*eda14cbcSMatt Macy */ 1345*eda14cbcSMatt Macy 1346*eda14cbcSMatt Macy /* 1347*eda14cbcSMatt Macy * Consider the following situation: txg is safely synced to disk. We've 1348*eda14cbcSMatt Macy * written the first uberblock for txg + 1, and then we lose power. When we 1349*eda14cbcSMatt Macy * come back up, we fail to see the uberblock for txg + 1 because, say, 1350*eda14cbcSMatt Macy * it was on a mirrored device and the replica to which we wrote txg + 1 1351*eda14cbcSMatt Macy * is now offline. If we then make some changes and sync txg + 1, and then 1352*eda14cbcSMatt Macy * the missing replica comes back, then for a few seconds we'll have two 1353*eda14cbcSMatt Macy * conflicting uberblocks on disk with the same txg. The solution is simple: 1354*eda14cbcSMatt Macy * among uberblocks with equal txg, choose the one with the latest timestamp. 1355*eda14cbcSMatt Macy */ 1356*eda14cbcSMatt Macy static int 1357*eda14cbcSMatt Macy vdev_uberblock_compare(const uberblock_t *ub1, const uberblock_t *ub2) 1358*eda14cbcSMatt Macy { 1359*eda14cbcSMatt Macy int cmp = TREE_CMP(ub1->ub_txg, ub2->ub_txg); 1360*eda14cbcSMatt Macy 1361*eda14cbcSMatt Macy if (likely(cmp)) 1362*eda14cbcSMatt Macy return (cmp); 1363*eda14cbcSMatt Macy 1364*eda14cbcSMatt Macy cmp = TREE_CMP(ub1->ub_timestamp, ub2->ub_timestamp); 1365*eda14cbcSMatt Macy if (likely(cmp)) 1366*eda14cbcSMatt Macy return (cmp); 1367*eda14cbcSMatt Macy 1368*eda14cbcSMatt Macy /* 1369*eda14cbcSMatt Macy * If MMP_VALID(ub) && MMP_SEQ_VALID(ub) then the host has an MMP-aware 1370*eda14cbcSMatt Macy * ZFS, e.g. zfsonlinux >= 0.7. 1371*eda14cbcSMatt Macy * 1372*eda14cbcSMatt Macy * If one ub has MMP and the other does not, they were written by 1373*eda14cbcSMatt Macy * different hosts, which matters for MMP. So we treat no MMP/no SEQ as 1374*eda14cbcSMatt Macy * a 0 value. 1375*eda14cbcSMatt Macy * 1376*eda14cbcSMatt Macy * Since timestamp and txg are the same if we get this far, either is 1377*eda14cbcSMatt Macy * acceptable for importing the pool. 1378*eda14cbcSMatt Macy */ 1379*eda14cbcSMatt Macy unsigned int seq1 = 0; 1380*eda14cbcSMatt Macy unsigned int seq2 = 0; 1381*eda14cbcSMatt Macy 1382*eda14cbcSMatt Macy if (MMP_VALID(ub1) && MMP_SEQ_VALID(ub1)) 1383*eda14cbcSMatt Macy seq1 = MMP_SEQ(ub1); 1384*eda14cbcSMatt Macy 1385*eda14cbcSMatt Macy if (MMP_VALID(ub2) && MMP_SEQ_VALID(ub2)) 1386*eda14cbcSMatt Macy seq2 = MMP_SEQ(ub2); 1387*eda14cbcSMatt Macy 1388*eda14cbcSMatt Macy return (TREE_CMP(seq1, seq2)); 1389*eda14cbcSMatt Macy } 1390*eda14cbcSMatt Macy 1391*eda14cbcSMatt Macy struct ubl_cbdata { 1392*eda14cbcSMatt Macy uberblock_t *ubl_ubbest; /* Best uberblock */ 1393*eda14cbcSMatt Macy vdev_t *ubl_vd; /* vdev associated with the above */ 1394*eda14cbcSMatt Macy }; 1395*eda14cbcSMatt Macy 1396*eda14cbcSMatt Macy static void 1397*eda14cbcSMatt Macy vdev_uberblock_load_done(zio_t *zio) 1398*eda14cbcSMatt Macy { 1399*eda14cbcSMatt Macy vdev_t *vd = zio->io_vd; 1400*eda14cbcSMatt Macy spa_t *spa = zio->io_spa; 1401*eda14cbcSMatt Macy zio_t *rio = zio->io_private; 1402*eda14cbcSMatt Macy uberblock_t *ub = abd_to_buf(zio->io_abd); 1403*eda14cbcSMatt Macy struct ubl_cbdata *cbp = rio->io_private; 1404*eda14cbcSMatt Macy 1405*eda14cbcSMatt Macy ASSERT3U(zio->io_size, ==, VDEV_UBERBLOCK_SIZE(vd)); 1406*eda14cbcSMatt Macy 1407*eda14cbcSMatt Macy if (zio->io_error == 0 && uberblock_verify(ub) == 0) { 1408*eda14cbcSMatt Macy mutex_enter(&rio->io_lock); 1409*eda14cbcSMatt Macy if (ub->ub_txg <= spa->spa_load_max_txg && 1410*eda14cbcSMatt Macy vdev_uberblock_compare(ub, cbp->ubl_ubbest) > 0) { 1411*eda14cbcSMatt Macy /* 1412*eda14cbcSMatt Macy * Keep track of the vdev in which this uberblock 1413*eda14cbcSMatt Macy * was found. We will use this information later 1414*eda14cbcSMatt Macy * to obtain the config nvlist associated with 1415*eda14cbcSMatt Macy * this uberblock. 1416*eda14cbcSMatt Macy */ 1417*eda14cbcSMatt Macy *cbp->ubl_ubbest = *ub; 1418*eda14cbcSMatt Macy cbp->ubl_vd = vd; 1419*eda14cbcSMatt Macy } 1420*eda14cbcSMatt Macy mutex_exit(&rio->io_lock); 1421*eda14cbcSMatt Macy } 1422*eda14cbcSMatt Macy 1423*eda14cbcSMatt Macy abd_free(zio->io_abd); 1424*eda14cbcSMatt Macy } 1425*eda14cbcSMatt Macy 1426*eda14cbcSMatt Macy static void 1427*eda14cbcSMatt Macy vdev_uberblock_load_impl(zio_t *zio, vdev_t *vd, int flags, 1428*eda14cbcSMatt Macy struct ubl_cbdata *cbp) 1429*eda14cbcSMatt Macy { 1430*eda14cbcSMatt Macy for (int c = 0; c < vd->vdev_children; c++) 1431*eda14cbcSMatt Macy vdev_uberblock_load_impl(zio, vd->vdev_child[c], flags, cbp); 1432*eda14cbcSMatt Macy 1433*eda14cbcSMatt Macy if (vd->vdev_ops->vdev_op_leaf && vdev_readable(vd)) { 1434*eda14cbcSMatt Macy for (int l = 0; l < VDEV_LABELS; l++) { 1435*eda14cbcSMatt Macy for (int n = 0; n < VDEV_UBERBLOCK_COUNT(vd); n++) { 1436*eda14cbcSMatt Macy vdev_label_read(zio, vd, l, 1437*eda14cbcSMatt Macy abd_alloc_linear(VDEV_UBERBLOCK_SIZE(vd), 1438*eda14cbcSMatt Macy B_TRUE), VDEV_UBERBLOCK_OFFSET(vd, n), 1439*eda14cbcSMatt Macy VDEV_UBERBLOCK_SIZE(vd), 1440*eda14cbcSMatt Macy vdev_uberblock_load_done, zio, flags); 1441*eda14cbcSMatt Macy } 1442*eda14cbcSMatt Macy } 1443*eda14cbcSMatt Macy } 1444*eda14cbcSMatt Macy } 1445*eda14cbcSMatt Macy 1446*eda14cbcSMatt Macy /* 1447*eda14cbcSMatt Macy * Reads the 'best' uberblock from disk along with its associated 1448*eda14cbcSMatt Macy * configuration. First, we read the uberblock array of each label of each 1449*eda14cbcSMatt Macy * vdev, keeping track of the uberblock with the highest txg in each array. 1450*eda14cbcSMatt Macy * Then, we read the configuration from the same vdev as the best uberblock. 1451*eda14cbcSMatt Macy */ 1452*eda14cbcSMatt Macy void 1453*eda14cbcSMatt Macy vdev_uberblock_load(vdev_t *rvd, uberblock_t *ub, nvlist_t **config) 1454*eda14cbcSMatt Macy { 1455*eda14cbcSMatt Macy zio_t *zio; 1456*eda14cbcSMatt Macy spa_t *spa = rvd->vdev_spa; 1457*eda14cbcSMatt Macy struct ubl_cbdata cb; 1458*eda14cbcSMatt Macy int flags = ZIO_FLAG_CONFIG_WRITER | ZIO_FLAG_CANFAIL | 1459*eda14cbcSMatt Macy ZIO_FLAG_SPECULATIVE | ZIO_FLAG_TRYHARD; 1460*eda14cbcSMatt Macy 1461*eda14cbcSMatt Macy ASSERT(ub); 1462*eda14cbcSMatt Macy ASSERT(config); 1463*eda14cbcSMatt Macy 1464*eda14cbcSMatt Macy bzero(ub, sizeof (uberblock_t)); 1465*eda14cbcSMatt Macy *config = NULL; 1466*eda14cbcSMatt Macy 1467*eda14cbcSMatt Macy cb.ubl_ubbest = ub; 1468*eda14cbcSMatt Macy cb.ubl_vd = NULL; 1469*eda14cbcSMatt Macy 1470*eda14cbcSMatt Macy spa_config_enter(spa, SCL_ALL, FTAG, RW_WRITER); 1471*eda14cbcSMatt Macy zio = zio_root(spa, NULL, &cb, flags); 1472*eda14cbcSMatt Macy vdev_uberblock_load_impl(zio, rvd, flags, &cb); 1473*eda14cbcSMatt Macy (void) zio_wait(zio); 1474*eda14cbcSMatt Macy 1475*eda14cbcSMatt Macy /* 1476*eda14cbcSMatt Macy * It's possible that the best uberblock was discovered on a label 1477*eda14cbcSMatt Macy * that has a configuration which was written in a future txg. 1478*eda14cbcSMatt Macy * Search all labels on this vdev to find the configuration that 1479*eda14cbcSMatt Macy * matches the txg for our uberblock. 1480*eda14cbcSMatt Macy */ 1481*eda14cbcSMatt Macy if (cb.ubl_vd != NULL) { 1482*eda14cbcSMatt Macy vdev_dbgmsg(cb.ubl_vd, "best uberblock found for spa %s. " 1483*eda14cbcSMatt Macy "txg %llu", spa->spa_name, (u_longlong_t)ub->ub_txg); 1484*eda14cbcSMatt Macy 1485*eda14cbcSMatt Macy *config = vdev_label_read_config(cb.ubl_vd, ub->ub_txg); 1486*eda14cbcSMatt Macy if (*config == NULL && spa->spa_extreme_rewind) { 1487*eda14cbcSMatt Macy vdev_dbgmsg(cb.ubl_vd, "failed to read label config. " 1488*eda14cbcSMatt Macy "Trying again without txg restrictions."); 1489*eda14cbcSMatt Macy *config = vdev_label_read_config(cb.ubl_vd, UINT64_MAX); 1490*eda14cbcSMatt Macy } 1491*eda14cbcSMatt Macy if (*config == NULL) { 1492*eda14cbcSMatt Macy vdev_dbgmsg(cb.ubl_vd, "failed to read label config"); 1493*eda14cbcSMatt Macy } 1494*eda14cbcSMatt Macy } 1495*eda14cbcSMatt Macy spa_config_exit(spa, SCL_ALL, FTAG); 1496*eda14cbcSMatt Macy } 1497*eda14cbcSMatt Macy 1498*eda14cbcSMatt Macy /* 1499*eda14cbcSMatt Macy * For use when a leaf vdev is expanded. 1500*eda14cbcSMatt Macy * The location of labels 2 and 3 changed, and at the new location the 1501*eda14cbcSMatt Macy * uberblock rings are either empty or contain garbage. The sync will write 1502*eda14cbcSMatt Macy * new configs there because the vdev is dirty, but expansion also needs the 1503*eda14cbcSMatt Macy * uberblock rings copied. Read them from label 0 which did not move. 1504*eda14cbcSMatt Macy * 1505*eda14cbcSMatt Macy * Since the point is to populate labels {2,3} with valid uberblocks, 1506*eda14cbcSMatt Macy * we zero uberblocks we fail to read or which are not valid. 1507*eda14cbcSMatt Macy */ 1508*eda14cbcSMatt Macy 1509*eda14cbcSMatt Macy static void 1510*eda14cbcSMatt Macy vdev_copy_uberblocks(vdev_t *vd) 1511*eda14cbcSMatt Macy { 1512*eda14cbcSMatt Macy abd_t *ub_abd; 1513*eda14cbcSMatt Macy zio_t *write_zio; 1514*eda14cbcSMatt Macy int locks = (SCL_L2ARC | SCL_ZIO); 1515*eda14cbcSMatt Macy int flags = ZIO_FLAG_CONFIG_WRITER | ZIO_FLAG_CANFAIL | 1516*eda14cbcSMatt Macy ZIO_FLAG_SPECULATIVE; 1517*eda14cbcSMatt Macy 1518*eda14cbcSMatt Macy ASSERT(spa_config_held(vd->vdev_spa, SCL_STATE, RW_READER) == 1519*eda14cbcSMatt Macy SCL_STATE); 1520*eda14cbcSMatt Macy ASSERT(vd->vdev_ops->vdev_op_leaf); 1521*eda14cbcSMatt Macy 1522*eda14cbcSMatt Macy spa_config_enter(vd->vdev_spa, locks, FTAG, RW_READER); 1523*eda14cbcSMatt Macy 1524*eda14cbcSMatt Macy ub_abd = abd_alloc_linear(VDEV_UBERBLOCK_SIZE(vd), B_TRUE); 1525*eda14cbcSMatt Macy 1526*eda14cbcSMatt Macy write_zio = zio_root(vd->vdev_spa, NULL, NULL, flags); 1527*eda14cbcSMatt Macy for (int n = 0; n < VDEV_UBERBLOCK_COUNT(vd); n++) { 1528*eda14cbcSMatt Macy const int src_label = 0; 1529*eda14cbcSMatt Macy zio_t *zio; 1530*eda14cbcSMatt Macy 1531*eda14cbcSMatt Macy zio = zio_root(vd->vdev_spa, NULL, NULL, flags); 1532*eda14cbcSMatt Macy vdev_label_read(zio, vd, src_label, ub_abd, 1533*eda14cbcSMatt Macy VDEV_UBERBLOCK_OFFSET(vd, n), VDEV_UBERBLOCK_SIZE(vd), 1534*eda14cbcSMatt Macy NULL, NULL, flags); 1535*eda14cbcSMatt Macy 1536*eda14cbcSMatt Macy if (zio_wait(zio) || uberblock_verify(abd_to_buf(ub_abd))) 1537*eda14cbcSMatt Macy abd_zero(ub_abd, VDEV_UBERBLOCK_SIZE(vd)); 1538*eda14cbcSMatt Macy 1539*eda14cbcSMatt Macy for (int l = 2; l < VDEV_LABELS; l++) 1540*eda14cbcSMatt Macy vdev_label_write(write_zio, vd, l, ub_abd, 1541*eda14cbcSMatt Macy VDEV_UBERBLOCK_OFFSET(vd, n), 1542*eda14cbcSMatt Macy VDEV_UBERBLOCK_SIZE(vd), NULL, NULL, 1543*eda14cbcSMatt Macy flags | ZIO_FLAG_DONT_PROPAGATE); 1544*eda14cbcSMatt Macy } 1545*eda14cbcSMatt Macy (void) zio_wait(write_zio); 1546*eda14cbcSMatt Macy 1547*eda14cbcSMatt Macy spa_config_exit(vd->vdev_spa, locks, FTAG); 1548*eda14cbcSMatt Macy 1549*eda14cbcSMatt Macy abd_free(ub_abd); 1550*eda14cbcSMatt Macy } 1551*eda14cbcSMatt Macy 1552*eda14cbcSMatt Macy /* 1553*eda14cbcSMatt Macy * On success, increment root zio's count of good writes. 1554*eda14cbcSMatt Macy * We only get credit for writes to known-visible vdevs; see spa_vdev_add(). 1555*eda14cbcSMatt Macy */ 1556*eda14cbcSMatt Macy static void 1557*eda14cbcSMatt Macy vdev_uberblock_sync_done(zio_t *zio) 1558*eda14cbcSMatt Macy { 1559*eda14cbcSMatt Macy uint64_t *good_writes = zio->io_private; 1560*eda14cbcSMatt Macy 1561*eda14cbcSMatt Macy if (zio->io_error == 0 && zio->io_vd->vdev_top->vdev_ms_array != 0) 1562*eda14cbcSMatt Macy atomic_inc_64(good_writes); 1563*eda14cbcSMatt Macy } 1564*eda14cbcSMatt Macy 1565*eda14cbcSMatt Macy /* 1566*eda14cbcSMatt Macy * Write the uberblock to all labels of all leaves of the specified vdev. 1567*eda14cbcSMatt Macy */ 1568*eda14cbcSMatt Macy static void 1569*eda14cbcSMatt Macy vdev_uberblock_sync(zio_t *zio, uint64_t *good_writes, 1570*eda14cbcSMatt Macy uberblock_t *ub, vdev_t *vd, int flags) 1571*eda14cbcSMatt Macy { 1572*eda14cbcSMatt Macy for (uint64_t c = 0; c < vd->vdev_children; c++) { 1573*eda14cbcSMatt Macy vdev_uberblock_sync(zio, good_writes, 1574*eda14cbcSMatt Macy ub, vd->vdev_child[c], flags); 1575*eda14cbcSMatt Macy } 1576*eda14cbcSMatt Macy 1577*eda14cbcSMatt Macy if (!vd->vdev_ops->vdev_op_leaf) 1578*eda14cbcSMatt Macy return; 1579*eda14cbcSMatt Macy 1580*eda14cbcSMatt Macy if (!vdev_writeable(vd)) 1581*eda14cbcSMatt Macy return; 1582*eda14cbcSMatt Macy 1583*eda14cbcSMatt Macy /* If the vdev was expanded, need to copy uberblock rings. */ 1584*eda14cbcSMatt Macy if (vd->vdev_state == VDEV_STATE_HEALTHY && 1585*eda14cbcSMatt Macy vd->vdev_copy_uberblocks == B_TRUE) { 1586*eda14cbcSMatt Macy vdev_copy_uberblocks(vd); 1587*eda14cbcSMatt Macy vd->vdev_copy_uberblocks = B_FALSE; 1588*eda14cbcSMatt Macy } 1589*eda14cbcSMatt Macy 1590*eda14cbcSMatt Macy int m = spa_multihost(vd->vdev_spa) ? MMP_BLOCKS_PER_LABEL : 0; 1591*eda14cbcSMatt Macy int n = ub->ub_txg % (VDEV_UBERBLOCK_COUNT(vd) - m); 1592*eda14cbcSMatt Macy 1593*eda14cbcSMatt Macy /* Copy the uberblock_t into the ABD */ 1594*eda14cbcSMatt Macy abd_t *ub_abd = abd_alloc_for_io(VDEV_UBERBLOCK_SIZE(vd), B_TRUE); 1595*eda14cbcSMatt Macy abd_zero(ub_abd, VDEV_UBERBLOCK_SIZE(vd)); 1596*eda14cbcSMatt Macy abd_copy_from_buf(ub_abd, ub, sizeof (uberblock_t)); 1597*eda14cbcSMatt Macy 1598*eda14cbcSMatt Macy for (int l = 0; l < VDEV_LABELS; l++) 1599*eda14cbcSMatt Macy vdev_label_write(zio, vd, l, ub_abd, 1600*eda14cbcSMatt Macy VDEV_UBERBLOCK_OFFSET(vd, n), VDEV_UBERBLOCK_SIZE(vd), 1601*eda14cbcSMatt Macy vdev_uberblock_sync_done, good_writes, 1602*eda14cbcSMatt Macy flags | ZIO_FLAG_DONT_PROPAGATE); 1603*eda14cbcSMatt Macy 1604*eda14cbcSMatt Macy abd_free(ub_abd); 1605*eda14cbcSMatt Macy } 1606*eda14cbcSMatt Macy 1607*eda14cbcSMatt Macy /* Sync the uberblocks to all vdevs in svd[] */ 1608*eda14cbcSMatt Macy static int 1609*eda14cbcSMatt Macy vdev_uberblock_sync_list(vdev_t **svd, int svdcount, uberblock_t *ub, int flags) 1610*eda14cbcSMatt Macy { 1611*eda14cbcSMatt Macy spa_t *spa = svd[0]->vdev_spa; 1612*eda14cbcSMatt Macy zio_t *zio; 1613*eda14cbcSMatt Macy uint64_t good_writes = 0; 1614*eda14cbcSMatt Macy 1615*eda14cbcSMatt Macy zio = zio_root(spa, NULL, NULL, flags); 1616*eda14cbcSMatt Macy 1617*eda14cbcSMatt Macy for (int v = 0; v < svdcount; v++) 1618*eda14cbcSMatt Macy vdev_uberblock_sync(zio, &good_writes, ub, svd[v], flags); 1619*eda14cbcSMatt Macy 1620*eda14cbcSMatt Macy (void) zio_wait(zio); 1621*eda14cbcSMatt Macy 1622*eda14cbcSMatt Macy /* 1623*eda14cbcSMatt Macy * Flush the uberblocks to disk. This ensures that the odd labels 1624*eda14cbcSMatt Macy * are no longer needed (because the new uberblocks and the even 1625*eda14cbcSMatt Macy * labels are safely on disk), so it is safe to overwrite them. 1626*eda14cbcSMatt Macy */ 1627*eda14cbcSMatt Macy zio = zio_root(spa, NULL, NULL, flags); 1628*eda14cbcSMatt Macy 1629*eda14cbcSMatt Macy for (int v = 0; v < svdcount; v++) { 1630*eda14cbcSMatt Macy if (vdev_writeable(svd[v])) { 1631*eda14cbcSMatt Macy zio_flush(zio, svd[v]); 1632*eda14cbcSMatt Macy } 1633*eda14cbcSMatt Macy } 1634*eda14cbcSMatt Macy 1635*eda14cbcSMatt Macy (void) zio_wait(zio); 1636*eda14cbcSMatt Macy 1637*eda14cbcSMatt Macy return (good_writes >= 1 ? 0 : EIO); 1638*eda14cbcSMatt Macy } 1639*eda14cbcSMatt Macy 1640*eda14cbcSMatt Macy /* 1641*eda14cbcSMatt Macy * On success, increment the count of good writes for our top-level vdev. 1642*eda14cbcSMatt Macy */ 1643*eda14cbcSMatt Macy static void 1644*eda14cbcSMatt Macy vdev_label_sync_done(zio_t *zio) 1645*eda14cbcSMatt Macy { 1646*eda14cbcSMatt Macy uint64_t *good_writes = zio->io_private; 1647*eda14cbcSMatt Macy 1648*eda14cbcSMatt Macy if (zio->io_error == 0) 1649*eda14cbcSMatt Macy atomic_inc_64(good_writes); 1650*eda14cbcSMatt Macy } 1651*eda14cbcSMatt Macy 1652*eda14cbcSMatt Macy /* 1653*eda14cbcSMatt Macy * If there weren't enough good writes, indicate failure to the parent. 1654*eda14cbcSMatt Macy */ 1655*eda14cbcSMatt Macy static void 1656*eda14cbcSMatt Macy vdev_label_sync_top_done(zio_t *zio) 1657*eda14cbcSMatt Macy { 1658*eda14cbcSMatt Macy uint64_t *good_writes = zio->io_private; 1659*eda14cbcSMatt Macy 1660*eda14cbcSMatt Macy if (*good_writes == 0) 1661*eda14cbcSMatt Macy zio->io_error = SET_ERROR(EIO); 1662*eda14cbcSMatt Macy 1663*eda14cbcSMatt Macy kmem_free(good_writes, sizeof (uint64_t)); 1664*eda14cbcSMatt Macy } 1665*eda14cbcSMatt Macy 1666*eda14cbcSMatt Macy /* 1667*eda14cbcSMatt Macy * We ignore errors for log and cache devices, simply free the private data. 1668*eda14cbcSMatt Macy */ 1669*eda14cbcSMatt Macy static void 1670*eda14cbcSMatt Macy vdev_label_sync_ignore_done(zio_t *zio) 1671*eda14cbcSMatt Macy { 1672*eda14cbcSMatt Macy kmem_free(zio->io_private, sizeof (uint64_t)); 1673*eda14cbcSMatt Macy } 1674*eda14cbcSMatt Macy 1675*eda14cbcSMatt Macy /* 1676*eda14cbcSMatt Macy * Write all even or odd labels to all leaves of the specified vdev. 1677*eda14cbcSMatt Macy */ 1678*eda14cbcSMatt Macy static void 1679*eda14cbcSMatt Macy vdev_label_sync(zio_t *zio, uint64_t *good_writes, 1680*eda14cbcSMatt Macy vdev_t *vd, int l, uint64_t txg, int flags) 1681*eda14cbcSMatt Macy { 1682*eda14cbcSMatt Macy nvlist_t *label; 1683*eda14cbcSMatt Macy vdev_phys_t *vp; 1684*eda14cbcSMatt Macy abd_t *vp_abd; 1685*eda14cbcSMatt Macy char *buf; 1686*eda14cbcSMatt Macy size_t buflen; 1687*eda14cbcSMatt Macy 1688*eda14cbcSMatt Macy for (int c = 0; c < vd->vdev_children; c++) { 1689*eda14cbcSMatt Macy vdev_label_sync(zio, good_writes, 1690*eda14cbcSMatt Macy vd->vdev_child[c], l, txg, flags); 1691*eda14cbcSMatt Macy } 1692*eda14cbcSMatt Macy 1693*eda14cbcSMatt Macy if (!vd->vdev_ops->vdev_op_leaf) 1694*eda14cbcSMatt Macy return; 1695*eda14cbcSMatt Macy 1696*eda14cbcSMatt Macy if (!vdev_writeable(vd)) 1697*eda14cbcSMatt Macy return; 1698*eda14cbcSMatt Macy 1699*eda14cbcSMatt Macy /* 1700*eda14cbcSMatt Macy * Generate a label describing the top-level config to which we belong. 1701*eda14cbcSMatt Macy */ 1702*eda14cbcSMatt Macy label = spa_config_generate(vd->vdev_spa, vd, txg, B_FALSE); 1703*eda14cbcSMatt Macy 1704*eda14cbcSMatt Macy vp_abd = abd_alloc_linear(sizeof (vdev_phys_t), B_TRUE); 1705*eda14cbcSMatt Macy abd_zero(vp_abd, sizeof (vdev_phys_t)); 1706*eda14cbcSMatt Macy vp = abd_to_buf(vp_abd); 1707*eda14cbcSMatt Macy 1708*eda14cbcSMatt Macy buf = vp->vp_nvlist; 1709*eda14cbcSMatt Macy buflen = sizeof (vp->vp_nvlist); 1710*eda14cbcSMatt Macy 1711*eda14cbcSMatt Macy if (!nvlist_pack(label, &buf, &buflen, NV_ENCODE_XDR, KM_SLEEP)) { 1712*eda14cbcSMatt Macy for (; l < VDEV_LABELS; l += 2) { 1713*eda14cbcSMatt Macy vdev_label_write(zio, vd, l, vp_abd, 1714*eda14cbcSMatt Macy offsetof(vdev_label_t, vl_vdev_phys), 1715*eda14cbcSMatt Macy sizeof (vdev_phys_t), 1716*eda14cbcSMatt Macy vdev_label_sync_done, good_writes, 1717*eda14cbcSMatt Macy flags | ZIO_FLAG_DONT_PROPAGATE); 1718*eda14cbcSMatt Macy } 1719*eda14cbcSMatt Macy } 1720*eda14cbcSMatt Macy 1721*eda14cbcSMatt Macy abd_free(vp_abd); 1722*eda14cbcSMatt Macy nvlist_free(label); 1723*eda14cbcSMatt Macy } 1724*eda14cbcSMatt Macy 1725*eda14cbcSMatt Macy static int 1726*eda14cbcSMatt Macy vdev_label_sync_list(spa_t *spa, int l, uint64_t txg, int flags) 1727*eda14cbcSMatt Macy { 1728*eda14cbcSMatt Macy list_t *dl = &spa->spa_config_dirty_list; 1729*eda14cbcSMatt Macy vdev_t *vd; 1730*eda14cbcSMatt Macy zio_t *zio; 1731*eda14cbcSMatt Macy int error; 1732*eda14cbcSMatt Macy 1733*eda14cbcSMatt Macy /* 1734*eda14cbcSMatt Macy * Write the new labels to disk. 1735*eda14cbcSMatt Macy */ 1736*eda14cbcSMatt Macy zio = zio_root(spa, NULL, NULL, flags); 1737*eda14cbcSMatt Macy 1738*eda14cbcSMatt Macy for (vd = list_head(dl); vd != NULL; vd = list_next(dl, vd)) { 1739*eda14cbcSMatt Macy uint64_t *good_writes; 1740*eda14cbcSMatt Macy 1741*eda14cbcSMatt Macy ASSERT(!vd->vdev_ishole); 1742*eda14cbcSMatt Macy 1743*eda14cbcSMatt Macy good_writes = kmem_zalloc(sizeof (uint64_t), KM_SLEEP); 1744*eda14cbcSMatt Macy zio_t *vio = zio_null(zio, spa, NULL, 1745*eda14cbcSMatt Macy (vd->vdev_islog || vd->vdev_aux != NULL) ? 1746*eda14cbcSMatt Macy vdev_label_sync_ignore_done : vdev_label_sync_top_done, 1747*eda14cbcSMatt Macy good_writes, flags); 1748*eda14cbcSMatt Macy vdev_label_sync(vio, good_writes, vd, l, txg, flags); 1749*eda14cbcSMatt Macy zio_nowait(vio); 1750*eda14cbcSMatt Macy } 1751*eda14cbcSMatt Macy 1752*eda14cbcSMatt Macy error = zio_wait(zio); 1753*eda14cbcSMatt Macy 1754*eda14cbcSMatt Macy /* 1755*eda14cbcSMatt Macy * Flush the new labels to disk. 1756*eda14cbcSMatt Macy */ 1757*eda14cbcSMatt Macy zio = zio_root(spa, NULL, NULL, flags); 1758*eda14cbcSMatt Macy 1759*eda14cbcSMatt Macy for (vd = list_head(dl); vd != NULL; vd = list_next(dl, vd)) 1760*eda14cbcSMatt Macy zio_flush(zio, vd); 1761*eda14cbcSMatt Macy 1762*eda14cbcSMatt Macy (void) zio_wait(zio); 1763*eda14cbcSMatt Macy 1764*eda14cbcSMatt Macy return (error); 1765*eda14cbcSMatt Macy } 1766*eda14cbcSMatt Macy 1767*eda14cbcSMatt Macy /* 1768*eda14cbcSMatt Macy * Sync the uberblock and any changes to the vdev configuration. 1769*eda14cbcSMatt Macy * 1770*eda14cbcSMatt Macy * The order of operations is carefully crafted to ensure that 1771*eda14cbcSMatt Macy * if the system panics or loses power at any time, the state on disk 1772*eda14cbcSMatt Macy * is still transactionally consistent. The in-line comments below 1773*eda14cbcSMatt Macy * describe the failure semantics at each stage. 1774*eda14cbcSMatt Macy * 1775*eda14cbcSMatt Macy * Moreover, vdev_config_sync() is designed to be idempotent: if it fails 1776*eda14cbcSMatt Macy * at any time, you can just call it again, and it will resume its work. 1777*eda14cbcSMatt Macy */ 1778*eda14cbcSMatt Macy int 1779*eda14cbcSMatt Macy vdev_config_sync(vdev_t **svd, int svdcount, uint64_t txg) 1780*eda14cbcSMatt Macy { 1781*eda14cbcSMatt Macy spa_t *spa = svd[0]->vdev_spa; 1782*eda14cbcSMatt Macy uberblock_t *ub = &spa->spa_uberblock; 1783*eda14cbcSMatt Macy int error = 0; 1784*eda14cbcSMatt Macy int flags = ZIO_FLAG_CONFIG_WRITER | ZIO_FLAG_CANFAIL; 1785*eda14cbcSMatt Macy 1786*eda14cbcSMatt Macy ASSERT(svdcount != 0); 1787*eda14cbcSMatt Macy retry: 1788*eda14cbcSMatt Macy /* 1789*eda14cbcSMatt Macy * Normally, we don't want to try too hard to write every label and 1790*eda14cbcSMatt Macy * uberblock. If there is a flaky disk, we don't want the rest of the 1791*eda14cbcSMatt Macy * sync process to block while we retry. But if we can't write a 1792*eda14cbcSMatt Macy * single label out, we should retry with ZIO_FLAG_TRYHARD before 1793*eda14cbcSMatt Macy * bailing out and declaring the pool faulted. 1794*eda14cbcSMatt Macy */ 1795*eda14cbcSMatt Macy if (error != 0) { 1796*eda14cbcSMatt Macy if ((flags & ZIO_FLAG_TRYHARD) != 0) 1797*eda14cbcSMatt Macy return (error); 1798*eda14cbcSMatt Macy flags |= ZIO_FLAG_TRYHARD; 1799*eda14cbcSMatt Macy } 1800*eda14cbcSMatt Macy 1801*eda14cbcSMatt Macy ASSERT(ub->ub_txg <= txg); 1802*eda14cbcSMatt Macy 1803*eda14cbcSMatt Macy /* 1804*eda14cbcSMatt Macy * If this isn't a resync due to I/O errors, 1805*eda14cbcSMatt Macy * and nothing changed in this transaction group, 1806*eda14cbcSMatt Macy * and the vdev configuration hasn't changed, 1807*eda14cbcSMatt Macy * then there's nothing to do. 1808*eda14cbcSMatt Macy */ 1809*eda14cbcSMatt Macy if (ub->ub_txg < txg) { 1810*eda14cbcSMatt Macy boolean_t changed = uberblock_update(ub, spa->spa_root_vdev, 1811*eda14cbcSMatt Macy txg, spa->spa_mmp.mmp_delay); 1812*eda14cbcSMatt Macy 1813*eda14cbcSMatt Macy if (!changed && list_is_empty(&spa->spa_config_dirty_list)) 1814*eda14cbcSMatt Macy return (0); 1815*eda14cbcSMatt Macy } 1816*eda14cbcSMatt Macy 1817*eda14cbcSMatt Macy if (txg > spa_freeze_txg(spa)) 1818*eda14cbcSMatt Macy return (0); 1819*eda14cbcSMatt Macy 1820*eda14cbcSMatt Macy ASSERT(txg <= spa->spa_final_txg); 1821*eda14cbcSMatt Macy 1822*eda14cbcSMatt Macy /* 1823*eda14cbcSMatt Macy * Flush the write cache of every disk that's been written to 1824*eda14cbcSMatt Macy * in this transaction group. This ensures that all blocks 1825*eda14cbcSMatt Macy * written in this txg will be committed to stable storage 1826*eda14cbcSMatt Macy * before any uberblock that references them. 1827*eda14cbcSMatt Macy */ 1828*eda14cbcSMatt Macy zio_t *zio = zio_root(spa, NULL, NULL, flags); 1829*eda14cbcSMatt Macy 1830*eda14cbcSMatt Macy for (vdev_t *vd = 1831*eda14cbcSMatt Macy txg_list_head(&spa->spa_vdev_txg_list, TXG_CLEAN(txg)); vd != NULL; 1832*eda14cbcSMatt Macy vd = txg_list_next(&spa->spa_vdev_txg_list, vd, TXG_CLEAN(txg))) 1833*eda14cbcSMatt Macy zio_flush(zio, vd); 1834*eda14cbcSMatt Macy 1835*eda14cbcSMatt Macy (void) zio_wait(zio); 1836*eda14cbcSMatt Macy 1837*eda14cbcSMatt Macy /* 1838*eda14cbcSMatt Macy * Sync out the even labels (L0, L2) for every dirty vdev. If the 1839*eda14cbcSMatt Macy * system dies in the middle of this process, that's OK: all of the 1840*eda14cbcSMatt Macy * even labels that made it to disk will be newer than any uberblock, 1841*eda14cbcSMatt Macy * and will therefore be considered invalid. The odd labels (L1, L3), 1842*eda14cbcSMatt Macy * which have not yet been touched, will still be valid. We flush 1843*eda14cbcSMatt Macy * the new labels to disk to ensure that all even-label updates 1844*eda14cbcSMatt Macy * are committed to stable storage before the uberblock update. 1845*eda14cbcSMatt Macy */ 1846*eda14cbcSMatt Macy if ((error = vdev_label_sync_list(spa, 0, txg, flags)) != 0) { 1847*eda14cbcSMatt Macy if ((flags & ZIO_FLAG_TRYHARD) != 0) { 1848*eda14cbcSMatt Macy zfs_dbgmsg("vdev_label_sync_list() returned error %d " 1849*eda14cbcSMatt Macy "for pool '%s' when syncing out the even labels " 1850*eda14cbcSMatt Macy "of dirty vdevs", error, spa_name(spa)); 1851*eda14cbcSMatt Macy } 1852*eda14cbcSMatt Macy goto retry; 1853*eda14cbcSMatt Macy } 1854*eda14cbcSMatt Macy 1855*eda14cbcSMatt Macy /* 1856*eda14cbcSMatt Macy * Sync the uberblocks to all vdevs in svd[]. 1857*eda14cbcSMatt Macy * If the system dies in the middle of this step, there are two cases 1858*eda14cbcSMatt Macy * to consider, and the on-disk state is consistent either way: 1859*eda14cbcSMatt Macy * 1860*eda14cbcSMatt Macy * (1) If none of the new uberblocks made it to disk, then the 1861*eda14cbcSMatt Macy * previous uberblock will be the newest, and the odd labels 1862*eda14cbcSMatt Macy * (which had not yet been touched) will be valid with respect 1863*eda14cbcSMatt Macy * to that uberblock. 1864*eda14cbcSMatt Macy * 1865*eda14cbcSMatt Macy * (2) If one or more new uberblocks made it to disk, then they 1866*eda14cbcSMatt Macy * will be the newest, and the even labels (which had all 1867*eda14cbcSMatt Macy * been successfully committed) will be valid with respect 1868*eda14cbcSMatt Macy * to the new uberblocks. 1869*eda14cbcSMatt Macy */ 1870*eda14cbcSMatt Macy if ((error = vdev_uberblock_sync_list(svd, svdcount, ub, flags)) != 0) { 1871*eda14cbcSMatt Macy if ((flags & ZIO_FLAG_TRYHARD) != 0) { 1872*eda14cbcSMatt Macy zfs_dbgmsg("vdev_uberblock_sync_list() returned error " 1873*eda14cbcSMatt Macy "%d for pool '%s'", error, spa_name(spa)); 1874*eda14cbcSMatt Macy } 1875*eda14cbcSMatt Macy goto retry; 1876*eda14cbcSMatt Macy } 1877*eda14cbcSMatt Macy 1878*eda14cbcSMatt Macy if (spa_multihost(spa)) 1879*eda14cbcSMatt Macy mmp_update_uberblock(spa, ub); 1880*eda14cbcSMatt Macy 1881*eda14cbcSMatt Macy /* 1882*eda14cbcSMatt Macy * Sync out odd labels for every dirty vdev. If the system dies 1883*eda14cbcSMatt Macy * in the middle of this process, the even labels and the new 1884*eda14cbcSMatt Macy * uberblocks will suffice to open the pool. The next time 1885*eda14cbcSMatt Macy * the pool is opened, the first thing we'll do -- before any 1886*eda14cbcSMatt Macy * user data is modified -- is mark every vdev dirty so that 1887*eda14cbcSMatt Macy * all labels will be brought up to date. We flush the new labels 1888*eda14cbcSMatt Macy * to disk to ensure that all odd-label updates are committed to 1889*eda14cbcSMatt Macy * stable storage before the next transaction group begins. 1890*eda14cbcSMatt Macy */ 1891*eda14cbcSMatt Macy if ((error = vdev_label_sync_list(spa, 1, txg, flags)) != 0) { 1892*eda14cbcSMatt Macy if ((flags & ZIO_FLAG_TRYHARD) != 0) { 1893*eda14cbcSMatt Macy zfs_dbgmsg("vdev_label_sync_list() returned error %d " 1894*eda14cbcSMatt Macy "for pool '%s' when syncing out the odd labels of " 1895*eda14cbcSMatt Macy "dirty vdevs", error, spa_name(spa)); 1896*eda14cbcSMatt Macy } 1897*eda14cbcSMatt Macy goto retry; 1898*eda14cbcSMatt Macy } 1899*eda14cbcSMatt Macy 1900*eda14cbcSMatt Macy return (0); 1901*eda14cbcSMatt Macy } 1902