1 /* 2 * CDDL HEADER START 3 * 4 * The contents of this file are subject to the terms of the 5 * Common Development and Distribution License (the "License"). 6 * You may not use this file except in compliance with the License. 7 * 8 * You can obtain a copy of the license at usr/src/OPENSOLARIS.LICENSE 9 * or https://opensource.org/licenses/CDDL-1.0. 10 * See the License for the specific language governing permissions 11 * and limitations under the License. 12 * 13 * When distributing Covered Code, include this CDDL HEADER in each 14 * file and include the License file at usr/src/OPENSOLARIS.LICENSE. 15 * If applicable, add the following below this CDDL HEADER, with the 16 * fields enclosed by brackets "[]" replaced with your own identifying 17 * information: Portions Copyright [yyyy] [name of copyright owner] 18 * 19 * CDDL HEADER END 20 */ 21 22 /* 23 * Copyright (c) 2018, 2019 by Delphix. All rights reserved. 24 */ 25 26 #include <sys/dmu_objset.h> 27 #include <sys/metaslab.h> 28 #include <sys/metaslab_impl.h> 29 #include <sys/spa.h> 30 #include <sys/spa_impl.h> 31 #include <sys/spa_log_spacemap.h> 32 #include <sys/vdev_impl.h> 33 #include <sys/zap.h> 34 35 /* 36 * Log Space Maps 37 * 38 * Log space maps are an optimization in ZFS metadata allocations for pools 39 * whose workloads are primarily random-writes. Random-write workloads are also 40 * typically random-free, meaning that they are freeing from locations scattered 41 * throughout the pool. This means that each TXG we will have to append some 42 * FREE records to almost every metaslab. With log space maps, we hold their 43 * changes in memory and log them altogether in one pool-wide space map on-disk 44 * for persistence. As more blocks are accumulated in the log space maps and 45 * more unflushed changes are accounted in memory, we flush a selected group 46 * of metaslabs every TXG to relieve memory pressure and potential overheads 47 * when loading the pool. Flushing a metaslab to disk relieves memory as we 48 * flush any unflushed changes from memory to disk (i.e. the metaslab's space 49 * map) and saves import time by making old log space maps obsolete and 50 * eventually destroying them. [A log space map is said to be obsolete when all 51 * its entries have made it to their corresponding metaslab space maps]. 52 * 53 * == On disk data structures used == 54 * 55 * - The pool has a new feature flag and a new entry in the MOS. The feature 56 * is activated when we create the first log space map and remains active 57 * for the lifetime of the pool. The new entry in the MOS Directory [refer 58 * to DMU_POOL_LOG_SPACEMAP_ZAP] is populated with a ZAP whose key-value 59 * pairs are of the form <key: txg, value: log space map object for that txg>. 60 * This entry is our on-disk reference of the log space maps that exist in 61 * the pool for each TXG and it is used during import to load all the 62 * metaslab unflushed changes in memory. To see how this structure is first 63 * created and later populated refer to spa_generate_syncing_log_sm(). To see 64 * how it is used during import time refer to spa_ld_log_sm_metadata(). 65 * 66 * - Each vdev has a new entry in its vdev_top_zap (see field 67 * VDEV_TOP_ZAP_MS_UNFLUSHED_PHYS_TXGS) which holds the msp_unflushed_txg of 68 * each metaslab in this vdev. This field is the on-disk counterpart of the 69 * in-memory field ms_unflushed_txg which tells us from which TXG and onwards 70 * the metaslab haven't had its changes flushed. During import, we use this 71 * to ignore any entries in the space map log that are for this metaslab but 72 * from a TXG before msp_unflushed_txg. At that point, we also populate its 73 * in-memory counterpart and from there both fields are updated every time 74 * we flush that metaslab. 75 * 76 * - A space map is created every TXG and, during that TXG, it is used to log 77 * all incoming changes (the log space map). When created, the log space map 78 * is referenced in memory by spa_syncing_log_sm and its object ID is inserted 79 * to the space map ZAP mentioned above. The log space map is closed at the 80 * end of the TXG and will be destroyed when it becomes fully obsolete. We 81 * know when a log space map has become obsolete by looking at the oldest 82 * (and smallest) ms_unflushed_txg in the pool. If the value of that is bigger 83 * than the log space map's TXG, then it means that there is no metaslab who 84 * doesn't have the changes from that log and we can therefore destroy it. 85 * [see spa_cleanup_old_sm_logs()]. 86 * 87 * == Important in-memory structures == 88 * 89 * - The per-spa field spa_metaslabs_by_flushed sorts all the metaslabs in 90 * the pool by their ms_unflushed_txg field. It is primarily used for three 91 * reasons. First of all, it is used during flushing where we try to flush 92 * metaslabs in-order from the oldest-flushed to the most recently flushed 93 * every TXG. Secondly, it helps us to lookup the ms_unflushed_txg of the 94 * oldest flushed metaslab to distinguish which log space maps have become 95 * obsolete and which ones are still relevant. Finally it tells us which 96 * metaslabs have unflushed changes in a pool where this feature was just 97 * enabled, as we don't immediately add all of the pool's metaslabs but we 98 * add them over time as they go through metaslab_sync(). The reason that 99 * we do that is to ease these pools into the behavior of the flushing 100 * algorithm (described later on). 101 * 102 * - The per-spa field spa_sm_logs_by_txg can be thought as the in-memory 103 * counterpart of the space map ZAP mentioned above. It's an AVL tree whose 104 * nodes represent the log space maps in the pool. This in-memory 105 * representation of log space maps in the pool sorts the log space maps by 106 * the TXG that they were created (which is also the TXG of their unflushed 107 * changes). It also contains the following extra information for each 108 * space map: 109 * [1] The number of metaslabs that were last flushed on that TXG. This is 110 * important because if that counter is zero and this is the oldest 111 * log then it means that it is also obsolete. 112 * [2] The number of blocks of that space map. This field is used by the 113 * block heuristic of our flushing algorithm (described later on). 114 * It represents how many blocks of metadata changes ZFS had to write 115 * to disk for that TXG. 116 * 117 * - The per-spa field spa_log_summary is a list of entries that summarizes 118 * the metaslab and block counts of all the nodes of the spa_sm_logs_by_txg 119 * AVL tree mentioned above. The reason this exists is that our flushing 120 * algorithm (described later) tries to estimate how many metaslabs to flush 121 * in each TXG by iterating over all the log space maps and looking at their 122 * block counts. Summarizing that information means that don't have to 123 * iterate through each space map, minimizing the runtime overhead of the 124 * flushing algorithm which would be induced in syncing context. In terms of 125 * implementation the log summary is used as a queue: 126 * * we modify or pop entries from its head when we flush metaslabs 127 * * we modify or append entries to its tail when we sync changes. 128 * 129 * - Each metaslab has two new range trees that hold its unflushed changes, 130 * ms_unflushed_allocs and ms_unflushed_frees. These are always disjoint. 131 * 132 * == Flushing algorithm == 133 * 134 * The decision of how many metaslabs to flush on a give TXG is guided by 135 * two heuristics: 136 * 137 * [1] The memory heuristic - 138 * We keep track of the memory used by the unflushed trees from all the 139 * metaslabs [see sus_memused of spa_unflushed_stats] and we ensure that it 140 * stays below a certain threshold which is determined by an arbitrary hard 141 * limit and an arbitrary percentage of the system's memory [see 142 * spa_log_exceeds_memlimit()]. When we see that the memory usage of the 143 * unflushed changes are passing that threshold, we flush metaslabs, which 144 * empties their unflushed range trees, reducing the memory used. 145 * 146 * [2] The block heuristic - 147 * We try to keep the total number of blocks in the log space maps in check 148 * so the log doesn't grow indefinitely and we don't induce a lot of overhead 149 * when loading the pool. At the same time we don't want to flush a lot of 150 * metaslabs too often as this would defeat the purpose of the log space map. 151 * As a result we set a limit in the amount of blocks that we think it's 152 * acceptable for the log space maps to have and try not to cross it. 153 * [see sus_blocklimit from spa_unflushed_stats]. 154 * 155 * In order to stay below the block limit every TXG we have to estimate how 156 * many metaslabs we need to flush based on the current rate of incoming blocks 157 * and our history of log space map blocks. The main idea here is to answer 158 * the question of how many metaslabs do we need to flush in order to get rid 159 * at least an X amount of log space map blocks. We can answer this question 160 * by iterating backwards from the oldest log space map to the newest one 161 * and looking at their metaslab and block counts. At this point the log summary 162 * mentioned above comes handy as it reduces the amount of things that we have 163 * to iterate (even though it may reduce the preciseness of our estimates due 164 * to its aggregation of data). So with that in mind, we project the incoming 165 * rate of the current TXG into the future and attempt to approximate how many 166 * metaslabs would we need to flush from now in order to avoid exceeding our 167 * block limit in different points in the future (granted that we would keep 168 * flushing the same number of metaslabs for every TXG). Then we take the 169 * maximum number from all these estimates to be on the safe side. For the 170 * exact implementation details of algorithm refer to 171 * spa_estimate_metaslabs_to_flush. 172 */ 173 174 /* 175 * This is used as the block size for the space maps used for the 176 * log space map feature. These space maps benefit from a bigger 177 * block size as we expect to be writing a lot of data to them at 178 * once. 179 */ 180 static const unsigned long zfs_log_sm_blksz = 1ULL << 17; 181 182 /* 183 * Percentage of the overall system's memory that ZFS allows to be 184 * used for unflushed changes (e.g. the sum of size of all the nodes 185 * in the unflushed trees). 186 * 187 * Note that this value is calculated over 1000000 for finer granularity 188 * (thus the _ppm suffix; reads as "parts per million"). As an example, 189 * the default of 1000 allows 0.1% of memory to be used. 190 */ 191 static unsigned long zfs_unflushed_max_mem_ppm = 1000; 192 193 /* 194 * Specific hard-limit in memory that ZFS allows to be used for 195 * unflushed changes. 196 */ 197 static unsigned long zfs_unflushed_max_mem_amt = 1ULL << 30; 198 199 /* 200 * The following tunable determines the number of blocks that can be used for 201 * the log space maps. It is expressed as a percentage of the total number of 202 * metaslabs in the pool (i.e. the default of 400 means that the number of log 203 * blocks is capped at 4 times the number of metaslabs). 204 * 205 * This value exists to tune our flushing algorithm, with higher values 206 * flushing metaslabs less often (doing less I/Os) per TXG versus lower values 207 * flushing metaslabs more aggressively with the upside of saving overheads 208 * when loading the pool. Another factor in this tradeoff is that flushing 209 * less often can potentially lead to better utilization of the metaslab space 210 * map's block size as we accumulate more changes per flush. 211 * 212 * Given that this tunable indirectly controls the flush rate (metaslabs 213 * flushed per txg) and that's why making it a percentage in terms of the 214 * number of metaslabs in the pool makes sense here. 215 * 216 * As a rule of thumb we default this tunable to 400% based on the following: 217 * 218 * 1] Assuming a constant flush rate and a constant incoming rate of log blocks 219 * it is reasonable to expect that the amount of obsolete entries changes 220 * linearly from txg to txg (e.g. the oldest log should have the most 221 * obsolete entries, and the most recent one the least). With this we could 222 * say that, at any given time, about half of the entries in the whole space 223 * map log are obsolete. Thus for every two entries for a metaslab in the 224 * log space map, only one of them is valid and actually makes it to the 225 * metaslab's space map. 226 * [factor of 2] 227 * 2] Each entry in the log space map is guaranteed to be two words while 228 * entries in metaslab space maps are generally single-word. 229 * [an extra factor of 2 - 400% overall] 230 * 3] Even if [1] and [2] are slightly less than 2 each, we haven't taken into 231 * account any consolidation of segments from the log space map to the 232 * unflushed range trees nor their history (e.g. a segment being allocated, 233 * then freed, then allocated again means 3 log space map entries but 0 234 * metaslab space map entries). Depending on the workload, we've seen ~1.8 235 * non-obsolete log space map entries per metaslab entry, for a total of 236 * ~600%. Since most of these estimates though are workload dependent, we 237 * default on 400% to be conservative. 238 * 239 * Thus we could say that even in the worst 240 * case of [1] and [2], the factor should end up being 4. 241 * 242 * That said, regardless of the number of metaslabs in the pool we need to 243 * provide upper and lower bounds for the log block limit. 244 * [see zfs_unflushed_log_block_{min,max}] 245 */ 246 static unsigned long zfs_unflushed_log_block_pct = 400; 247 248 /* 249 * If the number of metaslabs is small and our incoming rate is high, we could 250 * get into a situation that we are flushing all our metaslabs every TXG. Thus 251 * we always allow at least this many log blocks. 252 */ 253 static unsigned long zfs_unflushed_log_block_min = 1000; 254 255 /* 256 * If the log becomes too big, the import time of the pool can take a hit in 257 * terms of performance. Thus we have a hard limit in the size of the log in 258 * terms of blocks. 259 */ 260 static unsigned long zfs_unflushed_log_block_max = (1ULL << 17); 261 262 /* 263 * Also we have a hard limit in the size of the log in terms of dirty TXGs. 264 */ 265 static unsigned long zfs_unflushed_log_txg_max = 1000; 266 267 /* 268 * Max # of rows allowed for the log_summary. The tradeoff here is accuracy and 269 * stability of the flushing algorithm (longer summary) vs its runtime overhead 270 * (smaller summary is faster to traverse). 271 */ 272 static unsigned long zfs_max_logsm_summary_length = 10; 273 274 /* 275 * Tunable that sets the lower bound on the metaslabs to flush every TXG. 276 * 277 * Setting this to 0 has no effect since if the pool is idle we won't even be 278 * creating log space maps and therefore we won't be flushing. On the other 279 * hand if the pool has any incoming workload our block heuristic will start 280 * flushing metaslabs anyway. 281 * 282 * The point of this tunable is to be used in extreme cases where we really 283 * want to flush more metaslabs than our adaptable heuristic plans to flush. 284 */ 285 static unsigned long zfs_min_metaslabs_to_flush = 1; 286 287 /* 288 * Tunable that specifies how far in the past do we want to look when trying to 289 * estimate the incoming log blocks for the current TXG. 290 * 291 * Setting this too high may not only increase runtime but also minimize the 292 * effect of the incoming rates from the most recent TXGs as we take the 293 * average over all the blocks that we walk 294 * [see spa_estimate_incoming_log_blocks]. 295 */ 296 static unsigned long zfs_max_log_walking = 5; 297 298 /* 299 * This tunable exists solely for testing purposes. It ensures that the log 300 * spacemaps are not flushed and destroyed during export in order for the 301 * relevant log spacemap import code paths to be tested (effectively simulating 302 * a crash). 303 */ 304 int zfs_keep_log_spacemaps_at_export = 0; 305 306 static uint64_t 307 spa_estimate_incoming_log_blocks(spa_t *spa) 308 { 309 ASSERT3U(spa_sync_pass(spa), ==, 1); 310 uint64_t steps = 0, sum = 0; 311 for (spa_log_sm_t *sls = avl_last(&spa->spa_sm_logs_by_txg); 312 sls != NULL && steps < zfs_max_log_walking; 313 sls = AVL_PREV(&spa->spa_sm_logs_by_txg, sls)) { 314 if (sls->sls_txg == spa_syncing_txg(spa)) { 315 /* 316 * skip the log created in this TXG as this would 317 * make our estimations inaccurate. 318 */ 319 continue; 320 } 321 sum += sls->sls_nblocks; 322 steps++; 323 } 324 return ((steps > 0) ? DIV_ROUND_UP(sum, steps) : 0); 325 } 326 327 uint64_t 328 spa_log_sm_blocklimit(spa_t *spa) 329 { 330 return (spa->spa_unflushed_stats.sus_blocklimit); 331 } 332 333 void 334 spa_log_sm_set_blocklimit(spa_t *spa) 335 { 336 if (!spa_feature_is_active(spa, SPA_FEATURE_LOG_SPACEMAP)) { 337 ASSERT0(spa_log_sm_blocklimit(spa)); 338 return; 339 } 340 341 uint64_t msdcount = 0; 342 for (log_summary_entry_t *e = list_head(&spa->spa_log_summary); 343 e; e = list_next(&spa->spa_log_summary, e)) 344 msdcount += e->lse_msdcount; 345 346 uint64_t limit = msdcount * zfs_unflushed_log_block_pct / 100; 347 spa->spa_unflushed_stats.sus_blocklimit = MIN(MAX(limit, 348 zfs_unflushed_log_block_min), zfs_unflushed_log_block_max); 349 } 350 351 uint64_t 352 spa_log_sm_nblocks(spa_t *spa) 353 { 354 return (spa->spa_unflushed_stats.sus_nblocks); 355 } 356 357 /* 358 * Ensure that the in-memory log space map structures and the summary 359 * have the same block and metaslab counts. 360 */ 361 static void 362 spa_log_summary_verify_counts(spa_t *spa) 363 { 364 ASSERT(spa_feature_is_active(spa, SPA_FEATURE_LOG_SPACEMAP)); 365 366 if ((zfs_flags & ZFS_DEBUG_LOG_SPACEMAP) == 0) 367 return; 368 369 uint64_t ms_in_avl = avl_numnodes(&spa->spa_metaslabs_by_flushed); 370 371 uint64_t ms_in_summary = 0, blk_in_summary = 0; 372 for (log_summary_entry_t *e = list_head(&spa->spa_log_summary); 373 e; e = list_next(&spa->spa_log_summary, e)) { 374 ms_in_summary += e->lse_mscount; 375 blk_in_summary += e->lse_blkcount; 376 } 377 378 uint64_t ms_in_logs = 0, blk_in_logs = 0; 379 for (spa_log_sm_t *sls = avl_first(&spa->spa_sm_logs_by_txg); 380 sls; sls = AVL_NEXT(&spa->spa_sm_logs_by_txg, sls)) { 381 ms_in_logs += sls->sls_mscount; 382 blk_in_logs += sls->sls_nblocks; 383 } 384 385 VERIFY3U(ms_in_logs, ==, ms_in_summary); 386 VERIFY3U(ms_in_logs, ==, ms_in_avl); 387 VERIFY3U(blk_in_logs, ==, blk_in_summary); 388 VERIFY3U(blk_in_logs, ==, spa_log_sm_nblocks(spa)); 389 } 390 391 static boolean_t 392 summary_entry_is_full(spa_t *spa, log_summary_entry_t *e, uint64_t txg) 393 { 394 if (e->lse_end == txg) 395 return (0); 396 if (e->lse_txgcount >= DIV_ROUND_UP(zfs_unflushed_log_txg_max, 397 zfs_max_logsm_summary_length)) 398 return (1); 399 uint64_t blocks_per_row = MAX(1, 400 DIV_ROUND_UP(spa_log_sm_blocklimit(spa), 401 zfs_max_logsm_summary_length)); 402 return (blocks_per_row <= e->lse_blkcount); 403 } 404 405 /* 406 * Update the log summary information to reflect the fact that a metaslab 407 * was flushed or destroyed (e.g due to device removal or pool export/destroy). 408 * 409 * We typically flush the oldest flushed metaslab so the first (and oldest) 410 * entry of the summary is updated. However if that metaslab is getting loaded 411 * we may flush the second oldest one which may be part of an entry later in 412 * the summary. Moreover, if we call into this function from metaslab_fini() 413 * the metaslabs probably won't be ordered by ms_unflushed_txg. Thus we ask 414 * for a txg as an argument so we can locate the appropriate summary entry for 415 * the metaslab. 416 */ 417 void 418 spa_log_summary_decrement_mscount(spa_t *spa, uint64_t txg, boolean_t dirty) 419 { 420 /* 421 * We don't track summary data for read-only pools and this function 422 * can be called from metaslab_fini(). In that case return immediately. 423 */ 424 if (!spa_writeable(spa)) 425 return; 426 427 log_summary_entry_t *target = NULL; 428 for (log_summary_entry_t *e = list_head(&spa->spa_log_summary); 429 e != NULL; e = list_next(&spa->spa_log_summary, e)) { 430 if (e->lse_start > txg) 431 break; 432 target = e; 433 } 434 435 if (target == NULL || target->lse_mscount == 0) { 436 /* 437 * We didn't find a summary entry for this metaslab. We must be 438 * at the teardown of a spa_load() attempt that got an error 439 * while reading the log space maps. 440 */ 441 VERIFY3S(spa_load_state(spa), ==, SPA_LOAD_ERROR); 442 return; 443 } 444 445 target->lse_mscount--; 446 if (dirty) 447 target->lse_msdcount--; 448 } 449 450 /* 451 * Update the log summary information to reflect the fact that we destroyed 452 * old log space maps. Since we can only destroy the oldest log space maps, 453 * we decrement the block count of the oldest summary entry and potentially 454 * destroy it when that count hits 0. 455 * 456 * This function is called after a metaslab is flushed and typically that 457 * metaslab is the oldest flushed, which means that this function will 458 * typically decrement the block count of the first entry of the summary and 459 * potentially free it if the block count gets to zero (its metaslab count 460 * should be zero too at that point). 461 * 462 * There are certain scenarios though that don't work exactly like that so we 463 * need to account for them: 464 * 465 * Scenario [1]: It is possible that after we flushed the oldest flushed 466 * metaslab and we destroyed the oldest log space map, more recent logs had 0 467 * metaslabs pointing to them so we got rid of them too. This can happen due 468 * to metaslabs being destroyed through device removal, or because the oldest 469 * flushed metaslab was loading but we kept flushing more recently flushed 470 * metaslabs due to the memory pressure of unflushed changes. Because of that, 471 * we always iterate from the beginning of the summary and if blocks_gone is 472 * bigger than the block_count of the current entry we free that entry (we 473 * expect its metaslab count to be zero), we decrement blocks_gone and on to 474 * the next entry repeating this procedure until blocks_gone gets decremented 475 * to 0. Doing this also works for the typical case mentioned above. 476 * 477 * Scenario [2]: The oldest flushed metaslab isn't necessarily accounted by 478 * the first (and oldest) entry in the summary. If the first few entries of 479 * the summary were only accounting metaslabs from a device that was just 480 * removed, then the current oldest flushed metaslab could be accounted by an 481 * entry somewhere in the middle of the summary. Moreover flushing that 482 * metaslab will destroy all the log space maps older than its ms_unflushed_txg 483 * because they became obsolete after the removal. Thus, iterating as we did 484 * for scenario [1] works out for this case too. 485 * 486 * Scenario [3]: At times we decide to flush all the metaslabs in the pool 487 * in one TXG (either because we are exporting the pool or because our flushing 488 * heuristics decided to do so). When that happens all the log space maps get 489 * destroyed except the one created for the current TXG which doesn't have 490 * any log blocks yet. As log space maps get destroyed with every metaslab that 491 * we flush, entries in the summary are also destroyed. This brings a weird 492 * corner-case when we flush the last metaslab and the log space map of the 493 * current TXG is in the same summary entry with other log space maps that 494 * are older. When that happens we are eventually left with this one last 495 * summary entry whose blocks are gone (blocks_gone equals the entry's block 496 * count) but its metaslab count is non-zero (because it accounts all the 497 * metaslabs in the pool as they all got flushed). Under this scenario we can't 498 * free this last summary entry as it's referencing all the metaslabs in the 499 * pool and its block count will get incremented at the end of this sync (when 500 * we close the syncing log space map). Thus we just decrement its current 501 * block count and leave it alone. In the case that the pool gets exported, 502 * its metaslab count will be decremented over time as we call metaslab_fini() 503 * for all the metaslabs in the pool and the entry will be freed at 504 * spa_unload_log_sm_metadata(). 505 */ 506 void 507 spa_log_summary_decrement_blkcount(spa_t *spa, uint64_t blocks_gone) 508 { 509 log_summary_entry_t *e = list_head(&spa->spa_log_summary); 510 if (e->lse_txgcount > 0) 511 e->lse_txgcount--; 512 for (; e != NULL; e = list_head(&spa->spa_log_summary)) { 513 if (e->lse_blkcount > blocks_gone) { 514 e->lse_blkcount -= blocks_gone; 515 blocks_gone = 0; 516 break; 517 } else if (e->lse_mscount == 0) { 518 /* remove obsolete entry */ 519 blocks_gone -= e->lse_blkcount; 520 list_remove(&spa->spa_log_summary, e); 521 kmem_free(e, sizeof (log_summary_entry_t)); 522 } else { 523 /* Verify that this is scenario [3] mentioned above. */ 524 VERIFY3U(blocks_gone, ==, e->lse_blkcount); 525 526 /* 527 * Assert that this is scenario [3] further by ensuring 528 * that this is the only entry in the summary. 529 */ 530 VERIFY3P(e, ==, list_tail(&spa->spa_log_summary)); 531 ASSERT3P(e, ==, list_head(&spa->spa_log_summary)); 532 533 blocks_gone = e->lse_blkcount = 0; 534 break; 535 } 536 } 537 538 /* 539 * Ensure that there is no way we are trying to remove more blocks 540 * than the # of blocks in the summary. 541 */ 542 ASSERT0(blocks_gone); 543 } 544 545 void 546 spa_log_sm_decrement_mscount(spa_t *spa, uint64_t txg) 547 { 548 spa_log_sm_t target = { .sls_txg = txg }; 549 spa_log_sm_t *sls = avl_find(&spa->spa_sm_logs_by_txg, 550 &target, NULL); 551 552 if (sls == NULL) { 553 /* 554 * We must be at the teardown of a spa_load() attempt that 555 * got an error while reading the log space maps. 556 */ 557 VERIFY3S(spa_load_state(spa), ==, SPA_LOAD_ERROR); 558 return; 559 } 560 561 ASSERT(sls->sls_mscount > 0); 562 sls->sls_mscount--; 563 } 564 565 void 566 spa_log_sm_increment_current_mscount(spa_t *spa) 567 { 568 spa_log_sm_t *last_sls = avl_last(&spa->spa_sm_logs_by_txg); 569 ASSERT3U(last_sls->sls_txg, ==, spa_syncing_txg(spa)); 570 last_sls->sls_mscount++; 571 } 572 573 static void 574 summary_add_data(spa_t *spa, uint64_t txg, uint64_t metaslabs_flushed, 575 uint64_t metaslabs_dirty, uint64_t nblocks) 576 { 577 log_summary_entry_t *e = list_tail(&spa->spa_log_summary); 578 579 if (e == NULL || summary_entry_is_full(spa, e, txg)) { 580 e = kmem_zalloc(sizeof (log_summary_entry_t), KM_SLEEP); 581 e->lse_start = e->lse_end = txg; 582 e->lse_txgcount = 1; 583 list_insert_tail(&spa->spa_log_summary, e); 584 } 585 586 ASSERT3U(e->lse_start, <=, txg); 587 if (e->lse_end < txg) { 588 e->lse_end = txg; 589 e->lse_txgcount++; 590 } 591 e->lse_mscount += metaslabs_flushed; 592 e->lse_msdcount += metaslabs_dirty; 593 e->lse_blkcount += nblocks; 594 } 595 596 static void 597 spa_log_summary_add_incoming_blocks(spa_t *spa, uint64_t nblocks) 598 { 599 summary_add_data(spa, spa_syncing_txg(spa), 0, 0, nblocks); 600 } 601 602 void 603 spa_log_summary_add_flushed_metaslab(spa_t *spa, boolean_t dirty) 604 { 605 summary_add_data(spa, spa_syncing_txg(spa), 1, dirty ? 1 : 0, 0); 606 } 607 608 void 609 spa_log_summary_dirty_flushed_metaslab(spa_t *spa, uint64_t txg) 610 { 611 log_summary_entry_t *target = NULL; 612 for (log_summary_entry_t *e = list_head(&spa->spa_log_summary); 613 e != NULL; e = list_next(&spa->spa_log_summary, e)) { 614 if (e->lse_start > txg) 615 break; 616 target = e; 617 } 618 ASSERT3P(target, !=, NULL); 619 ASSERT3U(target->lse_mscount, !=, 0); 620 target->lse_msdcount++; 621 } 622 623 /* 624 * This function attempts to estimate how many metaslabs should 625 * we flush to satisfy our block heuristic for the log spacemap 626 * for the upcoming TXGs. 627 * 628 * Specifically, it first tries to estimate the number of incoming 629 * blocks in this TXG. Then by projecting that incoming rate to 630 * future TXGs and using the log summary, it figures out how many 631 * flushes we would need to do for future TXGs individually to 632 * stay below our block limit and returns the maximum number of 633 * flushes from those estimates. 634 */ 635 static uint64_t 636 spa_estimate_metaslabs_to_flush(spa_t *spa) 637 { 638 ASSERT(spa_feature_is_active(spa, SPA_FEATURE_LOG_SPACEMAP)); 639 ASSERT3U(spa_sync_pass(spa), ==, 1); 640 ASSERT(spa_log_sm_blocklimit(spa) != 0); 641 642 /* 643 * This variable contains the incoming rate that will be projected 644 * and used for our flushing estimates in the future. 645 */ 646 uint64_t incoming = spa_estimate_incoming_log_blocks(spa); 647 648 /* 649 * At any point in time this variable tells us how many 650 * TXGs in the future we are so we can make our estimations. 651 */ 652 uint64_t txgs_in_future = 1; 653 654 /* 655 * This variable tells us how much room do we have until we hit 656 * our limit. When it goes negative, it means that we've exceeded 657 * our limit and we need to flush. 658 * 659 * Note that since we start at the first TXG in the future (i.e. 660 * txgs_in_future starts from 1) we already decrement this 661 * variable by the incoming rate. 662 */ 663 int64_t available_blocks = 664 spa_log_sm_blocklimit(spa) - spa_log_sm_nblocks(spa) - incoming; 665 666 int64_t available_txgs = zfs_unflushed_log_txg_max; 667 for (log_summary_entry_t *e = list_head(&spa->spa_log_summary); 668 e; e = list_next(&spa->spa_log_summary, e)) 669 available_txgs -= e->lse_txgcount; 670 671 /* 672 * This variable tells us the total number of flushes needed to 673 * keep the log size within the limit when we reach txgs_in_future. 674 */ 675 uint64_t total_flushes = 0; 676 677 /* Holds the current maximum of our estimates so far. */ 678 uint64_t max_flushes_pertxg = zfs_min_metaslabs_to_flush; 679 680 /* 681 * For our estimations we only look as far in the future 682 * as the summary allows us. 683 */ 684 for (log_summary_entry_t *e = list_head(&spa->spa_log_summary); 685 e; e = list_next(&spa->spa_log_summary, e)) { 686 687 /* 688 * If there is still room before we exceed our limit 689 * then keep skipping TXGs accumulating more blocks 690 * based on the incoming rate until we exceed it. 691 */ 692 if (available_blocks >= 0 && available_txgs >= 0) { 693 uint64_t skip_txgs = MIN(available_txgs + 1, 694 (available_blocks / incoming) + 1); 695 available_blocks -= (skip_txgs * incoming); 696 available_txgs -= skip_txgs; 697 txgs_in_future += skip_txgs; 698 ASSERT3S(available_blocks, >=, -incoming); 699 ASSERT3S(available_txgs, >=, -1); 700 } 701 702 /* 703 * At this point we're far enough into the future where 704 * the limit was just exceeded and we flush metaslabs 705 * based on the current entry in the summary, updating 706 * our available_blocks. 707 */ 708 ASSERT(available_blocks < 0 || available_txgs < 0); 709 available_blocks += e->lse_blkcount; 710 available_txgs += e->lse_txgcount; 711 total_flushes += e->lse_msdcount; 712 713 /* 714 * Keep the running maximum of the total_flushes that 715 * we've done so far over the number of TXGs in the 716 * future that we are. The idea here is to estimate 717 * the average number of flushes that we should do 718 * every TXG so that when we are that many TXGs in the 719 * future we stay under the limit. 720 */ 721 max_flushes_pertxg = MAX(max_flushes_pertxg, 722 DIV_ROUND_UP(total_flushes, txgs_in_future)); 723 } 724 return (max_flushes_pertxg); 725 } 726 727 uint64_t 728 spa_log_sm_memused(spa_t *spa) 729 { 730 return (spa->spa_unflushed_stats.sus_memused); 731 } 732 733 static boolean_t 734 spa_log_exceeds_memlimit(spa_t *spa) 735 { 736 if (spa_log_sm_memused(spa) > zfs_unflushed_max_mem_amt) 737 return (B_TRUE); 738 739 uint64_t system_mem_allowed = ((physmem * PAGESIZE) * 740 zfs_unflushed_max_mem_ppm) / 1000000; 741 if (spa_log_sm_memused(spa) > system_mem_allowed) 742 return (B_TRUE); 743 744 return (B_FALSE); 745 } 746 747 boolean_t 748 spa_flush_all_logs_requested(spa_t *spa) 749 { 750 return (spa->spa_log_flushall_txg != 0); 751 } 752 753 void 754 spa_flush_metaslabs(spa_t *spa, dmu_tx_t *tx) 755 { 756 uint64_t txg = dmu_tx_get_txg(tx); 757 758 if (spa_sync_pass(spa) != 1) 759 return; 760 761 if (!spa_feature_is_active(spa, SPA_FEATURE_LOG_SPACEMAP)) 762 return; 763 764 /* 765 * If we don't have any metaslabs with unflushed changes 766 * return immediately. 767 */ 768 if (avl_numnodes(&spa->spa_metaslabs_by_flushed) == 0) 769 return; 770 771 /* 772 * During SPA export we leave a few empty TXGs to go by [see 773 * spa_final_dirty_txg() to understand why]. For this specific 774 * case, it is important to not flush any metaslabs as that 775 * would dirty this TXG. 776 * 777 * That said, during one of these dirty TXGs that is less or 778 * equal to spa_final_dirty(), spa_unload() will request that 779 * we try to flush all the metaslabs for that TXG before 780 * exporting the pool, thus we ensure that we didn't get a 781 * request of flushing everything before we attempt to return 782 * immediately. 783 */ 784 if (spa->spa_uberblock.ub_rootbp.blk_birth < txg && 785 !dmu_objset_is_dirty(spa_meta_objset(spa), txg) && 786 !spa_flush_all_logs_requested(spa)) 787 return; 788 789 /* 790 * We need to generate a log space map before flushing because this 791 * will set up the in-memory data (i.e. node in spa_sm_logs_by_txg) 792 * for this TXG's flushed metaslab count (aka sls_mscount which is 793 * manipulated in many ways down the metaslab_flush() codepath). 794 * 795 * That is not to say that we may generate a log space map when we 796 * don't need it. If we are flushing metaslabs, that means that we 797 * were going to write changes to disk anyway, so even if we were 798 * not flushing, a log space map would have been created anyway in 799 * metaslab_sync(). 800 */ 801 spa_generate_syncing_log_sm(spa, tx); 802 803 /* 804 * This variable tells us how many metaslabs we want to flush based 805 * on the block-heuristic of our flushing algorithm (see block comment 806 * of log space map feature). We also decrement this as we flush 807 * metaslabs and attempt to destroy old log space maps. 808 */ 809 uint64_t want_to_flush; 810 if (spa_flush_all_logs_requested(spa)) { 811 ASSERT3S(spa_state(spa), ==, POOL_STATE_EXPORTED); 812 want_to_flush = UINT64_MAX; 813 } else { 814 want_to_flush = spa_estimate_metaslabs_to_flush(spa); 815 } 816 817 /* Used purely for verification purposes */ 818 uint64_t visited = 0; 819 820 /* 821 * Ideally we would only iterate through spa_metaslabs_by_flushed 822 * using only one variable (curr). We can't do that because 823 * metaslab_flush() mutates position of curr in the AVL when 824 * it flushes that metaslab by moving it to the end of the tree. 825 * Thus we always keep track of the original next node of the 826 * current node (curr) in another variable (next). 827 */ 828 metaslab_t *next = NULL; 829 for (metaslab_t *curr = avl_first(&spa->spa_metaslabs_by_flushed); 830 curr != NULL; curr = next) { 831 next = AVL_NEXT(&spa->spa_metaslabs_by_flushed, curr); 832 833 /* 834 * If this metaslab has been flushed this txg then we've done 835 * a full circle over the metaslabs. 836 */ 837 if (metaslab_unflushed_txg(curr) == txg) 838 break; 839 840 /* 841 * If we are done flushing for the block heuristic and the 842 * unflushed changes don't exceed the memory limit just stop. 843 */ 844 if (want_to_flush == 0 && !spa_log_exceeds_memlimit(spa)) 845 break; 846 847 if (metaslab_unflushed_dirty(curr)) { 848 mutex_enter(&curr->ms_sync_lock); 849 mutex_enter(&curr->ms_lock); 850 metaslab_flush(curr, tx); 851 mutex_exit(&curr->ms_lock); 852 mutex_exit(&curr->ms_sync_lock); 853 if (want_to_flush > 0) 854 want_to_flush--; 855 } else 856 metaslab_unflushed_bump(curr, tx, B_FALSE); 857 858 visited++; 859 } 860 ASSERT3U(avl_numnodes(&spa->spa_metaslabs_by_flushed), >=, visited); 861 862 spa_log_sm_set_blocklimit(spa); 863 } 864 865 /* 866 * Close the log space map for this TXG and update the block counts 867 * for the log's in-memory structure and the summary. 868 */ 869 void 870 spa_sync_close_syncing_log_sm(spa_t *spa) 871 { 872 if (spa_syncing_log_sm(spa) == NULL) 873 return; 874 ASSERT(spa_feature_is_active(spa, SPA_FEATURE_LOG_SPACEMAP)); 875 876 spa_log_sm_t *sls = avl_last(&spa->spa_sm_logs_by_txg); 877 ASSERT3U(sls->sls_txg, ==, spa_syncing_txg(spa)); 878 879 sls->sls_nblocks = space_map_nblocks(spa_syncing_log_sm(spa)); 880 spa->spa_unflushed_stats.sus_nblocks += sls->sls_nblocks; 881 882 /* 883 * Note that we can't assert that sls_mscount is not 0, 884 * because there is the case where the first metaslab 885 * in spa_metaslabs_by_flushed is loading and we were 886 * not able to flush any metaslabs the current TXG. 887 */ 888 ASSERT(sls->sls_nblocks != 0); 889 890 spa_log_summary_add_incoming_blocks(spa, sls->sls_nblocks); 891 spa_log_summary_verify_counts(spa); 892 893 space_map_close(spa->spa_syncing_log_sm); 894 spa->spa_syncing_log_sm = NULL; 895 896 /* 897 * At this point we tried to flush as many metaslabs as we 898 * can as the pool is getting exported. Reset the "flush all" 899 * so the last few TXGs before closing the pool can be empty 900 * (e.g. not dirty). 901 */ 902 if (spa_flush_all_logs_requested(spa)) { 903 ASSERT3S(spa_state(spa), ==, POOL_STATE_EXPORTED); 904 spa->spa_log_flushall_txg = 0; 905 } 906 } 907 908 void 909 spa_cleanup_old_sm_logs(spa_t *spa, dmu_tx_t *tx) 910 { 911 objset_t *mos = spa_meta_objset(spa); 912 913 uint64_t spacemap_zap; 914 int error = zap_lookup(mos, DMU_POOL_DIRECTORY_OBJECT, 915 DMU_POOL_LOG_SPACEMAP_ZAP, sizeof (spacemap_zap), 1, &spacemap_zap); 916 if (error == ENOENT) { 917 ASSERT(avl_is_empty(&spa->spa_sm_logs_by_txg)); 918 return; 919 } 920 VERIFY0(error); 921 922 metaslab_t *oldest = avl_first(&spa->spa_metaslabs_by_flushed); 923 uint64_t oldest_flushed_txg = metaslab_unflushed_txg(oldest); 924 925 /* Free all log space maps older than the oldest_flushed_txg. */ 926 for (spa_log_sm_t *sls = avl_first(&spa->spa_sm_logs_by_txg); 927 sls && sls->sls_txg < oldest_flushed_txg; 928 sls = avl_first(&spa->spa_sm_logs_by_txg)) { 929 ASSERT0(sls->sls_mscount); 930 avl_remove(&spa->spa_sm_logs_by_txg, sls); 931 space_map_free_obj(mos, sls->sls_sm_obj, tx); 932 VERIFY0(zap_remove_int(mos, spacemap_zap, sls->sls_txg, tx)); 933 spa_log_summary_decrement_blkcount(spa, sls->sls_nblocks); 934 spa->spa_unflushed_stats.sus_nblocks -= sls->sls_nblocks; 935 kmem_free(sls, sizeof (spa_log_sm_t)); 936 } 937 } 938 939 static spa_log_sm_t * 940 spa_log_sm_alloc(uint64_t sm_obj, uint64_t txg) 941 { 942 spa_log_sm_t *sls = kmem_zalloc(sizeof (*sls), KM_SLEEP); 943 sls->sls_sm_obj = sm_obj; 944 sls->sls_txg = txg; 945 return (sls); 946 } 947 948 void 949 spa_generate_syncing_log_sm(spa_t *spa, dmu_tx_t *tx) 950 { 951 uint64_t txg = dmu_tx_get_txg(tx); 952 objset_t *mos = spa_meta_objset(spa); 953 954 if (spa_syncing_log_sm(spa) != NULL) 955 return; 956 957 if (!spa_feature_is_enabled(spa, SPA_FEATURE_LOG_SPACEMAP)) 958 return; 959 960 uint64_t spacemap_zap; 961 int error = zap_lookup(mos, DMU_POOL_DIRECTORY_OBJECT, 962 DMU_POOL_LOG_SPACEMAP_ZAP, sizeof (spacemap_zap), 1, &spacemap_zap); 963 if (error == ENOENT) { 964 ASSERT(avl_is_empty(&spa->spa_sm_logs_by_txg)); 965 966 error = 0; 967 spacemap_zap = zap_create(mos, 968 DMU_OTN_ZAP_METADATA, DMU_OT_NONE, 0, tx); 969 VERIFY0(zap_add(mos, DMU_POOL_DIRECTORY_OBJECT, 970 DMU_POOL_LOG_SPACEMAP_ZAP, sizeof (spacemap_zap), 1, 971 &spacemap_zap, tx)); 972 spa_feature_incr(spa, SPA_FEATURE_LOG_SPACEMAP, tx); 973 } 974 VERIFY0(error); 975 976 uint64_t sm_obj; 977 ASSERT3U(zap_lookup_int_key(mos, spacemap_zap, txg, &sm_obj), 978 ==, ENOENT); 979 sm_obj = space_map_alloc(mos, zfs_log_sm_blksz, tx); 980 VERIFY0(zap_add_int_key(mos, spacemap_zap, txg, sm_obj, tx)); 981 avl_add(&spa->spa_sm_logs_by_txg, spa_log_sm_alloc(sm_obj, txg)); 982 983 /* 984 * We pass UINT64_MAX as the space map's representation size 985 * and SPA_MINBLOCKSHIFT as the shift, to make the space map 986 * accept any sorts of segments since there's no real advantage 987 * to being more restrictive (given that we're already going 988 * to be using 2-word entries). 989 */ 990 VERIFY0(space_map_open(&spa->spa_syncing_log_sm, mos, sm_obj, 991 0, UINT64_MAX, SPA_MINBLOCKSHIFT)); 992 993 spa_log_sm_set_blocklimit(spa); 994 } 995 996 /* 997 * Find all the log space maps stored in the space map ZAP and sort 998 * them by their TXG in spa_sm_logs_by_txg. 999 */ 1000 static int 1001 spa_ld_log_sm_metadata(spa_t *spa) 1002 { 1003 int error; 1004 uint64_t spacemap_zap; 1005 1006 ASSERT(avl_is_empty(&spa->spa_sm_logs_by_txg)); 1007 1008 error = zap_lookup(spa_meta_objset(spa), DMU_POOL_DIRECTORY_OBJECT, 1009 DMU_POOL_LOG_SPACEMAP_ZAP, sizeof (spacemap_zap), 1, &spacemap_zap); 1010 if (error == ENOENT) { 1011 /* the space map ZAP doesn't exist yet */ 1012 return (0); 1013 } else if (error != 0) { 1014 spa_load_failed(spa, "spa_ld_log_sm_metadata(): failed at " 1015 "zap_lookup(DMU_POOL_DIRECTORY_OBJECT) [error %d]", 1016 error); 1017 return (error); 1018 } 1019 1020 zap_cursor_t zc; 1021 zap_attribute_t za; 1022 for (zap_cursor_init(&zc, spa_meta_objset(spa), spacemap_zap); 1023 (error = zap_cursor_retrieve(&zc, &za)) == 0; 1024 zap_cursor_advance(&zc)) { 1025 uint64_t log_txg = zfs_strtonum(za.za_name, NULL); 1026 spa_log_sm_t *sls = 1027 spa_log_sm_alloc(za.za_first_integer, log_txg); 1028 avl_add(&spa->spa_sm_logs_by_txg, sls); 1029 } 1030 zap_cursor_fini(&zc); 1031 if (error != ENOENT) { 1032 spa_load_failed(spa, "spa_ld_log_sm_metadata(): failed at " 1033 "zap_cursor_retrieve(spacemap_zap) [error %d]", 1034 error); 1035 return (error); 1036 } 1037 1038 for (metaslab_t *m = avl_first(&spa->spa_metaslabs_by_flushed); 1039 m; m = AVL_NEXT(&spa->spa_metaslabs_by_flushed, m)) { 1040 spa_log_sm_t target = { .sls_txg = metaslab_unflushed_txg(m) }; 1041 spa_log_sm_t *sls = avl_find(&spa->spa_sm_logs_by_txg, 1042 &target, NULL); 1043 1044 /* 1045 * At this point if sls is zero it means that a bug occurred 1046 * in ZFS the last time the pool was open or earlier in the 1047 * import code path. In general, we would have placed a 1048 * VERIFY() here or in this case just let the kernel panic 1049 * with NULL pointer dereference when incrementing sls_mscount, 1050 * but since this is the import code path we can be a bit more 1051 * lenient. Thus, for DEBUG bits we always cause a panic, while 1052 * in production we log the error and just fail the import. 1053 */ 1054 ASSERT(sls != NULL); 1055 if (sls == NULL) { 1056 spa_load_failed(spa, "spa_ld_log_sm_metadata(): bug " 1057 "encountered: could not find log spacemap for " 1058 "TXG %llu [error %d]", 1059 (u_longlong_t)metaslab_unflushed_txg(m), ENOENT); 1060 return (ENOENT); 1061 } 1062 sls->sls_mscount++; 1063 } 1064 1065 return (0); 1066 } 1067 1068 typedef struct spa_ld_log_sm_arg { 1069 spa_t *slls_spa; 1070 uint64_t slls_txg; 1071 } spa_ld_log_sm_arg_t; 1072 1073 static int 1074 spa_ld_log_sm_cb(space_map_entry_t *sme, void *arg) 1075 { 1076 uint64_t offset = sme->sme_offset; 1077 uint64_t size = sme->sme_run; 1078 uint32_t vdev_id = sme->sme_vdev; 1079 1080 spa_ld_log_sm_arg_t *slls = arg; 1081 spa_t *spa = slls->slls_spa; 1082 1083 vdev_t *vd = vdev_lookup_top(spa, vdev_id); 1084 1085 /* 1086 * If the vdev has been removed (i.e. it is indirect or a hole) 1087 * skip this entry. The contents of this vdev have already moved 1088 * elsewhere. 1089 */ 1090 if (!vdev_is_concrete(vd)) 1091 return (0); 1092 1093 metaslab_t *ms = vd->vdev_ms[offset >> vd->vdev_ms_shift]; 1094 ASSERT(!ms->ms_loaded); 1095 1096 /* 1097 * If we have already flushed entries for this TXG to this 1098 * metaslab's space map, then ignore it. Note that we flush 1099 * before processing any allocations/frees for that TXG, so 1100 * the metaslab's space map only has entries from *before* 1101 * the unflushed TXG. 1102 */ 1103 if (slls->slls_txg < metaslab_unflushed_txg(ms)) 1104 return (0); 1105 1106 switch (sme->sme_type) { 1107 case SM_ALLOC: 1108 range_tree_remove_xor_add_segment(offset, offset + size, 1109 ms->ms_unflushed_frees, ms->ms_unflushed_allocs); 1110 break; 1111 case SM_FREE: 1112 range_tree_remove_xor_add_segment(offset, offset + size, 1113 ms->ms_unflushed_allocs, ms->ms_unflushed_frees); 1114 break; 1115 default: 1116 panic("invalid maptype_t"); 1117 break; 1118 } 1119 if (!metaslab_unflushed_dirty(ms)) { 1120 metaslab_set_unflushed_dirty(ms, B_TRUE); 1121 spa_log_summary_dirty_flushed_metaslab(spa, 1122 metaslab_unflushed_txg(ms)); 1123 } 1124 return (0); 1125 } 1126 1127 static int 1128 spa_ld_log_sm_data(spa_t *spa) 1129 { 1130 spa_log_sm_t *sls, *psls; 1131 int error = 0; 1132 1133 /* 1134 * If we are not going to do any writes there is no need 1135 * to read the log space maps. 1136 */ 1137 if (!spa_writeable(spa)) 1138 return (0); 1139 1140 ASSERT0(spa->spa_unflushed_stats.sus_nblocks); 1141 ASSERT0(spa->spa_unflushed_stats.sus_memused); 1142 1143 hrtime_t read_logs_starttime = gethrtime(); 1144 1145 /* Prefetch log spacemaps dnodes. */ 1146 for (sls = avl_first(&spa->spa_sm_logs_by_txg); sls; 1147 sls = AVL_NEXT(&spa->spa_sm_logs_by_txg, sls)) { 1148 dmu_prefetch(spa_meta_objset(spa), sls->sls_sm_obj, 1149 0, 0, 0, ZIO_PRIORITY_SYNC_READ); 1150 } 1151 1152 uint_t pn = 0; 1153 uint64_t ps = 0; 1154 psls = sls = avl_first(&spa->spa_sm_logs_by_txg); 1155 while (sls != NULL) { 1156 /* Prefetch log spacemaps up to 16 TXGs or MBs ahead. */ 1157 if (psls != NULL && pn < 16 && 1158 (pn < 2 || ps < 2 * dmu_prefetch_max)) { 1159 error = space_map_open(&psls->sls_sm, 1160 spa_meta_objset(spa), psls->sls_sm_obj, 0, 1161 UINT64_MAX, SPA_MINBLOCKSHIFT); 1162 if (error != 0) { 1163 spa_load_failed(spa, "spa_ld_log_sm_data(): " 1164 "failed at space_map_open(obj=%llu) " 1165 "[error %d]", 1166 (u_longlong_t)sls->sls_sm_obj, error); 1167 goto out; 1168 } 1169 dmu_prefetch(spa_meta_objset(spa), psls->sls_sm_obj, 1170 0, 0, space_map_length(psls->sls_sm), 1171 ZIO_PRIORITY_ASYNC_READ); 1172 pn++; 1173 ps += space_map_length(psls->sls_sm); 1174 psls = AVL_NEXT(&spa->spa_sm_logs_by_txg, psls); 1175 continue; 1176 } 1177 1178 /* Load TXG log spacemap into ms_unflushed_allocs/frees. */ 1179 kpreempt(KPREEMPT_SYNC); 1180 ASSERT0(sls->sls_nblocks); 1181 sls->sls_nblocks = space_map_nblocks(sls->sls_sm); 1182 spa->spa_unflushed_stats.sus_nblocks += sls->sls_nblocks; 1183 summary_add_data(spa, sls->sls_txg, 1184 sls->sls_mscount, 0, sls->sls_nblocks); 1185 1186 struct spa_ld_log_sm_arg vla = { 1187 .slls_spa = spa, 1188 .slls_txg = sls->sls_txg 1189 }; 1190 error = space_map_iterate(sls->sls_sm, 1191 space_map_length(sls->sls_sm), spa_ld_log_sm_cb, &vla); 1192 if (error != 0) { 1193 spa_load_failed(spa, "spa_ld_log_sm_data(): failed " 1194 "at space_map_iterate(obj=%llu) [error %d]", 1195 (u_longlong_t)sls->sls_sm_obj, error); 1196 goto out; 1197 } 1198 1199 pn--; 1200 ps -= space_map_length(sls->sls_sm); 1201 space_map_close(sls->sls_sm); 1202 sls->sls_sm = NULL; 1203 sls = AVL_NEXT(&spa->spa_sm_logs_by_txg, sls); 1204 1205 /* Update log block limits considering just loaded. */ 1206 spa_log_sm_set_blocklimit(spa); 1207 } 1208 1209 hrtime_t read_logs_endtime = gethrtime(); 1210 spa_load_note(spa, 1211 "read %llu log space maps (%llu total blocks - blksz = %llu bytes) " 1212 "in %lld ms", (u_longlong_t)avl_numnodes(&spa->spa_sm_logs_by_txg), 1213 (u_longlong_t)spa_log_sm_nblocks(spa), 1214 (u_longlong_t)zfs_log_sm_blksz, 1215 (longlong_t)((read_logs_endtime - read_logs_starttime) / 1000000)); 1216 1217 out: 1218 if (error != 0) { 1219 for (spa_log_sm_t *sls = avl_first(&spa->spa_sm_logs_by_txg); 1220 sls; sls = AVL_NEXT(&spa->spa_sm_logs_by_txg, sls)) { 1221 if (sls->sls_sm) { 1222 space_map_close(sls->sls_sm); 1223 sls->sls_sm = NULL; 1224 } 1225 } 1226 } else { 1227 ASSERT0(pn); 1228 ASSERT0(ps); 1229 } 1230 /* 1231 * Now that the metaslabs contain their unflushed changes: 1232 * [1] recalculate their actual allocated space 1233 * [2] recalculate their weights 1234 * [3] sum up the memory usage of their unflushed range trees 1235 * [4] optionally load them, if debug_load is set 1236 * 1237 * Note that even in the case where we get here because of an 1238 * error (e.g. error != 0), we still want to update the fields 1239 * below in order to have a proper teardown in spa_unload(). 1240 */ 1241 for (metaslab_t *m = avl_first(&spa->spa_metaslabs_by_flushed); 1242 m != NULL; m = AVL_NEXT(&spa->spa_metaslabs_by_flushed, m)) { 1243 mutex_enter(&m->ms_lock); 1244 m->ms_allocated_space = space_map_allocated(m->ms_sm) + 1245 range_tree_space(m->ms_unflushed_allocs) - 1246 range_tree_space(m->ms_unflushed_frees); 1247 1248 vdev_t *vd = m->ms_group->mg_vd; 1249 metaslab_space_update(vd, m->ms_group->mg_class, 1250 range_tree_space(m->ms_unflushed_allocs), 0, 0); 1251 metaslab_space_update(vd, m->ms_group->mg_class, 1252 -range_tree_space(m->ms_unflushed_frees), 0, 0); 1253 1254 ASSERT0(m->ms_weight & METASLAB_ACTIVE_MASK); 1255 metaslab_recalculate_weight_and_sort(m); 1256 1257 spa->spa_unflushed_stats.sus_memused += 1258 metaslab_unflushed_changes_memused(m); 1259 1260 if (metaslab_debug_load && m->ms_sm != NULL) { 1261 VERIFY0(metaslab_load(m)); 1262 metaslab_set_selected_txg(m, 0); 1263 } 1264 mutex_exit(&m->ms_lock); 1265 } 1266 1267 return (error); 1268 } 1269 1270 static int 1271 spa_ld_unflushed_txgs(vdev_t *vd) 1272 { 1273 spa_t *spa = vd->vdev_spa; 1274 objset_t *mos = spa_meta_objset(spa); 1275 1276 if (vd->vdev_top_zap == 0) 1277 return (0); 1278 1279 uint64_t object = 0; 1280 int error = zap_lookup(mos, vd->vdev_top_zap, 1281 VDEV_TOP_ZAP_MS_UNFLUSHED_PHYS_TXGS, 1282 sizeof (uint64_t), 1, &object); 1283 if (error == ENOENT) 1284 return (0); 1285 else if (error != 0) { 1286 spa_load_failed(spa, "spa_ld_unflushed_txgs(): failed at " 1287 "zap_lookup(vdev_top_zap=%llu) [error %d]", 1288 (u_longlong_t)vd->vdev_top_zap, error); 1289 return (error); 1290 } 1291 1292 for (uint64_t m = 0; m < vd->vdev_ms_count; m++) { 1293 metaslab_t *ms = vd->vdev_ms[m]; 1294 ASSERT(ms != NULL); 1295 1296 metaslab_unflushed_phys_t entry; 1297 uint64_t entry_size = sizeof (entry); 1298 uint64_t entry_offset = ms->ms_id * entry_size; 1299 1300 error = dmu_read(mos, object, 1301 entry_offset, entry_size, &entry, 0); 1302 if (error != 0) { 1303 spa_load_failed(spa, "spa_ld_unflushed_txgs(): " 1304 "failed at dmu_read(obj=%llu) [error %d]", 1305 (u_longlong_t)object, error); 1306 return (error); 1307 } 1308 1309 ms->ms_unflushed_txg = entry.msp_unflushed_txg; 1310 ms->ms_unflushed_dirty = B_FALSE; 1311 ASSERT(range_tree_is_empty(ms->ms_unflushed_allocs)); 1312 ASSERT(range_tree_is_empty(ms->ms_unflushed_frees)); 1313 if (ms->ms_unflushed_txg != 0) { 1314 mutex_enter(&spa->spa_flushed_ms_lock); 1315 avl_add(&spa->spa_metaslabs_by_flushed, ms); 1316 mutex_exit(&spa->spa_flushed_ms_lock); 1317 } 1318 } 1319 return (0); 1320 } 1321 1322 /* 1323 * Read all the log space map entries into their respective 1324 * metaslab unflushed trees and keep them sorted by TXG in the 1325 * SPA's metadata. In addition, setup all the metadata for the 1326 * memory and the block heuristics. 1327 */ 1328 int 1329 spa_ld_log_spacemaps(spa_t *spa) 1330 { 1331 int error; 1332 1333 spa_log_sm_set_blocklimit(spa); 1334 1335 for (uint64_t c = 0; c < spa->spa_root_vdev->vdev_children; c++) { 1336 vdev_t *vd = spa->spa_root_vdev->vdev_child[c]; 1337 error = spa_ld_unflushed_txgs(vd); 1338 if (error != 0) 1339 return (error); 1340 } 1341 1342 error = spa_ld_log_sm_metadata(spa); 1343 if (error != 0) 1344 return (error); 1345 1346 /* 1347 * Note: we don't actually expect anything to change at this point 1348 * but we grab the config lock so we don't fail any assertions 1349 * when using vdev_lookup_top(). 1350 */ 1351 spa_config_enter(spa, SCL_CONFIG, FTAG, RW_READER); 1352 error = spa_ld_log_sm_data(spa); 1353 spa_config_exit(spa, SCL_CONFIG, FTAG); 1354 1355 return (error); 1356 } 1357 1358 /* BEGIN CSTYLED */ 1359 ZFS_MODULE_PARAM(zfs, zfs_, unflushed_max_mem_amt, ULONG, ZMOD_RW, 1360 "Specific hard-limit in memory that ZFS allows to be used for " 1361 "unflushed changes"); 1362 1363 ZFS_MODULE_PARAM(zfs, zfs_, unflushed_max_mem_ppm, ULONG, ZMOD_RW, 1364 "Percentage of the overall system memory that ZFS allows to be " 1365 "used for unflushed changes (value is calculated over 1000000 for " 1366 "finer granularity)"); 1367 1368 ZFS_MODULE_PARAM(zfs, zfs_, unflushed_log_block_max, ULONG, ZMOD_RW, 1369 "Hard limit (upper-bound) in the size of the space map log " 1370 "in terms of blocks."); 1371 1372 ZFS_MODULE_PARAM(zfs, zfs_, unflushed_log_block_min, ULONG, ZMOD_RW, 1373 "Lower-bound limit for the maximum amount of blocks allowed in " 1374 "log spacemap (see zfs_unflushed_log_block_max)"); 1375 1376 ZFS_MODULE_PARAM(zfs, zfs_, unflushed_log_txg_max, ULONG, ZMOD_RW, 1377 "Hard limit (upper-bound) in the size of the space map log " 1378 "in terms of dirty TXGs."); 1379 1380 ZFS_MODULE_PARAM(zfs, zfs_, unflushed_log_block_pct, ULONG, ZMOD_RW, 1381 "Tunable used to determine the number of blocks that can be used for " 1382 "the spacemap log, expressed as a percentage of the total number of " 1383 "metaslabs in the pool (e.g. 400 means the number of log blocks is " 1384 "capped at 4 times the number of metaslabs)"); 1385 1386 ZFS_MODULE_PARAM(zfs, zfs_, max_log_walking, ULONG, ZMOD_RW, 1387 "The number of past TXGs that the flushing algorithm of the log " 1388 "spacemap feature uses to estimate incoming log blocks"); 1389 1390 ZFS_MODULE_PARAM(zfs, zfs_, keep_log_spacemaps_at_export, INT, ZMOD_RW, 1391 "Prevent the log spacemaps from being flushed and destroyed " 1392 "during pool export/destroy"); 1393 /* END CSTYLED */ 1394 1395 ZFS_MODULE_PARAM(zfs, zfs_, max_logsm_summary_length, ULONG, ZMOD_RW, 1396 "Maximum number of rows allowed in the summary of the spacemap log"); 1397 1398 ZFS_MODULE_PARAM(zfs, zfs_, min_metaslabs_to_flush, ULONG, ZMOD_RW, 1399 "Minimum number of metaslabs to flush per dirty TXG"); 1400